今天处理一下问答部分。首先将文本处理一下,代码如下:
1 import os 2 import io 3 import numpy as np 4 5 def dealline(line): 6 lineArr = line.split(',') 7 name = lineArr[0] 8 questionslist = [] 9 for index in range(1,len(lineArr)-2,3): 10 questiondic = {} 11 questionlist = [] 12 question = lineArr[index] 13 answer1 = lineArr[index+1] 14 answer2 = lineArr[index+2] 15 answer3 = lineArr[index+3] 16 questionlist.append(answer1) 17 questionlist.append(answer2) 18 questionlist.append(answer3) 19 questiondic[question] = questionlist 20 questionslist.append(questiondic) 21 return name,questionslist 22 23 videodic = {} 24 rootdir = r"D:\ai\AIE04\VQADatasetA_20180815\train.txt" 25 f = open(rootdir,'r',encoding="utf-8") 26 for line in f: 27 name,questionlist = dealline(line) 28 videodic[name] = questionlist 29 print(name) 30 np.savez("npz/question.npz",question=videodic) 31 print('finish')
处理成结构化数据之后,后边要对问题切分,例如:what is是一组,in front of是一组,the person是一组,in video是一组。分组的思路是从高到底逐步加词统计出现的次数,次数比较多的为一组词;或者已经有成熟的英文分组算法,也要查资料看看。