论文部分内容阅读
本研究的目的是解决语音搜索系统中新领域语料稀缺的问题。对于手中的少量语料,采取的方法是:首先从中进行语义类的提取,语义类的提取采用的是一种基于同现概率的语义类提取方法,这种基于相似度计算方法的提取结果在正确率、召回率、F_1值的评价中均优于常用的基于Kullback-Leibler散度的距离度量。利用从少量文本中提取出的语义类别和文本结构,生成句子模板;再把领域信息加入到模板中,并由此生成大量领域相关语料。最后,利用生成的大量语料,进行语言模型自适应,这时的语音识别结果(字识别正确率)从85.2%提高到91%。实验结果说明语音搜索领域的语料不足问题可以通过语义类提取后得到的模板,生成领域相关语料的方法来有效解决。
The purpose of this study is to solve the problem of corpus scarcity in the new field of voice search system. For a small amount of corpus in the hands, the approach is: First, from the semantic class extraction, the semantic class is extracted using a co-occurrence probability based semantic class extraction method based on the similarity calculation method of the extraction results in the correct Rate, recall, and F_1 values are superior to the common Kullback-Leibler divergence-based distance metrics. Generate sentence templates using semantic categories and text structures extracted from a small amount of text; add domain information to the templates and generate a large number of domain related corpora. Finally, using the large amount of generated corpus, language model is adaptive, and the speech recognition result (word recognition accuracy) is increased from 85.2% to 91%. Experimental results show that the problem of insufficient corpus in the field of voice search can be effectively solved by extracting the templates obtained by the semantic classes and generating corpora in the field.