论文部分内容阅读
本文提出了一种利用句内相邻字之间的互信息及t-测试差这两个统计量解决汉语自动分词中交集型歧义切分字段的方法.汉字二元语法关系(bigram)为相关计算的基础,直接从生语料库中自动习得.初步的实验结果显示,可以正确处理90.3%的交集字段
In this paper, we propose a method that uses the two statistics of mutual information and t-test differences between adjacent words in a sentence to solve the cross-disambiguated segmentation in Chinese automatic word segmentation. The bigram of Chinese characters is the basis for related calculations and is automatically learned directly from the corpus. Preliminary experimental results show that 90.3% of the intersection fields can be correctly processed