论文部分内容阅读
经过对大量维吾尔文网站的调查与分析,该文从多语种混合网页中针对维吾尔文网页识别进行了研究,这对维吾尔语信息处理工作起着关键作用。首先该文探讨了维吾尔文不规范网页的字符编码转换规则及原理,以此对不规范维吾尔文字符进行了相应的处理,之后介绍了基于修改的N-Gram方法和基于维吾尔语常用词特征向量的两种方法,其中后者融合了维吾尔文常用候选词语料库及向量空间模型(Vector Space Model)。使用三种不同类型的维吾尔文网页文本作为本研究的数据集,在此基础上验证了该文提出的网页识别方法,以及采用不同的方法进行了网页识别的实验。实验结果表明,基于N-Gram的方法对正文较长的新闻或论坛网页的识别性能最佳,反而基于常用词特征向量的方法对短文本的网页识别性能优越N-Gram。所提方法对维吾尔文网页识别的整体性能达到90%以上,并验证了这两种方法的有效性。
After the investigation and analysis of a large number of Uyghur language websites, this article studies Uyghur webpage recognition from multilingual mixed web pages, which plays a key role in Uighur information processing. First of all, this article explores the rules and principles of character encoding conversion for Uyghur non-standard web pages, and then treats the non-standard Uighur characters accordingly. Then, we introduce the modified N-Gram method and the Uyghur commonly used word eigenvectors , Which combines the Uyghur common candidate word corpus and Vector Space Model. Three different types of Uyghur texts were used as data sets in this study. Based on this, we verified the methods of identifying the texts in the paper and experiments using different methods to identify the texts. The experimental results show that the N-Gram-based method has the best recognition performance for the news with the longer text or the web page of the forum, whereas the method based on the common word eigenvector is superior to the N-Gram. The proposed method can achieve more than 90% overall performance of Uyghur web page recognition and verify the validity of the two methods.