A Novel Approach to Improve the Mongolian Language Model using Intermediate Characters

来源 :第十五届全国计算语言学学术会议(CCL2016)暨第四届基于自然标注大数据的自然语言处理国际学术研讨会(NLP-NABD | 被引量 : 0次 | 上传用户:redlong888
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
  In Mongolian language,there is a phenomenon that many words have the same presentation form but represent different words with different codes.Since typists usually input the words according to their representation forms and cannot distinguish the codes sometimes,there are lots of coding errors occurred in Mongolian corpus.It results in statistic and retrieval very difficult on such a Mongolian corpus.To solve this problem,this paper proposed a method which merges the words with same presentation forms by Intermediate characters,then use the corpus in Intermediate characters form to build Mongolian language model.Experimental result shows that the proposed method can reduce the perplexity and the word error rate for the 3-gram language model by 41%and 30%respectively when comparing model trained on the corpus without processing.The proposed approach significantly improves the performance of Mongolian language model and greatly enhances the accuracy of Mongolian speech recognition.
其他文献
在利用大规模双语语料获取复述知识方面,传统的基于"枢轴"方法只能考虑两步以内的复述现象.本文针对已有方法的局限性,对不同语言之间互为翻译的短语对构建翻译关系图,提出基于随机行走N步的复述获取算法,改进已有方法以获取更多潜在的复述知识.本文描述了由汉英短语翻译表构建翻译关系图的方法、基于N步的随机行走算法和基于期望步数的复述短语可信度计算方法.同时,本文提出面向多语言对的翻译关系图扩展方法.在NTC
会议
  Most researches to SRL focus on English.It is still a challenge to improve the SRL performance of other language.In this paper,we introduce a two-pass appro
会议
  Sentiment analysis on social media represented by Weibo is one of the hotspot research problems in NLP.A comprehensive and systematic fine-grained annotated
会议
  近年来基于矩阵分解的协同过滤算法在评分预测上取得的显著成果,但冷启动、数据稀疏等问题仍然未能得到很好的解决,因此如何将评论信息引入推荐系统以缓解上述问题,开始成为
会议
  Topic-sentiment mining is a challenging task for many applications.This paper presents a topic-sentiment joint model in order to mine topics and their senti
会议
  统计机器翻译模型,特别是基于句法的翻译模型,其翻译单元在保留足够的翻译信息以及翻译单元在翻译新句子时的泛化能力上始终存在着一个平衡.神经网络被成功用于统计机器翻
会议
  At present,Tibetan information is quickly connected with modernization and information,which results the expansive development of Tibetan information on the
会议
  本文提出一种基于语言现象的文本蕴涵识别方法,该方法建立了一个语言现象识别和整体推理判断的联合分类模型,目的是对两个高度相关的任务进行统一学习,避免管道模型的错误传
  情感分析是自然语言处理领域的重要研究问题。现有方法往往难以克服样本偏置与领域依赖问题,严重制约了情感分析的发展和应用。为此,本文提出了一种基于深度表示学习和高斯
会议
  Traditional approaches to the task of ACE event extraction usually rely on complicated natural language processing(NLP)tools and elaborately designed featur
会议