论文部分内容阅读
Traditional Mongolian Unicode Encoding has serious problems as several pairs of vowels with the same glyphs but different pronunciations are coded differently.We expose the severity of the problem by examples from our Mongolian corpus and propose two ways to alleviate the problem: first,developing a publicly available Mongolian input method that can help users to choose the correct encoding and second,a normalization method to solve the data sparseness problems caused by the proliferation of homographs.Experiments in search engines and statistical machine translation show that our methods are effective.