论文部分内容阅读
考察在信息检索过程中用户以自然语言表述的查询语句中的词语使用情况。以一个信息需求描述颗粒度不等的查询表述语料库为素材,辅以汉语通用语料作为对照,通过词频以及词语的文本覆盖率等统计数据,按照是否需要在目标文本中直接或以其他形式出现,将查询表述语句中的词语区分为对汉语文本处理具有普遍意义的通用停用词、服务于信息检索表述用的专用停用词和与特定需求相关的信息内容词语。区分词语使用的不同性质,能为信息系统前端的自然语言查询处理增加一道剥离工序,防止将整个查询语句的分词结果全部作为检索项所造成的效率和准确率的退化。
Investigate the use of words in the query of the user in natural language during the process of information retrieval. A corpus of query expression corpus with different granularity is described by an information requirement, supplemented by the Chinese common corpus as a control. Through statistical data such as word frequency and textual coverage of the words and so on, according to whether the need exists in the target text directly or in other forms, The words in the query expression are divided into general stop words that have general meaning to the Chinese text processing, special stop words to serve the information retrieval expression and information content words related to the specific requirement. Differentiating the different nature of the use of words can add a stripping process to the natural language query processing in the front end of the information system to prevent the degradation of the efficiency and accuracy caused by all the word segmentation results of the entire query.