论文部分内容阅读
微博客文本蕴含类型丰富的地理事件信息,能够弥补传统定点监测手段的不足,提高事件应急响应质量。然而,由于大规模标注语料的普遍匮乏,无法利用监督学习过程识别蕴含地理事件信息的微博客文本。为此,本文提出一种蕴含地理事件微博客消息的自动识别方法,通过快速获取的语料资源增强识别效果。该方法利用主题模型具有提取文档中主题集合的优势,通过主题过滤候选语料文本,实现地理事件语料的自动提取。同时,将分布式表达词向量模型引入事件相关性计算过程,借助词向量隐含的语义信息丰富微博客短文本的上下文内容,进一步增强事件消息的识别效果。通过以新浪微博为数据源开展的实验分析表明,本文提出的蕴含地理事件信息微博客消息识别方法,识别来自事件微博话题的消息文本的F-1值可达到71.41%,比经典的基于SVM模型的监督学习方法提高了10.79%。在模拟真实微博环境的500万微博客数据集上的识别准确率达到60%。
Microblogging text contains rich types of geographic event information, which can make up for the shortcomings of traditional fixed point monitoring and improve the quality of emergency response. However, due to the general lack of large-scale annotation corpus, it is impossible to use the supervised learning process to identify micro-blogging texts that contain geographic event information. To this end, this paper proposes a method of automatic recognition of microblog messages containing geographic events, which enhances the recognition effect through the rapid acquisition of corpus resources. The method uses the theme model to extract the topic set in the document, and filters the candidate corpus text through the topic to realize the automatic extraction of the geographic event corpus. At the same time, the distributed expression vector vector model is introduced into the process of event correlation calculation, and the context content of micro-blog short text is enriched by the semantic information hidden by word vectors to further enhance the recognition effect of event messages. Experiments conducted with Sina Weibo as a data source show that the proposed F-1 value of the message texts containing event information microblogging messages is 71.41% SVM model supervised learning method increased by 10.79%. In the simulated real microblogging environment 5 million micro-blog data set recognition accuracy of 60%.