论文部分内容阅读
以图书网页为对象,主要研究书目信息提取方法。该方法首先利用LDA对书名与各描述性段落组合的混合文本进行建模,然后分别计算书名与各段落之间的相似度提取书目信息,有效地避免传统方法不能很好反映文档间相似性的不足。实验证明,该模型针对图书网页书目信息的提取准确率达到87.4%,较传统方法有了显著提高,同时也为图书网页信息组织管理和自动分类研究奠定了基础。
To book pages as the object, the main study bibliographic information extraction method. This method first uses LDA to model the mixed texts of the title and the descriptive paragraphs, and then calculates the similarity between the title and the paragraphs respectively to extract the bibliographic information so as to effectively avoid that the traditional method can not reflect the similarity between the documents Sexual deficiencies. Experiments show that the model is accurate to 87.4% of the bibliographic information extracted from book web pages, which is significantly improved compared with the traditional methods. It also lays the foundation for the research of web page information organization management and automatic classification.