Semi-Supervised Learning in Large Scale Text Categorization

来源 :上海交通大学学报(英文版) | 被引量 : 0次 | 上传用户:xp108999
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
The rapid development of the Intet brings a variety of original information including text information,audio information,etc.However,it is difficult to find the most useful knowledge rapidly and accurately because of its huge number.Automatic text classification technology based on machine leing can classify a large number of natural language documents into the corresponding subject categories according to its correct semantics.It is helpful to grasp the text information directly.By leing from a set of hand-labeled documents,we obtain the traditional supervised classifier for text categorization (TC).However,labeling all data by human is labor intensive and time consuming.To solve this problem,some scholars proposed a semi-supervised leing method to train classifier,but it is unfeasible for various kinds and great number of Web data since it still needs a part of hand-labeled data.In 2012,Li et al.invented a fully automatic categorization approach for text (FACT)based on supervised leing,where no manual labeling efforts are required.But automatically labeling all data can bring noise into experiment and cause the fact that the result cannot meet the accuracy requirement.We put forward a new idea that part of data with high accuracy can be automatically tagged based on the semantic of category name,then a semi-supervised way is taken to train classifier with both labeled and unlabeled data,and ultimately a precise classification of massive text data can be achieved.The empirical experiments show that the method outperforms the supervised support vector machine (SVM) in terms of both F1 performance and classification accuracy in most cases.It proves the effectiveness of the semi-supervised algorithm in automatic TC.
其他文献
《史记》和《汉书》是两部伟大的著作,对后世有着深远影响。作为二书重要组成部分的史书论赞,是了解史家个性特点与著史思想的第一手资料,但长期以来,《汉书》论赞常作为《史记》
Gas explosion is one of the most serious events in coal mine.In consideration of the limitation of past research method about gas explosion,three-dimensional (3
黄泥庄水文站流域建设项目的日益增加,严重影响了流域内天然河道特性。根据梅山水库的水量统一调度和水资源利用规划,对该站河床的糙率和流速系数的率定提出了新的要求。通过
在玉/豆套作模式下研究了不同施氮量对大豆碳氮代谢及产量的影响。结果表明,适量增施氮肥有利于大豆增产,而氮素过量却会引起严重减产。高氮水平下(每公顷补施纯氮78.2 kg和9
西藏是自驾者的天堂,四条进藏公路条条无不挑战着每位驾驶者的技术与车辆的性能,而为了能完成之前在西藏没能拍到星空的遗憾,这一次我们信守诺言,再次踏上了挑战珠峰大本营的旅程,不只是为了那片星空,更是为了一个与雪山的约定。  超广角镜头壮阔人文风光  对于广角镜头来说最适合的拍摄场景莫过于壮丽的山水,但有时我们换个角度去看世界,或许也能拍摄到与众不同的人文视角。  扎什伦布寺低角度盲拍  扎什伦布寺在藏
今年的宣传工作要以邓小平理论、“三个代表”重要思想为指导,贯彻党的十六届四中全会精神,按照学校党委的统一部署,坚持“以评促建,以评促改,以评促管,评建结合,重在建设”
本文从经学、史学和文辞的角度对吕祖谦《左氏博议》加以论析。这些角度都是宏观性的,中国古代的许多文化现象都可以从这三个方面加以考察。然而,相比较而言,吕祖谦《左氏博议》
甘肃省天水市在创建和深化学习型城市建设中,自觉地把创建学习型城市与构建和谐社会的各项工作结合起来,创新工作思路,丰富创建内涵,拓宽创建领域,围绕“和谐”抓创建,深化创
1 WDM从点对点到ROADM的演变rn回首历史,第一代高速网络是基于25 Gbit/s SONET/SDH技术的,在点对点或环网架构下,它可以在同一系统中汇聚音频和数据.由于对网络带宽的要求越