,A machine learning approach to query generation in plagiarism source retrieval

来源 :Frontiers of Information Technology & Electronic Engineering | 被引量 : 0次 | 上传用户:goodshape
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness. Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval Each heuristic-based method has its own advantages, and no one statistically out of form the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for This article to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from candidates The statistical machine learning approach to query generati on for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselinelines, the experimental results show that applying our proposed query generation method based on machine learningability statistically significant over baselines in source retrieval effectiveness.
其他文献
载培稻(oryza sativa L.)灿粳亚种间杂种优势利用的主要障碍是杂种的半不育现象.广亲和品种的发现及其广亲和理论的提出对于克服这一障碍无疑具有重要的意义,故引起中国育
邗江县进行社会舆论问卷调查最近,中共邗江县委宣传部就党的十三大提出的有关问题,进行了一次社会舆论问卷调查。从收回的241份有效问卷看,84%的人满意政治体制改革的设想;81
<正>贵州特产牙舟陶的生产始于明代洪武年间,距今已有六百多年历史,为中国十大名陶之一。牙舟陶传统产品多为生活用具、陈设品及祭祀器皿,其陶制品富于装饰性,釉色玲珑剔透、
怎样写好“对话报道”? 根据目前的实践,我觉得有几点必须注意. 第一,调查研究,了解真情. 要写好对话报道,首先要确定对话的内容.这就需要调查研究,了解真情,下功夫研究社会
应用多元分析方法研究马铃薯的诱导因子,得到薯块数(y_1)、薯块重(y_2),2个国标函数分别与光照强度(x_1)、CCC浓度(x_2)和BAP浓度(x_3)三项因素的回归数学模型。3个因素对于
采用苗期喷雾接种法,研究了对太湖稻区3个代表性的稻瘟病菌生理小种表现高抗的3个太湖粳稻地方品种和1个改良品种的抗稻瘟病性遗传.将薄稻、黑壳子粳、铁杆青和武粳4号等4个