Can Automatic Classification Help to Increase Accuracy in Data Collection?

来源 :Journal of Data and Information Science | 被引量 : 0次 | 上传用户：glc12123

【摘要】

：

Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datase

【作者】

：

Frederique Lang Diego Chavarro Yuxian Liu

【机构】

：

Science Policy Research Unit (SPRU), School of Business, Management and Economics, University of Sus

【出处】

：

Journal of Data and Information Science

【发表日期】

：

2016年03期

【关键词】

：

下载到本地 , 更方便阅读

下载此文赞助VIP

声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架

论文部分内容阅读

Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets.Design/methodology/approach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms(Support Vector Machine(SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings: We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset(10%), which may reduce the manual work needed for classification tasks. Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers. Practical implications: Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall,and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification. Originality/value: We analyzed the performance of seven algorithms and whether combinations of these algorithms improve accuracy in data collection. Use of these algorithms could reduce time needed for manual data cleaning. Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design / methodology / appach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of accuracy keywords in the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of Seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. not only each algorithm individually, but also their combinations through a voting scheme these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings: We found that the performance of the algorithms used vary with the size However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This Combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks. Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification. of unrelated papers. Practical implications: Although the classification achieved by thismeans is not completely accurate, the amount of manual coding needed can be greatly reduced by using using algorithms algorithms. This can be great help when the dataset is big. With the help of accuracy, recall, and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification. Originality / value: We analyzed the performance of seven algorithms and whether combinations of these algorithms improve accuracy in data collection. Use of these algorithms could reduce time needed for manual data cleaning.

其他文献

关于调和映射和双调和映射性质的研究

设F=u+iv是区域D(∈)C上的2p阶连续可微复值函数.若F满足p阶调和方程△pF=△(△p-1)F=0，则称F是p-调和的，其中△表示复值Laplace算子△=4(e)2/(e)z(e)2:=(e)2/(e)x2+(e)/(e)y2,

学位

调和映射双调和映射系数估计Landau定理单叶半径偏差定理

两时间尺度生产库存系统的稳定性分析和控制器设计

本文研究了两时间尺度生产库存系统的H∞控制问题与事件触发控制问题，其中两时间尺度是指生产库存系统中不同生产过程中产品的查货周期不同。由于实际系统中的要求，多时间尺度

学位

生产库存系统稳定性分析H∞控制Markov过程事件触发

一类L2控制受限的椭圆最优控制问题的自适应有限元方法

在本文中，我们考虑一类L2控制受限的椭圆型方程的最优控制问题，将其等价为求解状态方程、伴随状态方程以及最优控制满足的变分不等式构成的耦合问题。在此基础上，建立了耦合问题

学位

最优控制自适应有限元先验误差后验误差椭圆型方程网格剖分投影梯度法

TCAR: A new network coding-aware routing mechanism based on local topology detection

期刊

inter-session network codingcoding aware routingwireless network

发展方程的两类有限元方法

发展方程(Evolution Equation)又称为进化方程或演化方程，它是用以描述随时间变化的过程的一类重要的偏微分方程（或方程组）的总称.发展方程是数学与自然科学的诸多领域间的桥梁，

学位

对流扩散方程最小二乘混合元守恒律发展方程有限元方法

一类非线性Kirchhoff-Schrödinger-Poisson系统解的存在性与唯一性

非线性偏微分方程是现代数学中的一个重要分支，是自然科学与工程领域中普遍研究的问题.不论在理论还是实际应用中，都有重大的意义和价值，一直以来受到大量国内外研究者的广泛关

学位

非线性偏微分方程Kirchhoff型方程Schr(0)dinger方程山路定理全局紧性引理唯一解基态解

新时期要永葆党员的先进性

当今世界是知识和信息时代,新理论、新知识、新技术层出不穷,不加强学习就会落伍,就不能体现党员的先进性。一是加强理论学习,不断用先进理论武装头脑,提高党员明辨是非的

期刊

理论学习信息时代党性观念生活作风务实作风领导作风工作作风行政领导职务党内生活双重历史任务

证券金融公司的法律调整机制研究

证券交易除可以采取现货交易外,也可以采取信用交易的方式,通过融资融券进行证券买卖。在融资融券过程中,由于证券和资金的有限性,证券公司需要向专业的证券金融公司或其他机构进行转融通,筹集更多的资金和证券用于融资融券交易。我国通过行政立法的方式明确规定采用证券金融公司的转融通方式。就未来的法律调整而言,证券金融公司的法律定位应为特殊企业,在此基础上进行信用规制,并加强对其民事交易活动的法律调整。

期刊

金融公司融券融资融券转融通融资融券交易法律调整机制行政立法信用交易证券金融公司法律调整

无线传感器网络寿命建模与分析

寿命是无线传感器网络的重要设计指标之一，因而寿命分析和评估是网络设计环节的关键问题，有效的寿命分析方法对于合理利用传感器网络资源具有重要意义。本文给出两种无线传感器

学位

无线传感器网络网络寿命马尔可夫链退化过程

稀疏信号重构算法分析

压缩感知是近几年兴起的介于数学与信息学的一个新的研究领域，是对传统信息论的一次变革，并且在雷达探测，医学成像，图像处理，单像素照相机，天文学等领域实现了广泛的应用。压缩感知

学位

稀疏信号重构算法恢复算法压缩感知理论

Can Automatic Classification Help to Increase Accuracy in Data Collection?

其他学术论文