Can Automatic Classification Help to Increase Accuracy in Data Collection?

来源 :Journal of Data and Information Science | 被引量 : 0次 | 上传用户:glc12123
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets.Design/methodology/approach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms(Support Vector Machine(SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings: We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset(10%), which may reduce the manual work needed for classification tasks. Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers. Practical implications: Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall,and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification. Originality/value: We analyzed the performance of seven algorithms and whether combinations of these algorithms improve accuracy in data collection. Use of these algorithms could reduce time needed for manual data cleaning. Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design / methodology / appach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of accuracy keywords in the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of Seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. not only each algorithm individually, but also their combinations through a voting scheme these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings: We found that the performance of the algorithms used vary with the size However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This Combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks. Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification. of unrelated papers. Practical implications: Although the classification achieved by thismeans is not completely accurate, the amount of manual coding needed can be greatly reduced by using using algorithms algorithms. This can be great help when the dataset is big. With the help of accuracy, recall, and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification. Originality / value: We analyzed the performance of seven algorithms and whether combinations of these algorithms improve accuracy in data collection. Use of these algorithms could reduce time needed for manual data cleaning.
其他文献
设F=u+iv是区域D(∈)C上的2p阶连续可微复值函数.若F满足p阶调和方程△pF=△(△p-1)F=0,则称F是p-调和的,其中△表示复值Laplace算子△=4(e)2/(e)z(e)2:=(e)2/(e)x2+(e)/(e)y2,
本文研究了两时间尺度生产库存系统的H∞控制问题与事件触发控制问题,其中两时间尺度是指生产库存系统中不同生产过程中产品的查货周期不同。由于实际系统中的要求,多时间尺度
在本文中,我们考虑一类L2控制受限的椭圆型方程的最优控制问题,将其等价为求解状态方程、伴随状态方程以及最优控制满足的变分不等式构成的耦合问题。在此基础上,建立了耦合问题
发展方程(Evolution Equation)又称为进化方程或演化方程,它是用以描述随时间变化的过程的一类重要的偏微分方程(或方程组)的总称.发展方程是数学与自然科学的诸多领域间的桥梁,
非线性偏微分方程是现代数学中的一个重要分支,是自然科学与工程领域中普遍研究的问题.不论在理论还是实际应用中,都有重大的意义和价值,一直以来受到大量国内外研究者的广泛关
当今世界是知识和信息时代,新理论、新知识、新技术层出不穷,不加强学习就会落伍,就不能体现党员的先进性。 一是加强理论学习,不断用先进理论武装头脑,提高党员明辨是非的
证券交易除可以采取现货交易外,也可以采取信用交易的方式,通过融资融券进行证券买卖。在融资融券过程中,由于证券和资金的有限性,证券公司需要向专业的证券金融公司或其他机构进行转融通,筹集更多的资金和证券用于融资融券交易。我国通过行政立法的方式明确规定采用证券金融公司的转融通方式。就未来的法律调整而言,证券金融公司的法律定位应为特殊企业,在此基础上进行信用规制,并加强对其民事交易活动的法律调整。
寿命是无线传感器网络的重要设计指标之一,因而寿命分析和评估是网络设计环节的关键问题,有效的寿命分析方法对于合理利用传感器网络资源具有重要意义。本文给出两种无线传感器
压缩感知是近几年兴起的介于数学与信息学的一个新的研究领域,是对传统信息论的一次变革,并且在雷达探测,医学成像,图像处理,单像素照相机,天文学等领域实现了广泛的应用。压缩感知