GA-iForest:An Efficient Isolated Forest Framework Based on Genetic Algorithm for Numerical Data Outl

来源 :南京航空航天大学学报(英文版) | 被引量 : 0次 | 上传用户:wsgray
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
With the development of data age,data quality has become one of the problems that people pay muchattention to.As a field of data mining,outlier detection is related to the quality of data.The isolated forest algorithm is one of the more prominent numerical data outlier detection algorithms in recent years.In the process of constructing the isolation tree by the isolated forest algorithm,as the isolation tree is continuously generated,the difference of isolation trees will gradually decrease or even no difference,which will result in the waste of memory and reduced efficiency of outlier detection.And in the constructed isolation trees,some isolation trees cannot detect outlier.In this paper,an improved iForest-based method GA-iForest is proposed.This method optimizes the isolated forest by selecting some better isolation trees according to the detection accuracy and the difference of isolation trees,thereby reducing some duplicate,similar and poor detection isolation trees and improving the accuracy and stability of outlier detection.In the experiment,Ubuntu system and Spark platform are used to build the experiment environment.The outlier datasets provided by ODDS are used as test.According to indicators such as the accuracy,recall rate,ROC curves,AUC and execution time,the performance of the proposed method is evaluated.Experimental results show that the proposed method can not only improve the accuracy and stability of outlier detection,but also reduce the number of isolation trees by 20%-40%compared with the original iForest method.
其他文献
碳循环是全球变化研究中的热点问题,作为全球碳库的一个重要组成部分,森林生态系统在平衡“碳源”和“碳汇”方面起着重要的作用。而森林生态系统的净第一性生产力(NPP)一直是
汉画像石是我国重要的历史文化遗产,对其历史遗存展开搜集、整理并加以妥善的保护工作,依托独特的展列形式,将其所蕴含的博大精深的文化内涵予以展现出来。是当前相关博物馆
铁路供电系统是铁路运输的重要装备,担负着为电力机车和动车组、沿途车站、通信、照明、信号灯和闭塞装置等设备供电的重任,其安全可靠供电直接关系到铁路运输的安全,故为保
测定血浆中蛋白质的浓度变化对于疾病的早期诊断、病因阐明以及药物疗效检测具有十分重要的意义。但是血浆蛋白质种类复杂、浓度具有极大的动态范围等特点限制了血浆蛋白质组
随着电力系统的规模的扩大,超高压长距离输电线路开始广泛应用,由于成本和技术的原因,超高压输电线路采取不换位的运行方式,这使得电力系统的参数变得不再对称。对于参数不对称的输电线路,如果采用传统的分析方法如对称分量法不能实现序分量之间的解耦,如果采用相分量法计算量太大,计算效率不高,而采用等传输常数模分量法不仅能够计算参数不对称的电力系统,而且计算量也小于相分量法。等传输常数模分量法的核心在于求解使得
采用CFD软件FLUENT对输水管道进行三维数值模拟,将其结果和EPANET采用海曾—威廉公式或达西—韦伯公式计算结果进行比较,提出大口径输水管道使用EPANET进行水力计算时一些值
本文研究了变压器在不同运行状况下,电流非周期分量的动态衰减变化规律,将非周期分量的动态变化特征作为励磁涌流的识别依据,并提出了用非周期分量动态时间常数来区别励磁涌流与故障电流的方法;结合非周期分量的方向特征以及衰减特征,提出了一种带有励磁涌流识别功能的电力变压器差动保护。解决了在考虑非周期分量系数时保护判据中整定值过大的问题,增加了差动保护的灵敏性和速动性。针对供电方式的不同以及变压器发生短路故障
本文通过对荣华二采区10
仓储是物流业的重要一环,目前仓储自动化也因中国物流业的高速发展越来越获得社会重视,但目前一些厂家提供的传统仓库的自动化方案改造门槛高,费用贵,严重阻碍其在中小仓储企业中