论文部分内容阅读
现实应用中常常涉及许多连续的数值属性 ,而目前许多机器学习算法则要求所处理的属性具有离散值 .基于信息论的基本原理 ,提出一种新的有监督离散化算法WILD ,该算它可以看成是决策树离散化算法的一种扩充 ,其主要改进在于考虑区间内观测值出现的频度 ,采用加权信息损耗作为区间离散化的测度 ,以克服决策树算法离散不均衡的问题 .该算法非常自然地采用了自底向上的区间归并方案 ,可以同时归并多个相邻区间 ,有利于提高离散化算法的速度 .实验结果表明该算法能够提高机器学习算法的精度 .
Many real-time applications often involve many continuous numerical attributes, but many current machine learning algorithms require that the attributes to be processed have discrete values. Based on the basic theory of information theory, a new supervised discretization algorithm, WILD, is proposed As an extension of decision tree discretization algorithm, the main improvement is to consider the occurrence frequency of interval observations and to use weighted information loss as a measure of interval discretization to overcome the problem of discrete unbalanced decision tree algorithm. It is very natural to adopt a bottom-up interval merging scheme, which can merge multiple adjacent intervals at the same time, which is helpful to improve the speed of the discretization algorithm.The experimental results show that the algorithm can improve the accuracy of the machine learning algorithm.