论文部分内容阅读
特征提取技术是决定分类结果优良的主要因素,传统特征提取方法存在许多不足,诸如,当类别和特征分布高度不平衡时,不能有效地处理低频词;对于单个特征的处理不当,导致局部最优解的产生。针对特征提取技术中的问题,提出基于χ2统计-遗传算法的特征提取算法,此方法将词条的χ2统计值引入到特征向量中,将此类向量作为遗传算法的初始种群进行启发式搜索,与此同时,针对特征提取的性质,提出新的适应度函数和交叉规则。实验表明,基于χ2统计-遗传算法的特征提取算法能选择出准确表征文本类别的特征项,将其运用到文本分类系统中能有效地提高文本分类的准确率。
The feature extraction technique is the main factor that determines the good classification results. There are many deficiencies in the traditional feature extraction methods, such as low frequency words can not be effectively processed when the classification and feature distribution are highly imbalanced; for the improper handling of individual features, Solution of the production. Aiming at the problems in feature extraction, a feature extraction algorithm based on χ2 statistics-genetic algorithm was proposed. This method introduced χ2 statistic of entry into eigenvector, heuristic search was used as initial population of genetic algorithm, At the same time, according to the nature of feature extraction, a new fitness function and crossover rule are proposed. Experiments show that the feature extraction algorithm based on χ2 statistic-genetic algorithm can select the feature item that accurately characterizes the text category, and can effectively improve the accuracy of the text classification by applying it to the text classification system.