论文部分内容阅读
目的探讨在代谢组学数据中服从正态分布的变量个数逐步增加时统计分类方法分类准确率的变化趋势。方法首先模拟产生11组代谢数据,且数据中服从正态分布的变量逐渐增加,然后用传统的非机器学习统计方法[Bayes判别、Fisher判别、偏最小二乘判别分析(PLS-DA)]和机器学习方法[随机森林(RF)、支持向量机(SVM)]进行统计分析,比较分类准确率的变化;最后用两个实例分析对模拟结果的合理性进行评价。结果代谢组学数据正态性对Bayes判别、Fisher判别、PLS-DA的分析结果影响较大,随着数据中服从正态分布的变量个数增加,分类准确率增大,而对RF和SVM基本没有影响。结论传统的非机器学习方法在统计分析过程中对数据正态性有一定的要求,而机器学习类的方法对数据正态性基本没有要求,且分类准确率一直保持较高的稳定状态。
OBJECTIVE: To investigate the trend of the classification accuracy of statistical classification methods when the number of obeying normal distribution variables gradually increases in the metabonomics data. Methods Firstly, 11 groups of metabolic data were simulated and the data obeying the normal distribution gradually increased. Then, the traditional non-machine learning statistical methods (Bayes, Fisher and PLS-DA) Machine learning methods [random forest (RF), support vector machine (SVM)] were used for statistical analysis to compare the classification accuracy rate changes. Finally, two examples were used to evaluate the rationality of the simulation results. Results The normality of metabolomics data had a great influence on the results of Bayes discriminant analysis, Fisher’s discriminant analysis and PLS-DA analysis. With the increase of the number of obeying normal distribution variables in the data, the classification accuracy increased, but the RF and SVM Basically no effect. Conclusion The traditional non-machine learning methods have certain requirements on the data normality in the process of statistical analysis. However, the method of machine learning class does not require the data normality basically, and the classification accuracy rate keeps a high steady state all the time.