论文部分内容阅读
发现新的癌基因是癌症研究的主要目标之一.生物信息学方法可以帮助加快癌基因的发现,理解癌症发生机制和挖掘药物靶标.通过整合网络属性、序列特征和功能注释信息,建立了一个能用于潜在癌基因预测的分类器.通过检测发现,在癌基因与非癌基因之间有55个特征显示了显著的差异.14个癌症相关的特征被用于训练分类器.在分类器中,探索使用4种机器学习方法,即logistic回归、支持向量机、贝叶斯网络和决策树,来区分癌基因与非癌基因.通过5倍交叉验证评估不同模型的有效性,发现这4种方法对应的ROC曲线下面积分别为0.834,0.740,0.800和0.782.最后,将基于多种生物学特征的logistic回归分类器应用于Entrez数据库中的基因,发现了1976个潜在的癌基因.本研究发现,整合的预测方法优于基于单一证据的预测模型,而网络特征和功能注释信息相比序列特征具有更强的预测能力.
Discovering new oncogenes is one of the major goals of cancer research, and bioinformatics approaches can help speed up the discovery of oncogenes, understand the mechanisms of cancer pathogenesis and mine drug targets, and by integrating network properties, sequence features, and functional annotation information, A classifier that can be used for the prediction of potential oncogenes.According to the test, 55 features showed significant differences between oncogenes and non-oncogenes.Fifteen cancer-related features were used to train the classifiers.In the classifier We explored the use of four machine learning methods, logistic regression, support vector machines, Bayesian networks, and decision trees to distinguish between oncogenes and non-oncogenes and assessed the effectiveness of different models by 5-fold cross validation The corresponding areas under the ROC curve were 0.834, 0.740, 0.800 and 0.782, respectively.Finally, based on the logistic regression classifier with multiple biological features, we applied the genes in Entrez database and found 1976 potential oncogenes The study found that the integrated prediction method is better than the single evidence-based prediction model, while the network features and functional annotation information have stronger predictive power than the sequence features .