论文部分内容阅读
目的通过对单核苷酸多态性(SNPs)数据展开缺失值填补影响因素与填补效果的研究,为利用SNPs数据进行基因与疾病的关联研究提供科学依据。方法以国际人类基因组单体型图计划(Hap Map计划)数据为原始数据,利用HAPGEN2软件,依据原始数据生成SNP基因型模拟数据,人为产生缺失数据并进行缺失值的填补,分析不同条件(4个水平的缺失比例、4个水平的参考数据样本量)的填补错误率。结果数据缺失比例越小、参考数据样本量越大,填补的错误率越低(样本量50、100、150和200的平均错误率分别为7.01%、5.92%、5.67%和5.26%);2种缺失模式在缺失比例较大时(r2=0.825),随机缺失填补(平均5.64%)较固定缺失填补(平均9.10%)填补错误率低,而当缺失比例较小时(r2=0.9),固定位点缺失模式的填补错误率较低(平均4.96%),在各种条件下IMPUTE2的填补错误率为3%~13%。结论缺失比例、参考数据样本量以及缺失模式对缺失数据填补的准确性有一定影响;对标签SNP数据进行缺失值填补,再进一步分析是一种有效的策略。
OBJECTIVE: To provide a scientific basis for the study of the association between genes and diseases using SNPs data by filling in missing values of single nucleotide polymorphisms (SNPs) data to fill in the influencing factors and filling effects. Methods Using HAPGEN2 software, based on the HapMap data, the data of SNP genotypes were generated based on the original data, and the missing data were artificially generated and filled in. A level of missing ratio, 4 levels of reference data sample size) fill error rate. The smaller the data loss ratio of the result data, the lower the error rate of the reference data samples (7.01%, 5.92%, 5.67% and 5.26%, respectively) for the sample sizes 50, 100, 150 and 200; The missing deletion model had a lower filling error rate (r = 0.825), less random filling (average 5.64%) than fixed missing filling (averagely 9.10%), while the error rate was lower when the deletion ratio was smaller (r2 = 0.9) The missing error rate of site deletion was lower (4.96% on average), and the imputation error rate of IMPUTE2 was 3% ~ 13% under various conditions. Conclusion The deletion ratio, the sample size of reference data and the deletion mode have an impact on the accuracy of missing data filling. It is an effective strategy to fill in the missing value of the tagged SNP data and further analysis.