论文部分内容阅读
21世纪是信息爆炸的时代,随着计算机技术的飞速发展,极大地便利了数据的采集和存储,各个部门每天都积累了大量的数据,比如商业银行交易记录、超市的销售记录、政府统计中各中小企业的财务报表等等。同时这些数据的维度也越来越高,比如研究基因与癌症的关系涉及的基因有几万个,信用评分中有上千个自变量等等。数据来源多样化,有业务记录数据,有传感器数据,也有第三方数据,甚至是从网上爬取来的数据等。此外,数据的格式也越来越多样化,有结构化数
The 21st century is an era of information explosion. With the rapid development of computer technology, it greatly facilitates data collection and storage. Every department accumulates a large amount of data every day, such as commercial bank transaction records, supermarket sales records, government statistics The SME financial statements and so on. At the same time these data dimensions are also getting higher and higher, for example, the relationship between gene and cancer involves tens of thousands of genes, credit scores have thousands of independent variables and so on. Diversification of data sources, business records data, sensor data, there are third-party data, and even crawling data from the Internet. In addition, the format of data is more and more diversified, with a structured number