论文部分内容阅读
He Internet Archive,a San Francisco-based nonprofit,aims to provideuniversal access to human knowledge and culture using digital storage and networktechnology.The Archive’s most extensive and well-known collection is eight years ofsnapshots of the public Internet web content,including tens of billions of web pagesand associated resources.These snapshots come from a commercial partnerorganization,and may be browsed via the Archive’s public website.To augment thisgeneral dataset with new approaches,the Archive began development in 2003 of newopen source web crawling software called Heritrix.Heritrix is designed to be ageneric crawling framework suitable for many crawling use cases.With collaborativesupport from National Libraries,Heritrix is now available in its 1.0.0 version,withmany features making it well suited for focused crawling.Future work by the Archiveand others will further extend Heritrix,making it better suited for broad andcontinuous crawling.