论文部分内容阅读
数字图书馆建设过程中的一个突出难题。就是对图书,文献等海量信息资料的数字化。国家知识产权出版社在信息资料数字化过程中,通过运用“汉王 OCR 录入工厂软件系统”解决了大量专利文献的录入。国家知识产权出版社每年要出版百余种知识产权、社会科学、自然科学方面的图书、电子出版物和提供专利文献的网上查询与检索。由于专利文献不同干一般的普通文献,它涉及各个学科(如:化学生物物理等)的专利申请文件,其中包括化学式,分子式等专业符号和图表。而且,它还要求在版面还原时对版面信息做出标识,以支持 XML 格式的还原输出。国家知识产权出版社选用了汉王科技公司开发的“汉王 OCR 录入工厂软件系统”,它是一个包括外围管理功能在内的一整套海量文字录入解决方案(其网络结构如图1所示)。
A prominent problem in the construction of digital library. Is the digital books, literature and other mass information. During the digitization of information materials, State Intellectual Property Press solved the entry of a large number of patent documents through the application of “Hanwang OCR Input Factory Software System”. National IP Press publishes over 100 kinds of online searches and searches of intellectual property, social sciences, natural science books, electronic publications and patent documents every year. Due to the fact that the patent literature does not do much in general, it covers patent applications in various disciplines (eg, chemistry, biology, physics, etc.), including professional symbols and charts, such as chemical formulas and molecular formulas. What’s more, it also requires layout information to be identified when the layout is restored to support the output of the XML format. National Intellectual Property Press selected “Hanwang OCR Input Factory Software System” developed by Hanwang Technology Co., Ltd. It is a set of massive text entry solutions including peripheral management functions (the network structure is shown in FIG. 1).