论文部分内容阅读
[目的/意义]研究开放环境下科技语料库的质量评价问题。[方法/过程]通过整理已有相关研究和理论分析,提出一套具有代表性、规模、正确性、稳定性和开放性5个一级指标的评价指标体系。其中,领域代表性(代表性的二级指标)使用统计量度量,开放性分为多样性和使用量两个二级指标,多样性使用信息熵度量。在4个开放语料库和1个自建语料库上检验统计量和信息熵在实际评价中的表现,并通过数据分析说明这两个指标不具有统计学意义上的相关性。[结果/结论]研究结果可用于图书馆和科技信息服务机构内以科技文献为基础进行语料库建设过程中的评价与质量控制。
[Purpose / Significance] To study the quality evaluation of science and technology corpus in open environment. [Methods / Processes] By sorting out the existing relevant research and theoretical analysis, a set of evaluation index system with five first-level indicators of representativeness, scale, correctness, stability and openness is proposed. Among them, the domain representation (the representative second-level indicator) uses the statistic measure, the openness divides into two second-level indicators of the diversity and the use quantity, and the diversity uses the information entropy measure. The four open corpora and one self-built corpus were used to test the performance of the statistic and entropy in the actual evaluation, and the data analysis showed that the two indicators were not statistically significant. [Results / Conclusions] The findings can be used in the evaluation and quality control of corpus construction based on scientific literature in libraries and scientific and technological information service institutions.