A Large Chinese Text Dataset in the Wild

来源 :计算机科学技术学报(英文版) | 被引量 : 0次 | 上传用户:blnxy541
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
In this paper, we introduce a very large Chinese text dataset in the wild. While optical character recognition (OCR) in document images is well studied and many commercial tools are available, the detection and recognition of text in natural images is still a challenging problem, especially for some more complicated character sets such as Chinese text. Lack of training data has always been a problem, especially for deep leing methods which require massive training data. In this paper, we provide details of a newly created dataset of Chinese text with about 1 million Chinese characters from 3850 unique ones annotated by experts in over 30000 street view images. This is a challenging dataset with good diversity containing planar text, raised text, text under poor illumination, distant text, partially occluded text, etc. For each character, the annotation includes its underlying character, bounding box, and six attributes. The attributes indicate the character’s background complexity, appearance, style, etc. Besides the dataset, we give baseline results using state-of-the-art methods for three tasks: character recognition (top-1 accuracy of 80.5%), character detection (AP of 70.9%), and text line detection (AED of 22.1). The dataset, source code, and trained models are publicly available.
其他文献
杨树团状栽培模式是根据我国杨树主要产区——平原农区的社会经济和自然条件提出的一种创新的农林复合经营模式。它尽量缩小杨树的占地面积,由100%缩减到5%~10%,减少杨树与农
股民大多有这样的经历:持有的股票(无论是牛股还是熊股),由于一直未动,错过了不少高抛低吸、获取短差的机会。这种机会的最佳把握方法,实际上就是连环交易操作法。  简单实用  先来通过实盘操作,感受一下连环交易操作法的独特之处。  中海集运是笔者长期跟踪、反复操作的品种之一,在前期大量买入的基础上,最近一次建仓式买入是在今年的4月9日(2.30元,13400股)。买入后,先是经过较长时间的震荡盘整,后
以灵空山林区典型辽东栎皆伐迹地为研究区,调查了辽东栎伐桩径、高以及萌枝的数量、高度、基径等指标,分析了辽东栎萌芽更新规律。结果表明,在一定范围内,萌芽数量随着伐桩径
手术及病理:剪开后右腹膜见腹膜后有一巨大囊性肿物,内有积液,上与右肾上极相通,下极至盆腔,与右肾具有同一包膜,将右肾向右侧推挤,重复肾呈上下排列.放出积水,沿积水肾表面
通过对油桃5个品种总根、须根、粗根及抗逆性的调查研究,结果表明:在壤土中4年生油桃5个品种的根型都属于垂直状根系,根系垂直分布在10~100cm。早红2号、瑞光3号、NJN78、早
期刊
期刊