Long-term OCR services aim to provide high-quality output to their users at competitive costs. It is essential to upgrade the models because of the complex data loaded by the users. The service providers encourage the users who provide data where the OCR model fails by rewarding them based on data complexity, readability, and available budget. Hitherto, the OCR works include preparing the models on standard datasets without considering the end-users. We propose a strategy of consistently upgrading an existing Handwritten Hindi OCR model three times on the dataset of 15 users. We fix the budget of 4 users for each iteration. For the first iteration, the model directly trains on the dataset from the first four users. For the rest iteration, all remaining users write a page each, which service providers later analyze to select the 4 (new) best users based on the quality of predictions on the human-readable words. Selected users write 23 more pages for upgrading the model. We upgrade the model with Curriculum Learning (CL) on the data available in the current iteration and compare the subset from previous iterations. The upgraded model is tested on a held-out set of one page each from all 23 users. We provide insights into our investigations on the effect of CL, user selection, and especially the data from unseen writing styles. Our work can be used for long-term OCR services in crowd-sourcing scenarios for the service providers and end users.
translated by 谷歌翻译
Handwritten Text Recognition (HTR) is more interesting and challenging than printed text due to uneven variations in the handwriting style of the writers, content, and time. HTR becomes more challenging for the Indic languages because of (i) multiple characters combined to form conjuncts which increase the number of characters of respective languages, and (ii) near to 100 unique basic Unicode characters in each Indic script. Recently, many recognition methods based on the encoder-decoder framework have been proposed to handle such problems. They still face many challenges, such as image blur and incomplete characters due to varying writing styles and ink density. We argue that most encoder-decoder methods are based on local visual features without explicit global semantic information. In this work, we enhance the performance of Indic handwritten text recognizers using global semantic information. We use a semantic module in an encoder-decoder framework for extracting global semantic information to recognize the Indic handwritten texts. The semantic information is used in both the encoder for supervision and the decoder for initialization. The semantic information is predicted from the word embedding of a pre-trained language model. Extensive experiments demonstrate that the proposed framework achieves state-of-the-art results on handwritten texts of ten Indic languages.
translated by 谷歌翻译
场景文本识别低资源印度语言是挑战,因为具有多个脚本,字体,文本大小和方向等复杂性。在这项工作中,我们调查从英语到两个常见的印度语言的深度场景文本识别网络的所有层的转移学习的力量。我们对传统的CRNN模型和星网进行实验,以确保连续性。为研究不同脚本的变化影响,我们最初在使用Unicode字体呈现的综合单词图像上运行我们的实验。我们表明英语模型转移到印度语言简单的合成数据集并不实用。相反,我们建议由于其n-gram分布的相似性以及像元音和结合字符的视觉功能,因此在印度语言中应用转移学习技术。然后,我们研究了六种印度语言之间的转移学习,在字体和单词长度统计中不同的复杂性。我们还证明,从其他印度语言转移的模型的学习功能与来自英语转移的人的特征视觉更接近(并且有时甚至更好)。我们终于通过在MLT-17上实现了6%,5%,2%和23%的单词识别率(WRRS )与以前的作品相比。通过将新颖的校正Bilstm插入我们的模型,我们进一步提高了MLT-17 Bangla结果。我们还释放了大约440个场景图像的数据集,其中包含了500古吉拉蒂和2535个泰米尔单词。在MLT-19 Hindi和Bangla Datasets和Gujarati和泰米尔数据集上,WRRS在基线上提高了8%,4%,5%和3%。
translated by 谷歌翻译
由于多个字体,简单的词汇统计,更新的数据生成工具和写入系统,场景 - 文本识别比非拉丁语语言更好地比非拉丁语语言更好。本文通过将英文数据集与非拉丁语语言进行比较,检查了低精度的可能原因。我们比较单词图像和Word Length Statistics的大小(宽度和高度)等各种功能。在过去的十年中,通过强大的深度学习技术生成合成数据集具有极大地改善了场景文本识别。通过改变(i)字体的数量来创建合成数据的数量和(ii)创建字图像来对英语进行几个受控实验。我们发现这些因素对于场景文本识别系统至关重要。英语合成数据集使用超过1400字体,而阿拉伯语和其他非拉丁数据集使用少于100个字体的数据生成。由于这些语言中的一些是不同区域的一部分,我们通过基于地区的搜索来加入额外的字体,以改善阿拉伯语和Devanagari中的场景文本识别模型。与以前的作品或基线相比,我们将阿拉伯MLT-17和MLT-19数据集的单词识别率(WRRS)提高了24.54%和2.32%。对于IIT-ILST和MLT-19 Devanagari数据集,我们实现了7.88%和3.72%的WRR收益。
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.
translated by 谷歌翻译
手写文本识别(HTR)是计算机视觉和自然语言处理的交集的一个开放问题。当处理历史手稿时,主要挑战是由于保存纸张支撑,手写的可变性 - 甚至在广泛的时间内的同一作者的变异性 - 以及来自古代,代表不良的数据稀缺语言。为了促进有关该主题的研究,在本文中,我们介绍了Ludovico Antonio Muratori(LAM)数据集,这是一家大型线条级的HTR HTR数据集,该数据集是由单个作者编辑的60年来编辑的意大利古代手稿。该数据集有两种配置:基本分裂和基于日期的分裂,该分裂考虑了作者的年龄。第一个设置旨在研究意大利语的古代文档中的HTR,而第二个设置则侧重于HTR系统在无法获得培训数据的时期内识别同一作者编写的文本的能力。对于这两种配置,我们都在其他线路级别的HTR基准方面分析了定量和定性特征,并介绍了最先进的HTR架构的识别性能。该数据集可在\ url {https://aimagelab.ing.unimore.it/go/lam}下载。
translated by 谷歌翻译
本文介绍了用于文档图像分析的图像数据集的系统文献综述,重点是历史文档,例如手写手稿和早期印刷品。寻找适当的数据集进行历史文档分析是促进使用不同机器学习算法进行研究的关键先决条件。但是,由于实际数据非常多(例如,脚本,任务,日期,支持系统和劣化量),数据和标签表示的不同格式以及不同的评估过程和基准,因此找到适当的数据集是一项艰巨的任务。这项工作填补了这一空白,并在现有数据集中介绍了元研究。经过系统的选择过程(根据PRISMA指南),我们选择了56项根据不同因素选择的研究,例如出版年份,文章中实施的方法数量,所选算法的可靠性,数据集大小和期刊的可靠性出口。我们通过将其分配给三个预定义的任务之一来总结每个研究:文档分类,布局结构或语义分析。我们为每个数据集提供统计,文档类型,语言,任务,输入视觉方面和地面真实信息。此外,我们还提供了这些论文或最近竞争的基准任务和结果。我们进一步讨论了该领域的差距和挑战。我们倡导将转换工具提供到通用格式(例如,用于计算机视觉任务的可可格式),并始终提供一组评估指标,而不仅仅是一种评估指标,以使整个研究的结果可比性。
translated by 谷歌翻译
无约束的手写文本识别是一项具有挑战性的计算机视觉任务。传统上,它是通过两步方法来处理的,结合了线细分,然后是文本线识别。我们第一次为手写文档识别任务提出了无端到端的无分段体系结构:文档注意网络。除文本识别外,该模型还接受了使用类似XML的方式使用开始和结束标签标记文本零件的训练。该模型由用于特征提取的FCN编码器和用于复发令牌预测过程的变压器解码器层组成。它将整个文本文档作为输入和顺序输出字符以及逻辑布局令牌。与现有基于分割的方法相反,该模型是在不使用任何分割标签的情况下进行训练的。我们在页面级别的Read 2016数据集以及CER分别为3.43%和3.70%的双页级别上获得了竞争成果。我们还为Rimes 2009数据集提供了页面级别的结果,达到CER的4.54%。我们在https://github.com/factodeeplearning/dan上提供所有源代码和预训练的模型权重。
translated by 谷歌翻译
这项工作提出了一个基于注意力的序列到序列模型,用于手写单词识别,并探讨了用于HTR系统数据有效培训的转移学习。为了克服培训数据稀缺性,这项工作利用了在场景文本图像上预先训练的模型,作为调整手写识别模型的起点。Resnet特征提取和基于双向LSTM的序列建模阶段一起形成编码器。预测阶段由解码器和基于内容的注意机制组成。拟议的端到端HTR系统的有效性已在新型的多作用数据集IMGUR5K和IAM数据集上进行了经验评估。实验结果评估了HTR框架的性能,并通过对误差案例的深入分析进一步支持。源代码和预培训模型可在https://github.com/dmitrijsk/attentionhtr上找到。
translated by 谷歌翻译
Annotating words in a historical document image archive for word image recognition purpose demands time and skilled human resource (like historians, paleographers). In a real-life scenario, obtaining sample images for all possible words is also not feasible. However, Zero-shot learning methods could aptly be used to recognize unseen/out-of-lexicon words in such historical document images. Based on previous state-of-the-art method for zero-shot word recognition Pho(SC)Net, we propose a hybrid model based on the CTC framework (Pho(SC)-CTC) that takes advantage of the rich features learned by Pho(SC)Net followed by a connectionist temporal classification (CTC) framework to perform the final classification. Encouraging results were obtained on two publicly available historical document datasets and one synthetic handwritten dataset, which justifies the efficacy of Pho(SC)-CTC and Pho(SC)Net.
translated by 谷歌翻译
手写数字识别(HDR)是光学特征识别(OCR)领域中最具挑战性的任务之一。不管语言如何,HDR都存在一些固有的挑战,这主要是由于个人跨个人的写作风格的变化,编写媒介和环境的变化,无法在反复编写任何数字等时保持相同的笔触。除此之外,特定语言数字的结构复杂性可能会导致HDR的模棱两可。多年来,研究人员开发了许多离线和在线HDR管道,其中不同的图像处理技术与传统的机器学习(ML)基于基于的和/或基于深度学习(DL)的体系结构相结合。尽管文献中存在有关HDR的广泛审查研究的证据,例如:英语,阿拉伯语,印度,法尔西,中文等,但几乎没有对孟加拉人HDR(BHDR)的调查,这缺乏对孟加拉语HDR(BHDR)的研究,而这些调查缺乏对孟加拉语HDR(BHDR)的研究。挑战,基础识别过程以及可能的未来方向。在本文中,已经分析了孟加拉语手写数字的特征和固有的歧义,以及二十年来最先进的数据集的全面见解和离线BHDR的方法。此外,还详细讨论了一些涉及BHDR的现实应用特定研究。本文还将作为对离线BHDR背后科学感兴趣的研究人员的汇编,煽动了对相关研究的新途径的探索,这可能会进一步导致在不同应用领域对孟加拉语手写数字进行更好的离线认识。
translated by 谷歌翻译
无约束的手写文本识别仍然具有挑战性的计算机视觉系统。段落识别传统上由两个模型实现:第一个用于线分割和用于文本线路识别的第二个。我们提出了一个统一的端到端模型,使用混合注意力来解决这项任务。该模型旨在迭代地通过线路进行段落图像线。它可以分为三个模块。编码器从整个段落图像生成特征映射。然后,注意力模块循环生成垂直加权掩模,使能专注于当前的文本线特征。这样,它执行一种隐式线分割。对于每个文本线特征,解码器模块识别关联的字符序列,导致整个段落的识别。我们在三个流行的数据集赛中达到最先进的字符错误率:ribs的1.91%,IAM 4.45%,读取2016年3.59%。我们的代码和培训的模型重量可在HTTPS:// GitHub上获得.com / fefodeeplearning / watermentattentocroc。
translated by 谷歌翻译
我们介绍了两个数据增强技术,它与Reset-Bilstm-CTC网络一起使用,显着降低了在手写文本识别(HTR)任务上的最佳报告结果之外的字错误率(WER)和字符错误率(CER)。我们应用了一种基于打印文本(StackMix)的删除文本(手写污染)和手写文本生成方法的新型增强,这被证明在HTR任务中非常有效。StackMix使用弱监督框架来获得字符边界。因为这些数据增强技术与所使用的网络无关,所以也可以应用于增强其他网络的性能和HTR的方法。十个手写文本数据集的广泛实验表明,手写墨水增强和StackMix显着提高了HTR模型的质量
translated by 谷歌翻译
Automatic Arabic handwritten recognition is one of the recently studied problems in the field of Machine Learning. Unlike Latin languages, Arabic is a Semitic language that forms a harder challenge, especially with variability of patterns caused by factors such as writer age. Most of the studies focused on adults, with only one recent study on children. Moreover, much of the recent Machine Learning methods focused on using Convolutional Neural Networks, a powerful class of neural networks that can extract complex features from images. In this paper we propose a convolutional neural network (CNN) model that recognizes children handwriting with an accuracy of 91% on the Hijja dataset, a recent dataset built by collecting images of the Arabic characters written by children, and 97% on Arabic Handwritten Character Dataset. The results showed a good improvement over the proposed model from the Hijja dataset authors, yet it reveals a bigger challenge to solve for children Arabic handwritten character recognition. Moreover, we proposed a new approach using multi models instead of single model based on the number of strokes in a character, and merged Hijja with AHCD which reached an averaged prediction accuracy of 96%.
translated by 谷歌翻译
草书手写文本识别是模式识别领域中一个具有挑战性的研究问题。当前的最新方法包括基于卷积复发性神经网络和多维长期记忆复发性神经网络技术的模型。这些方法在高度计算上是广泛的模型,在设计级别上也很复杂。在最近的研究中,与基于卷积的复发性神经网络相比,基于卷积神经网络和票面卷积神经网络模型的组合显示出较少的参数。在减少要训练的参数总数的方向上,在这项工作中,我们使用了深度卷积代替标准卷积,结合了封闭式跨跨跨性神经网络和双向封闭式复发单元来减少参数总数接受训练。此外,我们还在测试步骤中包括了基于词典的单词梁搜索解码器。它还有助于提高模型的整体准确性。我们在IAM数据集上获得了3.84%的字符错误率和9.40%的单词错误率;乔治·华盛顿数据集的字符错误率和14.56%的字符错误率和14.56%的单词错误率。
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 2nd International Workshop on Reading Music Systems, held in Delft on the 2nd of November 2019.
translated by 谷歌翻译
识别类似于波斯语和Urdu等阿拉伯语的脚本比拉丁语的脚本更具挑战性。这是由于存在二维结构,依赖性字符形状,空间和重叠,以及凹陷的放置。离线手写乌尔通脚本的研究并不多,这是世界上第10个最口语的核心脚本。我们提出了一种基于的编码器 - 解码器模型,用于在上下文中读取URDU。引入了一个新的本地化惩罚,以鼓励模型在识别下一个字符时一次只参加一个位置。此外,我们全面地在地面真实注释方面完善了唯一的完整和公开的手写Urdu数据集。我们评估乌尔都语和阿拉伯语数据集的模型,并显示上下文的注意本地化优于简单的关注和多向LSTM模型。
translated by 谷歌翻译
目的。手写是日常生活中最常见的模式之一,由于它具有挑战性的应用,例如手写识别(HWR),作家识别和签名验证。与仅使用空间信息(即图像)的离线HWR相反,在线HWR(ONHWR)使用更丰富的时空信息(即轨迹数据或惯性数据)。尽管存在许多离线HWR数据集,但只有很少的数据可用于开发纸质上的ONHWR方法,因为它需要硬件集成的笔。方法。本文为实时序列到序列(SEQ2SEQ)学习和基于单个字符的识别提供了数据和基准模型。我们的数据由传感器增强的圆珠笔记录,从三轴加速度计,陀螺仪,磁力计和力传感器100 \,\ textit {hz}产生传感器数据流。我们建议各种数据集,包括与作者依赖和作者无关的任务的方程式和单词。我们的数据集允许在平板电脑上的经典ONHWR与传感器增强笔之间进行比较。我们使用经常性和时间卷积网络和变压器与连接派时间分类(CTC)损失(CTC)损失(CE)损失,为SEQ2SEQ和基于单个字符的HWR提供了评估基准。结果。我们的卷积网络与Bilstms相结合,优于基于变压器的架构,与基于序列的分类任务的启动时间相提并论,并且与28种最先进的技术相比,结果更好。时间序列扩展方法改善了基于序列的任务,我们表明CE变体可以改善单个分类任务。
translated by 谷歌翻译
手写的文本识别问题是由计算机视觉社区的研究人员广泛研究的,因为它的改进和适用于日常生活的范围,它是模式识别的子域。自从过去几十年以来,基于神经网络的系统的计算能力提高了计算能力,因此有助于提供最新的手写文本识别器。在同一方向上,我们采用了两个最先进的神经网络系统,并将注意力机制合并在一起。注意技术已被广泛用于神经机器翻译和自动语音识别的领域,现在正在文本识别域中实现。在这项研究中,我们能够在IAM数据集上达到4.15%的字符错误率和9.72%的单词错误率,7.07%的字符错误率和GW数据集的16.14%单词错误率与现有的Flor合并后,GW数据集的单词错误率等。建筑学。为了进一步分析,我们还使用了类似于Shi等人的系统。具有贪婪解码器的神经网络系统,观察到基本模型的字符错误率提高了23.27%。
translated by 谷歌翻译
Image-based sequence recognition has been a longstanding research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.
translated by 谷歌翻译