本文介绍了一个名为Bangrawriting的孟加拉手写数据集,其中包含260个不同个性和年龄的个人的单页手写。每个页面都包含边界框的边界框以及写作的Unicode表示。该数据集总共包含21,234个单词和32,787个字符。此外,该数据集包括5,470个孟加拉词汇的独特单词。除了通常的单词外,数据集还包括261个可理解的覆盖物和450个手写罢工和错误。所有的边界盒和单词标签都是手动生成的。该数据集可用于复杂的光学字符/单词识别,作者识别,手写单词分割和单词生成。此外,该数据集适用于提取基于年龄的和基于性别的笔迹变化。
translated by 谷歌翻译
本文介绍了用于文档图像分析的图像数据集的系统文献综述,重点是历史文档,例如手写手稿和早期印刷品。寻找适当的数据集进行历史文档分析是促进使用不同机器学习算法进行研究的关键先决条件。但是,由于实际数据非常多(例如,脚本,任务,日期,支持系统和劣化量),数据和标签表示的不同格式以及不同的评估过程和基准,因此找到适当的数据集是一项艰巨的任务。这项工作填补了这一空白,并在现有数据集中介绍了元研究。经过系统的选择过程(根据PRISMA指南),我们选择了56项根据不同因素选择的研究,例如出版年份,文章中实施的方法数量,所选算法的可靠性,数据集大小和期刊的可靠性出口。我们通过将其分配给三个预定义的任务之一来总结每个研究:文档分类,布局结构或语义分析。我们为每个数据集提供统计,文档类型,语言,任务,输入视觉方面和地面真实信息。此外,我们还提供了这些论文或最近竞争的基准任务和结果。我们进一步讨论了该领域的差距和挑战。我们倡导将转换工具提供到通用格式(例如,用于计算机视觉任务的可可格式),并始终提供一组评估指标,而不仅仅是一种评估指标,以使整个研究的结果可比性。
translated by 谷歌翻译
我们提出了一个数据集,该数据集包含具有唯一对象标识(IDS)的对象注释,用于高效视频编码(HEVC)V1常见测试条件(CTC)序列。准备了13个序列的地面实际注释并作为称为SFU-HW-Tracks-V1的数据集发布。对于每个视频帧,地面真相注释包括对象类ID,对象ID和边界框位置及其维度。数据集可用于评估未压缩视频序列上的对象跟踪性能,并研究视频压缩与对象跟踪之间的关系。
translated by 谷歌翻译
印度车牌检测是一个问题,它在开源级别尚未探讨。可以使用专有解决方案,但没有大的开源数据集可用于执行实验并测试不同的方法。可用的大型数据集是中国,巴西等国家,但在这些数据集上培训的模型对印度板块表现不佳,因为字体样式和板材设计从国家到国家差异很大。这篇论文介绍了印度车牌数据集使用16192图像和21683板板用每个板的4个点注释,并且相应的板中的每个字符.WE呈现了一种使用语义分割来解决数字板检测的基准模型。我们提出了一种两级方法,其中第一阶段是用于本地化板,第二阶段是读取裁剪板图像中的文本.WE测试的基准对象检测和语义分段模型,用于第二阶段,我们使用了LPRNET基于OCR。
translated by 谷歌翻译
该项目附带了OCR(光学字符识别)的技术,包括计算机科学的各种研究侧面。该项目是拍摄一个字符的图片并处理它以识别像人类大脑一样的那样的角色识别各个数字。该项目包含图像处理技术的深刻思想和机器学习的大研究领域以及机器学习的建筑块,称为神经网络。该项目有两种不同的部分。培训部分通过提供各种类似的角色来训练孩子的想法,但不是完全相同,并说明它们的输出就是这样。就像这个想法一样,人们必须用这么多角色训练新建的神经网络。此部分包含一些新的算法,它是自我创建和升级的作为项目需要。测试部分包含一个新数据集的测试。这部分始终在培训部分之后。第一个必须教孩子如何认识到这个角色。然后一个人必须参加测试是否给予了正确的答案或者不是。如果没有,如果给出新的数据集和新条目,必须培训他更加努力。就像那个必须也要测试算法。有许多部分统计建模和优化技术,该技术进入了需要大量建模的统计数据概念,如优化器技术和过滤过程,其中过滤或算法后面的数学和预测是如何之一或其实际需要的最终对预测模型创造的预测。机器学习算法由预测和编程概念构建。
translated by 谷歌翻译
Most low-resource languages do not have the necessary resources to create even a substantial monolingual corpus. These languages may often be found in government proceedings but mainly in Portable Document Format (PDF) that contains legacy fonts. Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding, which are not optimized for text extraction. Therefore, we propose a simple, automatic, and novel idea that can scale for Tamil, Sinhala, English languages, and many documents along with parallel corpora. Since Tamil and Sinhala are Low-Resource Languages, we improved the performance of Tesseract by employing LSTM-based training on more than 20 legacy fonts to recognize printed characters in these languages. Especially, our model detects code-mixed text, numbers, and special characters from the printed document. It is shown that this approach can reduce the character-level error rate of Tesseract from 6.03 to 2.61 for Tamil (-3.42% relative change) and 7.61 to 4.74 for Sinhala (-2.87% relative change), as well as the word-level error rate from 39.68 to 20.61 for Tamil (-19.07% relative change) and 35.04 to 26.58 for Sinhala (-8.46% relative change) on the test set. Also, our newly created parallel corpus consists of 185.4k, 168.9k, and 181.04k sentences and 2.11M, 2.22M, and 2.33M Words in Tamil, Sinhala, and English respectively. This study shows that fine-tuning Tesseract models on multiple new fonts help to understand the texts and enhances the performance of the OCR. We made newly trained models and the source code for fine-tuning Tesseract, freely available.
translated by 谷歌翻译
手写文本识别(HTR)是计算机视觉和自然语言处理的交集的一个开放问题。当处理历史手稿时,主要挑战是由于保存纸张支撑,手写的可变性 - 甚至在广泛的时间内的同一作者的变异性 - 以及来自古代,代表不良的数据稀缺语言。为了促进有关该主题的研究,在本文中,我们介绍了Ludovico Antonio Muratori(LAM)数据集,这是一家大型线条级的HTR HTR数据集,该数据集是由单个作者编辑的60年来编辑的意大利古代手稿。该数据集有两种配置:基本分裂和基于日期的分裂,该分裂考虑了作者的年龄。第一个设置旨在研究意大利语的古代文档中的HTR,而第二个设置则侧重于HTR系统在无法获得培训数据的时期内识别同一作者编写的文本的能力。对于这两种配置,我们都在其他线路级别的HTR基准方面分析了定量和定性特征,并介绍了最先进的HTR架构的识别性能。该数据集可在\ url {https://aimagelab.ing.unimore.it/go/lam}下载。
translated by 谷歌翻译
Manually analyzing spermatozoa is a tremendous task for biologists due to the many fast-moving spermatozoa, causing inconsistencies in the quality of the assessments. Therefore, computer-assisted sperm analysis (CASA) has become a popular solution. Despite this, more data is needed to train supervised machine learning approaches in order to improve accuracy and reliability. In this regard, we provide a dataset called VISEM-Tracking with 20 video recordings of 30s of spermatozoa with manually annotated bounding-box coordinates and a set of sperm characteristics analyzed by experts in the domain. VISEM-Tracking is an extension of the previously published VISEM dataset. In addition to the annotated data, we provide unlabeled video clips for easy-to-use access and analysis of the data. As part of this paper, we present baseline sperm detection performances using the YOLOv5 deep learning model trained on the VISEM-Tracking dataset. As a result, the dataset can be used to train complex deep-learning models to analyze spermatozoa. The dataset is publicly available at https://zenodo.org/record/7293726.
translated by 谷歌翻译
对象检测一直是实用的。我们世界上有很多事情,以至于认识到它们不仅可以增加我们对周围环境的自动知识,而且对于有兴趣开展新业务的人来说也可以很有利润。这些有吸引力的物体之一是车牌(LP)。除了可以使用车牌检测的安全用途外,它还可以用于创建创意业务。随着基于深度学习模型的对象检测方法的开发,适当且全面的数据集变得双重重要。但是,由于频繁使用车牌数据集的商业使用,不仅在伊朗而且在世界范围内也有限。用于检测车牌的最大伊朗数据集具有1,466张图像。此外,识别车牌角色的最大伊朗数据集具有5,000张图像。我们已经准备了一个完整的数据集,其中包括20,967辆汽车图像,以及对整个车牌及其字符的所有检测注释,这对于各种目的都是有用的。此外,字符识别应用程序的车牌图像总数为27,745张图像。
translated by 谷歌翻译
Furigana是日语写作中使用的发音笔记。能够检测到这些可以帮助提高光学特征识别(OCR)性能,或通过正确显示Furigana来制作日本书面媒体的更准确的数字副本。该项目的重点是在日本书籍和漫画中检测Furigana。尽管已经研究了日本文本的检测,但目前尚无提议检测Furigana的方法。我们构建了一个包含日本书面媒体和Furigana注释的新数据集。我们建议对此类数据的评估度量,该度量与对象检测中使用的评估协议类似,除非它允许对象组通过一个注释标记。我们提出了一种基于数学形态和连接组件分析的Furigana检测方法。我们评估数据集的检测,并比较文本提取的不同方法。我们还分别评估了不同类型的图像,例如书籍和漫画,并讨论每种图像的挑战。所提出的方法在数据集上达到76 \%的F1得分。该方法在常规书籍上表现良好,但在漫画和不规则格式的书籍上的表现较少。最后,我们证明所提出的方法可以在漫画109数据集上提高OCR的性能5 \%。源代码可通过\ texttt {\ url {https://github.com/nikolajkb/furiganadetection}}}
translated by 谷歌翻译
语言是个人表达思想的方法。每种语言都有自己的字母和数字字符集。人们可以通过口头或书面交流相互交流。但是,每种语言都有同类语言。聋哑和/或静音的个人通过手语交流。孟加拉语还具有手语,称为BDSL。数据集是关于孟加拉手册图像的。该系列包含49个单独的孟加拉字母图像。 BDSL49是一个数据集,由29,490张具有49个标签的图像组成。在数据收集期间,已经记录了14个不同成年人的图像,每个人都有不同的背景和外观。在准备过程中,已经使用了几种策略来消除数据集中的噪声。该数据集可免费提供给研究人员。他们可以使用机器学习,计算机视觉和深度学习技术开发自动化系统。此外,该数据集使用了两个模型。第一个是用于检测,而第二个是用于识别。
translated by 谷歌翻译
该研究形成了由芬兰民族学家和语言学家,Matthias Alexander Castr \'en(1813-1852)收集和出版的材料进行的各种任务的技术报告。 Finno-Ugrian社会正在将Castr \'en的稿件作为新的关键和数字版本出版,同时不同的研究团体也关注这些材料。我们讨论了所用的工作流程和技术基础设施,并考虑如何创建有利于不同计算任务的数据集以进一步提高这些材料的可用性,并帮助进一步处理类似的归档集合。我们专注于以一种方式处理的集合的部分,这些集合可以在更提高其在更多技术应用中的可用性,补充较早的这些材料的文化和语言方面的工作。大多数这些数据集在Zenodo公开使用。该研究指出需要进一步研究的特定区域,并为文本识别任务提供基准。
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.
translated by 谷歌翻译
This study focuses on improving the optical character recognition (OCR) data for panels in the COMICS dataset, the largest dataset containing text and images from comic books. To do this, we developed a pipeline for OCR processing and labeling of comic books and created the first text detection and recognition datasets for western comics, called "COMICS Text+: Detection" and "COMICS Text+: Recognition". We evaluated the performance of state-of-the-art text detection and recognition models on these datasets and found significant improvement in word accuracy and normalized edit distance compared to the text in COMICS. We also created a new dataset called "COMICS Text+", which contains the extracted text from the textboxes in the COMICS dataset. Using the improved text data of COMICS Text+ in the comics processing model from resulted in state-of-the-art performance on cloze-style tasks without changing the model architecture. The COMICS Text+ dataset can be a valuable resource for researchers working on tasks including text detection, recognition, and high-level processing of comics, such as narrative understanding, character relations, and story generation. All the data and inference instructions can be accessed in https://github.com/gsoykan/comics_text_plus.
translated by 谷歌翻译
Long-term OCR services aim to provide high-quality output to their users at competitive costs. It is essential to upgrade the models because of the complex data loaded by the users. The service providers encourage the users who provide data where the OCR model fails by rewarding them based on data complexity, readability, and available budget. Hitherto, the OCR works include preparing the models on standard datasets without considering the end-users. We propose a strategy of consistently upgrading an existing Handwritten Hindi OCR model three times on the dataset of 15 users. We fix the budget of 4 users for each iteration. For the first iteration, the model directly trains on the dataset from the first four users. For the rest iteration, all remaining users write a page each, which service providers later analyze to select the 4 (new) best users based on the quality of predictions on the human-readable words. Selected users write 23 more pages for upgrading the model. We upgrade the model with Curriculum Learning (CL) on the data available in the current iteration and compare the subset from previous iterations. The upgraded model is tested on a held-out set of one page each from all 23 users. We provide insights into our investigations on the effect of CL, user selection, and especially the data from unseen writing styles. Our work can be used for long-term OCR services in crowd-sourcing scenarios for the service providers and end users.
translated by 谷歌翻译
Digitization of scanned receipts aims to extract text from receipt images and save it into structured documents. This is usually split into two sub-tasks: text localization and optical character recognition (OCR). Most existing OCR models only focus on the cropped text instance images, which require the bounding box information provided by a text region detection model. Introducing an additional detector to identify the text instance images in advance is inefficient, however instance-level OCR models have very low accuracy when processing the whole image for the document-level OCR, such as receipt images containing multiple text lines arranged in various layouts. To this end, we propose a localization-free document-level OCR model for transcribing all the characters in a receipt image into an ordered sequence end-to-end. Specifically, we finetune the pretrained Transformer-based instance-level model TrOCR with randomly cropped image chunks, and gradually increase the image chunk size to generalize the recognition ability from instance images to full-page images. In our experiments on the SROIE receipt OCR dataset, the model finetuned with our strategy achieved 64.4 F1-score and a 22.8% character error rates (CER) on the word-level and character-level metrics, respectively, which outperforms the baseline results with 48.5 F1-score and 50.6% CER. The best model, which splits the full image into 15 equally sized chunks, gives 87.8 F1-score and 4.98% CER with minimal additional pre or post-processing of the output. Moreover, the characters in the generated document-level sequences are arranged in the reading order, which is practical for real-world applications.
translated by 谷歌翻译
分析文档的布局以识别标题,部分,表,数字等对理解其内容至关重要。基于深度学习的检测文档图像布局结构的方法一直很有前途。然而,这些方法在训练期间需要大量注释的例子,这既昂贵又耗时。我们在这里描述了一个合成文档生成器,它自动产生具有用于空间位置,范围和布局元素类别的标签的现实文档。所提出的生成过程将文档的每个物理组件视为随机变量,并使用贝叶斯网络图模拟其内在依赖项。我们使用随机模板的分层制定允许在保留广泛主题之间的文档之间的参数共享,并且分布特性产生视觉上独特的样本,从而捕获复杂和不同的布局。我们经常说明纯粹在合成文档上培训的深层布局检测模型可以匹配使用真实文档的模型的性能。
translated by 谷歌翻译
在本文中,我们为手势识别(HGR)系统介绍了一个巨大的数据集海格(手势识别图像数据集)。该数据集包含552,992个样本,分为18类手势。注释包括带有手势标签和领先手的标记的手框。拟议的数据集允许构建HGR系统,该系统可用于视频会议服务,家庭自动化系统,汽车行业,言语和听力障碍者的服务等。我们特别专注于与设备进行管理以管理它们。这就是为什么所有18个选择的手势都具有功能性,大多数人都熟悉的原因,并且可能是采取一些行动的动机。此外,我们使用众包平台来收集数据集并考虑各种参数以确保数据多样性。我们描述了将现有的HGR数据集用于我们的任务的挑战,并提供了详细的概述。此外,提出了手势检测和手势分类任务的基准。
translated by 谷歌翻译
文本线分割是现代光学字符识别系统的预级之一。本文提出的算法方法是为此确切目的而设计的。其主要特征是两种不同的技术,形态学图像操作和水平直方图投影的组合。该方法是开发的,以应用于历史数据收集,通常具有质量问题,例如降级纸,模糊文本或噪音存在。因此,有问题的分段人员可能特别感兴趣的文化机构,希望对给定的历史文档访问强大的线边界框。由于通过低计算成本加入的有希望的分割结果,该算法纳入了卢森堡国家图书馆的OCR管道,在历史报纸收集的主动的上下文中。本文的一般贡献是概述方法,并在准确性和速度方面评估增益,将其与用过的开源OCR软件捆绑在一起的分割算法。
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 2nd International Workshop on Reading Music Systems, held in Delft on the 2nd of November 2019.
translated by 谷歌翻译