这项研究是有关阿拉伯历史文档的光学特征识别(OCR)的一系列研究的第二阶段,并研究了不同的建模程序如何与问题相互作用。第一项研究研究了变压器对我们定制的阿拉伯数据集的影响。首次研究的弊端之一是训练数据的规模,由于缺乏资源,我们的3000万张图像中仅15000张图像。另外,我们添加了一个图像增强层,时间和空间优化和后校正层,以帮助该模型预测正确的上下文。值得注意的是,我们提出了一种使用视觉变压器作为编码器的端到端文本识别方法,即BEIT和Vanilla Transformer作为解码器,消除了CNNs以进行特征提取并降低模型的复杂性。实验表明,我们的端到端模型优于卷积骨架。该模型的CER为4.46%。
translated by 谷歌翻译
手写的文本识别问题是由计算机视觉社区的研究人员广泛研究的,因为它的改进和适用于日常生活的范围,它是模式识别的子域。自从过去几十年以来,基于神经网络的系统的计算能力提高了计算能力,因此有助于提供最新的手写文本识别器。在同一方向上,我们采用了两个最先进的神经网络系统,并将注意力机制合并在一起。注意技术已被广泛用于神经机器翻译和自动语音识别的领域,现在正在文本识别域中实现。在这项研究中,我们能够在IAM数据集上达到4.15%的字符错误率和9.72%的单词错误率,7.07%的字符错误率和GW数据集的16.14%单词错误率与现有的Flor合并后,GW数据集的单词错误率等。建筑学。为了进一步分析,我们还使用了类似于Shi等人的系统。具有贪婪解码器的神经网络系统,观察到基本模型的字符错误率提高了23.27%。
translated by 谷歌翻译
连接视觉和语言在生成智能中起着重要作用。因此,已经致力于图像标题的大型研究工作,即用句法和语义有意义的句子描述图像。从2015年开始,该任务通常通过由Visual Encoder组成的管道和文本生成的语言模型来解决任务。在这些年来,两种组件通过对象区域,属性,介绍多模态连接,完全关注方法和伯特早期融合策略的利用而显着发展。但是,无论令人印象深刻的结果,图像标题的研究还没有达到结论性答案。这项工作旨在提供图像标题方法的全面概述,从视觉编码和文本生成到培训策略,数据集和评估度量。在这方面,我们量化地比较了许多相关的最先进的方法来确定架构和培训策略中最有影响力的技术创新。此外,讨论了问题的许多变体及其开放挑战。这项工作的最终目标是作为理解现有文献的工具,并突出显示计算机视觉和自然语言处理的研究领域的未来方向可以找到最佳的协同作用。
translated by 谷歌翻译
文本识别是文档数字化的长期研究问题。现有的方法通常是基于CNN构建的,以用于图像理解,并为Char-Level文本生成而建立RNN。此外,通常需要另一种语言模型来提高整体准确性作为后处理步骤。在本文中,我们提出了一种使用预训练的图像变压器和文本变压器模型(即Trocr)提出的端到端文本识别方法,该模型利用了变压器体系结构,以实现图像理解和文字级级文本生成。TROR模型很简单,但有效,可以通过大规模合成数据进行预训练,并通过人体标记的数据集进行微调。实验表明,TROR模型的表现优于印刷,手写和场景文本识别任务上的当前最新模型。Trocr模型和代码可在\ url {https://aka.ms/trocr}上公开获得。
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.
translated by 谷歌翻译
无约束的手写文本识别是一项具有挑战性的计算机视觉任务。传统上,它是通过两步方法来处理的,结合了线细分,然后是文本线识别。我们第一次为手写文档识别任务提出了无端到端的无分段体系结构:文档注意网络。除文本识别外,该模型还接受了使用类似XML的方式使用开始和结束标签标记文本零件的训练。该模型由用于特征提取的FCN编码器和用于复发令牌预测过程的变压器解码器层组成。它将整个文本文档作为输入和顺序输出字符以及逻辑布局令牌。与现有基于分割的方法相反,该模型是在不使用任何分割标签的情况下进行训练的。我们在页面级别的Read 2016数据集以及CER分别为3.43%和3.70%的双页级别上获得了竞争成果。我们还为Rimes 2009数据集提供了页面级别的结果,达到CER的4.54%。我们在https://github.com/factodeeplearning/dan上提供所有源代码和预训练的模型权重。
translated by 谷歌翻译
确保适当的标点符号和字母外壳是朝向应用复杂的自然语言处理算法的关键预处理步骤。这对于缺少标点符号和壳体的文本源,例如自动语音识别系统的原始输出。此外,简短的短信和微博的平台提供不可靠且经常错误的标点符号和套管。本调查概述了历史和最先进的技术,用于恢复标点符号和纠正单词套管。此外,突出了当前的挑战和研究方向。
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
This study focuses on improving the optical character recognition (OCR) data for panels in the COMICS dataset, the largest dataset containing text and images from comic books. To do this, we developed a pipeline for OCR processing and labeling of comic books and created the first text detection and recognition datasets for western comics, called "COMICS Text+: Detection" and "COMICS Text+: Recognition". We evaluated the performance of state-of-the-art text detection and recognition models on these datasets and found significant improvement in word accuracy and normalized edit distance compared to the text in COMICS. We also created a new dataset called "COMICS Text+", which contains the extracted text from the textboxes in the COMICS dataset. Using the improved text data of COMICS Text+ in the comics processing model from resulted in state-of-the-art performance on cloze-style tasks without changing the model architecture. The COMICS Text+ dataset can be a valuable resource for researchers working on tasks including text detection, recognition, and high-level processing of comics, such as narrative understanding, character relations, and story generation. All the data and inference instructions can be accessed in https://github.com/gsoykan/comics_text_plus.
translated by 谷歌翻译
The rapid advancement of AI technology has made text generation tools like GPT-3 and ChatGPT increasingly accessible, scalable, and effective. This can pose serious threat to the credibility of various forms of media if these technologies are used for plagiarism, including scientific literature and news sources. Despite the development of automated methods for paraphrase identification, detecting this type of plagiarism remains a challenge due to the disparate nature of the datasets on which these methods are trained. In this study, we review traditional and current approaches to paraphrase identification and propose a refined typology of paraphrases. We also investigate how this typology is represented in popular datasets and how under-representation of certain types of paraphrases impacts detection capabilities. Finally, we outline new directions for future research and datasets in the pursuit of more effective paraphrase detection using AI.
translated by 谷歌翻译
Spelling error correction is the task of identifying and rectifying misspelled words in texts. It is a potential and active research topic in Natural Language Processing because of numerous applications in human language understanding. The phonetically or visually similar yet semantically distinct characters make it an arduous task in any language. Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods which we found rather inefficient. In particular, machine learning-based approaches, which exhibit superior performance to rule-based and statistical methods, are ineffective as they correct each character regardless of its appropriateness. In this work, we propose a novel detector-purificator-corrector framework based on denoising transformers by addressing previous issues. Moreover, we present a method for large-scale corpus creation from scratch which in turn resolves the resource limitation problem of any left-to-right scripted language. The empirical outcomes demonstrate the effectiveness of our approach that outperforms previous state-of-the-art methods by a significant margin for Bangla spelling error correction. The models and corpus are publicly available at https://tinyurl.com/DPCSpell.
translated by 谷歌翻译
Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how to effectively model the linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting. Firstly, the autonomous suggests enforcing explicitly language modeling by decoupling the recognizer into vision model and language model and blocking gradient flow between both models. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for the language model which can effectively alleviate the impact of noise input. Finally, to polish ABINet++ in long text recognition, we propose to aggregate horizontal features by embedding Transformer units inside a U-Net, and design a position and content attention module which integrates character order and content to attend to character features precisely. ABINet++ achieves state-of-the-art performance on both scene text recognition and scene text spotting benchmarks, which consistently demonstrates the superiority of our method in various environments especially on low-quality images. Besides, extensive experiments including in English and Chinese also prove that, a text spotter that incorporates our language modeling method can significantly improve its performance both in accuracy and speed compared with commonly used attention-based recognizers.
translated by 谷歌翻译
社交媒体网络已成为人们生活的重要方面,它是其思想,观点和情感的平台。因此,自动化情绪分析(SA)对于以其他信息来源无法识别人们的感受至关重要。对这些感觉的分析揭示了各种应用,包括品牌评估,YouTube电影评论和医疗保健应用。随着社交媒体的不断发展,人们以不同形式发布大量信息,包括文本,照片,音频和视频。因此,传统的SA算法已变得有限,因为它们不考虑其他方式的表现力。通过包括来自各种物质来源的此类特征,这些多模式数据流提供了新的机会,以优化基于文本的SA之外的预期结果。我们的研究重点是多模式SA的最前沿领域,该领域研究了社交媒体网络上发布的视觉和文本数据。许多人更有可能利用这些信息在这些平台上表达自己。为了作为这个快速增长的领域的学者资源,我们介绍了文本和视觉SA的全面概述,包括数据预处理,功能提取技术,情感基准数据集以及适合每个字段的多重分类方法的疗效。我们还简要介绍了最常用的数据融合策略,并提供了有关Visual Textual SA的现有研究的摘要。最后,我们重点介绍了最重大的挑战,并调查了一些重要的情感应用程序。
translated by 谷歌翻译
提取手写文本是数字化信息的最重要组成部分之一,并使其可用于大规模设置。手写光学角色读取器(OCR)是计算机视觉和自然语言处理计算的研究问题,对于英语,已经完成了许多工作,但是不幸的是,对于乌尔都语(例如乌尔都语)的低资源语言,几乎没有完成工作。乌尔都语语言脚本非常困难,因为它具有基于其相对位置的角色形状的草书性质和变化,因此,需要提出一个模型,该模型可以理解复杂的特征并将其推广到各种手写样式。在这项工作中,我们提出了一个基于变压器的乌尔都语手写文本提取模型。由于变压器在自然语言理解任务中非常成功,因此我们进一步探索它们以了解复杂的乌尔都语手写。
translated by 谷歌翻译
我们介绍了两个数据增强技术,它与Reset-Bilstm-CTC网络一起使用,显着降低了在手写文本识别(HTR)任务上的最佳报告结果之外的字错误率(WER)和字符错误率(CER)。我们应用了一种基于打印文本(StackMix)的删除文本(手写污染)和手写文本生成方法的新型增强,这被证明在HTR任务中非常有效。StackMix使用弱监督框架来获得字符边界。因为这些数据增强技术与所使用的网络无关,所以也可以应用于增强其他网络的性能和HTR的方法。十个手写文本数据集的广泛实验表明,手写墨水增强和StackMix显着提高了HTR模型的质量
translated by 谷歌翻译
我们提出了一种自我监督的预培训方法,用于学习手写和印刷历史文档转录的丰富视觉语言表示。监督我们预先调整我们预先培训的编码器表示两种语言的低资源文件转录后,(1)异构手写伊斯兰制稿件图像和(2)早期现代英语印刷文件,我们展现了有意义的认可改善从划痕培训的同一监督模型的准确性,只需30个线图像转录即可训练。我们屏蔽的语言模型式预培训策略,其中模型训练,以便能够识别从同一行中采样的患者的真正蒙面的视觉表示,鼓励学习强大的上下文化语言表示不变于抄写方式和打印噪声横跨文件。
translated by 谷歌翻译
我们想要模型的文本单位是什么?从字节到多字表达式,可以在许多粒度下分析和生成文本。直到最近,大多数自然语言处理(NLP)模型通过单词操作,将那些作为离散和原子令牌处理,但从字节对编码(BPE)开始,基于次字的方法在许多领域都变得占主导地位,使得仍然存在小词汇表允许快速推断。是道路字符级模型的结束或字节级处理吗?在这项调查中,我们通过展示和评估基于学习分割的词语和字符以及基于子字的方法的混合方法以及基于学习的分割的杂交方法,连接多行工作。我们得出结论,对于所有应用来说,并且可能永远不会成为所有应用的银子弹奇异解决方案,并且严重思考令牌化对许多应用仍然很重要。
translated by 谷歌翻译
在本文中,我们试图通过引入深度学习模型的句法归纳偏见来建立两所学校之间的联系。我们提出了两个归纳偏见的家族,一个家庭用于选区结构,另一个用于依赖性结构。选区归纳偏见鼓励深度学习模型使用不同的单位(或神经元)分别处理长期和短期信息。这种分离为深度学习模型提供了一种方法,可以从顺序输入中构建潜在的层次表示形式,即更高级别的表示由高级表示形式组成,并且可以分解为一系列低级表示。例如,在不了解地面实际结构的情况下,我们提出的模型学会通过根据其句法结构组成变量和运算符的表示来处理逻辑表达。另一方面,依赖归纳偏置鼓励模型在输入序列中找到实体之间的潜在关系。对于自然语言,潜在关系通常被建模为一个定向依赖图,其中一个单词恰好具有一个父节点和零或几个孩子的节点。将此约束应用于类似变压器的模型之后,我们发现该模型能够诱导接近人类专家注释的有向图,并且在不同任务上也优于标准变压器模型。我们认为,这些实验结果为深度学习模型的未来发展展示了一个有趣的选择。
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 2nd International Workshop on Reading Music Systems, held in Delft on the 2nd of November 2019.
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译