Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes. Conventional models in STR employ convolutional neural network (CNN) followed by recurrent neural network in an encoder-decoder framework. In recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency which appears to be prominent in scene text images. Many researchers utilized transformer as part of a hybrid CNN-transformer encoder, often followed by a transformer decoder. However, such methods only make use of the long-term dependency mid-way through the encoding process. Although the vision transformer (ViT) is able to capture such dependency at an early stage, its utilization remains largely unexploited in STR. This work proposes the use of a transformer-only model as a simple baseline which outperforms hybrid CNN-transformer models. Furthermore, two key areas for improvement were identified. Firstly, the first decoded character has the lowest prediction accuracy. Secondly, images of different original aspect ratios react differently to the patch resolutions while ViT only employ one fixed patch resolution. To explore these areas, Pure Transformer with Integrated Experts (PTIE) is proposed. PTIE is a transformer model that can process multiple patch resolutions and decode in both the original and reverse character orders. It is examined on 7 commonly used benchmarks and compared with over 20 state-of-the-art methods. The experimental results show that the proposed method outperforms them and obtains state-of-the-art results in most benchmarks.
translated by 谷歌翻译
Scene text images have different shapes and are subjected to various distortions, e.g. perspective distortions. To handle these challenges, the state-of-the-art methods rely on a rectification network, which is connected to the text recognition network. They form a linear pipeline which uses text rectification on all input images, even for images that can be recognized without it. Undoubtedly, the rectification network improves the overall text recognition performance. However, in some cases, the rectification network generates unnecessary distortions on images, resulting in incorrect predictions in images that would have otherwise been correct without it. In order to alleviate the unnecessary distortions, the portmanteauing of features is proposed. The portmanteau feature, inspired by the portmanteau word, is a feature containing information from both the original text image and the rectified image. To generate the portmanteau feature, a non-linear input pipeline with a block matrix initialization is presented. In this work, the transformer is chosen as the recognition network due to its utilization of attention and inherent parallelism, which can effectively handle the portmanteau feature. The proposed method is examined on 6 benchmarks and compared with 13 state-of-the-art methods. The experimental results show that the proposed method outperforms the state-of-the-art methods on various of the benchmarks.
translated by 谷歌翻译
场景文本识别(str)是图像和文本之间的重要桥梁,吸引了丰富的研究关注。虽然卷积神经网络(CNNS)在此任务中取得了显着的进展,但大多数现有工作都需要额外的模块(上下文建模模块)来帮助CNN捕获全局依赖项来解决归纳偏差并加强文本特征之间的关系。最近,该变压器已被提出作为通过自我关注机制的全球背景建模的有希望的网络,但在应用于识别时主要缺点是效率。我们提出了一个1-D拆分来解决复杂性的挑战,并用变压器编码器替换CNN,以减少对上下文建模模块的需求。此外,最近的方法使用冻结的初始嵌入来指导解码器对文本进行解码,导致精度损失。我们建议使用从变压器编码器中学到的学习学习的可读初始嵌入,使其自适应不同的输入图像。最重要的是,我们介绍了一个新颖的文本识别架构,名为基于变压器的文本识别器,其中包含三个阶段(转换,特征提取和预测)组成的初始嵌入指导(TRIG)。广泛的实验表明,我们的方法可以在文本识别基准上实现最先进的。
translated by 谷歌翻译
文本识别是文档数字化的长期研究问题。现有的方法通常是基于CNN构建的,以用于图像理解,并为Char-Level文本生成而建立RNN。此外,通常需要另一种语言模型来提高整体准确性作为后处理步骤。在本文中,我们提出了一种使用预训练的图像变压器和文本变压器模型(即Trocr)提出的端到端文本识别方法,该模型利用了变压器体系结构,以实现图像理解和文字级级文本生成。TROR模型很简单,但有效,可以通过大规模合成数据进行预训练,并通过人体标记的数据集进行微调。实验表明,TROR模型的表现优于印刷,手写和场景文本识别任务上的当前最新模型。Trocr模型和代码可在\ url {https://aka.ms/trocr}上公开获得。
translated by 谷歌翻译
Leveraging the advances of natural language processing, most recent scene text recognizers adopt an encoder-decoder architecture where text images are first converted to representative features and then a sequence of characters via `sequential decoding'. However, scene text images suffer from rich noises of different sources such as complex background and geometric distortions which often confuse the decoder and lead to incorrect alignment of visual features at noisy decoding time steps. This paper presents I2C2W, a novel scene text recognition technique that is tolerant to geometric and photometric degradation by decomposing scene text recognition into two inter-connected tasks. The first task focuses on image-to-character (I2C) mapping which detects a set of character candidates from images based on different alignments of visual features in an non-sequential way. The second task tackles character-to-word (C2W) mapping which recognizes scene text by decoding words from the detected character candidates. The direct learning from character semantics (instead of noisy image features) corrects falsely detected character candidates effectively which improves the final text recognition accuracy greatly. Extensive experiments over nine public datasets show that the proposed I2C2W outperforms the state-of-the-art by large margins for challenging scene text datasets with various curvature and perspective distortions. It also achieves very competitive recognition performance over multiple normal scene text datasets.
translated by 谷歌翻译
上下文感知的str方法通常使用内部自回旋(AR)语言模型(LM)。 AR模型的固有局限性动机是采用外部LM的两阶段方法。输入图像上外部LM的条件独立性可能导致其错误地纠正正确的预测,从而导致明显的低效率。我们的方法Parseq使用置换语言建模学习了具有共同权重的内部AR LMS集合。它统一了无上下文的非AR和上下文感知的AR推断,并使用双向上下文统一了迭代的精致。使用合成训练数据,Parseq实现了最新的(SOTA),从而获得了Str基准(精度为91.9%)和更具挑战性的数据集。在对实际数据进行培训时,它建立了新的SOTA结果(精度为96.0%)。 Parseq由于其简单,统一的结构和平行的令牌处理,对准确性与参数计数,拖放和延迟非常最佳。由于其广泛使用了注意力,它对在现实世界图像中常见的任意导向文本具有鲁棒性。代码,预处理的权重和数据可在以下网址提供:https://github.com/baudm/parseq。
translated by 谷歌翻译
基于关注的编码器解码器框架广泛用于场景文本识别任务。然而,对于当前的最先进的(SOTA)方法,就输入文本图像的本地视觉和全局上下文信息的有效使用而言,存在改进的余地,以及场景之间的鲁棒相关性处理模块(编码器)和文本处理模块(解码器)。在本文中,我们提出了一种表示和相关性增强的编码器解码器框架(Rceed)来解决这些缺陷和断裂性能瓶颈。在编码器模块中,将本地视觉功能,全局上下文特征和位置信息进行对齐并融合以生成小型综合特征图。在解码器模块中,使用两种方法来增强场景和文本特征空间之间的相关性。 1)解码器初始化由从编码器导出的整体特征和全局瞥觉矢量引导。 2)通过多头一般注意力产生的富集瞥见载体的特征来帮助RNN迭代和每个时间步骤的字符预测。同时,我们还设计了一个LABRAMORM-DROPOUT LSTM单元,以改善模型的可变文本的概括。基准的广泛实验展示了在现场文本识别任务中的有利性能,尤其是不规则的性能。
translated by 谷歌翻译
建模语义信息对于场景文本识别有用。在这项工作中,我们建议与视觉语义变压器(VST)共同模拟语义和视觉信息。 VST首先从具有变压器模块和主视觉语义对齐模块中的视觉特征映射明确地提取主语义信息。然后将语义信息与视觉特征映射(被视为序列)连接以形成伪多域序列,该伪多域序列组合视觉和语义信息,随后将其馈入基于变压器的交互模块,以便能够在视觉和视觉之间学习相互作用语义特征。以这种方式,可以通过语义信息和反之亦然可以增强视觉特征。可视特征的增强版本通过辅助视觉 - 语义对准模块进一步解码,其与主要一个共享权重。最后,通过获得最终文本预测的第三变压器模块共同处理解码的视觉特征和增强的语义特征。在包括常规/不规则文本识别数据集的七个公共基准测试中的实验验证了我们所提出的模型,在七个基准中的四个基准中达到最先进的效果。
translated by 谷歌翻译
Scene text spotting is of great importance to the computer vision community due to its wide variety of applications. Recent methods attempt to introduce linguistic knowledge for challenging recognition rather than pure visual classification. However, how to effectively model the linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from 1) implicit language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet++ for scene text spotting. Firstly, the autonomous suggests enforcing explicitly language modeling by decoupling the recognizer into vision model and language model and blocking gradient flow between both models. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for the language model which can effectively alleviate the impact of noise input. Finally, to polish ABINet++ in long text recognition, we propose to aggregate horizontal features by embedding Transformer units inside a U-Net, and design a position and content attention module which integrates character order and content to attend to character features precisely. ABINet++ achieves state-of-the-art performance on both scene text recognition and scene text spotting benchmarks, which consistently demonstrates the superiority of our method in various environments especially on low-quality images. Besides, extensive experiments including in English and Chinese also prove that, a text spotter that incorporates our language modeling method can significantly improve its performance both in accuracy and speed compared with commonly used attention-based recognizers.
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
提出了基于视觉变压器(VLT)的新型场景文本识别器。受NLP领域的Levenshtein Transformer的启发,提出的方法(命名为Levenshtein OCR和Short Levocr)探索了一种自动从裁剪自然图像中自动转录文本内容的替代方法。具体而言,我们将场景文本识别的问题视为迭代序列完善过程。由纯视觉模型产生的初始预测序列被编码并馈送到跨模式变压器中,以与视觉特征相互作用并融合,以逐渐近似地面真理。改进过程是通过两个基本字符级操作完成的:删除和插入,它们是通过模仿学习来学习的,并允许并行解码,动态长度变化和良好的解释性。定量实验清楚地表明,Levocr在标准基准上实现最新性能,定性分析验证了拟议的Levocr算法的有效性和优势。代码将很快发布。
translated by 谷歌翻译
在过去的几十年中,由于其在广泛的应用中,现场文本认可从学术界和实际用户获得了全世界的关注。尽管在光学字符识别方面取得了成就,但由于诸如扭曲或不规则布局等固有问题,现场文本识别仍然具有挑战性。大多数现有方法主要利用基于复发或卷积的神经网络。然而,虽然经常性的神经网络(RNN)通常由于顺序计算而遭受慢的训练速度,并且遇到消失的梯度或瓶颈,但CNN在复杂性和性能之间衡量折衷。在本文中,我们介绍了SAFL,一种基于自我关注的神经网络模型,具有场景文本识别的焦点损失,克服现有方法的限制。使用焦损而不是负值对数似然有助于模型更多地关注低频样本训练。此外,为应对扭曲和不规则文本,我们在传递到识别网络之前,我们利用空间变换(STN)来纠正文本。我们执行实验以比较拟议模型的性能与七个基准。数值结果表明,我们的模型实现了最佳性能。
translated by 谷歌翻译
基于卷积神经网络(CNN)框架对图像支出进行了很好的研究,最近引起了计算机视觉的更多关注。但是,CNN依靠固有的电感偏见来实现有效的样品学习,这可能会降低性能上限。在本文中,以最小的变压器体系结构中的柔性自我发挥机制的启发,我们将广义图像支出问题重新构架为贴片的序列到序列自动估计问题,从而使基于查询的图像映射出现。具体而言,我们提出了一个新型混合视觉转换器基于编码器框架,名为\ textbf {query} \ textbf {o} utpainting \ textbf {trextbf {tr} ansformer(\ textbf {queryotr})围绕给定的图像。 Patch Mode的全球建模能力使我们可以从注意机制的查询角度推断图像。新颖的查询扩展模块(QEM)旨在根据编码器的输出从预测查询中整合信息,因此即使使用相对较小的数据集,也可以加速纯变压器的收敛性。为了进一步提高每个贴片之间的连接性,提议的贴片平滑模块(PSM)重新分配并平均重叠区域,从而提供无缝的预测图像。我们在实验上表明,QueryOtr可以针对最新的图像支出方法平稳和现实地产生吸引力的结果。
translated by 谷歌翻译
多年来,场景文本识别(STR)一直是计算机视觉的积极研究主题。为了解决这个具有挑战性的问题,已经提出了许多创新的方法,并将语言知识纳入STR模型最近已成为一个显着的趋势。在这项工作中,我们首先从视觉变压器(VIT)的最新进展中汲取灵感来构建一个概念上简单而强大的视觉str模型,该模型建立在VIT和胜过以前的现场文本识别的先前最新模型,包括纯视觉模型和语言增强方法。为了整合语言知识,我们进一步提出了一种多粒性预测策略,以隐式方式将信息从语言模式注入模型,即NLP中广泛使用的子字表示(BPE和Wordpiece)被引入输出空间,除了传统的字符级别表示外,不采用独立语言模型(LM)。所得的算法(称为MGP-STR)能够将Str的性能包络提高到更高的水平。具体而言,它的平均识别精度在标准基准上达到93.35%。代码将很快发布。
translated by 谷歌翻译
通过提供语义来改进字符序列,语言知识对现场文本识别带来了很大的好处。然而,由于语言知识已经单独应用于输出序列,因此之前的方法没有充分利用语义来理解文本识别的视觉线索。本文介绍了一种名为多模态文本识别网络(MITRN)的新方法,其能够实现视觉和语义特征之间的相互作用以获得更好的识别性能。具体地,Matrn识别视觉和语义特征对并将空间信息进行编码为语义特征。基于空间编码,通过参考其他模态的相关特征提高了视觉和语义特征。此外,通过隐藏与训练阶段中的角色相关的视觉线程来刺激基质特征将语义特征组合成视觉特征。我们的实验表明,在具有大边缘的七个基准上实现了最先进的表演,而两个方式的天真组合显示了边缘改善。进一步消融研究证明了我们所提出的组件的有效性。我们的实施将公开提供。
translated by 谷歌翻译
基于关注的编码器 - 解码器框架在现场文本识别中变得流行,主要是由于其在从视觉和语义域集成识别线索方面的优越性。然而,最近的研究表明,这两个线索可能在困难的文本中错位(例如,具有稀有文本形状)并引入诸如角色位置的约束来缓解问题。尽管有一定的成功,但无内容的位置嵌入稳定地与有意义的本地图像区域嵌入。在本文中,我们提出了一种名为多域字符距离感知(MDCDP)的新型模块,以建立视觉和语义相关位置编码。 MDCDP使用位置嵌入在注意机制后查询视觉和语义功能。它自然地编码了位置线索,其描述了字符之间的视觉和语义距离。我们开发一个名为CDISTNET的新型架构,堆叠MDCDP几次以指导精确的距离建模。因此,即使呈现的各种困难,视觉语义对准也很好地建造。我们将CDISTNET应用于两个增强的数据集和六个公共基准。实验表明,CDISTNET实现了最先进的识别准确性。虽然可视化也表明CDISTNET在视觉和语义域中实现了适当的注意本地化。我们将在验收时发布我们的代码。
translated by 谷歌翻译
Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes. Recent STR models benefit from taking linguistic information in addition to visual cues into consideration. We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information. Our encoder is a Vision Transformer, and our decoder is a multi-modal Transformer. MVLT is trained in two stages: in the first stage, we design a STR-tailored pretraining method based on a masking strategy; in the second stage, we fine-tune our model and adopt an iterative correction method to improve the performance. MVLT attains superior results compared to state-of-the-art STR models on several benchmarks. Our code and model are available at https://github.com/onealwj/MVLT.
translated by 谷歌翻译
变形金刚在自然语言处理方面取得了巨大的成功。由于变压器中自我发挥机制的强大能力,研究人员为各种计算机视觉任务(例如图像识别,对象检测,图像分割,姿势估计和3D重建)开发了视觉变压器。本文介绍了有关视觉变形金刚的不同建筑设计和培训技巧(包括自我监督的学习)文献的全面概述。我们的目标是为开放研究机会提供系统的审查。
translated by 谷歌翻译
Multivariate time series forecasting (MTSF) is a fundamental problem in numerous real-world applications. Recently, Transformer has become the de facto solution for MTSF, especially for the long-term cases. However, except for the one forward operation, the basic configurations in existing MTSF Transformer architectures were barely carefully verified. In this study, we point out that the current tokenization strategy in MTSF Transformer architectures ignores the token uniformity inductive bias of Transformers. Therefore, the vanilla MTSF transformer struggles to capture details in time series and presents inferior performance. Based on this observation, we make a series of evolution on the basic architecture of the vanilla MTSF transformer. We vary the flawed tokenization strategy, along with the decoder structure and embeddings. Surprisingly, the evolved simple transformer architecture is highly effective, which successfully avoids the over-smoothing phenomena in the vanilla MTSF transformer, achieves a more detailed and accurate prediction, and even substantially outperforms the state-of-the-art Transformers that are well-designed for MTSF.
translated by 谷歌翻译
本文的目标是学习强烈的唇读模型,可以在静音视频中识别语音。大多数事先有效地处理开放式视觉语音识别问题,通过调整在漫步的可视化功能之上的现有自动语音识别技术。相反,在本文中,我们专注于唇读中遇到的独特挑战,并提出量身定制的解决方案。为此,我们提出以下贡献:(1)我们提出了一种基于关注的汇集机制来聚合视觉语音表示; (2)我们首次使用Sub-Word单元进行唇读,并显示这使我们能够更好地模拟任务的含糊不限; (3)我们提出了一种用于视觉语音检测(VSD)的模型,在唇读网络顶部培训。在上文之后,我们在公共数据集训练时获得最先进的LRS2和LRS3基准,甚至通过使用更少的数据量级验证的大规模工业数据集培训的型号。我们最好的模型在LRS2数据集中实现了22.6%的字错误率,这是唇读模型前所未有的性能,显着降低了唇读和自动语音识别之间的性能差距。此外,在AVA-ActiveSpeaker基准测试中,我们的VSD模型超越了所有可视基线,甚至优于最近的几种视听方法。
translated by 谷歌翻译