Fig. 1: FigureSeer is an end-to-end framework for parsing result-figures in research papers. It automatically localizes figures, classifies them, and analyses their content (center). FigureSeer enables detailed indexing, retrieval, and redesign of result-figures, such as highlighting specific results (top-left), reformatting results (bottom-left), complex query answering (top-right), and results summarization (bottom-right). Abstract. 'Which are the pedestrian detectors that yield a precision above 95% at 25% recall?' Answering such a complex query involves identifying and analyzing the results reported in figures within several research papers. Despite the availability of excellent academic search engines , retrieving such information poses a cumbersome challenge today as these systems have primarily focused on understanding the text content of scholarly documents. In this paper, we introduce FigureSeer, an end-to-end framework for parsing result-figures, that enables powerful search and retrieval of results in research papers. Our proposed approach automatically localizes figures from research papers, classifies them, and analyses the content of the result-figures. The key challenge in analyzing the figure content is the extraction of the plotted data and its association with the legend entries. We address this challenge by formulating a novel graph-based reasoning approach using a CNN-based similarity metric. We present a thorough evaluation on a real-word annotated dataset to demonstrate the efficacy of our approach.
translated by 谷歌翻译
随着深度学习的兴起和发展,计算机视觉得到了极大的改造和重塑。作为计算机视觉的重要研究领域,场景文本的检测和识别受到这股革命浪潮的不可避免的影响,从而进入了深度学习的时代。近年来,社区在思维方式,方法和表现方面取得了实质性进展。本次调查旨在总结和分析深度学习时代场景文本检测与识别的重大变化和重大进展。通过本文,我们致力于:(1)介绍新的见解和想法; (2)突出最近的技术和基准; (3)展望未来趋势。具体来说,我们将强调深度学习带来的巨大差异,并且仍然存在巨大的挑战。我们希望这篇评论文章可以作为该领域研究人员的参考书。相关资源也在我们的Github存储库中收集和编译:https://github.com/Jyouhou/SceneTextPapers。
translated by 谷歌翻译
We introduce FigureQA, a visual reasoning corpus of over one million question-answer pairs grounded in over 100, 000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts. We formulate our reasoning task by generating questions from 15 templates; questions concern various relationships between plot elements and examine characteristics like the maximum, the minimum, area-under-the-curve, smoothness, and intersection. To resolve, such questions often require reference to multiple plot elements and synthesis of information distributed spatially throughout a figure. To facilitate the training of machine learning systems, the corpus also includes side data that can be used to formulate auxiliary objectives. In particular, we provide the numerical data used to generate each figure as well as bounding-box annotations for all plot elements. We study the proposed visual reasoning task by training several models, including the recently proposed Relation Network as a strong baseline. Preliminary results indicate that the task poses a significant machine learning challenge. We envision FigureQA as a first step towards developing models that can intuitively recognize patterns from visual representations of data.
translated by 谷歌翻译
质量控制是许多制造过程的基本组成部分,特别是那些涉及铸造或焊接的过程。但是,手动质量控制程序通常非常耗时且容易出错。为了满足对高质量产品的增长需求,智能视觉检测系统的使用在生产线中变得至关重要。最近,卷积神经网络(CNN)在图像分类和定位任务方面都表现出色。在本文中,基于基于掩模区域的CNN架构,提出了一种用于识别X射线图像中的铸造缺陷的系统。所提出的缺陷检测系统同时对输入图像执行缺陷检测和分割,使其适用于一系列缺陷检测任务。示出了训练网络以同时执行缺陷检测和缺陷实例分割,导致更高的缺陷检测精度,而不仅仅是缺陷检测。利用转移学习来减少训练数据需求并提高训练模型的预测准确性。更具体地,在对相对小的金属铸造X射线数据集进行微调之前,首先用两个大的可用图像数据集训练该模型。训练模型的准确性超过了GRIMA X射线图像数据库(GDXray)的最新性能。铸件数据标记的速度足以在生产环境中使用。该系统还可以在GDXray Welds数据集上执行。进行了一些深入的研究,探讨转学习,多任务学习和多班学习如何影响训练系统的表现。
translated by 谷歌翻译
This paper introduces a novel rotation-based framework for arbitrary-oriented text detection in natural scene images. We present the Rotation Region Proposal Networks (RRPN), which are designed to generate inclined proposals with text orientation angle information. The angle information is then adapted for bounding box regression to make the proposals more accurately fit into the text region in terms of the orientation. The Rotation Region-of-Interest (RRoI) pooling layer is proposed to project arbitrary-oriented proposals to a feature map for a text region classifier. The whole framework is built upon a region-proposal-based architecture, which ensures the computational efficiency of the arbitrary-oriented text detection compared with previous text detection systems. We conduct experiments using the rotation-based framework on three real-world scene text detection datasets and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.
translated by 谷歌翻译
Vehicle detection with orientation estimation in aerial images has received widespread interest as it is important for intelligent traffic management. This is a challenging task, not only because of the complex background and relatively small size of the target, but also the various orientations of vehicles in aerial images captured from the top view. The existing methods for oriented vehicle detection need several post-processing steps to generate final detection results with orientation, which are not efficient enough. Moreover, they can only get discrete orientation information for each target. In this paper, we present an end-to-end single convolutional neural network to generate arbitrarily-oriented detection results directly. Our approach, named Oriented_SSD (Single Shot MultiBox Detector, SSD), uses a set of default boxes with various scales on each feature map location to produce detection bounding boxes. Meanwhile, offsets are predicted for each default box to better match the object shape, which contain the angle parameter for oriented bounding boxes' generation. Evaluation results on the public DLR Vehicle Aerial dataset and Vehicle Detection in Aerial Imagery (VEDAI) dataset demonstrate that our method can detect both the location and orientation of the vehicle with high accuracy and fast speed. For test images in the DLR Vehicle Aerial dataset with a size of 5616 × 3744, our method achieves 76.1% average precision (AP) and 78.7% correct direction classification at 5.17 s on an NVIDIA GTX-1060.
translated by 谷歌翻译
Scene text detection is an important step of scene text recognition system and also a challenging problem. Different from general object detection, the main challenges of scene text detection lie on arbitrary orientations, small sizes, and significantly variant aspect ratios of text in natural images. In this paper, we present an end-to-end trainable fast scene text detector, named TextBoxes++, which detects arbitrary-oriented scene text with both high accuracy and efficiency in a single network forward pass. No post-processing other than an efficient non-maximum suppression is involved. We have evaluated the proposed TextBoxes++ on four public datasets. In all experiments, TextBoxes++ outperforms competing methods in terms of text lo-calization accuracy and runtime. More specifically, TextBoxes++ achieves an f-measure of 0.817 at 11.6fps for 1024×1024 ICDAR 2015 Incidental text images, and an f-measure of 0.5591 at 19.8fps for 768×768 COCO-Text images. Furthermore, combined with a text recognizer, TextBoxes++ significantly outperforms the state-of-the-art approaches for word spotting and end-to-end text recognition tasks on popular benchmarks. Code is available at: https://github.com/MhLiao/TextBoxes plusplus.
translated by 谷歌翻译
注释大量训练图像非常耗时。在这个背景下,本文着重于从易于获取的网络数据中学习,并利用学习模型在标记数据集中进行细粒度图像分类。目前,通过网络数据培训获得的绩效增长是增长的,就像一句俗语“总比没有好,但不是很多”。传统上,社区希望纠正嘈杂的网络标签,选择信息丰富的样本。在这项工作中,我们首先系统地研究了Web和标准数据集之间的内置差距,即两种数据之间的不同数据分布。然后,除了使用weblabels之外,我们还提出了一种无监督的对象定位方法,该方法提供了对Web图像中对象密度和比例的关键性。具体来说,我们设计了两个Web数据约束,以大幅减少Web和标准数据分布的差异。数据集。首先,我们提出了一种方法来控制检测区域中的对象的比例,定位和数量。其次,我们建议选择包含与web标记一致的对象的区域。基于这两个约束,我们能够处理Web图像以缩小差距,并且处理的Web数据用于更好地协助标准数据集来训练CNN。对几个细粒度图像分类数据集的实验证实,我们的方法对最先进的方法有利。
translated by 谷歌翻译
Recognition of text in natural scene images is becoming a prominent research area due to the widespread availablity of imaging devices in low-cost consumer products like mobile phones. To evaluate the performance of recent algorithms in detecting and recognizing text from complex images, the ICDAR 2011 Robust Reading Competition was organized. Challenge 2 of the competition dealt specifically with detecting/recognizing text in natural scene images. This paper presents an overview of the approaches that the participants used, the evaluation measure, and the dataset used in the Challenge 2 of the contest. We also report the performance of all participating methods for text localization and word recognition tasks and compare their results using standard methods of area precision/recall and edit distance.
translated by 谷歌翻译
自动检测枪支对于加强人员的安全和安全至关重要,但由于枪械的形状,大小和外观的广泛变化,这是一项具有挑战性的任务。为了应对这些挑战,我们提出了一种定向感知对象检测器(OAOD),它具有改进的枪械检测和定位性能。拟议的检测器有两个阶段。在阶段1中,它预测用于旋转对象提议的对象的方向。从旋转的对象提议中选择最大区域矩形,其再次在算法的阶段2中被分类和定位。定向对象提议被映射回原始坐标,从而产生定向边界框,这些边界框比轴对齐的边界框更好地定位武器。明确方向感知,我们的非最大抑制能够避免多次检测同一个对象,它可以更好地解决彼此相邻的物体。这个两阶段系统利用OAOD来预测面向对象的边界框,同时仅在地面实况中的轴对齐框上进行训练。为了训练用于枪械检测的物体探测器,从互联网上收集由大约一万一千个火器图像组成的数据集并手动注释。拟议的国际电联枪支(ITUF)数据集包含各种枪支和步枪。 OAOD算法在ITUF数据集上进行评估,并与当前的现有对象检测器进行比较。我们的实验证明了所提出的探测器在枪械探测任务中的出色性能。
translated by 谷歌翻译
The requiring of large amounts of annotated training data has become a common constraint on various deep learning systems. In this paper, we propose a weakly supervised scene text detection method (WeText) that trains robust and accurate scene text detection models by learning from unannotated or weakly annotated data. With a "light" supervised model trained on a small fully annotated dataset, we explore semi-supervised and weakly supervised learning on a large unannotated dataset and a large weakly annotated dataset, respectively. For the un-supervised learning, the light supervised model is applied to the unannotated dataset to search for more character training samples, which are further combined with the small annotated dataset to retrain a superior character detection model. For the weakly supervised learning, the character searching is guided by high-level annotations of words/text lines that are widely available and also much easier to prepare. In addition, we design an unified scene character detector by adapting regression based deep networks, which greatly relieves the error accumulation issue that widely exists in most traditional approaches. Extensive experiments across different unannotated and weakly annotated datasets show that the scene text detection performance can be clearly boosted under both scenarios, where the weakly supervised learning can achieve the state-of-the-art performance by using only 229 fully annotated scene text images.
translated by 谷歌翻译
在本文中,我们提出了一种新的场景文本检测方法,名为TextMountain。 TextMountain的关键思想是充分利用边界中心信息。与以往将中心边界视为二元分类问题的工作不同,我们预测文本中心边界概率(TCBP)和文本中心方向(TCD)。 TCBP就像一座山,顶部是文本中心,脚是文本边框。山顶可以使用语义分割图分离不易实现的文本实例,并且其上行方向可以为群组阶段的山脚上的每个像素规划通往顶部的道路。 TCD有助于TCBP更好地学习。我们的标签规则不会导致角度变换的明确问题,因此所提出的方法是针对多向文本的,并且可以很好地处理弯曲文本。在推理阶段,山脚的每个像素需要搜索到山顶的路径,这个过程可以有效地并行完成,从而产生我们的方法与其他方法相比的效率。在MLT,ICDAR2015,RCTW-17和SCUT-CTW1500数据库上的实验表明,所提出的方法在药物精度和效率方面实现了更好或相当的性能。值得一提的是,我们的方法在MLT上实现了76.85%的F-测量,其优于以前的方法。代码将可用。
translated by 谷歌翻译
This paper analyzes, compares, and contrasts technical challenges, methods, and the performance of text detection and recognition research in color imagery. It summarizes the fundamental problems and enumerates factors that should be considered when addressing these problems. Existing techniques are categorized as either stepwise or integrated and sub-problems are highlighted including text localization, verification, segmentation and recognition. Special issues associated with the enhancement of degraded text and the processing of video text, multi-oriented, perspectively distorted and multilingual text are also addressed. The categories and sub-categories of text are illustrated, benchmark datasets are enumerated, and the performance of the most representative approaches is compared. This review provides a fundamental comparison and analysis of the remaining problems in the field.
translated by 谷歌翻译
Text, as one of the most influential inventions of humanity, has played an important role in human life, so far from ancient times. The rich and precise information embodied in text is very useful in a wide range of vision-based applications, therefore text detection and recognition in natural scenes have become important, active research topics in computer vision and document analysis. Especially in recent years, the community has seen a surge of research efforts and substantial progresses in these fields, though a variety of challenges (e.g. noise, blur, distortion, occlusion and variation) still remain. The purposes of this survey are threefold: (1) introduce up-to-date works, (2) identify state-of-the-art algorithms , and (3) predict potential research directions in the future. Moreover, this paper provides comprehensive links to publicly available resources, including benchmark datasets, source codes, and online demos. In summary, this literature review can serve as a good reference for researchers in the areas of scene text detection and recognition.
translated by 谷歌翻译
Text data present in images and video contain useful information for automatic annotation, indexing, and structuring of images. Extraction of this information involves detection, localization, tracking, extraction, enhancement, and recognition of the text from a given image. However, variations of text due to diierences in size, style, orientation, and alignment, as well as low image contrast and complex background make the problem of automatic text extraction extremely challenging. While comprehensive surveys of related problems such as face detection, document analysis, and image & video indexing can be found, the problem of text information extraction is not well surveyed. A large number of techniques have been proposed to address this problem, and the purpose of this paper is to classify and review these algorithms, discuss benchmark data and performance evaluation, and to point out promising directions for future research.
translated by 谷歌翻译
无人驾驶飞行器由于其高机动性和覆盖不同高度和位置的区域的能力而越来越多地用于监视和交通监控。其中一个主要挑战是使用航拍图来准确检测汽车并实时计算它们以进行交通监控。最近提出了几种基于卷积神经网络(CNN)的深度学习技术,用于计算机视觉中的实时分类和识别。但是,它们的性能取决于使用它们的场景。在本文中,我们研究了两种最先进的CNN算法的性能,即更快的R-CNN和YOLOv3,用于从航空图像中检测汽车。我们在从无人机拍摄的大型汽车数据集上训练和测试了这些二次模型。我们在这篇文章中证明了YOLOv3在灵敏度和处理时间方面优于更快的R-CNN,尽管它们在精度度量方面具有可比性。
translated by 谷歌翻译
最近,基于深度神经网络的模型已经主导了文本检测和识别领域。在本文中,我们研究了场景文本定位问题,旨在同时在自然图像中进行文本检测和识别。提出了一种端到端的可训练神经网络模型forscene文本定位。拟议的模型,名为Mask TextSpotter,受到新发布的工作Mask R-CNN的启发。与以前使用端到端可训练的深度神经网络完成文本定位的方法不同,Mask TextSpotter利用简单,平滑到最终的学习过程,通过语义分割获得精确的文本检测和识别。此外,它在处理不规则形状的文本实例(例如,弯曲文本)方面优于以前的方法。 ICDAR2013,ICDAR2015和Total-Text上的实验表明,所提出的方法在场景文本检测和端到端文本识别任务中实现了最先进的结果。
translated by 谷歌翻译
在场景图像中找到的文本信息提供关于图像及其上下文的高级语义信息,并且可以用于更好的理解。在本文中,我们解决了场景文本检索的问题:给定文本查询,系统必须返回包含所述文本的所有图像。所提出的模型的新颖性在于使用单射CNN架构,其同时预测边界框和其中的单词的紧凑文本表示。以这种方式,基于文本的图像检索任务可以被铸造为在整个图像数据库上CNN的输出上的查询文本表示的简单最近邻搜索。我们的实验表明,所提出的架构优于先前的最新技术,同时它提供了显着的处理速度。
translated by 谷歌翻译
船舶检测非常重要,并且在遥感领域充满了挑战。应用场景的复杂性,检测区域的冗余性以及密集船舶检测的难度都是限制传统方法在船舶检测中成功运行的主要障碍。在本文中,我们提出了一种基于多尺度旋转区域卷积神经网络的全新检测模型来解决上述问题。该模型主要由五个连续部分组成:密集特征金字塔网络(DFPN),自适应感兴趣区域(ROI)对齐,旋转边框回归,向前方向预测和旋转非最大抑制(R-NMS)。首先,通过多尺度特征网络充分利用低级位置信息和高级语义信息。然后,我们设计自适应ROI对齐以获得保持完整空间和语义信息的高质量提议。与大多数先前的方法不同,通过我们的方法获得的预测是具有较少冗余区域的对象的最小边界正交角。因此,旋转区域检测框架更适合于检测密集对象的传统检测模型。另外,我们可以通过预测找到船舶的靠泊和航行方向。基于SRSS和DOTA旋转检测数据集的详细评估表明,我们的检测方法具有竞争性。
translated by 谷歌翻译
本文提出了一种场景文本检测技术,该技术利用引导和文本边界语义来准确定位文本内容。设计了一种新颖的自举技术,对一个单词或文本行的多个部分进行采样,从而有效地减轻了有限训练数据的约束。同时,文本“子部分”的重复采样提高了预测文本特征映射的一致性,这对于预测单个完整而不是长字或文本行的多个破碎盒至关重要。此外,设计了一种语义识别文本边界检测技术,该技术为每个场景文本产生四种类型的文本边界元素。通过语义识别文本边界,可以通过回归字或文本行的末端周围的文本像素而不是所有文本像素来更精确地定位场景文本,这通常导致在处理长字或文本行时不准确的本地化。大量实验证明了所提出技术的有效性,并且在几个公共数据集上获得了优异的性能,例如: G。对于MSRA-TD500,获得80.1分,ICDAR2017-RCTW获得67.1分,等等。
translated by 谷歌翻译