本文介绍了一种在图像中对短语进行接地的方法,这种方法在单个端到端模型中共同连接多个文本条件嵌入。为了将文本短语区分为语义上不同的子空间,我们提出了一个概念权重分支,它自动将短语分配给嵌入,而先前的工作预定义了这样的分配。我们提出的解决方案简化了单个嵌入的表示要求,并允许代表性不足的概念在将它们提供给特定于概念的层之前利用共享表示。综合实验验证了我们的方法在三个词汇表地理数据集,Flickr30K实体,ReferIt游戏和视觉基因组中的有效性,我们获得了(相应)4%,3%和4%的基础地区短语嵌入基线的接地性能改善。
translated by 谷歌翻译
由于对象检测技术的成功,我们甚至可以从巨大的图像集合中检索指定类的对象。然而,当前最先进的对象检测器(例如更快的R-CNN)只能处理预先指定的类。此外,培训需要大量的正面和负面样本。在本文中,我们解决了开放词汇对象检索和定位的问题,其中目标对象由文本查询(例如,单词或短语)指定。我们首先提出了Query-Adaptive R-CNN,它是一种适用于开放式词汇查询的快速R-CNN的简单扩展,它通过将文本嵌入向量转换为对象分类器和本地化回归器。然后,为了区别性训练,我们然后提出消极短语增强(NPA)来挖掘硬判断样本,这些样本在视觉上类似于查询并且同时在语义上与查询互斥。所提出的方法可以高精度地仅在0.5秒内从一百万个图像中检索和定位由文本查询指定的对象。
translated by 谷歌翻译
Image-language matching tasks have recently attracted a lot of attention in the computer vision field. These tasks include image-sentence matching, i.e., given an image query, retrieving relevant sentences and vice versa, and region-phrase matching or visual grounding, i.e., matching a phrase to relevant regions. This paper investigates two-branch neural networks for learning the similarity between these two data modalities. We propose two network structures that produce different output representations. The first one, referred to as an embedding network, learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. Compared to standard triplet sampling, we perform improved neighborhood sampling that takes neighborhood information into consideration while constructing mini-batches. The second network structure, referred to as a similarity network, fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our networks achieve high accuracies for phrase localization on the Flickr30K Entities dataset and for bi-directional image-sentence retrieval on Flickr30K and MSCOCO datasets.
translated by 谷歌翻译
This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues. We model the appearance, size, and position of entity bounding boxes, adjectives that contain attribute information, and spatial relationships between pairs of entities connected by verbs or prepositions. Special attention is given to relationships between people and clothing or body part mentions, as they are useful for distinguishing individuals. We automatically learn weights for combining these cues and at test time, perform joint inference over all phrases in a caption. The resulting system produces state of the art performance on phrase localiza-tion on the Flickr30k Entities dataset [33] and visual relationship detection on the Stanford VRD dataset [27]. 1
translated by 谷歌翻译
在视觉内容中对任意的,自由形式的文本短语进行接地(即本地化)是一个具有挑战性的问题,其中许多应用于人机交互和图像文本参考分辨率。很少有数据集提供短语的地面真实空间定位,因此需要从没有或很少接地监督的数据中学习。我们提出了一种新方法,通过使用注意机制重建给定短语来学习基础,注意机制可以是潜在的或直接优化的。在训练期间,使用循环网络语言模型对短语进行编码,然后学习参与相关图像区域以重建输入短语。在测试时,评估正确的注意力,即接地。如果接地监督可用,则可以通过注意机制的损失直接应用。我们证明了对Flickr 30k实体和ReferItGame数据集的有效性,这些数据集具有不同的监督水平,范围从不监督部分监督全面监督。我们的监督变体在两个数据集上都比现有技术提高了很多。
translated by 谷歌翻译
In this paper, we propose a novel weakly supervised model, Multi-scale Anchored Transformer Network (MAT-N), to accurately localize free-form textual phrases with only image-level supervision. The proposed MATN takes region proposals as localization anchors, and learns a multi-scale correspondence network to continuously search for phrase regions referring to the anchors. In this way, MATN can exploit useful cues from these anchors to reliably reason about locations of the regions described by the phrases given only image-level supervision. Through differentiable sampling on image spatial feature maps, MATN introduces a novel training objective to simultaneously minimize a con-trastive reconstruction loss between different phrases from a single image and a set of triplet losses among multiple images with similar phrases. Superior to existing region proposal based methods, MATN searches for the optimal bounding box over the entire feature map instead of selecting a sub-optimal one from discrete region proposals. We evaluate MATN on the Flickr30K Entities and ReferItGame datasets. The experimental results show that MATN significantly outperforms the state-of-the-art methods.
translated by 谷歌翻译
This paper proposes a method for learning joint embed-dings of images and text using a two-branch neural network with multiple layers of linear projections followed by nonlinearities. The network is trained using a large-margin objective that combines cross-view ranking constraints with within-view neighborhood structure preservation constraints inspired by metric learning literature. Extensive experiments show that our approach gains significant improvements in accuracy for image-to-text and text-to-image retrieval. Our method achieves new state-of-the-art results on the Flickr30K and MSCOCO image-sentence datasets and shows promise on the new task of phrase lo-calization on the Flickr30K Entities dataset.
translated by 谷歌翻译
短语接地旨在检测和定位图像中的对象,这些对象是由自然语言短语提供和查询的。短语接地在Visual Dialog,Visual Search和Image-textco-reference resolution等任务中找到应用。在本文中,我们提出了一个框架,利用短语类别,句子和语境中相邻短语之间的关系等信息来提高短语接地系统的性能。我们提出了三个模块:提议索引网络(PIN);词间回归网络(IRN)和提议排序网络(PRN),每个网络通过包含上述信息,以更高的细节水平分析图像的区域提议。此外,在没有短语的地面真实空间位置(弱监督)的情况下,我们提出了利用PIN模块框架的知识转移机制。我们证明了我们的方法在Flickr 30k实体和ReferItGamedatasets上的有效性,为此我们在监督和弱监督变体方面实现了对最先进方法的改进。
translated by 谷歌翻译
我们介绍了密集字幕任务,它需要一个计算机视觉系统来定位和描述自然语言图像中的显着区域。密集字幕任务在描述由单个单词组成时概括对象检测,在一个预测区域覆盖整个图像时概括图像字幕。为了解决本地化和描述任务,我们提出了一个完全卷积定位网络(FCLN)架构,该架构使用单个高效的前向传递处理图像,不需要外部区域提议,并且可以通过单轮优化进行端到端训练。该体系结构由ConvolutionalNetwork,一种新颖的密集定位层和生成标签序列的Recurrent Neural Network语言模型组成。我们在Visual Genome数据集上评估我们的网络,该数据集包含94,000个图像和4,100,000个区域接地字幕。基于生成和检索设置中的当前最新技术方法,我们观察到基线的速度和精度改进。
translated by 谷歌翻译
Current Zero-Shot Learning (ZSL) approaches are restricted to recognition of a single dominant unseen object category in a test image. We hypothesize that this setting is ill-suited for real-world applications where unseen objects appear only as a part of a complex scene, warranting both the 'recognition' and 'localization' of an unseen category. To address this limitation, we introduce a new 'Zero-Shot Detec-tion' (ZSD) problem setting, which aims at simultaneously recognizing and locating object instances belonging to novel categories without any training examples. We also propose a new experimental protocol for ZSD based on the highly challenging ILSVRC dataset, adhering to practical issues, e.g., the rarity of unseen objects. To the best of our knowledge, this is the first end-to-end deep network for ZSD that jointly models the interplay between visual and semantic domain information. To overcome the noise in the automatically derived semantic descriptions, we utilize the concept of meta-classes to design an original loss function that achieves synergy between max-margin class separation and semantic space clustering. Furthermore, we present a baseline approach extended from recognition to detection setting. Our extensive experiments show significant performance boost over the baseline on the imperative yet difficult ZSD problem.
translated by 谷歌翻译
Associating image regions with text queries has been recently explored as a new way to bridge visual and linguistic representations. A few pioneering approaches have been proposed based on recurrent neural language models trained generatively (e.g., generating captions), but achieving somewhat limited localization accuracy. To better address natural-language-based visual entity localization, we propose a discriminative approach. We formulate a dis-criminative bimodal neural network (DBNet), which can be trained by a classifier with extensive use of negative samples. Our training objective encourages better localiza-tion on single images, incorporates text phrases in a broad range, and properly pairs image regions with text phrases into positive and negative examples. Experiments on the Visual Genome dataset demonstrate the proposed DBNet significantly outperforms previous state-of-the-art methods both for localization on single images and for detection on multiple images. We we also establish an evaluation protocol for natural-language visual detection. Code is available at: http://ytzhang.net/projects/dbnet .
translated by 谷歌翻译
我们介绍并解决零射击物体检测(ZSD)的问题,该目标是检测在训练期间未观察到的物体类别。我们使用一组具有挑战性的对象类,而不是像以前的零射击分类工作那样限制自己的类似和/或细粒度类别。我们通过首先针对ZSD调整可视语义嵌入来呈现原则性方法。然后,我们讨论与选择背景类相关的问题,并激发两种用于学习鲁棒检测器的背景感知方法。其中一个模型使用固定的背景类,另一个基于迭代潜在分配。我们还概述了与使用有限数量的训练类相关的挑战,并提出了一种基于语义标签空间的密集采样的解决方案,使用具有大量类别的辅助数据。我们提出了两种标准检测数据集的新分裂 - MSCOCO和VisualGenome,以及传统和广义零射击设置中的初步实验结果,以突出所提出方法的优点。我们提供有用的算法知识,并通过提出一些开放性问题来鼓励进一步的研究。
translated by 谷歌翻译
In this paper, we address the task of natural language object retrieval, tolocalize a target object within a given image based on a natural language queryof the object. Natural language object retrieval differs from text-based imageretrieval task as it involves spatial information about objects within thescene and global scene context. To address this issue, we propose a novelSpatial Context Recurrent ConvNet (SCRC) model as scoring function on candidateboxes for object retrieval, integrating spatial configurations and globalscene-level contextual information into the network. Our model processes querytext, local image descriptors, spatial configurations and global contextfeatures through a recurrent network, outputs the probability of the query textconditioned on each candidate box as a score for the box, and can transfervisual-linguistic knowledge from image captioning domain to our task.Experimental results demonstrate that our method effectively utilizes bothlocal and global information, outperforming previous baseline methodssignificantly on different datasets and scenarios, and can exploit large scalevision and language datasets for knowledge transfer.
translated by 谷歌翻译
The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains linking mentions of the same entities in images, as well as 276k manually annotated bounding boxes corresponding to each entity. Such annotation is essential for continued progress in automatic image description and grounded language understanding. We present experiments demonstrating the usefulness of our annotations for text-to-image reference resolution, or the task of localizing textual entity mentions in an image, and for bidirectional image-sentence retrieval. These experiments confirm that we can further improve the accuracy of state-of-the-art retrieval methods by training with explicit region-to-phrase correspondence, but at the same time, they show that accurately inferring this correspondence given an image and a caption remains really challenging.
translated by 谷歌翻译
In this paper we introduce a new approach to phrase local-ization: grounding phrases in sentences to image regions. We propose a structured matching of phrases and regions that encourages the semantic relations between phrases to agree with the visual relations between regions. We formulate structured matching as a discrete optimization problem and relax it to a linear program. We use neural networks to embed regions and phrases into vectors, which then define the similarities (matching weights) between regions and phrases. We integrate structured matching with neural networks to enable end-to-end training. Experiments on Flickr30K Entities demonstrate the empirical e↵ectiveness of our approach.
translated by 谷歌翻译
Visual relationships capture a wide variety of interactions between pairs ofobjects in images (e.g. "man riding bicycle" and "man pushing bicycle").Consequently, the set of possible relationships is extremely large and it isdifficult to obtain sufficient training examples for all possiblerelationships. Because of this limitation, previous work on visual relationshipdetection has concentrated on predicting only a handful of relationships.Though most relationships are infrequent, their objects (e.g. "man" and"bicycle") and predicates (e.g. "riding" and "pushing") independently occurmore frequently. We propose a model that uses this insight to train visualmodels for objects and predicates individually and later combines them togetherto predict multiple relationships per image. We improve on prior work byleveraging language priors from semantic word embeddings to finetune thelikelihood of a predicted relationship. Our model can scale to predictthousands of types of relationships from a few examples. Additionally, welocalize the objects in the predicted relationships as bounding boxes in theimage. We further demonstrate that understanding relationships can improvecontent based image retrieval.
translated by 谷歌翻译
视频描述是视觉和语言理解中最具挑战性的问题之一,因为视频和语言方面的差异很大。因此,模型通常会缩短识别难度,并生成基于先验的合理句子,但不一定以视频为基础。在这项工作中,我们明确地将这些内容与视频中的证据联系起来,通过在视频的一个帧中使用相应的边界框来注释句子中的每个名词短语。我们的新型数据集ActivityNet-Entities基于challengeActivityNet Captions数据集并使用158k的边界框注释来增加它,每个边界都有一个名词短语。这允许使用这些数据训练视频描述模型,并且重要的是,评估这些模型对他们描述的视频的基础或“真实”程度。为了生成groundedcaptions,我们提出了一种新的视频描述模型,它能够利用这些边界框注释。我们在ActivityNet-Entities中展示了我们的模型的有效性,但也展示了它如何应用于Flickr30k实体数据集上的图像描述。我们在视频描述,视频段落描述和图像描述方面实现了最先进的性能,并证明我们生成的句子更好地基于视频。
translated by 谷歌翻译
We introduce a novel framework for image captioning that can produce naturallanguage explicitly grounded in entities that object detectors find in theimage. Our approach reconciles classical slot filling approaches (that aregenerally better grounded in images) with modern neural captioning approaches(that are generally more natural sounding and accurate). Our approach firstgenerates a sentence `template' with slot locations explicitly tied to specificimage regions. These slots are then filled in by visual concepts identified inthe regions by object detectors. The entire architecture (sentence templategeneration and slot filling with object detectors) is end-to-enddifferentiable. We verify the effectiveness of our proposed model on differentimage captioning tasks. On standard image captioning and novel objectcaptioning, our model reaches state-of-the-art on both COCO and Flickr30kdatasets. We also demonstrate that our model has unique advantages when thetrain and test distributions of scene compositions -- and hence language priorsof associated captions -- are different. Code has been made available at:https://github.com/jiasenlu/NeuralBabyTalk
translated by 谷歌翻译
在图像中定位自然语言短语是一个具有挑战性的问题,需要对文本和视觉模态的共同理解。在无监督设置中,缺乏监督信号加剧了这种困难。本文提出了一种新的无监督视觉接地框架,它以概念学习作为代理任务来获得自我监督。这个想法背后的简单直觉是鼓励模型本地化到可以解释数据中某些语义属性的区域,在我们的例子中,属性是在一组图像中存在概念。我们提供了足够的定量和定性实验来证明我们的方法的有效性,并且显示出比目前的Visual Genome数据集状态提高5.6%,在ReferItGame数据集上有5.8%的改进,并且与Flickr30k数据集上的最新性能相当。
translated by 谷歌翻译
图像字幕模型在包含有限视觉概念和大量配对图像字幕训练数据的数据集上取得了令人印象深刻的结果。然而,如果这些模型要在野外发挥作用,必须学习更多种类的视觉概念,理想情况是从较少的监督。为了鼓励开发可以从备选数据源(例如对象检测数据集)中获取视觉概念的图像字幕模型,我们为此任务提供了第一个大规模基准。被称为'nocaps',对于大规模的新物体字幕,我们的基准包括166,100个人类生成的字幕,描述了15,100张来自Open Imagesvalidation和测试集的图像。相关的训练数据包括COCOimage-caption对,以及Open Images图像级标签和对象边界框。由于Open Images包含比COCO更多的类,因此在测试图像中看到的超过500个对象类没有训练字幕(因此,nocaps)。我们在具有挑战性的基准测试中评估了几种现有的新对象字幕方法。在自动评估中,这些方法显示了仅在图像标题数据上训练的强基线的模式改进。然而,即使使用地面实况对象检测,结果也明显弱于我们的人类基线 - 表明有很大的改进空间。
translated by 谷歌翻译