Visual relationships capture a wide variety of interactions between pairs of objects in images (e.g. "man riding bicycle" and "man pushing bicycle"). Consequently, the set of possible relationships is extremely large and it is difficult to obtain sufficient training examples for all possible relationships. Because of this limitation, previous work on visual relationship detection has concentrated on predicting only a handful of relationships. Though most relationships are infrequent, their objects (e.g. "man" and "bicycle") and predicates (e.g. "riding" and "pushing") independently occur more frequently. We propose a model that uses this insight to train visual models for objects and predicates individually and later combines them together to predict multiple relationships per image. We improve on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship. Our model can scale to predict thousands of types of relationships from a few examples. Additionally, we localize the objects in the predicted relationships as bounding boxes in the image. We further demonstrate that understanding relationships can improve content based image retrieval.
translated by 谷歌翻译
在图像理解项目中越来越多的情况下,场景图一代在电脑视觉研究中获得了很多关注,如视觉问题应答,图像标题,自动驾驶汽车,人群行为分析,活动识别等等。场景图,图像的视觉图形结构,非常有助于简化图像理解任务。在这项工作中,我们介绍了一个称为几何上下文的后处理算法,以了解视觉场景更好的几何上。我们使用该后处理算法在对象对与先前模型之间添加和改进几何关系。我们通过计算对象对之间的方向和距离来利用此上下文。我们使用知识嵌入式路由网络(KERN)作为我们的基准模型,将工作与我们的算法扩展,并显示最近最先进的算法上的可比结果。
translated by 谷歌翻译
深度学习技术导致了通用对象检测领域的显着突破,近年来产生了很多场景理解的任务。由于其强大的语义表示和应用于场景理解,场景图一直是研究的焦点。场景图生成(SGG)是指自动将图像映射到语义结构场景图中的任务,这需要正确标记检测到的对象及其关系。虽然这是一项具有挑战性的任务,但社区已经提出了许多SGG方法并取得了良好的效果。在本文中,我们对深度学习技术带来了近期成就的全面调查。我们审查了138个代表作品,涵盖了不同的输入方式,并系统地将现有的基于图像的SGG方法从特征提取和融合的角度进行了综述。我们试图通过全面的方式对现有的视觉关系检测方法进行连接和系统化现有的视觉关系检测方法,概述和解释SGG的机制和策略。最后,我们通过深入讨论当前存在的问题和未来的研究方向来完成这项调查。本调查将帮助读者更好地了解当前的研究状况和想法。
translated by 谷歌翻译
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in
translated by 谷歌翻译
场景图是一个场景的结构化表示,可以清楚地表达场景中对象之间的对象,属性和关系。随着计算机视觉技术继续发展,只需检测和识别图像中的对象,人们不再满足。相反,人们期待着对视觉场景更高的理解和推理。例如,给定图像,我们希望不仅检测和识别图像中的对象,还要知道对象之间的关系(视觉关系检测),并基于图像内容生成文本描述(图像标题)。或者,我们可能希望机器告诉我们图像中的小女孩正在做什么(视觉问题应答(VQA)),甚至从图像中移除狗并找到类似的图像(图像编辑和检索)等。这些任务需要更高水平的图像视觉任务的理解和推理。场景图只是场景理解的强大工具。因此,场景图引起了大量研究人员的注意力,相关的研究往往是跨模型,复杂,快速发展的。然而,目前没有对场景图的相对系统的调查。为此,本调查对现行场景图研究进行了全面调查。更具体地说,我们首先总结了场景图的一般定义,随后对场景图(SGG)和SGG的发电方法进行了全面和系统的讨论,借助于先验知识。然后,我们调查了场景图的主要应用,并汇总了最常用的数据集。最后,我们对场景图的未来发展提供了一些见解。我们相信这将是未来研究场景图的一个非常有帮助的基础。
translated by 谷歌翻译
同一场景中的不同对象彼此之间或多或少相关,但是只有有限数量的这些关系值得注意。受到对象检测效果的DETR的启发,我们将场景图生成视为集合预测问题,并提出了具有编码器decoder架构的端到端场景图生成模型RELTR。关于视觉特征上下文的编码器原因是,解码器使用带有耦合主题和对象查询的不同类型的注意机制渗透了一组固定大小的三胞胎主题prodicate-object。我们设计了一套预测损失,以执行地面真相与预测三胞胎之间的匹配。与大多数现有场景图生成方法相反,Reltr是一种单阶段方法,它仅使用视觉外观直接预测一组关系,而无需结合实体并标记所有可能的谓词。视觉基因组和开放图像V6数据集的广泛实验证明了我们模型的出色性能和快速推断。
translated by 谷歌翻译
视觉关系检测(VRD)促使计算机视觉模型“看到”超越单个对象实例,并“理解”场景中不同对象是如何相关的。 VRD的传统方式首先检测图像中的对象,然后单独预测检测到的对象实例之间的关系。这种不相交的方法很容易预测具有相似语义含义的同一对象对之间的冗余关系标签(即谓词),或者具有与地面真实含义相似但在语义上不正确的含义相似的语义含义。为了解决这个问题,我们建议共同训练具有视觉对象特征和语义关系特征的VRD模型。为此,我们提出了弗雷伯特(Vrebert),这是一种类似于伯特的变压器模型,用于通过多阶段训练策略进行视觉关系检测,以共同处理视觉和语义特征。我们表明,我们简单的类似BERT的模型能够超越谓词预测中最先进的VRD模型。此外,我们表明,通过使用预先训练的Vrebert模型,我们的模型通过明显的余量(+8.49 r@50和+8.99 R@100)推动了最新的零拍谓语预测。
translated by 谷歌翻译
Understanding a visual scene goes beyond recognizing individual objects in isolation. Relationships between objects also constitute rich semantic information about the scene. In this work, we explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image. We propose a novel endto-end model that generates such structured scene representation from an input image. The model solves the scene graph inference problem using standard RNNs and learns to iteratively improves its predictions via message passing. Our joint inference model can take advantage of contextual cues to make better predictions on objects and their relationships. The experiments show that our model significantly outperforms previous methods for generating scene graphs using Visual Genome dataset and inferring support relations with NYU Depth v2 dataset.
translated by 谷歌翻译
We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images. Our model contains a Relation Proposal Network (RePN) that efficiently deals with the quadratic number of potential relations between objects in an image. We also propose an attentional Graph Convolutional Network (aGCN) that effectively captures contextual information between objects and relations. Finally, we introduce a new evaluation metric that is more holistic and realistic than existing metrics. We report state-of-the-art performance on scene graph generation as evaluated using both existing and our proposed metrics.
translated by 谷歌翻译
The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research.
translated by 谷歌翻译
We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We present new quantitative insights on such repeated structures in the Visual Genome dataset. Our analysis shows that object labels are highly predictive of relation labels but not vice-versa. We also find that there are recurring patterns even in larger subgraphs: more than 50% of graphs contain motifs involving at least two relations. Our analysis motivates a new baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. This baseline improves on the previous state-of-the-art by an average of 3.6% relative improvement across evaluation settings. We then introduce Stacked Motif Networks, a new architecture designed to capture higher order motifs in scene graphs that further improves over our strong baseline by an average 7.1% relative gain. Our code is available at github.com/rowanz/neural-motifs.
translated by 谷歌翻译
We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.
translated by 谷歌翻译
Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse human walk on/ sit on/lay on beach into human on beach. Given such SGG, the down-stream tasks such as VQA can hardly infer better scene structures than merely a bag of objects. However, debiasing in SGG is not trivial because traditional debiasing methods cannot distinguish between the good and bad bias, e.g., good context prior (e.g., person read book rather than eat) and bad long-tailed bias (e.g., near dominating behind/in front of). In this paper, we present a novel SGG framework based on causal inference but not the conventional likelihood. We first build a causal graph for SGG, and perform traditional biased training with the graph. Then, we propose to draw the counterfactual causality from the trained graph to infer the effect from the bad bias, which should be removed. In particular, we use Total Direct Effect as the proposed final predicate score for unbiased SGG. Note that our framework is agnostic to any SGG model and thus can be widely applied in the community who seeks unbiased predictions. By using the proposed Scene Graph Diagnosis toolkit 1 on the SGG benchmark Visual Genome and several prevailing models, we observed significant improvements over the previous state-of-the-art methods.
translated by 谷歌翻译
人体对象交互(HOI)检测是高级人以人为中心的场景理解的基本任务。我们提出了短语,其中包含了Hoi分支和一个新型短语分支,以利用语言和改进关系表达。具体而言,短语分支由语义嵌入式监督,其基础事实自动从原始的Hoi注释自动转换,而无需额外的人力努力。同时,提出了一种新颖的标签组合方法来处理会安的长尾问题,由语义邻居复合新型短语标签。此外,为了优化短语分支,提出了由蒸馏损失和平衡三态损耗组成的损失。进行了广泛的实验,以证明拟议的短语疗养的有效性,这使得对基线的显着改善,并超越了以前的最先进的方法,以满足的HICO-DET基准。
translated by 谷歌翻译
由Hong和Pavlic(2021)引入的单隐式层随机加权特征网络(RWFN)被开发为关系学习任务的神经张量网络方法的替代方案。其相对较小的占地面积结合使用了两个随机输入投影 - 一种昆虫 - 脑激发的输入表示和随机傅里叶特征 - 允许它以相对较低的培训成本实现有关关系的丰富表现力。特别是,当红和帕德奇比较RWFN到逻辑张量网络(LTNS)进行语义图像解释(SII)任务以提取图像的结构化语义描述,他们表明,两个隐藏的RWFN集成更好地捕获输入之间的关系具有更快的培训过程,即使它使用了更少的学习参数。在本文中,我们使用RWFN来执行视觉关系检测(VRD)任务,这些任务是更具挑战性的SII任务。零拍摄学习方法与RWFN一起使用,可以利用与其他所见关系和背景知识的相似性 - 以对象,关系和对象之间的逻辑约束表示 - 实现能够预测未出现在培训中的三维群体放。在视觉关系数据集上的实验,用于比较RWFN和LTNS之间的性能,其中一个领先的统计关系学习框架之一,显示RWFNS以谓词检测任务的销售胜过LTNS,同时使用较少数量的适应性参数(1:56比率)。此外,即使RWFNS的空间复杂性远小于LTNS(1:27比率),RWFN表示的背景技术也可用于减轻训练集的不完整性。
translated by 谷歌翻译
Scene Graph Generation (SGG) serves a comprehensive representation of the images for human understanding as well as visual understanding tasks. Due to the long tail bias problem of the object and predicate labels in the available annotated data, the scene graph generated from current methodologies can be biased toward common, non-informative relationship labels. Relationship can sometimes be non-mutually exclusive, which can be described from multiple perspectives like geometrical relationships or semantic relationships, making it even more challenging to predict the most suitable relationship label. In this work, we proposed the SG-Shuffle pipeline for scene graph generation with 3 components: 1) Parallel Transformer Encoder, which learns to predict object relationships in a more exclusive manner by grouping relationship labels into groups of similar purpose; 2) Shuffle Transformer, which learns to select the final relationship labels from the category-specific feature generated in the previous step; and 3) Weighted CE loss, used to alleviate the training bias caused by the imbalanced dataset.
translated by 谷歌翻译
This paper develops a novel framework for semantic image retrieval based on the notion of a scene graph. Our scene graphs represent objects ("man", "boat"), attributes of objects ("boat is white") and relationships between objects ("man standing on boat"). We use these scene graphs as queries to retrieve semantically related images. To this end, we design a conditional random field model that reasons about possible groundings of scene graphs to test images. The likelihoods of these groundings are used as ranking scores for retrieval. We introduce a novel dataset of 5,000 human-generated scene graphs grounded to images and use this dataset to evaluate our method for image retrieval. In particular, we evaluate retrieval using full scene graphs and small scene subgraphs, and show that our method outperforms retrieval methods that use only objects or low-level image features. In addition, we show that our full model can be used to improve object localization compared to baseline methods.
translated by 谷歌翻译
视觉关系检测旨在检测图像中对象之间的相互作用。但是,由于对象和相互作用的多样性,此任务遭受了组合爆炸的影响。由于与同一对象相关的相互作用是依赖的,因此我们探讨了相互作用的依赖性以减少搜索空间。我们通过交互图明确地对象和交互对象进行建模,然后提出一种消息式风格的算法来传播上下文信息。因此,我们称为建议的方法神经信息传递(NMP)。我们进一步整合了语言先验和空间线索,以排除不切实际的互动并捕获空间互动。两个基准数据集的实验结果证明了我们提出的方法的优越性。我们的代码可在https://github.com/phyllish/nmp上找到。
translated by 谷歌翻译
Scene graph generation from images is a task of great interest to applications such as robotics, because graphs are the main way to represent knowledge about the world and regulate human-robot interactions in tasks such as Visual Question Answering (VQA). Unfortunately, its corresponding area of machine learning is still relatively in its infancy, and the solutions currently offered do not specialize well in concrete usage scenarios. Specifically, they do not take existing "expert" knowledge about the domain world into account; and that might indeed be necessary in order to provide the level of reliability demanded by the use case scenarios. In this paper, we propose an initial approximation to a framework called Ontology-Guided Scene Graph Generation (OG-SGG), that can improve the performance of an existing machine learning based scene graph generator using prior knowledge supplied in the form of an ontology (specifically, using the axioms defined within); and we present results evaluated on a specific scenario founded in telepresence robotics. These results show quantitative and qualitative improvements in the generated scene graphs.
translated by 谷歌翻译
场景图生成(SGG)旨在捕获对物体对之间的各种相互作用,这对于完整的场景了解至关重要。在整个关系集上培训的现有SGG方法未能由于培训数据中的各种偏差而导致视觉和文本相关性的复杂原理。学习表明像“ON”这样的通用空间配置的琐碎关系,而不是“停放”,例如“停放”,不执行这种复杂的推理,伤害泛化。为了解决这个问题,我们提出了一种新颖的SGG培训框架,以利用基于其信息的关系标签。我们的模型 - 不可知论培训程序对培训数据中的较少信息样本造成缺失的信息关系,并在算标签上培训算法的SGG模型以及现有的注释。我们表明,这种方法可以成功地与最先进的SGG方法结合使用,并在标准视觉基因组基准测试中显着提高它们的性能。此外,我们在更具挑战性的零射击设置中获得了看不见的三胞胎的相当大的改进。
translated by 谷歌翻译