We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We present new quantitative insights on such repeated structures in the Visual Genome dataset. Our analysis shows that object labels are highly predictive of relation labels but not vice-versa. We also find that there are recurring patterns even in larger subgraphs: more than 50% of graphs contain motifs involving at least two relations. Our analysis motivates a new baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. This baseline improves on the previous state-of-the-art by an average of 3.6% relative improvement across evaluation settings. We then introduce Stacked Motif Networks, a new architecture designed to capture higher order motifs in scene graphs that further improves over our strong baseline by an average 7.1% relative gain. Our code is available at github.com/rowanz/neural-motifs.
translated by 谷歌翻译
We propose a novel scene graph generation model called Graph R-CNN, that is both effective and efficient at detecting objects and their relations in images. Our model contains a Relation Proposal Network (RePN) that efficiently deals with the quadratic number of potential relations between objects in an image. We also propose an attentional Graph Convolutional Network (aGCN) that effectively captures contextual information between objects and relations. Finally, we introduce a new evaluation metric that is more holistic and realistic than existing metrics. We report state-of-the-art performance on scene graph generation as evaluated using both existing and our proposed metrics.
translated by 谷歌翻译
Understanding a visual scene goes beyond recognizing individual objects in isolation. Relationships between objects also constitute rich semantic information about the scene. In this work, we explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image. We propose a novel endto-end model that generates such structured scene representation from an input image. The model solves the scene graph inference problem using standard RNNs and learns to iteratively improves its predictions via message passing. Our joint inference model can take advantage of contextual cues to make better predictions on objects and their relationships. The experiments show that our model significantly outperforms previous methods for generating scene graphs using Visual Genome dataset and inferring support relations with NYU Depth v2 dataset.
translated by 谷歌翻译
深度学习技术导致了通用对象检测领域的显着突破,近年来产生了很多场景理解的任务。由于其强大的语义表示和应用于场景理解,场景图一直是研究的焦点。场景图生成(SGG)是指自动将图像映射到语义结构场景图中的任务,这需要正确标记检测到的对象及其关系。虽然这是一项具有挑战性的任务,但社区已经提出了许多SGG方法并取得了良好的效果。在本文中,我们对深度学习技术带来了近期成就的全面调查。我们审查了138个代表作品,涵盖了不同的输入方式,并系统地将现有的基于图像的SGG方法从特征提取和融合的角度进行了综述。我们试图通过全面的方式对现有的视觉关系检测方法进行连接和系统化现有的视觉关系检测方法,概述和解释SGG的机制和策略。最后,我们通过深入讨论当前存在的问题和未来的研究方向来完成这项调查。本调查将帮助读者更好地了解当前的研究状况和想法。
translated by 谷歌翻译
场景图是一个场景的结构化表示,可以清楚地表达场景中对象之间的对象,属性和关系。随着计算机视觉技术继续发展,只需检测和识别图像中的对象,人们不再满足。相反,人们期待着对视觉场景更高的理解和推理。例如,给定图像,我们希望不仅检测和识别图像中的对象,还要知道对象之间的关系(视觉关系检测),并基于图像内容生成文本描述(图像标题)。或者,我们可能希望机器告诉我们图像中的小女孩正在做什么(视觉问题应答(VQA)),甚至从图像中移除狗并找到类似的图像(图像编辑和检索)等。这些任务需要更高水平的图像视觉任务的理解和推理。场景图只是场景理解的强大工具。因此,场景图引起了大量研究人员的注意力,相关的研究往往是跨模型,复杂,快速发展的。然而,目前没有对场景图的相对系统的调查。为此,本调查对现行场景图研究进行了全面调查。更具体地说,我们首先总结了场景图的一般定义,随后对场景图(SGG)和SGG的发电方法进行了全面和系统的讨论,借助于先验知识。然后,我们调查了场景图的主要应用,并汇总了最常用的数据集。最后,我们对场景图的未来发展提供了一些见解。我们相信这将是未来研究场景图的一个非常有帮助的基础。
translated by 谷歌翻译
同一场景中的不同对象彼此之间或多或少相关,但是只有有限数量的这些关系值得注意。受到对象检测效果的DETR的启发,我们将场景图生成视为集合预测问题,并提出了具有编码器decoder架构的端到端场景图生成模型RELTR。关于视觉特征上下文的编码器原因是,解码器使用带有耦合主题和对象查询的不同类型的注意机制渗透了一组固定大小的三胞胎主题prodicate-object。我们设计了一套预测损失,以执行地面真相与预测三胞胎之间的匹配。与大多数现有场景图生成方法相反,Reltr是一种单阶段方法,它仅使用视觉外观直接预测一组关系,而无需结合实体并标记所有可能的谓词。视觉基因组和开放图像V6数据集的广泛实验证明了我们模型的出色性能和快速推断。
translated by 谷歌翻译
在图像理解项目中越来越多的情况下,场景图一代在电脑视觉研究中获得了很多关注,如视觉问题应答,图像标题,自动驾驶汽车,人群行为分析,活动识别等等。场景图,图像的视觉图形结构,非常有助于简化图像理解任务。在这项工作中,我们介绍了一个称为几何上下文的后处理算法,以了解视觉场景更好的几何上。我们使用该后处理算法在对象对与先前模型之间添加和改进几何关系。我们通过计算对象对之间的方向和距离来利用此上下文。我们使用知识嵌入式路由网络(KERN)作为我们的基准模型,将工作与我们的算法扩展,并显示最近最先进的算法上的可比结果。
translated by 谷歌翻译
图像字幕显示可以通过使用场景图来表示图像中对象的关系来实现更好的性能。当前字幕编码器通常使用图形卷积网(GCN)来表示关系信息,并通过串联或卷积将其与对象区域特征合并,以获取句子解码的最终输入。但是,由于两个原因,现有方法中基于GCN的编码器在字幕上的有效性较小。首先,使用图像字幕作为目标(即最大似然估计),而不是以关系为中心的损失无法完全探索编码器的潜力。其次,使用预训练的模型代替编码器本身提取关系不是灵活的,并且不能有助于模型的解释性。为了提高图像字幕的质量,我们提出了一个新颖的体系结构改革者 - 一种关系变压器,可以生成具有嵌入关系信息的功能,并明确表达图像中对象之间的成对关系。改革者将场景图的生成目标与使用一个修改后的变压器模型的图像字幕结合在一起。这种设计使改革者不仅可以通过提取强大的关系图像特征的利益生成更好的图像标题,还可以生成场景图,以明确描述配对关系。公开可用数据集的实验表明,我们的模型在图像字幕和场景图生成上的最先进方法明显优于最先进的方法
translated by 谷歌翻译
Recent scene graph generation (SGG) frameworks have focused on learning complex relationships among multiple objects in an image. Thanks to the nature of the message passing neural network (MPNN) that models high-order interactions between objects and their neighboring objects, they are dominant representation learning modules for SGG. However, existing MPNN-based frameworks assume the scene graph as a homogeneous graph, which restricts the context-awareness of visual relations between objects. That is, they overlook the fact that the relations tend to be highly dependent on the objects with which the relations are associated. In this paper, we propose an unbiased heterogeneous scene graph generation (HetSGG) framework that captures relation-aware context using message passing neural networks. We devise a novel message passing layer, called relation-aware message passing neural network (RMP), that aggregates the contextual information of an image considering the predicate type between objects. Our extensive evaluations demonstrate that HetSGG outperforms state-of-the-art methods, especially outperforming on tail predicate classes.
translated by 谷歌翻译
The goal of this paper is to detect objects by exploiting their interrelationships. Rather than relying on predefined and labeled graph structures, we infer a graph prior from object co-occurrence statistics. The key idea of our paper is to model object relations as a function of initial class predictions and co-occurrence priors to generate a graph representation of an image for improved classification and bounding box regression. We additionally learn the object-relation joint distribution via energy based modeling. Sampling from this distribution generates a refined graph representation of the image which in turn produces improved detection performance. Experiments on the Visual Genome and MS-COCO datasets demonstrate our method is detector agnostic, end-to-end trainable, and especially beneficial for rare object classes. What is more, we establish a consistent improvement over object detectors like DETR and Faster-RCNN, as well as state-of-the-art methods modeling object interrelationships.
translated by 谷歌翻译
This paper presents a framework for jointly grounding objects that follow certain semantic relationship constraints given in a scene graph. A typical natural scene contains several objects, often exhibiting visual relationships of varied complexities between them. These inter-object relationships provide strong contextual cues toward improving grounding performance compared to a traditional object query-only-based localization task. A scene graph is an efficient and structured way to represent all the objects and their semantic relationships in the image. In an attempt towards bridging these two modalities representing scenes and utilizing contextual information for improving object localization, we rigorously study the problem of grounding scene graphs on natural images. To this end, we propose a novel graph neural network-based approach referred to as Visio-Lingual Message PAssing Graph Neural Network (VL-MPAG Net). In VL-MPAG Net, we first construct a directed graph with object proposals as nodes and an edge between a pair of nodes representing a plausible relation between them. Then a three-step inter-graph and intra-graph message passing is performed to learn the context-dependent representation of the proposals and query objects. These object representations are used to score the proposals to generate object localization. The proposed method significantly outperforms the baselines on four public datasets.
translated by 谷歌翻译
Scene graph generation from images is a task of great interest to applications such as robotics, because graphs are the main way to represent knowledge about the world and regulate human-robot interactions in tasks such as Visual Question Answering (VQA). Unfortunately, its corresponding area of machine learning is still relatively in its infancy, and the solutions currently offered do not specialize well in concrete usage scenarios. Specifically, they do not take existing "expert" knowledge about the domain world into account; and that might indeed be necessary in order to provide the level of reliability demanded by the use case scenarios. In this paper, we propose an initial approximation to a framework called Ontology-Guided Scene Graph Generation (OG-SGG), that can improve the performance of an existing machine learning based scene graph generator using prior knowledge supplied in the form of an ontology (specifically, using the axioms defined within); and we present results evaluated on a specific scenario founded in telepresence robotics. These results show quantitative and qualitative improvements in the generated scene graphs.
translated by 谷歌翻译
Scene Graph Generation (SGG) serves a comprehensive representation of the images for human understanding as well as visual understanding tasks. Due to the long tail bias problem of the object and predicate labels in the available annotated data, the scene graph generated from current methodologies can be biased toward common, non-informative relationship labels. Relationship can sometimes be non-mutually exclusive, which can be described from multiple perspectives like geometrical relationships or semantic relationships, making it even more challenging to predict the most suitable relationship label. In this work, we proposed the SG-Shuffle pipeline for scene graph generation with 3 components: 1) Parallel Transformer Encoder, which learns to predict object relationships in a more exclusive manner by grouping relationship labels into groups of similar purpose; 2) Shuffle Transformer, which learns to select the final relationship labels from the category-specific feature generated in the previous step; and 3) Weighted CE loss, used to alleviate the training bias caused by the imbalanced dataset.
translated by 谷歌翻译
场景图生成的任务需要在给定图像(或视频)中识别对象实体及其相应的交互谓词。由于组合较大的解决方案空间,现有的场景图生成方法假设关节分布的某些分解以使估计可行(例如,假设对象在有条件地与谓词预测无关)。但是,在所有情况下,这种固定的分解并不是理想的(例如,对于相互作用中需要的对象很小且本身不可辨别的图像)。在这项工作中,我们建议使用马尔可夫随机字段中传递消息,提出一个针对场景图生成的新颖框架,并在图像上引入动态调节。这是作为迭代改进过程实现的,其中每个修改都在上一个迭代中生成的图上进行条件。跨改进步骤的这种条件允许对实体和关系进行联合推理。该框架是通过基于小说和端到端的可训练变压器建筑实现的。此外,建议的框架可以改善现有的方法性能。通过有关视觉基因组和动作基因组基准数据集的广泛实验,我们在场景图生成上显示了改善的性能。
translated by 谷歌翻译
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in
translated by 谷歌翻译
Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format. This representation has proven useful in several tasks, such as question answering, captioning, and even object detection, to name a few. Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene, which adds computational overhead to the approach. This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction. Using two transformer-based components, we first sample a possible scene graph structure from detected objects and their visual features. We then perform predicate classification on the sampled edges to generate the final scene graph. This approach allows us to efficiently generate scene graphs from images with minimal inference overhead. Extensive experiments on the Visual Genome dataset demonstrate the efficiency of the proposed approach. Without bells and whistles, we obtain, on average, 20.7% mean recall (mR@100) across different settings for scene graph generation (SGG), outperforming state-of-the-art SGG approaches while offering competitive performance to unbiased SGG approaches.
translated by 谷歌翻译
场景图生成(SGG)旨在捕获对物体对之间的各种相互作用,这对于完整的场景了解至关重要。在整个关系集上培训的现有SGG方法未能由于培训数据中的各种偏差而导致视觉和文本相关性的复杂原理。学习表明像“ON”这样的通用空间配置的琐碎关系,而不是“停放”,例如“停放”,不执行这种复杂的推理,伤害泛化。为了解决这个问题,我们提出了一种新颖的SGG培训框架,以利用基于其信息的关系标签。我们的模型 - 不可知论培训程序对培训数据中的较少信息样本造成缺失的信息关系,并在算标签上培训算法的SGG模型以及现有的注释。我们表明,这种方法可以成功地与最先进的SGG方法结合使用,并在标准视觉基因组基准测试中显着提高它们的性能。此外,我们在更具挑战性的零射击设置中获得了看不见的三胞胎的相当大的改进。
translated by 谷歌翻译
场景图生成(SGG)由于其复杂的成分特性,仍然是一个具有挑战性的视觉理解任务。大多数以前的作品采用自下而上的两阶段或基于点的单阶段方法,通常遭受开销时间复杂性或次优设计假设。在这项工作中,我们提出了一种新颖的SGG方法来解决上述问题,其将任务制定为双层图形施工问题。为了解决问题,我们开发一个基于变换器的端到端框架,首先生成实体和谓词提议集,然后推断定向边缘以形成关系三态。特别地,我们基于结构谓词发生器开发新的实体感知谓词表示,以利用关系的组成特性。此外,我们设计了一个曲线图组装模块,以推断基于我们的实体感知结构的二分明场景图的连接,使我们能够以端到端的方式生成场景图。广泛的实验结果表明,我们的设计能够在两个具有挑战性的基准上实现最先进的或可比性的性能,超越大多数现有方法,并享受更高的推理效率。我们希望我们的模型可以作为基于变压器的场景图生成的强大基线。
translated by 谷歌翻译
现有的研究解决场景图生成(SGG) - 图像中场景理解的关键技术 - 从检测角度,即使用边界框检测到对象,然后预测其成对关系。我们认为这种范式引起了几个阻碍该领域进步的问题。例如,当前数据集中的基于框的标签通常包含冗余类,例如头发,并遗漏对上下文理解至关重要的背景信息。在这项工作中,我们介绍了Panoptic场景图生成(PSG),这是一项新的问题任务,要求该模型基于全景分割而不是刚性边界框生成更全面的场景图表示。一个高质量的PSG数据集包含可可和视觉基因组的49k井被宣传的重叠图像,是为社区创建的,以跟踪其进度。为了进行基准测试,我们构建了四个两阶段基线,这些基线是根据SGG中的经典方法修改的,以及两个单阶段基准,称为PSGTR和PSGFORMER,它们基于基于高效的变压器检测器,即detr。虽然PSGTR使用一组查询来直接学习三重态,但PSGFormer以来自两个变压器解码器的查询形式分别模拟对象和关系,然后是一种迅速的关系 - 对象对象匹配机制。最后,我们分享了关于公开挑战和未来方向的见解。
translated by 谷歌翻译
Today's scene graph generation (SGG) task is still far from practical, mainly due to the severe training bias, e.g., collapsing diverse human walk on/ sit on/lay on beach into human on beach. Given such SGG, the down-stream tasks such as VQA can hardly infer better scene structures than merely a bag of objects. However, debiasing in SGG is not trivial because traditional debiasing methods cannot distinguish between the good and bad bias, e.g., good context prior (e.g., person read book rather than eat) and bad long-tailed bias (e.g., near dominating behind/in front of). In this paper, we present a novel SGG framework based on causal inference but not the conventional likelihood. We first build a causal graph for SGG, and perform traditional biased training with the graph. Then, we propose to draw the counterfactual causality from the trained graph to infer the effect from the bad bias, which should be removed. In particular, we use Total Direct Effect as the proposed final predicate score for unbiased SGG. Note that our framework is agnostic to any SGG model and thus can be widely applied in the community who seeks unbiased predictions. By using the proposed Scene Graph Diagnosis toolkit 1 on the SGG benchmark Visual Genome and several prevailing models, we observed significant improvements over the previous state-of-the-art methods.
translated by 谷歌翻译