我们提出了一种新颖的计算模型“ Savir-T”,用于在Raven的渐进式矩阵(RPM)中体现的视觉推理问题。我们的模型考虑了拼图中每个图像中视觉元素的显式空间语义,编码为时空视标,并了解内部图像以及图像的依赖依赖性依赖性,与视觉推理任务高度相关。通过基于变压器的SAVIR-T体系结构建模的令牌关系,提取组(行或列)通过利用组规则相干性并将其用作电感偏置来提取前两行中的基本规则表示形式,从而引起了提取组(行或列)驱动的表示形式(或列)RPM中的每个令牌。我们使用此关系表示形式来找到正确的选择图像,该图像完成了RPM的最后一行或列。在两个合成RPM基准测试中进行了广泛的实验,包括Raven,I-Raven,Raven-Fair和PGM以及基于自然图像的“ V-Prom”,这表明Savir-T为视觉设定了新的最新时间推理,超过了先前模型的性能。
translated by 谷歌翻译
解决视觉推理测试的计算学习方法,例如Raven的渐进式矩阵(RPM),非常取决于识别测试中使用的视觉概念(即表示)以及基于这些概念(即,推理)。然而,学习表示和推理是一项具有挑战性且不足的任务,经常以舞台的方式(首先表示,然后推理)接近。在这项工作中,我们提出了一个端到端的联合代表性学习框架,该框架利用了弱的归纳偏见形式来共同改善这两项任务。具体而言,我们引入了RPMS,GM-RPM的一般生成图形模型,并将其应用于解决推理测试。我们使用基于GM-RPM原理的基于基于的抽象推理网络(DAREN)的新型学习框架来完成此操作。我们对Daren进行了多个基准数据集的经验评估。 Daren在推理和分离任务上都表现出对最先进的模型(SOTA)模型的一致改进。这证明了分离的潜在表示与解决抽象视觉推理任务的能力之间的密切相关性。
translated by 谷歌翻译
抽象推理是指分析信息,以无形层面发现规则以及以创新方式解决问题的能力。 Raven的渐进式矩阵(RPM)测试通常用于检查抽象推理的能力。要求受试者从答案集中确定正确的选择,以填充RPM右下角(例如,3 $ \ times $ 3矩阵),按照矩阵内的基本规则。最近利用卷积神经网络(CNN)的研究取得了令人鼓舞的进步,以实现RPM测试。但是,它们部分忽略了RPM求解器的必要归纳偏置,例如每个行/列内的订单灵敏度和增量规则诱导。为了解决这个问题,在本文中,我们提出了一个分层的规则感知网络(SRAN),以生成两个输入序列的规则嵌入。我们的SRAN学习了不同级别的多个粒度规则嵌入,并通过封闭的融合模块逐步整合了分层的嵌入流。借助嵌入,应用规则相似性度量标准来确保SRAN不仅可以使用Tuplet损失对SRAN进行训练,还可以有效地推断出最佳答案。我们进一步指出,用于RPM测试的流行Raven数据集中存在的严重缺陷,这阻止了对抽象推理能力的公平评估。为了修复缺陷,我们提出了一种称为属性分配树(ABT)的答案集合生成算法,形成了一个改进的数据集(简称I-Raven)。在PGM和I-Raven数据集上进行了广泛的实验,这表明我们的Sran的表现优于最先进的模型。
translated by 谷歌翻译
乌鸦的进步矩阵(RPMS)经常用于评估人类的视觉推理能力。研究人员在开发一个系统方面取得了相当大的努力,这些系统通常通过黑盒端到端卷积神经网络(CNN)用于视觉识别和逻辑推理任务。为了开发一个高度可解释的解决方案的目标,我们提出了一次性的人为可理解的推理(OS-HURS),这是一个两步框架,包括一种感知模块和推理模块,以解决现实世界的挑战可视识别和随后的逻辑推理任务。对于推理模块,我们提出了一种“2 + 1”制剂,可以通过人类更好地理解,并显着降低模型复杂性。因此,可以仅从一个RPM示例推导出精确推理规则,这对于现有解决方案方法来说是不可行的。所提出的推理模块还能够产生一系列推理规则,精确地建模人类知识来解决RPM问题。为了验证真实应用程序的提出方法,构建了RPM样单射帧预测(ROF)数据集,其中在使用现实世界视频帧而不是合成图像构造的RPM上进行视觉推理。各种RPM样数据集上的实验结果表明,与最先进的模型相比,所提出的OS-HUR达到了显着且一致的性能增益。
translated by 谷歌翻译
人类在解析和灵活地理解复杂的视觉场景的能力方面继续大大胜过现代AI系统。注意力和记忆是已知的两个系统,它们在我们选择性地维护和操纵与行为相关的视觉信息的能力中起着至关重要的作用,以解决一些最具挑战性的视觉推理任务。在这里,我们介绍了一种新颖的体系结构,用于视觉推理的认知科学文献,基于记忆和注意力(视觉)推理(MAREO)架构。 Mareo实例化了一个主动视觉理论,该理论认为大脑通过学习结合以前学习的基本视觉操作以形成更复杂的视觉例程来在构成中解决复杂的视觉推理问题。 Mareo学会通过注意力转移序列来解决视觉推理任务,以路由并通过多头变压器模块将与任务相关的视觉信息保持在存储库中。然后,通过训练有素的专用推理模块来部署视觉例程,以判断场景中对象之间的各种关系。对四种推理任务的实验证明了Mareo以强大和样品有效的方式学习视觉例程的能力。
translated by 谷歌翻译
Artificial Intelligence (AI) and its applications have sparked extraordinary interest in recent years. This achievement can be ascribed in part to advances in AI subfields including Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). Deep learning, a sub-field of machine learning that employs artificial neural network concepts, has enabled the most rapid growth in these domains. The integration of vision and language has sparked a lot of attention as a result of this. The tasks have been created in such a way that they properly exemplify the concepts of deep learning. In this review paper, we provide a thorough and an extensive review of the state of the arts approaches, key models design principles and discuss existing datasets, methods, their problem formulation and evaluation measures for VQA and Visual reasoning tasks to understand vision and language representation learning. We also present some potential future paths in this field of research, with the hope that our study may generate new ideas and novel approaches to handle existing difficulties and develop new applications.
translated by 谷歌翻译
视觉问题应答(VQA)任务利用视觉图像和语言分析来回回答图像的文本问题。它是一个流行的研究课题,在过去十年中越来越多的现实应用。本文介绍了我们最近对AliceMind-MMU的研究(阿里巴巴的编码器 - 解码器来自Damo Academy - 多媒体理解的机器智能实验室),其比人类在VQA上获得相似甚至略微更好的结果。这是通过系统地改善VQA流水线来实现的,包括:(1)具有全面的视觉和文本特征表示的预培训; (2)与学习参加的有效跨模型互动; (3)一个新颖的知识挖掘框架,具有专门的专业专家模块,适用于复杂的VQA任务。处理不同类型的视觉问题,需要具有相应的专业知识在提高我们的VQA架构的表现方面发挥着重要作用,这取决于人力水平。进行了广泛的实验和分析,以证明新的研究工作的有效性。
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
人类视野的一个基本组成部分是我们解析复杂的视觉场景并判断其组成物体之间的关系的能力。近年来,随着最先进的系统在其中一些基准上达到人类的准确性,近年来,视觉推理的AI基准驱动了快速进步。然而,就样本效率而言,人类和AI系统学习新的视觉推理任务的样本效率仍然存在。人类在学习方面的非凡效率至少部分归因于其利用组成性的能力,以便他们可以在学习新任务时有效利用先前获得的知识。在这里,我们介绍了一种新颖的视觉推理基准组成视觉关系(CVR),以推动发展更多数据有效学习算法的进步。我们从流体智能和非语言推理测试中汲取灵感,并描述一种新的方法,用于创建抽象规则和相关图像数据集的组成。我们提出的基准包括跨任务规则的样本效率,概括和转移的度量,以及利用组合性的能力。我们系统地评估现代神经体系结构,发现令人惊讶的是,在大多数数据制度中,卷积架构在所有性能指标中都超过了基于变压器的体系结构。但是,即使在使用自学意见书学习信息性的视觉表示之后,与人类相比,所有计算模型的数据效率要少得多。总体而言,我们希望我们的挑战能够激发人们对可以学会利用构图朝着更高效学习的神经体系结构发展的兴趣。
translated by 谷歌翻译
Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, question answering (such as ChatGPT), etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6-8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle while retaining their solution algorithm. To benchmark the performance on the SMART-101 dataset, we propose a vision and language meta-learning model using varied state-of-the-art backbone neural networks. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles that they are trained on, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT large language model on a subset of our dataset and find that while ChatGPT produces convincing reasoning abilities, the answers are often incorrect.
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.
translated by 谷歌翻译
Relational reasoning is a central component of generally intelligent behavior, but has proven difficult for neural networks to learn. In this paper we describe how to use Relation Networks (RNs) as a simple plug-and-play module to solve problems that fundamentally hinge on relational reasoning. We tested RN-augmented networks on three tasks: visual question answering using a challenging dataset called CLEVR, on which we achieve state-of-the-art, super-human performance; text-based question answering using the bAbI suite of tasks; and complex reasoning about dynamic physical systems. Then, using a curated dataset called Sort-of-CLEVR we show that powerful convolutional networks do not have a general capacity to solve relational questions, but can gain this capacity when augmented with RNs. Our work shows how a deep learning architecture equipped with an RN module can implicitly discover and learn to reason about entities and their relations.
translated by 谷歌翻译
在本文中,我们提出了端到端的结构化多峰关注(SMA)神经网络,主要解决了上述前两个问题。 SMA首先使用结构图表示来编码图像中出现的对象对象,对象文本和文本文本关系,然后设计多模式图注意网络以推理它。最后,由上述模块的输出由全局本地注意力应答模块处理,以通过跟随M4C迭代地生成从两个OCR和常规词汇拼接的答案。我们所提出的模型优于TextVQA数据集上的SOTA模型以及除基于预先训练的水龙头之外的所有模型中的所有模型中的ST-VQA数据集的两个任务。展示了强大的推理能力,它还在TextVQA挑战中获得了第一名的第一名。我们在几种推理模型中广泛测试了不同的OCR方法,并调查了逐步提高了OCR性能对TextVQA基准的影响。通过更好的OCR结果,不同的型号对VQA准确性的戏剧性提高,但我们的模型受益最强烈的文本视觉推理能力。要授予我们的方法,并为进一步作品提供公平的测试基础,我们还为TextVQA数据集提供人为的地面实际OCR注释,这些ocr注释未在原始版本中提供。 TextVQA数据集的代码和地面ocr注释在https://github.com/chenyugao-cs/sma提供
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
在过去的几年中,基于深度卷积神经网络(CNN)的图像识别已取得了重大进展。这主要是由于此类网络在挖掘判别对象姿势以及质地和形状的零件信息方面具有强大的能力。这通常不适合细粒度的视觉分类(FGVC),因为它由于阻塞,变形,照明等而表现出较高的类内和较低的阶层差异。表征对象/场景。为此,我们提出了一种方法,该方法可以通过汇总大多数相关图像区域的上下文感知特征及其在区分细颗粒类别中避免边界框和/或可区分的零件注释中的重要性来有效捕获细微的变化。我们的方法的灵感来自最新的自我注意力和图形神经网络(GNNS)方法的启发端到端的学习过程。我们的模型在八个基准数据集上进行了评估,该数据集由细粒对象和人类对象相互作用组成。它的表现优于最先进的方法,其识别准确性很大。
translated by 谷歌翻译
Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and highquality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%).To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.
translated by 谷歌翻译
在本文中,一种称为VigAt的纯粹发行的自下而上的方法,该方法将对象检测器与视觉变压器(VIT)骨干网络一起得出对象和框架功能,以及一个头网络来处理这些功能,以处理事件的任务提出了视频中的识别和解释。VIGAT头由沿空间和时间维度分解的图形注意网络(GAT)组成,以便有效捕获对象或帧之间的局部和长期依赖性。此外,使用从各个GAT块的邻接矩阵得出的加权内(wids),我们表明所提出的体系结构可以识别解释网络决策的最显着对象和框架。进行了全面的评估研究,表明所提出的方法在三个大型公开视频数据集(FCVID,Mini-Kinetics,ActivityNet)上提供了最先进的结果。
translated by 谷歌翻译
高分辨率图像和详尽的局部注释成本的过高成本阻碍了数字病理学的进展。用于对病理图像进行分类的常用范式是基于贴片的处理,该处理通常结合了多个实例学习(MIL)以汇总局部补丁级表示,从而得出图像级预测。尽管如此,诊断相关的区域只能占整个组织的一小部分,而当前基于MIL的方法通常会均匀地处理图像,从而丢弃相互作用的相互作用。为了减轻这些问题,我们提出了Scorenet,Scorenet是一种新的有效的变压器,利用可区分的建议阶段来提取区分图像区域并相应地专用计算资源。提出的变压器利用一些动态推荐的高分辨率区域的本地和全球关注,以有效的计算成本。我们通过利用图像的语义分布来指导数据混合并产生连贯的样品标签对,进一步介绍了一种新型的混合数据启发,即SCOREX。 SCOREMIX令人尴尬地简单,并减轻了先前的增强的陷阱,该增强性的陷阱假设了统一的语义分布,并冒着标签样品的风险。对血久毒素和曙红(H&E)的三个乳腺癌组织学数据集(H&E)的三个乳腺癌组织学数据集(H&E)的彻底实验和消融研究验证了我们的方法优于先前的艺术,包括基于变压器的肿瘤区域(TORIS)分类的模型。与其他混合增强变体相比,配备了拟议的得分增强的Scorenet表现出更好的概括能力,并实现了新的最先进的结果(SOTA)结果,仅50%的数据。最后,Scorenet产生了高疗效,并且胜过SOTA有效变压器,即TransPath和SwintransFormer。
translated by 谷歌翻译
虽然卷积神经网络(CNNS)在许多愿景任务中显示出显着的结果,但它们仍然是通过简单但具有挑战性的视觉推理问题所紧张的。在计算机视觉中最近的变压器网络成功的启发,在本文中,我们介绍了经常性视觉变压器(RVIT)模型。由于经常性连接和空间注意在推理任务中的影响,该网络实现了来自SVRT数据集的同样不同视觉推理问题的竞争结果。空间和深度尺寸中的重量共享正规化模型,允许它使用较少的自由参数学习,仅使用28K培训样本。全面的消融研究证实了混合CNN +变压器架构的重要性和反馈连接的作用,其迭代地细化内部表示直到获得稳定的预测。最后,本研究可以更深入地了解对求解视觉抽象推理任务的注意力和经常性联系的作用。
translated by 谷歌翻译
以前的研究如vizwiz发现,可以阅读的视觉问题(VQA)系统可以阅读和图像中的文本的理由在辅助视觉上受损人群的应用领域很有用。 TextVQA是一个用于这个问题的VQA数据集,其中问题需要回答系统来读取和理由图像中的视觉对象和文本对象。 TextVQA中的一个关键挑战是系统的设计,有效地是单独的视觉和文本对象的理由,而且还有关于这些对象之间的空间关系。这激励了使用“边缘特征”,即关于每对对象之间的关系的信息。一些当前TextVQA模型解决了这个问题,但只使用关系类别(而不是边缘特征向量),或者不要在变压器架构中使用边缘功能。为了克服这些缺点,我们提出了一种曲线图形关系变压器(GRT),除了节点信息之外,还使用边缘信息进行变压器中的图注意计算。我们发现,在不使用任何其他优化的情况下,所提出的GRT方法优于M4C基线模型的精度0.65%在Val Set上的精度和测试集0.57%。定性,我们观察到GRT对M4C具有卓越的空间推理能力。
translated by 谷歌翻译