AI的长期目标是建立以人类方式理解概念的系统。搁置建立这种系统的困难,即使试图评估一个系统也是一个挑战,这是由于当今的AI相对不透明度及其在寻找快捷键解决方案的倾向。假设可以识别一个概念实例的系统也必须像人类一样理解其他实例,那么人类倾向于拟人化的趋势会加剧这一点。在本文中,我们认为理解一个概念需要在各种环境中使用它的能力。因此,我们通过探测系统在许多不同的实例化中使用给定概念的能力来提出以概念为中心的系统评估。我们介绍了对两个领域的评估的案例研究 - 乌鸦(受乌鸦的渐进式矩阵)和抽象和推理语料库(ARC) - 用于开发和评估AI系统中的抽象能力。我们基于概念的评估方法揭示了有关常规测试集将隐藏的AI系统的信息。
translated by 谷歌翻译
人类视野的一个基本组成部分是我们解析复杂的视觉场景并判断其组成物体之间的关系的能力。近年来,随着最先进的系统在其中一些基准上达到人类的准确性,近年来,视觉推理的AI基准驱动了快速进步。然而,就样本效率而言,人类和AI系统学习新的视觉推理任务的样本效率仍然存在。人类在学习方面的非凡效率至少部分归因于其利用组成性的能力,以便他们可以在学习新任务时有效利用先前获得的知识。在这里,我们介绍了一种新颖的视觉推理基准组成视觉关系(CVR),以推动发展更多数据有效学习算法的进步。我们从流体智能和非语言推理测试中汲取灵感,并描述一种新的方法,用于创建抽象规则和相关图像数据集的组成。我们提出的基准包括跨任务规则的样本效率,概括和转移的度量,以及利用组合性的能力。我们系统地评估现代神经体系结构,发现令人惊讶的是,在大多数数据制度中,卷积架构在所有性能指标中都超过了基于变压器的体系结构。但是,即使在使用自学意见书学习信息性的视觉表示之后,与人类相比,所有计算模型的数据效率要少得多。总体而言,我们希望我们的挑战能够激发人们对可以学会利用构图朝着更高效学习的神经体系结构发展的兴趣。
translated by 谷歌翻译
Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, question answering (such as ChatGPT), etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6-8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle while retaining their solution algorithm. To benchmark the performance on the SMART-101 dataset, we propose a vision and language meta-learning model using varied state-of-the-art backbone neural networks. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles that they are trained on, they are not better than random accuracy when analyzed for generalization. We also evaluate the recent ChatGPT large language model on a subset of our dataset and find that while ChatGPT produces convincing reasoning abilities, the answers are often incorrect.
translated by 谷歌翻译
When building artificial intelligence systems that can reason and answer questions about visual data, we need diagnostic tests to analyze our progress and discover shortcomings. Existing benchmarks for visual question answering can help, but have strong biases that models can exploit to correctly answer questions without reasoning. They also conflate multiple sources of error, making it hard to pinpoint model weaknesses. We present a diagnostic dataset that tests a range of visual reasoning abilities. It contains minimal biases and has detailed annotations describing the kind of reasoning each question requires. We use this dataset to analyze a variety of modern visual reasoning systems, providing novel insights into their abilities and limitations.
translated by 谷歌翻译
The Abstraction and Reasoning Corpus (ARC) aims at benchmarking the performance of general artificial intelligence algorithms. The ARC's focus on broad generalization and few-shot learning has made it difficult to solve using pure machine learning. A more promising approach has been to perform program synthesis within an appropriately designed Domain Specific Language (DSL). However, these too have seen limited success. We propose Abstract Reasoning with Graph Abstractions (ARGA), a new object-centric framework that first represents images using graphs and then performs a search for a correct program in a DSL that is based on the abstracted graph space. The complexity of this combinatorial search is tamed through the use of constraint acquisition, state hashing, and Tabu search. An extensive set of experiments demonstrates the promise of ARGA in tackling some of the complicated object-centric tasks of the ARC rather efficiently, producing programs that are correct and easy to understand.
translated by 谷歌翻译
Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn, and how they learn it. Specifically, we argue that these machines should (a) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (b) ground learning in intuitive theories of physics and psychology, to support and enrich the knowledge that is learned; and (c) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes towards these goals that can combine the strengths of recent neural network advances with more structured cognitive models.
translated by 谷歌翻译
The recent advent of large language models - large neural networks trained on a simple predictive objective over a massive corpus of natural language - has reinvigorated debate over whether human cognitive capacities might emerge in such generic models given sufficient training data. Of particular interest is the ability of these models to reason about novel problems zero-shot, without any direct training on those problems. In human cognition, this capacity is closely tied to an ability to reason by analogy. Here, we performed a direct comparison between human reasoners and a large language model (GPT-3) on a range of analogical tasks, including a novel text-based matrix reasoning task closely modeled on Raven's Progressive Matrices. We found that GPT-3 displayed a surprisingly strong capacity for abstract pattern induction, matching or even surpassing human capabilities in most settings. Our results indicate that large language models such as GPT-3 have acquired an emergent ability to find zero-shot solutions to a broad range of analogy problems.
translated by 谷歌翻译
Raven的渐进式矩阵是一个经典智能测试家族,在研究和临床环境中都广泛使用。在AI社区中,有许多令人兴奋的努力在计算上模拟了解决此类形象类似推理问题的问题的各个方面。在本文中,我们提出了一系列计算模型,用于使用类比和图像转换来解决Raven的渐进式矩阵。我们按照人类测试人员通常采用的三种不同策略来运行模型。这些模型对Raven的渐进式矩阵的标准版本进行了测试,其中我们可以解决57个问题。因此,事实证明,类比和图像转换在解决RPM问题方面有效。
translated by 谷歌翻译
我们提出了一种新颖的计算模型“ Savir-T”,用于在Raven的渐进式矩阵(RPM)中体现的视觉推理问题。我们的模型考虑了拼图中每个图像中视觉元素的显式空间语义,编码为时空视标,并了解内部图像以及图像的依赖依赖性依赖性,与视觉推理任务高度相关。通过基于变压器的SAVIR-T体系结构建模的令牌关系,提取组(行或列)通过利用组规则相干性并将其用作电感偏置来提取前两行中的基本规则表示形式,从而引起了提取组(行或列)驱动的表示形式(或列)RPM中的每个令牌。我们使用此关系表示形式来找到正确的选择图像,该图像完成了RPM的最后一行或列。在两个合成RPM基准测试中进行了广泛的实验,包括Raven,I-Raven,Raven-Fair和PGM以及基于自然图像的“ V-Prom”,这表明Savir-T为视觉设定了新的最新时间推理,超过了先前模型的性能。
translated by 谷歌翻译
智力是通过连接主义或典型主义者实现的吗?虽然连接主义方法取得了超人的性能,但已经越来越多的证据表明,这些特定的特定优势在系统泛化中特别脆弱。这种观察表明了连接主义和典型主义者之间的中央辩论,其中后者不断地倡导认知架构中的代数治疗。在这项工作中,我们遵循典型主义者的呼叫,并提出一种混合方法来提高推理系统的泛化。具体而言,我们展示了具有代数表示的原型,用于乌鸦的渐进矩阵(RPM)的抽象空间 - 时间推理任务,并呈现代数感知神经半符号(Alans)学习者。艾拉斯学习者受到抽象代数和代表理论的动机。它由神经视觉感知前端和代数抽象推理后端组成:前端总结了基于对象的表示的可视信息,而后端将其转换为代数结构,并在飞行中引导隐藏的操作员。稍后执行诱导的操作员以预测答案的表示,并且选择与预测最相似的选择作为解决方案。广泛的实验表明,通过纳入代数处理,艾拉斯学习者优于需要系统泛化的域中的各种纯粹连接主义模型。我们进一步表明学习的代数表示可以通过同构以产生答案来解码。
translated by 谷歌翻译
我们考虑将机器学习模型相结合以执行更高级别的认知任务和明确规格的问题。我们提出了新的视觉歧视难题(VDP)的新问题,该问题需要找到可解释的歧视因子,这些歧视因子根据逻辑规范对图像进行分类。人类可以轻松解决这些难题,并提供强大,可验证和可解释的歧视者作为答案。我们提出了一个组成神经肌符号框架,该框架结合了一个神经网络,以检测对象和与符号学习者的关系,以发现可解释的歧视者。我们创建了涉及自然图像的大量VDP数据集,并表明与几种纯粹的神经方法相比,我们的神经成像框架表现出色。
translated by 谷歌翻译
抽象和推理语料库(ARC)是一组用于测试代理人灵活解决新颖问题的能力的程序任务。虽然大多数弧任务对于人类来说很容易,但它们对最先进的AI有挑战性。是什么让建筑物智能系统概括到新颖的情况,例如arc困难?我们可以通过研究\ emph {语言}的差异来找到答案:虽然人类在容易地生成和解释了一般语言中,计算机系统被束缚到他们可以精确执行的狭窄域的语言。我们呈现LARC,The \ Texit {语言完整的ARC}:一组人类参与者的一系列自然语言描述,这些人参与者在如何使用单独的语言解决acc任务,其中包含88 \%的成功说明弧任务。我们将收集的指示分析为“自然程序”,发现当他们类似于计算机程序时,它们以两种方式截然不同:首先,它们含有各种基元;其次,他们经常利用直接可执行代码超出交际策略。我们证明这两个区别防止了当前的程序合成技术利用LACC到其全部潜力,并提供有关如何构建下一代程序合成器的具体建议。
translated by 谷歌翻译
Euclidean geometry is among the earliest forms of mathematical thinking. While the geometric primitives underlying its constructions, such as perfect lines and circles, do not often occur in the natural world, humans rarely struggle to perceive and reason with them. Will computer vision models trained on natural images show the same sensitivity to Euclidean geometry? Here we explore these questions by studying few-shot generalization in the universe of Euclidean geometry constructions. We introduce Geoclidean, a domain-specific language for Euclidean geometry, and use it to generate two datasets of geometric concept learning tasks for benchmarking generalization judgements of humans and machines. We find that humans are indeed sensitive to Euclidean geometry and generalize strongly from a few visual examples of a geometric concept. In contrast, low-level and high-level visual features from standard computer vision models pretrained on natural images do not support correct generalization. Thus Geoclidean represents a novel few-shot generalization benchmark for geometric concept learning, where the performance of humans and of AI models diverge. The Geoclidean framework and dataset are publicly available for download.
translated by 谷歌翻译
视觉奇数任务被认为是对人类的普遍独立的分析智能测试。人工智能的进步导致了重要的突破,但是与人类在此类分析智能任务上竞争仍然具有挑战性,并且通常诉诸于非生物学上的架构。我们提出了一个具有生物学现实的系统,该系统从合成眼动运动中接收输入 - 扫视,并与结合新皮质神经元动力学的神经元一起处理它们。我们介绍了一个程序生成的视觉奇数数据集,以训练扩展常规关系网络和我们建议的系统的体系结构。两种方法都超过了人类的准确性,我们发现两者都具有相同的基本推理基本机制。最后,我们表明,具有生物学启发的网络可实现卓越的准确性,学习速度更快,所需的参数比常规网络更少。
translated by 谷歌翻译
People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms-for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world's alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion. On a challenging one-shot classification task, the model achieves human-level performance while outperforming recent deep learning approaches. We also present several "visual Turing tests" probing the model's creative generalization abilities, which in many cases are indistinguishable from human behavior.
translated by 谷歌翻译
当前的深度学习方法显示出良好的分布概括性能,但在分布外的概括方面挣扎。正如我们在许多智能测试中所发现的那样,在涉及抽象关系(例如识别序列中的规则)的任务中尤其如此。最近的工作探索了如何强迫关系表示与感觉表示的区别,这在大脑中似乎是这种情况,可以帮助人工系统。在这项工作的基础上,我们进一步探索并正式化了关系和感官细节的“分区”表示所提供的优势,以及这种归纳偏见如何帮助在新遇到的环境中重新组建学习的关系结构。我们介绍了一个基于相似性分数的简单体系结构,我们将其命名为组成关系网络(Corelnet)。使用此模型,我们研究了一系列的归纳偏见,以确保从感觉数据中学习并明显地了解抽象关系,并探索它们对一系列关系心理物理学任务的分布概括的影响。我们发现,简单的体系结构选择可以超越现有的模型,而在分布式概括中。总之,这些结果表明,从其他信息流中分配关系表示形式可能是一种简单的方法,可以在执行分布外的关系计算时增强现有网络体系结构的鲁棒性。
translated by 谷歌翻译
Artificial Intelligence (AI) and its applications have sparked extraordinary interest in recent years. This achievement can be ascribed in part to advances in AI subfields including Machine Learning (ML), Computer Vision (CV), and Natural Language Processing (NLP). Deep learning, a sub-field of machine learning that employs artificial neural network concepts, has enabled the most rapid growth in these domains. The integration of vision and language has sparked a lot of attention as a result of this. The tasks have been created in such a way that they properly exemplify the concepts of deep learning. In this review paper, we provide a thorough and an extensive review of the state of the arts approaches, key models design principles and discuss existing datasets, methods, their problem formulation and evaluation measures for VQA and Visual reasoning tasks to understand vision and language representation learning. We also present some potential future paths in this field of research, with the hope that our study may generate new ideas and novel approaches to handle existing difficulties and develop new applications.
translated by 谷歌翻译
本文回顾了概念,建模方法和最新发现,沿着不同级别的神经网络模型的抽象范围,包括跨(1)样本跨(2)分布,(3)域,(4)任务,(5)模态的概括,(2) ,和(6)范围。 (1)样品概括的结果表明,对于ImageNet而言,几乎所有最近的改进都减少了训练误差,而过度拟合则保持平坦。几乎消除了所有训练错误,未来的进度将需要专注于减少过度拟合。统计数据的观点突出显示了(2)分布概括如何交替地视为样本权重的变化或输入输出关系的变化。总结了(3)域概括的转移学习方法,以及最新的进步和域适应性基准数据集的财富。在(4)任务概括中调查的最新突破包括很少的元学习方法和BERT NLP引擎以及最近(5)个模态概括研究,这些研究整合了图像和文本数据,并应用了跨嗅觉的生物学启发的网络,视觉和听觉方式。回顾了最近(6)个范围泛化结果,将知识图嵌入深度NLP方法中。此外,讨论了关于大脑的模块化结构以及多巴胺驱动的条件导致抽象思维的步骤。
translated by 谷歌翻译
乌鸦的进步矩阵(RPMS)经常用于评估人类的视觉推理能力。研究人员在开发一个系统方面取得了相当大的努力,这些系统通常通过黑盒端到端卷积神经网络(CNN)用于视觉识别和逻辑推理任务。为了开发一个高度可解释的解决方案的目标,我们提出了一次性的人为可理解的推理(OS-HURS),这是一个两步框架,包括一种感知模块和推理模块,以解决现实世界的挑战可视识别和随后的逻辑推理任务。对于推理模块,我们提出了一种“2 + 1”制剂,可以通过人类更好地理解,并显着降低模型复杂性。因此,可以仅从一个RPM示例推导出精确推理规则,这对于现有解决方案方法来说是不可行的。所提出的推理模块还能够产生一系列推理规则,精确地建模人类知识来解决RPM问题。为了验证真实应用程序的提出方法,构建了RPM样单射帧预测(ROF)数据集,其中在使用现实世界视频帧而不是合成图像构造的RPM上进行视觉推理。各种RPM样数据集上的实验结果表明,与最先进的模型相比,所提出的OS-HUR达到了显着且一致的性能增益。
translated by 谷歌翻译
人类视觉感知的关键方面是能够将视觉场景分解为单个对象并进一步进入对象部分,形成部分整个层次结构。这种复合结构可以诱导丰富的语义概念和关系,从而在视觉信号的解释和组织中发挥着重要作用,以及视觉感知和推理的概括。但是,现有的视觉推理基准主要专注于物体而不是零件。基于完整的部分整个层次结构的视觉推理比以前粒度概念,更丰富的几何关系和更复杂的物理学所致的对象的推理更具挑战性。因此,为了更好地为基于部分的概念,关系和物理推理服务,我们介绍了一个名为PTR的新型大规模诊断视觉推理数据集。 PTR包含大约70k RGBD合成图像,具有地面真理对象和有关语义实例分段,颜色属性,空间和几何关系的部分级别注释,以及诸如稳定性的某些物理性质。这些图像与700K机生成的问题配对,涵盖各种类型的推理类型,使其成为视觉推理模型的良好测试平台。我们在这个数据集上检查了几种最先进的视觉推理模型,并观察到他们在人类可以容易地推断正确答案的情况下仍然存在许多令人惊讶的错误。我们认为,此数据集将开辟基于零件推理的新机会。
translated by 谷歌翻译