Euclidean geometry is among the earliest forms of mathematical thinking. While the geometric primitives underlying its constructions, such as perfect lines and circles, do not often occur in the natural world, humans rarely struggle to perceive and reason with them. Will computer vision models trained on natural images show the same sensitivity to Euclidean geometry? Here we explore these questions by studying few-shot generalization in the universe of Euclidean geometry constructions. We introduce Geoclidean, a domain-specific language for Euclidean geometry, and use it to generate two datasets of geometric concept learning tasks for benchmarking generalization judgements of humans and machines. We find that humans are indeed sensitive to Euclidean geometry and generalize strongly from a few visual examples of a geometric concept. In contrast, low-level and high-level visual features from standard computer vision models pretrained on natural images do not support correct generalization. Thus Geoclidean represents a novel few-shot generalization benchmark for geometric concept learning, where the performance of humans and of AI models diverge. The Geoclidean framework and dataset are publicly available for download.
translated by 谷歌翻译
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at
translated by 谷歌翻译
translated by 谷歌翻译
已经证明,经过代码完成培训的大型语言模型(LLMS)能够合成DocStrings的简单Python程序[1]。我们发现这些代码编写的LLM可以被重新使用以编写机器人策略代码,给定自然语言命令。具体而言,策略代码可以表达处理感知输出的功能或反馈循环(例如,从对象检测器[2],[3])并参数化控制原始API。当作为输入提供了几个示例命令(格式为注释)后,然后是相应的策略代码(通过少量提示),LLMS可以接收新命令并自主重新编写API调用以分别生成新的策略代码。通过链接经典的逻辑结构并引用第三方库(例如,numpy,shapely)执行算术,以这种方式使用的LLM可以编写(i)(i)表现出空间几何推理的机器人策略,(ii)(ii)将其推广到新的说明和新指令和新指令和(iii)根据上下文(即行为常识)规定模棱两可的描述(例如“更快”)的精确值(例如,速度)。本文将代码作为策略介绍:语言模型生成程序的以机器人为中心的形式化(LMP),该程序可以代表反应性策略(例如阻抗控制器),以及基于Waypoint的策略(基于远见的选择,基于轨迹,基于轨迹,控制),在多个真实的机器人平台上展示。我们方法的核心是促使层次代码 - 代码(递归定义未定义的功能),该代码可以编写更复杂的代码,还可以改善最新的代码,以解决HOMANEVAL [1]基准中的39.8%的问题。代码和视频可从https://code-as-policies.github.io获得。
translated by 谷歌翻译
People learning new concepts can often generalize successfully from just a single example, yet machine learning algorithms typically require tens or hundreds of examples to perform with similar accuracy. People can also use learned concepts in richer ways than conventional algorithms-for action, imagination, and explanation. We present a computational model that captures these human learning abilities for a large class of simple visual concepts: handwritten characters from the world's alphabets. The model represents concepts as simple programs that best explain observed examples under a Bayesian criterion. On a challenging one-shot classification task, the model achieves human-level performance while outperforming recent deep learning approaches. We also present several "visual Turing tests" probing the model's creative generalization abilities, which in many cases are indistinguishable from human behavior.
translated by 谷歌翻译
人类视觉感知的关键方面是能够将视觉场景分解为单个对象并进一步进入对象部分,形成部分整个层次结构。这种复合结构可以诱导丰富的语义概念和关系,从而在视觉信号的解释和组织中发挥着重要作用,以及视觉感知和推理的概括。但是,现有的视觉推理基准主要专注于物体而不是零件。基于完整的部分整个层次结构的视觉推理比以前粒度概念,更丰富的几何关系和更复杂的物理学所致的对象的推理更具挑战性。因此,为了更好地为基于部分的概念,关系和物理推理服务,我们介绍了一个名为PTR的新型大规模诊断视觉推理数据集。 PTR包含大约70k RGBD合成图像,具有地面真理对象和有关语义实例分段,颜色属性,空间和几何关系的部分级别注释,以及诸如稳定性的某些物理性质。这些图像与700K机生成的问题配对,涵盖各种类型的推理类型,使其成为视觉推理模型的良好测试平台。我们在这个数据集上检查了几种最先进的视觉推理模型,并观察到他们在人类可以容易地推断正确答案的情况下仍然存在许多令人惊讶的错误。我们认为,此数据集将开辟基于零件推理的新机会。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
人类具有以零拍的方式识别和获取新颖的视觉概念的非凡能力。考虑到以前学到的视觉概念及其关系的高级,象征性的描述,人类可以识别新颖的概念而不看到任何例子。此外,他们可以通过学习视觉概念和关系来解析和传达符号结构来获取新概念。赋予机器中的这些功能在提高推理时提高其概括能力方面至关重要。在这项工作中,我们介绍了零拍的概念识别和获取(ZEROC),这是一种神经符号结构,可以以零拍的方式识别和获取新颖的概念。 ZEROC代表概念作为组成概念模型的图(作为节点)及其关系(作为边缘)。为了允许推理时间组成,我们采用基于能量的模型(EBM)来建模概念和关系。我们设计ZEROC架构,以便它允许在概念的符号图结构及其相应的EBM之间进行一对一的映射,该图是第一次允许获取新概念,传达其图形结构并将其应用于分类和分类和在推理时检测任务(甚至跨域)。我们介绍了用于学习和推断ZEROC的算法。我们在一个充满挑战的网格世界数据集上评估了零,该数据集旨在探测零拍的概念识别和获取,并展示其功能。
translated by 谷歌翻译
已经提出了多个草图数据集,以了解人们如何绘制3D对象。但是,这样的数据集通常是小规模的,并且覆盖了一小部分对象或类别。此外,这些数据集包含大多来自专家用户的徒手草图,因此很难比较专家和新手用户的图纸,而这种比较对于告知对任何一个用户组的基于草图的界面更为有效的接口至关重要。这些观察结果激发了我们分析具有和没有足够绘图技能的人的不同程度的素描3D对象。我们邀请了70个新手用户和38位专家用户素描136 3D对象,这些对象是从多个视图中呈现的362张图像。这导致了3,620个徒手多视图草图的新数据集,在某些视图下,它们在其相应的3D对象上注册。我们的数据集比现有数据集大的数量级。我们在三个级别(即在空间和时间特征下以及跨越创建者组的内部和范围内)分析了三个级别的收集数据。我们发现,专业人士和新手的图纸在本质和外在的中风级别上显示出显着差异。我们在两个应用程序中演示了数据集的有用性:(i)徒手式的草图合成,(ii)将其作为基于草图的3D重建的潜在基准。我们的数据集和代码可在上获得。
translated by 谷歌翻译
translated by 谷歌翻译
给定日常工件,例如桌子和椅子,人类识别其中的高级规律性,例如桌子的对称性,腿的重复,同时拥有低级的几何学,例如,表面是平稳的,边缘是光滑的,边缘是光滑的。锋利。这种知识构成了人类感知理解和推理的重要组成部分。在这种知识中的表现以及如何推理,以及其获取的习得仍然是人工智能(AI)和认知科学中的开放问题。基于\ emph {3D形状程序}的先前建议,单独使用\ citet {tian2019llear}的随附的神经发电机和执行者,我们提出了一个分析性但可不同的执行者,在解释形状程序中更忠实,可以控制外推)和更有效的样本效率(不需要培训)。当无法获得地面真理程序时,这些促进了发电机的学习,当人类设计师或 - 在图书馆学习的背景下,当新的形状编程组件被录入时,应该特别有用。关于使用它进行适应的初步实验说明了所提出的模块的上述优势,鼓励在建筑机器中探索类似的方法,这些方法学会通过上述知识来推理推理,甚至学习这些知识本身。
translated by 谷歌翻译
Recent progress in artificial intelligence (AI) has renewed interest in building systems that learn and think like people. Many advances have come from using deep neural networks trained end-to-end in tasks such as object recognition, video games, and board games, achieving performance that equals or even beats humans in some respects. Despite their biological inspiration and performance achievements, these systems differ from human intelligence in crucial ways. We review progress in cognitive science suggesting that truly human-like learning and thinking machines will have to reach beyond current engineering trends in both what they learn, and how they learn it. Specifically, we argue that these machines should (a) build causal models of the world that support explanation and understanding, rather than merely solving pattern recognition problems; (b) ground learning in intuitive theories of physics and psychology, to support and enrich the knowledge that is learned; and (c) harness compositionality and learning-to-learn to rapidly acquire and generalize knowledge to new tasks and situations. We suggest concrete challenges and promising routes towards these goals that can combine the strengths of recent neural network advances with more structured cognitive models.
translated by 谷歌翻译
视觉奇数任务被认为是对人类的普遍独立的分析智能测试。人工智能的进步导致了重要的突破,但是与人类在此类分析智能任务上竞争仍然具有挑战性,并且通常诉诸于非生物学上的架构。我们提出了一个具有生物学现实的系统,该系统从合成眼动运动中接收输入 - 扫视,并与结合新皮质神经元动力学的神经元一起处理它们。我们介绍了一个程序生成的视觉奇数数据集,以训练扩展常规关系网络和我们建议的系统的体系结构。两种方法都超过了人类的准确性,我们发现两者都具有相同的基本推理基本机制。最后,我们表明,具有生物学启发的网络可实现卓越的准确性,学习速度更快,所需的参数比常规网络更少。
translated by 谷歌翻译
We study the problem of object recognition for categories for which we have no training examples, a task also called zero-data or zero-shot learning. This situation has hardly been studied in computer vision research, even though it occurs frequently; the world contains tens of thousands of different object classes, and image collections have been formed and suitably annotated for only a few of them. To tackle the problem, we introduce attribute-based classification: Objects are identified based on a high-level description that is phrased in terms of semantic attributes, such as the object's color or shape. Because the identification of each such property transcends the specific learning task at hand, the attribute classifiers can be prelearned independently, for example, from existing image data sets unrelated to the current task. Afterward, new classes can be detected based on their attribute representation, without the need for a new training phase. In this paper, we also introduce a new data set, Animals with Attributes, of over 30,000 images of 50 animal classes, annotated with 85 semantic attributes. Extensive experiments on this and two more data sets show that attribute-based classification indeed is able to categorize images without access to any training images of the target classes.
translated by 谷歌翻译
translated by 谷歌翻译
已经开发了许多Visio语言(V + L)表示学习方法,但现有数据集不会评估它们在统一空间中代表视觉和语言概念的程度。灵感来自于奇妙的转移和精神语言学文献,我们提出了一个新的V + L型号的评价设置:零射频跨模型转移。现有的V + L基准也经常在整个数据集上报告全局精度分数,渲染难以确定模型失败并成功的具体推理任务。要解决此问题并启用对跨模型传输的评估,我们存在TRAVLR,包括四个V + L推理任务的合成数据集。每个示例对场景进行了双倍,使得在训练/测试期间可以丢弃无论是没有相关信息的丢失。 Travlr的培训和测试分布也沿任务相关维度约束,从而可以评估分配外概括。我们评估了四个最先进的V + L型号,发现它们在从同一模态的测试集上表现良好,但所有型号都无法转移交叉模态,并且成功有限,容纳一个模态的添加或删除。在与事先工作的对齐中,我们还发现这些模型需要大量数据来学习简单的空间关系。我们将Travlr释放为研究界的开放挑战。
translated by 谷歌翻译
Understanding the 3D world from 2D images involves more than detection and segmentation of the objects within the scene. It also includes the interpretation of the structure and arrangement of the scene elements. Such understanding is often rooted in recognizing the physical world and its limitations, and in prior knowledge as to how similar typical scenes are arranged. In this research we pose a new challenge for neural network (or other) scene understanding algorithms - can they distinguish between plausible and implausible scenes? Plausibility can be defined both in terms of physical properties and in terms of functional and typical arrangements. Hence, we define plausibility as the probability of encountering a given scene in the real physical world. We build a dataset of synthetic images containing both plausible and implausible scenes, and test the success of various vision models in the task of recognizing and understanding plausibility.
translated by 谷歌翻译
AI的长期目标是建立以人类方式理解概念的系统。搁置建立这种系统的困难,即使试图评估一个系统也是一个挑战,这是由于当今的AI相对不透明度及其在寻找快捷键解决方案的倾向。假设可以识别一个概念实例的系统也必须像人类一样理解其他实例,那么人类倾向于拟人化的趋势会加剧这一点。在本文中,我们认为理解一个概念需要在各种环境中使用它的能力。因此,我们通过探测系统在许多不同的实例化中使用给定概念的能力来提出以概念为中心的系统评估。我们介绍了对两个领域的评估的案例研究 - 乌鸦(受乌鸦的渐进式矩阵)和抽象和推理语料库(ARC) - 用于开发和评估AI系统中的抽象能力。我们基于概念的评估方法揭示了有关常规测试集将隐藏的AI系统的信息。
translated by 谷歌翻译