A core process in human cognition is analogical mapping: the ability to identify a similar relational structure between different situations. We introduce a novel task, Visual Analogies of Situation Recognition, adapting the classical word-analogy task into the visual domain. Given a triplet of images, the task is to select an image candidate B' that completes the analogy (A to A' is like B to what?). Unlike previous work on visual analogy that focused on simple image transformations, we tackle complex analogies requiring understanding of scenes. We leverage situation recognition annotations and the CLIP model to generate a large set of 500k candidate analogies. Crowdsourced annotations for a sample of the data indicate that humans agree with the dataset label ~80% of the time (chance level 25%). Furthermore, we use human annotations to create a gold-standard dataset of 3,820 validated analogies. Our experiments demonstrate that state-of-the-art models do well when distractors are chosen randomly (~86%), but struggle with carefully chosen distractors (~53%, compared to 90% human accuracy). We hope our dataset will encourage the development of new analogy-making models. Website: https://vasr-dataset.github.io/
translated by 谷歌翻译
虽然视觉和语言模型在视觉问题回答等任务上表现良好,但在基本的人类常识性推理技能方面,它们会挣扎。在这项工作中,我们介绍了Winogavil:在线游戏,以收集视觉和语言协会(例如,狼人到满月),用作评估最先进模型的动态基准。受欢迎的纸牌游戏代号的启发,Spymaster提供了与几个视觉候选者相关的文本提示,另一个玩家必须识别它们。人类玩家因创建对竞争对手AI模型而具有挑战性的联想而获得了回报,但仍然可以由其他人类玩家解决。我们使用游戏来收集3.5k实例,发现它们对人类的直观(> 90%的Jaccard索引),但对最先进的AI模型充满挑战,其中最佳模型(Vilt)的得分为52% ,成功的位置在视觉上是显着的。我们的分析以及我们从玩家那里收集的反馈表明,收集的关联需要多种推理技能,包括一般知识,常识,抽象等。我们发布数据集,代码和交互式游戏,旨在允许未来的数据收集,可用于开发具有更好关联能力的模型。
translated by 谷歌翻译
人类具有出色的能力来推理绑架并假设超出图像的字面内容的内容。通过识别散布在整个场景中的具体视觉线索,我们几乎不禁根据我们的日常经验和对世界的知识来提出可能的推论。例如,如果我们在道路旁边看到一个“ 20英里 /小时”的标志,我们可能会假设街道位于居民区(而不是在高速公路上),即使没有房屋。机器可以执行类似的视觉推理吗?我们提出了Sherlock,这是一个带注释的103K图像的语料库,用于测试机器能力,以超出字面图像内容的绑架推理。我们采用免费观看范式:参与者首先观察并识别图像中的显着线索(例如,对象,动作),然后给定线索,然后提供有关场景的合理推论。我们总共收集了363K(线索,推理)对,该对形成了首个绑架的视觉推理数据集。使用我们的语料库,我们测试了三个互补的绑架推理轴。我们评估模型的能力:i)从大型候选人语料库中检索相关推论; ii)通过边界框来定位推论的证据,iii)比较合理的推论,以匹配人类在新收集的19k李克特级判断的诊断语料库上的判断。尽管我们发现具有多任务目标的微调夹RN50x64优于强大的基准,但模型性能与人类一致之间存在着重要的净空。可在http://visualabduction.com/上获得数据,模型和排行榜
translated by 谷歌翻译
为了使AI安全地在医院,学校和工作场所等现实世界中安全部署,它必须能够坚定地理解物理世界。这种推理的基础是物理常识:了解可用对象的物理特性和提供的能力,如何被操纵以及它们如何与其他对象进行交互。物理常识性推理从根本上是一项多感官任务,因为物理特性是通过多种模式表现出来的,其中两个是视觉和声学。我们的论文通过贡献PACS来朝着现实世界中的物理常识推理:第一个用于物理常识属性注释的视听基准。 PACS包含13,400对答案对,涉及1,377个独特的物理常识性问题和1,526个视频。我们的数据集提供了新的机会来通过将音频作为此多模式问题的核心组成部分来推进物理推理的研究领域。使用PACS,我们在我们的新挑战性任务上评估了多种最先进的模型。尽管某些模型显示出令人鼓舞的结果(精度为70%),但它们都没有人类的绩效(精度为95%)。我们通过证明多模式推理的重要性并为未来的研究提供了可能的途径来结束本文。
translated by 谷歌翻译
我们介绍了遮阳板,一个新的像素注释的新数据集和一个基准套件,用于在以自我为中心的视频中分割手和活动对象。遮阳板注释Epic-kitchens的视频,其中带有当前视频分割数据集中未遇到的新挑战。具体而言,我们需要确保像素级注释作为对象经历变革性相互作用的短期和长期一致性,例如洋葱被剥皮,切成丁和煮熟 - 我们旨在获得果皮,洋葱块,斩波板,刀,锅以及表演手的准确像素级注释。遮阳板引入了一条注释管道,以零件为ai驱动,以进行可伸缩性和质量。总共,我们公开发布257个对象类的272K手册语义面具,990万个插值密集口罩,67K手动关系,涵盖36小时的179个未修剪视频。除了注释外,我们还引入了视频对象细分,互动理解和长期推理方面的三个挑战。有关数据,代码和排行榜:http://epic-kitchens.github.io/visor
translated by 谷歌翻译
我们挑战AI模型,以“展示”对《纽约客》标题比赛的复杂多模式幽默的理解。具体而言,我们开发了三个精心限制的任务,以掌握图像和标题之间的潜在复杂和意外的关系,并且对人类经验的广泛品种产生了复杂和意外的寓意;这些是纽约口径卡通的标志。我们调查了直接将卡通像素和字幕输入的视觉和语言模型,以及仅通过提供图像的文本描述来规避图像处理的仅限语言模型。即使我们为卡通图像提供了丰富的多方面注释,我们也可以确定高质量的机器学习模型(例如,微调,175b参数语言模型)和人类之间的性能差距。我们公开发布我们的语料库,包括描述图像的位置/实体的注释,场景的不寻常以及对笑话的解释。
translated by 谷歌翻译
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the chal-
translated by 谷歌翻译
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in
translated by 谷歌翻译
我们提出Valse(视觉和语言结构化评估),这是一种新的基准,专为测试通用净化的视觉和语言(V&L)模型而设计,用于对特定语言现象的视野 - 语言接地能力。Valse提供涵盖各种语言构建体的六种测试套件。解决这些需要模型在视觉模型中地对语言现象,允许比迄今为止更细粒度的评估。我们使用支持有效箔的构造的方法构建Valse,并通过评估五种广泛使用的V&L模型的报告结果。我们的实验表明,目前的模型有很大的困难解决了大多数现象。因此,我们预计Valse就可以作为一种重要的基准,从语言角度来衡量预训过的V&L模型的未来进展,补充规范任务为中心的V&L评价。
translated by 谷歌翻译
大型语言模型越来越能够通过相对较少的特定任务的监督产生流畅的出现文本。但这些模型可以准确解释分类决策吗?我们考虑使用少量人写的例子(即,以几滴方式)生成自由文本解释的任务。我们发现(1)创作更高质量的例子,以提示导致更高质量的世代; (2)令人惊讶的是,在头到头比较中,人群公司通常更喜欢GPT-3生成的解释,以众包中包含的人性写入的解释。然而,Crowdworker评级也表明,虽然模型产生了事实,语法和充分的解释,但它们具有改进的空间,例如沿着提供新颖信息和支持标签的轴。我们创建了一种管道,该管道将GPT-3与监督过滤器结合起来,该过滤器通过二进制可接受性判断来包含人类循环。尽管具有重要的主观性内在的判断可接受性,但我们的方法能够始终如一地过滤人类可接受的GPT-3生成的解释。
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
We introduce KiloGram, a resource for studying abstract visual reasoning in humans and machines. Drawing on the history of tangram puzzles as stimuli in cognitive science, we build a richly annotated dataset that, with >1k distinct stimuli, is orders of magnitude larger and more diverse than prior resources. It is both visually and linguistically richer, moving beyond whole shape descriptions to include segmentation maps and part labels. We use this resource to evaluate the abstract visual reasoning capacities of recent multi-modal models. We observe that pre-trained weights demonstrate limited abstract reasoning, which dramatically improves with fine-tuning. We also observe that explicitly describing parts aids abstract reasoning for both humans and models, especially when jointly encoding the linguistic and visual inputs. KiloGram is available at https://lil.nlp.cornell.edu/kilogram .
translated by 谷歌翻译
图像文本匹配(ITM)是评估视觉和语言(VL)模型的常见任务。但是,现有的ITM基准有一个重大限制。他们有许多缺失的信件,源自数据构建过程本身。例如,标题仅与一个图像匹配,尽管标题可以与其他类似图像匹配,反之亦然。为了纠正大规模的虚假负面因素,我们通过提供与机器和人类注释者的缺失关联来构建扩展的可可验证(ECCV)标题数据集。我们在注释过程中采用五个具有不同属性的最先进的ITM模型。与原始的MS-Coco相比,我们的数据集提供了X3.6的X3.6积极图像到支撑关联和X8.5字幕到图像关联。我们还建议使用基于等级的公制映射@r,而不是流行的召回@k(r@k)。我们在现有和拟议的基准测试中重新评估了现有的25个VL模型。我们的发现是现有的基准测试,例如可可1K r@k,可可5k r@k,cxc r@1彼此高度相关,而当我们转移到eccv map@r时,排名会改变。最后,我们深入研究机器注释者选择引入的偏差的效果。源代码和数据集可从https://github.com/naver-ai/eccv-caption获得
translated by 谷歌翻译
情绪分析中最突出的任务是为文本分配情绪,并了解情绪如何在语言中表现出来。自然语言处理的一个重要观察结果是,即使没有明确提及情感名称,也可以通过单独参考事件来隐式传达情绪。在心理学中,被称为评估理论的情感理论类别旨在解释事件与情感之间的联系。评估可以被形式化为变量,通过他们认为相关的事件的人们的认知评估来衡量认知评估。其中包括评估事件是否是新颖的,如果该人认为自己负责,是否与自己的目标以及许多其他人保持一致。这样的评估解释了哪些情绪是基于事件开发的,例如,新颖的情况会引起惊喜或不确定后果的人可能引起恐惧。我们在文本中分析了评估理论对情绪分析的适用性,目的是理解注释者是否可以可靠地重建评估概念,如果可以通过文本分类器预测,以及评估概念是否有助于识别情感类别。为了实现这一目标,我们通过要求人们发短信描述触发特定情绪并披露其评估的事件来编译语料库。然后,我们要求读者重建文本中的情感和评估。这种设置使我们能够衡量是否可以纯粹从文本中恢复情绪和评估,并为判断模型的绩效指标提供人体基准。我们将文本分类方法与人类注释者的比较表明,两者都可以可靠地检测出具有相似性能的情绪和评估。我们进一步表明,评估概念改善了文本中情绪的分类。
translated by 谷歌翻译
为了实现长文档理解的构建和测试模型,我们引入质量,具有中文段的多项选择QA DataSet,具有约5,000个令牌的平均长度,比典型的当前模型更长。与经过段落的事先工作不同,我们的问题是由阅读整个段落的贡献者编写和验证的,而不是依赖摘要或摘录。此外,只有一半的问题是通过在紧缩时间限制下工作的注释器来应答,表明略读和简单的搜索不足以一直表现良好。目前的模型在此任务上表现不佳(55.4%),并且落后于人类性能(93.5%)。
translated by 谷歌翻译
Visual Entity Linking (VEL) is a task to link regions of images with their corresponding entities in Knowledge Bases (KBs), which is beneficial for many computer vision tasks such as image retrieval, image caption, and visual question answering. While existing tasks in VEL either rely on textual data to complement a multi-modal linking or only link objects with general entities, which fails to perform named entity linking on large amounts of image data. In this paper, we consider a purely Visual-based Named Entity Linking (VNEL) task, where the input only consists of an image. The task is to identify objects of interest (i.e., visual entity mentions) in images and link them to corresponding named entities in KBs. Since each entity often contains rich visual and textual information in KBs, we thus propose three different sub-tasks, i.e., visual to visual entity linking (V2VEL), visual to textual entity linking (V2TEL), and visual to visual-textual entity linking (V2VTEL). In addition, we present a high-quality human-annotated visual person linking dataset, named WIKIPerson. Based on WIKIPerson, we establish a series of baseline algorithms for the solution of each sub-task, and conduct experiments to verify the quality of proposed datasets and the effectiveness of baseline methods. We envision this work to be helpful for soliciting more works regarding VNEL in the future. The codes and datasets are publicly available at https://github.com/ict-bigdatalab/VNEL.
translated by 谷歌翻译
The Flickr30k dataset has become a standard benchmark for sentence-based image description. This paper presents Flickr30k Entities, which augments the 158k captions from Flickr30k with 244k coreference chains, linking mentions of the same entities across different captions for the same image, and associating them with 276k manually annotated bounding boxes. Such annotations are essential for continued progress in automatic image description and grounded language understanding. They enable us to define a new benchmark for localization of textual entity mentions in an image. We present a strong baseline for this task that combines an image-text embedding, detectors for common objects, a color classifier, and a bias towards selecting larger objects. While our baseline rivals in accuracy more complex state-of-the-art models, we show that its gains cannot be easily parlayed into improvements on such tasks as image-sentence retrieval, thus underlining the limitations of current methods and the need for further research.
translated by 谷歌翻译
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ∼0.25M images, ∼0.76M questions, and ∼10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).
translated by 谷歌翻译
Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and highquality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (∼45%).To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (∼65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.
translated by 谷歌翻译
最近,对建立问题的兴趣越来越兴趣,其中跨多种模式(如文本和图像)的原因。但是,使用图像的QA通常仅限于从预定义的选项集中挑选答案。此外,在现实世界中的图像,特别是在新闻中,具有与文本共同参考的对象,其中来自两个模态的互补信息。在本文中,我们提出了一种新的QA评估基准,并在新闻文章中提出了1,384个问题,这些文章需要跨媒体接地图像中的物体接地到文本上。具体地,该任务涉及需要推理图像标题对的多跳问题,以识别接地的视觉对象,然后从新闻正文文本中预测跨度以回答问题。此外,我们介绍了一种新颖的多媒体数据增强框架,基于跨媒体知识提取和合成问题答案生成,自动增强可以为此任务提供弱监管的数据。我们在我们的基准测试中评估了基于管道和基于端到端的预先预测的多媒体QA模型,并表明他们实现了有希望的性能,而在人类性能之后大幅滞后,因此留下了未来工作的大型空间,以便在这一具有挑战性的新任务上的工作。
translated by 谷歌翻译