视觉对话需要使用对话历史作为上下文来回答基于图像的一系列问题。除了在视觉问答(VQA)中发现的挑战(可以看作是一轮对话),视觉表盘还可以包含更多。我们关注一个称为视觉干扰分辨率的问题,该问题涉及确定哪些词,通常是名词短语和代词,共同引用图像中的同一实体/对象实例。这是至关重要的,特别是对于代词(例如,“它”),如对话代理必须首先将它链接到先前的共同参考(例如,“船”),然后才能依靠共同参与“船”的视觉基础来推断代词`的'。先前的工作(在视觉对话中)模拟视觉共参考解决方案(a)通过历史记录的内存网络隐含地,或(b)整个问题的粗略级别;而不是明确地在词组级别的粒度。在这项工作中,我们提出了一种用于视觉对话的神经模块网络架构,引入了两个新颖的模块 - 参考和排除 - 在更精细的单词级别执行显式的,基础的共参考分辨率。我们通过实现近乎完美的精确度来展示我们的模型在MNIST Dialog上的有效性,这是一个视觉上简单但有思想的复杂数据集,以及onVisDial,一个在真实图像上的大型且具有挑战性的视觉对话数据集,其中我们的模型优于其他方法,并且更易于解释,定性的,坚定的,一致的。
translated by 谷歌翻译
We introduce the first goal-driven training for visual question answering anddialog agents. Specifically, we pose a cooperative 'image guessing' gamebetween two agents -- Qbot and Abot -- who communicate in natural languagedialog so that Qbot can select an unseen image from a lineup of images. We usedeep reinforcement learning (RL) to learn the policies of these agentsend-to-end -- from pixels to multi-agent multi-round dialog to game reward. We demonstrate two experimental results. First, as a 'sanity check' demonstration of pure RL (from scratch), we showresults on a synthetic world, where the agents communicate in ungroundedvocabulary, i.e., symbols with no pre-specified meanings (X, Y, Z). We findthat two bots invent their own communication protocol and start using certainsymbols to ask/answer about certain visual attributes (shape/color/style).Thus, we demonstrate the emergence of grounded language and communication among'visual' dialog agents with no human supervision. Second, we conduct large-scale real-image experiments on the VisDial dataset,where we pretrain with supervised dialog data and show that the RL 'fine-tuned'agents significantly outperform SL agents. Interestingly, the RL Qbot learns toask questions that Abot is good at, ultimately resulting in more informativedialog and a better team.
translated by 谷歌翻译
我们研究了为\ emph {sets}定义的机器学习任务的模型设计问题。与在固定维向量上操作的传统方法相比,我们考虑在不同于排列的集合上定义的目标函数。这些问题很普遍,包括对人口统计数据的估计,以及堤坝的压力计数据中的异常检测\引用{Jung15Exploration},tocosmology \ cite {Ntampaka16Dynamical,Ravanbakhsh16ICML1}。我们的主要定理描述了置换不变函数,并提供了任何置换不变目标函数必须属于的族函数。这个函数族具有一种特殊的结构,使我们能够设计出可以在集合上运行且可以在各种上部署的陡峭网络体系结构。场景包括无监督和有监督的学习任务。我们还得出了深部模型中置换等变性的充分必要条件。我们证明了我们的方法在人口统计估计,点云分类,集合扩展和异常检测方面的适用性。
translated by 谷歌翻译
We introduce the task of Visual Dialog, which requires an AI agent to hold ameaningful dialog with humans in natural, conversational language about visualcontent. Specifically, given an image, a dialog history, and a question aboutthe image, the agent has to ground the question in image, infer context fromhistory, and answer the question accurately. Visual Dialog is disentangledenough from a specific downstream task so as to serve as a general test ofmachine intelligence, while being grounded in vision enough to allow objectiveevaluation of individual responses and benchmark progress. We develop a noveltwo-person chat data-collection protocol to curate a large-scale Visual Dialogdataset (VisDial). VisDial v0.9 has been released and contains 1 dialog with 10question-answer pairs on ~120k images from COCO, with a total of ~1.2M dialogquestion-answer pairs. We introduce a family of neural encoder-decoder models for Visual Dialog with3 encoders -- Late Fusion, Hierarchical Recurrent Encoder and Memory Network --and 2 decoders (generative and discriminative), which outperform a number ofsophisticated baselines. We propose a retrieval-based evaluation protocol forVisual Dialog where the AI agent is asked to sort a set of candidate answersand evaluated on metrics such as mean-reciprocal-rank of human response. Wequantify gap between machine and human performance on the Visual Dialog taskvia human studies. Putting it all together, we demonstrate the first 'visualchatbot'! Our dataset, code, trained models and visual chatbot are available onhttps://visualdialog.org
translated by 谷歌翻译