Automatically generating a natural language description of an image has attracted interests recently both because of its importance in practical applications and because it connects two major artificial intelligence fields: computer vision and natural language processing. Existing approaches are either top-down, which start from a gist of an image and convert it into words, or bottom-up, which come up with words describing various aspects of an image and then combine them. In this paper, we propose a new algorithm that combines both approaches through a model of semantic attention. Our algorithm learns to selectively attend to semantic concept proposals and fuse them into hidden states and outputs of recurrent neural networks.The selection and fusion form a feedback connecting the top-down and bottom-up computation. We evaluate our algorithm on two public benchmarks: Microsoft COCO and Flickr30K. Experimental results show that our algorithm significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.
translated by 谷歌翻译
自动在自然语言中自动生成图像的描述称为图像字幕。这是一个积极的研究主题,位于人工智能,计算机视觉和自然语言处理中两个主要领域的交集。图像字幕是图像理解中的重要挑战之一,因为它不仅需要识别图像中的显着对象,还需要其属性及其相互作用的方式。然后,系统必须生成句法和语义上正确的标题,该标题描述了自然语言的图像内容。鉴于深度学习模型的重大进展及其有效编码大量图像并生成正确句子的能力,最近已经提出了几种基于神经的字幕方法,每种方法都试图达到更好的准确性和标题质量。本文介绍了一个基于编码器的图像字幕系统,其中编码器使用以RESNET-101作为骨干为骨干来提取图像中每个区域的空间和全局特征。此阶段之后是一个精致的模型,该模型使用注意力进行注意的机制来提取目标图像对象的视觉特征,然后确定其相互作用。解码器由一个基于注意力的复发模块和一个反思性注意模块组成,该模块会协作地将注意力应用于视觉和文本特征,以增强解码器对长期顺序依赖性建模的能力。在两个基准数据集(MSCOCO和FLICKR30K)上进行的广泛实验显示了提出的方法和生成的字幕的高质量。
translated by 谷歌翻译
连接视觉和语言在生成智能中起着重要作用。因此,已经致力于图像标题的大型研究工作,即用句法和语义有意义的句子描述图像。从2015年开始,该任务通常通过由Visual Encoder组成的管道和文本生成的语言模型来解决任务。在这些年来,两种组件通过对象区域,属性,介绍多模态连接,完全关注方法和伯特早期融合策略的利用而显着发展。但是,无论令人印象深刻的结果,图像标题的研究还没有达到结论性答案。这项工作旨在提供图像标题方法的全面概述,从视觉编码和文本生成到培训策略,数据集和评估度量。在这方面,我们量化地比较了许多相关的最先进的方法来确定架构和培训策略中最有影响力的技术创新。此外,讨论了问题的许多变体及其开放挑战。这项工作的最终目标是作为理解现有文献的工具,并突出显示计算机视觉和自然语言处理的研究领域的未来方向可以找到最佳的协同作用。
translated by 谷歌翻译
图像标题是自动生成句子的任务,以最好的方式生成描述输入图像。最近用于自动生成图像标题的最成功的技术最近使用了细心的深度学习模型。设计了深入学习模型的设计方式有变化。在本调查中,我们为图像标题的细心深度学习模型提供了相关的文献述评。而不是对深度图像标题模型的所有先前工作进行全面审查,我们解释了用于深度学习模型中的图像标题任务的各种类型的注意机制。用于图像标题的最成功的深度学习模型遵循编码器解码器架构,尽管这些模型采用注意机制的方式存在差异。通过分析图像标题的不同细节深层模型的性能结果,我们的目标是在图像标题中找到深度模型中最成功的注意机制。柔软的关注,自下而上的关注和多主题是一种广泛应用于图像标题的最先进的深度学习模型的关注机构的类型。在当前时,最佳结果是从多针关注的变体实现的,以自下而上的关注。
translated by 谷歌翻译
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can train this model in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound. We also show through visualization how the model is able to automatically learn to fix its gaze on salient objects while generating the corresponding words in the output sequence. We validate the use of attention with state-of-theart performance on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.
translated by 谷歌翻译
Visual attention has been successfully applied in structural prediction tasks such as visual captioning and question answering. Existing visual attention models are generally spatial, i.e., the attention is modeled as spatial probabilities that re-weight the last conv-layer feature map of a CNN encoding an input image. However, we argue that such spatial attention does not necessarily conform to the attention mechanism -a dynamic feature extractor that combines contextual fixations over time, as CNN features are naturally spatial, channel-wise and multi-layer. In this paper, we introduce a novel convolutional neural network dubbed SCA-CNN that incorporates Spatial and Channelwise Attentions in a CNN. In the task of image captioning, SCA-CNN dynamically modulates the sentence generation context in multi-layer feature maps, encoding where (i.e., attentive spatial locations at multiple layers) and what (i.e., attentive channels) the visual attention is. We evaluate the proposed SCA-CNN architecture on three benchmark image captioning datasets: Flickr8K, Flickr30K, and MSCOCO. It is consistently observed that SCA-CNN significantly outperforms state-of-the-art visual attention-based image captioning methods.1 Each convolutional layer is optionally followed by a pooling, downsampling, normalization, or a fully connected layer.
translated by 谷歌翻译
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.
translated by 谷歌翻译
描述使用自然语言的图像被广泛称为图像标题,这是由于计算机视觉和自然语言生成技术的发展而达成了一致的进展。虽然传统的标题模型基于流行度量的高精度,即BLEU,苹果酒和香料,探索了标题与其他类似图像中的标题的能力。为了产生独特的标题,一些先驱采用对比学习或重新加权地面真理标题,其侧重于一个输入图像。然而,忽略了类似图像组中对象之间的关系(例如,相同专辑中的项目或属性或细粒度事件中的物品)。在本文中,我们使用基于组的独特标题模型(Gdiscap)来提高图像标题的独特性,其将每个图像与一个类似的组中的其他图像进行比较,并突出显示每个图像的唯一性。特别是,我们提出了一种基于组的内存注意力(GMA)模块,其存储在图像组中是唯一的对象特征(即,与其他图像中的对象的低相似性)。生成字幕时突出显示这些唯一的对象功能,从而产生更有独特的标题。此外,选择地面标题中的独特单词来监督语言解码器和GMA。最后,我们提出了一种新的评估度量,独特的单词率(Diswordrate)来测量标题的独特性。定量结果表明,该方法显着提高了几种基线模型的独特性,并实现了精度和独特性的最先进的性能。用户学习的结果与定量评估一致,并证明了新的公制Diswordrate的合理性。
translated by 谷歌翻译
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
translated by 谷歌翻译
It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation. Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1% to 128.7% on COCO testing set.
translated by 谷歌翻译
图像字幕是当前的研究任务,用于使用场景中的对象及其关系来描述图像内容。为了应对这项任务,使用了两个重要的研究领域,人为的视觉和自然语言处理。在图像字幕中,就像在任何计算智能任务中一样,性能指标对于知道方法的性能(或坏)至关重要。近年来,已经观察到,基于n-gram的经典指标不足以捕获语义和关键含义来描述图像中的内容。为了衡量或不进行最新指标的集合,在本手稿中,我们对使用众所周知的COCO数据集进行了对几种图像字幕指标的评估以及它们之间的比较。为此,我们设计了两种情况。 1)一组人工构建字幕,以及2)比较某些最先进的图像字幕方法的比较。我们试图回答问题:当前的指标是否有助于制作高质量的标题?实际指标如何相互比较?指标真正测量什么?
translated by 谷歌翻译
以无监督的方式训练图像标题模型而不利用注释的图像标题对是朝向更广泛的文本和图像语料库的重要步骤。在监督设置中,图像标题对“良好匹配”,其中句子中提到的所有对象都显示在相应的图像中。然而,这些配对在无监督的环境中不可用。为了克服这一点,主要是在克服这方面有效的主要研究学院是根据它们对物体的重叠来构建训练集中的图像和文本的对。与监督设置不同,然而,这些构造的配对不保证具有完全重叠的对象集。我们本文的工作通过从训练集中收获对应于给定句子的对象来克服了这一点,即使它们不属于同一图像也是如此。当用作变压器的输入时,如果不是完整的对象覆盖,并且当由相应的句子监督时,这些物体的混合使得产生的结果通过显着的余量产生艺术无监督方法的最佳状态。在此发现时,我们进一步展示了(1)对象与物体属性之间关系的其他信息也有助于提高性能; (2)我们的方法也很好地延伸到非英语图像标题,这通常遭受稀缺的注释水平。我们的研究结果得到了强大的经验结果。
translated by 谷歌翻译
自动音频字幕是一项跨模式翻译任务,旨在为给定的音频剪辑生成自然语言描述。近年来,随着免费可用数据集的发布,该任务受到了越来越多的关注。该问题主要通过深度学习技术解决。已经提出了许多方法,例如研究不同的神经网络架构,利用辅助信息,例如关键字或句子信息来指导字幕生成,并采用了不同的培训策略,这些策略极大地促进了该领域的发展。在本文中,我们对自动音频字幕的已发表贡献进行了全面综述,从各种现有方法到评估指标和数据集。我们还讨论了公开挑战,并设想可能的未来研究方向。
translated by 谷歌翻译
图像美学质量评估在过去十年中很受欢迎。除数值评估外,还提出了自然语言评估(美学字幕)来描述图像的一般美学印象。在本文中,我们提出了美学属性评估,即审美属性字幕,即评估诸如组成,照明使用和颜色布置之类的美学属性。标记美学属性的注释是一项非平凡的任务,该评论限制了相应数据集的规模。我们以半自动方式构建了一个名为DPC-CAPTIONSV2的新型数据集。知识从带有完整注释的小型数据集转移到摄影网站的大规模专业评论。 DPC-CAPTIONSV2的图像包含最多4个美学属性的注释:组成,照明,颜色和主题。然后,我们根据BUTD模型和VLPSA模型提出了一种新版本的美学多属性网络(AMANV2)。 AMANV2融合了带有完整注释的小规模PCCD数据集和带有完整注释的大规模DPCCAPTIONSV2数据集的混合物的功能。 DPCCAPTIONSV2的实验结果表明,我们的方法可以预测对4种美学属性的评论,这些评论比上一个Aman模型所产生的方法更接近美学主题。通过图像字幕的评估标准,专门设计的AMANV2模型对CNN-LSTM模型和AMAN模型更好。
translated by 谷歌翻译
We propose Scene Graph Auto-Encoder (SGAE) that incorporates the language inductive bias into the encoderdecoder image captioning framework for more human-like captions. Intuitively, we humans use the inductive bias to compose collocations and contextual inference in discourse. For example, when we see the relation "person on bike", it is natural to replace "on" with "ride" and infer "person riding bike on a road" even the "road" is not evident. Therefore, exploiting such bias as a language prior is expected to help the conventional encoder-decoder models less likely overfit to the dataset bias and focus on reasoning. Specifically, we use the scene graph -a directed graph (G) where an object node is connected by adjective nodes and relationship nodes -to represent the complex structural layout of both image (I) and sentence (S). In the textual domain, we use SGAE to learn a dictionary (D) that helps to reconstruct sentences in the S → G → D → S pipeline, where D encodes the desired language prior; in the vision-language domain, we use the shared D to guide the encoder-decoder in the I → G → D → S pipeline. Thanks to the scene graph representation and shared dictionary, the inductive bias is transferred across domains in principle. We validate the effectiveness of SGAE on the challenging MS-COCO image captioning benchmark, e.g., our SGAE-based single-model achieves a new state-of-theart 127.8 CIDEr-D on the Karpathy split, and a competitive 125.5 CIDEr-D (c40) on the official server even compared to other ensemble models. Code has been made available at: https://github.com/yangxuntu/SGAE.
translated by 谷歌翻译
We present a model that generates natural language descriptions of images and their regions. Our approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.
translated by 谷歌翻译
图像标题将复杂的视觉信息转换为抽象的自然语言以获得表示的抽象自然语言,这可以帮助计算机快速了解世界。但是,由于真实环境的复杂性,它需要识别关键对象并实现其连接,并进一步生成自然语言。整个过程涉及视觉理解模块和语言生成模块,它为深度神经网络的设计带来了比其他任务的深度神经网络的更具挑战。神经架构搜索(NAS)在各种图像识别任务中显示了它的重要作用。此外,RNN在图像标题任务中起重要作用。我们介绍了一种自动调用方法,可以更好地设计图像标题的解码器模块,其中我们使用NAS自动设计称为Autornn的解码器模块。我们使用基于共享参数的加固学习方法有效地自动设计Autornn。 AutoCaption的搜索空间包括图层之间的连接和层次的操作,它可以使Autornn快递更多的架构。特别是,RNN等同于搜索空间的子集。 MSCOCO数据集上的实验表明,我们的自动驾统模型可以比传统的手工设计方法实现更好的性能。
translated by 谷歌翻译
观察一组图像及其相应的段落限制,一个具有挑战性的任务是学习如何生成语义连贯的段落来描述图像的视觉内容。受到将语义主题纳入此任务的最新成功的启发,本文开发了插件的层次结构引导图像段落生成框架,该框架将视觉提取器与深层主题模型相结合,以指导语言模型的学习。为了捕获图像和文本在多个抽象层面上的相关性并从图像中学习语义主题,我们设计了一个变异推理网络,以构建从图像功能到文本字幕的映射。为了指导段落的生成,学习的层次主题和视觉特征被整合到语言模型中,包括长期的短期记忆(LSTM)和变压器,并共同优化。公共数据集上的实验表明,在标准评估指标方面具有许多最先进的方法竞争的拟议模型可用于提炼可解释的多层语义主题并产生多样的和相干的标题。我们在https://github.com/dandanguo1993/vtcm aseal-image-image-paragraph-caption.git上发布代码
translated by 谷歌翻译
本文对过去二十年来对自然语言生成(NLG)的研究提供了全面的审查,特别是与数据到文本生成和文本到文本生成深度学习方法有关,以及NLG的新应用技术。该调查旨在(a)给出关于NLG核心任务的最新综合,以及该领域采用的建筑;(b)详细介绍各种NLG任务和数据集,并提请注意NLG评估中的挑战,专注于不同的评估方法及其关系;(c)强调一些未来的强调和相对近期的研究问题,因为NLG和其他人工智能领域的协同作用而增加,例如计算机视觉,文本和计算创造力。
translated by 谷歌翻译
Attention mechanisms are widely used in current encoder/decoder frameworks of image captioning, where a weighted average on encoded vectors is generated at each time step to guide the caption decoding process. However, the decoder has little idea of whether or how well the attended vector and the given attention query are related, which could make the decoder give misled results. In this paper, we propose an "Attention on Attention" (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries. AoA first generates an "information vector" and an "attention gate" using the attention result and the current context, then adds another attention by applying element-wise multiplication to them and finally obtains the "attended information", the expected useful knowledge. We apply AoA to both the encoder and the decoder of our image captioning model, which we name as AoA Network (AoANet). Experiments show that AoANet outperforms all previously published methods and achieves a new state-ofthe-art performance of 129.8 CIDEr-D score on MS COCO "Karpathy" offline test split and 129.6 CIDEr-D (C40) score on the official online testing server. Code is available at https://github.com/husthuaan/AoANet.
translated by 谷歌翻译