对话系统需要了解动态视觉场景,以便与用户就周围的对象和事件进行对话。可以通过集成来自多个研究领域的最先进技术来开发用于实际应用的场景感知对话系统,包括:端到端对话技术,其使用来自对话数据的模型生成系统响应;视觉问答(VQA)技术,使用学习图像功能对图像提出疑问;和视频编码技术,其中使用多模式信息从视频生成描述/标题。我们引入了一个关于人类行为的对话视频的新数据集。每个对话框都是一个类型化的对话,由两个AmazonMechanical Turk(AMT)工作者之间的10个问答(QA)对组成。总的来说,我们收集了大约9,000个视频的对话框。使用这个用于音频视觉场景感知对话框(AVSD)的新数据集,编写了一个端到端会话模型,该模型在视频对话框中生成响应。我们的实验证明,使用为多模式基于注意力的视频描述而开发的多模式特征增强了关于动态场景(视频)的生成对话的质量。我们的数据集,模型代码和预训练模型将公开用于新的VideoScene-Aware Dialog挑战。
translated by 谷歌翻译
Real-world web videos often contain cues to supplement visual information for generating natural language descriptions. In this paper we propose a sequence-to-sequence model which explores such auxiliary information. In particular, audio and the topic of the video are used in addition to the visual information in a multimodal framework to generate coherent descriptions of videos "in the wild". In contrast to current encoder-decoder based models which exploit visual information only during the encoding stage, our model fuses multiple sources of information judiciously, showing improvement over using the different modalities separately. We based our multimodal video description network on the state-of-the-art sequence to sequence video to text (S2VT) [26] model and extended it to take advantage of multiple modalities. Extensive experiments on the challenging MSR-VTT dataset are carried out to show the superior performance of the proposed approach on natural videos found in the web.
translated by 谷歌翻译
Current methods for video description are based on encoder-decoder sentence generation using recurrent neu-ral networks (RNNs). Recent work has demonstrated the advantages of integrating temporal attention mechanisms into these models, in which the decoder network predicts each word in the description by selectively giving more weight to encoded features from specific time frames. Such methods typically use two different types of features: image features (from an object classification model), and motion features (from an action recognition model), combined by na¨ıvena¨ıve concatenation in the model input. Because different feature modalities may carry task-relevant information at different times, fusing them by na¨ıvena¨ıve concatena-tion may limit the model's ability to dynamically determine the relevance of each type of feature to different parts of the description. In this paper, we incorporate audio features in addition to the image and motion features. To fuse these three modalities, we introduce a multimodal attention model that can selectively utilize features from different modalities for each word in the output description. Combining our new multimodal attention model with standard temporal attention outperforms state-of-the-art methods on two standard datasets: YouTube2Text and MSR-VTT.
translated by 谷歌翻译
在本文中,我们介绍了How2,一个带有英文字幕和众包葡萄牙语翻译的多模式教学视频集。我们还提供了用于机器翻译,自动语音识别,口语翻译和多模式综合的集成序列到序列基线。通过为几种多模态自然语言任务提供数据和代码,我们希望能够激发对这些和类似挑战的更多研究,以便更深入地理解语言处理中的多模态。
translated by 谷歌翻译
This paper extends research on automated image caption-ing in the dimension of language, studying how to generate Chinese sentence descriptions for unlabeled images. To evaluate image captioning in this novel context, we present Flickr8k-CN, a bilingual extension of the popular Flickr8k set. The new multimedia dataset can be used to quantitatively assess the performance of Chinese captioning and English-Chinese machine translation. The possibility of re-using existing English data and models via machine translation is investigated. Our study reveals to some extent that a computer can master two distinct languages, English and Chinese, at a similar level for describing the visual world. Data is publicly available at http://tinyurl.com/flickr8kcn.
translated by 谷歌翻译
自动描述图像的内容是连接计算机视觉和自然语言处理的人工智能中的基本问题。在本文中,我们提出了一个基于深度复现架构的生成模型,该架构结合了计算机视觉和机器翻译的最新进展,可用于生成描述图像的自然句子。训练该模型以最大化给定训练图像的目标描述语句的可能性。 severaldatasets上的实验表明了该模型的准确性以及该语言的流畅性仅来自图像描述。我们的模型通常非常准确,我们在质量和数量上都进行了验证。最后,考虑到对此任务的兴趣,2015年使用新发布的COCO数据集组织了竞赛。我们描述并分析了应用于我们自己的基线的各种改进,并展示了竞争中的最终表现,我们赢得了微软研究团队的一个团队,并在TensorFlow中提供了一个开源实现。
translated by 谷歌翻译
人工智能最近的研究重点是制作关于视觉场景的叙事故事。它有可能实现更像人类的理解,而不仅仅是按顺序生成图像的基本描述。在这项工作中,我们提出了一个解决方案,用于生成基于序列到序列模型的序列图像的故事。作为一种新奇事物,我们的编码器模型由两个独立的编码器组成,一个用于模拟图像序列的行为,另一个用于模拟在图像序列中为先前图像生成的句子故事。通过使用图像序列编码器,我们捕获图像序列和句子故事之间的时间依赖关系,并通过使用前一个句子编码器,我们实现了更好的故事流程。 Oursolution产生了长长的类似人类的故事,不仅描述了图像序列的视觉上下文,还包含叙事和评价语言。通过手动人体评估确认获得的结果。
translated by 谷歌翻译
图像字幕是涉及计算机视觉和自然语言处理的多模式任务,其目标是学习从图像到其自然语言描述的映射。通常,映射函数是从训练的图像 - 字幕对集合中学习的。但是,对于某些语言,可能无法使用大规模图像标题配对语料库。我们通过语言旋转提出了一种不成对的图像字幕问题的方法。我们的方法从pivotlanguage(中文)中有效地捕获图像捕获者的特征,并使用另一个目标语言(中英文)句子平行语料库将其与目标语言(英语)对齐。我们在两个图像到英语的基准数据集上评估我们的方法:MSCOCO和Flickr30K。针对几种基线方法的定量比较证明了我们的方法的有效性。
translated by 谷歌翻译
互联网上视频数据的爆炸式增长需要有效且高效的技术,以便为无法播放视频的人自动生成字幕。尽管视频字幕研究取得了很大进展,特别是在视频特征编码方面,语言解码器仍然主要基于流行的RNN解码器,例如LSTM,它倾向于选择与视频对齐的频繁字。在本文中,我们提出了用于视频字幕的aboundary-aware层次语言解码器,它包括一个基于GRU的高级语言解码器,作为全局(字幕级)语言模型,以及一个基于GRU的低级语言解码器,工作作为本地(短语级)语言模型。最重要的是,我们将二进制门引入低级GRU语言解码器以检测语言边界。结合其他高级组件,包括联合视频预测,共享软关注和边界感知视频编码,我们的集成视频字幕框架可以发现分层语言信息并区分句子中的主语和宾语,这在语言生成过程中通常会引起混淆。对二次使用的视频字幕数据集进行了大量实验,MSR-Video-to-Text(MSR-VTT)\ cite {xu2016msr}和YouTube-to-Text(MSVD)\ cite {chen2011collecting}表明我们的方法具有很强的竞争力,最先进的方法。
translated by 谷歌翻译
There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have fo-cused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by objects in the image. In this paper, we introduce the novel task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, with considerably more abstract training data than provided to state-of-the-art cap-tioning systems thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation results show that while such models ask reasonable questions for a variety of images , there is still a wide gap with human performance which motivates further work on connecting images with commonsense knowledge and pragmatics. Our proposed task offers a new challenge to the community which we hope furthers interest in exploring deeper connections between vision & language.
translated by 谷歌翻译
Movie description
分类:
Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. We introduce the Large Scale Movie Description Challenge (LSMDC) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies (around 150 h of video in total). The goal of the challenge is to automatically generate descriptions for the movie clips. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several Communicated by teams who participated in the challenges organized in the context of two workshops at ICCV 2015 and ECCV 2016.
translated by 谷歌翻译
本文介绍了第七对话系统技术挑战(DSTC),它使用共享数据集来探索构建对话系统的问题。最近,端到端对话建模方法已应用于各种对话任务。第七个DSTC(DSTC7)侧重于开发与端到端对话系统相关的技术,用于(1)句子选择,(2)句子生成和(3)视听场景感知对话。本文总结了DSTC7的整体设置和结果,包括对不同轨道和提供的数据集的详细描述。我们还描述了提交系统的总体趋势和关键结果。每个轨道都引入了新的数据集,参与者使用最先进的端到端技术取得了令人瞩目的成果。
translated by 谷歌翻译
在计算机视觉,自然语言处理和机器学习的交叉点上,将视觉数据描述为自然语言是一项非常具有挑战性的任务。语言远远超出了物理对象及其相互作用的描述,可以通过多种方式传达相同的抽象概念。关于最高语义层面的内容以及流畅的形式,我们提出了一种方法,通过在多个编码器 - 解码器网络中达成共识,用自然语言描述视频。找到这样一种共识性的描述,与更大的群体共享共同的属性,有更好的机会传达正确的含义。我们提出并培训几种网络架构,并使用不同类型的图像,音频和视频特征。每个模型都产生自己对输入视频的描述,最好的一个是通过有效的两阶段共识过程来选择的。我们通过在具有挑战性的MSR-VTT数据集中获得最先进的结果来证明我们的方法的优势。
translated by 谷歌翻译
自动描述图像的内容是连接计算机视觉和自然语言处理的人工智能中的基本问题。在本文中,我们提出了一个基于深度复现架构的生成模型,该架构结合了计算机视觉和机器翻译的最新进展,可用于生成描述图像的自然句子。训练该模型以最大化给定训练图像的目标描述语句的可能性。 severaldatasets上的实验表明了该模型的准确性以及该语言的流畅性仅来自图像描述。我们的模型通常非常准确,我们在质量和数量上都进行了验证。例如,尽管Pascaldataset目前最先进的BLEU-1得分(越高越好)为25,我们的方法得出59,与人类表现69相比。我们还显示BLEU-1得分改善Flickr30k,从56到66,在SBU,从19到28.最后,在最新发布的COCO数据集中,我们推出了27.7的BLEU-4,这是目前最先进的技术。
translated by 谷歌翻译
随着人工智能的最新进展,智能虚拟助手(IVA)已经成为每个家庭中无处不在的一部分。展望未来,我们正在目睹视觉,语音和对话系统技术的影响,这些技术使IVA能够学习话语的视听基础,并与用户就周围的对象,活动和事件进行对话。作为第7对话系统技术挑战(DSTC7)的一部分,对于Audio VisualScene-Aware Dialog(AVSD)轨道,我们将对话的“主题”作为一个重要的上下文特征进入体系结构,同时探索多模态注意。我们还在我们的模型中加入了端到端的音频分类ConvNet,AclNet。我们提供了详细的实验分析,并表明我们的一些模型变体优于为此任务提供的基线系统。
translated by 谷歌翻译
During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.
translated by 谷歌翻译
了解视听内容以及对其进行信息对话的能力都是智能系统面临的挑战。视听场景感知对话(AVSD)挑战,作为对话系统技术挑战7(DSTC7)的轨道,提出一个组合任务,系统必须回答与视频有关的问题,并与之前的问答对和视频本身进行对话。我们为此任务提出了一种分层编码器 - 解码器模型,该模型计算对话上下文的多模式。它首先使用两个LSTM嵌入对话历史。我们定期提取视频和音频帧,并分别使用预先训练的I3D和VGGish模型计算语义特征。在使用LSTM将两种模态概括为固定长度向量之前,我们使用FiLMblock在当前问题的嵌入上对它们进行条件化,这使我们大大降低了维数。最后,我们使用LSTMdecoder,我们使用预定的采样进行训练并使用波束搜索进行评估。与AVSD挑战组织发布的模态融合基线模型相比,我们的模型实现了相对改进超过16%,得分为0.36 BLEU-4和超过33%,得分0.997 CIDEr。
translated by 谷歌翻译
With the recent popularity of animated GIFs on social media, there is need for ways to index them with rich meta-data. To advance research on animated GIF understanding, we collected a new dataset, Tumblr GIF (TGIF), with 100K animated GIFs from Tumblr and 120K natural language descriptions obtained via crowdsourcing. The motivation for this work is to develop a testbed for image sequence description systems, where the task is to generate natural language descriptions for animated GIFs or video clips. To ensure a high quality dataset, we developed a series of novel quality controls to validate free-form text input from crowd-workers. We show that there is unambiguous association between visual content and natural language descriptions in our dataset, making it an ideal benchmark for the visual content captioning task. We perform extensive statistical analyses to compare our dataset to existing image and video description datasets. Next, we provide baseline results on the animated GIF description task, using three representative techniques: nearest neighbor, statistical machine translation , and recurrent neural networks. Finally, we show that models fine-tuned from our animated GIF description dataset can be helpful for automatic movie description.
translated by 谷歌翻译
The use of Recurrent Neural Networks for video cap-tioning has recently gained a lot of attention, since they can be used both to encode the input video and to generate the corresponding description. In this paper, we present a recurrent video encoding scheme which can discover and leverage the hierarchical structure of the video. Unlike the classical encoder-decoder approach, in which a video is encoded continuously by a recurrent layer, we propose a novel LSTM cell which can identify discontinuity points between frames or segments and modify the temporal connections of the encoding layer accordingly. We evaluate our approach on three large-scale datasets: the Montreal Video Annotation dataset, the MPII Movie Description dataset and the Microsoft Video Description Corpus. Experiments show that our approach can discover appropriate hierarchical representations of input videos and improve the state of the art results on movie description datasets.
translated by 谷歌翻译
While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content. In this paper we present MSR-VTT (standing for "MSR-Video to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine , with 118 videos for each query. In its current version , MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Network-based approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT.
translated by 谷歌翻译