Real-world web videos often contain cues to supplement visual information for generating natural language descriptions. In this paper we propose a sequence-to-sequence model which explores such auxiliary information. In particular, audio and the topic of the video are used in addition to the visual information in a multimodal framework to generate coherent descriptions of videos "in the wild". In contrast to current encoder-decoder based models which exploit visual information only during the encoding stage, our model fuses multiple sources of information judiciously, showing improvement over using the different modalities separately. We based our multimodal video description network on the state-of-the-art sequence to sequence video to text (S2VT) [26] model and extended it to take advantage of multiple modalities. Extensive experiments on the challenging MSR-VTT dataset are carried out to show the superior performance of the proposed approach on natural videos found in the web.
translated by 谷歌翻译
对话系统需要了解动态视觉场景,以便与用户就周围的对象和事件进行对话。可以通过集成来自多个研究领域的最先进技术来开发用于实际应用的场景感知对话系统,包括:端到端对话技术,其使用来自对话数据的模型生成系统响应;视觉问答(VQA)技术,使用学习图像功能对图像提出疑问;和视频编码技术,其中使用多模式信息从视频生成描述/标题。我们引入了一个关于人类行为的对话视频的新数据集。每个对话框都是一个类型化的对话,由两个AmazonMechanical Turk(AMT)工作者之间的10个问答(QA)对组成。总的来说,我们收集了大约9,000个视频的对话框。使用这个用于音频视觉场景感知对话框(AVSD)的新数据集,编写了一个端到端会话模型,该模型在视频对话框中生成响应。我们的实验证明,使用为多模式基于注意力的视频描述而开发的多模式特征增强了关于动态场景(视频)的生成对话的质量。我们的数据集,模型代码和预训练模型将公开用于新的VideoScene-Aware Dialog挑战。
translated by 谷歌翻译
在本文中,我们介绍了How2,一个带有英文字幕和众包葡萄牙语翻译的多模式教学视频集。我们还提供了用于机器翻译,自动语音识别,口语翻译和多模式综合的集成序列到序列基线。通过为几种多模态自然语言任务提供数据和代码,我们希望能够激发对这些和类似挑战的更多研究,以便更深入地理解语言处理中的多模态。
translated by 谷歌翻译
Current methods for video description are based on encoder-decoder sentence generation using recurrent neu-ral networks (RNNs). Recent work has demonstrated the advantages of integrating temporal attention mechanisms into these models, in which the decoder network predicts each word in the description by selectively giving more weight to encoded features from specific time frames. Such methods typically use two different types of features: image features (from an object classification model), and motion features (from an action recognition model), combined by na¨ıvena¨ıve concatenation in the model input. Because different feature modalities may carry task-relevant information at different times, fusing them by na¨ıvena¨ıve concatena-tion may limit the model's ability to dynamically determine the relevance of each type of feature to different parts of the description. In this paper, we incorporate audio features in addition to the image and motion features. To fuse these three modalities, we introduce a multimodal attention model that can selectively utilize features from different modalities for each word in the output description. Combining our new multimodal attention model with standard temporal attention outperforms state-of-the-art methods on two standard datasets: YouTube2Text and MSR-VTT.
translated by 谷歌翻译
Movie description
分类:
Audio description (AD) provides linguistic descriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an interesting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed ADs, which are temporally aligned to full length movies. In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions. We introduce the Large Scale Movie Description Challenge (LSMDC) which contains a parallel corpus of 128,118 sentences aligned to video clips from 200 movies (around 150 h of video in total). The goal of the challenge is to automatically generate descriptions for the movie clips. First we characterize the dataset by benchmarking different approaches for generating video descriptions. Comparing ADs to scripts, we find that ADs are more visual and describe precisely what is shown rather than what should happen according to the scripts created prior to movie production. Furthermore, we present and compare the results of several Communicated by teams who participated in the challenges organized in the context of two workshops at ICCV 2015 and ECCV 2016.
translated by 谷歌翻译
本文介绍了第七对话系统技术挑战(DSTC),它使用共享数据集来探索构建对话系统的问题。最近,端到端对话建模方法已应用于各种对话任务。第七个DSTC(DSTC7)侧重于开发与端到端对话系统相关的技术,用于(1)句子选择,(2)句子生成和(3)视听场景感知对话。本文总结了DSTC7的整体设置和结果,包括对不同轨道和提供的数据集的详细描述。我们还描述了提交系统的总体趋势和关键结果。每个轨道都引入了新的数据集,参与者使用最先进的端到端技术取得了令人瞩目的成果。
translated by 谷歌翻译
During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective.
translated by 谷歌翻译
了解视听内容以及对其进行信息对话的能力都是智能系统面临的挑战。视听场景感知对话(AVSD)挑战,作为对话系统技术挑战7(DSTC7)的轨道,提出一个组合任务,系统必须回答与视频有关的问题,并与之前的问答对和视频本身进行对话。我们为此任务提出了一种分层编码器 - 解码器模型,该模型计算对话上下文的多模式。它首先使用两个LSTM嵌入对话历史。我们定期提取视频和音频帧,并分别使用预先训练的I3D和VGGish模型计算语义特征。在使用LSTM将两种模态概括为固定长度向量之前,我们使用FiLMblock在当前问题的嵌入上对它们进行条件化,这使我们大大降低了维数。最后,我们使用LSTMdecoder,我们使用预定的采样进行训练并使用波束搜索进行评估。与AVSD挑战组织发布的模态融合基线模型相比,我们的模型实现了相对改进超过16%,得分为0.36 BLEU-4和超过33%,得分0.997 CIDEr。
translated by 谷歌翻译
人工智能最近的研究重点是制作关于视觉场景的叙事故事。它有可能实现更像人类的理解,而不仅仅是按顺序生成图像的基本描述。在这项工作中,我们提出了一个解决方案,用于生成基于序列到序列模型的序列图像的故事。作为一种新奇事物,我们的编码器模型由两个独立的编码器组成,一个用于模拟图像序列的行为,另一个用于模拟在图像序列中为先前图像生成的句子故事。通过使用图像序列编码器,我们捕获图像序列和句子故事之间的时间依赖关系,并通过使用前一个句子编码器,我们实现了更好的故事流程。 Oursolution产生了长长的类似人类的故事,不仅描述了图像序列的视觉上下文,还包含叙事和评价语言。通过手动人体评估确认获得的结果。
translated by 谷歌翻译
This paper extends research on automated image caption-ing in the dimension of language, studying how to generate Chinese sentence descriptions for unlabeled images. To evaluate image captioning in this novel context, we present Flickr8k-CN, a bilingual extension of the popular Flickr8k set. The new multimedia dataset can be used to quantitatively assess the performance of Chinese captioning and English-Chinese machine translation. The possibility of re-using existing English data and models via machine translation is investigated. Our study reveals to some extent that a computer can master two distinct languages, English and Chinese, at a similar level for describing the visual world. Data is publicly available at http://tinyurl.com/flickr8kcn.
translated by 谷歌翻译
使用递归神经网络(RNN)进行图像描述的最新进展促使人们探索了它们对视频描述的应用。然而,虽然图像是静态的,但处理视频需要建模其动态时间结构,然后将该信息正确地整合到自然语言描述中。在这种情况下,我们提出了一种方法,该方法成功地考虑了视频的本地和全球时间结构,以产生描述。首先,我们的方法包括短时间动态的空间时间3-D卷积神经网络(3-D CNN)表示。 3-D CNN表示在视频动作识别任务上被训练,以便产生被调谐到人类运动和行为的表示。其次,我们提出了一种临时保留机制,它允许超越局部时间建模并学习在给定生成文本的RNN的情况下自动选择最相关的时间段。我们的方法超出了Youtube2Text数据集上的WORU和METEOR指标的当前最新技术水平。我们还提供了一个新的,更大的,更具挑战性的配对视频和自然语言描述数据集的结果。
translated by 谷歌翻译
自动描述图像的内容是连接计算机视觉和自然语言处理的人工智能中的基本问题。在本文中,我们提出了一个基于深度复现架构的生成模型,该架构结合了计算机视觉和机器翻译的最新进展,可用于生成描述图像的自然句子。训练该模型以最大化给定训练图像的目标描述语句的可能性。 severaldatasets上的实验表明了该模型的准确性以及该语言的流畅性仅来自图像描述。我们的模型通常非常准确,我们在质量和数量上都进行了验证。最后,考虑到对此任务的兴趣,2015年使用新发布的COCO数据集组织了竞赛。我们描述并分析了应用于我们自己的基线的各种改进,并展示了竞争中的最终表现,我们赢得了微软研究团队的一个团队,并在TensorFlow中提供了一个开源实现。
translated by 谷歌翻译
互联网上视频数据的爆炸式增长需要有效且高效的技术,以便为无法播放视频的人自动生成字幕。尽管视频字幕研究取得了很大进展,特别是在视频特征编码方面,语言解码器仍然主要基于流行的RNN解码器,例如LSTM,它倾向于选择与视频对齐的频繁字。在本文中,我们提出了用于视频字幕的aboundary-aware层次语言解码器,它包括一个基于GRU的高级语言解码器,作为全局(字幕级)语言模型,以及一个基于GRU的低级语言解码器,工作作为本地(短语级)语言模型。最重要的是,我们将二进制门引入低级GRU语言解码器以检测语言边界。结合其他高级组件,包括联合视频预测,共享软关注和边界感知视频编码,我们的集成视频字幕框架可以发现分层语言信息并区分句子中的主语和宾语,这在语言生成过程中通常会引起混淆。对二次使用的视频字幕数据集进行了大量实验,MSR-Video-to-Text(MSR-VTT)\ cite {xu2016msr}和YouTube-to-Text(MSVD)\ cite {chen2011collecting}表明我们的方法具有很强的竞争力,最先进的方法。
translated by 谷歌翻译
作为卖方或买方的谈判是在线购物的基本和复杂方面。这对于智能代理来说是具有挑战性的,因为它需要(1)提取和利用来自多个来源的信息(例如照片,文本和数字),(2)预测产品的合适价格以达到最佳可能的协议,(3)表达意图以自然语言的价格为条件,以及(4)一致的定价。传统的对话系统不能很好地解决这些问题。例如,我们认为价格应该是谈判的驱动因素,并由theagent理解。但是传统上,价格被简单地视为单词标记,即作为句子的一部分并且与其他单词共享相同的单词嵌入空间。为此,我们提出了我们的视觉谈判器,它包括一个端到端的深度学习模型,该模型可以预测初始协议价格并在创建引人注目的支持对话框的同时对其进行更新。对于(1),我们的视觉谈判者利用注意机制从图像和文本描述中提取相关信息,并将价格(以及后来的提炼价格)作为系统若干阶段的单独重要输入,而不是仅仅作为句子的一部分;对于(2),我们使用注意力来学习价格嵌入来估计初始值;随后,对于(3),我们以编码器 - 解码器方式生成支持对话,该方式利用价格嵌入。此外,我们使用分层递归模型,学习在一个级别上重新定价,同时在另一个级别生成支持对话;对于(4),该分层模型提供一致的定价。根据经验,我们证明我们的模型在协议价格,价格一致性和语言质量方面显着改善了CraigslistBargaindataset的谈判。
translated by 谷歌翻译
Automatically evaluating the quality of dialogue responses for unstructureddomains is a challenging problem. Unfortunately, existing automatic evaluationmetrics are biased and correlate very poorly with human judgements of responsequality. Yet having an accurate automatic evaluation procedure is crucial fordialogue research, as it allows rapid prototyping and testing of new modelswith fewer expensive human evaluations. In response to this challenge, weformulate automatic dialogue evaluation as a learning problem. We present anevaluation model (ADEM) that learns to predict human-like scores to inputresponses, using a new dataset of human response scores. We show that the ADEMmodel's predictions correlate significantly, and at a level much higher thanword-overlap metrics such as BLEU, with human judgements at both the utteranceand system-level. We also show that ADEM can generalize to evaluating dialoguemodels unseen during training, an important step for automatic dialogueevaluation.
translated by 谷歌翻译
自动描述图像的内容是连接计算机视觉和自然语言处理的人工智能中的基本问题。在本文中,我们提出了一个基于深度复现架构的生成模型,该架构结合了计算机视觉和机器翻译的最新进展,可用于生成描述图像的自然句子。训练该模型以最大化给定训练图像的目标描述语句的可能性。 severaldatasets上的实验表明了该模型的准确性以及该语言的流畅性仅来自图像描述。我们的模型通常非常准确,我们在质量和数量上都进行了验证。例如,尽管Pascaldataset目前最先进的BLEU-1得分(越高越好)为25,我们的方法得出59,与人类表现69相比。我们还显示BLEU-1得分改善Flickr30k,从56到66,在SBU,从19到28.最后,在最新发布的COCO数据集中,我们推出了27.7的BLEU-4,这是目前最先进的技术。
translated by 谷歌翻译
While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content. In this paper we present MSR-VTT (standing for "MSR-Video to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine , with 118 videos for each query. In its current version , MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Network-based approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT.
translated by 谷歌翻译
This paper gives an overview of the evaluation campaign results of the 7 th International Workshop on Spoken Language Translation (IWSLT 2010) 1. This year, we focused on three spoken language tasks: (1) public speeches on a variety of topics (TALK) from English to French, (2) spoken dialog in travel situations (DIALOG) between Chinese and English, and (3) traveling expressions (BTEC) from Arabic, Turkish, and French to English. In total, 28 teams (including 7 first-time participants) took part in the shared tasks, submitting 60 primary and 112 contrastive runs. Automatic and subjective evaluations of the primary runs were carried out in order to investigate the impact of different communication modalities, spoken language styles and semantic context on automatic speech recognition (ASR) and machine translation (MT) system performances.
translated by 谷歌翻译
While deep convolutional neural networks frequently approach or exceed human-level performance in benchmark tasks involving static images, extending this success to moving images is not straightforward. Video understanding is of interest for many applications, including content recommendation , prediction, summarization, event/object detection , and understanding human visual perception. However, many domains lack sufficient data to explore and perfect video models. In order to address the need for a simple, quantitative benchmark for developing and understanding video, we present MovieFIB, a fill-in-the-blank question-answering dataset with over 300,000 examples, based on descriptive video annotations for the visually impaired. In addition to presenting statistics and a description of the dataset, we perform a detailed analysis of 5 different mod-els' predictions, and compare these with human performance. We investigate the relative importance of language, static (2D) visual features, and moving (3D) visual features; the effects of increasing dataset size, the number of frames sampled; and of vocabulary size. We illustrate that: this task is not solvable by a language model alone; our model combining 2D and 3D visual information indeed provides the best result; all models perform significantly worse than human-level. We provide human evaluation for responses given by different models and find that accuracy on the MovieFIB evaluation corresponds well with human judgment. We suggest avenues for improving video models, and hope that the MovieFIB challenge can be useful for measuring and encouraging progress in this very interesting field.
translated by 谷歌翻译
There has been an explosion of work in the vision & language community during the past few years from image captioning to video transcription, and answering questions about images. These tasks have fo-cused on literal descriptions of the image. To move beyond the literal, we choose to explore how questions about an image are often directed at commonsense inference and the abstract events evoked by objects in the image. In this paper, we introduce the novel task of Visual Question Generation (VQG), where the system is tasked with asking a natural and engaging question when shown an image. We provide three datasets which cover a variety of images from object-centric to event-centric, with considerably more abstract training data than provided to state-of-the-art cap-tioning systems thus far. We train and test several generative and retrieval models to tackle the task of VQG. Evaluation results show that while such models ask reasonable questions for a variety of images , there is still a wide gap with human performance which motivates further work on connecting images with commonsense knowledge and pragmatics. Our proposed task offers a new challenge to the community which we hope furthers interest in exploring deeper connections between vision & language.
translated by 谷歌翻译