Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging. We propose a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new triplet-based method of collecting human annotations to measure consensus, a new automated metric (CIDEr) that captures consensus, and two new datasets: PASCAL-50S and ABSTRACT-50S that contain 50 sentences describing each image. Our simple metric captures human judgment of consensus better than existing metrics across sentences generated by various sources. We also evaluate five state-of-the-art image description approaches using this new protocol and provide a benchmark for future comparisons. A version of CIDEr named CIDEr-D is available as a part of MS COCO evaluation server to enable systematic evaluation and benchmarking.
translated by 谷歌翻译
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as which caption-generator best understands colors? and can caption-generators count?
translated by 谷歌翻译
图像字幕是当前的研究任务,用于使用场景中的对象及其关系来描述图像内容。为了应对这项任务,使用了两个重要的研究领域,人为的视觉和自然语言处理。在图像字幕中,就像在任何计算智能任务中一样,性能指标对于知道方法的性能(或坏)至关重要。近年来,已经观察到,基于n-gram的经典指标不足以捕获语义和关键含义来描述图像中的内容。为了衡量或不进行最新指标的集合,在本手稿中,我们对使用众所周知的COCO数据集进行了对几种图像字幕指标的评估以及它们之间的比较。为此,我们设计了两种情况。 1)一组人工构建字幕,以及2)比较某些最先进的图像字幕方法的比较。我们试图回答问题:当前的指标是否有助于制作高质量的标题?实际指标如何相互比较?指标真正测量什么?
translated by 谷歌翻译
视频标题的当前度量主要基于参考和候选字幕之间的文本级别比较。然而,它们具有一些不可能的缺点,例如,它们不能在没有参考的情况下处理视频,并且由于视频到文本的一对多性质和忽视视觉相关性的一对多性质,它们可能导致偏见的评估。从人类评估者的观点来看,高质量的标题应与提供的视频一致,但不一定类似于文字或语义中的参考。灵感来自人类评估,我们提出了Emscore(基于匹配的分数),是视频字幕的一种新颖的无参考度量,其直接测量视频和候选字幕之间的相似性。受益于最近的大规模预训练模型的发展,我们利用了一个良好的预先训练的视觉语言模型来提取用于计算Emscore的视觉和语言嵌入。具体地,Emscore将粗粒(视频和标题)和细粒度(帧和单词)水平的匹配分数组合,这将考虑到视频的整体理解和详细特征。此外,考虑到潜在的信息增益,Emscore可以灵活地扩展到人类标记的参考可用的条件。最后但并非最不重要的是,我们收集Vatex-eval和ActivityNet-Foil数据集以系统地评估现有的度量标准。 Vatex-emp实验表明,Emscore具有更高的人类相关性和较低的参考依赖性。 ActivityNet-Foil实验验证Emscore可以有效地识别“幻觉”标题。将释放数据集以促进视频标题度量的开发。代码可在:https://github.com/shiyaya/emcore。
translated by 谷歌翻译
在本文中,我们构建了两个自动评估度量,用于评估机器生成的标题和地面真理体型中的关联:overtyle和风格德。
translated by 谷歌翻译
视觉标题的开放性质使其成为评估的具有挑战性的区域。大多数拟议模型依赖于专业培训来改善人类关联,导致采用有限,普遍性和索引。我们介绍了“典型性”,一种新的评价制定,根植于信息理论,这是唯一适合缺乏明确的实践的问题。典型程度是我们开发新颖语义比较,SPARC的框架,以及引用的流畅评估度量。在我们的分析过程中,流利的两个单独的流利程度自然出现:风格,由公制刺激和语法捕获,以语法异常罚款的形式捕获。通过对基准数据集进行广泛的实验和消融研究,我们展示了这些语义和流畅程度的这些分解维度如何为标题差异提供更大的系统级洞察。与其他基于规则的评估指标相比,我们拟议的指标与他们的组合,SMURF,达到了人为判断的最先进的相关性。
translated by 谷歌翻译
用于评估有条件自然语言生成的传统自动化指标使用单个生成的文本和最佳匹配的金标准地面真相文本之间的成对比较。当有多个基础真相可用时,分数将使用参考中的平均或最大操作进行汇总。尽管这种方法在地面真相数据中的多样性(即有条件文本的分布的分散)可以归因于噪声,例如自动语音识别中,但在地面上的多样性的情况下,它不允许进行强有力的评估。真理代表模型的信号。在这项工作中,我们认为现有的指标不适合诸如视觉描述或摘要之类的域,而地面真理在语义上是多样的,并且这些字幕中的多样性捕获了有关上下文的有用的其他信息。我们提出了一种新的范式,用于对条件语言生成模型的多键入评估以及一个新的指标家族,该指标家族使用每种少量样本集比较参考和模型生成的字幕集的分布。我们通过视觉描述中的案例研究证明了方法的实用性:我们在其中证明现有模型优化了单描述质量而不是多样性,并获得了对采样方法和温度影响如何描述质量和多样性的一些见解。
translated by 谷歌翻译
近年来,研究人员创建并引入了大量各种代码生成模型。由于对每个新模型版本的人类评估都是不可行的,因此社区采用了自动评估指标,例如BLEU来近似人类判断的结果。这些指标源自机器翻译域,目前尚不清楚它们是否适用于代码生成任务,以及他们与人类对此任务的评估有多一致。还有两个指标,即Codebleu和Ruby,它们是为了估计代码的相似性并考虑了代码属性的。但是,对于这些指标,几乎没有关于他们与人类评估一致的研究。尽管如此,公制得分的最小差异仍用于声称某些代码生成模型的优越性。在本文中,我们介绍了一项有关六个指标的适用性的研究-Bleu,Rouge-L,Meteor,Chrf,Codebleu,Ruby-用于评估代码生成模型。我们对两个不同的代码生成数据集进行了一项研究,并使用人类注释来评估这些数据集上运行的所有模型的质量。结果表明,对于Python单线的Conala数据集,如果模型得分的差异小于5分,则没有一个指标可以正确模拟人类判断,而$ 95 \%$确定性,则使用$> 95 \%$确定性。对于由特定结构类别组成的炉石传说数据集,至少2分的模型得分差异足以声称一种模型比另一个模型的优越性。使用我们的发现,我们得出了有关使用指标来估计代码生成任务的模型性能的几项建议。
translated by 谷歌翻译
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes measures to automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans. The measures count the number of overlapping units such as n-gram, word sequences, and word pairs between the computer-generated summary to be evaluated and the ideal summaries created by humans. This paper introduces four different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S included in the ROUGE summarization evaluation package and their evaluatio ns. Three of them have been used in the Document Understanding Conference (DUC) 2004, a large-scale summarization evaluation sponsored by NIST.
translated by 谷歌翻译
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU-1 score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU-1 score improvements on Flickr30k, from 56 to 66, and on SBU, from 19 to 28. Lastly, on the newly released COCO dataset, we achieve a BLEU-4 of 27.7, which is the current state-of-the-art.
translated by 谷歌翻译
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in
translated by 谷歌翻译
文本生成的广泛使用的评估指标要么与更长的文本效果不错,要么无法评估文本质量的所有方面。在本文中,我们引入了一个名为SMART的新指标,以减轻此类限制。具体而言,我们将句子视为匹配的基本单位,而不是代币,并使用句子匹配函数来匹配匹配候选和参考句子。还将候选句子与源文件中的句子进行了比较,以允许接地(例如,事实)评估。我们的结果表明,我们提出的指标与基于模型的匹配函数的系统级相关性优于萨姆瓦尔摘要元评估数据集上的所有竞争指标指标。后者不使用任何神经模型,这在模型开发阶段很有用,在这些阶段,资源可以受到限制且需要快速评估。最后,我们还进行了广泛的分析,表明我们提出的指标与较长的摘要很好地运行,并且对特定模型的偏见较小。
translated by 谷歌翻译
We propose BERTSCORE, an automatic evaluation metric for text generation. Analogously to common metrics, BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTSCORE correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTSCORE is more robust to challenging examples when compared to existing metrics.
translated by 谷歌翻译
A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.
translated by 谷歌翻译
自动音频字幕是一项跨模式翻译任务,旨在为给定的音频剪辑生成自然语言描述。近年来,随着免费可用数据集的发布,该任务受到了越来越多的关注。该问题主要通过深度学习技术解决。已经提出了许多方法,例如研究不同的神经网络架构,利用辅助信息,例如关键字或句子信息来指导字幕生成,并采用了不同的培训策略,这些策略极大地促进了该领域的发展。在本文中,我们对自动音频字幕的已发表贡献进行了全面综述,从各种现有方法到评估指标和数据集。我们还讨论了公开挑战,并设想可能的未来研究方向。
translated by 谷歌翻译
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing ∼0.25M images, ∼0.76M questions, and ∼10M answers (www.visualqa.org), and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).
translated by 谷歌翻译
We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset -performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection allows us to approach the extremely challenging problem of description generation using relatively simple non-parametric methods and produces surprisingly effective results. We also develop methods incorporating many state of the art, but fairly noisy, estimates of image content to produce even more pleasing results. Finally we introduce a new objective performance measure for image captioning.
translated by 谷歌翻译
Human evaluations of machine translation are extensive but expensive. Human evaluations can take months to finish and involve human labor that can not be reused. We propose a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. We present this method as an automated understudy to skilled human judges which substitutes for them when there is need for quick or frequent evaluations. 1
translated by 谷歌翻译
本文对过去二十年来对自然语言生成(NLG)的研究提供了全面的审查,特别是与数据到文本生成和文本到文本生成深度学习方法有关,以及NLG的新应用技术。该调查旨在(a)给出关于NLG核心任务的最新综合,以及该领域采用的建筑;(b)详细介绍各种NLG任务和数据集,并提请注意NLG评估中的挑战,专注于不同的评估方法及其关系;(c)强调一些未来的强调和相对近期的研究问题,因为NLG和其他人工智能领域的协同作用而增加,例如计算机视觉,文本和计算创造力。
translated by 谷歌翻译
我们建立了一种基于规校的图像标题模型的人类评估协议。我们的得分标准及其定义是基于MSCOCO数据集上的机器和人类生成的标题仔细开发。每个字幕沿着权衡(精确和召回)中的两个主要尺寸以及测量文本质量的其他方面(流利,简洁,包容性语言)。我们的评估表明了当前评估实践的几个关键问题。人生成的标题显示出比机器生成的字块的质量大得多,特别是在突出信息的覆盖范围内(即,召回),而所有自动度量都可以说相反。我们基于规度的标准结果表明,曲线芯片,最近使用图像特征的度量标准,与人类判断更好地相关,因为它对召回更敏感。我们希望这项工作将推动更透明的图像标题和自动指标的评估协议。
translated by 谷歌翻译