视频标题的当前度量主要基于参考和候选字幕之间的文本级别比较。然而,它们具有一些不可能的缺点,例如,它们不能在没有参考的情况下处理视频,并且由于视频到文本的一对多性质和忽视视觉相关性的一对多性质,它们可能导致偏见的评估。从人类评估者的观点来看,高质量的标题应与提供的视频一致,但不一定类似于文字或语义中的参考。灵感来自人类评估,我们提出了Emscore(基于匹配的分数),是视频字幕的一种新颖的无参考度量,其直接测量视频和候选字幕之间的相似性。受益于最近的大规模预训练模型的发展,我们利用了一个良好的预先训练的视觉语言模型来提取用于计算Emscore的视觉和语言嵌入。具体地,Emscore将粗粒(视频和标题)和细粒度(帧和单词)水平的匹配分数组合,这将考虑到视频的整体理解和详细特征。此外,考虑到潜在的信息增益,Emscore可以灵活地扩展到人类标记的参考可用的条件。最后但并非最不重要的是,我们收集Vatex-eval和ActivityNet-Foil数据集以系统地评估现有的度量标准。 Vatex-emp实验表明,Emscore具有更高的人类相关性和较低的参考依赖性。 ActivityNet-Foil实验验证Emscore可以有效地识别“幻觉”标题。将释放数据集以促进视频标题度量的开发。代码可在:https://github.com/shiyaya/emcore。
translated by 谷歌翻译
图像字幕是当前的研究任务,用于使用场景中的对象及其关系来描述图像内容。为了应对这项任务,使用了两个重要的研究领域,人为的视觉和自然语言处理。在图像字幕中,就像在任何计算智能任务中一样,性能指标对于知道方法的性能(或坏)至关重要。近年来,已经观察到,基于n-gram的经典指标不足以捕获语义和关键含义来描述图像中的内容。为了衡量或不进行最新指标的集合,在本手稿中,我们对使用众所周知的COCO数据集进行了对几种图像字幕指标的评估以及它们之间的比较。为此,我们设计了两种情况。 1)一组人工构建字幕,以及2)比较某些最先进的图像字幕方法的比较。我们试图回答问题:当前的指标是否有助于制作高质量的标题?实际指标如何相互比较?指标真正测量什么?
translated by 谷歌翻译
Most existing text-video retrieval methods focus on cross-modal matching between the visual content of offline videos and textual query sentences. However, in real scenarios, online videos are frequently accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This inspires us to generate associated captions from offline videos to help with existing text-video retrieval methods. To do so, we propose to use the zero-shot video captioner with knowledge of pre-trained web-scale models (e.g., CLIP and GPT-2) to generate captions for offline videos without any training. Given the captions, one question naturally arises: what can auxiliary captions do for text-video retrieval? In this paper, we present a novel framework Cap4Video, which makes use of captions from three aspects: i) Input data: The video and captions can form new video-caption pairs as data augmentation for training. ii) Feature interaction: We perform feature interaction between video and caption to yield enhanced video representations. iii) Output score: The Query-Caption matching branch can be complementary to the original Query-Video matching branch for text-video retrieval. We conduct thorough ablation studies to demonstrate the effectiveness of our method. Without any post-processing, our Cap4Video achieves state-of-the-art performance on MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%).
translated by 谷歌翻译
在本文中,我们构建了两个自动评估度量,用于评估机器生成的标题和地面真理体型中的关联:overtyle和风格德。
translated by 谷歌翻译
视觉标题的开放性质使其成为评估的具有挑战性的区域。大多数拟议模型依赖于专业培训来改善人类关联,导致采用有限,普遍性和索引。我们介绍了“典型性”,一种新的评价制定,根植于信息理论,这是唯一适合缺乏明确的实践的问题。典型程度是我们开发新颖语义比较,SPARC的框架,以及引用的流畅评估度量。在我们的分析过程中,流利的两个单独的流利程度自然出现:风格,由公制刺激和语法捕获,以语法异常罚款的形式捕获。通过对基准数据集进行广泛的实验和消融研究,我们展示了这些语义和流畅程度的这些分解维度如何为标题差异提供更大的系统级洞察。与其他基于规则的评估指标相比,我们拟议的指标与他们的组合,SMURF,达到了人为判断的最先进的相关性。
translated by 谷歌翻译
There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as which caption-generator best understands colors? and can caption-generators count?
translated by 谷歌翻译
We propose BERTSCORE, an automatic evaluation metric for text generation. Analogously to common metrics, BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTSCORE correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTSCORE is more robust to challenging examples when compared to existing metrics.
translated by 谷歌翻译
在过去的几年中,引起了独特的图像字幕(DIC)(DIC) - 生成独特的标题来描述目标图像的独特细节。最近的DIC工作建议通过将目标图像与一组语义相似的参考图像(即基于参考的DIC(REF-DIC))进行比较来生成独特的字幕。它的目的是使生成的字幕可以分开目标图像和参考图像。不幸的是,现有参考作品使用的参考图像易于区分:这些参考图像仅类似于场景级别的目标图像,并且几乎没有常见的对象,因此,即使不考虑该模型,Ref-DIC模型也可以微不足道地生成独特的字幕参考图像。为了确保Ref-DIC模型真正了解目标图像中的唯一对象(或属性),我们首先提出了两个新的Ref-DIC基准。具体而言,我们设计了一个两阶段的匹配机制,该机制严格控制对象 - /属性级别的目标和参考图像之间的相似性(相对于场景级别)。其次,为了产生独特的标题,我们开发了一个强大的基于变压器的ref-DIC基线,称为传播。它不仅从目标图像中提取视觉特征,而且还编码目标和参考图像中对象之间的差异。最后,为了获得更值得信赖的基准测试,我们提出了一个新的评估度量指标,名为Ref-DIC的Discider,评估生成的字幕的准确性和独特性。实验结果表明,我们的传统可以产生独特的标题。此外,它在不同指标上的两个新基准测试中的几个最先进的模型都优于多种最先进的模型。
translated by 谷歌翻译
连接视觉和语言在生成智能中起着重要作用。因此,已经致力于图像标题的大型研究工作,即用句法和语义有意义的句子描述图像。从2015年开始,该任务通常通过由Visual Encoder组成的管道和文本生成的语言模型来解决任务。在这些年来,两种组件通过对象区域,属性,介绍多模态连接,完全关注方法和伯特早期融合策略的利用而显着发展。但是,无论令人印象深刻的结果,图像标题的研究还没有达到结论性答案。这项工作旨在提供图像标题方法的全面概述,从视觉编码和文本生成到培训策略,数据集和评估度量。在这方面,我们量化地比较了许多相关的最先进的方法来确定架构和培训策略中最有影响力的技术创新。此外,讨论了问题的许多变体及其开放挑战。这项工作的最终目标是作为理解现有文献的工具,并突出显示计算机视觉和自然语言处理的研究领域的未来方向可以找到最佳的协同作用。
translated by 谷歌翻译
Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging. We propose a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new triplet-based method of collecting human annotations to measure consensus, a new automated metric (CIDEr) that captures consensus, and two new datasets: PASCAL-50S and ABSTRACT-50S that contain 50 sentences describing each image. Our simple metric captures human judgment of consensus better than existing metrics across sentences generated by various sources. We also evaluate five state-of-the-art image description approaches using this new protocol and provide a benchmark for future comparisons. A version of CIDEr named CIDEr-D is available as a part of MS COCO evaluation server to enable systematic evaluation and benchmarking.
translated by 谷歌翻译
自动音频字幕是一项跨模式翻译任务,旨在为给定的音频剪辑生成自然语言描述。近年来,随着免费可用数据集的发布,该任务受到了越来越多的关注。该问题主要通过深度学习技术解决。已经提出了许多方法,例如研究不同的神经网络架构,利用辅助信息,例如关键字或句子信息来指导字幕生成,并采用了不同的培训策略,这些策略极大地促进了该领域的发展。在本文中,我们对自动音频字幕的已发表贡献进行了全面综述,从各种现有方法到评估指标和数据集。我们还讨论了公开挑战,并设想可能的未来研究方向。
translated by 谷歌翻译
我们介绍了一种零拍的视频字幕方法,该方法采用了两个冷冻网络:GPT-2语言模型和剪辑图像文本匹配模型。匹配分数用于引导语言模型生成一个句子,该句子的平均匹配分数高于视频帧的一个子集。与零拍图像字幕方法不同,我们的工作立即考虑整个句子。这是通过在生成过程中优化从头开始的一部分,通过在提示中修改所有其他令牌的表示,并通过迭代重复该过程,逐渐提高生成句子的特殊性和全面性来实现。我们的实验表明,生成的字幕是连贯的,并显示了广泛的现实知识。我们的代码可在以下网址找到:https://github.com/yoadtew/zero-shot-video-to-text
translated by 谷歌翻译
Current metrics for evaluating factuality for abstractive document summarization have achieved high correlations with human judgment, but they do not account for the vision modality and thus are not adequate for vision-and-language summarization. We propose CLIPBERTScore, a simple weighted combination of CLIPScore and BERTScore to leverage the robustness and strong factuality detection performance between image-summary and document-summary, respectively. Next, due to the lack of meta-evaluation benchmarks to evaluate the quality of multimodal factuality metrics, we collect human judgments of factuality with respect to documents and images. We show that this simple combination of two metrics in the zero-shot setting achieves higher correlations than existing factuality metrics for document summarization, outperforms an existing multimodal summarization metric, and performs competitively with strong multimodal factuality metrics specifically fine-tuned for the task. Our thorough analysis demonstrates the robustness and high correlation of CLIPBERTScore and its components on four factuality metric-evaluation benchmarks. Finally, we demonstrate two practical downstream applications of our CLIPBERTScore metric: for selecting important images to focus on during training, and as a reward for reinforcement learning to improve factuality of multimodal summary generation w.r.t automatic and human evaluation. Our data and code are publicly available at https://github.com/meetdavidwan/faithful-multimodal-summ
translated by 谷歌翻译
用于评估有条件自然语言生成的传统自动化指标使用单个生成的文本和最佳匹配的金标准地面真相文本之间的成对比较。当有多个基础真相可用时,分数将使用参考中的平均或最大操作进行汇总。尽管这种方法在地面真相数据中的多样性(即有条件文本的分布的分散)可以归因于噪声,例如自动语音识别中,但在地面上的多样性的情况下,它不允许进行强有力的评估。真理代表模型的信号。在这项工作中,我们认为现有的指标不适合诸如视觉描述或摘要之类的域,而地面真理在语义上是多样的,并且这些字幕中的多样性捕获了有关上下文的有用的其他信息。我们提出了一种新的范式,用于对条件语言生成模型的多键入评估以及一个新的指标家族,该指标家族使用每种少量样本集比较参考和模型生成的字幕集的分布。我们通过视觉描述中的案例研究证明了方法的实用性:我们在其中证明现有模型优化了单描述质量而不是多样性,并获得了对采样方法和温度影响如何描述质量和多样性的一些见解。
translated by 谷歌翻译
新颖的对象字幕(NOC)旨在描述包含对象的图像,而无需在训练过程中观察其地面真相标题。由于缺乏字幕注释,无法通过序列到序列训练或苹果酒优化直接优化字幕模型。结果,我们提出了启用释义(P2C),这是一个针对NOC的两阶段学习框架,它将通过释义通过释义来优化输出字幕。使用P2C,字幕模型首先从仅在文本语料库中预先训练的语言模型中学习释义,从而扩展了Bank一词以提高语言流利度。为了进一步实施足够描述输入图像的视觉内容的输出字幕,我们对引入的忠诚度和充分性目标进行字幕模型执行自我贴形。由于在训练过程中没有任何地面真相标题可用于新颖的对象图像,因此我们的P2C利用交叉模式(图像文本)关联模块可以确保可以正确保留上述字幕特征。在实验中,我们不仅表明我们的P2C在NOCAPS和COCO字幕数据集上实现了最先进的性能,而且还通过替换NOC的语言和跨模式关联模型来验证学习框架的有效性和灵活性。实施详细信息和代码可在补充材料中找到。
translated by 谷歌翻译
Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text pairs that are aligned at different levels, the inherent noises (e.g., misaligned pairs) make it difficult to learn a precise captioning model. While the filtering strategy can effectively remove noisy data, however, it leads to a decrease in learnable knowledge and sometimes brings about a new problem of data deficiency. To take the best of both worlds, we propose a noise-aware learning framework, which learns rich knowledge from the whole web-crawled data while being less affected by the noises. This is achieved by the proposed quality controllable model, which is learned using alignment levels of the image-text pairs as an additional control signal during training. The alignment-conditioned training allows the model to generate high-quality captions of well-aligned by simply setting the control signal to desired alignment level at inference time. Through in-depth analysis, we show that our controllable captioning model is effective in handling noise. In addition, with two tasks of zero-shot captioning and text-to-image retrieval using generated captions (i.e., self-retrieval), we also demonstrate our model can produce high-quality captions in terms of descriptiveness and distinctiveness. Code is available at \url{https://github.com/kakaobrain/noc}.
translated by 谷歌翻译
A major goal of multimodal research is to improve machine understanding of images and text. Tasks include image captioning, text-to-image generation, and vision-language representation learning. So far, research has focused on the relationships between images and text. For example, captioning models attempt to understand the semantics of images which are then transformed into text. An important question is: which annotation reflects best a deep understanding of image content? Similarly, given a text, what is the best image that can present the semantics of the text? In this work, we argue that the best text or caption for a given image is the text which would generate the image which is the most similar to that image. Likewise, the best image for a given text is the image that results in the caption which is best aligned with the original text. To this end, we propose a unified framework that includes both a text-to-image generative model and an image-to-text generative model. Extensive experiments validate our approach.
translated by 谷歌翻译
对于视频标题,“预培训和微调”已成为事实上的范式,其中想象成预训练(InP)通常用于帮助编码视频内容,并且从头开始进行任务导向的网络应对标题一代。将InP与最近提出的剪辑(对比语言图像预培训)进行比较,研究了INP的潜在缺陷,用于视频标题,并探索产生准确描述的关键。具体而言,我们对INP与剪辑的实证研究表明,INP使视频标题模型棘手捕获属性的语义和对无关背景信息的敏感。相比之下,剪辑在标题质量中的显着提升突出了属性感知表示学习的重要性。因此,我们被激励引入双属性预测,需要一个辅助任务,需要视频字幕模型来学习视频内容和属性之间的对应关系以及属性之间的共同发生关系。基准数据集的广泛实验表明,我们的方法能够更好地学习属性感知的表示,这对具有不同架构和解码算法的模型带来了一致的改进。
translated by 谷歌翻译
密集的视频字幕(DVC)旨在生成多句子描述,以阐明视频中的多个事件,这是具有挑战性,需要的视觉一致性,疑惑一致性和语言多样性。现有方法主要生成各个视频段的标题,缺乏适应全局视觉上下文和快速发展的视觉内容和文本描述之间的渐进对齐,这导致冗余和拼接描述。在本文中,我们介绍了信息流的概念,以模拟跨视频序列和标题的渐进信息。通过设计跨模型信息流对准机制,捕获和对齐的视觉和文本信息流,其在事件/主题演化上以更丰富的上下文和动态赋予标题处理。基于跨模型信息流对准模块,我们进一步提出了DVCFlow框架,它由全球本地视觉编码器组成,用于捕获每个视频段的全局功能和本地特征,以及用于产生标题的预先培训的标题生成器。对流行的ActivityNet标题和Youcookii数据集的广泛实验表明,我们的方法显着优于竞争基础,并根据主题和客观测试产生更多人类文本。
translated by 谷歌翻译
我们建立了一种基于规校的图像标题模型的人类评估协议。我们的得分标准及其定义是基于MSCOCO数据集上的机器和人类生成的标题仔细开发。每个字幕沿着权衡(精确和召回)中的两个主要尺寸以及测量文本质量的其他方面(流利,简洁,包容性语言)。我们的评估表明了当前评估实践的几个关键问题。人生成的标题显示出比机器生成的字块的质量大得多,特别是在突出信息的覆盖范围内(即,召回),而所有自动度量都可以说相反。我们基于规度的标准结果表明,曲线芯片,最近使用图像特征的度量标准,与人类判断更好地相关,因为它对召回更敏感。我们希望这项工作将推动更透明的图像标题和自动指标的评估协议。
translated by 谷歌翻译