能够替换人类判断的自动评估指标对于允许快速开发新方法至关重要。因此,许多研究工作集中在制定此类指标上。在这项工作中,我们退后一步,通过比较现有的自动指标和人类指标的身体来分析最近的进度。由于指标是根据它们的排名系统的方式使用的,因此我们比较系统排名空间中的指标。我们广泛的统计分析揭示了令人惊讶的发现:自动指标 - 新老 - 与彼此相比,比人类更相似。自动指标不是互补的,等级系统也类似。令人惊讶的是,人类指标彼此相互预测要比所有用于预测人类指标的自动指标的组合要好得多。令人惊讶的是,人类指标通常被设计为独立,以捕获质量的不同方面,例如内容保真度或可读性。我们对这些发现和建议进行讨论,以在评估领域的未来工作。
translated by 谷歌翻译
自然语言处理研究人员已经确定了对生成任务的评估方法的局限性,具有新的问题,提出了自动指标和人群判断的有效性。同时,改善生成模型的努力倾向于专注于简单的n-gram重叠度量(例如,Bleu,Rouge)。我们认为,对模型和指标的新进展应该每个人都更直接受益并告知另一个。因此,我们提出了排行榜,竞争排行榜(广告牌)的概括,同时跟踪语言生成任务和指标的进展。与通过预定度量分类提交系统的传统的单向排行榜不同,广告牌可接受发电机和评估度量作为竞争条目。广告牌会自动创建一个基于跨发电机的全局分析选择和线性地组合一些指标的集合度量。此外,指标基于与人类判断的相关性进行排序。我们释放了用于机器翻译,摘要和图像标题的四个广告牌。我们展示了一些多样化度量的线性集合有时会在隔离中显着优于现有的度量。我们的混合效果模型分析表明,大多数自动度量,尤其是基于参考的机器,对人类发电的重估,展示了更新度量的重要性,将来变得更强大(也许与人类更相似)。
translated by 谷歌翻译
通过人类注释评估自然语言生成系统的质量非常昂贵。此外,人类注释运动是耗时的,包括不可重复使用的人工劳动力。在实践中,研究人员依赖于自动指标作为质量的代理。在过去的十年中,已经介绍了许多基于字符串的度量(例如,BLEU)。但是,这种指标通常依赖于完全匹配,因此,不强大地处理同义词。在本文中,我们介绍了InfolmM一系列未经培训的指标,可以被视为基于字符串的度量标准,该度量可以通过预先接受培训的屏蔽语言模型来解决上述漏洞。这家指标族也利用信息措施,允许改编Infolmm对各种评估标准。使用直接评估,我们展示Infolmm在概要和Data2Text生成的许多配置中实现了统计上显着的改进和超过10美元的相关点。
translated by 谷歌翻译
Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging. We propose a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new triplet-based method of collecting human annotations to measure consensus, a new automated metric (CIDEr) that captures consensus, and two new datasets: PASCAL-50S and ABSTRACT-50S that contain 50 sentences describing each image. Our simple metric captures human judgment of consensus better than existing metrics across sentences generated by various sources. We also evaluate five state-of-the-art image description approaches using this new protocol and provide a benchmark for future comparisons. A version of CIDEr named CIDEr-D is available as a part of MS COCO evaluation server to enable systematic evaluation and benchmarking.
translated by 谷歌翻译
自动故事生成(ASG)的研究在很大程度上依赖于人类和自动评估。但是,尚无共识在哪些人类评估标准上使用,也没有分析自动标准与它们相关的良好状况。在本文中,我们建议重新评估ASG评估。我们介绍了由社会科学文学精心促进的6种正交和全面的人类标准。我们还提出了汉娜(Hanna),这是一个由10种不同ASG系统制作的1,056个故事的注释数据集。汉娜(Hanna)允许我们定量评估72个自动指标与人类标准的相关性。我们的分析强调了ASG当前指标的弱点,并使我们能够为ASG评估提出实用建议。
translated by 谷歌翻译
图像字幕是当前的研究任务,用于使用场景中的对象及其关系来描述图像内容。为了应对这项任务,使用了两个重要的研究领域,人为的视觉和自然语言处理。在图像字幕中,就像在任何计算智能任务中一样,性能指标对于知道方法的性能(或坏)至关重要。近年来,已经观察到,基于n-gram的经典指标不足以捕获语义和关键含义来描述图像中的内容。为了衡量或不进行最新指标的集合,在本手稿中,我们对使用众所周知的COCO数据集进行了对几种图像字幕指标的评估以及它们之间的比较。为此,我们设计了两种情况。 1)一组人工构建字幕,以及2)比较某些最先进的图像字幕方法的比较。我们试图回答问题:当前的指标是否有助于制作高质量的标题?实际指标如何相互比较?指标真正测量什么?
translated by 谷歌翻译
Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (BEGIN), comprised of 12k dialogue turns generated by neural dialogue systems trained on three knowledgegrounded dialogue corpora. We collect human annotations assessing the extent to which the models' responses can be attributed to the given background information. We then use BEGIN to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make BEGIN publicly available at https://github.com/ google/BEGIN-dataset.
translated by 谷歌翻译
文本生成的广泛使用的评估指标要么与更长的文本效果不错,要么无法评估文本质量的所有方面。在本文中,我们引入了一个名为SMART的新指标,以减轻此类限制。具体而言,我们将句子视为匹配的基本单位,而不是代币,并使用句子匹配函数来匹配匹配候选和参考句子。还将候选句子与源文件中的句子进行了比较,以允许接地(例如,事实)评估。我们的结果表明,我们提出的指标与基于模型的匹配函数的系统级相关性优于萨姆瓦尔摘要元评估数据集上的所有竞争指标指标。后者不使用任何神经模型,这在模型开发阶段很有用,在这些阶段,资源可以受到限制且需要快速评估。最后,我们还进行了广泛的分析,表明我们提出的指标与较长的摘要很好地运行,并且对特定模型的偏见较小。
translated by 谷歌翻译
无论是在功能选择的领域还是可解释的AI领域,都有基于其重要性的“排名”功能的愿望。然后可以将这种功能重要的排名用于:(1)减少数据集大小或(2)解释机器学习模型。但是,在文献中,这种特征排名没有以系统的,一致的方式评估。许多论文都有不同的方式来争论哪些具有重要性排名最佳的特征。本文通过提出一种新的评估方法来填补这一空白。通过使用合成数据集,可以事先知道特征重要性得分,从而可以进行更系统的评估。为了促进使用新方法的大规模实验,在Python建造了一个名为FSEVAL的基准测定框架。该框架允许并行运行实验,并在HPC系统上的计算机上分布。通过与名为“权重和偏见”的在线平台集成,可以在实时仪表板上进行交互探索图表。该软件作为开源软件发布,并在PYPI平台上以包裹发行。该研究结束时,探索了一个这样的大规模实验,以在许多方面找到参与算法的优势和劣势。
translated by 谷歌翻译
近年来,研究人员创建并引入了大量各种代码生成模型。由于对每个新模型版本的人类评估都是不可行的,因此社区采用了自动评估指标,例如BLEU来近似人类判断的结果。这些指标源自机器翻译域,目前尚不清楚它们是否适用于代码生成任务,以及他们与人类对此任务的评估有多一致。还有两个指标,即Codebleu和Ruby,它们是为了估计代码的相似性并考虑了代码属性的。但是,对于这些指标,几乎没有关于他们与人类评估一致的研究。尽管如此,公制得分的最小差异仍用于声称某些代码生成模型的优越性。在本文中,我们介绍了一项有关六个指标的适用性的研究-Bleu,Rouge-L,Meteor,Chrf,Codebleu,Ruby-用于评估代码生成模型。我们对两个不同的代码生成数据集进行了一项研究,并使用人类注释来评估这些数据集上运行的所有模型的质量。结果表明,对于Python单线的Conala数据集,如果模型得分的差异小于5分,则没有一个指标可以正确模拟人类判断,而$ 95 \%$确定性,则使用$> 95 \%$确定性。对于由特定结构类别组成的炉石传说数据集,至少2分的模型得分差异足以声称一种模型比另一个模型的优越性。使用我们的发现,我们得出了有关使用指标来估计代码生成任务的模型性能的几项建议。
translated by 谷歌翻译
The rapid growth of machine translation (MT) systems has necessitated comprehensive studies to meta-evaluate evaluation metrics being used, which enables a better selection of metrics that best reflect MT quality. Unfortunately, most of the research focuses on high-resource languages, mainly English, the observations for which may not always apply to other languages. Indian languages, having over a billion speakers, are linguistically different from English, and to date, there has not been a systematic study of evaluating MT systems from English into Indian languages. In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics. Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores. Additionally, we find that the metrics do not adequately capture fluency-based errors in Indian languages, and there is a need to develop metrics focused on Indian languages. We hope that our dataset and analysis will help promote further research in this area.
translated by 谷歌翻译
We propose BERTSCORE, an automatic evaluation metric for text generation. Analogously to common metrics, BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, we compute token similarity using contextual embeddings. We evaluate using the outputs of 363 machine translation and image captioning systems. BERTSCORE correlates better with human judgments and provides stronger model selection performance than existing metrics. Finally, we use an adversarial paraphrase detection task to show that BERTSCORE is more robust to challenging examples when compared to existing metrics.
translated by 谷歌翻译
近年来,对话系统引起了学术界和工业的重要兴趣。特别是开放式对话系统的纪律,又名聊天,已经获得了很大的势头。然而,困扰研究人员的长期挑战是缺乏有效的自动评估指标,这导致目前研究中的障碍。评估开放式对话模型表现的常见做法涉及对最终部署模型的广泛人类评估,这是时间和成本密集的。此外,最近建立开放式聊天聊天的趋势涉及具有大量社交媒体对话数据的预训练对话模型。但是,社交媒体对话中包含的信息可能是令人反感的和不合适的。不分青红皂白种的使用可能导致不敏感和有毒的生成模型。本文介绍了对话系统技术挑战10(DSTC10)的轨道5获得的数据,基线和结果。
translated by 谷歌翻译
Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation protocols and benchmarks for summarization either exhibit low inter-annotator agreement or lack the scale needed to draw statistically significant conclusions, and an in-depth analysis of human evaluation is lacking. In this work, we address the shortcomings of existing summarization evaluation along the following axes: 1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which relies on fine-grained semantic units and allows for high inter-annotator agreement. 2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets. 3) We compare our ACU protocol with three other human evaluation protocols, underscoring potential confounding factors in evaluation setups. 4) We evaluate existing automatic metrics using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. Furthermore, our findings have important implications for evaluating large language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.
translated by 谷歌翻译
提高对话系统的用户体验通常需要密集的开发人员努力读取对话日志,运行统计分析,并激活系统缺点的相对重要性。本文介绍了一种自动分析对话日志的新方法,了解用户系统交互与总体对话质量之间的关系。与在话语级别质量预测上的事先工作不同,我们的方法了解每个互动的影响,没有话语级注释的整体用户评级,允许基于经验证据和低成本获得所得模型结论。我们的模型识别与Chatbot设置中的与整体对话质量有着强烈相关的交互。实验表明,我们模型的自动分析同意专家判决,使这项工作首先表明这种弱监督的话语级质量预测学习是高度可取的。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
Automatic machine translation (MT) metrics are widely used to distinguish the translation qualities of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT metrics are at detecting the success of a machine translation component when placed in a larger platform with a downstream task. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state tracking, question answering, and semantic parsing). For each task, we only have access to a monolingual task-specific model. We calculate the correlation between the metric's ability to predict a good/bad translation with the success/failure on the final task for the Translate-Test setup. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes. We also find that the scores provided by neural metrics are not interpretable mostly because of undefined ranges. Our analysis suggests that future MT metrics be designed to produce error labels rather than scores to facilitate extrinsic evaluation.
translated by 谷歌翻译
最近提出的基于BERT的评估指标在标准评估基准方面表现良好,但容易受到对抗性攻击的影响,例如与事实错误有关。我们认为这(部分原因)是因为它们是语义相似性的模型。相反,我们根据自然语言推断(NLI)制定评估指标,我们认为这是更合适的建模。我们设计了一个基于偏好的对抗攻击框架,并表明我们的基于NLI的指标比最近基于BERT的指标更强大。在标准基准上,我们的基于NLI的指标的表现优于现有的摘要指标,但在SOTA MT指标下执行。但是,当我们将现有指标与NLI指标相结合时,我们可以获得更高的对抗性鲁棒性( +20%至 +30%)和较高质量的指标,如标准基准测量( +5%至 +25%)。
translated by 谷歌翻译
本文介绍了一个新颖的自我监督的细粒度对话评估框架(自我评估)。核心思想是建模转弯质量与整个对话质量之间的相关性。我们首先提出了一种新型的自动数据构建方法,该方法可以自动为任意对话数据分配细粒度的分数。然后,我们使用多层对比度学习模式训练\ textbf {self eval},有助于区分不同的分数水平。多个基准测试的实验结果表明,自我与人类评估高度一致,并且比最先进的模型更好。我们对本文的实验进行了详细的分析。我们的代码和数据将在GitHub上发布。
translated by 谷歌翻译
We investigate how humans perform the task of dubbing video content from one language into another, leveraging a novel corpus of 319.57 hours of video from 54 professionally produced titles. This is the first such large-scale study we are aware of. The results challenge a number of assumptions commonly made in both qualitative literature on human dubbing and machine-learning literature on automatic dubbing, arguing for the importance of vocal naturalness and translation quality over commonly emphasized isometric (character length) and lip-sync constraints, and for a more qualified view of the importance of isochronic (timing) constraints. We also find substantial influence of the source-side audio on human dubs through channels other than the words of the translation, pointing to the need for research on ways to preserve speech characteristics, as well as semantic transfer such as emphasis/emotion, in automatic dubbing systems.
translated by 谷歌翻译