尽管神经机器翻译(NMT)中幻觉的问题受到了一些关注,但对这种高度病理现象的研究缺乏坚实的基础。以前的工作在几种方面受到限制:它通常诉诸于放大问题的人工环境,它无视一些(常见的)幻觉类型,并且不能验证检测启发式方法的充分性。在本文中,我们为研究NMT幻觉的研究设定了基础。首先,我们在自然环境中工作,即没有人造噪声的内域数据,既不在训练中也没有推理。接下来,我们注释一个超过3.4K句子的数据集,指示不同类型的关键错误和幻觉。然后,我们转向以前使用的检测方法和两种重新访问方法,并建议使用基于玻璃盒的不确定性检测器。总体而言,我们表明,对于预防性设置,(i)先前使用的方法在很大程度上不足,(ii)序列对数概要性效果最好,并且与基于参考的方法相同。最后,我们提出了脱足素剂,这是一种减轻测试时间的简单方法,可大大降低幻觉速度。为了简化未来的研究,我们发布了用于WMT18德语英语数据的注释数据集以及模型,培训数据和代码。
translated by 谷歌翻译
While the problem of hallucinations in neural machine translation has long been recognized, so far the progress on its alleviation is very little. Indeed, recently it turned out that without artificially encouraging models to hallucinate, previously existing methods fall short and even the standard sequence log-probability is more informative. It means that characteristics internal to the model can give much more information than we expect, and before using external models and measures, we first need to ask: how far can we go if we use nothing but the translation model itself ? We propose to use a method that evaluates the percentage of the source contribution to a generated translation. Intuitively, hallucinations are translations "detached" from the source, hence they can be identified by low source contribution. This method improves detection accuracy for the most severe hallucinations by a factor of 2 and is able to alleviate hallucinations at test time on par with the previous best approach that relies on external models. Next, if we move away from internal model characteristics and allow external tools, we show that using sentence similarity from cross-lingual embeddings further improves these results.
translated by 谷歌翻译
Neural machine translation (NMT) has become the de-facto standard in real-world machine translation applications. However, NMT models can unpredictably produce severely pathological translations, known as hallucinations, that seriously undermine user trust. It becomes thus crucial to implement effective preventive strategies to guarantee their proper functioning. In this paper, we address the problem of hallucination detection in NMT by following a simple intuition: as hallucinations are detached from the source content, they exhibit encoder-decoder attention patterns that are statistically different from those of good quality translations. We frame this problem with an optimal transport formulation and propose a fully unsupervised, plug-in detector that can be used with any attention-based NMT model. Experimental results show that our detector not only outperforms all previous model-based detectors, but is also competitive with detectors that employ large models trained on millions of samples.
translated by 谷歌翻译
As machine translation (MT) metrics improve their correlation with human judgement every year, it is crucial to understand the limitations of such metrics at the segment level. Specifically, it is important to investigate metric behaviour when facing accuracy errors in MT because these can have dangerous consequences in certain contexts (e.g., legal, medical). We curate ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. We use ACES to evaluate a wide range of MT metrics including the submissions to the WMT 2022 metrics shared task and perform several analyses leading to general recommendations for metric developers. We recommend: a) combining metrics with different strengths, b) developing metrics that give more weight to the source and less to surface-level overlap with the reference and c) explicitly modelling additional language-specific information beyond what is available via multilingual embeddings.
translated by 谷歌翻译
在本文中,我们分享了我们努力建立能够翻译一千多种语言的实用机器翻译(MT)系统的发现。我们在三个研究领域中描述了结果:(i)通过利用半监督预训练的语言识别和开发数据驱动的过滤技术来构建1500多种语言的清洁,网挖数据集; (ii)通过利用大规模的多语言模型来开发用于服务不足的语言的实用MT模型,该模型训练了有监督的并行数据,以使用100多种高资源语言和单语言数据集,以增加1000多种语言; (iii)研究这些语言的评估指标的局限性,并对我们MT模型的输出进行定性分析,突出显示了这些类型模型的几种频繁误差模式。我们希望我们的工作为旨在为当前研究的语言构建MT系统的从业者提供有用的见解,并突出显示可以补充Data-Sparse设置中大量多语言模型的弱点的研究方向。
translated by 谷歌翻译
神经指标与机器翻译系统评估中的人类判断达到了令人印象深刻的相关性,但是在我们可以安全地针对此类指标进行优化之前,我们应该意识到(并且理想地消除)偏向获得高分的不良翻译的偏见。我们的实验表明,基于样本的最小贝叶斯风险解码可用于探索和量化此类弱点。在将此策略应用于彗星进行ende和de-en时,我们发现彗星模型不足以差异和命名实体差异。我们进一步表明,通过简单地培训其他合成数据并发布我们的代码和数据以促进进一步的实验,这些偏见很难完全消除。
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
我们引入了翻译误差校正(TEC),这是自动校正人类生成的翻译的任务。机器翻译(MT)的瑕疵具有长期的动机系统,可以通过自动编辑后改善变化后的转换。相比之下,尽管人类直觉上犯了不同的错误,但很少有人注意自动纠正人类翻译的问题,从错别字到翻译约定的矛盾之处。为了调查这一点,我们使用三个TEC数据集构建和释放ACED语料库。我们表明,与自动后编辑数据集中的MT错误相比,TEC中的人类错误表现出更加多样化的错误,翻译流利性误差要少得多,这表明需要专门用于纠正人类错误的专用TEC模型。我们表明,基于人类错误的合成错误的预训练可将TEC F-SCORE提高多达5.1点。我们通过九名专业翻译编辑进行了人类的用户研究,发现我们的TEC系统的帮助使他们产生了更高质量的修订翻译。
translated by 谷歌翻译
神经机器翻译(NMT)是一个开放的词汇问题。结果,处理在培训期间没有出现的单词(又称唱歌外(OOV)单词)长期以来一直是NMT系统的基本挑战。解决此问题的主要方法是字节对编码(BPE),将包括OOV单词在内的单词分为子字段中。在自动评估指标方面,BPE为广泛的翻译任务取得了令人印象深刻的结果。尽管通常假定使用BPE,但NMT系统能够处理OOV单词,但BPE在翻译OOV单词中的有效性尚未明确测量。在本文中,我们研究了BPE在多大程度上成功地翻译了单词级别的OOV单词。我们根据单词类型,段数,交叉注意权重和训练数据中段NGram的段频率分析OOV单词的翻译质量。我们的实验表明,尽管仔细的BPE设置似乎在整个数据集中翻译OOV单词时相当有用,但很大一部分的OOV单词被错误地翻译而成。此外,我们强调了BPE在为特殊案例(例如命名本性和涉及的语言彼此接近的语言)翻译OOV单词中的有效性稍高。
translated by 谷歌翻译
标准自动指标,例如BLEU对于文档级MT评估不可靠。他们既不能区分翻译质量的文档级改进与句子级别的改进,也不能确定引起上下文反应翻译的话语现象。本文介绍了一种新颖的自动公制金发,以扩大自动MT评估的范围,从句子到文档级别。金发女郎通过对与话语相关的跨度进行分类并计算基于相似性的F1分类跨度来考虑话语一致性。我们对新建的数据集BWB进行了广泛的比较。实验结果表明,金发女郎在文档级别具有更好的选择性和可解释性,并且对文档级别的细微差别更为敏感。在一项大规模的人类研究中,与以前的指标相比,金发碧眼的皮尔逊与人类判断的相关性也明显更高。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
这项工作适用于最低贝叶斯风险(MBR)解码,以优化翻译质量的各种自动化指标。机器翻译中的自动指标最近取得了巨大的进步。特别是,在人类评级(例如BLEurt,或Comet)上微调,在与人类判断的相关性方面是优于表面度量的微调。我们的实验表明,神经翻译模型与神经基于基于神经参考度量,BLEURT的组合导致自动和人类评估的显着改善。通过与经典光束搜索输出不同的翻译获得该改进:这些翻译的可能性较低,并且较少受到Bleu等表面度量的青睐。
translated by 谷歌翻译
MINED BITEXTS可以包含不完美的翻译,从而产生神经机翻译(NMT)的不可靠的训练信号。在已知过滤这样的对以提高最终模型质量的情况下,我们认为它在低资源条件下是次优的,甚至开采数据可以限制。在我们的工作中,我们提出了通过自动编辑来改进挖掘的BIESTS:给出语言XF中的句子,而且可能是IT XE的不完美翻译,我们的模型生成了一个修订的版本XF'或XE',产生更等值翻译对(即<XF,XE'或<XF',XE>)。我们使用一个简单的编辑策略(1)挖掘在给定的BITExt中的每个句子的潜在不完美的翻译,(2)学习一个模型来重建原始翻译并以多任务方式翻译。实验表明,我们的方法在大多数情况下,在大多数情况下,我们的方法成功地提高了5个低资源语言对和10个翻译方向,在大多数情况下改善了竞争反播基线。
translated by 谷歌翻译
The rapid growth of machine translation (MT) systems has necessitated comprehensive studies to meta-evaluate evaluation metrics being used, which enables a better selection of metrics that best reflect MT quality. Unfortunately, most of the research focuses on high-resource languages, mainly English, the observations for which may not always apply to other languages. Indian languages, having over a billion speakers, are linguistically different from English, and to date, there has not been a systematic study of evaluating MT systems from English into Indian languages. In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics. Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores. Additionally, we find that the metrics do not adequately capture fluency-based errors in Indian languages, and there is a need to develop metrics focused on Indian languages. We hope that our dataset and analysis will help promote further research in this area.
translated by 谷歌翻译
The word alignment task, despite its prominence in the era of statistical machine translation (SMT), is niche and under-explored today. In this two-part tutorial, we argue for the continued relevance for word alignment. The first part provides a historical background to word alignment as a core component of the traditional SMT pipeline. We zero-in on GIZA++, an unsupervised, statistical word aligner with surprising longevity. Jumping forward to the era of neural machine translation (NMT), we show how insights from word alignment inspired the attention mechanism fundamental to present-day NMT. The second part shifts to a survey approach. We cover neural word aligners, showing the slow but steady progress towards surpassing GIZA++ performance. Finally, we cover the present-day applications of word alignment, from cross-lingual annotation projection, to improving translation.
translated by 谷歌翻译
Trainable evaluation metrics for machine translation (MT) exhibit strong correlation with human judgements, but they are often hard to interpret and might produce unreliable scores under noisy or out-of-domain data. Recent work has attempted to mitigate this with simple uncertainty quantification techniques (Monte Carlo dropout and deep ensembles), however these techniques (as we show) are limited in several ways -- for example, they are unable to distinguish between different kinds of uncertainty, and they are time and memory consuming. In this paper, we propose more powerful and efficient uncertainty predictors for MT evaluation, and we assess their ability to target different sources of aleatoric and epistemic uncertainty. To this end, we develop and compare training objectives for the COMET metric to enhance it with an uncertainty prediction output, including heteroscedastic regression, divergence minimization, and direct uncertainty prediction. Our experiments show improved results on uncertainty prediction for the WMT metrics task datasets, with a substantial reduction in computational costs. Moreover, they demonstrate the ability of these predictors to address specific uncertainty causes in MT evaluation, such as low quality references and out-of-domain data.
translated by 谷歌翻译
注释数据是用于培训和评估机器学习模型的自然语言处理中的重要成分。因此,注释具有高质量是非常理想的。但是,最近的工作表明,几个流行的数据集包含令人惊讶的注释错误或不一致之处。为了减轻此问题,多年来已经设计了许多注释错误检测方法。尽管研究人员表明他们的方法在新介绍的数据集上效果很好,但他们很少将其方法与以前的工作或同一数据集进行比较。这引起了人们对方法的一般表现的强烈关注,并且使他们的优势和劣势很难解决。因此,我们重新实现18种检测潜在注释错误的方法,并在9个英语数据集上对其进行评估,以进行文本分类以及令牌和跨度标签。此外,我们定义了统一的评估设置,包括注释错误检测任务,评估协议和一般最佳实践的新形式化。为了促进未来的研究和可重复性,我们将数据集和实施释放到易于使用和开源软件包中。
translated by 谷歌翻译
这项研究讨论了半监督学习的影响与验证的语言模型,以生成数据到文本。当还补充大规模语言模型时,尚不清楚半监督学习是否仍然有用。这项研究的目的是通过将仅补充语言模型的数据到文本系统与两个数据到文本系统进行比较,这些系统通过数据增强或伪标记的半固定学习方法而富含数据。结果表明,半监督学习会导致多样性指标的得分更高。在输出质量方面,使用伪标记方法扩展数据到文本系统的训练集确实提高了文本质量分数,但是数据增强方法在没有训练设置扩展的情况下得出了与系统相似的分数。这些结果表明,即使也存在语言模型,半监督的学习方法也可以增强产出质量和多样性。
translated by 谷歌翻译
Automatic machine translation (MT) metrics are widely used to distinguish the translation qualities of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT metrics are at detecting the success of a machine translation component when placed in a larger platform with a downstream task. We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state tracking, question answering, and semantic parsing). For each task, we only have access to a monolingual task-specific model. We calculate the correlation between the metric's ability to predict a good/bad translation with the success/failure on the final task for the Translate-Test setup. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes. We also find that the scores provided by neural metrics are not interpretable mostly because of undefined ranges. Our analysis suggests that future MT metrics be designed to produce error labels rather than scores to facilitate extrinsic evaluation.
translated by 谷歌翻译