由于它们的低准确性,透明度缺乏透明度,而不是语义,而不是语义,而不是语言技能,而不是语义,而且与人类质量评估的普遍挑剔,机器翻译的传统自动评估度量被语言学家被广泛批评。 MQM样记录形式的人类评估始终是客户和翻译服务提供商(TSP)的真实行业环境中进行的。然而,传统的人类翻译质量评估昂贵才能实现和进入伟大的语言细节,提出对帧间可靠性(IRR)的问题,并且不设计用于衡量比优质质量翻译更糟糕的质量。在这项工作中,我们介绍了希望,基于专业后编辑注释的机器翻译输出的主导和以人为际的评估框架。它仅包含有限数量的常见错误类型,并使用评分模型与错误惩罚点(EPP)的几何进度反映了每个转换单元的错误严重性级别。来自高技术域的英语语言对MT输出的初始实验工作来自高技术领域的营销内容类型的文本揭示了我们的评估框架在反映了关于整体系统级性能和段级透明度的MT输出质量方面非常有效,并且它会增加错误类型解释。该方法具有若干关键优势,例如测量和比较少于不同系统的完美MT输出的能力,表明人类对质量的能力,立即估算所需的劳动力估算,使MT输出到优质的质量,低成本和更快的应用,以及更高的IRR。我们的实验数据可用于\ url {https://github.com/lhan87/hope}。
translated by 谷歌翻译
来自人类翻译(HT)和机器翻译(MT)研究人员的观点,翻译质量评估(TQE)是一个必不可少的任务。翻译服务提供商(TSP)必须提供大量翻译,满足客户规范,在紧张的时间框架和成本中具有苛刻的质量水平的严厉约束。 MT研究人员努力使其型号更好,这也需要可靠的质量评估。虽然自动化机器翻译评估(MTE)指标和质量估算(QE)工具广泛可用且易于访问,但现有的自动化工具不够好,并且来自专业翻译人员(HAP)的人为评估通常被选为金标准\ CITE {Han-Etal-2021-TQA}。然而,人类评估通常被指控具有低可靠性和协议。这是由主观性或统计造成的吗?如何避免待检查的整个文本,从成本和效率的角度来看,以及转换文本的最佳样本大小是什么,从而可靠地估计整个材料的翻译质量?这项工作执行了这种激励的研究,以正确估计置信区间\ Cite {Brown_Etal2001Interval},具体取决于翻译文本的样本大小,例如,例如:单词或句子的数量,需要在TQE工作流程上处理,以实现对整体翻译质量的自信和可靠的评估。我们申请这项工作的方法来自伯努利统计分布建模(BSDM)和蒙特卡罗采样分析(MCSA)。
translated by 谷歌翻译
With the fast development of Machine Translation (MT) systems, especially the new boost from Neural MT (NMT) models, the MT output quality has reached a new level of accuracy. However, many researchers criticised that the current popular evaluation metrics such as BLEU can not correctly distinguish the state-of-the-art NMT systems regarding quality differences. In this short paper, we describe the design and implementation of a linguistically motivated human-in-the-loop evaluation metric looking into idiomatic and terminological Multi-word Expressions (MWEs). MWEs have played a bottleneck in many Natural Language Processing (NLP) tasks including MT. MWEs can be used as one of the main factors to distinguish different MT systems by looking into their capabilities in recognising and translating MWEs in an accurate and meaning equivalent manner.
translated by 谷歌翻译
我们在本文中介绍了我们认为是视频游戏机翻译的首次尝试之一。我们的研究表明,只有有限的内域数据训练的模型超出了可公开可用的系统,随后的人类评估揭示了最终翻译中的有趣发现。本文的第一部分介绍了视频游戏翻译的一些挑战,一些现有文献以及本实验中使用的系统和数据集。最后一节讨论了我们对所得翻译的分析以及这种自动化系统的潜在好处。一个这样的发现突出了该模型学习从英语到法语的视频游戏翻译的典型规则和模式的能力。因此,我们的结论表明,鉴于令人鼓舞的结果,工作的高度重复性以及翻译人员在该领域中通常不良的工作条件,视频游戏机译的具体情况可能非常有用。但是,与文化部门中MT的其他用例一样,我们认为这在很大程度上取决于该工具的适当实施,该工具应与人类翻译人员进行交互方式来刺激创造力,而不是为了生产力而不是原始的后编辑。
translated by 谷歌翻译
我们沿着多项努力指标,从英文 - 印地教方向上进行了第一个深入的后编辑工作估算研究表明。我们进行受控实验,涉及专业翻译人员,他们交替完成分配任务,在翻译和编辑后的条件中。我们发现后编辑减少了翻译时间(按63%),利用较少的击键(按59%),与从头转换相比,减少暂停(按63%)的次数。我们进一步验证了通过人类评估任务所生产的翻译质量,我们没有检测到任何可辨别的质量差异。
translated by 谷歌翻译
我们引入了翻译误差校正(TEC),这是自动校正人类生成的翻译的任务。机器翻译(MT)的瑕疵具有长期的动机系统,可以通过自动编辑后改善变化后的转换。相比之下,尽管人类直觉上犯了不同的错误,但很少有人注意自动纠正人类翻译的问题,从错别字到翻译约定的矛盾之处。为了调查这一点,我们使用三个TEC数据集构建和释放ACED语料库。我们表明,与自动后编辑数据集中的MT错误相比,TEC中的人类错误表现出更加多样化的错误,翻译流利性误差要少得多,这表明需要专门用于纠正人类错误的专用TEC模型。我们表明,基于人类错误的合成错误的预训练可将TEC F-SCORE提高多达5.1点。我们通过九名专业翻译编辑进行了人类的用户研究,发现我们的TEC系统的帮助使他们产生了更高质量的修订翻译。
translated by 谷歌翻译
As machine translation (MT) metrics improve their correlation with human judgement every year, it is crucial to understand the limitations of such metrics at the segment level. Specifically, it is important to investigate metric behaviour when facing accuracy errors in MT because these can have dangerous consequences in certain contexts (e.g., legal, medical). We curate ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. We use ACES to evaluate a wide range of MT metrics including the submissions to the WMT 2022 metrics shared task and perform several analyses leading to general recommendations for metric developers. We recommend: a) combining metrics with different strengths, b) developing metrics that give more weight to the source and less to surface-level overlap with the reference and c) explicitly modelling additional language-specific information beyond what is available via multilingual embeddings.
translated by 谷歌翻译
自动编辑(APE)旨在通过自动纠正机器翻译输出中的错误来减少手动后编辑工作。由于人类注销的培训数据数量有限,数据稀缺是所有猿类系统所面临的主要挑战之一。为了减轻缺乏真正的培训数据,当前的大多数猿类系统采用数据增强方法来生成大规模的人工语料库。鉴于APE数据增强的重要性,我们分别研究了人工语料库的构建方法和人工数据域对猿类模型性能的影响。此外,猿类的难度在不同的机器翻译(MT)系统之间有所不同。我们在困难的猿数据集上研究了最先进的APE模型的输出,以分析现有的APE系统中的问题。首先,我们发现1)具有高质量源文本和机器翻译文本的人工语料库更有效地改善了猿类模型的性能; 2)内域人工训练数据可以更好地改善猿类模型的性能,而无关紧要的外域数据实际上会干扰该模型; 3)现有的APE模型与包含长源文本或高质量机器翻译文本的案例斗争; 4)最先进的猿类模型在语法和语义添加问题上很好地工作,但是输出容易出现实体和语义遗漏误差。
translated by 谷歌翻译
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.
translated by 谷歌翻译
我们假设现有的句子级机器翻译(MT)指标在人类参考包含歧义时会效率降低。为了验证这一假设,我们提出了一种非常简单的方法,用于扩展预审计的指标以在文档级别合并上下文。我们将我们的方法应用于三个流行的指标,即Bertscore,Prism和Comet,以及无参考的公制Comet-QE。我们使用提供的MQM注释评估WMT 2021指标共享任务的扩展指标。我们的结果表明,扩展指标的表现在约85%的测试条件下优于其句子级别的级别,而在排除低质量人类参考的结果时。此外,我们表明我们的文档级扩展大大提高了其对话语现象任务的准确性,从而优于专用基线高达6.1%。我们的实验结果支持我们的初始假设,并表明对指标的简单扩展使他们能够利用上下文来解决参考中的歧义。
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
标准自动指标,例如BLEU对于文档级MT评估不可靠。他们既不能区分翻译质量的文档级改进与句子级别的改进,也不能确定引起上下文反应翻译的话语现象。本文介绍了一种新颖的自动公制金发,以扩大自动MT评估的范围,从句子到文档级别。金发女郎通过对与话语相关的跨度进行分类并计算基于相似性的F1分类跨度来考虑话语一致性。我们对新建的数据集BWB进行了广泛的比较。实验结果表明,金发女郎在文档级别具有更好的选择性和可解释性,并且对文档级别的细微差别更为敏感。在一项大规模的人类研究中,与以前的指标相比,金发碧眼的皮尔逊与人类判断的相关性也明显更高。
translated by 谷歌翻译
机器翻译(MT)的单词级质量估计(QE)旨在在不参考的情况下找出翻译句子中的潜在翻译错误。通常,关于文字级别量化宽松的传统作品旨在根据文章编辑工作来预测翻译质量,其中通过比较MT句子之间的单词来自动生成单词标签(“ OK”和“ BAD”)。通过翻译错误率(TER)工具包编辑的句子。虽然可以使用后编辑的工作来在一定程度上测量翻译质量,但我们发现它通常与人类对单词是否良好或翻译不良的判断相抵触。为了克服限制,我们首先创建了一个金色基准数据集,即\ emph {hjqe}(人类对质量估计的判断),专家翻译直接注释了对其判断的不良翻译单词。此外,为了进一步利用平行语料库,我们提出了使用两个标签校正策略的自我监督的预训练,即标记改进策略和基于树的注释策略,以使基于TER的人工量化量子ceper更接近\ emph {HJQE}。我们根据公开可用的WMT en-de和en-ZH Corpora进行实质性实验。结果不仅表明我们提出的数据集与人类的判断更加一致,而且还确认了提议的标签纠正策略的有效性。 。}
translated by 谷歌翻译
With the recent advance in neural machine translation demonstrating its importance, research on quality estimation (QE) has been steadily progressing. QE aims to automatically predict the quality of machine translation (MT) output without reference sentences. Despite its high utility in the real world, there remain several limitations concerning manual QE data creation: inevitably incurred non-trivial costs due to the need for translation experts, and issues with data scaling and language expansion. To tackle these limitations, we present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and QUAK-H, produced through three strategies that are relatively free from language constraints. Since each strategy requires no human effort, which facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M for QUAK-M. As an experiment, we quantitatively analyze word-level QE results in various ways while performing statistical analysis. Moreover, we show that datasets scaled in an efficient way also contribute to performance improvements by observing meaningful performance gains in QUAK-M, P when adding data up to 1.58M.
translated by 谷歌翻译
The rapid growth of machine translation (MT) systems has necessitated comprehensive studies to meta-evaluate evaluation metrics being used, which enables a better selection of metrics that best reflect MT quality. Unfortunately, most of the research focuses on high-resource languages, mainly English, the observations for which may not always apply to other languages. Indian languages, having over a billion speakers, are linguistically different from English, and to date, there has not been a systematic study of evaluating MT systems from English into Indian languages. In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics. Our results show that pre-trained metrics, such as COMET, have the highest correlations with annotator scores. Additionally, we find that the metrics do not adequately capture fluency-based errors in Indian languages, and there is a need to develop metrics focused on Indian languages. We hope that our dataset and analysis will help promote further research in this area.
translated by 谷歌翻译
Recent advances in AI and ML applications have benefited from rapid progress in NLP research. Leaderboards have emerged as a popular mechanism to track and accelerate progress in NLP through competitive model development. While this has increased interest and participation, the over-reliance on single, and accuracy-based metrics have shifted focus from other important metrics that might be equally pertinent to consider in real-world contexts. In this paper, we offer a preliminary discussion of the risks associated with focusing exclusively on accuracy metrics and draw on recent discussions to highlight prescriptive suggestions on how to develop more practical and effective leaderboards that can better reflect the real-world utility of models.
translated by 谷歌翻译
人类评估一直昂贵,而研究人员则努力信任自动指标。为了解决这个问题,我们建议通过采取预先接受训练的语言模型(PLM)和有限的人类标记分数来定制传统指标。我们首先重新介绍Hlepor度量因子,然后是我们开发的Python版本(移植),这实现了Hlepor度量中的加权参数的自动调整。然后我们介绍了使用Optuna超参数优化框架的定制Hlepor(Cushlepor),以便更好地协议为预先接受训练的语言模型(使用Labse),这是关于Cushlepor的确切MT语言对。我们还在英语 - 德语和汉英语言对基于MQM和PSQM框架的专业人体评估数据进行了优化的曲位波。实验研究表明,Cushlepor可以提升Hlepor对PLMS的更好的表演,如Labse,如Labse的更好的成本,以及更好的人类评估协议,包括MQM和PSQM得分,并且比Bleu(AT \ URL的数据提供更好的表演(HTTPS:// github.com/poethan/cushlepor})。官方结果表明,我们的提交赢得了三种语言对,包括\ textbf {英语 - 德语}和\ textbf {中文 - 英文}通过cushlepor(lm)和\ textbf {英语 - 俄语}上\ textit {通过hlepor ted}域。
translated by 谷歌翻译
A lack of standard datasets and evaluation metrics has prevented the field of paraphrasing from making the kind of rapid progress enjoyed by the machine translation community over the last 15 years. We address both problems by presenting a novel data collection framework that produces highly parallel text data relatively inexpensively and on a large scale. The highly parallel nature of this data allows us to use simple n-gram comparisons to measure both the semantic adequacy and lexical dissimilarity of paraphrase candidates. In addition to being simple and efficient to compute, experiments show that these metrics correlate highly with human judgments.
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.
translated by 谷歌翻译
The state-of-the-art language model-based automatic metrics, e.g. BARTScore, benefiting from large-scale contextualized pre-training, have been successfully used in a wide range of natural language generation (NLG) tasks, including machine translation, text summarization, and data-to-text. Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors (e.g. imperfections in fluency) can produce high-quality human judgments. This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis. To this end, we augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors. Experimental results show that BARTScore++ can consistently improve the performance of vanilla BARTScore and outperform existing top-scoring metrics in 20 out of 25 test settings. We hope our technique can also be extended to other pre-trained model-based metrics. We will release our code and scripts to facilitate the community.
translated by 谷歌翻译