Given a document in a source language, cross-lingual summarization (CLS) aims at generating a concise summary in a different target language. Unlike monolingual summarization (MS), naturally occurring source-language documents paired with target-language summaries are rare. To collect large-scale CLS samples, existing datasets typically involve translation in their creation. However, the translated text is distinguished from the text originally written in that language, i.e., translationese. Though many efforts have been devoted to CLS, none of them notice the phenomenon of translationese. In this paper, we first confirm that the different approaches to constructing CLS datasets will lead to different degrees of translationese. Then we design systematic experiments to investigate how translationese affects CLS model evaluation and performance when it appears in source documents or target summaries. In detail, we find that (1) the translationese in documents or summaries of test sets might lead to the discrepancy between human judgment and automatic evaluation; (2) the translationese in training sets would harm model performance in the real scene; (3) though machine-translated documents involve translationese, they are very useful for building CLS systems on low-resource languages under specific training strategies. Furthermore, we give suggestions for future CLS research including dataset and model developments. We hope that our work could let researchers notice the phenomenon of translationese in CLS and take it into account in the future.
translated by 谷歌翻译
跨语性摘要是用一种语言(例如英语)以不同语言(例如中文)生成一种语言(例如英语)的摘要。在全球化背景下,这项任务吸引了计算语言学界的越来越多的关注。然而,对于这项任务仍然缺乏全面的审查。因此,我们在该领域的数据集,方法和挑战上介绍了第一个系统的批判性审查。具体而言,我们分别根据不同的构造方法和解决方案范例仔细组织现有的数据集和方法。对于每种类型的数据集或方法,我们彻底介绍并总结了以前的努力,并将它们相互比较以提供更深入的分析。最后,我们还讨论了有希望的方向,并提供了我们的思想,以促进未来的研究。这项调查适用于跨语性摘要的初学者和专家,我们希望它将成为起点,也可以为对该领域感兴趣的研究人员和工程师提供新的想法。
translated by 谷歌翻译
Cross-Lingual Summarization (CLS) aims at generating summaries in one language for the given documents in another language. CLS has attracted wide research attention due to its practical significance in the multi-lingual world. Though great contributions have been made, existing CLS works typically focus on short documents, such as news articles, short dialogues and guides. Different from these short texts, long documents such as academic articles and business reports usually discuss complicated subjects and consist of thousands of words, making them non-trivial to process and summarize. To promote CLS research on long documents, we construct Perseus, the first long-document CLS dataset which collects about 94K Chinese scientific documents paired with English summaries. The average length of documents in Perseus is more than two thousand tokens. As a preliminary study on long-document CLS, we build and evaluate various CLS baselines, including pipeline and end-to-end methods. Experimental results on Perseus show the superiority of the end-to-end baseline, outperforming the strong pipeline models equipped with sophisticated machine translation systems. Furthermore, to provide a deeper understanding, we manually analyze the model outputs and discuss specific challenges faced by current approaches. We hope that our work could benchmark long-document CLS and benefit future studies.
translated by 谷歌翻译
我们介绍了MTG,这是一套新的基准套件,用于培训和评估多语言文本生成。它是具有最大人类通知数据(400K)的第一次传播的多语言多路文本生成数据集。它包括五种语言(英语,德语,法语,西班牙语和中文)的四代任务(故事产生,问题生成,标题生成和文本摘要)。Multiway设置可以启用跨语言和任务的模型测试知识传输功能。使用MTG,我们从不同方面训练和分析了几种流行的多语言生成模型。我们的基准套件通过更多的人为宣传的并行数据促进了模型性能增强。它提供了各种一代方案的全面评估。代码和数据可在\ url {https://github.com/zide05/mtg}上获得。
translated by 谷歌翻译
With the recent advance in neural machine translation demonstrating its importance, research on quality estimation (QE) has been steadily progressing. QE aims to automatically predict the quality of machine translation (MT) output without reference sentences. Despite its high utility in the real world, there remain several limitations concerning manual QE data creation: inevitably incurred non-trivial costs due to the need for translation experts, and issues with data scaling and language expansion. To tackle these limitations, we present QUAK, a Korean-English synthetic QE dataset generated in a fully automatic manner. This consists of three sub-QUAK datasets QUAK-M, QUAK-P, and QUAK-H, produced through three strategies that are relatively free from language constraints. Since each strategy requires no human effort, which facilitates scalability, we scale our data up to 1.58M for QUAK-P, H and 6.58M for QUAK-M. As an experiment, we quantitatively analyze word-level QE results in various ways while performing statistical analysis. Moreover, we show that datasets scaled in an efficient way also contribute to performance improvements by observing meaningful performance gains in QUAK-M, P when adding data up to 1.58M.
translated by 谷歌翻译
Machine Translation Quality Estimation (QE) is the task of evaluating translation output in the absence of human-written references. Due to the scarcity of human-labeled QE data, previous works attempted to utilize the abundant unlabeled parallel corpora to produce additional training data with pseudo labels. In this paper, we demonstrate a significant gap between parallel data and real QE data: for QE data, it is strictly guaranteed that the source side is original texts and the target side is translated (namely translationese). However, for parallel data, it is indiscriminate and the translationese may occur on either source or target side. We compare the impact of parallel data with different translation directions in QE data augmentation, and find that using the source-original part of parallel corpus consistently outperforms its target-original counterpart. Moreover, since the WMT corpus lacks direction information for each parallel sentence, we train a classifier to distinguish source- and target-original bitext, and carry out an analysis of their difference in both style and domain. Together, these findings suggest using source-original parallel data for QE data augmentation, which brings a relative improvement of up to 4.0% and 6.4% compared to undifferentiated data on sentence- and word-level QE tasks respectively.
translated by 谷歌翻译
目前最先进的交叉逻辑摘要模型采用了多任务学习范例,它适用于共享词汇模块,并依赖于自我关注机制以两种语言参加令牌。然而,通过自我关注汲取的相关性往往松动和隐含,效率效率低,捕获语言之间的至关重要的交叉表示。在用单独的形态或结构特征进行语言时,此事恶化,使交叉对齐更具挑战性,导致性能下降。为了克服这一问题,我们提出了一种新颖的知识蒸馏的跨语言摘要框架,寻求通过蒸馏到单语摘要教师进入交叉综合学生的知识来明确构建交叉关联。由于教师和学生的代表介绍了两种不同的向量空间,我们进一步提出了使用污水偏差,最佳运输距离的知识蒸馏损失,以估计这些教师和学生表示之间的差异。由于陷入困境的直观的几何性质,学生模型可以高效地学习与单声道隐藏状态对齐其产生的交叉隐藏状态,因此导致远方语言之间的强烈相关性。对遥控语言成对的交叉语言摘要数据集的实验表明,我们的方法在高资源和低资源的设置下优于最先进的模型。
translated by 谷歌翻译
GPT-3等大型自回归语言模型是几秒钟的学习者,可以在没有微调的情况下执行各种语言任务。虽然已知这些模型能够共同代表许多不同的语言,但他们的培训数据由英语主导,可能限制了它们的交叉概括。在这项工作中,我们在覆盖多种语言的平衡语料库上培训多语言自回归语言模型,并在广泛的任务中研究他们几乎没有零点的学习能力。我们最大的模型,具有75亿参数,在20多种代表语言中,在几种代表语言中,在几种代表性语言中,在几种代表性语言中,在多语言型号推理中表现出可比大小的GPT-3(在0次设置和0次拍摄设置中的绝对精度改善+ 7.4% 4-拍摄设置中的9.4%)和自然语言推理(每次拍摄和4次设置中的每一个+ 5.4%)。在Flores-101机器翻译基准测试中,我们的模型优于GPT-3在182个翻译方向上有32个培训例子,同时超过45个方向的官方监督基线。我们介绍了模型成功和失败的位置的详细分析,特别是它尤其显示在某些任务中实现交叉语境的内容学习,而仍然存在改善表面的鲁棒性和适应没有a的任务的余地自然冻结形式。最后,我们评估我们在仇恨语音检测中以五种语言的仇恨语音检测的模型,并发现它具有与可比大小的GPT-3模型类似的限制。
translated by 谷歌翻译
While the NLP community is generally aware of resource disparities among languages, we lack research that quantifies the extent and types of such disparity. Prior surveys estimating the availability of resources based on the number of datasets can be misleading as dataset quality varies: many datasets are automatically induced or translated from English data. To provide a more comprehensive picture of language resources, we examine the characteristics of 156 publicly available NLP datasets. We manually annotate how they are created, including input text and label sources and tools used to build them, and what they study, tasks they address and motivations for their creation. After quantifying the qualitative NLP resource gap across languages, we discuss how to improve data collection in low-resource languages. We survey language-proficient NLP researchers and crowd workers per language, finding that their estimated availability correlates with dataset availability. Through crowdsourcing experiments, we identify strategies for collecting high-quality multilingual data on the Mechanical Turk platform. We conclude by making macro and micro-level suggestions to the NLP community and individual researchers for future multilingual data development.
translated by 谷歌翻译
我们呈现横梁,一个大规模数据集,包括1500多个语言对的165万次交叉文章摘要样本,构成了45种语言。我们使用多语言XL-SUM数据集,并通过使用语言 - 不可知的表示模型通过跨语言检索对齐以不同语言编写的相同文章。我们提出了一种多级数据采样算法和微调MT5,这是一种多语言预制模型,具有横梁的明确交叉监管,并引入了评估交叉综述的新度量。成立和我们拟议的指标的结果表明,即使源和目标语言对遥远的速度和目标语言对,也表明,即使源极和目标语言对遥远的速度,也表明模型优于概要概述+翻译基线。据我们所知,Crosssum是最大的交叉汇总数据集,也是第一个不依赖英语作为枢轴语。我们正在发布数据集,对齐和培训脚本以及模型,以促使未来的交叉抽象摘要研究。可以在\ url {https://github.com/csebuetnlp/crosssum}中找到资源。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
抽象性摘要领域的最新进展利用了预训练的语言模型,而不是从头开始训练模型。但是,这样的模型训练和伴随着大量的开销。研究人员提出了一些轻巧的替代方案,例如较小的适配器来减轻缺点。尽管如此,就提高效率而没有绩效不愉快的牺牲,使用使用适配器是否有利于总结的任务。在这项工作中,我们对具有不同复杂性的摘要任务进行了多方面的调查:语言,域和任务转移。在我们的实验中,对预训练的语言模型进行微调通常比使用适配器更好。性能差距与所使用的训练数据量正相关。值得注意的是,在极低的资源条件下,适配器超过微调。我们进一步提供了有关多语言,模型收敛性和鲁棒性的见解,希望能阐明抽象性摘要中微调或适配器的实用选择。
translated by 谷歌翻译
最近提出的基于BERT的评估指标在标准评估基准方面表现良好,但容易受到对抗性攻击的影响,例如与事实错误有关。我们认为这(部分原因)是因为它们是语义相似性的模型。相反,我们根据自然语言推断(NLI)制定评估指标,我们认为这是更合适的建模。我们设计了一个基于偏好的对抗攻击框架,并表明我们的基于NLI的指标比最近基于BERT的指标更强大。在标准基准上,我们的基于NLI的指标的表现优于现有的摘要指标,但在SOTA MT指标下执行。但是,当我们将现有指标与NLI指标相结合时,我们可以获得更高的对抗性鲁棒性( +20%至 +30%)和较高质量的指标,如标准基准测量( +5%至 +25%)。
translated by 谷歌翻译
评估指标是文本生成系统的关键成分。近年来,已经提出了几十年前的文本生成质量的人类评估,提出了几个基于伯特的评估指标(包括Bertscore,Moverscore,BLEurt等),这些评估与文本生成质量的人类评估比Bleu或Rouge进行了更好。但是,很少是已知这些度量基于黑盒语言模型表示的指标实际捕获(通常假设它们模型语义相似性)。在这项工作中,我们使用基于简单的回归的全局解释技术来沿着语言因素解开度量标准分数,包括语义,语法,形态和词汇重叠。我们表明,不同的指标捕获了一定程度的各个方面,但它们对词汇重叠大大敏感,就像Bleu和Rouge一样。这暴露了这些新颖性拟议的指标的限制,我们还在对抗对抗测试场景中突出显示。
translated by 谷歌翻译
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective . mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.
translated by 谷歌翻译
人类翻译的文本以同一语言显示出与自然书面文本的不同特征。这种现象被称为翻译人员,被认为是将机器翻译(MT)评估混淆。但是,我们发现现有的翻译工作忽略了一些重要因素,结论主要是相关的,但不是因果关系。在这项工作中,我们收集了Causalmt,这是一个数据集,其中MT培训数据还标有人类翻译方向。我们检查了两个关键因素,即火车测试方向匹配(是否对齐训练和测试集中的人类翻译方向)和数据模型方向匹配(该模型是否沿与人类翻译方向相同的方向学习数据集)。我们表明,这两个因素对MT的性能具有很大的因果影响,除了测试模型方向不匹配的情况下,现有工作对TranslationEse的影响强调了。鉴于我们的发现,我们为MT培训和评估提供了一系列建议。我们的代码和数据在https://github.com/edisonni-hku/causalmt上
translated by 谷歌翻译
体育游戏摘要旨在根据实时评论生成体育新闻。该任务吸引了广泛的研究关注,但由于缺乏相应的英语数据集,但仍未探索。因此,在本文中,我们发布了第一个英语体育游戏摘要数据集的目标。具体而言,目标有103个评论新对,评论和新闻的平均长度分别为2724.9和476.3个字。此外,为了支持半监督环境中的研究,目标还提供了2,160个未标记的评论文件。基于我们的目标,我们建立和评估了几个基线,包括提取性和抽象基线。实验结果表明,此任务的挑战仍然存在。我们希望我们的工作能够促进体育游戏总结的研究。该数据集已在https://github.com/krystalan/goal上发布。
translated by 谷歌翻译
传统上,文本简化被视为单语翻译任务,其中源文本及其简化的对应物之间的句子是对齐的。但是,尤其是对于更长的输入文档,总结文本(或完全删除相关内容)在简化过程中起重要作用,目前在现有数据集中尚未反映出该过程。同时,非英语语言的资源通常很少,并且对于培训新解决方案而言是过分的。为了解决这个问题,我们对可以共同总结和简化长源文档的系统提出了核心要求。我们进一步描述了基于德国Wikipedia和德国儿童词典“ Klexikon”的新数据集的创建,用于简化和摘要,包括近2900个文档。我们发布了一个与文档一致的版本,特别突出了摘要方面,并提供了统计证据,表明此资源也非常适合简化。代码和数据可在GitHub上找到:https://github.com/dennlinger/klexikon
translated by 谷歌翻译
Translating training data into many languages has emerged as a practical solution for improving cross-lingual transfer. For tasks that involve span-level annotations, such as information extraction or question answering, an additional label projection step is required to map annotated spans onto the translated texts. Recently, a few efforts have utilized a simple mark-then-translate method to jointly perform translation and projection by inserting special markers around the labeled spans in the original sentence. However, as far as we are aware, no empirical analysis has been conducted on how this approach compares to traditional annotation projection based on word alignment. In this paper, we present an extensive empirical study across 42 languages and three tasks (QA, NER, and Event Extraction) to evaluate the effectiveness and limitations of both methods, filling an important gap in the literature. Experimental results show that our optimized version of mark-then-translate, which we call EasyProject, is easily applied to many languages and works surprisingly well, outperforming the more complex word alignment-based methods. We analyze several key factors that affect end-task performance, and show EasyProject works well because it can accurately preserve label span boundaries after translation. We will publicly release all our code and data.
translated by 谷歌翻译
Despite the current success of multilingual pre-training, most prior works focus on leveraging monolingual data or bilingual parallel data and overlooked the value of trilingual parallel data. This paper presents \textbf{Tri}angular Document-level \textbf{P}re-training (\textbf{TRIP}), which is the first in the field to extend the conventional monolingual and bilingual pre-training to a trilingual setting by (i) \textbf{Grafting} the same documents in two languages into one mixed document, and (ii) predicting the remaining one language as the reference translation. Our experiments on document-level MT and cross-lingual abstractive summarization show that TRIP brings by up to 3.65 d-BLEU points and 6.2 ROUGE-L points on three multilingual document-level machine translation benchmarks and one cross-lingual abstractive summarization benchmark, including multiple strong state-of-the-art (SOTA) scores. In-depth analysis indicates that TRIP improves document-level machine translation and captures better document contexts in at least three characteristics: (i) tense consistency, (ii) noun consistency and (iii) conjunction presence.
translated by 谷歌翻译