由于包括架构改进和转移学习的效果,AMR Parsing在过去三年中经历了不起起的表现增加。自学习技术也在推动性能方面发挥作用。然而,对于最近的高性能解析器,自学和银数据生成的效果似乎褪色。在本文中,我们表明,通过将基于Spatch的集合技术与集合蒸馏组合来克服这一减少的银数据的递减。在一个广泛的实验设置中,我们首次推出超过85次Spatch以上的单一模型英语解析器性能并返回大量收益。我们还为中国,德语,意大利语和西班牙语进行了跨语态amr解析的新型最先进的。最后,我们探讨了所提出的蒸馏技术对领域适应的影响,并表明它可以产生竞争对QALD-9的人类注释数据的增益,并为生物群体实现新的最先进。
translated by 谷歌翻译
Translating training data into many languages has emerged as a practical solution for improving cross-lingual transfer. For tasks that involve span-level annotations, such as information extraction or question answering, an additional label projection step is required to map annotated spans onto the translated texts. Recently, a few efforts have utilized a simple mark-then-translate method to jointly perform translation and projection by inserting special markers around the labeled spans in the original sentence. However, as far as we are aware, no empirical analysis has been conducted on how this approach compares to traditional annotation projection based on word alignment. In this paper, we present an extensive empirical study across 42 languages and three tasks (QA, NER, and Event Extraction) to evaluate the effectiveness and limitations of both methods, filling an important gap in the literature. Experimental results show that our optimized version of mark-then-translate, which we call EasyProject, is easily applied to many languages and works surprisingly well, outperforming the more complex word alignment-based methods. We analyze several key factors that affect end-task performance, and show EasyProject works well because it can accurately preserve label span boundaries after translation. We will publicly release all our code and data.
translated by 谷歌翻译
翻译质量估计(QE)是预测机器翻译(MT)输出质量的任务,而无需任何参考。作为MT实际应用中的重要组成部分,这项任务已越来越受到关注。在本文中,我们首先提出了XLMRScore,这是一种基于使用XLM-Roberta(XLMR)模型计算的BertScore的简单无监督的QE方法,同时讨论了使用此方法发生的问题。接下来,我们建议两种减轻问题的方法:用未知令牌和预训练模型的跨语性对准替换未翻译的单词,以表示彼此之间的一致性单词。我们在WMT21 QE共享任务的四个低资源语言对上评估了所提出的方法,以及本文介绍的新的英语FARSI测试数据集。实验表明,我们的方法可以在两个零射击方案的监督基线中获得可比的结果,即皮尔森相关性的差异少于0.01,同时在所有低资源语言对中的平均低资源语言对中的无人看管竞争对手的平均水平超过8%的平均水平超过8%。 。
translated by 谷歌翻译
State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in crosslingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.
translated by 谷歌翻译
Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark 1 to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.
translated by 谷歌翻译
We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence. We also create manually annotated testsets for 8 languages containing approximately 1000 sentences per language. We demonstrate the utility of the obtained dataset on existing testsets and the Naamapadam-test data for 8 Indic languages. We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set. IndicNER achieves the best F1 on the Naamapadam-test set compared to an mBERT model fine-tuned on existing datasets. IndicNER achieves an F1 score of more than 80 for 7 out of 11 Indic languages. The dataset and models are available under open-source licenses at https://ai4bharat.iitm.ac.in/naamapadam.
translated by 谷歌翻译
This paper demonstrates that multilingual denoising pre-training produces significant performance gains across a wide variety of machine translation (MT) tasks. We present mBART -a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective . mBART is the first method for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. Pre-training a complete model allows it to be directly fine tuned for supervised (both sentence-level and document-level) and unsupervised machine translation, with no task-specific modifications. We demonstrate that adding mBART initialization produces performance gains in all but the highest-resource settings, including up to 12 BLEU points for low resource MT and over 5 BLEU points for many document-level and unsupervised models. We also show it also enables new types of transfer to language pairs with no bi-text or that were not in the pre-training corpus, and present extensive analysis of which factors contribute the most to effective pre-training.
translated by 谷歌翻译
我们描述了JD Explore Academy对WMT 2022共享的一般翻译任务的提交。我们参加了所有高资源曲目和一条中型曲目,包括中文英语,德语英语,捷克语英语,俄语 - 英语和日语英语。我们通过扩大两个主要因素,即语言对和模型大小,即\ textbf {vega-mt}系统来推动以前的工作的极限 - 进行翻译的双向培训。至于语言对,我们将“双向”扩展到“多向”设置,涵盖所有参与语言,以利用跨语言的常识,并将其转移到下游双语任务中。至于型号尺寸,我们将变压器限制到拥有近47亿参数的极大模型,以完全增强我们VEGA-MT的模型容量。此外,我们采用数据增强策略,例如单语数据的循环翻译以及双语和单语数据的双向自我训练,以全面利用双语和单语言数据。为了使我们的Vega-MT适应通用域测试集,设计了概括调整。根据受约束系统的官方自动分数,根据图1所示的sacrebleu,我们在{zh-en(33.5),en-zh(49.7)(49.7),de-en(33.7)上获得了第一名-de(37.8),CS-EN(54.9),En-CS(41.4)和En-Ru(32.7)},在{ru-en(45.1)和Ja-en(25.6)}和第三名上的第二名和第三名在{en-ja(41.5)}上; W.R.T彗星,我们在{zh-en(45.1),en-zh(61.7),de-en(58.0),en-de(63.2),cs-en(74.7),ru-en(ru-en(ru-en)上,我们获得了第一名64.9),en-ru(69.6)和en-ja(65.1)},分别在{en-cs(95.3)和ja-en(40.6)}上的第二名。将发布模型,以通过GitHub和Omniforce平台来促进MT社区。
translated by 谷歌翻译
多语言语言模型(\ mllms),如mbert,xlm,xlm-r,\ textit {etc。}已成为一种可行的选择,使预先估计到大量语言的力量。鉴于他们的成功在零射击转移学习中,在(i)建立更大的\ mllms〜覆盖了大量语言(ii)创建覆盖更广泛的任务和语言来评估的详尽工作基准mllms〜(iii)分析单音零点,零拍摄交叉和双语任务(iv)对Monolingual的性能,了解\ mllms〜(v)增强(通常)学习的通用语言模式(如果有的话)有限的容量\ mllms〜以提高他们在已见甚至看不见语言的表现。在这项调查中,我们审查了现有的文学,涵盖了上述与\ MLLMS有关的广泛研究领域。根据我们的调查,我们建议您有一些未来的研究方向。
translated by 谷歌翻译
我们介绍了MTG,这是一套新的基准套件,用于培训和评估多语言文本生成。它是具有最大人类通知数据(400K)的第一次传播的多语言多路文本生成数据集。它包括五种语言(英语,德语,法语,西班牙语和中文)的四代任务(故事产生,问题生成,标题生成和文本摘要)。Multiway设置可以启用跨语言和任务的模型测试知识传输功能。使用MTG,我们从不同方面训练和分析了几种流行的多语言生成模型。我们的基准套件通过更多的人为宣传的并行数据促进了模型性能增强。它提供了各种一代方案的全面评估。代码和数据可在\ url {https://github.com/zide05/mtg}上获得。
translated by 谷歌翻译
跨语性转移(CLT)是各种应用。但是,标记的跨语言语料库是昂贵甚至无法访问的,尤其是在标签是私人的领域,例如医学症状和业务中用户概况的诊断结果。然而,这些敏感领域有现成的模型。 CLT的解决方法不是追求原始标签,而是从没有标签的现成模型中转移知识。为此,我们定义了一个名为Freetransfer-X的新颖的CLT问题,旨在实现知识转移,以丰富的资源语言的现成模型转移。为了解决这个问题,我们提出了基于多语言预训练的语言模型(MPLM)的两步知识蒸馏(KD,Hinton等,2015)框架。对强神经转换(NMT)基线的显着改善证明了该方法的有效性。除了降低注释成本和保护专用标签外,该建议的方法还与不同的网络兼容,并且易于部署。最后,一系列分析表明该方法的巨大潜力。
translated by 谷歌翻译
双语术语是电子商务领域中重要的机器翻译资源,通常是手动翻译或自动从并行数据中提取的。人类的翻译成本高昂,电子商务并行语料库非常稀缺。但是,同一商品领域中不同语言中的可比数据很丰富。在本文中,我们提出了一个新颖的框架,即从可比较的数据中提取电子商业双语术语。我们的框架受益于电子商务的跨语化预培训,可以充分利用源端术语和目标端句子之间的深层语义关系,以提取相应的目标术语。各种语言对的实验结果表明,我们的方法比各种强大的基线都取得了明显更好的性能。
translated by 谷歌翻译
In the absence of readily available labeled data for a given task and language, annotation projection has been proposed as one of the possible strategies to automatically generate annotated data which may then be used to train supervised systems. Annotation projection has often been formulated as the task of projecting, on parallel corpora, some labels from a source into a target language. In this paper we present T-Projection, a new approach for annotation projection that leverages large pretrained text2text language models and state-of-the-art machine translation technology. T-Projection decomposes the label projection task into two subtasks: (i) The candidate generation step, in which a set of projection candidates using a multilingual T5 model is generated and, (ii) the candidate selection step, in which the candidates are ranked based on translation probabilities. We evaluate our method in three downstream tasks and five different languages. Our results show that T-projection improves the average F1 score of previous methods by more than 8 points.
translated by 谷歌翻译
The word alignment task, despite its prominence in the era of statistical machine translation (SMT), is niche and under-explored today. In this two-part tutorial, we argue for the continued relevance for word alignment. The first part provides a historical background to word alignment as a core component of the traditional SMT pipeline. We zero-in on GIZA++, an unsupervised, statistical word aligner with surprising longevity. Jumping forward to the era of neural machine translation (NMT), we show how insights from word alignment inspired the attention mechanism fundamental to present-day NMT. The second part shifts to a survey approach. We cover neural word aligners, showing the slow but steady progress towards surpassing GIZA++ performance. Finally, we cover the present-day applications of word alignment, from cross-lingual annotation projection, to improving translation.
translated by 谷歌翻译
跨语性摘要是用一种语言(例如英语)以不同语言(例如中文)生成一种语言(例如英语)的摘要。在全球化背景下,这项任务吸引了计算语言学界的越来越多的关注。然而,对于这项任务仍然缺乏全面的审查。因此,我们在该领域的数据集,方法和挑战上介绍了第一个系统的批判性审查。具体而言,我们分别根据不同的构造方法和解决方案范例仔细组织现有的数据集和方法。对于每种类型的数据集或方法,我们彻底介绍并总结了以前的努力,并将它们相互比较以提供更深入的分析。最后,我们还讨论了有希望的方向,并提供了我们的思想,以促进未来的研究。这项调查适用于跨语性摘要的初学者和专家,我们希望它将成为起点,也可以为对该领域感兴趣的研究人员和工程师提供新的想法。
translated by 谷歌翻译
Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training. Targetside monolingual data plays an important role in boosting fluency for phrasebased statistical machine translation, and we investigate the use of monolingual data for NMT. In contrast to previous work, which combines NMT models with separately trained language models, we note that encoder-decoder NMT architectures already have the capacity to learn the same information as a language model, and we explore strategies to train with monolingual data without changing the neural network architecture. By pairing monolingual training data with an automatic backtranslation, we can treat it as additional parallel training data, and we obtain substantial improvements on the WMT 15 task English↔German (+2.8-3.7 BLEU), and for the low-resourced IWSLT 14 task Turkish→English (+2.1-3.4 BLEU), obtaining new state-of-the-art results. We also show that fine-tuning on in-domain monolingual and parallel data gives substantial improvements for the IWSLT 15 task English→German.
translated by 谷歌翻译
Open-Domain Generative Question Answering has achieved impressive performance in English by combining document-level retrieval with answer generation. These approaches, which we refer to as GenQA, can generate complete sentences, effectively answering both factoid and non-factoid questions. In this paper, we extend GenQA to the multilingual and cross-lingual settings. For this purpose, we first introduce GenTyDiQA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian. Based on GenTyDiQA, we design a cross-lingual generative model that produces full-sentence answers by exploiting passages written in multiple languages, including languages different from the question. Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.
translated by 谷歌翻译
对于多语言序列到序列预审预周序模型(多语言SEQ2SEQ PLM),例如姆巴特(Mbart),自制的预处理任务接受了多种单语言的培训,例如25种来自CommonCrawl的语言,而下游的跨语言任务通常在双语语言子集上进行,例如英语 - 德国人,存在数据差异,即领域的差异,以及跨语言学习客观差异,即在训练和填充阶段之间的任务差异。为了弥合上述跨语言域和任务差距,我们将使用额外的代码切换恢复任务扩展了香草预后管道。具体而言,第一阶段采用自我监督的代码转换还原任务作为借口任务,从而允许多语言SEQ2SEQ PLM获取一些域内对齐信息。在第二阶段,我们正常在下游数据上微调模型。 NLG评估(12个双语翻译任务,30个零射击任务和2项跨语言摘要任务)和NLU评估(7个跨语性自然语言推理任务)的实验表明,我们的模型超过了强大的基线MBART,具有标准的FINETUNNING,这表明了我们的模型策略,一致。分析表明,我们的方法可以缩小跨语性句子表示的欧几里得距离,并通过微不足道的计算成本改善模型概括。我们在:https://github.com/zanchangtong/csr4mbart上发布代码。
translated by 谷歌翻译
我们提出了一种基于转换的系统来转换摘要意义代表(AMR)进入SPARQL,了解知识库问题应答(KBQA)。这允许将抽象问题的一部分委派给强训练的语义解析器,同时使用少量配对数据学习转换。我们从最近的工作相关的AMR和SPARQL构造,而不是应用一套规则,我们教导BART模型选择性地使用这些关系。此外,在最近的语义解析作品之后,我们避免在BART的注意机制中进行了显式编码AMR,而是编码解析器状态。结果模型很简单,为其决策提供支持文本,并且优于LC-Quad(F1 53.4)中的基于AMR的KBQA中的最新进展,在QAL(F1 30.8)中匹配,同时利用相同的归纳偏差。
translated by 谷歌翻译
以前的工作主要侧重于改善NLU任务的交叉传输,具有多语言预用编码器(MPE),或提高与伯特的监督机器翻译的性能。然而,探索了,MPE是否可以有助于促进NMT模型的交叉传递性。在本文中,我们专注于NMT中的零射频转移任务。在此任务中,NMT模型培训,只有一个语言对的并行数据集和搁置架MPE,然后它直接测试在零拍语言对上。我们为此任务提出了Sixt,一个简单而有效的模型。 SIXT利用了两阶段培训计划利用MPE,并进一步改进了解离编码器和容量增强的解码器。使用此方法,SIMPT显着优于MBart,这是一个用于NMT的预磨削的多语言编码器解码器模型,平均改善了14个源语言的零拍摄的任何英语测试集上的7.1 BLEU。此外,培训计算成本和培训数据较少,我们的模型在15个任何英语测试组上实现了比Criss和M2M-100,两个强大的多语言NMT基线更好的性能。
translated by 谷歌翻译