跨语性转移(CLT)是各种应用。但是,标记的跨语言语料库是昂贵甚至无法访问的,尤其是在标签是私人的领域,例如医学症状和业务中用户概况的诊断结果。然而,这些敏感领域有现成的模型。 CLT的解决方法不是追求原始标签,而是从没有标签的现成模型中转移知识。为此,我们定义了一个名为Freetransfer-X的新颖的CLT问题,旨在实现知识转移,以丰富的资源语言的现成模型转移。为了解决这个问题,我们提出了基于多语言预训练的语言模型(MPLM)的两步知识蒸馏(KD,Hinton等,2015)框架。对强神经转换(NMT)基线的显着改善证明了该方法的有效性。除了降低注释成本和保护专用标签外,该建议的方法还与不同的网络兼容,并且易于部署。最后,一系列分析表明该方法的巨大潜力。
translated by 谷歌翻译
Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space. Aligning cross-lingual sentence embeddings usually requires supervised cross-lingual parallel sentences. In this work, we propose mSimCSE, which extends SimCSE to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data. In unsupervised and weakly supervised settings, mSimCSE significantly improves previous sentence embedding methods on cross-lingual retrieval and multilingual STS tasks. The performance of unsupervised mSimCSE is comparable to fully supervised methods in retrieving low-resource languages and multilingual STS. The performance can be further enhanced when cross-lingual NLI data is available. Our code is publicly available at https://github.com/yaushian/mSimCSE.
translated by 谷歌翻译
跨语性摘要是用一种语言(例如英语)以不同语言(例如中文)生成一种语言(例如英语)的摘要。在全球化背景下,这项任务吸引了计算语言学界的越来越多的关注。然而,对于这项任务仍然缺乏全面的审查。因此,我们在该领域的数据集,方法和挑战上介绍了第一个系统的批判性审查。具体而言,我们分别根据不同的构造方法和解决方案范例仔细组织现有的数据集和方法。对于每种类型的数据集或方法,我们彻底介绍并总结了以前的努力,并将它们相互比较以提供更深入的分析。最后,我们还讨论了有希望的方向,并提供了我们的思想,以促进未来的研究。这项调查适用于跨语性摘要的初学者和专家,我们希望它将成为起点,也可以为对该领域感兴趣的研究人员和工程师提供新的想法。
translated by 谷歌翻译
Zero-shot cross-lingual named entity recognition (NER) aims at transferring knowledge from annotated and rich-resource data in source languages to unlabeled and lean-resource data in target languages. Existing mainstream methods based on the teacher-student distillation framework ignore the rich and complementary information lying in the intermediate layers of pre-trained language models, and domain-invariant information is easily lost during transfer. In this study, a mixture of short-channel distillers (MSD) method is proposed to fully interact the rich hierarchical information in the teacher model and to transfer knowledge to the student model sufficiently and efficiently. Concretely, a multi-channel distillation framework is designed for sufficient information transfer by aggregating multiple distillers as a mixture. Besides, an unsupervised method adopting parallel domain adaptation is proposed to shorten the channels between the teacher and student models to preserve domain-invariant features. Experiments on four datasets across nine languages demonstrate that the proposed method achieves new state-of-the-art performance on zero-shot cross-lingual NER and shows great generalization and compatibility across languages and fields.
translated by 谷歌翻译
先前的研究证明,跨语性知识蒸馏可以显着提高预训练模型的跨语义相似性匹配任务的性能。但是,在此操作中,学生模型必须大。否则,其性能将急剧下降,从而使部署到内存限制设备的不切实际。为了解决这个问题,我们深入研究了跨语言知识蒸馏,并提出了一个多阶段蒸馏框架,用于构建一个小型但高性能的跨语性模型。在我们的框架中,合并了对比度学习,瓶颈和参数复发策略,以防止在压缩过程中损害性能。实验结果表明,我们的方法可以压缩XLM-R和Minilm的大小超过50 \%,而性能仅降低约1%。
translated by 谷歌翻译
我们描述了JD Explore Academy对WMT 2022共享的一般翻译任务的提交。我们参加了所有高资源曲目和一条中型曲目,包括中文英语,德语英语,捷克语英语,俄语 - 英语和日语英语。我们通过扩大两个主要因素,即语言对和模型大小,即\ textbf {vega-mt}系统来推动以前的工作的极限 - 进行翻译的双向培训。至于语言对,我们将“双向”扩展到“多向”设置,涵盖所有参与语言,以利用跨语言的常识,并将其转移到下游双语任务中。至于型号尺寸,我们将变压器限制到拥有近47亿参数的极大模型,以完全增强我们VEGA-MT的模型容量。此外,我们采用数据增强策略,例如单语数据的循环翻译以及双语和单语数据的双向自我训练,以全面利用双语和单语言数据。为了使我们的Vega-MT适应通用域测试集,设计了概括调整。根据受约束系统的官方自动分数,根据图1所示的sacrebleu,我们在{zh-en(33.5),en-zh(49.7)(49.7),de-en(33.7)上获得了第一名-de(37.8),CS-EN(54.9),En-CS(41.4)和En-Ru(32.7)},在{ru-en(45.1)和Ja-en(25.6)}和第三名上的第二名和第三名在{en-ja(41.5)}上; W.R.T彗星,我们在{zh-en(45.1),en-zh(61.7),de-en(58.0),en-de(63.2),cs-en(74.7),ru-en(ru-en(ru-en)上,我们获得了第一名64.9),en-ru(69.6)和en-ja(65.1)},分别在{en-cs(95.3)和ja-en(40.6)}上的第二名。将发布模型,以通过GitHub和Omniforce平台来促进MT社区。
translated by 谷歌翻译
We present DualNER, a simple and effective framework to make full use of both annotated source language corpus and unlabeled target language text for zero-shot cross-lingual named entity recognition (NER). In particular, we combine two complementary learning paradigms of NER, i.e., sequence labeling and span prediction, into a unified multi-task framework. After obtaining a sufficient NER model trained on the source data, we further train it on the target data in a {\it dual-teaching} manner, in which the pseudo-labels for one task are constructed from the prediction of the other task. Moreover, based on the span prediction, an entity-aware regularization is proposed to enhance the intrinsic cross-lingual alignment between the same entities in different languages. Experiments and analysis demonstrate the effectiveness of our DualNER. Code is available at https://github.com/lemon0830/dualNER.
translated by 谷歌翻译
由于包括架构改进和转移学习的效果,AMR Parsing在过去三年中经历了不起起的表现增加。自学习技术也在推动性能方面发挥作用。然而,对于最近的高性能解析器,自学和银数据生成的效果似乎褪色。在本文中,我们表明,通过将基于Spatch的集合技术与集合蒸馏组合来克服这一减少的银数据的递减。在一个广泛的实验设置中,我们首次推出超过85次Spatch以上的单一模型英语解析器性能并返回大量收益。我们还为中国,德语,意大利语和西班牙语进行了跨语态amr解析的新型最先进的。最后,我们探讨了所提出的蒸馏技术对领域适应的影响,并表明它可以产生竞争对QALD-9的人类注释数据的增益,并为生物群体实现新的最先进。
translated by 谷歌翻译
以前的工作主要侧重于改善NLU任务的交叉传输,具有多语言预用编码器(MPE),或提高与伯特的监督机器翻译的性能。然而,探索了,MPE是否可以有助于促进NMT模型的交叉传递性。在本文中,我们专注于NMT中的零射频转移任务。在此任务中,NMT模型培训,只有一个语言对的并行数据集和搁置架MPE,然后它直接测试在零拍语言对上。我们为此任务提出了Sixt,一个简单而有效的模型。 SIXT利用了两阶段培训计划利用MPE,并进一步改进了解离编码器和容量增强的解码器。使用此方法,SIMPT显着优于MBart,这是一个用于NMT的预磨削的多语言编码器解码器模型,平均改善了14个源语言的零拍摄的任何英语测试集上的7.1 BLEU。此外,培训计算成本和培训数据较少,我们的模型在15个任何英语测试组上实现了比Criss和M2M-100,两个强大的多语言NMT基线更好的性能。
translated by 谷歌翻译
While prior work has established that the use of parallel data is conducive for cross-lingual learning, it is unclear if the improvements come from the data itself, or if it is the modeling of parallel interactions that matters. Exploring this, we examine the usage of unsupervised machine translation to generate synthetic parallel data, and compare it to supervised machine translation and gold parallel data. We find that even model generated parallel data can be useful for downstream tasks, in both a general setting (continued pretraining) as well as the task-specific setting (translate-train), although our best results are still obtained using real parallel data. Our findings suggest that existing multilingual models do not exploit the full potential of monolingual data, and prompt the community to reconsider the traditional categorization of cross-lingual learning approaches.
translated by 谷歌翻译
我们考虑使用最新的MultieRlex数据集中考虑法律主题分类中的零射击跨语性转移。由于原始数据集包含并行文档,这对于零拍传输不现实是不现实的,因此我们开发了一个没有并行文档的数据集的新版本。我们使用它来表明,基于翻译的方法非常优于多绘制预训练的模型,这是多曲线的最佳先前的零弹性传输方法。我们还开发了一种双语的教师零摄像转移方法,该方法利用了目标语言的其他未标记文档,并且比直接在标记的目标语言文档上进行微调的模型更好。
translated by 谷歌翻译
虽然对比学习大大提升了句子嵌入的表示,但它仍然受到现有句子数据集的大小的限制。在本文中,我们向Transaug(转换为增强),它提供了利用翻译句子对作为文本的数据增强的第一次探索,并介绍了两级范例,以提高最先进的句子嵌入。我们不是采用以其他语言设置培训的编码器,我们首先从SIMCSE编码器(以英语预先预先预订)蒸发蒸馏出一个汉语编码器,以便它们的嵌入在语义空间中靠近,这可以被后悔作为隐式数据增强。然后,我们只通过交叉语言对比学习更新英语编码器并将蒸馏的中文编码器冷冻。我们的方法在标准语义文本相似度(STS)上实现了一种新的最先进的,表现出SIMCSE和句子T5,以及由Senteval评估的传输任务的相应轨道中的最佳性能。
translated by 谷歌翻译
语言之间的大多数翻译任务都属于无法使用的零资源翻译问题。与两种通用枢轴翻译相比,多语言神经机器翻译(MNMT)可以使用所有语言的共享语义空间进行一通翻译,但通常表现不佳的基于枢轴的方法。在本文中,我们提出了一种新颖的方法,称为NMT(UM4)的统一多语言多语言多种教师模型。我们的方法统一了来源教师,目标老师和枢轴教师模型,以指导零资源翻译的学生模型。来源老师和目标教师迫使学生学习直接来源,以通过源头和目标方面的蒸馏知识进行目标翻译。枢轴教师模型进一步利用单语语料库来增强学生模型。实验结果表明,我们的72个方向模型在WMT基准测试上明显优于先前的方法。
translated by 谷歌翻译
作为自然语言处理领域(NLP)领域的广泛研究,基于方面的情感分析(ABSA)是预测文本中相对于相应方面所表达的情感的任务。不幸的是,大多数语言缺乏足够的注释资源,因此越来越多的研究人员专注于跨语义方面的情感分析(XABSA)。但是,最近的研究仅集中于跨语性数据对准而不是模型对齐。为此,我们提出了一个新颖的框架CL-XABSA:基于跨语言的情感分析的对比度学习。基于对比度学习,我们在不同的语义空间中关闭具有相同标签的样品之间的距离,从而实现了不同语言的语义空间的收敛。具体而言,我们设计了两种对比策略,即代币嵌入(TL-CTE)和情感水平的对比度学习,对代币嵌入(SL-CTE)的对比度学习,以使源语言和目标语言的语义空间正规化,以使其更加统一。由于我们的框架可以在培训期间以多种语言接收数据集,因此我们的框架不仅可以适应XABSA任务,而且可以针对基于多语言的情感分析(MABSA)进行调整。为了进一步提高模型的性能,我们执行知识蒸馏技术利用未标记的目标语言的数据。在蒸馏XABSA任务中,我们进一步探讨了不同数据(源数据集,翻译数据集和代码切换数据集)的比较有效性。结果表明,所提出的方法在XABSA,蒸馏XABSA和MABSA的三个任务中具有一定的改进。为了获得可重复性,我们的本文代码可在https://github.com/gklmip/cl-xabsa上获得。
translated by 谷歌翻译
We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence. We also create manually annotated testsets for 8 languages containing approximately 1000 sentences per language. We demonstrate the utility of the obtained dataset on existing testsets and the Naamapadam-test data for 8 Indic languages. We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set. IndicNER achieves the best F1 on the Naamapadam-test set compared to an mBERT model fine-tuned on existing datasets. IndicNER achieves an F1 score of more than 80 for 7 out of 11 Indic languages. The dataset and models are available under open-source licenses at https://ai4bharat.iitm.ac.in/naamapadam.
translated by 谷歌翻译
Multilingual Pretrained Language Models (MPLMs) have shown their strong multilinguality in recent empirical cross-lingual transfer studies. In this paper, we propose the Prompts Augmented by Retrieval Crosslingually (PARC) pipeline to improve the zero-shot performance on low-resource languages (LRLs) by augmenting the context with semantically similar sentences retrieved from a high-resource language (HRL) as prompts. PARC improves the zero-shot performance on three downstream tasks (binary sentiment classification, topic categorization and natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in both unlabeled settings (+5.1%) and labeled settings (+16.3%). PARC-labeled also outperforms the finetuning baseline by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between the high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.
translated by 谷歌翻译
对于多语言序列到序列预审预周序模型(多语言SEQ2SEQ PLM),例如姆巴特(Mbart),自制的预处理任务接受了多种单语言的培训,例如25种来自CommonCrawl的语言,而下游的跨语言任务通常在双语语言子集上进行,例如英语 - 德国人,存在数据差异,即领域的差异,以及跨语言学习客观差异,即在训练和填充阶段之间的任务差异。为了弥合上述跨语言域和任务差距,我们将使用额外的代码切换恢复任务扩展了香草预后管道。具体而言,第一阶段采用自我监督的代码转换还原任务作为借口任务,从而允许多语言SEQ2SEQ PLM获取一些域内对齐信息。在第二阶段,我们正常在下游数据上微调模型。 NLG评估(12个双语翻译任务,30个零射击任务和2项跨语言摘要任务)和NLU评估(7个跨语性自然语言推理任务)的实验表明,我们的模型超过了强大的基线MBART,具有标准的FINETUNNING,这表明了我们的模型策略,一致。分析表明,我们的方法可以缩小跨语性句子表示的欧几里得距离,并通过微不足道的计算成本改善模型概括。我们在:https://github.com/zanchangtong/csr4mbart上发布代码。
translated by 谷歌翻译
Translating training data into many languages has emerged as a practical solution for improving cross-lingual transfer. For tasks that involve span-level annotations, such as information extraction or question answering, an additional label projection step is required to map annotated spans onto the translated texts. Recently, a few efforts have utilized a simple mark-then-translate method to jointly perform translation and projection by inserting special markers around the labeled spans in the original sentence. However, as far as we are aware, no empirical analysis has been conducted on how this approach compares to traditional annotation projection based on word alignment. In this paper, we present an extensive empirical study across 42 languages and three tasks (QA, NER, and Event Extraction) to evaluate the effectiveness and limitations of both methods, filling an important gap in the literature. Experimental results show that our optimized version of mark-then-translate, which we call EasyProject, is easily applied to many languages and works surprisingly well, outperforming the more complex word alignment-based methods. We analyze several key factors that affect end-task performance, and show EasyProject works well because it can accurately preserve label span boundaries after translation. We will publicly release all our code and data.
translated by 谷歌翻译
State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in crosslingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.
translated by 谷歌翻译
Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark 1 to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.
translated by 谷歌翻译