This paper aims for a potential architectural improvement for multilingual learning and asks: Can different tasks from different languages be modeled in a monolithic framework, i.e. without any task/language-specific module? The benefit of achieving this could open new doors for future multilingual research, including allowing systems trained on low resources to be further assisted by other languages as well as other tasks. We approach this goal by developing a learning framework named Polyglot Prompting to exploit prompting methods for learning a unified semantic space for different languages and tasks with multilingual prompt engineering. We performed a comprehensive evaluation of 6 tasks, namely topic classification, sentiment classification, named entity recognition, question answering, natural language inference, and summarization, covering 24 datasets and 49 languages. The experimental results demonstrated the efficacy of multilingual multitask prompt-based learning and led to inspiring observations. We also present an interpretable multilingual evaluation methodology and show how the proposed framework, multilingual multitask prompt training, works. We release all datasets prompted in the best setting and code.
translated by 谷歌翻译
GPT-3等大型自回归语言模型是几秒钟的学习者,可以在没有微调的情况下执行各种语言任务。虽然已知这些模型能够共同代表许多不同的语言,但他们的培训数据由英语主导,可能限制了它们的交叉概括。在这项工作中,我们在覆盖多种语言的平衡语料库上培训多语言自回归语言模型,并在广泛的任务中研究他们几乎没有零点的学习能力。我们最大的模型,具有75亿参数,在20多种代表语言中,在几种代表语言中,在几种代表性语言中,在几种代表性语言中,在多语言型号推理中表现出可比大小的GPT-3(在0次设置和0次拍摄设置中的绝对精度改善+ 7.4% 4-拍摄设置中的9.4%)和自然语言推理(每次拍摄和4次设置中的每一个+ 5.4%)。在Flores-101机器翻译基准测试中,我们的模型优于GPT-3在182个翻译方向上有32个培训例子,同时超过45个方向的官方监督基线。我们介绍了模型成功和失败的位置的详细分析,特别是它尤其显示在某些任务中实现交叉语境的内容学习,而仍然存在改善表面的鲁棒性和适应没有a的任务的余地自然冻结形式。最后,我们评估我们在仇恨语音检测中以五种语言的仇恨语音检测的模型,并发现它具有与可比大小的GPT-3模型类似的限制。
translated by 谷歌翻译
Pre-trained multilingual language models show significant performance gains for zero-shot cross-lingual model transfer on a wide range of natural language understanding (NLU) tasks. Previously, for zero-shot cross-lingual evaluation, pre-trained models are only fine-tuned on English data and tested on a variety of target languages. In this paper, we do cross-lingual evaluation on various NLU tasks (sentence classification, sequence labeling, question answering) using prompt-tuning and compare it with fine-tuning. The results show that prompt tuning achieves much better cross-lingual transfer than fine-tuning across datasets, with only 0.1% to 0.3% tuned parameters. Additionally, we demonstrate through the analysis that prompt tuning can have better cross-lingual transferability of representations on downstream tasks with better aligned decision boundaries.
translated by 谷歌翻译
Multilingual Pretrained Language Models (MPLMs) have shown their strong multilinguality in recent empirical cross-lingual transfer studies. In this paper, we propose the Prompts Augmented by Retrieval Crosslingually (PARC) pipeline to improve the zero-shot performance on low-resource languages (LRLs) by augmenting the context with semantically similar sentences retrieved from a high-resource language (HRL) as prompts. PARC improves the zero-shot performance on three downstream tasks (binary sentiment classification, topic categorization and natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in both unlabeled settings (+5.1%) and labeled settings (+16.3%). PARC-labeled also outperforms the finetuning baseline by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between the high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.
translated by 谷歌翻译
Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark 1 to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.
translated by 谷歌翻译
我们介绍了MTG,这是一套新的基准套件,用于培训和评估多语言文本生成。它是具有最大人类通知数据(400K)的第一次传播的多语言多路文本生成数据集。它包括五种语言(英语,德语,法语,西班牙语和中文)的四代任务(故事产生,问题生成,标题生成和文本摘要)。Multiway设置可以启用跨语言和任务的模型测试知识传输功能。使用MTG,我们从不同方面训练和分析了几种流行的多语言生成模型。我们的基准套件通过更多的人为宣传的并行数据促进了模型性能增强。它提供了各种一代方案的全面评估。代码和数据可在\ url {https://github.com/zide05/mtg}上获得。
translated by 谷歌翻译
一种有效的横向传输方法是在一种语言中微调在监督数据集上的双语或多语言模型,并以零拍方式在另一种语言上进行评估。在培训时间或推理时间翻译例子也是可行的替代方案。然而,存在与文献中很少有关的这些方法相关的成本。在这项工作中,我们在其有效性(例如,准确性),开发和部署成本方面分析交叉语言方法,以及推理时间的延迟。我们的三个任务的实验表明最好的交叉方法是高度任务依赖性的。最后,通过结合零射和翻译方法,我们在这项工作中使用的三个数据集中实现了最先进的。基于这些结果,我们对目标语言手动标记的培训数据有所了解。代码和翻译的数据集可在https://github.com/unicamp-dl/cross-lingsual-analysis上获得
translated by 谷歌翻译
We present NusaCrowd, a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages. Our work is intended to help advance natural language processing research in under-represented languages.
translated by 谷歌翻译
以前的工作主要侧重于改善NLU任务的交叉传输,具有多语言预用编码器(MPE),或提高与伯特的监督机器翻译的性能。然而,探索了,MPE是否可以有助于促进NMT模型的交叉传递性。在本文中,我们专注于NMT中的零射频转移任务。在此任务中,NMT模型培训,只有一个语言对的并行数据集和搁置架MPE,然后它直接测试在零拍语言对上。我们为此任务提出了Sixt,一个简单而有效的模型。 SIXT利用了两阶段培训计划利用MPE,并进一步改进了解离编码器和容量增强的解码器。使用此方法,SIMPT显着优于MBart,这是一个用于NMT的预磨削的多语言编码器解码器模型,平均改善了14个源语言的零拍摄的任何英语测试集上的7.1 BLEU。此外,培训计算成本和培训数据较少,我们的模型在15个任何英语测试组上实现了比Criss和M2M-100,两个强大的多语言NMT基线更好的性能。
translated by 谷歌翻译
姿态检测的目标是确定以目标朝向目标的文本中表达的视点。这些观点或上下文通常以许多不同的语言表达,这取决于用户和平台,这可以是本地新闻插座,社交媒体平台,新闻论坛等。然而,姿态检测的大多数研究已经限于使用单一语言和几个有限的目标,在交叉舌姿态检测很少有效。此外,标记数据的非英语来源通常稀缺,并具有额外的挑战。最近,大型多语言语言模型在许多非英语任务上大大提高了性能,尤其是具有有限数量的示例。这突出了模型预培训的重要性及其从少数例子中学习的能力。在本文中,我们展示了对日期交叉姿态检测的最全面的研究:我们在6名语言系列中使用12种语言的12种不同的数据集进行实验,每个都有6个低资源评估设置。对于我们的实验,我们构建了模式开发培训,提出了添加一种新颖的标签编码器来简化言语程序。我们进一步提出了基于情绪的姿态数据进行预培训,这在与几个强的基线相比,在低拍摄环境中显示了大量的6%F1绝对的增长。
translated by 谷歌翻译
与辅助语言的元学习已经表明了对交叉语言自然语言处理的有希望的改进。然而,以前的研究采样使用相同语言的元培训和元测试数据,这限制了模型交叉传输的能力。在本文中,我们提出了XLA-MAML,在元学习阶段执行直接交叉调整。我们对自然语言推理和问题进行零射击和几次拍摄实验。实验结果表明了我们在不同语言,任务和预磨料模型中的方法的有效性。我们还对元学习的各种交叉特定设置进行了分析,包括采样策略和并行性。
translated by 谷歌翻译
Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space. Aligning cross-lingual sentence embeddings usually requires supervised cross-lingual parallel sentences. In this work, we propose mSimCSE, which extends SimCSE to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data. In unsupervised and weakly supervised settings, mSimCSE significantly improves previous sentence embedding methods on cross-lingual retrieval and multilingual STS tasks. The performance of unsupervised mSimCSE is comparable to fully supervised methods in retrieving low-resource languages and multilingual STS. The performance can be further enhanced when cross-lingual NLI data is available. Our code is publicly available at https://github.com/yaushian/mSimCSE.
translated by 谷歌翻译
Translating training data into many languages has emerged as a practical solution for improving cross-lingual transfer. For tasks that involve span-level annotations, such as information extraction or question answering, an additional label projection step is required to map annotated spans onto the translated texts. Recently, a few efforts have utilized a simple mark-then-translate method to jointly perform translation and projection by inserting special markers around the labeled spans in the original sentence. However, as far as we are aware, no empirical analysis has been conducted on how this approach compares to traditional annotation projection based on word alignment. In this paper, we present an extensive empirical study across 42 languages and three tasks (QA, NER, and Event Extraction) to evaluate the effectiveness and limitations of both methods, filling an important gap in the literature. Experimental results show that our optimized version of mark-then-translate, which we call EasyProject, is easily applied to many languages and works surprisingly well, outperforming the more complex word alignment-based methods. We analyze several key factors that affect end-task performance, and show EasyProject works well because it can accurately preserve label span boundaries after translation. We will publicly release all our code and data.
translated by 谷歌翻译
Open-Domain Generative Question Answering has achieved impressive performance in English by combining document-level retrieval with answer generation. These approaches, which we refer to as GenQA, can generate complete sentences, effectively answering both factoid and non-factoid questions. In this paper, we extend GenQA to the multilingual and cross-lingual settings. For this purpose, we first introduce GenTyDiQA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian. Based on GenTyDiQA, we design a cross-lingual generative model that produces full-sentence answers by exploiting passages written in multiple languages, including languages different from the question. Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.
translated by 谷歌翻译
多语种预训练模型在许多多语言NLP任务中展示了它们的有效性,并使从高资源语言到低资源的零射击或几秒钟传输。然而,由于某种语言之间的显着的类型差异和矛盾,这些模型通常在许多语言和交叉语言设置上表现不佳,这表明了学习单一模型同时处理大规模不同语言的难度。为了减轻这个问题,我们提出了一个新的多语言预训练管道。我们建议从多语言预先训练的模型产生语言表示,并进行语言分析,以表明语言表示相似度反映了从多个角度来看的语言相似度,包括语言家庭,地理蓝星,词汇表演和语法。然后,我们将所有目标语言集成到多个组中,并将每个组名称为表示SprachBund。因此,在同一表示SprachBund中的语言应该在培训和微调中互相提升,因为它们共享丰富的语言相似性。我们预先列车为每个代表斯普拉克班达一个多语言模型。实验在交叉基准上进行,与强基线相比,实现了显着的改进。
translated by 谷歌翻译
在这项工作中,我们证明了多种语的大规模序列到序列(SEQ2SEQ)模型,该模型是通过Denoising和因果语言建模(CLM)任务的混合物进行训练的,比仅解码器模型更有效地进行了效率的学习者在各种任务上。特别是,我们培训了一个名为Alexa教师模型(Alexatm 20b)的200亿个参数多语言SEQ2SEQ模型,并表明它在1-Shot摘要任务上实现了最先进的(SOTA)性能,超过了更大的540B PALM DOPODER模型。 Alexatm 20b还可以在1-Shot Machine翻译中实现SOTA,尤其是对于低资源语言,几乎所有语言对(阿拉伯语,英语,法语,德语,德语,印地语,意大利语,日语,以及flores-101数据集上的泰卢固语)。我们还显示了零拍设置,AlexATM 20B在SuperGlue和SqueadV2数据集上的表现优于GPT3(175B),并在XNLI,XCOPA,PAWS-X和XWINOGRAD等多语言任务上提供SOTA性能。总体而言,我们的结果为SEQ2SEQ模型提供了一个令人信服的案例,作为大型语言模型(LLM)培训的仅解码器模型的强大替代方法。
translated by 谷歌翻译
知识丰富的语言代表学习在各种知识密集型的NLP任务中表现出了有希望的表现。但是,现有的知识语言模型都培训了单格式知识图数据,这将其应用限制为更多语言。在这项工作中,我们向预先rain基于知识的多语言语言模型(KMLMS)提出了一种新颖的框架。我们首先使用Wikidata知识图来生成大量的代码切换合成句和基于推理的多语言训练数据。然后基于所生成的数据的内部和际际结构,我们设计预先升温任务,以促进知识学习,这允许语言模型不仅存储事实知识,还可以学习有用的逻辑模式。我们的预制kmlms展示了对广泛知识密集型的交叉线路任务的显着性能,包括指定实体识别,事实知识检索,关系分类以及我们设计的新任务,即逻辑推理。我们的代码和预付费语言模型将公开可用。
translated by 谷歌翻译
通过自我监督的学习预先训练的大型语言模型在各种各样的任务上表现出令人印象深刻的零击功能。在这项工作中,我们介绍了Welm:一种针对中文的精心读取的预训练的语言模型,能够无缝执行不同类型的任务,以零或几次演示。 Welm通过“阅读”涵盖广泛主题的精选高质量语料库来接受10b参数的培训。我们表明,韦尔姆拥有有关各种领域和语言的广泛知识。在18个单语(中文)任务中,WELM可以大大优于现有的预训练模型,尺寸相似,并匹配高达25倍大的模型的性能。韦尔姆还表现出强大的多种语言和代码转换理解的能力,优于预先对30种语言进行预培训的现有多语言模型。此外,我们收集了人工编写的提示,并通过多次培训进行了大量的中文和微调韦尔姆的监督数据集。最终的模型可以实现对看不见的任务类型的强烈概括,并在零射门学习中优于无监督的韦尔姆。最后,我们证明韦尔姆具有解释和校准自己的决策的基本技能,这可能是未来研究的有希望的方向。我们的模型可以从https://welm.weixin.qq.com/docs/api/应用。
translated by 谷歌翻译
Misinformation spread over social media has become an undeniable infodemic. However, not all spreading claims are made equal. If propagated, some claims can be destructive, not only on the individual level, but to organizations and even countries. Detecting claims that should be prioritized for fact-checking is considered the first step to fight against spread of fake news. With training data limited to a handful of languages, developing supervised models to tackle the problem over lower-resource languages is currently infeasible. Therefore, our work aims to investigate whether we can use existing datasets to train models for predicting worthiness of verification of claims in tweets in other languages. We present a systematic comparative study of six approaches for cross-lingual check-worthiness estimation across pairs of five diverse languages with the help of Multilingual BERT (mBERT) model. We run our experiments using a state-of-the-art multilingual Twitter dataset. Our results show that for some language pairs, zero-shot cross-lingual transfer is possible and can perform as good as monolingual models that are trained on the target language. We also show that in some languages, this approach outperforms (or at least is comparable to) state-of-the-art models.
translated by 谷歌翻译
我们介绍了关于多语言信息访问(MIA)2022共享任务的研讨会的结果,评估了16种类型上多样性的语言中的跨语性开放回程答案(QA)系统。在此任务中,我们在14种类型上多样化的语言中调整了两个大规模的跨语性开放式质疑QA数据集,并使用了2种代表性不足的语言中的新注释的开放式QA数据:Tagalog和Tamil。四个团队提交了他们的系统。利用迭代开采的最佳系统是不同的负面示例和较大的预审慎模型达到32.2 F1,表现优于我们的基线4.5分。第二最佳系统使用实体感知的上下文化表示文档检索,并在泰米尔语(20.8 F1)方面取得了重大改进,而其他大多数系统的得分几乎为零。
translated by 谷歌翻译