表明多语言语言模型允许跨脚本和语言进行非平凡的转移。在这项工作中,我们研究了能够转移的内部表示的结构。我们将重点放在性别区分作为实际案例研究的表示上,并研究在跨不同语言的共享子空间中编码性别概念的程度。我们的分析表明,性别表示由几个跨语言共享的重要组成部分以及特定于语言的组成部分组成。与语言无关和特定语言的组成部分的存在为我们做出的有趣的经验观察提供了解释:虽然性别分类跨语言良好地传递了跨语言,对性别删除的干预措施,对单一语言进行了培训,但不会轻易转移给其他人。
translated by 谷歌翻译
Multilingual BERT (mBERT) has demonstrated considerable cross-lingual syntactic ability, whereby it enables effective zero-shot cross-lingual transfer of syntactic knowledge. The transfer is more successful between some languages, but it is not well understood what leads to this variation and whether it fairly reflects difference between languages. In this work, we investigate the distributions of grammatical relations induced from mBERT in the context of 24 typologically different languages. We demonstrate that the distance between the distributions of different languages is highly consistent with the syntactic difference in terms of linguistic formalisms. Such difference learnt via self-supervision plays a crucial role in the zero-shot transfer performance and can be predicted by variation in morphosyntactic properties between languages. These results suggest that mBERT properly encodes languages in a way consistent with linguistic diversity and provide insights into the mechanism of cross-lingual transfer.
translated by 谷歌翻译
Related works used indexes like CKA and variants of CCA to measure the similarity of cross-lingual representations in multilingual language models. In this paper, we argue that assumptions of CKA/CCA align poorly with one of the motivating goals of cross-lingual learning analysis, i.e., explaining zero-shot cross-lingual transfer. We highlight what valuable aspects of cross-lingual similarity these indexes fail to capture and provide a motivating case study \textit{demonstrating the problem empirically}. Then, we introduce \textit{Average Neuron-Wise Correlation (ANC)} as a straightforward alternative that is exempt from the difficulties of CKA/CCA and is good specifically in a cross-lingual context. Finally, we use ANC to construct evidence that the previously introduced ``first align, then predict'' pattern takes place not only in masked language models (MLMs) but also in multilingual models with \textit{causal language modeling} objectives (CLMs). Moreover, we show that the pattern extends to the \textit{scaled versions} of the MLMs and CLMs (up to 85x original mBERT).\footnote{Our code is publicly available at \url{https://github.com/TartuNLP/xsim}}
translated by 谷歌翻译
Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space. Aligning cross-lingual sentence embeddings usually requires supervised cross-lingual parallel sentences. In this work, we propose mSimCSE, which extends SimCSE to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data. In unsupervised and weakly supervised settings, mSimCSE significantly improves previous sentence embedding methods on cross-lingual retrieval and multilingual STS tasks. The performance of unsupervised mSimCSE is comparable to fully supervised methods in retrieving low-resource languages and multilingual STS. The performance can be further enhanced when cross-lingual NLI data is available. Our code is publicly available at https://github.com/yaushian/mSimCSE.
translated by 谷歌翻译
State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in crosslingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.
translated by 谷歌翻译
最近的工作表明,通过多语种伯爵(MBENT)获得的知识有两个组件:特定于语言和语言中立的。本文分析了它们之间的关系,在两项任务的微调 - POS标记和自然语言推理的背景下 - 需要模型带来不同的语言特异性知识。可视化揭示MBERT失去了在微调后通过语言进行群集表示的能力,这是通过语言识别实验的证据支持的结果。然而,显示使用梯度逆转和迭代对抗性学习的“无学习”语言特定表示的进一步实验,不会在微调的效果之外增加对独立于语言无关的组件的进一步改进。此处提出的结果表明,微调的过程导致模型的重组有限的代表能力,以特定于语言特定的代表性的语言无关的表示。
translated by 谷歌翻译
语言可以用作再现和执行有害刻板印象和偏差的手段,并被分析在许多研究中。在本文中,我们对自然语言处理中的性别偏见进行了304篇论文。我们分析了社会科学中性别及其类别的定义,并将其连接到NLP研究中性别偏见的正式定义。我们调查了在对性别偏见的研究中应用的Lexica和数据集,然后比较和对比方法来检测和减轻性别偏见。我们发现对性别偏见的研究遭受了四个核心限制。 1)大多数研究将性别视为忽视其流动性和连续性的二元变量。 2)大部分工作都在单机设置中进行英语或其他高资源语言进行。 3)尽管在NLP方法中对性别偏见进行了无数的论文,但我们发现大多数新开发的算法都没有测试他们的偏见模型,并无视他们的工作的伦理考虑。 4)最后,在这一研究线上发展的方法基本缺陷涵盖性别偏差的非常有限的定义,缺乏评估基线和管道。我们建议建议克服这些限制作为未来研究的指导。
translated by 谷歌翻译
Pre-trained multilingual language models show significant performance gains for zero-shot cross-lingual model transfer on a wide range of natural language understanding (NLU) tasks. Previously, for zero-shot cross-lingual evaluation, pre-trained models are only fine-tuned on English data and tested on a variety of target languages. In this paper, we do cross-lingual evaluation on various NLU tasks (sentence classification, sequence labeling, question answering) using prompt-tuning and compare it with fine-tuning. The results show that prompt tuning achieves much better cross-lingual transfer than fine-tuning across datasets, with only 0.1% to 0.3% tuned parameters. Additionally, we demonstrate through the analysis that prompt tuning can have better cross-lingual transferability of representations on downstream tasks with better aligned decision boundaries.
translated by 谷歌翻译
In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2019) as a single language model pre-trained from monolingual corpora in 104 languages, is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language. To understand why, we present a large number of probing experiments, showing that transfer is possible even to languages in different scripts, that transfer works best between typologically similar languages, that monolingual corpora can train models for code-switching, and that the model can find translation pairs. From these results, we can conclude that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs.
translated by 谷歌翻译
As machine translation (MT) metrics improve their correlation with human judgement every year, it is crucial to understand the limitations of such metrics at the segment level. Specifically, it is important to investigate metric behaviour when facing accuracy errors in MT because these can have dangerous consequences in certain contexts (e.g., legal, medical). We curate ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. We use ACES to evaluate a wide range of MT metrics including the submissions to the WMT 2022 metrics shared task and perform several analyses leading to general recommendations for metric developers. We recommend: a) combining metrics with different strengths, b) developing metrics that give more weight to the source and less to surface-level overlap with the reference and c) explicitly modelling additional language-specific information beyond what is available via multilingual embeddings.
translated by 谷歌翻译
现代语言模型中的检测和缓解有害偏见被广泛认为是至关重要的开放问题。在本文中,我们退后一步,研究语言模型首先是如何偏见的。我们使用在英语Wikipedia语料库中训练的LSTM架构,使用相对较小的语言模型。在培训期间的每一步中,在每个步骤中都会更改数据和模型参数,我们可以详细介绍性别表示形式的发展,数据集中的哪些模式驱动器以及模型的内部状态如何与偏差相关在下游任务(语义文本相似性)中。我们发现性别的表示是动态的,并在训练过程中确定了不同的阶段。此外,我们表明,性别信息在模型的输入嵌入中越来越多地表示,因此,对这些性别的态度可以有效地减少下游偏置。监测训练动力学,使我们能够检测出在输入嵌入中如何表示男性和男性性别的不对称性。这很重要,因为这可能会导致幼稚的缓解策略引入新的不良偏见。我们更普遍地讨论了发现与缓解策略的相关性,以及将我们的方法推广到更大语言模型,变压器体系结构,其他语言和其他不良偏见的前景。
translated by 谷歌翻译
低资源语言(例如波罗的海语言)受益于具有非凡的跨语性转移性能功能的大型多语言模型(LMS)。这项工作是对多语言LMS的跨语性表示的解释和分析研究。先前的作品假设这些LMS内部项目表示不同语言的代表形式为共享的跨语言空间。但是,文献产生了矛盾的结果。在本文中,我们重新审视了先前的工作,声称“ Bert不是Interlingua”,并表明不同的语言确实会在此类语言模型中收敛到共享空间,并具有另一种选择策略或相似性索引。然后,我们对使用378个成对语言比较的两个最受欢迎的多语言LMS进行了跨语性代表性分析。我们发现,虽然大多数语言共享联合跨语言空间,但有些语言却没有。但是,我们观察到波罗的海语言确实属于共享空间。
translated by 谷歌翻译
知识库,例如Wikidata Amass大量命名实体信息,例如多语言标签,这些信息对于各种多语言和跨语义应用程序非常有用。但是,从信息一致性的角度来看,不能保证这样的标签可以跨语言匹配,从而极大地损害了它们对机器翻译等字段的有用性。在这项工作中,我们研究了单词和句子对准技术的应用,再加上匹配算法,以将从Wikidata提取的10种语言中提取的跨语性实体标签对齐。我们的结果表明,Wikidata的主标签之间的映射将通过任何使用的方法都大大提高(F1分数最高20美元)。我们展示了依赖句子嵌入的方法如何超过所有其他脚本,甚至在不同的脚本上。我们认为,这种技术在测量标签对的相似性上的应用,再加上富含高质量实体标签的知识库,是机器翻译的绝佳资产。
translated by 谷歌翻译
在过去几年中,已经提出了多语言预训练的语言模型(PLMS)的激增,以实现许多交叉曲线下游任务的最先进的性能。但是,了解为什么多语言PLMS表现良好仍然是一个开放域。例如,目前尚不清楚多语言PLM是否揭示了不同语言的一致令牌归因。要解决此问题,请在本文中提出了令牌归因(CCTA)评估框架的交叉致新一致性。三个下游任务中的广泛实验表明,多语言PLMS为多语素同义词分配了显着不同的归因。此外,我们有以下观察结果:1)当它用于培训PLMS时,西班牙语在不同语言中实现了最常见的令牌归属;2)令牌归属的一致性与下游任务中的性能强烈相关。
translated by 谷歌翻译
我们介绍了MTG,这是一套新的基准套件,用于培训和评估多语言文本生成。它是具有最大人类通知数据(400K)的第一次传播的多语言多路文本生成数据集。它包括五种语言(英语,德语,法语,西班牙语和中文)的四代任务(故事产生,问题生成,标题生成和文本摘要)。Multiway设置可以启用跨语言和任务的模型测试知识传输功能。使用MTG,我们从不同方面训练和分析了几种流行的多语言生成模型。我们的基准套件通过更多的人为宣传的并行数据促进了模型性能增强。它提供了各种一代方案的全面评估。代码和数据可在\ url {https://github.com/zide05/mtg}上获得。
translated by 谷歌翻译
多语言语言模型(\ mllms),如mbert,xlm,xlm-r,\ textit {etc。}已成为一种可行的选择,使预先估计到大量语言的力量。鉴于他们的成功在零射击转移学习中,在(i)建立更大的\ mllms〜覆盖了大量语言(ii)创建覆盖更广泛的任务和语言来评估的详尽工作基准mllms〜(iii)分析单音零点,零拍摄交叉和双语任务(iv)对Monolingual的性能,了解\ mllms〜(v)增强(通常)学习的通用语言模式(如果有的话)有限的容量\ mllms〜以提高他们在已见甚至看不见语言的表现。在这项调查中,我们审查了现有的文学,涵盖了上述与\ MLLMS有关的广泛研究领域。根据我们的调查,我们建议您有一些未来的研究方向。
translated by 谷歌翻译
GPT-3等大型自回归语言模型是几秒钟的学习者,可以在没有微调的情况下执行各种语言任务。虽然已知这些模型能够共同代表许多不同的语言,但他们的培训数据由英语主导,可能限制了它们的交叉概括。在这项工作中,我们在覆盖多种语言的平衡语料库上培训多语言自回归语言模型,并在广泛的任务中研究他们几乎没有零点的学习能力。我们最大的模型,具有75亿参数,在20多种代表语言中,在几种代表语言中,在几种代表性语言中,在几种代表性语言中,在多语言型号推理中表现出可比大小的GPT-3(在0次设置和0次拍摄设置中的绝对精度改善+ 7.4% 4-拍摄设置中的9.4%)和自然语言推理(每次拍摄和4次设置中的每一个+ 5.4%)。在Flores-101机器翻译基准测试中,我们的模型优于GPT-3在182个翻译方向上有32个培训例子,同时超过45个方向的官方监督基线。我们介绍了模型成功和失败的位置的详细分析,特别是它尤其显示在某些任务中实现交叉语境的内容学习,而仍然存在改善表面的鲁棒性和适应没有a的任务的余地自然冻结形式。最后,我们评估我们在仇恨语音检测中以五种语言的仇恨语音检测的模型,并发现它具有与可比大小的GPT-3模型类似的限制。
translated by 谷歌翻译
本文涉及捷克,英语和法语语言的跨语言分析。我们使用五个线性转换与LSTM和CNN基于CNN的分类器进行零射击跨语性分类。我们比较了单个转换的性能,此外,我们与现有的类似伯特的模型面对基于转换的方法。我们表明,与单语言分类不同的是,来自目标域的预训练的嵌入对于改善跨语性分类结果至关重要,在单语分类中,效果并非如此独特。
translated by 谷歌翻译
跨语言嵌入(CLWE)已被证明在许多跨语性任务中有用。但是,大多数现有的学习Clwe的方法,包括具有上下文嵌入的方法是无知的。在这项工作中,我们提出了一个新颖的框架,以通过仅利用双语词典的跨语性信号来使上下文嵌入在感觉层面上。我们通过首先提出一种新颖的感知感知的跨熵损失来明确地提出一种新颖的感知跨熵损失来实现我们的框架。通过感知感知的跨熵损失预算的单语Elmo和BERT模型显示出对单词感官歧义任务的显着改善。然后,我们提出了一个感官对齐目标,除了跨语义模型预训练的感知感知跨熵损失以及几种语言对的跨语义模型(英语对德语/西班牙语/日本/中文)。与最佳的基线结果相比,我们的跨语言模型分别在零摄影,情感分类和XNLI任务上达到0.52%,2.09%和1.29%的平均绩效提高。
translated by 谷歌翻译
Bias elimination and recent probing studies attempt to remove specific information from embedding spaces. Here it is important to remove as much of the target information as possible, while preserving any other information present. INLP is a popular recent method which removes specific information through iterative nullspace projections. Multiple iterations, however, increase the risk that information other than the target is negatively affected. We introduce two methods that find a single targeted projection: Mean Projection (MP, more efficient) and Tukey Median Projection (TMP, with theoretical guarantees). Our comparison between MP and INLP shows that (1) one MP projection removes linear separability based on the target and (2) MP has less impact on the overall space. Further analysis shows that applying random projections after MP leads to the same overall effects on the embedding space as the multiple projections of INLP. Applying one targeted (MP) projection hence is methodologically cleaner than applying multiple (INLP) projections that introduce random effects.
translated by 谷歌翻译