虽然对多语言视觉语言预测的模型实现了一些好处,但是当将多句预训练的视力语言模型应用于非英语数据时,各种任务和语言的最新基准测试表明,跨语性概括不佳,并且在有监督之间存在很大的差距( )英语表现和(零射)跨语性转移。在这项工作中,我们探讨了这些模型在零拍的跨语性视觉响应(VQA)任务上的糟糕性能,其中模型在英语视觉问题数据上进行了微调,并对7种类型上多样的语言进行了评估。我们通过三种策略改善了跨语性转移:(1)我们引入了语言的先验目标,以增加基于相似性损失以指导模型在培训期间的跨渗透损失,(2)我们学习了一个特定于任务的子网络,改善跨语性概括并减少不修改模型的方差,(3)我们使用合成代码混合来扩大培训示例,以促进源和目标语言之间的嵌入。我们使用预审计的多语言多模式变压器UC2和M3P进行的XGQA实验证明了针对7种语言提出的微调策略的一致有效性,以稀疏模型优于现有的转移方法。复制我们发现的代码和数据已公开可用。
translated by 谷歌翻译
视觉和语言任务在研究界越来越受欢迎,但重点仍主要放在英语上。我们提出了一条管道,该管道利用仅英语视觉语言模型来训练目标语言的单语模型。我们建议扩展Oscar+,该模型利用对象标签作为学习图像文本对齐的锚点,以训练以不同语言的视觉问题回答数据集。我们提出了一种新颖的知识蒸馏方法,以使用并行句子以其他语言来训练模型。与其他在训练阶段的语料库中使用目标语言的模型相比,我们可以利用现有的英语模型使用明显较小的资源将知识转移到目标语言中。我们还以日语和印地语语言发布了一个大规模的视觉问题,回答数据集。尽管我们将工作限制为视觉问题的回答,但我们的模型可以扩展到任何序列级别的分类任务,并且也可以将其扩展到其他语言。本文重点介绍了两种语言,用于视觉问题回答任务 - 日语和印地语。我们的管道表现优于当前的最新模型的相对增加4.4%和13.4%的准确性。
translated by 谷歌翻译
可靠的评估基准是为了可复制性和全面性而设计的,在机器学习方面取得了进步。但是,由于缺乏多语言基准,视觉和语言研究主要集中在英语任务上。为了填补这一空白,我们介绍了图像的语言理解评估基准。 Iglue通过汇总已有的数据集并创建新的数据来汇集 - 视觉问题回答,跨模式检索,扎根的推理以及跨20种不同语言的扎根成本。我们的基准测试能够评估多语言多模型用于转移学习的模型,不仅在零弹位设置中,而且还以新定义的少数图学习设置。根据对可用最新模型的评估,我们发现翻译测试转移优于零弹性转移,并且对于许多任务而言,很难利用射击的学习。此外,下游性能部分用可用的未标记文本数据进行预处理来解释,并且仅通过目标源语言的类型学距离而微弱。我们希望通过向社区释放基准来鼓励该领域的未来研究工作。
translated by 谷歌翻译
多语言语言模型(\ mllms),如mbert,xlm,xlm-r,\ textit {etc。}已成为一种可行的选择,使预先估计到大量语言的力量。鉴于他们的成功在零射击转移学习中,在(i)建立更大的\ mllms〜覆盖了大量语言(ii)创建覆盖更广泛的任务和语言来评估的详尽工作基准mllms〜(iii)分析单音零点,零拍摄交叉和双语任务(iv)对Monolingual的性能,了解\ mllms〜(v)增强(通常)学习的通用语言模式(如果有的话)有限的容量\ mllms〜以提高他们在已见甚至看不见语言的表现。在这项调查中,我们审查了现有的文学,涵盖了上述与\ MLLMS有关的广泛研究领域。根据我们的调查,我们建议您有一些未来的研究方向。
translated by 谷歌翻译
有效的缩放和灵活的任务接口使大型语言模型能够在许多任务中表现出色。帕利(Pali)根据视觉和文本输入生成文本,并使用该界面以许多语言执行许多视觉,语言和多模式任务。为了训练帕利,我们利用了大型的编码器语言模型和视觉变压器(VITS)。这使我们能够利用其现有能力,并利用培训它们的大量成本。我们发现,视觉和语言组成部分的联合缩放很重要。由于现有的语言变压器比其视觉对应物要大得多,因此我们训练迄今为止最大的VIT(VIT-E),以量化甚至大容量视觉模型的好处。为了训练Pali,我们基于一个新的图像文本训练集,其中包含10B图像和文本,以100多种语言来创建大型的多语言组合。帕利(Pali)在多个视觉和语言任务(例如字幕,视觉问题,索方式,场景文本理解)中实现了最新的,同时保留了简单,模块化和可扩展的设计。
translated by 谷歌翻译
对于多语言序列到序列预审预周序模型(多语言SEQ2SEQ PLM),例如姆巴特(Mbart),自制的预处理任务接受了多种单语言的培训,例如25种来自CommonCrawl的语言,而下游的跨语言任务通常在双语语言子集上进行,例如英语 - 德国人,存在数据差异,即领域的差异,以及跨语言学习客观差异,即在训练和填充阶段之间的任务差异。为了弥合上述跨语言域和任务差距,我们将使用额外的代码切换恢复任务扩展了香草预后管道。具体而言,第一阶段采用自我监督的代码转换还原任务作为借口任务,从而允许多语言SEQ2SEQ PLM获取一些域内对齐信息。在第二阶段,我们正常在下游数据上微调模型。 NLG评估(12个双语翻译任务,30个零射击任务和2项跨语言摘要任务)和NLU评估(7个跨语性自然语言推理任务)的实验表明,我们的模型超过了强大的基线MBART,具有标准的FINETUNNING,这表明了我们的模型策略,一致。分析表明,我们的方法可以缩小跨语性句子表示的欧几里得距离,并通过微不足道的计算成本改善模型概括。我们在:https://github.com/zanchangtong/csr4mbart上发布代码。
translated by 谷歌翻译
Universal cross-lingual sentence embeddings map semantically similar cross-lingual sentences into a shared embedding space. Aligning cross-lingual sentence embeddings usually requires supervised cross-lingual parallel sentences. In this work, we propose mSimCSE, which extends SimCSE to multilingual settings and reveal that contrastive learning on English data can surprisingly learn high-quality universal cross-lingual sentence embeddings without any parallel data. In unsupervised and weakly supervised settings, mSimCSE significantly improves previous sentence embedding methods on cross-lingual retrieval and multilingual STS tasks. The performance of unsupervised mSimCSE is comparable to fully supervised methods in retrieving low-resource languages and multilingual STS. The performance can be further enhanced when cross-lingual NLI data is available. Our code is publicly available at https://github.com/yaushian/mSimCSE.
translated by 谷歌翻译
由于低资源语言缺乏培训数据,交叉语言机器阅读理解(XMRC)是挑战。最近的方法仅使用培训数据,以资源丰富的语言,如英语到微调大规模的跨语法预训练的语言模型。由于语言之间的巨大差异,仅由源语言微调的模型可能无法对目标语言表现良好。有趣的是,我们观察到,虽然先前方法预测的前1个结果可能经常无法达到地面真理答案,但是正确的答案通常包含在Top-K预测结果中。基于这种观察,我们开发了一种两级方法来提高模型性能。召回的第一阶段目标:我们设计一个艰难的学习(HL)算法,以最大化顶级预测包含准确答案的可能性。第二阶段专注于精确:开发了答案感知对比学习(AA-CL)机制,以了解准确答案和其他候选者之间的细差异。我们的广泛实验表明,我们的模型在两个交叉语言MRC基准数据集上显着优于一系列强大的基线。
translated by 谷歌翻译
Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark 1 to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.
translated by 谷歌翻译
与辅助语言的元学习已经表明了对交叉语言自然语言处理的有希望的改进。然而,以前的研究采样使用相同语言的元培训和元测试数据,这限制了模型交叉传输的能力。在本文中,我们提出了XLA-MAML,在元学习阶段执行直接交叉调整。我们对自然语言推理和问题进行零射击和几次拍摄实验。实验结果表明了我们在不同语言,任务和预磨料模型中的方法的有效性。我们还对元学习的各种交叉特定设置进行了分析,包括采样策略和并行性。
translated by 谷歌翻译
Pre-trained multilingual language models show significant performance gains for zero-shot cross-lingual model transfer on a wide range of natural language understanding (NLU) tasks. Previously, for zero-shot cross-lingual evaluation, pre-trained models are only fine-tuned on English data and tested on a variety of target languages. In this paper, we do cross-lingual evaluation on various NLU tasks (sentence classification, sequence labeling, question answering) using prompt-tuning and compare it with fine-tuning. The results show that prompt tuning achieves much better cross-lingual transfer than fine-tuning across datasets, with only 0.1% to 0.3% tuned parameters. Additionally, we demonstrate through the analysis that prompt tuning can have better cross-lingual transferability of representations on downstream tasks with better aligned decision boundaries.
translated by 谷歌翻译
最近的工作表明,通过多语种伯爵(MBENT)获得的知识有两个组件:特定于语言和语言中立的。本文分析了它们之间的关系,在两项任务的微调 - POS标记和自然语言推理的背景下 - 需要模型带来不同的语言特异性知识。可视化揭示MBERT失去了在微调后通过语言进行群集表示的能力,这是通过语言识别实验的证据支持的结果。然而,显示使用梯度逆转和迭代对抗性学习的“无学习”语言特定表示的进一步实验,不会在微调的效果之外增加对独立于语言无关的组件的进一步改进。此处提出的结果表明,微调的过程导致模型的重组有限的代表能力,以特定于语言特定的代表性的语言无关的表示。
translated by 谷歌翻译
Multilingual Pretrained Language Models (MPLMs) have shown their strong multilinguality in recent empirical cross-lingual transfer studies. In this paper, we propose the Prompts Augmented by Retrieval Crosslingually (PARC) pipeline to improve the zero-shot performance on low-resource languages (LRLs) by augmenting the context with semantically similar sentences retrieved from a high-resource language (HRL) as prompts. PARC improves the zero-shot performance on three downstream tasks (binary sentiment classification, topic categorization and natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in both unlabeled settings (+5.1%) and labeled settings (+16.3%). PARC-labeled also outperforms the finetuning baseline by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between the high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.
translated by 谷歌翻译
Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models to non-English inputs and achieve impressive performance. However, these models focus only on understanding tasks utilizing encoder-only architecture. In this paper, we propose ERNIE-UniX2, a unified cross-lingual cross-modal pre-training framework for both generation and understanding tasks. ERNIE-UniX2 integrates multiple pre-training paradigms (e.g., contrastive learning and language modeling) based on encoder-decoder architecture and attempts to learn a better joint representation across languages and modalities. Furthermore, ERNIE-UniX2 can be seamlessly fine-tuned for varieties of generation and understanding downstream tasks. Pre-trained on both multilingual text-only and image-text datasets, ERNIE-UniX2 achieves SOTA results on various cross-lingual cross-modal generation and understanding tasks such as multimodal machine translation and multilingual visual question answering.
translated by 谷歌翻译
Translating training data into many languages has emerged as a practical solution for improving cross-lingual transfer. For tasks that involve span-level annotations, such as information extraction or question answering, an additional label projection step is required to map annotated spans onto the translated texts. Recently, a few efforts have utilized a simple mark-then-translate method to jointly perform translation and projection by inserting special markers around the labeled spans in the original sentence. However, as far as we are aware, no empirical analysis has been conducted on how this approach compares to traditional annotation projection based on word alignment. In this paper, we present an extensive empirical study across 42 languages and three tasks (QA, NER, and Event Extraction) to evaluate the effectiveness and limitations of both methods, filling an important gap in the literature. Experimental results show that our optimized version of mark-then-translate, which we call EasyProject, is easily applied to many languages and works surprisingly well, outperforming the more complex word alignment-based methods. We analyze several key factors that affect end-task performance, and show EasyProject works well because it can accurately preserve label span boundaries after translation. We will publicly release all our code and data.
translated by 谷歌翻译
Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without
translated by 谷歌翻译
视觉问题回答(VQA)主要通过英语镜头进行了研究。但是,以其他方式以其他方式处理VQA将需要大量资源。在本文中,我们在数据和建模方面提出了多种语言视觉问题回答(MVQA)的可扩展解决方案。我们首先向MVQA数据生成提出了一个基于翻译的框架,该框架比直接收集问题和答案的常规方法所需的人类注释工作要少得多。然后,我们将框架应用于CrossModal-3600数据集中的多语言字幕,并开发了有效的注释协议,以创建Maverics-XM3600(MAXM),这是一种仅使用7种不同语言的仅测试的VQA基准。最后,我们提出了一种方法,用于统一,可扩展,开放式和端到端MVQA建模,并在13种语言中表现出强劲的性能。
translated by 谷歌翻译
对于许多任务,基于变压器的体系结构已经实现了最新的结果,从而导致实践从使用特定于任务的架构到预先训练的语言模型的微调。持续的趋势包括具有越来越多的数据和参数的培训模型,这需要大量资源。它导致了强有力的搜索,以提高基于仅针对英语评估的算法和硬件改进的算法和硬件改进。这引发了有关其可用性的疑问,当应用于小规模的学习问题时,对于资源不足的语言任务,有限的培训数据可用。缺乏适当尺寸的语料库是应用数据驱动和转移学习的方法的障碍。在本文中,我们建立了致力于基于变压器模型的可用性的最新努力,并建议评估这些改进的法语表现,而法语的效果很少。我们通过通过数据增强,超参数优化和跨语性转移来调查各种培训策略来解决与数据稀缺有关的不稳定。我们还为法国弗拉伯特(Fralbert)引入了一种新的紧凑型模型,该模型在低资源环境中被证明具有竞争力。
translated by 谷歌翻译
交叉思考的预培训使用单晶和双语纯文本语料库取得了巨大的成功。然而,大多数预先训练的模型忽略了多语言知识,这是语言不可知的,但包括丰富的交叉结构对齐。在本文中,我们提出了一种XLM-K,这是一种跨语言模型,其在预训练中结合了多语言知识。xlm-k增强了具有两个知识任务的现有多语言预培训,即屏蔽实体预测任务和对象引入任务。我们评估MLQA,NER和XNLI的XLM-K。实验结果清楚地表明了对现有的多语言语言模型的显着改进。MLQA和NER上的结果展示了知识相关任务中的XLM-K的优越性。XNLI中的成功显示了在XLM-k中获得的更好的交叉翻转性。更重要的是,我们提供了详细的探测分析,以确认我们在培训前方案中捕获的所需知识。
translated by 谷歌翻译
State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in crosslingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.
translated by 谷歌翻译