In the absence of readily available labeled data for a given task and language, annotation projection has been proposed as one of the possible strategies to automatically generate annotated data which may then be used to train supervised systems. Annotation projection has often been formulated as the task of projecting, on parallel corpora, some labels from a source into a target language. In this paper we present T-Projection, a new approach for annotation projection that leverages large pretrained text2text language models and state-of-the-art machine translation technology. T-Projection decomposes the label projection task into two subtasks: (i) The candidate generation step, in which a set of projection candidates using a multilingual T5 model is generated and, (ii) the candidate selection step, in which the candidates are ranked based on translation probabilities. We evaluate our method in three downstream tasks and five different languages. Our results show that T-projection improves the average F1 score of previous methods by more than 8 points.
translated by 谷歌翻译
Translating training data into many languages has emerged as a practical solution for improving cross-lingual transfer. For tasks that involve span-level annotations, such as information extraction or question answering, an additional label projection step is required to map annotated spans onto the translated texts. Recently, a few efforts have utilized a simple mark-then-translate method to jointly perform translation and projection by inserting special markers around the labeled spans in the original sentence. However, as far as we are aware, no empirical analysis has been conducted on how this approach compares to traditional annotation projection based on word alignment. In this paper, we present an extensive empirical study across 42 languages and three tasks (QA, NER, and Event Extraction) to evaluate the effectiveness and limitations of both methods, filling an important gap in the literature. Experimental results show that our optimized version of mark-then-translate, which we call EasyProject, is easily applied to many languages and works surprisingly well, outperforming the more complex word alignment-based methods. We analyze several key factors that affect end-task performance, and show EasyProject works well because it can accurately preserve label span boundaries after translation. We will publicly release all our code and data.
translated by 谷歌翻译
We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence. We also create manually annotated testsets for 8 languages containing approximately 1000 sentences per language. We demonstrate the utility of the obtained dataset on existing testsets and the Naamapadam-test data for 8 Indic languages. We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set. IndicNER achieves the best F1 on the Naamapadam-test set compared to an mBERT model fine-tuned on existing datasets. IndicNER achieves an F1 score of more than 80 for 7 out of 11 Indic languages. The dataset and models are available under open-source licenses at https://ai4bharat.iitm.ac.in/naamapadam.
translated by 谷歌翻译
一种有效的横向传输方法是在一种语言中微调在监督数据集上的双语或多语言模型,并以零拍方式在另一种语言上进行评估。在培训时间或推理时间翻译例子也是可行的替代方案。然而,存在与文献中很少有关的这些方法相关的成本。在这项工作中,我们在其有效性(例如,准确性),开发和部署成本方面分析交叉语言方法,以及推理时间的延迟。我们的三个任务的实验表明最好的交叉方法是高度任务依赖性的。最后,通过结合零射和翻译方法,我们在这项工作中使用的三个数据集中实现了最先进的。基于这些结果,我们对目标语言手动标记的培训数据有所了解。代码和翻译的数据集可在https://github.com/unicamp-dl/cross-lingsual-analysis上获得
translated by 谷歌翻译
句子嵌入通常用于文本聚类和语义检索任务中。最先进的句子表示方法基于大量手动标记句子对集合的人工神经网络。高资源语言(例如英语或中文)可以使用足够数量的注释数据。在不太受欢迎的语言中,必须使用多语言模型,从而提供较低的性能。在本出版物中,我们通过提出一种培训有效的语言特定句子编码的方法来解决此问题,而无需手动标记数据。我们的方法是从句子对准双语文本语料库中自动构建释义对数据集。然后,我们使用收集的数据来微调具有附加复发池层的变压器语言模型。我们的句子编码器可以在不到一天的时间内在一张图形卡上进行培训,从而在各种句子级的任务上实现高性能。我们在波兰语中评估了八个语言任务的方法,并将其与最佳可用多语言句子编码器进行比较。
translated by 谷歌翻译
翻译质量估计(QE)是预测机器翻译(MT)输出质量的任务,而无需任何参考。作为MT实际应用中的重要组成部分,这项任务已越来越受到关注。在本文中,我们首先提出了XLMRScore,这是一种基于使用XLM-Roberta(XLMR)模型计算的BertScore的简单无监督的QE方法,同时讨论了使用此方法发生的问题。接下来,我们建议两种减轻问题的方法:用未知令牌和预训练模型的跨语性对准替换未翻译的单词,以表示彼此之间的一致性单词。我们在WMT21 QE共享任务的四个低资源语言对上评估了所提出的方法,以及本文介绍的新的英语FARSI测试数据集。实验表明,我们的方法可以在两个零射击方案的监督基线中获得可比的结果,即皮尔森相关性的差异少于0.01,同时在所有低资源语言对中的平均低资源语言对中的无人看管竞争对手的平均水平超过8%的平均水平超过8%。 。
translated by 谷歌翻译
Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark 1 to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.
translated by 谷歌翻译
State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in crosslingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.
translated by 谷歌翻译
The word alignment task, despite its prominence in the era of statistical machine translation (SMT), is niche and under-explored today. In this two-part tutorial, we argue for the continued relevance for word alignment. The first part provides a historical background to word alignment as a core component of the traditional SMT pipeline. We zero-in on GIZA++, an unsupervised, statistical word aligner with surprising longevity. Jumping forward to the era of neural machine translation (NMT), we show how insights from word alignment inspired the attention mechanism fundamental to present-day NMT. The second part shifts to a survey approach. We cover neural word aligners, showing the slow but steady progress towards surpassing GIZA++ performance. Finally, we cover the present-day applications of word alignment, from cross-lingual annotation projection, to improving translation.
translated by 谷歌翻译
MARCO排名数据集已广泛用于培训IR任务的深度学习模型,在不同的零射击方案上实现了相当大的效果。但是,这种类型的资源是英语以外的语言的稀缺。在这项工作中,我们呈现MMARCO,MS Marco段落的多语言版本,该数据集包括使用机器翻译创建的13种语言。我们通过微调单语和多语言重新排名模型以及此数据集的密集多语言模型进行了评估。实验结果表明,在我们翻译的数据集上微调微调的多语言模型可以单独对原始英文版的模型进行微调的卓越效果。我们蒸馏的多语言RE-RANKER与非蒸馏模型具有竞争力,而参数较少的5.4倍。最后,我们展现了翻译质量和检索效果之间的正相关性,提供了证据,即翻译方法的改进可能导致多语言信息检索的改进。翻译的数据集和微调模型可在https://github.com/unicamp-dl/mmarco.git上获得。
translated by 谷歌翻译
In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate\footnote{for IndicXParaphrase, where an automatic translation system is used, a second human verification and correction step is done.} our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL.
translated by 谷歌翻译
多语言语言模型(\ mllms),如mbert,xlm,xlm-r,\ textit {etc。}已成为一种可行的选择,使预先估计到大量语言的力量。鉴于他们的成功在零射击转移学习中,在(i)建立更大的\ mllms〜覆盖了大量语言(ii)创建覆盖更广泛的任务和语言来评估的详尽工作基准mllms〜(iii)分析单音零点,零拍摄交叉和双语任务(iv)对Monolingual的性能,了解\ mllms〜(v)增强(通常)学习的通用语言模式(如果有的话)有限的容量\ mllms〜以提高他们在已见甚至看不见语言的表现。在这项调查中,我们审查了现有的文学,涵盖了上述与\ MLLMS有关的广泛研究领域。根据我们的调查,我们建议您有一些未来的研究方向。
translated by 谷歌翻译
机器翻译(MT)的单词级质量估计(QE)旨在在不参考的情况下找出翻译句子中的潜在翻译错误。通常,关于文字级别量化宽松的传统作品旨在根据文章编辑工作来预测翻译质量,其中通过比较MT句子之间的单词来自动生成单词标签(“ OK”和“ BAD”)。通过翻译错误率(TER)工具包编辑的句子。虽然可以使用后编辑的工作来在一定程度上测量翻译质量,但我们发现它通常与人类对单词是否良好或翻译不良的判断相抵触。为了克服限制,我们首先创建了一个金色基准数据集,即\ emph {hjqe}(人类对质量估计的判断),专家翻译直接注释了对其判断的不良翻译单词。此外,为了进一步利用平行语料库,我们提出了使用两个标签校正策略的自我监督的预训练,即标记改进策略和基于树的注释策略,以使基于TER的人工量化量子ceper更接近\ emph {HJQE}。我们根据公开可用的WMT en-de和en-ZH Corpora进行实质性实验。结果不仅表明我们提出的数据集与人类的判断更加一致,而且还确认了提议的标签纠正策略的有效性。 。}
translated by 谷歌翻译
Multilingual Pretrained Language Models (MPLMs) have shown their strong multilinguality in recent empirical cross-lingual transfer studies. In this paper, we propose the Prompts Augmented by Retrieval Crosslingually (PARC) pipeline to improve the zero-shot performance on low-resource languages (LRLs) by augmenting the context with semantically similar sentences retrieved from a high-resource language (HRL) as prompts. PARC improves the zero-shot performance on three downstream tasks (binary sentiment classification, topic categorization and natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in both unlabeled settings (+5.1%) and labeled settings (+16.3%). PARC-labeled also outperforms the finetuning baseline by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between the high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.
translated by 谷歌翻译
现有的多方对话数据集用于核心分辨率是新生的,许多挑战仍然没有解决。我们根据电视成绩单为此任务创建了一个大规模数据集,多语言多方CoreF(MMC)。由于使用多种语言的黄金质量字幕可用,我们建议重复注释以通过注释投影以其他语言(中文和Farsi)创建银色核心数据。在黄金(英语)数据上,现成的模型在MMC上的性能相对较差,这表明MMC比以前的数据集更广泛地覆盖多方核心。在银数据上,我们发现成功使用它进行数据增强和从头开始训练,这有效地模拟了零击的跨语性设置。
translated by 谷歌翻译
We present DualNER, a simple and effective framework to make full use of both annotated source language corpus and unlabeled target language text for zero-shot cross-lingual named entity recognition (NER). In particular, we combine two complementary learning paradigms of NER, i.e., sequence labeling and span prediction, into a unified multi-task framework. After obtaining a sufficient NER model trained on the source data, we further train it on the target data in a {\it dual-teaching} manner, in which the pseudo-labels for one task are constructed from the prediction of the other task. Moreover, based on the span prediction, an entity-aware regularization is proposed to enhance the intrinsic cross-lingual alignment between the same entities in different languages. Experiments and analysis demonstrate the effectiveness of our DualNER. Code is available at https://github.com/lemon0830/dualNER.
translated by 谷歌翻译
Open-Domain Generative Question Answering has achieved impressive performance in English by combining document-level retrieval with answer generation. These approaches, which we refer to as GenQA, can generate complete sentences, effectively answering both factoid and non-factoid questions. In this paper, we extend GenQA to the multilingual and cross-lingual settings. For this purpose, we first introduce GenTyDiQA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian. Based on GenTyDiQA, we design a cross-lingual generative model that produces full-sentence answers by exploiting passages written in multiple languages, including languages different from the question. Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.
translated by 谷歌翻译
我们提出了一个针对德国医学自然语言处理的统计模型,该模型训练了命名实体识别(NER),作为开放的公开模型。这项工作是我们第一个Gernerm模型的精致继任者,我们的工作大大优于我们的工作。我们证明了结合多种技术的有效性,以通过在预审预测的深度语言模型(LM),单词平衡和神经机器翻译上转移学习的方式来实现实体识别绩效。由于开放的公共医疗实体识别模型在德国文本上的稀疏情况,这项工作为医疗NLP作为基线模型的德国研究社区提供了好处。由于我们的模型基于公共英语数据,因此提供了其权重,而无需法律限制使用和分发。示例代码和统计模型可在以下网址获得:https://github.com/frankkramer-lab/gernermed-pp
translated by 谷歌翻译
对于多语言序列到序列预审预周序模型(多语言SEQ2SEQ PLM),例如姆巴特(Mbart),自制的预处理任务接受了多种单语言的培训,例如25种来自CommonCrawl的语言,而下游的跨语言任务通常在双语语言子集上进行,例如英语 - 德国人,存在数据差异,即领域的差异,以及跨语言学习客观差异,即在训练和填充阶段之间的任务差异。为了弥合上述跨语言域和任务差距,我们将使用额外的代码切换恢复任务扩展了香草预后管道。具体而言,第一阶段采用自我监督的代码转换还原任务作为借口任务,从而允许多语言SEQ2SEQ PLM获取一些域内对齐信息。在第二阶段,我们正常在下游数据上微调模型。 NLG评估(12个双语翻译任务,30个零射击任务和2项跨语言摘要任务)和NLU评估(7个跨语性自然语言推理任务)的实验表明,我们的模型超过了强大的基线MBART,具有标准的FINETUNNING,这表明了我们的模型策略,一致。分析表明,我们的方法可以缩小跨语性句子表示的欧几里得距离,并通过微不足道的计算成本改善模型概括。我们在:https://github.com/zanchangtong/csr4mbart上发布代码。
translated by 谷歌翻译
由于包括架构改进和转移学习的效果,AMR Parsing在过去三年中经历了不起起的表现增加。自学习技术也在推动性能方面发挥作用。然而,对于最近的高性能解析器,自学和银数据生成的效果似乎褪色。在本文中,我们表明,通过将基于Spatch的集合技术与集合蒸馏组合来克服这一减少的银数据的递减。在一个广泛的实验设置中,我们首次推出超过85次Spatch以上的单一模型英语解析器性能并返回大量收益。我们还为中国,德语,意大利语和西班牙语进行了跨语态amr解析的新型最先进的。最后,我们探讨了所提出的蒸馏技术对领域适应的影响,并表明它可以产生竞争对QALD-9的人类注释数据的增益,并为生物群体实现新的最先进。
translated by 谷歌翻译