Transformer language models (TLMs) are critical for most NLP tasks, but they are difficult to create for low-resource languages because of how much pretraining data they require. In this work, we investigate two techniques for training monolingual TLMs in a low-resource setting: greatly reducing TLM size, and complementing the masked language modeling objective with two linguistically rich supervised tasks (part-of-speech tagging and dependency parsing). Results from 7 diverse languages indicate that our model, MicroBERT, is able to produce marked improvements in downstream task evaluations relative to a typical monolingual TLM pretraining approach. Specifically, we find that monolingual MicroBERT models achieve gains of up to 18% for parser LAS and 11% for NER F1 compared to a multilingual baseline, mBERT, while having less than 1% of its parameter count. We conclude reducing TLM parameter count and using labeled data for pretraining low-resource TLMs can yield large quality benefits and in some cases produce models that outperform multilingual approaches.
translated by 谷歌翻译
经过审计的多语言模型已成为将NLP功能转移到低资源语言的常见工具,通常具有适应性。在这项工作中,我们研究了两种改编的性能,可扩展性和相互作用:词汇增强和脚本音译。我们对九种多样化的低资源语言中的词性标签,普遍依赖解析的评估,并命名为实体识别,以维护这些方法的可行性,同时围绕如何最佳地将多语言模型适应低资源设置的新问题。
translated by 谷歌翻译
预处理的多语言上下文表示表现出了巨大的成功,但是由于其预处理数据的限制,其好处并不适用于所有语言品种。这给这些模型不熟悉的语言品种带来了挑战,这些模型的标签\ emph {和未标记的}数据太限制了无法有效训练单语模型。我们建议使用其他特定于语言的预审进和词汇增强,以使多语言模型适应低资源设置。使用依赖性解析四种不同的低资源语言品种作为案例研究,我们表明,这些方法显着改善了基准的性能,尤其是在最低的资源案例中,并证明了此类模型的数据和目标之间关系的重要性语言品种。
translated by 谷歌翻译
多语言语言模型(\ mllms),如mbert,xlm,xlm-r,\ textit {etc。}已成为一种可行的选择,使预先估计到大量语言的力量。鉴于他们的成功在零射击转移学习中,在(i)建立更大的\ mllms〜覆盖了大量语言(ii)创建覆盖更广泛的任务和语言来评估的详尽工作基准mllms〜(iii)分析单音零点,零拍摄交叉和双语任务(iv)对Monolingual的性能,了解\ mllms〜(v)增强(通常)学习的通用语言模式(如果有的话)有限的容量\ mllms〜以提高他们在已见甚至看不见语言的表现。在这项调查中,我们审查了现有的文学,涵盖了上述与\ MLLMS有关的广泛研究领域。根据我们的调查,我们建议您有一些未来的研究方向。
translated by 谷歌翻译
最近,大型预用语言模型(LMS)越来越受欢迎。培训这些模型需要更多的计算资源,并且大多数现有模型仅在英文文本上培训。以其他语言训练这些模型非常昂贵。为了缓解这个问题,我们介绍了一种叫做威施塞的方法 - 将英语模型传输到新语言。我们将英语模型的销量与目标语言中的销量交换,并初始化令牌嵌入式,以便通过利用覆盖英语和目标语言的多语言静态字嵌入来初始化令牌嵌入式。我们使用Wechsel将GPT-2和Roberta模型转移到4种其他语言(法语,德语,中文和斯瓦希里语)。 Wechsel通过以前提出的跨语言参数转移和优于比较大小的模型来改善从目标语言的划痕训练的相当大小的型号,距离培训速度较小。我们的方法使培训大型语言模型为新语言更容易访问,更少损害环境。我们宣传我们的代码和型号。
translated by 谷歌翻译
编码单词语义属性的密集词向量或“Word Embeddings”现在已成为机器翻译(MT),问题应答(QA),字感消解(WSD)和信息检索(IR)中的NLP任务的积分。在本文中,我们使用各种现有方法为14个印度语言创建多个单词嵌入。我们将这些嵌入的嵌入式为所有这些语言,萨姆萨姆,孟加拉,古吉拉蒂,印地教派,kannada,konkani,malayalam,marathi,尼泊尔,odiya,punjabi,梵语,泰米尔和泰雅古士在一个单一的存储库中。相对较新的方法,强调迎合上下文(BERT,ELMO等),表明了显着的改进,但需要大量资源来产生可用模型。我们释放使用上下文和非上下文方法生成的预训练嵌入。我们还使用Muse和XLM来培训所有上述语言的交叉语言嵌入。为了展示我们嵌入的效果,我们为所有这些语言评估了我们对XPOS,UPOS和NER任务的嵌入模型。我们使用8种不同的方法释放了436个型号。我们希望他们对资源受限的印度语言NLP有用。本文的标题是指最初在1924年出版的福斯特的着名小说“一段是印度”。
translated by 谷歌翻译
多语言预训练的语言模型在跨语言任务上表现出了令人印象深刻的表现。它极大地促进了自然语言处理在低资源语言上的应用。但是,当前的多语言模型仍然有些语言表现不佳。在本文中,我们提出了Cino(中国少数族裔训练的语言模型),这是一种用于中国少数语言的多语言预训练的语言模型。它涵盖了标准的中文,Yue中文和其他六种少数民族语言。为了评估多语言模型在少数族裔语言上的跨语性能力,我们从Wikipedia和新闻网站收集文档,并构建两个文本分类数据集,WCM(Wiki-Chinese-Minority)和CMNEWS(中国最少的新闻)。我们表明,Cino在各种分类任务上的表现明显优于基准。Cino模型和数据集可在http://cino.hfl-rc.com上公开获得。
translated by 谷歌翻译
In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate\footnote{for IndicXParaphrase, where an automatic translation system is used, a second human verification and correction step is done.} our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL.
translated by 谷歌翻译
Given the impact of language models on the field of Natural Language Processing, a number of Spanish encoder-only masked language models (aka BERTs) have been trained and released. These models were developed either within large projects using very large private corpora or by means of smaller scale academic efforts leveraging freely available data. In this paper we present a comprehensive head-to-head comparison of language models for Spanish with the following results: (i) Previously ignored multilingual models from large companies fare better than monolingual models, substantially changing the evaluation landscape of language models in Spanish; (ii) Results across the monolingual models are not conclusive, with supposedly smaller and inferior models performing competitively. Based on these empirical results, we argue for the need of more research to understand the factors underlying them. In this sense, the effect of corpus size, quality and pre-training techniques need to be further investigated to be able to obtain Spanish monolingual models significantly better than the multilingual ones released by large private companies, specially in the face of rapid ongoing progress in the field. The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem which requires to marry resources (monetary and/or computational) with the best research expertise and practice.
translated by 谷歌翻译
语言模型是通过有限的输入集定义的,当我们尝试扩展支持语言的数量时,该输入会产生词汇瓶颈。解决此瓶颈会导致在嵌入矩阵中可以表示的与输出层中的计算问题之间的权衡。本文介绍了基于像素的语言编码器Pixel,这两个问题都没有遭受这些问题的影响。 Pixel是一种验证的语言模型,可将文本作为图像呈现,使基于拼字法相似性或像素的共激活的语言传输表示形式。 Pixel经过训练可以重建蒙版贴片的像素,而不是预测令牌上的分布。我们在与BERT相同的英语数据上为8600万参数像素模型预告,并对包括各种非拉丁语脚本在内的类型上多样化的语言中的句法和语义任务进行了评估。我们发现,Pixel在预读取数据中找不到的脚本上的句法和语义处理任务大大优于BERT,但是在使用拉丁文脚本时,Pixel比BERT稍弱。此外,我们发现像素对嘈杂的文本输入比bert更强大,进一步证实了用像素建模语言的好处。
translated by 谷歌翻译
多语言预训练的语言模型(PLM)在高资源和低资源语言的下游任务上表现出令人印象深刻的表现。但是,在预培训期间,尤其是非洲语言中,看不见的语言仍然有很大的表现。适应新语言的最有效方法之一是\ textit {语言自适应微调}(LAFT) - 使用预训练目标对单语言的多语言PLM进行微调。但是,适应目标语言会单独使用大磁盘空间,并限制了由此产生的模型的跨语言转移能力,因为它们已经专门用于单语言。在本文中,我们对17种最重要的非洲语言和其他三种在非洲大陆上广泛使用的高资源语言对17种最具资源的非洲语言进行\ Textit {多语言自适应微调},以鼓励跨语性转移学习。为了进一步专注于多语言PLM,我们从嵌入式层中删除了与MAFT之前的非非洲写作脚本相对应的词汇令牌,从而将模型大小降低了约50%。我们对两个多语言PLM(Afriberta和XLM-R)和三个NLP任务(NER,新闻主题分类和情感分类)的评估表明,我们的方法可以在单个语言上应用LAFT,同时需要较小的磁盘空间。此外,我们表明我们的适应性PLM还提高了参数有效微调方法的零击跨语性转移能力。
translated by 谷歌翻译
The BLOOM model is a large open-source multilingual language model capable of zero-shot learning, but its pretraining was limited to 46 languages. To improve its zero-shot performance on unseen languages, it is desirable to adapt BLOOM, but previous works have only explored adapting small language models. In this work, we apply existing language adaptation strategies to BLOOM and benchmark its zero-shot prompting performance on eight new languages. We find language adaptation to be effective at improving zero-shot performance in new languages. Surprisingly, adapter-based finetuning is more effective than continued pretraining for large models. In addition, we discover that prompting performance is not significantly affected by language specifics, such as the writing system. It is primarily determined by the size of the language adaptation data. We also add new languages to BLOOMZ, which is a multitask finetuned version of BLOOM capable of following task instructions zero-shot. We find including a new language in the multitask fine-tuning mixture to be the most effective method to teach BLOOMZ a new language. We conclude that with sufficient training data language adaptation can generalize well to diverse languages. Our code is available at \url{https://github.com/bigscience-workshop/multilingual-modeling/}.
translated by 谷歌翻译
Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark 1 to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.
translated by 谷歌翻译
While large pre-trained models have transformed the field of natural language processing (NLP), the high training cost and low cross-lingual availability of such models prevent the new advances from being equally shared by users across all languages, especially the less spoken ones. To promote equal opportunities for all language speakers in NLP research and to reduce energy consumption for sustainability, this study proposes an effective and energy-efficient framework GreenPLM that uses bilingual lexicons to directly translate language models of one language into other languages at (almost) no additional cost. We validate this approach in 18 languages and show that this framework is comparable to, if not better than, other heuristics trained with high cost. In addition, when given a low computational cost (2.5\%), the framework outperforms the original monolingual language models in six out of seven tested languages. We release language models in 50 languages translated from English and the source code here.
translated by 谷歌翻译
大型预用屏蔽语言模型已成为许多NLP问题的最先进的解决方案。虽然研究表明,单晶模型产生比多语言模型产生更好的结果,但训练数据集必须足够大。我们培训了立陶宛,拉脱维亚语和英语的三种语言Litlat Bert样模型,以及爱沙尼亚的单语Est-Roberta模型。我们在四个下游任务中评估它们的性能:命名实体识别,依赖解析,词语标记和单词类比。为了分析对单一语言的重要性以及大型培训集的重要性,我们将创建的模型与爱沙尼亚,拉脱维亚和立陶宛人进行了现有的单语和多语言伯特模型。结果表明,新创建的Litlat Bert和Est-Roberta模型在大多数情况下改善了所有测试任务的现有模型的结果。
translated by 谷歌翻译
在NLP社区中有一个正在进行的辩论,无论现代语言模型是否包含语言知识,通过所谓的探针恢复。在本文中,我们研究了语言知识是否是现代语言模型良好表现的必要条件,我们称之为\ Texit {重新发现假设}。首先,我们展示了语言模型,这是显着压缩的,但在预先磨普目标上表现良好,以便在语言结构探讨时保持良好的分数。这一结果支持重新发现的假设,并导致我们的论文的第二款贡献:一个信息 - 理论框架,与语言建模目标相关。该框架还提供了测量语言信息对字词预测任务的影响的度量标准。我们通过英语综合和真正的NLP任务加固我们的分析结果。
translated by 谷歌翻译
This paper presents the work of restoring punctuation for ASR transcripts generated by multilingual ASR systems. The focus languages are English, Mandarin, and Malay which are three of the most popular languages in Singapore. To the best of our knowledge, this is the first system that can tackle punctuation restoration for these three languages simultaneously. Traditional approaches usually treat the task as a sequential labeling task, however, this work adopts a slot-filling approach that predicts the presence and type of punctuation marks at each word boundary. The approach is similar to the Masked-Language Model approach employed during the pre-training stages of BERT, but instead of predicting the masked word, our model predicts masked punctuation. Additionally, we find that using Jieba1 instead of only using the built-in SentencePiece tokenizer of XLM-R can significantly improve the performance of punctuating Mandarin transcripts. Experimental results on English and Mandarin IWSLT2022 datasets and Malay News show that the proposed approach achieved state-of-the-art results for Mandarin with 73.8% F1-score while maintaining a reasonable F1-score for English and Malay, i.e. 74.7% and 78% respectively. Our source code that allows reproducing the results and building a simple web-based application for demonstration purposes is available on Github.
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
这项研究提供了对僧伽罗文本分类的预训练语言模型的性能的首次全面分析。我们测试了一组不同的Sinhala文本分类任务,我们的分析表明,在包括Sinhala(XLM-R,Labse和Laser)的预训练的多语言模型中,XLM-R是迄今为止Sinhala文本的最佳模型分类。我们还预先培训了两种基于罗伯塔的单语僧伽罗模型,它们远远优于僧伽罗的现有预训练的语言模型。我们表明,在微调时,这些预训练的语言模型为僧伽罗文本分类树立了非常强大的基线,并且在标记数据不足以进行微调的情况下非常强大。我们进一步提供了一组建议,用于使用预训练的模型进行Sinhala文本分类。我们还介绍了新的注释数据集,可用于僧伽罗文本分类的未来研究,并公开发布我们的预培训模型。
translated by 谷歌翻译
The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.
translated by 谷歌翻译