这项研究提供了对僧伽罗文本分类的预训练语言模型的性能的首次全面分析。我们测试了一组不同的Sinhala文本分类任务,我们的分析表明,在包括Sinhala(XLM-R,Labse和Laser)的预训练的多语言模型中,XLM-R是迄今为止Sinhala文本的最佳模型分类。我们还预先培训了两种基于罗伯塔的单语僧伽罗模型,它们远远优于僧伽罗的现有预训练的语言模型。我们表明,在微调时,这些预训练的语言模型为僧伽罗文本分类树立了非常强大的基线,并且在标记数据不足以进行微调的情况下非常强大。我们进一步提供了一组建议,用于使用预训练的模型进行Sinhala文本分类。我们还介绍了新的注释数据集,可用于僧伽罗文本分类的未来研究,并公开发布我们的预培训模型。
translated by 谷歌翻译
多语言预训练的语言模型在跨语言任务上表现出了令人印象深刻的表现。它极大地促进了自然语言处理在低资源语言上的应用。但是,当前的多语言模型仍然有些语言表现不佳。在本文中,我们提出了Cino(中国少数族裔训练的语言模型),这是一种用于中国少数语言的多语言预训练的语言模型。它涵盖了标准的中文,Yue中文和其他六种少数民族语言。为了评估多语言模型在少数族裔语言上的跨语性能力,我们从Wikipedia和新闻网站收集文档,并构建两个文本分类数据集,WCM(Wiki-Chinese-Minority)和CMNEWS(中国最少的新闻)。我们表明,Cino在各种分类任务上的表现明显优于基准。Cino模型和数据集可在http://cino.hfl-rc.com上公开获得。
translated by 谷歌翻译
Given the impact of language models on the field of Natural Language Processing, a number of Spanish encoder-only masked language models (aka BERTs) have been trained and released. These models were developed either within large projects using very large private corpora or by means of smaller scale academic efforts leveraging freely available data. In this paper we present a comprehensive head-to-head comparison of language models for Spanish with the following results: (i) Previously ignored multilingual models from large companies fare better than monolingual models, substantially changing the evaluation landscape of language models in Spanish; (ii) Results across the monolingual models are not conclusive, with supposedly smaller and inferior models performing competitively. Based on these empirical results, we argue for the need of more research to understand the factors underlying them. In this sense, the effect of corpus size, quality and pre-training techniques need to be further investigated to be able to obtain Spanish monolingual models significantly better than the multilingual ones released by large private companies, specially in the face of rapid ongoing progress in the field. The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem which requires to marry resources (monetary and/or computational) with the best research expertise and practice.
translated by 谷歌翻译
特定于语言的预训练模型已被证明比单语说在单语法评估设置中更准确,阿拉伯语也不例外。但是,我们发现先前发布的阿拉伯伯特模型显着培训。在这本技术报告中,我们展示了Jaber,Junior Arabic Bert,我们的预用语言模型原型专用于阿拉伯语。我们进行实证研究,以系统地评估模型在各种现有阿拉伯语NLU任务中的性能。实验结果表明,Jaber实现了Alue的最先进的表演,这是阿拉伯语了解评估的新基准,以及成熟的内部基准
translated by 谷歌翻译
文本分类是具有各种有趣应用程序的典型自然语言处理或计算语言学任务。随着社交媒体平台上的用户数量的增加,数据加速促进了有关社交媒体文本分类(SMTC)或社交媒体文本挖掘的新兴研究。与英语相比,越南人是低资源语言之一,仍然没有集中精力并彻底利用。受胶水成功的启发,我们介绍了社交媒体文本分类评估(SMTCE)基准,作为各种SMTC任务的数据集和模型的集合。借助拟议的基准,我们实施和分析了各种基于BERT的模型(Mbert,XLM-R和Distilmbert)和基于单语的BERT模型(Phobert,Vibert,Vibert,Velectra和Vibert4news)的有效性SMTCE基准。单语模型优于多语言模型,并实现所有文本分类任务的最新结果。它提供了基于基准的多语言和单语言模型的客观评估,该模型将使越南语言中有关贝尔特兰的未来研究有利。
translated by 谷歌翻译
预先训练的上下文化文本表示模型学习自然语言的有效表示,以使IT机器可以理解。在注意机制的突破之后,已经提出了新一代预磨模的模型,以便自变压器引入以来实现了良好的性能。来自变压器(BERT)的双向编码器表示已成为语言理解的最先进的模型。尽管取得了成功,但大多数可用的型号已经在印度欧洲语言中培训,但是对代表性的语言和方言的类似研究仍然稀疏。在本文中,我们调查了培训基于单语言变换器的语言模型的可行性,以获得代表语言的特定重点是突尼斯方言。我们评估了我们的语言模型对情感分析任务,方言识别任务和阅读理解问答任务。我们表明使用嘈杂的Web爬网数据而不是结构化数据(维基百科,文章等)更方便这些非标准化语言。此外,结果表明,相对小的Web爬网数据集导致与使用较大数据集获得的那些表现相同的性能。最后,我们在所有三个下游任务中达到或改善了最先进的Tunbert模型。我们释放出Tunbert净化模型和用于微调的数据集。
translated by 谷歌翻译
多语言预训练的语言模型(PLM)在高资源和低资源语言的下游任务上表现出令人印象深刻的表现。但是,在预培训期间,尤其是非洲语言中,看不见的语言仍然有很大的表现。适应新语言的最有效方法之一是\ textit {语言自适应微调}(LAFT) - 使用预训练目标对单语言的多语言PLM进行微调。但是,适应目标语言会单独使用大磁盘空间,并限制了由此产生的模型的跨语言转移能力,因为它们已经专门用于单语言。在本文中,我们对17种最重要的非洲语言和其他三种在非洲大陆上广泛使用的高资源语言对17种最具资源的非洲语言进行\ Textit {多语言自适应微调},以鼓励跨语性转移学习。为了进一步专注于多语言PLM,我们从嵌入式层中删除了与MAFT之前的非非洲写作脚本相对应的词汇令牌,从而将模型大小降低了约50%。我们对两个多语言PLM(Afriberta和XLM-R)和三个NLP任务(NER,新闻主题分类和情感分类)的评估表明,我们的方法可以在单个语言上应用LAFT,同时需要较小的磁盘空间。此外,我们表明我们的适应性PLM还提高了参数有效微调方法的零击跨语性转移能力。
translated by 谷歌翻译
In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate\footnote{for IndicXParaphrase, where an automatic translation system is used, a second human verification and correction step is done.} our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL.
translated by 谷歌翻译
Pre-trained language models are trained on large-scale unsupervised data, and they can be fine-tuned on small-scale labeled datasets and achieve good results. Multilingual pre-trained language models can be trained on multiple languages and understand multiple languages at the same time. At present, the research on pre-trained models mainly focuses on rich-resource language, while there is relatively little research on low-resource languages such as minority languages, and the public multilingual pre-trained language model can not work well for minority languages. Therefore, this paper constructs a multilingual pre-trained language model named MiLMo that performs better on minority language tasks, including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To solve the problem of scarcity of datasets on minority languages and verify the effectiveness of the MiLMo model, this paper constructs a minority multilingual text classification dataset named MiTC, and trains a word2vec model for each language. By comparing the word2vec model and the pre-trained model in the text classification task, this paper provides an optimal scheme for the downstream task research of minority languages. The final experimental results show that the performance of the pre-trained model is better than that of the word2vec model, and it has achieved the best results in minority multilingual text classification. The multilingual pre-trained language model MiLMo, multilingual word2vec model and multilingual text classification dataset MiTC are published on https://milmo.cmli-nlp.com.
translated by 谷歌翻译
In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.
translated by 谷歌翻译
句子嵌入通常用于文本聚类和语义检索任务中。最先进的句子表示方法基于大量手动标记句子对集合的人工神经网络。高资源语言(例如英语或中文)可以使用足够数量的注释数据。在不太受欢迎的语言中,必须使用多语言模型,从而提供较低的性能。在本出版物中,我们通过提出一种培训有效的语言特定句子编码的方法来解决此问题,而无需手动标记数据。我们的方法是从句子对准双语文本语料库中自动构建释义对数据集。然后,我们使用收集的数据来微调具有附加复发池层的变压器语言模型。我们的句子编码器可以在不到一天的时间内在一张图形卡上进行培训,从而在各种句子级的任务上实现高性能。我们在波兰语中评估了八个语言任务的方法,并将其与最佳可用多语言句子编码器进行比较。
translated by 谷歌翻译
Pre-trained transformers are now the de facto models in Natural Language Processing given their state-of-the-art results in many tasks and languages. However, most of the current models have been trained on languages for which large text resources are already available (such as English, French, Arabic, etc.). Therefore, there are still a number of low-resource languages that need more attention from the community. In this paper, we study the Algerian dialect which has several specificities that make the use of Arabic or multilingual models inappropriate. To address this issue, we collected more than one million Algerian tweets, and pre-trained the first Algerian language model: DziriBERT. When compared with existing models, DziriBERT achieves better results, especially when dealing with the Roman script. The obtained results show that pre-training a dedicated model on a small dataset (150 MB) can outperform existing models that have been trained on much more data (hundreds of GB). Finally, our model is publicly available to the community.
translated by 谷歌翻译
BERT,ROBERTA或GPT-3等复杂的基于注意力的语言模型的外观已允许在许多场景中解决高度复杂的任务。但是,当应用于特定域时,这些模型会遇到相当大的困难。诸如Twitter之类的社交网络就是这种情况,Twitter是一种不断变化的信息流,以非正式和复杂的语言编写的信息流,鉴于人类的重要作用,每个信息都需要仔细评估,即使人类也需要理解。通过自然语言处理解决该领域的任务涉及严重的挑战。当将强大的最先进的多语言模型应用于这种情况下,特定语言的细微差别用来迷失翻译。为了面对这些挑战,我们提出了\ textbf {bertuit},这是迄今为止针对西班牙语提出的较大变压器,使用Roberta Optimization进行了230m西班牙推文的大规模数据集进行了预培训。我们的动机是提供一个强大的资源,以更好地了解西班牙Twitter,并用于专注于该社交网络的应用程序,特别强调致力于解决该平台中错误信息传播的解决方案。对Bertuit进行了多个任务评估,并与M-Bert,XLM-Roberta和XLM-T进行了比较,该任务非常具有竞争性的多语言变压器。在这种情况下,使用应用程序显示了我们方法的实用性:一种可视化骗局和分析作者群体传播虚假信息的零击方法。错误的信息在英语以外的其他语言等平台上疯狂地传播,这意味着在英语说话之外转移时,变形金刚的性能可能会受到影响。
translated by 谷歌翻译
Pre-training large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. Although this method has proven to be effective for many domains, it might not always provide desirable benefits. In this paper, we study the effects of hateful pre-training on low-resource hate speech classification tasks. While previous studies on the English language have emphasized its importance, we aim to augment their observations with some non-obvious insights. We evaluate different variations of tweet-based BERT models pre-trained on hateful, non-hateful, and mixed subsets of a 40M tweet dataset. This evaluation is carried out for the Indian languages Hindi and Marathi. This paper is empirical evidence that hateful pre-training is not the best pre-training option for hate speech detection. We show that pre-training on non-hateful text from the target domain provides similar or better results. Further, we introduce HindTweetBERT and MahaTweetBERT, the first publicly available BERT models pre-trained on Hindi and Marathi tweets, respectively. We show that they provide state-of-the-art performance on hate speech classification tasks. We also release hateful BERT for the two languages and a gold hate speech evaluation benchmark HateEval-Hi and HateEval-Mr consisting of manually labeled 2000 tweets each. The models and data are available at https://github.com/l3cube-pune/MarathiNLP .
translated by 谷歌翻译
大型预用屏蔽语言模型已成为许多NLP问题的最先进的解决方案。虽然研究表明,单晶模型产生比多语言模型产生更好的结果,但训练数据集必须足够大。我们培训了立陶宛,拉脱维亚语和英语的三种语言Litlat Bert样模型,以及爱沙尼亚的单语Est-Roberta模型。我们在四个下游任务中评估它们的性能:命名实体识别,依赖解析,词语标记和单词类比。为了分析对单一语言的重要性以及大型培训集的重要性,我们将创建的模型与爱沙尼亚,拉脱维亚和立陶宛人进行了现有的单语和多语言伯特模型。结果表明,新创建的Litlat Bert和Est-Roberta模型在大多数情况下改善了所有测试任务的现有模型的结果。
translated by 谷歌翻译
作为世界上第四大语言家庭,Dravidian语言已成为自然语言处理(NLP)的研究热点。虽然Dravidian语言包含大量语言,但有相对较少的公众可用资源。此外,文本分类任务是自然语言处理的基本任务,如何将其与Dravidian语言中的多种语言相结合,仍然是Dravidian自然语言处理的主要困难。因此,为了解决这些问题,我们为Dravidian语言提出了一个多语言文本分类框架。一方面,该框架使用Labse预先训练的模型作为基础模型。针对多任务学习中文本信息偏见的问题,我们建议使用MLM策略选择语言特定的单词,并使用对抗训练来扰乱它们。另一方面,鉴于模型无法识别和利用语言之间的相关性的问题,我们进一步提出了一种特定于语言的表示模块,以丰富模型的语义信息。实验结果表明,我们提出的框架在多语言文本分类任务中具有重要性能,每个策略实现某些改进。
translated by 谷歌翻译
Laws and their interpretations, legal arguments and agreements\ are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeavors. Their usefulness, however, largely depends on whether current state-of-the-art models can generalize across various tasks in the legal domain. To answer this currently open question, we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks.
translated by 谷歌翻译
多语言语言模型(\ mllms),如mbert,xlm,xlm-r,\ textit {etc。}已成为一种可行的选择,使预先估计到大量语言的力量。鉴于他们的成功在零射击转移学习中,在(i)建立更大的\ mllms〜覆盖了大量语言(ii)创建覆盖更广泛的任务和语言来评估的详尽工作基准mllms〜(iii)分析单音零点,零拍摄交叉和双语任务(iv)对Monolingual的性能,了解\ mllms〜(v)增强(通常)学习的通用语言模式(如果有的话)有限的容量\ mllms〜以提高他们在已见甚至看不见语言的表现。在这项调查中,我们审查了现有的文学,涵盖了上述与\ MLLMS有关的广泛研究领域。根据我们的调查,我们建议您有一些未来的研究方向。
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
语言模型预训练的最新进展利用大规模数据集创建多语言模型。但是,这些数据集中大多遗漏了低资源语言。这主要是因为网络上没有很好地表示口语,因此被排除在用于创建数据集的大规模爬网中。此外,这些模型的下游用户仅限于最初选择用于预训练的语言的选择。这项工作调查了如何最佳利用现有的预培训模型来为16种非洲语言创建低资源翻译系统。我们关注两个问题:1)如何将预训练的模型用于初始预培训中未包含的语言? 2)生成的翻译模型如何有效地转移到新域?为了回答这些问题,我们创建了一个新的非洲新闻语料库,涵盖16种语言,其中8种语言不属于任何现有评估数据集的一部分。我们证明,将两种语言转移到其他语言和其他领域的最有效策略是,以少量的高质量翻译数据微调大型预训练模型。
translated by 谷歌翻译