Pre-trained transformers are now the de facto models in Natural Language Processing given their state-of-the-art results in many tasks and languages. However, most of the current models have been trained on languages for which large text resources are already available (such as English, French, Arabic, etc.). Therefore, there are still a number of low-resource languages that need more attention from the community. In this paper, we study the Algerian dialect which has several specificities that make the use of Arabic or multilingual models inappropriate. To address this issue, we collected more than one million Algerian tweets, and pre-trained the first Algerian language model: DziriBERT. When compared with existing models, DziriBERT achieves better results, especially when dealing with the Roman script. The obtained results show that pre-training a dedicated model on a small dataset (150 MB) can outperform existing models that have been trained on much more data (hundreds of GB). Finally, our model is publicly available to the community.
translated by 谷歌翻译
预先训练的上下文化文本表示模型学习自然语言的有效表示,以使IT机器可以理解。在注意机制的突破之后,已经提出了新一代预磨模的模型,以便自变压器引入以来实现了良好的性能。来自变压器(BERT)的双向编码器表示已成为语言理解的最先进的模型。尽管取得了成功,但大多数可用的型号已经在印度欧洲语言中培训,但是对代表性的语言和方言的类似研究仍然稀疏。在本文中,我们调查了培训基于单语言变换器的语言模型的可行性,以获得代表语言的特定重点是突尼斯方言。我们评估了我们的语言模型对情感分析任务,方言识别任务和阅读理解问答任务。我们表明使用嘈杂的Web爬网数据而不是结构化数据(维基百科,文章等)更方便这些非标准化语言。此外,结果表明,相对小的Web爬网数据集导致与使用较大数据集获得的那些表现相同的性能。最后,我们在所有三个下游任务中达到或改善了最先进的Tunbert模型。我们释放出Tunbert净化模型和用于微调的数据集。
translated by 谷歌翻译
由于BERT出现,变压器语言模型和转移学习已成为自然语言理解任务的最先进。最近,一些作品适用于特定领域的预训练,专制模型,例如科学论文,医疗文件等。在这项工作中,我们呈现RoberTuito,用于西班牙语中的用户生成内容的预先训练的语言模型。我们在西班牙语中培训了罗伯特托5亿推文。关于涉及用户生成文本的4个任务的基准测试显示,罗伯特托多于西班牙语的其他预先接受的语言模型。为了帮助进一步研究,我们将罗伯特多公开可在HuggingFace Model Hub上提供。
translated by 谷歌翻译
这项研究提供了对僧伽罗文本分类的预训练语言模型的性能的首次全面分析。我们测试了一组不同的Sinhala文本分类任务,我们的分析表明,在包括Sinhala(XLM-R,Labse和Laser)的预训练的多语言模型中,XLM-R是迄今为止Sinhala文本的最佳模型分类。我们还预先培训了两种基于罗伯塔的单语僧伽罗模型,它们远远优于僧伽罗的现有预训练的语言模型。我们表明,在微调时,这些预训练的语言模型为僧伽罗文本分类树立了非常强大的基线,并且在标记数据不足以进行微调的情况下非常强大。我们进一步提供了一组建议,用于使用预训练的模型进行Sinhala文本分类。我们还介绍了新的注释数据集,可用于僧伽罗文本分类的未来研究,并公开发布我们的预培训模型。
translated by 谷歌翻译
Pre-training large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. Although this method has proven to be effective for many domains, it might not always provide desirable benefits. In this paper, we study the effects of hateful pre-training on low-resource hate speech classification tasks. While previous studies on the English language have emphasized its importance, we aim to augment their observations with some non-obvious insights. We evaluate different variations of tweet-based BERT models pre-trained on hateful, non-hateful, and mixed subsets of a 40M tweet dataset. This evaluation is carried out for the Indian languages Hindi and Marathi. This paper is empirical evidence that hateful pre-training is not the best pre-training option for hate speech detection. We show that pre-training on non-hateful text from the target domain provides similar or better results. Further, we introduce HindTweetBERT and MahaTweetBERT, the first publicly available BERT models pre-trained on Hindi and Marathi tweets, respectively. We show that they provide state-of-the-art performance on hate speech classification tasks. We also release hateful BERT for the two languages and a gold hate speech evaluation benchmark HateEval-Hi and HateEval-Mr consisting of manually labeled 2000 tweets each. The models and data are available at https://github.com/l3cube-pune/MarathiNLP .
translated by 谷歌翻译
情感分析是NLP中研究最广泛的应用程序之一,但大多数工作都集中在具有大量数据的语言上。我们介绍了尼日利亚的四种口语最广泛的语言(Hausa,Igbo,Nigerian-Pidgin和Yor \'ub \'a)的第一个大规模的人类通知的Twitter情感数据集,该数据集由大约30,000个注释的推文组成(以及每种语言的大约30,000个)(以及14,000尼日利亚猎人),其中包括大量的代码混合推文。我们提出了文本收集,过滤,处理和标记方法,使我们能够为这些低资源语言创建数据集。我们评估了数据集上的预训练模型和转移策略。我们发现特定于语言的模型和语言适应性芬通常表现最好。我们将数据集,训练的模型,情感词典和代码释放到激励措施中,以代表性不足的语言进行情感分析。
translated by 谷歌翻译
Automated offensive language detection is essential in combating the spread of hate speech, particularly in social media. This paper describes our work on Offensive Language Identification in low resource Indic language Marathi. The problem is formulated as a text classification task to identify a tweet as offensive or non-offensive. We evaluate different mono-lingual and multi-lingual BERT models on this classification task, focusing on BERT models pre-trained with social media datasets. We compare the performance of MuRIL, MahaTweetBERT, MahaTweetBERT-Hateful, and MahaBERT on the HASOC 2022 test set. We also explore external data augmentation from other existing Marathi hate speech corpus HASOC 2021 and L3Cube-MahaHate. The MahaTweetBERT, a BERT model, pre-trained on Marathi tweets when fine-tuned on the combined dataset (HASOC 2021 + HASOC 2022 + MahaHate), outperforms all models with an F1 score of 98.43 on the HASOC 2022 test set. With this, we also provide a new state-of-the-art result on HASOC 2022 / MOLD v2 test set.
translated by 谷歌翻译
在本文中,我们介绍了TweetNLP,这是社交媒体中自然语言处理(NLP)的集成平台。TweetNLP支持一套多样化的NLP任务,包括诸如情感分析和命名实体识别的通用重点领域,以及社交媒体特定的任务,例如表情符号预测和进攻性语言识别。特定于任务的系统由专门用于社交媒体文本的合理大小的基于变压器的语言模型(尤其是Twitter)提供动力,无需专用硬件或云服务即可运行。TweetNLP的主要贡献是:(1)使用适合社会领域的各种特定于任务的模型,用于支持社交媒体分析的现代工具包的集成python库;(2)使用我们的模型进行无编码实验的交互式在线演示;(3)涵盖各种典型社交媒体应用的教程。
translated by 谷歌翻译
我们介绍了Twhin-Bert,这是一种多语言语言模型,该模型在流行的社交网络Twitter上训练了内域数据。Twhin-bert与先前的预训练的语言模型有所不同,因为它不仅接受了基于文本的自学训练,而且还具有基于Twitter异质信息网络(TWHIN)中丰富社交活动的社会目标。我们的模型接受了70亿条推文的培训,涵盖了100多种不同的语言,为简短,嘈杂,用户生成的文本提供了有价值的表示形式。我们对各种多语言社会建议和语义理解任务进行评估,并证明了对既定的预训练的语言模型的大幅改进。我们将自由开放源代码Twhin-Bert和我们为研究社区提供的精心策划标签预测和社会参与基准数据集。
translated by 谷歌翻译
特定于语言的预训练模型已被证明比单语说在单语法评估设置中更准确,阿拉伯语也不例外。但是,我们发现先前发布的阿拉伯伯特模型显着培训。在这本技术报告中,我们展示了Jaber,Junior Arabic Bert,我们的预用语言模型原型专用于阿拉伯语。我们进行实证研究,以系统地评估模型在各种现有阿拉伯语NLU任务中的性能。实验结果表明,Jaber实现了Alue的最先进的表演,这是阿拉伯语了解评估的新基准,以及成熟的内部基准
translated by 谷歌翻译
文本分类是具有各种有趣应用程序的典型自然语言处理或计算语言学任务。随着社交媒体平台上的用户数量的增加,数据加速促进了有关社交媒体文本分类(SMTC)或社交媒体文本挖掘的新兴研究。与英语相比,越南人是低资源语言之一,仍然没有集中精力并彻底利用。受胶水成功的启发,我们介绍了社交媒体文本分类评估(SMTCE)基准,作为各种SMTC任务的数据集和模型的集合。借助拟议的基准,我们实施和分析了各种基于BERT的模型(Mbert,XLM-R和Distilmbert)和基于单语的BERT模型(Phobert,Vibert,Vibert,Velectra和Vibert4news)的有效性SMTCE基准。单语模型优于多语言模型,并实现所有文本分类任务的最新结果。它提供了基于基准的多语言和单语言模型的客观评估,该模型将使越南语言中有关贝尔特兰的未来研究有利。
translated by 谷歌翻译
BERT,ROBERTA或GPT-3等复杂的基于注意力的语言模型的外观已允许在许多场景中解决高度复杂的任务。但是,当应用于特定域时,这些模型会遇到相当大的困难。诸如Twitter之类的社交网络就是这种情况,Twitter是一种不断变化的信息流,以非正式和复杂的语言编写的信息流,鉴于人类的重要作用,每个信息都需要仔细评估,即使人类也需要理解。通过自然语言处理解决该领域的任务涉及严重的挑战。当将强大的最先进的多语言模型应用于这种情况下,特定语言的细微差别用来迷失翻译。为了面对这些挑战,我们提出了\ textbf {bertuit},这是迄今为止针对西班牙语提出的较大变压器,使用Roberta Optimization进行了230m西班牙推文的大规模数据集进行了预培训。我们的动机是提供一个强大的资源,以更好地了解西班牙Twitter,并用于专注于该社交网络的应用程序,特别强调致力于解决该平台中错误信息传播的解决方案。对Bertuit进行了多个任务评估,并与M-Bert,XLM-Roberta和XLM-T进行了比较,该任务非常具有竞争性的多语言变压器。在这种情况下,使用应用程序显示了我们方法的实用性:一种可视化骗局和分析作者群体传播虚假信息的零击方法。错误的信息在英语以外的其他语言等平台上疯狂地传播,这意味着在英语说话之外转移时,变形金刚的性能可能会受到影响。
translated by 谷歌翻译
Covid-19已遍布全球,已经开发了几种疫苗来应对其激增。为了确定与社交媒体帖子中与疫苗相关的正确情感,我们在与Covid-19疫苗相关的推文上微调了各种最新的预训练的变压器模型。具体而言,我们使用最近引入的最先进的预训练的变压器模型Roberta,XLNet和Bert,以及在CoVID-19的推文中预先训练的域特异性变压器模型CT-Bert和Bertweet。我们通过使用基于语言模型的过采样技术(LMOTE)过采样来进一步探索文本扩展的选项,以改善这些模型的准确性,特别是对于小样本数据集,在正面,负面和中性情感类别之间存在不平衡的类别分布。我们的结果总结了我们关于用于微调最先进的预训练的变压器模型的不平衡小样本数据集的文本过采样的适用性,以及针对分类任务的域特异性变压器模型的实用性。
translated by 谷歌翻译
Understanding customer feedback is becoming a necessity for companies to identify problems and improve their products and services. Text classification and sentiment analysis can play a major role in analyzing this data by using a variety of machine and deep learning approaches. In this work, different transformer-based models are utilized to explore how efficient these models are when working with a German customer feedback dataset. In addition, these pre-trained models are further analyzed to determine if adapting them to a specific domain using unlabeled data can yield better results than off-the-shelf pre-trained models. To evaluate the models, two downstream tasks from the GermEval 2017 are considered. The experimental results show that transformer-based models can reach significant improvements compared to a fastText baseline and outperform the published scores and previous models. For the subtask Relevance Classification, the best models achieve a micro-averaged $F1$-Score of 96.1 % on the first test set and 95.9 % on the second one, and a score of 85.1 % and 85.3 % for the subtask Polarity Classification.
translated by 谷歌翻译
Due to their crucial role in all NLP, several benchmarks have been proposed to evaluate pretrained language models. In spite of these efforts, no public benchmark of diverse nature currently exists for evaluation of Arabic. This makes it challenging to measure progress for both Arabic and multilingual language models. This challenge is compounded by the fact that any benchmark targeting Arabic needs to take into account the fact that Arabic is not a single language but rather a collection of languages and varieties. In this work, we introduce ORCA, a publicly available benchmark for Arabic language understanding evaluation. ORCA is carefully constructed to cover diverse Arabic varieties and a wide range of challenging Arabic understanding tasks exploiting 60 different datasets across seven NLU task clusters. To measure current progress in Arabic NLU, we use ORCA to offer a comprehensive comparison between 18 multilingual and Arabic language models. We also provide a public leaderboard with a unified single-number evaluation metric (ORCA score) to facilitate future research.
translated by 谷歌翻译
多语言预训练的语言模型在跨语言任务上表现出了令人印象深刻的表现。它极大地促进了自然语言处理在低资源语言上的应用。但是,当前的多语言模型仍然有些语言表现不佳。在本文中,我们提出了Cino(中国少数族裔训练的语言模型),这是一种用于中国少数语言的多语言预训练的语言模型。它涵盖了标准的中文,Yue中文和其他六种少数民族语言。为了评估多语言模型在少数族裔语言上的跨语性能力,我们从Wikipedia和新闻网站收集文档,并构建两个文本分类数据集,WCM(Wiki-Chinese-Minority)和CMNEWS(中国最少的新闻)。我们表明,Cino在各种分类任务上的表现明显优于基准。Cino模型和数据集可在http://cino.hfl-rc.com上公开获得。
translated by 谷歌翻译
In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.
translated by 谷歌翻译
我们遵守社交媒体上最近的行为,用户在其中故意从阿拉伯语字母中删除Consonantal Dots,以绕过内容分类算法。内容分类通常是通过微调预先训练的语言模型来完成的,这些语言模型已被许多自然语言处理应用程序采用。在这项工作中,我们研究了在“不所需的”阿拉伯语文本上应用预训练的阿拉伯语模型的效果。我们建议使用预先训练的模型支持所需的文本的几种方法,而无需额外培训,并在两种阿拉伯语自然语言处理下游任务中衡量其性能。结果令人鼓舞;在其中一个任务中,我们的方法显示了几乎完美的性能。
translated by 谷歌翻译
Given the impact of language models on the field of Natural Language Processing, a number of Spanish encoder-only masked language models (aka BERTs) have been trained and released. These models were developed either within large projects using very large private corpora or by means of smaller scale academic efforts leveraging freely available data. In this paper we present a comprehensive head-to-head comparison of language models for Spanish with the following results: (i) Previously ignored multilingual models from large companies fare better than monolingual models, substantially changing the evaluation landscape of language models in Spanish; (ii) Results across the monolingual models are not conclusive, with supposedly smaller and inferior models performing competitively. Based on these empirical results, we argue for the need of more research to understand the factors underlying them. In this sense, the effect of corpus size, quality and pre-training techniques need to be further investigated to be able to obtain Spanish monolingual models significantly better than the multilingual ones released by large private companies, specially in the face of rapid ongoing progress in the field. The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem which requires to marry resources (monetary and/or computational) with the best research expertise and practice.
translated by 谷歌翻译
The health mention classification (HMC) task is the process of identifying and classifying mentions of health-related concepts in text. This can be useful for identifying and tracking the spread of diseases through social media posts. However, this is a non-trivial task. Here we build on recent studies suggesting that using emotional information may improve upon this task. Our study results in a framework for health mention classification that incorporates affective features. We present two methods, an intermediate task fine-tuning approach (implicit) and a multi-feature fusion approach (explicit) to incorporate emotions into our target task of HMC. We evaluated our approach on 5 HMC-related datasets from different social media platforms including three from Twitter, one from Reddit and another from a combination of social media sources. Extensive experiments demonstrate that our approach results in statistically significant performance gains on HMC tasks. By using the multi-feature fusion approach, we achieve at least a 3% improvement in F1 score over BERT baselines across all datasets. We also show that considering only negative emotions does not significantly affect performance on the HMC task. Additionally, our results indicate that HMC models infused with emotional knowledge are an effective alternative, especially when other HMC datasets are unavailable for domain-specific fine-tuning. The source code for our models is freely available at https://github.com/tahirlanre/Emotion_PHM.
translated by 谷歌翻译