自然语言处理领域(NLP)最近看到使用预先接受训练的语言模型来解决几乎任何任务的大量变化。尽管对各种任务的基准数据集显示了很大的改进,但这些模型通常在非标准域中对临床领域的临床域进行次优,其中观察到预训练文件和目标文件之间的巨大差距。在本文中,我们的目标是通过对语言模型的域特定培训结束这种差距,我们调查其对多种下游任务和设置的影响。我们介绍了预先训练的Clin-X(临床XLM-R)语言模型,并展示了Clin-X如何通过两种语言的十个临床概念提取任务的大幅度优于其他预先训练的变压器模型。此外,我们展示了如何通过基于随机分裂和交叉句子上下文的集合来利用我们所提出的任务和语言 - 无人机模型架构进一步改善变压器模型。我们在低资源和转移设置中的研究显​​示,尽管只有250个标记的句子,但在只有250个标记的句子时,缺乏带注释数据的稳定模型表现。我们的结果突出了专业语言模型作为非标准域中的概念提取的Clin-X的重要性,但也表明我们的任务 - 无人机模型架构跨越测试任务和语言是强大的,以便域名或任务特定的适应不需要。 Clin-Xlanguage模型和用于微调和传输模型的源代码在https://github.com/boschresearch/clin\_x/和Huggingface模型集线器上公开使用。
translated by 谷歌翻译
我们利用预训练的语言模型来解决两种低资源语言的复杂NER任务:中文和西班牙语。我们使用整个单词掩码(WWM)的技术来提高大型和无监督的语料库的掩盖语言建模目标。我们在微调的BERT层之上进行多个神经网络体系结构,将CRF,Bilstms和线性分类器结合在一起。我们所有的模型都优于基线,而我们的最佳性能模型在盲目测试集的评估排行榜上获得了竞争地位。
translated by 谷歌翻译
多语言预训练的语言模型(PLM)在高资源和低资源语言的下游任务上表现出令人印象深刻的表现。但是,在预培训期间,尤其是非洲语言中,看不见的语言仍然有很大的表现。适应新语言的最有效方法之一是\ textit {语言自适应微调}(LAFT) - 使用预训练目标对单语言的多语言PLM进行微调。但是,适应目标语言会单独使用大磁盘空间,并限制了由此产生的模型的跨语言转移能力,因为它们已经专门用于单语言。在本文中,我们对17种最重要的非洲语言和其他三种在非洲大陆上广泛使用的高资源语言对17种最具资源的非洲语言进行\ Textit {多语言自适应微调},以鼓励跨语性转移学习。为了进一步专注于多语言PLM,我们从嵌入式层中删除了与MAFT之前的非非洲写作脚本相对应的词汇令牌,从而将模型大小降低了约50%。我们对两个多语言PLM(Afriberta和XLM-R)和三个NLP任务(NER,新闻主题分类和情感分类)的评估表明,我们的方法可以在单个语言上应用LAFT,同时需要较小的磁盘空间。此外,我们表明我们的适应性PLM还提高了参数有效微调方法的零击跨语性转移能力。
translated by 谷歌翻译
Transformer language models (TLMs) are critical for most NLP tasks, but they are difficult to create for low-resource languages because of how much pretraining data they require. In this work, we investigate two techniques for training monolingual TLMs in a low-resource setting: greatly reducing TLM size, and complementing the masked language modeling objective with two linguistically rich supervised tasks (part-of-speech tagging and dependency parsing). Results from 7 diverse languages indicate that our model, MicroBERT, is able to produce marked improvements in downstream task evaluations relative to a typical monolingual TLM pretraining approach. Specifically, we find that monolingual MicroBERT models achieve gains of up to 18% for parser LAS and 11% for NER F1 compared to a multilingual baseline, mBERT, while having less than 1% of its parameter count. We conclude reducing TLM parameter count and using labeled data for pretraining low-resource TLMs can yield large quality benefits and in some cases produce models that outperform multilingual approaches.
translated by 谷歌翻译
编码单词语义属性的密集词向量或“Word Embeddings”现在已成为机器翻译(MT),问题应答(QA),字感消解(WSD)和信息检索(IR)中的NLP任务的积分。在本文中,我们使用各种现有方法为14个印度语言创建多个单词嵌入。我们将这些嵌入的嵌入式为所有这些语言,萨姆萨姆,孟加拉,古吉拉蒂,印地教派,kannada,konkani,malayalam,marathi,尼泊尔,odiya,punjabi,梵语,泰米尔和泰雅古士在一个单一的存储库中。相对较新的方法,强调迎合上下文(BERT,ELMO等),表明了显着的改进,但需要大量资源来产生可用模型。我们释放使用上下文和非上下文方法生成的预训练嵌入。我们还使用Muse和XLM来培训所有上述语言的交叉语言嵌入。为了展示我们嵌入的效果,我们为所有这些语言评估了我们对XPOS,UPOS和NER任务的嵌入模型。我们使用8种不同的方法释放了436个型号。我们希望他们对资源受限的印度语言NLP有用。本文的标题是指最初在1924年出版的福斯特的着名小说“一段是印度”。
translated by 谷歌翻译
We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence. We also create manually annotated testsets for 8 languages containing approximately 1000 sentences per language. We demonstrate the utility of the obtained dataset on existing testsets and the Naamapadam-test data for 8 Indic languages. We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set. IndicNER achieves the best F1 on the Naamapadam-test set compared to an mBERT model fine-tuned on existing datasets. IndicNER achieves an F1 score of more than 80 for 7 out of 11 Indic languages. The dataset and models are available under open-source licenses at https://ai4bharat.iitm.ac.in/naamapadam.
translated by 谷歌翻译
Understanding customer feedback is becoming a necessity for companies to identify problems and improve their products and services. Text classification and sentiment analysis can play a major role in analyzing this data by using a variety of machine and deep learning approaches. In this work, different transformer-based models are utilized to explore how efficient these models are when working with a German customer feedback dataset. In addition, these pre-trained models are further analyzed to determine if adapting them to a specific domain using unlabeled data can yield better results than off-the-shelf pre-trained models. To evaluate the models, two downstream tasks from the GermEval 2017 are considered. The experimental results show that transformer-based models can reach significant improvements compared to a fastText baseline and outperform the published scores and previous models. For the subtask Relevance Classification, the best models achieve a micro-averaged $F1$-Score of 96.1 % on the first test set and 95.9 % on the second one, and a score of 85.1 % and 85.3 % for the subtask Polarity Classification.
translated by 谷歌翻译
我们提出了一个针对德国医学自然语言处理的统计模型,该模型训练了命名实体识别(NER),作为开放的公开模型。这项工作是我们第一个Gernerm模型的精致继任者,我们的工作大大优于我们的工作。我们证明了结合多种技术的有效性,以通过在预审预测的深度语言模型(LM),单词平衡和神经机器翻译上转移学习的方式来实现实体识别绩效。由于开放的公共医疗实体识别模型在德国文本上的稀疏情况,这项工作为医疗NLP作为基线模型的德国研究社区提供了好处。由于我们的模型基于公共英语数据,因此提供了其权重,而无需法律限制使用和分发。示例代码和统计模型可在以下网址获得:https://github.com/frankkramer-lab/gernermed-pp
translated by 谷歌翻译
In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.
translated by 谷歌翻译
While large pre-trained models have transformed the field of natural language processing (NLP), the high training cost and low cross-lingual availability of such models prevent the new advances from being equally shared by users across all languages, especially the less spoken ones. To promote equal opportunities for all language speakers in NLP research and to reduce energy consumption for sustainability, this study proposes an effective and energy-efficient framework GreenPLM that uses bilingual lexicons to directly translate language models of one language into other languages at (almost) no additional cost. We validate this approach in 18 languages and show that this framework is comparable to, if not better than, other heuristics trained with high cost. In addition, when given a low computational cost (2.5\%), the framework outperforms the original monolingual language models in six out of seven tested languages. We release language models in 50 languages translated from English and the source code here.
translated by 谷歌翻译
在法律文本中预先培训的基于变压器的预训练语言模型(PLM)的出现,法律领域中的自然语言处理受益匪浅。有经过欧洲和美国法律文本的PLM,最著名的是Legalbert。但是,随着印度法律文件的NLP申请量的迅速增加以及印度法律文本的区别特征,也有必要在印度法律文本上预先培训LMS。在这项工作中,我们在大量的印度法律文件中介绍了基于变压器的PLM。我们还将这些PLM应用于印度法律文件的几个基准法律NLP任务,即从事实,法院判决的语义细分和法院判决预测中的法律法规识别。我们的实验证明了这项工作中开发的印度特定PLM的实用性。
translated by 谷歌翻译
Given the impact of language models on the field of Natural Language Processing, a number of Spanish encoder-only masked language models (aka BERTs) have been trained and released. These models were developed either within large projects using very large private corpora or by means of smaller scale academic efforts leveraging freely available data. In this paper we present a comprehensive head-to-head comparison of language models for Spanish with the following results: (i) Previously ignored multilingual models from large companies fare better than monolingual models, substantially changing the evaluation landscape of language models in Spanish; (ii) Results across the monolingual models are not conclusive, with supposedly smaller and inferior models performing competitively. Based on these empirical results, we argue for the need of more research to understand the factors underlying them. In this sense, the effect of corpus size, quality and pre-training techniques need to be further investigated to be able to obtain Spanish monolingual models significantly better than the multilingual ones released by large private companies, specially in the face of rapid ongoing progress in the field. The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem which requires to marry resources (monetary and/or computational) with the best research expertise and practice.
translated by 谷歌翻译
首字母缩略词和长形式通常在研究文件中发现,更多的资料来自科学和法律领域的文件。在此文件中使用的许多首字母缩略词是特定于域的,很少在正常文本语料库中找到。由于这一点,基于变压器的NLP模型经常检测缩略词令牌的OOV(词汇),特别是对于非英语语言,它们的性能在提取期间将首字母缩略词与它们的长形式联系起来。此外,像BERT这样的预磨削变压器模型不专注于处理科学和法律文件。随着这些积分是这项工作背后的总体动机,我们提出了一种新颖的框架尚非:缩写式提取的字符感知BERT,其考虑文本中的字符序列,并通过屏蔽语言建模进行了科学和法律域。我们进一步使用了一个增强损失功能的目标,将最大损耗和掩码丢失术语添加到培训人物的标准交叉熵损失。我们进一步利用伪标记和对抗性数据生成来提高框架的普遍性。与各种基线相比,实验结果证明了所提出的框架的优越性。此外,我们表明,所提出的框架更适合基线模型,用于对非英语的零拍摄概括,从而加强了我们方法的有效性。我们的Team BackGprop在法国数据集中获得了最高分,丹麦和越南的最高分,在全球排行榜上的英语合法数据集中获得了第三高,用于SDU AAAI-22的Althym提取(AE)共享任务。
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
多语言预训练的语言模型在跨语言任务上表现出了令人印象深刻的表现。它极大地促进了自然语言处理在低资源语言上的应用。但是,当前的多语言模型仍然有些语言表现不佳。在本文中,我们提出了Cino(中国少数族裔训练的语言模型),这是一种用于中国少数语言的多语言预训练的语言模型。它涵盖了标准的中文,Yue中文和其他六种少数民族语言。为了评估多语言模型在少数族裔语言上的跨语性能力,我们从Wikipedia和新闻网站收集文档,并构建两个文本分类数据集,WCM(Wiki-Chinese-Minority)和CMNEWS(中国最少的新闻)。我们表明,Cino在各种分类任务上的表现明显优于基准。Cino模型和数据集可在http://cino.hfl-rc.com上公开获得。
translated by 谷歌翻译
Motivation: Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. Results: We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts.
translated by 谷歌翻译
非洲语言最近是几项自然语言处理(NLP)研究的主题,这导致其在该领域的代表性大大增加。但是,在评估模型在诸如命名实体识别(NER)等任务中的性能时,大多数研究往往比数据集的质量更多地关注模型。尽管这在大多数情况下效果很好,但它并不能说明使用低资源语言进行NLP的局限性,即我们可以使用的数据集的质量和数量。本文根据数据集质量提供了各种模型的性能的分析。我们根据某些非洲NER数据集的每个句子的实体密度评估了不同的预训练模型。我们希望这项研究能够改善在低资源语言的背景下进行NLP研究的方式。
translated by 谷歌翻译
In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.
translated by 谷歌翻译
语言模型是使用大量通用数据(如Book Copus,Common Crawl和Wikipedia)进行预训练的,这对于模型了解语言的语言特征至关重要。新的研究建议将域自适应预训练(DAPT)和任务自适应预训练(TAPT)作为最终填充任务之前的中间步骤。此步骤有助于涵盖目标域词汇,并改善下游任务的模型性能。在这项工作中,我们仅研究训练在TAPT和特定于任务的填充过程中嵌入层对模型性能的影响。基于我们的研究,我们提出了一种简单的方法,以通过对BERT层进行选择性预训练,使基于BERT的模型的中间步骤更有效。我们表明,在TAPT期间仅训练BERT嵌入层足以适应目标域的词汇并实现可比的性能。我们的方法在计算上是有效的,在TAPT期间训练了78%的参数。所提出的嵌入层列式方法也可以是一种有效的域适应技术。
translated by 谷歌翻译
临床表型可以从患者记录中自动提取临床状况,这可能对全球医生和诊所有益。但是,当前的最新模型主要适用于用英语编写的临床笔记。因此,我们研究了跨语化知识转移策略,以针对不使用英语并且有少量可用数据的诊所执行此任务。我们评估了希腊和西班牙诊所的这些策略,利用来自心脏病学,肿瘤学和ICU等不同临床领域的临床笔记。我们的结果揭示了两种策略,这些策略优于最先进的方法:基于翻译的方法,结合了域的编码器和跨语性编码器以及适配器。我们发现,这些策略在对稀有表型进行分类方面表现特别好,我们建议在哪种情况下更喜欢哪种方法。我们的结果表明,使用多语言数据总体可以改善临床表型模型,并可以补偿数据稀疏性。
translated by 谷歌翻译