We study the performance of monolingual and multilingual language models on the task of question-answering (QA) on three diverse languages: English, Finnish and Japanese. We develop models for the tasks of (1) determining if a question is answerable given the context and (2) identifying the answer texts within the context using IOB tagging. Furthermore, we attempt to evaluate the effectiveness of a pre-trained multilingual encoder (Multilingual BERT) on cross-language zero-shot learning for both the answerability and IOB sequence classifiers.
translated by 谷歌翻译
问题答案(QA)是自然语言处理中最具挑战性的最具挑战性的问题之一(NLP)。问答(QA)系统试图为给定问题产生答案。这些答案可以从非结构化或结构化文本生成。因此,QA被认为是可以用于评估文本了解系统的重要研究区域。大量的QA研究致力于英语语言,调查最先进的技术和实现最先进的结果。然而,由于阿拉伯QA中的研究努力和缺乏大型基准数据集,在阿拉伯语问答进展中的研究努力得到了很大速度的速度。最近许多预先接受的语言模型在许多阿拉伯语NLP问题中提供了高性能。在这项工作中,我们使用四个阅读理解数据集来评估阿拉伯QA的最先进的接种变压器模型,它是阿拉伯语 - 队,ArcD,AQAD和TYDIQA-GoldP数据集。我们微调并比较了Arabertv2基础模型,ArabertV0.2大型型号和ARAElectra模型的性能。在最后,我们提供了一个分析,了解和解释某些型号获得的低绩效结果。
translated by 谷歌翻译
本教程展示了工作流程,将文本数据纳入精算分类和回归任务。主要重点是采用基于变压器模型的方法。平均长度为400个单词的车祸描述的数据集,英语和德语可用,以及具有简短财产保险索赔的数据集用来证明这些技术。案例研究应对与多语言环境和长输入序列有关的挑战。他们还展示了解释模型输出,评估和改善模型性能的方法,通过将模型调整到应用程序领域或特定预测任务。最后,该教程提供了在没有或仅有少数标记数据的情况下处理分类任务的实用方法。通过使用最少的预处理和微调的现成自然语言处理(NLP)模型的语言理解技能(NLP)模型实现的结果清楚地证明了用于实际应用的转移学习能力。
translated by 谷歌翻译
Transformer language models (TLMs) are critical for most NLP tasks, but they are difficult to create for low-resource languages because of how much pretraining data they require. In this work, we investigate two techniques for training monolingual TLMs in a low-resource setting: greatly reducing TLM size, and complementing the masked language modeling objective with two linguistically rich supervised tasks (part-of-speech tagging and dependency parsing). Results from 7 diverse languages indicate that our model, MicroBERT, is able to produce marked improvements in downstream task evaluations relative to a typical monolingual TLM pretraining approach. Specifically, we find that monolingual MicroBERT models achieve gains of up to 18% for parser LAS and 11% for NER F1 compared to a multilingual baseline, mBERT, while having less than 1% of its parameter count. We conclude reducing TLM parameter count and using labeled data for pretraining low-resource TLMs can yield large quality benefits and in some cases produce models that outperform multilingual approaches.
translated by 谷歌翻译
语言模型是通过有限的输入集定义的,当我们尝试扩展支持语言的数量时,该输入会产生词汇瓶颈。解决此瓶颈会导致在嵌入矩阵中可以表示的与输出层中的计算问题之间的权衡。本文介绍了基于像素的语言编码器Pixel,这两个问题都没有遭受这些问题的影响。 Pixel是一种验证的语言模型,可将文本作为图像呈现,使基于拼字法相似性或像素的共激活的语言传输表示形式。 Pixel经过训练可以重建蒙版贴片的像素,而不是预测令牌上的分布。我们在与BERT相同的英语数据上为8600万参数像素模型预告,并对包括各种非拉丁语脚本在内的类型上多样化的语言中的句法和语义任务进行了评估。我们发现,Pixel在预读取数据中找不到的脚本上的句法和语义处理任务大大优于BERT,但是在使用拉丁文脚本时,Pixel比BERT稍弱。此外,我们发现像素对嘈杂的文本输入比bert更强大,进一步证实了用像素建模语言的好处。
translated by 谷歌翻译
In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate\footnote{for IndicXParaphrase, where an automatic translation system is used, a second human verification and correction step is done.} our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL.
translated by 谷歌翻译
首字母缩略词和长形式通常在研究文件中发现,更多的资料来自科学和法律领域的文件。在此文件中使用的许多首字母缩略词是特定于域的,很少在正常文本语料库中找到。由于这一点,基于变压器的NLP模型经常检测缩略词令牌的OOV(词汇),特别是对于非英语语言,它们的性能在提取期间将首字母缩略词与它们的长形式联系起来。此外,像BERT这样的预磨削变压器模型不专注于处理科学和法律文件。随着这些积分是这项工作背后的总体动机,我们提出了一种新颖的框架尚非:缩写式提取的字符感知BERT,其考虑文本中的字符序列,并通过屏蔽语言建模进行了科学和法律域。我们进一步使用了一个增强损失功能的目标,将最大损耗和掩码丢失术语添加到培训人物的标准交叉熵损失。我们进一步利用伪标记和对抗性数据生成来提高框架的普遍性。与各种基线相比,实验结果证明了所提出的框架的优越性。此外,我们表明,所提出的框架更适合基线模型,用于对非英语的零拍摄概括,从而加强了我们方法的有效性。我们的Team BackGprop在法国数据集中获得了最高分,丹麦和越南的最高分,在全球排行榜上的英语合法数据集中获得了第三高,用于SDU AAAI-22的Althym提取(AE)共享任务。
translated by 谷歌翻译
多语言语言模型(\ mllms),如mbert,xlm,xlm-r,\ textit {etc。}已成为一种可行的选择,使预先估计到大量语言的力量。鉴于他们的成功在零射击转移学习中,在(i)建立更大的\ mllms〜覆盖了大量语言(ii)创建覆盖更广泛的任务和语言来评估的详尽工作基准mllms〜(iii)分析单音零点,零拍摄交叉和双语任务(iv)对Monolingual的性能,了解\ mllms〜(v)增强(通常)学习的通用语言模式(如果有的话)有限的容量\ mllms〜以提高他们在已见甚至看不见语言的表现。在这项调查中,我们审查了现有的文学,涵盖了上述与\ MLLMS有关的广泛研究领域。根据我们的调查,我们建议您有一些未来的研究方向。
translated by 谷歌翻译
数据增强是自然语言处理(NLP)模型的鲁棒性评估的重要组成部分,以及增强他们培训的数据的多样性。在本文中,我们呈现NL-Cogmenter,这是一种新的参与式Python的自然语言增强框架,它支持创建两个转换(对数据的修改)和过滤器(根据特定功能的数据拆分)。我们描述了框架和初始的117个变换和23个过滤器,用于各种自然语言任务。我们通过使用其几个转换来分析流行自然语言模型的鲁棒性来证明NL-Upmenter的功效。基础架构,Datacards和稳健性分析结果在NL-Augmenter存储库上公开可用(\ url {https://github.com/gem-benchmark/nl-augmenter})。
translated by 谷歌翻译
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. Span-BERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1 * Equal contribution. 1 Our code and pre-trained models are available at https://github.com/facebookresearch/ SpanBERT.
translated by 谷歌翻译
Laws and their interpretations, legal arguments and agreements\ are typically expressed in writing, leading to the production of vast corpora of legal text. Their analysis, which is at the center of legal practice, becomes increasingly elaborate as these collections grow in size. Natural language understanding (NLU) technologies can be a valuable tool to support legal practitioners in these endeavors. Their usefulness, however, largely depends on whether current state-of-the-art models can generalize across various tasks in the legal domain. To answer this currently open question, we introduce the Legal General Language Understanding Evaluation (LexGLUE) benchmark, a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way. We also provide an evaluation and analysis of several generic and legal-oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks.
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
我们利用预训练的语言模型来解决两种低资源语言的复杂NER任务:中文和西班牙语。我们使用整个单词掩码(WWM)的技术来提高大型和无监督的语料库的掩盖语言建模目标。我们在微调的BERT层之上进行多个神经网络体系结构,将CRF,Bilstms和线性分类器结合在一起。我们所有的模型都优于基线,而我们的最佳性能模型在盲目测试集的评估排行榜上获得了竞争地位。
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
通过将搜索任务框架作为解释跨度检测来绘制语义搜索问题,即给定文本作为查询短语的段,任务是在给定文档中识别其释放,与通常相同的建模设置用于提取问题的回答。在Turku释放语料库中的100,000手动提取芬兰语释义对,包括其原始文档上下文,我们发现我们的扫描跨度检测模型分别优于31.9pp和22.4pp的两个强烈的检索基线(词汇相似性和BERT句子嵌入)。匹配,达到22.3pp和12.9pp的令牌级F分数。这展示了在跨度检索而不是句子相似性方面建模任务的强大优点。此外,我们介绍了一种通过背部翻译创建人工释义数据的方法,适用于手动注释用于训练的跨度检测模型的剖析资源。
translated by 谷歌翻译
对于自然语言处理应用可能是有问题的,因为它们的含义不能从其构成词语推断出来。缺乏成功的方法方法和足够大的数据集防止了用于检测成语的机器学习方法的开发,特别是对于在训练集中不发生的表达式。我们提出了一种叫做小鼠的方法,它使用上下文嵌入来实现此目的。我们展示了一个新的多字表达式数据集,具有文字和惯用含义,并使用它根据两个最先进的上下文单词嵌入式培训分类器:Elmo和Bert。我们表明,使用两个嵌入式的深度神经网络比现有方法更好地执行,并且能够检测惯用词使用,即使对于训练集中不存在的表达式。我们展示了开发模型的交叉传输,并分析了所需数据集的大小。
translated by 谷歌翻译
近年来,低资源机器阅读理解(MRC)取得了重大进展,模型在各种语言数据集中获得了显着性能。但是,这些模型都没有为URDU语言定制。这项工作探讨了通过将机器翻译的队伍与来自剑桥O级书籍的Wikipedia文章和Urdu RC工作表组合的人生成的样本组合了机器翻译的小队,探讨了乌尔通题的半自动创建了数据集(UQuad1.0)。 UQuad1.0是一个大型URDU数据集,用于提取机器阅读理解任务,由49K问题答案成对组成,段落和回答格式。在UQuad1.0中,通过众包的原始SquAd1.0和大约4000对的机器翻译产生45000对QA。在本研究中,我们使用了两种类型的MRC型号:基于规则的基线和基于先进的变换器的模型。但是,我们发现后者优于其他人;因此,我们已经决定专注于基于变压器的架构。使用XLMroberta和多语言伯特,我们分别获得0.66和0.63的F1得分。
translated by 谷歌翻译
在这项工作中,我们探索如何学习专用的语言模型,旨在学习从文本文件中学习关键词的丰富表示。我们在判别和生成设置中进行预训练变压器语言模型(LMS)的不同掩蔽策略。在歧视性设定中,我们引入了一种新的预训练目标 - 关键边界,用替换(kbir)infifiling,在使用Kbir预先训练的LM进行微调时显示出在Sota上的性能(F1中高达9.26点)的大量增益关键酶提取的任务。在生成设置中,我们为BART - 键盘介绍了一个新的预训练设置,可再现与CATSeq格式中的输入文本相关的关键字,而不是Denoised原始输入。这也导致在关键词中的性能(F1 @ M)中的性能(高达4.33点),用于关键正版生成。此外,我们还微调了在命名实体识别(ner),问题应答(qa),关系提取(重新),抽象摘要和达到与SOTA的可比性表现的预训练的语言模型,表明学习丰富的代表关键词确实有利于许多其他基本的NLP任务。
translated by 谷歌翻译
数据增强是通过转换为机器学习的人工创建数据的人工创建,是一个跨机器学习学科的研究领域。尽管它对于增加模型的概括功能很有用,但它还可以解决许多其他挑战和问题,从克服有限的培训数据到正规化目标到限制用于保护隐私的数据的数量。基于对数据扩展的目标和应用的精确描述以及现有作品的分类法,该调查涉及用于文本分类的数据增强方法,并旨在为研究人员和从业者提供简洁而全面的概述。我们将100多种方法划分为12种不同的分组,并提供最先进的参考文献来阐述哪种方法可以通过将它们相互关联,从而阐述了哪种方法。最后,提供可能构成未来工作的基础的研究观点。
translated by 谷歌翻译
命名实体识别是一项信息提取任务,可作为其他自然语言处理任务的预处理步骤,例如机器翻译,信息检索和问题答案。命名实体识别能够识别专有名称以及开放域文本中的时间和数字表达式。对于诸如阿拉伯语,阿姆哈拉语和希伯来语之类的闪族语言,由于这些语言的结构严重变化,指定的实体识别任务更具挑战性。在本文中,我们提出了一个基于双向长期记忆的Amharic命名实体识别系统,并带有条件随机字段层。我们注释了一种新的Amharic命名实体识别数据集(8,070个句子,具有182,691个令牌),并将合成少数群体过度采样技术应用于我们的数据集,以减轻不平衡的分类问题。我们命名的实体识别系统的F_1得分为93%,这是Amharic命名实体识别的新最新结果。
translated by 谷歌翻译