Sigmorphon 2022关于词素分割的共享任务挑战了将单词分解为一系列词素的系统,并涵盖了大多数类型的形态:化合物,衍生和弯曲。子任务1,单词级词素细分,涵盖了9种语言的500万个单词(捷克,英语,西班牙语,匈牙利语,法语,意大利语,俄语,拉丁语,蒙古语),并收到了7个团队的13个系统提交,最佳系统平均为97.29%F1在所有语言中得分,英语(93.84%)到拉丁语(99.38%)。子任务2,句子级的词素细分,涵盖了3种语言的18,735个句子(捷克,英语,蒙古人),从3个团队中收到10个系统提交,最好的系统优于所有三种最先进的子字体化方法(BPE(BPE),Ulm,Morfessor2)绝对30.71%。为了促进错误分析并支持任何类型的未来研究,我们发布了所有系统预测,评估脚本和所有黄金标准数据集。
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
数据饥饿的深度神经网络已经将自己作为许多NLP任务的标准建立为包括传统序列标记的标准。尽管他们在高资源语言上表现最先进的表现,但它们仍然落后于低资源场景的统计计数器。一个方法来反击攻击此问题是文本增强,即,从现有数据生成新的合成训练数据点。虽然NLP最近目睹了一种文本增强技术的负载,但该领域仍然缺乏对多种语言和序列标记任务的系统性能分析。为了填补这一差距,我们调查了三类文本增强方法,其在语法(例如,裁剪子句子),令牌(例如,随机字插入)和字符(例如,字符交换)级别上执行更改。我们系统地将它们与语音标记,依赖解析和语义角色标记的分组进行了比较,用于使用各种模型的各种语言系列,包括依赖于诸如MBERT的普赖金的多语言语境化语言模型的架构。增强最显着改善了解析,然后是语音标记和语义角色标记的依赖性解析。我们发现实验技术通常在形态上丰富的语言,而不是越南语等分析语言。我们的研究结果表明,增强技术可以进一步改善基于MBERT的强基线。我们将字符级方法标识为最常见的表演者,而同义词替换和语法增强仪提供不一致的改进。最后,我们讨论了最大依赖于任务,语言对和模型类型的结果。
translated by 谷歌翻译
即使在高度发达的国家,多达15-30%的人口只能理解使用基本词汇编写的文本。他们对日常文本的理解是有限的,这阻止了他们在社会中发挥积极作用,并就医疗保健,法律代表或民主选择做出明智的决定。词汇简化是一项自然语言处理任务,旨在通过更简单地替换复杂的词汇和表达方式来使每个人都可以理解文本,同时保留原始含义。在过去的20年中,它引起了极大的关注,并且已经针对各种语言提出了全自动词汇简化系统。该领域进步的主要障碍是缺乏用于构建和评估词汇简化系统的高质量数据集。我们提出了一个新的基准数据集,用于英语,西班牙语和(巴西)葡萄牙语中的词汇简化,并提供有关数据选择和注释程序的详细信息。这是第一个可直接比较三种语言的词汇简化系统的数据集。为了展示数据集的可用性,我们将两种具有不同体系结构(神经与非神经)的最先进的词汇简化系统适应所有三种语言(英语,西班牙语和巴西葡萄牙语),并评估他们的表演在我们的新数据集中。为了进行更公平的比较,我们使用多种评估措施来捕获系统功效的各个方面,并讨论其优势和缺点。我们发现,最先进的神经词汇简化系统优于所有三种语言中最先进的非神经词汇简化系统。更重要的是,我们发现最先进的神经词汇简化系统对英语的表现要比西班牙和葡萄牙语要好得多。
translated by 谷歌翻译
我们想要模型的文本单位是什么?从字节到多字表达式,可以在许多粒度下分析和生成文本。直到最近,大多数自然语言处理(NLP)模型通过单词操作,将那些作为离散和原子令牌处理,但从字节对编码(BPE)开始,基于次字的方法在许多领域都变得占主导地位,使得仍然存在小词汇表允许快速推断。是道路字符级模型的结束或字节级处理吗?在这项调查中,我们通过展示和评估基于学习分割的词语和字符以及基于子字的方法的混合方法以及基于学习的分割的杂交方法,连接多行工作。我们得出结论,对于所有应用来说,并且可能永远不会成为所有应用的银子弹奇异解决方案,并且严重思考令牌化对许多应用仍然很重要。
translated by 谷歌翻译
Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for $517$ African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID's powerful capabilities and limitations.
translated by 谷歌翻译
We study politeness phenomena in nine typologically diverse languages. Politeness is an important facet of communication and is sometimes argued to be cultural-specific, yet existing computational linguistic study is limited to English. We create TyDiP, a dataset containing three-way politeness annotations for 500 examples in each language, totaling 4.5K examples. We evaluate how well multilingual models can identify politeness levels -- they show a fairly robust zero-shot transfer ability, yet fall short of estimated human accuracy significantly. We further study mapping the English politeness strategy lexicon into nine languages via automatic translation and lexicon induction, analyzing whether each strategy's impact stays consistent across languages. Lastly, we empirically study the complicated relationship between formality and politeness through transfer experiments. We hope our dataset will support various research questions and applications, from evaluating multilingual models to constructing polite multilingual agents.
translated by 谷歌翻译
本文概述了与CRAC 2022研讨会相关的多语言核心分辨率的共享任务。共同的任务参与者应该开发能够识别提及并根据身份核心重点聚集的训练系统。Corefud 1.0的公共版本包含10种语言的13个数据集,被用作培训和评估数据的来源。先前面向核心共享任务中使用的串联分数用作主要评估度量。5个参与团队提交了8个核心预测系统;此外,组织者在共享任务开始时提供了一个基于竞争变压器的基线系统。获胜者系统的表现优于基线12个百分点(就所有语言的所有数据集而言,在所有数据集中平均得分)。
translated by 谷歌翻译
As machine translation (MT) metrics improve their correlation with human judgement every year, it is crucial to understand the limitations of such metrics at the segment level. Specifically, it is important to investigate metric behaviour when facing accuracy errors in MT because these can have dangerous consequences in certain contexts (e.g., legal, medical). We curate ACES, a translation accuracy challenge set, consisting of 68 phenomena ranging from simple perturbations at the word/character level to more complex errors based on discourse and real-world knowledge. We use ACES to evaluate a wide range of MT metrics including the submissions to the WMT 2022 metrics shared task and perform several analyses leading to general recommendations for metric developers. We recommend: a) combining metrics with different strengths, b) developing metrics that give more weight to the source and less to surface-level overlap with the reference and c) explicitly modelling additional language-specific information beyond what is available via multilingual embeddings.
translated by 谷歌翻译
基于变压器的模型的出现,机器翻译已经快速发展。这些模型没有内置的明确的语言结构,但是它们仍然可以通过参与相关令牌隐式学习结构化的关系。我们假设通过明确赋予变形金刚具有结构性偏见,可以使这种结构学习变得更加健壮,我们研究了两种在这种偏见中构建的方法。一种方法,即TP变换器,可以增强传统的变压器体系结构,包括代表结构的附加组件。第二种方法通过将数据分割为形态令牌化来灌输数据级别的结构。我们测试了这些方法从英语翻译成土耳其语和Inuktitut的形态丰富的语言,并考虑自动指标和人类评估。我们发现,这两种方法中每种方法都允许网络实现更好的性能,但是此改进取决于数据集的大小。总而言之,结构编码方法使变压器更具样本效率,从而使它们能够从少量数据中表现得更好。
translated by 谷歌翻译
In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.
translated by 谷歌翻译
数据增强是自然语言处理(NLP)模型的鲁棒性评估的重要组成部分,以及增强他们培训的数据的多样性。在本文中,我们呈现NL-Cogmenter,这是一种新的参与式Python的自然语言增强框架,它支持创建两个转换(对数据的修改)和过滤器(根据特定功能的数据拆分)。我们描述了框架和初始的117个变换和23个过滤器,用于各种自然语言任务。我们通过使用其几个转换来分析流行自然语言模型的鲁棒性来证明NL-Upmenter的功效。基础架构,Datacards和稳健性分析结果在NL-Augmenter存储库上公开可用(\ url {https://github.com/gem-benchmark/nl-augmenter})。
translated by 谷歌翻译
本文介绍了对土耳其语可用于的语料库和词汇资源的全面调查。我们审查了广泛的资源,重点关注公开可用的资源。除了提供有关可用语言资源的信息外,我们还提供了一组建议,并确定可用于在土耳其语言学和自然语言处理中进行研究和建筑应用的数据中的差距。
translated by 谷歌翻译
对任何人类语言的文本的语法分析通常涉及许多基本的处理任务,例如令牌化,形态标记和依赖性解析。最先进的系统可以在具有大数据集的语言上实现这些任务的高精度,但是对于几乎没有带注释的数据的他的他加禄语等语言的结果很差。为了解决他加禄语语言的此问题,我们研究了在没有带注释的他加禄语数据的情况下使用辅助数据源来创建特定于任务模型的使用。我们还探索了单词嵌入和数据扩展的使用,以提高性能,而只有少量带注释的他加禄语数据可用。我们表明,与最先进的监督基线相比,这些零射击和几乎没有射击的方法在对域内和域外的塔加尔teact文本进行了语法分析方面进行了实质性改进。
translated by 谷歌翻译
The BERT family of neural language models have become highly popular due to their ability to provide sequences of text with rich context-sensitive token encodings which are able to generalise well to many NLP tasks. We introduce gaBERT, a monolingual BERT model for the Irish language. We compare our gaBERT model to multilingual BERT and the monolingual Irish WikiBERT, and we show that gaBERT provides better representations for a downstream parsing task. We also show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance. We compare the results of fine-tuning a gaBERT model with an mBERT model for the task of identifying verbal multiword expressions, and show that the fine-tuned gaBERT model also performs better at this task. We release gaBERT and related code to the community.
translated by 谷歌翻译
In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate\footnote{for IndicXParaphrase, where an automatic translation system is used, a second human verification and correction step is done.} our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL.
translated by 谷歌翻译
基于变压器的语言模型最近在许多自然语言任务中取得了显着的结果。但是,通常通过利用大量培训数据来实现排行榜的性能,并且很少通过将明确的语言知识编码为神经模型。这使许多人质疑语言学对现代自然语言处理的相关性。在本文中,我介绍了几个案例研究,以说明理论语言学和神经语言模型仍然相互关联。首先,语言模型通过提供一个客观的工具来测量语义距离,这对语言学家很有用,语义距离很难使用传统方法。另一方面,语言理论通过提供框架和数据源来探究我们的语言模型,以了解语言理解的特定方面,从而有助于语言建模研究。本论文贡献了三项研究,探讨了语言模型中语法 - 听觉界面的不同方面。在论文的第一部分中,我将语言模型应用于单词类灵活性的问题。我将Mbert作为语义距离测量的来源,我提供了有利于将单词类灵活性分析为方向过程的证据。在论文的第二部分中,我提出了一种方法来测量语言模型中间层的惊奇方法。我的实验表明,包含形态句法异常的句子触发了语言模型早期的惊喜,而不是语义和常识异常。最后,在论文的第三部分中,我适应了一些心理语言学研究,以表明语言模型包含了论证结构结构的知识。总而言之,我的论文在自然语言处理,语言理论和心理语言学之间建立了新的联系,以为语言模型的解释提供新的观点。
translated by 谷歌翻译
Transformer language models (TLMs) are critical for most NLP tasks, but they are difficult to create for low-resource languages because of how much pretraining data they require. In this work, we investigate two techniques for training monolingual TLMs in a low-resource setting: greatly reducing TLM size, and complementing the masked language modeling objective with two linguistically rich supervised tasks (part-of-speech tagging and dependency parsing). Results from 7 diverse languages indicate that our model, MicroBERT, is able to produce marked improvements in downstream task evaluations relative to a typical monolingual TLM pretraining approach. Specifically, we find that monolingual MicroBERT models achieve gains of up to 18% for parser LAS and 11% for NER F1 compared to a multilingual baseline, mBERT, while having less than 1% of its parameter count. We conclude reducing TLM parameter count and using labeled data for pretraining low-resource TLMs can yield large quality benefits and in some cases produce models that outperform multilingual approaches.
translated by 谷歌翻译
Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark 1 to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.
translated by 谷歌翻译
机器翻译系统(MTS)是通过将文本或语音从一种语言转换为另一种语言的有效工具。在像印度这样的大型多语言环境中,对有效的翻译系统的需求变得显而易见,英语和一套印度语言(ILS)正式使用。与英语相反,由于语料库的不可用,IL仍然被视为低资源语言。为了解决不对称性质,多语言神经机器翻译(MNMT)系统会发展为在这个方向上的理想方法。在本文中,我们提出了一个MNMT系统,以解决与低资源语言翻译有关的问题。我们的模型包括两个MNMT系统,即用于英语印度(一对多),另一个用于指示英语(多一对多),其中包含15个语言对(30个翻译说明)的共享编码器码头。由于大多数IL对具有很少的平行语料库,因此不足以训练任何机器翻译模型。我们探索各种增强策略,以通过建议的模型提高整体翻译质量。最先进的变压器体系结构用于实现所提出的模型。大量数据的试验揭示了其优越性比常规模型的优势。此外,本文解决了语言关系的使用(在方言,脚本等方面),尤其是关于同一家族的高资源语言在提高低资源语言表现方面的作用。此外,实验结果还表明了ILS的倒退和域适应性的优势,以提高源和目标语言的翻译质量。使用所有这些关键方法,我们提出的模型在评估指标方面比基线模型更有效,即一组ILS的BLEU(双语评估研究)得分。
translated by 谷歌翻译