Event Extraction (EE) is one of the fundamental tasks in Information Extraction (IE) that aims to recognize event mentions and their arguments (i.e., participants) from text. Due to its importance, extensive methods and resources have been developed for Event Extraction. However, one limitation of current research for EE involves the under-exploration for non-English languages in which the lack of high-quality multilingual EE datasets for model training and evaluation has been the main hindrance. To address this limitation, we propose a novel Multilingual Event Extraction dataset (MEE) that provides annotation for more than 50K event mentions in 8 typologically different languages. MEE comprehensively annotates data for entity mentions, event triggers and event arguments. We conduct extensive experiments on the proposed dataset to reveal challenges and opportunities for multilingual EE.
translated by 谷歌翻译
Event Detection (ED) is the task of identifying and classifying trigger words of event mentions in text. Despite considerable research efforts in recent years for English text, the task of ED in other languages has been significantly less explored. Switching to non-English languages, important research questions for ED include how well existing ED models perform on different languages, how challenging ED is in other languages, and how well ED knowledge and annotation can be transferred across languages. To answer those questions, it is crucial to obtain multilingual ED datasets that provide consistent event annotation for multiple languages. There exist some multilingual ED datasets; however, they tend to cover a handful of languages and mainly focus on popular ones. Many languages are not covered in existing multilingual ED datasets. In addition, the current datasets are often small and not accessible to the public. To overcome those shortcomings, we introduce a new large-scale multilingual dataset for ED (called MINION) that consistently annotates events for 8 different languages; 5 of them have not been supported by existing multilingual datasets. We also perform extensive experiments and analysis to demonstrate the challenges and transferability of ED across languages in MINION that in all call for more research effort in this area.
translated by 谷歌翻译
We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. In each language, it contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location and Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language sentence. We also create manually annotated testsets for 8 languages containing approximately 1000 sentences per language. We demonstrate the utility of the obtained dataset on existing testsets and the Naamapadam-test data for 8 Indic languages. We also release IndicNER, a multilingual mBERT model fine-tuned on the Naamapadam training set. IndicNER achieves the best F1 on the Naamapadam-test set compared to an mBERT model fine-tuned on existing datasets. IndicNER achieves an F1 score of more than 80 for 7 out of 11 Indic languages. The dataset and models are available under open-source licenses at https://ai4bharat.iitm.ac.in/naamapadam.
translated by 谷歌翻译
Translating training data into many languages has emerged as a practical solution for improving cross-lingual transfer. For tasks that involve span-level annotations, such as information extraction or question answering, an additional label projection step is required to map annotated spans onto the translated texts. Recently, a few efforts have utilized a simple mark-then-translate method to jointly perform translation and projection by inserting special markers around the labeled spans in the original sentence. However, as far as we are aware, no empirical analysis has been conducted on how this approach compares to traditional annotation projection based on word alignment. In this paper, we present an extensive empirical study across 42 languages and three tasks (QA, NER, and Event Extraction) to evaluate the effectiveness and limitations of both methods, filling an important gap in the literature. Experimental results show that our optimized version of mark-then-translate, which we call EasyProject, is easily applied to many languages and works surprisingly well, outperforming the more complex word alignment-based methods. We analyze several key factors that affect end-task performance, and show EasyProject works well because it can accurately preserve label span boundaries after translation. We will publicly release all our code and data.
translated by 谷歌翻译
Much recent progress in applications of machine learning models to NLP has been driven by benchmarks that evaluate models across a wide variety of tasks. However, these broad-coverage benchmarks have been mostly limited to English, and despite an increasing interest in multilingual models, a benchmark that enables the comprehensive evaluation of such methods on a diverse range of languages and tasks is still missing. To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. There is also a wide spread of results across languages. We release the benchmark 1 to encourage research on cross-lingual learning methods that transfer linguistic knowledge across a diverse and representative set of languages and tasks.
translated by 谷歌翻译
可靠的评估基准是为了可复制性和全面性而设计的,在机器学习方面取得了进步。但是,由于缺乏多语言基准,视觉和语言研究主要集中在英语任务上。为了填补这一空白,我们介绍了图像的语言理解评估基准。 Iglue通过汇总已有的数据集并创建新的数据来汇集 - 视觉问题回答,跨模式检索,扎根的推理以及跨20种不同语言的扎根成本。我们的基准测试能够评估多语言多模型用于转移学习的模型,不仅在零弹位设置中,而且还以新定义的少数图学习设置。根据对可用最新模型的评估,我们发现翻译测试转移优于零弹性转移,并且对于许多任务而言,很难利用射击的学习。此外,下游性能部分用可用的未标记文本数据进行预处理来解释,并且仅通过目标源语言的类型学距离而微弱。我们希望通过向社区释放基准来鼓励该领域的未来研究工作。
translated by 谷歌翻译
我们介绍了关于多语言信息访问(MIA)2022共享任务的研讨会的结果,评估了16种类型上多样性的语言中的跨语性开放回程答案(QA)系统。在此任务中,我们在14种类型上多样化的语言中调整了两个大规模的跨语性开放式质疑QA数据集,并使用了2种代表性不足的语言中的新注释的开放式QA数据:Tagalog和Tamil。四个团队提交了他们的系统。利用迭代开采的最佳系统是不同的负面示例和较大的预审慎模型达到32.2 F1,表现优于我们的基线4.5分。第二最佳系统使用实体感知的上下文化表示文档检索,并在泰米尔语(20.8 F1)方面取得了重大改进,而其他大多数系统的得分几乎为零。
translated by 谷歌翻译
现代实体链接(EL)系统构成了流行偏见,但是没有数据集以英语以外的其他语言上关注尾巴和新兴实体。我们向Hansel展示了中国人的新基准,它填补了非英国几乎没有射击和零击EL挑战的空缺。Hansel的测试集经过人工注释和审查,并采用了一种用于收集零照片EL数据集的新方法。它涵盖了新闻,社交媒体帖子和其他网络文章中的10k多种文档,Wikidata作为目标知识库。我们证明,现有的最新EL系统在Hansel上的表现不佳(R@1中的36.6%,几乎没有射击)。然后,我们建立了一个强大的基线,该基线在我们的数据集上的零射门上为46.2%的R@1分之1。我们还表明,我们的基线在TAC-KBP2015中国实体链接任务上取得了竞争成果。
translated by 谷歌翻译
JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois. Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the effectiveness of transfer from large monolingual or multilingual pretrained models. While our work, along with previous work, shows that transfer from these models to low-resource languages that are unrelated to languages in their training set is not very effective, we would expect stronger results from transfer to creoles. Indeed, our experiments show considerably better results from few-shot learning of JamPatoisNLI than for such unrelated languages, and help us begin to understand how the unique relationship between creoles and their high-resource base languages affect cross-lingual transfer. JamPatoisNLI, which consists of naturally-occurring premises and expert-written hypotheses, is a step towards steering research into a traditionally underserved language and a useful benchmark for understanding cross-lingual NLP.
translated by 谷歌翻译
现有的多方对话数据集用于核心分辨率是新生的,许多挑战仍然没有解决。我们根据电视成绩单为此任务创建了一个大规模数据集,多语言多方CoreF(MMC)。由于使用多种语言的黄金质量字幕可用,我们建议重复注释以通过注释投影以其他语言(中文和Farsi)创建银色核心数据。在黄金(英语)数据上,现成的模型在MMC上的性能相对较差,这表明MMC比以前的数据集更广泛地覆盖多方核心。在银数据上,我们发现成功使用它进行数据增强和从头开始训练,这有效地模拟了零击的跨语性设置。
translated by 谷歌翻译
临床领域中的事件提取是一个探索较少的研究领域。除了大量的特定领域的行话外,缺乏培训数据,包括较长的实体,具有模糊的边界,使该任务尤其具有挑战性。在本文中,我们介绍了DICE,这是一种用于临床事件提取的强大而数据效率的生成模型。骰子框架事件提取作为有条件的生成问题,并利用域专家提供的描述来提高低资源设置下的性能。此外,DICE学会了与辅助提及的识别任务一起定位和约束生物医学提及,该任务与事件提取任务共同培训,以利用任务间的依赖性,并进一步纳入确定的提及作为其各自任务的触发和论证候选者。我们还介绍了MacCrobat-EE,这是第一个带有事件参数注释的临床事件提取数据集。我们的实验证明了在临床领域的低数据设置下骰子的鲁棒性,以及将柔性关节训练并提及标记纳入生成方法的好处。
translated by 谷歌翻译
临床表型可以从患者记录中自动提取临床状况,这可能对全球医生和诊所有益。但是,当前的最新模型主要适用于用英语编写的临床笔记。因此,我们研究了跨语化知识转移策略,以针对不使用英语并且有少量可用数据的诊所执行此任务。我们评估了希腊和西班牙诊所的这些策略,利用来自心脏病学,肿瘤学和ICU等不同临床领域的临床笔记。我们的结果揭示了两种策略,这些策略优于最先进的方法:基于翻译的方法,结合了域的编码器和跨语性编码器以及适配器。我们发现,这些策略在对稀有表型进行分类方面表现特别好,我们建议在哪种情况下更喜欢哪种方法。我们的结果表明,使用多语言数据总体可以改善临床表型模型,并可以补偿数据稀疏性。
translated by 谷歌翻译
Recent directions for offensive language detection are hierarchical modeling, identifying the type and the target of offensive language, and interpretability with offensive span annotation and prediction. These improvements are focused on English and do not transfer well to other languages because of cultural and linguistic differences. In this paper, we present the Korean Offensive Language Dataset (KOLD) comprising 40,429 comments, which are annotated hierarchically with the type and the target of offensive language, accompanied by annotations of the corresponding text spans. We collect the comments from NAVER news and YouTube platform and provide the titles of the articles and videos as the context information for the annotation process. We use these annotated comments as training data for Korean BERT and RoBERTa models and find that they are effective at offensiveness detection, target classification, and target span detection while having room for improvement for target group classification and offensive span detection. We discover that the target group distribution differs drastically from the existing English datasets, and observe that providing the context information improves the model performance in offensiveness detection (+0.3), target classification (+1.5), and target group classification (+13.1). We publicly release the dataset and baseline models.
translated by 谷歌翻译
我们介绍了MTG,这是一套新的基准套件,用于培训和评估多语言文本生成。它是具有最大人类通知数据(400K)的第一次传播的多语言多路文本生成数据集。它包括五种语言(英语,德语,法语,西班牙语和中文)的四代任务(故事产生,问题生成,标题生成和文本摘要)。Multiway设置可以启用跨语言和任务的模型测试知识传输功能。使用MTG,我们从不同方面训练和分析了几种流行的多语言生成模型。我们的基准套件通过更多的人为宣传的并行数据促进了模型性能增强。它提供了各种一代方案的全面评估。代码和数据可在\ url {https://github.com/zide05/mtg}上获得。
translated by 谷歌翻译
随着语言技术变得更加无处不在,越来越努力扩大自然语言处理(NLP)系统的语言分集和覆盖范围。可以说,影响现代NLP系统质量的最重要因素是数据可用性。在这项工作中,我们研究了NLP数据集的地理代表性,旨在量化NLP数据集与语言扬声器的预期需求量化。在这样做时,我们使用实体识别和链接系统,同时对其交叉量度的一致性进行重要观察,并为更强大的评估提供建议。最后,我们探讨了可能解释观察到的数据集发行版的一些地理和经济因素。此处提供代码和数据:https://github.com/ffaisal93/dataset_geography。此处提供其他可视化:https://nlp.cs.gmu.edu/project/datasetmaps/。
translated by 谷歌翻译
State-of-the-art natural language processing systems rely on supervision in the form of annotated data to learn competent models. These models are generally trained on data in a single language (usually English), and cannot be directly used beyond that language. Since collecting data in every language is not realistic, there has been a growing interest in crosslingual language understanding (XLU) and low-resource cross-language transfer. In this work, we construct an evaluation set for XLU by extending the development and test sets of the Multi-Genre Natural Language Inference Corpus (MultiNLI) to 15 languages, including low-resource languages such as Swahili and Urdu. We hope that our dataset, dubbed XNLI, will catalyze research in cross-lingual sentence understanding by providing an informative standard evaluation task. In addition, we provide several baselines for multilingual sentence understanding, including two based on machine translation systems, and two that use parallel data to train aligned multilingual bag-of-words and LSTM encoders. We find that XNLI represents a practical and challenging evaluation suite, and that directly translating the test data yields the best performance among available baselines.
translated by 谷歌翻译
In this work, we introduce IndicXTREME, a benchmark consisting of nine diverse tasks covering 18 languages from the Indic sub-continent belonging to four different families. Across languages and tasks, IndicXTREME contains a total of 103 evaluation sets, of which 51 are new contributions to the literature. To maintain high quality, we only use human annotators to curate or translate\footnote{for IndicXParaphrase, where an automatic translation system is used, a second human verification and correction step is done.} our datasets. To the best of our knowledge, this is the first effort toward creating a standard benchmark for Indic languages that aims to test the zero-shot capabilities of pretrained language models. We also release IndicCorp v2, an updated and much larger version of IndicCorp that contains 20.9 billion tokens in 24 languages. We pretrain IndicBERT v2 on IndicCorp v2 and evaluate it on IndicXTREME to show that it outperforms existing multilingual language models such as XLM-R and MuRIL.
translated by 谷歌翻译
自然语言推理(NLI)和语义文本相似性(STS)是广泛使用的基准任务,用于对预训练的语言模型进行组成评估。尽管对语言普遍性的兴趣越来越大,但大多数NLI/STS研究几乎完全集中在英语上。特别是,日语中没有可用的多语言NLI/STS数据集,它在类型上与英语不同,并且可以阐明语言模型当前有争议的行为,例如对单词顺序和案例粒子的敏感性。在此背景下,我们介绍了日本NLI/STS数据集Jsick,该数据集是从英语数据集病中手动翻译的。我们还提出了一个用于组成推断的应力测试数据集,该数据集是通过转换JSick中句子的句法结构来研究语言模型是否对单词顺序和案例粒子敏感的。我们在不同的预训练语言模型上进行基线实验,并比较应用于日语和其他语言时多语言模型的性能。应力测试实验的结果表明,当前的预训练的语言模型对单词顺序和案例标记不敏感。
translated by 谷歌翻译
Open-Domain Generative Question Answering has achieved impressive performance in English by combining document-level retrieval with answer generation. These approaches, which we refer to as GenQA, can generate complete sentences, effectively answering both factoid and non-factoid questions. In this paper, we extend GenQA to the multilingual and cross-lingual settings. For this purpose, we first introduce GenTyDiQA, an extension of the TyDiQA dataset with well-formed and complete answers for Arabic, Bengali, English, Japanese, and Russian. Based on GenTyDiQA, we design a cross-lingual generative model that produces full-sentence answers by exploiting passages written in multiple languages, including languages different from the question. Our cross-lingual generative system outperforms answer sentence selection baselines for all 5 languages and monolingual generative pipelines for three out of five languages studied.
translated by 谷歌翻译
情感分析是NLP中研究最广泛的应用程序之一,但大多数工作都集中在具有大量数据的语言上。我们介绍了尼日利亚的四种口语最广泛的语言(Hausa,Igbo,Nigerian-Pidgin和Yor \'ub \'a)的第一个大规模的人类通知的Twitter情感数据集,该数据集由大约30,000个注释的推文组成(以及每种语言的大约30,000个)(以及14,000尼日利亚猎人),其中包括大量的代码混合推文。我们提出了文本收集,过滤,处理和标记方法,使我们能够为这些低资源语言创建数据集。我们评估了数据集上的预训练模型和转移策略。我们发现特定于语言的模型和语言适应性芬通常表现最好。我们将数据集,训练的模型,情感词典和代码释放到激励措施中,以代表性不足的语言进行情感分析。
translated by 谷歌翻译