We describe the design and use of the Stanford CoreNLP toolkit, an extensible pipeline that provides core natural language analysis. This toolkit is quite widely used, both in the research NLP community and also among commercial and government users of open source NLP technology. We suggest that this follows from a simple, approachable design, straightforward interfaces, the inclusion of robust and good quality analysis components, and not requiring use of a large amount of associated baggage.
translated by 谷歌翻译
We introduce Sta n z a , an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Sta n z a features a language-agnostic fully neural pipeline for text analysis, including tokenization, multiword token expansion, lemmatization, part-ofspeech and morphological feature tagging, dependency parsing, and named entity recognition. We have trained Sta n z a on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Sta n z a includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction. Source code, documentation, and pretrained models for 66 languages are available at https:// stanfordnlp.github.io/stanza/.
translated by 谷歌翻译
由于它们在自然语言处理工具的开发中所扮演的关键作用,因此优质树仓的价值正在稳步增长。这种树仓的创造是劳动密集型且耗时的。尤其是当考虑树库的大小时,支持注释过程的工具至关重要。但是,已经提出了各种注释工具,但是它们通常不适合土耳其语等凝集性语言。 V1是用于注释依赖关系的船,随后被用于创建手动注释的Boun Treebank(UD_TURKISH-BOUN)。在这项工作中,我们根据使用船V1获得的经验报告了依赖性注释工具船V2的设计和实施,这揭示了一些改进的机会。 V2是一种多用户和基于Web的依赖性注释工具,设计为注释用户体验以产生有效的注释。该工具的主要目标是:(1)支持以提高速度创建有效且一致的注释,(2)显着改善注释者的用户体验,(3)支持注释者之间的协作,(4)提供开放 - 通过灵活的应用程序编程接口(API)来源和易于部署的基于Web的注释工具,以使科学界受益。本文讨论了船V2的启发,设计和实施以及示例。
translated by 谷歌翻译
我们描述了CREER数据集的设计和使用,这是一个带有丰富英语语法和语义属性的大型语料库。CREER数据集使用Stanford Corenlp注释器从Wikipedia纯文本中捕获丰富的语言结构。该数据集遵循广泛使用的语言和语义注释,因此不仅可以用于大多数自然语言处理任务,还可以用于扩展数据集。这个大型监督数据集可以作为改善未来NLP任务的性能的基础。我们通过链接来宣传数据集:https://140.116.82.111/share.cgi?ssid=000Doj4
translated by 谷歌翻译
数据增强是自然语言处理(NLP)模型的鲁棒性评估的重要组成部分,以及增强他们培训的数据的多样性。在本文中,我们呈现NL-Cogmenter,这是一种新的参与式Python的自然语言增强框架,它支持创建两个转换(对数据的修改)和过滤器(根据特定功能的数据拆分)。我们描述了框架和初始的117个变换和23个过滤器,用于各种自然语言任务。我们通过使用其几个转换来分析流行自然语言模型的鲁棒性来证明NL-Upmenter的功效。基础架构,Datacards和稳健性分析结果在NL-Augmenter存储库上公开可用(\ url {https://github.com/gem-benchmark/nl-augmenter})。
translated by 谷歌翻译
网络威胁智能(CTI)是描述威胁媒介,漏洞和攻击的信息,通常用作基于AI的网络防御系统(例如网络安全知识图(CKG))的培训数据。非常需要开发可访问社区的数据集来培训现有的基于AI的网络安全管道,以有效,准确地从CTI中提取有意义的见解。我们已经从各种开放源中创建了一个初始的非结构化CTI语料库,我们使用SPACY框架并探索自学习方法来自动识别网络安全实体,用于训练和测试网络安全实体模型。我们还描述了应用网络安全域实体与Wikidata现有世界知识联系起来的方法。我们未来的工作将调查和测试Spacy NLP工具,并创建方法,以连续整合从文本中提取的新信息。
translated by 谷歌翻译
Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.
translated by 谷歌翻译
本文提出了创造和管理12个主要印度语言的大型并行语言(即将扩展到23种语言)的挑战,作为由信息技术部(DIT),政府部门资助的主要财团项目的一部分。印度,并在印度的10所不同大学中平行运行。为了有效地管理这些巨大的Corpora的创建和传播过程,基于Web的(具有减少的独立版本)的注释工具ILCiann(印度语言语料集团倡议注释工具)已经开发出来。它主要是为POS注释制定的,以及由具有不同竞争力和物理位于相距远的地点的人员的管理器的管理。为了维持在创建Corpora中的一致性和标准,有必要每个人都在这个工具提供的共同平台上。
translated by 谷歌翻译
虽然有几种可用于匈牙利语的源语言处理管道,但它们都不满足当今NLP应用程序的要求。语言处理管道应由接近最先进的lemmatization,形态学分析,实体识别和单词嵌入。工业文本处理应用程序必须满足非功能性的软件质量要求,更重要的是,支持多种语言的框架越来越受青睐。本文介绍了哈普西,匈牙利匈牙利语言处理管道。呈现的工具为最重要的基本语言分析任务提供组件。它是开源,可在许可证下提供。我们的系统建立在Spacy的NLP组件之上,这意味着它快速,具有丰富的NLP应用程序和扩展生态系统,具有广泛的文档和众所周知的API。除了底层模型的概述外,我们还对共同的基准数据集呈现严格的评估。我们的实验证实,母鹿在所有子组织中具有高精度,同时保持资源有效的预测能力。
translated by 谷歌翻译
命名实体识别是一项信息提取任务,可作为其他自然语言处理任务的预处理步骤,例如机器翻译,信息检索和问题答案。命名实体识别能够识别专有名称以及开放域文本中的时间和数字表达式。对于诸如阿拉伯语,阿姆哈拉语和希伯来语之类的闪族语言,由于这些语言的结构严重变化,指定的实体识别任务更具挑战性。在本文中,我们提出了一个基于双向长期记忆的Amharic命名实体识别系统,并带有条件随机字段层。我们注释了一种新的Amharic命名实体识别数据集(8,070个句子,具有182,691个令牌),并将合成少数群体过度采样技术应用于我们的数据集,以减轻不平衡的分类问题。我们命名的实体识别系统的F_1得分为93%,这是Amharic命名实体识别的新最新结果。
translated by 谷歌翻译
We present POTATO, the Portable text annotation tool, a free, fully open-sourced annotation system that 1) supports labeling many types of text and multimodal data; 2) offers easy-to-configure features to maximize the productivity of both deployers and annotators (convenient templates for common ML/NLP tasks, active learning, keypress shortcuts, keyword highlights, tooltips); and 3) supports a high degree of customization (editable UI, inserting pre-screening questions, attention and qualification tests). Experiments over two annotation tasks suggest that POTATO improves labeling speed through its specially-designed productivity features, especially for long documents and complex tasks. POTATO is available at https://github.com/davidjurgens/potato and will continue to be updated.
translated by 谷歌翻译
我们介绍ASNER,这是一种使用基线阿萨姆语NER模型的低资源阿萨姆语言的命名实体注释数据集。该数据集包含大约99k代币,其中包括印度总理和阿萨姆人戏剧演讲中的文字。它还包含个人名称,位置名称和地址。拟议的NER数据集可能是基于深神经的阿萨姆语言处理的重要资源。我们通过训练NER模型进行基准测试数据集并使用最先进的体系结构评估被监督的命名实体识别(NER),例如FastText,Bert,XLM-R,Flair,Muril等。我们实施了几种基线方法,标记BI-LSTM-CRF体系结构的序列。当使用Muril用作单词嵌入方法时,所有基线中最高的F1得分的准确性为80.69%。带注释的数据集和最高性能模型公开可用。
translated by 谷歌翻译
本文介绍了对土耳其语可用于的语料库和词汇资源的全面调查。我们审查了广泛的资源,重点关注公开可用的资源。除了提供有关可用语言资源的信息外,我们还提供了一组建议,并确定可用于在土耳其语言学和自然语言处理中进行研究和建筑应用的数据中的差距。
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.
translated by 谷歌翻译
目前的自动COSTERED解析的工作集中在Ontonotes基准数据集上,由于其大小和一致性。然而,NLP从业者的Onototes注释方案的许多方面并不能够很好地理解,包括治疗通用NPS,Noun修饰剂,无限期的视性,预测等。这些通常会导致违反直觉的索赔,结果和系统行为。这个意见案件旨在突出努力的努力的一些问题,并提出依赖三个原则的前进方式:1。专注于语义,不是Morphosyntax;2.交叉语言概括性;3.分离身份和范围,可以解决涉及时间和模态域一致性的旧问题。
translated by 谷歌翻译
即使在高度发达的国家,多达15-30%的人口只能理解使用基本词汇编写的文本。他们对日常文本的理解是有限的,这阻止了他们在社会中发挥积极作用,并就医疗保健,法律代表或民主选择做出明智的决定。词汇简化是一项自然语言处理任务,旨在通过更简单地替换复杂的词汇和表达方式来使每个人都可以理解文本,同时保留原始含义。在过去的20年中,它引起了极大的关注,并且已经针对各种语言提出了全自动词汇简化系统。该领域进步的主要障碍是缺乏用于构建和评估词汇简化系统的高质量数据集。我们提出了一个新的基准数据集,用于英语,西班牙语和(巴西)葡萄牙语中的词汇简化,并提供有关数据选择和注释程序的详细信息。这是第一个可直接比较三种语言的词汇简化系统的数据集。为了展示数据集的可用性,我们将两种具有不同体系结构(神经与非神经)的最先进的词汇简化系统适应所有三种语言(英语,西班牙语和巴西葡萄牙语),并评估他们的表演在我们的新数据集中。为了进行更公平的比较,我们使用多种评估措施来捕获系统功效的各个方面,并讨论其优势和缺点。我们发现,最先进的神经词汇简化系统优于所有三种语言中最先进的非神经词汇简化系统。更重要的是,我们发现最先进的神经词汇简化系统对英语的表现要比西班牙和葡萄牙语要好得多。
translated by 谷歌翻译
发现别人认为是我们信息收集策略的关键方面。现在,人们可以积极利用信息技术来寻找和理解他人的想法,这要归功于越来越多的意见资源(例如在线评论网站和个人博客)的越来越多。由于其在理解人们的意见方面的关键功能,因此情感分析(SA)是一项至关重要的任务。另一方面,现有的研究主要集中在英语上,只有少量研究专门研究低资源语言。对于情感分析,这项工作根据用户评估提供了一个新的多级乌尔都语数据集。高音扬声器网站用于获取乌尔都语数据集。我们提出的数据集包括10,000项评论,这些评论已被人类专家精心归类为两类:正面,负面。这项研究的主要目的是构建一个手动注释的数据集进行乌尔都语情绪分析,并确定基线结果。采用了五种不同的词典和规则的算法,包括NaiveBayes,Stanza,TextBlob,Vader和Flair,实验结果表明,其精度为70%的天赋优于其他经过测试的算法。
translated by 谷歌翻译
情感分析是NLP中研究最广泛的应用程序之一,但大多数工作都集中在具有大量数据的语言上。我们介绍了尼日利亚的四种口语最广泛的语言(Hausa,Igbo,Nigerian-Pidgin和Yor \'ub \'a)的第一个大规模的人类通知的Twitter情感数据集,该数据集由大约30,000个注释的推文组成(以及每种语言的大约30,000个)(以及14,000尼日利亚猎人),其中包括大量的代码混合推文。我们提出了文本收集,过滤,处理和标记方法,使我们能够为这些低资源语言创建数据集。我们评估了数据集上的预训练模型和转移策略。我们发现特定于语言的模型和语言适应性芬通常表现最好。我们将数据集,训练的模型,情感词典和代码释放到激励措施中,以代表性不足的语言进行情感分析。
translated by 谷歌翻译
To effectively train accurate Relation Extraction models, sufficient and properly labeled data is required. Adequately labeled data is difficult to obtain and annotating such data is a tricky undertaking. Previous works have shown that either accuracy has to be sacrificed or the task is extremely time-consuming, if done accurately. We are proposing an approach in order to produce high-quality datasets for the task of Relation Extraction quickly. Neural models, trained to do Relation Extraction on the created datasets, achieve very good results and generalize well to other datasets. In our study, we were able to annotate 10,022 sentences for 19 relations in a reasonable amount of time, and trained a commonly used baseline model for each relation.
translated by 谷歌翻译