近年来,基于方面的情感分析(ABSA)已成为一项普遍的任务。但是,在当前的ABSA研究中缺乏统一的框架使得公平地比较不同模型的性能是具有挑战性的。因此,我们创建了一个开源的ABSA框架,即Pyabsa。此外,以前的努力通常忽略了前体方面术语提取(ASC)子任务,并专注于方面情感分类(ATE)子任务。与以前的作品相比,PYABSA包括方面术语提取,方面情感分类和文本分类的特征,而多个ABSA子任务可以适应Pyabsa,因为其模块化体系结构。为了促进ABSA应用程序,Pyabsaseamless集成了多语言建模,自动化数据集注释等,这有助于部署ABSA服务。在ASC和Ate中,Pyabsa分别提供多达33个和7个内置模型,而所有模型都提供快速培训和即时推断。此外,Pyabsa还包含21个增强ABSA数据集的180K+ ABSA实例,用于应用和研究。 Pyabsa可从https://github.com/yangheng95/pyabsa获得
translated by 谷歌翻译
This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.
translated by 谷歌翻译
我们介绍了Yato,这是一种开源工具包,用于文本分析,并深入学习。它着重于文本上的基本序列标签和序列分类任务。Yato在层次结构中设计,支持三种功能的免费组合,包括1)传统神经网络(CNN,RNN等);2)预训练的语言模型(Bert,Roberta,Electra等);3)通过简单的可配置文件,用户定制的神经功能。Yato受益于灵活性和易用性的优势,可以促进对最先进的NLP模型的再现和完善,并促进NLP技术的跨学科应用。源代码,示例和文档可在https://github.com/jiesutd/yato上公开获取。
translated by 谷歌翻译
We introduce Sta n z a , an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Sta n z a features a language-agnostic fully neural pipeline for text analysis, including tokenization, multiword token expansion, lemmatization, part-ofspeech and morphological feature tagging, dependency parsing, and named entity recognition. We have trained Sta n z a on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Sta n z a includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction. Source code, documentation, and pretrained models for 66 languages are available at https:// stanfordnlp.github.io/stanza/.
translated by 谷歌翻译
Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. Transformers is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered stateof-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrained models made by and available for the community. Transformers is designed to be extensible by researchers, simple for practitioners, and fast and robust in industrial deployments. The library is available at https://github.com/ huggingface/transformers.
translated by 谷歌翻译
Recent progress in natural language processing has been driven by advances in both model architecture and model pretraining. Transformer architectures have facilitated building higher-capacity models and pretraining has made it possible to effectively utilize this capacity for a wide variety of tasks. Transformers is an open-source library with the goal of opening up these advances to the wider machine learning community. The library consists of carefully engineered stateof-the art Transformer architectures under a unified API. Backing this library is a curated collection of pretrained models made by and available for the community. Transformers is designed to be extensible by researchers, simple for practitioners, and fast and robust in industrial deployments. The library is available at https://github.com/ huggingface/transformers.
translated by 谷歌翻译
如今,基础模型已成为人工智能中的基本基础设施之一,铺平了通往通用情报的方式。但是,现实提出了两个紧急挑战:现有的基础模型由英语社区主导;用户通常会获得有限的资源,因此不能总是使用基础模型。为了支持中文社区的发展,我们介绍了一个名为Fengshenbang的开源项目,该项目由认知计算与自然语言研究中心(CCNL)领导。我们的项目具有全面的功能,包括大型预培训模型,用户友好的API,基准,数据集等。我们将所有这些都包装在三个子项目中:风水次模型,风水框架和狂热基准。 Fengshenbang的开源路线图旨在重新评估中国预培训的大型大型模型的开源社区,促使整个中国大型模型社区的发展。我们还希望构建一个以用户为中心的开源生态系统,以允许个人访问所需的模型以匹配其计算资源。此外,我们邀请公司,大学和研究机构与我们合作建立大型开源模型的生态系统。我们希望这个项目将成为中国认知情报的基础。
translated by 谷歌翻译
在本文中,我们介绍了TweetNLP,这是社交媒体中自然语言处理(NLP)的集成平台。TweetNLP支持一套多样化的NLP任务,包括诸如情感分析和命名实体识别的通用重点领域,以及社交媒体特定的任务,例如表情符号预测和进攻性语言识别。特定于任务的系统由专门用于社交媒体文本的合理大小的基于变压器的语言模型(尤其是Twitter)提供动力,无需专用硬件或云服务即可运行。TweetNLP的主要贡献是:(1)使用适合社会领域的各种特定于任务的模型,用于支持社交媒体分析的现代工具包的集成python库;(2)使用我们的模型进行无编码实验的交互式在线演示;(3)涵盖各种典型社交媒体应用的教程。
translated by 谷歌翻译
We present POTATO, the Portable text annotation tool, a free, fully open-sourced annotation system that 1) supports labeling many types of text and multimodal data; 2) offers easy-to-configure features to maximize the productivity of both deployers and annotators (convenient templates for common ML/NLP tasks, active learning, keypress shortcuts, keyword highlights, tooltips); and 3) supports a high degree of customization (editable UI, inserting pre-screening questions, attention and qualification tests). Experiments over two annotation tasks suggest that POTATO improves labeling speed through its specially-designed productivity features, especially for long documents and complex tasks. POTATO is available at https://github.com/davidjurgens/potato and will continue to be updated.
translated by 谷歌翻译
域和部署设置的机器学习模型的快速增殖使各种社区(例如行业从业人员)引起,该社区寻求跨个人价值的任务和目标的基准模型。不幸的是,这些用户不能使用标准基准导致执行如传统基准的价值驱动的比较,因为传统的基准在单个目标(例如平均精度)上评估模型,并且无法促进控制混淆变量(例如计算预算)的标准化训练框架(例如计算预算),使公平比较困难。为解决这些挑战,我们介绍了开源Ludwig基准测试工具包(LBT),一个个性化基准工具包,用于运行端到端的基准研究(从超级计量优化到评估),跨易于扩展的任务,深度学习模型,数据集和评估指标。 LBT提供了一种可配置的界面,用于控制培训和定制评估,是消除混淆变量的标准化培训框架,以及支持多目标评估。我们展示LBT如何用于创建个性化基准研究,具有7个模型和9个数据集的文本分类的大规模比较分析。我们探讨推理延迟和性能之间的权衡,数据集属性和性能之间的关系,以及预先介绍对融合和鲁棒性的影响,展示了LBT如何用于满足各种基准测试目标。
translated by 谷歌翻译
文本分类是具有各种有趣应用程序的典型自然语言处理或计算语言学任务。随着社交媒体平台上的用户数量的增加,数据加速促进了有关社交媒体文本分类(SMTC)或社交媒体文本挖掘的新兴研究。与英语相比,越南人是低资源语言之一,仍然没有集中精力并彻底利用。受胶水成功的启发,我们介绍了社交媒体文本分类评估(SMTCE)基准,作为各种SMTC任务的数据集和模型的集合。借助拟议的基准,我们实施和分析了各种基于BERT的模型(Mbert,XLM-R和Distilmbert)和基于单语的BERT模型(Phobert,Vibert,Vibert,Velectra和Vibert4news)的有效性SMTCE基准。单语模型优于多语言模型,并实现所有文本分类任务的最新结果。它提供了基于基准的多语言和单语言模型的客观评估,该模型将使越南语言中有关贝尔特兰的未来研究有利。
translated by 谷歌翻译
情感分析是NLP中研究最广泛的应用程序之一,但大多数工作都集中在具有大量数据的语言上。我们介绍了尼日利亚的四种口语最广泛的语言(Hausa,Igbo,Nigerian-Pidgin和Yor \'ub \'a)的第一个大规模的人类通知的Twitter情感数据集,该数据集由大约30,000个注释的推文组成(以及每种语言的大约30,000个)(以及14,000尼日利亚猎人),其中包括大量的代码混合推文。我们提出了文本收集,过滤,处理和标记方法,使我们能够为这些低资源语言创建数据集。我们评估了数据集上的预训练模型和转移策略。我们发现特定于语言的模型和语言适应性芬通常表现最好。我们将数据集,训练的模型,情感词典和代码释放到激励措施中,以代表性不足的语言进行情感分析。
translated by 谷歌翻译
Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.
translated by 谷歌翻译
现有的多方对话数据集用于核心分辨率是新生的,许多挑战仍然没有解决。我们根据电视成绩单为此任务创建了一个大规模数据集,多语言多方CoreF(MMC)。由于使用多种语言的黄金质量字幕可用,我们建议重复注释以通过注释投影以其他语言(中文和Farsi)创建银色核心数据。在黄金(英语)数据上,现成的模型在MMC上的性能相对较差,这表明MMC比以前的数据集更广泛地覆盖多方核心。在银数据上,我们发现成功使用它进行数据增强和从头开始训练,这有效地模拟了零击的跨语性设置。
translated by 谷歌翻译
As an important fine-grained sentiment analysis problem, aspect-based sentiment analysis (ABSA), aiming to analyze and understand people's opinions at the aspect level, has been attracting considerable interest in the last decade. To handle ABSA in different scenarios, various tasks are introduced for analyzing different sentiment elements and their relations, including the aspect term, aspect category, opinion term, and sentiment polarity. Unlike early ABSA works focusing on a single sentiment element, many compound ABSA tasks involving multiple elements have been studied in recent years for capturing more complete aspect-level sentiment information. However, a systematic review of various ABSA tasks and their corresponding solutions is still lacking, which we aim to fill in this survey. More specifically, we provide a new taxonomy for ABSA which organizes existing studies from the axes of concerned sentiment elements, with an emphasis on recent advances of compound ABSA tasks. From the perspective of solutions, we summarize the utilization of pre-trained language models for ABSA, which improved the performance of ABSA to a new stage. Besides, techniques for building more practical ABSA systems in cross-domain/lingual scenarios are discussed. Finally, we review some emerging topics and discuss some open challenges to outlook potential future directions of ABSA.
translated by 谷歌翻译
We present NusaCrowd, a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages. Our work is intended to help advance natural language processing research in under-represented languages.
translated by 谷歌翻译
Aspect sentiment triplet extraction (ASTE) aims to extract aspect term, sentiment and opinion term triplets from sentences. Since the initial datasets used to evaluate models on ASTE had flaws, several studies later corrected the initial datasets and released new versions of the datasets independently. As a result, different studies select different versions of datasets to evaluate their methods, which makes ASTE-related works hard to follow. In this paper, we analyze the relation between different versions of datasets and suggest that the entire-space version should be used for ASTE. Besides the sentences containing triplets and the triplets in the sentences, the entire-space version additionally includes the sentences without triplets and the aspect terms which do not belong to any triplets. Hence, the entire-space version is consistent with real-world scenarios and evaluating models on the entire-space version can better reflect the models' performance in real-world scenarios. In addition, experimental results show that evaluating models on non-entire-space datasets inflates the performance of existing models and models trained on the entire-space version can obtain better performance.
translated by 谷歌翻译
最近的知名作品表明,基于方面的情感分类(ABSC)令人鼓舞,而隐性方面情感建模仍然是必须解决的问题。我们的初步研究表明,隐式方面的情感通常取决于相邻方面的情感,这表明我们可以通过局部情感依赖性建模提取隐式情感。我们根据经验情感模式(SP)制定了局部情感聚合范式(LSA),以解决情感依赖性建模。与现有方法相比,LSA是一种有效的方法,它可以在局部情感聚合窗口中学习隐性情感,该窗口解决了效率问题并避免了基于语法的方法的令牌节点对齐问题。此外,我们根据梯度下降来完善一种差分加权方法,该方法指导了情感聚合窗口的构建。根据实验结果,LSA对所有客观的ABSC模型都有有效,可以在三个公共数据集上获得最先进的性能。 LSA是一种自适应范式,准备适应现有模型,我们发布代码以提供洞察力以改善现有的ABSC模型。
translated by 谷歌翻译
许多可解释性工具使从业人员和研究人员可以解释自然语言处理系统。但是,每个工具都需要不同的配置,并提供不同形式的解释,从而阻碍了评估和比较它们的可能性。原则上的统一评估基准将指导用户解决中心问题:哪种解释方法对我的用例更可靠?我们介绍了雪貂,这是一个易于使用的,可扩展的Python库,以解释与拥抱面枢纽集成的基于变形金刚的模型。它提供了一个统一的基准测试套件来测试和比较任何文本或可解释性语料库的广泛最先进的解释器。此外,雪貂提供方便的编程摘要,以促进新的解释方法,数据集或评估指标的引入。
translated by 谷歌翻译
数据增强是自然语言处理(NLP)模型的鲁棒性评估的重要组成部分,以及增强他们培训的数据的多样性。在本文中,我们呈现NL-Cogmenter,这是一种新的参与式Python的自然语言增强框架,它支持创建两个转换(对数据的修改)和过滤器(根据特定功能的数据拆分)。我们描述了框架和初始的117个变换和23个过滤器,用于各种自然语言任务。我们通过使用其几个转换来分析流行自然语言模型的鲁棒性来证明NL-Upmenter的功效。基础架构,Datacards和稳健性分析结果在NL-Augmenter存储库上公开可用(\ url {https://github.com/gem-benchmark/nl-augmenter})。
translated by 谷歌翻译