数据增强是通过转换为机器学习的人工创建数据的人工创建,是一个跨机器学习学科的研究领域。尽管它对于增加模型的概括功能很有用,但它还可以解决许多其他挑战和问题,从克服有限的培训数据到正规化目标到限制用于保护隐私的数据的数量。基于对数据扩展的目标和应用的精确描述以及现有作品的分类法,该调查涉及用于文本分类的数据增强方法,并旨在为研究人员和从业者提供简洁而全面的概述。我们将100多种方法划分为12种不同的分组,并提供最先进的参考文献来阐述哪种方法可以通过将它们相互关联,从而阐述了哪种方法。最后,提供可能构成未来工作的基础的研究观点。
translated by 谷歌翻译
在许多机器学习的情况下,研究表明,培训数据的开发可能比分类器本身的选择和建模更高。因此,已经开发了数据增强方法来通过人为创建的培训数据来改善分类器。在NLP中,为提供新的语言模式的文本转换建立通用规则存在挑战。在本文中,我们介绍并评估一种适合于长期和短文的分类器的性能的文本生成方法。通过我们的文本生成方法的增强,我们在评估简短和长期文本任务时取得了令人鼓舞的改进。尤其是在小型数据分析方面,与NO增强基线和其他数据增强技术相比,在构建的低数据状态下,添加精度的提高到达15.53%和3.56%。由于这些构建制度的当前轨道并非普遍适用,因此我们还显示了几个现实世界中低数据任务(高达+4.84 F1得分)的重大改进。由于我们从许多角度(总共11个数据集)评估了该方法,因此我们还观察到该方法可能不合适的情况。我们讨论了在不同类型的数据集上成功应用我们的方法的含义和模式。
translated by 谷歌翻译
作为有效的策略,数据增强(DA)减轻了深度学习技术可能失败的数据稀缺方案。它广泛应用于计算机视觉,然后引入自然语言处理并实现了许多任务的改进。DA方法的主要重点之一是提高培训数据的多样性,从而帮助模型更好地推广到看不见的测试数据。在本调查中,我们根据增强数据的多样性,将DA方法框架为三类,包括释义,注释和采样。我们的论文根据上述类别,详细分析了DA方法。此外,我们还在NLP任务中介绍了他们的应用以及挑战。
translated by 谷歌翻译
由于低资源域名,新任务以及需要大量培训数据的大规模神经网络的普及,最近,数据增强最近看到了对NLP的兴趣增加。尽管最近的高潮,但由于语言数据的离散性质所带来的挑战,这一领域仍然相对望远欠了。在本文中,我们通过以结构化方式概述文献来展示对NLP的全面和统一对NLP的数据。我们首先介绍和激励NLP的数据增强,然后讨论主要的方法论代表性方法。接下来,我们突出显示用于流行NLP应用程序和任务的技术。我们通过概述当前挑战和未来研究的指示来结束。总体而言,我们的论文旨在澄清现有文学的景观,以便NLP的数据增强,并激励该领域的其他工作。我们还提供了一个GitHub存储库,纸张列表将在https://github.com/styfeng/dataaug4nlp上不断更新
translated by 谷歌翻译
数据增强是自然语言处理(NLP)模型的鲁棒性评估的重要组成部分,以及增强他们培训的数据的多样性。在本文中,我们呈现NL-Cogmenter,这是一种新的参与式Python的自然语言增强框架,它支持创建两个转换(对数据的修改)和过滤器(根据特定功能的数据拆分)。我们描述了框架和初始的117个变换和23个过滤器,用于各种自然语言任务。我们通过使用其几个转换来分析流行自然语言模型的鲁棒性来证明NL-Upmenter的功效。基础架构,Datacards和稳健性分析结果在NL-Augmenter存储库上公开可用(\ url {https://github.com/gem-benchmark/nl-augmenter})。
translated by 谷歌翻译
数据饥饿的深度神经网络已经将自己作为许多NLP任务的标准建立为包括传统序列标记的标准。尽管他们在高资源语言上表现最先进的表现,但它们仍然落后于低资源场景的统计计数器。一个方法来反击攻击此问题是文本增强,即,从现有数据生成新的合成训练数据点。虽然NLP最近目睹了一种文本增强技术的负载,但该领域仍然缺乏对多种语言和序列标记任务的系统性能分析。为了填补这一差距,我们调查了三类文本增强方法,其在语法(例如,裁剪子句子),令牌(例如,随机字插入)和字符(例如,字符交换)级别上执行更改。我们系统地将它们与语音标记,依赖解析和语义角色标记的分组进行了比较,用于使用各种模型的各种语言系列,包括依赖于诸如MBERT的普赖金的多语言语境化语言模型的架构。增强最显着改善了解析,然后是语音标记和语义角色标记的依赖性解析。我们发现实验技术通常在形态上丰富的语言,而不是越南语等分析语言。我们的研究结果表明,增强技术可以进一步改善基于MBERT的强基线。我们将字符级方法标识为最常见的表演者,而同义词替换和语法增强仪提供不一致的改进。最后,我们讨论了最大依赖于任务,语言对和模型类型的结果。
translated by 谷歌翻译
Controllable Text Generation (CTG) is emerging area in the field of natural language generation (NLG). It is regarded as crucial for the development of advanced text generation technologies that are more natural and better meet the specific constraints in practical applications. In recent years, methods using large-scale pre-trained language models (PLMs), in particular the widely used transformer-based PLMs, have become a new paradigm of NLG, allowing generation of more diverse and fluent text. However, due to the lower level of interpretability of deep neural networks, the controllability of these methods need to be guaranteed. To this end, controllable text generation using transformer-based PLMs has become a rapidly growing yet challenging new research hotspot. A diverse range of approaches have emerged in the recent 3-4 years, targeting different CTG tasks which may require different types of controlled constraints. In this paper, we present a systematic critical review on the common tasks, main approaches and evaluation methods in this area. Finally, we discuss the challenges that the field is facing, and put forward various promising future directions. To the best of our knowledge, this is the first survey paper to summarize CTG techniques from the perspective of PLMs. We hope it can help researchers in related fields to quickly track the academic frontier, providing them with a landscape of the area and a roadmap for future research.
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译
本次调查绘制了用于分析社交媒体数据的生成方法的研究状态的广泛的全景照片(Sota)。它填补了空白,因为现有的调查文章在其范围内或被约会。我们包括两个重要方面,目前正在挖掘和建模社交媒体的重要性:动态和网络。社会动态对于了解影响影响或疾病的传播,友谊的形成,友谊的形成等,另一方面,可以捕获各种复杂关系,提供额外的洞察力和识别否则将不会被注意的重要模式。
translated by 谷歌翻译
近年来,文本的风格特性吸引了计算语言学研究人员。具体来说,研究人员研究了文本样式转移(TST)任务,该任务旨在在保留其样式独立内容的同时改变文本的风格属性。在过去的几年中,已经开发了许多新颖的TST算法,而该行业利用这些算法来实现令人兴奋的TST应用程序。由于这种共生,TST研究领域迅速发展。本文旨在对有关文本样式转移的最新研究工作进行全面审查。更具体地说,我们创建了一种分类法来组织TST模型,并提供有关最新技术状况的全面摘要。我们回顾了针对TST任务的现有评估方法,并进行了大规模的可重复性研究,我们在两个公开可用的数据集上实验基准了19个最先进的TST TST算法。最后,我们扩展了当前趋势,并就TST领域的新开发发展提供了新的观点。
translated by 谷歌翻译
The rapid advancement of AI technology has made text generation tools like GPT-3 and ChatGPT increasingly accessible, scalable, and effective. This can pose serious threat to the credibility of various forms of media if these technologies are used for plagiarism, including scientific literature and news sources. Despite the development of automated methods for paraphrase identification, detecting this type of plagiarism remains a challenge due to the disparate nature of the datasets on which these methods are trained. In this study, we review traditional and current approaches to paraphrase identification and propose a refined typology of paraphrases. We also investigate how this typology is represented in popular datasets and how under-representation of certain types of paraphrases impacts detection capabilities. Finally, we outline new directions for future research and datasets in the pursuit of more effective paraphrase detection using AI.
translated by 谷歌翻译
深度学习属于人工智能领域,机器执行通常需要某种人类智能的任务。类似于大脑的基本结构,深度学习算法包括一种人工神经网络,其类似于生物脑结构。利用他们的感官模仿人类的学习过程,深入学习网络被送入(感官)数据,如文本,图像,视频或声音。这些网络在不同的任务中优于最先进的方法,因此,整个领域在过去几年中看到了指数增长。这种增长在过去几年中每年超过10,000多种出版物。例如,只有在医疗领域中的所有出版物中覆盖的搜索引擎只能在Q3 2020中覆盖所有出版物的子集,用于搜索术语“深度学习”,其中大约90%来自过去三年。因此,对深度学习领域的完全概述已经不可能在不久的将来获得,并且在不久的将来可能会难以获得难以获得子场的概要。但是,有几个关于深度学习的综述文章,这些文章专注于特定的科学领域或应用程序,例如计算机愿景的深度学习进步或在物体检测等特定任务中进行。随着这些调查作为基础,这一贡献的目的是提供对不同科学学科的深度学习的第一个高级,分类的元调查。根据底层数据来源(图像,语言,医疗,混合)选择了类别(计算机愿景,语言处理,医疗信息和其他工程)。此外,我们还审查了每个子类别的常见架构,方法,专业,利弊,评估,挑战和未来方向。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
文本样式传输是自然语言生成中的重要任务,旨在控制生成的文本中的某些属性,例如礼貌,情感,幽默和许多其他特性。它在自然语言处理领域拥有悠久的历史,最近由于深神经模型带来的有希望的性能而重大关注。在本文中,我们对神经文本转移的研究进行了系统调查,自2017年首次神经文本转移工作以来跨越100多个代表文章。我们讨论了任务制定,现有数据集和子任务,评估,以及丰富的方法在存在并行和非平行数据存在下。我们还提供关于这项任务未来发展的各种重要主题的讨论。我们的策据纸张列表在https://github.com/zhijing-jin/text_style_transfer_survey
translated by 谷歌翻译
确保适当的标点符号和字母外壳是朝向应用复杂的自然语言处理算法的关键预处理步骤。这对于缺少标点符号和壳体的文本源,例如自动语音识别系统的原始输出。此外,简短的短信和微博的平台提供不可靠且经常错误的标点符号和套管。本调查概述了历史和最先进的技术,用于恢复标点符号和纠正单词套管。此外,突出了当前的挑战和研究方向。
translated by 谷歌翻译
本文对过去二十年来对自然语言生成(NLG)的研究提供了全面的审查,特别是与数据到文本生成和文本到文本生成深度学习方法有关,以及NLG的新应用技术。该调查旨在(a)给出关于NLG核心任务的最新综合,以及该领域采用的建筑;(b)详细介绍各种NLG任务和数据集,并提请注意NLG评估中的挑战,专注于不同的评估方法及其关系;(c)强调一些未来的强调和相对近期的研究问题,因为NLG和其他人工智能领域的协同作用而增加,例如计算机视觉,文本和计算创造力。
translated by 谷歌翻译
最近的自然语言处理(NLP)技术在基准数据集中实现了高性能,主要原因是由于深度学习性能的显着改善。研究界的进步导致了最先进的NLP任务的生产系统的巨大增强,例如虚拟助理,语音识别和情感分析。然而,随着对抗性攻击测试时,这种NLP系统仍然仍然失败。初始缺乏稳健性暴露于当前模型的语言理解能力中的令人不安的差距,当NLP系统部署在现实生活中时,会产生问题。在本文中,我们通过以各种维度的系统方式概述文献来展示了NLP稳健性研究的结构化概述。然后,我们深入了解稳健性的各种维度,跨技术,指标,嵌入和基准。最后,我们认为,鲁棒性应该是多维的,提供对当前研究的见解,确定文学中的差距,以建议值得追求这些差距的方向。
translated by 谷歌翻译
随着系统变得更大,更复杂,从开源的收集网络威胁智能对于维持和实现高水平的安全性变得越来越重要。但是,这些开源通常会受到信息过载的约束。因此,应用机器学习模型将信息量凝结到必要的内容很有用。然而,以前的研究和应用表明,由于其概括能力低,现有的分类器无法提取有关新兴网络安全事件的特定信息。因此,我们建议通过为每个新事件培训新的分类器来克服这个问题的系统。由于这需要使用标准培训方法进行大量标记的数据,因此我们结合了三种不同的低数据制度技术 - 转移学习,数据增强和很少的学习学习 - 从很少的标记实例中培训高质量的分类器。我们使用从2021年的Microsoft Exchange Server数据泄露中得出的新型数据集评估了我们的方法,该数据集由三名专家标记。与标准训练方法相比,与标准训练方法相比,与标准训练方法相比,F1得分的增加超过21分,与几次学习中的最新方法相比,F1得分的增加超过18分。此外,经过此方法培训的分类器和32个实例的分类器仅比接受1800个实例的分类器少于5 F1分数。
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
Deep Learning and Machine Learning based models have become extremely popular in text processing and information retrieval. However, the non-linear structures present inside the networks make these models largely inscrutable. A significant body of research has focused on increasing the transparency of these models. This article provides a broad overview of research on the explainability and interpretability of natural language processing and information retrieval methods. More specifically, we survey approaches that have been applied to explain word embeddings, sequence modeling, attention modules, transformers, BERT, and document ranking. The concluding section suggests some possible directions for future research on this topic.
translated by 谷歌翻译