Machine learning models can reach high performance on benchmark natural language processing (NLP) datasets but fail in more challenging settings. We study this issue when a pre-trained model learns dataset artifacts in natural language inference (NLI), the topic of studying the logical relationship between a pair of text sequences. We provide a variety of techniques for analyzing and locating dataset artifacts inside the crowdsourced Stanford Natural Language Inference (SNLI) corpus. We study the stylistic pattern of dataset artifacts in the SNLI. To mitigate dataset artifacts, we employ a unique multi-scale data augmentation technique with two distinct frameworks: a behavioral testing checklist at the sentence level and lexical synonym criteria at the word level. Specifically, our combination method enhances our model's resistance to perturbation testing, enabling it to continuously outperform the pre-trained baseline.
translated by 谷歌翻译
Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. We show that, in a significant portion of such data, this protocol leaves clues that make it possible to identify the label by looking only at the hypothesis, without observing the premise. Specifically, we show that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI (Bowman et al., 2015) and 53% of MultiNLI (Williams et al., 2018). Our analysis reveals that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Our findings suggest that the success of natural language inference models to date has been overestimated, and that the task remains a hard open problem.
translated by 谷歌翻译
Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a taskagnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
translated by 谷歌翻译
数据增强是通过转换为机器学习的人工创建数据的人工创建,是一个跨机器学习学科的研究领域。尽管它对于增加模型的概括功能很有用,但它还可以解决许多其他挑战和问题,从克服有限的培训数据到正规化目标到限制用于保护隐私的数据的数量。基于对数据扩展的目标和应用的精确描述以及现有作品的分类法,该调查涉及用于文本分类的数据增强方法,并旨在为研究人员和从业者提供简洁而全面的概述。我们将100多种方法划分为12种不同的分组,并提供最先进的参考文献来阐述哪种方法可以通过将它们相互关联,从而阐述了哪种方法。最后,提供可能构成未来工作的基础的研究观点。
translated by 谷歌翻译
For natural language understanding (NLU) technology to be maximally useful, it must be able to process language in a way that is not exclusive to a single task, genre, or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation (GLUE) benchmark, a collection of tools for evaluating the performance of models across a diverse set of existing NLU tasks. By including tasks with limited training data, GLUE is designed to favor and encourage models that share general linguistic knowledge across tasks. GLUE also includes a hand-crafted diagnostic test suite that enables detailed linguistic analysis of models. We evaluate baselines based on current methods for transfer and representation learning and find that multi-task training on all tasks performs better than training a separate model per task. However, the low absolute performance of our best model indicates the need for improved general NLU systems.
translated by 谷歌翻译
语言可以用作再现和执行有害刻板印象和偏差的手段,并被分析在许多研究中。在本文中,我们对自然语言处理中的性别偏见进行了304篇论文。我们分析了社会科学中性别及其类别的定义,并将其连接到NLP研究中性别偏见的正式定义。我们调查了在对性别偏见的研究中应用的Lexica和数据集,然后比较和对比方法来检测和减轻性别偏见。我们发现对性别偏见的研究遭受了四个核心限制。 1)大多数研究将性别视为忽视其流动性和连续性的二元变量。 2)大部分工作都在单机设置中进行英语或其他高资源语言进行。 3)尽管在NLP方法中对性别偏见进行了无数的论文,但我们发现大多数新开发的算法都没有测试他们的偏见模型,并无视他们的工作的伦理考虑。 4)最后,在这一研究线上发展的方法基本缺陷涵盖性别偏差的非常有限的定义,缺乏评估基线和管道。我们建议建议克服这些限制作为未来研究的指导。
translated by 谷歌翻译
数据增强是自然语言处理(NLP)模型的鲁棒性评估的重要组成部分,以及增强他们培训的数据的多样性。在本文中,我们呈现NL-Cogmenter,这是一种新的参与式Python的自然语言增强框架,它支持创建两个转换(对数据的修改)和过滤器(根据特定功能的数据拆分)。我们描述了框架和初始的117个变换和23个过滤器,用于各种自然语言任务。我们通过使用其几个转换来分析流行自然语言模型的鲁棒性来证明NL-Upmenter的功效。基础架构,Datacards和稳健性分析结果在NL-Augmenter存储库上公开可用(\ url {https://github.com/gem-benchmark/nl-augmenter})。
translated by 谷歌翻译
Understanding entailment and contradiction is fundamental to understanding natural language, and inference about entailment and contradiction is a valuable testing ground for the development of semantic representations. However, machine learning research in this area has been dramatically limited by the lack of large-scale resources. To address this, we introduce the Stanford Natural Language Inference corpus, a new, freely available collection of labeled sentence pairs, written by humans doing a novel grounded task based on image captioning. At 570K pairs, it is two orders of magnitude larger than all other resources of its type. This increase in scale allows lexicalized classifiers to outperform some sophisticated existing entailment models, and it allows a neural network-based model to perform competitively on natural language inference benchmarks for the first time.
translated by 谷歌翻译
最近的自然语言处理(NLP)技术在基准数据集中实现了高性能,主要原因是由于深度学习性能的显着改善。研究界的进步导致了最先进的NLP任务的生产系统的巨大增强,例如虚拟助理,语音识别和情感分析。然而,随着对抗性攻击测试时,这种NLP系统仍然仍然失败。初始缺乏稳健性暴露于当前模型的语言理解能力中的令人不安的差距,当NLP系统部署在现实生活中时,会产生问题。在本文中,我们通过以各种维度的系统方式概述文献来展示了NLP稳健性研究的结构化概述。然后,我们深入了解稳健性的各种维度,跨技术,指标,嵌入和基准。最后,我们认为,鲁棒性应该是多维的,提供对当前研究的见解,确定文学中的差距,以建议值得追求这些差距的方向。
translated by 谷歌翻译
众包NLP数据集的反复挑战是,在制作示例时,人类作家通常会依靠重复的模式,从而导致缺乏语言多样性。我们介绍了一种基于工人和AI协作的数据集创建的新方法,该方法汇集了语言模型的生成力量和人类的评估力量。从现有的数据集,自然语言推理(NLI)的Multinli开始,我们的方法使用数据集制图自动识别示例来证明具有挑战性的推理模式,并指示GPT-3撰写具有相似模式的新示例。然后,机器生成的示例会自动过滤,并最终由人类人群工人修订和标记。最终的数据集Wanli由107,885个NLI示例组成,并在现有的NLI数据集上呈现出独特的经验优势。值得注意的是,培训有关Wanli的模型,而不是Multinli($ 4 $ $倍)可改善我们考虑的七个外域测试集的性能,包括汉斯(Hans)的11%和对抗性NLI的9%。此外,将Multinli与Wanli结合起来比将其与其他NLI增强集相结合更有效。我们的结果表明,自然语言生成技术的潜力是策划增强质量和多样性的NLP数据集。
translated by 谷歌翻译
大型语言模型(LLM)已在一系列自然语言理解任务上实现了最先进的表现。但是,这些LLM可能依靠数据集偏差和文物作为预测的快捷方式。这极大地损害了他们的分布(OOD)概括和对抗性鲁棒性。在本文中,我们对最新发展的综述,这些发展解决了LLMS的鲁棒性挑战。我们首先介绍LLM的概念和鲁棒性挑战。然后,我们介绍了在LLM中识别快捷方式学习行为的方法,表征了快捷方式学习的原因以及引入缓解解决方案。最后,我们确定了关键挑战,并将这一研究线的联系引入其他方向。
translated by 谷歌翻译
由于NLP模型实现了基准测试的最先进的性能并获得了广泛的应用程序,因此确保在现实世界中的安全部署这些模型的安全部署,例如,确保模型对未经调用或具有挑战性的情景稳健。尽管具有越来越多的学习主题,但它在视觉和NLP等应用中分别探讨了,具有多种研究中的各种定义,评估和缓解策略。在本文中,我们的目标是提供对如何定义,测量和提高NLP鲁棒性的统一调查。我们首先连接多种稳健性的定义,然后统一各种各样的工作方面识别稳健性失败和评估模型的鲁棒性。相应地,我们呈现了数据驱动,模型驱动和基于归纳的缓解策略,具有如何有效地改善NLP模型中的鲁棒性的更系统的观点。最后,我们通过概述开放的挑战和未来方向来促进在这一领域的进一步研究。
translated by 谷歌翻译
A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area.
translated by 谷歌翻译
大规模的预训练语言模型在广泛的自然语言理解(NLU)任务中取得了巨大的成功,甚至超过人类性能。然而,最近的研究表明,这些模型的稳健性可能受到精心制作的文本对抗例子的挑战。虽然已经提出了几个单独的数据集来评估模型稳健性,但仍缺少原则和全面的基准。在本文中,我们呈现对抗性胶水(AdvGlue),这是一个新的多任务基准,以定量和彻底探索和评估各种对抗攻击下现代大规模语言模型的脆弱性。特别是,我们系统地应用14种文本对抗的攻击方法来构建一个粘合的援助,这是由人类进一步验证的可靠注释。我们的调查结果总结如下。 (i)大多数现有的对抗性攻击算法容易发生无效或暧昧的对手示例,其中大约90%的含量改变原始语义含义或误导性的人的注册人。因此,我们执行仔细的过滤过程来策划高质量的基准。 (ii)我们测试的所有语言模型和强大的培训方法在AdvGlue上表现不佳,差价远远落后于良性准确性。我们希望我们的工作能够激励开发新的对抗攻击,这些攻击更加隐身,更加统一,以及针对复杂的对抗性攻击的新强大语言模型。 Advglue在https://adversarialglue.github.io提供。
translated by 谷歌翻译
We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sheds light on the shortcomings of current state-of-theart models, and shows that non-expert annotators are successful at finding their weaknesses. The data collection method can be applied in a never-ending learning scenario, becoming a moving target for NLU, rather than a static benchmark that will quickly saturate.
translated by 谷歌翻译
This paper introduces the Multi-Genre Natural Language Inference (MultiNLI) corpus, a dataset designed for use in the development and evaluation of machine learning models for sentence understanding. At 433k examples, this resource is one of the largest corpora available for natural language inference (a.k.a. recognizing textual entailment), improving upon available resources in both its coverage and difficulty. MultiNLI accomplishes this by offering data from ten distinct genres of written and spoken English, making it possible to evaluate systems on nearly the full complexity of the language, while supplying an explicit setting for evaluating cross-genre domain adaptation. In addition, an evaluation using existing machine learning models designed for the Stanford NLI corpus shows that it represents a substantially more difficult task than does that corpus, despite the two showing similar levels of inter-annotator agreement. This task will involve reading a line from a non-fiction article and writing three sentences that relate to it. The line will describe a situation or event. Using only this description and what you know about the world:
translated by 谷歌翻译
随着人工智能系统变得越来越强大和普遍,人们对机器的道德或缺乏道德的关注变得越来越关注。然而,向机器讲授道德是一项艰巨的任务,因为道德仍然是人类中最激烈的争论问题之一,更不用说AI了。但是,部署到数百万用户的现有AI系统已经在做出充满道德影响的决策,这构成了一个看似不可能的挑战:教学机器的道德意义,而人类继续努力努力。为了探索这一挑战,我们介绍了Delphi,这是一个基于深层神经网络的实验框架,直接训练了描述性道德判断,例如,“帮助朋友”通常是不错的,而“帮助朋友传播假新闻”不是。经验结果提供了对机器伦理的承诺和局限性的新见解。面对新的道德情况,德尔菲(Delphi)表现出强大的概括能力,而现成的神经网络模型表现出明显差的判断,包括不公正的偏见,证实了对明确教学机器的道德意义的必要性。然而,德尔菲并不完美,表现出对普遍性偏见和不一致的敏感性。尽管如此,我们还是展示了不完美的Delphi的积极用例,包括在其他不完美的AI系统中将其用作组件模型。重要的是,我们根据著名的道德理论来解释Delphi的运营化,这使我们提出了重要的未来研究问题。
translated by 谷歌翻译
微调预训练模型在标准的自然语言处理基准上取得了令人印象深刻的性能。然而,所产生的模型概括性仍然明确地理解。例如,我们不知道,性能如何导致泛化模型的完善。在这项研究中,我们使用关系提取来分析来自不同观点的微调BERT模型。我们还根据我们提出的改进来表征泛化技术的差异。从经验实验中,我们发现BERT通过随机化,对抗性和反事实试验以及偏差(即选择和语义)遭受鲁棒性而遭受瓶颈。这些发现突出了未来改进的机会。我们的开放式测试平台诊断为\ url {https://github.com/zjunlp/diagnosere}。
translated by 谷歌翻译
作为有效的策略,数据增强(DA)减轻了深度学习技术可能失败的数据稀缺方案。它广泛应用于计算机视觉,然后引入自然语言处理并实现了许多任务的改进。DA方法的主要重点之一是提高培训数据的多样性,从而帮助模型更好地推广到看不见的测试数据。在本调查中,我们根据增强数据的多样性,将DA方法框架为三类,包括释义,注释和采样。我们的论文根据上述类别,详细分析了DA方法。此外,我们还在NLP任务中介绍了他们的应用以及挑战。
translated by 谷歌翻译
Grammatical Error Correction (GEC) is the task of automatically detecting and correcting errors in text. The task not only includes the correction of grammatical errors, such as missing prepositions and mismatched subject-verb agreement, but also orthographic and semantic errors, such as misspellings and word choice errors respectively. The field has seen significant progress in the last decade, motivated in part by a series of five shared tasks, which drove the development of rule-based methods, statistical classifiers, statistical machine translation, and finally neural machine translation systems which represent the current dominant state of the art. In this survey paper, we condense the field into a single article and first outline some of the linguistic challenges of the task, introduce the most popular datasets that are available to researchers (for both English and other languages), and summarise the various methods and techniques that have been developed with a particular focus on artificial error generation. We next describe the many different approaches to evaluation as well as concerns surrounding metric reliability, especially in relation to subjective human judgements, before concluding with an overview of recent progress and suggestions for future work and remaining challenges. We hope that this survey will serve as comprehensive resource for researchers who are new to the field or who want to be kept apprised of recent developments.
translated by 谷歌翻译