尽管机器学习模型在自然语言处理中的成功(NLP)任务中,但是这些模型的预测经常失败在分销(OOD)样本上。事先作品专注于开发用于检测ood的最先进的方法。 oo ood样本如何与分布式样品不同的基本问题仍未得到答复。本文探讨了培训模型中的数据动态如何,可以使用广泛的细节来了解OOD和配送样本之间的根本差异。我们发现数据样本的句法特征,即该模型在ood和分销案件中持续预测不正确,直接相互矛盾。此外,我们观察到初步证据支持假设模型更容易锁存在琐碎的句法启发式(例如,在对OOD样本进行预测时锁存两种句子之间的单词。我们希望我们的初步研究加速了对各种机器学习现象的数据为中心的分析。
translated by 谷歌翻译
在模式连通性文献中被广泛接受的是,当两个神经网络在相同的数据上类似地训练时,它们通过路径通过参数空间连接,维持了测试集精度。在某些情况下,包括从预验证的模型中转移学习,这些路径被认为是线性的。与现有结果相反,我们发现在文本分类器(在MNLI,QQP和COLA上训练)中,一些填充模型具有较大的障碍,它们之间的线性路径之间的损失越来越大。在每个任务上,我们都会发现模型的不同簇,这些模型簇在测试损失表面上是线性连接的,但与集群外部的模型断开 - 模型占据了表面上的单独盆地。通过测量专门制作的诊断数据集的性能,我们发现这些簇对应于不同的概括策略:一个群集的行为就像域移动下的一袋单词模型一样,而另一个群集使用句法启发式方法。我们的工作表明,损耗表面的几何形状如何指导模型朝着不同的启发式函数。
translated by 谷歌翻译
尽管对检测分配(OOD)示例的重要性一致,但就OOD示例的正式定义几乎没有共识,以及如何最好地检测到它们。我们将这些示例分类为它们是否表现出背景换档或语义移位,并发现ood检测,模型校准和密度估计(文本语言建模)的两个主要方法,对这些类型的ood数据具有不同的行为。在14对分布和ood英语自然语言理解数据集中,我们发现密度估计方法一致地在背景移位设置中展开校准方法,同时在语义移位设置中执行更糟。此外,我们发现两种方法通常都无法检测到挑战数据中的示例,突出显示当前方法的弱点。由于没有单个方法在所有设置上都效果很好,因此在评估不同的检测方法时,我们的结果呼叫了OOD示例的明确定义。
translated by 谷歌翻译
A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area.
translated by 谷歌翻译
众包NLP数据集的反复挑战是,在制作示例时,人类作家通常会依靠重复的模式,从而导致缺乏语言多样性。我们介绍了一种基于工人和AI协作的数据集创建的新方法,该方法汇集了语言模型的生成力量和人类的评估力量。从现有的数据集,自然语言推理(NLI)的Multinli开始,我们的方法使用数据集制图自动识别示例来证明具有挑战性的推理模式,并指示GPT-3撰写具有相似模式的新示例。然后,机器生成的示例会自动过滤,并最终由人类人群工人修订和标记。最终的数据集Wanli由107,885个NLI示例组成,并在现有的NLI数据集上呈现出独特的经验优势。值得注意的是,培训有关Wanli的模型,而不是Multinli($ 4 $ $倍)可改善我们考虑的七个外域测试集的性能,包括汉斯(Hans)的11%和对抗性NLI的9%。此外,将Multinli与Wanli结合起来比将其与其他NLI增强集相结合更有效。我们的结果表明,自然语言生成技术的潜力是策划增强质量和多样性的NLP数据集。
translated by 谷歌翻译
估计数据集的难度通常涉及将最新模型与人类进行比较;性能差距越大,据说数据集就越难。但是,这种比较几乎没有理解给定分布中的每个实例的难度,或者什么属性使给定模型的数据集难以进行。为了解决这些问题,我们将数据集难度框架 - W.R.T.模型$ \ MATHCAL {V} $ - 由于缺乏$ \ Mathcal {V} $ - $ \ textit {usable Information} $(Xu等,2019),其中较低的值表示更困难的数据集用于$ \ mathcal {v} $。我们进一步介绍了$ \ textit {pointSise $ \ mathcal {v} $ - 信息} $(pvi),以测量单个实例的难度W.R.T.给定的分布。虽然标准评估指标通常仅比较同一数据集的不同模型,但$ \ MATHCAL {V} $ - $ \ textit {usable Information} $ and PVI也允许相反:对于给定的模型$ \ Mathcal {v} $,我们,我们,我们可以比较同一数据集的不同数据集以及不同的实例/切片。此外,我们的框架可以通过输入的转换来解释不同的输入属性,我们用来在广泛使用的NLP基准中发现注释人工制品。
translated by 谷歌翻译
大型语言模型(LLM)已在一系列自然语言理解任务上实现了最先进的表现。但是,这些LLM可能依靠数据集偏差和文物作为预测的快捷方式。这极大地损害了他们的分布(OOD)概括和对抗性鲁棒性。在本文中,我们对最新发展的综述,这些发展解决了LLMS的鲁棒性挑战。我们首先介绍LLM的概念和鲁棒性挑战。然后,我们介绍了在LLM中识别快捷方式学习行为的方法,表征了快捷方式学习的原因以及引入缓解解决方案。最后,我们确定了关键挑战,并将这一研究线的联系引入其他方向。
translated by 谷歌翻译
深度学习模型的快速发展促使人们对适当的培训数据的需求增加。大型数据集的普及(有时被称为“大数据”)转移到评估其质量中的关注。大型数据集进行培训通常需要过度的系统资源和不可行的时间。此外,监督的机器学习过程尚未完全自动化:对于监督学习,大型数据集需要更多时间来手动标记样本。我们提出了一种在初始培训会话后使用可比分布模型的准确性来策划较小的数据集的方法,该方法使用适当的样本分布,该样本分类得出,该样品对模型很难从模型中学习。
translated by 谷歌翻译
Discriminative neural networks offer little or no performance guarantees when deployed on data not generated by the same process as the training distribution. On such out-of-distribution (OOD) inputs, the prediction may not only be erroneous, but confidently so, limiting the safe deployment of classifiers in real-world applications. One such challenging application is bacteria identification based on genomic sequences, which holds the promise of early detection of diseases, but requires a model that can output low confidence predictions on OOD genomic sequences from new bacteria that were not present in the training data. We introduce a genomics dataset for OOD detection that allows other researchers to benchmark progress on this important problem. We investigate deep generative model based approaches for OOD detection and observe that the likelihood score is heavily affected by population level background statistics. We propose a likelihood ratio method for deep generative models which effectively corrects for these confounding background statistics. We benchmark the OOD detection performance of the proposed method against existing approaches on the genomics dataset and show that our method achieves state-of-the-art performance. We demonstrate the generality of the proposed method by showing that it significantly improves OOD detection when applied to deep generative models of images.
translated by 谷歌翻译
我们考虑使用深度神经网络时检测到(分发外)输入数据的问题,并提出了一种简单但有效的方法来提高几种流行的ood检测方法对标签换档的鲁棒性。我们的作品是通过观察到的,即大多数现有的OOD检测算法考虑整个训练/测试数据,无论每个输入激活哪个类进入(级别差异)。通过广泛的实验,我们发现这种做法导致探测器,其性能敏感,易于标记换档。为了解决这个问题,我们提出了一种类别的阈值方案,可以适用于大多数现有的OOD检测算法,并且即使在测试分布的标签偏移存在下也可以保持相似的OOD检测性能。
translated by 谷歌翻译
对于自然语言处理系统,两种证据支持在大型未解除的基层上的神经语言模型中使用文本表示:在应用程序启发基准上的表现(Peters等,2018年,除其他外)以及出现的出现这些陈述中的句法抽象(Tenney等,2019年,尤其)。另一方面,缺乏接地的监督呼吁质疑这些表现如何捕获意义(Bender和Koller,2020)。我们对最近的语言模型应用小说探针 - 特别关注由语义依赖性运作的谓词参数结构(Ivanova等,2012) - 并发现,与语法不同,语义不是通过今天的预磨款模型带到表面上。然后,我们使用卷积图编码器将语义解析明确地将语义解析结合到特定于任务的FineTuning中,为胶水基准测试中的自然语言理解(NLU)任务产生益处。这种方法展示了通用(而不是任务特定的)语言监督的潜力,以上和超越传统的预威胁和芬特。有几个诊断有助于本地化我们方法的好处。
translated by 谷歌翻译
Targeted syntactic evaluations of language models ask whether models show stable preferences for syntactically acceptable content over minimal-pair unacceptable inputs. Most targeted syntactic evaluation datasets ask models to make these judgements with just a single context-free sentence as input. This does not match language models' training regime, in which input sentences are always highly contextualized by the surrounding corpus. This mismatch raises an important question: how robust are models' syntactic judgements in different contexts? In this paper, we investigate the stability of language models' performance on targeted syntactic evaluations as we vary properties of the input context: the length of the context, the types of syntactic phenomena it contains, and whether or not there are violations of grammaticality. We find that model judgements are generally robust when placed in randomly sampled linguistic contexts. However, they are substantially unstable for contexts containing syntactic structures matching those in the critical test content. Among all tested models (GPT-2 and five variants of OPT), we significantly improve models' judgements by providing contexts with matching syntactic structures, and conversely significantly worsen them using unacceptable contexts with matching but violated syntactic structures. This effect is amplified by the length of the context, except for unrelated inputs. We show that these changes in model performance are not explainable by simple features matching the context and the test inputs, such as lexical overlap and dependency overlap. This sensitivity to highly specific syntactic features of the context can only be explained by the models' implicit in-context learning abilities.
translated by 谷歌翻译
神经网络在分布中的数据中取得了令人印象深刻的性能,该数据与训练集相同,但可以为这些网络从未见过的数据产生过分自信的结果。因此,至关重要的是要检测输入是否来自分布(OOD),以确保现实世界中部署的神经网络的安全性。在本文中,我们提出了一种简单有效的事后技术Weshort,以减少神经网络对OOD数据的过度自信。我们的方法灵感来自对内部残留结构的观察,该结构显示了捷径层中OOD和分布(ID)数据的分离。我们的方法与不同的OOD检测分数兼容,并且可以很好地推广到网络的不同体系结构。我们在各种OOD数据集上演示了我们的方法,以展示其竞争性能,并提供合理的假设,以解释我们的方法为何起作用。在Imagenet基准测试上,Weshort在假阳性率(FPR95)和接收器操作特征(AUROC)下实现了最先进的性能(在事后方法)上。
translated by 谷歌翻译
Commonly used AI networks are very self-confident in their predictions, even when the evidence for a certain decision is dubious. The investigation of a deep learning model output is pivotal for understanding its decision processes and assessing its capabilities and limitations. By analyzing the distributions of raw network output vectors, it can be observed that each class has its own decision boundary and, thus, the same raw output value has different support for different classes. Inspired by this fact, we have developed a new method for out-of-distribution detection. The method offers an explanatory step beyond simple thresholding of the softmax output towards understanding and interpretation of the model learning process and its output. Instead of assigning the class label of the highest logit to each new sample presented to the network, it takes the distributions over all classes into consideration. A probability score interpreter (PSI) is created based on the joint logit values in relation to their respective correct vs wrong class distributions. The PSI suggests whether the sample is likely to belong to a specific class, whether the network is unsure, or whether the sample is likely an outlier or unknown type for the network. The simple PSI has the benefit of being applicable on already trained networks. The distributions for correct vs wrong class for each output node are established by simply running the training examples through the trained network. We demonstrate our OOD detection method on a challenging transmission electron microscopy virus image dataset. We simulate a real-world application in which images of virus types unknown to a trained virus classifier, yet acquired with the same procedures and instruments, constitute the OOD samples.
translated by 谷歌翻译
The usage of deep neural networks in safety-critical systems is limited by our ability to guarantee their correct behavior. Runtime monitors are components aiming to identify unsafe predictions and discard them before they can lead to catastrophic consequences. Several recent works on runtime monitoring have focused on out-of-distribution (OOD) detection, i.e., identifying inputs that are different from the training data. In this work, we argue that OOD detection is not a well-suited framework to design efficient runtime monitors and that it is more relevant to evaluate monitors based on their ability to discard incorrect predictions. We call this setting out-ofmodel-scope detection and discuss the conceptual differences with OOD. We also conduct extensive experiments on popular datasets from the literature to show that studying monitors in the OOD setting can be misleading: 1. very good OOD results can give a false impression of safety, 2. comparison under the OOD setting does not allow identifying the best monitor to detect errors. Finally, we also show that removing erroneous training data samples helps to train better monitors.
translated by 谷歌翻译
在计算机视觉中探索的分销(OOD)检测良好的虽然,但在NLP分类的情况下已经开始较少尝试。在本文中,我们认为这些目前的尝试没有完全解决ood问题,并且可能遭受数据泄漏和所产生模型的校准差。我们呈现PNPOOD,通过使用最近提出的即插即用语言模型(Dathathri等,2020),通过域外样本生成进行数据增强技术来执行OOD检测。我们的方法产生靠近阶级边界的高质量辨别样本,从而在测试时间内进行准确的检测。我们展示了我们的模型优于预先样本检测的现有模型,并在20次新闻组文本和斯坦福情绪Teebank数据集上展示较低的校准错误(Lang,1995; Socheret al。,2013)。我们进一步突出显示了在EAC检测的先前尝试中使用的数据集进行了重要的数据泄露问题,并在新数据集中分享结果,以便无法遭受同样问题的检测。
translated by 谷歌翻译
预先训练的语言模型在多大程度上了解有关分发性现象的语义知识?在本文中,我们介绍了Distnli,这是一种新的自然语言推理诊断数据集,该数据集针对分布式引起的语义差异,并采用因果中介分析框架来量化模型行为并探索该语义相关任务中的基本机制。我们发现,模型的理解程度与模型大小和词汇大小有关。我们还提供有关模型如何编码这种高级语义知识的见解。
translated by 谷歌翻译
背景。通常,深度神经网络(DNN)概括了从类似于训练集的分布的样本概括。然而,当测试样本从不同的分布中抽出时,DNNS的预测是脆性和不可靠的。这是在现实世界应用中部署的主要关注点,这种行为可能以相当大的成本,例如工业生产线,自治车辆或医疗保健应用。贡献。我们将DNN中的分布(OOD)检测出来作为统计假设检测问题。在我们所提出的框架内产生的测试将证据组合来自整个网络。与以前的检测启发式不同,此框架返回每个测试样本的$ p $ -value。有保证维护I型错误(T1E - 错误地识别OOD样本为ID)进行测试数据。此外,这允许在保持T1E的同时组合多个检测器。在此框架上建立,我们建议一种基于低阶统计数据的新型程序。我们的方法在不接受的EOD基准上的最新方法实现了比较或更好的结果,而无需再培训网络参数或假设测试分配的现有知识 - 并且以计算成本的一小部分。
translated by 谷歌翻译
已知现代深度神经网络模型将错误地将分布式(OOD)测试数据分类为具有很高信心的分数(ID)培训课程之一。这可能会对关键安全应用产生灾难性的后果。一种流行的缓解策略是训练单独的分类器,该分类器可以在测试时间检测此类OOD样本。在大多数实际设置中,在火车时间尚不清楚OOD的示例,因此,一个关键问题是:如何使用合成OOD样品来增加ID数据以训练这样的OOD检测器?在本文中,我们为称为CNC的OOD数据增强提出了一种新颖的复合腐败技术。 CNC的主要优点之一是,除了培训集外,它不需要任何固定数据。此外,与当前的最新技术(SOTA)技术不同,CNC不需要在测试时间进行反向传播或结合,从而使我们的方法在推断时更快。我们与过去4年中主要会议的20种方法进行了广泛的比较,表明,在OOD检测准确性和推理时间方面,使用基于CNC的数据增强训练的模型都胜过SOTA。我们包括详细的事后分析,以研究我们方法成功的原因,并确定CNC样本的较高相对熵和多样性是可能的原因。我们还通过对二维数据集进行零件分解分析提供理论见解,以揭示(视觉和定量),我们的方法导致ID类别周围的边界更紧密,从而更好地检测了OOD样品。源代码链接:https://github.com/cnc-ood
translated by 谷歌翻译
Clinical machine learning models show a significant performance drop when tested in settings not seen during training. Domain generalisation models promise to alleviate this problem, however, there is still scepticism about whether they improve over traditional training. In this work, we take a principled approach to identifying Out of Distribution (OoD) environments, motivated by the problem of cross-hospital generalization in critical care. We propose model-based and heuristic approaches to identify OoD environments and systematically compare models with different levels of held-out information. We find that access to OoD data does not translate to increased performance, pointing to inherent limitations in defining potential OoD environments potentially due to data harmonisation and sampling. Echoing similar results with other popular clinical benchmarks in the literature, new approaches are required to evaluate robust models on health records.
translated by 谷歌翻译