我们在最常用的计算机视觉,自然语言和音频数据集中的10个测试集中识别标签错误,随后研究这些标签错误的可能性影响基准结果。测试集中的错误是众多和广泛的:我们估计10个数据集的至少3.3%的误差,例如标签错误包括至少6%的想象验证集。使用自信的学习算法识别推定的标签错误,然后通过众包(51%的算法上标记的候选者的51%确实错误地标记了数据集)。传统上,机器学习从业者选择基于测试准确性部署哪种模型 - 我们的调查结果在此提出谨慎行事,提出在正确标记的测试集上判断模型可能更有用,特别是对于嘈杂的现实世界数据集。令人惊讶的是,我们发现较低的容量模型可能与现实世界数据集中的更高容量模型几乎更有用,具有高比例的错误标记数据。例如,在具有校正标签的ImageNet上:Reset-18优于Reset-50,如果最初错误标记的测试示例的普及仅增加6%。在具有校正标签的CiFar-10上:VGG-11优于VGG-19,如果最初错误标记的测试示例的患病率达到5%。在HTTPS://labelerrors.com上查看10个数据集中的测试集错误,HTTPS://github.com/cleanlab/labelors可以再现所有标签错误。
translated by 谷歌翻译
学习存在于数据的背景下,但信心的概念通常集中在模型预测上,而不是标签质量上。自信学习(CL)是一种替代方法,它通过根据修剪嘈杂数据的原理来表征和识别数据集中的标签错误来重点关注标签质量,并使用概率阈值来估算噪声,并将示例排名以自信。尽管许多研究已经独立开发了这些原理,但在这里,我们将它们结合起来,建立在类似的噪声过程的基础上,以直接估计嘈杂(给定的)标签和未腐败(未知)标签之间的关节分布。这导致了广义的CL,该CL证明是一致且具有实验性能的。我们提供了足够的条件,CL准确地发现标签错误,并且CL性能超过了CIFAR数据集上使用嘈杂标签的七种近期学习方法。独特的是,CL框架不与特定的数据模式或模型耦合(例如,我们使用CL在假定的无错误MNIST数据集中查找几个标签错误,并在亚马逊评论中对文本数据进行改善的情感分类)。我们还使用Imagenet上的CL来量化本体论类重叠(例如,估计645个“导弹”图像被错误标记为其母体类“弹丸”),并通过清洁训练前清洁数据来提高模型准确性(例如,用于RESNET)。使用开源清洁行释放可以复制这些结果。
translated by 谷歌翻译
We show that large pre-trained language models are inherently highly capable of identifying label errors in natural language datasets: simply examining out-of-sample data points in descending order of fine-tuned task loss significantly outperforms more complex error-detection mechanisms proposed in previous work. To this end, we contribute a novel method for introducing realistic, human-originated label noise into existing crowdsourced datasets such as SNLI and TweetNLP. We show that this noise has similar properties to real, hand-verified label errors, and is harder to detect than existing synthetic noise, creating challenges for model robustness. We argue that human-originated noise is a better standard for evaluation than synthetic noise. Finally, we use crowdsourced verification to evaluate the detection of real errors on IMDB, Amazon Reviews, and Recon, and confirm that pre-trained models perform at a 9-36% higher absolute Area Under the Precision-Recall Curve than existing models.
translated by 谷歌翻译
We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% -15% on CIFAR-10 and 11% -14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.
translated by 谷歌翻译
图像分类器通常在其测试设置精度上进行评分,但高精度可以屏蔽微妙类型的模型故障。我们发现高分卷积神经网络(CNNS)在流行的基准上表现出令人不安的病理,即使在没有语义突出特征的情况下,即使在没有语义突出特征的情况下也能够显示高精度。当模型提供没有突出的输入功能而无突出的频率决定时,我们说分类器已经过度解释了它的输入,找到了太多的课程 - 以对人类荒谬的模式。在这里,我们展示了在CiFar-10和Imagenet上培训的神经网络患有过度诠释,我们发现CIFAR-10上的模型即使在屏蔽95%的输入图像中,人类不能在剩余像素子集中辨别出突出的特征。我们介绍了批量梯度SIS,一种用于发现复杂数据集的足够输入子集的新方法,并使用此方法显示故事中的边界像素的充分性以进行培训和测试。虽然这些模式在现实世界部署中移植了潜在的模型脆弱性,但它们实际上是基准的有效统计模式,单独就足以实现高测试精度。与对手示例不同,过度解释依赖于未修改的图像像素。我们发现合奏和输入辍学可以帮助缓解过度诠释。
translated by 谷歌翻译
在过去的十年中,计算机愿景,旨在了解视觉世界的人工智能分支,从简单地识别图像中的物体来描述图片,回答有关图像的问题,以及围绕物理空间的机器人操纵甚至产生新的视觉内容。随着这些任务和应用程序的现代化,因此依赖更多数据,用于模型培训或评估。在本章中,我们展示了新颖的互动策略可以为计算机愿景提供新的数据收集和评估。首先,我们提出了一种众群界面,以通过数量级加速付费数据收集,喂养现代视觉模型的数据饥饿性质。其次,我们探索使用自动社交干预措施增加志愿者贡献的方法。第三,我们开发一个系统,以确保人类对生成视觉模型的评估是可靠的,实惠和接地在心理物理学理论中。我们结束了人机互动的未来机会,以帮助计算机愿景。
translated by 谷歌翻译
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the chal-
translated by 谷歌翻译
Deep neural networks may easily memorize noisy labels present in real-world data, which degrades their ability to generalize. It is therefore important to track and evaluate the robustness of models against noisy label memorization. We propose a metric, called susceptibility, to gauge such memorization for neural networks. Susceptibility is simple and easy to compute during training. Moreover, it does not require access to ground-truth labels and it only uses unlabeled data. We empirically show the effectiveness of our metric in tracking memorization on various architectures and datasets and provide theoretical insights into the design of the susceptibility metric. Finally, we show through extensive experiments on datasets with synthetic and real-world label noise that one can utilize susceptibility and the overall training accuracy to distinguish models that maintain a low memorization on the training set and generalize well to unseen clean data.
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
人工智能的最新趋势是将验证的模型用于语言和视觉任务,这些模型已经实现了非凡的表现,但也令人困惑。因此,以各种方式探索这些模型的能力对该领域至关重要。在本文中,我们探讨了模型的可靠性,在其中我们将可靠的模型定义为一个不仅可以实现强大的预测性能,而且在许多涉及不确定性(例如选择性预测,开放式设置识别)的决策任务上,在许多决策任务上表现出色,而且表现良好。强大的概括(例如,准确性和适当的评分规则,例如在分布数据集中和分发数据集上的对数可能性)和适应性(例如,主动学习,几乎没有射击不确定性)。我们设计了40个数据集的10种任务类型,以评估视觉和语言域上可靠性的不同方面。为了提高可靠性,我们分别开发了VIT-PLEX和T5-PLEX,分别针对视觉和语言方式扩展了大型模型。 PLEX极大地改善了跨可靠性任务的最先进,并简化了传统协议,因为它可以改善开箱即用的性能,并且不需要设计分数或为每个任务调整模型。我们演示了高达1B参数的模型尺寸的缩放效果,并预处理数据集大小最多4B示例。我们还展示了PLEX在具有挑战性的任务上的功能,包括零射门的开放式识别,主动学习和对话语言理解中的不确定性。
translated by 谷歌翻译
深度学习在大量大数据的帮助下取得了众多域中的显着成功。然而,由于许多真实情景中缺乏高质量标签,数据标签的质量是一个问题。由于嘈杂的标签严重降低了深度神经网络的泛化表现,从嘈杂的标签(强大的培训)学习是在现代深度学习应用中成为一项重要任务。在本调查中,我们首先从监督的学习角度描述了与标签噪声学习的问题。接下来,我们提供62项最先进的培训方法的全面审查,所有这些培训方法都按照其方法论差异分为五个群体,其次是用于评估其优越性的六种性质的系统比较。随后,我们对噪声速率估计进行深入分析,并总结了通常使用的评估方法,包括公共噪声数据集和评估度量。最后,我们提出了几个有前途的研究方向,可以作为未来研究的指导。所有内容将在https://github.com/songhwanjun/awesome-noisy-labels提供。
translated by 谷歌翻译
机器学习(ML)研究通常集中在模型上,而最突出的数据集已用于日常的ML任务,而不考虑这些数据集对基本问题的广度,困难和忠诚。忽略数据集的基本重要性已引起了重大问题,该问题涉及现实世界中的数据级联以及数据集驱动标准的模型质量饱和,并阻碍了研究的增长。为了解决此问题,我们提出Dataperf,这是用于评估ML数据集和数据集工作算法的基准软件包。我们打算启用“数据棘轮”,其中培训集将有助于评估相同问题的测试集,反之亦然。这种反馈驱动的策略将产生一个良性的循环,该循环将加速以数据为中心的AI。MLCommons协会将维护Dataperf。
translated by 谷歌翻译
高质量数据对于现代机器学习是必需的。但是,由于人类的嘈杂和模棱两可的注释,难以获取此类数据。确定图像标签的这种注释的聚合导致数据质量较低。我们提出了一个以数据为中心的图像分类基准,该基准具有9个现实世界数据集和每个图像的多次注释,以调查和量化此类数据质量问题的影响。我们通过询问如何提高数据质量来关注以数据为中心的观点。在数千个实验中,我们表明多个注释可以更好地近似实际的基础类别分布。我们确定硬标签无法捕获数据的歧义,这可能会导致过度自信模型的常见问题。根据呈现的数据集,基准基准和分析,我们为未来创造了多个研究机会。
translated by 谷歌翻译
深层神经网络(DNN)越来越多地用于软件工程和代码智能任务。这些是强大的工具,能够通过数百万参数从大型数据集中学习高度概括的模式。同时,它们的大容量可以使他们容易记住数据点。最近的工作表明,当训练数据集嘈杂,涉及许多模棱两可或可疑的样本时,记忆风险特别强烈表现出来,而记忆是唯一的追索权。本文的目的是评估和比较神经代码智能模型中的记忆和概括程度。它旨在提供有关记忆如何影响神经模型在代码智能系统中的学习行为的见解。为了观察模型中的记忆程度,我们为原始训练数据集增加了随机噪声,并使用各种指标来量化噪声对训练和测试各个方面的影响。我们根据Java,Python和Ruby Codebase评估了几种最先进的神经代码智能模型和基准。我们的结果突出了重要的风险:数百万可训练的参数允许神经网络记住任何包括嘈杂数据,并提供错误的概括感。我们观察到所有模型都表现出某些形式的记忆。在大多数代码智能任务中,这可能会很麻烦,因为它们依赖于相当容易发生噪声和重复性数据源,例如GitHub的代码。据我们所知,我们提供了第一个研究,以量化软件工程和代码智能系统领域的记忆效应。这项工作提高了人们的意识,并为训练神经模型的重要问题提供了新的见解,这些问题通常被软件工程研究人员忽略。
translated by 谷歌翻译
持续学习(CL)被广泛认为是终身AI的关键挑战。但是,现有的CLENG分类,例如置换式和拆分式剪裁,利用人工时间变化,不与现实世界一致或不一致。在本文中,我们介绍了Clear,这是第一个连续的图像分类基准数据集,其在现实世界中具有自然的视觉概念的时间演变,它跨越了十年(2004-2014)。我们通过现有的大规模图像集(YFCC100M)清楚地清楚地通过一种新颖且可扩展的低成本方法来进行粘性语言数据集策划。我们的管道利用了预处理的视觉语言模型(例如剪辑)来互动地构建标记的数据集,这些数据集通过众包进一步验证以删除错误甚至不适当的图像(隐藏在原始YFCC100M中)。在先前的CLENMACK上,明确的主要优势是具有现实世界图像的视觉概念的平滑时间演变,包括每个时间段的高质量标记数据以及丰富的未标记样本,用于连续半惯用的学习。我们发现,一个简单的无监督预训练步骤已经可以提高只能利用完全监督数据的最新CL算法。我们的分析还表明,主流CL评估方案训练和测试IID数据人为膨胀CL系统的性能。为了解决这个问题,我们为CL提出了新颖的“流”协议,该协议始终在(近)未来测试。有趣的是,流媒体协议(a)可以简化数据集策划,因为当今的测试集可以重新用于明天的火车集,并且(b)可以生成更具概括性的模型,具有更准确的性能估算,因为每个时间段的所有标记数据都用于培训和培训,并且测试(与经典的IID火车测试拆分不同)。
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.
translated by 谷歌翻译
Label noise is an important issue in classification, with many potential negative consequences. For example, the accuracy of predictions may decrease, whereas the complexity of inferred models and the number of necessary training samples may increase. Many works in the literature have been devoted to the study of label noise and the development of techniques to deal with label noise. However, the field lacks a comprehensive survey on the different types of label noise, their consequences and the algorithms that consider label noise. This paper proposes to fill this gap. First, the definitions and sources of label noise are considered and a taxonomy of the types of label noise is proposed. Second, the potential consequences of label noise are discussed. Third, label noise-robust, label noise cleansing, and label noise-tolerant algorithms are reviewed. For each category of approaches, a short discussion is proposed to help the practitioner to choose the most suitable technique in its own particular field of application. Eventually, the design of experiments is also discussed, what may interest the researchers who would like to test their own algorithms. In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g. confidences on labels.
translated by 谷歌翻译
我们提出了一种新颖的三阶段查找解析标签工作流程,用于众包注释,以减少任务指令中的模糊性,从而提高注释质量。第1阶段(查找)询问人群找到其正确标签似乎暧昧的任务指令的示例。还要求工人提供一个简短的标签,它描述了所发现的特定实例体现的模糊概念。我们比较这个阶段的合作与非协作设计。在第2阶段(解析)中,请求者选择一个或多个这些模糊的例子到标签(解析歧义)。新标签将自动注入任务说明,以提高清晰度。最后,在第3阶段(标签)中,工人使用经修订的指南进行实际注释,澄清示例。我们比较三个使用这些示例的设计:仅限示例,仅标记或两者。我们通过亚马逊机械土耳其报告六个任务设计中的图像标记实验。结果显示了有关众包注释任务的有效设计的提高的注释准确性和进一步的见解。
translated by 谷歌翻译
构建用于仇恨语音检测的基准数据集具有各种挑战。首先,因为仇恨的言论相对少见,随机抽样对诠释的推文是非常效率的发现仇恨。为了解决此问题,先前的数据集通常仅包含匹配已知的“讨厌字”的推文。然而,将数据限制为预定义的词汇表可能排除我们寻求模型的现实世界现象的部分。第二个挑战是仇恨言论的定义往往是高度不同和主观的。具有多种讨论仇恨言论的注释者可能不仅可能不同意彼此不同意,而且还努力符合指定的标签指南。我们的重点识别是仇恨语音的罕见和主体性类似于信息检索(IR)中的相关性。此连接表明,可以有效地应用创建IR测试集合的良好方法,以创建更好的基准数据集以进行仇恨语音。为了智能和有效地选择要注释的推文,我们应用{\ em汇集}和{em主动学习}的标准IR技术。为了提高注释的一致性和价值,我们应用{\ EM任务分解}和{\ EM注释器理由}技术。我们在Twitter上共享一个用于仇恨语音检测的新基准数据集,其提供比以前的数据集更广泛的仇恨覆盖。在这些更广泛形式的仇恨中测试时,我们还表现出现有检测模型的准确性的戏剧性降低。注册器理由我们不仅可以证明标签决策证明,而且还可以在建模中实现未来的双重监督和/或解释生成的工作机会。我们的方法的进一步细节可以在补充材料中找到。
translated by 谷歌翻译
Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train stateof-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.
translated by 谷歌翻译