旨在选择最有用的培训样本子集的CoreSet选择是一个长期存在的学习问题,可以使许多下游任务受益,例如数据效率学习,持续学习,神经体系结构搜索,主动学习等。但是,许多现有的核心选择方法不是为深度学习而设计的,这些方法可能具有很高的复杂性和不良的概括性能。此外,最近提出的方法在模型,数据集和不同复杂性的设置上进行评估。为了促进深度学习中核心选择的研究,我们贡献了一个全面的代码库,即深核,并就CIFAR10和Imagenet数据集的流行核心选择方法提供了经验研究。关于CIFAR10和Imagenet数据集的广泛实验验证,尽管在某些实验设置中具有优势,但随机选择仍然是一个强大的基线。
translated by 谷歌翻译
Computational cost of training state-of-the-art deep models in many learning problems is rapidly increasing due to more sophisticated models and larger datasets. A recent promising direction for reducing training cost is dataset condensation that aims to replace the original large training set with a significantly smaller learned synthetic set while preserving the original information. While training deep models on the small set of condensed images can be extremely fast, their synthesis remains computationally expensive due to the complex bi-level optimization and second-order derivative computation. In this work, we propose a simple yet effective method that synthesizes condensed images by matching feature distributions of the synthetic and original training images in many sampled embedding spaces. Our method significantly reduces the synthesis cost while achieving comparable or better performance. Thanks to its efficiency, we apply our method to more realistic and larger datasets with sophisticated neural architectures and obtain a significant performance boost. We also show promising practical benefits of our method in continual learning and neural architecture search.
translated by 谷歌翻译
虽然深度学习(DL)是渴望数据的,并且通常依靠广泛的标记数据来提供良好的性能,但主动学习(AL)通过从未标记的数据中选择一小部分样本进行标签和培训来降低标签成本。因此,近年来,在有限的标签成本/预算下,深入的积极学习(DAL)是可行的解决方案,可在有限的标签成本/预算下最大化模型性能。尽管已经开发了大量的DAL方法并进行了各种文献综述,但在公平比较设置下对DAL方法的性能评估尚未可用。我们的工作打算填补这一空白。在这项工作中,我们通过重新实现19种引用的DAL方法来构建DAL Toolkit,即Deepal+。我们调查和分类与DAL相关的作品,并构建经常使用的数据集和DAL算法的比较实验。此外,我们探讨了影响DAL功效的一些因素(例如,批处理大小,训练过程中的时期数),这些因素为研究人员设计其DAL实验或执行DAL相关应用程序提供了更好的参考。
translated by 谷歌翻译
主动学习(AL)是一个有希望的ML范式,有可能解析大型未标记数据并有助于降低标记数据可能令人难以置信的域中的注释成本。最近提出的基于神经网络的AL方法使用不同的启发式方法来实现这一目标。在这项研究中,我们证明,在相同的实验环境下,不同类型的AL算法(基于不确定性,基于多样性和委员会)产生了与随机采样基线相比的不一致增长。通过各种实验,控制了随机性来源,我们表明,AL算法实现的性能指标方差可能会导致与先前报道的结果不符的结果。我们还发现,在强烈的正则化下,AL方法在各种实验条件下显示出比随机采样基线的边缘或没有优势。最后,我们以一系列建议进行结论,以了解如何使用新的AL算法评估结果,以确保在实验条件下的变化下结果可再现和健壮。我们共享我们的代码以促进AL评估。我们认为,我们的发现和建议将有助于使用神经网络在AL中进行可重复的研究。我们通过https://github.com/prateekmunjal/torchal开源代码
translated by 谷歌翻译
主动学习通过从未标记的数据集中标记有信息的样本来有效地构建标记的数据集。在现实世界中的活跃学习方案中,考虑到所选样本的多样性至关重要,因为存在许多冗余或高度相似的样本。核心设定方法是基于多样性的有希望的方法,根据样品之间的距离选择不同的样品。然而,与选择最困难的样本的基于不确定性的方法相比,该方法的性能差,神经模型表现出低置信度。在这项工作中,我们通过密度的晶状体分析特征空间,有趣的是,观察到局部稀疏区域往往比密集区域具有更多信息样本。通过我们的分析,我们将核心设定方法赋予密度意识,并提出密度感知的核心集(DACS)。该策略是估计未标记样品的密度,并主要从稀疏区域选择不同的样品。为了减少估计密度的计算瓶颈,我们还基于对区域敏感的散列引入了新的密度近似。实验结果清楚地表明了DAC在分类和回归任务中的功效,并特别表明DAC可以在实际情况下产生最先进的性能。由于DACS微弱地取决于神经体系结构,因此我们提出了一种简单而有效的组合方法,以表明现有方法可以与DAC合并。
translated by 谷歌翻译
Despite significant advances, the performance of state-of-the-art continual learning approaches hinges on the unrealistic scenario of fully labeled data. In this paper, we tackle this challenge and propose an approach for continual semi-supervised learning -- a setting where not all the data samples are labeled. An underlying issue in this scenario is the model forgetting representations of unlabeled data and overfitting the labeled ones. We leverage the power of nearest-neighbor classifiers to non-linearly partition the feature space and learn a strong representation for the current task, as well as distill relevant information from previous tasks. We perform a thorough experimental evaluation and show that our method outperforms all the existing approaches by large margins, setting a strong state of the art on the continual semi-supervised learning paradigm. For example, on CIFAR100 we surpass several others even when using at least 30 times less supervision (0.8% vs. 25% of annotations).
translated by 谷歌翻译
通过选择最具信息丰富的样本,已证明主动学习可用于最小化标记成本。但是,现有的主动学习方法在诸如不平衡或稀有类别的现实方案中不适用于未标记集中的分发数据和冗余。在这项工作中,我们提出了类似的(基于子模块信息措施的主动学习),使用最近提出的子模块信息措施(SIM)作为采集函数的统一主动学习框架。我们认为类似的不仅在标准的主动学习中工作,而且还可以轻松扩展到上面考虑的现实设置,并充当活动学习的一站式解决方案,可以扩展到大型真实世界数据集。凭经验,我们表明,在罕见的课程的情况下,在罕见的阶级和〜5% - 10%的情况下,在罕见的几个图像分类任务的情况下,相似显着优异的活动学习算法像CiFar-10,Mnist和Imagenet。类似于Distil Toolkit的一部分:“https://github.com/decile-team/distil”。
translated by 谷歌翻译
As an important data selection schema, active learning emerges as the essential component when iterating an Artificial Intelligence (AI) model. It becomes even more critical given the dominance of deep neural network based models, which are composed of a large number of parameters and data hungry, in application. Despite its indispensable role for developing AI models, research on active learning is not as intensive as other research directions. In this paper, we present a review of active learning through deep active learning approaches from the following perspectives: 1) technical advancements in active learning, 2) applications of active learning in computer vision, 3) industrial systems leveraging or with potential to leverage active learning for data iteration, 4) current limitations and future research directions. We expect this paper to clarify the significance of active learning in a modern AI model manufacturing process and to bring additional research attention to active learning. By addressing data automation challenges and coping with automated machine learning systems, active learning will facilitate democratization of AI technologies by boosting model production at scale.
translated by 谷歌翻译
In semi-supervised representation learning frameworks, when the number of labelled data is very scarce, the quality and representativeness of these samples become increasingly important. Existing literature on semi-supervised learning randomly sample a limited number of data points for labelling. All these labelled samples are then used along with the unlabelled data throughout the training process. In this work, we ask two important questions in this context: (1) does it matter which samples are selected for labelling? (2) does it matter how the labelled samples are used throughout the training process along with the unlabelled data? To answer the first question, we explore a number of unsupervised methods for selecting specific subsets of data to label (without prior knowledge of their labels), with the goal of maximizing representativeness w.r.t. the unlabelled set. Then, for our second line of inquiry, we define a variety of different label injection strategies in the training process. Extensive experiments on four popular datasets, CIFAR-10, CIFAR-100, SVHN, and STL-10, show that unsupervised selection of samples that are more representative of the entire data improves performance by up to ~2% over the existing semi-supervised frameworks such as MixMatch, ReMixMatch, FixMatch and others with random sample labelling. We show that this boost could even increase to 7.5% for very few-labelled scenarios. However, our study shows that gradually injecting the labels throughout the training procedure does not impact the performance considerably versus when all the existing labels are used throughout the entire training.
translated by 谷歌翻译
标记大量数据很昂贵。主动学习旨在通过要求注释未标记的集合中最有用的数据来解决这个问题。我们提出了一种新颖的活跃学习方法,该方法利用自我监督的借口任务和独特的数据采样器来选择既困难又具有代表性的数据。我们发现,简单的自我监督借口任务(例如旋转预测)的损失与下游任务损失密切相关。在主动学习迭代之前,对未标记的集合进行了借口任务学习者进行培训,并且未标记的数据被分类并通过其借口任务损失分组成批处理。在每个主动的学习迭代中,主要任务模型用于批评要注释的批次中最不确定的数据。我们评估了有关各种图像分类和分割基准测试的方法,并在CIFAR10,CALTECH-101,IMAGENET和CITYSCAPES上实现引人注目的性能。我们进一步表明,我们的方法在不平衡的数据集上表现良好,并且可以有效地解决冷启动问题的解决方案,在这种问题中,主动学习性能受到随机采样的初始标记集的影响。
translated by 谷歌翻译
主动学习(AL)算法旨在识别注释的最佳数据子集,使得深神经网络(DNN)在此标记子集上培训时可以实现更好的性能。 AL特别有影响的工业规模设置,其中数据标签成本高,从业者使用各种工具来处理,以提高模型性能。最近自我监督预测(SSP)的成功突出了利用丰富的未标记数据促进模型性能的重要性。通过将AL与SSP结合起来,我们可以使用未标记的数据,同时标记和培训特别是信息样本。在这项工作中,我们研究了Imagenet上的AL和SSP的组合。我们发现小型玩具数据集上的性能 - 文献中的典型基准设置 - 由于活动学习者选择的类不平衡样本,而不是想象中的性能。在我们测试的现有基线中,各种小型和大规​​模设置的流行AL算法未能以随机抽样优于差异。为了解决类别不平衡问题,我们提出了平衡选择(基础),这是一种简单,可伸缩的AL算法,通过选择比现有方法更加平衡样本来始终如一地始终采样。我们的代码可用于:https://github.com/zeyademam/active_learning。
translated by 谷歌翻译
While deep learning succeeds in a wide range of tasks, it highly depends on the massive collection of annotated data which is expensive and time-consuming. To lower the cost of data annotation, active learning has been proposed to interactively query an oracle to annotate a small proportion of informative samples in an unlabeled dataset. Inspired by the fact that the samples with higher loss are usually more informative to the model than the samples with lower loss, in this paper we present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss. The core of our approach is a measurement Temporal Output Discrepancy (TOD) that estimates the sample loss by evaluating the discrepancy of outputs given by models at different optimization steps. Our theoretical investigation shows that TOD lower-bounds the accumulated sample loss thus it can be used to select informative unlabeled samples. On basis of TOD, we further develop an effective unlabeled data sampling strategy as well as an unsupervised learning criterion for active learning. Due to the simplicity of TOD, our methods are efficient, flexible, and task-agnostic. Extensive experimental results demonstrate that our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks. In addition, we show that TOD can be utilized to select the best model of potentially the highest testing accuracy from a pool of candidate models.
translated by 谷歌翻译
The generalisation performance of a convolutional neural networks (CNN) is majorly predisposed by the quantity, quality, and diversity of the training images. All the training data needs to be annotated in-hand before, in many real-world applications data is easy to acquire but expensive and time-consuming to label. The goal of the Active learning for the task is to draw most informative samples from the unlabeled pool which can used for training after annotation. With total different objective, self-supervised learning which have been gaining meteoric popularity by closing the gap in performance with supervised methods on large computer vision benchmarks. self-supervised learning (SSL) these days have shown to produce low-level representations that are invariant to distortions of the input sample and can encode invariance to artificially created distortions, e.g. rotation, solarization, cropping etc. self-supervised learning (SSL) approaches rely on simpler and more scalable frameworks for learning. In this paper, we unify these two families of approaches from the angle of active learning using self-supervised learning mainfold and propose Deep Active Learning using BarlowTwins(DALBT), an active learning method for all the datasets using combination of classifier trained along with self-supervised loss framework of Barlow Twins to a setting where the model can encode the invariance of artificially created distortions, e.g. rotation, solarization, cropping etc.
translated by 谷歌翻译
随着深入学习更加标签的目标,越来越多的论文已经研究了深度模型的主动学习(AL)。然而,普遍存在的实验设置中存在许多问题,主要源于缺乏统一的实施和基准。当前文献中的问题包括有时对不同AL算法的性能的矛盾观察,意外排除重要的概括方法,如数据增强和SGD进行优化,缺乏对al的标签效率等评价方面的研究,并且很少或没有在Al优于随机采样(RS)的情况下的清晰度。在这项工作中,我们通过我们的新开源AL Toolkit Distil在图像分类的背景下统一重新实现了最先进的AL算法,我们仔细研究了这些问题作为有效评估的方面。在积极的方面,我们表明AL技术为2美元至4倍以上$ 4 \倍。与使用数据增强相比,与卢比相比,高效。令人惊讶的是,当包括数据增强时,在使用徽章,最先进的方法,在简单的不确定性采样中不再存在一致的增益。然后,我们仔细分析现有方法如何具有不同数量的冗余和每个类的示例。最后,我们为AL从业者提供了几次见解,以考虑在将来的工作中考虑,例如Al批量大小的效果,初始化的效果,在每一轮中再培训模型的重要性以及其他见解。
translated by 谷歌翻译
基于池的主动学习(AL)通过依次从大型未标记数据池中选择信息的未标记样本并从Oracle/Ontoter中查询标签,从而取得了巨大成功。但是,现有的AL采样策略可能在分布外(OOD)数据方案中无法很好地工作,其中未标记的数据池包含一些不属于目标任务类别的数据示例。在OOD数据情景下实现良好的AL性能是一项具有挑战性的任务,因为Al采样策略与OOD样本检测之间的自然冲突。 Al选择很难由当前基本分类器进行分类的数据(例如,预测类概率具有较高熵的样品),而OOD样品往往具有比分布更均匀的预测类概率(即高熵)(即高熵)(ID ) 数据。在本文中,我们提出了一种采样方案,即用于主动学习的蒙特 - 卡洛帕累托优化(POAL),该方案从未标记的数据库中选择了具有固定批次大小的未标记样品的最佳子集。我们将AL采样任务施加为多目标优化问题,因此我们基于两个冲突的目标利用Pareto优化:(1)正常的AL数据采样方案(例如,最大熵)和(2)作为OOD样本。实验结果表明其对经典机器学习(ML)和深度学习(DL)任务的有效性。
translated by 谷歌翻译
Artificial neural networks thrive in solving the classification problem for a particular rigid task, acquiring knowledge through generalized learning behaviour from a distinct training phase. The resulting network resembles a static entity of knowledge, with endeavours to extend this knowledge without targeting the original task resulting in a catastrophic forgetting. Continual learning shifts this paradigm towards networks that can continually accumulate knowledge over different tasks without the need to retrain from scratch. We focus on task incremental classification, where tasks arrive sequentially and are delineated by clear boundaries. Our main contributions concern (1) a taxonomy and extensive overview of the state-of-the-art; (2) a novel framework to continually determine the stability-plasticity trade-off of the continual learner; (3) a comprehensive experimental comparison of 11 state-of-the-art continual learning methods and 4 baselines. We empirically scrutinize method strengths and weaknesses on three benchmarks, considering Tiny Imagenet and large-scale unbalanced iNaturalist and a sequence of recognition datasets. We study the influence of model capacity, weight decay and dropout regularization, and the order in which the tasks are presented, and qualitatively compare methods in terms of required memory, computation time and storage.
translated by 谷歌翻译
最近的研究表明,基于梯度匹配的数据集综合或数据集凝结(DC),当应用于数据有效的学习任务时,方法可以实现最先进的性能。但是,在这项研究中,我们证明,当任务 - 核定信息构成培训数据集的重要组成部分时,现有的DC方法比随机选择方法的性能更糟。我们将其归因于缺乏与课堂梯度匹配策略所产生的类对比信号的参与。为了解决此问题,我们通过修改损耗函数以使DC方法有效地捕获类之间的差异来提出与对比度信号(DCC)的数据集凝结。此外,我们通过跟踪内核速度来分析训练动力学的新损失函数。此外,我们引入了双层热身策略,以稳定优化。我们的实验结果表明,尽管现有方法对细粒度的图像分类任务无效,但所提出的方法可以成功地为相同任务生成信息合成数据集。此外,我们证明所提出的方法甚至在基准数据集(例如SVHN,CIFAR-10和CIFAR-100)上也优于基准。最后,我们通过将其应用于持续学习任务来证明该方法的高度适用性。
translated by 谷歌翻译
Dataset Distillation (DD), a newly emerging field, aims at generating much smaller and high-quality synthetic datasets from large ones. Existing DD methods based on gradient matching achieve leading performance; however, they are extremely computationally intensive as they require continuously optimizing a dataset among thousands of randomly initialized models. In this paper, we assume that training the synthetic data with diverse models leads to better generalization performance. Thus we propose two \textbf{model augmentation} techniques, ~\ie using \textbf{early-stage models} and \textbf{weight perturbation} to learn an informative synthetic set with significantly reduced training cost. Extensive experiments demonstrate that our method achieves up to 20$\times$ speedup and comparable performance on par with state-of-the-art baseline methods.
translated by 谷歌翻译
对网络规模数据进行培训可能需要几个月的时间。但是,在已经学习或不可学习的冗余和嘈杂点上浪费了很多计算和时间。为了加速训练,我们引入了可减少的持有损失选择(Rho-loss),这是一种简单但原则上的技术,它大致选择了这些训练点,最大程度地减少了模型的概括损失。结果,Rho-loss减轻了现有数据选择方法的弱点:优化文献中的技术通常选择“硬损失”(例如,高损失),但是这种点通常是嘈杂的(不可学习)或更少的任务与任务相关。相反,课程学习优先考虑“简单”的积分,但是一旦学习,就不必对这些要点进行培训。相比之下,Rho-Loss选择了可以学习的点,值得学习的,尚未学习。与先前的艺术相比,Rho-loss火车的步骤要少得多,可以提高准确性,并加快对广泛的数据集,超参数和体系结构(MLP,CNNS和BERT)的培训。在大型Web绑带图像数据集服装1M上,与统一的数据改组相比,步骤少18倍,最终精度的速度少2%。
translated by 谷歌翻译
Active learning enables efficient model training by leveraging interactions between machine learning agents and human annotators. We study and propose a novel framework that formulates batch active learning from the sparse approximation's perspective. Our active learning method aims to find an informative subset from the unlabeled data pool such that the corresponding training loss function approximates its full data pool counterpart. We realize the framework as sparsity-constrained discontinuous optimization problems, which explicitly balance uncertainty and representation for large-scale applications and could be solved by greedy or proximal iterative hard thresholding algorithms. The proposed method can adapt to various settings, including both Bayesian and non-Bayesian neural networks. Numerical experiments show that our work achieves competitive performance across different settings with lower computational complexity.
translated by 谷歌翻译