主动学习(AL)是应选择的数据用于注释。现有的工作试图选择高度不确定或信息性的注释数据。尽管如此,它仍然不清楚所选择的数据如何影响AL中使用的任务模型的测试性能。在这项工作中,我们通过理论上证明,选择更高梯度规范的未标记数据导致测试损失的较低的上限,从而探讨了这种影响,从而产生更好的测试性能。但是,由于缺乏标签信息,直接计算未标记数据的梯度标准是不可行的。为了解决这一挑战,我们提出了两种计划,即预期的Gradnorm和熵 - Gradnorm。前者通过构建预期的经验损失来计算梯度规范,而后者用熵构造无监督的损失。此外,我们将这两个方案集成在通用AL框架中。我们在古典图像分类和语义分割任务中评估我们的方法。为了展示其域应用程序的能力及其对噪声的鲁棒性,我们还在蜂窝成像分析任务中验证了我们的方法,即Cryo-Collecton Subtom图分类。结果表明,我们的方法达到了最先进的卓越性能。我们的源代码可在https://github.com/xulabs/aitom提供
translated by 谷歌翻译
While deep learning succeeds in a wide range of tasks, it highly depends on the massive collection of annotated data which is expensive and time-consuming. To lower the cost of data annotation, active learning has been proposed to interactively query an oracle to annotate a small proportion of informative samples in an unlabeled dataset. Inspired by the fact that the samples with higher loss are usually more informative to the model than the samples with lower loss, in this paper we present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss. The core of our approach is a measurement Temporal Output Discrepancy (TOD) that estimates the sample loss by evaluating the discrepancy of outputs given by models at different optimization steps. Our theoretical investigation shows that TOD lower-bounds the accumulated sample loss thus it can be used to select informative unlabeled samples. On basis of TOD, we further develop an effective unlabeled data sampling strategy as well as an unsupervised learning criterion for active learning. Due to the simplicity of TOD, our methods are efficient, flexible, and task-agnostic. Extensive experimental results demonstrate that our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks. In addition, we show that TOD can be utilized to select the best model of potentially the highest testing accuracy from a pool of candidate models.
translated by 谷歌翻译
我们介绍了有监督的对比度积极学习(SCAL),并根据功能相似性(功能IM)和基于主成分分析的基于特征重建误差(FRE)提出有效的活动学习策略,以选择具有不同特征表示的信息性数据示例。我们证明了我们提出的方法可实现最新的准确性,模型校准并减少在图像分类任务上平衡和不平衡数据集的主动学习设置中的采样偏差。我们还评估了模型的鲁棒性,从主动学习环境中不同查询策略得出的分配转移。使用广泛的实验,我们表明我们提出的方法的表现优于高性能密集型方法,从而使平均损坏误差降低了9.9%,在数据集偏移下的预期校准误差降低了7.2%,而AUROC降低了8.9%的AUROC。检测。
translated by 谷歌翻译
Active learning aims to develop label-efficient algorithms by sampling the most representative queries to be labeled by an oracle. We describe a pool-based semisupervised active learning algorithm that implicitly learns this sampling mechanism in an adversarial manner. Unlike conventional active learning algorithms, our approach is task agnostic, i.e., it does not depend on the performance of the task for which we are trying to acquire labeled data. Our method learns a latent space using a variational autoencoder (VAE) and an adversarial network trained to discriminate between unlabeled and labeled data. The minimax game between the VAE and the adversarial network is played such that while the VAE tries to trick the adversarial network into predicting that all data points are from the labeled pool, the adversarial network learns how to discriminate between dissimilarities in the latent space. We extensively evaluate our method on various image classification and semantic segmentation benchmark datasets and establish a new state of the art on CIFAR10/100, Caltech-256, ImageNet, Cityscapes, and BDD100K. Our results demonstrate that our adversarial approach learns an effective low dimensional latent space in large-scale settings and provides for a computationally efficient sampling method. 1
translated by 谷歌翻译
虽然深度学习(DL)是渴望数据的,并且通常依靠广泛的标记数据来提供良好的性能,但主动学习(AL)通过从未标记的数据中选择一小部分样本进行标签和培训来降低标签成本。因此,近年来,在有限的标签成本/预算下,深入的积极学习(DAL)是可行的解决方案,可在有限的标签成本/预算下最大化模型性能。尽管已经开发了大量的DAL方法并进行了各种文献综述,但在公平比较设置下对DAL方法的性能评估尚未可用。我们的工作打算填补这一空白。在这项工作中,我们通过重新实现19种引用的DAL方法来构建DAL Toolkit,即Deepal+。我们调查和分类与DAL相关的作品,并构建经常使用的数据集和DAL算法的比较实验。此外,我们探讨了影响DAL功效的一些因素(例如,批处理大小,训练过程中的时期数),这些因素为研究人员设计其DAL实验或执行DAL相关应用程序提供了更好的参考。
translated by 谷歌翻译
Active learning enables efficient model training by leveraging interactions between machine learning agents and human annotators. We study and propose a novel framework that formulates batch active learning from the sparse approximation's perspective. Our active learning method aims to find an informative subset from the unlabeled data pool such that the corresponding training loss function approximates its full data pool counterpart. We realize the framework as sparsity-constrained discontinuous optimization problems, which explicitly balance uncertainty and representation for large-scale applications and could be solved by greedy or proximal iterative hard thresholding algorithms. The proposed method can adapt to various settings, including both Bayesian and non-Bayesian neural networks. Numerical experiments show that our work achieves competitive performance across different settings with lower computational complexity.
translated by 谷歌翻译
主动学习(AL)是一个有希望的ML范式,有可能解析大型未标记数据并有助于降低标记数据可能令人难以置信的域中的注释成本。最近提出的基于神经网络的AL方法使用不同的启发式方法来实现这一目标。在这项研究中,我们证明,在相同的实验环境下,不同类型的AL算法(基于不确定性,基于多样性和委员会)产生了与随机采样基线相比的不一致增长。通过各种实验,控制了随机性来源,我们表明,AL算法实现的性能指标方差可能会导致与先前报道的结果不符的结果。我们还发现,在强烈的正则化下,AL方法在各种实验条件下显示出比随机采样基线的边缘或没有优势。最后,我们以一系列建议进行结论,以了解如何使用新的AL算法评估结果,以确保在实验条件下的变化下结果可再现和健壮。我们共享我们的代码以促进AL评估。我们认为,我们的发现和建议将有助于使用神经网络在AL中进行可重复的研究。我们通过https://github.com/prateekmunjal/torchal开源代码
translated by 谷歌翻译
主动学习通过从未标记的数据集中标记有信息的样本来有效地构建标记的数据集。在现实世界中的活跃学习方案中,考虑到所选样本的多样性至关重要,因为存在许多冗余或高度相似的样本。核心设定方法是基于多样性的有希望的方法,根据样品之间的距离选择不同的样品。然而,与选择最困难的样本的基于不确定性的方法相比,该方法的性能差,神经模型表现出低置信度。在这项工作中,我们通过密度的晶状体分析特征空间,有趣的是,观察到局部稀疏区域往往比密集区域具有更多信息样本。通过我们的分析,我们将核心设定方法赋予密度意识,并提出密度感知的核心集(DACS)。该策略是估计未标记样品的密度,并主要从稀疏区域选择不同的样品。为了减少估计密度的计算瓶颈,我们还基于对区域敏感的散列引入了新的密度近似。实验结果清楚地表明了DAC在分类和回归任务中的功效,并特别表明DAC可以在实际情况下产生最先进的性能。由于DACS微弱地取决于神经体系结构,因此我们提出了一种简单而有效的组合方法,以表明现有方法可以与DAC合并。
translated by 谷歌翻译
As an important data selection schema, active learning emerges as the essential component when iterating an Artificial Intelligence (AI) model. It becomes even more critical given the dominance of deep neural network based models, which are composed of a large number of parameters and data hungry, in application. Despite its indispensable role for developing AI models, research on active learning is not as intensive as other research directions. In this paper, we present a review of active learning through deep active learning approaches from the following perspectives: 1) technical advancements in active learning, 2) applications of active learning in computer vision, 3) industrial systems leveraging or with potential to leverage active learning for data iteration, 4) current limitations and future research directions. We expect this paper to clarify the significance of active learning in a modern AI model manufacturing process and to bring additional research attention to active learning. By addressing data automation challenges and coping with automated machine learning systems, active learning will facilitate democratization of AI technologies by boosting model production at scale.
translated by 谷歌翻译
标记大量数据很昂贵。主动学习旨在通过要求注释未标记的集合中最有用的数据来解决这个问题。我们提出了一种新颖的活跃学习方法,该方法利用自我监督的借口任务和独特的数据采样器来选择既困难又具有代表性的数据。我们发现,简单的自我监督借口任务(例如旋转预测)的损失与下游任务损失密切相关。在主动学习迭代之前,对未标记的集合进行了借口任务学习者进行培训,并且未标记的数据被分类并通过其借口任务损失分组成批处理。在每个主动的学习迭代中,主要任务模型用于批评要注释的批次中最不确定的数据。我们评估了有关各种图像分类和分割基准测试的方法,并在CIFAR10,CALTECH-101,IMAGENET和CITYSCAPES上实现引人注目的性能。我们进一步表明,我们的方法在不平衡的数据集上表现良好,并且可以有效地解决冷启动问题的解决方案,在这种问题中,主动学习性能受到随机采样的初始标记集的影响。
translated by 谷歌翻译
最近,无监督的域适应是一种有效的范例,用于概括深度神经网络到新的目标域。但是,仍有巨大的潜力才能达到完全监督的性能。在本文中,我们提出了一种新颖的主动学习策略,以帮助目标域中的知识转移,有效域适应。我们从观察开始,即当训练(源)和测试(目标)数据来自不同的分布时,基于能量的模型表现出自由能量偏差。灵感来自这种固有的机制,我们经验揭示了一种简单而有效的能源 - 基于能量的采样策略揭示了比需要特定架构或距离计算的现有方法的最有价值的目标样本。我们的算法,基于能量的活动域适应(EADA),查询逻辑数据组,它将域特征和实例不确定性结合到每个选择回合中。同时,通过通过正则化术语对准源域周围的目标数据紧凑的自由能,可以隐含地减少域间隙。通过广泛的实验,我们表明EADA在众所周知的具有挑战性的基准上超越了最先进的方法,具有实质性的改进,使其成为开放世界中的一个有用的选择。代码可在https://github.com/bit-da/eada获得。
translated by 谷歌翻译
The performance of deep neural networks improves with more annotated data. The problem is that the budget for annotation is limited. One solution to this is active learning, where a model asks human to annotate data that it perceived as uncertain. A variety of recent methods have been proposed to apply active learning to deep networks but most of them are either designed specific for their target tasks or computationally inefficient for large networks. In this paper, we propose a novel active learning method that is simple but task-agnostic, and works efficiently with the deep networks. We attach a small parametric module, named "loss prediction module," to a target network, and learn it to predict target losses of unlabeled inputs. Then, this module can suggest data that the target model is likely to produce a wrong prediction. This method is task-agnostic as networks are learned from a single loss regardless of target tasks. We rigorously validate our method through image classification, object detection, and human pose estimation, with the recent network architectures. The results demonstrate that our method consistently outperforms the previous methods over the tasks.
translated by 谷歌翻译
The generalisation performance of a convolutional neural networks (CNN) is majorly predisposed by the quantity, quality, and diversity of the training images. All the training data needs to be annotated in-hand before, in many real-world applications data is easy to acquire but expensive and time-consuming to label. The goal of the Active learning for the task is to draw most informative samples from the unlabeled pool which can used for training after annotation. With total different objective, self-supervised learning which have been gaining meteoric popularity by closing the gap in performance with supervised methods on large computer vision benchmarks. self-supervised learning (SSL) these days have shown to produce low-level representations that are invariant to distortions of the input sample and can encode invariance to artificially created distortions, e.g. rotation, solarization, cropping etc. self-supervised learning (SSL) approaches rely on simpler and more scalable frameworks for learning. In this paper, we unify these two families of approaches from the angle of active learning using self-supervised learning mainfold and propose Deep Active Learning using BarlowTwins(DALBT), an active learning method for all the datasets using combination of classifier trained along with self-supervised loss framework of Barlow Twins to a setting where the model can encode the invariance of artificially created distortions, e.g. rotation, solarization, cropping etc.
translated by 谷歌翻译
我们提出了一种新方法,用于近似于基于假设标记的候选数据点进行重新培训的主动学习获取策略。尽管这通常与深层网络不可行,但我们使用神经切线内核来近似重新进行重新培训的结果,并证明该近似值即使在主动学习设置中也无效 - 近似于“ look-aead abead”选择标准,所需的计算要少得多。 。这也使我们能够进行顺序的主动学习,即在流态中更新模型,而无需在添加每个新数据点后使用SGD重新训练模型。此外,我们的查询策略可以更好地理解模型的预测将如何通过与标准(“近视”)标准相比,通过大幅度击败其他查看策略,并获得相等或更好的性能,并取得了相等或更好的性能。基于池的主动学习中的几个基准数据集上的最新方法。
translated by 谷歌翻译
主动学习(AL)算法旨在识别注释的最佳数据子集,使得深神经网络(DNN)在此标记子集上培训时可以实现更好的性能。 AL特别有影响的工业规模设置,其中数据标签成本高,从业者使用各种工具来处理,以提高模型性能。最近自我监督预测(SSP)的成功突出了利用丰富的未标记数据促进模型性能的重要性。通过将AL与SSP结合起来,我们可以使用未标记的数据,同时标记和培训特别是信息样本。在这项工作中,我们研究了Imagenet上的AL和SSP的组合。我们发现小型玩具数据集上的性能 - 文献中的典型基准设置 - 由于活动学习者选择的类不平衡样本,而不是想象中的性能。在我们测试的现有基线中,各种小型和大规​​模设置的流行AL算法未能以随机抽样优于差异。为了解决类别不平衡问题,我们提出了平衡选择(基础),这是一种简单,可伸缩的AL算法,通过选择比现有方法更加平衡样本来始终如一地始终采样。我们的代码可用于:https://github.com/zeyademam/active_learning。
translated by 谷歌翻译
虽然注释大量的数据以满足复杂的学习模型,但对于许多现实世界中的应用程序可能会过于良好。主动学习(AL)和半监督学习(SSL)是两个有效但经常被隔离的方法,可以减轻渴望数据的问题。最近的一些研究探索了将AL和SSL相结合以更好地探测未标记数据的潜力。但是,几乎所有这些当代的SSL-AL作品都采用了简单的组合策略,忽略了SSL和AL的固有关系。此外,在处理大规模,高维数据集时,其他方法则遭受高计算成本。通过标记数据的行业实践的激励,我们提出了一种基于创新的基于不一致的虚拟对抗性积极学习(理想)算法,以进一步研究SSL-AL的潜在优势,并实现Al和SSL的相互增强,即SSL,即SSL宣传标签信息,以使标签信息无标记的样本信息并为Al提供平滑的嵌入,而AL排除了具有不一致的预测和相当不确定性的样品。我们通过不同粒度的增强策略(包括细粒度的连续扰动探索和粗粒数据转换)来估计未标记的样品的不一致。在文本和图像域中,广泛的实验验证了所提出的算法的有效性,并将其与最先进的基线进行了比较。两项实际案例研究可视化应用和部署所提出的数据采样算法的实际工业价值。
translated by 谷歌翻译
通过主动学习(AL)获取最具代表性示例,可以通过最大限度地减少图像级或像素 - 明智的注释的努力来使许多数据相关的计算机视觉任务受益。在本文中,我们提出了一种新颖的协作Panoptic-Cable活动学习框架(CPRAL)来解决语义细分任务。对于最初用像素 - WISE注释采样的一小批图像,我们采用Panoptic信息来最初选择未标记的样本。考虑到分段数据集中的类别不平衡,我们导入区域高斯注意模块(RGA)以实现语义偏置选择。该子集通过投票熵突出显示,然后由高斯内核参加,以最大化偏置区域。我们还提出了一个上下文标签扩展(CLE),以提高区域注释,具有语境关注指导。通过协作语义 - 不可知的Panoptic匹配和区域缺陷的选择和延伸,我们的CPRAL可以在标签努力和性能之间取得平衡,并妥协语义分布。我们对城市景观和BDD10K数据集进行了广泛的实验,并表明CPRAL以令人印象深刻的结果和较少的标记比例优于尖端方法。
translated by 谷歌翻译
最近的一些研究表明,使用额外的分配数据可能会导致高水平的对抗性鲁棒性。但是,不能保证始终可以为所选数据集获得足够的额外数据。在本文中,我们提出了一种有偏见的多域对抗训练(BIAMAT)方法,该方法可以使用公开可用的辅助数据集诱导训练数据放大,而无需在主要和辅助数据集之间进行类分配匹配。提出的方法可以通过多域学习利用辅助数据集来实现主数据集上的对抗性鲁棒性。具体而言,可以通过使用Biamat的应用来实现对鲁棒和非鲁棒特征的数据扩增,如通过理论和经验分析所证明的。此外,我们证明,尽管由于辅助和主要数据之间的分布差异,现有方法容易受到负转移的影响,但提出的方法使神经网络能够通过应用程序通过应用程序来成功处理域差异来灵活地利用各种图像数据集来进行对抗训练基于置信的选择策略。预先训练的模型和代码可在:\ url {https://github.com/saehyung-lee/biamat}中获得。
translated by 谷歌翻译
Deep neural networks may easily memorize noisy labels present in real-world data, which degrades their ability to generalize. It is therefore important to track and evaluate the robustness of models against noisy label memorization. We propose a metric, called susceptibility, to gauge such memorization for neural networks. Susceptibility is simple and easy to compute during training. Moreover, it does not require access to ground-truth labels and it only uses unlabeled data. We empirically show the effectiveness of our metric in tracking memorization on various architectures and datasets and provide theoretical insights into the design of the susceptibility metric. Finally, we show through extensive experiments on datasets with synthetic and real-world label noise that one can utilize susceptibility and the overall training accuracy to distinguish models that maintain a low memorization on the training set and generalize well to unseen clean data.
translated by 谷歌翻译
当训练数据集患有极端阶级失衡时,深度神经网络通常会表现不佳。最近的研究发现,以半监督的方式直接使用分布外数据(即开放式样本)培训将损害概括性能。在这项工作中,我们从理论上表明,从贝叶斯的角度来看,仍然可以利用分发数据来扩大少数群体。基于这种动机,我们提出了一种称为开放采样的新方法,该方法利用开放式嘈杂标签重新平衡培训数据集的班级先验。对于每个开放式实例,标签是​​从我们的预定义分布中取样的,该分布互补,与原始类先验的分布互补。我们从经验上表明,开放采样不仅可以重新平衡阶级先验,还鼓励神经网络学习可分离的表示。广泛的实验表明,我们提出的方法显着优于现有数据重新平衡方法,并可以提高现有最新方法的性能。
translated by 谷歌翻译