We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the "cluster assumption". Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces.
translated by 谷歌翻译
在机器学习中的半监督学习(SSL)方法越来越关注在分类器的训练数据的情况下形成分类器,该分类器包括有限数量的分类观察,而是更大数量的未分类观察。这是因为由于高收购成本和随后的财务,时间和伦理问题,可以出现的经济型资金,而且可以在尝试为已经获得的未经分类数据提供真正的类标签而产生的,所以提供的分类数据的采购。我们在此提供对该问题的统计SSL方法的审查,侧重于最近的结果,即由部分分类的样本形成的分类器实际上可以具有比样本完全分类的更小的预期误差率。
translated by 谷歌翻译
Label noise is an important issue in classification, with many potential negative consequences. For example, the accuracy of predictions may decrease, whereas the complexity of inferred models and the number of necessary training samples may increase. Many works in the literature have been devoted to the study of label noise and the development of techniques to deal with label noise. However, the field lacks a comprehensive survey on the different types of label noise, their consequences and the algorithms that consider label noise. This paper proposes to fill this gap. First, the definitions and sources of label noise are considered and a taxonomy of the types of label noise is proposed. Second, the potential consequences of label noise are discussed. Third, label noise-robust, label noise cleansing, and label noise-tolerant algorithms are reviewed. For each category of approaches, a short discussion is proposed to help the practitioner to choose the most suitable technique in its own particular field of application. Eventually, the design of experiments is also discussed, what may interest the researchers who would like to test their own algorithms. In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g. confidences on labels.
translated by 谷歌翻译
半监督学习是通过将一个小标签的数据集与大概更大的未标记数据集相结合来训练准确的预测模型的问题。已经开发了许多半监督深度学习的方法,包括伪标记,一致性正则化和对比度学习技术。然而,伪标记方法非常容易受到混淆,在这种方法中假定错误的伪标记在早期迭代中是真正的标签,从而导致该模型增强其先前的偏见,从而无法推广到强大的预测性能。我们提出了一种新方法来通过一种方法来抑制混杂的错误,我们将其描述为伪预期最大化(范围)的半监督对比度删除。像基本的伪标记一样,范围与期望最大化有关(EM),这是一个潜在的变量框架,可以扩展到理解群集实现深度半监督算法。但是,与基本的伪标记不同,该假标签无法充分考虑到鉴于模型的未标记样品的概率,范围引入了一个异常抑制项,旨在改善EM迭代的行为,因为在异常存在的情况下具有歧视DNN骨架。我们的结果表明,范围极大地提高了基线的半监督分类精度,并且当结合一致性正规化时,使用250和4000个标记的样品将半监督的CIFAR-10分类任务获得了最高报告的准确性。此外,我们表明范围通过修剪错误的高信心伪标记样品来降低伪标记迭代期间混杂误差的流行率,否则这些样品否则会污染随后的重新迭代中标记的设置。
translated by 谷歌翻译
We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework.
translated by 谷歌翻译
Learning curves provide insight into the dependence of a learner's generalization performance on the training set size. This important tool can be used for model selection, to predict the effect of more training data, and to reduce the computational complexity of model training and hyperparameter tuning. This review recounts the origins of the term, provides a formal definition of the learning curve, and briefly covers basics such as its estimation. Our main contribution is a comprehensive overview of the literature regarding the shape of learning curves. We discuss empirical and theoretical evidence that supports well-behaved curves that often have the shape of a power law or an exponential. We consider the learning curves of Gaussian processes, the complex shapes they can display, and the factors influencing them. We draw specific attention to examples of learning curves that are ill-behaved, showing worse learning performance with more training data. To wrap up, we point out various open problems that warrant deeper empirical and theoretical investigation. All in all, our review underscores that learning curves are surprisingly diverse and no universal model can be identified.
translated by 谷歌翻译
我们调查了半个空间自训算法的泛化特性。该方法从标记和未标记的培训数据中迭代地了解半个空间列表,其中每个迭代包括两个步骤:探索和修剪。在探索阶段中,通过在未标记的示例中最大化未符号余量,然后将伪标签分配给具有高于当前阈值的距离的距离来顺序地找到半空间。然后将伪标记的示例添加到训练集中,并且学习了一个新的分类器。重复该过程,直到不再是未标记的示例仍然用于伪标记。在修剪阶段,然后丢弃与距离相关的未签名边缘大于相关的余量的距离的伪标记的样本。我们证明了由此产生的分类序列的错误分类误差被界定,并表明由此产生的半导体方法与仅使用初始标记的训练集学习的分类器相比,从未降低性能。与最先进的方法相比,在各种基准上进行的实验证明了所提出的方法的效率。
translated by 谷歌翻译
The notion of uncertainty is of major importance in machine learning and constitutes a key element of machine learning methodology. In line with the statistical tradition, uncertainty has long been perceived as almost synonymous with standard probability and probabilistic predictions. Yet, due to the steadily increasing relevance of machine learning for practical applications and related issues such as safety requirements, new problems and challenges have recently been identified by machine learning scholars, and these problems may call for new methodological developments. In particular, this includes the importance of distinguishing between (at least) two different types of uncertainty, often referred to as aleatoric and epistemic. In this paper, we provide an introduction to the topic of uncertainty in machine learning as well as an overview of attempts so far at handling uncertainty in general and formalizing this distinction in particular.
translated by 谷歌翻译
社会科学家经常将文本文档分类为使用结果标签作为实证研究的结果或预测指标。自动化文本分类已成为标准工具,因为它需要较少的人体编码。但是,学者们仍然需要许多人类标记的文件来培训自动分类器。为了降低标签成本,我们提出了一种新的文本分类算法,将概率模型与主动学习结合在一起。概率模型同时使用标记和未标记的数据,而主动学习集中在难以分类的文件上标记工作。我们的验证研究表明,我们的算法的分类性能与最先进的方法相当,而计算成本的一部分。此外,我们复制了两篇最近发表的文章,并得出相同的实质性结论,其中仅占这些研究中使用的原始标记数据的一小部分。我们提供ActiveText,一种开源软件来实现我们的方法。
translated by 谷歌翻译
群集分析需要许多决定:聚类方法和隐含的参考模型,群集数,通常,几个超参数和算法调整。在实践中,一个分区产生多个分区,基于验证或选择标准选择最终的分区。存在丰富的验证方法,即隐式或明确地假设某个聚类概念。此外,它们通常仅限于从特定方法获得的分区上操作。在本文中,我们专注于可以通过二次或线性边界分开的群体。参考集群概念通过二次判别符号函数和描述集群大小,中心和分散的参数定义。我们开发了两个名为二次分数的群集质量标准。我们表明这些标准与从一般类椭圆对称分布产生的组一致。对这种类型的组追求在应用程序中是常见的。研究了与混合模型和模型的聚类的似然理论的连接。基于Bootstrap重新采样的二次分数,我们提出了一个选择规则,允许在许多聚类解决方案中选择。所提出的方法具有独特的优点,即它可以比较不能与其他最先进的方法进行比较的分区。广泛的数值实验和实际数据的分析表明,即使某些竞争方法在某些设置中出现优越,所提出的方法也实现了更好的整体性能。
translated by 谷歌翻译
近几十年来,科学和工程的可用数据数量的重大增长彻底改变了。然而,尽管现在收集和存储数据的空前很容易,但通过补充每个功能的标签来标记数据仍然是具有挑战性的。标签过程需要专家知识或乏味且耗时的说明任务包括用诊断X射线标记X射线,具有蛋白质类型的蛋白质序列,其主题的文本,通过其情感推文或视频通过其类型的视频。在这些和许多其他示例中,由于成本和时间限制,只能手动标记一些功能。我们如何才能最好地将标签信息从少数昂贵的标签功能到大量未标记的标签信息传播?这是半监督学习(SSL)提出的问题。本文概述了基于图的贝叶斯SSL的最新基础发展,这是一种使用功能之间的相似性的标签传播概率框架。 SSL是一个活跃的研究领域,对现有文献的彻底回顾超出了本文的范围。我们的重点将放在我们自己的研究中得出的主题,这些主题说明了对基于图的贝叶斯SSL的统计准确性和计算效率进行严格研究的广泛数学工具和思想。
translated by 谷歌翻译
分类模型是物理资产管理技术的基本组成部分,如结构健康监测(SHM)系统和数字双胞胎。以前的工作介绍了\ Texit {基于风险的主动学习},一种在线方法,用于开发考虑它们所应用的决策支持上下文的统计分类器。通过优先查询数据标签来考虑决策,根据\ Textit {完美信息的预期值}(EVPI)。虽然通过采用基于风险的主动学习方法获得了几种好处,但包括改进的决策性能,但算法遭受与引导查询过程的采样偏差有关的问题。这种采样偏差最终表现为在主动学习后的后期阶段的决策表现的下降,这又对应于丢失的资源/实用程序。目前的论文提出了两种新方法来抵消采样偏置的影响:\纺织{半监督学习},以及\ extentit {鉴别的分类模型}。首先使用合成数据集进行这些方法,然后随后应用于实验案例研究,具体地,Z24桥数据集。半监督学习方法显示有变量性能;具有稳健性,对采样偏置依赖于对每个数据集选择模型所选择的生成分布的适用性。相反,判别分类器被证明对采样偏压的影响具有优异的鲁棒性。此外,发现在监控运动期间进行的检查数,因此可以通过仔细选择决策支持监测系统中使用的统计分类器的仔细选择来减少。
translated by 谷歌翻译
现代深度学习方法构成了令人难以置信的强大工具,以解决无数的挑战问题。然而,由于深度学习方法作为黑匣子运作,因此与其预测相关的不确定性往往是挑战量化。贝叶斯统计数据提供了一种形式主义来理解和量化与深度神经网络预测相关的不确定性。本教程概述了相关文献和完整的工具集,用于设计,实施,列车,使用和评估贝叶斯神经网络,即使用贝叶斯方法培训的随机人工神经网络。
translated by 谷歌翻译
Semi-supervised learning (SSL) provides a powerful framework for leveraging unlabeled data when labels are limited or expensive to obtain. SSL algorithms based on deep neural networks have recently proven successful on standard benchmark tasks. However, we argue that these benchmarks fail to address many issues that SSL algorithms would face in real-world applications. After creating a unified reimplementation of various widely-used SSL techniques, we test them in a suite of experiments designed to address these issues. We find that the performance of simple baselines which do not use unlabeled data is often underreported, SSL methods differ in sensitivity to the amount of labeled and unlabeled data, and performance can degrade substantially when the unlabeled dataset contains out-ofdistribution examples. To help guide SSL research towards real-world applicability, we make our unified reimplemention and evaluation platform publicly available. 2 * Equal contribution 2 https://github.com/brain-research/realistic-ssl-evaluation 32nd Conference on Neural Information Processing Systems (NeurIPS 2018),
translated by 谷歌翻译
We consider semi-supervised binary classification for applications in which data points are naturally grouped (e.g., survey responses grouped by state) and the labeled data is biased (e.g., survey respondents are not representative of the population). The groups overlap in the feature space and consequently the input-output patterns are related across the groups. To model the inherent structure in such data, we assume the partition-projected class-conditional invariance across groups, defined in terms of the group-agnostic feature space. We demonstrate that under this assumption, the group carries additional information about the class, over the group-agnostic features, with provably improved area under the ROC curve. Further assuming invariance of partition-projected class-conditional distributions across both labeled and unlabeled data, we derive a semi-supervised algorithm that explicitly leverages the structure to learn an optimal, group-aware, probability-calibrated classifier, despite the bias in the labeled data. Experiments on synthetic and real data demonstrate the efficacy of our algorithm over suitable baselines and ablative models, spanning standard supervised and semi-supervised learning approaches, with and without incorporating the group directly as a feature.
translated by 谷歌翻译
Labeling a module defective or non-defective is an expensive task. Hence, there are often limits on how much-labeled data is available for training. Semi-supervised classifiers use far fewer labels for training models, but there are numerous semi-supervised methods, including self-labeling, co-training, maximal-margin, and graph-based methods, to name a few. Only a handful of these methods have been tested in SE for (e.g.) predicting defects and even that, those tests have been on just a handful of projects. This paper takes a wide range of 55 semi-supervised learners and applies these to over 714 projects. We find that semi-supervised "co-training methods" work significantly better than other approaches. However, co-training needs to be used with caution since the specific choice of co-training methods needs to be carefully selected based on a user's specific goals. Also, we warn that a commonly-used co-training method ("multi-view"-- where different learners get different sets of columns) does not improve predictions (while adding too much to the run time costs 11 hours vs. 1.8 hours). Those cautions stated, we find using these "co-trainers," we can label just 2.5% of data, then make predictions that are competitive to those using 100% of the data. It is an open question worthy of future work to test if these reductions can be seen in other areas of software analytics. All the codes used and datasets analyzed during the current study are available in the https://GitHub.com/Suvodeep90/Semi_Supervised_Methods.
translated by 谷歌翻译
众所周知,即使通过核心点之间捕获数据点之间的相似性,也可以通过捕获相似性来提供准确的预测和不确定性估计,以提供准确的预测和不确定性估计。然而,传统的GP内核在捕获高维数据点之间的相似性时不是非常有效的。神经网络可用于学习在高维数据中编码复杂结构的良好表示,并且可以用作GP内核的输入。然而,神经网络的巨大数据要求使得这种方法在小数据设置中无效。为了解决代表学习和数据效率的冲突问题,我们建议通过使用概率神经网络来学习概率嵌入的深核。我们的方法将高维数据映射到低维子空间中的概率分布,然后计算这些分布之间的内核以捕获相似性。要启用端到端学习,我们可以推导出用于培训模型的功能梯度血清过程。各种数据集的实验表明,我们的方法在监督和半监督设置中占GP内核学习中的最先进。我们还将我们的方法扩展到其他小型数据范例,例如少量分类,在迷你想象网和小熊数据集上以前的方式胜过先前的方法。
translated by 谷歌翻译
半监督学习(SSL)是使用不仅标记的示例,而且是未标记的示例学习预测模型的常见方法。尽管用于分类和回归的简单任务的SSL受到了研究社区的广泛关注,但对于具有结构依赖变量的复杂预测任务,这尚未得到适当的研究。这种情况是多标签分类和分层多标签分类任务,可能需要其他信息,可能来自未标记示例提供的描述性空间中的基础分布,以更好地面对同时预测多个类别标签的挑战性任务。在本文中,我们研究了这一方面,并​​提出了一种基于对预测性聚类树的半监督学习的(分层)多标签分类方法。我们还扩展了整体学习的方法,并提出了一种基于随机森林方法的方法。在23个数据集上进行的广泛实验评估显示了该方法的显着优势及其在其监督对应物方面的扩展。此外,该方法可保留可解释性并降低基于经典树模型的时间复杂性。
translated by 谷歌翻译
The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. Although specific domain knowledge can be used to help design representations, learning with generic priors can also be used, and the quest for AI is motivating the design of more powerful representation-learning algorithms implementing such priors. This paper reviews recent work in the area of unsupervised feature learning and deep learning, covering advances in probabilistic models, auto-encoders, manifold learning, and deep networks. This motivates longer-term unanswered questions about the appropriate objectives for learning good representations, for computing representations (i.e., inference), and the geometrical connections between representation learning, density estimation and manifold learning.
translated by 谷歌翻译
所有著名的机器学习算法构成了受监督和半监督的学习工作,只有在一个共同的假设下:培训和测试数据遵循相同的分布。当分布变化时,大多数统计模型必须从新收集的数据中重建,对于某些应用程序,这些数据可能是昂贵或无法获得的。因此,有必要开发方法,以减少在相关领域中可用的数据并在相似领域中进一步使用这些数据,从而减少需求和努力获得新的标签样品。这引起了一个新的机器学习框架,称为转移学习:一种受人类在跨任务中推断知识以更有效学习的知识能力的学习环境。尽管有大量不同的转移学习方案,但本调查的主要目的是在特定的,可以说是最受欢迎的转移学习中最受欢迎的次级领域,概述最先进的理论结果,称为域适应。在此子场中,假定数据分布在整个培训和测试数据中发生变化,而学习任务保持不变。我们提供了与域适应性问题有关的现有结果的首次最新描述,该结果涵盖了基于不同统计学习框架的学习界限。
translated by 谷歌翻译