监督学习需要大量标记数据,在存在隐私问题或标签成本高的情况下,这可能是一个巨大的瓶颈。为了克服这个问题,我们提出了一种新的弱监督学习设置,其中只需要类似的(S)数据对(两个例子属于同类)和未标记的(U)数据点而不是完全标记的数据,这被称为SU分类。我们证明了分类风险的无偏估计只能从SU数据中获得,其经验风险最小化的估计误差达到了最优参数收敛速度。最后,我们通过实验证明了所提出的方法的有效性。
translated by 谷歌翻译
本文旨在提供对对称损失的更好理解。首先,我们表明,使用对称损耗有利于平衡误码率(BER)最小化和接收器工作特性曲线(AUC)下的区域从损坏的标签最大化。其次,我们证明了对称损失的一般理论性质,包括分类校准条件,超额风险界限,条件风险最小化和AUC-一致性条件。第三,由于所有非负对称损失都是非凸的,我们提出了一个凸障碍铰链损失,它可以从对称条件中获益,尽管它在任何地方都不是对称的。最后,我们对来自损坏标签的BER和AUC优化进行了实验,以验证对称条件的相关性。
translated by 谷歌翻译
经验风险最小化(ERM),具有适当的损失函数和规则化,是监督分类的常见做法。在本文中,我们通过ERM研究从任意未标记(U)数据中任意(从线性到深度)二元分类器的训练。我们证明,给定单个U数据集,不可能以无偏的方式估计任意二元分类器的风险,但是给定两组具有不同类前体的U数据是可能的。这两个事实回答了一个基本问题 - 最小监督是用于仅从U数据训练任何二元分类器。根据这些发现,我们从两组Udata中提出了一种基于ERM的学习方法,然后证明它是一致的。实验表明,所提出的方法可以训练深层模型,并且优于最新的两组U数据学习方法。
translated by 谷歌翻译
Can we learn a binary classifier from only positive data, without any negative data or unlabeled data? We show that if one can equip positive data with confidence (positive-confidence), one can successfully learn a binary classifier, which we name positive-confidence (Pconf) classification. Our work is related to one-class classification which is aimed at "describing" the positive class by clustering-related methods , but one-class classification does not have the ability to tune hyper-parameters and their aim is not on "discriminating" positive and negative classes. For the Pconf classification problem, we provide a simple empirical risk minimization framework that is model-independent and optimization-independent. We theoretically establish the consistency and an estimation error bound, and demonstrate the usefulness of the proposed method for training deep neural networks through experiments.
translated by 谷歌翻译
考虑一个分类问题,我们有标记和未标记的数据可用。我们表明,对于由基于凸基数的替代损失定义的线性分类器正在减少,不可能构建任何半监督方法,该方法能够保证通过标记和未标记数据上的这种替代损失测量的监督分类器的改进。对于也基于凸边缘的损失函数,我们证明可以进行安全改进。
translated by 谷歌翻译
Many of the ordinal regression models that have been proposed in the literature can be seen as methods that minimize a convex surrogate of the zero-one, absolute, or squared loss functions. A key property that allows to study the statistical implications of such approximations is that of Fisher consistency. Fisher consistency is a desirable property for surrogate loss functions and implies that in the population setting, i.e., if the probability distribution that generates the data were available, then optimization of the surrogate would yield the best possible model. In this paper we will characterize the Fisher consistency of a rich family of surrogate loss functions used in the context of ordinal regression, including support vector ordinal regression, ORBoosting and least absolute deviation. We will see that, for a family of surrogate loss functions that subsumes support vector ordinal regression and ORBoosting, consistency can be fully characterized by the derivative of a real-valued function at zero, as happens for convex margin-based surrogates in binary classification. We also derive excess risk bounds for a surrogate of the absolute error that generalize existing risk bounds for binary classification. Finally, our analysis suggests a novel surrogate of the squared error loss. We compare this novel surrogate with competing approaches on 9 different datasets. Our method shows to be highly competitive in practice, outperforming the least squares loss on 7 out of 9 datasets.
translated by 谷歌翻译
我们研究了具有拒绝的多类分类的问题,其中分类器可以选择不进行预测以避免关键的分类。我们考虑两种方法来解决这个问题:一个基于置信度得分的传统方法和一个基于分类器和拒绝器同时约束的更新方法。前面的一种现有方法侧重于一类特定的损失,其经验表现并不十分令人信服。在本文中,我们提出了基于置信度的多类别分类拒绝标准,它可以处理更多的一般损失并保证对贝叶斯最优解的校准。后一种方法是相对较新的,并且仅在二元情况下可用,而且是最好的知识。我们的第二个贡献是证明在多类别中通过这种方法几乎不可能校准到贝叶斯最优解。最后,我们进行实验以验证理论发现的相关性。
translated by 谷歌翻译
来自正数和未标记数据的二元分类瓶颈(PU分类)是从测试边际分布中得出未标记模式的要求,并且假阳性错误的惩罚与假阴性错误相同。但是,这些要求在实践中并未得到满足。在本文中,我们将PU分类推广到类先验移位和非对称错误情景。在对Bayes最优分类器的分析的基础上,我们证明了给定一个测试类先验,在类先行移位下的PU分类等价于具有对称误差的PU分类。然后,我们提出了两个不同的框架来处理这些问题,即风险最小化框架和密度比估计框架。最后,我们证明了所提出的框架的有效性,并通过使用基准数据集的实验来比较两个框架。
translated by 谷歌翻译
正未标记(PU)学习解决了从正(P)和未标记(U)数据学习二进制分类器的问题。通常应用于负(N)数据难以完全标记的情况。然而,在许多实际情况中,收集仅包含所有可能N数据的一小部分的非代表性N集可以更容易。本文研究了一种新的分类框架,该框架在PU学习中融入了这种有偏见的N(bN)数据。训练N数据有偏差的事实也使得我们的工作与标准半监督学习的工作非常不同。我们提供了一种基于经验风险最小化的方法来解决这个PUbN分类问题。我们的方法可以被视为传统示例 - 重新加权算法的变体,每个示例的权重通过从PU学习中汲取灵感的初步步骤来计算。 Wealso导出所提出方法的估计误差界限。实验结果证明了我们的算法在PUbNlearning场景中的有效性,以及在几个基准数据集上的普通PU倾斜场景。
translated by 谷歌翻译
分布式鲁棒监督学习(DRSL)是构建可靠机器学习系统所必需的。当在该世界中部署机器学习时,其性能可能显着降低,因为测试数据可能跟随训练数据的不同分布。具有f-分歧的DRSL通过最小化对侧重新加权的训练损失来明确地考虑最坏情况的分布偏移。在本文中,我们分析了这个DRSL,重点是分类场景。由于DRSL是针对分布式移位场景而明确规划的,因此我们自然希望它能够提供可以积极处理移位分布的自我分类器。然而,令人惊讶的是,我们证明DRSL最终给出的分类器非常符合给定的训练分布,这太过于悲观。这种紧张主义来自两个来源:分类中使用的特定损失以及DRSL试图确定的各种分布过于宽泛的事实。在我们的分析的推动下,我们提出了简单的DRSL,它可以克服这种悲观情绪并凭经验证明其有效性。
translated by 谷歌翻译
From only positive (P) and unlabeled (U) data, a binary classifier could be trained with PU learning, in which the state of the art is unbiased PU learning. However, if its model is very flexible, empirical risks on training data will go negative, and we will suffer from serious overfitting. In this paper, we propose a non-negative risk estimator for PU learning: when getting minimized, it is more robust against overfitting, and thus we are able to use very flexible models (such as deep neural networks) given limited P data. Moreover, we analyze the bias, consistency, and mean-squared-error reduction of the proposed risk estimator, and bound the estimation error of the resulting empirical risk minimizer. Experiments demonstrate that our risk estimator fixes the overfitting problem of its unbiased counterparts.
translated by 谷歌翻译
模仿学习(IL)旨在通过示范来学习最优政策。然而,这种示范往往是不完美的,因为收集最优的政策是昂贵的。为了有效地从不完美的示范中学习,我们提出了一种利用置信度得分的新方法,它描述了示范的质量。更具体地说,我们提出了两种基于置信度的IL方法,即两步重要性加权IL(2IWIL)和生成性对抗性IL,具有不完美的证明和置信度(IC-GAIL)。我们证明,仅给出一小部分次优演示的置信度分数在理论上和经验上都显着提高了IL的性能。
translated by 谷歌翻译
与标准分类范例相比,其中每个训练模式都给出了真实(或可能有噪声)的类,互补标签学习仅使用每个都配有补充标签的训练模式。这仅指定模式不属于的一个类。 。这些关于互补标签学习的论文提出了一种无偏估计的分类风险,只能从互补标记的数据中计算出来。但是,它需要对损耗函数有限制条件,因此不可能使用诸如softmax cross-entropyloss之类的普遍损失。最近,提出了具有softmax交叉熵损失的另一种配方,其具有一致性保证。然而,这个表述确实明确涉及风险估计。因此,通过交叉验证不可能进行模型/超参数选择---我们可能需要额外的通常标记的数据用于验证目的,这在当前设置中是不可用的。在本文中,我们给出了一个新的互补标签学习的一般框架,并为任意损失和模型推导出一个无偏的风险估计器。我们通过非负校正进一步改进风险估计,并通过实验证明其优越性。
translated by 谷歌翻译
我们解决了在无监督域适应中测量两个域之间差异的问题。我们指出,当应用诸如深度神经网络的复杂模型时,现有的差异对策信息量较少。此外,对现有差异度量的估计在计算上可能是困难的并且仅限于二元分类任务。为了缓解这些缺点,我们提出了一种新颖的差异度量,对于许多不仅限于二元分类的任务,理论上基于并且可以有效地应用于复杂模型,非常容易估计。我们还提供易于解释的泛化界限,以解释在一些伪监测域适应中伪标记方法家族的有效性。最后,我们进行实验以验证我们提出的差异度量的有用性。
translated by 谷歌翻译
In this paper, we study a classification problem in which sample labels arerandomly corrupted. In this scenario, there is an unobservable sample withnoise-free labels. However, before being observed, the true labels areindependently flipped with a probability $\rho\in[0,0.5)$, and the random labelnoise can be class-conditional. Here, we address two fundamental problemsraised by this scenario. The first is how to best use the abundant surrogateloss functions designed for the traditional classification problem when thereis label noise. We prove that any surrogate loss function can be used forclassification with noisy labels by using importance reweighting, withconsistency assurance that the label noise does not ultimately hinder thesearch for the optimal classifier of the noise-free sample. The other is theopen problem of how to obtain the noise rate $\rho$. We show that the rate isupper bounded by the conditional probability $P(y|x)$ of the noisy sample.Consequently, the rate can be estimated, because the upper bound can be easilyreached in classification problems. Experimental results on synthetic and realdatasets confirm the efficiency of our methods.
translated by 谷歌翻译
In the multi-view learning paradigm, the input variable is partitioned into two different views X 1 and X 2 and there is a target variable Y of interest. The underlying assumption is that either view alone is sufficient to predict the target Y accurately. This provides a natural semi-supervised learning setting in which unlabeled data can be used to eliminate hypothesis from either view, whose predictions tend to disagree with predictions based on the other view. This work explicitly formalizes an information theoretic, multi-view assumption and studies the multi-view paradigm in the PAC style semi-supervised framework of Balcan and Blum [2006]. Underlying the PAC style framework is that an incompatibility function is assumed to be known-roughly speaking, this incompatibility function is a means to score how good a function is based on the unlabeled data alone. Here, we show how to derive incompatibility functions for certain loss functions of interest, so that minimizing this incompatibility over unlabeled data helps reduce expected loss on future test cases. In particular, we show how the class of empirically successful coregularization algorithms fall into our framework and provide performance bounds (using the results in Rosenberg and Bartlett [2007], Farquhar et al. [2005]). We also provide a normative justification for canonical correlation analysis (CCA) as a dimensionality reduction technique. In particular, we show (for strictly convex loss functions of the formℓ(W.x.y) that we can first use CCA as dimensionality reduction technique and (if the multi-view assumption is satisfied) this projection does not throw away much predictive information about the target Y-the benefit being that subsequent learning with a labeled set need only work in this lower dimensional space. Abstract In the multi-view learning paradigm, the input variable is partitioned into two different views X 1 and X 2 and there is a target variable Y of interest. The underlying assumption is that either view alone is sufficient to predict the target Y accurately. This provides a natural semi-supervised learning setting in which unlabeled data can be used to eliminate hypothesis from either view, whose predictions tend to disagree with predictions based on the other view. This work explicitly formalizes an information theoretic, multi-view assumption and studies the multi-view paradigm in the PAC style semi-supervised framework of Balcan and Blum [2006]. Underlying the PAC style framework is that an incompatibility function is assumed to be known-roughly speaking, this incompatibility function is a means to score how good a function is based on the unlabeled data alone. Here, we show how to derive incompatibility functions for certain loss functions of interest, so that minimizing this incompatibility over unlabeled data helps reduce expected loss on future test cases. In particular, we show how the class of empirically successful co-regularization algorithms fall into our
translated by 谷歌翻译
In this paper, we theoretically study the problem of binary classification in the presence of random classification noise-the learner, instead of seeing the true labels , sees labels that have independently been flipped with some small probability. Moreover, random label noise is class-conditional-the flip probability depends on the class. We provide two approaches to suitably modify any given surrogate loss function. First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical risk minimization in the presence of iid data with noisy labels. If the loss function satisfies a simple symmetry condition, we show that the method leads to an efficient algorithm for empirical minimization. Second, by leveraging a reduction of risk minimization under noisy labels to classification with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong empirical risk bounds. This approach has a very remarkable consequence-methods used in practice such as biased SVM and weighted logistic regression are provably noise-tolerant. On a synthetic non-separable dataset, our methods achieve over 88% accuracy even when 40% of the labels are corrupted, and are competitive with respect to recently proposed methods for dealing with label noise in several benchmark datasets.
translated by 谷歌翻译
无监督域自适应是问题设置,其中源域和目标域中的数据生成分布不同,并且目标域中的标签不可用。无监督域适应中的一个重要问题是如何衡量源域和目标域之间的差异。先前提出的不使用源域标签的差异需要高计算成本来估计并且可能导致目标域中的松散一般化误差限制。为了缓解这些问题,我们提出了一个新的差异,称为源引导差异($ S $ -disc),其中包括源域中的标签。因此,可以通过有限的样本收敛保证有效地计算$ S $ -disc。此外,我们证明$ S $ -disc可以提供比基于现有差异更严格的泛化误差限制。最后,我们报告了实验结果,证明了$ S $ -disc优于现有差异的优势。
translated by 谷歌翻译
A common setting for novelty detection assumes that labeled examples from the nominal class are available, but that labeled examples of novelties are unavailable. The standard (inductive) approach is to declare novelties where the nominal density is low, which reduces the problem to density level set estimation. In this paper, we consider the setting where an unlabeled and possibly contaminated sample is also available at learning time. We argue that novelty detection in this semi-supervised setting is naturally solved by a general reduction to a binary classification problem. In particular, a detector with a desired false positive rate can be achieved through a reduction to Neyman-Pearson classification. Unlike the inductive approach, semi-supervised novelty detection (SSND) yields detectors that are optimal (e.g., statistically consistent) regardless of the distribution on novelties. Therefore, in novelty detection, unlabeled data have a substantial impact on the theoretical properties of the decision rule. We validate the practical utility of SSND with an extensive experimental study. We also show that SSND provides distribution-free, learning-theoretic solutions to two well known problems in hypothesis testing. First, our results provide a general solution to the general two-sample problem, that is, the problem of determining whether two random samples arise from the same distribution. Second, a specialization of SSND coincides with the standard p-value approach to multiple testing under the so-called random effects model. Unlike standard rejection regions based on thresholded p-values, the general SSND framework allows for adaptation to arbitrary alternative distributions in multiple dimensions.
translated by 谷歌翻译
This paper proposes a modelling of Support Vector Machine (SVM) learning to address the problem of learning with sloppy labels. In binary classification, learning with sloppy labels is the situation where a learner is provided with labelled data, where the observed labels of each class are possibly noisy (flipped) version of their true class and where the probability of flipping a label y to −y only depends on y. The noise probability is therefore constant and uniform within each class: learning with positive and unlabeled data is for instance a motivating example for this model. In order to learn with sloppy labels, we propose SloppySvm, an SVM algorithm that minimizes a tailored nonconvex functional that is shown to be a uniform estimate of the noise-free SVM functional. Several experiments validate the soundness of our approach.
translated by 谷歌翻译