所有著名的机器学习算法构成了受监督和半监督的学习工作,只有在一个共同的假设下:培训和测试数据遵循相同的分布。当分布变化时,大多数统计模型必须从新收集的数据中重建,对于某些应用程序,这些数据可能是昂贵或无法获得的。因此,有必要开发方法,以减少在相关领域中可用的数据并在相似领域中进一步使用这些数据,从而减少需求和努力获得新的标签样品。这引起了一个新的机器学习框架,称为转移学习:一种受人类在跨任务中推断知识以更有效学习的知识能力的学习环境。尽管有大量不同的转移学习方案,但本调查的主要目的是在特定的,可以说是最受欢迎的转移学习中最受欢迎的次级领域,概述最先进的理论结果,称为域适应。在此子场中,假定数据分布在整个培训和测试数据中发生变化,而学习任务保持不变。我们提供了与域适应性问题有关的现有结果的首次最新描述,该结果涵盖了基于不同统计学习框架的学习界限。
translated by 谷歌翻译
Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. Often, however, we have plentiful labeled training data from a source domain but wish to learn a classifier which performs well on a target domain with a different distribution and little or no labeled training data. In this work we investigate two questions. First, under what conditions can a classifier trained from source data be expected to perform well on target data? Second, given a small amount of labeled target data, how should we combine it during training with the large amount of labeled source data to achieve the lowest target error at test time?
translated by 谷歌翻译
We define notions of stability for learning algorithms and show how to use these notions to derive generalization error bounds based on the empirical error and the leave-one-out error. The methods we use can be applied in the regression framework as well as in the classification one when the classifier is obtained by thresholding a real-valued function. We study the stability properties of large classes of learning algorithms such as regularization based algorithms. In particular we focus on Hilbert space regularization and Kullback-Leibler regularization. We demonstrate how to apply the results to SVM for regression and classification.1. For a qualitative discussion about sensitivity analysis with links to other resources see e.g. http://sensitivity-analysis.jrc.cec.eu.int/
translated by 谷歌翻译
比较概率分布是许多机器学习算法的关键。最大平均差异(MMD)和最佳运输距离(OT)是在过去几年吸引丰富的关注的概率措施之间的两类距离。本文建立了一些条件,可以通过MMD规范控制Wassersein距离。我们的作品受到压缩统计学习(CSL)理论的推动,资源有效的大规模学习的一般框架,其中训练数据总结在单个向量(称为草图)中,该训练数据捕获与所考虑的学习任务相关的信息。在CSL中的现有结果启发,我们介绍了H \“较旧的较低限制的等距属性(H \”较旧的LRIP)并表明这家属性具有有趣的保证对压缩统计学习。基于MMD与Wassersein距离之间的关系,我们通过引入和研究学习任务的Wassersein可读性的概念来提供压缩统计学习的保证,即概率分布之间的某些特定于特定的特定度量,可以由Wassersein界定距离。
translated by 谷歌翻译
可实现和不可知性的可读性的等价性是学习理论的基本现象。与PAC学习和回归等古典设置范围的变种,近期趋势,如对冲强劲和私人学习,我们仍然缺乏统一理论;等同性的传统证据往往是不同的,并且依赖于强大的模型特异性假设,如统一的收敛和样本压缩。在这项工作中,我们给出了第一个独立的框架,解释了可实现和不可知性的可读性的等价性:三行黑箱减少简化,统一,并在各种各样的环境中扩展了我们的理解。这包括没有已知的学报的模型,例如学习任意分布假设或一般损失,以及许多其他流行的设置,例如强大的学习,部分学习,公平学习和统计查询模型。更一般地,我们认为可实现和不可知的学习的等价性实际上是我们调用属性概括的更广泛现象的特殊情况:可以满足有限的学习算法(例如\噪声公差,隐私,稳定性)的任何理想性质假设类(可能在某些变化中)延伸到任何学习的假设类。
translated by 谷歌翻译
Due to the ability of deep neural nets to learn rich representations, recent advances in unsupervised domain adaptation have focused on learning domain-invariant features that achieve a small error on the source domain. The hope is that the learnt representation, together with the hypothesis learnt from the source domain, can generalize to the target domain. In this paper, we first construct a simple counterexample showing that, contrary to common belief, the above conditions are not sufficient to guarantee successful domain adaptation. In particular, the counterexample exhibits conditional shift: the class-conditional distributions of input features change between source and target domains. To give a sufficient condition for domain adaptation, we propose a natural and interpretable generalization upper bound that explicitly takes into account the aforementioned shift. Moreover, we shed new light on the problem by proving an information-theoretic lower bound on the joint error of any domain adaptation method that attempts to learn invariant representations. Our result characterizes a fundamental tradeoff between learning invariant representations and achieving small joint error on both domains when the marginal label distributions differ from source to target. Finally, we conduct experiments on real-world datasets that corroborate our theoretical findings. We believe these insights are helpful in guiding the future design of domain adaptation and representation learning algorithms.
translated by 谷歌翻译
转移学习或域适应性与机器学习问题有关,在这些问题中,培训和测试数据可能来自可能不同的概率分布。在这项工作中,我们在Russo和Xu发起的一系列工作之后,就通用错误和转移学习算法的过量风险进行了信息理论分析。我们的结果也许表明,也许正如预期的那样,kullback-leibler(kl)Divergence $ d(\ mu || \ mu')$在$ \ mu $和$ \ mu'$表示分布的特征中起着重要作用。培训数据和测试测试。具体而言,我们为经验风险最小化(ERM)算法提供了概括误差上限,其中两个分布的数据在训练阶段都可用。我们进一步将分析应用于近似的ERM方法,例如Gibbs算法和随机梯度下降方法。然后,我们概括了与$ \ phi $ -Divergence和Wasserstein距离绑定的共同信息。这些概括导致更紧密的范围,并且在$ \ mu $相对于$ \ mu' $的情况下,可以处理案例。此外,我们应用了一套新的技术来获得替代的上限,该界限为某些学习问题提供了快速(最佳)的学习率。最后,受到派生界限的启发,我们提出了Infoboost算法,其中根据信息测量方法对源和目标数据的重要性权重进行了调整。经验结果表明了所提出的算法的有效性。
translated by 谷歌翻译
This paper addresses the problem of unsupervised domain adaption from theoretical and algorithmic perspectives. Existing domain adaptation theories naturally imply minimax optimization algorithms, which connect well with the domain adaptation methods based on adversarial learning. However, several disconnections still exist and form the gap between theory and algorithm. We extend previous theories (Mansour et al., 2009c;Ben-David et al., 2010) to multiclass classification in domain adaptation, where classifiers based on the scoring functions and margin loss are standard choices in algorithm design. We introduce Margin Disparity Discrepancy, a novel measurement with rigorous generalization bounds, tailored to the distribution comparison with the asymmetric margin loss, and to the minimax optimization for easier training. Our theory can be seamlessly transformed into an adversarial learning algorithm for domain adaptation, successfully bridging the gap between theory and algorithm. A series of empirical studies show that our algorithm achieves the state of the art accuracies on challenging domain adaptation tasks.
translated by 谷歌翻译
A major problem in machine learning is that of inductive bias: how to choose a learner's hypothesis space so that it is large enough to contain a solution to the problem being learnt, yet small enough to ensure reliable generalization from reasonably-sized training sets. Typically such bias is supplied by hand through the skill and insights of experts. In this paper a model for automatically learning bias is investigated. The central assumption of the model is that the learner is embedded within an environment of related learning tasks. Within such an environment the learner can sample from multiple tasks, and hence it can search for a hypothesis space that contains good solutions to many of the problems in the environment. Under certain restrictions on the set of all hypothesis spaces available to the learner, we show that a hypothesis space that performs well on a sufficiently large number of training tasks will also perform well when learning novel tasks in the same environment. Explicit bounds are also derived demonstrating that learning multiple tasks within an environment of related tasks can potentially give much better generalization than learning a single task.
translated by 谷歌翻译
分发概括是将模型从实验室转移到现实世界时的关键挑战之一。现有努力主要侧重于源和目标域之间建立不变的功能。基于不变的功能,源域上的高性能分类可以在目标域上同样良好。换句话说,不变的功能是\ emph {transcorable}。然而,在实践中,没有完全可转换的功能,并且一些算法似乎学习比其他算法更学习“更可转移”的特征。我们如何理解和量化此类\ EMPH {可转录性}?在本文中,我们正式定义了一种可以量化和计算域泛化的可转换性。我们指出了与域之间的常见差异措施的差异和连接,例如总变化和Wassersein距离。然后,我们证明我们可以使用足够的样本估计我们的可转换性,并根据我们的可转移提供目标误差的新上限。经验上,我们评估现有算法学习的特征嵌入的可转换性,以获得域泛化。令人惊讶的是,我们发现许多算法并不完全学习可转让的功能,尽管很少有人仍然可以生存。鉴于此,我们提出了一种用于学习可转移功能的新算法,并在各种基准数据集中测试,包括RotationMnist,PACS,Office和Wilds-FMOW。实验结果表明,该算法在许多最先进的算法上实现了一致的改进,证实了我们的理论发现。
translated by 谷歌翻译
我们研究了广义熵的连续性属性作为潜在的概率分布的函数,用动作空间和损失函数定义,并使用此属性来回答统计学习理论中的基本问题:各种学习方法的过度风险分析。我们首先在几种常用的F分歧,Wassersein距离的熵差异导出了两个分布的熵差,这取决于动作空间的距离和损失函数,以及由熵产生的Bregman发散,这也诱导了两个分布之间的欧几里德距离方面的界限。对于每个一般结果的讨论给出了示例,使用现有的熵差界进行比较,并且基于新结果导出新的相互信息上限。然后,我们将熵差异界限应用于统计学习理论。结果表明,两种流行的学习范式,频繁学习和贝叶斯学习中的过度风险都可以用不同形式的广义熵的连续性研究。然后将分析扩展到广义条件熵的连续性。扩展为贝叶斯决策提供了不匹配的分布来提供性能范围。它也会导致第三个划分的学习范式的过度风险范围,其中决策规则是在经验分布的预定分布家族的预测下进行最佳设计。因此,我们通过广义熵的连续性建立了统计学习三大范式的过度风险分析的统一方法。
translated by 谷歌翻译
Boosting是一种著名的机器学习方法,它基于将弱和适度不准确假设与强烈而准确的假设相结合的想法。我们研究了弱假设属于界限能力类别的假设。这个假设的灵感来自共同的惯例,即虚弱的假设是“易于学习的类别”中的“人数规则”。 (Schapire和Freund〜 '12,Shalev-Shwartz和Ben-David '14。)正式,我们假设弱假设类别具有有界的VC维度。我们关注两个主要问题:(i)甲骨文的复杂性:产生准确的假设需要多少个弱假设?我们设计了一种新颖的增强算法,并证明它绕过了由Freund和Schapire('95,'12)的经典下限。虽然下限显示$ \ omega({1}/{\ gamma^2})$弱假设有时是必要的,而有时则需要使用$ \ gamma $ -margin,但我们的新方法仅需要$ \ tilde {o}({1})({1}) /{\ gamma})$弱假设,前提是它们属于一类有界的VC维度。与以前的增强算法以多数票汇总了弱假设的算法不同,新的增强算法使用了更复杂(“更深”)的聚合规则。我们通过表明复杂的聚合规则实际上是规避上述下限是必要的,从而补充了这一结果。 (ii)表现力:通过提高有限的VC类的弱假设可以学习哪些任务?可以学到“遥远”的复杂概念吗?为了回答第一个问题,我们{介绍组合几何参数,这些参数捕获增强的表现力。}作为推论,我们为认真的班级的第二个问题提供了肯定的答案,包括半空间和决策树桩。一路上,我们建立并利用差异理论的联系。
translated by 谷歌翻译
We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.
translated by 谷歌翻译
We study the fundamental question of how to define and measure the distance from calibration for probabilistic predictors. While the notion of perfect calibration is well-understood, there is no consensus on how to quantify the distance from perfect calibration. Numerous calibration measures have been proposed in the literature, but it is unclear how they compare to each other, and many popular measures such as Expected Calibration Error (ECE) fail to satisfy basic properties like continuity. We present a rigorous framework for analyzing calibration measures, inspired by the literature on property testing. We propose a ground-truth notion of distance from calibration: the $\ell_1$ distance to the nearest perfectly calibrated predictor. We define a consistent calibration measure as one that is a polynomial factor approximation to the this distance. Applying our framework, we identify three calibration measures that are consistent and can be estimated efficiently: smooth calibration, interval calibration, and Laplace kernel calibration. The former two give quadratic approximations to the ground truth distance, which we show is information-theoretically optimal. Our work thus establishes fundamental lower and upper bounds on measuring distance to calibration, and also provides theoretical justification for preferring certain metrics (like Laplace kernel calibration) in practice.
translated by 谷歌翻译
无监督域适应(UDA)的绝大多数现有算法都集中在以一次性的方式直接从标记的源域调整到未标记的目标域。另一方面,逐渐的域适应性(GDA)假设桥接源和目标的$(t-1)$未标记的中间域,并旨在通过利用中间的路径在目标域中提供更好的概括。在某些假设下,Kumar等人。 (2020)提出了一种简单的算法,逐渐自我训练,以及按$ e^{o(t)} \ left的顺序结合的概括(\ varepsilon_0+o \ of \ left(\ sqrt {log(log(log(t)/n log(t)/n) } \ right)\ right)$对于目标域错误,其中$ \ varepsilon_0 $是源域错误,$ n $是每个域的数据大小。由于指数因素,当$ t $仅适中时,该上限变得空虚。在这项工作中,我们在更一般和放松的假设下分析了逐步的自我训练,并证明概括为$ \ varepsilon_0 + o \ left(t \ delta + t/\ sqrt {n} {n} \ right) + \ widetilde { o} \ left(1/\ sqrt {nt} \ right)$,其中$ \ delta $是连续域之间的平均分配距离。与对$ t $作为乘法因素的指数依赖性的现有界限相比,我们的界限仅取决于$ t $线性和添加性。也许更有趣的是,我们的结果意味着存在最佳的$ t $的最佳选择,从而最大程度地减少了概括性错误,并且自然也暗示了一种构造中间域路径的最佳方法,以最大程度地减少累积路径长度$ t \ delta源和目标之间的$。为了证实我们理论的含义,我们检查了对多个半合成和真实数据集的逐步自我训练,这证实了我们的发现。我们相信我们的见解为未来GDA算法设计的途径提供了前进的途径。
translated by 谷歌翻译
通过使一组基本预测因素投票根据一些权重,即对某些概率分布来获得聚合预测器。根据一些规定的概率分布,通过在一组基本预测器中采样来获得随机预测器。因此,聚合和随机预测器的共同之处包括最小化问题,而是通过对预测器集的概率分布来定义。在统计学习理论中,有一套工具旨在了解此类程序的泛化能力:Pac-Bayesian或Pac-Bayes界。由于D. Mcallester的原始Pac-Bayes界,这些工具在许多方向上得到了大大改善(例如,我们将描述社区错过的O. Catoni的定位技术的简化版本,后来被重新发现“相互信息界“)。最近,Pac-Bayes的界限受到相当大的关注:例如,在2017年的Pac-Bayes上有研讨会,“(几乎)50种贝叶斯学习:Pac-Bayesian趋势和见解”,由B. Guedj,F组织。 。巴赫和P.Merain。这一最近成功的原因之一是通过G. Dziugaite和D. Roy成功地将这些限制应用于神经网络。对Pac-Bayes理论的初步介绍仍然缺失。这是一种尝试提供这样的介绍。
translated by 谷歌翻译
Virtually all machine learning tasks are characterized using some form of loss function, and "good performance" is typically stated in terms of a sufficiently small average loss, taken over the random draw of test data. While optimizing for performance on average is intuitive, convenient to analyze in theory, and easy to implement in practice, such a choice brings about trade-offs. In this work, we survey and introduce a wide variety of non-traditional criteria used to design and evaluate machine learning algorithms, place the classical paradigm within the proper historical context, and propose a view of learning problems which emphasizes the question of "what makes for a desirable loss distribution?" in place of tacit use of the expected loss.
translated by 谷歌翻译
在数据驱动的优化和机器学习中获得概括界限的建立方法主要基于从经验风险最小化(ERM)中的解决方案,这些解决方案取决于假设类别的功能复杂性。在本文中,我们提出了一条替代途径,以从分布强劲的优化(DRO)获得解决方案上的这些界限,这是一个基于最坏情况分析的最新数据驱动的优化框架,以及设置为捕获统计不确定性的模棱两可的概念。与ERM中的假设类复杂性相反,我们的DRO界限取决于歧义集的几何形状及其与真实损耗函数的兼容性。值得注意的是,当将最大平均差异用作DRO距离度量标准时,我们的分析意味着对假设类别的依赖性的概括范围似乎是最小的:结合仅取决于真实的损失函数,独立于假设类中的任何其他候选者。据我们所知,这是文献中这种类型的第一个概括,我们希望我们的发现可以更好地了解DRO,尤其是其在损失最小化和其他机器学习应用方面的收益。
translated by 谷歌翻译
在因果推理和强盗文献中,基于观察数据的线性功能估算线性功能的问题是规范的。我们分析了首先估计治疗效果函数的广泛的两阶段程序,然后使用该数量来估计线性功能。我们证明了此类过程的均方误差上的非反应性上限:这些边界表明,为了获得非反应性最佳程序,应在特定加权$ l^2 $中最大程度地估算治疗效果的误差。 -规范。我们根据该加权规范的约束回归分析了两阶段的程序,并通过匹配非轴突局部局部最小值下限,在有限样品中建立了实例依赖性最优性。这些结果表明,除了取决于渐近效率方差之外,最佳的非质子风险除了取决于样本量支持的最富有函数类别的真实结果函数与其近似类别之间的加权规范距离。
translated by 谷歌翻译
Wasserstein的分布在强大的优化方面已成为强大估计的有力框架,享受良好的样本外部性能保证,良好的正则化效果以及计算上可易处理的双重重新纠正。在这样的框架中,通过将最接近经验分布的所有概率分布中最接近的所有概率分布中最小化的最差预期损失来最大程度地减少估计量。在本文中,我们提出了一个在噪声线性测量中估算未知参数的Wasserstein分布稳定的M估计框架,我们专注于分析此类估计器的平方误差性能的重要且具有挑战性的任务。我们的研究是在现代的高维比例状态下进行的,在该状态下,环境维度和样品数量都以相对的速度进行编码,该速率以编码问题的下/过度参数化的比例。在各向同性高斯特征假设下,我们表明可以恢复平方误差作为凸 - 串联优化问题的解,令人惊讶的是,它在最多四个标量变量中都涉及。据我们所知,这是在Wasserstein分布强劲的M估计背景下研究此问题的第一项工作。
translated by 谷歌翻译