在数据驱动的优化和机器学习中获得概括界限的建立方法主要基于从经验风险最小化(ERM)中的解决方案,这些解决方案取决于假设类别的功能复杂性。在本文中,我们提出了一条替代途径,以从分布强劲的优化(DRO)获得解决方案上的这些界限,这是一个基于最坏情况分析的最新数据驱动的优化框架,以及设置为捕获统计不确定性的模棱两可的概念。与ERM中的假设类复杂性相反,我们的DRO界限取决于歧义集的几何形状及其与真实损耗函数的兼容性。值得注意的是,当将最大平均差异用作DRO距离度量标准时,我们的分析意味着对假设类别的依赖性的概括范围似乎是最小的:结合仅取决于真实的损失函数,独立于假设类中的任何其他候选者。据我们所知,这是文献中这种类型的第一个概括,我们希望我们的发现可以更好地了解DRO,尤其是其在损失最小化和其他机器学习应用方面的收益。
translated by 谷歌翻译
Empirical risk minimization (ERM) and distributionally robust optimization (DRO) are popular approaches for solving stochastic optimization problems that appear in operations management and machine learning. Existing generalization error bounds for these methods depend on either the complexity of the cost function or dimension of the uncertain parameters; consequently, the performance of these methods is poor for high-dimensional problems with objective functions under high complexity. We propose a simple approach in which the distribution of uncertain parameters is approximated using a parametric family of distributions. This mitigates both sources of complexity; however, it introduces a model misspecification error. We show that this new source of error can be controlled by suitable DRO formulations. Our proposed parametric DRO approach has significantly improved generalization bounds over existing ERM / DRO methods and parametric ERM for a wide variety of settings. Our method is particularly effective under distribution shifts. We also illustrate the superior performance of our approach on both synthetic and real-data portfolio optimization and regression tasks.
translated by 谷歌翻译
比较概率分布是许多机器学习算法的关键。最大平均差异(MMD)和最佳运输距离(OT)是在过去几年吸引丰富的关注的概率措施之间的两类距离。本文建立了一些条件,可以通过MMD规范控制Wassersein距离。我们的作品受到压缩统计学习(CSL)理论的推动,资源有效的大规模学习的一般框架,其中训练数据总结在单个向量(称为草图)中,该训练数据捕获与所考虑的学习任务相关的信息。在CSL中的现有结果启发,我们介绍了H \“较旧的较低限制的等距属性(H \”较旧的LRIP)并表明这家属性具有有趣的保证对压缩统计学习。基于MMD与Wassersein距离之间的关系,我们通过引入和研究学习任务的Wassersein可读性的概念来提供压缩统计学习的保证,即概率分布之间的某些特定于特定的特定度量,可以由Wassersein界定距离。
translated by 谷歌翻译
所有著名的机器学习算法构成了受监督和半监督的学习工作,只有在一个共同的假设下:培训和测试数据遵循相同的分布。当分布变化时,大多数统计模型必须从新收集的数据中重建,对于某些应用程序,这些数据可能是昂贵或无法获得的。因此,有必要开发方法,以减少在相关领域中可用的数据并在相似领域中进一步使用这些数据,从而减少需求和努力获得新的标签样品。这引起了一个新的机器学习框架,称为转移学习:一种受人类在跨任务中推断知识以更有效学习的知识能力的学习环境。尽管有大量不同的转移学习方案,但本调查的主要目的是在特定的,可以说是最受欢迎的转移学习中最受欢迎的次级领域,概述最先进的理论结果,称为域适应。在此子场中,假定数据分布在整个培训和测试数据中发生变化,而学习任务保持不变。我们提供了与域适应性问题有关的现有结果的首次最新描述,该结果涵盖了基于不同统计学习框架的学习界限。
translated by 谷歌翻译
我们研究了广义熵的连续性属性作为潜在的概率分布的函数,用动作空间和损失函数定义,并使用此属性来回答统计学习理论中的基本问题:各种学习方法的过度风险分析。我们首先在几种常用的F分歧,Wassersein距离的熵差异导出了两个分布的熵差,这取决于动作空间的距离和损失函数,以及由熵产生的Bregman发散,这也诱导了两个分布之间的欧几里德距离方面的界限。对于每个一般结果的讨论给出了示例,使用现有的熵差界进行比较,并且基于新结果导出新的相互信息上限。然后,我们将熵差异界限应用于统计学习理论。结果表明,两种流行的学习范式,频繁学习和贝叶斯学习中的过度风险都可以用不同形式的广义熵的连续性研究。然后将分析扩展到广义条件熵的连续性。扩展为贝叶斯决策提供了不匹配的分布来提供性能范围。它也会导致第三个划分的学习范式的过度风险范围,其中决策规则是在经验分布的预定分布家族的预测下进行最佳设计。因此,我们通过广义熵的连续性建立了统计学习三大范式的过度风险分析的统一方法。
translated by 谷歌翻译
在包括生成建模的各种机器学习应用中的两个概率措施中,已经证明了切片分歧的想法是成功的,并且包括计算两种测量的一维随机投影之间的“基地分歧”的预期值。然而,这种技术的拓扑,统计和计算后果尚未完整地确定。在本文中,我们的目标是弥合这种差距并导出切片概率分歧的各种理论特性。首先,我们表明切片保留了公制公理和分歧的弱连续性,这意味着切片分歧将共享相似的拓扑性质。然后,我们在基本发散属于积分概率度量类别的情况下精确结果。另一方面,我们在轻度条件下建立了切片分歧的样本复杂性并不依赖于问题尺寸。我们终于将一般结果应用于几个基地分歧,并说明了我们对合成和实际数据实验的理论。
translated by 谷歌翻译
尽管现代的大规模数据集通常由异质亚群(例如,多个人口统计组或多个文本语料库)组成 - 最小化平均损失的标准实践并不能保证所有亚人群中均匀的低损失。我们提出了一个凸面程序,该过程控制给定尺寸的所有亚群中最差的表现。我们的程序包括有限样本(非参数)收敛的保证,可以保证最坏的亚群。从经验上讲,我们观察到词汇相似性,葡萄酒质量和累犯预测任务,我们最糟糕的程序学习了对不看到看不见的亚人群的模型。
translated by 谷歌翻译
We study distributionally robust optimization (DRO) with Sinkhorn distance -- a variant of Wasserstein distance based on entropic regularization. We provide convex programming dual reformulation for a general nominal distribution. Compared with Wasserstein DRO, it is computationally tractable for a larger class of loss functions, and its worst-case distribution is more reasonable. We propose an efficient first-order algorithm with bisection search to solve the dual reformulation. We demonstrate that our proposed algorithm finds $\delta$-optimal solution of the new DRO formulation with computation cost $\tilde{O}(\delta^{-3})$ and memory cost $\tilde{O}(\delta^{-2})$, and the computation cost further improves to $\tilde{O}(\delta^{-2})$ when the loss function is smooth. Finally, we provide various numerical examples using both synthetic and real data to demonstrate its competitive performance and light computational speed.
translated by 谷歌翻译
We study a natural extension of classical empirical risk minimization, where the hypothesis space is a random subspace of a given space. In particular, we consider possibly data dependent subspaces spanned by a random subset of the data, recovering as a special case Nystrom approaches for kernel methods. Considering random subspaces naturally leads to computational savings, but the question is whether the corresponding learning accuracy is degraded. These statistical-computational tradeoffs have been recently explored for the least squares loss and self-concordant loss functions, such as the logistic loss. Here, we work to extend these results to convex Lipschitz loss functions, that might not be smooth, such as the hinge loss used in support vector machines. This unified analysis requires developing new proofs, that use different technical tools, such as sub-gaussian inputs, to achieve fast rates. Our main results show the existence of different settings, depending on how hard the learning problem is, for which computational efficiency can be improved with no loss in performance.
translated by 谷歌翻译
我们提出了一种统一的技术,用于顺序估计分布之间的凸面分歧,包括内核最大差异等积分概率度量,$ \ varphi $ - 像Kullback-Leibler发散,以及最佳运输成本,例如Wassersein距离的权力。这是通过观察到经验凸起分歧(部分有序)反向半角分离的实现来实现的,而可交换过滤耦合,其具有这些方法的最大不等式。这些技术似乎是对置信度序列和凸分流的现有文献的互补和强大的补充。我们构建一个离线到顺序设备,将各种现有的离线浓度不等式转换为可以连续监测的时间均匀置信序列,在任意停止时间提供有效的测试或置信区间。得到的顺序边界仅在相应的固定时间范围内支付迭代对数价格,保留对问题参数的相同依赖性(如适用的尺寸或字母大小)。这些结果也适用于更一般的凸起功能,如负差分熵,实证过程的高度和V型统计。
translated by 谷歌翻译
在因果推理和强盗文献中,基于观察数据的线性功能估算线性功能的问题是规范的。我们分析了首先估计治疗效果函数的广泛的两阶段程序,然后使用该数量来估计线性功能。我们证明了此类过程的均方误差上的非反应性上限:这些边界表明,为了获得非反应性最佳程序,应在特定加权$ l^2 $中最大程度地估算治疗效果的误差。 -规范。我们根据该加权规范的约束回归分析了两阶段的程序,并通过匹配非轴突局部局部最小值下限,在有限样品中建立了实例依赖性最优性。这些结果表明,除了取决于渐近效率方差之外,最佳的非质子风险除了取决于样本量支持的最富有函数类别的真实结果函数与其近似类别之间的加权规范距离。
translated by 谷歌翻译
Wasserstein distributionally robust optimization (DRO) has found success in operations research and machine learning applications as a powerful means to obtain solutions with favourable out-of-sample performances. Two compelling explanations for the success are the generalization bounds derived from Wasserstein DRO and the equivalency between Wasserstein DRO and the regularization scheme commonly applied in machine learning. Existing results on generalization bounds and the equivalency to regularization are largely limited to the setting where the Wasserstein ball is of a certain type and the decision criterion takes certain forms of an expected function. In this paper, we show that by focusing on Wasserstein DRO problems with affine decision rules, it is possible to obtain generalization bounds and the equivalency to regularization in a significantly broader setting where the Wasserstein ball can be of a general type and the decision criterion can be a general measure of risk, i.e., nonlinear in distributions. This allows for accommodating many important classification, regression, and risk minimization applications that have not been addressed to date using Wasserstein DRO. Our results are strong in that the generalization bounds do not suffer from the curse of dimensionality and the equivalency to regularization is exact. As a byproduct, our regularization results broaden considerably the class of Wasserstein DRO models that can be solved efficiently via regularization formulations.
translated by 谷歌翻译
转移学习或域适应性与机器学习问题有关,在这些问题中,培训和测试数据可能来自可能不同的概率分布。在这项工作中,我们在Russo和Xu发起的一系列工作之后,就通用错误和转移学习算法的过量风险进行了信息理论分析。我们的结果也许表明,也许正如预期的那样,kullback-leibler(kl)Divergence $ d(\ mu || \ mu')$在$ \ mu $和$ \ mu'$表示分布的特征中起着重要作用。培训数据和测试测试。具体而言,我们为经验风险最小化(ERM)算法提供了概括误差上限,其中两个分布的数据在训练阶段都可用。我们进一步将分析应用于近似的ERM方法,例如Gibbs算法和随机梯度下降方法。然后,我们概括了与$ \ phi $ -Divergence和Wasserstein距离绑定的共同信息。这些概括导致更紧密的范围,并且在$ \ mu $相对于$ \ mu' $的情况下,可以处理案例。此外,我们应用了一套新的技术来获得替代的上限,该界限为某些学习问题提供了快速(最佳)的学习率。最后,受到派生界限的启发,我们提出了Infoboost算法,其中根据信息测量方法对源和目标数据的重要性权重进行了调整。经验结果表明了所提出的算法的有效性。
translated by 谷歌翻译
We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.
translated by 谷歌翻译
Bayesian methods, distributionally robust optimization methods, and regularization methods are three pillars of trustworthy machine learning hedging against distributional uncertainty, e.g., the uncertainty of an empirical distribution compared to the true underlying distribution. This paper investigates the connections among the three frameworks and, in particular, explores why these frameworks tend to have smaller generalization errors. Specifically, first, we suggest a quantitative definition for "distributional robustness", propose the concept of "robustness measure", and formalize several philosophical concepts in distributionally robust optimization. Second, we show that Bayesian methods are distributionally robust in the probably approximately correct (PAC) sense; In addition, by constructing a Dirichlet-process-like prior in Bayesian nonparametrics, it can be proven that any regularized empirical risk minimization method is equivalent to a Bayesian method. Third, we show that generalization errors of machine learning models can be characterized using the distributional uncertainty of the nominal distribution and the robustness measures of these machine learning models, which is a new perspective to bound generalization errors, and therefore, explain the reason why distributionally robust machine learning models, Bayesian models, and regularization models tend to have smaller generalization errors.
translated by 谷歌翻译
Classical asymptotic theory for statistical inference usually involves calibrating a statistic by fixing the dimension $d$ while letting the sample size $n$ increase to infinity. Recently, much effort has been dedicated towards understanding how these methods behave in high-dimensional settings, where $d$ and $n$ both increase to infinity together. This often leads to different inference procedures, depending on the assumptions about the dimensionality, leaving the practitioner in a bind: given a dataset with 100 samples in 20 dimensions, should they calibrate by assuming $n \gg d$, or $d/n \approx 0.2$? This paper considers the goal of dimension-agnostic inference; developing methods whose validity does not depend on any assumption on $d$ versus $n$. We introduce an approach that uses variational representations of existing test statistics along with sample splitting and self-normalization to produce a new test statistic with a Gaussian limiting distribution, regardless of how $d$ scales with $n$. The resulting statistic can be viewed as a careful modification of degenerate U-statistics, dropping diagonal blocks and retaining off-diagonal blocks. We exemplify our technique for some classical problems including one-sample mean and covariance testing, and show that our tests have minimax rate-optimal power against appropriate local alternatives. In most settings, our cross U-statistic matches the high-dimensional power of the corresponding (degenerate) U-statistic up to a $\sqrt{2}$ factor.
translated by 谷歌翻译
我们研究只有历史数据时设计最佳学习和决策制定公式的问题。先前的工作通常承诺要进行特定的数据驱动配方,并随后尝试建立样本外的性能保证。我们以相反的方式采取了相反的方法。我们首先定义一个明智的院子棒,以测量任何数据驱动的公式的质量,然后寻求找到最佳的这种配方。在非正式的情况下,可以看到任何数据驱动的公式可以平衡估计成本与实际成本的接近度的量度,同时保证了样本外的性能水平。考虑到可接受的样本外部性能水平,我们明确地构建了一个数据驱动的配方,该配方比任何其他享有相同样本外部性能的其他配方都更接近真实成本。我们展示了三种不同的样本外绩效制度(超大型制度,指数状态和次指数制度)之间存在,最佳数据驱动配方的性质会经历相变的性质。最佳数据驱动的公式可以解释为超级稳定的公式,在指数方面是一种熵分布在熵上稳健的公式,最后是次指数制度中的方差惩罚公式。这个最终的观察揭示了这三个观察之间的令人惊讶的联系,乍一看似乎是无关的,数据驱动的配方,直到现在仍然隐藏了。
translated by 谷歌翻译
概率分布之间的差异措施,通常被称为统计距离,在概率理论,统计和机器学习中普遍存在。为了在估计这些距离的距离时,对维度的诅咒,最近的工作已经提出了通过带有高斯内核的卷积在测量的分布中平滑局部不规则性。通过该框架的可扩展性至高维度,我们研究了高斯平滑$ P $ -wassersein距离$ \ mathsf {w} _p ^ {(\ sigma)} $的结构和统计行为,用于任意$ p \ GEQ 1 $。在建立$ \ mathsf {w} _p ^ {(\ sigma)} $的基本度量和拓扑属性之后,我们探索$ \ mathsf {w} _p ^ {(\ sigma)}(\ hat {\ mu} _n,\ mu)$,其中$ \ hat {\ mu} _n $是$ n $独立观察的实证分布$ \ mu $。我们证明$ \ mathsf {w} _p ^ {(\ sigma)} $享受$ n ^ { - 1/2} $的参数经验融合速率,这对比$ n ^ { - 1 / d} $率对于未平滑的$ \ mathsf {w} _p $ why $ d \ geq 3 $。我们的证明依赖于控制$ \ mathsf {w} _p ^ {(\ sigma)} $ by $ p $ th-sting spoollow sobolev restion $ \ mathsf {d} _p ^ {(\ sigma)} $并导出限制$ \ sqrt {n} \,\ mathsf {d} _p ^ {(\ sigma)}(\ hat {\ mu} _n,\ mu)$,适用于所有尺寸$ d $。作为应用程序,我们提供了使用$ \ mathsf {w} _p ^ {(\ sigma)} $的两个样本测试和最小距离估计的渐近保证,使用$ p = 2 $的实验使用$ \ mathsf {d} _2 ^ {(\ sigma)} $。
translated by 谷歌翻译
We explore the ability of overparameterized shallow ReLU neural networks to learn Lipschitz, non-differentiable, bounded functions with additive noise when trained by Gradient Descent (GD). To avoid the problem that in the presence of noise, neural networks trained to nearly zero training error are inconsistent in this class, we focus on the early-stopped GD which allows us to show consistency and optimal rates. In particular, we explore this problem from the viewpoint of the Neural Tangent Kernel (NTK) approximation of a GD-trained finite-width neural network. We show that whenever some early stopping rule is guaranteed to give an optimal rate (of excess risk) on the Hilbert space of the kernel induced by the ReLU activation function, the same rule can be used to achieve minimax optimal rate for learning on the class of considered Lipschitz functions by neural networks. We discuss several data-free and data-dependent practically appealing stopping rules that yield optimal rates.
translated by 谷歌翻译
我们在对数损失下引入条件密度估计的过程,我们调用SMP(样本Minmax预测器)。该估算器最大限度地减少了统计学习的新一般过度风险。在标准示例中,此绑定量表为$ d / n $,$ d $ d $模型维度和$ n $ sample大小,并在模型拼写条目下批判性仍然有效。作为一个不当(超出型号)的程序,SMP在模型内估算器(如最大似然估计)的内部估算器上,其风险过高的风险降低。相比,与顺序问题的方法相比,我们的界限删除了SubOltimal $ \ log n $因子,可以处理无限的类。对于高斯线性模型,SMP的预测和风险受到协变量的杠杆分数,几乎匹配了在没有条件的线性模型的噪声方差或近似误差的条件下匹配的最佳风险。对于Logistic回归,SMP提供了一种非贝叶斯方法来校准依赖于虚拟样本的概率预测,并且可以通过解决两个逻辑回归来计算。它达到了$ O的非渐近风险((d + b ^ 2r ^ 2)/ n)$,其中$ r $绑定了特征的规范和比较参数的$ B $。相比之下,在模型内估计器内没有比$ \ min达到更好的速率({b r} / {\ sqrt {n}},{d e ^ {br} / {n})$。这为贝叶斯方法提供了更实用的替代方法,这需要近似的后部采样,从而部分地解决了Foster等人提出的问题。 (2018)。
translated by 谷歌翻译