隐式和明确的生成建模的几种作品经验观察到特征学习鉴别器在模型的样本质量方面优于固定内核鉴别器。我们在使用函数类$ \ mathcal {f} _2 $和$ \ mathcal {f} _1 $分别在使用函数类$ \ mathcal {f} _2 $分别提供分离结果。 。特别地,我们构造了通过固定内核$(\ Mathcal {F} _2)$积分概率度量(IPM)和高维度的超积分(\ Mathcal {F} _2)和高维度差异(SD)的超领域的分布对。但是可以是由他们的特征学习($ \ mathcal {f} _1 $)对应物歧视。为了进一步研究分离,我们提供$ \ mathcal {f} _1 $和$ \ mathcal {f} _2 $ IMM之间的链接。我们的工作表明,固定内核鉴别者的表现比其特征学习对应者更糟糕,因为它们的相应度量较弱。
translated by 谷歌翻译
我们构建了一对分发$ \ mu_d,\ nu_d $ on $ \ mathbb {r} ^ d $,使得数量$ | \ mathbb {e} _ {x \ sim \ mu_d} [f(x)] - \ mathbb{e} _ {x \ sim \ nu_d} [f(x)] | $ \ oomega(1 / d ^ 2)$减少一些三层Relu网络$ F $与多项式宽度和重量,同时下降如果$ F $是具有多项式权重的任何双层网络,则指数呈指数级。这表明深甘判位能够区分浅鉴别者不能的分布。类似地,我们构建了$ \ mu_d,\ nu_d $ on $ \ mathbb {r} ^ d $,这样$ | \ mathbb {e} _ {x \ sim \ mu_d} [f(x)] - \ mathbb{e} _ {x \ sim \ nu_d} [f(x)] | $ ys $ \ omega(1 /(d \ log d))$,双层Relu网络具有多项式权重,同时为界限指数递减-norm在关联的rkhs中的函数。这证实了特征学习对鉴别者有益。我们的界限基于傅里叶变换。
translated by 谷歌翻译
着名的工作系列(Barron,1993; Bresiman,1993; Klusowski&Barron,2018)提供了宽度$ N $的界限,所需的relu两层神经网络需要近似函数$ f $超过球。 \ mathcal {b} _r(\ mathbb {r} ^ d)$最终$ \ epsilon $,当傅立叶的数量$ c_f = \ frac {1} {(2 \ pi)^ {d / 2}} \ int _ {\ mathbb {r} ^ d} \ | \ xi \ | ^ 2 | \ hat {f}(\ xi)| \ d \ xi $是有限的。最近ongie等。 (2019)将Radon变换用作分析无限宽度Relu两层网络的工具。特别是,他们介绍了基于氡的$ \ mathcal {r} $ - norms的概念,并显示$ \ mathbb {r} ^ d $上定义的函数可以表示为无限宽度的双层神经网络如果只有在$ \ mathcal {r} $ - norm是有限的。在这项工作中,我们扩展了Ongie等人的框架。 (2019)并定义类似的基于氡的半规范($ \ mathcal {r},\ mathcal {r} $ - norms),使得函数承认在有界开放式$ \ mathcal上的无限宽度神经网络表示{ u} \ subseteq \ mathbb {r} ^ d $当它$ \ mathcal {r}时,\ mathcal {u} $ - norm是有限的。建立在这方面,我们派生稀疏(有限宽度)神经网络近似界,其优化Breiman(1993); Klusowski&Barron(2018)。最后,我们表明有限开放集的无限宽度神经网络表示不是唯一的,并研究其结构,提供模式连接的功能视图。
translated by 谷歌翻译
比较概率分布是许多机器学习算法的关键。最大平均差异(MMD)和最佳运输距离(OT)是在过去几年吸引丰富的关注的概率措施之间的两类距离。本文建立了一些条件,可以通过MMD规范控制Wassersein距离。我们的作品受到压缩统计学习(CSL)理论的推动,资源有效的大规模学习的一般框架,其中训练数据总结在单个向量(称为草图)中,该训练数据捕获与所考虑的学习任务相关的信息。在CSL中的现有结果启发,我们介绍了H \“较旧的较低限制的等距属性(H \”较旧的LRIP)并表明这家属性具有有趣的保证对压缩统计学习。基于MMD与Wassersein距离之间的关系,我们通过引入和研究学习任务的Wassersein可读性的概念来提供压缩统计学习的保证,即概率分布之间的某些特定于特定的特定度量,可以由Wassersein界定距离。
translated by 谷歌翻译
概率分布之间的差异措施,通常被称为统计距离,在概率理论,统计和机器学习中普遍存在。为了在估计这些距离的距离时,对维度的诅咒,最近的工作已经提出了通过带有高斯内核的卷积在测量的分布中平滑局部不规则性。通过该框架的可扩展性至高维度,我们研究了高斯平滑$ P $ -wassersein距离$ \ mathsf {w} _p ^ {(\ sigma)} $的结构和统计行为,用于任意$ p \ GEQ 1 $。在建立$ \ mathsf {w} _p ^ {(\ sigma)} $的基本度量和拓扑属性之后,我们探索$ \ mathsf {w} _p ^ {(\ sigma)}(\ hat {\ mu} _n,\ mu)$,其中$ \ hat {\ mu} _n $是$ n $独立观察的实证分布$ \ mu $。我们证明$ \ mathsf {w} _p ^ {(\ sigma)} $享受$ n ^ { - 1/2} $的参数经验融合速率,这对比$ n ^ { - 1 / d} $率对于未平滑的$ \ mathsf {w} _p $ why $ d \ geq 3 $。我们的证明依赖于控制$ \ mathsf {w} _p ^ {(\ sigma)} $ by $ p $ th-sting spoollow sobolev restion $ \ mathsf {d} _p ^ {(\ sigma)} $并导出限制$ \ sqrt {n} \,\ mathsf {d} _p ^ {(\ sigma)}(\ hat {\ mu} _n,\ mu)$,适用于所有尺寸$ d $。作为应用程序,我们提供了使用$ \ mathsf {w} _p ^ {(\ sigma)} $的两个样本测试和最小距离估计的渐近保证,使用$ p = 2 $的实验使用$ \ mathsf {d} _2 ^ {(\ sigma)} $。
translated by 谷歌翻译
We consider neural networks with a single hidden layer and non-decreasing positively homogeneous activation functions like the rectified linear units. By letting the number of hidden units grow unbounded and using classical non-Euclidean regularization tools on the output weights, they lead to a convex optimization problem and we provide a detailed theoretical analysis of their generalization performance, with a study of both the approximation and the estimation errors. We show in particular that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace. Moreover, when using sparsity-inducing norms on the input weights, we show that high-dimensional non-linear variable selection may be achieved, without any strong assumption regarding the data and with a total number of variables potentially exponential in the number of observations. However, solving this convex optimization problem in infinite dimensions is only possible if the non-convex subproblem of addition of a new unit can be solved efficiently. We provide a simple geometric interpretation for our choice of activation functions and describe simple conditions for convex relaxations of the finite-dimensional non-convex subproblem to achieve the same generalization error bounds, even when constant-factor approximations cannot be found. We were not able to find strong enough convex relaxations to obtain provably polynomial-time algorithms and leave open the existence or non-existence of such tractable algorithms with non-exponential sample complexities.
translated by 谷歌翻译
Learning high-dimensional distributions is often done with explicit likelihood modeling or implicit modeling via minimizing integral probability metrics (IPMs). In this paper, we expand this learning paradigm to stochastic orders, namely, the convex or Choquet order between probability measures. Towards this end, exploiting the relation between convex orders and optimal transport, we introduce the Choquet-Toland distance between probability measures, that can be used as a drop-in replacement for IPMs. We also introduce the Variational Dominance Criterion (VDC) to learn probability measures with dominance constraints, that encode the desired stochastic order between the learned measure and a known baseline. We analyze both quantities and show that they suffer from the curse of dimensionality and propose surrogates via input convex maxout networks (ICMNs), that enjoy parametric rates. We provide a min-max framework for learning with stochastic orders and validate it experimentally on synthetic and high-dimensional image generation, with promising results. Finally, our ICMNs class of convex functions and its derived Rademacher Complexity are of independent interest beyond their application in convex orders.
translated by 谷歌翻译
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activations and confirm the statistical benefits of this implicit bias.
translated by 谷歌翻译
生成的对抗网络(GAN)在无监督学习方面取得了巨大的成功。尽管具有显着的经验表现,但关于gan的统计特性的理论研究有限。本文提供了gan的近似值和统计保证,以估算具有H \“ {o} lder空间密度的数据分布。我们的主要结果表明,如果正确选择了生成器和鉴别器网络架构,则gan是一致的估计器在较强的差异指标下的数据分布(例如Wasserstein-1距离。 ,这不受环境维度的诅咒。我们对低维数据的分析基于具有Lipschitz连续性保证的神经网络的通用近似理论,这可能具有独立的兴趣。
translated by 谷歌翻译
在包括生成建模的各种机器学习应用中的两个概率措施中,已经证明了切片分歧的想法是成功的,并且包括计算两种测量的一维随机投影之间的“基地分歧”的预期值。然而,这种技术的拓扑,统计和计算后果尚未完整地确定。在本文中,我们的目标是弥合这种差距并导出切片概率分歧的各种理论特性。首先,我们表明切片保留了公制公理和分歧的弱连续性,这意味着切片分歧将共享相似的拓扑性质。然后,我们在基本发散属于积分概率度量类别的情况下精确结果。另一方面,我们在轻度条件下建立了切片分歧的样本复杂性并不依赖于问题尺寸。我们终于将一般结果应用于几个基地分歧,并说明了我们对合成和实际数据实验的理论。
translated by 谷歌翻译
在这项工作中,我们通过整流电源单元激活功能导出浅神经网络的整体表示的公式。主要是,我们的第一件结果涉及REPU浅网络的非相似性表现能力。本文的多维结果表征了可以用有界规范和可能无界宽度表示的功能集。
translated by 谷歌翻译
本文介绍了一种新的基于仿真的推理程序,以对访问I.I.D. \ samples的多维概率分布进行建模和样本,从而规避明确建模密度函数或设计Markov Chain Monte Carlo的通常方法。我们提出了一个称为可逆的Gromov-monge(RGM)距离的新概念的距离和同构的动机,并研究了RGM如何用于设计新的转换样本,以执行基于模拟的推断。我们的RGM采样器还可以估计两个异质度量度量空间之间的最佳对齐$(\ cx,\ mu,c _ {\ cx})$和$(\ cy,\ cy,\ nu,c _ {\ cy})$从经验数据集中,估计的地图大约将一个量度$ \ mu $推向另一个$ \ nu $,反之亦然。我们研究了RGM距离的分析特性,并在轻度条件下得出RGM等于经典的Gromov-Wasserstein距离。奇怪的是,与Brenier的两极分解结合了连接,我们表明RGM采样器以$ C _ {\ cx} $和$ C _ {\ cy} $的正确选择诱导了强度同构的偏见。研究了有关诱导采样器的收敛,表示和优化问题的统计率。还展示了展示RGM采样器有效性的合成和现实示例。
translated by 谷歌翻译
Quantifying the deviation of a probability distribution is challenging when the target distribution is defined by a density with an intractable normalizing constant. The kernel Stein discrepancy (KSD) was proposed to address this problem and has been applied to various tasks including diagnosing approximate MCMC samplers and goodness-of-fit testing for unnormalized statistical models. This article investigates a convergence control property of the diffusion kernel Stein discrepancy (DKSD), an instance of the KSD proposed by Barp et al. (2019). We extend the result of Gorham and Mackey (2017), which showed that the KSD controls the bounded-Lipschitz metric, to functions of polynomial growth. Specifically, we prove that the DKSD controls the integral probability metric defined by a class of pseudo-Lipschitz functions, a polynomial generalization of Lipschitz functions. We also provide practical sufficient conditions on the reproducing kernel for the stated property to hold. In particular, we show that the DKSD detects non-convergence in moments with an appropriate kernel.
translated by 谷歌翻译
We consider the problem of estimating the optimal transport map between a (fixed) source distribution $P$ and an unknown target distribution $Q$, based on samples from $Q$. The estimation of such optimal transport maps has become increasingly relevant in modern statistical applications, such as generative modeling. At present, estimation rates are only known in a few settings (e.g. when $P$ and $Q$ have densities bounded above and below and when the transport map lies in a H\"older class), which are often not reflected in practice. We present a unified methodology for obtaining rates of estimation of optimal transport maps in general function spaces. Our assumptions are significantly weaker than those appearing in the literature: we require only that the source measure $P$ satisfies a Poincar\'e inequality and that the optimal map be the gradient of a smooth convex function that lies in a space whose metric entropy can be controlled. As a special case, we recover known estimation rates for bounded densities and H\"older transport maps, but also obtain nearly sharp results in many settings not covered by prior work. For example, we provide the first statistical rates of estimation when $P$ is the normal distribution and the transport map is given by an infinite-width shallow neural network.
translated by 谷歌翻译
在本文中,我们研究了与具有多种激活函数的浅神经网络相对应的变异空间的近似特性。我们介绍了两个主要工具,用于估计这些空间的度量熵,近似率和$ n $宽度。首先,我们介绍了平滑参数化词典的概念,并在非线性近似速率,度量熵和$ n $ widths上给出了上限。上限取决于参数化的平滑度。该结果适用于与浅神经网络相对应的脊功能的字典,并且在许多情况下它们的现有结果改善了。接下来,我们提供了一种方法,用于下限度量熵和$ n $ widths的变化空间,其中包含某些类别的山脊功能。该结果给出了$ l^2 $ approximation速率,度量熵和$ n $ widths的变化空间的急剧下限具有界变化的乙状结激活函数。
translated by 谷歌翻译
量化概率分布之间的异化的统计分歧(SDS)是统计推理和机器学习的基本组成部分。用于估计这些分歧的现代方法依赖于通过神经网络(NN)进行参数化经验变化形式并优化参数空间。这种神经估算器在实践中大量使用,但相应的性能保证是部分的,并呼吁进一步探索。特别是,涉及的两个错误源之间存在基本的权衡:近似和经验估计。虽然前者需要NN课程富有富有表现力,但后者依赖于控制复杂性。我们通过非渐近误差界限基于浅NN的基于浅NN的估计的估算权,重点关注四个流行的$ \ mathsf {f} $ - 分离 - kullback-leibler,chi squared,squared hellinger,以及总变异。我们分析依赖于实证过程理论的非渐近功能近似定理和工具。界限揭示了NN尺寸和样品数量之间的张力,并使能够表征其缩放速率,以确保一致性。对于紧凑型支持的分布,我们进一步表明,上述上三次分歧的神经估算器以适当的NN生长速率接近Minimax率 - 最佳,实现了对数因子的参数速率。
translated by 谷歌翻译
三角形流量,也称为kn \“{o}的Rosenblatt测量耦合,包括用于生成建模和密度估计的归一化流模型的重要构建块,包括诸如实值的非体积保存变换模型的流行自回归流模型(真实的NVP)。我们提出了三角形流量统计模型的统计保证和样本复杂性界限。特别是,我们建立了KN的统计一致性和kullback-leibler估算器的rospblatt的kullback-leibler估计的有限样本会聚率使用实证过程理论的工具测量耦合。我们的结果突出了三角形流动下播放功能类的各向异性几何形状,优化坐标排序,并导致雅各比比流动的统计保证。我们对合成数据进行数值实验,以说明我们理论发现的实际意义。
translated by 谷歌翻译
Many applications, such as system identification, classification of time series, direct and inverse problems in partial differential equations, and uncertainty quantification lead to the question of approximation of a non-linear operator between metric spaces $\mathfrak{X}$ and $\mathfrak{Y}$. We study the problem of determining the degree of approximation of such operators on a compact subset $K_\mathfrak{X}\subset \mathfrak{X}$ using a finite amount of information. If $\mathcal{F}: K_\mathfrak{X}\to K_\mathfrak{Y}$, a well established strategy to approximate $\mathcal{F}(F)$ for some $F\in K_\mathfrak{X}$ is to encode $F$ (respectively, $\mathcal{F}(F)$) in terms of a finite number $d$ (repectively $m$) of real numbers. Together with appropriate reconstruction algorithms (decoders), the problem reduces to the approximation of $m$ functions on a compact subset of a high dimensional Euclidean space $\mathbb{R}^d$, equivalently, the unit sphere $\mathbb{S}^d$ embedded in $\mathbb{R}^{d+1}$. The problem is challenging because $d$, $m$, as well as the complexity of the approximation on $\mathbb{S}^d$ are all large, and it is necessary to estimate the accuracy keeping track of the inter-dependence of all the approximations involved. In this paper, we establish constructive methods to do this efficiently; i.e., with the constants involved in the estimates on the approximation on $\mathbb{S}^d$ being $\mathcal{O}(d^{1/6})$. We study different smoothness classes for the operators, and also propose a method for approximation of $\mathcal{F}(F)$ using only information in a small neighborhood of $F$, resulting in an effective reduction in the number of parameters involved.
translated by 谷歌翻译
We investigate the training and performance of generative adversarial networks using the Maximum Mean Discrepancy (MMD) as critic, termed MMD GANs. As our main theoretical contribution, we clarify the situation with bias in GAN loss functions raised by recent work: we show that gradient estimators used in the optimization process for both MMD GANs and Wasserstein GANs are unbiased, but learning a discriminator based on samples leads to biased gradients for the generator parameters. We also discuss the issue of kernel choice for the MMD critic, and characterize the kernel corresponding to the energy distance used for the Cramér GAN critic. Being an integral probability metric, the MMD benefits from training strategies recently developed for Wasserstein GANs. In experiments, the MMD GAN is able to employ a smaller critic network than the Wasserstein GAN, resulting in a simpler and faster-training algorithm with matching performance. We also propose an improved measure of GAN convergence, the Kernel Inception Distance, and show how to use it to dynamically adapt learning rates during GAN training.
translated by 谷歌翻译
生成的对抗网络(GaN)是学习高维分布的众所周知的模型,但是没有理解其泛化能力的机制。特别是,GaN容易受到记忆现象的影响,最终会聚到经验分布。我们考虑一个简化的GaN模型,发电机替换为密度,分析鉴别者如何有助于泛化。我们表明,随着早期停下来,威尔斯坦度量测量的泛化误差从维度的诅咒中逃脱,尽管长期来看,记忆是不可避免的。此外,我们展示了WAN的学习结果的硬度。
translated by 谷歌翻译