在本说明中,我们研究了如何使用单个隐藏层和RELU激活的神经网络插值数据,该数据是从径向对称分布中的,目标标签1处的目标标签1和单位球外部0,如果单位球内没有标签。通过重量衰减正则化和无限神经元的无限数据限制,我们证明存在独特的径向对称最小化器,其重量衰减正常器和Lipschitz常数分别为$ d $和$ \ sqrt {d} $。我们此外表明,如果标签$ 1 $强加于半径$ \ varepsilon $,而不仅仅是源头,则重量衰减正规剂会在$ d $中成倍增长。相比之下,具有两个隐藏层的神经网络可以近似目标函数,而不会遇到维度的诅咒。
translated by 谷歌翻译
在本文中,我们研究了与具有多种激活函数的浅神经网络相对应的变异空间的近似特性。我们介绍了两个主要工具,用于估计这些空间的度量熵,近似率和$ n $宽度。首先,我们介绍了平滑参数化词典的概念,并在非线性近似速率,度量熵和$ n $ widths上给出了上限。上限取决于参数化的平滑度。该结果适用于与浅神经网络相对应的脊功能的字典,并且在许多情况下它们的现有结果改善了。接下来,我们提供了一种方法,用于下限度量熵和$ n $ widths的变化空间,其中包含某些类别的山脊功能。该结果给出了$ l^2 $ approximation速率,度量熵和$ n $ widths的变化空间的急剧下限具有界变化的乙状结激活函数。
translated by 谷歌翻译
在这项工作中,我们通过整流电源单元激活功能导出浅神经网络的整体表示的公式。主要是,我们的第一件结果涉及REPU浅网络的非相似性表现能力。本文的多维结果表征了可以用有界规范和可能无界宽度表示的功能集。
translated by 谷歌翻译
了解通过随机梯度下降(SGD)训练的神经网络的特性是深度学习理论的核心。在这项工作中,我们采取了平均场景,并考虑通过SGD培训的双层Relu网络,以实现一个非变量正则化回归问题。我们的主要结果是SGD偏向于简单的解决方案:在收敛时,Relu网络实现输入的分段线性图,以及“结”点的数量 - 即,Relu网络估计器的切线变化的点数 - 在两个连续的训练输入之间最多三个。特别地,随着网络的神经元的数量,通过梯度流的解决方案捕获SGD动力学,并且在收敛时,重量的分布方法接近相关的自由能量的独特最小化器,其具有GIBBS形式。我们的主要技术贡献在于分析了这一最小化器产生的估计器:我们表明其第二阶段在各地消失,除了代表“结”要点的一些特定地点。我们还提供了经验证据,即我们的理论预测的不同可能发生与数据点不同的位置的结。
translated by 谷歌翻译
我们为特殊神经网络架构,称为运营商复发性神经网络的理论分析,用于近似非线性函数,其输入是线性运算符。这些功能通常在解决方案算法中出现用于逆边值问题的问题。传统的神经网络将输入数据视为向量,因此它们没有有效地捕获与对应于这种逆问题中的数据的线性运算符相关联的乘法结构。因此,我们介绍一个类似标准的神经网络架构的新系列,但是输入数据在向量上乘法作用。由较小的算子出现在边界控制中的紧凑型操作员和波动方程的反边值问题分析,我们在网络中的选择权重矩阵中促进结构和稀疏性。在描述此架构后,我们研究其表示属性以及其近似属性。我们还表明,可以引入明确的正则化,其可以从所述逆问题的数学分析导出,并导致概括属性上的某些保证。我们观察到重量矩阵的稀疏性改善了概括估计。最后,我们讨论如何将运营商复发网络视为深度学习模拟,以确定诸如用于从边界测量的声波方程中重建所未知的WAVESTED的边界控制的算法算法。
translated by 谷歌翻译
我们研究了神经网络中平方损耗训练问题的优化景观和稳定性,但通用非线性圆锥近似方案。据证明,如果认为非线性圆锥近似方案是(以适当定义的意义)比经典线性近似方法更具表现力,并且如果存在不完美的标签向量,则在方位损耗的训练问题必须在其中不稳定感知其解决方案集在训练数据中的标签向量上不连续地取决于标签向量。我们进一步证明对这些不稳定属性负责的效果也是马鞍点出现的原因和杂散的局部最小值,这可能是从全球解决方案的任意遥远的,并且既不训练问题也不是训练问题的不稳定性通常,杂散局部最小值的存在可以通过向目标函数添加正则化术语来克服衡量近似方案中参数大小的目标函数。无论可实现的可实现性是否满足,后一种结果都被证明是正确的。我们表明,我们的分析特别适用于具有可变宽度的自由结插值方案和深层和浅层神经网络的培训问题,其涉及各种激活功能的任意混合(例如,二进制,六骨,Tanh,arctan,软标志, ISRU,Soft-Clip,SQNL,Relu,Lifley Relu,Soft-Plus,Bent Identity,Silu,Isrlu和ELU)。总之,本文的发现说明了神经网络和一般非线性圆锥近似仪器的改进近似特性以直接和可量化的方式与必须解决的优化问题的不期望的性质链接,以便训练它们。
translated by 谷歌翻译
在负面的感知问题中,我们给出了$ n $数据点$({\ boldsymbol x} _i,y_i)$,其中$ {\ boldsymbol x} _i $是$ d $ -densional vector和$ y_i \ in \ { + 1,-1 \} $是二进制标签。数据不是线性可分离的,因此我们满足自己的内容,以找到最大的线性分类器,具有最大的\ emph {否定}余量。换句话说,我们想找到一个单位常规矢量$ {\ boldsymbol \ theta} $,最大化$ \ min_ {i \ le n} y_i \ langle {\ boldsymbol \ theta},{\ boldsymbol x} _i \ rangle $ 。这是一个非凸优化问题(它相当于在Polytope中找到最大标准矢量),我们在两个随机模型下研究其典型属性。我们考虑比例渐近,其中$ n,d \ to \ idty $以$ n / d \ to \ delta $,并在最大边缘$ \ kappa _ {\ text {s}}(\ delta)上证明了上限和下限)$或 - 等效 - 在其逆函数$ \ delta _ {\ text {s}}(\ kappa)$。换句话说,$ \ delta _ {\ text {s}}(\ kappa)$是overparametization阈值:以$ n / d \ le \ delta _ {\ text {s}}(\ kappa) - \ varepsilon $一个分类器实现了消失的训练错误,具有高概率,而以$ n / d \ ge \ delta _ {\ text {s}}(\ kappa)+ \ varepsilon $。我们在$ \ delta _ {\ text {s}}(\ kappa)$匹配,以$ \ kappa \ to - \ idty $匹配。然后,我们分析了线性编程算法来查找解决方案,并表征相应的阈值$ \ delta _ {\ text {lin}}(\ kappa)$。我们观察插值阈值$ \ delta _ {\ text {s}}(\ kappa)$和线性编程阈值$ \ delta _ {\ text {lin {lin}}(\ kappa)$之间的差距,提出了行为的问题其他算法。
translated by 谷歌翻译
We consider the problem of estimating a multivariate function $f_0$ of bounded variation (BV), from noisy observations $y_i = f_0(x_i) + z_i$ made at random design points $x_i \in \mathbb{R}^d$, $i=1,\ldots,n$. We study an estimator that forms the Voronoi diagram of the design points, and then solves an optimization problem that regularizes according to a certain discrete notion of total variation (TV): the sum of weighted absolute differences of parameters $\theta_i,\theta_j$ (which estimate the function values $f_0(x_i),f_0(x_j)$) at all neighboring cells $i,j$ in the Voronoi diagram. This is seen to be equivalent to a variational optimization problem that regularizes according to the usual continuum (measure-theoretic) notion of TV, once we restrict the domain to functions that are piecewise constant over the Voronoi diagram. The regression estimator under consideration hence performs (shrunken) local averaging over adaptively formed unions of Voronoi cells, and we refer to it as the Voronoigram, following the ideas in Koenker (2005), and drawing inspiration from Tukey's regressogram (Tukey, 1961). Our contributions in this paper span both the conceptual and theoretical frontiers: we discuss some of the unique properties of the Voronoigram in comparison to TV-regularized estimators that use other graph-based discretizations; we derive the asymptotic limit of the Voronoi TV functional; and we prove that the Voronoigram is minimax rate optimal (up to log factors) for estimating BV functions that are essentially bounded.
translated by 谷歌翻译
比较概率分布是许多机器学习算法的关键。最大平均差异(MMD)和最佳运输距离(OT)是在过去几年吸引丰富的关注的概率措施之间的两类距离。本文建立了一些条件,可以通过MMD规范控制Wassersein距离。我们的作品受到压缩统计学习(CSL)理论的推动,资源有效的大规模学习的一般框架,其中训练数据总结在单个向量(称为草图)中,该训练数据捕获与所考虑的学习任务相关的信息。在CSL中的现有结果启发,我们介绍了H \“较旧的较低限制的等距属性(H \”较旧的LRIP)并表明这家属性具有有趣的保证对压缩统计学习。基于MMD与Wassersein距离之间的关系,我们通过引入和研究学习任务的Wassersein可读性的概念来提供压缩统计学习的保证,即概率分布之间的某些特定于特定的特定度量,可以由Wassersein界定距离。
translated by 谷歌翻译
Consider the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to log nfactors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the network. Specifically, we consider large networks with number of potential network parameters exceeding the sample size. The analysis gives some insights into why multilayer feedforward neural networks perform well in practice. Interestingly, for ReLU activation function the depth (number of layers) of the neural network architectures plays an important role and our theory suggests that for nonparametric regression, scaling the network depth with the sample size is natural. It is also shown that under the composition assumption wavelet estimators can only achieve suboptimal rates.
translated by 谷歌翻译
We consider neural networks with a single hidden layer and non-decreasing positively homogeneous activation functions like the rectified linear units. By letting the number of hidden units grow unbounded and using classical non-Euclidean regularization tools on the output weights, they lead to a convex optimization problem and we provide a detailed theoretical analysis of their generalization performance, with a study of both the approximation and the estimation errors. We show in particular that they are adaptive to unknown underlying linear structures, such as the dependence on the projection of the input variables onto a low-dimensional subspace. Moreover, when using sparsity-inducing norms on the input weights, we show that high-dimensional non-linear variable selection may be achieved, without any strong assumption regarding the data and with a total number of variables potentially exponential in the number of observations. However, solving this convex optimization problem in infinite dimensions is only possible if the non-convex subproblem of addition of a new unit can be solved efficiently. We provide a simple geometric interpretation for our choice of activation functions and describe simple conditions for convex relaxations of the finite-dimensional non-convex subproblem to achieve the same generalization error bounds, even when constant-factor approximations cannot be found. We were not able to find strong enough convex relaxations to obtain provably polynomial-time algorithms and leave open the existence or non-existence of such tractable algorithms with non-exponential sample complexities.
translated by 谷歌翻译
本文通过引入几何深度学习(GDL)框架来构建通用馈电型型模型与可区分的流形几何形状兼容的通用馈电型模型,从而解决了对非欧国人数据进行处理的需求。我们表明,我们的GDL模型可以在受控最大直径的紧凑型组上均匀地近似任何连续目标函数。我们在近似GDL模型的深度上获得了最大直径和上限的曲率依赖性下限。相反,我们发现任何两个非分类紧凑型歧管之间始终都有连续的函数,任何“局部定义”的GDL模型都不能均匀地近似。我们的最后一个主要结果确定了数据依赖性条件,确保实施我们近似的GDL模型破坏了“维度的诅咒”。我们发现,任何“现实世界”(即有限)数据集始终满足我们的状况,相反,如果目标函数平滑,则任何数据集都满足我们的要求。作为应用,我们确认了以下GDL模型的通用近似功能:Ganea等。 (2018)的双波利馈电网络,实施Krishnan等人的体系结构。 (2015年)的深卡尔曼 - 滤波器和深度玛克斯分类器。我们构建了:Meyer等人的SPD-Matrix回归剂的通用扩展/变体。 (2011)和Fletcher(2003)的Procrustean回归剂。在欧几里得的环境中,我们的结果暗示了Kidger和Lyons(2020)的近似定理和Yarotsky和Zhevnerchuk(2019)无估计近似率的数据依赖性版本的定量版本。
translated by 谷歌翻译
着名的工作系列(Barron,1993; Bresiman,1993; Klusowski&Barron,2018)提供了宽度$ N $的界限,所需的relu两层神经网络需要近似函数$ f $超过球。 \ mathcal {b} _r(\ mathbb {r} ^ d)$最终$ \ epsilon $,当傅立叶的数量$ c_f = \ frac {1} {(2 \ pi)^ {d / 2}} \ int _ {\ mathbb {r} ^ d} \ | \ xi \ | ^ 2 | \ hat {f}(\ xi)| \ d \ xi $是有限的。最近ongie等。 (2019)将Radon变换用作分析无限宽度Relu两层网络的工具。特别是,他们介绍了基于氡的$ \ mathcal {r} $ - norms的概念,并显示$ \ mathbb {r} ^ d $上定义的函数可以表示为无限宽度的双层神经网络如果只有在$ \ mathcal {r} $ - norm是有限的。在这项工作中,我们扩展了Ongie等人的框架。 (2019)并定义类似的基于氡的半规范($ \ mathcal {r},\ mathcal {r} $ - norms),使得函数承认在有界开放式$ \ mathcal上的无限宽度神经网络表示{ u} \ subseteq \ mathbb {r} ^ d $当它$ \ mathcal {r}时,\ mathcal {u} $ - norm是有限的。建立在这方面,我们派生稀疏(有限宽度)神经网络近似界,其优化Breiman(1993); Klusowski&Barron(2018)。最后,我们表明有限开放集的无限宽度神经网络表示不是唯一的,并研究其结构,提供模式连接的功能视图。
translated by 谷歌翻译
这项调查的目的是介绍对深神经网络的近似特性的解释性回顾。具体而言,我们旨在了解深神经网络如何以及为什么要优于其他经典线性和非线性近似方法。这项调查包括三章。在第1章中,我们回顾了深层网络及其组成非线性结构的关键思想和概念。我们通过在解决回归和分类问题时将其作为优化问题来形式化神经网络问题。我们简要讨论用于解决优化问题的随机梯度下降算法以及用于解决优化问题的后传播公式,并解决了与神经网络性能相关的一些问题,包括选择激活功能,成本功能,过度适应问题和正则化。在第2章中,我们将重点转移到神经网络的近似理论上。我们首先介绍多项式近似中的密度概念,尤其是研究实现连续函数的Stone-WeierStrass定理。然后,在线性近似的框架内,我们回顾了馈电网络的密度和收敛速率的一些经典结果,然后在近似Sobolev函数中进行有关深网络复杂性的最新发展。在第3章中,利用非线性近似理论,我们进一步详细介绍了深度和近似网络与其他经典非线性近似方法相比的近似优势。
translated by 谷歌翻译
在本文中,我们通过任意大量的隐藏层研究了全连接的前馈深度Relu Ann,我们证明了在假设不正常化的概率密度函数下,在训练中具有随机初始化的GD优化方法的风险的融合在考虑的监督学习问题的输入数据的概率分布是分段多项式,假设目标函数(描述输入数据与输出数据之间的关系)是分段多项式,并且在假设风险函数下被认为的监督学习问题至少承认至少一个常规全球最低限度。此外,在浅句的特殊情况下只有一个隐藏的层和一维输入,我们还通过证明对每个LipsChitz连续目标功能的培训来验证这种假设,风险景观中存在全球最小值。最后,在具有Relu激活的深度广域的训练中,我们还研究梯度流(GF)差分方程的解决方案,并且我们证明每个非发散的GF轨迹会聚在临界点的多项式收敛速率(在限制意义上FR \'ECHET子提让性)。我们的数学融合分析造成了来自真实代数几何的工具,例如半代数函数和广义Kurdyka-Lojasiewicz不等式,从功能分析(如Arzel \)Ascoli定理等工具,在来自非本地结构的工具中作为限制FR \'echet子分子的概念,以及具有固定架构的浅印刷ANN的实现功能的事实形成由Petersen等人显示的连续功能集的封闭子集。
translated by 谷歌翻译
We consider the problem of estimating the optimal transport map between a (fixed) source distribution $P$ and an unknown target distribution $Q$, based on samples from $Q$. The estimation of such optimal transport maps has become increasingly relevant in modern statistical applications, such as generative modeling. At present, estimation rates are only known in a few settings (e.g. when $P$ and $Q$ have densities bounded above and below and when the transport map lies in a H\"older class), which are often not reflected in practice. We present a unified methodology for obtaining rates of estimation of optimal transport maps in general function spaces. Our assumptions are significantly weaker than those appearing in the literature: we require only that the source measure $P$ satisfies a Poincar\'e inequality and that the optimal map be the gradient of a smooth convex function that lies in a space whose metric entropy can be controlled. As a special case, we recover known estimation rates for bounded densities and H\"older transport maps, but also obtain nearly sharp results in many settings not covered by prior work. For example, we provide the first statistical rates of estimation when $P$ is the normal distribution and the transport map is given by an infinite-width shallow neural network.
translated by 谷歌翻译
在深度学习中的优化分析是连续的,专注于(变体)梯度流动,或离散,直接处理(变体)梯度下降。梯度流程可符合理论分析,但是风格化并忽略计算效率。它代表梯度下降的程度是深度学习理论的一个开放问题。目前的论文研究了这个问题。将梯度下降视为梯度流量初始值问题的近似数值问题,发现近似程度取决于梯度流动轨迹周围的曲率。然后,我们表明,在具有均匀激活的深度神经网络中,梯度流动轨迹享有有利的曲率,表明它们通过梯度下降近似地近似。该发现允许我们将深度线性神经网络的梯度流分析转换为保证梯度下降,其几乎肯定会在随机初始化下有效地收敛到全局最小值。实验表明,在简单的深度神经网络中,具有传统步长的梯度下降确实接近梯度流。我们假设梯度流动理论将解开深入学习背后的奥秘。
translated by 谷歌翻译
Neural networks trained to minimize the logistic (a.k.a. cross-entropy) loss with gradient-based methods are observed to perform well in many supervised classification tasks. Towards understanding this phenomenon, we analyze the training and generalization behavior of infinitely wide two-layer neural networks with homogeneous activations. We show that the limits of the gradient flow on exponentially tailed losses can be fully characterized as a max-margin classifier in a certain non-Hilbertian space of functions. In presence of hidden low-dimensional structures, the resulting margin is independent of the ambiant dimension, which leads to strong generalization bounds. In contrast, training only the output layer implicitly solves a kernel support vector machine, which a priori does not enjoy such an adaptivity. Our analysis of training is non-quantitative in terms of running time but we prove computational guarantees in simplified settings by showing equivalences with online mirror descent. Finally, numerical experiments suggest that our analysis describes well the practical behavior of two-layer neural networks with ReLU activations and confirm the statistical benefits of this implicit bias.
translated by 谷歌翻译
In this paper we prove Gamma-convergence of a nonlocal perimeter of Minkowski type to a local anisotropic perimeter. The nonlocal model describes the regularizing effect of adversarial training in binary classifications. The energy essentially depends on the interaction between two distributions modelling likelihoods for the associated classes. We overcome typical strict regularity assumptions for the distributions by only assuming that they have bounded $BV$ densities. In the natural topology coming from compactness, we prove Gamma-convergence to a weighted perimeter with weight determined by an anisotropic function of the two densities. Despite being local, this sharp interface limit reflects classification stability with respect to adversarial perturbations. We further apply our results to deduce Gamma-convergence of the associated total variations, to study the asymptotics of adversarial training, and to prove Gamma-convergence of graph discretizations for the nonlocal perimeter.
translated by 谷歌翻译
We generalize the classical universal approximation theorem for neural networks to the case of complex-valued neural networks. Precisely, we consider feedforward networks with a complex activation function $\sigma : \mathbb{C} \to \mathbb{C}$ in which each neuron performs the operation $\mathbb{C}^N \to \mathbb{C}, z \mapsto \sigma(b + w^T z)$ with weights $w \in \mathbb{C}^N$ and a bias $b \in \mathbb{C}$, and with $\sigma$ applied componentwise. We completely characterize those activation functions $\sigma$ for which the associated complex networks have the universal approximation property, meaning that they can uniformly approximate any continuous function on any compact subset of $\mathbb{C}^d$ arbitrarily well. Unlike the classical case of real networks, the set of "good activation functions" which give rise to networks with the universal approximation property differs significantly depending on whether one considers deep networks or shallow networks: For deep networks with at least two hidden layers, the universal approximation property holds as long as $\sigma$ is neither a polynomial, a holomorphic function, or an antiholomorphic function. Shallow networks, on the other hand, are universal if and only if the real part or the imaginary part of $\sigma$ is not a polyharmonic function.
translated by 谷歌翻译