We study the necessary and sufficient complexity of ReLU neural networks-in terms of depth and number of weights-which is required for approximating classifier functions in an L 2-sense. As a model class, we consider the set E β (R d) of possibly discontinuous piecewise C β functions f : [− 1 /2, 1 /2] d → R, where the different "smooth regions" of f are separated by C β hypersurfaces. For given dimension d ≥ 2, regularity β > 0, and accuracy ε > 0, we construct artificial neural networks with ReLU activation function that approximate functions from E β (R d) up to an L 2 error of ε. The constructed networks have a fixed number of layers, depending only on d and β, and they have O(ε −2(d−1)/β) many nonzero weights, which we prove to be optimal. For the proof of optimality, we establish a lower bound on the description complexity of the class E β (R d). By showing that a family of approximating neural networks gives rise to an encoder for E β (R d), we then prove that one cannot approximate a general function f ∈ E β (R d) using neural networks that are less complex than those produced by our construction. In addition to the optimality in terms of the number of weights, we show that in order to achieve this optimal approximation rate, one needs ReLU networks of a certain minimal depth. Precisely, for piecewise C β (R d) functions, this minimal depth is given-up to a multiplicative constant-by β/d. Up to a log factor, our constructed networks match this bound. This partly explains the benefits of depth for ReLU networks by showing that deep networks are necessary to achieve efficient approximation of (piecewise) smooth functions. Finally, we analyze approximation in high-dimensional spaces where the function f to be approximated can be factorized into a smooth dimension reducing feature map τ and classifier function g-defined on a low-dimensional feature space-as f = g • τ. We show that in this case the approximation rate depends only on the dimension of the feature space and not the input dimension.
translated by 谷歌翻译
深度神经网络已成为各种实用机器学习任务的最先进技术,例如图像分类,手写信号识别,语音识别或游戏智能。本文通过描述如果不对学习算法和训练数据量施加约束的可能性,开发了深度神经网络学习的基本限制。具体地说,我们通过深度神经网络考虑信息 - 理论上的最佳近似,其指导主题是要近似的函数(类)的复杂性与用于存储网络拓扑和相关联的连接和存储器要求的近似网络中间点的复杂性之间的关系。量化权重。我们开发的理论引出了深度网络的显着的通用性。具体而言,深度网络是非常不同的函数类(例如仿射系统和Gabor系统)的最佳近似。这种普遍性由深度网络与时移,缩放和频率变换的并发不变性特性提供。此外,深度网络提供指数近似精度,即近似误差在网络中的非零权重数量呈指数衰减,这些函数在诸如平方运算,乘法,多项式,正弦函数,一般平滑函数甚至一维振荡等大不相同的函数中纹理和分形函数,例如Weierstrass函数,两者都没有任何已知的方法达到指数近似精度。总之,深度神经网络提供信息 - 理论上最佳近似的数学信号处理中使用的非常广泛的函数和函数类。
translated by 谷歌翻译
本文关注深度神经网络的近似和表达能力。这是一个活跃的研究领域,目前正在制作许多有趣的论文。文献中最常见的结果证明,神经网络近似具有经典光滑度的函数,与经典线性近似方法相同,例如,在规定的分区上通过多项式或分段多项式逼近。然而,依赖于n个参数的神经网络的近似是非线性近似的一种形式,因此应该与其他非线性方法比较,例如可变节点样条或n项近似值。神经网络在目标应用中的性能,例如机器学习,表明它们实际上比这些传统的非线性近似方法具有更强的相似性。本文的主要结果证明了确实如此。这是通过展示大类函数来实现的,这些函数可以通过神经网络有效地捕获,其中经典非线性方法不能完成任务。本文有目的地将其自身局限于研究ReLU网络对单变量函数的近似。可以设想对几个变量和其他激活函数的函数的许多概括。然而,即使在这里考虑的最简单的设置中,仍然缺乏完全量化神经网络的近似能力的理论。
translated by 谷歌翻译
In a recent paper, Caron and Fox suggest a probabilistic model for sparse graphs which are exchangeable when associating each vertex with a time parameter in R +. Here we show that by generalizing the classical definition of graphons as functions over probability spaces to functions over σ-finite measure spaces, we can model a large family of exchangeable graphs, including the Caron-Fox graphs and the traditional exchangeable dense graphs as special cases. Explicitly, modelling the underlying space of features by a σ-finite measure space (S, S, µ) and the connection probabilities by an integrable function W : S × S → [0, 1], we construct a random family (G t) t≥0 of growing graphs such that the vertices of G t are given by a Poisson point process on S with intensity tµ, with two points x, y of the point process connected with probability W (x, y). We call such a random family a graphon process. We prove that a graphon process has convergent subgraph frequencies (with possibly infinite limits) and that, in the natural extension of the cut metric to our setting, the sequence converges to the generating graphon. We also show that the underlying graphon is identifiable only as an equivalence class over graphons with cut distance zero. More generally, we study metric convergence for arbitrary (not necessarily random) sequences of graphs, and show that a sequence of graphs has a convergent subsequence if and only if it has a subsequence satisfying a property we call uniform regularity of tails. Finally, we prove that every graphon is equivalent to a graphon on R + equipped with Lebesgue measure.
translated by 谷歌翻译
Performance bounds for criteria for model selection are developed using recent theory for sieves. The model selection criteria are based on an empirical loss or contrast function with an added penalty term motivated by empirical process theory and roughly proportional to the number of parameters needed to describe the model divided by the number of observations. Most of our examples involve density or regression estimation settings and we focus on the problem of estimating the unknown density or regression function. We show that the quadratic risk of the minimum penalized empirical contrast estimator is bounded by an index of the accuracy of the sieve. This accuracy index quantifies the trade-off among the candidate models between the approximation error and parameter dimension relative to sample size. If we choose a list of models which exhibit good approximation properties with respect to different classes of smoothness, the estimator can be simultaneously minimax rate optimal in each of those classes. This is what is usually called adaptation. The type of classes of smoothness in which one gets adaptation depends heavily on the list of models. If too many models are involved in order to get accurate approximation of many wide classes of functions simultaneously, it may happen that the estimator is only approx-Work supported in part by the NSF grant ECS-9410760, and URA CNRS 1321 "Statistique et modèles aléatoires", and URA CNRS 743 "Modélisation stochastique et Statistique". A. Barron et al. imately adaptive (typically up to a slowly varying function of the sample size). We shall provide various illustrations of our method such as penalized maximum likelihood, projection or least squares estimation. The models will involve commonly used finite dimensional expansions such as piece-wise polynomials with fixed or variable knots, trigonometric polynomials, wavelets, neural nets and related nonlinear expansions defined by superpo-sition of ridge functions.
translated by 谷歌翻译
The 'classical' random graph models, in particular G(n, p), are 'homogeneous', in the sense that the degrees (for example) tend to be concentrated around a typical value. Many graphs arising in the real world do not have this property, having, for example, power-law degree distributions. Thus there has been a lot of recent interest in defining and studying 'inhomogeneous' random graph models. One of the most studied properties of these new models is their 'robust-ness', or, equivalently, the 'phase transition' as an edge density parameter is varied. For G(n, p), p = c/n, the phase transition at c = 1 has been a central topic in the study of random graphs for well over 40 years. Many of the new inhomogeneous models are rather complicated; although there are exceptions, in most cases precise questions such as determining exactly the critical point of the phase transition are approachable only when there is independence between the edges. Fortunately, some models studied have this property already, and others can be approximated by models with independence. Here we introduce a very general model of an inhomogeneous random graph with (conditional) independence between the edges, which scales so that the number of edges is linear in the number of vertices. This scaling corresponds to the p = c/n scaling for G(n, p) used to study the phase transition; also, it seems to be a property of many large real-world graphs. Our model includes as special cases many models previously studied. We show that, under one very weak assumption (that the expected number of edges is 'what it should be'), many properties of the model can be determined, in particular the critical point of the phase transition, and the size of the giant component above the transition. We do this by relating our random graphs to branching processes, which are much easier to analyze. We also consider other properties of the model, showing, for example, that when there is a giant component, it is 'stable': for a typical random graph, no matter how we add or delete o(n) edges, the size of the giant component does not change by more than o(n).
translated by 谷歌翻译
我们证明在$ \ reals ^ d $上有一个简单的(近似径向)函数,可由一个小的3层前馈神经网络表达,它不能被任何2层网络接近,超过一定的恒定精度,除非它的宽度在维度上呈指数级。结果几乎包含所有已知的激活函数,包括整流线性单位,sigmoids和阈值,并正式证明深度 - 即使增加1 - 可以比标准前馈神经网络的宽度指数地更有价值。而且,与布尔函数相关的相关结果相比,我们的结果需要更少的假设,并且防伪技术和结构是非常不同的。
translated by 谷歌翻译
Consider the multivariate nonparametric regression model. It is shown that estimators based on sparsely connected deep neural networks with ReLU activation function and properly chosen network architecture achieve the minimax rates of convergence (up to log n-factors) under a general composition assumption on the regression function. The framework includes many well-studied structural constraints such as (generalized) additive models. While there is a lot of flexibility in the network architecture, the tuning parameter is the sparsity of the network. Specifically, we consider large networks with number of potential network parameters exceeding the sample size. The analysis gives some insights into why multilayer feedforward neural networks perform well in practice. Interestingly, for ReLU activation function the depth (number of layers) of the neural network architectures plays an important role and our theory suggests that for nonparametric regression, scaling the network depth with the sample size is natural. It is also shown that under the composition assumption wavelet estimators can only achieve suboptimal rates.
translated by 谷歌翻译
This paper deals with sparse approximations by means of convex combinations of elements from a predetermined \basis" subset S of a function space. Specically, the focus is on the rate at which the lowest achievable error can be reduced as larger subsets of S are allowed when constructing an approximant. The new results extend those given for Hilbert spaces by Jones and Barron, including in particular a computationally attractive in-cremental approximation scheme. Bounds are derived for broad classes of Banach spaces; in particular, for Lp spaces with 1 < p < 1 , the O(n 1=2) bounds of Barron and Jones are recovered when p = 2. One motivation for the questions studied here arises from the area of \articial neural networks," where the problem can be stated in terms of the growth in the number of \neurons" (the elements of S) needed in order to achieve a desired error rate. The focus on non-Hilbert spaces is due to the desire to understand approximation in the more \robust" (resistant to exemplar noise) Lp, 1 p < 2 norms. The techniques used borrow from results regarding moduli of smoothness in functional analysis as well as from the theory of stochastic processes on function spaces.
translated by 谷歌翻译
在过去的二十年中,人们已经付出了很多努力来表征储层计算系统表现出所谓的回声和衰落记忆特性的情况。这些重要特征对于全球油藏系统解决方案的存在和连续性而言,是数学上的术语。本文对该研究进行了补充,其中对于离散时间确定性输入的非常一般类别,对储层过滤器的不同性进行了充分表征。差异的局部性质允许形成条件,以确保可区分的地球和全球存在,并且通过褪色的记忆解决方案,其与现有关于回声状态属性的输入依赖性质的研究相关联。利用泰勒定理在分析案例中构造了具有半无限离散时间输入的储层过滤器的Volterra型系列表示,并提供了相应的近似界。最后,作为这些结果的推论显示,任何衰落记忆滤波器可以通过具有有限记忆的有限Volterra级数均匀近似。
translated by 谷歌翻译
引入了一类新的非齐次状态仿射系统用于储层计算。确定了足够的条件,首先保证相关的具有线性读数的储存器计算机具有因果性,时间不变性,并且满足衰落存储器性质;其次,该类的子集在衰落存储器滤波器的类别中是通用的,随机几乎肯定是均匀的有限的投入。这意味着任何满足具有该类型的randominput的衰落存储器属性的离散时间滤波器可以由均匀的状态仿射族中的元素均匀地近似。
translated by 谷歌翻译
We provide several new depth-based separation results for feed-forward neural networks, proving that various types of simple and natural functions can be better approximated using deeper networks than shallower ones, even if the shallower networks are much larger. This includes indicators of balls and ellipses; non-linear functions which are radial with respect to the L 1 norm; and smooth non-linear functions. We also show that these gaps can be observed experimentally: Increasing the depth indeed allows better learning than increasing width, when training neural networks to learn an indicator of a unit ball.
translated by 谷歌翻译
This is a survey of nonlinear approximation, especially that part of the subject which is important in numerical computation. Nonlinear approximation means that the approximants do not come from linear spaces but rather from nonlinear manifolds. The central question to be studied is what, if any, are the advantages of nonlinear approximation over the simpler, more established, linear methods. This question is answered by studying the rate of approximation which is the decrease in error versus the number of parameters in the approx-imant. The number of parameters usually correlates well with computational effort. It is shown that in many settings the rate of nonlinear approximation can be characterized by certain smoothness conditions which are significantly weaker than required in the linear theory. Emphasis in the survey will be placed on approximation by piecewise polynomials and wavelets as well as their numerical implementation. Results on highly nonlinear methods such as optimal basis selection and greedy algorithms (adaptive pursuit) are also given. Applications to image processing, statistical estimation, regularity for PDEs, and adaptive algorithms are discussed.
translated by 谷歌翻译
Wavelet bases and frames consisting of band limited functions of nearly exponential localization on R d are a powerful tool in harmonic analysis by making various spaces of functions and distributions more accessible for study and utilization, and providing sparse representation of natural function spaces (e.g. Besov spaces) on R d. Such frames are also available on the sphere and in more general homogeneous spaces, on the interval and ball. The purpose of this article is to develop band limited well-localized frames in the general setting of Dirichlet spaces with doubling measure and a local scale-invariant Poincaré inequality which lead to heat kernels with small time Gaussian bounds and Hölder continuity. As an application of this construction, band limited frames are developed in the context of Lie groups or homogeneous spaces with polynomial volume growth, complete Riemannian manifolds with Ricci curvature bounded from below and satisfying the volume doubling property , and other settings. The new frames are used for decomposition of Besov spaces in this general setting.
translated by 谷歌翻译
We elucidate the close connection between the repulsive lattice gas in equilibrium statistical mechanics and the Lovász local lemma in probabilistic com-binatorics. We show that the conclusion of the Lovász local lemma holds for dependency graph G and probabilities {p x } if and only if the independent-set polynomial for G is nonvanishing in the polydisc of radii {p x }. Furthermore, we show that the usual proof of the Lovász local lemma-which provides a sufficient condition for this to occur-corresponds to a simple inductive argument for the nonvanishing of the independent-set polynomial in a polydisc, which was discovered implicitly by Shearer [98] and explicitly by Dobrushin [37, 38]. We also present some refinements and extensions of both arguments, including a generalization of the Lovász local lemma that allows for "soft" dependencies. In addition, we prove some general properties of the partition function of a repulsive lattice gas, most of which are consequences of the alternating-sign property for the Mayer coefficients. We conclude with a brief discussion of the repulsive lattice gas on countably infinite graphs.
translated by 谷歌翻译
In this survey we discuss various approximation-theoretic problems that arise in the multilayer feedforward perceptron (MLP) model in neural networks. The MLP model is one of the more popular and practical of the many neural network models. Mathematically it is also one of the simpler models. Nonetheless the mathematics of this model is not well understood, and many of these problems are approximation-theoretic in character. Most of the research we will discuss is of very recent vintage. We will report on what has been done and on various unanswered questions. We will not be presenting practical (algorithmic) methods. We will, however, be exploring the capabilities and limitations of this model. In the rst two sections we present a brief introduction and overview of neural networks and the multilayer feedforward perceptron model. In Section 3 we discuss in great detail the question of density. When does this model have the theoretical ability to approximate any reasonable function arbritrarily well? In Section 4 we present conditions for simultaneously approximating a function and its derivatives. Section 5 considers the interpolation capability of this model. In Section 6 we study upper and lower bounds on the order of approximation of this model. The material presented in Sections 3{6 treats the single hidden layer MLP model. In Section 7 we discuss some of the diierences that arise when considering more than one hidden layer. The lengthy list of references includes many papers not cited in the text, but relevant to the subject matter of this survey. 144 A. Pinkus CONTENTS
translated by 谷歌翻译
在过去十年中,差异化隐私作为数据隐私的严格和实用的格式化已经取得了显着的成功。但它也有一些众所周知的弱点:值得注意的是,它并没有严格处理构图。这种弱点激发了近期基于Renyi差异的差异隐私的几种放松。我们提出了差异隐私的另一种放宽,我们将其称为“$ f $ -differential privacy”,它具有许多吸引人的特性,并避免了与基于差异的放松相关的一些困难。首先,它保留了对差异隐私的假设检验解释,这使其保证易于解释。它允许对组合和后处理进行无损推理,特别是从差异隐私中导入现有工具的直接方式,包括通过子采样进行隐私放大。我们定义了我们类中的规范单参数系列定义,我们根据两个移位高斯分布的假设检验来定义“高斯差分隐私”。我们通过证明中心极限定理表明该家族是焦点,这表明基于假设测试的隐私定义(包括差异隐私)的隐私保证在组合的极限下收敛于高斯差分隐私。我们还证明了中心极限定理的有限(Berry-Esseen风格)版本,它提供了一个有用的工具,可以简单地分析潜在复杂表达式的精确组成。我们通过对噪声随机梯度下降的隐私保护进行改进分析来证明我们开发的工具的使用。
translated by 谷歌翻译
The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.
translated by 谷歌翻译
恒定深度网络的现有深度分离结果实际上表明,$ \ mathbb {R} ^ d $中的某些径向函数(可以通过深度$ 3 $网络轻松近似)无法通过深度$ 2 $网络近似,甚至达到恒定精度,除非它们的大小在$ d $中是指数级的。然而,用于演示这一功能的函数正在迅速振荡,Lipschitz参数按比例缩放为$ d $维度(通过缩放函数,硬度结果适用于$ \ mathcal {O}(1)$ - Lipschitz函数仅当目标精度$ \ epsilon $最多为$ \ text {poly}(1 / d)$)时。在本文中,我们研究这种深度分配是否仍然可以保持在$ \ mathcal {O}(1)$ - Lipschitz径向函数的自然环境中,当$ \ epsilon $不能与$ d $一起扩展时。也许令人惊讶的是,我们表明答案是否定的:与先前作品所暗示的直觉相比,它可以接近$ \ mathcal {O}(1)$ - Lipschitz径向函数,深度$ 2 $,大小$ \ text {poly}(d)$ networks,对于每个常量$ \ epsilon $。我们补充说明,对于每个常数$ d $,深度$ 2 $,大小$ \ text {poly}(1 / \ epsilon)$ networks也可以近似这些函数。最后,我们证明在$ d,1 / \ epsilon $中同时存在多项式依赖是不可能的。总的来说,我们的结果表明,为了显示表达$ \ mathcal {O}(1)$ - Lipschitz函数的深度分离,如果可能的话 - 人们需要与文献中现有的技术根本不同的技术。
translated by 谷歌翻译
Many learning problems are described by a risk functional which in turn is defined by a loss function, and a straightforward and widely-known approach to learn such problems is to minimize a (modified) empirical version of this risk functional. However, in many cases this approach suffers from substantial problems such as computational requirements in classification or robustness concerns in regression. In order to resolve these issues many successful learning algorithms try to minimize a (modified) empirical risk of a surrogate loss function, instead. Of course, such a surrogate loss must be "reasonably related" to the original loss function since otherwise this approach cannot work well. For classification good surrogate loss functions have been recently identified, and the relationship between the excess classification risk and the excess risk of these surrogate loss functions has been exactly described. However, beyond the classification problem little is known on good surrogate loss functions up to now. In this work we establish a general theory that provides powerful tools for comparing excess risks of different loss functions. We then apply this theory to several learning problems including (cost-sensitive) classification, regression, density estimation, and density level detection.
translated by 谷歌翻译