Optimal bump functions for shallow ReLU networks
Weight decay, depth separation and the curse of dimensionality

Stephan Wojtowytsch Stephan Wojtowytsch
Department of Mathematics
Texas A&M University
155 Ireland Street
College Station, TX 77840
swoj@tamu.edu
September 5, 2022
Abstract.

In this note, we study how neural networks with a single hidden layer and ReLU activation interpolate data drawn from a radially symmetric distribution with target labels 1 at the origin and 0 outside the unit ball, if no labels are known inside the unit ball. With weight decay regularization and in the infinite neuron, infinite data limit, we prove that a unique radially symmetric minimizer exists, whose weight decay regularizer and Lipschitz constant grow as and respectively.

We furthermore show that the weight decay regularizer grows exponentially in if the label is imposed on a ball of radius rather than just at the origin. By comparison, a neural networks with two hidden layers can approximate the target function without encountering the curse of dimensionality.

Key words and phrases:
Deep learning, depth separation, Barron space, Radon-BV, compact support, mollifier, weight decay, minimum norm solution, symmetry learning, explicit regularization, curse of dimensionality, radial symmetry
2020 Mathematics Subject Classification:
68T07, 65D40, 41A30

1. Introduction

Neural networks have revolutionized fields from computer vision [KSH12] to natural language processing [VSP17]. They are the driving force behind AIs which play strategy games at superhuman levels of proficiency [SHS18, SHM16, SSS17], facilitated major advances in scientific problems such as protein folding [TAW21, JEP21], and have been used for computer-assisted proofs in applied mathematics [WLGSB22]. While empirical evidence indicates that they often generalize well to previously unseen data when trained appropriately, there is little rigorous understanding of how neural networks interpolate a function between known data points.

In this article, we provide insight in the simple setting of infinitely wide ReLU networks with a single hidden layer and data which are drawn from a radially symmetric distribution on a Euclidean space . The target function satisfies and for , where denotes the Euclidean norm on . We consider a loss functional composed of an -error and a weight decay regularizer. Despite the fact that neural networks with a single hidden layer cannot represent compactly supported target functions exactly [Lu21], there are such functions which can be approximated efficiently even in high dimension. Here, we construct optimal infinitely wide networks, and show that the weight decay regularizer grows only linearly in the dimension of the data space, improving on the quadratic upper bound established by [OWSS19].

While highly idealized, this setting allows us to study several important aspects of neural network models:

  1. Learning symmetries. The target function has two important symmetries:

    • is radially symmetric. While it is impossible to fit this symmetry exactly by finite networks, it can be attained asymptotically for highly overparametrized networks. More precisely, one could ask whether regularized risk minimization leads to symmetry learning. While we show that a unique radially symmetric solution exists, it remains open whether other solutions exist which do not exhibit radial symmetry.

    • almost everywhere with respect to the data distribution. Unlike linear models, which necessarily output negative data even if all training data points are positive, neural networks have the capacity to respect this constraint. We show that risk minimization asymptotically enforces the bound everywhere on the data space, at least for the unique radially symmetric solution.

  2. Fitting random or perturbed data. It is known that overparametrized neural networks can fit random data, but due to the great generality of the result, the network weights may be prohibitively large for given data. Assuming that labels are generated by a function which can be approximated well by a neural network, compactly supported bump functions can be used to obtain an upper bound on the magnitude needed for specific labels or the increase necessary in the weight decay regularizer if the labels are perturbed.

  3. Depth separation and curse of dimensionality. We prove two complimentary results:

    • In dimension , there exists an infinitely wide ReLU network with one hidden layer with weight decay regularizer such that and if .

    • If is an infinitely wide ReLU network with one hidden layer such that for and if , then the weight decay regularizer of grows at least exponentially as in the dimension of the data space.

    The curse of dimensionality can be avoided in the second situation by using a neural network with two hidden layers, for which the weight decay regularizer only grows as .

  4. Effect of regularization. Weight decay regularization is often taken as a proxy for controlling the Lipschitz constant of a neural network, as it can be computed more easily. In this highly symmetric setting, we can compare two optimal solutions:

    1. The data is fitted optimally by the function , which attains the minimal Lipschitz constant . The function cannot be represented by a ReLU network with a single hidden layer and finite weights, even in the infinite width limit [EW20b, Example 5.19]. It can be presented by a neural network with two infinitely wide hidden layers and weight decay .

    2. The weight decay regularizer of the optimal two-layer ReLU network grows like , while its Lipschitz constant grows like .

  5. Highly localized peaks. The target function can be seen as the prototypical example of learning functions which take values at isolated points which are separated as ‘islands’ in a ‘sea’ of points with labels .

  6. Mollification. The infinitely wide neural networks constructed in this note can be used to establish approximation rates in function spaces for shallow neural networks by mollification, if the mollification width is optimized to balance the competition between approximation of the target function by the infinitely wide network and approximation of the infinitely wide network by finite neural networks.

To the best of our knowledge, this is the first time that an optimal solution for fitting data by neural networks has been computed in dimension . For technical reasons, we focus on the case that is odd. The optimal radial solution can be written as a finite sum

for some coefficients satisfying .

The article is organized as follows. In the remainder of the Introduction, we briefly review the context of this work in the literature and the notation we will use throughout the article. In Section 2, we give a brief introduction to the function spaces associated to two-layer ReLU networks with a weight decay regularizer (Barron or Radon BV spaces). Sections 3 and 4 are dedicated to the statement and proof of our main results respectively. Applications of our results can be found in Section 5. Numerical approximations of the optimal solutions can be found in Section 6. We conclude the article with a brief summary and list of open problems in Section 7.

Further numerical experiments can be found in Appendix A. Some proofs from the main part of the article are postponed to Appendix B, while proofs of results which are known in similar form are postponed to Appendix C. Slight extensions of the main results can be found in Appendix D.

1.1. Previous work

The complexity of a neural network is often measured by the number of its non-zero coefficients (weights) [LWK17, SSVB17, GKNV22] or by a measure of their magnitude. From a practical perspective, both are crucial pieces of information: a neural network with an excessive number of non-zero connections is expensive to store and evaluate, while a network with very large coefficients is likely to depend on subtle cancellations at training data points and unlikely to generalize well to unseen data.

[Bar93] realized that a large class of functions can be approximated efficiently by neural networks with a single hidden layer and any sigmoidal activation function while keeping the outer layer coefficients bounded. The function class is defined in terms of a spectral criterion and diverse enough that any linear method of approximation must face the curse of dimensionality in it.

Subsequently, function approximation by ReLU networks with a single hidden layer and bounded coefficients in both layers was studied in [Bac17, EMW18, EMW19b, EW20b]. Optimal rates of approximation were obtained in [SX19, SX21b]. A spectral criterion for this scenario in terms of the Fourier transform was developed in [KB18], and a sharp criterion in terms of the Radon transform in [OWSS19, PN21]. A detailed study of Fourier-like criteria in this context is given by [CPV20].

The norm in these function spaces is related to the popular explicit ‘weight decay’ regularizer (the -norm of the network weights). It retains significance in the context of implicit regularization, as [CB20] showed that infinitely wide two-layer ReLU networks converge to minimum norm/maximum margin classifiers with respect to the weight decay norm, when trained by a gradient flow optimizer for binary classification with logistic loss.

While the structure of the function spaces has been studied and many of their functional analytic properties are understood [EW20b, PN21, SX21b, SX21a], explicit examples remain rare. Spectral criteria have been used to show that functions in certain smoothness classes can be expressed as infinitely wide two-layer networks with finite weight-decay norm. [EW22] construct a maximum margin classifier in a simple one-dimensional scenario. A structure theorem is given in [EW20b] to easily demonstrate that certain functions cannot be expressed this way. Closest to the present work is [Han21], where the minimum norm interpolants of a finite one-dimensional data set are studied.

Much of the work on ReLU-activated two-layer networks makes heavy use of the homogeneity of the activation function. Two-layer neural networks with arbitrary activation are studied e.g. in [SX20, LMW20]. Partial (and different) extensions to deeper neural networks can be found e.g. in [PN22, EW20a], while residual neural networks of continuous depth (‘neural ODEs’) have been studied from this perspective in [EMW19a, EMW19b, EMW19c].

1.2. Notation

We denote by the average integral over a set which has finite measure for a measure , i.e. . By we mean that we integrate with respect to the (signed) measure in the variable . In this article, will always be a measure (often signed), while denotes the exterior normal vector field on a sphere.

The natural -dimensional area (Hausdorff) measure is denoted by . In this article, it will always refer to the (unnormalized) uniform distribution on a -dimensional sphere.

The total variation norm of a measure on a measurable space is defined as , where is the Hahn decomposition of the signed measure .

In the following, is always going to be a function of one variable and is going to be a radially symmetric function on . By an abuse of notation, we will also consider defined by . We denote by

a quotient related to the area of hyperspheres in dimension and , and by a constant related to the approximability of the function by polynomials of degree at most in , which also relates to the minimal value of the weight decay regularizer for fitting data as above.

The variables and are always related by , i.e. .

2. Weight decay and Barron spaces

In this section, we briefly review the theory of infinitely wide ReLU networks with a single hidden layer. Function spaces for this setting have been studied under the name in [Bac17], Barron space in [EMWW20, EMW19b, EMW19c, EMW18], Radon-BV in [PN22, PN21] and the convex hull of the ReLU dictionary or the variation space of the ReLU dictionary in [SX21a, SX21c]. In this note, we refer to them as Barron spaces in reference to the seminal work of Andrew Barron [Bar93]. Some results presented below are extensions of known results to the case where we consider a Barron semi-norm rather than the full Barron norm, corresponding to a weight decay regularizer which does not control the magnitude of the biases.

A neural network with a single hidden layer and neurons can be represented as

where are the weights of the neural network. For networks in which the size of the weights is controlled, this representation can be generalized to

where is a measure on and is a probability measure on . More generally, due to the symmetry for , can be taken to be a signed measure. Finite networks are contained in the general setting by setting

respectively, where the parameters can be chosen freely for a convenient representation. The integral is guaranteed to converge if the Barron norm

(2.3)

is finite, where denotes the total variation measure of the signed measure . The infimum must be taken since the representation of a function in this fashion is highly non-unique [EW20b, Section 2.1]. The two representations of the norm coincide by [EW20b, Section 2.4].

The norm in the parameter variable is chosen dual to the norm in the data variable such that the inequality holds. In particular, if distances in the data domain are measured in the -sense for , then distances in the parameter domain are measured in the -sense for . For compatibility with radial symmetry, we focus on the case in this note.

We refer to the space as Barron space , or at times to indicate dependence on dimension.

Due to the control over the bias, the Barron norm as defined in [EMW19b, EW20b] is not invariant under translations in the data space, i.e. the functions and generally have a different norm for . By contrast, the following Barron semi-norm is translation invariant and has useful properties which suffice in many applications:

(2.4)

We will address the convergence of the integrals in (2) without control over in Proposition 2.1. This is more in line with the approach in [OWSS19, PN21], where the magnitude of the bias is also not controlled. We opt for controlling rather than for convenience, but note that the classical Barron norm could be defined in this fashion, too. The key observation is that the ReLU activation function is positively one-homogeneous, i.e.  for all . In particular

i.e. we may normalize neurons to

without changing the output of the neural network. In particular , indicating that we could define the Barron norm in the analogous fashion by squares. Indeed, in the infinite limit it is even possible to assume that is supported on the set . For a more technically rigorous discussion, see e.g. [EW20b].

By a slight abuse of terminology, we will also refer to as Barron space from now on. We briefly note the following properties, which relate the Barron semi-norm and more well-established quantities.

Proposition 2.1.
  1. If the integral in (2.4) is finite for , then the integral defining in (2) exists for all if and only if it exists for . It may then be re-cast as

    This expression always converges if the integral in (2.4) is finite. The integral exists as a Bochner integral with values in for compact or for a probability distribution on with finite -th moments.

  2. is a norm on the modified Barron space , which makes a Banach space. Compared to classical Barron spaces, .

  3. .

  4. If , then is Lipschitz-continuous and the Lipschitz-constant of satisfies .

All statements could be given in terms of a general signed measure instead of . The proof, along with other proofs from this section, can be found in Appendix C.

Functions in Barron spaces are defined by means of an explicit representation formula. Paradoxically, this explicit characterization often makes it difficult to verify whether a given function is in Barron space. A more abstract framework was created in [OWSS19] by the means of the Radon transform, based on the observation that

i.e. the second spatial derivatives of a ReLU network with one hidden layer are superpositions of measures concentrated on the hyperplanes . This allows for a characterization of Barron spaces in terms of second derivatives. The Radon transform is used as a technical tool in order to dualize from hyperplanes to points. This convenient characterization allows the construction of some examples of functions in Barron space.

Example 2.2.
  1. Assume that is a Lipschitz-continuous function and that the (possibly non-integer) power of the Laplacian in the distributional sense exists as a measure. Then

    where denotes the total variation norm of [OWSS19, Proposition 3].

  2. If is odd, the power of the Laplacian is integer. In particular, if belongs to the Sobolev space of functions whose first (weak) partial derivatives are -integrable, then and

    for some constant , which depends on the exact choice of the norm on . In particular [OWSS19, Corollary 1].

  3. If is an odd integer and is the radial bump function given by

    then if . For , the norm bound holds according to [OWSS19, Example 3].

    In [OWSS19], also a stronger version of the statement is claimed, including an if and only if condition for and a comparable lower bound for . Those claims are based on an error in the proof of [OWSS19, Proposition 15], where the erroneous claim is made that if , then the integral of over any hyperplane is bounded from above by .

Based on the same intuition, we point out two observations. The first demonstrates that the singular set of a Barron function (i.e. the set where the functions is not differentiable) is ‘straight’ and lower dimensional. This is a stronger version of Rademacher’s theorem, which states that the singular set of a Lipschitz function is Lebesgue null, in the context of Barron spaces. The following statement has the stronger implication that is contained in a countable union of affine subspaces of and therefore has Hausdorff dimension .

Proposition 2.3.

[EW20b] Any function can be written as a countable sum where

  1. is -smooth,

  2. where

    • is an orthogonal projection for (i.e. 

    • is -smooth except at .

The fact that the singular set is straight has two immediate implications.

Corollary 2.4.
  1. If is radially symmetric, then .

  2. If is a diffeomorphism such that , then is an affine linear map [EW20b, Theorem 5.18].

A brief inspection of the proof of [EW20b, Theorem 5.18] reveals that Proposition 2.3 and 2.4 reveals that both statements remain valid for . A stronger result on radial Barron functions is proved below in Lemma 4.1. Secondly, we recall a characterization of one-dimensional Barron spaces, which is essentially the simpler one-dimensional case of the Radon transform construction. A similar statement can also be found e.g. in [EW20b, Example 4.1] and [LMW20].

Proposition 2.5.

if and only if there exists a finite signed measure such that , i.e.  for all such that (in particular, all but countably many). For all such and any , we can write

Furthermore

where

is the convex hull of the set of approximate derivatives. Conversely

We believe that the upper bound is, in fact an identity. We now recall a property of Barron spaces .

Proposition 2.6 (Direct approximation theorem).

For every and every probability measure on there exists as in (2) and such that

and

depending on the normalization in (2).

For the sake of completeness, we sketch a probablistic proof in Appendix C. This formulation of the direct approximation theorem improves on known results in two major ways:

  1. The dependence on the data distribution is only through the ‘projected second moments’ rather than the full second moments . It is easy to see that

    for any probability measure on , and that equality is attained for any measure which is the product of one-dimensional probability measures, e.g. a standard normal distribution. The constant in the bound may therefore be significantly smaller in high dimension.

  2. The bound depends on the Barron semi-norm, but not the full Barron norm.

While the constants are improved in this formulation compared to e.g. [EMW18, EMW19b, EW20b], the result is not expected to be sharp in terms of the rate which is achieved. An improvement from to in the classical setting can be found in [SX21b] at the cost of a more involved proof.

Many of the results above are somewhat specific to ReLU activation as the proofs either use positive homogeneity or the property that . Both are shared by leaky ReLU activation.

Remark 2.7.

Consider the leaky ReLU activation function for in addition to the classical ReLU activation . Since

any function which can be represented as a superposition of ReLUs can be represented as a superposition of leaky ReLUs and vice versa. The entire construction of Barron space goes through as above, leading to two semi-norms and on the same function class such that and by the explicit representation (2.7). More compactly, we write this as

Using positive one-homogeneity, it can be seen that the coefficients in the representations (2.7) are in fact optimal and thus that (2.7) is sharp. The norms induced on Barron space by ReLU and leaky ReLU activation are therefore equivalent, and all properties mentioned above survive if is replaced by .

The more subtle statements which we prove below do not survive passing to an equivalent norm. When minimizing under the constraints , the set of solutions will generally depend on . For example, consider the one-dimensional data set with two points and , which is fit exactly by for any . The solution is norm-minimizing for , but not for , where the norm-inequality is sharp (and vice versa).

The equivalence of norms estimate degenerates at , where the activation would become linear. If , a similar construction holds unless , where the any -Barron function must satisfy .

We are finally ready to state (and prove) the main results of this article rigorously.

3. Statements of Main Results

Theorem 3.1.

For every odd , there exists a unique radial function such that

Furthermore

  1. .

  2. The radial profile , is strictly monotone decreasing in in . In particular, .

  3. There exists such that is a linear, strictly monotone decreasing function of on .

As , the norm of increases linearly as

where is the inverse of the Bernstein constant.

The Bernstein constant is a quantity in classical numerical analysis and approximation theory arising when approximating the function by polynomials in , see e.g. [Tre19]. From the proof of Theorem 3.1, we obtain an algorithm to compute to arbitrary precision, which is implemented in Section 6.

The functions are radially symmetric, compactly supported and non-negative. In particular, they can serve as mollifiers to easily prove quantitative approximation results for two-layer ReLU networks in general function classes. In a companion article [Wojb], we prove that they are achieved as (radial averages of) empirical risk minimizers with a weight decay regularizer.

Note that we do not exclude the possibility that other minimizers exist which are not radially symmetric. From direct arguments, we can only conclude that the set of minimizers is convex and invariant under coordinate rotations. The existence of at least one radially symmetric minimizer follows relatively easily, while its uniqueness is established below by construction. For any minimizer , which may not not be radially symmetric, the radial average

is a radially symmetric minimizer, i.e. . Knowledge of the unique minimizer after radial averaging allows us to study optimization algorithms for implicit bias and finding global optima. This line of inquiry is pursued in upcoming work [Wojc].

We find it easier to deal with odd dimensions, as the function is a polynomial in this case. This is analogous to the observations of [OWSS19]. We remark that, if is a Barron function and , then

is also a Barron function and , so the limit

remains valid if even dimensions are considered, as can be seen when sandwiching an even integer between and .

Using a reflection argument, any radially symmetric Barron function can be written as

for some measure on the space of biases. In this context, Theorem 3.1 can be understood as a finite representer theorem, since the proof shows precisely that there exist weights and biases such that

Finally, we note that the methods in the proof of Theorem 3.1 can also be used to show the following extension.

Theorem 3.2.

For every and every odd , there exists a unique radial function which minimizes the Barron semi-norm in the class

Furthermore

  1. .

  2. The radial profile , is strictly monotone decreasing on . In particular, .

In this case, the Barron norm grows exponentially in the dimension . More precisely, there exists independent of such that

if .

We thus observe that the problem of approximating compactly supported bump functions which are constant in a neighbourhood of the origin by shallow neural networks suffers from the curse of dimensionality. We will argue below that this is not the case for ReLU networks with at least two hidden layers.

4. Proofs of the Main Results

We begin by stating three lemmas in this section, which are used to prove the main theorems. The proofs are given in Appendix B. By a slight abuse of notation, we denote for a radially symmetric function and by the radial derivative of . We first note a general result on radially symmetric Barron functions.

Lemma 4.1.

Let be a radially symmetric Barron function and odd. Then

  1. as a function of , is times continuously differentiable in . The -th radial derivative is bounded and measurable, and the -th radial derivative in the distributional sense is a bounded (Radon) measure.

  2. for every , there exists such that the Lipschitz bound

    holds for every .

The following Lemma allows us to express radial symmetry and compact support for Barron functions in odd dimensions in a one-dimensional fashion. It is based on an exchange in the order of integration in (3).

Lemma 4.2.

Assume that is a one-dimensional Barron function such that , outside of and

Then the function given by

satisfies the following properties:

  1. ,

  2. if ,

  3. is radially symmetric, and

  4. .

Conversely, if is a radially symmetric Barron function which satisfies and on , then there exists as above such that (4.2) holds, which is additionally an even function and satisfies .

Furthermore in if and only if in for the even representative of the function class .

We will show that such a Barron function indeed exists for every odd and compute the precise asymptotic growth of as . The following Lemma is the main technical tool in our proof.

Lemma 4.3.

For , set

Then is the inverse of the Bernstein constant. The minimum is attained by a unique measure where

  1. are the distinct points in at which is extremal in , where is the optimal even polynomial approximator of degree for in .

  2. are parameters satisfying the alternation criterion for and the generalized Vandermonde system

We are finally prepared to prove our main results.

Proof of Theorem 3.1.

Step 1. In this step, we construct and using Lemmas 4.2 and 4.3. Let be an odd integer and . Let be the unique function such that

in the distributional sense, where is the even reflection of the measure described in Lemma 4.3, i.e. . Note that the origin is counted twice. By construction, is piecewise linear, on and

In particular, and for all due to the moment conditions

Since is even and , we find that is even. Due to Proposition 2.5, we observe that . Integrating by parts twice, we realize that

for and , since in a neighbourhood of , so the boundary terms vanish. The integration by parts is well-established for smooth functions and can be justified in the piecewise case by mollification.

In particular, satisfies the conditions of Lemma 4.2 and induces an admissible radially symmetric function .

Step 2. Assume for now that there exists a minimizer of the Barron semi-norm in . Since the Barron semi-norm is a a convex function on the convex function class , and since furthermore both and are invariant under coordinate rotations, we note that the set of minimal semi-norm elements in is both convex and rotation invariant. In particular

is the average of with respect to all rotations. The measure is the Haar measure on the group , i.e. the -dimensional Hausdorff measure induced by the Frobenius norm on the space of -matrices.

Thus, if a minimizer exists, then there is also a radially symmetric minimizer. In this step, we illustrate that the function associated to as in Step 1 is in fact optimal. In particular, we can conclude from the proof below that a minimizer does exist..

It is easy to see that (4) is both necessary and sufficient to imply the moment conditions for in Lemma 4.2. In particular

with corresponding minimizers. As is the unique solution to the minimization problem on the right, is the unique radial minimizer on the left.

In particular

Step 3. We note that is linear on the interval , where is as in Lemma 4.3. Thus

is linear by the origin.

Step 4. In this step, we show that is strictly decreasing in radial direction inside the unit ball. As noted in Corollary 2.4, the function is -smooth away at the origin. Since and , it suffices to show that for . We compute

We make the following claim: If is an even piecewise linear function on with at most segments in and , then the function

has at most zeros in .

To prove the claim, start with . Then

As is constant in the interval by the origin, is constant (and non-zero) in , meaning that cannot have a zero in . In any interval where is constant, the function

is monotone, since does not change sign. In particular:

  1. There is no zero in the first interval .

  2. There is at most one zero in .

  3. The zero in the final interval is attained at .

Thus there are at most zeros in , which proves the claim for . Now consider . Note that

In particular, the term on the left is zero if and only if the integral on the right is zero. If there are two points on the right where the integral vanishes (and ), then by Rolle’s theorem in between there exists a point at which

The integral also vanishes at zero, where we are integrating over the empty set. We note that for any the integral vanishes at and . If, for it vanishes at interior points, then for it must vanish at points: 0, 1, and at least once in each interval. In particular, for , there are at most interior vanishing points.

Step 5. We finally note that the Lipschitz bound follows directly from the Barron norm bound on and Lemma 4.1. ∎

Example 4.4.

Let us consider the case , i.e. . The points are given by the equi-oscillating points of the best approximation of the function on by elements of the space spanned by .

The best approximation of by even quadratic polynomials in is , which attains maximal distance at . This can easily be verified as is a polynomial of degree inside , so if , then , so the most distant points are in . By Kolmogorov’s equi-oscillation theorem, all three are points of largest error. It is now easy to solve for the coefficients of .

We can find the measure for the second derivative by solving the linear system of moment conditions

So is the even continuous piecewise linear function satisfying and

Finally, since , we find that and thus

In particular, we observe that and that . It is easy to see that the first derivative of

is a continuous function, the second

is a bounded and measurable function, and the third (distributional) derivative

is a finite measure, where denotes a Dirac delta located at the point and denotes the one-dimensional Lebesgue measure of the open set .

Proof of Theorem 3.2.

The existence of a radial minimizer is proved as in Theorem 3.1. By Lemma 4.1, we find that , and since is constant in a neighbourhood of the origin, we find that . The uniqueness follows as in Theorem 3.1 by considering the optimal measure on satisfying the moment conditions, using again Lemma 4.2. The main difference lies in the greater ability to uniformly approximate the function by even polynomials on compared to .

We claim the following: Let , and a measure on such that

Then for every there exists independent of such that

if .

The claim is proved in Appendix B. Inserting the lower bound in (B), the statement is proved. ∎

5. Applications

5.1. Fitting values on a finite data set

Let be a finite data set in . For each , define to be the minimal distance between the point and the closest data point to it. Then

is a Barron function such that

  1. for all and

  2. .

In most practical data sets, the minimum -distance between data points is lower bounded as or even , meaning that the Barron norm only grows as or even . Using the direct approximation theorem for Barron functions (Proposition 2.6) in for , for every there exists a shallow neural network with neurons (and one constant shift) such that

where . Often, the labels lie in a bounded set, at least with high probability. The projected and centered second moments may well be independent of the ambient dimension , leading to a realistic, even somewhat pessimistic, expectation that

for data sets which do not heavily concentrate at a single point or exhibit heavy tail behavior. While the data can generally be fit exactly if (see e.g. [LS06] and the references therein) this estimate also controls the size of the weights of the neural network needed.

Similarly, this estimate can be used to bound the additional size of the Barron norm which is required to fit values , assuming that the Barron norm required to fit is already known.

5.2. Mollification and density

Since is a compactly supported, non-negative function, it can serve as a mollifier. Namely, for and , denote

It is well-known that in for any compact set [Dob10, Lemma 4.22]. In many situations, rates can be obtained, either in the -topology or a stronger topology, under the assumption that lies in a space of more regular functions (e.g. a Hölder or Sobolev space). We consider the following scenario:

Assume that , where is a space of functions for which it is known that for a given domain and some universal constants (which may depend on ). If is naturally defined only on and not the entire space, extension theorems can often be used to extend in the same regularity class, see e.g. [Dob10, Chapter 6].

Note furthermore that is a continuous superposition of Barron functions in . Since is a Banach space, is a Barron function with norm at most

In particular, due to the direct approximation theorem for Barron functions (Proposition 2.6), there exists a neural network with one hidden layer, ReLU activation, neurons (and an affine shift) such that

and thus

Balancing the scaling of terms , we find that it is optimal to choose , which leads to an approximation order of . We note that not only the rate, but also the constants exhibit the curse of dimensionality. Observe that it is generally impossible to approximate functions in classical function spaces by functions of low norm from any function class in which the unit ball has low Rademacher complexity, so the curse of dimensionality cannot be avoided here [EW21].

Since and outside the unit ball, we find that . The true -norm is likely even much smaller, as appears to decay rapidly close to the unit sphere. Nevertheless, we find this an easy way to obtain an explicit rate with little effort.

Example 5.1.

If is the space of Lipschitz-continuous functions on , then the approximation property holds as

The conditions above are therefore met with .

5.3. Depth separation

We have seen that any function which satisfies in and outside of has Barron semi-norm which is exponentially large in the dimension of the data space (for fixed ).

By comparison, the function

can be represented as the composition of two Barron functions and

with norm

The second norm estimate be easily obtained by Proposition 2.5, whereas the second can be obtained as in the second step in the proof of 4.1 in Appendix B – see also [EW20b, Section 4].

In particular, by the direct approximation theorem for Barron functions (Proposition 2.6), it is possible to approximate with parameters whose magnitude does not exceed . When written as a neural network with two hidden layers, the initial linear layer of and terminal linear layer of are concatenated into a single linear map. Balancing the magnitude of coefficients equally over all layers, we find that the parameters scale only like , so the weight decay regularizer grows as . Observe that weight decay does not induce a norm for deeper ReLU networks due to the mismatch in homogeneities.

Theorem 3.2 thus serves to illustrate the following depth separation phenomenon: A function which takes values on and on is much easier to approximate by ReLU networks with two hidden layers than with one. While depth separation phenomena are well established [ES16, Tel16], this is a particularly easy criterion. The fact that compositions of Barron functions correspond to certain neural networks with two hidden layers has been observed e.g. in [EW20b, PN22].

On the other hand, the result is a weaker version of a depth separations statement than others. We do not claim that the number of neurons required to approximate such a function to a certain accuracy grows exponentially in dimension, but rather that either the number of neurons or the magnitude of the parameters does. From a practical point of view, both are prohibitive.

6. Finding optimal bump functions

In this section, we compute numerical approximations of the optimal bump functions which were constructed in Theorem 3.1 for different odd dimensions beyond the case considered in Example 4.4. As previously, denote , i.e. . For simplicity, we exploit that three tasks are equivalent: Approximating in by polynomials of degree at most (or ), approximating in by polynomials of degree at most , and approximating in by even polynomials of degree . We proceed in three steps:

  1. Find the optimal approximation of by polynomials of degree in , and find the points at which the error is maximal. Take the optimal points for the approximation of by even polynomials.

  2. Solve the linear system (B) to obtain the measure . Compute the piecewise linear function by in , and .

  3. Obtain from by numerically integrating (4.2).

In our implementation, the first step is solved by the Remez algorithm [Tre19]:

  1. Initialize , e.g. as equi-distant points such that and .

  2. Solve the system , for the coefficients and the equi-oscillation parameter .

  3. Update such that , and for , is a point at which the unsigned error function has a local extremum.

  4. Iterate (ii) and (iii) until after the final update we have approximately reached equi-oscillating points of largest error:

We solve the linear system in step 2 in the iteration scheme by LU factorization. The non-linear system is solved by two nested interval constructions:

  • Given such that , we conclude that for all , there exists such that

    by the Intermediate Value Theorem. In particular, the points are distinct and ordered. We approximate by the bisection method to accuracy . Note that , since the approximating polynomial cannot match the objective function at points by the same argument as in the proof of Theorem 3.1.

  • Given , we find that there for there exists such that

    by Rolle’s Theorem. Again, we approximate by the bisection method to accuracy . By construction, all are distinct. We update

The nested interval construction is more numerically stable than Newton-Raphson iteration, as a Newton solver tends to find the same multiple times starting at different roots from the previous iteration.

The linear step (2) is solved by LU factorization. The integral in (3) is evaluated using a composite Simpson rule and 1,001 integration points. A sample implementation of the algorithm can be found in a google colab notebook at [Woja].

 We compute the piecewise linear function  We compute the piecewise linear function  We compute the piecewise linear function
Figure 1. We compute the piecewise linear function satisfying the moment conditions (B) for three different choices of break points : Optimal points found by the Remez algorithm (left), equi-distant points (middle) and the roots of Chebyshev polynomials of the second kind (right).
For the optimal choice of , we empirically observe that the collection of points concentrates on a line parallel to the horizontal axis. For equi-distant nodes, the oscillations become larger as increases, whereas they become smaller for Chebyshev nodes.

We note that the linear system (B) can be solved for any choice of distinct points . To explore the importance of using the optimal points found using the Remez algorithm, we compare and for the optimal choice of sample points and other, more classic and explicit choices in Figures 1 and 2 respectively. The Barron norm grows slowly and linearly for optimal break points and faster than linearly for other explicit choices of break points, as can be seen in the rightmost image in Figure 3. In the left and middle plot of Figure 3, we also display the known profiles of Barron functions due to [OWSS19] as well as profiles of Barron functions which are constant in a neighbourhood of the origin.

For any choice of break points which include zero, the function is a ‘wizard’s hat’ function: Monotone decreasing, flat away from the origin, monotone decreasing and convex, non-smooth at the origin. It is thus qualitatively different from previously known radial profiles due to [OWSS19].

 We compare the functions  We compare the functions  We compare the functions
Figure 2. We compare the functions computed by (4.2) for the piecewise linear functions in Figure 1 associated to three different choices of break points : Optimal (left), equi-distant (middle), Chebyshev (right). The break points and curves agree for (blue curve) and are qualitatively similar for all , in particular non-negative, monotone-decreasing and convex. The curves are steeper at the origin in higher dimensions, most noticeably for Chebyshev nodes. The curve with optimal break points appears to make the slowest transition from to .
Figure 3. Left: The radial profiles of the known Barron bump functions of [OWSS19] are smooth at the origin and non-convex in the radial direction. They are thus geometrically distinct from the profiles associated to piecewise linear functions as depicted in Figure 2. Middle: The functions associated to piecewise linear functions with break points at equidistant points in are -smooth, monotone decreasing and non-negative, but not convex in the radial direction. Right: The Barron semi-norm of functions associated to with different break points grows slowest (and linearly) for optimal the optimal choice of points and fastest for break points at Chebyshev nodes. All growth rates are ostensibly polynomial of empirical degree (optimal points), 1.4 (equi-distant points) and 1.6 (Chebyshev nodes) as determined by least squares fitting. By comparison, the norm growth for the choice of equi-distant nodes in in the middle figure is exponentially large in and is not pictured for better readability of the plots.

Additional empirical results relating to the optimal construction can be found in Appendix A.

7. Conclusion and Open Problems

We have provided an explicit construction for how neural networks optimally interpolate certain radially symmetric data with respect to a weight decay regularizer in the infinite parameter limit. While we do not prove that the optimal interpolant is radially symmetric, the radial average of all interpolants coincides with the solution constructed in this article. We show that its weight decay regularizer grows as and its Lipschitz constant grows at most as . In contrast, we identify a slight modification which necessitates exponential growth. A number of important questions remain open, even for shallow neural networks and the simple case of rotational symmetry. Deeper networks appear to be out of reach for our methodology.

Is the radially symmetric minimizer the only one? A uniqueness statement would allow us to establish that regularized risk minimization does in fact lead to symmetry learning, at least in a toy example, and would allow us to prove stronger convergence results in the companion article [Wojb]. We give further heuristic consideration to this question in Appendix D.

What happens if we modify the constraints? For example, it is not clear from the proof of Theorem 3.2 whether the constraint on induces the curse of dimensionality as the constraint on does. Similarly, it may be interesting to study the case where the boundary condition is imposed on a shell rather than the entire exterior domain. We recover the problem studied in this article in the limit , whereas the optimal solution in the case would be . Furthermore, a modified minimization problem is required to find optimal mollifiers:

It appears that subtle differences may make the difference between a solvable data fitting problem and one where we encounter the curse of dimensionality.

What more can we say about the optimal function ? For example, we do not provide a lower bound on the Lipschitz constant of , nor do we study the decay of for fixed or rigorously. We conjecture that both decay at least exponentially in , and that both sequences are monotone in . Limited evidence is provided in Appendix D.

Finally, the fact that the extrema of lie on straight lines parallel to the horizontal axis as we vary appears too specific to be random. It is not clear to us how to interpret this observation.

Acknowledgements

The author would like to thank Jonathan Siegel and Rahul Parhi for inspiring conversations.

References

Appendix A Further plots

We note the following: If is a piecewise linear function with break points which satisfies the linear moment conditions (B), then it also satisfies the same linear moment conditions for any . In particular, the function

is a Barron function such that and on for . We plot for (i.e. ) and various choices of and various choices of break points in Figure 4. In Figure 5 we fix instead and consider the influence of varying .

The larger the discrepancy between and , the more oscillatory the function is. This is reminiscent of observations from Step 4 in the proof of Theorem 3.1.

In Figure 6, we numerically investigate the decay of as varies, both pointwise in and integrated. Since is -Lipschitz and , we see that the one-dimensional integral is bounded from below by . This indeed appears to be the dominant term, and for fixed , we observe empirically that decays to zero exponentially in .

 We plot the radial profile of  We plot the radial profile of  We plot the radial profile of
Figure 4. We plot the radial profile of as in (A) for various choices of break points. The points are chosen optimally (for dimension ) on the left, equi-distant in in the middle plot and equidistant in on the right. Notably, the functions are neither monotone nor non-negative if , and the number of local extrema increases as grows.
 We plot  We plot  We plot
Figure 5. We plot as in (A) corresponding to low dimension , and . and various choices of break points: Optimal for (left), equidistant in (middle) and equi-distant in (right). The radial profiles of the Barron functions are neither monotone nor non-negative. The number of local extrema of the profiles is larger if the dimension is small compared to the number of break points. The oscillations are smallest for the optimal choice of break points and largest for break points which are bounded away from the origin.
Figure 6. Left: We plot as a function of for fixed in a logarithmic scale together with , where are chosen depending on as the least squares fit for the function . The graphs suggest that the decay is exponential in and thus comparable to explicit solutions of [OWSS19]. Middle: We graphically compare the decay of the normalized -dimensional integral of for different choices of break points. Despite graphical differences around the origin, the values of the integrals are very similar and decay roughly as . In dimension three, all three functions coincide, while their difference close to the origin becomes negligible in high dimension, where almost all measure concentrates by the boundary of the unit ball. Right: We graphically compare the decay of the -dimensional integral of the function associated to piecewise linear with break points in for different choices of break points. The integral empirically decays as for optimal points, like for equidistant points and like for Chebyshev points. The order of decay was established by a least squares regression.

Appendix B Postponed proofs

Recall the co-area formula, which allows us to integrate over a Riemannian manifold by ‘slicing’ the domain into the level sets of a function , where is another Riemannian manifold [BZ13, Theorem 13.4.2]. In the case of slicing the sphere into level sets of a coordinate projection , the formula reads as

since is the modulus of the tangential gradient of , which measures volume distortions. This can be considered a curvinlinear version of Fubini’s theorem. If only depends on , the formula further simplifies to

(B.1)

since is a -dimensional Euclidean sphere of radius . Here denotes the volume of the -dimensional unit ball and the volume of the -dimensional Euclidean unit sphere.

Proof of Lemma 4.1.

Step 1. Symmetrization. Let be a radially symmetric Barron function. Then in particular is the same as the average over for in and the average is taken with respect to the Haar measure (which coincides with the -dimensional Hausdorff measure on with respect to the Frobenius norm), i.e.

since for any , the map , pushes the Haar measure forward to the uniform distribution on . Thus can be written as a continuous linear combination of the elementary radially symmetric Barron functions

On the other hand, every function of this type is a radially symmetric Barron function. Finally, we note that

since the uniform distribution is invariant under the substitution in the first term. In particular, every radially symmetric Barron function can be written as

for some measure on and , since for any .

Step 2. Gradient bound. We note that for any by definition and

Due to radial symmetry, the gradient points in direction , i.e.

The gradient is largest as , and

independently of . In the last step, we used the coarea formula (B.1). It is now possible to evaluate the gradient

in the sense that

Consequently, for a general radially symmetric Barron function as in (B) and sufficiently large , we find that

Step 3. Higher regularity. By Corollary 2.4, any radially symmetric Barron function is -smooth except at the origin. This establishes the claim in the case .

Note that if is -smooth, then the same is true for given by by the chain rule and product rule. It thus suffices to analyze the radial profile of . In the following, we will denote both functions as by a slight abuse of notation. Consider the radial profile of the function

for . We can compute the first two derivatives of by exchanging differentiation and integration

where the second formula must be justified by approximation, as the derivative (considered as a ‘function’ of ) is not regular. For , it is easy to see that is -smooth in , and as a polynomial in also -smooth on . Clearly and all its derivatives vanish at infinity. If , is continuous except at , where it has a jump discontinuity. If , the function

vanishes as at and thus has additional derivatives which vanish at . We find for any odd dimension . The -th derivative of is bounded and continuous except at , and the -th derivative of is a finite measure associated to the regular part of the derivative in and the jump at .

It remains to show that a general radial Barron function

has the same regularity as its components , at least away from the origin. To simplify the presentation, we focus on the case . Let and observe that

Clearly, the affine linear component is -smooth except at the origin. Secondly, we note that for any , the identity holds. In particular,

where the integrals converge uniformly for due to the -bound on the -th derivative of . Similarly, the -th derivative converges in for all due to the bound on the measure-valued -th derivative. Finally, for , the integral converges weakly in the sense of Radon measures, i.e. in the weak-* sense, when we consider the space of (Radon) measures as dual to the space of continuous functions.

For the first integral, we prove convergence assuming that . Note that for some polynomial and . By induction we see that for all , where is another polynomial. This is easily seen since

Hence, as before,

The integral converges since and is a continuous function. ∎

Proof of Lemma 4.2.

First claim. Assume that is a function with the properties outlined above and is defined by (4.2). Then is radially symmetric by definition and

Furthermore, if and , then

since outside of . The integral vanishes by (4.2) if is odd, since is an even polynomial of degree at most for all .

It remains to show that is a Barron function. Take any measure such that

as in Proposition 2.5 and compute that

(B.3)

by the co-area formula (B.1). The fact that the normalizing constant is exactly

can be justified by the same co-area integration. Finally, we note that the left hand side of (B.3) is clearly a radially symmetric Barron function satisfying . Taking the infimum over all representing , we find that .

Second claim. Assume on the other hand that is a radially symmetric Barron function. If we denote by the Haar measure on the special orthogonal group , then due to radial symmetry

where for the map

It is now possible to reverse the calculations from Step 1 by setting

Taking the infimum over representing , we find that . Clearly, both and induce the same function by (4.2) due to symmetry, and so does the even representative .

It remains to show that outside of and that the moment conditions (B) hold. Assuming that outside , we find that

for all , as the integral over vanishes. The moment conditions follow easily as

for and . Taking derivatives, we find that is -orthogonal to . Lowering the order of the derivative inductively, we find that is -orthogonal to all even polynomials of degree at most .

Thus we only need to show that outside . First consider the case , i.e. and thus

Then for , we have

since . As is an even function, we conclude that for all and thus for all . We now proceed inductively: Assume that is such that

for all . Then also

since the boundary term vanishes for . In particular, we conclude that

for and . Since is odd, we can reduce the integer exponent inductively until . Then, by the same consideration as in the case , the result is proved. ∎

We now prove the abstract statement about measures on the unit interval.

Proof of Lemma 4.3.

Lower bound. Let be a finite signed measure satisfying the moment conditions

Then

by definition. Taking the infimum over the parameters on the right, we find that

i.e.

The asymptotics of

are known due to Bernstein [Ber12] and Varga and Carpenter [VC85] who proved that , so

Upper bound: Step 0. Note that due to compactness, there exist parameters such that

We fix accordingly. Further note that equality is attained in (B) if the measure is supported on the set of points

and the measure

which has density with respect to is non-negative, i.e.  has “the right sign” at all points. If such a exists, it therefore serves as a matching upper bound and the Lemma is proved. It is, however, not immediately clear whether there exists a signed measure supported on which satisfies the moment conditions (B) and positivity condition (B). In the following, we will prove that indeed does exist.

Step 1. Due to compactness, is a non-empty subset of . Additionally

since the function is either maximal or minimal at . By the fundamental theorem of algebra, is thus a finite subset of . In this step, we prove that and .

Note that is also an optimal polynomial approximation of the function in in the space of polynomials of degree at most , since the optimal approximation is an even polynomial. By Chebyshev’s equi-oscillation Theorem [KKC09, Section 6.9], there exist distinct points such that the error

satisfies

i.e. there are distinct points where the deviation from the target function is largest, and the oscillation around the target function at consecutive points goes in opposite directions.

Clearly, if is maximal at if and only if it is maximal at . Therefore, there exist at least points in . Rounding up is required since is odd, and the point counts fully towards . Thus .

It remains to show that . We prove this only if , as the case of approximation by constant functions can be solved explicitly by direct inspection by the constant polynomial .

Assume for a contradiction that . Then there exist at least distinct points in . Since is either maximal at , we conclude that for every . By Rolle’s Theorem, between any two points such that , there exists such that . In particular, has at least distinct zeros in . Since is even, it follows that has at least distinct zeros. But, since is a polynomial of degree , it follows that and thus that is a quadratic polynomial on . On the other hand, we have seen that there exist at least points in , meaning that there are points in at which vanishes. We conclude that , i.e.  is a linear polynomial on . It is easy to see that this is not optimal in terms of approximation.

Step 2. We claim that the -Vandermonde type matrix

is invertible for any distinct points . This is true by classical results [Lun17] for the Vandermonde submatrix

since the points are distinct. It remains to show that the first row is linearly independent from the others, i.e. there exist no coefficients such that at distinct points in . Assume the contrary. Then there are distinct points such that

By Rolle’s theorem, between two such points there exists such that

The contradiction follows as in Step 1 of this proof.

Step 3. Combining the results of the second and third step of this proof, we can choose and find a unique vector such that

The measure under consideration is now

by construction. Thus the moment conditions are met. It remains to show that does not change sign in order to ensure that equality is attained in Hölder’s inequality. Using Chebyshev’s equi-oscillation theorem again, it suffices to show that and have opposite signs for all .

For any , consider the unique even polynomial of degree such that for except . Then, since is an even polynomial of degree

but since has zeros at for , we find that for any . Thus and have the same sign. In order to satisfy (B), we therefore find that and must have different signs. ∎

Proof of the claim in the proof of Theorem 3.2.

To see this, we use the lower bound

from (B). By replacing the variable by , we find that

Recall that the function is an analytic function on the interval and

The coefficients decay asymptotically as since

so for every , there exists which is independent of such that the -distance of the function from the space of polynomials of degree is at most

Appendix C Brief proofs of known results

In this appendix, we merely sketch the proofs of known results. For a more detailed introduction, we recommend e.g. [EW20b]. We begin by sketching a proof of Proposition 2.1, where we established general properties of Barron

Proof of Proposition 2.1.

First claim. We note that, assuming existence of the integrals and for fixed , we have

If the first integral exists, then also the integral defining exists as the integrand is continuous and grows at most linearly. Then

Measurability is not an issue for fixed due to the continuity of the integrand. For the sake of brevity, denote . More generally, we note that the Bochner-integral

converges in for compact sets and in for and probability distributions with finite -th moments in , i.e. the function is Bochner integrable with respect to when considered as a function with values in either or . To see this, consider step functions

where are -dimensional cubes of side length whose union is and . If , then

for any . Fixing to be the square root of the side-length of , we find that pointwise for all . Furthermore, is Lipschitz continuous in uniformly in , so converges to a limit in by the compact embedding of Lipschitz functions in , which coincides with the pointwise limit . In other words, the Bochner integral exists in . The argument follows in by the dominated convergence theorem considering for all .

Second claim. In this step, we show that is a Banach space and illustrate that and are different spaces. The fact that is a Banach space follows as [SX21a, Lemma 1] from the previous claim, where we have shown the existence of as a Bochner integral in , i.e. as a continuous convex combination not only pointwise, but in a function space.

To see that , observe that any can be decomposed into a positively one-homogeneous and a bounded part due to [EW20b, Corollary 5.3]. On the other hand, in one dimension, the function satisfies and has an integrable second derivative . By Proposition 2.5, we find that . Since is not bounded but grows sub-linearly, we conclude that . The first inclusion follows from the fact that as shown next.

Third claim. The claim that is self-evident by definition, as the full Barron norm also limits the magnitude of the bias.

Fourth claim. Finally, we note that is Lipschitz-continuous, since

Taking the infimum over (and optionally noting that ), we find that . ∎

Proposition 2.3 is proved in [EW20b, Theorem 5.18] and Corollary 2.4 follows from it directly. Let us sketch how the structure of one-dimensional Barron functions can be understood.

Proof of Proposition 2.5.

Upper bound. Let and be such that . Then for we have

and for

Noting that the terms in the first expression vanish for when and vice versa, we find that

Consequently, for a measure

where denotes the atomic point measure of mass one and denotes the one-dimensional Hausdorff measure, restricted to half-lines and . and hence

By approximation, the same is true if and is merely a measure.

Lower bound direction. The bound

follows from Proposition 2.1 and the Rademacher Theorem on the differentiability of Lipschitz functions. For the second form of the lower bound, let , i.e. there exists a measure on such that

The second expression can be written as

where

i.e.  is the push-forward of the measure which has density with respect to onto the real line. If is any function, then by exchanging the order of integration and integrating by parts, we find that

we find that in the distributional sense, and thus . In particular,

We sketch a proof of the direct approximation theorem for Barron spaces.

Proof of Proposition 2.6.

Step 1. Consider the Hilbert space and observe that defined by has norm at most

We use Proposition 2.1 to write as

Step 2. Using the homogeneity relation , the distribution can be normalized such that

almost surely by considering the push-forward of along the map

if and otherwise, which satisfies . Thus for any , is in the -closed convex hull of the family

Step 3. By the Maurey-Barron-Jones Lemma [Bar93, Lemma 1], for every and every , there exist such that

As the vectors are constrained to a compact domain of and the map , is continuous, we can set and obtain the result without constant by an appropriate subsequence.

Finally, we write for compatibility with the original notation. ∎

Appendix D Further results

d.1. On the decay of for

Numerical experiments in Appendix A suggest that decays to zero exponentially fast for . While we cannot prove this in full generality, we show that

for a constant which is independent of . In particular, exponentially fast in if . To see this, observe that

for since is -orthogonal to the polynomial . Since , we may estimate

The pre-factor grows as since and

Finally, we note that holds for positive if and only if .

d.2. Non-radial minimum norm solutions

In this note, we constructed

Since both the Barron semi-norm and the class are convex and invariant under rotations of the data domain, we find that there exists at least one minimizer which is radially symmetric. By direct construction, we saw that this minimizer

is unique, at least if is odd. The biases and weights are given by the optimization process. Our proof does not exclude the existence of other minimizers, which are not radially symmetric. In fact, assume that for are functions such that

Then trivially also since , and thus

Since is the unique radial solution, we can average in the radial direction and observe that for all . The Barron norm of the combined solution is

if is so small that for all , since the function does not change signs in this case, and the integral of averages to zero. In particular, if exist such that is supported in and fails to be radial, then a non-radial minimizer exists.

By considering the behavior of at infinity, we establish two conditions: in order to have bounded, and in order to have .

Lemma D.1.

Assume there exist measures on such that

for all and . Then there exists a minimizer of the Barron semi-norm which is not radially symmetric. Without loss of generality, we may assume that is radially symmetric with respect to .

Proof.

Step 1. Assume for now that is identically zero. Let be a -probability density on the group of rotations which is supported in an -neighbourhood of the unit matrix, and let be the Haar measure on . Define the radial mollification

where

We make three observations.

  1. if or .

  2. as (pointwise and locally uniformly), so cannot be identically zero for sufficiently small .

  3. is absolutely continuous with respect to the uniform distribution on the sphere since

    Due to the uniform estimate, the Radon-Nikodym derivative is an -function.

We now fix small enough, write and note that is also a solution to (D.2). In particular, cannot be radially symmetric since is the unique radially symmetric minimizer.

Step 2. Take to be non-trivial as implied by step 1. Then there exists at least one direction such that . Without loss of generality, we may take . We can now average over all rotations which leave fixed. The resulting function is radially symmetric in all components orthogonal to , i.e. in . Since we only average over rotations which leave the -direction fixed, we have . In particular, we may assume that has the desired symmetry. ∎

The question whether there exists such that on but on can be rephrased in terms of functional analysis. Namely, if we understand as an element of the dual space of and we associate to the function given by , then we can write as a duality product.

In particular, we consider two subspaces :

Obviously . We note the following: If , then by the Hahn-Banach theorem there exists such that for all but not all . It is easy to see by contradiction that there exists in particular with such that . Note that since for any by design.

We have thus proved the following.

Corollary D.2.

Denote and , . Consider the subspaces of as in (D.2). There exists a non-radial solution of the minimization problem (D.2) if and only if .