Optimal bump functions for shallow ReLU networks
Weight decay, depth separation and the curse of dimensionality

Stephan Wojtowytsch Stephan Wojtowytsch
Department of Mathematics
Texas A&M University
155 Ireland Street
College Station, TX 77840 swoj@tamu.edu

September 5, 2022

Abstract.

In this note, we study how neural networks with a single hidden layer and ReLU activation interpolate data drawn from a radially symmetric distribution with target labels 1 at the origin and 0 outside the unit ball, if no labels are known inside the unit ball. With weight decay regularization and in the infinite neuron, infinite data limit, we prove that a unique radially symmetric minimizer exists, whose weight decay regularizer and Lipschitz constant grow as $d$ and $\sqrt{d}$ respectively.

We furthermore show that the weight decay regularizer grows exponentially in $d$ if the label $1$ is imposed on a ball of radius $ε$ rather than just at the origin. By comparison, a neural networks with two hidden layers can approximate the target function without encountering the curse of dimensionality.

Key words and phrases:

Deep learning, depth separation, Barron space, Radon-BV, compact support, mollifier, weight decay, minimum norm solution, symmetry learning, explicit regularization, curse of dimensionality, radial symmetry

2020 Mathematics Subject Classification:

68T07, 65D40, 41A30

1. Introduction

Neural networks have revolutionized fields from computer vision [KSH12] to natural language processing [VSP $^{+}$ 17]. They are the driving force behind AIs which play strategy games at superhuman levels of proficiency [SHS $^{+}$ 18, SHM $^{+}$ 16, SSS $^{+}$ 17], facilitated major advances in scientific problems such as protein folding [TAW $^{+}$ 21, JEP $^{+}$ 21], and have been used for computer-assisted proofs in applied mathematics [WLGSB22]. While empirical evidence indicates that they often generalize well to previously unseen data when trained appropriately, there is little rigorous understanding of how neural networks interpolate a function between known data points.

In this article, we provide insight in the simple setting of infinitely wide ReLU networks with a single hidden layer and data which are drawn from a radially symmetric distribution on a Euclidean space $R^{d}$ . The target function $f^{*}$ satisfies $f^{*} (0) = 1$ and $f^{*} (x) = 0$ for $| x | \geq 1$ , where $| \cdot |$ denotes the Euclidean norm on $R^{d}$ . We consider a loss functional composed of an $ℓ^{2}$ -error and a weight decay regularizer. Despite the fact that neural networks with a single hidden layer cannot represent compactly supported target functions exactly [Lu21], there are such functions which can be approximated efficiently even in high dimension. Here, we construct optimal infinitely wide networks, and show that the weight decay regularizer grows only linearly in the dimension $d$ of the data space, improving on the quadratic upper bound established by [OWSS19].

While highly idealized, this setting allows us to study several important aspects of neural network models:

Learning symmetries. The target function has two important symmetries:
- $f^{*}$ is radially symmetric. While it is impossible to fit this symmetry exactly by finite networks, it can be attained asymptotically for highly overparametrized networks. More precisely, one could ask whether regularized risk minimization leads to symmetry learning. While we show that a unique radially symmetric solution exists, it remains open whether other solutions exist which do not exhibit radial symmetry.
- $0 \leq f^{*} \leq 1$ almost everywhere with respect to the data distribution. Unlike linear models, which necessarily output negative data even if all training data points are positive, neural networks have the capacity to respect this constraint. We show that risk minimization asymptotically enforces the bound everywhere on the data space, at least for the unique radially symmetric solution.
Fitting random or perturbed data. It is known that overparametrized neural networks can fit random data, but due to the great generality of the result, the network weights may be prohibitively large for given data. Assuming that labels are generated by a function which can be approximated well by a neural network, compactly supported bump functions can be used to obtain an upper bound on the magnitude needed for specific labels or the increase necessary in the weight decay regularizer if the labels are perturbed.
Depth separation and curse of dimensionality. We prove two complimentary results:
- In dimension $d$ , there exists an infinitely wide ReLU network with one hidden layer $f_{d}^{*}$ with weight decay regularizer $\sim d$ such that $f_{d}^{*} (0) = 1$ and $f_{d}^{*} (x) = 0$ if $| x | \geq 1$ .
- If $f_{d, ε}^{*}$ is an infinitely wide ReLU network with one hidden layer such that $f_{d}^{*} (x) = 1$ for $| x | \leq ε$ and $f_{d}^{*} (x) = 0$ if $| x | \geq 1$ , then the weight decay regularizer of $f_{d}^{*}$ grows at least exponentially as $ε^{2} d^{1 / 2} (1 - ε^{2})^{- \frac{d + 1}{2}}$ in the dimension $d$ of the data space.
The curse of dimensionality can be avoided in the second situation by using a neural network with two hidden layers, for which the weight decay regularizer only grows as $\sim d^{1 / 3} (1 - ε)^{- 1}$ .
Effect of regularization. Weight decay regularization is often taken as a proxy for controlling the Lipschitz constant of a neural network, as it can be computed more easily. In this highly symmetric setting, we can compare two optimal solutions:
1. The data is fitted optimally by the function $^fd(x)=max{1−|x|,0}$ , which attains the minimal Lipschitz constant $1$ . The function cannot be represented by a ReLU network with a single hidden layer and finite weights, even in the infinite width limit [EW20b, Example 5.19]. It can be presented by a neural network with two infinitely wide hidden layers and weight decay $\sim \sqrt{d}$ .
2. The weight decay regularizer of the optimal two-layer ReLU network $f_{d}^{*}$ grows like $d$ , while its Lipschitz constant grows like $\sqrt{d}$ .
Highly localized peaks. The target function can be seen as the prototypical example of learning functions which take values $y_{1}, \dots, y_{N}$ at isolated points $x_{1}, \dots, x_{N}$ which are separated as ‘islands’ in a ‘sea’ of points $x_{N + 1}, \dots, x_{M}$ with labels $y_{N + 1} = \dots = y_{M} = 0$ .
Mollification. The infinitely wide neural networks constructed in this note can be used to establish approximation rates in function spaces for shallow neural networks by mollification, if the mollification width $ε$ is optimized to balance the competition between approximation of the target function by the infinitely wide network and approximation of the infinitely wide network by finite neural networks.

To the best of our knowledge, this is the first time that an optimal solution for fitting data by neural networks has been computed in dimension $d > 1$ . For technical reasons, we focus on the case that $d$ is odd. The optimal radial solution can be written as a finite sum

fd(x)=n+1∑i=0μi\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(νTx−bi)dHd−1,n=d−12,0=b0<⋯<bn+1=1

for some coefficients $μ_{i} \in R$ satisfying $\sum_{i = 0}^{n} | μ_{i} | = γ_{n} \sim 3.7 d$ .

The article is organized as follows. In the remainder of the Introduction, we briefly review the context of this work in the literature and the notation we will use throughout the article. In Section 2, we give a brief introduction to the function spaces associated to two-layer ReLU networks with a weight decay regularizer (Barron or Radon BV spaces). Sections 3 and 4 are dedicated to the statement and proof of our main results respectively. Applications of our results can be found in Section 5. Numerical approximations of the optimal solutions $f_{d}^{*}$ can be found in Section 6. We conclude the article with a brief summary and list of open problems in Section 7.

Further numerical experiments can be found in Appendix A. Some proofs from the main part of the article are postponed to Appendix B, while proofs of results which are known in similar form are postponed to Appendix C. Slight extensions of the main results can be found in Appendix D.

1.1. Previous work

The complexity of a neural network is often measured by the number of its non-zero coefficients (weights) [LWK17, SSVB17, GKNV22] or by a measure of their magnitude. From a practical perspective, both are crucial pieces of information: a neural network with an excessive number of non-zero connections is expensive to store and evaluate, while a network with very large coefficients is likely to depend on subtle cancellations at training data points and unlikely to generalize well to unseen data.

[Bar93] realized that a large class of functions can be approximated efficiently by neural networks with a single hidden layer and any sigmoidal activation function while keeping the outer layer coefficients bounded. The function class is defined in terms of a spectral criterion and diverse enough that any linear method of approximation must face the curse of dimensionality in it.

Subsequently, function approximation by ReLU networks with a single hidden layer and bounded coefficients in both layers was studied in [Bac17, EMW18, EMW19b, EW20b]. Optimal rates of approximation were obtained in [SX19, SX21b]. A spectral criterion for this scenario in terms of the Fourier transform was developed in [KB18], and a sharp criterion in terms of the Radon transform in [OWSS19, PN21]. A detailed study of Fourier-like criteria in this context is given by [CPV20].

The norm in these function spaces is related to the popular explicit ‘weight decay’ regularizer (the $ℓ^{2}$ -norm of the network weights). It retains significance in the context of implicit regularization, as [CB20] showed that infinitely wide two-layer ReLU networks converge to minimum norm/maximum margin classifiers with respect to the weight decay norm, when trained by a gradient flow optimizer for binary classification with logistic loss.

While the structure of the function spaces has been studied and many of their functional analytic properties are understood [EW20b, PN21, SX21b, SX21a], explicit examples remain rare. Spectral criteria have been used to show that functions in certain smoothness classes can be expressed as infinitely wide two-layer networks with finite weight-decay norm. [EW22] construct a maximum margin classifier in a simple one-dimensional scenario. A structure theorem is given in [EW20b] to easily demonstrate that certain functions cannot be expressed this way. Closest to the present work is [Han21], where the minimum norm interpolants of a finite one-dimensional data set are studied.

Much of the work on ReLU-activated two-layer networks makes heavy use of the homogeneity of the activation function. Two-layer neural networks with arbitrary activation are studied e.g. in [SX20, LMW20]. Partial (and different) extensions to deeper neural networks can be found e.g. in [PN22, EW20a], while residual neural networks of continuous depth (‘neural ODEs’) have been studied from this perspective in [EMW19a, EMW19b, EMW19c].

1.2. Notation

We denote by $\vbox\hruleheight0.4ptwidth4.3ptdepth0.0pt∫A$ the average integral over a set $A$ which has finite measure for a measure $μ$ , i.e. $\vbox\hruleheight0.4ptwidth4.3ptdepth0.0pt∫Af(x)dμx=1μ(A)∫Af(x)dμx$ . By $d μ_{x}$ we mean that we integrate with respect to the (signed) measure $μ$ in the variable $x$ . In this article, $μ$ will always be a measure (often signed), while $ν$ denotes the exterior normal vector field on a sphere.

The natural $d - 1$ -dimensional area (Hausdorff) measure is denoted by $H^{d - 1}$ . In this article, it will always refer to the (unnormalized) uniform distribution on a $d - 1$ -dimensional sphere.

The total variation norm of a measure $μ$ on a measurable space $X$ is defined as $∥ μ ∥_{T V} = μ_{+} (X) + μ_{-} (X)$ , where $μ_{+}, μ_{-}$ is the Hahn decomposition of the signed measure $μ$ .

In the following, $g$ is always going to be a function of one variable and $f$ is going to be a radially symmetric function on $R^{d}$ . By an abuse of notation, we will also consider $f : [0, \infty) \to R$ defined by $f (r) = f (r \cdot e_{1})$ . We denote by

c_{d} = \frac{| S^{d - 2} |}{| S^{d - 1} |} = \frac{1}{\int_{- 1}^{1} (1 - s^{2})^{\frac{d - 3}{2}} d s}

a quotient related to the area of hyperspheres in dimension $d$ and $d - 1$ , and by $γ_{n}$ a constant related to the approximability of the function $\sqrt{s}$ by polynomials of degree at most $n$ in $L^{\infty} (0, 1)$ , which also relates to the minimal value of the weight decay regularizer for fitting data as above.

The variables $d$ and $n$ are always related by $n = \frac{d - 1}{2}$ , i.e. $d = 2 n + 1$ .

2. Weight decay and Barron spaces

In this section, we briefly review the theory of infinitely wide ReLU networks with a single hidden layer. Function spaces for this setting have been studied under the name $F_{1}$ in [Bac17], Barron space in [EMWW20, EMW19b, EMW19c, EMW18], Radon-BV in [PN22, PN21] and the convex hull of the ReLU dictionary or the variation space of the ReLU dictionary in [SX21a, SX21c]. In this note, we refer to them as Barron spaces in reference to the seminal work of Andrew Barron [Bar93]. Some results presented below are extensions of known results to the case where we consider a Barron semi-norm rather than the full Barron norm, corresponding to a weight decay regularizer which does not control the magnitude of the biases.

A neural network with a single hidden layer and $m \in N$ neurons can be represented as

f_{m} (x) = m \sum i = 1 a_{i} σ (w_{i}^{T} x + b_{i}) or f_{m} (x) = \frac{1}{m} m \sum i = 1 a_{i} σ (w_{i}^{T} x + b_{i})

where $(a_{i}, w_{i}, b_{i}) \in R \times R^{d} \times R$ are the weights of the neural network. For networks in which the size of the weights is controlled, this representation can be generalized to

f_{μ} (x) = \int_{R \times R^{d} \times R} a σ (w^{T} x + b) d μ_{(a, w, b)} or f_{π} (x) = \int_{R \times R^{d} \times R} a σ (w^{T} x + b) d π_{(a, w, b)}

where $μ$ is a measure on $R^{d + 2}$ and $π$ is a probability measure on $R^{d + 2}$ . More generally, due to the symmetry $a σ (w^{T} x + b) = λ ((λ^{- 1} a) σ (w^{T} x + b))$ for $λ \neq 0$ , $μ$ can be taken to be a signed measure. Finite networks are contained in the general setting by setting

μ_{m} = m \sum i = 1 λ_{i} δ_{(λ_{i}^{- 1} a_{i}, w_{i}, b_{i})} and π_{m} = \frac{1}{m} m \sum i = 1 δ_{(a_{i}, w_{i}, b_{i})}

respectively, where the parameters $λ_{i} \neq 0$ can be chosen freely for a convenient representation. The integral is guaranteed to converge if the Barron norm

(2.3)

∥ f ∥_{B}

is finite, where $| μ | = μ^{+} + μ^{-}$ denotes the total variation measure of the signed measure $μ = μ^{+} - μ^{-}$ . The infimum must be taken since the representation of a function in this fashion is highly non-unique [EW20b, Section 2.1]. The two representations of the norm coincide by [EW20b, Section 2.4].

The norm in the parameter variable $w$ is chosen dual to the norm in the data variable $x$ such that the inequality $| w^{T} x | \leq | w | \cdot | x |$ holds. In particular, if distances in the data domain are measured in the $ℓ^{p}$ -sense for $p \in [1, \infty]$ , then distances in the parameter domain are measured in the $ℓ^{q}$ -sense for $q = \frac{p}{p - 1}$ . For compatibility with radial symmetry, we focus on the case $p = \frac{p}{p - 1} = 2$ in this note.

We refer to the space ${f : ∥ f ∥_{B} < \infty}$ as Barron space $B$ , or at times $B (R^{d})$ to indicate dependence on dimension.

Due to the control over the bias, the Barron norm as defined in [EMW19b, EW20b] is not invariant under translations in the data space, i.e. the functions $f$ and $f (\cdot + ¯ x)$ generally have a different norm for $¯ x \neq 0$ . By contrast, the following Barron semi-norm is translation invariant and has useful properties which suffice in many applications:

(2.4)

[f]_{B} = inf π {\frac{1}{2} \int_{R^{d + 2}} | a |^{2} + | w |^{2} d π : f \equiv f_{π}} = inf μ {\frac{1}{2} \int_{R^{d + 2}} | a |^{2} + | w |^{2} d | μ | : f \equiv f_{μ}} .

We will address the convergence of the integrals in (2) without control over $b$ in Proposition 2.1. This is more in line with the approach in [OWSS19, PN21], where the magnitude of the bias is also not controlled. We opt for controlling $| a |^{2} + | w |^{2}$ rather than $| a | \cdot | w |$ for convenience, but note that the classical Barron norm could be defined in this fashion, too. The key observation is that the ReLU activation function $σ (z) = max {z, 0}$ is positively one-homogeneous, i.e. $σ (λ z) = λ σ (z)$ for all $λ > 0$ . In particular

a σ (w^{T} x + b) = a \sqrt{\frac{| w |}{| a |}} σ (\sqrt{\frac{| a |}{| w |}} w^{T} x + \sqrt{\frac{| a |}{| w |}} b),

i.e. we may normalize neurons $(a_{i}, w_{i}, b_{i})$ to

a_{i}^{'} = a_{i} \sqrt{\frac{| w_{i} |}{| a_{i} |}}, w_{i}^{'} = \sqrt{\frac{| a_{i} |}{| w_{i} |}} w_{i} s.t. | a_{i}^{'} |^{2} = | w_{i}^{'} |^{2} = | a_{i} | | w_{i} |

without changing the output of the neural network. In particular $| a |^{2} + | w |^{2} = 2 | a | | w |$ , indicating that we could define the Barron norm in the analogous fashion by squares. Indeed, in the infinite limit it is even possible to assume that $π$ is supported on the set $| a | = | w | = \sqrt{[f]_{B}}$ . For a more technically rigorous discussion, see e.g. [EW20b].

By a slight abuse of terminology, we will also refer to $B_{0}$ as Barron space from now on. We briefly note the following properties, which relate the Barron semi-norm and more well-established quantities.

Proposition 2.1.

If the integral in (2.4) is finite for $π$ , then the integral defining $f_{π}$ in (2) exists for all $x \in R^{d}$ if and only if it exists for $x = 0$ . It may then be re-cast as

$f_{π} (x) = f_{π} (0) + \int_{R^{d + 2}} a [σ (w^{T} x + b) - σ (b)] d π .$

This expression always converges if the integral in (2.4) is finite. The integral exists as a Bochner integral with values in $C^{0} (K)$ for compact $K \subseteq R^{d}$ or $L^{p} (P)$ for a probability distribution $P$ on $R^{d + 2}$ with finite $p$ -th moments.
$[f]_{B}$ is a norm on the modified Barron space $V_{0} = {f \in C^{0} (R^{d}) : f (0) = 0, [f]_{B} < \infty}$ , which makes $V_{0}$ a Banach space. Compared to classical Barron spaces, $V_{0} ⊈ B (R^{d})$ .
$[f]_{B} \leq ∥ f ∥_{B}$ .
If $f \in B$ , then $f$ is Lipschitz-continuous and the Lipschitz-constant of $f$ satisfies $[f]_{L i p} \leq [f]_{B}$ .

All statements could be given in terms of a general signed measure $μ$ instead of $π$ . The proof, along with other proofs from this section, can be found in Appendix C.

Functions in Barron spaces are defined by means of an explicit representation formula. Paradoxically, this explicit characterization often makes it difficult to verify whether a given function is in Barron space. A more abstract framework was created in [OWSS19] by the means of the Radon transform, based on the observation that

	$Δ (m \sum i = 1 a_{i} σ (w_{i}^{T} \cdot + b_{i}))$	$= m \sum i = 1 a_{i} \| w_{i} \| \cdot H^{d - 1} \|_{W_{i}}$
	$D^{2} (m \sum i = 1 a_{i} σ (w_{i}^{T} \cdot + b_{i}))$	$= m \sum i = 1 a_{i} \| w_{i} \| \cdot \frac{w_{i}}{\| w_{i} \|} \otimes \frac{}{w_{i}} \| w_{i} \| \cdot H^{d - 1} \|_{W_{i}},$

i.e. the second spatial derivatives of a ReLU network with one hidden layer are superpositions of measures concentrated on the hyperplanes $W_{i} = {x : w_{i}^{T} x + b_{i} = 0}$ . This allows for a characterization of Barron spaces in terms of second derivatives. The Radon transform is used as a technical tool in order to dualize from hyperplanes to points. This convenient characterization allows the construction of some examples of functions in Barron space.

Example 2.2.

Assume that $f$ is a Lipschitz-continuous function and that the (possibly non-integer) power $(- Δ)^{(d + 1) / 2} f$ of the Laplacian in the distributional sense exists as a measure. Then

$[f]_{B (R^{d})} \leq \frac{1}{2^{d - 1} π^{d / 2 - 1} Γ (d / 2)} ∥ (- Δ)^{(d + 1) / 2} f ∥_{T V},$

where $∥ \cdot ∥_{T V}$ denotes the total variation norm of $Δ f$ [OWSS19, Proposition 3].
If $d$ is odd, the power of the Laplacian is integer. In particular, if $f$ belongs to the Sobolev space $W^{d + 1, 1} (R^{d}) \subseteq C^{d + 1} (R^{d})$ of functions whose first $d + 1$ (weak) partial derivatives are $L^{1}$ -integrable, then $f \in B (R^{d})$ and

$[f]_{B} \leq c_{d} ∥ f ∥_{W^{d + 1, 1}} .$

for some constant $c_{d} > 0$ , which depends on the exact choice of the norm on $W^{d + 1, 1}$ . In particular $C_{c}^{\infty} (R^{d}) \subseteq B (R^{d})$ [OWSS19, Corollary 1].
If $d \geq 3$ is an odd integer and $f_{d, k} : R^{d} \to R$ is the radial bump function given by

$f_{d, k} (x) = {\begin{matrix} (1 - | x |^{2})^{k} & | x | \leq 1 0 & else \end{matrix},$

then $f_{d, k} \in B_{0} (R^{d})$ if $k \geq \frac{d + 1}{2}$ . For $k_{d} = \frac{d + 1}{2} + 2$ , the norm bound $[f_{d, k_{d}}]_{B (R^{d})} \leq 2 d (d + 5)$ holds according to [OWSS19, Example 3].

In [OWSS19], also a stronger version of the statement is claimed, including an if and only if condition for $k$ and a comparable lower bound for $[f_{d, k_{d}}]_{B (R^{d})}$ . Those claims are based on an error in the proof of [OWSS19, Proposition 15], where the erroneous claim is made that if $\int_{R^{d}} | ϕ | d x = 1$ , then the integral of $ϕ$ over any hyperplane $\int_{H} | ϕ | d H^{d - 1}$ is bounded from above by $1$ .

Based on the same intuition, we point out two observations. The first demonstrates that the singular set $Σ$ of a Barron function (i.e. the set where the functions is not differentiable) is ‘straight’ and lower dimensional. This is a stronger version of Rademacher’s theorem, which states that the singular set of a Lipschitz function is Lebesgue null, in the context of Barron spaces. The following statement has the stronger implication that $Σ$ is contained in a countable union of affine subspaces of $R^{d}$ and therefore has Hausdorff dimension $\leq d - 1$ .

Proposition 2.3.

[EW20b] Any function $f \in B (R^{d})$ can be written as a countable sum $f = \sum_{i = 0}^{\infty} f_{i}$ where

$f_{0} \in B (R^{d})$ is $C^{1}$ -smooth,
$f_{i} (x) = g_{i} (P_{i} x + b_{i})$ where
- $P_{i} : R^{d} \to R^{k_{i}}$ is an orthogonal projection for $1 \leq k_{i} \leq d$ (i.e. $P_{i} P_{i}^{T} = I_{k \times k})$
- $g_{i} \in B (R^{k_{i}})$ is $C^{1}$ -smooth except at $0 \in R^{k_{i}}$ .

The fact that the singular set is straight has two immediate implications.

Corollary 2.4.

If $f \in B (R^{d})$ is radially symmetric, then $f \in C^{1} (R^{d} ∖ {0})$ .
If $ϕ : R^{d} \to R^{d}$ is a diffeomorphism such that $f \in B (R^{d}) \Rightarrow f \circ ϕ \in B (R^{d})$ , then $ϕ$ is an affine linear map [EW20b, Theorem 5.18].

A brief inspection of the proof of [EW20b, Theorem 5.18] reveals that Proposition 2.3 and 2.4 reveals that both statements remain valid for $B_{0} (R^{d})$ . A stronger result on radial Barron functions is proved below in Lemma 4.1. Secondly, we recall a characterization of one-dimensional Barron spaces, which is essentially the simpler one-dimensional case of the Radon transform construction. A similar statement can also be found e.g. in [EW20b, Example 4.1] and [LMW20].

Proposition 2.5.

$ϕ \in B_{0} (R)$ if and only if there exists a finite signed measure $μ$ such that $ϕ^{''} = μ$ , i.e. $ϕ^{'} (s) = μ ((- \infty, s])$ for all $s \in R$ such that $μ ({s}) = 0$ (in particular, all but countably many). For all such $ϕ$ and any $a \in R$ , we can write

ϕ (z) = ϕ (a) + ϕ^{'} (a) [σ (x - a) - σ (a - x)] + \int_{a}^{\infty} ϕ^{''} (s) σ (z - s) d s + \int_{- \infty}^{a} ϕ^{''} (s) σ (s - z) d s .

Furthermore

[ϕ]_{B} \leq ∥ ϕ^{''} ∥_{T V} + 2 inf a \in R inf v \in \partial ϕ (a) | v |

where

\partial_{a} f = c o n v ({v \in R : \exists x_{n} \to a s.t. \frac{f (x_{n}) - f (a)}{x_{n} - a} \to v})

is the convex hull of the set of approximate derivatives. Conversely

max {∥ ϕ^{''} ∥_{T V}, sup a \in R inf v \in \partial ϕ (a) | v |} \leq [ϕ]_{B} .

We believe that the upper bound is, in fact an identity. We now recall a property of Barron spaces $B_{0}$ .

Proposition 2.6 (Direct approximation theorem).

For every $f \in B_{0}$ and every probability measure $P$ on $R^{d}$ there exists $f_{m}$ as in (2) and $c > 0$ such that

∥ f - f_{m} - c ∥_{L^{2} (P)}^{2} \leq \frac{[f]_{B}^{2}}{m} max | ν | \leq 1 \int_{R^{d}} | ν^{T} x |^{2} d P

and

| a_{i} | = | w_{i} | = \sqrt{\frac{∥ f ∥_{B}}{m}} or | a_{i} | = | w_{i} | = \sqrt{∥ f ∥_{B}},

depending on the normalization in (2).

For the sake of completeness, we sketch a probablistic proof in Appendix C. This formulation of the direct approximation theorem improves on known results in two major ways:

The dependence on the data distribution $P$ is only through the ‘projected second moments’ $M_{2, p r o j} (P) := {max}_{| w | \leq 1} \int_{R^{d}} | w^{T} x |^{2} d P$ rather than the full second moments $M_{2} (P) := \int_{R^{d}} | x |^{2} d P$ . It is easy to see that

$M_{2, p r o j} (P) \leq M_{2} (P) = d \sum i = 1 \int_{R^{d}} | e_{i}^{T} x |^{2} d P \leq d \cdot M_{2, p r o j} (P)$

for any probability measure $P$ on $R^{d}$ , and that equality is attained for any measure $P$ which is the product of $d$ one-dimensional probability measures, e.g. a standard normal distribution. The constant in the bound may therefore be significantly smaller in high dimension.
The bound depends on the Barron semi-norm, but not the full Barron norm.

While the constants are improved in this formulation compared to e.g. [EMW18, EMW19b, EW20b], the result is not expected to be sharp in terms of the rate which is achieved. An improvement from $m^{- 1 / 2}$ to $m^{- 1 / 2 - 3 / 2 d}$ in the classical setting can be found in [SX21b] at the cost of a more involved proof.

Many of the results above are somewhat specific to ReLU activation as the proofs either use positive homogeneity or the property that $σ^{''} = δ$ . Both are shared by leaky ReLU activation.

Remark 2.7.

Consider the leaky ReLU activation function $σ_{ε} (z) = max {ε z, z}$ for $ε \in (0, 1)$ in addition to the classical ReLU activation $σ = σ_{0}$ . Since

σ_{ε} (z) = σ (z) - ε σ (- z) and σ (z) = \frac{1}{1 - ε^{2}} σ_{ε} (z) + \frac{ε}{1 - ε^{2}} σ_{ε} (- z),

any function which can be represented as a superposition of ReLUs can be represented as a superposition of leaky ReLUs and vice versa. The entire construction of Barron space goes through as above, leading to two semi-norms $[\cdot]_{B}$ and $[\cdot]_{ε}$ on the same function class such that $[f]_{ε} \leq (1 + ε) [f]_{B}$ and $[f]_{B} \leq \frac{1 + ε}{1 - ε^{2}} [f]_{ε} = \frac{1}{1 - ε} [f]_{ε}$ by the explicit representation (2.7). More compactly, we write this as

(1 - ε) [f]_{B} \leq [f]_{ε} \leq (1 + ε) [f]_{B} \forall f \in B_{0} .

Using positive one-homogeneity, it can be seen that the coefficients in the representations (2.7) are in fact optimal and thus that (2.7) is sharp. The norms induced on Barron space by ReLU and leaky ReLU activation are therefore equivalent, and all properties mentioned above survive if $σ$ is replaced by $σ_{ε}$ .

The more subtle statements which we prove below do not survive passing to an equivalent norm. When minimizing $[f]_{ε}$ under the constraints $f (x_{i}) = y_{i}$ , the set of solutions $M_{ε} \subseteq B_{0}$ will generally depend on $ε \in [0, 1)$ . For example, consider the one-dimensional data set with two points $(x_{0}, y_{0}) = (0, 0)$ and $(x_{1}, y_{1}) = (1, 1)$ , which is fit exactly by $σ_{ε}$ for any $ε$ . The solution $σ_{ε}$ is norm-minimizing for $[\cdot]_{ε}$ , but not for $[\cdot]_{B}$ , where the norm-inequality is sharp (and vice versa).

The equivalence of norms estimate degenerates at $ε = 1$ , where the activation would become linear. If $ε < 0$ , a similar construction holds unless $ε = - 1$ , where the any $σ_{ε}$ -Barron function $f$ must satisfy ${lim}_{t \to \infty} f (t x) = {lim}_{t \to - \infty} f (t x)$ .

We are finally ready to state (and prove) the main results of this article rigorously.

3. Statements of Main Results

Theorem 3.1.

For every odd $d \in N$ , there exists a unique radial function $f_{d}^{*} \in B (R^{d})$ such that

f_{d}^{*} \in argmin f \in F [f]_{B}, F := {f \in C (R^{d}) : f (0) = 1 and f \equiv 0 % on R^{d} ∖ B_{1} (0)} .

Furthermore

$f_{d}^{*} \in C^{\frac{d - 1}{2}} (R^{d} ∖ {0})$ .
The radial profile ${^f}_{d}^{*} : [0, \infty) \to R$ , ${^f}_{d}^{*} (r) = f_{d}^{*} (r \cdot e_{1})$ is strictly monotone decreasing in $r$ in $(0, 1)$ . In particular, $0 \leq f_{d}^{*} \leq 1$ .
There exists $r_{d} > 0$ such that ${^f}_{d}^{*}$ is a linear, strictly monotone decreasing function of $r$ on $[0, r_{d}]$ .

As $d \to \infty$ , the norm of $f_{d}^{*}$ increases linearly as

lim d \to \infty, d odd \frac{[f_{d}^{*}]_{B (R^{d})}}{d} = γ \approx 3.6,

where $γ$ is the inverse of the Bernstein constant.

The Bernstein constant is a quantity in classical numerical analysis and approximation theory arising when approximating the function $h (x) = | x |$ by polynomials in $L^{\infty} (- 1, 1)$ , see e.g. [Tre19]. From the proof of Theorem 3.1, we obtain an algorithm to compute $f_{d}^{*}$ to arbitrary precision, which is implemented in Section 6.

The functions $f_{d}^{*}$ are radially symmetric, compactly supported and non-negative. In particular, they can serve as mollifiers to easily prove quantitative approximation results for two-layer ReLU networks in general function classes. In a companion article [Wojb], we prove that they are achieved as (radial averages of) empirical risk minimizers with a weight decay regularizer.

Note that we do not exclude the possibility that other minimizers exist which are not radially symmetric. From direct arguments, we can only conclude that the set of minimizers is convex and invariant under coordinate rotations. The existence of at least one radially symmetric minimizer follows relatively easily, while its uniqueness is established below by construction. For any minimizer ${~ f}_{d} \in {argmin}_{f \in F} [f]_{B}$ , which may not not be radially symmetric, the radial average

~fd,av(x)=\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫SO(d)~fd(Ox)dHO

is a radially symmetric minimizer, i.e. ${~ f}_{d, a v} \equiv f_{d}^{*}$ . Knowledge of the unique minimizer after radial averaging allows us to study optimization algorithms for implicit bias and finding global optima. This line of inquiry is pursued in upcoming work [Wojc].

We find it easier to deal with odd dimensions, as the function $(1 - s^{2})^{\frac{d - 1}{2}}$ is a polynomial in this case. This is analogous to the observations of [OWSS19]. We remark that, if $f : R^{D} \to R$ is a Barron function and $d \leq D$ , then

˜ f : R^{d} \to R, ˜ f (x) = f (x_{1}, \dots, x_{d}, 0, \dots, 0)

is also a Barron function and $[˜ f]_{B (R^{d})} \leq [f]_{B (R^{D})}$ , so the limit

lim d \to \infty inf {[f]_{B (R^{d})} : f (0) = 1 and% f \equiv 0 on R^{d} ∖ ¯ ¯¯¯¯¯¯¯¯¯¯¯ ¯ B_{1} (0)} \approx 3.6

remains valid if even dimensions are considered, as can be seen when sandwiching an even integer $d$ between $d - 1$ and $d + 1$ .

Using a reflection argument, any radially symmetric Barron function $f : R^{d} \to R$ can be written as

f(x)=f(0)+∫[0,∞)\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(νTx−b)dHd−1dμb

for some measure $μ$ on the space of biases. In this context, Theorem 3.1 can be understood as a finite representer theorem, since the proof shows precisely that there exist $n + 2 = \frac{d + 3}{2} \in N$ weights $μ_{0}, \dots, μ_{n + 1}$ and biases $0 = b_{0} < \dots < b_{n + 1} = 1$ such that

f∗d(x)=1+n+1∑i=0μi\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(νTx−bi)dHd−1.

Finally, we note that the methods in the proof of Theorem 3.1 can also be used to show the following extension.

Theorem 3.2.

For every $ε \in (0, 1)$ and every odd $d \in N$ , there exists a unique radial function $f_{d, ε}^{*} \in B (R^{d})$ which minimizes the Barron semi-norm in the class

f_{d, ε}^{*} \in argmin f \in F_{ε} [f]_{B}, F_{ε} := {f \in C (R^{d}) : f \equiv 1 on ¯ ¯¯¯¯¯¯¯¯¯¯¯ ¯ B_{ε} (0) and f \equiv 0 on R^{d} ∖ B_{1} (0)} .

Furthermore

$f_{d, ε}^{*} \in C^{\frac{d - 1}{2}} (R^{d})$ .
The radial profile ${^f}_{d, ε}^{*} : [0, \infty) \to R$ , ${^f}_{d, ε}^{*} (r) = f_{d, ε}^{*} (r \cdot e_{1})$ is strictly monotone decreasing on $[ε, 1]$ . In particular, $0 \leq f_{d, ε}^{*} \leq 1$ .

In this case, the Barron norm grows exponentially in the dimension $d$ . More precisely, there exists $D \in N$ independent of $ε > 0$ such that

∥ f_{d, ε}^{*} ∥_{B (R^{d})} \geq \frac{ε^{2} \sqrt{d}}{(1 - ε^{2})^{\frac{d + 1}{2}}}

if $d \geq D$ .

We thus observe that the problem of approximating compactly supported bump functions which are constant in a neighbourhood of the origin by shallow neural networks suffers from the curse of dimensionality. We will argue below that this is not the case for ReLU networks with at least two hidden layers.

4. Proofs of the Main Results

We begin by stating three lemmas in this section, which are used to prove the main theorems. The proofs are given in Appendix B. By a slight abuse of notation, we denote $f (r) = f (r e_{1})$ for a radially symmetric function $f : R^{d} \to R$ and by $f^{'}$ the radial derivative of $f$ . We first note a general result on radially symmetric Barron functions.

Lemma 4.1.

Let $f : R^{d} \to R$ be a radially symmetric Barron function and $d$ odd. Then

as a function of $r$ , $f$ is $n := \frac{d - 1}{2}$ times continuously differentiable in $R^{d} ∖ {0}$ . The $n + 1$ -th radial derivative is bounded and measurable, and the $n + 2$ -th radial derivative in the distributional sense is a bounded (Radon) measure.
for every $ε > 0$ , there exists $D \in N$ such that the Lipschitz bound

$[f]_{L i p} \leq \frac{1 + ε}{\sqrt{2 π d}} [f]_{B_{0}}$

holds for every $d \geq D$ .

The following Lemma allows us to express radial symmetry and compact support for Barron functions in odd dimensions in a one-dimensional fashion. It is based on an exchange in the order of integration in (3).

Lemma 4.2.

Assume that $g \in B_{0} (R)$ is a one-dimensional Barron function such that $g (0) = 1$ , $g \equiv 0$ outside of $(- 1, 1)$ and

\int_{- 1}^{1} g (s) s^{2 k} d s = 0 for k = 1, \dots, \frac{d - 3}{2} .

Then the function $f : R^{d} \to R$ given by

f (x) = \frac{\int_{- 1}^{1} (1 - s^{2})^{\frac{d - 3}{2}} g (| x | s) d s}{\int_{- 1}^{1} (1 - s^{2})^{\frac{d - 3}{2}} d s}

satisfies the following properties:

$f (0) = 1$ ,
$f (x) = 0$ if $| x | \geq 1$ ,
$f$ is radially symmetric, and
$[f]_{B (R^{d})} \leq [g]_{B (R)}$ .

Conversely, if $f : R^{d} \to R$ is a radially symmetric Barron function which satisfies $f (0) = 1$ and $f \equiv 0$ on $R^{d} ∖ B_{1} (0)$ , then there exists $g$ as above such that (4.2) holds, which is additionally an even function and satisfies $[f]_{B (R^{d})} = [g]_{B (R)}$ .

Furthermore $f \equiv 1$ in $B_{ε} (0)$ if and only if $g \equiv 1$ in $(- ε, ε)$ for the even representative of the function class $g$ .

We will show that such a Barron function $g$ indeed exists for every odd $d \geq 3$ and compute the precise asymptotic growth of $[g]_{B}$ as $d \to \infty$ . The following Lemma is the main technical tool in our proof.

Lemma 4.3.

For $n \in N$ , set

γ_{n} := min {∥ μ ∥_{T V} : \int_{0}^{1} s d μ_{s} = 1, \int_{0}^{1} s^{2 k} d μ_{s} = 0 for 0 \leq k \leq n} .

Then ${lim}_{n \to \infty} \frac{γ_{n}}{n} = γ \approx 3.57$ is the inverse of the Bernstein constant. The minimum is attained by a unique measure $μ = \sum_{i = 0}^{n + 1} μ_{i} δ_{s_{i}}$ where

$0 = s_{0} < s_{1} < \dots < s_{n + 1} = 1$ are the $n + 2$ distinct points in $[0, 1]$ at which $P (s) - s$ is extremal in $[0, 1]$ , where $P$ is the optimal even polynomial approximator of degree $\leq 2 n$ for $g (s) = s$ in $L^{\infty} (0, 1)$ .

$μ_{0}, \dots, μ_{n + 1} \in R$ are parameters satisfying the alternation criterion $μ_{i + 1} μ_{i} < 0$ for $i = 0, \dots, n$ and the generalized Vandermonde system

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ \begin{matrix} s_{0} & s_{1} & \dots & s_{n + 1} 1 & 1 & \dots & 1 s_{0}^{2} & s_{1}^{2} & \dots & s_{n + 1}^{2} s_{0}^{4} & s_{1}^{4} & \dots & s_{n + 1}^{4} ⋮ & ⋮ & ⋱ & ⋮ s_{0}^{2 n} & s_{1}^{2 n} & \dots & s_{n + 1}^{2 n} \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ ⎛ ⎜ ⎜ ⎝ \begin{matrix} μ_{0} ⋮ μ_{n + 1} \end{matrix} ⎞ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ \begin{matrix} 10 ⋮ 0 \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ .

We are finally prepared to prove our main results.

Proof of Theorem 3.1.

Step 1. In this step, we construct $g$ and $f$ using Lemmas 4.2 and 4.3. Let $d \geq 3$ be an odd integer and $n = \frac{d - 1}{2}$ . Let $g : [0, \infty) \to R$ be the unique function such that

g (0) = 0, lim z ↗ - 1 g^{'} (z) = 0, g^{''} = μ,

in the distributional sense, where $μ$ is the even reflection of the measure $μ_{n + 1}$ described in Lemma 4.3, i.e. $μ (U) = μ_{n + 1} (U \cap [0, \infty)) + μ_{n + 1} (- U \cap [0, \infty))$ . Note that the origin is counted twice. By construction, $g$ is piecewise linear, $g \equiv 0$ on $(- 1, 1)$ and

g (s) = \int_{- 1}^{s} (z - s) d μ_{z} \forall s \geq - 1.

In particular, $g (0) = 1$ and $g (s) = 0$ for all $s \geq 1$ due to the moment conditions

\int_{0}^{1} 1 d μ_{s} = 0, \int_{0}^{1} s d μ_{s} = 1.

Since $g^{''} = μ$ is even and $g (- 1) = g (1)$ , we find that $g$ is even. Due to Proposition 2.5, we observe that $[g]_{B} = ∥ μ ∥_{T V} = 2 γ_{n + 1}$ . Integrating by parts twice, we realize that

\int_{- 1}^{1} g (s) s^{2 k} d s = \int_{- r}^{r} g (s) s^{2 k} d s = \frac{1}{(2 k + 2) (2 k + 1)} \int_{- r}^{r} g^{''} (s) s^{2 k + 2} d s = 0

for $r > 1$ and $k = 0, \dots, n - 1 = \frac{d - 3}{2}$ , since $g \equiv g^{'} \equiv 0$ in a neighbourhood of $r$ , so the boundary terms vanish. The integration by parts is well-established for smooth functions and can be justified in the piecewise case by mollification.

In particular, $g$ satisfies the conditions of Lemma 4.2 and induces an admissible radially symmetric function $f \in B_{0} (R^{d})$ .

Step 2. Assume for now that there exists a minimizer of the Barron semi-norm in $F$ . Since the Barron semi-norm is a a convex function on the convex function class $F$ , and since furthermore both $F$ and $[\cdot]_{B}$ are invariant under coordinate rotations, we note that the set of minimal semi-norm elements in $F$ is both convex and rotation invariant. In particular

^f \in argmin f \in F [f]_{B} \Rightarrow {^f}_{O} \in argmin f \in F [f]_{B} where {^f}_{O} (x) = \int_{S O (d)}^f (O x) d H_{O}

is the average of $^f$ with respect to all rotations. The measure $H$ is the Haar measure on the group $S O (d)$ , i.e. the $d (d - 1) / 2$ -dimensional Hausdorff measure induced by the Frobenius norm on the space of $d \times d$ -matrices.

Thus, if a minimizer exists, then there is also a radially symmetric minimizer. In this step, we illustrate that the function $f = f_{d}^{*}$ associated to $g$ as in Step 1 is in fact optimal. In particular, we can conclude from the proof below that a minimizer does exist..

It is easy to see that (4) is both necessary and sufficient to imply the moment conditions for $g$ in Lemma 4.2. In particular

inf f \in F [f]_{B} = inf {[g]_{B} : g (0) = 1, g \equiv 0 on [1, \infty), g even and \int_{0}^{1} g^{''} (s) s^{2 k} d s = 0 for 0 \leq k \leq \frac{d - 1}{2}},

with corresponding minimizers. As $g^{''} = μ$ is the unique solution to the minimization problem on the right, $f$ is the unique radial minimizer on the left.

In particular

lim d \to \infty, d odd \frac{[f_{d}^{*}]_{B}}{d} = lim d \to \infty, d odd \frac{2 γ_{(d - 1) / 2}}{d} = γ \leq 3.6.

Step 3. We note that $g$ is linear on the interval $[0, s_{1}]$ , where $s_{1}$ is as in Lemma 4.3. Thus

is linear by the origin.

Step 4. In this step, we show that $f$ is strictly decreasing in radial direction inside the unit ball. As noted in Corollary 2.4, the function $f$ is $C^{1}$ -smooth away at the origin. Since $f (0) = 1$ and $f (e_{1}) = 0$ , it suffices to show that $\partial_{r} f (r e_{1}) \neq 0$ for $r \in (0, 1)$ . We compute

f (r e_{1}) = c_{d} \int_{- 1}^{1} g (r s) (1 - s^{2})^{\frac{d - 3}{2}} d s, \partial_{r} f (r e_{1}) = c_{d} \int_{- 1}^{1} g^{'} (r s) s (1 - s^{2})^{\frac{d - 3}{2}} d s .

We make the following claim: If $g$ is an even piecewise linear function on $[- 1, 1]$ with at most $n + 1$ segments in $[0, 1]$ and $k \leq n - 1$ , then the function

r \mapsto \int_{- 1}^{1} g^{'} (r s) s (1 - s^{2})^{k} d s

has at most $n - 2 - k$ zeros in $(0, 1)$ .

To prove the claim, start with $k = 0$ . Then

\int_{- 1}^{1} g^{'} (r s) s d s = \frac{1}{r^{2}} \int_{- 1}^{1} g^{'} (r s) r s r d s = \frac{2}{r^{2}} \int_{0}^{r} g^{'} (z) z d z .

As $g^{'}$ is constant in the interval $[0, s_{1}]$ by the origin, $\partial_{r} f$ is constant (and non-zero) in $[0, s_{1}]$ , meaning that $\partial_{r} f$ cannot have a zero in $[0, s_{1}]$ . In any interval $(s_{i}, s_{i + 1})$ where $g^{'}$ is constant, the function

r \mapsto \int_{0}^{r} g^{'} (z) z d z = \int_{0}^{s_{i}} g^{'} (z) z d z + \int_{s_{i}}^{r} g^{'} (z) z d z

is monotone, since $g^{'} (z) \cdot z$ does not change sign. In particular:

There is no zero in the first interval $[s_{0}, s_{1}] = [0, s_{1}]$ .
There is at most one zero in $[s_{i}, s_{i + 1}]$ .
The zero in the final interval $[s_{n}, s_{n + 1}] = [s_{n}, 1]$ is attained at $s = 1$ .

Thus there are at most $n - 2$ zeros in $(0, 1)$ , which proves the claim for $k = 0$ . Now consider $k \geq 1$ . Note that

\int_{0}^{1} g^{'} (r s) s (1 - s^{2})^{k} d s = r^{- (2 + k)} \int_{0}^{1} g^{'} (r s) r s (r^{2} - (r s)^{2})^{k} r d s = r^{- (2 + k)} \int_{0}^{r} g^{'} (z) (r^{2} - z^{2})^{k} d z

In particular, the term on the left is zero if and only if the integral on the right is zero. If there are two points $r_{1}, r_{2}$ on the right where the integral vanishes (and $k \geq 1$ ), then by Rolle’s theorem in between there exists a point $r \in (r_{1}, r_{2})$ at which

0 = 2 r \int_{0}^{r} g^{'} (z) z (r^{2} - z^{2})^{k - 1} d z = r^{k - 2} \int_{- 1}^{1} g^{'} (r s) s (1 - s^{2})^{k - 1} d s

The integral also vanishes at zero, where we are integrating over the empty set. We note that for any $k \geq 0$ the integral $\int_{- 1}^{1} g^{'} (r s) s (1 - s^{2}) d s$ vanishes at $r = 0$ and $r = 1$ . If, for $k \geq 1$ it vanishes at $N$ interior points, then for $k - 1$ it must vanish at $N + 1$ points: 0, 1, and at least once in each interval. In particular, for $k \geq 1$ , there are at most $n - 2 - k$ interior vanishing points.

Step 5. We finally note that the Lipschitz bound follows directly from the Barron norm bound on $f_{d}^{*}$ and Lemma 4.1. ∎

Example 4.4.

Let us consider the case $d = 3$ , i.e. $n = \frac{d - 1}{2} = 1$ . The $n + 2 = 3$ points $s_{0}, s_{1}, s_{2}$ are given by the equi-oscillating points of the best approximation of the function $f (s) = s$ on $[0, 1]$ by elements of the space spanned by ${s^{0}, s^{2}, \dots, s^{2 n}} = {1, s^{2}}$ .

The best approximation of $s$ by even quadratic polynomials in $L^{\infty} (0, 1)$ is $P (s) = s^{2} + \frac{1}{8}$ , which attains maximal distance at $s = 0, 1 / 2, 1$ . This can easily be verified as $P (s) - s$ is a polynomial of degree $2$ inside $(0, 1)$ , so if $P (0) = P (1)$ , then $P (s) = α + β (s - 1 / 2)^{2}$ , so the most distant points are in ${0, 1 / 2, 1}$ . By Kolmogorov’s equi-oscillation theorem, all three are points of largest error. It is now easy to solve for the coefficients of $P$ .

We can find the measure $μ = μ_{0} δ_{0} + μ_{1} δ_{1 / 2} + μ_{2} δ_{1}$ for the second derivative $g^{''} = μ$ by solving the linear system of moment conditions

⎛ ⎜ ⎝ \begin{matrix} 1 & 1 & 1 0 & 1 / 2 & 1 0 & 1 / 4 & 1 \end{matrix} ⎞ ⎟ ⎠ ⎛ ⎜ ⎝ \begin{matrix} μ_{0} μ_{1} μ_{2} \end{matrix} ⎞ ⎟ ⎠ = ⎛ ⎜ ⎝ \begin{matrix} 010 \end{matrix} ⎞ ⎟ ⎠ \Leftrightarrow ⎛ ⎜ ⎝ \begin{matrix} μ_{0} μ_{1} μ_{2} \end{matrix} ⎞ ⎟ ⎠ = ⎛ ⎜ ⎝ \begin{matrix} - 3 4 - 1 \end{matrix} ⎞ ⎟ ⎠ .

So $g$ is the even continuous piecewise linear function satisfying $g (0) = 1$ and

Finally, since $\frac{d - 3}{2} = 0$ , we find that $(1 - s^{2})^{\frac{d - 3}{2}} \equiv 1$ and thus

	$f (r e_{1})$	$= \frac{\int_{- 1}^{1} g (r s) d s}{\int_{- 1}^{1} 1 d s} = \int_{0}^{1} g (r s) d s = \frac{1}{r} \int_{0}^{r} g (s) d s = \frac{1}{r} ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} r - \frac{3}{2} r^{2} & 0 \leq r \leq 1 / 2 \frac{r^{2}}{2} - r + \frac{1}{2} & 1 / 2 \leq r \leq 1 0 & r \geq 1 \end{matrix}$
		$= ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} 1 - \frac{3}{2} r & 0 \leq r \leq 1 / 2 \frac{r}{2} - 1 + \frac{1}{2 r} & 1 / 2 \leq r \leq 1 0 & r \geq 1 \end{matrix} .$

In particular, we observe that $f \geq 0$ and that $f \in C^{1} (0, \infty)$ . It is easy to see that the first derivative of $f$

f^{'} (r) = \frac{1}{2} (- 3 \cdot χ_{(0, 1 / 2]} (r) + [1 - \frac{1}{r^{2}}] \cdot χ_{(1 / 2, 1]} (r))

is a continuous function, the second

f^{''} (r) = \frac{1}{r^{3}} \cdot χ_{(1 / 2, 1)}

is a bounded and measurable function, and the third (distributional) derivative

f^{'''} (r) = 8 \cdot δ_{1 / 2} - \frac{3}{r^{4}} \cdot L |_{(1 / 2, 1)} - δ_{1}

is a finite measure, where $δ_{x}$ denotes a Dirac delta located at the point $x$ and $L_{U}$ denotes the one-dimensional Lebesgue measure of the open set $U$ .

Proof of Theorem 3.2.

The existence of a radial minimizer $f_{d, ε}^{*}$ is proved as in Theorem 3.1. By Lemma 4.1, we find that $f_{d, ε}^{*} \in C^{\frac{d - 1}{2}} (R^{d} ∖ {0})$ , and since $f_{d, ε}^{*}$ is constant in a neighbourhood of the origin, we find that $f_{d, ε}^{*} \in C^{\frac{d - 1}{2}} (R^{d})$ . The uniqueness follows as in Theorem 3.1 by considering the optimal measure $μ$ on $[ε, 1]$ satisfying the moment conditions, using again Lemma 4.2. The main difference lies in the greater ability to uniformly approximate the function $f (s) = s$ by even polynomials on $[ε, 1]$ compared to $[0, 1]$ .

We claim the following: Let $ε > 0$ , $n \in N$ and $μ_{n}$ a measure on $[ε, 1]$ such that

\int_{ε}^{1} s d μ_{n} = 1, \int_{ε}^{1} s^{2 k} d μ_{n} = 0 \forall k = 0, \dots, n .

Then for every $c < 1$ there exists $N \in N$ independent of $ε$ such that

∥ μ_{n} ∥ \geq c \frac{ε^{2} \sqrt{π n}}{(1 - ε^{2})^{n + 1}}

if $n \geq N$ .

The claim is proved in Appendix B. Inserting the lower bound in (B), the statement is proved. ∎

5. Applications

5.1. Fitting values on a finite data set

Let $(x_{i}, y_{i})_{i = 1}^{N}$ be a finite data set in $R^{d} \times R$ . For each $i$ , define $r_{i} = {min}_{j \neq i} | x_{j} - x_{i} |$ to be the minimal distance between the point $x_{i}$ and the closest data point to it. Then

f (x) = N \sum i = 1 y_{i} f_{d}^{*} (\frac{x - x_{i}}{r_{i}})

is a Barron function such that

$f (x_{i}) = y_{i}$ for all $i$ and
$∥ f ∥_{B} \leq 2 γ_{(d + 1) / 2} \sum_{i = 1}^{N} \frac{| y_{i} |}{r_{i}}$ .

In most practical data sets, the minimum $ℓ^{2}$ -distance between data points is lower bounded as $Ω (1)$ or even $Ω (\sqrt{d})$ , meaning that the Barron norm only grows as $\sim d N$ or even $\sqrt{d} N$ . Using the direct approximation theorem for Barron functions (Proposition 2.6) in $L^{2} (P_{n})$ for $P_{n} = \frac{1}{n} \sum_{i = 1}^{n} δ_{x_{i}}$ , for every $m \in N$ there exists a shallow neural network $f_{m}$ with $m$ neurons (and one constant shift) such that

\frac{1}{n} n \sum i = 1 ∣ ∣ f (x_{i}) - y_{i} {∣ ∣}^{2} \leq \frac{∥ f ∥_{B}^{2} {max}_{| ν | = 1} {⟨ \sum_{i = 1}^{n} (x_{i} - ¯ x), ν ⟩}_{i}^{2}}{m}

where $¯ x = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$ . Often, the labels $y$ lie in a bounded set, at least with high probability. The projected and centered second moments may well be independent of the ambient dimension $d$ , leading to a realistic, even somewhat pessimistic, expectation that

for data sets which do not heavily concentrate at a single point or exhibit heavy tail behavior. While the data can generally be fit exactly if $m > n$ (see e.g. [LS06] and the references therein) this estimate also controls the size of the weights of the neural network needed.

Similarly, this estimate can be used to bound the additional size of the Barron norm which is required to fit values $y_{i}^{'} = y_{i} + ε$ , assuming that the Barron norm required to fit $y_{i}$ is already known.

5.2. Mollification and density

Since $f_{d}^{*}$ is a compactly supported, non-negative function, it can serve as a mollifier. Namely, for $ε > 0$ and $u \in L_{l o c}^{1} (R^{d})$ , denote

η_{ε} (z) = \frac{f_{d}^{*} (z / ε)}{∥ f_{d}^{*} ∥_{L^{1} (R^{d})} ε^{d}}, u_{ε} (x) = (u * η_{ε}) (x) = \int_{R^{d}} u (z) η_{ε} (x - z) d z .

It is well-known that $u_{ε} \to u$ in $L^{1} (K)$ for any compact set $K \subseteq R^{d}$ [Dob10, Lemma 4.22]. In many situations, rates can be obtained, either in the $L^{1}$ -topology or a stronger topology, under the assumption that $u$ lies in a space of more regular functions (e.g. a Hölder or Sobolev space). We consider the following scenario:

Assume that $u \in X$ , where $X \subseteq L^{1} (R^{d})$ is a space of functions $u : R^{d} \to R$ for which it is known that $∥ u_{ε} - u ∥_{L^{2} (U)} \leq C ∥ u ∥_{X} ε^{α}$ for a given domain $U \subseteq R^{d}$ and some universal constants $C, α > 0$ (which may depend on $U$ ). If $u$ is naturally defined only on $U$ and not the entire space, extension theorems can often be used to extend $u$ in the same regularity class, see e.g. [Dob10, Chapter 6].

Note furthermore that $u_{ε}$ is a continuous superposition of Barron functions in $x$ . Since $B_{0}$ is a Banach space, $u_{ε}$ is a Barron function with norm at most

[u_{ε}]_{B} \leq ∥ u ∥_{L^{1} (R^{d})} ∥ η_{ε} ∥_{B} = \frac{∥ u ∥_{L^{1} (R^{d})} [f_{d}^{*}]_{B}}{∥ f_{d}^{*} ∥_{L^{1} (R^{d})} ε^{d + 1}} .

In particular, due to the direct approximation theorem for Barron functions (Proposition 2.6), there exists a neural network $f_{m}$ with one hidden layer, ReLU activation, $m$ neurons (and an affine shift) such that

∥ f_{m} - u_{ε} ∥_{L^{2} (U)} \leq [u_{ε}]_{B} m e a s (U) d i a m (U) m^{- 1 / 2}

and thus

Balancing the scaling of terms $ε^{- (d + 1)} m^{- 1 / 2} = ε^{α}$ , we find that it is optimal to choose $ε \sim m^{- \frac{1}{2 (d + 1 + α)}}$ , which leads to an approximation order of $ε^{α} \sim m^{- \frac{α}{2 (d + 1 + α)}}$ . We note that not only the rate, but also the constants exhibit the curse of dimensionality. Observe that it is generally impossible to approximate functions in classical function spaces by functions of low norm from any function class in which the unit ball has low Rademacher complexity, so the curse of dimensionality cannot be avoided here [EW21].

Since $f_{d}^{*} \leq 1$ and $f_{d}^{*} \equiv 0$ outside the unit ball, we find that $∥ f_{d}^{*} ∥_{L^{1}} \leq ω_{d} \sim \frac{1}{\sqrt{π d}} {(\frac{2 π e}{d})}^{d / 2}$ . The true $L^{1}$ -norm is likely even much smaller, as $f_{d}^{*}$ appears to decay rapidly close to the unit sphere. Nevertheless, we find this an easy way to obtain an explicit rate with little effort.

Example 5.1.

If $X$ is the space of Lipschitz-continuous functions on $R^{d}$ , then the approximation property holds as

	$∣ ∣ u_{ε} (x) - u (x) ∣ ∣$	$= ∣ ∣ ∣ \int_{R^{d}} [u (z) - u (x)] η (x - z) d z ∣ ∣ ∣ \leq \int_{B_{ε} (0)} η_{ε} (z) \| u (x + z) - u (x) \| d z$
		$\leq [u]_{L i p} \frac{\int_{0}^{ε} η_{ε} (r) r^{d} d r}{\int_{0}^{ε} η_{ε} (r) r^{d - 1} d r} \leq [u]_{L i p} ε .$

The conditions above are therefore met with $α = 1$ .

5.3. Depth separation

We have seen that any function which satisfies $f \equiv 1$ in $B_{ε} (0)$ and $f \equiv 0$ outside of $B_{1} (0)$ has Barron semi-norm which is exponentially large in the dimension $d$ of the data space (for fixed $ε \in (0, 1)$ ).

By comparison, the function

f (x) = ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \begin{matrix} 1 & | x | \leq ε 1 - \frac{| x | - ε}{1 - ε} & ε \leq | x | \leq 1 0 & | x | \geq 1 \end{matrix}

can be represented as the composition $f = f_{1} \circ f_{2}$ of two Barron functions $f_{2} : R^{d} \to R$ and $f_{1} : R \to R$

f_{2} (x) = \frac{| x | - ε}{1 - ε}, f_{1} (z) = max {{0, min {1 - z, 1}} = σ (1 - z) - σ (- z)

with norm

[f_{2}]_{B} = \frac{1}{1 - ε} \frac{\int_{S^{d - 1}} 1 d H^{d - 1}}{\int_{S^{d - 1}} σ (ν_{1}) d H_{ν}^{d - 1}} \sim \frac{2 \sqrt{d}}{1 - ε}, [f_{1}]_{B} = 2.

The second norm estimate be easily obtained by Proposition 2.5, whereas the second can be obtained as in the second step in the proof of 4.1 in Appendix B – see also [EW20b, Section 4].

In particular, by the direct approximation theorem for Barron functions (Proposition 2.6), it is possible to approximate $f_{2}$ with parameters whose magnitude does not exceed $C (1 - ε) d^{- 1 / 2}$ . When written as a neural network with two hidden layers, the initial linear layer of $f_{1}$ and terminal linear layer of $f_{2}$ are concatenated into a single linear map. Balancing the magnitude of coefficients equally over all layers, we find that the parameters scale only like $d^{1 / 6} (1 - ε)^{- 1 / 3}$ , so the weight decay regularizer grows as $d^{1 / 3} (1 - ε)^{- 2 / 3}$ . Observe that weight decay does not induce a norm for deeper ReLU networks due to the mismatch in homogeneities.

Theorem 3.2 thus serves to illustrate the following depth separation phenomenon: A function $f : R^{d} \to R$ which takes values $1$ on $B_{ε} (0)$ and $0$ on $R^{d} ∖ B_{1} (0)$ is much easier to approximate by ReLU networks with two hidden layers than with one. While depth separation phenomena are well established [ES16, Tel16], this is a particularly easy criterion. The fact that compositions of Barron functions correspond to certain neural networks with two hidden layers has been observed e.g. in [EW20b, PN22].

On the other hand, the result is a weaker version of a depth separations statement than others. We do not claim that the number of neurons required to approximate such a function $f$ to a certain accuracy grows exponentially in dimension, but rather that either the number of neurons or the magnitude of the parameters does. From a practical point of view, both are prohibitive.

6. Finding optimal bump functions

In this section, we compute numerical approximations of the optimal bump functions which were constructed in Theorem 3.1 for different odd dimensions $d \in N$ beyond the case $d = 3$ considered in Example 4.4. As previously, denote $n = \frac{d - 1}{2}$ , i.e. $d = 2 n + 1$ . For simplicity, we exploit that three tasks are equivalent: Approximating $| s |$ in $L^{\infty} (- 1, 1)$ by polynomials of degree at most $2 n$ (or $2 n + 1$ ), approximating $\sqrt{s}$ in $L^{\infty} (0, 1)$ by polynomials of degree at most $n$ , and approximating $s$ in $L^{\infty} (0, 1)$ by even polynomials of degree $2 n$ . We proceed in three steps:

Find the optimal approximation of $s \mapsto \sqrt{s}$ by polynomials of degree $n$ in $L^{\infty} (0, 1)$ , and find the $n + 2$ points $t_{0}, \dots, t_{n + 1}$ at which the error is maximal. Take the optimal points $s_{i} = t_{i}^{2}$ for the approximation of $f (s) = s$ by even polynomials.
Solve the linear system (B) to obtain the measure $μ = \sum_{i = 0}^{n + 1} μ_{i} δ_{s_{i}}$ . Compute the piecewise linear function $g$ by $g^{''} = μ$ in $(0, 1]$ , $g (0) = 1$ and $g^{'} (0) = μ_{0}$ .
Obtain $f$ from $g$ by numerically integrating (4.2).

In our implementation, the first step is solved by the Remez algorithm [Tre19]:

Initialize $s_{0}, \dots, s_{n + 1} \subseteq [0, 1]$ , e.g. as equi-distant points such that $s_{0} = 0$ and $s_{1} = 1$ .
Solve the system $\sum_{j = 0}^{n} α_{j} s_{i}^{j} = \sqrt{s} + (- 1)^{i} e$ , $0 \leq i \leq n + 1$ for the coefficients $α_{j}$ and the equi-oscillation parameter $e$ .
Update $s_{0}, \dots, s_{n + 1}$ such that $s_{0} = 0$ , $s_{n + 1} = 1$ and for $i = 1, \dots, n$ , $s_{i}$ is a point at which the unsigned error function $\sqrt{s} - \sum_{j = 0}^{n} α_{j} s^{j}$ has a local extremum.
Iterate (ii) and (iii) until after the final update we have approximately reached equi-oscillating points of largest error:

$\frac{{max}_{0 \leq i \leq n + 1} ∣ ∣ \sqrt{s_{i}} - \sum_{j = 0}^{n} α_{j} s_{i}^{j} ∣ ∣}{{min}_{0 \leq i \leq n + 1} ∣ ∣ \sqrt{s_{i}} - \sum_{j = 0}^{n} α_{j} s_{i}^{j} ∣ ∣} < 1.001.$

We solve the linear system in step 2 in the iteration scheme by LU factorization. The non-linear system is solved by two nested interval constructions:

Given $0 = s_{0} < \dots < s_{n + 1} = 1$ such that $\sqrt{s_{i}} - \sum_{j = 0}^{n} α_{j} s_{i}^{j} = (- 1)^{i} e$ , we conclude that for all $i = 0, \dots, n$ , there exists $t_{i} \in (s_{j}, s_{j + 1})$ such that

$\sqrt{t_{i}} - n \sum j = 0 α_{j} t_{i}^{j} = 0, i = 0, \dots, n$

by the Intermediate Value Theorem. In particular, the $n + 1$ points $0 < t_{0} < \dots < t_{n} < 1$ are distinct and ordered. We approximate $t_{i}$ by the bisection method to accuracy $< 10^{- 12}$ . Note that $e \neq 0$ , since the approximating polynomial cannot match the objective function at $n + 2$ points by the same argument as in the proof of Theorem 3.1.
Given $t_{0}, \dots, t_{n}$ , we find that there for $i = 1, \dots, n$ there exists $ξ_{i} \in (t_{i - 1}, t_{i})$ such that

$\frac{d}{d s} {∣ ∣ ∣}_{s = ξ_{i}} (\sqrt{s} - n \sum j = 0 α_{j} s^{j}) = 0, i = 1, \dots, n$

by Rolle’s Theorem. Again, we approximate $ξ_{i}$ by the bisection method to accuracy $< 10^{- 12}$ . By construction, all $ξ_{i}$ are distinct. We update

${s_{0}, s_{1}, \dots, s_{n}, s_{n + 1}} \mapsto {s_{0}, ξ_{1}, \dots, ξ_{n}, s_{n + 1}} .$

The nested interval construction is more numerically stable than Newton-Raphson iteration, as a Newton solver tends to find the same $ξ_{i}$ multiple times starting at different roots $s_{i}, s_{i^{'}}$ from the previous iteration.

The linear step (2) is solved by LU factorization. The integral in (3) is evaluated using a composite Simpson rule and 1,001 integration points. A sample implementation of the algorithm can be found in a google colab notebook at [Woja].

Figure 1. We compute the piecewise linear function $g$ satisfying the moment conditions (B) for three different choices of $n + 2 = \frac{d + 3}{2}$ break points $s_{i}$ : Optimal points found by the Remez algorithm (left), equi-distant points $s_{i} = i / (n + 1)$ (middle) and the roots of Chebyshev polynomials of the second kind $s_{i} = 1 / 2 + cos (i π / (n + 1)) / 2$ (right).
For the optimal choice of $s_{i}$ , we empirically observe that the collection of points ${(s_{i}, g_{d} (s_{i})) : d \in 2 N + 1}$ concentrates on a line $ℓ_{i}$ parallel to the horizontal axis. For equi-distant nodes, the oscillations become larger as $d$ increases, whereas they become smaller for Chebyshev nodes.

We note that the linear system (B) can be solved for any choice of distinct points $0 \leq s_{0} < \dots < s_{n + 1} \leq 1$ . To explore the importance of using the optimal points found using the Remez algorithm, we compare $g$ and $f$ for the optimal choice of sample points and other, more classic and explicit choices $s_{i}$ in Figures 1 and 2 respectively. The Barron norm grows slowly and linearly for optimal break points and faster than linearly for other explicit choices of break points, as can be seen in the rightmost image in Figure 3. In the left and middle plot of Figure 3, we also display the known profiles of Barron functions due to [OWSS19] as well as profiles of Barron functions which are constant in a neighbourhood of the origin.

For any choice of break points which include zero, the function $f$ is a ‘wizard’s hat’ function: Monotone decreasing, flat away from the origin, monotone decreasing and convex, non-smooth at the origin. It is thus qualitatively different from previously known radial profiles due to [OWSS19].

Figure 2. We compare the functions $f$ computed by (4.2) for the piecewise linear functions $g$ in Figure 1 associated to three different choices of break points $s_{i}$ : Optimal (left), equi-distant (middle), Chebyshev (right). The break points and curves agree for $d = 3$ (blue curve) and are qualitatively similar for all $d \geq 3$ , in particular non-negative, monotone-decreasing and convex. The curves are steeper at the origin in higher dimensions, most noticeably for Chebyshev nodes. The curve with optimal break points appears to make the slowest transition from $f = 1$ to $f \approx 0$ .

Figure 3. Left: The radial profiles of the known Barron bump functions $f (| x |) = (1 - | x |^{2})^{\frac{d - 1}{2}}$ of [OWSS19] are smooth at the origin and non-convex in the radial direction. They are thus geometrically distinct from the profiles associated to piecewise linear functions $g$ as depicted in Figure 2. Middle: The functions $f$ associated to piecewise linear functions $g$ with break points at equidistant points $s_{i} = 0.1 + 0.9 \cdot i / (n + 1)$ in $[0, 1]$ are $C^{1}$ -smooth, monotone decreasing and non-negative, but not convex in the radial direction. Right: The Barron semi-norm $\sum_{i = 0}^{n + 1} | μ_{i} |$ of functions $f$ associated to $g$ with different break points $s_{i}$ grows slowest (and linearly) for optimal the optimal choice of points and fastest for break points at Chebyshev nodes. All growth rates are ostensibly polynomial of empirical degree $1.1$ (optimal points), 1.4 (equi-distant points) and 1.6 (Chebyshev nodes) as determined by least squares fitting. By comparison, the norm growth for the choice of equi-distant nodes in $[0.1, 1]$ in the middle figure is exponentially large in $d$ and is not pictured for better readability of the plots.

Additional empirical results relating to the optimal construction can be found in Appendix A.

7. Conclusion and Open Problems

We have provided an explicit construction for how neural networks optimally interpolate certain radially symmetric data with respect to a weight decay regularizer in the infinite parameter limit. While we do not prove that the optimal interpolant is radially symmetric, the radial average of all interpolants coincides with the solution constructed in this article. We show that its weight decay regularizer grows as $d$ and its Lipschitz constant grows at most as $\sqrt{d}$ . In contrast, we identify a slight modification which necessitates exponential growth. A number of important questions remain open, even for shallow neural networks and the simple case of rotational symmetry. Deeper networks appear to be out of reach for our methodology.

Is the radially symmetric minimizer the only one? A uniqueness statement would allow us to establish that regularized risk minimization does in fact lead to symmetry learning, at least in a toy example, and would allow us to prove stronger convergence results in the companion article [Wojb]. We give further heuristic consideration to this question in Appendix D.

What happens if we modify the constraints? For example, it is not clear from the proof of Theorem 3.2 whether the constraint $f \geq 1$ on $B_{ε} (0)$ induces the curse of dimensionality as the constraint $f \equiv 1$ on $B_{ε} (0)$ does. Similarly, it may be interesting to study the case where the boundary condition $f \equiv 0$ is imposed on a shell ${1 \leq | x | \leq R}$ rather than the entire exterior domain. We recover the problem studied in this article in the limit $R \to \infty$ , whereas the optimal solution in the case $R = 1$ would be $f (x) = 1 - | x |$ . Furthermore, a modified minimization problem is required to find optimal mollifiers:

It appears that subtle differences may make the difference between a solvable data fitting problem and one where we encounter the curse of dimensionality.

What more can we say about the optimal function $f_{d}^{*}$ ? For example, we do not provide a lower bound on the Lipschitz constant of $f_{d}^{*}$ , nor do we study the decay of $f_{d}^{*} (r)$ for fixed $r \in (0, 1)$ or $∥ f ∥_{L^{1} (R^{d})}$ rigorously. We conjecture that both decay at least exponentially in $d$ , and that both sequences are monotone in $d$ . Limited evidence is provided in Appendix D.

Finally, the fact that the extrema of $g = g_{n}$ lie on straight lines parallel to the horizontal axis as we vary $n$ appears too specific to be random. It is not clear to us how to interpret this observation.

Acknowledgements

The author would like to thank Jonathan Siegel and Rahul Parhi for inspiring conversations.

References

[Bac17] F. Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
[Bar93] A. R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
[Ber12] S. Bernstein. Sur l’ordre de la meilleure approximation des fonctions continues par des polynômes de degré donné, volume 4. Hayez, imprimeur des académies royales, 1912.
[BZ13] Y. D. Burago and V. A. Zalgaller. Geometric inequalities, volume 285. Springer Science & Business Media, 2013.
[CB20] L. Chizat and F. Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. arxiv:2002.04486 [math.OC], 2020.
[CPV20] A. Caragea, P. Petersen, and F. Voigtlaender. Neural network approximation and estimation of classifiers with classification boundary in a barron class. arXiv:2011.09363 [math.FA], 2020.
[Dob10] M. Dobrowolski. Angewandte Funktionalanalysis: Funktionalanalysis, Sobolev-Räume und elliptische Differentialgleichungen. Springer-Verlag, 2010.
[EMW18] W. E, C. Ma, and L. Wu. A priori estimates of the population risk for two-layer neural networks. Comm. Math. Sci., 17(5):1407 – 1425 (2019), arxiv:1810.06397 [cs.LG] (2018).
[EMW19a] W. E, C. Ma, and Q. Wang. A priori estimates of the population risk for residual networks. arXiv:1903.02154 [cs.LG], 2019.
[EMW19b] W. E, C. Ma, and L. Wu. The barron space and the flow-induced function spaces for neural network models. arXiv:1906.08039 [cs.LG], 2019.
[EMW19c] W. E, C. Ma, and L. Wu. Machine learning from a continuous viewpoint. arxiv:1912.12777 [math.NA], 2019.
[EMWW20] W. E, C. Ma, S. Wojtowytsch, and L. Wu. Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t. CSIAM Trans. Appl. Math., 1(4):561–615, 2020.
[ES16] R. Eldan and O. Shamir. The power of depth for feedforward neural networks. In Conference on learning theory, pages 907–940, 2016.
[EW20a] W. E and S. Wojtowytsch. On the Banach spaces associated with multi-layer ReLU networks of infinite width. CSIAM Trans. Appl. Math., 1(3):387–440, 2020.
[EW20b] W. E and S. Wojtowytsch. Representation formulas and pointwise properties for Barron functions. arXiv:2006.05982 [stat.ML], accepted for publication Calc.Var.PDE, 2020.
[EW21] W. E and S. Wojtowytsch. Kolmogorov width decay and poor approximators in machine learning: Shallow neural networks, random feature models and neural tangent kernels. Res Math Sci, 8(5), 2021.
[EW22] W. E and S. Wojtowytsch. On the emergence of simplex symmetry in the final and penultimate layers of neural network classifiers. In Mathematical and Scientific Machine Learning, pages 270–290. PMLR, 2022.
[GKNV22] R. Gribonval, G. Kutyniok, M. Nielsen, and F. Voigtlaender. Approximation spaces of deep neural networks. Constructive approximation, 55(1):259–367, 2022.
[Han21] B. Hanin. Ridgeless interpolation with shallow relu networks in $1 d$ is nearest neighbor curvature extrapolation and provably generalizes on lipschitz functions. arXiv preprint arXiv:2109.12960, 2021.
[JEP $^{+}$ 21] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
[KB18] J. M. Klusowski and A. R. Barron. Approximation by combinations of relu and squared relu ridge functions with $ℓ^{1}$ and $ℓ^{0}$ controls. IEEE Transactions on Information Theory, 64(12):7649–7656, 2018.
[KKC09] D. Kincaid, D. R. Kincaid, and E. W. Cheney. Numerical analysis: mathematics of scientific computing, volume 2. American Mathematical Soc., 2009.
[KSH12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
[LMW20] Z. Li, C. Ma, and L. Wu. Complexity measures for neural networks with general activation functions using path-based norms. arXiv preprint arXiv:2009.06132, 2020.
[LS06] B. Llanas and F. Sainz. Constructive approximate interpolation by neural networks. Journal of Computational and Applied Mathematics, 188(2):283–308, 2006.
[Lu21] Z. Lu. A note on the representation power of GHHs. arXiv preprint arXiv:2101.11286, 2021.
[Lun17] K. Lundengård. Generalized Vandermonde matrices and determinants in electromagnetic compatibility. PhD thesis, Mälardalen University, 2017.
[LWK17] C. Louizos, M. Welling, and D. P. Kingma. Learning sparse neural networks through $l_0$ regularization. arXiv preprint arXiv:1712.01312, 2017.
[OWSS19] G. Ongie, R. Willett, D. Soudry, and N. Srebro. A function space view of bounded norm infinite width relu nets: The multivariate case. arXiv preprint arXiv:1910.01635, 2019.
[PN21] R. Parhi and R. D. Nowak. Banach space representer theorems for neural networks and ridge splines. J. Mach. Learn. Res., 22:43–1, 2021.
[PN22] R. Parhi and R. D. Nowak. What kinds of functions do deep neural networks learn? insights from variational spline theory. SIAM Journal on Mathematics of Data Science, 4(2):464–489, 2022.
[SHM $^{+}$ 16] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
[SHS $^{+}$ 18] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
[SSS $^{+}$ 17] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
[SSVB17] S. Srinivas, A. Subramanya, and R. Venkatesh Babu. Training sparse neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 138–145, 2017.
[SX19] J. W. Siegel and J. Xu. On the approximation properties of neural networks. arXiv preprint arXiv:1904.02311, 2019.
[SX20] J. W. Siegel and J. Xu. Approximation rates for neural networks with general activation functions. Neural Networks, 128:313–321, 2020.
[SX21a] J. W. Siegel and J. Xu. Characterization of the variation spaces corresponding to shallow neural networks. arXiv preprint arXiv:2106.15002, 2021.
[SX21b] J. W. Siegel and J. Xu. Optimal approximation rates and metric entropy of relu $^{k}$ and cosine networks. arXiv preprint arXiv:2101.12365, 2021.
[SX21c] J. W. Siegel and J. Xu. Sharp bounds on the approximation rates, metric entropy, and $n$ -widths of shallow neural networks. arXiv preprint arXiv:2101.12365, 2021.
[TAW $^{+}$ 21] K. Tunyasuvunakool, J. Adler, Z. Wu, T. Green, M. Zielinski, A. Žídek, A. Bridgland, A. Cowie, C. Meyer, A. Laydon, et al. Highly accurate protein structure prediction for the human proteome. Nature, 596(7873):590–596, 2021.
[Tel16] M. Telgarsky. Benefits of depth in neural networks. In Conference on learning theory, pages 1517–1539. PMLR, 2016.
[Tre19] L. N. Trefethen. Approximation Theory and Approximation Practice, Extended Edition. SIAM, 2019.
[VC85] R. S. Varga and A. J. Carpenter. On the bernstein conjecture in approximation theory. Constructive Approximation, 1(1):333–348, 1985.
[VSP $^{+}$ 17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[WLGSB22] Y. Wang, C.-Y. Lai, J. Gómez-Serrano, and T. Buckmaster. Self-similar blow-up profile for the boussinesq equations via a physics-informed neural network. arXiv preprint arXiv:2201.06780, 2022.
[Woja] S. Wojtowytsch. https://colab.research.google.com/drive/1ofrzlafdq73ev1-mgmmnuua0lft0f5gg?usp=sharing.
[Wojb] S. Wojtowytsch. Learning symmetries: A study of shallow neural network optimization. in preparation.
[Wojc] S. Wojtowytsch. Radially symmetric data: An empirical study of optimization for shallow neural networks. in preparation.

Appendix A Further plots

We note the following: If $g$ is a piecewise linear function with $n$ break points which satisfies the $n + 2$ linear moment conditions (B), then it also satisfies the same linear moment conditions for any $m \leq n$ . In particular, the function

f_{m, n} : R^{2 m + 1} \to R, f_{m, n} (x) = \frac{\int_{- 1}^{1} g (| x | s) (1 - s^{2})^{\frac{m - 3}{2}} d s}{\int_{- 1}^{1} (1 - s^{2})^{\frac{m - 3}{2}} d s}

is a Barron function such that $f_{m, n} (0) = 1$ and $f_{m, n} \equiv 0$ on $R^{k} ∖ B_{1} (0)$ for $k \leq n$ . We plot $f_{m, n}$ for $m = 1$ (i.e. $2 m + 1 = 3$ ) and various choices of $n$ and various choices of break points in Figure 4. In Figure 5 we fix $n = 10$ instead and consider the influence of varying $m$ .

The larger the discrepancy between $m$ and $n$ , the more oscillatory the function $f_{m, n}$ is. This is reminiscent of observations from Step 4 in the proof of Theorem 3.1.

In Figure 6, we numerically investigate the decay of $f_{d}^{*}$ as $d$ varies, both pointwise in $r$ and integrated. Since $f_{d}^{*}$ is $\sqrt{d}$ -Lipschitz and $f_{d}^{*} (0) = 1$ , we see that the one-dimensional integral $\int_{0}^{1} f_{d}^{*} (r) d r$ is bounded from below by $Ω (d^{- 1 / 2})$ . This indeed appears to be the dominant term, and for fixed $r > 0$ , we observe empirically that $f_{d}^{*} (r)$ decays to zero exponentially in $d$ .

Figure 4. We plot the radial profile of

f_{1, n} : R^{3} \to R

as in (A) for various choices of

n = \frac{d - 1}{2}

break points. The points are chosen optimally (for dimension

d = 2 n + 1

) on the left, equi-distant in

[0, 1]

in the middle plot and equidistant in

[0.1, 1]

on the right. Notably, the functions are neither monotone nor non-negative if

d > 3

, and the number of local extrema increases as

n

grows.

Figure 5. We plot $f_{m, n}$ as in (A) corresponding to low dimension $d \in {3, 5, 7}$ , $m = \frac{d - 1}{2}$ and $n = 10$ . and various choices of break points: Optimal for $n = 10$ (left), equidistant in $[0, 1]$ (middle) and equi-distant in $[0.1, 1]$ (right). The radial profiles of the Barron functions are neither monotone nor non-negative. The number of local extrema of the profiles is larger if the dimension $d$ is small compared to the number of break points. The oscillations are smallest for the optimal choice of break points and largest for break points which are bounded away from the origin.

Figure 6. Left: We plot $f_{d}^{*} (x)$ as a function of $d$ for fixed $x$ in a logarithmic scale together with $exp (α (x) d + β (x))$ , where $α, β$ are chosen depending on $x$ as the least squares fit for the function $log (f_{d}^{*} (x))$ . The graphs suggest that the decay is exponential in $d$ and thus comparable to explicit solutions ${~ f}_{d} (x) = (1 - | x |^{2})^{\frac{d + 3}{2}}$ of [OWSS19]. Middle: We graphically compare the decay of the normalized $d$ -dimensional integral of $\vbox\hruleheight0.4ptwidth4.3ptdepth0.0pt∫B1(0)fd(x)dx$ for different choices of break points. Despite graphical differences around the origin, the values of the integrals are very similar and decay roughly as $exp (- 4.7 - 0.75 * d)$ . In dimension three, all three functions $f_{d}$ coincide, while their difference close to the origin becomes negligible in high dimension, where almost all measure concentrates by the boundary of the unit ball. Right: We graphically compare the decay of the $1$ -dimensional integral $\int_{0}^{1} f_{d} (r \cdot e_{1}) d r$ of the function $f_{d}$ associated to piecewise linear $g$ with $n = \frac{d - 1}{2}$ break points in $(0, 1)$ for different choices of break points. The integral empirically decays as $0.3 \cdot d^{- 0.49}$ for optimal points, like $0.29 \cdot d^{- 0.45}$ for equidistant points and like $0.64 \cdot d^{- 1.02}$ for Chebyshev points. The order of decay was established by a least squares regression.

Appendix B Postponed proofs

Recall the co-area formula, which allows us to integrate over a Riemannian manifold $M$ by ‘slicing’ the domain into the level sets of a function $ϕ : M \to N$ , where $N$ is another Riemannian manifold [BZ13, Theorem 13.4.2]. In the case of slicing the sphere into level sets of a coordinate projection $ϕ (x) = x_{1}$ , the formula reads as

\int_{S^{d - 1}} f (x) d H_{x}^{d - 1} = \int_{- 1}^{1} (\int_{S^{d - 1} \cap {x_{1} = s}} f (x) d H_{x}^{d - 2}) (1 - s^{2})^{- 1 / 2} d s

since $1 - x_{1}^{2} = | \nabla^{∥} ϕ |^{2}$ is the modulus of the tangential gradient of $ϕ$ , which measures volume distortions. This can be considered a curvinlinear version of Fubini’s theorem. If $f$ only depends on $x_{1}$ , the formula further simplifies to

(B.1)

\int_{S^{d - 1}} f (x) d H_{x}^{d - 1}

= (d - 1) ω_{d - 1} \int_{- 1}^{1} f (s, 0, \dots, 0) (1 - s^{2})^{(d - 3) / 2} d s

since $S^{d - 1} \cap {x_{1} = s}$ is a $d - 2$ -dimensional Euclidean sphere of radius $\sqrt{1 - s^{2}}$ . Here $ω_{d - 1}$ denotes the volume of the $d - 1$ -dimensional unit ball and $(d - 1) ω_{d - 1}$ the volume of the $d - 2$ -dimensional Euclidean unit sphere.

Proof of Lemma 4.1.

Step 1. Symmetrization. Let $f : R^{d} \to R$ be a radially symmetric Barron function. Then in particular $f (x)$ is the same as the average over $f (O x)$ for $O$ in $S O (d)$ and the average is taken with respect to the Haar measure $H$ (which coincides with the $\frac{d (d - 1)}{2}$ -dimensional Hausdorff measure on $S O (d)$ with respect to the Frobenius norm), i.e.

	$f (x)$	$=\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫SO(d)f(Ox)dHO$
		$=\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫SO(d)∫Rd+2σ(wTOx+b)dμ(w,b)dHO$
		$=∫Rd+2\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫SO(d)σ((OTw)Tx+b)dHOdμ(w,b)$
		$=∫Rd+2\|w\|\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(νTx+b\|w\|)dHd−1νdμ(w,b)$

since for any $w \in S^{d - 1}$ , the map $S O (d) \to S^{d - 1}$ , $O \mapsto O w$ pushes the Haar measure forward to the uniform distribution on $S^{d - 1}$ . Thus $f$ can be written as a continuous linear combination of the elementary radially symmetric Barron functions

fb(x)=\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(νTx−b)dHd−1νand f∞(x)≡1.

On the other hand, every function of this type is a radially symmetric Barron function. Finally, we note that

	$f_{b} (x) - f_{- b} (x)$	$=\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(νTx−b)−σ(νTx+b)dHd−1ν$
		$=\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(−νTx−b)−σ(νTx+b)dHd−1ν$
		$=\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1νTx+bdHd−1ν$
		$= b,$

since the uniform distribution is invariant under the substitution $ν \mapsto - ν$ in the first term. In particular, every radially symmetric Barron function can be written as

f (x) = f (0) + \int_{[0, \infty)} f_{b} (x) d μ_{b}

for some measure $μ$ on $[0, \infty)$ and $[f]_{B} = ∥ μ ∥_{T V}$ , since $f_{b} (0) = 0$ for any $b > 0$ .

Step 2. Gradient bound. We note that $[f_{b}]_{B} = 1$ for any $b$ by definition and

\nabla f_{b} (x)

=\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ′(νTx−b)νdHd−1ν=\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−11{νTx>b}νdHd−1ν.

Due to radial symmetry, the gradient points in direction $x$ , i.e.

\nabla f_{b} (x)

= \frac{\int_{S^{d - 1}} 1_{{ν_{1} > b / | x |}} ν_{1} d H_{ν}^{d - 1}}{\int_{S^{d - 1}} 1 d H_{ν}^{d - 1}} \frac{x}{| x |}

The gradient is largest as $| x | \to \infty$ , and

independently of $b$ . In the last step, we used the coarea formula (B.1). It is now possible to evaluate the gradient

\frac{\int_{0}^{1} s (1 - s^{2})^{\frac{d - 3}{2}} d s}{2 \int_{0}^{1} (1 - s^{2})^{\frac{d - 3}{2}} d s} = \frac{\frac{1}{d - 1}}{\sqrt{π} \frac{Γ ((d - 1) / 2)}{Γ (d / 2)}} = \frac{Γ (d / 2)}{\sqrt{π} (d - 1) Γ ((d - 1) / 2)} \sim \frac{1}{\sqrt{2 π d}}

in the sense that

Consequently, for a general radially symmetric Barron function as in (B) and sufficiently large $d \in N$ , we find that

[f]_{L i p} = sup x \in R^{d} ∣ ∣ \nabla f (x) ∣ ∣ \leq \int_{[0, \infty)} ∥ \nabla f_{b} ∥_{L^{\infty}} d | μ |_{b} \leq \frac{1 + ε}{\sqrt{2 π d}} ∥ μ ∥_{T V} = \frac{1 + ε}{\sqrt{2 π d}} [f]_{B} .

Step 3. Higher regularity. By Corollary 2.4, any radially symmetric Barron function is $C^{1}$ -smooth except at the origin. This establishes the claim in the case $d = 3$ .

Note that if $f : (0, \infty) \to R$ is $C^{k}$ -smooth, then the same is true for $F : R^{d} \to R$ given by $F (x) = f (| x |)$ by the chain rule and product rule. It thus suffices to analyze the radial profile $f$ of $F$ . In the following, we will denote both functions as $f$ by a slight abuse of notation. Consider the radial profile of the function

f_{b} (r) = c_{d} \int_{0}^{1} σ (s r - b) (1 - s^{2})^{\frac{d - 3}{2}} d s, c_{d} = \frac{1}{\int_{- 1}^{1} (1 - s^{2})^{\frac{d - 3}{2}} d s}

for $b > 0$ . We can compute the first two derivatives of $f_{b}$ by exchanging differentiation and integration

	$f_{b}^{'} (r)$	$= c_{d} \int_{0}^{1} σ^{'} (s r - b) s (1 - s^{2})^{\frac{d - 3}{2}} d s$
	$f_{b}^{''} (r)$	$= c_{d} \int_{0}^{1} σ^{''} (s r - b) s^{2} (1 - s^{2})^{\frac{d - 3}{2}} d s = \frac{c_{d}}{r} (b / r)^{2} max {1 - (b / r)^{2}, 0}^{\frac{d - 3}{2}},$

where the second formula must be justified by approximation, as the derivative $\frac{d^{2}}{d r^{2}} σ (s r - b) = \frac{1}{r} \cdot δ_{b / r}$ (considered as a ‘function’ of $s$ ) is not regular. For $b > 0$ , it is easy to see that $f_{b} \equiv 0$ is $C^{\infty}$ -smooth in $[0, b)$ , and as a polynomial in $1 / r$ also $C^{\infty}$ -smooth on $(b, \infty)$ . Clearly $f_{b}^{''}$ and all its derivatives vanish at infinity. If $d = 3$ , $f_{b}^{''}$ is continuous except at $r = b$ , where it has a jump discontinuity. If $d \geq 5$ , the function

f_{b}^{''} (r) = c_{d} r^{- d} b^{2} (r^{2} - b^{2})^{\frac{d - 3}{2}} = c_{d} b^{2} r^{- d} (r - b)^{\frac{d - 3}{2}} (r + b)^{\frac{d - 3}{2}}

vanishes as $(r - b)^{\frac{d - 3}{2}}$ at $r = b$ and thus has $\frac{d - 5}{2}$ additional derivatives which vanish at $r = b$ . We find $f_{b} \in C^{\frac{d - 1}{2}}$ for any odd dimension $d$ . The $\frac{d + 1}{2}$ -th derivative of $f_{b}$ is bounded and continuous except at $r = b$ , and the $\frac{d + 3}{2}$ -th derivative of $f_{b}^{''}$ is a finite measure associated to the regular part of the derivative in $(b, \infty)$ and the jump at $r = b$ .

It remains to show that a general radial Barron function

f (r) = f (0) + \int_{[0, \infty)} f_{b} (r) d μ_{b}

has the same regularity as its components $f_{b}$ , at least away from the origin. To simplify the presentation, we focus on the case $d \geq 5$ . Let $ε > 0$ and observe that

f (r) = f (0) + μ ({0}) f_{0} (r) + \int_{(0, ε]} f_{b} (r) d μ_{b} + \int_{(ε, \infty)} f_{b} (r) d μ_{b} .

Clearly, the affine linear component $f (0) + μ ({0}) f_{0} (r)$ is $C^{\infty}$ -smooth except at the origin. Secondly, we note that for any $b > 0$ , the identity $f_{b} (r) = b f_{1} (r / b)$ holds. In particular,

\frac{d^{k}}{d r^{k}} \int_{(ε, \infty)} f_{b} (r) d μ_{b} = \int_{(ε, \infty)} b^{1 - k} f_{1}^{(k)} (\frac{r}{b}) d μ_{b}

where the integrals converge uniformly for $k \leq \frac{d - 1}{2}$ due to the $L^{\infty}$ -bound on the $k + 1$ -th derivative of $f_{b}$ . Similarly, the $\frac{d + 1}{2}$ -th derivative converges in $L^{p}$ for all $p < \infty$ due to the bound on the measure-valued $\frac{d + 3}{2}$ -th derivative. Finally, for $k = \frac{d + 3}{2}$ , the integral converges weakly in the sense of Radon measures, i.e. in the weak-* sense, when we consider the space of (Radon) measures as dual to the space of continuous functions.

For the first integral, we prove convergence assuming that $r \geq ε$ . Note that $f_{b}^{''} (r) = r^{- 1} P (b / r)$ for some polynomial $P$ and $r \geq ε \geq b$ . By induction we see that $f_{b}^{(k)} (r) = r^{1 - k} P_{k} (b / r)$ for all $k \geq 2$ , where $P_{k}$ is another polynomial. This is easily seen since

\frac{d}{d r} [r^{1 - k} P_{k} (b / r)] = (1 - k) r^{- k} P_{k} (b / r) + r^{1 - k} P_{k}^{'} (b / r) (- \frac{b}{r^{2}}) = r^{- k} ((1 - k) P_{k} (b / r) - \frac{b}{r} P_{k}^{'} (b / r)) .

Hence, as before,

\frac{d^{k}}{d r^{k}} \int_{(0, ε]} f_{b} (r) d μ_{b} = \int_{(0, ε]} r^{1 - k} P_{k} (b / r) d μ_{b} = r^{1 - k} \int_{(0, ε]} P_{k} (b / r) d μ_{b} .

The integral converges since $b / r \in [0, 1]$ and $P_{k}$ is a continuous function. ∎

Proof of Lemma 4.2.

First claim. Assume that $g : R \to R$ is a function with the properties outlined above and $f$ is defined by (4.2). Then $f$ is radially symmetric by definition and

f (0) = \frac{\int_{- 1}^{1} (1 - s^{2})^{\frac{d - 3}{2}} g (0) d s}{\int_{- 1}^{1} (1 - s^{2})^{\frac{d - 3}{2}} d s} = 1.

Furthermore, if $r = | x | \geq 1$ and ${~ c}_{d} := \int_{- 1}^{1} (1 - s^{2})^{\frac{d - 3}{2}} d s$ , then

	${~ c}_{d} f (x)$	$= \int_{- 1}^{1} (1 - s^{2})^{\frac{d - 3}{2}} g (r s) d s$
		$= r^{2 - d} \int_{- 1}^{1} (r^{2} - (r s)^{2})^{\frac{d - 3}{2}} g (r s) r d s$

since $g \equiv 0$ outside of $(- 1, 1)$ . The integral vanishes by (4.2) if $d \geq 3$ is odd, since $(r^{2} - z^{2})^{\frac{d - 3}{2}}$ is an even polynomial of degree at most $d - 3$ for all $r$ .

It remains to show that $f$ is a Barron function. Take any measure $μ$ such that

g (x) = \int_{- \infty}^{\infty} σ (x + b) d μ_{b}

as in Proposition 2.5 and compute that

(B.3)	$∫∞−∞\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(νTx+b)dHd−1dμb$	$=\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1∫∞−∞σ(νTx+b)dμbdHd−1ν$
	$=\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1g(νTx)dHd−1ν$
	$= \frac{\| S^{d - 2} \|}{\| S^{d - 1} \|} \int_{- 1}^{1} g (s) (1 - s^{2})^{\frac{d - 3}{2}} d s = f (x)$

by the co-area formula (B.1). The fact that the normalizing constant is exactly

{~ c}_{d} = \frac{| S^{d - 2} |}{| S^{d - 1} |} = \int_{- 1}^{1} (1 - s^{2})^{\frac{d - 3}{2}} d s

can be justified by the same co-area integration. Finally, we note that the left hand side of (B.3) is clearly a radially symmetric Barron function satisfying $[f]_{B} \leq ∥ μ ∥_{T V}$ . Taking the infimum over all $μ$ representing $g$ , we find that $[f]_{B (R^{d})} \leq [g]_{B (R)}$ .

Second claim. Assume on the other hand that $f$ is a radially symmetric Barron function. If we denote by $¯ ¯ ¯ μ$ the Haar measure on the special orthogonal group $S O (d)$ , then due to radial symmetry

	$f (x)$	$= \int_{S O (d)} f (O x) d {¯ ¯ ¯ μ}_{O}$
		$= f (0) + \int_{S O (d)} \int_{R^{d + 1}} σ (w^{T} O x + b) d μ_{(w, b)} d {¯ ¯ ¯ μ}_{O}$
		$=f(0)+∫Rd+1\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫SO(d)σ((OTw)Tx+b)d¯¯¯μOdμ(w,b)$
		$=f(0)+∫Rd+1\|w\|\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(νTx+b\|w\|)dHd−1νdμ(w,b)$
		$=f(0)+∫∞−∞\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(νTx+b′)dHd−1νd^μb′$

where $^μ = Φ_{♯} (| w | \cdot μ)$ for the map

Φ : R^{d + 1} \to R, Φ (w, b) = \frac{b}{| w |} .

It is now possible to reverse the calculations from Step 1 by setting

g : R \to R, g (x) = \int_{R} σ (x - b) d {^μ}_{b} .

Taking the infimum over $μ$ representing $f$ , we find that $[g]_{B (R)} \leq [f]_{B (R^{d})}$ . Clearly, both $s \mapsto g (s)$ and $s \mapsto g (- s)$ induce the same function $f$ by (4.2) due to symmetry, and so does the even representative $s \mapsto (g (s) + g (- s)) / 2$ .

It remains to show that $g \equiv 0$ outside of $(- 1, 1)$ and that the moment conditions (B) hold. Assuming that $g \equiv 0$ outside $(- 1, 1)$ , we find that

0 = \int_{- 1}^{1} (1 - s^{2})^{\frac{d - 3}{2}} g (r s) d s = r^{2 - d} \int_{- 1}^{1} (r^{2} - (r s)^{2})^{\frac{d - 3}{2}} g (r s) r d s = r^{2 - d} \int_{- 1}^{1} (r^{2} - z^{2})^{\frac{d - 3}{2}} g (z) d z

for all $r \geq 1$ , as the integral over $(- r, - 1) \cup (1, r)$ vanishes. The moment conditions follow easily as

\frac{d^{k}}{d r^{k}} \int_{- 1}^{1} (r^{2} - z^{2})^{\frac{d - 3}{2}} g (z) d z = \frac{d^{k}}{d r^{k}} (d - 3) / 2 \sum j = 0 (\frac{(d - 3) / 2}{j}) r^{d - 3 - 2 j} \int_{- 1}^{1} g (z) z^{2 j} d z \equiv 0

for $r \in [1, \infty)$ and $k \geq 1$ . Taking $k = \frac{d - 3}{2}$ derivatives, we find that $g$ is $L^{2}$ -orthogonal to $z^{0}$ . Lowering the order of the derivative inductively, we find that $g$ is $L^{2}$ -orthogonal to all even polynomials of degree at most $d - 3$ .

Thus we only need to show that $g \equiv 0$ outside $(- 1, 1)$ . First consider the case $d = 3$ , i.e. $n = 0$ and thus

f (r) = \int_{- 1}^{1} g (r s) d s .

Then for $r \geq 1$ , we have

0 = f (r) = \frac{1}{r} \int_{- 1}^{1} g (r s) r d s = \frac{1}{r} \int_{- 1}^{1} g (z) d z + \frac{1}{r} \int_{1}^{r} g (z) + g (- z) d z = \frac{1}{r} \int_{1}^{r} g (z) + g (- z) d z

since $\int_{- 1}^{1} g (s) d s = f (1) = 0$ . As $g$ is an even function, we conclude that $\int_{1}^{r} g (s) d s = 0$ for all $r \geq 1$ and thus $g (s) = 0$ for all $s > 1$ . We now proceed inductively: Assume that $n \geq 1$ is such that

r^{- (n + 1)} \int_{- r}^{r} g (z) (r^{2} - z^{2})^{n} d z = r^{- (n + 1)} \int_{- 1}^{1} g (r s) (r^{2} - (r s)^{2})^{n} r d s = \int_{- 1}^{1} g (r s) (1 - s^{2})^{n} d s = 0

for all $r \geq 1$ . Then also

0 \equiv \int_{- r}^{r} g (z) (r^{2} - z^{2})^{n} d z \Rightarrow 0 \equiv \frac{d}{d r} \int_{- r}^{r} g (z) (r^{2} - z^{2})^{n} d z = 2 r \int_{- r}^{r} g (z) (r^{2} - z^{2})^{n - 1} d z

since the boundary term vanishes for $n \geq 1$ . In particular, we conclude that

\int_{- 1}^{1} g (r s) (1 - s^{2})^{n} d s \equiv 0 \Rightarrow \int_{- 1}^{1} g (r s) (1 - s^{2})^{n - 1} d s \equiv 0

for $r \geq 1$ and $n \geq 1$ . Since $d$ is odd, we can reduce the integer exponent $n = \frac{d - 3}{2}$ inductively until $n = 0$ . Then, by the same consideration as in the case $d = 3$ , the result is proved. ∎

We now prove the abstract statement about measures on the unit interval.

Proof of Lemma 4.3.

Lower bound. Let $μ$ be a finite signed measure satisfying the moment conditions

\int_{0}^{1} s d μ_{s} = 1, \int_{0}^{1} s^{2 k} d μ_{s} = 0 \forall 0 \leq k \leq n - 1

Then

\int_{0}^{1} s d μ_{s} = \int_{0}^{1} (s - n \sum k = 0 a_{k} s^{2 k}) d μ_{s} \leq {∥ ∥ ∥ ∥ s - n \sum k = 0 a_{k} s^{2 k} ∥ ∥ ∥ ∥}_{L^{\infty} (0, 1)}^{2 k} | μ | ([0, 1])

by definition. Taking the infimum over the parameters $a_{0}, \dots, a_{n}$ on the right, we find that

1 = \int_{- 1}^{1} s d μ_{s} \leq {d i s t}_{L^{\infty} (0, 1)} (s \mapsto s, s p a n {1, s^{2}, \dots, s^{2 n}}) \cdot ∥ μ ∥

i.e.

∥ μ ∥ \geq \frac{1}{{d i s t}_{L^{\infty} (0, 1)} (s \mapsto s, s p a n {1, s^{2}, \dots, s^{2 m}})} = \frac{1}{{d i s t}_{L^{\infty} (- 1, 1)} (s \mapsto | s |, s p a n {1, s, \dots, s^{2 m}})} .

The asymptotics of

β_{n} := {d i s t}_{L^{\infty} (- 1, 1)} (s \mapsto | s |, s p a n {1, s, \dots, s^{2 m}})

are known due to Bernstein [Ber12] and Varga and Carpenter [VC85] who proved that ${lim}_{n \to \infty} n β_{n} =: β \approx 0.28$ , so

liminf n \to \infty \frac{γ_{n}}{n} \geq liminf n \to \infty \frac{1}{n β_{n}} = \frac{1}{β} \approx 3.57.

Upper bound: Step 0. Note that due to compactness, there exist parameters $a_{0}, \dots, a_{n}$ such that

{∥ ∥ ∥ ∥ s - n \sum k = 0 a_{k} s^{2 k} ∥ ∥ ∥ ∥}_{L^{\infty} (0, 1)}^{2 k} = {d i s t}_{L^{\infty} (0, 1)} (s \mapsto s, s p a n {1, s^{2}, \dots, s^{2 m}}) .

We fix $a_{0}, \dots, a_{n}$ accordingly. Further note that equality is attained in (B) if the measure $μ$ is supported on the set of points

Θ := {s \in [0, 1] : ∣ ∣ ∣ ∣ s - n \sum k = 0 a_{k} s^{2 k} ∣ ∣ ∣ ∣ = max r \in [0, 1] ∣ ∣ ∣ ∣ r - n \sum k = 0 a_{k} r^{2 k} ∣ ∣ ∣ ∣}

and the measure

˜ μ = (s - n \sum k = 0 a_{k} s^{2 k}) \cdot μ

which has density $s - \sum_{k = 0}^{n} a_{k} s^{2 k}$ with respect to $μ$ is non-negative, i.e. $μ$ has “the right sign” at all points. If such a $μ$ exists, it therefore serves as a matching upper bound and the Lemma is proved. It is, however, not immediately clear whether there exists a signed measure $μ$ supported on $Θ$ which satisfies the moment conditions (B) and positivity condition (B). In the following, we will prove that $μ$ indeed does exist.

Step 1. Due to compactness, $Θ$ is a non-empty subset of $[0, 1]$ . Additionally

Θ \subseteq {0, 1} \cup {s \in R : 2 n \sum k = 1 k a_{k} s^{2 k - 1} = 1}

since the function $s \mapsto s - \sum_{k = 0}^{n} a_{k} s^{2 k}$ is either maximal or minimal at $s \in Θ$ . By the fundamental theorem of algebra, $Θ = {s_{1}, \dots, s_{N}}$ is thus a finite subset of $[0, 1]$ . In this step, we prove that $0, 1 \in Θ$ and $Θ \cap (0, 1) = n$ .

Note that $\sum_{k = 1}^{n} a_{k} s^{2 k}$ is also an optimal polynomial approximation of the function $h (s) = | s |$ in $C^{0} [- 1, 1]$ in the space $P_{2 n + 1}$ of polynomials of degree at most $2 n + 1$ , since the optimal approximation is an even polynomial. By Chebyshev’s equi-oscillation Theorem [KKC09, Section 6.9], there exist $N \geq 2 n + 3$ distinct points $t_{1} < \dots < t_{N}$ such that the error

e (s) = | s | - n \sum k = 0 a_{k} s^{2 k}

satisfies

| e (t_{i}) | = max s \in [- 1, 1] | e (s) | \forall i = 1, \dots, N and e (t_{i}) e (t_{i + 1}) < 0 \forall i = 1, \dots, N - 1,

i.e. there are $N \geq 2 n + 3$ distinct points where the deviation from the target function is largest, and the oscillation around the target function at consecutive points $t_{i}, t_{i + 1}$ goes in opposite directions.

Clearly, if $e$ is maximal at $s \in [- 1, 0]$ if and only if it is maximal at $(- s) \in [0, 1]$ . Therefore, there exist at least $⌈ N / 2 ⌉ = ⌈ (2 n + 3) / 2 ⌉ = n + 2$ points in $Θ = [0, 1] \cap argmax e$ . Rounding up is required since $2 n + 3$ is odd, and the point $0$ counts fully towards $Θ \subset [0, 1]$ . Thus $| Θ | \geq n + 2$ .

It remains to show that $| Θ | \leq n + 2$ . We prove this only if $n \geq 1$ , as the case $n = 0$ of approximation by constant functions can be solved explicitly by direct inspection by the constant polynomial $a_{0} = 1 / 2$ .

Assume for a contradiction that $| Θ | \geq n + 3$ . Then there exist at least $n + 1$ distinct points in $Θ \cap (0, 1)$ . Since $e$ is either maximal at $s \in Θ \cap (0, 1)$ , we conclude that $e^{'} (s) = 0$ for every $s \in Θ \cap (0, 1)$ . By Rolle’s Theorem, between any two points $s, s^{'}$ such that $e^{'} (s) = e^{'} (s^{'})$ , there exists $s^{*} \in (s, s^{'})$ such that $e^{''} (s^{*}) = 0$ . In particular, $e^{''}$ has at least $n$ distinct zeros in $(0, 1)$ . Since $e^{''}$ is even, it follows that $e^{''}$ has at least $2 n$ distinct zeros. But, since $e^{''}$ is a polynomial of degree $2 n - 2$ , it follows that $e^{''} \equiv 0$ and thus that $e$ is a quadratic polynomial on $(0, 1)$ . On the other hand, we have seen that there exist at least $n + 1$ points in $Θ \cap (0, 1)$ , meaning that there are $n + 1 > 1$ points in $(0, 1)$ at which $e^{'}$ vanishes. We conclude that $e^{'} \equiv 0$ , i.e. $e$ is a linear polynomial on $(0, 1)$ . It is easy to see that this is not optimal in terms of approximation.

Step 2. We claim that the $(n + 2) \times (n + 2)$ -Vandermonde type matrix

V = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ \begin{matrix} s_{0} & s_{1} & \dots & s_{n + 1} 1 & 1 & \dots & 1 s_{0}^{2} & s_{1}^{2} & \dots & s_{n + 1}^{2} s_{0}^{4} & s_{1}^{4} & \dots & s_{n + 1}^{4} ⋮ & ⋮ & ⋱ & ⋮ s_{0}^{2 n} & s_{1}^{2 n} & \dots & s_{n + 1}^{2 n} \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

is invertible for any distinct $n + 2$ points $0 \leq s_{0} < \dots < s_{n + 1} \leq 1$ . This is true by classical results [Lun17] for the $(n + 1) \times (n + 1)$ Vandermonde submatrix

V = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ \begin{matrix} 1 & \dots & 1 s_{0}^{2} & \dots & s_{n}^{2} ⋮ & ⋱ & ⋮ s_{0}^{2 n} & \dots & s_{n}^{2 n} \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

since the points $s_{0}^{2}, \dots, s_{n}^{2}$ are distinct. It remains to show that the first row is linearly independent from the others, i.e. there exist no coefficients $a_{0}, \dots, a_{k}$ such that $s = \sum_{k = 0}^{n} a_{k} s^{2 k}$ at $(n + 2)$ distinct points in $[0, 1]$ . Assume the contrary. Then there are $n + 2$ distinct points $s_{0} < \dots < s_{n + 1} \in [0, 1]$ such that

0 = s - n \sum k = 0 a_{k} s^{2 k}, s \in {s_{0}, \dots, s_{n + 1}} .

By Rolle’s theorem, between two such points $s_{i}, s_{i + 1}$ there exists $ξ_{i}$ such that

0 = \frac{d}{d s} {∣ ∣ ∣}_{s = ξ_{i}} (s - n \sum k = 0 a_{k} s^{2 k}) .

The contradiction follows as in Step 1 of this proof.

Step 3. Combining the results of the second and third step of this proof, we can choose ${s_{0}, \dots, s_{n + 1}} = Θ$ and find a unique vector $ν \in R^{d + 2}$ such that

V ν = (1, 0, 0, \dots, 0)^{T}

The measure under consideration is now

μ = n + 1 \sum i = 0 μ_{i} δ_{s_{i}} which satisfies \int_{0}^{1} s^{2 k} d μ_{s} = {\begin{matrix} (V ν)_{1} & k = 0 (V ν)_{k + 2} & k \geq 1 \end{matrix} = 0, \int_{0}^{1} s^{2 k} d μ_{s} = (V ν)_{2} = 1

by construction. Thus the moment conditions are met. It remains to show that $μ_{i} \cdot (s - \sum_{k = 0}^{n} a_{k} s^{2 k})$ does not change sign in order to ensure that equality is attained in Hölder’s inequality. Using Chebyshev’s equi-oscillation theorem again, it suffices to show that $μ_{i}$ and $μ_{i + 1}$ have opposite signs for all $i$ .

For any $i \in {0, \dots, n}$ , consider the unique even polynomial $P$ of degree $n$ such that $P (s_{j}) = 0$ for $0 \leq j \leq n + 1$ except $j \in {i, i + 1}$ . Then, since $P$ is an even polynomial of degree $\leq 2 n$

0 = \int_{0}^{1} P (s) d μ_{s} = μ_{i} P (s_{i}) + μ_{i + 1} P (s_{i + 1}),

but since $P$ has $2 n$ zeros at $\pm s_{j}$ for $i \notin {i, i + 1}$ , we find that $P (s) \neq 0$ for any $s \in [s_{i}, s_{i + 1}]$ . Thus $P (s_{i})$ and $P (s_{i + 1})$ have the same sign. In order to satisfy (B), we therefore find that $μ_{i}$ and $μ_{i + 1}$ must have different signs. ∎

Proof of the claim in the proof of Theorem 3.2.

To see this, we use the lower bound

∥ μ ∥ \geq \frac{1}{{d i s t}_{L^{\infty} (ε, 1)} (s, s p a n {1, s^{2}, \dots, s^{2 n}})}

from (B). By replacing the variable $s$ by $s^{2}$ , we find that

{d i s t}_{L^{\infty} (ε, 1)} (s, s p a n {1, s^{2}, \dots, s^{2 n}}) = {d i s t}_{L^{\infty} (ε^{2}, 1)} (\sqrt{s}, s p a n {1, s, \dots, s^{n}}) .

Recall that the function $\sqrt{s}$ is an analytic function on the interval $[ε^{2}, 1]$ and

\sqrt{s} = 1 + \infty \sum n = 1 (- 1)^{n + 1} \frac{\prod_{k = 1}^{n} (2 k - 1)}{2^{n} n!} (s - 1)^{n} = 1 - \frac{1}{\sqrt{π}} \infty \sum n = 1 \frac{Γ (n + 1 / 2)}{Γ (n + 1)} (1 - s)^{n} .

The coefficients decay asymptotically as $n^{- 1 / 2}$ since

lim n \to \infty (\sqrt{n} \frac{Γ (n + 1 / 2)}{Γ (n + 1)}) = 1,

so for every $δ > 0$ , there exists $N \in N$ which is independent of $ε$ such that the $L^{\infty}$ -distance of the function $s \mapsto \sqrt{s}$ from the space $P_{n}$ of polynomials of degree $\leq n$ is at most

	${d i s t}_{L^{\infty} (ε^{2}, 1)} (\sqrt{s}, P_{n})$	$\leq max s \in [ε^{2}, 1] ∣ ∣ ∣ ∣ \sqrt{s} - 1 - \frac{1}{\sqrt{} π} n \sum k = 1 \frac{Γ (k + 1 / 2)}{Γ (k + 1)} (1 - s)^{k} ∣ ∣ ∣ ∣$
		$\leq \frac{1 + δ}{\sqrt{π n}} \infty \sum k = n + 1 (1 - ε^{2})^{k}$
		$= (1 + δ) \frac{(1 - ε^{2})^{n + 1}}{\sqrt{π n} ε^{2}} .$

∎

Appendix C Brief proofs of known results

In this appendix, we merely sketch the proofs of known results. For a more detailed introduction, we recommend e.g. [EW20b]. We begin by sketching a proof of Proposition 2.1, where we established general properties of Barron

Proof of Proposition 2.1.

First claim. We note that, assuming existence of the integrals and for fixed $x \in R^{d}$ , we have

	$f_{π} (0)$	$= \int_{R^{d + 2}} a σ (b) d π$
	$\| f_{π} (x) - f_{π} (0) \|$	$= ∣ ∣ ∣ \int_{R^{d + 2}} a {σ (w^{T} x + b) - σ (b)} d π ∣ ∣ ∣ \leq \int_{R^{d + 2}} \| a \| \| w^{T} x \| d π \leq \frac{\| x \|}{2} \int_{R^{d + 2}} \| a \|^{2} + \| w \|^{2} d π .$

If the first integral exists, then also the integral defining $f_{π} (x)$ exists as the integrand is continuous and grows at most linearly. Then

f_{π} (x) = f_{π} (x) - f_{π} (0) + f_{π} (0) = \int_{R^{d + 2}} a {σ (w^{T} x + b) - σ (b)} d π + f_{π} (0) .

Measurability is not an issue for fixed $x$ due to the continuity of the integrand. For the sake of brevity, denote $h_{(a, w, b)} (x) = a {σ (w^{T} x + b) - σ (b)}$ . More generally, we note that the Bochner-integral

f = c + \int_{R^{d + 2}} (x \mapsto h_{(a, w, b)} (x)) d π_{(a, w, b)}

converges in $C^{0} (K)$ for compact sets $K$ and in $L^{p} (P)$ for $1 \leq p < \infty$ and probability distributions $P$ with finite $p$ -th moments in $x$ , i.e. the function $(a, w, b) \mapsto h_{(a, w, b)}$ is Bochner integrable with respect to $π$ when considered as a function with values in either $C^{0} (K)$ or $L^{p} (P)$ . To see this, consider step functions

{~ h}_{i} = \sum j 1_{Q_{i j}} h_{(a_{i j}, w_{i j}, b_{i j})}, {~ f}_{i} = \int_{R^{d + 2}} {~ h}_{i} d π

where $Q_{i j}$ are $(d + 2)$ -dimensional cubes of side length $2^{- i}$ whose union is $⋃_{j} Q_{i j} = [- 2^{i}, 2^{i}]^{d + 2}$ and $(a_{i j}, w_{i j}, b_{i j}) \in W_{i j}$ . If $(a, w, b) \in Q_{i j}$ , then

	$\| h_{(a, w, b)} (x) - h_{(a_{i j}, w_{i j}, b_{i j})} (x) \|$	$\leq \| a - a_{i j} \| \| σ (w^{T} x + b) \| + \| a_{i j} \| ∣ ∣ σ (w^{T} x + b) - σ (w_{i j}^{T} x + b_{i j} ∣ ∣$
		$\leq \| a - a_{i j} \| \| w^{T} x + b \| + \| a_{i j} \| [\| w - w_{i j} \| \| x \| + \| b - b_{i j} \|]$
		$\leq C (\frac{\| a - a_{i j} \|^{2} + \| w - w_{i j} \|^{2} + \| b - b_{i j} \|^{2}}{ε} + ε {\| a \|^{2} + \| w \|^{2} \| x \|^{2} + \| b \|^{2}})$

for any $ε > 0$ . Fixing $ε$ to be the square root of the side-length of $Q_{i j}$ , we find that ${~ f}_{i} (x) \to f_{π} (x)$ pointwise for all $x$ . Furthermore, ${~ f}_{i}$ is Lipschitz continuous in $x$ uniformly in $i$ , so ${~ f}_{i}$ converges to a limit in $C^{0} (K)$ by the compact embedding of Lipschitz functions in $C^{0}$ , which coincides with the pointwise limit $f_{π}$ . In other words, the Bochner integral exists in $C^{0}$ . The argument follows in $L^{p} (P)$ by the dominated convergence theorem considering $| {~ f}_{i} | (x) \leq 2 (1 + [f]_{B} | x |)$ for all $x \in R^{d}$ .

Second claim. In this step, we show that $V_{0}$ is a Banach space and illustrate that $B$ and $B_{0}$ are different spaces. The fact that $V_{0}$ is a Banach space follows as [SX21a, Lemma 1] from the previous claim, where we have shown the existence of $f \in B_{0}$ as a Bochner integral in $L^{2} (P)$ , i.e. as a continuous convex combination not only pointwise, but in a function space.

To see that $B \neq B_{0}$ , observe that any $f \in B$ can be decomposed into a positively one-homogeneous and a bounded part due to [EW20b, Corollary 5.3]. On the other hand, in one dimension, the function $f (x) = log (1 + x^{2})$ satisfies $f (0) = f^{'} (0) = 0$ and has an integrable second derivative $f^{''} (x) = 2 \frac{1 - x^{2}}{(1 + x^{2})^{2}}$ . By Proposition 2.5, we find that $f \in B_{0}$ . Since $f$ is not bounded but grows sub-linearly, we conclude that $B \subseteq B_{0} ⊈ B$ . The first inclusion follows from the fact that $[f]_{B} \leq ∥ f ∥_{B}$ as shown next.

Third claim. The claim that $[f]_{B} \leq ∥ f ∥_{B}$ is self-evident by definition, as the full Barron norm also limits the magnitude of the bias.

Fourth claim. Finally, we note that $f \in B_{0}$ is Lipschitz-continuous, since

	$\| f_{π} (x) - f_{π} (x^{'}) \|$	$= ∣ ∣ ∣ \int_{R^{d + 2}} a [σ (w^{T} x + b) - σ (w^{T} x^{'} + b)] d π_{(a, w, b)} ∣ ∣ ∣$
		$\leq \int_{R^{d + 2}} \| a \| \| w^{T} (x - x^{'}) \| d π_{(a, w, b)} \leq \| x - x^{'} \| \int_{R^{d + 2}} \| a \| \| w \| d π_{(a, w, b)} .$

Taking the infimum over $π$ (and optionally noting that $2 | a | | w | \leq | a |^{2} + | w |^{2}$ ), we find that $| f (x) - f (x^{'}) | \leq [f]_{B} | x - x^{'} |$ . ∎

Proposition 2.3 is proved in [EW20b, Theorem 5.18] and Corollary 2.4 follows from it directly. Let us sketch how the structure of one-dimensional Barron functions can be understood.

Proof of Proposition 2.5.

Upper bound. Let $a \in R$ and $f \in C^{2} (R)$ be such that $f^{''} \in L^{1} (R)$ . Then for $x > a$ we have

	$f (x)$	$= f (a) + \int_{a}^{x} f^{'} (s) \cdot 1 d s = f (a) + f^{'} (a) (x - a) - \int_{a}^{x} f^{''} (s) (s - x) d s$
		$= f (a) + f^{'} (a) σ (x - a) + \int_{a}^{\infty} f^{''} (s) σ (x - s) d s$

and for $x < a$

	$f (x)$	$= f (a) - \int_{x}^{a} f^{'} (s) \cdot 1 d s = f (a) - f^{'} (a) (a - x) - \int_{a}^{x} f^{''} (s) (s - x) d s$
		$= f (a) - f^{'} (a) σ (a - x) + \int_{a}^{\infty} f^{''} (s) σ (x - s) d s .$

Noting that the $σ$ terms in the first expression vanish for when $x < a$ and vice versa, we find that

f (x) = f (a) + f^{'} (a) [σ (x - a) - σ (a - x)] + \int_{R} f^{''} (s) [σ (x - s) 1_{(a, \infty)} (s) + 1_{(- \infty, a)} (s) σ (s - x)] d s .

Consequently, $f = f_{μ}$ for a measure

μ = f^{'} (a) [δ_{(1, a)} - δ_{(- 1, a)}] + f^{''} (b) \cdot [H^{1} |_{{w = 1, b > a}} + H^{1} |_{{w = - 1, b < a}}]

where $δ$ denotes the atomic point measure of mass one and $H^{1}$ denotes the one-dimensional Hausdorff measure, restricted to half-lines ${w = 1, b > a}$ and ${w = - 1, b < a}$ . and hence

[f]_{B} = inf f = f_{μ} ∥ μ ∥_{T V} \leq 2 inf a \in R | f^{'} (a) | + \int_{R} | f^{''} (s) | d s .

By approximation, the same is true if $f \notin C^{2}$ and $f^{''}$ is merely a measure.

Lower bound direction. The bound

[f]_{B} \leq [f]_{L i p} = sup a \in R max v \in \partial f (a) | v | = sup a \in R | f^{'} (a) |

follows from Proposition 2.1 and the Rademacher Theorem on the differentiability of Lipschitz functions. For the second form of the lower bound, let $f \in B_{0}$ , i.e. there exists a measure $μ$ on $R^{2}$ such that

f (x)

The second expression can be written as

f (x) = c + f^{+} (x) + f^{-} (x) = c + \int_{R} σ (x + b) d μ_{b}^{+} + \int_{R} σ (- x + b) d μ_{b}^{+}

where

μ_{\pm} = ψ_{♯} (| w | \cdot 1_{{\pm w > 0}} \cdot μ), ψ (w, b) = b / | w |,

i.e. $μ_{\pm}$ is the push-forward of the measure which has density $| w |$ with respect to $μ$ onto the real line. If $ϕ \in C_{c}^{\infty} (R)$ is any function, then by exchanging the order of integration and integrating by parts, we find that

	$\int_{- \infty}^{\infty} f^{+} (x) ϕ^{''} (x) d x$	$= \int_{- \infty}^{\infty} ϕ^{''} (x) \int_{R} σ (x + b) d μ_{b}^{+} d x$
		$= \int_{R} \int_{- b}^{\infty} ϕ^{''} (x) (x + b) d x d μ_{b}^{+}$
		$= - \int_{R} \int_{- b}^{\infty} ϕ^{'} (x) d x d μ_{b}^{+}$
		$= \int_{R} ϕ (b) d μ_{b}^{+}$

we find that $(f^{+})^{''} = μ^{+}$ in the distributional sense, and thus $f^{''} = μ^{+} + μ^{-}$ . In particular,

∥ f^{''} ∥_{T V} = ∥ μ^{+} + μ^{-} ∥_{T V} \leq inf μ ∥ μ^{+} ∥_{T V} + ∥ μ^{-} ∥_{T V} \leq inf μ \int_{R^{2}} | w | d | μ |_{(w, b)} = [f]_{B} .

∎

We sketch a proof of the direct approximation theorem for Barron spaces.

Proof of Proposition 2.6.

Step 1. Consider the Hilbert space $L^{2} (P)$ and observe that $h_{(a, w, b)} \in H$ defined by $h_{(a, w, b)} (x) = a {σ (w^{T} x + b) - σ (b)}$ has norm at most

∥ h_{(a, w, b)} ∥_{H}^{2} = \int_{R^{d}} a^{2} [σ (w^{T} x + b) - σ (b)]^{2} d P \leq a^{2} \int_{R^{d}} | w^{T} x |^{2} d P .

We use Proposition 2.1 to write $f \in B_{0}$ as

f (x) = f (0) + \int_{R^{d + 2}} h_{(a, w, b)} (x) d π_{(a, w, b)} .

Step 2. Using the homogeneity relation $σ (z) = λ^{- 1} σ (λ z)$ , the distribution $π$ can be normalized such that

| a |^{2} = | w |^{2} = \frac{1}{2} \int_{R^{d + 2}} | a^{'} |^{2} + | w^{'} |^{2} d π_{(a^{'}, w^{'}, b^{'})}

almost surely by considering the push-forward of $π$ along the map

T : R^{d + 2} \to R^{d + 2}, T (a, w, b) = (a \sqrt{\frac{| w |}{| a |}}, w \sqrt{\frac{| a |}{| w |}}, w \sqrt{\frac{| a |}{| w |}})

if $a, w \neq 0$ and $T (a, w, b) = 0$ otherwise, which satisfies $f_{T_{♯} π} \equiv f_{π}$ . Thus for any $ε > 0$ , $f - f (0)$ is in the $H$ -closed convex hull of the family

G_{∥ f ∥_{B} + ε} = {h_{(a, w, b)} : | a | = | w | \leq ∥ f ∥_{B} + ε} .

Step 3. By the Maurey-Barron-Jones Lemma [Bar93, Lemma 1], for every $m \in N$ and every $ε^{'} > 0$ , there exist $h_{(a_{i}, w_{i}, b_{i})} \in G_{∥ f ∥_{B} + ε}$ such that

{∥ ∥ ∥ ∥ f - f (0) - \frac{1}{m} m \sum i = 1 h_{(a_{i}, w_{i}, b_{i})} ∥ ∥ ∥ ∥}_{H} \leq \frac{∥ f ∥_{B} + ε}{\sqrt{m}} + ε^{'} .

As the vectors $(a_{i}, w_{i}, b_{i})$ are constrained to a compact domain of $R^{d + 2}$ and the map $R^{d + 2} \to H$ , $(a, w, b) \mapsto h_{(a, w, b)}$ is continuous, we can set $ε, ε^{'} \to 0$ and obtain the result without constant by an appropriate subsequence.

Finally, we write $c = f (0) + \frac{1}{m} \sum_{i = 1}^{m} a_{i} σ (b_{i})$ for compatibility with the original notation. ∎

Appendix D Further results

d.1. On the decay of $f_{d}^{*} (x)$ for $x \neq 0$

Numerical experiments in Appendix A suggest that $f_{d}^{*} (x)$ decays to zero exponentially fast for $x \neq 0$ . While we cannot prove this in full generality, we show that

0 \leq f_{d}^{*} (x) \leq C d^{3 / 2} {(\frac{1 - | x |^{2}}{| x |})}^{\frac{d - 3}{2}}

for a constant $C > 0$ which is independent of $d$ . In particular, $f_{d}^{*} (x) \to 0$ exponentially fast in $d$ if $| x | > 0.62$ . To see this, observe that

	$f_{d}^{*} (r)$	$= c_{d} \int_{- 1}^{1} g (r s) (1 - s^{2})^{\frac{d - 3}{2}} d s = 2 c_{d} r^{\frac{1 - d}{2}} \int_{0}^{r} g (z) (r^{2} - z^{2})^{\frac{d - 3}{2}} d z$
		$= - 2 c_{d} r^{\frac{1 - d}{2}} \int_{r}^{1} g (z) (r^{2} - z^{2})^{\frac{d - 3}{2}} d z$

for $r < 1$ since $g$ is $L^{2} (0, 1)$ -orthogonal to the polynomial $(r^{2} - z^{2})^{\frac{d - 3}{2}}$ . Since $∥ g ∥_{L^{\infty} (0, 1)} \leq γ_{\frac{d - 1}{2}}$ , we may estimate

| f_{d}^{*} (r) | \leq 2 c_{d} r^{\frac{1 - d}{2}} γ_{d} (1 - r^{2})^{\frac{d - 3}{2}} = \frac{2 c_{d} γ_{\frac{d - 1}{2}}}{r} {(\frac{1 - r^{2}}{r})}^{\frac{d - 3}{2}} .

The pre-factor grows as $d^{3 / 2}$ since $γ_{d} \sim d$ and

c_{d} = \frac{1}{\int_{- 1}^{1} (1 - s^{2})^{\frac{d - 3}{2}} d s} = \frac{Γ (d / 2)}{\sqrt{π} Γ (\frac{d - 1}{2})} \sim \sqrt{} \frac{d}{2 π} .

Finally, we note that $(1 - r^{2}) / r < 1$ holds for positive $r$ if and only if $r > \frac{\sqrt{5} - 1}{2} \approx 0.618$ .

d.2. Non-radial minimum norm solutions

In this note, we constructed

f_{d}^{*} \in argmin f \in F [f]_{B}, F = {f \in B_{0} (R^{d}) : f (0) = 1 and f \equiv 0 on R^{d} ∖ B_{1} (0)} .

Since both the Barron semi-norm and the class $F$ are convex and invariant under rotations of the data domain, we find that there exists at least one minimizer which is radially symmetric. By direct construction, we saw that this minimizer

f∗d(x)=1+d+12∑i=0μi\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(νTx−bi)dHd−1ν

is unique, at least if $d$ is odd. The biases $0 = b_{0} < \dots < b_{(d + 1) / 2} = 1$ and weights $μ_{i} \neq 0$ are given by the optimization process. Our proof does not exclude the existence of other minimizers, which are not radially symmetric. In fact, assume that $ϕ_{i} \in L^{\infty} (S^{d - 1})$ for $i = 0, \dots, \frac{d + 1}{2}$ are functions such that

fϕ(x)=d+12∑i=0\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(νTx−bi)ϕi(ν)dHd−1ν=0∀ |x|≥1.

Then trivially also $f_{ϕ} (0) = 0$ since $b_{i} \geq 0$ , and thus

(f∗d+εfϕ)(x)=1+d+12∑i=0\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1(μi+εϕi(ν))σ(νTx−bi)dHd−1ν={1x=00|x|≥1.

Since $f_{d}^{*}$ is the unique radial solution, we can average in the radial direction and observe that $\int_{S^{d - 1}} ϕ_{i} (ν) = 0$ for all $i = 0, \dots, \frac{d + 1}{2}$ . The Barron norm of the combined solution is

(d + 1) / 2 \sum i = 0 \frac{∥ μ_{i} + ε ϕ_{i} ∥_{L^{1} (S^{d - 1})}}{H^{d - 1} (S^{d - 1})} = (d + 1) / 2 \sum i = 0 | μ_{i} |

if $ε$ is so small that $ε ∥ ϕ_{i} ∥_{L^{\infty}} \leq | μ_{i} |$ for all $i$ , since the function $μ_{i} + ε ϕ_{i}$ does not change signs in this case, and the integral of $ϕ_{i}$ averages to zero. In particular, if $(ϕ_{0}, \dots, ϕ_{(d + 1) / 2})$ exist such that $f_{ϕ}$ is supported in $¯ ¯¯¯¯¯¯¯¯¯¯¯ ¯ B_{1} (0)$ and fails to be radial, then a non-radial minimizer exists.

By considering the behavior of $f_{ϕ}$ at infinity, we establish two conditions: $\sum_{i = 0}^{(d + 1) / 2} ϕ_{i} \equiv 0$ in order to have $f_{ϕ}$ bounded, and $\sum_{i = 0}^{(d + 1) / 2} b_{i} ϕ_{i} \equiv 0$ in order to have ${lim}_{x \to \infty} f_{ϕ} (x) = 0$ .

Lemma D.1.

Assume there exist $\frac{d + 3}{2}$ measures ${¯ μ}_{i}$ on $S^{d - 1}$ such that

f¯μ(x):=d+12∑i=0\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1σ(νTx−bi)d¯μi=0

for all $| x | \geq 1$ and $f_{¯ μ} (x) \equiv̸ 0$ . Then there exists a minimizer ${^f}_{d} \in F$ of the Barron semi-norm which is not radially symmetric. Without loss of generality, we may assume that ${^f}_{d}$ is radially symmetric with respect to $(x_{2}, \dots, x_{d})$ .

Proof.

Step 1. Assume for now that $f_{¯ μ}$ is identically zero. Let $ψ_{δ}$ be a $C^{\infty}$ -probability density on the group of rotations $S O (d)$ which is supported in an $δ$ -neighbourhood of the unit matrix, and let $H$ be the Haar measure on $S O (d)$ . Define the radial mollification

	$f_{¯ μ, δ} (x)$	$= \int_{S O (d)} ψ_{δ} (O) f_{¯ μ} (O^{T} x) d H_{O}$
		$=d+12∑i=0\vbox\hruleheight0.4ptwidth6.0ptdepth0.0pt∫Sd−1(∫SO(d)ψδ(O)σ((Oν)Tx−bi)dHO)d¯μi,ν$
		$= \frac{d + 1}{2} \sum i = 0 \int_{S^{d - 1}} σ (ν^{T} x - b_{i}) d {~ μ}_{i}$

where

{~ μ}_{i, δ} (B) = \int_{S O (d)} ψ_{δ} (O) {¯ μ}_{i} (O \cdot B) d H_{O} .

We make three observations.

$f_{¯ μ, δ} (x) = 0$ if $x = 0$ or $| x | \geq 1$ .
$f_{¯ μ, δ} \to f_{¯ μ}$ as $δ \to 0$ (pointwise and locally uniformly), so $f_{¯ μ, δ}$ cannot be identically zero for sufficiently small $δ > 0$ .
${~ μ}_{i}$ is absolutely continuous with respect to the uniform distribution on the sphere since

$| {~ μ}_{i} | (B) \leq ∥ ψ_{δ} ∥_{L^{\infty}} ∥ {¯ μ}_{i} ∥_{T V} .$

Due to the uniform estimate, the Radon-Nikodym derivative $ϕ_{i, δ} := \frac{d {~ μ}_{i, δ}}{d H^{d - 1}}$ is an $L^{\infty} (S^{d - 1})$ -function.

We now fix $ε, δ$ small enough, write $ϕ_{i} = ϕ_{i, δ}$ and note that $f_{d}^{*} + ε f_{ϕ}$ is also a solution to (D.2). In particular, $f_{ϕ}$ cannot be radially symmetric since $f_{d}^{*}$ is the unique radially symmetric minimizer.

Step 2. Take $f_{ϕ}$ to be non-trivial as implied by step 1. Then there exists at least one direction $¯ ν$ such that $f_{ϕ} (t ¯ ν) \equiv̸ 0$ . Without loss of generality, we may take $¯ ν = e_{1}$ . We can now average over all rotations which leave $e_{1}$ fixed. The resulting function ${^f}_{ϕ}$ is radially symmetric in all components orthogonal to $e_{1}$ , i.e. in $(x_{2}, \dots, x_{d})$ . Since we only average over rotations which leave the $e_{1}$ -direction fixed, we have ${^f}_{ϕ} (t e_{1}) = f_{ϕ} (t e_{1}) \equiv̸ 0$ . In particular, we may assume that $f_{ϕ}$ has the desired symmetry. ∎

The question whether there exists $¯ μ = ({¯ μ}_{0}, \dots, {¯ μ}_{(d + 1) / 2})$ such that $f_{¯ μ} \equiv 0$ on $R^{d} ∖ B_{1} (0)$ but $f_{¯ μ} \equiv̸ 0$ on $R^{d}$ can be rephrased in terms of functional analysis. Namely, if we understand $¯ μ$ as an element of the dual space $Z^{*}$ of $Z := C^{0} (S^{d - 1}; R^{(d + 3) / 2})$ and we associate to $x \in R^{d}$ the function $h_{x} \in Z$ given by , then we can write $f_{¯ μ} (x) = ⟨ ¯ μ, h_{x} ⟩_{Z^{*}, Z}$ as a duality product.

In particular, we consider two subspaces $V_{1}, V_{2} \subseteq Z$ :

V_{1} = s p a n {h_{x} : x \in R^{d}}, V_{2} = s p a n {h_{x} : | x | \geq 1} .

Obviously $V_{2} \subseteq V_{1}$ . We note the following: If $_{2} \neq_{1}$ , then by the Hahn-Banach theorem there exists $μ \in Z^{*}$ such that $⟨ μ, v ⟩ = 0$ for all $v \in_{2}$ but not all $v \in_{1}$ . It is easy to see by contradiction that there exists in particular $h_{x}$ with $| x | < 1$ such that $f_{μ} (x) = ⟨ μ, h_{x} ⟩_{Z^{*}, Z} \neq 0$ . Note that $x \neq 0$ since $f_{μ} (0) = 0$ for any $μ$ by design.

We have thus proved the following.

Corollary D.2.

Denote $Z := C^{0} (S^{d - 1}; R^{(d + 3) / 2})$ and $h_{x} \in Z$ , $h_{x} (ν) = (σ (ν^{T} x - b_{0}), \dots, σ (ν^{T} x - b_{(d + 1) / 2}))$ . Consider the subspaces $V_{1}, V_{2}$ of $Z$ as in (D.2). There exists a non-radial solution $f$ of the minimization problem (D.2) if and only if $_{1} \neq_{2}$ .

	$\| h_{(a, w, b)} (x) - h_{(a_{i j}, w_{i j}, b_{i j})} (x) \|$	$\leq \| a - a_{i j} \| \| σ (w^{T} x + b) \| + \| a_{i j} \| ∣ ∣ σ (w^{T} x + b) - σ (w_{i j}^{T} x + b_{i j} ∣ ∣$
		$\leq \| a - a_{i j} \| \| w^{T} x + b \| + \| a_{i j} \| [\| w - w_{i j} \| \| x \| + \| b - b_{i j} \|]$
		$\leq C (\frac{\| a - a_{i j} \|^{2} + \| w - w_{i j} \|^{2} + \| b - b_{i j} \|^{2}}{ε} + ε {\| a \|^{2} + \| w \|^{2} \| x \|^{2} + \| b \|^{2}})$

Optimal bump functions for shallow ReLU networks Weight decay, depth separation and the curse of dimensionality

Abstract.

Key words and phrases:

2020 Mathematics Subject Classification:

1. Introduction

1.1. Previous work

1.2. Notation

2. Weight decay and Barron spaces

Proposition 2.1.

Example 2.2.

Proposition 2.3.

Corollary 2.4.

Proposition 2.5.

Proposition 2.6 (Direct approximation theorem).

Remark 2.7.

3. Statements of Main Results

Theorem 3.1.

Theorem 3.2.

4. Proofs of the Main Results

Lemma 4.1.

Lemma 4.2.

Lemma 4.3.

Proof of Theorem 3.1.

Example 4.4.

Proof of Theorem 3.2.

5. Applications

5.1. Fitting values on a finite data set

5.2. Mollification and density

Example 5.1.

5.3. Depth separation

6. Finding optimal bump functions

7. Conclusion and Open Problems

Acknowledgements

References

Appendix A Further plots

Appendix B Postponed proofs

Proof of Lemma 4.1.

Proof of Lemma 4.2.

Proof of Lemma 4.3.

Proof of the claim in the proof of Theorem 3.2.

Appendix C Brief proofs of known results

Proof of Proposition 2.1.

Proof of Proposition 2.5.

Proof of Proposition 2.6.

Appendix D Further results

d.1. On the decay of f∗d(x) for x≠0

d.2. Non-radial minimum norm solutions

Lemma D.1.

Proof.

Corollary D.2.

Optimal bump functions for shallow ReLU networks
Weight decay, depth separation and the curse of dimensionality

d.1. On the decay of $f_{d}^{*} (x)$ for $x \neq 0$