Neural Network Approximation of Lipschitz Functions in High Dimensions with Applications to Inverse Problems

Santhosh Karnik, Rongrong Wang, and Mark Iwen Santhosh Karnik, Rongrong Wang, and Mark Iwen are with the Department of Computational Mathematics, Science, and Engineering at Michigan State University. Rongrong Wang and Mark Iwen are also with the Department of Mathematics at Michigan State University (e-mail: karniksa@msu.edu, wangron6@msu.edu, iwenmark@msu.edu). Mark Iwen was supported in part by NSF DMS 2106472

Abstract

The remarkable successes of neural networks in a huge variety of inverse problems have fueled their adoption in disciplines ranging from medical imaging to seismic analysis over the past decade. However, the high dimensionality of such inverse problems has simultaneously left current theory, which predicts that networks should scale exponentially in the dimension of the problem, unable to explain why the seemingly small networks used in these settings work as well as they do in practice. To reduce this gap between theory and practice, a general method for bounding the complexity required for a neural network to approximate a Lipschitz function on a high-dimensional set with a low-complexity structure is provided herein. The approach is based on the observation that the existence of a linear Johnson-Lindenstrauss embedding $A \in R^{d \times D}$ of a given high-dimensional set $S \subset R^{D}$ into a low dimensional cube $[- M, M]^{d}$ implies that for any Lipschitz function $f : S \to R^{p}$ , there exists a Lipschitz function $g : [- M, M]^{d} \to R^{p}$ such that $g (A x) = f (x)$ for all $x \in S$ . Hence, if one has a neural network which approximates $g : [- M, M]^{d} \to R^{p}$ , then a layer can be added which implements the JL embedding $A$ to obtain a neural network which approximates $f : S \to R^{p}$ . By pairing JL embedding results along with results on approximation of Lipschitz functions by neural networks, one then obtains results which bound the complexity required for a neural network to approximate Lipschitz functions on high dimensional sets. The end result is a general theoretical framework which can then be used to better explain the observed empirical successes of smaller networks in a wider variety of inverse problems than current theory allows.

1 Introduction

At present various network architectures (NN, CNN, ResNet) achieve state-of-the-art performance in a broad range of inverse problems, including matrix completion [31, 20, 10, 11] image-deconvolution [28, 16], low-dose CT-reconstitution [21], electric and magnetic inverse Problems [9] (seismic analysis, electromagnetic scattering). However, since these problems are very high dimensional, classical universal approximation theory for such networks provides very pessimistic estimates of the network sizes required to learn such inverse maps (i.e., as being much larger than what standard computers can store, much less train). As a result, a gap still exists between the widely observed successes of networks in practice and the network size bounds provided by current theory in many inverse problem applications. The purpose of this paper is to provide a refined bound on the size of networks in a wide range of such applications and to show that the network size is indeed affordable in many inverse problem settings. In particular, the bound developed herein depends on the model complexity of the domain of the forward map instead of the domain’s extrinsic input dimension, and therefore is much smaller in a wide variety of model settings.

To be more specific, recall in most inverse problems one aims to recover some signal $x$ from its measurement $y = F (x)$ . Here $y$ and $x$ could both be high dimensional vectors, or even matrices and tensors, and $F$ , which is called the forward map/operator, could either be linear or nonlinear with various regularity conditions depending on the application. In all cases, however, recovering $x$ from $y$ amounts to inverting $F$ . In other words, one aims want to find the operator $F^{- 1}$ , that sends every measurement $y$ back to the original signal $x$ . Depending on the specific application of interest, there are various commonly considered forms of the forward map $F$ . For example, $F$ could be a linear map from high to low dimensions as in compressive sensing applications; $F$ could be a convolution operator that computes the shifted local blurring of an image as in the image deblurring setting; $F$ could be a mask that filters out the unobserved entries of the data as in the matrix completion application; or $F$ could also be the source-to-solution map of a differential equation as in ODE/PDE based inverse problems.

In most of these applications, the inverse operator $F^{- 1}$ does not possess a closed-form expression. As a result, in order to approximate the inverse one commonly uses analytical approaches that involve solving, e.g., an optimization problem. Take the sparse recovery as an example. With the prior knowledge that the true signal $x \in R^{n}$ is sparse, one can recover it from the under-determined measurements $R^{m} ∋ y = A x$ with $m < n)$ by solving the optimization problem

^x = arg min z ∥ z ∥_{0}, A z = y

The inverse of the linear measurement map $F (x) = y = A x$ when restricted to the low-complexity domain of sparse vectors has an inverse, $F^{- 1} (y)$ , that is then the minimizer $^x$ above.

Note that traditional optimization-based approaches could be extremely slow for large-scale problems (e.g., for $n$ large above). Alternatively, we can approximate the inverse operator by a neural network instead. Amortizing the initial cost of an expensive training stage, the network can later achieve unprecedented speed over time at the test stage leading to better total efficiency over its lifetime. To realize this goal, however, we need to first find a neural network architecture $f_{θ}$ , and train it to approximate $F^{- 1}$ , so that the approximation error ${max}_{y} ∥ f_{θ} (y) - F^{- 1} (y) ∥ = ∥ f_{θ} (y) - x ∥$ is small. The purpose of this paper is to provide a unified way to give a meaningful estimation of the size of the network that one can use to set up the network in situations where the domain of $F$ is low-complexity as is the case in, e.g., compressive sensing, low-rank matrix completion, deblurring with low-dimensional signal assumptions, etc..

2 Related Work

The expressive power of neural networks is important in applications as a means of both guiding network architecture design choices, as well as for providing confidence that good network solutions exist in general situations. As a result, numerous results about the approximation power has been established in recent years [32, 23, 30, 29, 18]. Most results concern the approximation of functions on $R^{D}$ , however, and yield network sizes that increase exponentially with the input dimension $D$ . As a result, the high dimensionality of many inverse problems leads to bounds from most of the existing literature which are too large to explain the observed empirical success of neural approaches in such applications.

A similar high-dimensional scaling issue arises in many image classification tasks as well. Motivated by this setting [7] refined previous approximation results for ReLU networks, and showed that input data that is close to a low-dimensional manifold leads to network sizes that only grow exponentially with respect to the intrinsic dimension of the manifold. However, this improved bound relies on the data fitting manifold assumption which is quite strong in the inverse problems setting. For example, even the “simple” sparse recovery problem does not have a domain/range that forms a manifold (note that the intersections of $s$ -dimensional subspaces prevent from it being a manifold). Therefore, to study expressive power of networks on inverse problems needs to remove such strict manifold assumptions. Another mild issue with such manifolds results is that the number of neurons also depends on the curvature of the manifold in question which can be difficult to estimate. Furthermore, such curvature dependence is unavoidable for manifold results and needs to be incorporated into any valid bounds.¹¹1To see why, e.g., curvature dependence is unavoidable, consider any discrete training dataset in a compact ball. There always exists a 1-dimensional manifold, namely a curve, that goes through all the data points. Thus, the mere existence of the 1-dimensional manifold does not mean the data complexity is low. Curvature information and other manifold properties matter as well!

In this paper, we provide another way to estimate the size of the network, by directly using the Guassian width of the data as a measure of its inherent complexity. Our result can therefore be considered generalization of the manifold result discussed above in two ways. First, it applies to more arbitrary data sets with low complexities. And, it also applies to other types of networks besides just feedforward ReLu networks. Both types of generalization are then shown to be useful and applicable to various inverse problems.

3 Main Results

We begin by stating a few definitions. We say that a neural network $ϵ$ -approximates a function $f$ if the function implemented by the neural network $ˆ f$ satisfies $∥ ˆ f (x) - f (x) ∥_{\infty} \leq ϵ$ for all $x$ in the domain of $f$ . We say that a neural network architecture $ϵ$ -approximates any function in a function class $F$ if for any function $f \in F$ , there exists a choice of edge weights such that the function $ˆ f$ implemented by the neural network with that choice of edge weights satisfies $∥ ˆ f (x) - f (x) ∥_{\infty} \leq ϵ$ for all $x$ in the domain of $f$ .

Also, for any positive integers $d < D$ , any set $S \subset R^{D}$ , and any constant $ρ \in (0, 1)$ , we say that a matrix $A \in R^{d \times D}$ is a $ρ$ -JL (Johnson-Lindenstrauss) embedding of $S$ if

(1 - ρ) ∥ x - x^{'} ∥_{2} \leq ∥ A x - A x^{'} ∥_{2} \leq (1 + ρ) ∥ x - x^{'} ∥_{2} for all x, x^{'} \in S .

If we furthermore have $A (S) := {A x : x \in S} \subset T$ , we say that $A$ is a $ρ$ -JL embedding of $S$ into $T$ . Intuitively, a $ρ$ -JL embedding of $S$ into $R^{d}$ transforms $S$ from a high-dimensional space to a low-dimensional space without significantly distorting distances between points.

Contributions: Existing universal approximation theorems for various types of neural networks are mainly stated for functions defined on an $d$ -dimensional cube. Our main contribution is to generalize these results to functions defined on arbitrary JL-embedable sets, which possibly reside in very high dimensions. We then demonstrate how our result can be applied to inverse problems to obtain a reasonable estimate of the network size.

Since our theory is to be applied to general inverse problems, for which we cannot assume anything more than Lipschitz continuous. Hence in this paper, we focus on the class of Lipschitz functions.

More explicitly, we show that if there exists a $ρ$ -JL embedding of a high-dimensional set $S \subset R^{D}$ into a low-dimensional cube $[- M, M]^{d}$ , then we can use any neural network architecture which can $ϵ$ -approximate $\frac{L}{1 - ρ}$ -Lipschitz functions on $[- M, M]^{d}$ to construct a neural network architecture which can $ϵ$ -approximate $L$ -Lipschitz functions on $S$ . To establish this, we show that if there exists $ρ$ -JL embedding $A \in R^{d \times D}$ of $S \subset R^{D}$ into $d$ -dimensions, then for any $L$ -Lipschitz function $f : S \to R^{p}$ , there exists a $\frac{L}{1 - ρ}$ -Lipschitz function $g : [- M, M]^{d} \to R^{p}$ (where $M = {sup}_{x \in S} ∥ A x ∥_{\infty}$ ) such that $g (A x) = f (x)$ for all $x \in S$ . Hence, if we have a neural network which can approximate $g : [- M, M]^{d} \to R^{p}$ , then we can compose it with a neural network which implements the JL embedding $A$ to obtain a neural network which approximates $f : S \to R^{p}$ . By pairing JL embedding existence results along with results on approximation of Lipschitz functions by neural networks, we obtain results which bound the complexity required for a neural network to approximate Lipschitz functions on high dimensional sets.

Figure 1: If there exists a $ρ$ -JL embedding of $S \subset R^{D}$ into $[- M, M]^{d}$ , then we can write the target function $f : S \to R^{p}$ as $f = g \circ J L$ where $g : [- M, M]^{d} \to R^{p}$ . So, we can then construct a neural network approximation of $f$ by using a neural network approximation of $g$ and adding a layer to implement the JL embedding.

We now state our main theorem.

Theorem 1.

Let $d < D$ be positive integers, and let $L, M > 0$ and $ρ \in (0, 1)$ be constants. Let $S \subset R^{D}$ be a bounded subset for which there exists a $ρ$ -JL embedding $A \in R^{d \times D}$ of $S$ into $[- M, M]^{d}$ .

a) Suppose that any $\frac{L}{1 - ρ}$ -Lipschitz function $g : [- M, M]^{d} \to R^{p}$ can be $ϵ$ -approximated by a feedforward neural network with at most $N$ nodes, $E$ edges, and $L$ layers. Then, any $L$ -Lipschitz function $f : S \to R^{p}$ can be $ϵ$ -approximated by a feedforward neural network with at most $N + D$ nodes, $E + D d$ edges, and $L + 1$ layers.

b) Furthermore, if there exists a single feedforward neural network architecture with at most $N$ nodes, $E$ edges, and $L$ layers that can $ϵ$ -approximate any $\frac{L}{1 - ρ}$ -Lipschitz function $g : [- M, M]^{d} \to R^{p}$ , then there also exists another feedforward neural network architecture with at most $N + D$ nodes, $E + D d$ edges, and $L + 1$ layers that can $ϵ$ -approximate any $L$ -Lipschitz function $f : S \to R^{p}$ .

c) Suppose that the $ρ$ -JL embedding is of the form $A = M D$ , where $M$ is a partial circulant matrix, and $D$ is a diagonal matrix with $\pm 1$ on its diagonal. Also, suppose that any $\frac{L}{1 - ρ}$ -Lipschitz function $g : [- M, M]^{d} \to R^{p}$ can be $ϵ$ -approximated by a convolutional neural network with at most $N$ nodes, $P$ parameters, and $L$ layers. Then, any $L$ -Lipschitz function $f : S \to R^{p}$ can be $ϵ$ -approximated by a feedforward neural network with at most $N + 3 D$ nodes, $P + 2 D + d$ parameters, and $L + 4$ layers.

d) Furthermore, if there exists a single convolutional neural network architecture with at most $N$ nodes, $P$ parameters, and $L$ layers that can $ϵ$ -approximate any $\frac{L}{1 - ρ}$ -Lipschitz function $g : [- M, M]^{d} \to R^{p}$ , then there also exists another convolutional neural network architecture with at most $N + 2 D$ nodes, $P + 2 D d$ parameters, and $L + 3$ layers that can $ϵ$ -approximate any $L$ -Lipschitz function $f : S \to R^{p}$ .

Remark 1.

The theorem ensures that the network size for approximating $f$ grows exponentially with the compressed dimension $d$ instead of growing exponentially with the input dimension $D$ . The task now reduces to making the compressed dimension $d$ as small as possible while still ensuring that a $ρ$ -JL embedding of $S$ into $[- M, M]^{d}$ exists.

Remark 2.

The theorem is quite general as parts a and b are not restricted to any particular type of network or activation function. In Section 3.3, we provide two corollaries of Theorem 1 that establish the expressive power of the feedforward and convolutional neural networks.

Remark 3.

If an inverse operator is Lipschitz continuous and there exists a $ρ$ -JL embedding of the set of possible observations $S$ into $d$ dimensions, then the theorem gives us a bound on the complexity of a neural network architecture required to approximate the inverse operator.

3.1 JL embeddings, and covering numbers and Gaussian width

As the existence of the JL map is a critical assumption of our theorem, in this section, we discuss the sufficient conditions for this assumption to hold. In addition, we also care about the structures of the JL maps, as they will end up being the first layer of the final neural network. For example, if the neural network is of convolution type, we need to make sure that a circulant JL matrix exists.

Existence of $ρ$ -JL maps: It is well-known that for finite sets $S$ , the existence of a $ρ$ -JL embedding can be guaranteed by the Johnson-Lindenstrauss Lemma. For sets $S$ with infinite cardinally, the Johnson-Lindenstrauss lemma cannot be directly used. In the following proposition, we extend the Johnson-Lindenstrauss lemma from a finite set of $n$ points to a general set $S$ .

Proposition 1.

Let $ρ \in (0, 1)$ . For $S \subseteq R^{D}$ , define

U_{S} := ¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ {\frac{x - x^{'}}{∥ x - x^{'} ∥_{2}} : x, x^{'} \in S s . t . x \neq x^{'}}

to be the closure of the set of unit secants of $S$ , and $N (U_{S}, ∥ \cdot ∥_{2}, δ)$ to be the covering number of $U_{S}$ with $δ$ -balls. Then, there exists a set $S_{1}$ with $| S_{1} | = 2 N (U_{S}, ∥ \cdot ∥_{2}, δ)$ points such that if a matrix $A \in R^{d \times D}$ is a $ρ$ -JL embedding of $S_{1}$ , then $A$ is also a $(ρ + 2 ∥ A ∥ δ)$ -JL embedding of $S$ .

The proposition guarantees that whenever we have a JL-map for finite sets, we can extend it to a JL-map for infinite sets with similar level of complexity measured in terms of the covering numbers. There are many known JL-maps for finite sets that we can extend from, including sub-Gaussian matrix [19], Gaussian circulant matrices with random sign flip [8], etc. We present some of the related results here.

Proposition 2 ([19]).

Let $x_{1}, \dots, x_{n} \in R^{D}$ . Let $ρ \in (0, \frac{1}{2})$ and $β \in (0, 1)$ . Let $A \in R^{d \times D}$ be a random matrix whose entries are i.i.d. from a subgaussian distribution with mean $0$ and variance $1$ . Then, there exists a constant $C > 0$ depending only on the subgaussian distribution such that if $d \geq C ρ^{- 2} log \frac{n}{β}$ , then $\frac{1}{\sqrt{d}} A$ will be a $ρ$ -JL embedding of ${x_{1}, \dots, x_{n}}$ with probability at least $1 - β$ .

Proposition 3 (Corollary 1.3 in [8]).

Let $x_{1}, \dots, x_{n} \in R^{D}$ . Let $ρ \in (0, \frac{1}{2})$ , and let $d = O (ρ^{- 2} {log}^{1 + α} n)$ for some $α > 0$ . Let $A = \frac{1}{\sqrt{d}} M D$ where $M \in R^{d \times D}$ is a random Gaussian circulant matrix and $D \in R^{D \times D}$ is a random Rademacher diagonal matrix. Then, with probability at least $\frac{2}{3} (1 - (D + d) e^{- {log}^{α} n})$ , $A$ is a $ρ$ -JL embedding of ${x_{1}, \dots, x_{n}}$ .

Note that the $α$ in the proposition can be set to be any positive number making the probability of failure less than 1.

Combining the results of Propositions 2 and 3 with Proposition 1, we have the following existence result for the JL map of an arbitrary set $S$ ,

Proposition 4.

Let $ρ \in (0, 1)$ be a constant. For $S \subseteq R^{D}$ , let $N (U_{S}, ∥ \cdot ∥_{2}, δ)$ to be the covering number with $δ$ -balls of the unit secant $U_{S}$ of $S$ defined in Proposition 1. Then

a) If $D \geq d ≳ ρ^{- 2} log N (U_{S}, ∥ \cdot ∥_{2}, \frac{ρ}{4 \sqrt{3 D}})$ , then there exists a matrix $A \in R^{d \times D}$ which is a $ρ$ -JL embedding of $S$ .

b) If $D \geq d ≳ ρ^{- 2} log (4 D + 4 d) log N (U_{S}, ∥ \cdot ∥_{2}, \frac{ρ}{4 \sqrt{3 D}})$ , then there exists a matrix $A \in R^{d \times D}$ in the form of $M D$ and of size $d \times D$ that works as $ρ$ -JL map for $S$ , where $M$ is a partial circulant matrix and $D$ is a diagonal matrix with $\pm 1$ on its diagonal.

The above proposition characterizes the compressibility of a set $S$ by a JL-mapping terms of the covering number. Alternatively, one can also characterize it using the Gaussian width. For example, in [12] it is shown using methods from [26] that if the set of unit secants of $S$ has a low Gaussian width, then with high probability a subgaussian random matrix with provide a low-distortion linear embedding, and the dimension $d$ required scales quadratically with the Gaussian width of the set of unit secants of $S$ .

Proposition 5 (Corollary 2.1 in [12]).

Let $ρ, β \in (0, 1)$ be constants. Let $A \in R^{d \times D}$ be a matrix whose rows $a_{1}^{T}, \dots, a_{d}^{T}$ are independent, isotropic ( $E [a_{i} a_{i}^{T}] = I$ ), and subgaussian random vectors. Let $S \subset R^{D}$ , and Let

ω (U_{S}) := E sup u \in U_{S} ⟨ u, z ⟩, z \sim Normal% (0, I)

to be the Gaussian width of $U_{S}$ . Then, there exists a constant $C > 0$ depending only on the distribution of the rows of $A$ such that if

d \geq \frac{C}{ρ^{2}} {(ω (U_{S}) + \sqrt{log \frac{2}{β}})}_{S}^{2},

then $\frac{1}{\sqrt{d}} A$ is a $ρ$ -JL embedding of $S$ with probability at least $1 - β$ .

In practice, one can use either the Gaussian width (Proposition 5) or the covering number (Proposition 4) to compute the lower bound of $d$ , whichever is more convenient for a specific application.

3.2 Universal approximator neural networks for Lipschitz functions on $d$ -dimensional cubes

In Theorem 1, we showed that with the help of JL, approximation rate of neural networks for functions defined on an arbitrary set $S$ can be derived from their approximation rates for functions defined on the cube $[- M, M]^{d}$ . In this section, we review known results for the later, so that they can be used in combination of Theorem 1 to provide useful approximation results for network applications to various inverse problems. Specifically, we review two types of universal approximators for functions defined on the cube $[- M, M]^{d}$ . One is the Feedforward ReLU network and the other is the Resnet type convolution neural network.

Feedforward ReLU network: The fully connected feedforward neural network with ReLU activation is known to be a universal approximator of any Lipschitz function on the box $[- M, M]^{d}$ . Moreover, for such networks, the non-asymptotic approximation error has also been established, allowing us to get an estimate of the network size. The proposition below is a variant of Proposition 1 in [29], and the proof uses an approximating function that uses the same ideas as in [29].

Proposition 6.

Given constants $L, M, ϵ > 0$ and positive integers $d$ and $p$ , there exists a ReLU NN architecture with at most

that can $ϵ$ -approximate any $L$ -Lipschitz function $g : [- M, M]^{d} \to R^{p}$ . Here, $C_{1}, C_{2} > 0$ are universal constants. Also for each edge of the ReLU NN, the corresponding weight is either independent of $g$ , or is of the form $g_{i} (x)$ for some fixed $x \in [- M, M]^{d}$ and coordinate $i = 1, \dots, p$ .

Convolutional Neural Network: As many successful network applications on inverse problems results from the use of filters in the CNN architectures [14], we are particularly interested in the expressive power of CNN in approximating the Lipschitz functions. Currently known non-asymptotic results for CNN includes [32, 23, 30], but they are established under stricter assumptions than merely Lipschitz continuous. On the other hand, the ResNet-based CNN with the following architecture has been shown to possess good convergence rate.

C N N_{θ}^{σ} := F C_{W, b} \circ ({Conv}_{ω_{M}, b_{M}}^{σ} + id) \circ \dots \circ ({Conv}_{ω_{1}, b_{1}}^{σ} + id) \circ P

(1)

where $σ$ is the activation function, each ${Conv}_{ω_{m}, b_{m}}$ is an convolution layer with $L_{m}$ filters $ω_{m}^{1}$ ,…, $ω_{m}^{L_{m}}$ stored in $ω_{M}$ and $L_{m}$ bias $b_{m}^{1}, . . ., b_{m}^{L_{m}}$ stored in $b_{m}$ . The addition by the identity map, ${Conv}_{ω_{M}, b_{M}}^{σ} + id$ , makes it a residual block. $F C_{W, b}$ represents a fully connected layer appended to the final layer of the network. We see that the ResNet-based CNN is essentially a normal CNN with skip connections.

The following asymptotic result is proved in [22]. We note that the authors of [22] proved a more general result for $β$ -Hölder functions, but we state it for Lipschitz functions, i.e., $β = 1$ .

Proposition 7 (Corollary 4 from [22]).

Let $f : [- 1, 1]^{d} \to R$ be a Lipschitz function. Then, for any $K \in {2, . . ., d}$ , there exists a CNN $f^{(CNN)}$ with $O (N)$ residual blocks, each of which has depth $O (log N)$ and $O (1)$ channels, and whose filter size is at most $K$ , such that $∥ f - f^{(CNN)} ∥_{\infty} \leq ˜ O (N^{- 1 / d})$ .

3.3 Main results

We can now combine Propositions 4 , 5, 6 and 7 with our Theorem 1 to obtain theorems bounding the required complexity of a feed-forward/convolutional neural network that can $ϵ$ -approximate any $L$ -Lipschitz function on arbitrary sets $S \subset R^{D}$ for which a $ρ$ -JL embedding into $[- M, M]^{d}$ exists.

Theorem 2.

Let $d < D$ be positive integers, and let $L > 0$ and $ρ \in (0, 1)$ be constants. Let $S \subset R^{D}$ be a bounded set and $U_{S}$ be its set of unit secants. Suppose that

d ≳ min {ρ^{- 2} log N (U_{S}, ∥ \cdot ∥_{2}, \frac{ρ}{4 \sqrt{3 D}}), ρ^{- 2} {(ω (U_{S}))}_{S}^{2}},

where $N (U_{S}, ∥ \cdot ∥_{2}, \frac{ρ}{4 \sqrt{3 D}})$ is the covering number and $ω (U_{S})$ is the Gaussian width of $U_{S}$ . Then, there exists a ReLU neural network architecture with at most

(p + C_{1}) {(2 ⌈ \frac{L M \sqrt{d}}{(1 - ρ) ϵ} ⌉ + 1)}^{d} + D d edges,

C_{2} {(2 ⌈ \frac{L M \sqrt{d}}{(1 - ρ) ϵ} ⌉ + 1)}^{d} + p + D nodes,

and ⌈ {log}_{2} (d + 1) ⌉ + 3 layers

that can $ϵ$ -approximate any $L$ -Lipschitz function $f : S \to R^{p}$ , where $M = {sup}_{x \in S} ∥ A x ∥_{\infty}$ .

Our Theorem 3 is a variant on our Theorem 2, but for convolutional neural networks.

Theorem 3.

Let $d < D$ be positive integers, and let $L > 0$ and $ρ \in (0, 1)$ be constants. Let $S \subset R^{D}$ be a bounded set and $U_{S}$ be its set of unit secants. Suppose that

d ≳ ρ^{- 2} log (4 D + 4 d) log N (U_{S}, ∥ \cdot ∥_{2}, \frac{ρ}{4 \sqrt{3 D}}) .

Then, for any $L$ -Lipschitz function $f : S \to R^{p}$ , there exists a ResNet type CNN $f^{(CNN)}$ in the form of (1) with $O (N)$ residual blocks, each of which has a depth $O (log N)$ and $O (1)$ channels, and whose filter size is at most $K$ such that $∥ f - f^{(CNN)} ∥_{\infty} \leq ˜ O (N^{- 1 / d})$ .

4 Applications to Inverse Problems

Now we focus on inverse problems and demonstrate how the main theorems can be used to provide a reasonable estimate of the size of the neural networks needed to solve some classical inverse problems in signal processing. The problems we consider here are sparse recovery, blind deconvolution, and matrix completion.

In all the inverse problems, we want to recover some signal $x \in S$ from its forward measurement $y = F (x)$ , where the forward map $F$ is assumed to be known. The minimal assumption we have to impose on $F$ is the invertibility.

Assumption 1 (invertibility of the forward map): Let $S$ be the domain of the forward map $F$ , and $Y = F (S)$ be the range. Assume that the inverse operator $F^{- 1} : Y \to S$ exists and is Lipschitz continuous with constant $L$ ,

∥ F^{- 1} (y_{1}) - F^{- 1} (y_{2}) ∥ \leq L ∥ y_{1} - y_{2} ∥, for all y_{1}, y_{2} \in Y .

For any inverse problems satisfying Assumption 1, Theorem 1, 2 and 3 provide ways to estimate the size of the universal approximator networks for the inverse map. When applying the theorems to each problem, we need to estimate the covering number of $U_{Y}$ first.

Depending on the problem, one may estimate the covering number either numerically or theoretically. If the domain $Y$ of the inverse map is irregular and discrete, then it may be easier to compute the covering number numerically. If the domain has a nice mathematical structure, then we may be able to estimate it theoretically. Below are three examples of the theoretical estimation. From them, we see that it is quite common for inverse problems to have a small intrinsic complexity, with which Theorems 2 and 3 can significantly reduce the required size of the network from the previously known results.

We emphasize that the covering number that the theorems use is the one of the unit secant of $Y$ , which can be much larger than the covering number of $Y$ itself.

Sparse recovery: Sparsity is now one of the most commonly used priors in inverse problems as signals in many real applications possess certain level sparsity in some appropriate domain. For simplicity, we consider the strictly sparse signals. Let $Σ_{s}^{N}$ be the set of $s -$ sparse vectors of length $N$ . Assume a sparse vector is measured linearly $y = A x \equiv F (x)$ , the inverse problem amounts to recovering $x$ from $y$ . Now that we want to use a network to approximate the inverse map $F^{- 1} : A Σ_{s}^{N} \equiv Y \in y \to x \in Σ_{s}^{N}$ , and estimate the size of the network through the theorems, we need to estimate the covering number of the unit secant $U_{A Σ_{s}^{N}}$ .

Proposition 8.

Let $U_{A Σ_{s}^{N}}$ denote the set of unit secants of $A Σ_{s}^{N}$ . Then, we have

log N (U_{A Σ_{s}^{N}}, ∥ \cdot ∥, δ) ≲ s log \frac{N}{δ} .

Proof.

By definition, the unit secant of $Y = A Σ_{s}^{N}$ is defined as

U_{Y} = {\frac{y_{1} - y_{2}}{∥ y_{1} - y_{2} ∥}, y_{1}, y_{2} \in A Σ_{s}^{N}}

which contains all unit vectors that are linear combinations of $2 s$ columns of $A$ . Let $T$ with $| T | = 2 s$ be a fixed support set, the covering number of $span (A_{T}) \cap S^{m - 1}$ is $(\frac{3}{δ})^{2 s}$ , so the covering number of $U_{Y}$ is at most $N^{s} (\frac{3}{δ})^{2 s}$ .

∎

If the inverse of $F$ exists, such as in the case when $A$ is a Restricted-Isometry-Property matrix, then by Theorem 2 and 3, there exist neural networks of fully connected type or of CNN type with $O (ϵ^{- s log N})$ number of weights, that can do the sparse recovery up to an error of $ϵ$ .

Blind deconvolution: Blind deconvolution concerns the recovery of a signal $x$ from its blurry measurements

y = k \otimes x

(2)

when the kernel $k$ is also unknown. Here $\otimes$ denotes the convolution operation.

Blind-deconvolution is an ill-posed problem due to the existence of a scaling ambiguity between $x$ and $k$ , namely, if $(k, x)$ is a solution, then $(α k, \frac{1}{α} x)$ with $α \neq 0$ is also a solution. To resolve this issue, we focus on recovering the outer product $x k^{T}$ , where $x$ and $k$ here are both columns vectors. The recovery of the outer product $x k^{T}$ from the convolution $y = k \otimes x$ can be well-posed in various settings [17, 1]. For example, [1] showed that if we assume $x = Φ u$ and $k = Ψ v$ , where $Φ \in R^{N, n} (n < N)$ is i.i.d. Gaussian matrix and $Ψ \in R^{N, m} (m < N)$ is a matrix of small coherence, then for large enough $N$ , the outer-product $x k^{T}$ can be stably recovered from $y$ in the following sense. For any two signal-kernel pairs $(x, k)$ , $(~ x, ~ k)$ and their corresponding convolutions $y$ , $~ y$ , we have

∥ x k^{T} - ~ x {~ k}^{T} ∥ \leq L ∥ y - ~ y ∥

(3)

with some $L$ . When using a neural network to approximate the inverse map by $F^{- 1} : y \to x k^{T}$ , we need to estimate the covering number of the unit secant cone of $Y = {y = x \otimes k, x \in Φ Σ_{s}^{N}, k \in span Ψ}$ , which is done in the following proposition.

Proposition 9.

Suppose the inverse map $F^{- 1} : y \to x k^{T}$ is Lipschitz continuous with Lipschitz constant $L$ , then for $Y = {y = x \otimes k, x \in span (Φ), k \in span Ψ}$ , the logarithm of the covering number of the set of unit secants of $Y$ is bounded by

log N (U_{Y}, ∥ \cdot ∥_{2}, δ) ≲ max {m, n} log \frac{3 L}{δ} .

Combining this proposition with Theorem 2 and 3, we obtain that there exist neural networks of full connected type or of CNN type having about $O (ϵ^{- max {m, n} log (L (n + m))})$ number of weights, that can solve the blind-deconvolution problem up to an error of $ϵ$ .

Matrix completion: Matrix Completion is a central task in machine learning where we want to recover a matrix from its partially observed entries. It arises from a number of applications including image super resolution [25, 6], image/video denoising [13], recommender systems [31, 20], and gene-expression prediction [15], etc.. Recently neural network models have achieved state-of-the-art performance [31, 20, 10, 11], but a general existence result in the non-asymptotic regime is still missing.

In this setting, the measurements $Y = P_{Ω} X$ consists of a set of observed entries of the unknown low-rank matrix $X$ , where $Ω$ is the index set of the observed entries and $P_{Ω}$ is the mask that sets all but entries in $Ω$ to 0. Assuming $M_{r}^{n, m}$ is the set of $n \times m$ matrices with rank at most $r$ and $X \in M_{r}^{n, m}$ . If the mask is random, and the left and right eigenvectors $U, V$ of $X$ are incoherent, in the sense that

	$max 1 \leq i \leq n$	$∥ U^{T} e_{i} ∥ \leq \sqrt{\frac{μ_{0} r}{n}}, max 1 \leq i \leq m ∥ V^{T} e_{i} ∥ \leq \sqrt{\frac{μ_{0} r}{m}},$		(4)
		$max 1 \leq i \leq n, 1 \leq j \leq m ∥ (U V^{T})_{i, j} ∥ \leq \sqrt{\frac{μ_{1} r}{n m}}$

then it is known (e.g. [4]) that provided that the number of observations

| Ω | ≳ μ_{0} r max {m, n} {log}^{2} max {m, n},

then with overwhelming probability, the inverse map $F^{- 1} : Y \to X$ exists and is Lipschitz continuous. Let us denote the set of low-rank matrices satisfying (4) to be $C$ . To estimate the complexity of the inverse map, we compute the covering number of $U_{Y}$ for $Y = {Y = P_{Ω} X : X \in M_{r}^{m, n} \cap C}$ .

Proposition 10.

Suppose the mask is chosen so that the inverse map $F^{- 1} : Y = P_{Ω} X \to X$ is Lipschitz continuous with Lipschitz constant $L$ , then for $Y = {P_{Ω} X : X \in M_{r}^{m, n} \cap C}$ , the logarithm of the covering number of the set of unit secants of $Y$ is bounded by

Combining this proposition with Theorem 2 and 3, we obtain that there exist neural networks of full connected type or of CNN type having about $O (ϵ^{- r (m + n) log (L n m)})$ number of weights, that can solve the blind-deconvolution problem up to an error of $ϵ$ .

5 Conclusion and Discussion

The main message of this paper is that when neural networks are used to approximate Lipschitz continuous functions, the size of the network only needs to grow exponentially with respect to the intrinsic complexity of the input set measured using either the Gaussian width or the covering numbers. Therefore, it is more optimistic than the previous estimate that requires the size of the network to grow exponentially with respect to the input dimension.

Similar results were derived previously in [7] in a more restrictive setting, namely, the input set is assumed to be close to a smooth manifold with a small curvature, and the network type is restricted to the feedforward ReLU networks. In this paper, by utilizing the JL map, we are able to state the result in a very general setting, that does not pose any structural requirement on the inputs set other than that they have a small complexity. In addition, our result holds for many different types of networks – although we only stated it for feedfoward neural networks and the ResNet type of convolutional neural networks, the same idea naturally applies to other types of networks as long as an associated JL-map exists.

The estimate we provided for the network size ultimately depends on the complexity of the input set, measured by either the covering number or the Gaussian width of its set of unit secants. The computation of these quantities varies case by case, and in some cases might be rather difficult. This is a possible limitation of the proposed method. Because if the estimation of the input complexity is not tight enough, we may again get a pessimistic bound. Having said that, for most of the classical inverse problems, the covering number and the Gaussian width are not too difficult to calculate. As we demonstrated in Section 4, there are many known properties of them that one can use to facilitate the calculation. And when a training dataset is given, one can even compute the covering number numerically with off-the-shelf algorithms.

Finally, although the applications of neural networks to inverse problems are seeing its success. There are much more failed attempts with unclear reasons. One common explanation is that the size of the network in use is not large enough for the targeted applications. Since inverse problems models usually have a much higher intrinsic dimensionality than say image classification models, the required network sizes might also be much larger. The classical universal approximation theorems only guarantees small errors when the network size approaches infinity, therefore is not very helpful in the non-asymptotic regime where we have to choose the network size, which is now known to be critical to good performances. We hope the presented result can give more insight on this matter.

References

[1] A. Ahmed, B. Recht, and J. Romberg (2013) Blind deconvolution using convex programming. IEEE Transactions on Information Theory 60 (3), pp. 1711–1732. Cited by: §4.
[2] R. Arora, A. Basu, P. Mianjy, and A. Mukherjee (2016) Understanding deep neural networks with rectified linear units. arXiv preprint arXiv:1611.01491. Cited by: Appendix D.
[3] D. Azagra, E. Le Gruyer, and C. Mudarra (2021) Kirszbraun’s theorem via an explicit formula. Canadian Mathematical Bulletin 64 (1), pp. 142–153. Cited by: Appendix A.
[4] E. J. Candes and Y. Plan (2010) Matrix completion with noise. Proceedings of the IEEE 98 (6), pp. 925–936. Cited by: §4.
[5] E. J. Candes and Y. Plan (2011) Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory 57 (4), pp. 2342–2359. Cited by: Appendix H.
[6] F. Cao, M. Cai, and Y. Tan (2014) Image interpolation via low-rank matrix completion and recovery. IEEE Transactions on Circuits and Systems for Video Technology 25 (8), pp. 1261–1270. Cited by: §4.
[7] M. Chen, H. Jiang, W. Liao, and T. Zhao (2019) Efficient approximation of deep relu networks for functions on low dimensional manifolds. Advances in neural information processing systems 32. Cited by: §2, §5.
[8] L. Cheng and H. Zhang (2014) New bounds for circulant johnson-lindenstrauss embeddings. Communications in Mathematical Sciences 12 (4), pp. 695–705. Cited by: §3.1, Proposition 3.
[9] E. Coccorese, R. Martone, and F. C. Morabito (1994) A neural network approach for the solution of electric and magnetic inverse problems. IEEE transactions on magnetics 30 (5), pp. 2829–2839. Cited by: §1.
[10] G. K. Dziugaite and D. M. Roy (2015) Neural network matrix factorization. arXiv preprint arXiv:1511.06443. Cited by: §1, §4.
[11] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua (2017) Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web, pp. 173–182. Cited by: §1, §4.
[12] M.A. Iwen, B. Schmidt, and A. Tavakoli (accepted. (See Arxiv 2110.04193)) On fast johnson-lindenstrauss embeddings of compact submanifolds of $R^{N}$ with boundary. Discrete $&$ Computational Geometry. External Links: Link Cited by: §3.1, Proposition 5.
[13] H. Ji, C. Liu, Z. Shen, and Y. Xu (2010) Robust video denoising using low rank matrix completion. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1791–1798. Cited by: §4.
[14] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser (2017) Deep convolutional neural network for inverse problems in imaging. IEEE Transactions on Image Processing 26 (9), pp. 4509–4522. Cited by: §3.2.
[15] A. Kapur, K. Marwah, and G. Alterovitz (2016) Gene expression prediction using low-rank matrix completion. BMC bioinformatics 17 (1), pp. 1–13. Cited by: §4.
[16] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas (2018) Deblurgan: blind motion deblurring using conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8183–8192. Cited by: §1.
[17] K. Lee, Y. Li, M. Junge, and Y. Bresler (2015) Stability in blind deconvolution of sparse signals and reconstruction by alternating minimization. In 2015 International Conference on Sampling Theory and Applications (SampTA), pp. 158–162. Cited by: §4.
[18] H. Lin and S. Jegelka (2018) Resnet with one-neuron hidden layers is a universal approximator. Advances in neural information processing systems 31. Cited by: §2.
[19] J. Matoušek (2008) On variants of the johnson–lindenstrauss lemma. Random Structures & Algorithms 33 (2), pp. 142–156. Cited by: §3.1, Proposition 2.
[20] F. Monti, M. Bronstein, and X. Bresson (2017) Geometric matrix completion with recurrent multi-graph neural networks. Advances in neural information processing systems 30. Cited by: §1, §4.
[21] S. Nah, T. Hyun Kim, and K. Mu Lee (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3883–3891. Cited by: §1.
[22] K. Oono and T. Suzuki (2019) Approximation and non-parametric estimation of resnet-type convolutional neural networks. In International Conference on Machine Learning, pp. 4922–4931. Cited by: §3.2, Proposition 7.
[23] P. Petersen and F. Voigtlaender (2020) Equivalence of approximation by convolutional neural networks and fully-connected networks. Proceedings of the American Mathematical Society 148 (4), pp. 1567–1581. Cited by: §2, §3.2.
[24] J. T. Schwartz (1969) Nonlinear functional analysis. Vol. 4, CRC Press. Cited by: Appendix A.
[25] F. Shi, J. Cheng, L. Wang, P. Yap, and D. Shen (2013) Low-rank total variation for image super-resolution. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 155–162. Cited by: §4.
[26] R. Vershynin (2018) High-dimensional probability: an introduction with applications in data science. Vol. 47, Cambridge university press. Cited by: §3.1.
[27] P. Wedin (1972) Perturbation bounds in connection with singular value decomposition. BIT Numerical Mathematics 12 (1), pp. 99–111. Cited by: Appendix G.
[28] L. Xu, J. S. Ren, C. Liu, and J. Jia (2014) Deep convolutional neural network for image deconvolution. Advances in neural information processing systems 27. Cited by: §1.
[29] D. Yarotsky (2018) Optimal approximation of continuous functions by very deep relu networks. In Conference on learning theory, pp. 639–649. Cited by: Appendix D, §2, §3.2.
[30] D. Yarotsky (2022) Universal approximations of invariant maps by neural networks. Constructive Approximation 55 (1), pp. 407–474. Cited by: §2, §3.2.
[31] Y. Zheng, B. Tang, W. Ding, and H. Zhou (2016) A neural autoregressive approach to collaborative filtering. In International Conference on Machine Learning, pp. 764–773. Cited by: §1, §4.
[32] D. Zhou (2020) Universality of deep convolutional neural networks. Applied and computational harmonic analysis 48 (2), pp. 787–794. Cited by: §2, §3.2.

Appendix A Proof of Theorem 1

First, we prove the following lemma.

Lemma 1.

Let $d < D$ and $p$ be positive integers, and let $L, M > 0$ and $ρ \in (0, 1)$ be constants. Let $S \subset R^{D}$ be a bounded subset for which there exists a $ρ$ -JL embedding $A \in R^{d \times D}$ of $S$ into $[- M, M]^{d}$ . Then, for any $L$ -Lipschitz function $f : S \to R^{p}$ , there exists an $\frac{L}{1 - ρ}$ -Lipschitz function $g : [- M, M]^{d} \to R^{p}$ such that $g (A x) = f (x)$ for all $x \in S$ .

Proof.

For any $x, x^{'} \in S$ , if $A x = A x^{'}$ then since $A$ is a $ρ$ -JL embedding of $S$ , we have that $∥ x - x^{'} ∥_{2} \leq \frac{1}{1 - ρ} ∥ A x - A x^{'} ∥_{2} = \frac{1}{1 - ρ} ∥ 0 ∥_{2} = 0$ , and so, $∥ x - x^{'} ∥_{2} = 0$ , i.e., $x = x^{'}$ . Therefore, the map $x \mapsto A x$ from $S$ to $A (S) := {A x : x \in S}$ is invertible. We define $A^{- 1} : A (S) \to S$ to be the inverse of the map $x \mapsto A x$ .

Now, for any $L$ -Lipschitz function $f : S \to R^{p}$ , we define $˜ g : A (S) \to R^{p}$ by $˜ g = f \circ A^{- 1}$ . Then, for any $y, y^{'} \in A (S)$ , we have

$∥ ˜ g (y) - ˜ g (y^{'}) ∥_{2}$	$= {∥ ∥ f (A^{- 1} (y)) - f (A^{- 1} (y^{'})) ∥ ∥}_{2}^{- 1}$	$since g = f \circ A^{- 1}$
	$\leq L {∥ ∥ A^{- 1} (y) - A^{- 1} (y^{'}) ∥ ∥}_{2}^{- 1}$	$since f is L -Lipschitz$
	$=\leq \frac{L}{1 - ρ} {∥ ∥ A A^{- 1} (y) - A A^{- 1} (y^{'}) ∥ ∥}_{2}^{- 1}$	$since A is a ρ -JL % embedding of S$
	$= \frac{L}{1 - ρ} {∥ ∥ y - y^{'} ∥ ∥}_{2}^{'} .$	$since A^{- 1} is the inverse of x \mapsto A x$

Therefore, $˜ g : A (S) \to R^{p}$ is $\frac{L}{1 - ρ}$ -Lipschitz. Then, since $A (S) \subset [- M, M]^{d}$ , by the Kirszbraun theorem[24], there exists a $\frac{L}{1 - ρ}$ -Lipschitz extension of $˜ g$ to $[- M, M]^{d}$ , i.e., a function $g : [- M, M]^{d} \to R^{p}$ which is $\frac{L}{1 - ρ}$ -Lipschitz on $[- M, M]^{d}$ and satisfies $g (y) = ˜ g (y)$ for all $y \in S$ . Finally, for any $x \in S$ , we have $A x \in A (S)$ , and so, $g (A x) = ˜ g (A x) = f (A^{- 1} (A x)) = f (x)$ , as required. ∎

Remark: In [3], the authors give an explicit formula for the Lipschitz extension of a given Lipschitz function.

With Lemma 1, we can now prove each of the four parts of Theorem 1. As a reminder, we assume that $S \subset R^{D}$ is a bounded set for which there exists a $ρ$ -JL embedding $A \in R^{d \times D}$ of $S$ into $[- M, M]^{d}$ .

a) Let $f : S \to R^{p}$ be an $L$ -Lipschitz function. By Lemma 1, there exists a $\frac{L}{1 - ρ}$ -Lipschitz function $g : [- M, M]^{d} \to R^{p}$ such that $f (x) = g (A x)$ for all $x \in S$ . By assumption, $g$ can be $ϵ$ -approximated by a feedforward neural network with at most $N$ nodes, $E$ edges, and $L$ layers. In other words, there exists a function $ˆ g$ such that $∥ ˆ g (y) - g (y) ∥_{\infty} \leq ϵ$ for all $y \in [- M, M]^{d}$ , and $ˆ g$ can be implemented by a feedforward neural network with at most $N$ nodes, $E$ edges, and $L$ layers.

Define another function $ˆ f = ˆ g \circ A$ , i.e., $ˆ f (x) = ˆ g (A x)$ for all $x \in S$ . Since $A (S) \subset [- M, M]^{d}$ by assumption, we have that $A x \in [- M, M]^{d}$ for all $x \in S$ . Then, $∥ ˆ f (x) - f (x) ∥_{\infty} = ∥ ˆ g (A x) - g (A x) ∥_{\infty} \leq ϵ$ for all $x \in S$ , i.e., $ˆ f$ is an $ϵ$ -approximation of $f$ .

Furthermore, we can construct a feedforward neural network to implement $ˆ f = ˆ g \circ A$ by having a linear layer to implement the map $x \mapsto A x$ , and then feeding this into the neural network implementation of $ˆ g$ . The map $x \mapsto A x$ can be implemented with $D$ nodes for the input layer, and $D d$ edges between the input nodes and first hidden layer. By assumption, $ˆ g$ can be implemented by a feedforward neural network with at most $N$ nodes, $E$ edges, and $L$ layers. Hence, $ˆ f = ˆ g \circ A$ can be implemented by a feedforward neural network with at most $N + D$ nodes, $E + D d$ edges, and $L + 1$ layers, as desired.

b) If the same feedforward neural network architecture $ϵ$ -approximates every $\frac{L}{1 - ρ}$ -Lipschitz function $g : [- M, M]^{d} \mapsto R^{p}$ , then our construction of a feedforward neural network that implements $ˆ f = ˆ g \circ A$ has the same architecture for every $L$ -Lipschitz function $f : S \to R^{p}$ . Hence, the same bounds on the number of nodes, edges, and layers hold.

c) In a similar manner as 1a), we form a CNN that can approximate $f = g \circ A$ by first implementing the linear map $x \mapsto A x$ with a CNN and feeding this into a CNN that approximates $g$ .

The JL matrix $A = M D$ can be represented by a Resnet-CNN structure as follows. Let $x$ be the input of the network, then $D x$ , the random sign flip of the input can be realized by setting the weight/kernel $w_{1}$ of the first two layers to be the delta function, and the bias vectors to take large values at the location where $D$ has a $1$ , and small values where $D$ has a $- 1$ . Then with the help of the ReLU activation, we can successfully flip the signs. More explicitly, set $T = {sup}_{x \in S} ∥ x ∥_{\infty}$ so that $x \in [- T, T]^{D}$ for all $x \in S$ . Let $b_{i}$ be the bias to be added to the $i$ th coordinate of the input. We design a 2 layer Resnet-CNN, $L (x)$ , as follows

L (x)_{i} = ReLU (2 x_{i} + b_{i}) - ReLU (b_{i}) - x_{i}, i = 1, . . ., D .

The bias $b_{i}$ are chosen to realize the sign flip as follows. If $D_{i i}$ contains a $1$ , then we set $b_{i} = 2 T$ , which will make $L (x)_{i} = x_{i}$ . If $D_{i i}$ contains a $- 1$ , then we set $b_{i} = - 2 T$ , which will make $L (x)_{i} = - x_{i}$ , thus realizing the sign-flip. This architecture can also be used to realize a mask (i.e., setting certain entries of $x$ to 0). For the application of $M$ to $D x$ , it is a simply a convolution with a mask, therefore again can be achieved by 2 layers of Resnet-CNN.

This Resnet-CNN that implements $x \mapsto A x$ requires $2 D$ nodes, $D$ parameters, and $2$ layers to apply $D$ to $x$ , and an additional $D$ nodes, $D + d$ parameters, and $2$ layers to apply $M$ to $D x$ . By adding this to the $N$ nodes, $P$ parameters, and $L$ layers needed for a CNN to approximate the $\frac{L}{1 - ρ}$ -Lipschitz function $g : [- M, M]^{d} \to R^{p}$ , we obtain that the $L$ -Lipshitz function $f : S \to R^{p}$ can be approximated by a Resnet-CNN with $N + 3 D$ nodes, $P + 2 D + d$ parameters, and $L + 4$ layers.

d) If the same convolutional neural network architecture $ϵ$ -approximates every $\frac{L}{1 - ρ}$ -Lipschitz function $g : [- M, M]^{d} \mapsto R^{p}$ , then our construction of a convolutional neural network that implements $ˆ f = ˆ g \circ A$ has the same architecture for every $L$ -Lipschitz function $f : S \to R^{p}$ . Hence, the same bounds on the number of nodes, edges, and layers hold.

Appendix B Proof of Proposition 1

Consider a covering of $U_{S}$ by $N (U_{S}, ∥ \cdot ∥_{2}, δ)$ balls of radius $δ$ . Each ball must intersect $U_{S}$ as otherwise we could remove that ball from the covering and obtain a covering of $U_{S}$ with only $N (U_{S}, ∥ \cdot ∥_{2}, δ) - 1$ balls of radius $δ$ , which contradicts the definition of $N (U_{S}, ∥ \cdot ∥_{2}, δ)$ . Enumerate these balls $i = 1, \dots, N (U_{S}, ∥ \cdot ∥_{2}, δ)$ . For each $i$ , pick a point $u_{i} \in U_{S}$ which is also in the $i$ -th ball, and then pick points $x_{i}, x_{i}^{'} \in S$ with $x_{i} \neq x_{i}^{'}$ such that $\frac{x_{i} - x_{i}^{'}}{∥ x_{i} - x_{i}^{'} ∥_{2}} = u_{i}$ . Then, set $S_{1} = {x_{i}}_{i} \cup {x_{i}^{'}}_{i}$ so $| S_{1} | \leq 2 N (U_{S}, ∥ \cdot ∥_{2}, δ)$ .

Suppose $A \in R^{d \times D}$ is a $ρ$ -JL embedding of $S_{1}$ . Then, by definition of a $ρ$ -JL embedding,

(1 - ρ) ∥ x_{i} - x_{i}^{'} ∥ \leq ∥ A x_{i} - A x_{i}^{'} ∥ \leq (1 + ρ) ∥ x_{i} - x_{i}^{'} ∥, for i = 1, \dots, N (U_{S}, ∥ \cdot ∥_{2}, δ)

Now, for any two points $y, y^{'} \in S$ with $y \neq y^{'}$ , there exists an index $i$ such that $\frac{y - y^{'}}{∥ y - y^{'} ∥_{2}} \in U_{S}$ lies in the $i$ -th ball of our covering of $U_{S}$ . Since $\frac{x_{i} - x_{i}^{'}}{∥ x_{i} - x_{i}^{'} ∥_{2}}$ is also in the $i$ -th ball, we have that

∥ ∥ ∥ \frac{x_{i} - x_{i}^{'}}{∥ x_{i} - x_{i}^{'} ∥} - \frac{y - y^{'}}{∥ y - y^{'} ∥} ∥ ∥ ∥ \leq 2 δ .

For simplicity of notation, we set $a = ∥ x_{i} - x_{i}^{'} ∥, b = ∥ y - y^{'} ∥$ . Then we immediately have

	$∥ A (y - y^{'}) ∥$	$= ∥ \frac{b}{a} A (x_{i} - x_{i}^{'}) + A (y - y^{'}) - \frac{b}{a} A (x_{i} - x_{i}^{'}) ∥$
		$\leq \frac{b}{a} ∥ A (x_{i} - x_{i}^{'}) ∥ + ∥ A (y - y^{'} - \frac{}{b} a (x_{i} - x_{i}^{'})) ∥$
		$\leq \frac{b}{a} (1 + ρ) ∥ x_{i} - x_{i}^{'} ∥ + ∥ A ∥ ∥ y - y^{'} - \frac{b}{a} (x_{i} - x_{i}^{'}) ∥$
		$\leq (1 + ρ) b + 2 ∥ A ∥ δ b = (1 + ρ + 2 ∥ A ∥ δ) ∥ y - y^{'} ∥,$

where the second inequality used the previous two formulae. The other side of the bi-lipschitz formula can be proved similarly. Hence, $A$ is also a $(ρ + 2 ∥ A ∥ δ)$ -JL embedding of $S$ .

Appendix C Proof of Proposition 4

a) By Proposition 1, there exists a finite set $S_{1}$ with at most $| S_{1} | \leq 2 N (U_{S}, ∥ \cdot ∥_{2}, \frac{ρ}{4 \sqrt{3 D}})$ points such that any $\frac{ρ}{2}$ -JL embedding of $S_{1}$ is also a $(\frac{ρ}{2} + ∥ A ∥ \frac{ρ}{2 \sqrt{3 D}})$ -JL embedding of $S$ .

We now show that there exists a matrix $A \in R^{d \times D}$ with $∥ A ∥ \leq \sqrt{3 D}$ which is $\frac{ρ}{2}$ -JL embedding of $S_{1}$ by generating a random $A$ and showing that the probability of $∥ A ∥ \leq \sqrt{3 D}$ and $A$ is a $\frac{ρ}{2}$ -JL embedding of $S_{1}$ both occurring is greater than zero.

Let $A \in R^{d \times D}$ be a random matrix whose entries are i.i.d. from a subgaussian distribution with mean $0$ and variance $\frac{1}{d}$ . Since

E ∥ A ∥_{F}^{2} = d \sum i = 1 D \sum j = 1 E A_{i, j}^{2} = d \sum i = 1 D \sum j = 1 \frac{1}{d} = D,

we have that

P {∥ A ∥_{F}^{2} \geq 3 D} \leq \frac{E ∥ A ∥_{F}^{2}}{3 D} = \frac{1}{3} .

Furthermore, since

d ≳ ρ^{- 2} log N (U_{S}, ∥ \cdot ∥_{2}, \frac{ρ}{4 \sqrt{3 D}}) ≳ {(\frac{ρ}{2})}^{- 2} log (3 | S_{1} |),

by Proposition 2, $A$ is a $\frac{ρ}{2}$ -JL embedding of $S_{1}$ with probability at least $1 - \frac{1}{3} = \frac{2}{3}$ . Therefore, $A$ is both a $\frac{ρ}{2}$ -JL embedding of $S_{1}$ and satisfies $∥ A ∥ \leq \sqrt{3 D}$ with probability at least $\frac{2}{3} - \frac{1}{3} = \frac{1}{3} > 0$ .

Hence, there exists a matrix $A \in R^{d \times D}$ such that $A$ is a $\frac{ρ}{2}$ -JL embedding of $S_{1}$ and satisfies $∥ A ∥ \leq \sqrt{3 D}$ . Finally, by Proposition 1, since $A$ is a $\frac{ρ}{2}$ -JL embedding of $S_{1}$ , it is also a $(\frac{ρ}{2} + ∥ A ∥ \frac{ρ}{2 \sqrt{3 D}})$ -JL embedding of $S$ . Since $∥ A ∥ \leq \sqrt{3 D}$ , we have $\frac{ρ}{2} + ∥ A ∥ \frac{ρ}{2 \sqrt{3 D}} \leq ρ$ , and thus, $A$ is a $ρ$ -JL embedding of $S$ , as desired.

b) Again, by Proposition 1, there exists a finite set $S_{1}$ with at most $| S_{1} | \leq 2 N (U_{S}, ∥ \cdot ∥_{2}, \frac{ρ}{4 \sqrt{3 D}})$ points such that any $\frac{ρ}{2}$ -JL embedding of $S_{1}$ is also a $(\frac{ρ}{2} + ∥ A ∥ \frac{ρ}{2 \sqrt{3 D}})$ -JL embedding of $S$ .

Let $A \in R^{d \times D}$ be a random matrix of the form $M D$ where $D \in R^{D \times D}$ is a diagonal matrix whose entries are independent Rademacher random variables, and $M \in R^{d \times D}$ is a random circulant matrix whose entries are Gaussian random variables with mean $0$ and variance $\frac{1}{d}$ and entries in different diagonals are independent. Again, we can show that

E ∥ A ∥_{F}^{2} = d \sum i = 1 D \sum j = 1 E A_{i, j}^{2} = d \sum i = 1 D \sum j = 1 E M_{i, j}^{2} D_{j, j}^{2} = d \sum i = 1 D \sum j = 1 E M_{i, j}^{2} = d \sum i = 1 D \sum j = 1 \frac{1}{d} = D,

and so,

P {∥ A ∥_{F}^{2} \geq 3 D} \leq \frac{E ∥ A ∥_{F}^{2}}{3 D} = \frac{1}{3} .

Now, set $α = log (log | S_{1} |) / log (log (4 D + 4 d))$ so that ${log}^{α} | S_{1} | = log (4 D + 4 d)$ . Then, since

d ≳ ρ^{- 2} log (4 D + 4 d) log N (U_{S}, ∥ \cdot ∥_{2}, \frac{ρ}{4 \sqrt{3 D}}) ≳ {(\frac{ρ}{2})}^{- 2} {log}^{1 + α} | S_{1} |,

by Proposition 3, $A$ is a $\frac{ρ}{2}$ -JL embedding of $S_{1}$ with probability at least

\frac{2}{3} (1 - (D + d) e^{- {log}^{α} | S_{1} |}) = \frac{2}{3} (1 - (D + d) e^{- log (4 D + 4 d)}) = \frac{2}{3} (1 - \frac{1}{4}) = \frac{1}{2} .

Therefore, $A$ is both a $\frac{ρ}{2}$ -JL embedding of $S_{1}$ and satisfies $∥ A ∥ \leq \sqrt{3 D}$ with probability at least $\frac{1}{2} - \frac{1}{3} = \frac{1}{6} > 0$ .

Hence, there exists a matrix $A \in R^{d \times D}$ such that $A$ is a $\frac{ρ}{2}$ -JL embedding of $S_{1}$ and satisfies $∥ A ∥ \leq \sqrt{3 D}$ . Again, by Proposition 1, since $A$ is a $\frac{ρ}{2}$ -JL embedding of $S_{1}$ , it is also a $(\frac{ρ}{2} + ∥ A ∥ \frac{ρ}{2 \sqrt{3 D}})$ -JL embedding of $S$ . Since $∥ A ∥ \leq \sqrt{3 D}$ , we have $\frac{ρ}{2} + ∥ A ∥ \frac{ρ}{2 \sqrt{3 D}} \leq ρ$ , and thus, $A$ is a $ρ$ -JL embedding of $S$ , as desired.

Appendix D Proof of Proposition 6

We first construct a function $ˆ g$ that is an $ϵ$ -approximation of $g$ . To do this, we first define a compactly supported “spike” function $ϕ : R^{d} \to [0, 1]$ by

ϕ (z) = max {1 + min {z_{1}, \dots, z_{d}, 0} - max {z_{1}, \dots, z_{d}, 0}, 0} .

Then, for any positive integer $N$ , define an approximation $ˆ g : [- M, M]^{d} \to R^{p}$ to $g$ by

ˆ g (y) := \sum n \in {- N, \dots, N}^{d} g (\frac{M n}{N}) ϕ (\frac{N y}{M} - n) .

Similarly to what was done in [29], it can be shown that the scaled and shifted spike functions ${ϕ (\frac{N y}{M} - n)}_{n \in {- N, \dots, N}^{d}}$ form a partition of unity, i.e.

\sum n \in {- N, \dots, N}^{d} ϕ (\frac{N y}{M} - n) = 1 for all y \in [- M, M]^{d} .

Trivially, $ϕ (y) \geq 0$ for all $y \in R^{d}$ . Also, one can check that $supp (ϕ) \subseteq [- 1, 1]^{d}$ , and thus, $ϕ (\frac{N y}{M} - n) = 0$ for all $n$ such that $∥ \frac{N y}{M} - n ∥_{\infty} > 1$ . Furthermore, for any $n$ such that $∥ \frac{N y}{M} - n ∥_{\infty} \leq 1$ , we have

{∥ ∥ g (y) - g (\frac{M n}{N}) ∥ ∥}_{2} \leq L ∥ y - \frac{M n}{N} ∥_{2} \leq L \sqrt{d} ∥ y - \frac{M n}{N} ∥_{\infty} \leq \frac{L M \sqrt{d}}{N} .

Hence, we can bound the approximation error for any $y \in [- M, M]^{d}$ as follows:

	${∥ ˆ g (y) - g (y) ∥}_{2}$	$= {∥ ∥ ∥ ∥ \sum n \in {- N, \dots, N}^{d} g (\frac{M n}{N}) ϕ (\frac{N y}{M} - n) - g (y) ∥ ∥ ∥ ∥}_{2}$
		$= {∥ ∥ ∥ ∥ \sum n \in {- N, \dots, N}^{d} (g (\frac{}{M n} N) - g (y)) ϕ (\frac{N y}{M} - n) ∥ ∥ ∥ ∥}_{2}$
		$\leq \sum n \in {- N, \dots, N}^{d} {∥ ∥ g (\frac{M n}{N}) - g (y) ∥ ∥}_{2} ϕ (\frac{N y}{M} - n)$
		$= \sum {∥ ∥ ∥ \frac{N y}{M} - n ∥ ∥ ∥}_{\infty} \leq 1 {∥ ∥ g (\frac{M n}{N}) - g (y) ∥ ∥}_{2} ϕ (\frac{N y}{M} - n)$

		$\leq \sum n \in {- N, \dots, N}^{d} \frac{L M \sqrt{d}}{} N ϕ (\frac{N y}{M} - n)$
		$= \frac{L M \sqrt{d}}{N} .$

So by choosing $N = ⌈ \frac{L M \sqrt{d}}{ϵ} ⌉$ , we can obtain $∥ ˆ g (y) - g (y) ∥_{\infty} \leq ∥ ˆ g (y) - g (y) ∥_{2} \leq ϵ$ for all $y \in [- M, M]^{d}$ , i.e., $ˆ g$ is an $ϵ$ -approximation of $g$ .

We now focus on constructing a ReLU NN architecture which can implement the $ϵ$ -approximation $ˆ g$ for any $L$ -Lipschitz function $g$ . We do this by first constructing a ReLU NN that is independent of $g$ which implements the map $Φ : R^{d} \to R^{(2 N + 1)^{d}}$ defined by $(Φ (y))_{n} = ϕ (\frac{N y}{M} - n)$ . Then, we add a final layer which outputs the appropriate linear combination of the $ϕ (\frac{N y}{M} - n)$ ’s.

Lemma 2.

For any integers $N, d \geq 1$ , the maps $m_{d} : R^{d} \to R^{(2 N + 1)^{d}}$ and $M_{d} : R^{d} \to R^{(2 N + 1)^{d}}$ defined by

(m_{d} (y))_{n} := min {\frac{N y_{1}}{M} - n_{1}, \dots, \frac{N y_{d}}{M} - n_{d}, 0} for n \in {- N, \dots, N}^{d}

and

(M_{d} (y))_{n} := max {\frac{N y_{1}}{M} - n_{1}, \dots, \frac{N y_{d}}{M} - n_{d}, 0} for n \in {- N, \dots, N}^{d},

can both be implemented by a ReLU NN with $O ((2 N + 1)^{d})$ weights, $O ((2 N + 1)^{d})$ nodes, and $⌈ {log}_{2} (d + 1) ⌉$ layers.

Proof.

First, we note that we can write

	$(m_{d} (y))_{n} =$	$min {min {\frac{N y_{1}}{M} - n_{1}, \dots, \frac{N y_{⌈ d / 2 ⌉}}{M} - n_{⌈ d / 2 ⌉}},$

and

	$(M_{d} (y))_{n} =$	$max {max {\frac{N y_{1}}{M} - n_{1}, \dots, \frac{N y_{⌈ d / 2 ⌉}}{M} - n_{⌈ d / 2 ⌉}},$

In [2], it is shown that for any positive integer $k$ , the maps $(z_{1}, \dots, z_{k}) \mapsto min {z_{1}, \dots, z_{k}}$ and $(z_{1}, \dots, z_{k}) \mapsto max {z_{1}, \dots, z_{k}}$ can be implemented by a ReLU NN with at most $c_{1} k$ edges, $c_{2} k$ nodes, and $⌈ {log}_{2} k ⌉$ layers, where $c_{1}, c_{2} > 0$ are universal constants. So to construct the map $m_{d}$ , we first implement the $(2 N + 1)^{⌈ d / 2 ⌉}$ maps

(y_{1}, \dots, y_{⌈ d / 2 ⌉}) \mapsto min {\frac{N y_{1}}{M} - n_{1}, \dots, \frac{N y_{⌈ d / 2 ⌉}}{M} - n_{⌈ d / 2 ⌉}}

(5)

for $(n_{1}, \dots, n_{⌈ d / 2 ⌉}) \in {- N, \dots, N}^{⌈ d / 2 ⌉}$ . Implementing each of these maps requires $c_{1} ⌈ \frac{d}{2} ⌉$ edges, $c_{2} ⌈ \frac{d}{2} ⌉$ nodes, and $⌈ {log}_{2} ⌈ \frac{d}{2} ⌉ ⌉$ layers. Next, we implement the $(2 N + 1)^{⌊ d / 2 ⌋}$ maps

(y_{⌈ d / 2 ⌉ + 1}, \dots, y_{d}) \mapsto min {\frac{N y_{⌈ d / 2 ⌉ + 1}}{M} - n_{⌈ d / 2 ⌉ + 1}, \dots, \frac{N y_{d}}{M} - n_{d}, 0}

(6)

for $(n_{⌈ d / 2 ⌉ + 1}, \dots, n_{d}) \in {- N, \dots, N}^{⌊ d / 2 ⌋}$ . Implementing each of these maps requires $c_{1} (⌊ \frac{d}{2} ⌋ + 1)$ edges, $c_{2} (⌊ \frac{d}{2} ⌋ + 1)$ nodes, and $⌈ {log}_{2} (⌊ \frac{d}{2} ⌋ + 1) ⌉$ layers. After placing these $(2 N + 1)^{⌈ d / 2 ⌉} + (2 N + 1)^{⌊ d / 2 ⌋}$ maps in parallel, we construct one final layer as follows. For each $n = (n_{1}, \dots, n_{d}) \in {- N, \dots, N}^{d}$ , we combine the output of the $(n_{1}, \dots, n_{⌈ d / 2 ⌉})$ -th map of the form in Equation 5 and the output of the $(n_{⌈ d / 2 ⌉ + 1}, \dots, n_{d})$ -th map of the form in Equation 6 by using them as inputs to a ReLU NN that implements the map $(a, b) \mapsto min {a, b}$ . Each of these requires at most $2 c_{1}$ edges and $2 c_{2}$ nodes.

The total number of edges used to implement $m_{d}$ is


	$\leq$	$c_{1} (⌈ \frac{d}{2} ⌉ + ⌊ \frac{d}{2} ⌋ + 1) (2 N + 1)^{⌈ d / 2 ⌉} + 2 c_{1} (2 N + 1)^{d}$
	$=$	$c_{1} (d + 1) (2 N + 1)^{⌈ d / 2 ⌉} + 2 c_{1} (2 N + 1)^{d}$
	$=$	$c_{1} ((d + 1) (2 N + 1)^{- ⌊ d / 2 ⌋} + 2) (2 N + 1)^{d}$
	$\leq$
	$\leq$	$4 c_{1} (2 N + 1)^{d},$

where we have used the fact that $N \geq 1$ by definition, and the easily verifiable inequality $(d + 1) \cdot 3^{- ⌊ d / 2 ⌋} \leq 2$ for all positive integers $d$ .

A nearly identical calculation shows that the total number of nodes used to implement $m_{d}$ is at most $4 c_{2} (2 N + 1)^{d}$ . Finally, since the $(2 N + 1)^{⌈ d / 2 ⌉}$ maps of the form in Equation 5 and the $(2 N + 1)^{⌊ d / 2 ⌋}$ maps of the form in Equation 6 are in parallel, the total number of layers used to implement $m_{d}$ is

Hence, the map $m_{d}$ can be implemented by a ReLU NN with at most $C_{1} (2 N + 1)^{d}$ edges, $C_{2} (2 N + 1)^{d}$ nodes, and $⌈ {log}_{2} (d + 1) ⌉$ layers, as desired. The proof for $M_{d}$ is identical, except with $min$ replaced by $max$ . ∎

Next, we note that

(Φ (y))_{n} = ϕ (\frac{N y}{M} - n) = max {1 + (m_{d} (y))_{n} - (M_{d} (y))_{n}, 0} for all n \in {- N, \dots, N}^{d} .

So to construct a ReLU NN which implements $Φ$ , we first place a ReLU NN that implements $m_{d}$ in parallel with a ReLU NN that implements $M_{d}$ . Then, we add an extra layer which has $(2 N + 1)^{d}$ nodes, where the $n$ -th node of this layer has two edges, one from the $n$ -th node of $m_{d}$ and one from the $n$ -th node of $M_{d}$ . Since $m_{d}$ and $M_{d}$ are in parallel and each can each be implemented with ReLU NNs with $O ((2 N + 1)^{d})$ edges, $O ((2 N + 1)^{d})$ nodes, and $⌈ {log}_{2} (d + 1) ⌉$ layers, and the last layer has $2 (2 N + 1)^{d}$ edges and $(2 N + 1)^{d}$ nodes, the ReLU NN which implements $Φ$ has $O ((2 N + 1)^{d})$ edges, $O ((2 N + 1)^{d})$ nodes, and $⌈ {log}_{2} (d + 1) ⌉ + 1$ layers.

Finally, we can construct a ReLU NN which implements

ˆ g (x) := \sum n \in {- N, \dots, N}^{d} g (\frac{M n}{N}) ϕ (\frac{N x}{M} - n)

by using the ReLU NN which implements $Φ$ , followed by a linear layer which computes the weighted sum for $ˆ g$ . This last layer has $p$ nodes, and $p (2 N + 1)^{d}$ edges. So the ReLU NN that implements $ˆ g$ has $(p + C_{1}) (2 N + 1)^{d}$ edges, $C_{2} (2 N + 1)^{d} + p$ nodes, and $⌈ {log}_{2} (d + 1) ⌉ + 2$ layers, as desired.

Appendix E Proof of Theorem 2

By combining Proposition 4a and Proposition 5, we have that there exists a $ρ$ -JL embedding $A \in R^{d \times D}$ of $S$ with

d ≳ min {ρ^{- 2} log N (U_{S}, ∥ \cdot ∥_{2}, \frac{ρ}{4 \sqrt{3 D}}), ρ^{- 2} {(ω (U_{S}))}_{S}^{2}} .

Let $M = {sup}_{x \in S} ∥ A x ∥_{\infty}$ so that $A (S) \subset [- M, M]^{d}$ , and so, $A$ is a $ρ$ -JL embedding of $S$ into $[- M, M]^{d}$ . By Proposition 6, there exists a ReLU NN architecture with at most

E = (p + C_{1}) {(2 ⌈ \frac{L M \sqrt{d}}{(1 - ρ) ϵ} ⌉ + 1)}^{d} edges,

and L = ⌈ {log}_{2} (d + 1) ⌉ + 2 layers

which can $ϵ$ -approximate any $\frac{L}{1 - ρ}$ -Lipschitz function $g : [- M, M]^{d}$ . Finally, by applying Theorem 1b, we have that there exists a a ReLU NN architecture with at most

E + D d = (p + C_{1}) {(2 ⌈ \frac{L M \sqrt{d}}{(1 - ρ) ϵ} ⌉ + 1)}^{d} edges,

N + D = C_{2} {(2 ⌈ \frac{L M \sqrt{d}}{(1 - ρ) ϵ} ⌉ + 1)}^{d} + p + D nodes

and L + 1 = ⌈ {log}_{2} (d + 1) ⌉ + 3 layers

which can $ϵ$ -approximate any $L$ -Lipschitz function $f : S \to R^{p}$ , as desired.

Appendix F Proof of Theorem 3

Let $f : S \to R^{p}$ be the target function to approximate. By Proposition 4b, we have that there exists a matrix $A \in R^{d \times D}$ in the form $M D$ where $M$ is a partial circulant matrix and $D$ is a diagonal matrix with $\pm 1$ entries such that $A$ is a $ρ$ -JL embedding of $S$ with

d ≳ ρ^{- 2} log (4 D + 4 d) log N (U_{S}, ∥ \cdot ∥_{2}, \frac{ρ}{4 \sqrt{3 D}}) .

Let $M = {sup}_{x \in S} ∥ A x ∥_{\infty}$ so that $A (S) \subset [- M, M]^{d}$ , and so, $A$ is a $ρ$ -JL embedding of $S$ into $[- M, M]^{d}$ .

By Lemma 1, there exists an $\frac{L}{1 - ρ}$ -Lipschitz function $g : [- M, M]^{d} \to R^{p}$ such that $g (A x) = f (x)$ for all $x \in S$ . Let $g_{i} : [- M, M]^{d} \to R$ be the $i$ -th coordinate of $g$ . Let ${˜ g}_{i} : [- 1, 1]^{d} \to R$ be defined by ${˜ g}_{i} (y) = g_{i} (M y)$ for all $y \in [- 1, 1]^{d}$ . Note that each $g_{i}$ is $\frac{L}{1 - ρ}$ -Lipschitz, and so, each ${˜ g}_{i}$ is $\frac{L M}{1 - ρ}$ -Lipschitz,

Then, by Proposition 7, for each ${˜ g}_{i}$ , there exists a CNN ${˜ g}_{i}^{(C N N)}$ with $O (N)$ residual blocks, each of which has depth $O (log N)$ and $O (1)$ channels, and whose filter size is at most $K$ such that $∥ {˜ g}_{i} - {˜ g}^{(C N N)} ∥_{\infty} \leq ˜ O (N^{- 1 / d})$ .

Now, we construct a CNN to approximate $f$ as follows. First, we implement the map $x \mapsto \frac{1}{M} A x$ using the same $4$ layer Resnet CNN described in the proof of Theorem 1c. Then, we pass the output of that Resnet CNN into $p$ parallel CNNs which implement ${˜ g}_{i}^{(C N N)}$ for $i = 1, \dots, p$ . The output of the $i$ -th of these parallel CNNs is ${˜ g}_{i}^{(C N N)} (\frac{1}{M} A x)$ , which is an $˜ O (N^{- 1 / d})$ -approximation of ${˜ g}_{i} (\frac{1}{M} A x) = g_{i} (A x) = f_{i} (x)$ . Hence, the constructed CNN is a $˜ O (N^{- 1 / d})$ -approximation of $f$ .

The CNN which implements the map $x \mapsto \frac{1}{M} A x$ needs $O (1)$ residual blocks, each of which has depth $O (1)$ and $O (1)$ channels. Each of the $p$ parallel CNNs which implement the ${˜ g}_{i}^{(C N N)}$ ’s have $O (N)$ residual blocks, each of which has depth $O (log N)$ and $O (1)$ channels. So the overall network to approximate $f$ has $O (p N)$ residual blocks, each of which has depth $O (log N)$ and $O (1)$ channels.

Appendix G Proof of Proposition 9

By the $sin Θ$ theorem [27], we have

∥ ∥ ∥ \frac{x_{1}}{∥ x_{1} ∥} - \frac{x_{2}}{∥ x_{2} ∥} ∥ ∥ ∥ \leq \frac{∥ x_{1} k_{1}^{T} - x_{2} k_{2}^{T} ∥}{∥ x_{1} ∥ ∥ k_{1} ∥}

(7)

and

∥ ∥ ∥ \frac{k_{1}}{∥ k_{1} ∥} - \frac{k_{2}}{∥ k_{2} ∥} ∥ ∥ ∥ \leq \frac{∥ x_{1} k_{1}^{T} - x_{2} k_{2}^{T} ∥}{∥ x_{2} ∥ ∥ k_{2} ∥} .

(8)

Let us find a set whose covering number is easy to compute while containing the unit secant $U_{Y}$ as a subset

	${\frac{y_{1} - y_{2}}{∥ y_{1} - y_{2} ∥}, y_{1}, y_{2} \in Y} = {\frac{x_{1} \otimes k_{1} - x_{2} \otimes k_{2}}{∥ x_{1} \otimes k_{1} - x_{2} \otimes k_{2} ∥}, x_{i} = Φ u_{i}, k_{i} = Ψ v_{i}, i = 1, 2}$
	$= ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \frac{x_{1} \otimes k_{1} - (\frac{∥ x_{1} ∥}{∥ x_{2} ∥} x_{2}) \otimes k_{1}}{∥ x_{1} \otimes k_{1} - x_{2} \otimes k_{2} ∥} + \frac{(\frac{∥ x_{1} ∥}{∥ x_{2} ∥} x_{2}) \otimes k_{1} - x_{2} \otimes k_{2}}{∥ x_{1} \otimes k_{1} - x_{2} \otimes k_{2} ∥}, x_{i} = Φ u_{i}, k_{i} = Ψ v_{i}, i = 1, 2 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭$
	$\subseteq ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \frac{x_{1} \otimes k_{1} - (\frac{∥ x_{1} ∥}{∥ x_{2} ∥} x_{2}) \otimes k_{1}}{∥ x_{1} \otimes k_{1} - x_{2} \otimes k_{2} ∥}, x_{i} = Φ u_{i}, k_{i} = Ψ v_{i}, i = 1, 2 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭$
	$+ ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \frac{(\frac{∥ x_{1} ∥}{∥ x_{2} ∥} x_{2}) \otimes k_{1} - x_{2} \otimes k_{2}}{∥ x_{1} \otimes k_{1} - x_{2} \otimes k_{2} ∥}, x_{i} = Φ u_{i}, k_{i} = Ψ v_{i}, i = 1, 2 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ .$

For the first set in the sum, by using (3) and (7), we have

	$⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ \frac{x_{1} \otimes k_{1} - (\frac{∥ x_{1} ∥}{∥ x_{2} ∥} x_{2}) \otimes k_{1}}{∥ x_{1} \otimes k_{1} - x_{2} \otimes k_{2} ∥}, x_{i} = Φ u_{i}, k_{i} = Ψ v_{i}, i = 1, 2 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭$
	$\subseteq ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ t \cdot \frac{x_{1} \otimes k_{1} - (\frac{∥ x_{1} ∥}{∥ x_{2} ∥} x_{2}) \otimes k_{1}}{∥ x_{1} k_{1}^{T} - x_{2} k_{2}^{T} ∥}, t \in [0, L], x_{i} = Φ u_{i}, k_{i} = Ψ v_{i}, i = 1, 2 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭$
	$\subseteq ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ t \cdot \frac{x_{1} \otimes k_{1} - (\frac{∥ x_{1} ∥}{∥ x_{2} ∥} x_{2}) \otimes k_{1}}{∥ x_{1} - \frac{∥ x_{1} ∥}{∥ x_{2} ∥} x_{2} ∥ ∥ k_{1} ∥}, t \in [0, L], x_{i} = Φ u_{i}, k_{i} = Ψ v_{i}, i = 1, 2 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭$
	$\subseteq ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ ⎛ ⎜ ⎜ ⎝ \sqrt{t} \cdot \frac{x_{1} - \frac{∥ x_{1} ∥}{∥ x_{2} ∥} x_{2}}{∥ x_{1} - \frac{∥ x_{1} ∥}{∥ x_{2} ∥} x_{2} ∥} ⎞ ⎟ ⎟ ⎠ \otimes (\sqrt{t} \cdot \frac{k_{1}}{∥ k_{1} ∥}), t \in [0, L], x_{i} = Φ u_{i}, k_{i} = Ψ v_{i}, i = 1, 2 ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭$

The covering number with $ϵ$ balls of the set $⎧ ⎪ ⎨ ⎪ ⎩ \sqrt{t} \cdot \frac{x_{1} \otimes k_{1} - (\frac{∥ x_{1} ∥}{∥ x_{2} ∥} x_{2})}{∥ x_{1} \otimes k_{1} - (\frac{∥ x_{1} ∥}{∥ x_{2} ∥} x_{2}) ∥}, t \in [0, L] ⎫ ⎪ ⎬ ⎪ ⎭$ is ${(\frac{3 \sqrt{L}}{ϵ})}^{n}$ , and that for the set ${\sqrt{t} \cdot \frac{k_{1}}{∥ k_{1} ∥}, t \in [0, L]}$ is ${(\frac{3 \sqrt{L}}{ϵ})}^{m}$ . So the covering number with $ϵ$ balls of $S$ is

{(\frac{6 L}{ϵ})}^{n} + {(\frac{6 L}{ϵ})}^{m} .

The same argument holds for the second set in the sum.

Appendix H Proof of Proposition 10

By definition,

U_{Y} = {\frac{y_{1} - y_{2}}{∥ y_{1} - y_{2} ∥}, y_{1}, y_{2} \in Y} = {\frac{P_{Ω} (X_{1} - X_{2})}{∥ P_{Ω} (X_{1} - X_{2}) ∥}, X_{1}, X_{2} \in Y}

Since $y_{1} - y_{2} = P_{Ω} (X_{1} - X_{2})$ and $X$

{\frac{P_{Ω} (X_{1} - X_{2})}{∥ P_{Ω} (X_{1} - X_{2}) ∥}, X_{1}, X_{2} \in Y} \subseteq {t \cdot P_{Ω} \frac{(X_{1} - X_{2})}{∥ X_{1} - X_{2} ∥_{F}}, t \in [0, L], X_{1}, X_{2} \in Y}

Notice that $\frac{(X_{1} - X_{2})}{∥ X_{1} - X_{2} ∥_{F}}$ are matrices of unit Frobenius norm with rank at most $2 r$ . By Lemma 3.1 in [5], they form a set whose covering number is at most ${(\frac{9}{δ})}^{r (m + n + 1)}$ .

Neural Network Approximation of Lipschitz Functions in High Dimensions with Applications to Inverse Problems

Abstract

1 Introduction

2 Related Work

3 Main Results

Theorem 1.

Remark 1.

Remark 2.

Remark 3.

3.1 JL embeddings, and covering numbers and Gaussian width

Proposition 1.

Proposition 2 ([19]).

Proposition 3 (Corollary 1.3 in [8]).

Proposition 4.

Proposition 5 (Corollary 2.1 in [12]).

3.2 Universal approximator neural networks for Lipschitz functions on d-dimensional cubes

Proposition 6.

Proposition 7 (Corollary 4 from [22]).

3.3 Main results

Theorem 2.

Theorem 3.

4 Applications to Inverse Problems

Proposition 8.

Proof.

Proposition 9.

Proposition 10.

5 Conclusion and Discussion

References

Appendix A Proof of Theorem 1

Lemma 1.

Proof.

Appendix B Proof of Proposition 1

Appendix C Proof of Proposition 4

Appendix D Proof of Proposition 6

Lemma 2.

Proof.

Appendix E Proof of Theorem 2

Appendix F Proof of Theorem 3

Appendix G Proof of Proposition 9

Appendix H Proof of Proposition 10

3.2 Universal approximator neural networks for Lipschitz functions on $d$ -dimensional cubes