Visualizing high-dimensional loss landscapes with Hessian directions^†^†thanks: Submitted to the editors on August 11, 2022.

Lucas Böttcher Centre for Human and Machine Intelligence, Frankfurt School of Finance and Management, Frankfurt am Main, 60322, Germany (). l.boettcher@fs.de Gregory Wheeler Centre for Human and Machine Intelligence, Frankfurt School of Finance and Management, Frankfurt am Main, 60322, Germany (). g.wheeler@fs.de

Abstract

Analyzing geometric properties of high-dimensional loss functions, such as local curvature and the existence of other optima around a certain point in loss space, can help provide a better understanding of the interplay between neural network structure, implementation attributes, and learning performance. In this work, we combine concepts from high-dimensional probability and differential geometry to study how curvature properties in lower-dimensional loss representations depend on those in the original loss space. We show that saddle points in the original space are rarely correctly identified as such in lower-dimensional representations if random projections are used. In such projections, the expected curvature in a lower-dimensional representation is proportional to the mean curvature in the original loss space. Hence, the mean curvature in the original loss space determines if saddle points appear, on average, as either minima, maxima, or almost flat regions. We use the connection between expected curvature and mean curvature (i.e., the normalized Hessian trace) to estimate the trace of Hessians without calculating the Hessian or Hessian-vector products as in Hutchinson’s method. Because random projections are not able to correctly identify saddle information, we propose to study projections along Hessian directions that are associated with the largest and smallest principal curvatures. We connect our findings to the ongoing debate on loss landscape flatness and generalizability. Finally, we illustrate our method in numerical experiments on different image classifiers with up to about $7 \times 10^{6}$ parameters.

\newsiamremark

remarkRemark \newsiamremarkhypothesisHypothesis \newsiamthmclaimClaim \headersVisualizing high-dimensional loss landscapesL. Böttcher and G. Wheeler

igh-dimensional functions, dimensionality reduction, loss visualization, principal curvature

{AMS}

68T07, 51M99, 53Z50

1 Introduction

Deep neural networks extend the workable range of functional forms available to fit data, thereby increasing our predictive-modeling capacities. Every neural network loss function $L (θ) : R^{N} \to R$ depends on network parameters $θ \in R^{N}$ , typically for large $N$ , yielding a loss landscape that is high-dimensional and non-convex [17]. Such loss landscapes and the optimization procedures which operate on them are influenced by several factors—principally the structural properties of a neural network, such as its architecture and shape [28, 27, 44], and implementation attributes, including the choice of activation function, initialization protocol, optimizer and regularizer [25, 47, 16]. But the exact nature of these influences and their interactions remains largely unknown.

For instance, within over-parameterized regimens, where the number $N$ of model parameters far exceeds the number of training examples, deep neural networks fit random labels as easily as they fit natural labeled data [17, Lemma 3.3], but currently the only discernible difference between a deep learning algorithm tasked with encoding randomly labeled data and that same algorithm tasked with learning regular features from naturally labeled data is the time it takes to do each task [51]. Further, learned minimizers for deep neural network image classifiers that yield state-of-the-art performance on held-out test sets are not robust to adversarial examples [45]. Even so, simple gradient methods are routinely used to successfully find minimizers that yield zero or near-zero training loss [35] and good generalizability, even in over-parameterized settings [14, 22].

Analyzing geometric properties of the loss landscape, such as local curvature and the existence of additional optima around a certain point in loss space, is sometimes proposed to help improve performance and generalizability of neural networks. For example, Keskar et al. [32] characterize flatness and sharpness of loss minima via the spectrum of the underlying Hessian at a given point in loss space, and Dinh et al. [21] show that reparameterizations may render flat minima sharp without affecting generalization properties. Another approach seeks to visualize high-dimensional loss functions $L (θ)$ by a lower-dimensional (and often random) projection of two or three dimensions [23, 38, 24, 48, 29]. Extending this approach further, Horoi et al. [29] dynamically sample points in projected low-loss regions surrounding local minima during training.

But one may ask, how do the curvature properties in low-dimensional visualizations of high-dimensional functions depend on the curvature properties of the original, high-dimensional loss space? Specifically, are random two and three-dimensional loss projections able to meaningfully represent convexity and concavity properties of high-dimensional loss functions? The first goal of this paper is to address both of these questions.

One of the main obstacles to answering these questions is the exponential proliferation of saddle points in high dimensions [19] and to account for the success of gradient methods [22] in spite of numerous saddles. Results derived by Bray and Dean [12] show that there is a monotonic relation between the index of the Hessian (i.e., the number of negative eigenvalues) at a critical point and the loss value: the larger the index, the larger the corresponding loss. Bray and Dean [12] also showed that the eigenvalues of the Hessian of a Gaussian random field at a critical point are distributed according to Wigner’s semicircle law [46] up to an additional shift. As errors increase, the semicircle distribution will be shifted to the left, leading to a larger proportion of negative eigenvalues. Applying results from random matrix theory such as Wigner’s semicircle law or the Marčenko–Pastur law [41] to neural networks is based on various simplifying assumptions, including Gaussian weight, error, and data distributions [40].

Baldi and Hornik [7] analyzed the loss landscape of a linear single-layer perceptron and found that all critical points except one are saddles. So, given the abundance of saddles in high-dimensional loss landscapes, one would intuitively expect that apparently convex and concave points in a lower-dimensional projection are in fact saddles in the original, high-dimensional space. Indeed, in Section 2 we show that random lower-dimensional projections generally fail to correctly identify saddle information in the original, high-dimensional space. Nevertheless, random projections can be useful to obtain mean curvature estimates, as we demonstrate by establishing a connection between expected curvature in random loss projections and Hessian trace estimates based on Hutchinson’s method [30]. This leads to the second goal of our paper. Instead of relying on random projections to visualize loss functions, we propose in Section 3 to analyze projections along Hessian directions that are associated with the largest magnitude positive and negative principal curvatures (i.e., those directions along which the magnitude of positive and negative changes of the loss function is largest). In accordance with related works unlocking Hessian information in non-linear and high-dimensional settings [50, 40], our proposal uses Hessian-vector products (HVPs), therefore bypassing the computational expense of computing a Hessian matrix. In short, the role of random projections in our approach is to estimate mean curvature and weighted averages of the original high-dimensional Hessian components. We then demonstrate the advantages of our Hessian directions approach by comparing loss projections along Hessian directions and random directions for two image classifiers. To summarize, the main contributions of our work are threefold:

First, we identify deficiencies using random projections for visualizing high-dimensional loss functions by showing that (i) different methods for averaging will yield different answers and (ii) irrespective of the method for averaging, saddle points are generally misrepresented;
Second, we introduce a method to extract Hessian trace information using quadratic function fits to one-dimensional random loss projections;
Lastly, complementing random projection–based loss visualizations, we propose to project loss functions along dominant negative and positive Hessian directions to provide a clearer visual understanding of saddle information in high-dimensional spaces.

Finally, we conclude our paper in Section 4 with a discussion of our results. Our source code is publicly available at [10].

2 Principal curvature in random projections

To describe the connection between the principal curvature of a loss function $L (θ)$ and that associated with a lower-dimensional, random projection, we provide in Section 2.1 an overview of concepts from differential geometry [36, 9, 33] that are useful to mathematically describe curvature in high-dimensional spaces. In Section 2.2, we then show the relationship between principal curvature and random projections. Finally, in Section 2.3, we discuss different examples to aid intuition on how (i) random projections average curvature information, (ii) curvature-based Hessian trace estimates relate to those obtained with Hutchinson’s method, and (iii) Hessian directions (i.e., the eigenbasis of $H_{θ}$ ) can be used for a guided visual analysis of saddle information in high-dimensional spaces.

2.1 Differential and information geometry concepts

In differential geometry, the principal curvatures are the eigenvalues of the shape operator (or Weingarten map¹¹1Some authors distinguish between the shape operator and the Weingarten map depending on if the change of the underlying tangent vector is described in the original manifold or in Euclidean space (see, e.g., chapter 3 in [18]).), a linear endomorphism defined on the tangent space $T_{p}$ of $L$ at a point $p$ . For the original high-dimensional space, we have $(θ, L (θ)) \subseteq R^{N + 1}$ and there are $N$ principal curvatures $κ_{1}^{θ} \geq κ_{2}^{θ} \geq \dots \geq κ_{N}^{θ}$ . At a non-degenerate critical point $θ^{*}$ where $\nabla L$ vanishes, the matrix of the shape parameter is given by the Hessian $H_{θ}$ with elements $(H_{θ})_{i j} = \partial^{2} L / (\partial θ_{i} \partial θ_{j})$ ( $1 \leq i, j \leq N$ ). Some works refer to the Hessian as the “curvature matrix” [42] or use it to characterize the curvature properties of $L (θ)$ [38]. In the vicinity of a non-degenerate critical point $θ^{*}$ , the eigenvalues of the Hessian $H_{θ}$ are the principal curvatures and describe the loss function in the eigenbasis of $H_{θ}$ according to

L (θ^{*} + Δ θ) = L (θ^{*}) + \frac{1}{2} N \sum i = 1 κ_{i}^{θ} Δ θ_{i}^{2} .

(1)

The Morse lemma states that, if a critical point $θ^{*}$ of $L (θ)$ is non-degenerate, then there is a local representation of $L (θ)$ around $θ^{*}$

L (θ) = - θ_{1}^{2} - \dots - θ_{i}^{2} + θ_{i + 1}^{2} + \dots + θ_{N}^{2} + L (θ^{*}),

(2)

where the loss function $L (θ)$ is decreasing along $i$ directions and increasing along the remaining $i + 1$ to $N$ directions. Further, the index $i$ of a critical point $θ^{*}$ is the number of linearly independent decreasing dimensions of $L$ near $θ^{*}$ (i.e., the number of negative eigenvalues of the Hessian $H_{θ}$ at that point).

In the standard basis, the Hessian is

Hθ\coloneqq∇θ∇θL(θ)=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝∂2L∂θ21⋯∂2L∂θ1∂θN⋮⋱⋮∂2L∂θN∂θ1⋯∂2L∂θ2N⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

(3)

Closely related to the Hessian is the Fisher information matrix (FIM), whose components are given by second derivatives of the likelihood function, $p (x ∣ θ)$ :

F_{i, j} = - E [\frac{\partial^{2}}{\partial θ_{i} \partial θ_{j}} log p (x ∣ θ)] .

(4)

Equation (4) may be rewritten as the product of partial derivatives of the log likelihood,

F_{i, j} = E [\frac{\partial}{\partial θ_{i}} log p (x ∣ θ) \frac{\partial}{\partial θ_{j}} log p (x ∣ θ)] .

(5)

For a neural network that uses a sample $x (t)$ ( $1 \leq t \leq T$ ) as an input and outputs $f_{θ} (t) \equiv f_{θ} (x (t))$ , the empirical Fisher information matrix is

F_{θ} = \frac{1}{T} T \sum t = 1 C \sum k = 1 \nabla_{θ} f_{θ, k} (t) {\nabla_{θ} f_{θ, k} (t)}^{⊤},

(6)

where $f_{θ, k}$ is the $k$ -th entry of the neural network output [31, 3]. The empirical FIM converges towards FIM, Eq. (5), as $T \to \infty$ .

In natural gradient descent [2], the metric associated with a stochastic feedforward neural network is the FIM. For a mean-squared loss function

L (θ) = \frac{1}{2 T} T \sum t = 1 ∥ y (t) - f_{θ} (t) ∥_{2}^{2},

(7)

the Hessian $H_{θ}$ of $L (θ)$ and the empirical FIM $F_{θ}$ are connected via

H_{θ} = F_{θ} - \frac{1}{T} T \sum t = 1 C \sum k = 1 (y_{k} (t) - f_{θ, k} (t)) \nabla_{θ} \nabla_{θ} f_{θ, k} (t) .

(8)

Here, $y (t)$ denotes the label associated with sample $x (t)$ . At the global optimum, where $(y_{k} (t) - f_{θ, k} (t))$ vanishes for all $k, t$ , the empirical FIM is equal to the Hessian $H_{θ}$ [3].

2.2 Random projections

Figure 1: Dimensionality-reduced loss $L (θ^{*} + α η + β δ)$ of Eq. (28) with $n = 900, ~ n = 1000$ for different directions $η, δ$ . (a–c) The directions $η, δ$ correspond to eigenvectors of the Hessian $H_{θ}$ of Eq. (28). If the eigenvalues associated with $η, δ$ have different signs, the corresponding loss landscape is a saddle as depicted in panel (a). If the eigenvalues associated with $η, δ$ have the same sign, the corresponding loss landscape is either a minimum (both signs are positive) as shown in panel (b) or a maximum (both signs are negative) as shown in panel (c). Because there is an excess of $~ n - n = 100$ positive eigenvalues in $H_{θ}$ , a projection onto a dimension-reduced space that is spanned by the random directions $η, δ$ is often associated with an apparently convex loss landscape. An example of such an apparent minimum is shown in panel (d).

To graphically explore an $N$ -dimensional loss function $L$ around a critical point $θ^{*}$ , one may wish to work in a lower-dimensional representation. For example, a two-dimensional projection of $L$ around $θ^{*}$ is provided by

L (θ^{*} + α η + β δ),

(9)

where the parameters $α, β \in R$ scale the directions $η, δ \in R^{N}$ . The corresponding graph representation is $(α, β, L (α, β)) \subseteq R^{3}$ .

In high-dimensional spaces, there exist vastly many more almost-orthogonal than orthogonal directions. In fact, if the dimension of our space is large enough, with high probability random vectors will be sufficiently close to orthogonal [26]. Following this result, many related works [38, 48, 29] use random Gaussian directions with vector components $η_{i}, δ_{i} \sim N (0, 1)$ ( $1 \leq i \leq N$ ). Even so, note that different elements of $η, δ$ may act on different parts of the composition $L$ .

We now turn to showing that $η, δ$ are almost orthogonal using a concentration inequality for chi-squared distributed random variables. To do so, we first note that the scalar product of random Gaussian vectors $η, δ$ is a sum of the difference between two chi-squared distributed random variables because

N \sum i = 1 η_{i} δ_{i} = N \sum i = 1 \frac{1}{4} (η_{i} + δ_{i})^{2} - \frac{1}{4} (η_{i} - δ_{i})^{2} = N \sum i = 1 \frac{1}{4} X_{i}^{2} - \frac{1}{4} Y_{i}^{2},

(10)

where $X_{i}, Y_{i} \sim N (0, 1)$ . Since $E [Z] = N$ for $Z = \sum_{i = 1}^{N} X_{i}^{2}$ , the difference between the sum over the difference between $X_{i}^{2}$ and $Y_{i}^{2}$ vanishes as $N \to \infty$ . That is, in the limit $N \to \infty$ , the scalar product (10) vanishes and the vectors $η, δ$ are orthogonal. For finite $N$ , we can bound Eq. (10) using the concentration inequalities

	$Pr [\frac{1}{N} N \sum i = 1 X_{i}^{2} - 1 \geq ϵ + \frac{1}{2} ϵ^{2}]$	$\leq e^{- \frac{N ϵ^{2}}{4}}$		(11)
	$Pr [1 - \frac{1}{N} N \sum i = 1 X_{i}^{2} \geq ϵ]$	$\leq e^{- \frac{N ϵ^{2}}{4}}$		(12)

for an $ϵ > 0$ [34]. The above inequalities can be cast in the following form

Pr [∣ ∣ ∣ ∣ \frac{1}{N} N \sum i = 1 X_{i}^{2} - 1 ∣ ∣ ∣ ∣ \geq ϵ]

\leq 2 e^{- \frac{N ϵ^{2}}{4}} .

(13)

Using $| N^{- 1} \sum_{i} η_{i} δ_{i} | = | (4 N)^{- 1} \sum_{i} X_{i}^{2} - 1 + 1 - Y_{i}^{2} |$ , we obtain

Pr [∣ ∣ ∣ ∣ \frac{1}{N} N \sum i = 1 η_{i} δ_{i} ∣ ∣ ∣ ∣ \geq ϵ] \leq \sqrt{2} e^{- 2 N ϵ^{2}}

(14)

in the limit of large $N$ . For further details on the derivation of Eq. (14), see Appendix A.

With the form of random Gaussian projections in hand, we now analyze the principal curvatures in both the original and lower-dimensional spaces. The Hessian associated with the two-dimensional loss projection (9) is

(15)

where we use Einstein notation in the last equality. Because the elements of $δ, η$ are distributed according to a standard normal distribution, the second derivatives associated with the composition $L$ have prefactors that are products of standard normal variables and, hence, can be expressed as sums of chi-squared distributed random variables as in Eq. (10). Elements of $H_{α, β}$ are sums of second derivatives in the original space weighted with chi-squared distributed prefactors.

The principal curvatures $κ_{\pm}^{α, β}$ (i.e., the eigenvalues of $H_{α, β}$ ) are

κ_{\pm}^{α, β} = \frac{1}{2} (A + C \pm \sqrt{4 B^{2} + (A - C)^{2}}),

(16)

where $A = (H_{θ})_{i j} η^{i} η^{j}$ , $B = (H_{θ})_{i j} η^{i} δ^{j}$ , and $C = (H_{θ})_{i j} δ^{i} δ^{j}$ . To the best of our knowledge, a closed, analytic expression for the distribution of the quantities $A, B, C$ is not yet known [20, 34, 8, 15].

The appeal of random projections is that pairwise distances between points in a high-dimensional space can be nearly preserved by a lower-dimensional linear embedding, affording a low-dimensional representation of expectation values and variance information with minimal distortion. The question is whether random projections also preserve curvature information, and if so, the nature of the relationship between random Gaussian directions and principal curvature. For instance, Li et al. [38] assert that “the principal curvatures of a dimensionality-reduced plot (with random Gaussian directions) are weighted averages of the principal curvatures of the full-dimensional surface”. However, our results show that the principal curvatures $κ_{\pm}^{α, β}$ in a two-dimensional loss projection are weighted averages of the Hessian components $(H_{θ})_{i j}$ in the original space and not weighted averages of the principal curvatures $κ_{i}^{θ}$ , which instead are solutions of an $N$ -th degree polynomial. Similar arguments apply to projections with dimension larger than 2.

In Section 2.3 we discuss examples that show lower-dimensional projections of $L (θ)$ can be misleading since high-dimensional saddle points may appear to be minima, maxima, or almost flat regions depending on the index of the underlying Hessian $H_{θ}$ in the original space (see Figure 1).

Returning to principal curvature, since $\sum_{i, j} a_{i j} η^{i} η^{j} = \sum_{i} a_{i i} η^{i} η^{i} + \sum_{i \neq j} a_{i j} η^{i} η^{j}$ , in accordance with Eq. (10) we find that $E [A] = E [C] = {(H_{θ})^{i}}_{i}$ and $E [B] = 0$ . Hence, the expected, dimension-reduced Hessian is

E [H_{α, β}] = (\begin{matrix} {(H_{θ})^{i}}_{i} & 0 0 & {(H_{θ})^{i}}_{i} \end{matrix}),

(17)

where ${(H_{θ})^{i}}_{i} \equiv t r (H_{θ}) = \sum_{i = 1}^{N} κ_{i}^{θ}$ . The corresponding eigenvalue (or principal curvature) in dimension-reduced space, ${¯ κ}^{α, β}$ , is therefore given by the sum over all principal curvatures in the original space (i.e., ${¯ κ}^{α, β} = \sum_{i = 1}^{N} κ_{i}^{θ}$ ). Invoking Eq. (16), we can establish an additional connection between ${¯ κ}^{α, β}$ , $t r (H_{θ})$ , and $κ_{\pm}^{α, β}$ . Because $κ_{+}^{α, β} + κ_{-}^{α, β} = A + C$ , we have

t r (H_{θ}) = \frac{1}{2} (E [κ_{-}^{α, β}] + E [κ_{+}^{α, β}])

(18)

and

(19)

The mean curvature $H$ in the original space is related to ${¯ κ}^{α, β}$ via

H = \frac{1}{N} {¯ κ}^{α, β} = \frac{1}{N} N \sum i = 1 κ_{i}^{θ} .

(20)

In analogy with Eq. (8), the dimension-reduced Hessian $H_{α, β}$ is connected to the corresponding FIM $F_{α, β}$ according to

H_{α, β} = F_{α, β} - \frac{1}{T} T \sum t = 1 C \sum k = 1 (y_{k} (t) - f_{(α, β), k} (t)) \nabla_{(α, β)} \nabla_{(α, β)} f_{(α, β), k} (t),

(21)

where $f_{(α, β), k} (t)$ is the $k$ -th entry of the output of a neural network with parameters $θ^{*} + α η + β δ$ . The operator $\nabla_{(α, β)}$ denotes the gradient w.r.t. parameters $α, β$ .

The trace (or unnormalized mean curvature) of the Hessian $H_{θ}$ has already found applications in characterizing loss landscapes of neural networks [49, 50]. A common way of estimating $t r (H_{θ})$ without explicitly computing all eigenvalues of $H_{θ}$ is based on Hutchinson’s method [30] and random numerical linear algebra [6, 5]. The basic idea is to use a random vector $z \in R^{N}$ with components $z_{i}$ that are distributed according to a distribution with zero mean and unit variance (e.g., a Rademacher distribution with $Pr (z_{i} = \pm 1) = 1 / 2$ ), and compute $z^{⊤} H_{θ} z$ , an unbiased estimator of $t r (H_{θ})$ . That is,

t r (H_{θ}) = E [z^{⊤} H_{θ} z] .

(22)

Equation (19) shows that the principal curvature of an averaged random loss projection, ${¯ κ}^{α, β}$ , is equal to $t r (H_{θ})$ . Instead of calculating the Hessian and estimating $t r (H_{θ})$ using Hutchinson’s method (22), an alternative unbiased estimate is provided by the mean of the expected values of $κ_{-}^{α, β}$ and $κ_{+}^{α, β}$ [see Eq. (18)].

2.3 Examples

In this section, we summarize the main concepts derived in Section 2.2 in terms of different examples. To do so, we focus on three specific aspects associated with the study of high-dimensional loss functions. First, we will study how random projections average curvature information. Second, we will compare curvature–based Hessian trace estimates with those obtained using Hutchinson’s method. Third, we will show how one can use Hessian directions (i.e., the eigenbasis of $H_{θ}$ ) to visually analyze saddle information in high-dimensional spaces.

2.3.1 Extracting curvature information

Figure 2: Convergence of the ensemble average (25) of (a,c) Hessian components $(H_{α, β})_{i j}$ and (b,d) principal curvatures, all as a function of the number of sampled random Gaussian vectors $S$ . In panels (a,b) and (c,d), the loss functions are given by Eqs. (23) and (28), respectively. In loss function (28), we set $n = 500$ and $~ n = 800$ . Dashed grey lines panels (c,d) represent ${¯ κ}^{α, β} = t r (H_{θ})$ .

We now study two examples that will help build intuition for the concepts discussed in the previous sections. In the first example, we study a critical point for which (i) all principal curvatures have the same magnitude and (ii) the number of positive curvature directions is equal to the number of negative curvature directions. The mean curvature of this saddle point is zero. We will show that the mean curvature can be captured by the normalized eigenvalue ${¯ κ}^{α, β}$ of the expected, dimension-reduced Hessian (17) and that the principal curvatures $κ_{\pm}^{α, β}$ are able to correctly identify the saddle point in the majority of simulated realizations. In the second example, we use a loss function associated with an unequal number of negative and positive curvature directions. Regardless of the averaging process used to compute principal curvatures in the dimension-reduced space, we find that random projections cannot identify the underlying saddle point.

The loss function of our first example is

L (θ) = \frac{1}{2} θ_{2 n + 1} (n \sum i = 1 θ_{i}^{2} - θ_{i + n}^{2}), n \in Z_{+},

(23)

where we set $N = 2 n + 1$ . The Hessian at the critical point $(θ_{1}, \dots, θ_{2 n}, θ_{2 n + 1}) = (0, \dots, 0, 1)$ is

H_{θ} = d i a g (1, \dots, 1      n t i m e s, - 1, \dots, - 1      n t i m e s, 0) .

(24)

Because $H_{θ}$ is neither positive nor negative definite, the critical point is a saddle with principal curvatures $κ_{i}^{θ} \in {- 1, 1}$ ( $1 \leq i \leq N$ ). In this example, the mean curvature, $H$ , as defined in Eq. (20), is equal to 0. According to Eq. (17), the principal curvature, ${¯ κ}^{α, β}$ , associated with the expected, dimension-reduced Hessian $H_{α, β}$ is also equal to 0, erroneously indicating an apparently flat loss landscape. We will now study the ensemble average

⟨ X ⟩ = \frac{1}{S} S \sum k = 1 X^{(k)}

(25)

of different quantities of interest, $X$ . Here, $X^{(k)}$ is the $k$ -th realization (or sample) of $X$ .

Figure 2(a) shows the convergence of the ensemble average $⟨ (H_{α, β})_{i j} ⟩$ towards the expected value $E [(H_{α, β})_{i j}]$ as a function of the number of sampled random Gaussian vectors $S$ . Note that $E [(H_{α, β})_{i j}] = 0$ for all $i, j$ . The solid black and red lines in Fig. 2(b), respectively, show the ensemble averages $⟨ κ_{\pm}^{α, β} ⟩$ and

⟨ {~ κ}_{\pm}^{α, β} ⟩ = \frac{1}{2} (⟨ A ⟩ + ⟨ C ⟩ \pm \sqrt{4 ⟨ B ⟩^{2} + (⟨ A ⟩ - ⟨ C ⟩)^{2}})

(26)

as a function of samples $S$ . Since $E [A] = E [C] = {(H_{θ})^{i}}_{i}$ and $E [B] = 0$ [see Eq. (17)], we have that $⟨ {~ κ}_{\pm}^{α, β} ⟩ = {¯ κ}^{α, β}$ in the limit $S \to \infty$ . In the current example, the ensemble average $⟨ {~ κ}_{\pm}^{α, β} ⟩$ thus approaches ${¯ κ}^{α, β} = 0$ for large numbers of samples $S$ , represented by the dashed red lines in Figure 2(b). The distribution of the principal curvatures $κ_{\pm}^{α, β}$ is shown in Figure 3(a). We observe that the distributions are plausibly Gaussian. Using a corresponding Gaussian approximation for both distributions, we calculate the probability

Pr (κ_{+}^{α, β} κ_{-}^{α, β} > 0) = Pr (κ_{+}^{α, β} > 0) Pr (κ_{-}^{α, β} > 0) + Pr (κ_{+}^{α, β} < 0) Pr (κ_{-}^{α, β} < 0)

(27)

that the critical point in the lower-dimensional, random projection does not appear as a saddle (i.e., $κ_{+}^{α, β} κ_{-}^{α, β} > 0$ ). For the example shown in Figure 3(a), we find that $Pr (κ_{+}^{α, β} κ_{-}^{α, β} > 0) \approx 0.25$ . That means that in about 25% of the simulated realizations, the lower-dimensional loss landscape wrongly indicates that it does not correspond to a saddle.

Our first example shows that the principal curvatures in the lower-dimensional representation of $L (θ)$ may capture the saddle behavior in the original space if one averages over $κ_{\pm}^{α, β}$ in the lower dimensional space. However, if one first calculates ensemble averages of the dimension-reduced Hessian $H_{α, β}$ to infer $⟨ {~ κ}_{\pm}^{α, β} ⟩$ , the loss landscape appears to be flat in this example. We thus conclude that different ways of averaging (either before or after calculating the principal curvatures) may lead to different results with respect to the flatness of a dimension-reduced loss landscape.

Figure 3: Distribution of principal curvatures $κ_{+}^{α, β}$ (black bars) and $κ_{-}^{α, β}$ (red bars). In panels (a) and (b), the loss functions are given by Eqs. (23) and (28), respectively. In loss function (28), we set $n = 500$ and $~ n = 800$ . Histograms are based on 20,000 realizations of $κ_{\pm}^{α, β}$ .

In the next example, we will show that random projections cannot identify certain saddle points regardless of the underlying averaging process. We use the loss function

L (θ) = \frac{1}{2} θ_{2 n + 1} (~ n \sum i = 1 θ_{i}^{2} - 2 n \sum i = ~ n + 1 θ_{i}^{2}), n \in Z_{+}, ~ n > n .

(28)

The Hessian at the critical point $(θ_{1}, \dots, θ_{2 n}, θ_{2 n + 1}) = (0, \dots, 0, 1)$ is

H_{θ} = d i a g (1, \dots, 1      ~ n t i m e s, - 1, \dots, - 1      n t i m e s, 0) .

(29)

The critical point is again a saddle, but the mean curvature is $H = N^{- 1} (~ n - n) > 0$ . In the following numerical experiments, we set $n = 500$ and $~ n = 800$ . Figure 2(c) shows that the ensemble averages $⟨ (H_{α, β})_{i j} ⟩$ converge towards the expected value $E [(H_{α, β})_{i j}]$ as the number of samples increases. Because of the dominance of positive principal curvatures $κ_{i}^{θ}$ in the original space, the corresponding ensemble averages of principal curvatures (i.e., $⟨ κ_{\pm}^{α, β} ⟩$ , $⟨ {~ κ}_{\pm}^{α, β} ⟩$ ) in the lower-dimensional representation approach positive values [Figure 2(d)]. The distribution of $κ_{\pm}^{α, β}$ indicates that the probability of observing a saddle in the lower-dimensional loss landscape is vanishingly small [Figure 3(b)]. In this second example both ways of averaging, before and after calculating the lower-dimensional principal curvatures, suggest that the saddle in the original space is a minimum in dimension-reduced space.

The two examples that we discussed above can be summarized as follows.

Saddle points in the original space $L (θ)$ are rarely identified as such in lower-dimensional representations $L (θ + α η + β δ)$ if random directions are used. Depending on (i) the method of averaging curvature information and (ii) the index of the underlying Hessian $H_{θ}$ in the original space, saddle points will often appear erroneously as either minima, maxima, or almost flat regions.
The mean curvature $H = {¯ κ}^{α, β} / N$ correctly captures convex and concave points (i.e., those associated with positive definite and negative definite Hessians $H_{θ}$ ). However, such points are scarce in high-dimensional loss spaces [19, 7].
Because of the confounding factors associated with the inability of random projections to correctly identify saddle information, it does not appear advisable to use “flatness” around a critical point in a lower-dimensional random loss projection as a measure of generalization error [38].

2.3.2 Hessian trace

Figure 4: Estimating the trace of the Hessian $H_{θ}$ . In panels (a) and (b), the loss functions are given by Eqs. (23) and (28), respectively. Dashed grey lines indicate the true trace values. Solid black and dash-dotted red lines represent Hutchinson [ $⟨ z^{⊤} H_{θ} z ⟩$ ; see Eq. (22)] and curvature-based ( $κ^{α}$ ) estimates of $t r (H_{θ})$ , respectively. In both methods, the same random vectors with components that are distributed according to a standard normal distribution are used. For the curvature-based estimation of $t r (H_{θ})$ , we perform least-square fits of $L (θ^{*} + α η)$ over an interval $α \in [- 0.05, 0.05]$ .

We now use the loss functions (23) and (28) to compare the convergence behavior between Hutchinson’s method (22) and the curvature-based trace estimation (18). Instead of two random directions, we use one random Gaussian direction $η$ and perform quadratic least-square fits over an interval $α \in [- 0.05, 0.05]$ to extract estimates of $t r (H_{θ})$ from $L (θ^{*} + α η)$ . We use the same random Gaussian directions in Hutchinson’s method.

Figure 4 shows how the Hutchinson and curvature-based trace estimates converge towards the true trace values, 0 for the loss function (23) and 600 for the loss function (28) with $n = 500$ and $~ n = 800$ . Given that we use the same random vectors in both methods, their convergence behavior towards the true trace value is similar. However, the curvature-based method does not require to calculate Hessian-vector products (HVPs) of high-dimensional functions. It may thus provide a computationally efficient alternative to Hutchinson’s method in estimating the trace of high-dimensional Hessians.

2.3.3 Hessian directions

Given the described shortcomings of random projection–based curvature information, we suggest to use Hessian directions (i.e., the eigenbasis of $H_{θ}$ ) as directions $η, δ$ in $L (θ^{*} + α η + β δ)$ . For Eq. (28) with $n = 900, ~ n = 1000$ , we show projections along different Hessian directions in Figure 1(a–c). We observe that different Hessian directions indicate different types of critical points in dimension-reduced space. If the eigenvalues associated with the Hessian directions $η, δ$ have different signs, the corresponding lower-dimensional loss landscape is a saddle [Figure 1(a)]. If both eigenvalues have the same sign, the loss landscape is either a minimum [Figure 1(b): both signs are positive] or it is a maximum [Figure 1(c): both signs are negative]. If one uses a random projection instead, the resulting lower-dimensional loss landscape often appears to be a minimum in this example [Figure 1(d)]. To quantify the proportion of random projections that correctly identify the saddle with $n = 900, ~ n = 1000$ , we generated 10,000 realizations of $κ_{\pm}^{α, β}$ . The signs of $κ_{\pm}^{α, β}$ were different in only about 0.5% of all simulated realizations.

In the next section, we will show that Hessian directions that are associated with the largest magnitude positive and negative eigenvalues of $H_{θ}$ are useful to appropriately identify saddle information and visually study the geometric properties of saddle points in high-dimensional loss spaces of image classifiers.

3 Applications to neural networks

Figure 5: Loss landscape projections for ResNet-56. (a,c) The projection directions $η, δ$ are given by the eigenvectors associated with the largest and smallest eigenvalues of the Hessian $H_{θ}$ , respectively. The zoomed inset in panel (c) shows the loss landscape for $(α, β) \in [- 0.01, 0.005] \times [- 0.05, 0.05]$ . We observe a decreasing loss along the negative $β$ -axis. (b,d) The projection directions $η, δ$ are given by random vectors. The domains of $(α, β)$ in panels (a,b) and (c,d) are $[- 0.05, 0.05] \times [- 0.05, 0.05]$ and $[- 1, 1] \times [- 1, 1]$ , respectively.

We now compare projections $L (θ^{*} + α η + β δ)$ using (i) dominant Hessian directions (i.e., eigenvectors that are associated with the largest magnitude positive and negative eigenvalues of $H_{θ}$ ) and (ii) random directions. Hessian directions were computed using HVPs without an explicit representation of $H_{θ}$ (see Algorithm 1). We first compute the largest magnitude eigenvalue and then use an annihilation method [13] to compute the second largest magnitude eigenvalue of opposite sign. More details on the annihilation algorithm are provided in Appendix B. Other deflation techniques can be used to find additional Hessian directions.

Instead of explicitly representing the Hessian $H_{θ}$ , Algorithm 1 evaluates products between $H_{θ}$ and a vector $v \in R^{N}$ using the identity

\nabla_{θ} [(\nabla_{θ} L)^{⊤} v] = (\nabla_{θ} \nabla_{θ} L) v + (\nabla_{θ} L)^{⊤} \nabla_{θ} v = H_{θ} v .

(30)

In the first step, the gradient $(\nabla_{θ} L)^{⊤}$ is computed using backpropagation to then compute the scalar product $[(\nabla_{θ} L)^{⊤} v]$ . The second step is based on again applying backpropagation to the computational graph associated with the scalar product $[(\nabla_{θ} L)^{⊤} v]$ . Because the vector $v$ does not depend on $θ$ (i.e., $\nabla_{θ} v = 0$ ), the result is $H_{θ} v$ .

We used the LinearOperator module that is available in the Python package scipy to represent HVPs. As an alternative to constructing differentiable computational graphs and using automatic differentiation methods for evaluating HVPs, the torch.autograd.functional module already provides a corresponding built-in function called hvp. Moreover, the Python package PyHessian [50] can be also used to evaluate Hessians of high-dimensional functions in a distributed manner. For the computation of dominant Hessian directions, we used the scipy function eigsh that is based on the implicitly restarted Lanczos method [37].

L_{1} = L i n e a r O p e r a t o r ((N, N), m a t v e c = h v p)

\hfill

▹

initialize linear operator for HVP calculation

2: eigval1, eigvec1 = solve_lm_evp(

L_{1}

)\hfill

▹

compute largest magnitude eigenvalue and corresponding eigenvector associated with operator

L_{1}

3: shifted_hvp(vec) = hvp(vec) - eigval1*vec\hfill

▹

define shifted HVP

L_{2} = L i n e a r O p e r a t o r ((N, N), m a t v e c = s h i f t e d_h v p)

\hfill

▹

initialize linear operator for shifted HVP calculation

5: eigval2, eigvec2 = solve_lm_evp(

L_{2}

)\hfill

▹

compute largest magnitude eigenvalue and corresponding eigenvector associated with operator

L_{2}

6: eigval2 += eigval1

7: if eigval1

>=

0 then

8: maxeigval, maxeigvec = eigval1, eigvec1

9: mineigval, mineigvec = eigval2, eigvec2

10: else

11: maxeigval, maxeigvec = eigval2, eigvec2

12: mineigval, mineigvec = eigval1, eigvec1

13: return maxeigval, maxeigvec, mineigval, mineigvec

Algorithm 1 Compute dominant Hessian directions

We first focus on a ResNet-56 architecture that has been trained on CIFAR-10 data with SGD. The number of parameters is 855,770. In accordance with Li et al. [38], we apply filter normalization to random directions. Hessian directions are already normalized, so we do not apply any additional normalization.

Figure 5 shows the two-dimensional projections of the loss function around a local critical point. The smallest and largest eigenvalues are $- 16.4$ and $5007.9$ , respectively. This means that the found critical point is a saddle with a negative curvature that is more than two orders of magnitude smaller than the positive curvature at that point. The saddle point is clearly visible in Figure 5(a,c). We observe in the zoomed inset in Figure 5(c) that the loss decreases along the negative $β$ -axis.

If random directions are used, the corresponding projections indicate that the optimizer converged to a local minimum and not to a saddle point [Figure 5(b,d)]. Overall, the ResNet-56 visualizations that we show in Figure 5 exhibit structural similarities to those that we generated using the simple loss model (28) [Figure 1(d)].

As a second neural-network structure, we consider DenseNet-121 that has also been trained on CIFAR-10 data using SGD. In this neural network, the number of parameters is 6,956,298. Figure 6 shows the projections $L (θ^{*} + α η + β δ)$ of the loss function around a local optimum. We again observe the typical saddle shape of $L$ if dominant Hessian directions are used and an apparent local minimum for projections along random directions. One should keep in mind that the principal curvatures in the random direction plot are given by $κ_{\pm}^{α, β}$ [see Eq. (16)]. At a critical point, the principal curvatures in dimension-reduced, random-projection space are, on average, equal to the sum of the principal curvatures in the original space. If only very few, small-magnitude negative principal curvatures are present in the original space, random projections are always associated with an apparent local minimum in dimension-reduced, random-projection space, as shown by numerical experiments with ResNet-56 and DenseNet-121.

Figure 6: Loss landscape projections for DenseNet-121. (a,c) The projection directions $η, δ$ are given by the eigenvectors associated with the largest and smallest eigenvalues of the Hessian $H_{θ}$ , respectively. (b,d) The projection directions $η, δ$ are given by random vectors. The domains of $(α, β)$ in panels (a,b) and (c,d) are $[- 0.4, 0.4] \times [- 0.4, 0.4]$ and $[- 1, 1] \times [- 1, 1]$ , respectively.

Analyzing the Hessian $H_{θ}$ of DenseNet-121 around its local optimum, we find that the minimum and maximum principal curvatures are -109.1 and 1937.5, respectively. In comparison to ResNet-56, the relative difference between these two curvatures is substantially smaller. In particular the largest negative principal curvature in DenseNet-121 is almost one order of magnitude larger than that in ResNet-56, indicating that more substantial loss improvements are possible by changing the neural-network parameters $θ$ along the negative curvature direction.

Figure 7: Loss heatmaps along random and dominant Hessian directions. (a–c) Loss heatmaps for DenseNet-121 [(a,c): training data, (c,d): test data]. Random directions are used in panels (a) and (b) while Hessian directions are used in panels (c) and (d). (e–h) Loss heatmaps for ResNet-56 [(e,g): training data, (f,h): test data]. Random directions are used in panels (e) and (f) while Hessian directions are used in panels (g) and (h). Green and red regions indicate small and large loss values, respectively.

In Figure 7, we compare changes in the projected loss along random and dominant Hessian directions for DenseNet-121 and ResNet-56. In addition to evaluating the loss on training data (50,000 images), we also provide a comparison to loss changes in the test dataset (10,000 images). Black crosses in Figure 7 indicate minimum loss points. For both neural networks, we observe that random projections are associated with loss values that increase along both directions $δ, η$ and for both training and test data [Figure 7(a,b,e,f)]. For these projections, the loss minimum is very close to the origin of the loss space. Different observations can be made if Hessian directions are employed in the loss projections. Figure 6(c) shows that the loss minimum is not located at the origin but at $(α, β) \approx (0, 0.2)$ . A value of $β \approx 0.2$ means that one has to move along the negative curvature direction to find a loss function value that is smaller than the one at the origin. We find that the value of the loss at that point is more than one order of magnitude smaller than the loss at the origin or the smallest loss found in a random projection plot. Similar observations can be made in the test dataset. However, the smallest loss value found in the Hessian direction plot is associated with a large loss in the test dataset [Figure 6(d)]. Using Hessian directions to improve training and test performance requires one to balance potential performance improvements in the test and training datasets. Because of the relatively small negative principal curvature of ResNet-56, one can only achieve small loss improvements along the corresponding Hessian direction and differences in minimum losses between random and Hessian projections are less pronounced [Figure 6(e–h)].

4 Discussion and conclusion

Given that a global understanding of high-dimensional loss landscapes remains still very limited [17], one and two-dimensional projections of such functions are a common tool to study the geometric properties of loss regions nearby certain optima and along optimizer trajectories [23, 38, 24, 48, 29]. For an informed way of projecting high-dimensional functions to lower-dimensional representations, it is important to mathematically interpret the geometric information contained in such visualizations. To put in the words of Ker-Chau Li [39]: “There are too many directions to project a high-dimensional data set and unguided plotting can be time-consuming and fruitless.” Therefore, it is important to understand how projection directions $η, δ$ affect the resulting loss visualizations. A standard choice in the literature is to use random Gaussian directions to obtain lower-dimensional loss representations $L (θ^{*} + α η + β δ)$ . Combining concepts from high-dimensional probability and differential geometry, we show that saddle points in the original space may not be identified as such in lower-dimensional representations if such random projections are used.

As formalized in Eq. (20), the expected principal curvature ${¯ κ}^{α, β} = t r (H_{θ})$ , obtained by averaging over Hessian components in a dimension-reduced space according to Eq. (26), is proportional to the mean curvature $H = t r (H_{θ}) / N$ in the original loss space. The connection between ${¯ κ}^{α, β}$ and $t r (H_{θ})$ shows that Hessian trace estimates can be computed directly from curvature information in lower-dimensional random projections without explicitly calculating Hessians or Hessian-vector products as in Hutchinson’s method [30].

Depending on the value of the mean curvature in the original loss space, saddle points appear, on average, as either minima ( $H > 0$ ), maxima ( $H < 0$ ), or almost flat regions ( $H \approx 0$ ). Instead of averaging over Hessian components, one may also average directly over principal curvatures. Our numerical experiments show that both ways of averaging are associated with different geometric interpretations and that random projections are generally not able to correctly identify saddle points. We therefore propose to study projections along dominant Hessian directions that are associated with the largest magnitude positive and negative principal curvatures in the original space. Similar methods have been developed and applied in related work on exploratory data analysis [39]. Our work may be viewed as part of a recent “Hessian turn” in machine learning characterized by computationally tractable methods for unlocking Hessian information in non-linear and high-dimensional settings [50, 40]. Projecting high-dimensional loss functions along Hessian directions provides a genuine way of illustrating their geometric properties around certain minima. Negative curvature directions may be used as a starting point for further improving the performance of a given neural network. Analyzing principal curvatures (or eigenvalues of $H_{θ}$ ) at critical points $θ^{*}$ is also useful to identify global minima. It has been shown by Cooper [17] that for any global minimum of $L$ , the corresponding Hessian $H_{θ}$ has $r n$ positive eigenvalues, $N - r n$ eigenvalues equal to 0, and no negative eigenvalues. Here, $n$ and $r$ are the number of input samples (i.e., data points) and the output dimension, respectively.

Our numerical experiments on two commonly used image classifiers, ResNet-56 and DenseNet-121, are in accordance with our reasoning and identify marked differences in the projections that are based on Hessian and random directions. Our results are also connected to the ongoing debate on the connection between flatness of a certain loss region and generalizability (i.e., the ability to effectively use trained neural networks on unseen data). Dinh et al. [21] provided a detailed discussion on different mechanisms (e.g., reparameterization of flat regions) that indicate that sharp minima can achieve good generalization properties. Their work also highlighted the importance of clear definitions of flatness. Our work shows that flatness in randomly projected loss landscapes might be just apparent and a consequence of (large) principal curvatures of opposite sign that cancel out each other.

There are many interesting and worthwhile directions for future research building on the work reported here. Our results suggest that negative curvature directions are useful to achieve substantial loss reductions, even after completed training. More research on learning algorithms that minimize loss functions along negative curvature directions [1] can contribute to the development of more effective optimizers. Such guided optimization algorithms may be of particular use in applications (e.g., neural network–based control [4, 11]) where one main goal is to approximate an unknown function as closely as possible (i.e., where one does not need to distinguish between training and test error). In applications where one wishes to reduce generalization error, the proposed visualization methods may be also useful to provide more insights into curvature regularization [43].

References

[1] G. Alain, N. L. Roux, and P.-A. Manzagol, Negative eigenvalues of the hessian in deep neural networks, arXiv preprint arXiv:1902.02366, (2019).
[2] S.-I. Amari, Natural gradient works efficiently in learning, Neural Computation, 10 (1998), pp. 251–276.
[3] S.-i. Amari, H. Park, and K. Fukumizu, Adaptive method of realizing natural gradient learning for multilayer perceptrons, Neural Computation, 12 (2000), pp. 1399–1409.
[4] T. Asikis, L. Böttcher, and N. Antulov-Fantulin, Neural ordinary differential equation control of dynamics on graphs, Physical Review Research, 4 (2022), p. 013221.
[5] H. Avron and S. Toledo, Randomized algorithms for estimating the trace of an implicit symmetric positive semi-definite matrix, Journal of the ACM (JACM), 58 (2011), pp. 1–34.
[6] Z. Bai, G. Fahey, and G. Golub, Some large-scale matrix computation problems, Journal of Computational and Applied Mathematics, 74 (1996), pp. 71–89.
[7] P. Baldi and K. Hornik, Neural networks and principal component analysis: Learning from examples without local minima, Neural Networks, 2 (1989), pp. 53–58.
[8] J. Bausch, On the efficient calculation of a linear combination of chi-square random variables with an application in counting string vacua, Journal of Physics A: Mathematical and Theoretical, 46 (2013), p. 505202.
[9] M. Berger and B. Gostiaux, Differential Geometry: Manifolds, Curves, and Surfaces, vol. 115, Springer Science & Business Media, 2012.
[10] L. Böttcher, GitLab repository. https://gitlab.com/ComputationalScience/loss-visualization, 2022.
[11] L. Böttcher, N. Antulov-Fantulin, and T. Asikis, AI Pontryagin or how artificial neural networks learn to control dynamical systems, Nature Communications, 13 (2022), p. 333.
[12] A. J. Bray and D. S. Dean, Statistics of critical points of Gaussian fields on large-dimensional spaces, Physical Review Letters, 98 (2007), p. 150201.
[13] R. L. Burden, J. D. Faires, and A. M. Burden, Numerical Analysis, Cengage Learning, 2015.
[14] Y. Cao and Q. Gu, Generalization bounds of stochastic gradient generalization bounds of stochastic gradient descent for wide and deep neural networks, in Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS 2019), no. 972, 2019, pp. 10836–10846.
[15] T. Chen and T. Lumley, Numerical evaluation of methods approximating the distribution of a large quadratic form in normal variables, Computational Statistics & Data Analysis, 139 (2019), pp. 75–81.
[16] D. Choi, C. J. Shallue, Z. Nado, J. Lee, C. J. Maddison, and G. E. Dahl, On empirical comparisons of optimizers for deep learning, International Conference on Learning Representations, (2020).
[17] Y. Cooper, Global minima of overparameterized neural networks, SIAM Journal on Mathematics of Data Science, 3 (2021), pp. 676–691.
[18] K. Crane, Discrete differential geometry: An applied introduction, Notices of the AMS, Communication, (2018), pp. 1153–1159.
[19] Y. N. Dauphin, R. Pascanu, Ç. Gülçehre, K. Cho, S. Ganguli, and Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds., 2014, pp. 2933–2941.
[20] R. B. Davies, Algorithm AS 155: The distribution of a linear combination of $χ^{2}$ random variables, Applied Statistics, (1980), pp. 323–333.
[21] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, Sharp Minima Can Generalize For Deep Nets, in Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, D. Precup and Y. W. Teh, eds., vol. 70 of Proceedings of Machine Learning Research, 2017, pp. 1019–1028.
[22] S. Du, J. Lee, H. Li, L. Want, and X. Zhai, Gradient descent finds global minima of deep neural networks, in Proceedings of the 36th International Conference on Machine Learning, vol. 97 of Proceedings of Machine Learning Research, 2019.
[23] I. J. Goodfellow and O. Vinyals, Qualitatively Characterizing Neural Network Optimization Problems, in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, eds., 2015.
[24] Y. Hao, L. Dong, F. Wei, and K. Xu, Visualizing and Understanding the Effectiveness of BERT, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan, eds., Association for Computational Linguistics, 2019, pp. 4141–4150.
[25] M. Hardt, B. Recht, and Y. Singer, Train faster, generalize better: Stability of stochastic gradient descent, in International Conference on Machine Learning, Proceedings of Machine Learning Research, 2016, pp. 1225–1234.
[26] R. Hecht-Nielsen, Context vectors: general purpose approximate meaning representations self-organized from raw data, Computational Intelligence: Imitating Life, IEEE Press, (1994), pp. 43–56.
[27] K. Hornik, Approximation capabilities of multilayer feedforward networks, Neural Networks, 4 (1991), pp. 251–257.
[28] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks are universal approximators, Neural Networks, 2 (1989), pp. 359–366.
[29] S. Horoi, J. Huang, B. Rieck, G. Lajoie, G. Wolf, and S. Krishnaswamy, Exploring the Geometry and Topology of Neural Network Loss Landscapes, in Advances in Intelligent Data Analysis XX - 20th International Symposium on Intelligent Data Analysis, IDA 2022, Rennes, France, April 20-22, 2022, Proceedings, T. Bouadi, É. Fromont, and E. Hüllermeier, eds., vol. 13205 of Lecture Notes in Computer Science, Springer, 2022, pp. 171–184.
[30] M. F. Hutchinson, A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines, Communications in Statistics-Simulation and Computation, 18 (1989), pp. 1059–1076.
[31] R. Karakida, S. Akaho, and S.-i. Amari, Universal statistics of Fisher information in deep neural networks: mean field approach, Journal of Statistical Mechanics: Theory and Experiment, 2020 (2020), p. 124005.
[32] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
[33] W. Kühnel, Differential Geometry, vol. 77, American Mathematical Society, 2015.
[34] B. Laurent and P. Massart, Adaptive estimation of a quadratic functional by model selection, Annals of Statistics, (2000), pp. 1302–1338.
[35] J. D. Lee, M. Simchowitz, M. I. Jordan, and G. Recht, Gradient descent converges to minimizers, in Conference on Learning Theory (COLT 2016), vol. 49 of Proceedings of Machine Learning Research, 2016, pp. 1–12.
[36] J. M. Lee, Riemannian Manifolds: An Introduction to Curvature, vol. 176, Springer Science & Business Media, 2006.
[37] R. B. Lehoucq, D. C. Sorensen, and C. Yang, ARPACK users’ guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods, SIAM, 1998.
[38] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, Visualizing the Loss Landscape of Neural Nets, in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds., 2018, pp. 6391–6401.
[39] K.-C. Li, On principal hessian directions for data visualization and dimension reduction: Another application of Stein’s lemma, Journal of the American Statistical Association, 87 (1992), pp. 1025–1039.
[40] Z. Liao and M. W. Mahoney, Hessian eigenspectra of more realistic nonlinear models, Advances in Neural Information Processing Systems, 34 (2021), pp. 20104–20117.
[41] V. A. Marčenko and L. A. Pastur, Distribution of eigenvalues for some sets of random matrices, Mathematics of the USSR-Sbornik, 1 (1967), p. 457.
[42] J. Martens, New insights and perspectives on the natural gradient method, Journal of Machine Learning Research, 21 (2020), pp. 146:1–146:76.
[43] S.-M. Moosavi-Dezfooli, A. Fawzi, J. Uesato, and P. Frossard, Robustness via curvature regularization, and vice versa, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9078–9086.
[44] S. Park, C. Yun, J. Lee, and J. Shin, Minimum Width for Universal Approximation, in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, 2021.
[45] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, Intriguing properties of neural networks, in International Conference on Learning Representations (ICLR 2014), 2014.
[46] E. P. Wigner, On the statistical distribution of the widths and spacings of nuclear resonance levels, in Mathematical Proceedings of the Cambridge Philosophical Society, vol. 47, Cambridge University Press, 1951, pp. 790–798.
[47] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, The marginal value of adaptive gradient methods in machine learning, Advances in Neural Information Processing Systems, 30 (2017).
[48] D. Wu, S.-T. Xia, and Y. Wang, Adversarial weight perturbation helps robust generalization, Advances in Neural Information Processing Systems, 33 (2020), pp. 2958–2969.
[49] Z. Yao, A. Gholami, K. Keutzer, and M. K. Mahoney, Hessian-based analysis of large batch training and robustness to adversaries, in Advances in Neural Information Processing Systems (NIPS 2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds., vol. 31, Curran Associates, Inc., 2018, pp. 4954–4964.
[50] Z. Yao, A. Gholami, K. Keutzer, and M. W. Mahoney, PyHessian: Neural networks through the lens of the Hessian, in 2020 IEEE International Conference on Big Data (Big data), IEEE, 2020, pp. 581–590.
[51] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, Understanding deep learning (still) requires rethinking generalization, Communications of the ACM, 64 (2021), pp. 107–115.

Appendix A Concentration inequality

For $η_{i}, δ_{i} \sim N (0, 1)$ and $X_{i} = η_{i} + δ_{i}, Y_{i} = η_{i} - δ_{i}$ , our goal is to construct a concentration inequality for $| N^{- 1} \sum_{i} η_{i} δ_{i} | = | (4 N)^{- 1} \sum_{i} X_{i}^{2} - 1 + 1 - Y_{i}^{2} | = | (~ X - 1 + 1 - ~ Y) / 4 |$ , where $~ X = N^{- 1} \sum_{i = 1}^{N} X_{i}^{2}$ and $~ Y = N^{- 1} \sum_{i = 1}^{N} Y_{i}^{2}$ . Note that $~ X$ and $~ Y$ are independent because the covariance associated with the corresponding bivariate normal distribution vanishes. Using that the cumulative distribution function of the sum of independent random variables is given by the convolution of the individual distributions, we obtain

\begin{matrix} Pr [\frac{1}{N} N \sum i = 1 η_{i} δ_{i} \geq ϵ] & = Pr [\frac{1}{4 N} N \sum i = 1 X_{i}^{2} - 1 + 1 - Y_{i}^{2} \geq ϵ] = Pr [~ X - 1 + 1 - ~ Y \geq 4 ϵ] = \int_{0}^{\infty} Pr [~ X - 1 \geq 4 ϵ + ~ Y - 1 | ~ Y = y] f_{~ Y} (y) d y \leq \int_{0}^{\infty} e^{- \frac{N (4 ϵ + y - 1)^{2}}{4}} f_{~ Y} (y) d y, \end{matrix}

(31)

where we used inequality (12) in the last step. In the limit of large $N$ , we approximate the chi-squared distribution function $f_{~ Y} (y)$ by a Gaussian with mean $1$ and variance $2 / N$ . With this approximation, we find

Pr [\frac{1}{N} N \sum i = 1 η_{i} δ_{i} \geq ϵ] \leq \frac{e^{- 2 N ϵ^{2}}}{\sqrt{2}} .

(32)

The same bound applies to $Pr [1 - ~ X + ~ Y - 1 \geq 4 ϵ]$ , so we have

Pr [∣ ∣ ∣ ∣ \frac{1}{N} N \sum i = 1 η_{i} δ_{i} ∣ ∣ ∣ ∣ \geq ϵ] \leq \sqrt{2} e^{- 2 N ϵ^{2}} .

(33)

For two independent random variables $X, Y$ , a similar bound can be derived using the inequality

\begin{matrix} Pr [X + Y \geq ϵ] & = \int_{R^{2}} 1_{x + y \geq ϵ} f_{X, Y} (x, y) d x d y = \int_{R^{2}} 1_{x + y \geq ϵ} 1_{x \geq ϵ / 2} f_{X, Y} (x, y) d x d y + \int_{R^{2}} 1_{x + y \geq ϵ} 1_{x < ϵ / 2} f_{X, Y} (x, y) d x d y \leq \int_{- \infty}^{\infty} 1_{x \geq ϵ / 2} f_{X} (x) d x + \int_{- \infty}^{\infty} 1_{y \geq ϵ / 2} f_{Y} (y) d y = Pr (X \geq ϵ / 2) + Pr (Y \geq ϵ / 2), \end{matrix}

(34)

where $f_{X, Y} (x, y) = f_{X} (x) f_{Y} (y)$ denotes the joint distribution associated with random variables $X, Y$ . Using inequality (34), we obtain

Pr [∣ ∣ ∣ ∣ \frac{1}{N} N \sum i = 1 η_{i} δ_{i} ∣ ∣ ∣ ∣ \geq ϵ] \leq 2 e^{- N ϵ^{2}} .

(35)

Appendix B Annihilation method

Let $A \in R^{N \times N}$ be a symmetric matrix with eigenvalues ${λ_{i}}_{i = 1, \dots, N}$ satisfying

| λ_{1} | > | λ_{2} | \geq \dots \geq | λ_{N} | and | λ_{k} - λ_{1} | > | λ_{j} - λ_{1} | for~{} j \neq k .

(36)

Here, $λ_{k}$ is the largest magnitude eigenvalue that belongs to the set of eigenvalues of opposite sign of $λ_{1}$ . An example of a matrix with eigenvalues that satisfy Eq. (36) is a Hessian $H_{θ}$ at a saddle point. Because the matrix $A$ is symmetric, one can construct an orthonormal eigenbasis which we denote by ${v_{j}}_{i = 1, \dots, N}$ . Furthermore, let $B = A - λ_{1} 1_{N \times N}$ . For some vector $z_{0} \in R^{N}$ , we compute

z_{1} \equiv B z_{0} = A z_{0} - λ_{1} z_{0} = N \sum j = 1 c_{j} A v_{j} - λ_{1} N \sum j = 1 c_{j} v_{j} = N \sum j = 2 c_{j} (λ_{j} - λ_{1}) v_{j} .

(37)

For iterations $k \geq 1$ with $z_{k} = B z_{k - 1}$ , we directly evaluate the Rayleigh quotient

λ^{(k)} \equiv \frac{z_{k}^{⊤} B z_{k}}{z_{k}^{⊤} z_{k}} = \frac{\sum_{j = 2}^{N} c_{j}^{2 k} (λ_{j} - λ_{1})^{2 k + 1}}{\sum_{j = 2}^{N} c_{j}^{2 k} (λ_{j} - λ_{1})^{2 k}},

(38)

where we used that $v_{i}^{⊤} v_{j} = 0$ if $i \neq j$ and 1 otherwise. Note that $λ^{(k)} \to λ_{k} - λ_{1}$ in the limit $k \to \infty$ . The described method is reminiscent of eigenvalue annihilation as commonly used in deflation techniques [13]. The main difference of the outlined deflation method w.r.t. standard annihilation approaches is that we are not interested in the second largest magnitude eigenvalue, but in the largest magnitude eigenvalue that belongs to the set of eigenvalues of opposite sign of $λ_{1}$ . The eigenvalues $λ_{1}, λ_{k}$ can be identified with the largest negative and positive principal curvatures and the corresponding eigenvectors are the dominant Hessian directions.

Visualizing high-dimensional loss landscapes with Hessian directions††thanks: Submitted to the editors on August 11, 2022.