The premise of approximate MCMC in Bayesian deep learning

\fnmsTheodore \snmPapamarkou\correflabel=e1] t.papamarkou@manchester.ac.uk [ Department of Mathematics, The University of Manchester, Manchester, UK

Abstract

This paper identifies several characteristics of approximate MCMC in Bayesian deep learning. It proposes an approximate sampling algorithm for neural networks. By analogy to sampling data batches from big datasets, it is proposed to sample parameter subgroups from neural network parameter spaces of high dimensions. While the advantages of minibatch MCMC have been discussed in the literature, blocked Gibbs sampling has received less research attention in Bayesian deep learning.

\kwd

\startlocaldefs\endlocaldefs\runtitle

The premise of approximate MCMC in Bayesian deep learning

Approximate MCMC \kwdBayesian inference \kwdBayesian neural networks \kwdblocked Gibbs sampling \kwdminibatch sampling \kwdposterior predictive distribution

1 Motivation

This paper pushes the frontiers of approximate MCMC in Bayesian deep learning. It is a research-oriented sequel of the review of exact MCMC for neural networks by Papamarkou et al. (2022).

Why develop approximate MCMC sampling algorithms for deep learning? The answer stems from a general merit of MCMC, namely uncertainty quantification. This work demonstrates how approximate MCMC sampling of neural network parameters quantifies predictive uncertainty in classification problems.

Several impediments have inhibited the adoption of MCMC in deep learning; to name three notorious problems, low acceptance rate, high computational cost and lack of convergence typically occur. See Papamarkou et al. (2022) for a relevant review.

Empirical evidence herein suggests a less dismissive view of approximate MCMC in deep learning. Firstly, a sampling mechanism that takes into account the neural network structure and that partitions the parameter space into small parameter blocks retains high acceptance rate. Secondly, minibatch MCMC sampling of neural network parameters mitigates the computational bottleneck induced by big data. Bayesian marginalization, which is used for making predictions and for assessing predictive performance, is also computationally expensive. However, Bayesian marginalization is embarrassingly parallelizable across test points and along Markov chain length. Thirdly, MCMC convergence does not seem necessary in order to make predictions and to assess predictive uncertainty via neural networks. A non-convergent Markov chain acquires valuable predictive information.

The paper is structured as follows. Section 2 briefly reviews the approximate MCMC literature for deep learning. Section 3 revises some basic knowledge, including the Bayesian multilayer perceptron (MLP) model and blocked Gibbs sampling. Section 4 introduces a finer node-blocked Gibbs (FNBG) algorithm to sample MLP parameters. Section 5 utilizes FNBG sampling to fit MLPs to three training datasets, making predictions on three associated test datasets. In Section 5, numerous observations are made about the scope of approximate MCMC in deep learning. Section 6 concludes the paper with a discussion about future research directions.

2 Literature review

Two research directions have been mainly taken to develop MCMC algorithms for neural networks. Initially, sequential Monte Carlo and reversible jump MCMC have been applied on MLPs and radial basis function networks (Andrieu, de Freitas and Doucet, 1999; de Freitas, 1999; Andrieu, de Freitas and Doucet, 2000; de Freitas et al., 2001). More recently, stochastic-gradient MCMC (SG-MCMC) algorithms have become a mainstream approach.

SG-MCMC belongs to the broader family of minibatch MCMC algorithms. In minibatch MCMC, a target density is evaluated on a subset (minibatch) of the data, thus avoiding the computational cost of MCMC iterations based on the entire data. Welling and Teh (2011) has employed the notion of minibatch to develop a stochastic gradient Langevin dynamics (SG-LD) Monte Carlo algorithm, which is the first instance of SG-MCMC. Chen, Fox and Guestrin (2014) have introduced stochastic gradient Hamiltonian Monte Carlo (SG-HMC) and applied it to infer the parameters of a Bayesian neural network fitted to the MNIST dataset (Lecun et al., 1998).

SG-LD and SG-HMC are two SG-MCMC algorithms that initiated approximate MCMC research in machine learning. Several variants of SG-MCMC have appeared ever since. Gong, Li and Hernández-Lobato (2019) have proposed an SG-MCMC scheme that generalizes Hamiltonian dynamics with state-dependent drift and diffusion, and have demonstrated the performance of this scheme on convolutional and on recurrent neural networks. Zhang et al. (2020) have proposed cyclical SG-MCMC, a tempered version of SG-LD with a cyclical stepsize schedule. Moreover, Zhang et al. (2020) have showcased the performance of cyclical SG-MCMC on a ResNet-18 (He et al., 2016) fitted to the CIFAR-10 and CIFAR-100 datasets (Krizhevsky and Hinton, 2009). Alexos, Boyd and Mandt (2022) have introduced structured SG-MCMC, a combination of SG-MCMC and structured variational inference (Saul and Jordan, 1995). Structured SG-MCMC employs SG-LD or SG-HMC to sample from a factorized variational parameter posterior density. Alexos, Boyd and Mandt (2022) have tested the performance of structured SG-MCMC on ResNet-20 (He et al., 2016) architectures fitted to the CIFAR-10, SVHN (Netzer et al., 2011) and fashion MNIST (Xiao, Rasul and Vollgraf, 2017) datasets.

Aside from research papers on MCMC for neural networks, there also exist reviews of the topic. For instance, see Wenzel et al. (2020), Izmailov et al. (2021) and Papamarkou et al. (2022). Such reviews provide insights into the utility, challenges and future of MCMC in deep learning.

3 Preliminaries

This section revises two topics, the Bayesian MLP model for supervised classification (Subsection 3.1) and blocked Gibbs sampling (Subsection 3.2). For the Bayesian MLP model, the parameter posterior density and posterior predictive probability mass function (pmf) are stated. Blocked Gibbs sampling provides a starting point in developing the algorithm of Section 4 for sampling from the MLP parameter posterior density.

3.1 The Bayesian MLP model

An MLP is a feedforward neural network comprising an input layer, one or more hidden layers and an output layer (Rosenblatt, 1958; Minsky and Papert, 1988; Hastie, Tibshirani and Friedman, 2016). For a fixed natural number $ρ \geq 2$ , an index $j \in {0, 1, \dots, ρ}$ indicates the layer. In particular, $j = 0$ refers to the input layer, $j \in {1, 2, \dots, ρ - 1}$ to one of the $ρ - 1$ hidden layers, and $j = ρ$ to the output layer. Let $κ_{j}$ be the number of nodes in layer $j$ , and let $κ_{0 : ρ} = (κ_{0}, κ_{1}, \dots, κ_{ρ})$ be the sequence of node counts per layer. $MLP (κ_{0 : ρ})$ denotes an MLP with $ρ - 1$ hidden layers and $κ_{j}$ nodes at layer $j$ .

An $MLP (κ_{0 : ρ})$ with $ρ - 1$ hidden layers and $κ_{j}$ nodes at layer $j$ is defined recursively as

	$g_{j} (x_{i}, θ_{1 : j})$	$= w_{j} h_{j - 1} (x_{i}, θ_{1 : j - 1}) + b_{j},$		(3.1)
	$h_{j} (x_{i}, θ_{1 : j})$	$= ϕ_{j} (g_{j} (x_{i}, θ_{1 : j})),$		(3.2)

for $j \in {1, 2, \dots, ρ}$ . An input data point $x_{i} \in R^{κ_{0}}$ is passed to the input layer $h_{0} (x_{i}) = x_{i}$ , yielding vector $g_{1} (x_{i}, θ_{1}) = w_{1} x_{i} + b_{1}$ in the first hidden layer. The parameters $θ_{j} = (w_{j}, b_{j})$ at layer $j$ consist of weights $w_{j}$ and biases $b_{j}$ . The weight matrix $w_{j}$ has $κ_{j}$ rows and $κ_{j - 1}$ columns, while the vector $b_{j}$ of biases has length $κ_{j}$ . All weights and biases up to layer $j$ are denoted by $θ_{1 : j} = (θ_{1}, θ_{2}, \dots, θ_{j})$ . An activation function $ϕ_{j}$ is applied elementwise to pre-activation vector $g_{j} (x_{i}, θ_{1 : j})$ , and returns post-activation vector $h_{j} (x_{i}, θ_{1 : j})$ . Concatenating all $θ_{j}, j \in {1, 2, \dots, ρ}$ , gives a parameter vector $θ = θ_{1 : ρ} \in R^{n}$ of length $n = \sum_{j = 1}^{ρ} κ_{j} (κ_{j - 1} + 1)$ .

$w_{j, k, l}$ denotes the $(k, l)$ -th element of weight matrix $w_{j}$ . Analogously, $b_{j, k}, x_{i, k}, g_{j, k}$ and $h_{j, k}$ correspond to the $k$ -th coordinate of bias $b_{j}$ , of input $x_{i}$ , of pre-activation $g_{j}$ and of post-activation $h_{j}$ .

MLPs are typically visualized as graphs. For instance, Figure 1 displays a graph representation of $MLP (κ_{0} = 3, κ_{1} = 2, κ_{2} = 2, κ_{3} = 2)$ , which has an input layer with $κ_{0} = 3$ nodes (purple), two hidden layers with $κ_{1} = κ_{2} = 2$ nodes each (blue), and an output layer with $κ_{3} = 2$ nodes (gray). Purple nodes indicate observed variables (input data), whereas blue and gray nodes indicate latent variables (post-activations).

Figure 1: A graph visualization of $MLP (3, 2, 2, 2)$ . Purple, blue and gray nodes correspond to input data, to hidden layer post-activations and to output layer (softmax) post-activations used for making predictions.

Let $D_{1 : s} = {(x_{i}, y_{i}) : i = 1, 2, \dots, s}$ be a training dataset. Each training data point $(x_{i}, y_{i})$ includes an input $x_{i} \in R^{κ_{0}}$ and a discrete output (label) $y_{i} \in {1, 2, \dots, κ_{ρ}}, κ_{ρ} \geq 2$ . Moreover, let $(x, y)$ be a test point consisting of an input $x \in R^{κ_{0}}$ and of a label $y \in {1, 2, \dots, κ_{ρ}}$ . The supervised classification problem under consideration is to predict test label $y$ given test input $x$ and training dataset $D_{1 : s}$ . An $MLP (κ_{0 : ρ})$ , whose output layer has $κ_{ρ}$ nodes and applies the softmax activation function $ϕ_{ρ}$ , is used to address this problem. The softmax activation function at the output layer expresses as $ϕ_{ρ} (g_{ρ}) = exp (g_{ρ}) / \sum_{k = 1}^{κ_{ρ}} exp (g_{ρ, k})$ .

It is assumed that the training labels $y_{1 : s} = (y_{1}, y_{2}, \dots, y_{s})$ are outcomes of $s$ independent draws from a categorical pmf with event probabilities given by $Pr (y_{i} = k | x_{i}, θ) = h_{ρ, k} (x_{i}, θ) = ϕ_{ρ} (g_{ρ, k} (x_{i}, θ))$ , where $θ$ is the set of $MLP (κ_{0 : ρ})$ parameters. It follows that the likelihood function for the $MLP (κ_{0 : ρ})$ model in supervised classification is

L (y_{1 : s} | x_{1 : s}, θ) = s \prod i = 1 κ_{ρ} \prod k = 1 (h_{ρ, k} (x_{i}, θ))^{1_{{y_{i} = k}}},

(3.3)

where $x_{1 : s} = (x_{1}, x_{2}, \dots, x_{s})$ are the training inputs and $1$ denotes the indicator function. Interest is in sampling from the parameter posterior density

p (θ | x_{1 : s}, y_{1 : s}) \propto L (y_{1 : s} | x_{1 : s}, θ) π (θ),

(3.4)

given the likelihood function $L (y_{1 : s} | x_{1 : s}, θ)$ of Equation (3.3) and a parameter prior $π (θ)$ . For brevity, the parameter posterior density $p (θ | x_{1 : s}, y_{1 : s})$ is alternatively denoted by $p (θ | D_{1 : s})$ .

By integrating out parameters $θ$ , the posterior predictive pmf of test label $y$ given test input $x$ and training dataset $D_{1 : s}$ becomes

p (y | x, D_{1 : s}) = \int L (y | x, θ) p (θ | D_{1 : s}) d θ,

(3.5)

where $L$ is the likelihood function of Equation (3.3) evaluated on $(x, y)$ , and $p (θ | D_{1 : s})$ is the parameter posterior density of Equation (3.4). The integral in Equation (3.5) can be approximated via Monte Carlo integration, yielding the approximate posterior predictive pmf

^p (y | x, D_{1 : s}) ≃ v \sum t = 1 p (y | x, ω_{t}),

(3.6)

where $(ω_{1}, ω_{2}, \dots, ω_{v})$ is a Markov chain realization obtained from the parameter posterior density $p (θ | D_{1 : s})$ . Maximizing the approximate posterior predictive pmf $^p (y | x, D_{1 : s})$ of Equation (3.6) yields the prediction

^y=argmaxy{^p(y|x,D1:s)}

(3.7)

for test label $y \in {1, 2, \dots, κ_{ρ}}$ .

The likelihood function for an MLP model with $κ_{ρ} \geq 2$ output layer nodes, as stated in Equation (3.3), is suited for multiclass classification with $κ_{ρ}$ classes. For binary classification, which involves two classes, Equation (3.3) is related to an MLP with $κ_{ρ} = 2$ output layer nodes. There is an alternative likelihood function based on an MLP model with a single output layer node, which can be used for binary classification; see Papamarkou et al. (2022) for details.

3.2 Blocked Gibbs sampling

A blocked Gibbs sampling algorithm samples groups (blocks) of two or more parameters conditioned on all other other parameters, rather than sampling each parameter individually. The choice of parameter groups affects the rate of convergence (Roberts and Sahu, 1997). For instance, breaking down the parameter space into statistically independent groups of correlated parameters speeds up convergence.

To sample from the parameter posterior density $p (θ | D_{1 : s})$ of an $MLP (κ_{0 : ρ})$ model fitted to a training dataset $D_{1 : s}$ , a blocked Gibbs sampling algorithm utilizes a partition ${θ_{z (1)}, θ_{z (2)}, \dots, θ_{z (m)}}$ of the MLP parameters $θ = (θ_{1}, θ_{2}, \dots, θ_{n})$ . Due to partitioning ${θ_{1}, θ_{2} \dots, θ_{n}}$ , the parameter subsets $θ_{z (1)}, θ_{z (2)}, \dots, θ_{z (m)}$ are pairwise disjoint and satisfy $\cup_{q = 1}^{m} θ_{z (q)} = {θ_{1}, θ_{2}, \dots, θ_{n}}, m \leq n$ . Without loss of generality, it is assumed that each subset $θ_{z (q)}$ of $θ$ is totally ordered. For any $(c, q)$ such that $1 \leq c \leq q \leq m$ , the shorthand notation $θ_{z (c) : z (q)} = (θ_{z (c)}, θ_{z (c + 1)}, \dots, θ_{z (q)})$ is used hereafter. So, the vector $θ_{z (1) : z (m)}$ is a permutation of $θ$ .

Under such a setup, Algorithm 1 summarizes blocked Gibbs sampling. At iteration $t$ , for each $q \in {1, 2, \dots, m}$ , a blocked Gibbs sampling algorithm draws a sample $θ_{z (q)}^{(t)}$ of parameter group $θ_{z (q)}$ from the corresponding conditional density $p (θ_{z (q)} | θ_{z (1) : z (q - 1)}^{(t)}, θ_{z (q + 1) : z (m)}^{(t - 1)}, D_{1 : s})$ . To put it another way, at each iteration, a sample is drawn from the conditional density of each parameter group conditioned on the most recent values of the other parameter groups and on the training dataset.

1:Input: training dataset

D_{1 : s}

2:Input: initial state

θ_{z (1) : z (m)}^{(0)}

3:Input: number of Gibbs sampling iterations

v

4:for

t = 1, \dots, v

5: Draw

θ_{z (1)}^{(t)} \sim p (θ_{z (1)} | θ_{z (2) : z (m)}^{(t - 1)}, D_{1 : s})

6: Draw

θ_{z (2)}^{(t)} \sim p (θ_{z (2)} | θ_{z (1)}^{(t)}, θ_{z (3) : z (m)}^{(t - 1)}, D_{1 : s})

⋮

7: Draw

θ_{z (q)}^{(t)} \sim p (θ_{z (q)} | θ_{z (1) : z (q - 1)}^{(t)}, θ_{z (q + 1) : z (m)}^{(t - 1)}, D_{1 : s})

⋮

8: Draw

θ_{z (m)}^{(t)} \sim p (θ_{z (m)} | θ_{z (1) : z (m - 1)}^{(t)}, D_{1 : s})

9:end for

Algorithm 1 Blocked Gibbs sampling

4 Methodology

This section introduces a blocked Gibbs sampling algorithm for MLPs in supervised classification. MLP parameter blocks are determined by linking parameters to MLP nodes, as elaborated in Subsections 4.1 and 4.2 and as exemplified in Subsections 4.3 and 4.4.

Minibatching and parameter blocking render the proposed Gibbs sampler possible. Blocked Gibbs sampling is typically motivated by increased rates of convergence attained via near-optimal or optimal parameter groupings. Although low speed of convergence is a problem with MCMC in deep learning, near-zero acceptance rates constitute a more immediate problem. In other words, no mixing is a more pressing issue than slow mixing. By updating a small block of parameters at a time instead of updating all parameters via a single step, each block-specific acceptance rate moves away from zero. So, minibatch blocked Gibbs sampling provides a workaround for vanishing acceptance rates in deep learning. Of course there is no free lunch; increased acceptance rates come at a computational price per Gibbs step, which consists of additional conditional density sampling sub-steps.

SG-LD and SG-HMC implementations sample all parameters at once or sample parameters layer-wise. In the latter case, one block of parameters is formed for each MLP layer, and the resulting algorithm is LD-within-Gibbs or HMC-within-Gibbs. A caveat to grouping parameters by MLP layer is that parameter block sizes depend on layer widths. Hence, a parameter block can be large, containing hundreds or thousands of parameters, in which case the problem of low acceptance rate is not resolved. The blocked Gibbs sampler of this paper groups parameters by MLP node and allows to further partition parameters into smaller blocks within each node, thus controlling the number of parameters per block.

While structured SG-MCMC (Alexos, Boyd and Mandt, 2022) also splits the parameter space into blocks, it uses the parameter blocks to factorize a variational posterior density. Hence, structured SG-MCMC aims to solve the low acceptance and slow mixing problems by factorizing an approximate parameter posterior density. The blocked Gibbs sampler herein factorizes the exact parameter posterior density, relying on finer parameter grouping. Minibatching, which is the only type of approximation employed by the blocked Gibbs sampler of this paper, is an approximation related to the data, not to the MLP model.

The finer node-blocked Gibbs sampler for deep learning, as presently conceived here, is a gradient-free minibatch MCMC algorithm, so it does not belong to the SG-MCMC family. Moreover, it does not use tempering or any optimized tuning or scheduling of proposal hyperperameters.

4.1 Blocked Gibbs sampling via cross-entropy

Algorithm 1 raises the question how to sample each parameter block from its conditional density. Such conditional densities for MLPs are not available in closed form. Instead, a single Metropolis-Hastings step can be taken to draw a sample from a conditional density. In this case, Algorithm 1 becomes a Metropolis-within-blocked-Gibbs (MWBG) sampling algorithm.

At iteration $t$ of MWBG, a candidate state $θ_{z (q)}^{⋆}$ for parameter block $θ_{z (q)}$ can be sampled from an isotropic normal proposal density $N (θ_{z (q)}^{(t - 1)}, σ_{q}^{2} I_{q})$ centered at state $θ_{z (q)}^{(t - 1)}$ of iteration $t - 1$ , where $I_{q}$ is the $| θ_{z (q)} | \times | θ_{z (q)} |$ identity matrix, $| θ_{z (q)} |$ is the number of parameters in block $θ_{z (q)}$ , and $σ_{q}^{2} > 0$ is the proposal variance for block $θ_{z (q)}$ . Proposition 1 and Corollary 1 provide expressions for the acceptance probability of candidate state $θ_{z (q)}^{⋆}$ . The proofs of Proposition 1 and Corollary 1 are available in Appendix A: proofs.

Proposition 1.

Consider an $MLP (κ_{0 : ρ})$ with likelihood function $L (y_{1 : s} | x_{1 : s}, θ)$ specified by Equation (3.3), where ${(x_{i}, y_{i}) : i = 1, 2, \dots, s}$ is a training dataset related to a supervised classification problem and $θ$ are the MLP parameters. Let $π (θ) = \prod_{q = 1}^{m} π (θ_{z (q)})$ be a parameter prior density based on a partition ${θ_{z (1)}, θ_{z (2)}, \dots, θ_{z (m)}}$ of $θ$ . A MWBG version of Algorithm 1 is used for sampling from the target density $p (θ | x_{1 : s}, y_{1 : s})$ . At iteration $t$ , a candidate state $θ_{z (q)}^{⋆}$ for parameter block $θ_{z (q)}$ is drawn from the isotropic normal proposal density $N (θ_{z (q)}^{(t - 1)}, σ_{q}^{2} I_{q})$ . The acceptance probability $a (θ_{z (q)}^{⋆}, θ_{z (q)}^{(t - 1)})$ of $θ_{z (q)}^{⋆}$ is given by

\begin{matrix} a (θ_{z (q)}^{⋆}, θ_{z (q)}^{(t - 1)}) = min {\frac{L (y_{1 : s} | x_{1 : s}, θ^{⋆}) π (θ_{z (q)}^{⋆})}{L (y_{1 : s} | x_{1 : s}, θ^{(t - 1)}) π (θ_{z (q)}^{(t - 1)})}, 1}, \end{matrix}

(4.1)

where $θ^{(t - 1)}$ and $θ^{⋆}$ denote the values of $θ$ obtained by inverting the permutations $(θ_{z (1) : z (q - 1)}^{(t)}, θ_{z (q) : z (m)}^{(t - 1)})$ and $(θ_{z (1) : z (q - 1)}^{(t)}, θ_{z (q)}^{⋆}, θ_{z (q + 1) : z (m)}^{(t - 1)})$ , respectively.

Corollary 1.

Consider an $MLP (κ_{0 : ρ})$ with cross-entropy loss function $E (θ, D_{1 : s})$ , where $D_{1 : s} = {(x_{i}, y_{i}) : i = 1, 2, \dots, s}$ is a training dataset related to a supervised classification problem and $θ$ are the MLP parameters. It is assumed that $E$ is unnormalized, which means that it is not scaled by batch size. Under the sampling setup of Proposition 1, the acceptance probability of $θ_{z (q)}^{⋆}$ , expressed in terms of cross-entropy loss function $E$ , is given by

\begin{matrix} a (θ_{z (q)}^{⋆}, θ_{z (q)}^{(t - 1)}) = min {\frac{π (θ_{z (q)}^{⋆})}{π (θ_{z (q)}^{(t - 1)})} exp (E (θ^{(t - 1)}, D_{1 : s}) - E (θ^{⋆}, D_{1 : s})), 1} . \end{matrix}

(4.2)

Proposition 1 states the acceptance probability in statistical terms using the likelihood function, whereas Corollary 1 states it in deep learning terms using the cross-entropy loss function. Corollary 1 is practical in the sense that deep learning software frameworks, being geared towards optimization, provide implementations of cross-entropy loss. For example, the unnormalized cross-entropy loss $E$ , as stated in Equation (6.4) of Appendix A: proofs, can be computed in PyTorch via the CrossEntropyLoss class initialized with reduction=\textquotesinglesum\textquotesingle.

Algorithm 2 summarizes exact MWBG sampling. To make Algorithm 2 amenable to big data, minibatching can be used by replacing all instances of $D_{1 : s}$ with $r (D_{1 : s})$ , where $r : D_{1 : s} \to P (D_{1 : s})$ is a function from $D_{1 : s}$ to its power set $P (D_{1 : s})$ . If $r$ is the identity function, then exact MWBG sampling is performed. If $r$ returns batches (strict subsets of $D_{1 : s}$ ), then the resulting approximate MCMC algorithm is termed ‘minibatch MWBG sampling’.

1:Input: training dataset

D_{1 : s}

2:Input: initial state

θ_{z (1) : z (m)}^{(0)}

3:Input: proposal variances

(σ_{1}^{2}, \dots, σ_{m}^{2})

across blocks

4:Input: number of Gibbs sampling iterations

v

5:for

t = 1, \dots, v

6: for

q = 1, \dots, m

7: Draw

θ_{z (q)}^{⋆} \sim N (θ_{z (q)}^{(t - 1)}, σ_{q}^{2} I_{q})

8: Compute

a (θ_{z (q)}^{⋆}, θ_{z (q)}^{(t - 1)}) = min ⎧ ⎪ ⎨ ⎪ ⎩ \frac{π (θ_{z (q)}^{⋆})}{π (θ_{z (q)}^{(t - 1)})} exp (E (θ^{(t - 1)}, D_{1 : s}) - E (θ^{⋆}, D_{1 : s})), 1 ⎫ ⎪ ⎬ ⎪ ⎭

9: Draw

u

from uniform

U (0, 1)

10: if

u \leq a (θ_{z (q)}^{⋆}, θ_{z (q)}^{(t - 1)})

then

11: Set

θ_{z (q)}^{(t)} = θ_{z (q)}^{⋆}

12: else

13: Set

θ_{z (q)}^{(t)} = θ_{z (q)}^{(t - 1)}

14: end if

15: end for

16:end for

Algorithm 2 Metropolis-within-blocked-Gibbs (MWBG) sampling based on cross-entropy

4.2 Finer node-blocked Gibbs sampling

Big data and big models challenge the adaptation of MCMC sampling methods in deep learning. Minibatching provides a way of applying MCMC to big data. It is less clear how to apply MCMC to big neural network models, containing thousands or millions of parameters. Minibatch MWBG sampling proposes a way forward by drawing an analogy between subsetting data and subsetting model parameters. As data batches reduce the dimensionality of data per Gibbs sampling iteration, parameter blocks reduce the dimensionality of parameters per Metropolis-within-Gibbs update.

In an $MLP (κ_{0 : ρ})$ with $n$ parameters, layer $j$ contains $κ_{j} (κ_{j - 1} + 1)$ parameters, of which $κ_{j} κ_{j - 1}$ are weights and $κ_{j}$ are biases. So, if parameters are grouped by layer, then the block of layer $j$ contains $κ_{j} (κ_{j - 1} + 1)$ parameters. The number of parameters in the block of layer $j$ grows linearly with the number $κ_{j}$ of nodes in layer $j$ as well as linearly with the number $κ_{j - 1}$ of nodes in layer $j - 1$ .

If parameters are grouped by node, then each node block in layer $j$ contains $κ_{j - 1} + 1$ , of which $κ_{j - 1}$ are weights and one is bias. The number of parameters in a node block in layer $j$ does not depend on the number $κ_{j}$ of nodes in layer $j$ , but it grows linearly with the number $κ_{j - 1}$ of nodes in layer $j - 1$ . MWBG sampling (Algorithm 2) based on parameter grouping by MLP node is termed ‘(Metropolis-within-)node-blocked-Gibbs (NBG) sampling’.

Finer parameter blocks of smaller size can be generated by splitting the $κ_{j - 1} + 1$ parameters of a node in layer $j$ into $β_{j}$ subgroups. In this case, each finer parameter block in each node in layer $j$ contains $(κ_{j - 1} + 1) / β_{j}$ parameters. If hyperparameter $β_{j}$ is chosen to be a linear function of $κ_{j - 1}$ , then the number of parameters per finer block per node in layer $j$ depends neither on the number $κ_{j}$ of nodes in layer $j$ nor on the number $κ_{j - 1}$ of nodes in layer $j - 1$ . MWBG sampling (Algorithm 2) based on finer parameter grouping per node is termed ‘(Metropolis-within-)finer-node-blocked-Gibbs (FNBG) sampling’.

4.3 Toy example of finer node-blocking

The $MLP (3, 2, 2, 2)$ architecture shown in Figure 1 provides a toy example that showcases layer-based, node-based and finer node-based parameter grouping (more briefly termed ‘layer-blocking’, ‘node-blocking’ and ‘finer node-blocking’). It is reminded that finer node-based grouping refers to parameter grouping into smaller blocks within each node. Figure 2 shows the directed acyclic graph (DAG) representation of $MLP (3, 2, 2, 2)$ , augmenting Figure 1 with parameter annotations and with a layer consisting of a single node that represents label $y_{i}$ . Yellow shapes indicate parameters; yellow circles and boxes correspond to biases and weights. Yellow boxes adhere to expository visual conventions of plate models, with each box representing a set of weights. Purple nodes indicate observed variables (input and output data), whereas blue and gray nodes indicate latent variables (post-activations).

Figure 2: Visual demonstration of node-based parameter blocking for the $MLP (3, 2, 2, 2)$ architecture. The MLP is expressed as a DAG. Yellow nodes and yellow plates correspond to biases and weights. Each of the blue hidden layer nodes and of the gray output layer nodes is assigned a parameter block of yellow parent nodes in the DAG.

Layer-blocking partitions the set of $20$ parameters of $MLP (3, 2, 2, 2)$ to three blocks $θ_{z (1)}, θ_{z (2)}, θ_{z (3)},$ which contain $| θ_{z (1)} | = 8, | θ_{z (2)} | = 6, | θ_{z (3)} | = 6,$ parameters. For instance, the first hidden layer induces block $θ_{z (1)} = (w_{1, 1, 1 : 3}, b_{1, 1}, w_{1, 2, 1 : 3}, b_{1, 2})$ , where $w_{j, k, 1 : l} = (w_{j, k, 1}, w_{j, k, 2}, \dots, w_{j, k, l})$ .

Node-blocking partitions the set of $20$ parameters of $MLP (3, 2, 2, 2)$ to six blocks, as many as the number of hidden and output layer nodes. Each blue or gray node in a hidden layer or in the output layer has its own distinct set of yellow weight and bias parents. Parameters are grouped according to shared parenthood. For instance, the parameters of block $θ_{z (1)} = (w_{1, 1, 1 : 3}, b_{1, 1})$ , have node $h_{1, 1}$ as a common child.

Figure 2 facilitates a visual explanation of Proposition 1. Acceptance probabilities for parameter blocks require likelihood function evaluations. It is not possible to factorize conditional densities to achieve more computationally efficient block updates. For instance, as it can be seen in Figure 2, changes in block $θ_{z (1)} = (w_{1, 1, 1 : 3}, b_{1, 1})$ induced by node $h_{1, 1}$ in layer $1$ propagate through subsequent layers due to the hierarchical MLP structure, thus prohibiting a factorization of conditional density $p (θ_{z (1)} | θ_{z (2) : z (6)}, D_{1 : s})$ . More formally, each pair of node-based parameter blocks forms a v-structure, having label $y_{i}$ (purple node) as a descendant. Since training label $y_{i}$ is observed, such v-structures are activated, and therefore any two node-based parameter blocks are not conditionally independent given label $y_{i}$ .

As a demonstration of finer node-blocking for $MLP (3, 2, 2, 2)$ , set $β_{1} = 2$ in layer $1$ . For $β_{1} = 2$ , blocks $θ_{z (1)} = w_{1, 1, 1 : 2}$ and $θ_{z (2)} = (w_{1, 1, 3}, b_{1, 1})$ are generated within node $h_{1, 1}$ . Similarly, blocks $θ_{z (3)} = w_{1, 2, 1 : 2}$ and $θ_{z (4)} = (w_{1, 2, 3}, b_{1, 2})$ are generated within node $h_{1, 2}$ .

To recap on this toy example, layer-based grouping produces a single block of eight parameters in layer $1$ , node-based grouping produces two blocks of four parameters each in layer $1$ , and a case of finer node-based grouping produces four blocks of two parameters each in layer $1$ . It is thus illustrated that finer blocks per node provide a way to reduce the number of parameters per Gibbs sampling block.

4.4 MNIST example of finer node-blocking

After having used $MLP (3, 2, 2, 2)$ as a toy example to describe the basics of finer node-blocking, the wider $MLP (784, 10, 10, 10, 10)$ architecture is utilized to elaborate on the practical relevance of smaller blocks per node. An $MLP (784, 10, 10, 10, 10)$ is fitted to the MNIST (and FMNIST) training dataset in Section 5. An $MLP (784, 10, 10, 10, 10)$ contains $8180$ parameters, of which $7850, 110, 110$ and $110$ have children nodes in the first, second, third hidden layer and output layer, respectively.

So, layer-blocking for $MLP (784, 10, 10, 10, 10)$ involves four parameter blocks $θ_{z (1)}, θ_{z (2)}, θ_{z (3)}, θ_{z (4)}$ of sizes $| θ_{z (1)} | = 7850, | θ_{z (2)} | = | θ_{z (3)} | = | θ_{z (4)} | = 110$ . Metropolis-within-Gibbs updates for block $θ_{z (1)}$ have zero or near-zero acceptance rate due to the large block size of $| θ_{z (1)} | = 7850$ . Although each of blocks $θ_{z (2)}, θ_{z (3)}, θ_{z (4)}$ has nearly two orders of magnitude smaller size than $θ_{z (1)}$ , a block size of $| θ_{z (2)} | = | θ_{z (3)} | = | θ_{z (4)} | = 110$ might be large enough to yield Metropolis-within-Gibbs updates with prohibitively low acceptance rate.

Node-blocking for $MLP (784, 10, 10, 10, 10)$ entails a block of $785$ parameters for each node in the first hidden layer, and a block of $11$ parameters for each node in the second and third hidden layer and in the output layer. Thus, node-blocking addresses the low acceptance rate problem related to large parameter blocks for block updates in all layers apart from the first hidden layer.

There is no practical need to carry out finer node-blocking in nodes belonging to the second or third hidden layer or to the output layer of $MLP (784, 10, 10, 10, 10)$ , since each block in these layers contains only $11$ parameters based on node-blocking. On the other hand, finer node-blocking is useful in nodes belonging to the first hidden layer, since each block related to such nodes contains a large number of $785$ parameters. By setting $β_{1} = 10$ , smaller blocks (each consisting of $78$ or $79$ parameters) are generated in the first hidden layer. So, finer node-blocking disentangles block sizes in the first hidden layer from input data dimensions, making it possible to decrease block sizes and to consequently increase acceptance rates.

5 Experiments

Minibatch FNBG sampling is put into practice to make empirical observations about several characteristics of approximate MCMC in deep learning. In the experiments of this section, parameters of MLPs are sampled. Three datasets are used, namely a simulated noisy version of exclusive-or (Papamarkou et al., 2022), MNIST (Lecun et al., 1998) and fashion MNIST (Xiao, Rasul and Vollgraf, 2017). For brevity, exclusive-or and fashion MNIST are abbreviated to XOR and FMNIST. Table 1 displays the correspondence between used datasets and fitted MLPs.

Dataset			Neural network
Name	Sample size		Architecture	# parameters
Name	Training	Test	Architecture	# parameters
Noisy XOR	$5000$	$1200$	$MLP (2, 2, 1)$	$9$
Noisy XOR	$5000$	$1200$	$MLP (2, 2, 2, 2, 2, 2, 2, 1)$	$39$
MNIST	$60000$	$10000$	$MLP (784, 10, 10, 10, 10)$	$8180$
FMNIST	$60000$	$10000$	$MLP (784, 10, 10, 10, 10)$	$8180$

Table 1: Datasets used in the experiments and MLPs fitted to these datasets. Training and test dataset sample sizes as well as MLP parameter dimensions are shown.

The noisy XOR training and test datasets are visualized in Figure 9 of Appendix B: noisy XOR dataset. Random perturbations of $(0, 0)$ and of $(1, 1)$ , corresponding to gray and yellow points, are mapped to $0$ (circles). Moreover, random perturbations of $(0, 1)$ and of $(1, 0)$ , corresponding to purple and blue points, are mapped to $1$ (triangles). More information about the simulation of noisy XOR can be found in Papamarkou et al. (2022).

Each MNIST and FMNIST image is firstly reshaped, by converting it from a $28 \times 28$ matrix to a vector of length $784 = 28 \times 28$ , and it is subsequently standardized. This image reshaping explains why the $MLP (784, 10, 10, 10, 10)$ model, which is fitted to MNIST and FMNIST, has an input layer width of $784$ .

5.1 Experimental configuration

Binary classification for noisy XOR is performed via the likelihood function based on binary cross-entropy, as described in Papamarkou et al. (2022). Multiclass classification for MNIST and FMNIST is performed via the likelihood function given by Equation (3.3), which is based on cross-entropy.

The sigmoid activation function is applied at each hidden layer of each MLP of Table 1. Furthermore, the sigmoid activation function is also applied at the output layer of $MLP (2, 2, 1)$ and of $MLP (2, 2, 2, 2, 2, 2, 2, 1)$ , conforming to the employed likelihood function for binary classification. The softmax activation function is applied at the output layer of $MLP (784, 10, 10, 10, 10)$ , in accordance with likelihood function (3.3) for multiclass classification. The same $MLP (784, 10, 10, 10, 10)$ model is fitted to the MNIST and FMNIST datasets.

A normal prior $π (θ) \sim N (0, 10 I)$ is adopted for the parameters $θ \in R^{n}$ of each MLP model shown in Table 1. A relatively high variance, equal to $10$ , is assigned a priori to each coordinate of $θ$ via the isotropic covariance matrix $10 I$ .

NBG sampling is run upon fitting $MLP (2, 2, 1)$ and $MLP (2, 2, 2, 2, 2, 2, 2, 1)$ to the noisy XOR training set, while FNBG sampling is run upon fitting $MLP (784, 10, 10, 10, 10)$ to the MNIST and FMNIST training sets. So, parameters are grouped by node in $MLP (2, 2, 1)$ and $MLP (2, 2, 2, 2, 2, 2, 2, 1)$ , whereas multiple parameter groups per node are formed in the first hidden layer of $MLP (784, 10, 10, 10, 10)$ as elaborated in Subsection 4.4. Parameters are grouped by node from the second hidden layer onwards in $MLP (784, 10, 10, 10, 10)$ . All three MLPs of Table 1 are relatively shallow neural networks. However, $MLP (784, 10, 10, 10, 10)$ has two orders of magnitude larger input layer width in comparison to $MLP (2, 2, 1)$ and $MLP (2, 2, 2, 2, 2, 2, 2, 1)$ . So, the higher dimension of MNIST and FMNIST input data necessitates finer node-blocking in the first hidden layer of $MLP (784, 10, 10, 10, 10)$ . On the other hand, the smaller dimension of noisy XOR input data implies that finer blocks per node are not required in the first hidden layer of $MLP (2, 2, 1)$ or of $MLP (2, 2, 2, 2, 2, 2, 2, 1)$ .

A normal proposal density is chosen for each parameter block. The variance of each proposal density is a hyperparameter, thus enabling to tune the magnitude of proposal steps separately for each parameter block. Proposal variances are tuned empirically.

$m = 10$ Markov chains are realized for noisy XOR, whereas $m = 1$ chain is realized for each of MNIST and FMNIST due to computational resource limitations. $110000$ iterations are run per chain realization, $10000$ of which are discarded as burn-in. Thereby, $v = 100000$ post-burnin iterations are retained per chain realization. Acceptance rates are computed from all $100000$ post-burnin iterations per chain.

Monte Carlo approximations of posterior predictive pmfs are computed according to Equation (3.6) for each data point of each test set. To reduce the computational cost, the last $v = 10000$ iterations of each realized chain are used in Equation (3.6).

Predictions for noisy XOR are made using the binary classification rule mentioned in Papamarkou et al. (2022). Predictions for MNIST and for FMIST are made using the multiclass classification rule specified by Equation (3.7). Given a single chain realization based on a training set, predictions are made for every point in the corresponding test set; the predictive accuracy is then computed as the number of correct predictions over the total number of points in the test set. For the noisy XOR test set, the mean of predictive accuracies across the $m = 10$ realized chains is reported. For the MNIST and FMNIST test sets, the predictive accuracy based on the corresponding single chain realization ( $m = 1$ ) is reported.

5.2 Exact versus approximate MCMC

An illustrative comparison between approximate and exact NBG sampling is made in terms of acceptance rate, predictive accuracy and runtime. The comparison between approximate and exact NBG sampling is carried out in the context of noisy XOR only, since exact MCMC is not feasible for the MNIST and FMNIST examples due to vanishing acceptance rates and high computational requirements.

$MLP (2, 2, 2, 2, 2, 2, 2, 1)$ is fitted to the noisy XOR training set under four scenarios. For scenario $1$ , approximate NBG sampling is run with a batch size of $100$ to simulate $m = 10$ chains. For scenario $2$ , exact NBG is run with a batch size of $100$ to simulate $10$ chains. For scenario $3$ , exact NBG is run until $10$ chains are obtained, each having an acceptance rate $\geq 5 %$ . For scenario $4$ , exact NBG is run until $10$ chains are acquired, each with an acceptance rate $\geq 20 %$ . $11$ and $23$ chains have been run in total under scenarios $3$ and $4$ , respectively, to get $10$ chains that satisfy the acceptance rate lower bounds in each scenario.

For approximate NBG sampling (scenario $1$ ), the proposal variance is set to $0.04$ . For the three exact NBG sampling scenarios, the proposal variance is lowered to $0.001$ in order to mitigate decreased acceptance rates in the presence of increased sample size ( $5000$ training data points) relatively to the batch size of $100$ used in approximate sampling.

(a) Acceptance rate boxplots. The left and right boxplot in each pair correspond to approximate and exact NBG.

Figure 2(a) displays boxplots of node-specific acceptance rates for approximate and exact NBG sampling without lower bound conditions on acceptance rates (scenarios $1$ and $2$ ). A pair of boxplots is shown for each of the $13$ nodes in the six hidden layers and one output layer of $MLP (2, 2, 2, 2, 2, 2, 2, 1)$ . The left and right boxplots per pair correspond to approximate and exact NBG sampling. Blue lines represent medians.

Three empirical observations are drawn from Figure 2(a). First of all, approximate NBG attains higher acceptance rates than exact NBG according to the (blue) medians, despite setting higher proposal variance in the former in comparison to the latter ( $0.04$ and $0.001$ , respectively). Secondly, approximate NBG attains less volatile acceptance rates than exact NBG as seen from the boxplot interquartile ranges. Acceptance rates for exact NBG range from near $0 %$ to about $50 %$ as neural network depth increases, exhibiting lack of stability due to entrapment in local modes in some chain realizations. Thirdly, acceptance rates decrease as depth increases. For instance, exact NBG yields median acceptance rates of $63.83 %$ and $20.72 %$ in nodes $1$ and $13$ , respectively. The attenuation of acceptance rate with depth is further discussed in Subsection 5.3.

Figure 2(b) shows boxplots of predictive accuracies for the four scenarios under consideration. Approximate NBG has a median predictive accuracy of $98.88 %$ , with interquartile range concentrated around the median and with a single outlier ( $87.25 %$ ) in $10$ chain realizations. Exact NBG without conditions on acceptance rate and exact NBG conditioned on acceptance rate $\geq 5 %$ have lower median predictive accuracies ( $86.92 %$ and $95.38 %$ ) and higher interquartile ranges than exact NBG. Exact NBG conditioned on acceptance rate $\geq 20 %$ attains a median predictive accuracy of $100 %$ ; nine out of $10$ chain realizations yield $100 %$ accuracy, and one chain gives an outlier accuracy of $72.83 %$ . The overall conclusion is that approximate NBG retains a predictive advantage over exact NBG, since minibatch sampling ensures consistency in terms of high predictive accuracy and reduced predictive variability. Exact NBG conditioned on higher acceptance rates can yield near-perfect predictive accuracy in the low parameter and data dimensions of the toy noisy XOR example, but stability and computational issues arise, as many chains with near-zero acceptance rates are discarded before $10$ chains with the required level of acceptance rate ( $\geq 20 %$ ) are obtained.

Figure 2(c) shows a barplot of runtimes (in hours) for the four scenarios under consideration. Purple bars represent runtimes for the $10$ retained chains per scenario, whereas gray bars indicate runtimes for the chains that have been discarded due to unmet acceptance rate requirements. As seen from a comparison between purple bars, approximate NBG has shorter runtime (for retained chains of same length) than exact NBG, which is explained by the fact that minibatching uses a subset of the training set at each approximate NBG iteration. A comparison between gray bars in scenarios $3$ and $4$ demonstrates that exact NBG runtimes for discarded chains increase with increasing acceptance rate lower bounds. By observing Figures 2(b) and 2(c) jointly, it is pointed out that predictive accuracy improvements of exact NBG (arising from higher acceptance rate lower bounds) come at higher computational costs.

5.3 Effect of depth on acceptance rate

Figure 4 displays mean acceptance rates across $m = 10$ chains realized via minibatch NBG upon fitting $MLP (2, 2, 2, 2, 2, 2, 2, 1)$ to noisy XOR. In particular, Figure 3(a) shows the mean acceptance rate for each node in the six hidden layers and one output layer of $MLP (2, 2, 2, 2, 2, 2, 2, 1)$ , while Figure 3(b) shows the mean acceptance rate for each of these seven (six hidden and one output) layers. A batch size of $100$ is used for minibatch NBG. The same set of $10$ chains have been used in Figures 3 and 4.

Figures 2(a) and 3(a) provide alternative views of node-specific acceptance rates. The former figure represents such information via boxplots and medians, whereas the latter makes use of a barplot of associated means.

Figure 4 demonstrates that if the proposal variance is the same for all parameter blocks across layers, then the acceptance rate reduces with depth. For instance, it can be seen in Figure 3(b) that the acceptance rates for hidden layers $1$ , $2$ and $3$ are $56.31 %$ , $36.18 %$ and $26.56 %$ , respectively.

Using a common proposal variance for all parameter blocks across layers generates disparities in acceptance rates, with higher rates in shallower layers and lower rates in deeper layers. These disparities become more pronounced with big data or with high parameter dimensions. For example, sampling $MLP (784, 10, 10, 10, 10)$ parameters with the same proposal variance in all parameter blocks is not feasible in the case of MNIST or FMNIST; the acceptance rates are high in the first hidden layer and drop near zero in the output layer. FNBG sampling enables to reduce the proposal variance for deeper layers, thus avoiding vanishing acceptance rates with increasing depth.

Tables 5 and 6 of Appendix C: tuning per layer exemplify empirically tuned proposal variances for minibatch FNBG sampling of $MLP (784, 10, 10, 10, 10)$ parameters in the respective cases of MNIST and FMNIST. Batch sizes of $600, 1800, 3000$ and $4200$ are employed, corresponding to $1 %, 3 %, 5 %$ and $7 %$ of the MNIST and FMNIST training sets. For each of these four batch sizes and for each training set, the proposal variance per layer is tuned empirically, and subsequently the acceptance rate per layer is computed from a chain realization. Tables 5 and 6 demonstrate that if proposal variances are reduced in deeper layers, then acceptance rates do not vanish with depth. For increasing batch size, acceptance rates drop across all layers, as expected when shifting from approximate towards exact MCMC.

As part of Table 5, a chain is simulated upon fitting $MLP (784, 10, 10, 10, 10)$ to the MNIST training set via minibatch FNBG sampling with a batch size of $3000$ . Figure 5, which comprises a grid of $4 \times 2 = 8$ traceplots, is produced from that chain. Each row of Figure 5 is related to one of the $8180$ parameters of $MLP (784, 10, 10, 10, 10)$ . More specifically, the first, second, third and fourth row correspond to parameter $θ_{1005}$ in hidden layer $1$ , parameter $θ_{7872}$ in hidden layer $2$ , parameter $θ_{8008}$ in hidden layer $3$ and parameter $θ_{8107}$ in the output layer. A pair of traceplots per parameter is shown in each row; the right traceplot is more zoomed out than the left one. All traceplots in the right column share a common range of $[- 8, 8]$ in their vertical axes.

Figure 5: Markov chain traceplots of four parameter coordinates of $MLP (784, 10, 10, 10, 10)$ , which is fitted to MNIST via minibatch FNBG sampling with a batch size of $3000$ . Each row displays two traceplots of the same chain for a single parameter; the traceplot on the right is more zoomed-out than the one on the left. The traceplots of the right column share a common range on the vertical axes. Vertical dotted lines indicate the end of burnin.

It is observed that the zoomed-in traceplots (left column of Figure 5) do not exhibit entrapment in local modes irrespective of network depth, agreeing with the non-vanishing acceptance rates of Table 5. Furthermore, it is seen from the zoomed-out traceplots (right column of Figure 5) that chain scales decrease in deeper layers. For example, the right traceplot of parameter $θ_{8107}$ (output layer) has non-visible fluctuations under a y-axis range of $[- 8, 8]$ , whereas the right traceplot of parameter $θ_{1005}$ (first hidden layer) fluctuates more widely under the same y-axis range.

Figure 5 suggests that chains of parameters in shallower layers perform more exploration, while chains of parameters in deeper layers carry out more exploitation. This way, chain scales collapse towards point estimates for increasing network depth.

5.4 Effect of batch size on log-likelihood

For each batch size shown in Figure 5(a), the likelihood function of Equation (3.3) is evaluated on $10$ batch samples, which are drawn from the MNIST training set. A boxplot is then generated from the $10$ log-likelihood values and it is displayed in Figure 5(a). The log-likelihood function is normalized by batch size in order to obtain visually comparable boxplots across different batch sizes. In PyTorch, the normalized log-likelihood is computed via the CrossEntropyLoss class initialized with reduction=\textquotesinglemean\textquotesingle. In each boxplot, the blue line and yellow point correspond to the median and mean of the $10$ associated log-likelihood values. The horizontal gray line represents the log-likelihood value based on the whole MNIST training set. Figure 5(b) is generated using the FMNIST training set, following an analogous setup.

(a) Log-likelihood value boxplots for MNIST.

Figures 5(a) and 5(b) demonstrate that log-likelihood values are increasingly volatile for decreasing batch size. Furthermore, the volatility of log-likelihood values vanishes as the batch size gets close to the training sample size. So, Figure 6 confirms visually that the approximate likelihood tends to the exact likelihood for increasing batch size. Thus, the batch size in FNBG sampling is preferred to be as large as possible, up to the point that (finer) block acceptance rates do not become prohibitively low.

5.5 Effect of depth on prediction

Figure 7 explores how network depth affects predictive accuracy in approximate MCMC. Shallower $MLP (2, 2, 1)$ , consisting of one hidden layer, and deeper $MLP (2, 2, 2, 2, 2, 2, 2, 1)$ , consisting of six hidden layers, are fitted to the noisy XOR training set using minibatch NBG with a batch size of $100$ and a proposal variance of $0.04$ ; $m = 10$ chains are realized for each of the two MLPs. Subsequently, the predictive accuracy per chain is evaluated on the noisy XOR test set. One boxplot is generated for each set of $10$ chains, as shown in Figure 7. Blue lines represent medians.

Figure 7: A comparison between a shallower and a deeper MLP architecture. Each of $MLP (2, 2, 1)$ and $MLP (2, 2, 2, 2, 2, 2, 2, 1)$ is fitted to noisy XOR via minibatch NBG sampling with a batch size of $100$ . Predictive accuracy boxplots are generated from $10$ chains per MLP. Blue lines indicate medians.

The same $10$ chains are used to generate relevant plots in Figures 2(b), 4 and 7. In particular, the leftmost boxplot in Figure 2(b) and right boxplot in Figure 7 stem from the same $10$ chains and are thus identical. Figure 4 shows mean acceptance rates per node and per layer across the $10$ chains that also yield the right boxplot of predictive accuracies in Figure 7.

$MLP (2, 2, 1)$ and $MLP (2, 2, 2, 2, 2, 2, 2, 1)$ have respective predictive accuracy medians of $86.75 %$ and $98.88 %$ as blue lines indicate in Figure 7, so predictive accuracy increases with increasing depth. Moreover, the interquartile ranges of Figure 7 demonstrate that a deeper architecture yields less volatile, and in that sense more stable, predictive accuracy. As an overall empirical observation, increasing the network depth in approximate MCMC seems to produce higher and less volatile predictive accuracy.

5.6 Effect of batch size on prediction

This subsection assesses empirically the effect of batch size on predictive accuracy in approximate MCMC. To this end, $MLP (784, 10, 10, 10, 10)$ is fitted to the MNIST and FMNIST training sets using minibatch FNBG sampling with batch sizes of $600$ , $1800$ , $3000$ and $4200$ , which correspond to $1 %$ , $3 %$ , $5 %$ and $7 %$ of each training sample size. One chain is realized per combination of training set and batch size. Table 2 reports the predictive accuracy for each chain.

Dataset	Batch size
	1%	3%	5%	7%
	0.6K	1.8K	3K	4.2K
MNIST	85.99	89.01	90.75	90.43
FMNIST	71.50	80.07	80.89	79.17

Table 2: Predictive accuracies obtained by fitting

MLP (784, 10, 10, 10, 10)

to MNIST and to FMNIST via minibatch FNBG sampling with different batch sizes.

The same chains are used to compute predictive accuracies in Table 2 as well as acceptance rates in Tables 5 and 6 of Appendix C: tuning per layer. The chain that yields the predictive accuracy for MNIST and for a batch size of $3000$ (first row and third column of Table 2) is partly visualized by traceplots in Figure 5.

According to Table 2, the highest accuracy of $90.75 %$ for MNIST and of $80.89 %$ for FMNIST are attained by employing a batch size of $3000$ . Overall, predictive accuracy increases as batch size increases. However, predictive accuracy decreases when batch size increases from $3000$ to $4200$ ; this is explained by the fact that a batch size of $4200$ is too large, in the sense that it reduces acceptance rates (see Tables 5 and 6). So, as pointed out in Subsection 5.4, a tuning guideline is to increase the batch size up to the point that no substantial reduction in finer block acceptance rates occurs.

An attained predictive accuracy of $90.75 %$ on MNIST demonstrates that non-convergent chains (simulated via minibatch FNBG) learn from data, since data-agnostic guessing based on pure chance has a predictive accuracy of $10 %$ . While stochastic optimisation algorithms for deep learning achieve predictive accuracies higher than $90.75 %$ on MNIST, the goal of this work has not been to construct an approximate MCMC algorithm that outperforms stochastic optimisation on the predictive front. The main objective has been to demonstrate that approximate MCMC for neural networks learns from data and to uncover associated sampling characteristics, such as diminishing chain ranges (Figure 5) and diminishing acceptance rates (Tables 5 and 6) for increasing network depth. Similar predictive accuracies in the vicinity of $90 %$ using Hamiltonian Monte Carlo for deep learning have been reported in the literature (Wenzel et al., 2020; Izmailov et al., 2021). Nonetheless, this body of relevant work relies on chain lengths one or two orders of magnitude shorter. The present paper proposes to circumvent vanishing acceptance rates by grouping network parameters into smaller blocks, thus enabling to generate lengthier chains.

5.7 Effect of chain length on prediction

It is reminded that $110000$ iterations are run per chain in the experiments herein, of which the first $10000$ are discarded as burnin. The last $v = 10000$ (out of the remaining $100000$ ) iterations are used for making predictions via Bayesian marginalization based on Equation (3.6). Only $10000$ iterations are utilized in Equation (3.6) to cap the computational cost for predictions.

There exists a tractable solution to Bayesian marginalization, since the approximate posterior predictive pmf of Equation (3.6) can be computed in parallel both in terms of Monte Carlo iterations and of test points. The implementation of such a parallel solution is deferred to future work.

In the meantime, it is examined here how chain length affects predictive accuracy. Along these lines, predictive accuracies are computed from the last $1000$ , $10000$ , $20000$ and $30000$ iterations of the chain realized via minibatch FNBG with a batch size of $3000$ for each of MNIST and FMNIST (see Table 3). The last $10000$ and all $100000$ post-burnin iterations of the same chain generate predictive accuracies in Table 2 and acceptance rates in Tables 5 and 6, respectively.

Dataset	Chain length
Dataset	1K	10K	20K	30K
MNIST	88.31	90.75	91.12	91.20
FMNIST	78.93	80.89	81.36	81.53

Table 3: Predictive accuracies obtained from different chain lengths.

MLP (784, 10, 10, 10, 10)

is fitted to MNIST and to FMNIST via minibatch FNBG sampling with a batch size of

3000

. One chain is realized per dataset. Subsequently, predictions are made via Bayesian marginalization using chunks of different length from the end of the realized chains.

Table 3 demonstrates that predictive accuracy increases (both for MNIST and FMNIST) as chain length increases. So, as a chain traverses the parameter space of a neural network, information of predictive importance accrues despite the lack of convergence. It can also be seen from Table 3 that the rate of improvement in predictive accuracy slows down for increasing chain length.

5.8 Effect of data augmentation on prediction

To assess the effect of data augmentation on predictive accuracy, three image transformations are performed on the MNIST and FMNIST training sets, namely rotations by angle, blurring, and colour inversions. Images are rotated by angles randomly selected between $- 30$ and $30$ degrees. Each image is blurred with probability $0.9$ . Blur is randomly generated from a Gaussian kernel of size $9 \times 9$ . The standard deviation of the kernel is randomly selected between $1$ and $1.5$ . Each image is colour-inverted with probability $0.5$ . Figure 10 in Appendix D: data augmentation displays examples of MNIST and FMNIST training images that have been rotated, blurred or colour-inverted according to the described transformations.

Each of the three transformations is applied to the whole MNIST and FMNIST training sets. Subsequently, $MLP (784, 10, 10, 10, 10)$ is fitted to each transformed training set via minibatch FNBG with a batch size of $3000$ and with proposal variances specified in Tables 5 and 6. One chain is simulated per transformed training set. Predictive accuracies are computed on the corresponding untransformed MNIST and FMNIST test sets and are reported in Table 4. Moreover, predictive accuracies based on the untransformed MNIST and FMNIST training sets are available in the first column of Table 4, as previously reported in Table 2.

Dataset	Transform
Dataset	None	Rotation	Blur	Inversion
MNIST	90.75	86.19	85.66	36.87
FMNIST	80.89	6.62	7.46	8.61

Table 4: Predictive accuracies obtained from different data augmentation schemes.

MLP (784, 10, 10, 10, 10)

is fitted to each of the augmented MNIST and FMNIST training sets via minibatch FNBG sampling with a batch size of

3000

. Predictive accuracies are computed on the corresponding non-augmented test sets. The first column reports predictive accuracies based on the non-augmented MNIST and FMNIST training sets.

According to Table 4, if data augmentation is performed, then predictive accuracy deteriorates drastically. Notably, data augmentation has catastrophic predictive consequences for FMNIST. These empirical findings agree with the ‘dirty likelihood hypothesis’ of Wenzel et al. (2020), according to which data augmentation violates the likelihood principle.

5.9 Uncertainty quantification

Approximate MCMC enables predictive uncertainty quantification (UQ) via Bayesian marginalization. Such a principled approach to UQ constitutes an advantage of approximate MCMC over stochastic optimization in deep learning. This subsection showcases how predictive uncertainty is quantified for neural networks via minibatch FNBG sampling.

Recall that one chain has been simulated for each of MNIST and FMNIST to compute the predictive accuracies of column $3$ in Table 2 (see Subsection 5.6). Those chains are used to estimate posterior predictive probabilities for some images in the corresponding test sets, as shown in Figure 8. All test images in Figure 8 have been correctly classified via Bayesian marginalization.

(a) Predictive posterior probabilities for MNIST.

The first and second MNIST test images in Figure 7(a) show numbers $0$ and $7$ , with corresponding posterior predictive probabilities $0.98$ and $0.97$ that indicate near-certainty about the classification outcomes. The third MNIST test image in Figure 7(a) shows number $9$ . Attempting to classify this image by eye casts doubt as to whether the number in the image is $9$ or $4$ . While Bayesian marginalization correctly classifies the number as $9$ , the posterior predictive probability $^p (y = 9 | x, D_{1 : s}) = 0.35$ is relatively low, indicating uncertainty in the prediction. Moreover, the second highest posterior predictive probability $^p (y = 4 | x, D_{1 : s}) = 0.28$ identifies number $4$ as a probable alternative, in agreement with human perception. All in all, posterior predictive probabilities and human understanding are aligned in terms of perceived predictive uncertainties and in terms of plausible classification outcomes. Image $4$ is aligned with image $3$ of Figure 7(a) regarding UQ conclusions.

Figure 7(b), which entails FMNIST test images, is analogous to Figure 7(a) from a UQ point of view. In Figure 7(b), FMNIST test images $1$ and $2$ show trousers and a bag, with corresponding posterior predictive probabilities $0.99$ and $0.96$ that indicate near-certainty about the classification outcomes. The third FMNIST test image of Figure 7(b) shows a shirt. It is not visually clear whether this image depicts a shirt or a pullover. While Bayesian marginalization correctly identifies the object as a shirt, the posterior predictive probabilities $^p (y = shirt | x, D_{1 : s}) = 0.33$ and $^p (y = pullover | x, D_{1 : s}) = 0.32$ capture human uncertainty and identify the two most plausible classification outcomes. Image $4$ is analogous to image $3$ of Figure 7(b) in terms of UQ conclusions.

6 Future work

Several future research directions emerge from this paper; two software engineering extensions are planned, two methodological developments are proposed, and one theoretical question is posed.

To start with planned software engineering work, Bayesian marginalization will be parallelized across test points and across FNBG iterations per test point. Additionally, an adaptive version of FNBG sampling will be implemented based on existing Gibbs sampling methods for proposal variance tuning, thus automating tuning and reducing tuning computational requirements.

In terms of methodological developments, alternative ways of grouping parameters in FNBG sampling will be considered. For example, parameters may be grouped according to their covariance structure, as estimated from pilot FNBG runs. Moreover, FNBG sampling will be developed for neural network architectures other than MLPs. To this end, DAG representations of other neural network architectures will be devised and fine parameter blocks will be identified from the DAGs.

A theoretical question of interest is how to construct lower bounds of predictive accuracy for minibatch FNBG (and for minibatch MCMC more generally) as a function of the distance between the exact and approximate parameter posterior density. It has been observed empirically that minibatch FNBG has predictive capacity, yet theoretical guarantees for predictive accuracy have not been established.

Software and Data

The FNBG sampler for MLPs has been implemented under the eeyore package using Python and PyTorch. eeyore is available at https://github.com/papamarkou/eeyore. Source code for the examples of Section 5 can be found in dmcl_examples, forming a separate Python package based on eeyore. dmcl_examples can be downloaded from https://github.com/papamarkou/dmcl_examples.

Appendix A: proofs

\sname

Appendix A

Proof of Proposition 1.

Since the proposal density is symmetric, the Metropolis-Hastings acceptance probability expresses as

\begin{matrix} a (θ_{z (q)}^{⋆}, θ_{z (q)}^{(t - 1)}) = min {\frac{p (θ_{z (q)}^{⋆} | θ_{z (1) : z (q - 1)}^{(t)}, θ_{z (q + 1) : z (m)}^{(t - 1)}, D_{1 : s})}{p (θ_{z (q)}^{(t - 1)} | θ_{z (1) : z (q - 1)}^{(t)}, θ_{z (q + 1) : z (m)}^{(t - 1)}, D_{1 : s})}, 1} . \end{matrix}

(6.1)

According to the definition of conditional density, it holds that

\begin{matrix} p (θ_{z (q)} | θ_{z (1) : z (q - 1)}^{(t)}, θ_{z (q + 1) : z (m)}^{(t - 1)}, D_{1 : s}) = \frac{p (θ_{z (1) : z (q - 1)}^{(t)}, θ_{z (q)}, θ_{z (q + 1) : z (m)}^{(t - 1)} | D_{1 : s}) p (D_{1 : s})}{p (θ_{z (1) : z (q - 1)}^{(t)}, θ_{z (q + 1) : z (m)}^{(t - 1)}, D_{1 : s})} . \end{matrix}

(6.2)

By setting once $θ_{z (q)} = θ_{z (q)}^{⋆}$ and once $θ_{z (q)} = θ_{z (q)}^{(t - 1)}$ in Equation (6.2) and by plugging the resulting expressions into Equation (6.1), it follows that

a (θ_{z (q)}^{⋆}, θ_{z (q)}^{(t - 1)}) = min {\frac{p (θ^{⋆} | D_{1 : s})}{p (θ^{(t - 1)} | D_{1 : s})}, 1} .

(6.3)

The proof is completed by combining Equations (3.4), (6.3) and the assumed parameter prior density. ∎

Proof of Corollary 1.

It is recalled that the cross-entropy loss function expresses as

\begin{matrix} E (θ, y_{1 : s}, x_{1 : s}) = - s \sum i = 1 κ_{ρ} \sum k = 1 1_{{y_{i} = k}} log (h_{ρ, k} (x_{i}, θ)) . \end{matrix}

(6.4)

For brevity, $E (θ, y_{1 : s}, x_{1 : s})$ is alternatively denoted by $E (θ, D_{1 : s})$ in the main text. It follows from Equations (6.4) and (3.3) that

L (y_{1 : s} | x_{1 : s}, θ) = exp (- E (θ, y_{1 : s}, x_{1 : s})) .

(6.5)

Replacing the likelihood function in Proposition 1 with Equation (6.5) completes the proof. ∎

Appendix B: noisy XOR dataset

\sname

Appendix B

Figure 9 shows the noisy XOR training and test datasets used in Section 5. Information about how these noisy XOR datasets have been simulated is available in Papamarkou et al. (2022).

Figure 9: Noisy XOR training set (left) and test set (right) consisting of $5000$ and $1200$ data points, respectively.

Appendix C: tuning per layer

\sname

Appendix C

Tables 5 and 6 show that acceptance rates obtained from minibatch FNBG sampling can be retained at non-vanishing levels in deeper layers by reducing the proposal variances corresponding to these layers. $MLP (784, 10, 10, 10, 10)$ is fitted to MNIST and to FMNIST via minibatch FNBG sampling with different batch sizes. The acceptance rate per layer is computed from one chain for each batch size. Tables 5 and 6 report the obtained acceptance rates for MNIST and for FMNIST, respectively.

Layer		$σ$	Rate
$Batch size = 600 (1 %)$
Hidden	$1^{s t}$	$5 \cdot 10^{- 2}$	45.56
	$2^{n d}$	$5 \cdot 10^{- 4}$	26.43
	$3^{r d}$	$5 \cdot 10^{- 4}$	26.28
Output		$5 \cdot 10^{- 5}$	29.18
$Batch size = 1800 (3 %)$
Hidden	$1^{s t}$	$2 \cdot 10^{- 2}$	41.41
	$2^{n d}$	$2 \cdot 10^{- 4}$	30.68
	$3^{r d}$	$2 \cdot 10^{- 4}$	31.92
Output		$2 \cdot 10^{- 5}$	35.66
$Batch size = 3000 (5 %)$
Hidden	$1^{s t}$	$10^{- 2}$	54.95
	$2^{n d}$	$10^{- 4}$	45.73
	$3^{r d}$	$10^{- 4}$	44.98
Output		$10^{- 5}$	51.54
$Batch size = 4200 (7 %)$
Hidden	$1^{s t}$	$10^{- 2}$	31.68
	$2^{n d}$	$10^{- 4}$	20.17
	$3^{r d}$	$10^{- 4}$	19.76
Output		$10^{- 5}$	22.22

Table 5: Acceptance rate per layer obtained by fitting

MLP (784, 10, 10, 10, 10)

to MNIST via minibatch FNBG sampling with different batch sizes.

Layer		$σ$	Rate
$Batch size = 600 (1 %)$
Hidden	$1^{s t}$	$5 \cdot 10^{- 2}$	47.86
	$2^{n d}$	$5 \cdot 10^{- 4}$	34.61
	$3^{r d}$	$5 \cdot 10^{- 4}$	32.99
Output		$5 \cdot 10^{- 5}$	37.73
$Batch size = 1800 (3 %)$
Hidden	$1^{s t}$	$2 \cdot 10^{- 2}$	60.34
	$2^{n d}$	$2 \cdot 10^{- 4}$	46.78
	$3^{r d}$	$2 \cdot 10^{- 4}$	45.91
Output		$2 \cdot 10^{- 5}$	52.07
$Batch size = 3000 (5 %)$
Hidden	$1^{s t}$	$10^{- 2}$	66.94
	$2^{n d}$	$10^{- 4}$	57.40
	$3^{r d}$	$10^{- 4}$	58.48
Output		$10^{- 5}$	64.64
$Batch size = 4200 (7 %)$
Hidden	$1^{s t}$	$10^{- 2}$	55.28
	$2^{n d}$	$10^{- 4}$	47.10
	$3^{r d}$	$10^{- 4}$	47.19
Output		$10^{- 5}$	52.75

Table 6: Acceptance rate per layer obtained by fitting

MLP (784, 10, 10, 10, 10)

to FMNIST via minibatch FNBG sampling with different batch sizes.

Appendix D: data augmentation

\sname

Appendix D

Figure 10 shows examples of images from the MNIST and FMNIST training sets transformed by rotation, blurring and colour inversion. These transformations are used in Subsection 5.8 to assess the effect of data augmentation on predictive accuracy. Details about the performed transformations are available in Subsection 5.8.

(a) Examples of transformed MNIST training images.

Acknowledgements

The author would like to acknowledge the assistance given by Research IT and the use of the Computational Shared Facility at The University of Manchester. This work used the Cirrus UK National Tier-2 HPC Service at EPCC (http://www.cirrus.ac.uk) funded by the University of Edinburgh and EPSRC (EP/P020267/1). The author would like to thank Google for the provision of free credit on Google Cloud Platform.

This work was presented at two seminars supported by a travel grant from the Dame Kathleen Ollerenshaw Trust, which is gratefully acknowledged.

The author would like to dedicate this paper to the memory of his mother, who died as this paper was being developed.

References

Alexos, Boyd and Mandt (2022) {binproceedings}[author] \bauthor\bsnmAlexos, \bfnmAntonios\binitsA., \bauthor\bsnmBoyd, \bfnmAlex J\binitsA. J. \AND\bauthor\bsnmMandt, \bfnmStephan\binitsS. (\byear2022). \btitleStructured stochastic gradient MCMC. In \bbooktitleProceedings of the 39th International Conference on Machine Learning \bvolume162 \bpages414–434. \bpublisherPMLR. \endbibitem
Andrieu, de Freitas and Doucet (1999) {bmisc}[author] \bauthor\bsnmAndrieu, \bfnmC.\binitsC., \bauthor\bparticlede \bsnmFreitas, \bfnmJ. F. G.\binitsJ. F. G. \AND\bauthor\bsnmDoucet, \bfnmA.\binitsA. (\byear1999). \btitleSequential Bayesian estimation and model selection applied to neural networks. \endbibitem
Andrieu, de Freitas and Doucet (2000) {binproceedings}[author] \bauthor\bsnmAndrieu, \bfnmChristophe\binitsC., \bauthor\bparticlede \bsnmFreitas, \bfnmNando\binitsN. \AND\bauthor\bsnmDoucet, \bfnmArnaud\binitsA. (\byear2000). \btitleReversible jump MCMC simulated annealing for neural networks. In \bbooktitleProceedings of the 16th Conference on Uncertainty in Artificial Intelligence \bpages11–18. \endbibitem
Chen, Fox and Guestrin (2014) {binproceedings}[author] \bauthor\bsnmChen, \bfnmTianqi\binitsT., \bauthor\bsnmFox, \bfnmEmily\binitsE. \AND\bauthor\bsnmGuestrin, \bfnmCarlos\binitsC. (\byear2014). \btitleStochastic gradient Hamiltonian Monte Carlo. In \bbooktitleProceedings of the 31st International Conference on Machine Learning. \bseriesPMLR \bvolume32 \bpages1683–1691. \endbibitem
de Freitas (1999) {bphdthesis}[author] \bauthor\bparticlede \bsnmFreitas, \bfnmNando\binitsN. (\byear1999). \btitleBayesian methods for neural networks, \btypePhD thesis, \bpublisherUniversity of Cambridge. \endbibitem
de Freitas et al. (2001) {binbook}[author] \bauthor\bparticlede \bsnmFreitas, \bfnmN.\binitsN., \bauthor\bsnmAndrieu, \bfnmC.\binitsC., \bauthor\bsnmHøjen-Sørensen, \bfnmP.\binitsP., \bauthor\bsnmNiranjan, \bfnmM.\binitsM. \AND\bauthor\bsnmGee, \bfnmA.\binitsA. (\byear2001). \btitleSequential Monte Carlo methods for neural networks In \bbooktitleSequential Monte Carlo Methods in Practice \bpages359–379. \bpublisherSpringer New York. \endbibitem
Gong, Li and Hernández-Lobato (2019) {binproceedings}[author] \bauthor\bsnmGong, \bfnmWenbo\binitsW., \bauthor\bsnmLi, \bfnmYingzhen\binitsY. \AND\bauthor\bsnmHernández-Lobato, \bfnmJosé Miguel\binitsJ. M. (\byear2019). \btitleMeta-learning for stochastic gradient MCMC. In \bbooktitleInternational Conference on Learning Representations. \endbibitem
Hastie, Tibshirani and Friedman (2016) {bbook}[author] \bauthor\bsnmHastie, \bfnmTrevor\binitsT., \bauthor\bsnmTibshirani, \bfnmRobert\binitsR. \AND\bauthor\bsnmFriedman, \bfnmJerome\binitsJ. (\byear2016). \btitleThe elements of statistical learning: data mining, inference and prediction, \beditionsecond ed. \bpublisherSpringer. \endbibitem
He et al. (2016) {binproceedings}[author] \bauthor\bsnmHe, \bfnmKaiming\binitsK., \bauthor\bsnmZhang, \bfnmXiangyu\binitsX., \bauthor\bsnmRen, \bfnmShaoqing\binitsS. \AND\bauthor\bsnmSun, \bfnmJian\binitsJ. (\byear2016). \btitleDeep residual learning for image recognition. In \bbooktitle2016 IEEE Conference on Computer Vision and Pattern Recognition \bpages770–778. \endbibitem
Izmailov et al. (2021) {binproceedings}[author] \bauthor\bsnmIzmailov, \bfnmPavel\binitsP., \bauthor\bsnmVikram, \bfnmSharad\binitsS., \bauthor\bsnmHoffman, \bfnmMatthew D\binitsM. D. \AND\bauthor\bsnmWilson, \bfnmAndrew Gordon Gordon\binitsA. G. G. (\byear2021). \btitleWhat are Bayesian neural network posteriors really like? In \bbooktitleProceedings of the 38th International Conference on Machine Learning \bvolume139 \bpages4629–4640. \bpublisherPMLR. \endbibitem
Krizhevsky and Hinton (2009) {btechreport}[author] \bauthor\bsnmKrizhevsky, \bfnmAlex\binitsA. \AND\bauthor\bsnmHinton, \bfnmGeoffrey\binitsG. (\byear2009). \btitleLearning multiple layers of features from tiny images \btypeTechnical Report, \bpublisherUniversity of Toronto, \baddressToronto, Ontario. \endbibitem
Lecun et al. (1998) {barticle}[author] \bauthor\bsnmLecun, \bfnmY.\binitsY., \bauthor\bsnmBottou, \bfnmL.\binitsL., \bauthor\bsnmBengio, \bfnmY.\binitsY. \AND\bauthor\bsnmHaffner, \bfnmP.\binitsP. (\byear1998). \btitleGradient-based learning applied to document recognition. \bjournalProceedings of the IEEE \bvolume86 \bpages2278–2324. \endbibitem
Minsky and Papert (1988) {bbook}[author] \bauthor\bsnmMinsky, \bfnmMarvin L\binitsM. L. \AND\bauthor\bsnmPapert, \bfnmSeymour A\binitsS. A. (\byear1988). \btitlePerceptrons: expanded edition. \bpublisherMIT press. \endbibitem
Netzer et al. (2011) {binproceedings}[author] \bauthor\bsnmNetzer, \bfnmYuval\binitsY., \bauthor\bsnmWang, \bfnmTao\binitsT., \bauthor\bsnmCoates, \bfnmAdam\binitsA., \bauthor\bsnmBissacco, \bfnmAlessandro\binitsA., \bauthor\bsnmWu, \bfnmBo\binitsB. \AND\bauthor\bsnmNg, \bfnmAndrew Y.\binitsA. Y. (\byear2011). \btitleReading digits in natural images with unsupervised feature learning. In \bbooktitleNIPS Workshop on Deep Learning and Unsupervised Feature Learning. \endbibitem
Papamarkou et al. (2022) {barticle}[author] \bauthor\bsnmPapamarkou, \bfnmT.\binitsT., \bauthor\bsnmHinkle, \bfnmJ.\binitsJ., \bauthor\bsnmYoung, \bfnmM. T.\binitsM. T. \AND\bauthor\bsnmWomble, \bfnmD.\binitsD. (\byear2022). \btitleChallenges in Markov chain Monte Carlo for Bayesian neural networks. \bjournalStatistical Science \bvolume37 \bpages425–442. \endbibitem
Roberts and Sahu (1997) {barticle}[author] \bauthor\bsnmRoberts, \bfnmG. O.\binitsG. O. \AND\bauthor\bsnmSahu, \bfnmS. K.\binitsS. K. (\byear1997). \btitleUpdating schemes, correlation structure, blocking and parameterization for the Gibbs sampler. \bjournalJournal of the Royal Statistical Society: Series B (Statistical Methodology) \bvolume59 \bpages291–317. \endbibitem
Rosenblatt (1958) {barticle}[author] \bauthor\bsnmRosenblatt, \bfnmFrank\binitsF. (\byear1958). \btitleThe perceptron: a probabilistic model for information storage and organization in the brain. \bjournalPsychological review \bvolume65 \bpages386. \endbibitem
Saul and Jordan (1995) {binproceedings}[author] \bauthor\bsnmSaul, \bfnmLawrence\binitsL. \AND\bauthor\bsnmJordan, \bfnmMichael\binitsM. (\byear1995). \btitleExploiting Tractable Substructures in Intractable Networks. In \bbooktitleAdvances in Neural Information Processing Systems \bvolume8. \bpublisherMIT Press. \endbibitem
Welling and Teh (2011) {binproceedings}[author] \bauthor\bsnmWelling, \bfnmMax\binitsM. \AND\bauthor\bsnmTeh, \bfnmYee Whye\binitsY. W. (\byear2011). \btitleBayesian learning via stochastic gradient Langevin dynamics. In \bbooktitleProceedings of the 28th International Conference on International Conference on Machine Learning \bpages681–688. \bpublisherOmnipress. \endbibitem
Wenzel et al. (2020) {binproceedings}[author] \bauthor\bsnmWenzel, \bfnmFlorian\binitsF., \bauthor\bsnmRoth, \bfnmKevin\binitsK., \bauthor\bsnmVeeling, \bfnmBastiaan\binitsB., \bauthor\bsnmSwiatkowski, \bfnmJakub\binitsJ., \bauthor\bsnmTran, \bfnmLinh\binitsL., \bauthor\bsnmMandt, \bfnmStephan\binitsS., \bauthor\bsnmSnoek, \bfnmJasper\binitsJ., \bauthor\bsnmSalimans, \bfnmTim\binitsT., \bauthor\bsnmJenatton, \bfnmRodolphe\binitsR. \AND\bauthor\bsnmNowozin, \bfnmSebastian\binitsS. (\byear2020). \btitleHow good is the Bayes posterior in deep neural networks really? In \bbooktitleProceedings of the 37th International Conference on Machine Learning \bvolume119 \bpages10248–10259. \bpublisherPMLR. \endbibitem
Xiao, Rasul and Vollgraf (2017) {barticle}[author] \bauthor\bsnmXiao, \bfnmHan\binitsH., \bauthor\bsnmRasul, \bfnmKashif\binitsK. \AND\bauthor\bsnmVollgraf, \bfnmRoland\binitsR. (\byear2017). \btitleFashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. \bjournalarXiv preprint arXiv:1708.07747. \endbibitem
Zhang et al. (2020) {binproceedings}[author] \bauthor\bsnmZhang, \bfnmRuqi\binitsR., \bauthor\bsnmLi, \bfnmChunyuan\binitsC., \bauthor\bsnmZhang, \bfnmJianyi\binitsJ., \bauthor\bsnmChen, \bfnmChangyou\binitsC. \AND\bauthor\bsnmWilson, \bfnmAndrew Gordon\binitsA. G. (\byear2020). \btitleCyclical stochastic gradient MCMC for Bayesian deep learning. In \bbooktitleInternational Conference on Learning Representations. \endbibitem