Bayesian Neural Network Language Modeling for Speech Recognition

Boyang Xue, houkang Hu, unhao Xu, engzhe Geng, Xunying Liu, Helen Meng, Boyang Xue, Shoukang Hu, Junhao Xu, Mengzhe Geng, Xunying Liu and Helen Meng are with the Department of System Engineering and Engineering Management, the Chinese University of Hong Kong, Hong Kong, China (email: byxue@se.cuhk.edu.hk; skhu@se.cuhk.edu.hk; jhxu@se.cuhk.edu.hk; mzgeng@se.cuhk.edu.hk; xyliu@se.cuhk.edu.hk; hmmeng@se.cuhk.edu.hk). Corresponding author: Xunying Liu. Our code has been released on https://github.com/AmourWaltz/BayesLMs.

Abstract

State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. They are prone to overfitting and poor generalization when given limited training data. To this end, an overarching full Bayesian learning framework encompassing three methods is proposed in this paper to account for the underlying uncertainty in LSTM-RNN and Transformer LMs. The uncertainty over their model parameters, choice of neural activations and hidden output representations are modeled using Bayesian, Gaussian Process and variational LSTM-RNN or Transformer LMs respectively. Efficient inference approaches were used to automatically select the optimal network internal components to be Bayesian learned using neural architecture search. A minimal number of Monte Carlo parameter samples as low as one was also used. These allow the computational costs incurred in Bayesian NNLM training and evaluation to be minimized. Experiments are conducted on two tasks: AMI meeting transcription and Oxford-BBC LipReading Sentences 2 (LRS2) overlapped speech recognition using state-of-the-art LF-MMI trained factored TDNN systems featuring data augmentation, speaker adaptation and audio-visual multi-channel beamforming for overlapped speech. Consistent performance improvements over the baseline LSTM-RNN and Transformer LMs with point estimated model parameters and drop-out regularization were obtained across both tasks in terms of perplexity and word error rate (WER). In particular, on the LRS2 data, statistically significant WER reductions up to 1.3% and 1.2% absolute (12.1% and 11.3% relative) were obtained over the baseline LSTM-RNN and Transformer LMs respectively after model combination between Bayesian NNLMs and their respective baselines.

neural language models, Bayesian learning, model uncertainty, neural architecture search, speech recognition

\bstctlcite

IEEEexample:BSTcontrol

I Introduction

Language models (LMs) are key components in many speech technology applications such as automatic speech recognition (ASR) systems. LMs are used to compute the probability of a word sequence $W = (w_{0}, w_{1}, \dots, w_{n})$ ,

P (W) = P (w_{0}, w_{1}, . . ., w_{n}) = n \prod t = 1 P (w_{t} | w_{0}, . . ., w_{t - 1})

(1)

which can be further expressed as the product of individual word probabilities conditioned on their respective preceding contexts, where $w_{0}$ is typically a start symbol $<$ s $>$ , and generally $P (w_{0}) = 1$ . The key research issue for statistical language modeling is to learn long-range contextual dependencies. Directly modeling long-span word histories using conventional back-off $n$ -gram models [1] generally leads to a severe data sparsity issue [2]. To this end, over the past few decades significant efforts have been made to develop artificial neural network based language modeling techniques [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]. Neural network language models (NNLMs) that represent longer span history contexts in a continuous and lower dimensional vector space can be used to improve generalization performance.

With the rapid progress of deep neural network (DNN) based ASR technologies in recent decades, the underlying model architectures of NNLMs have evolved from feedforward structures [3, 4, 5, 6] to more advanced variants represented by long-short term memory recurrent neural networks (LSTM-RNNs) [7, 8, 9, 10], [18] and recently neural Transformers [11, 12, 13, 14],[19] that are designed to model longer range contexts. In particular, Transformer based NNLMs in recent years have defined state-of-the-art performance across a range of ASR tasks [11, 12, 13, 14],[20]. These models [11, 12, 13],[20] are often constructed using a deep stacking of multiple self-attention based neural building blocks [21, 22, 23], each of which also includes residual connections [24] and layer normalization modules [25]. Additional positional encoding layers [19, 26] are used to augment the self-attention modules with word sequence order information. Performance improvements over conventional LSTM-RNN LMs have been reported [11, 27].

In state-of-the-art ASR systems based on both the conventional hybrid DNN-HMM architecture [28, 29, 30, 31, 32, 33] and the recent end-to-end (E2E) modeling paradigm represented by listen, attend and spell (LAS) [34], connectionist temporal classification (CTC) [35], RNN transducers (RNN-T) [36] and Transformers [37], the use of separate or externally fused LSTM-RNN or Transformer based language models [9, 10, 11, 12, 13, 38, 39, 40, 41] that can benefit from the use of more text data in addition to the speech audio transcripts is essential.

However, the highly complex neural architecture designs of LSTM-RNN and Transformer LMs often lead to a large increase in the overall system complexity. When given limited training data of a target task domain, the use of point estimate based, deterministic model parameters and neural structures in LSTM-RNN and Transformer LMs lead to a most salient challenge of handling model uncertainty, and mitigating the risk of over-fitting and poor generalization. One popular approach to address this issue in neural language models, and many other deep learning systems, is based on the dropout method [42, 43, 44, 45], a simple and effective regularization approach. Similar approaches based on, for example, weight noise [46, 47, 48], inject random noise directly into neural network parameters, or the incorporation of an additional L1 [49] or L2 [50] penalty term or maximum a posteriori (MAP) estimation [51, 52] into the training cost function to improve generalization. However, these intuitive regularization techniques do not provide a full Bayesian framework to model the underlying uncertainty when estimating complex and over-parameterized DNN models on limited training data.

A more general solution considered in this paper to address the above model uncertainty issue for NNLMs is based on Bayesian neural network learning. In the machine learning community, Bayesian learning has been established as a well formulated framework to account for the underlying uncertainty over model parameters [53, 54, 55, 56, 57], or hidden layer output representations [58, 59] in artificial neural network systems. Previous studies have further shown the commonly used dropout method can be formulated as special instance of Bayesian neural networks [45]. Within the speech community, previous research works on Bayesian deep learning approaches were conducted mainly in the context of either acoustic model components of conventional hybrid DNN-HMM systems [60, 61, 62, 63], or recently E2E architectures [47]. They have also been successfully applied to speaker adaptation [64, 65] and speaker verification [66] tasks.

In contrast, limited previous research works conducted on Bayesian neural network language modeling approaches have been largely restricted to conventional RNN [51, 52, 67, 68], LSTM-RNN and gated recurrent unit (GRU) based NNLMs [69, 70]. There are several issues associated with these prior studies. First, Bayesian learning approaches have not been studied in the context of Transformer LMs for speech technology applications. Instead, the only previous works in this direction [71, 72] were conducted on machine translation and probabilistic programming tasks. Second, these earlier studies focused on modeling the uncertainty over neural network parameters while lacking a holistic comparison against techniques designed to model the additional uncertainty associated with the underlying neural architectures and hidden layer output representations. Finally, when the Monte Carlo parameter sampling approach [73, 74] commonly used for Bayesian neural network inference is applied to all individual layers of LSTM-RNN or Transformer LMs, the computational cost incurred in Bayesian model estimation grows exponentially with respect to the number of hidden layers in these systems. This leads to a major scalability issue that limits the practical application of Bayesian neural language modeling approaches.

In order to address these issues, this paper proposes a mathematically well-defined full Bayesian learning framework to account for model uncertainty in both LSTM-RNN and Transformer LMs. Three full Bayesian learning based approaches are proposed in this paper. The uncertainty over deterministic model parameters are modeled using Bayesian LSTM-RNN and Transformer LMs. The uncertainty over their underlying neural architecture design is further addressed using non-parametric Gaussian Process (GP) based neural activation functions. This leads to Gaussian Process LSTM-RNN and Transformer LMs. Variational LSTM-RNN and Transformer LMs are utilized to model the uncertainty over the hidden layer outputs in their respective conventional counterparts. A complete and side by side comparison between these three Bayesian NNLM approaches is drawn. All the latent variable distributions considered in these three Bayesian methods are estimated using efficient variational inference based approaches. Efficient inference approaches were used to automatically select the optimal network internal components to be estimated using a Bayesian approach and neural architecture search (NAS). A minimal number of Monte Carlo parameter samples as low as one was also used. These allow the computational costs incurred in Bayesian NNLM training and evaluation to be minimized for both LSTM-RNN and Transformer LMs

Experiments are conducted on two different tasks: a) AMI meeting transcription [75]; b) a multi-channel overlapped speech recognition task on the Oxford-BBC LipReading Sentences 2 (LRS2) corpus [76] using state-of-the-art LF-MMI trained factored TDNN systems [31] featuring speed perturbation, i-Vector [77, 78, 79] and learning hidden unit contribution (LHUC) based speaker adaptation [80], in addition to the use of audio-visual multi-channel beamforming for overlapped speech recognition. Consistent performance improvements over the baseline LSTM-RNN and Transformer LMs using point estimated model parameters and hidden activation configurations as well as drop-out regularization were obtained across both tasks in terms of perplexity and word error rate (WER). In particular, on the LRS2 multi-channel overlapped speech recognition task, statistically significant average WER reductions of 1.3% and 1.2% absolute (12.1% and 11.3% relative) were obtained over the baseline LSTM-RNN and Transformer LMs respectively after model combination between Bayesian NNLMs and their respective baselines. A signal-to-noise ratio (SNR) computed over the variational Gaussian distribution mean introduced in [47, 57] was also used to measure the parameter uncertainty in the proposed Bayesian NNLMs.

The main contributions of this paper are summarized below:

1) To the best of our knowledge, this paper is the first work to systematically investigate a mathematically well grounded, full Bayesian framework to account for the underlying uncertainty over model parameters, neural activations and hidden layer output representations in both LSTM-RNN and Transformer based LMs. In contrast, prior research works on Bayesian neural network language modeling approaches were limited to conventional RNN [51, 52, 67, 68], LSTM-RNN and GRU based NNLMs [69, 70]. In this paper, a complete and side by side contrast between these three Bayesian neural language modeling approaches across multiple speech recognition tasks is drawn and serves to provide insights on how to design practical Bayesian LSTM-RNN and Transformer LMs.

2) The research presented in this paper also presents the first investigation of Gaussian Process and variational Transformer LMs published to date to account for additional uncertainty over neural activations and hidden layer output representations. In contrast, the prior research work on Bayesian Transformer LMs [81] considered modeling parametric uncertainty only.

3) Efficient inference algorithms are proposed for various Bayesian learning based LSTM-RNN and Transformer LMs. In addition to the use of a minimal number of Monte Carlo parameter samples drawn as low as one, a novel neural architecture search based method is proposed to automatically locate the most important network internal components to be Bayesian estimated, for example, a small number of lower positioned Transformer layers exhibiting larger uncertainty than others due to higher variability in their respective data inputs. This allows the model training costs for various Bayesian LSTM-RNN and Transformer LMs to be comparable to those of conventional NNLMs with point estimate based parameters, and improves their scalability when deeper model architectures are used. In contrast, previous research works [70, 81] manually selected a subset of hidden layers for Bayesian estimation, while an exhaustive search over all neural network component combinations was computationally infeasible.

The rest of this paper is organized as follows. Section II reviews the conventional LSTM-RNN and Transformer based LMs. Section III presents three full Bayesian learning approaches for LSTM-RNN and Transformer LMs. The accompanying set of implementation issues to improve their efficiency are presented in section IV. Experiments and results are shown in section V. Finally, conclusions are drawn and future works discussed in section VI.

Ii Neural Network Language Models

This section reviews neural network language models based on the LSTM-RNN and Transformer architectures.

Ii-a LSTM-RNN Language Models

In conventional LSTM-RNN language models, the word probability is computed as:

P (w_{t} | w_{0 : t - 1}) \approx P (w_{t} | h_{t - 1}, w_{t - 1}) = P (w_{t} | h_{t})

(2)

where $w_{0 : t - 1}$ denotes the word sequence $(w_{0}, \dots, w_{t - 1})$ and $h_{t - 1} \in R^{D}$ is the hidden state that encodes the previous word sequence $(w_{0}, \dots, w_{t - 2})$ into a continuous vector representation, and D is the number of hidden nodes. In NNLMs, the most recent history $w_{t - 1}$ is represented by a one-hot vector ${~ w}_{t - 1} \in R^{N}$ as the input, where N is the vocabulary size. Generally, the LSTM-RNN LM contains three parts: the word embedding layer, the recurrent layer and the output layer. In order to address data sparsity, the word embedding layer projects the one-hot word ${~ w}_{t}$ as the input vector into a neural network. After being forward fed through non-linear transformations contained in multiple hidden layers, the one-hot word ${~ w}_{t}$ is then projected into a continuous space $x_{t} \in R^{M}$ , where M is the embedding size and usually $M ≪ N$ :

x_{t} = Θ_{U} {~ w}_{t}

(3)

where $Θ_{U} \in R^{M \times N}$ is a projection matrix that can be learned during training process before being further fed into the hidden layers. After word embedding, the hidden state $h_{t}$ is obtained within the memory cell in the LSTM architecture [18] where the information of previous hidden state $h_{t - 1}$ is combined with the word $w_{t - 1}$ . As shown in Fig.1 (a), at time step $t$ the respective outputs from the gates are obtained, including the forget gate $f_{t}$ , the input gate $i_{t}$ , the cell input ${~ c}_{t}$ and the output gate $o_{t}$ . These gating functions are used to control the information flow within the cells to store the historical contexts over longer time steps and address the vanishing gradient issue with the additive connection in Equation (8) found in conventional RNN LMs. The respective outputs from these gates are calculated by

$f_{t}$	$= σ (Θ_{f} {[{x_{t - 1}}^{⊤}, {h_{t - 1}}^{⊤}, 1]}^{⊤})$	(4)
$i_{t}$	$= σ (Θ_{i} {[{x_{t - 1}}^{⊤}, {h_{t - 1}}^{⊤}, 1]}^{⊤})$	(5)
${~ c}_{t}$	$= t a n h (Θ_{c} {[{x_{t - 1}}^{⊤}, {h_{t - 1}}^{⊤}, 1]}^{⊤})$	(6)
$o_{t}$	$= σ (Θ_{o} {[{x_{t - 1}}^{⊤}, {h_{t - 1}}^{⊤}, 1]}^{⊤})$	(7)

where $σ$ denotes the Sigmoid activation function and $Θ_{f}, Θ_{i}, Θ_{c}, Θ_{o} \in R^{D \times (M + D + 1)}$ denote the weight parameter matrices associated with the above gates respectively. The final hidden state is normally computed recursively from left to right in uni-directional LSTM-RNN LMs [7, 8]. Given the gating outputs, the hidden state $h_{t}$ and cell memory $c_{t}$ are:

	$c_{t} = f_{t} \circ c_{t - 1} + i_{t} \circ {~ c}_{t}$		(8)
	$h_{t} = o_{t} \circ t a n h (c_{t})$		(9)

where $\circ$ is the Hadamard product. Finally, the output layer uses the hidden state vector $h_{t}$ to compute the word probabilities via a Softmax function:

(10)

where $Θ_{v}^{(w)} \in R^{1 \times D}$ is the weight vector for word $w$ in the output layer, and $V$ represents the vocabulary.

Examples of
(a) a standard LSTM-RNN LM with point estimated parameters and two hidden layers, where — Figure 1: Examples of (a) a standard LSTM-RNN LM with point estimated parameters and two hidden layers, where $Θ_{i}$ , $Θ_{f}$ , $Θ_{c}$ , $Θ_{o}$ denote the respective weight parameters for the input, forget, output gates and cell input of an LSTM unit shown in Equation (4)-(7); (b) a Bayesian LSTM-RNN LM with latent Gaussian posterior distributions modeling parametric uncertainty on the cell input; (c) a Gaussian Process LSTM-RNN LM with an interpolation over multiple basis activation functions, where latent Gaussian posterior distributions are used to model uncertainty over both the basis coefficients and activation function parameters on the cell input; and (d) a variational LSTM-RNN LM using latent Gaussian posterior distributions to model the uncertainty over hidden layer outputs.

Examples of
(a) a standard Transformer LM with point estimated parameters and six layers, where — Figure 2: Examples of (a) a standard Transformer LM with point estimated parameters and six layers, where $Θ_{w}$ , $Θ_{q}$ , $Θ_{k}$ , $Θ_{v}$ , $Θ_{h}$ , $Θ_{1}$ , $Θ_{2}$ denote the parameters of the word embedding layer, multi-head self-attention and feed forward network modules in Equation (11), (13) and (15); (b) a Bayesian Transformer LM with latent Gaussian posterior distributions modeling parametric uncertainty for a feed-forward sub-network; (c) a Gaussian Process Transformer LM with an interpolation over multiple basis activation functions, where latent Gaussian posterior distributions are used to model uncertainty over both the basis coefficients and activation function parameters for a feed-forward sub-network; and (d) a variational Transformer LM using latent Gaussian posterior distributions to model the uncertainty over hidden layer outputs.

Ii-B Transformer Language Models

The Transformer model architecture considered in this paper features a deep stacking of multiple Transformer decoder blocks. As shown in Fig.2 (a), each Transformer decoder block consists of a multi-head self-attention [21, 22, 23] and feed forward neural network modules in each block. Residual connections [24] and layer normalization [25] are also inserted between the two modules. Assume that $x_{t}^{l - 1} \in R^{M}$ represents the output of the $(l - 1)$ -th Transformer layer at time step $t$ , where $M$ is the dimensionality of word embedding. In the $l$ -th layer, the multi-head self-attention module transforms $x_{t}^{l - 1}$ to $z_{t}^{l}$ as follows:

$q_{t}^{l}, k_{t}^{l},$	$v_{t}^{l} = Θ_{q}^{l} {[{x_{t}^{l - 1}}^{⊤}, 1]}^{⊤}, Θ_{k}^{l} {[{x_{t}^{l - 1}}^{⊤}, 1]}^{⊤}, Θ_{v}^{l} {[{x_{t}^{l - 1}}^{⊤}, 1]}^{⊤}$	(11)
	$ylt=\bf{{Attn}}(kl1,…,klt,vl1,…,vlt,qlt)$
	$=\bf Softmax⎛⎝qlt⊤(kl1,…,klt)√M⎞⎠(vl1,…,vlt)$	(12)
	$o_{t}^{l} = Θ_{h}^{l} {[{y_{t}^{l}}^{⊤}, 1]}^{⊤} + x_{t}^{l - 1}$	(13)
	$zlt=\bf LayerNorm(olt)$	(14)

where $Θ_{q}^{l}, Θ_{k}^{l}, Θ_{v}^{l} \in R^{M \times (M + 1)}$ are projection matrices of $l$ -th self-attention module that maps the input $x_{t}^{l - 1}$ into query $q_{t}^{l}$ , key $k_{t}^{l}$ and value $v_{t}^{l}$ respectively. $\bf{Attn}(⋅)$ represents the scaled multi-head dot product self-attention mechanism [19]. Only self-attention mechanism with a single head is used in Equation (12). $y_{t}^{l}$ is the sequence of cached key-value vector pairs up to time step $t$ , which can be restricted to only contain the history context information and thus prevent the model from using any future context, and $\frac{1}{\sqrt{M}}$ denotes the scaling factor. $\bf LayerNorm(⋅)$ is the layer normalization operation [25]. $Θ_{h}^{l} \in R^{M \times (M + 1)}$ is the projection matrix applied to the outputs of the ${\bf Attn}(⋅)$ operation for residual connection [24]. The normalized output $z_{t}^{l}$ is then fed into the feed forward network module:

	$s_{t}^{l}$			(15)
	$x_{t}^{l}$	$=\bf LayerNorm(slt)$		(16)

where $Θ_{1}^{l} \in R^{D \times (M + 1)}, Θ_{2}^{l} \in R^{M \times (D + 1)}$ denote the weight matrices of the feed forward networks and $D$ is the size of hidden nodes. The Gaussian error linear unit $\bf GELU(⋅)$ [82, 83] is adopted as the activation function in this work. The output of the last Transformer layer, $x_{t}^{l}$ , is fed into the output layer, where the word probabilities are computed as in Equation (10), assuming $h_{t} = x_{t}^{l}$ .

Iii Bayesian Learning based NNLMs

This section presents three Bayesian neural network language modeling approaches based on Bayesian neural networks, Gaussian process and variational neural networks. These are proposed in this paper to model the uncertainty over model parameters, hidden activation functions and their outputs respectively in LSTM-RNN and Transformer based NNLMs.

Iii-a Bayesian NNLMs

Conventional NNLMs using point estimate parameters fail to account for the model uncertainty associated with the word prediction. When given limited training data, these NNLMs are prone to over-ﬁtting and poor generalization. To this end, Bayesian neural networks (BNNs) [53, 54, 55, 56, 57] can be used by modeling the uncertainty over model parameters $Θ$ using as a probability distribution $p_{r} (Θ)$ . The model parameters $Θ$ include all the activation function parameters, for example, those of various LSTM unit gates $Θ_{i}$ , $Θ_{f}$ , $Θ_{c}$ , $Θ_{o}$ of Equation (4)-(7) in LSTM-RNN LMs, and the feed-forward layer parameters $Θ_{1}$ of Equation (15) in Transformer LMs. The log probability of a given $n$ word sentence $W = (w_{0}, w_{1}, \dots, w_{n})$ is obtained as:

	$log P$	$(W \| D) = log \int P (W \| Θ) p (Θ \| D) d Θ$
		$= log \int n \prod t = 1 P (w_{t} \| w_{0}, \dots w_{t - 1}, Θ) p (Θ \| D) d Θ$		(17)

where $p (Θ | D)$ is the posterior distribution over parameters to be learned from the given $N$ word training data set $D = (w_{0}, w_{1}, . . ., w_{N})$ .

In classical Bayesian neural networks, including Bayesian NNLMs considered in this paper, the estimation of the latent variable distribution $p (Θ | D)$ must be learned in order to perform inference on any unseen test data sentence $W$ using the predictive distribution $P (W | D)$ , for example, given in Equation (III-A) for Bayesian NNLMs. The latent variable distribution $p (Θ | D)$ needs to be learned through maximizing the model evidence expressed as the training data marginal likelihood ([55], Chapter 5.7). Since directly computing the evidence integral is intractable for neural networks, a range of approximated inference schemes were proposed in previous works including Laplace approximation [55, 54], Markov chain Monte Carlo [53, 84], and the more recently proposed variational inference [56, 58, 73, 74] that is also used in this paper. The following variational lower bound of the marginal log-likelihood is maximized instead [58]:

		$log P (D) = log \int P (D \| Θ) p_{r} (Θ) d Θ$		(18)
		$\geq \int log P (D \| Θ) q (Θ) d Θ      L_{1} - KL [q (Θ) \| \| p_{r} (Θ)]      L_{2}$		(19)

where $q (Θ)$ is the variational approximation of the parameter posterior distribution $p (Θ | D)$ . $p_{r} (Θ)$ is the prior distribution of $Θ$ and $K L [q (Θ) | | p_{r} (Θ)]$ represents the Kullback-Leiber (KL) divergence between $q (Θ)$ and $p_{r} (Θ)$ . The exact computation of this KL regularization term is non-trivial for general forms of parameter posterior and prior distributions. For efficiency, both $q (Θ)$ and $p_{r} (Θ)$ are assumed to be diagonal Gaussian distributions [56, 71, 72, 73, 74] in this work:

q (Θ) = N (Θ; μ, Σ_{d i a g}), p_{r} (Θ) = N (Θ; μ^{r}, Σ_{d i a g}^{r})

(20)

where $μ$ , $Σ_{d i a g}$ are the mean and diagonal covariance matrix for the variational distribution, while $μ^{r}$ , $Σ_{d i a g}^{r}$ are those of the prior distribution. The first term of the marginal log-likelihood lower bound $L_{1}$ is the expectation of log-likelihood of the word sequence $W$ over the approximated posterior distribution $q (Θ)$ . This can be further efficiently approximated by the Monte Carlo sampling method:

	$L_{1}$	$\approx \frac{1}{K} K \sum k = 1 log P (D \| Θ^{(k)})$
		$= \frac{1}{K} K \sum k = 1 N \sum t = 1 log P (w_{t} \| w_{0}, \dots, w_{t - 1}, Θ^{(k)})$		(21)

where $K$ is the number of samples and $Θ^{(k)}$ is the $k$ -th sample drawn from the distribution $q (Θ)$ . It has been found that directly sampling $Θ^{(k)}$ using the variational distribution mean $μ$ and covariance matrix $Σ_{d i a g}$ is prone to instability during inference. To address this issue, the following re-parameterization method [85] is adopted to sample the $i$ -th element ${Θ_{i}}^{(k)}$ in the $k$ -th sampled parameter $Θ^{(k)}$ as follows:

{Θ_{i}}^{(k)} = μ_{i} + σ_{i} \cdot {ϵ_{i}}^{(k)}, {ϵ_{i}}^{(k)} \sim N (0, 1)

(22)

where $μ_{i}$ is the $i$ -th component of the $M$ -dimensional mean vector $μ$ , $σ_{i}$ is the square root of $Σ_{d i a g, i i}$ , the $i$ -th diagonal element in the diagonal covariance matrix $Σ_{d i a g}$ of the variational distribution, i.e. ${σ_{i}}^{2} = Σ_{d i a g, i i}$ , and ${ϵ_{i}}^{(k)}$ is the $k$ -th sample in the total $K$ samples drawn for the $i$ -th dimension. Assuming the independence between the model parameters of different NNLM internal components, for example, those of different gates of an LSTM cell unit, the above sampling needs to independently performed for the latent variational distribution associated with their respective parameters.

The second part $L_{2}$ of Equation (19) is the KL divergence between $q (Θ)$ and $p_{r} (Θ)$ . Under the Gaussian assumption, an analytical closed-form can be derived as:

KL [q (Θ) | | p_{r} (Θ)] = M \sum i = 1 {log \frac{σ_{i}^{r}}{σ_{i}} + \frac{{σ_{i}}^{2} + (μ_{i} - μ_{i}^{r})^{2}}{2 {σ_{i}^{r}}^{2}} - \frac{1}{2}}

(23)

where $μ_{i}^{r}$ and $σ_{i}^{r}$ are the $i$ -th component of the mean $μ^{r}$ and the square root of $Σ_{d i a g, i i}^{r}$ , the $i$ th diagonal variance element of the prior distribution, i.e. ${σ_{i}^{r}}^{2} = Σ_{d i a g, i i}^{r}$ .

The combined use of both the parameter sampling of Equation (III-A)-(22), and the closed-form KL regularization term of Equation (23) under the Gaussian assumption over the parameter posterior and prior distributions, allows the variational lower bound of Equation (19) to be differentiable with respect to the variational distribution hyper-parameters $μ$ and $Σ_{d i a g}$ , and can be fully integrated into the conventional back-propagation algorithm during Bayesian LSTM-RNN or Transformer LM training.

Iii-B Gaussian Process based NNLMs

Gaussian Processes (GPs) [86] are non-parametric distributions over continuous functions in probabilistic modeling for many machine learning applications including regression and classification tasks and beyond. A function $f (\cdot)$ modeled using Gaussian Process is represented as:

f (x) \sim G P (m (x), k (x, x^{'}))

(24)

where $x, x^{'} \in R^{M}$ are an arbitrary pair of input data vectors, and $f (x) \in R^{D}$ is a Gaussian distribution specified by the mean function $m (x)$ and the kernel function $k (x, x^{'})$ . The above formulation is known as the kernel space view of GP models [86]. The associated computational complexity over the kernel covariance function during inference is determined by the size of the training data set, thus impractical when applied to large scale tasks, for example, language models trained on millions, or billions of words. An alternative and computationally more tractable form of GP uses basis function interpolation ([86], Chapter 2), also known as the following weight space view of GP,

f (x) = λ^{⊤} \cdot ϕ (x) = K \sum j = 1 λ^{j} ϕ^{j} (x)

(25)

where $k (\cdot, \cdot) = ϕ (\cdot)^{⊤} ϕ (\cdot)$ , $λ$ denotes the vector including $K$ basis coefficients of different basis functions $ϕ^{j} (x)$ in $ϕ (\cdot)$ .

A series of prior research works were conducted in the machine learning community to build the connection between neural networks and Gaussian processes. Neal [87] proved that Bayesian neural networks [54] with a single hidden layer of infinite width are equivalent to Gaussian Processes. Hazan and Jaakkola [88] and later Lee [89] proposed the use of GP kernels to approximate infinitely wide deep neural networks. Such connection was further studied in deep Gaussian process (DGP) [90] models where deep belief neural network layers were replaced by Gaussian Processes.

The form of the traditional Bayesian neural networks introduced in Sec.III-A only considers the uncertainty associated with model parameters, but not the network structural configurations. For example, the choice over the activation functions that are widely used in Equation (4)-(7), (9) and (15), can be learned using a learnable weighted interpolation in Equation (25) over commonly used basis activation functions, i.e Sigmoid, tanh, ReLU and GELU, as considered in this paper. This interpolated form of hidden activation function, for example, the activation function in Equation (4), is given by

f_{t} = K \sum j = 1 λ^{j} ϕ^{j} (Θ_{f} {[{x_{t - 1}}^{⊤}, {h_{t - 1}}^{⊤}, 1]}^{⊤})

(26)

where $Θ_{f}$ is the activation function parameters and $λ^{j}$ is the $j$ -th basis activation coefficient.

In addition to treating model parameters inside the activation functions as random variables as Bayesian NNLMs, further considering the uncertainty over the basis activation coefficients $λ$ leads to a more general form of Gaussian Process neural networks that simultaneously account for the uncertainty over both the NNLM model parameters and the choice of hidden activation functions. In GP NNLMs, assuming that the activation functions parameters $Θ$ (linear weight parameters that are applied to the inputs of a neuron, before any non-linearity) and basis activation coefficients $λ$ are statistically independent of each other, the sentence level probability of a given word sequence $W$ is then computed as follows:

	$log P (W \| D) = log \int \int P (W \| Θ, λ) p (Θ \| D) p (λ \| D) d Θ d λ$
	$= log \int \int n \prod t = 1 P (w_{t} \| w_{0}, \dots, w_{t - 1}, Θ, λ) p (Θ \| D) p (λ \| D) d Θ d λ$		(27)

where $p (Θ | D)$ and $p (λ | D)$ denote the posterior probability distributions over the basis activation function parameters $Θ$ and the basis coefficients $λ$ respectively, both to be learned from the training data $D$ .

Similar to the variational inference approach considered in Sec.III-A for Bayesian NNLMs, the following variational lower bound of marginal log-likelihood is maximized:

	$log P (D)$	$\geq \int \int log P (D \| Θ, λ) q (Θ) q (λ) d Θ d λ      L_{1}$
		$- KL [q (Θ) \| \| p_{r} (Θ)]      L_{2} - % K L [q (λ) \| \| p_{r} (λ)]      L_{3}$		(28)

where $q (Θ)$ and $q (λ)$ are variational approximations of posterior distributions $p (Θ | D)$ and $p (λ | D)$ that separately model the uncertainty over model parameters and the choice of hidden neural activation functions. Their respective KL-divergence against their prior distributions in $L_{2}$ and $L_{3}$ serve as regularization terms during GP NNLM inference.

In common with Bayesian NNLMs, both the prior distributions, $p_{r} (Θ)$ and $p_{r} (λ)$ , and the variational distributions, $q (Θ)$ and $q (λ)$ , for model parameters and basis activation coefficients are assumed to be Gaussian distributions [71, 72, 73, 74] to produce closed-form, analytical solutions to the two KL terms.

The expectation of log-likelihood in the first term $L_{1}$ of the variational lower bound in Equation (28) is calculated using sampling method that requires activation parameter samples and basis activation coefficient samples to be drawn from their respective variational distributions, $q (Θ)$ and $q (λ)$ , with the re-parameterization method of Equation (22).

Iii-C Variational NNLMs

In contrast to the BNNs and GPNNs presented in Sec.III-A and III-B, variational neural networks (VNNs) [58, 59, 91, 70] can be used to model uncertainty associated with hidden representations. Instead of modeling the uncertainty over the weight parameters $Θ$ inside the activation functions in BNNs or assuming additional uncertainty over the activation basis coefficients $λ$ in GPNNs, VNNs introduce a latent variable $z_{t}$ to encode the uncertainty associated with, for example, the hidden node output vector $h_{t}$ of LSTM-RNNs in Equation (9), or the outputs of feed-forward networks $s_{t}^{l}$ of Transformer models in Equation (15). The general form of sentence level probability of a given word sequence $W = (w_{0}, w_{1}, \dots, w_{n})$ under variational NNLMs is computed as:

	$log P$	$(W \| D) = log n \prod t = 1 P (w_{t} \| w_{0}, \dots, w_{t - 1}, D)$		(29)
		$\approx log n \prod t = 1 \int P (w_{t} \| w_{0}, \dots, w_{t - 1}, z_{t}) p (z_{t} \| h_{t}, D) d z_{t}$

where $p (z_{t} | h_{t}, D)$ denotes the latent variable distribution over neural network hidden layer outputs $z_{t}$ to be learned from the training data $D = (w_{0}, w_{1}, . . ., w_{N})$ , and the hidden layer vector outputs over time are $(z_{0}, z_{1}, \dots, z_{t})$ .

During variational NNLMs training, the following training data marginal log-likelihood lower bound is maximized,

	$log P (D)$	$\geq N \sum t = 1 \int log P (w_{t} \| w_{0}, \dots, w_{t - 1}, z_{t}) q (z_{t}) d z_{t}      L_{1}$
		$- KL [q (z_{t}) \| \| p_{r} (z_{t})]      L_{2}$		(30)

where $q (z_{t})$ is the variational approximation of $p (z_{t} | h_{t}, D)$ , the posterior distribution over the hidden output vector $z_{t}$ , and $p_{r} (z_{t})$ is its prior distribution. In common with both Bayesian and GP NNLMs, for efficiency during inference, both $q (z_{t})$ and $p_{r} (z_{t})$ are assumed to be Gaussian distributions:

q (z_{t}) = N (z_{t}; μ_{t}, Σ_{d i a g, t}), p_{r} (z_{t}) = N (z_{t}; μ_{t}^{r}, Σ_{d i a g, t}^{r})

(31)

to allow the KL divergence between $q (z_{t})$ and $p_{r} (z_{t})$ in Equation (30) to be computed in a tractable form.

	KL	$[q (z_{t}) \| \| p_{r} (z_{t})] =$
		$M \sum i = 1 ⎧ ⎨ ⎩ log \frac{σ_{t, i}^{r}}{σ_{t, i}} + \frac{{σ_{t, i}}^{2} + (μ_{t, i} - μ_{t, i}^{r})^{2}}{2 {σ_{t, i}^{r}}^{2}} - \frac{1}{2} ⎫ ⎬ ⎭$		(32)

The hyper-parameters $μ_{t}, Σ_{d i a g, t}$ , where ${σ_{t, i}}^{2}$ $=$ $Σ_{d i a g, t, i i}$ , are calculated using an inference network $Φ^{i n f e r} (w_{0}, \dots, w_{t - 1})$ , and $μ_{t}^{r}, Σ_{d i a g, t}^{r}$ , where ${σ_{t, i}^{r}}^{2}$ $=$ $Σ_{d i a g, t, i i}^{r}$ , through an prior network $Φ^{p r i o r} (w_{0}, \dots, w_{t - 1})$ respectively. Both the inference network and prior network are regression networks. The inputs of the two networks are history sequence $(w_{0}, \dots, w_{t - 1})$ . Each of the network produce two outputs including the mean and variance vectors of $q (z_{t})$ and $p_{r} (z_{t})$ as follows.

	$[μ_{t}, Σ_{d i a g, t}] = Φ^{i n f e r} (w_{0}, \dots, w_{t - 1})$		(33)
	$[μ_{t}^{r}, Σ_{d i a g, t}^{r}] = Φ^{p r i o r} (w_{0}, \dots, w_{t - 1})$		(34)

The input word sequence $(w_{0}, \dots, w_{t - 1})$ are approximated by the hidden state $h_{t}$ in LSTM-RNNs, and the output of the Transformer feed-forward modules $s_{t}^{l}$ as shown in Fig.1 (d) and Fig.2 (d) respectively. In this work, both the inference networks and prior networks we use contain one single hidden layer for the Variational NNLMs. Both the inference and prior network modules are differentiable with respect to their internal parameters. Similar to the Bayesian and Gaussian Process NNLMs of Sec.III-A and III-B, sampling the hidden output vector $z_{t}$ from the variational distribution $q (z_{t})$ allows the marginal log-likelihood item $L_{1}$ of Equation (30) to be efficiently approximated. The maximization of the variational lower bound of Equation (30) can then be integrated into the back-propagation algorithm during variational LSTM-RNN or Transformer LM training.

Iv System Implementation

In this section, a set of implementation details that affect the performance and efficiency of the Bayesian, Gaussian Process and Variational LSTM-RNN and Transformer LMs in Sec.III are discussed.

Iv-a Choice of Prior Distributions

When training all three Bayesian learning approaches based NNLMs, suitable choices of parameter prior distributions need to be set. In this paper, the prior distribution Gaussian means for various Bayesian learned LSTM-RNN and Transformer LMs are based on the parameter estimates of comparable deterministic, point estimated NNLMs that have converged in model training. Throughout this paper the prior Gaussian variances used for Bayesian learning based LSTM-RNN and Transformer LMs are empirically set as 1 and 1.0e-3 respectively. In addition, all the other standard, non-Bayesian estimated parameters in the Bayesian learning based LSTM-RNN and Transformer models are initialized using the parameters obtained from the comparable pre-trained standard models reaching half convergence. The combination of these two settings in practice was found to yield a good balance between convergence speed and performance¹¹1Several methods to initialize the other standard non-Bayesian estimated model parameters include: 1) random weights; 2) the weights from both half-converged standard models; and 3) fully converged pre-trained standard models for Bayesian learning based NNLMs. Experimental results suggested that the model parameters initialized from the half-converged or fully converged pre-trained standard models produced similar performance, while both marginally better than using random weights..

ID	LMs	Bayesian				PPL	WER(%) of dev				PPL	WER(%) of eval
ID	LMs	Layer	Position	#Sample	Seed	(dev)	ihm	sdm1	mdm8	Avg.	(eval)	ihm	sdm1	mdm8	Avg.
1	LSTM+3g	Not Applied				60.4	16.4	30.0	27.7	24.7	59.6	16.2	34.8	30.9	27.3
2	Bayesian LSTM +3g	1	Cell input	1	1	49.2	16.1 $^{†}$	29.6 $^{†}$	27.3 $^{†}$	24.3 $^{†}$	50.5	15.9 $^{†}$	34.3 $^{†}$	30.5 $^{†}$	26.9 $^{†}$
3		1		1	2	50.1	16.1 $^{†}$	29.6 $^{†}$	27.3 $^{†}$	24.3 $^{†}$	51.0	15.9 $^{†}$	34.3 $^{†}$	30.5 $^{†}$	26.9 $^{†}$
4		1		1	3	52.7	16.2	29.7	27.5	24.5	53.3	16.0	34.4	30.7	27.0
5		1		3	1	48.3	16.0 $^{†}$	29.6 $^{†}$	27.4 $^{†}$	24.3 $^{†}$	50.4	15.9 $^{†}$	34.2 $^{†}$	30.4 $^{†}$	26.8 $^{†}$
6		1		5	1	49.5	16.1 $^{†}$	29.6 $^{†}$	27.4 $^{†}$	24.4 $^{†}$	51.3	15.8 $^{†}$	34.3 $^{†}$	30.4 $^{†}$	26.8 $^{†}$
7	Transformer+3g	Not Applied				46.9	16.1	29.9	27.5	24.5	48.1	15.9	34.5	30.5	27.0
8	Bayesian Transformer +3g	1	FFN	1	1	46.1	16.0	29.7 $^{‡}$	27.4	24.4	47.7	15.8	34.3 $^{‡}$	30.4	26.8
9		1		1	2	46.2	16.0	29.7 $^{‡}$	27.4	24.4	47.5	15.8	34.4	30.4	26.8
10		1		1	3	46.2	15.9	29.7 $^{‡}$	27.4	24.4	47.4	15.8	34.3 $^{‡}$	30.5	26.8
11		1		3	1	47.0	16.0	29.7 $^{‡}$	27.4	24.4	48.2	15.8	34.4	30.4	26.9
12		1		5	1	47.8	16.0	29.8	27.4	24.4	49.3	15.9	34.4	30.5	26.9

Table I: Perplexity (PPL) and WER(%) of the baseline LSTM and Transformer LMs and their Bayesian variants on AMI dev and eval sets of ihm, sdm1 and mdm8 conditions using different number of samples (1, 3 and 5) and multiple seeds (1, 2 and 3) after interpolation with 3-gram (3g) LM. "FFN" represents the feed-forward network module in a Transformer LM. "

†

" and "

‡

" denote statistically significant WER reductions were obtained over the baseline LSTM-RNN (line 1) and Transformer (line 7) LMs respectively.

Iv-B Modeling Local Uncertainty

When training various types of Bayesian estimated LSTM-RNN or Transformer LMs using variational inference, the evidence lower bounds in Equation (19), (28) and (30) require Monte Carlo parameter samples to be drawn. These samples are drawn independently at each Bayesian estimated NNLM hidden layer. Assuming the number of samples drawn at each layer is $K$ , the cost incurred in Bayesian inference grows exponentially with respect to the number of hidden layers $L$ as $K^{L}$ . Hence, directly performing Bayesian estimation using variational inference across too many NNLM hidden layers becomes computationally intractable.

One approach considered in this paper to address the above scalability issue is to use neural architecture search (NAS) techniques [92, 93, 94, 95, 96]. They are used to automatically learn the most important network internal locations inside LSTM-RNNs and Transformer LMs that require Bayesian uncertainty modeling. The form of NAS method considered in this paper is based on differentiable neural architecture search (DARTS) [95, 96, 97, 98, 99]. NAS using the DARTS method is performed over a super-network containing paths connecting either the point estimate, or Bayesian estimated neural network structures to be considered. The search requires the estimation of the weights that are assigned to each candidate neural architecture within such a super-network. The search requires the estimation of the weights that are assigned to each point estimate based, or Bayesian estimated candidate neural architecture within such a super-network neural network LM. The architecture weights and normal model parameters are jointly learned by minimizing the word sequence log-likelihood probability using the cross-entropy cost when training such a super-network language model. The standard back-propagation algorithm is used in the training process. When the super-network LM containing both the point estimate based, or Bayesian estimated architectures and their respective model parameters is trained to convergence, the optimal 1-best and top N-best architectures can be obtained by pruning lower weighted paths within the super-network that are considered less important, akin to the approach used in our previous research on applying NAS to TDNNs [98].

More specifically, the NNLM internal components to be examined during NAS include: a) the input, output, forget gates, cell inputs and hidden node activations (for Bayesian or GP LSTM-RNNs), as well as hidden output vectors (for variational LSTM-RNNs), as shown in Fig.1; and b) the multi-head self-attention and feedforward sub-layers within each Transformer LM block (for Bayesian or GP Transformers), as well as its hidden vector outputs (for variational Transformers), as shown in Fig.2.

The required DARTS super-network is constructed by building two parallel architecture paths for each of the above search locations within LSTM-RNN or Transformer LMs that respectively represent the use of either conventional point estimate, or Bayesian learning methods of Sec.III. For example, a fragment of such super-network designed for a cell input of an LSTM unit and its activation output is given below:


	$+ \frac{exp a_{B a y e s}^{l}}{exp a_{p o i n t}^{l} + exp a_{B a y e s}^{l}} t a n h (Θ_{c, (B a y e s)} {[{x_{t - 1}}^{⊤}, {h_{t - 1}}^{⊤}, 1]}^{⊤})$		(35)

where $a_{p o i n t}^{l}$ is the architecture weight indicating using point estimate parameter estimation for the $l$ -th cell input, and $a_{B a y e s}^{l}$ for using Bayesian parameter estimation.

Detailed experimental contrasts between using manual and NAS automatic selection of NNLM components to perform Bayesian estimation for LSTM-RNN and Transformer LMs are shown in Table III (line 9-19 and 23 vs. line 20-22 and 24 for Bayesian LSTM-RNNs; and line 25-39 and 43 vs. line 40-42 and 44 for GP LSTM-RNNs) and Table IV (line 9-16 and 20 vs. line 17-19 and 21 for Bayesian Transformers; and line 22-27 and 31 vs. line 28-30 and 32 for GP Transformers) respectively. A figure plotting the values of the six pairs of Transformer LM feed-forward module specific architecture weights $a_{B a y e s}$ (in blue) and $a_{p o i n t}$ (in red) extracted from the DARTS super-network is shown in Fig.3²²2The results suggested that the first and second layers in Transformer LM were selected to be Bayesian estimated. This decision is made based on the fact that the respective architecture weights associated with Bayesian estimation is larger than those of point estimation, $α_{B a y e s} > α_{p o i n t}$ , in the NAS super-network. This can be intuitively interpreted as that the lower positioned, two bottom Transformer feedforward layers exhibit larger uncertainty than those positions at the top. This is further confirmed by examining the median parameter signal-to-noise ratio (SNR) statistics [47, 57] (further discussed in section V-A) measured over the feedforward modules of a 6-layer Bayesian Transformer LM (line 16, Table IV). The SNRs in Transformer layer 1-6 are 3.3, 3.1, 3.9, 4.7, 6.2, 7.0 respectively. As expected, the medium parameter SNRs of the first two bottom layers are much lower than those of the higher layers. This trend indicates a larger parameter uncertainty in these two bottom layers that can benefit more from the use of Bayesian estimation..

Figure 3: Examples of Transformer LM feed-forward module specific point estimate (in red) and Bayesian (in blue) architecture weights $a_{B a y e s}$ , $a_{p o i n t}$ extracted from the DARTS super-network trained on the AMI data. The bottom layers 1 and 2, where $a_{B a y e s} > a_{p o i n t}$ , were selected to be Bayesian estimated (shown in line 17 and 21, Table IV with or without being interpolated with standard Transformer LMs).

Iv-C Parameter Sampling

A second approach to further address the above scalability issue is to use a minimal number of parameter samples when approximating the marginal log-likelihood lower bounds in Equation (19), (28) and (30) for the three types of Bayesian NNLMs of Sec.III-A to III-C. Within each individual component of NNLMs, for example, a cell input of an LSTM-RNN LM where Bayesian learning is applied and requires such parameter sampling, the model training cost can be further reduced when using a smaller number of samples.

An ablation study on the performance of AMI meeting data trained Bayesian LSTM-RNN and Transformer LMs by varying the number of parameter samples drawn respectively at the first LSTM hidden layer’s cell inputs and the first Transformer module’s feed-forward sub-layer during model training across multiple random seeds is shown in Table I. The experimental results in Table I show that only a marginal difference in WER was observed by drawing more samples (three and five samples) in the error forwarding³³3“error forwarding” refers to the process of feed forwarding the training data through the neural network to compute the error loss for subsequent gradient backward propagation. pass of Bayesian LSTM-RNN and Transformer LMs. Based on these findings, only one sample is drawn to approximate the marginal log-likelihood in Equation (19), (28) and (30) for all the Bayesian estimated NNLMs presented in this paper.

During evaluation, the inference of Bayesian, Gaussian Process and Variational NNLMs are efficiently approximated by computing the expectation of the model parameters or the latent variables using their respective posterior distributions. For example, during evaluation, the samples of activation parameters $Θ$ in Bayesian LSTM-RNN or Transformer LMs are approximated by taking the mean $μ$ in Equation (20) of their latent distribution:

\int P (W | Θ) p (Θ | D) d Θ \approx P (W | E [Θ | D]) = P (W | μ)

(36)

when predicting the probability of a test data sentence $W$ .

ID	LM	Uncertainty			#Parameter			Speed Ratio
ID	LM	$λ$	$θ$	$z$	$λ$	$θ$	$z$	Train	Test
1	Standard	✗	✗	✗	0	ab	$0$	1.00	1.00
2	B-NNLMs	✗	✓	✗	0	2ab	$0$	1.02	1.00
3	GP NNLMs	✓	✓	✗	8b	2ab	$0$	1.10	1.20
4	V-NNLMs	✗	✗	✓	$0$	ab+cb	4ac	1.05	1.03

Table II: Description of Bayesian, Gaussian Process and Variational NNLMs in terms of their respective uncertainty modeling, number of samples, number of free parameters and the speed ratios relative to the standard NNLMs in training and evaluation. The hidden layer input vector size is expressed as

a

, the number of hidden nodes as

b

, and the latent hidden output vector

z

is of

c

dimensions.

Iv-D System Description

Following the above implementation details, the description of a set of Bayesian, Gaussian Process and Variational NNLMs in terms of their respective forms of uncertainty modeling, number of free parameters and the speed ratios relative to the standard NNLMs in training and evaluation is presented in Table II. In addition, when using the above efficient sampling during inference for model training (as low as one sample drawn) and evaluation, the Bayesian, Gaussian Process and Variational NNLMs only require a moderate increase in system training time of approximately 5%-10% over the standard baseline NNLMs during training, while their computational complexity is comparable to that of standard NNLMs during evaluation.

ID	LMs	Bayesian			Uncertainty			PPL	WER(%) of dev				PPL	WER(%) of eval
ID	LMs	Design	Block	Position	$λ$	$θ$	$z$	(dev)	ihm	sdm1	mdm8	Avg.	(eval)	ihm	sdm1	mdm8	Avg.
1	3-gram	Not Applied			✗	✗	✗	81.9	18.1	31.8	29.5	26.5	93.6	18.4	36.4	32.5	29.1
2	LSTM +3g							60.4	16.4	30.0	27.7	24.7	59.6	16.2	34.7	30.9	27.3
3	LSTM(L1) +3g							54.5	16.3	29.8	27.6	24.6	55.6	16.1	34.5	30.7	27.1
4	+LSTM							53.2	16.2	29.7	27.5	24.5	54.2	16.0	34.4	30.6	27.0
5	LSTM(L2) +3g							53.4	16.3	29.8	27.5	24.5	54.7	16.0	34.6	30.7	27.1
6	+LSTM							52.3	16.2	29.7	27.4	24.4	53.1	15.9	34.4	30.5	26.9
7	LSTM(MAP) +3g							52.5	16.2	29.7	27.5	24.5	53.7	16.0	34.5	30.6	27.0
8	+LSTM							51.4	16.1	29.6	27.4	24.4	52.4	15.9	34.4	30.5	26.9
9	Bayesian LSTM +3g	Manual	1-2	All gates	✗	✓	✗	52.1	16.2 $^{†}$	29.6 $^{†}$	27.4 $^{†}$	24.4 $^{†}$	53.3	16.0 $^{†}$	34.5 $^{†}$	30.6 $^{†}$	27.0 $^{†}$
10			1					48.2	16.0 $^{†}$	29.5 $^{†}$	27.3 $^{†}$	24.3 $^{†}$	49.5	15.8 $^{†}$	34.3 $^{†}$	30.4 $^{†}$	26.8 $^{†}$
11			2					46.9	16.0 $^{†}$	29.5 $^{†}$	27.2 $^{†}$	24.2 $^{†}$	48.1	15.7 $^{†}$	34.2 $^{†}$	30.3 $^{†}$	26.7 $^{†}$
12			1	Input gate (IG)	✗	✓	✗	52.1	16.2 $^{†}$	29.6 $^{†}$	27.5 $^{†}$	24.4 $^{†}$	54.3	15.9 $^{†}$	34.4 $^{†}$	30.6 $^{†}$	27.0 $^{†}$
13			1	Forget gate (FG)				52.2	16.2 $^{†}$	29.7 $^{†}$	27.4 $^{†}$	24.4 $^{†}$	53.9	16.0 $^{†}$	34.4 $^{†}$	30.6 $^{†}$	27.0 $^{†}$
14			1	Cell input (CI)				49.2	16.1 $^{†}$	29.6 $^{†}$	27.3 $^{†}$	24.3 $^{†}$	50.5	15.9 $^{†}$	34.3 $^{†}$	30.5 $^{†}$	26.9 $^{†}$
15			1	Output gate (OG)				52.5	16.2 $^{†}$	29.7 $^{†}$	27.5 $^{†}$	24.5 $^{†}$	53.0	16.0 $^{†}$	34.4 $^{†}$	30.6 $^{†}$	27.0 $^{†}$
16			2	Input gate (IG)	✗	✓	✗	51.4	16.2 $^{†}$	29.6 $^{†}$	27.3 $^{†}$	24.4 $^{†}$	52.5	15.9 $^{†}$	34.5 $^{†}$	30.6 $^{†}$	27.0 $^{†}$
17			2	Forget gate (FG)				51.4	16.2 $^{†}$	29.6 $^{†}$	27.4 $^{†}$	24.4 $^{†}$	52.4	15.9 $^{†}$	34.4 $^{†}$	30.6 $^{†}$	27.0 $^{†}$
18			2	Cell input (CI)				47.6	16.0 $^{†}$	29.6 $^{†}$	27.3 $^{†}$	24.3 $^{†}$	48.0	15.7 $^{†}$	34.2 $^{†}$	30.3 $^{†}$	26.7 $^{†}$
19			2	Output gate (OG)				51.2	16.2 $^{†}$	29.6 $^{†}$	27.4 $^{†}$	24.4 $^{†}$	52.0	15.9 $^{†}$	34.4 $^{†}$	30.6 $^{†}$	27.0 $^{†}$
20		NAS (Search Space: Layer 1,2 IG,FG,CI,OG)	1-2 (top1)	Cell input	✗	✓	✗	46.0	15.9 $^{†}$	29.5 $^{†}$	27.2 $^{†}$	24.2 $^{†}$	47.1	15.6 $^{†}$	34.2 $^{†}$	30.2 $^{†}$	26.7 $^{†}$
21			1 (top2)					49.2	16.1 $^{†}$	29.6 $^{†}$	27.3 $^{†}$	24.3 $^{†}$	50.5	15.9 $^{†}$	34.3 $^{†}$	30.5 $^{†}$	26.9 $^{†}$
22			2 (top3)					47.6	16.0 $^{†}$	29.6 $^{†}$	27.3 $^{†}$	24.3 $^{†}$	48.0	15.7 $^{†}$	34.2 $^{†}$	30.3 $^{†}$	26.7 $^{†}$
23	+LSTM	Manual	2	All gates	✗	✓	✗	47.4	15.9 $^{†}$	29.4 $^{†}$	27.2 $^{†}$	24.2 $^{†}$	48.3	15.6 $^{†}$	34.1 $^{†}$	30.2 $^{†}$	26.6 $^{†}$
24	+LSTM	NAS	1-2	Cell input	✗	✓	✗	47.1	15.9 $^{†}$	29.4 $^{†}$	27.1 $^{†}$	24.1 $^{†}$	47.3	15.6 $^{†}$	34.1 $^{†}$	30.2 $^{†}$	26.6 $^{†}$
25	GP LSTM +3g	Manual	1	Input gate (IG)	✓	✓	✗	58.3	16.4	30.0	27.7	24.7	57.6	16.3	34.8	30.9	27.3
26			1	Forget gate (FG)				54.3	16.1 $^{†}$	29.9	27.5 $^{†}$	24.5	55.2	16.1	34.6	30.8	27.2
27			1	Cell input (CI)				52.4	16.0 $^{†}$	29.7 $^{†}$	27.4 $^{†}$	24.4 $^{†}$	53.8	16.0	34.5 $^{†}$	30.4 $^{†}$	27.0 $^{†}$
28			1	Output gate (OG)				50.1	16.1 $^{†}$	29.7 $^{†}$	27.5 $^{†}$	24.4 $^{†}$	51.5	16.0	34.5 $^{†}$	30.5 $^{†}$	27.0 $^{†}$
29			1	c-Gate (cG)				54.4	16.2 $^{†}$	29.9	27.5	24.5	55.7	16.1	34.6	30.7	27.1
30			1	h-Gate (hG)				51.2	16.1 $^{†}$	29.6 $^{†}$	27.4 $^{†}$	24.4 $^{†}$	52.5	16.0	34.4 $^{†}$	30.6 $^{†}$	27.0 $^{†}$
31			1	i-Gate (iG)				56.7	16.3	29.9	27.7	24.6	56.9	16.2	34.7	30.8	27.2
32			2	Input gate (IG)	✓	✓	✗	57.3	16.3	30.0	27.8	24.7	58.0	16.3	34.8	31.0	27.4
33			2	Forget gate (FG)				58.6	16.4	30.1	27.8	24.8	57.5	16.3	34.9	31.2	27.5
34			2	Cell input (CI)				50.4	16.0 $^{†}$	29.6 $^{†}$	27.4 $^{†}$	24.3 $^{†}$	51.2	15.9 $^{†}$	34.4 $^{†}$	30.4 $^{†}$	26.9 $^{†}$
35			2	Output gate (OG)				53.1	16.1 $^{†}$	29.8	27.5 $^{†}$	24.5 $^{†}$	54.5	16.0	34.5 $^{†}$	30.6 $^{†}$	27.0 $^{†}$
36			2	c-Gate (cG)				65.5	16.6	30.2	27.9	24.9	64.0	16.5	34.8	31.2	27.5
37			2	h-Gate (hG)				53.3	16.2 $^{†}$	29.7 $^{†}$	27.5 $^{†}$	27.5 $^{†}$	54.6	16.0	34.7	30.7 $^{†}$	27.1
38			2	i-Gate (iG)				55.7	16.3	29.8	27.7	27.6	58.3	16.1	34.6	30.7	27.1
39			1-2	Cell input (CI)	✓	✓	✗	48.4	16.1 $^{†}$	29.6 $^{†}$	27.2 $^{†}$	24.3 $^{†}$	48.2	15.8 $^{†}$	34.3 $^{†}$	30.5 $^{†}$	26.9 $^{†}$
40		NAS (Search Space: Layer 1,2 CI,OG,hG)	1-2 (top1)	h-Gate	✓	✓	✗	48.1	16.0 $^{†}$	29.5 $^{†}$	27.4 $^{†}$	24.3 $^{†}$	48.9	15.8 $^{†}$	34.3 $^{†}$	30.5 $^{†}$	26.9 $^{†}$
41			1 (top2)					51.2	16.1 $^{†}$	29.6 $^{†}$	27.4 $^{†}$	24.4 $^{†}$	52.6	16.0	34.4 $^{†}$	30.6 $^{†}$	27.0 $^{†}$
42			2 (top3)					53.3	16.2 $^{†}$	29.7 $^{†}$	27.5 $^{†}$	27.5 $^{†}$	55.6	16.0	34.7	30.7 $^{†}$	27.1 $^{†}$
43	+LSTM	Manual	1-2	Cell input	✓	✓	✗	48.2	15.9 $^{†}$	29.5 $^{†}$	27.2 $^{†}$	24.2 $^{†}$	48.6	15.7 $^{†}$	34.2 $^{†}$	30.5 $^{†}$	26.8 $^{†}$
44	+LSTM	NAS	1-2	h-Gate	✓	✓	✗	47.6	15.8 $^{†}$	29.4 $^{†}$	27.2 $^{†}$	24.1 $^{†}$	49.2	15.6 $^{†}$	34.2 $^{†}$	30.3 $^{†}$	26.7 $^{†}$
45	Variational LSTM +3g	Manual	1	Hidden output	✗	✗	✓	53.6	16.2	29.7	27.6	24.5	54.7	16.0	34.5	30.7	27.1
46			2					53.6	16.2	29.8	27.6	24.5	55.0	16.0	34.5	30.7	27.1
47			1-2					52.7	16.2	29.7	27.5	24.4	53.8	15.9	34.4	30.6	27.0
48	+LSTM	Manual	1-2	Hidden output	✗	✗	✓	51.4	16.0 $^{†}$	29.5 $^{†}$	27.4 $^{†}$	24.3 $^{†}$	52.4	15.7 $^{†}$	34.3 $^{†}$	30.5 $^{†}$	26.8 $^{†}$

Table III: Perplexity and WER% of the baseline 3-gram (3g), LSTM-RNN LMs with standard point estimate, and L1, or L2 regularized or MAP estimated LSTM-RNN LMs, and Bayesian, GP and variational LSTM-RNN LMs with various forms of uncertainty modeling on the AMI dev and eval sets of ihm, sdm1 and mdm8 conditions. "Design" (also in Table IV and VI) denotes whether the network internal positions to be Bayesian estimated are selected manually (Manual), or automatically using NAS on the listed search space. "+3g" and "+LSTM" denote the results of further interpolation with the baseline 3-gram and LSTM-RNN LMs. "†" denotes statistically significant WER reductions were obtained over the LSTM-RNN baseline (line 2).

ID	LMs	Bayesian			Uncertainty			PPL	WER(%) of dev				PPL	WER(%) of eval
ID	LMs	Design	Block	Position	$λ$	$θ$	$z$	(dev)	ihm	sdm1	mdm8	Avg.	(eval)	ihm	sdm1	mdm8	Avg.
1	3-gram	Not Applied			✗	✗	✗	81.9	18.1	31.8	29.5	26.5	93.6	18.4	36.4	32.5	29.1
2	Transformer +3g							46.9	16.1	29.9	27.5	24.5	48.4	15.9	34.5	30.5	27.0
3	Transformer(L1) +3g							46.8	16.1	29.9	27.5	24.5	48.3	15.9	34.5	30.6	27.0
4	+Transformer							46.8	16.0	29.8	27.4	24.4	48.1	15.8	34.5	30.5	26.9
5	Transformer(L2) +3g							46.8	16.1	29.9	27.5	24.5	48.2	15.9	34.5	30.4	26.9
6	+Transformer							46.7	16.0	29.8	27.4	24.4	48.1	15.8	34.4	30.4	26.9
7	Transformer(MAP) +3g							47.1	16.1	29.9	27.5	24.5	48.3	15.9	34.5	30.5	27.0
8	+Transformer							46.8	16.0	29.8	27.4	24.4	48.1	15.8	34.4	30.5	26.9
9	Bayesian Transformer +3g	Manual	-	EMB	✗	✓	✗	48.1	16.1	29.8	27.5	24.5	49.4	15.9	34.5	30.5	27.0
10			1	MHA				47.0	16.0	29.8	27.4	24.4	48.2	15.8	34.4	30.5	26.9
11			1	FFN				46.6	16.0	29.7 $^{†}$	27.4	24.4	47.3	15.8	34.3 $^{†}$	30.4	26.8
12			1-2	FFN	✗	✓	✗	47.1	16.0	29.7	27.4	24.4	48.0	15.8	34.4	30.5	26.9
13			1-3					47.2	16.0	29.8	27.5	24.4	48.3	15.8	34.5	30.5	26.9
14			1-4					47.8	16.0	29.8	27.5	24.4	49.2	15.9	34.5	30.7	27.0
15			1-5					47.9	16.1	29.9	27.7	24.6	49.6	15.9	34.4	30.7	27.0
16			1-6					49.2	16.1	30.0	27.7	24.6	51.5	16.0	34.5	30.7	27.1
17		NAS (Search Space: Layer 1-6 FFN)	1-2 (top1)	FFN	✗	✓	✗	47.1	16.0	29.7	27.4	24.4	48.0	15.8	34.4	30.5	26.9
18			1 (top2)					46.6	16.0	29.7 $^{†}$	27.4	24.4	47.3	15.8	34.3 $^{†}$	30.4	26.8
19			2 (top3)					46.9	16.0	29.7	27.4	24.4	47.8	15.8	34.4	30.5	26.9
20	+Transformer	Manual	1	FFN	✗	✓	✗	46.6	15.9 $^{†}$	29.6 $^{†}$	27.3 $^{†}$	24.3 $^{†}$	47.2	15.7 $^{†}$	34.3 $^{†}$	30.4 $^{†}$	26.8 $^{†}$
21	+Transformer	NAS	1-2	FFN	✗	✓	✗	46.9	15.9 $^{†}$	29.7 $^{†}$	34.3 $^{†}$	24.3 $^{†}$	47.8	15.7 $^{†}$	27.3 $^{†}$	30.4	26.8 $^{†}$
22	GP Transformer +3g	Manual	1	FFN	✓	✓	✗	47.0	15.9 $^{†}$	29.7 $^{†}$	27.4	24.3	48.1	15.8	34.4	30.4	26.9
23			1-2					47.4	16.0	29.8	27.4	24.4	48.5	15.9	34.4	30.5	26.9
24			1-3					47.6	16.0	29.8	27.5	24.4	48.8	15.9	34.5	30.5	27.0
25			1-4					47.8	16.1	29.8	27.5	24.5	49.0	15.9	34.4	30.5	26.9
26			1-5					48.3	16.2	29.9	27.5	24.5	49.4	15.9	34.5	30.6	27.0
27			1-6					48.1	16.1	29.9	27.6	24.5	49.5	16.0	34.7	30.6	27.1
28		NAS (Search Space: Layer 1-6 FFN)	1,2,6 (top1)	FFN	✓	✓	✗	46.3	16.0	29.7 $^{†}$	27.4	24.4	47.1	15.8	34.4	30.4	26.9
29			1,6 (top2)					46.5	16.0	29.8	27.4	24.4	47.2	15.8	34.4	30.5	26.9
30			1,2,5,6 (top3)					46.4	16.0	29.8	27.5	24.4	47.1	15.8	34.4	30.5	26.9
31	+Transformer	Manual	1	FFN	✓	✓	✗	46.4	15.9 $^{†}$	29.6 $^{†}$	27.3 $^{†}$	24.3 $^{†}$	47.2	15.7 $^{†}$	34.3 $^{†}$	30.3 $^{†}$	26.8 $^{†}$
32	+Transformer	NAS	1,2,6	FFN	✓	✓	✗	46.6	15.9 $^{†}$	29.7 $^{†}$	27.3 $^{†}$	24.3 $^{†}$	47.2	15.7 $^{†}$	34.3 $^{†}$	30.3 $^{†}$	26.8 $^{†}$
33	Variational Transformer +3g	Manual	1	Hidden output	✗	✗	✓	48.3	16.1	29.8	27.5	24.5	49.5	15.8	34.5	30.5	26.9
34			2					48.5	16.1	29.8	27.6	24.5	49.6	15.8	34.6	30.6	26.9
35			1,2					48.4	16.1	29.8	27.5	24.5	49.4	15.8	34.5	30.6	27.0
36	+Transformer	Manual	1	Hidden output	✗	✗	✓	47.3	16.0	29.7	27.4	24.4	48.4	15.7 $^{†}$	34.4	30.5	26.9

Table IV: Perplexity and WER% of the baseline 3-gram (3g), Transformer LMs with standard point estimate, and L1, or L2 regularized or MAP estimated Transformer LMs, and Bayesian, GP and variational Transformer LMs with various forms of uncertainty modeling on the AMI dev and eval sets of ihm, sdm1 and mdm8 conditions. "FFN", "MHA" and "EMB" denote the feed-forward, multi-head self-attention and embedding layers respectively. "+3g" and "+Transformer" denote the results of further interpolation with the baseline 3-gram and Transformer LMs. "†" denotes statistically significant WER reductions were obtained over the Transformer baseline (line 2).

ID	LMs	PPL	WER(%) of dev				PPL	WER(%) of eval
ID	LMs	(dev)	ihm	sdm1	mdm8	Avg.	(eval)	ihm	sdm1	mdm8	Avg.
1	Oracle	-	10.3	20.9	19.1	16.8	-	9.9	25.1	21.8	18.9
2	Transformer +LSTM +3g (line 2, Table III + line 2, Table IV)	54.8	15.9	29.5	27.3	24.2	55.1	15.6	34.3	30.4	26.8
3	All NNLMs(L1) (line 4, Table III + line 4, Table IV)	50.2	15.8	29.4	27.3	24.2	52.0	15.6	34.2	30.3	26.7
4	All NNLMs(L2) (line 6, Table III + line 6, Table IV)	50.1	15.8	29.4	27.1	24.1	51.4	15.5	34.2	30.2	26.6
5	All NNLMs(MAP) (line 8, Table III + line 8, Table IV)	48.5	15.8	29.4	27.1	24.0	48.8	15.4	34.1	30.2	26.6
6	All Bayesian NNLMs (line 24, Table III + line 21, Table IV)	46.4	15.6 $^{†}$	29.2 $^{†}$	26.9 $^{†}$	23.9 $^{†}$	47.9	15.3 $^{†}$	33.8 $^{†}$	29.9 $^{†}$	26.3 $^{†}$
7	All GP NNLMs (line 44, Table III + line 32, Table IV)	46.5	15.7 $^{†}$	29.3	27.1 $^{†}$	24.0 $^{†}$	48.2	15.3 $^{†}$	34.0 $^{†}$	30.1 $^{†}$	26.5 $^{†}$
8	All Variational NNLMs (line 48, Table III + line 36, Table IV)	47.4	15.8	29.4	27.1 $^{†}$	24.1	48.4	15.4 $^{†}$	34.2	30.2 $^{†}$	26.6 $^{†}$

Table V: Perplexity and WER(%) of the baseline LSTM-RNN+Transformer interpolated LM and the 5-way interpolated model between the baseline 3-gram, LSTM-RNN, Transformer LMs and their respective best Bayesian, GP and variational counterparts on AMI dev and eval sets of ihm, sdm1 and mdm8 conditions. "

†

" denotes statistically significant results were obtained over the LSTM-RNN+Transformer+3g interpolation (line 2).

V Experiments

In this section the performance of various Bayesian learning based LSTM-RNN and Transformer LMs are evaluated on the speech recognition system using state-of-the-art LF-MMI sequence trained time delay neural networks (TDNNs) [31] featuring speed perturbation based data augmentation and i-Vector speaker adaptation [79]. Audio-visual multi-channel beamforming and recognition [100] was also used when processing overlapping speech. In Sec.V-A, the first set of experiments conducted on the AMI meeting room data [75] were presented. The experiments were designed to provide a detailed side by side analysis over the three Bayesian learning based NNLM approaches presented in Sec.III, while following the implementation issues and the best configurations of Bayesian NNLMs discussed in Sec.IV. A second set of experiments are conducted in Sec.V-B on an audio-visual multi-channel overlapped speech recognition task based on Oxford-BBC Lip Reading Sentences 2 (LRS2) corpus [76] to further evaluate the performance of Bayesian NNLMs.

All the LSTM-RNN LMs investigated in this paper consist of 2 LSTM layers while both the input word embedding and hidden layer sizes were set as 1024. All the Transformer LMs contain 6 Transformer layers. The dimensionality of all query, key and value embedding layers were set as 512, and the hidden vector dimensionality was set as 4096. Both LSTM-RNN and Transformer LMs were implemented using PyTorch [101]. All NNLMs were trained using a single NVIDIA Tesla V100 Volta GPU card. SGD parameter update in a mini-batch mode (32 sentences per batch) was used. Dropout regularization was also used and the dropout rate was set at 0.2 in all experiments. A set of baseline NNLMs using L1 [49] or L2 [50] regularization, or maximum a posteriori (MAP) estimation [51, 52] with the same priors used by various forms of Bayesian NNLMs in this work, were also presented as the contrasts to those Bayesian estimated NNLMs. Initial learning rate settings of 5 and 0.1 were used for the baseline LSTM-RNN and Transformer LMs respectively. A smaller initial learning rate of 0.1 and 0.01 were used for the Bayesian estimated LSTM-RNN and Transformer LMs. For all models, the learning rate was halved during SGD update whenever the perplexity reduction was not obtained on the validation set. These baseline NNLM settings are based on those provided by the Kaldi recipe⁴⁴4Kaldi: egs/swbd/s5c/local/pytorchnn/run_nnlm.sh before being further fine-tuned.

During performance evaluation, by default all the NNLMs, baseline or Bayesian estimated, were all linearly interpolated with the $n$ -gram back-off LM. For all the Bayesian estimated NNLMs of this paper, there are optionally further 3-way interpolated with the $n$ -gram and their respective baseline NNLMs using standard point estimate. For example, for the Bayesian LSTM-RNN LM produced word probability, $P_{b l s t m} (\cdot)$ , the 3-way interpolated LM probability with the baseline $n$ -gram $P_{n g} (\cdot)$ and point estimated LSTM-RNN $P_{l s t m} (\cdot)$ LMs is given by

	$P (w_{t} \| w_{1}^{t - 1}) =$	$λ_{1} P_{n g} (w_{t} \| w_{1}^{t - 1}) + λ_{2} P_{l s t m} (w_{t} \| w_{1}^{t - 1})$
		$+ (1 - λ_{1} - λ_{2}) P_{b l s t m} (w_{t} \| w_{1}^{t - 1})$		(37)

where $λ_{1}, λ_{2}$ denote the interpolation weights that are optimized by minimizing the perplexity on the held-out development data set using the expectation maximization (EM) algorithm. The above LM interpolation also allows a more powerful multi-way combination between $n$ -gram, LSTM-RNN and Transformer LMs using point estimate or Bayesian learning to be performed. Statistical significance test was conducted at level $α$ $=$ $0.05$ based on matched pairs sentence segment word error (MAPSSWE) [102] for recognition performance analysis.

V-a Experiments on AMI Meeting Room Data

System Description: The Augmented Multi-party Interaction (AMI) speech corpus consists of approximately 100 hours of audio data collected using both headset microphone and distant microphone arrays from the meeting environment. Following the Kaldi recipe⁵⁵5Kaldi: egs/swbd/s5c/local/chain/tuning/run_tdnn_7q.sh, three LF-MMI trained [31] acoustic models with speech perturbation based data augmentation and i-Vector based speaker adaptation [79] were then constructed. The AMI 8.9-hour dev and 8.7-hour eval sets recorded under close talking microphone (ihm), single distant microphone (sdm) and multiple distant microphones (mdm) were used. A 47K word recognition lexicon was used. Various LSTM-RNN and Transformer LMs were trained on a combined data set of 15M words including both the AMI and Fisher transcripts, before being used to rescore the 3-gram LM produced N-best lists (N = 20) for WER performance evaluation.

Figure 4: Performance contrasts between baseline, Bayesian, GP and Variational counterparts for both LSTM-RNN and Transformer LMs trained on randomly selected subsets of training data (varying from 5% to 100% of the 15M word AMI+SWBD training set): perplexity and average WER% measured across ihm, sdm and mdm conditions on the AMI dev (a) and eval (b) data for LSTM-RNN LMs; perplexity and average WER% similarly measured on the AMI dev (c) and eval (d) data for Transformer LMs.

Experiments on LSTM-RNN LMs: Table III presents the perplexity and WER performance of various Bayesian, GP and variational LSTM-RNN LMs. Several trends can be found:

1) Irrespective of the precise form of Bayesian modeling being used, Bayesian (line 9 to 24), Gaussian Process (line 25 to 44) or variational (line 45 to 48) approaches, a general trend can be observed that Bayesian estimation produced consistent perplexity and WER reductions over the baseline LSTM-RNN LM (line 2) using point estimated parameters. In particular, the majority of the Bayesian and GP LSTM-RNN LMs (line 9 to 44) with Bayesian or GP modeling applied to various network internal locations that were either manually or automatically selected using NAS, produced statistically significant ( $α$ $=$ $0.05$ ) WER reductions over the baseline LSTM-RNN LM (line 2). Performance improvements were also obtained over the comparable L1 (line 3, 4), L2 (line 5, 6) regularized or MAP estimated (line 7, 8) LSTM-RNN LMs.

2) A more detailed ablation study on the effect of applying Bayesian uncertainty modeling to different LSTM unit internal components (previously shown in Fig.1) on performance is shown for Bayesian (line 9 to 22) and GP (line 25 to 42) LSTM-RNN LMs respectively. The results in Table III suggest that applying Bayesian estimation to only the LSTM cell activation parameters of both hidden layers (line 20 and 24), auto-configured as the 1-best location by the NAS approach of Sec.IV-B, produced the best performance in comparison against other locations. Similarly applying GP modeling only to the hidden vector input gate ("h-Gate" in left bottom corner, in purple of Fig.1 (a)) (line 40 and 44) of both hidden layers produced the best performance for GP LSTM-RNN LMs.

3) Among three Bayesian modeling approaches, the performance of Bayesian and GP LSTM-RNN LMs (line 23, 24 and 43, 44) consistently outperform that of the comparable variational LSTM-RNN LMs (line 48) which consider the uncertainty associated with the hidden output vectors. Between Bayesian and GP LSTM-RNN LMs, the results in Table III also suggest that modeling the additional uncertainty on this data over the activation function choices inside the classic, expert crafted neural structure of LSTM units using GP LSTM-RNN LMs (line 43, 44) brings no further improvements over Bayesian LSTM-RNNs (line 23, 24) that consider the uncertainty over model parameters only.

4) The best performance was obtained using a 3-way interpolation between the 3-gram, point estimate parameter LSTM-RNN and Bayesian LSTM-RNN LMs (line 24) with the precise position of applying Bayesian estimation auto-configured using NAS. Perplexity reduction of 13 points and WER reductions of 0.5%-0.7% absolute (1.7%-3.7% relative) were obtained over the baseline LSTM-RNN LM (line 2).

Experiments on Transformer LMs: A set of experiments that are comparable to those previously shown in Table III for LSTM-RNN LMs were then conducted for Transformer LMs. These are shown in Table IV with the following trends:

1) After interpolating the Bayesian (line 20, 21), Gaussian Process (line 31, 32) or variational (line 36) Transformer LM with the 3-gram and baseline Transformer LM, a general trend is that Bayesian estimation produced small, but consistent and statistically significant ( $α$ $=$ $0.05$ ) WER reductions of 0.2% to 0.3% absolute over the baseline Transformer LM (line 2) using point estimated parameters across all channel conditions. Performance improvements were also obtained over the comparable L1 (line 3, 4), or L2 (line 5, 6) regularized or MAP estimated (line 7, 8) Transformer LMs.

2) The above improvements from Bayesian learning are notably smaller than those previously found on the comparable LSTM-RNN LM experiments in Table III. This may be attributed to the larger modeling uncertainty that the LSTM-RNN LMs exhibit on this task when compared with the Transformer LM.

3) The best Bayesian Transformer LM performance was obtained using a 3-way interpolation between the baseline 3-gram, point estimate parameter Transformer and GP Transformer LMs (line 31, 32) with the precise location of applying GP estimation either manually set as the lowest positioned feedforward layer, or auto-configured using NAS.

ID	LM	Bayesian			PPL	WER(%)
ID	LM	Design	Layer	Position	(Test)	clean	TF masking	Filter & Sum	MVDR	Avg.
1	Oracle	Not Applied			-	1.7	5.8	5.4	5.2	4.5
2	4-gram				77.0	8.5	17.7	16.6	16.1	14.7
3	LSTM +4g				65.7	5.8	13.3	12.1	11.8	10.7
4	Transformer +4g				66.3	5.7	12.7	12.1	11.9	10.6
5	Transformer +LSTM +4g				65.8	5.3	12.2	11.3	11.6	10.1
6	LSTM(L1) +4g				65.5	5.7	13.2	12.1	11.8	10.7
7	+LSTM				65.2	5.5	12.9	11.8	11.7	10.5
8	Transformer(L1) +4g				66.1	5.7	12.6	12.0	11.9	10.6
9	+Transformer				65.7	5.5	12.4	11.7	11.8	10.3
10	LSTM(L2) +4g				65.4	5.7	13.1	12.0	11.8	10.7
11	+LSTM				65.0	5.5	12.8	11.7	11.6	10.4
12	Transformer(L2) +4g				65.9	5.6	12.6	11.8	11.9	10.5
13	+Transformer				65.2	5.4	12.4	11.6	11.7	10.3
14	LSTM(MAP) +4g				65.3	5.6	13.1	12.0	11.8	10.6
15	+LSTM				64.9	5.5	12.8	11.7	11.6	10.4
16	Transformer(MAP) +4g				65.8	5.6	12.5	11.8	11.9	10.5
17	+Transformer				65.0	5.5	12.4	11.5	11.7	10.3
18	All NNLMs(L1) (line 7 + 9)				64.9	5.3	12.2	11.2	11.4	10.0
19	All NNLMs(L2) (line 11 + 13)				64.8	5.2	12.2	11.1	11.4	10.0
20	All NNLMs(MAP) (line 15 + 17)				64.6	5.2	12.1	11.1	11.3	9.9
21	Bayesian LSTM +4g	Manual	2	All gates	65.5	5.3 $^{†}$	12.4 $^{†}$	11.8	12.0	10.4
22	Bayesian LSTM +4g	NAS	1,2	Cell input	65.5	5.4 $^{†}$	12.5 $^{†}$	11.8	11.9	10.4 $^{†}$
23	GP LSTM +4g	Manual	1,2	Cell input	65.6	5.2 $^{†}$	12.5 $^{†}$	11.7 $^{†}$	12.2	10.4 $^{†}$
24	GP LSTM +4g	NAS	1,2	h-Gate	65.2	5.3 $^{†}$	12.4 $^{†}$	11.8	12.2	10.4
25	V-LSTM +4g	Manual	1,2	Hidden output	62.4	5.5	12.6 $^{†}$	11.7 $^{†}$	11.6	10.5
26	Bayesian Transformer +4g	Manual	1	FFN	66.7	5.5	12.5	11.8	12.3	10.5
27	Bayesian Transformer +4g	NAS	1,2	FFN	65.0	5.2 $^{‡}$	12.3 $^{‡}$	11.5 $^{‡}$	12.3	10.3 $^{‡}$
28	GP Transformer +4g	Manual	1	FFN	64.7	5.4 $^{‡}$	12.4	11.5 $^{‡}$	11.9	10.3
29	GP Transformer +4g	NAS	1,2,6	FFN	65.4	5.5	12.6	11.8	12.0	10.6
30	V-Transformer +4g	Manual	1	Hidden output	66.1	5.6	12.7	11.9	12.4	10.7
31	Bayesian LSTM +4g +LSTM	Manual	2	All gates	64.4	5.1 $^{†}$	12.1 $^{†}$	11.4 $^{†}$	11.3 $^{†}$	10.0 $^{†}$
32	Bayesian LSTM +4g +LSTM	NAS	1,2	Cell input	64.4	5.1 $^{†}$	12.1 $^{†}$	11.4 $^{†}$	11.3 $^{†}$	10.0 $^{†}$
33	GP LSTM +4g +LSTM	Manual	1,2	Cell input	64.5	5.0 $^{†}$	12.0 $^{†}$	11.3 $^{†}$	11.6	10.0 $^{†}$
34	GP LSTM +4g +LSTM	NAS	1,2	h-Gate	63.9	5.0 $^{†}$	12.0 $^{†}$	11.5 $^{†}$	11.5	10.0 $^{†}$
35	V-LSTM +4g +LSTM	Manual	1,2	Hidden output	62.3	5.3 $^{†}$	12.0 $^{†}$	11.3 $^{†}$	11.6	10.1 $^{†}$
36	Bayesian Transformer +4g +Transformer	Manual	1	FFN	65.9	5.2 $^{‡}$	12.0 $^{‡}$	11.3 $^{‡}$	11.8	10.1 $^{‡}$
37	Bayesian Transformer +4g +Transformer	NAS	1,2	FFN	64.0	5.0 $^{‡}$	11.8 $^{‡}$	11.2 $^{‡}$	11.6 $^{‡}$	9.9 $^{‡}$
38	GP Transformer +4g +Transformer	Manual	1	FFN	63.8	5.0 $^{‡}$	11.9 $^{‡}$	11.1 $^{‡}$	11.4 $^{‡}$	9.9 $^{‡}$
39	GP Transformer +4g +Transformer	NAS	1,2,6	FFN	64.7	5.2 $^{‡}$	12.0 $^{‡}$	11.4 $^{‡}$	11.6 $^{‡}$	10.1 $^{‡}$
40	V-Transformer +4g +Transformer	Manual	1	Hidden output	64.7	5.2 $^{‡}$	12.2 $^{‡}$	11.4 $^{‡}$	11.6 $^{‡}$	10.1 $^{‡}$
41	All Bayesian NNLMs (line 32 + 37)	-	-	-	64.0	4.8 $^{⋆}$	11.7 $^{⋆}$	10.8 $^{⋆}$	11.0 $^{⋆}$	9.6 $^{⋆}$
42	All GP NNLMs (line 34 + 39)	-	-	-	63.7	4.7 $^{⋆}$	11.4 $^{⋆}$	10.7 $^{⋆}$	10.8 $^{⋆}$	9.4 $^{⋆}$
43	All Variational NNLMs (line 35 + 40)	-	-	-	64.1	4.9 $^{⋆}$	11.8 $^{⋆}$	10.8 $^{⋆}$	11.2 $^{⋆}$	9.7 $^{⋆}$

Table VI: Perplexity and WER% of the baseline 4-gram (4g), LSTM-RNN and Transformer LMs with standard point estimate, and using L1, or L2 regularization or MAP estimation, and their Bayesian, GP and variational counterparts with various forms of uncertainty modeling on the LRS2 Test set of “clean”, and overlapping conditions obtained using “TF masking”, “Filter & Sum” or “MVDR” based audio-visual neural beamforming. "+4g", "+LSTM" and "+ Transformer" denote the results of further interpolation with the baseline 4-gram, LSTM-RNN and Transformer LMs. "

†

", "

‡

" and "

⋆

" denote statistically significant WER reductions were obtained over the baseline LSTM-RNN (line 3), Transformer (line 4) LMs and LSTM-RNN+Transformer interpolation (line 5) respectively.

ID	LM	Subset of Training Data / SNR
1	B-LSTM	5%/0.3	10%/0.4	25%/0.5	50%/0.7	100%/1.1
2	B-Transformer	5%/0.6	10%/0.8	25%/1.1	50%/2.0	100%/3.3
	LM	Hidden (FFN) Layer Dimensionality / SNR
3	B-LSTM	128/5.1	256/2.6	512/1.4	1024/1.1	2048/2.0
4	B-Transformer	512/8.2	1024/4.3	2048/3.0	4096/3.3	8192/3.7

Table VII: The median parameter signal-to-noise ratio (SNR) statistics [57, 47] measured over Bayesian LSTM-RNN (B-LSTM) and Transformer (B-Transformer) LMs trained on randomly selected subsets of training data varying from 5% to 100% of the 15M word AMI+SWBD training set (line 1, 2), and different model sizes varying from 128 to 2048 in terms of both the word embedding and hidden layer dimensionality for LSTM-RNNs and from 512 to 8192 in terms of feed forward module dimensionality for Transformers (line 3, 4).

Experiments of Uncertainty Analysis: The following set of ablation studies were further conducted to evaluate Bayesian NNLMs’ performance and provide modeling uncertainty analyses in Fig.4, Fig.5 and Table VII:

1) A first ablation study shown in Fig.4 is to illustrate that statistically significant WER reductions, and perplexity improvements were consistently obtained using Bayesian learned LSTM-RNN and Transformer LMs trained on randomly selected subsets of training data (varying from 5% to 100% of the 15M word AMI+SWBD training set) over their respective point estimated baseline LMs on each subset of training data. These two sets of experiments serve to analyze the improvements of Bayesian estimation using a series of fixed model complexity versus varying data quantity operating points.

2) Another set of experiments were then conducted using a different set of model complexity versus data quantity operating points by using all the AMI training data, but varying either the LSTM-RNN LM word embedding and hidden layer dimensionality, or the dimensionality of the Transformer LM feed-forward modules. These contrasts are shown in Fig.5.

3) For Bayesian estimated LSTM-RNN and Transformer LMs constructed using the above different model complexity versus data quantity trade-off points, the median signal-to-noise ratio (SNR) of the Gaussian approximated latent NNLM model parameter distributions [57, 47]

(38)

is analyzed and shown in Table VII as a measure of parameter uncertainty. The $i$ -th element $Θ_{i}$ in parameter $Θ$ specific signal-to-noise ratio (SNR) is defined as follows.

{S N R}_{Θ_{i}} = \frac{| μ_{i} |}{σ_{i}}

(39)

A general trend can be found that either decreasing the training data quantity, or increasing the NNLM model complexity, produced a lower parameter SNR, which indicate an increase in model parameter uncertainty and risk of over-fitting.

V-B Experiments on the LRS2

System Description: The Oxford-BBC Lip Reading Sentences 2 (LRS2) corpus is one of the largest publicly available corpora for audio-visual speech recognition, which consists of news and talk shows extracted from BBC broadcast. Multi-channel cocktail party overlapped speech was simulated with an 85% average overlapping ratio based on the LRS2 corpus. An audio-visual multi-channel overlapped speech recognition (AVSR) system [103, 100] was then constructed, featuring tightly integrated separation front-end and recognition back-end components jointly fine-tuned using a multi-task interpolation of the scale-invariant signal to noise ratio (Si-SNR) and LF-MMI criteria. An AVSR system trained on non-overlapped, clean speech, as well as another three AVSR systems trained on overlapped speech using different audio-visual multi-channel beamforming approaches including TF Masking, Filter & Sum and Mask-based MVDR, were used for evaluation. The LRS2 corpus is already divided into three subsets: Pre-train, Train-val and Test. Various baseline LSTM-RNN and Transformer LMs are trained on the combined Pre-train and Train-Val set containing 2.5M words using a 41K word recognition lexicon. The resulting NNLMs were used to rescore the 4-gram LM produced N-best lists (N = 20) for WER evaluation.

Experimental Results: A set of experiments that are comparable to those in Sec.V-A conducted on the AMI data in Table III and IV for LSTM-RNN and Transformer LMs are shown in Table VI. For all the Bayesian, GP and variational LSTM-RNN and Transformer LMs, their respective best network internal positions to apply Bayesian modeling were based on those previously selected either manually or automatically using NAS in Table III and IV. The following trends are found:

1) For both LSTM-RNN and Transformer LMs, using the Bayesian (line 21, 22, 26, 27, 31, 32, 36, 37), Gaussian Process (line 23, 24, 28, 29, 33, 34, 38, 39) or variational (line 25, 30, 35, 40) approaches, a general trend can be observed that Bayesian estimation produced consistent perplexity and WER reductions over the corresponding baseline NNLMs (line 3 and 4) using point estimated, deterministic parameters. Performance improvements were also obtained over the comparable L1 (line 6 to 9), or L2 (line 10 to 13) regularized or MAP estimated (line 14 to 17) LSTM-RNN and Transformer LMs.

2) Among three Bayesian modeling approaches, GP LSTM-RNN and Transformer LMs (line 23, 24, 28, 29, 33, 34, 38, 39) on average across both the clean and three overlapped speech conditions outperform their Bayesian (line 21, 22, 26, 27, 31, 32, 36, 37) and variational counterparts (line 25, 30, 35, 40) considering the uncertainty over either model parameters or hidden output vectors only. This is expected as on this LRS2 task the training data size is reduced to 2.5M words, 16.7% of the 15M word AMI training data, while the LSTM-RNN or Transformer LM model complexity remains the same. Hence, a larger risk of over-fitting arises and requires more powerful uncertainty modeling over both the hidden activation functions and their weight parameters using GP NNLMs.

3) No statistically significant ( $α$ $=$ $0.05$ ) WER difference was obtained among the best performing Bayesian/GP LSTM-RNN LMs (line 21, 22 and 24) and variational LSTM-RNN LM (line 25) on the “clean", “TF masking", “Filter & Sum" test conditions. A statistically significant WER difference was only obtained by using the variational LSTM-RNN LM (line 25) over Bayesian/GP LSTM-RNN LMs (line 21, 23, 24) on the “MVDR" beamforming test condition. When comparing the overall WER averaged over the four test conditions, again there was no statistically significant WER difference between the Bayesian/GP LSTM-RNN LMs (line 21, 22 and 24) and variational LSTM-RNN LM (line 25).

4) The best performance was obtained using a 5-way interpolation between the baseline 4-gram, LSTM-RNN, Transformer LMs and their respective Gaussian Process counterparts (line 42, last column) with their internal positions of applying GP modeling auto-configured using NAS. Statistically significant average WER reductions of 1.3% and 1.2% absolute (12.1% and 11.3% relative) were obtained over the baseline LSTM-RNN (line 3, last column) and Transformer LMs (line 4, last column) respectively after model combination.

Vi Conclusion

This paper presents a full Bayesian learning framework including three methods to systematically account for the underlying uncertainty in state-of-the-art LSTM-RNN and Transformer LMs. The uncertainty over their model parameters, hidden activations and hidden output vectors are modeled using Bayesian, Gaussian Process and variational LSTM-RNN or Transformer LMs respectively. Novel inference approaches were used to speed up model inference and allow the computational cost in Bayesian NNLMs training and evaluation time comparable to those of the baseline LSTM-RNN and Transformer LMs with point estimate based parameters. Experimental results obtained on the AMI meeting transcription and Oxford-BBC LipReading Sentences 2 overlapped speech recognition tasks suggest the proposed three Bayesian neural language modeling approaches can effectively mitigate the risk of over-fitting and poor generalization when LSTM-RNN and Transformer LMs with deterministic model parameters are trained on limited task domain data. Experimental results across multiple data sets and testing conditions suggest that the Bayesian or GP estimated LSTM-RNN or Transformer LMs with NAS auto-configured layer level uncertainty modeling and efficient inference using a minimal number of parameter samples can provide consistent performance improvements when being further linearly interpolated with the $n$ -gram LM and respective baseline NNLMs using point estimation (e.g. line 24, 44 in Table III, line 21, 32 in Table IV, and line 32, 34, 37 in Table VI). Performance improvements from Bayesian learning can be retained when combining LSTM-RNN or Transformer LMs derived using standard or Bayesian estimation (e.g. line 42, Table VI). Future research works will investigate Bayesian learning approaches for fast domain adaptation of large scale pre-trained neural network language models.

Acknowledgment

This research is supported by Hong Kong Research Grants Council GRF grant No. 14200218, 14200220, 14200021, Innovation & Technology Fund grant No. ITS/254/19 and InP/057/21.

References

[1] R. Kneser et al., “Improved backing-off for m-gram language modeling,” in ICASSP, 1995, pp. 181–184.
[2] S. F. Chen et al., “An empirical study of smoothing techniques for language modeling,” Computer Speech & Language, 1999.
[3] Y. Bengio et al., “A neural probabilistic language model,” JMLR, 2003.
[4] H. Schwenk, “Continuous space language models,” Computer Speech & Language, 2007.
[5] E. Arisoy et al., “Deep neural network language models,” in NAACL-HLT Workshop, 2012.
[6] H.-S. Le et al., “Structured output layer neural network language models for speech recognition,” TASLP, 2012.
[7] T. Mikolov et al., “Recurrent neural network based language model.” in Interspeech, 2010.
[8] M. Sundermeyer et al., “From feedforward to recurrent lstm neural networks for language modeling,” TASLP, 2015.
[9] X. Chen et al., “Efficient training and evaluation of recurrent neural network language models for automatic speech recognition,” TASLP, 2016.
[10] X. Chen et al., “Exploiting future word contexts in neural network language models for speech recognition,” TASLP, 2019.
[11] K. Irie et al., “Language modeling with deep transformers,” in Interspeech, 2019.
[12] K. Li et al., “An empirical study of transformer-based neural language model adaptation,” in ICASSP, 2020.
[13] E. Beck et al., “Lvcsr with transformer language models,” in Interspeech, 2020.
[14] P. Baquero-Arnal et al., “Improved hybrid streaming asr with transformer language models,” in Interspeech, 2020.
[15] G. Sun et al., “Transformer language models with lstm-based cross-utterance information representation,” in ICASSP, 2021.
[16] K. Irie, “Advancing neural language modeling in automatic speech recognition.” Ph.D. dissertation, RWTH Aachen University, 2020.
[17] I. Sheikh et al., “Transformer versus lstm language models trained on uncertain asr hypotheses in limited data scenarios,” Preprint, 2021.
[18] S. Hochreiter et al., “Long short-term memory,” Neural computation, 1997.
[19] A. Vaswani et al., “Attention is all you need,” in NIPS, 2017.
[20] J. Xu et al., “Mixed precision quantization of transformer language models for speech recognition,” in ICASSP, 2021.
[21] J. Cheng et al., “Long short-term memory-networks for machine reading,” in EMNLP, 2016.
[22] Z. Lin et al., “A structured self-attentive sentence embedding,” in ICLR, 2017.
[23] A. P. Parikh et al., “A decomposable attention model for natural language inference,” in EMNLP, 2016.
[24] K. He et al., “Deep residual learning for image recognition,” in CVPR, June 2016.
[25] J. L. Ba et al., “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
[26] J. Gehring et al., “Convolutional sequence to sequence learning,” in ICML, 2017.
[27] A. Zeyer et al., “A comparison of transformer and lstm encoder decoder models for asr,” in ASRU Workshop, 2019.
[28] B. Kingsbury, “Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling,” in ICASSP, 2009.
[29] K. Veselỳ et al., “Sequence-discriminative training of deep neural networks.” in Interspeech, 2013.
[30] H. Su et al., “Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription,” in ICASSP, 2013.
[31] D. Povey et al., “Purely sequence-trained neural networks for asr based on lattice-free mmi.” in Interspeech, 2016.
[32] H. Sak et al., “Sequence discriminative distributed training of long short-term memory recurrent neural networks,” in Interspeech, 2015.
[33] W. Michel et al., “Frame-level mmi as a sequence discriminative training criterion for lvcsr,” in ICASSP, 2020.
[34] W. Chan et al., “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016.
[35] A. Graves et al., “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
[36] A. Graves, “Sequence transduction with recurrent neural networks,” arXiv preprint arXiv:1211.3711, 2012.
[37] P. Guo et al., “Recent developments on espnet toolkit boosted by conformer,” in ICASSP, 2021.
[38] Y. Wang et al., “Transformer-based acoustic modeling for hybrid speech recognition,” in ICASSP, 2020.
[39] S. Kim et al., “Improved neural language model fusion for streaming recurrent neural network transducer,” in ICASSP, 2021.
[40] Z. Meng et al., “Internal language model training for domain-adaptive end-to-end speech recognition,” in ICASSP, 2021.
[41] Z. Tüske et al., “On the limit of english conversational speech recognition,” in Interspeech, 2021.
[42] P. Baldi et al., “Understanding dropout,” in NIPS, 2013.
[43] N. Srivastava et al., “Dropout: a simple way to prevent neural networks from overfitting,” JMLR, 2014.
[44] Y. Gal et al., “A theoretically grounded application of dropout in recurrent neural networks,” in NIPS, 2016.
[45] Y. Gal et al., “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in ICML, 2016.
[46] A. F. Murray et al., “Enhanced mlp performance and fault tolerance resulting from synaptic weight noise during training,” IEEE Transactions on neural networks, 1994.
[47] S. Braun et al., “Parameter uncertainty for end-to-end speech recognition,” in ICASSP, 2019.
[48] C. M. Bishop, “Training with noise is equivalent to tikhonov regularization,” Neural computation, 1995.
[49] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), 1996.
[50] A. N. Tikhonov, “On the solution of ill-posed problems and the method of regularization,” in Doklady Akademii Nauk. Russian Academy of Sciences, 1963.
[51] J.-T. Chien et al., “Bayesian recurrent neural network language model,” in SLT Workshop, 2014, pp. 206–211.
[52] J.-T. Chien, “Bayesian recurrent neural network for language modeling,” IEEE transactions on neural networks and learning systems, 2015.
[53] R. M. Neal, Bayesian learning for neural networks. Springer Science & Business Media, 2012, vol. 118.
[54] D. J. MacKay, “A practical bayesian framework for backpropagation networks,” Neural computation, 1992.
[55] C. M. Bishop, “Pattern recognition and machine learning, 5th edition,” Information science and statistics, Springer 2007, ISBN 9780387310732, 2007.
[56] A. Graves, “Practical variational inference for neural networks,” in ICML, 2011.
[57] C. Blundell et al., “Weight uncertainty in neural network,” in ICML, 2015.
[58] D. P. Kingma et al., “Auto-encoding variational bayes,” in ICLR, 2013.
[59] J. Chung et al., “A recurrent latent variable model for sequential data,” in NIPS, 2015.
[60] M. W. Lam et al., “Gaussian process neural networks for speech recognition,” in Interspeech, 2018.
[61] S. Hu et al., “Lf-mmi training of bayesian and gaussian process time delay neural networks for speech recognition,” in Interspeech, 2019.
[62] S. Hu et al., “Bayesian and gaussian process neural networks for large vocabulary continuous speech recognition,” in ICASSP, 2019.
[63] S. Hu et al., “Bayesian learning of lf-mmi trained time delay neural networks for speech recognition,” TASLP, 2021.
[64] X. Xie et al., “Blhuc: Bayesian learning of hidden unit contributions for deep neural network speaker adaptation,” in ICASSP, 2019.
[65] X. Xie et al., “Bayesian learning for deep neural network adaptation,” TASLP, 2021.
[66] X. Li et al., “Bayesian x-vector: Bayesian neural network based x-vector system for speaker verification,” in Interspeech, 2020.
[67] Z. Gan et al., “Scalable bayesian learning of recurrent neural networks for language modeling,” in ACL, 2017.
[68] M. Fortunato et al., “Bayesian recurrent neural networks,” Women in Machine Learning Workshop (WiML) NIPS, 2017.
[69] M. W. Y. Lam et al., “Gaussian process lstm recurrent neural network language models for speech recognition,” in ICASSP, 2019.
[70] J. Yu et al., “Comparative study of parametric and representation uncertainty modeling for recurrent neural network language models,” in Interspeech, 2019.
[71] D. Tran et al., “Bayesian layers: A module for neural network uncertainty,” in NIPS, 2018.
[72] C. Yuan et al., “Blt: Exact bayesian inference with distribution transformers,” 2019.
[73] D. Barber et al., “Ensemble learning in bayesian neural networks,” Nato ASI Series F Computer and Systems Sciences, 1998.
[74] D. P. Kingma et al., “Stochastic gradient vb and the variational auto-encoder,” in ICLR, 2014.
[75] T. Hain et al., “The ami meeting transcription system: Progress and performance,” in International Workshop on Machine Learning for Multimodal Interaction. Springer, 2006.
[76] J. S. Chung et al., “Lip reading sentences in the wild,” in CVPR, 2017.
[77] N. Dehak et al., “Front-end factor analysis for speaker verification,” TASLP, 2010.
[78] S. Madikeri et al., “Implementation of the standard i-vector system for the kaldi speech recognition toolkit,” Idiap, Tech. Rep., 2016.
[79] G. Saon et al., “Speaker adaptation of neural network acoustic models using i-vectors,” in ASRU Workshop, 2013.
[80] P. Swietojanski et al., “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models,” in SLT Workshop, 2014.
[81] B. Xue et al., “Bayesian transformer language models for speech recognition,” in ICASSP, 2021.
[82] D. Hendrycks et al., “Bridging nonlinearities and stochastic regularizers with gaussian error linear units,” arXiv preprint arXiv:1606.08415, 2016.
[83] A. Radford et al., “Improving language understanding by generative pre-training,” OpenAI Preprint, 2018.
[84] R. Neal, “Bayesian learning for neural networks ph.d,” Ph.D. dissertation, thesis. Dept. of Computer Science, University of Toronto, 1994.
[85] D. P. Kingma et al., “Variational dropout and the local reparameterization trick,” in NIPS, 2015.
[86] C. K. Williams et al., Gaussian processes for machine learning. MIT press Cambridge, MA, 2006.
[87] R. M. Neal, “Priors for infinite networks,” Springer New York, 1996.
[88] T. Hazan et al., “Steps toward deep kernel methods from infinite neural networks,” arXiv preprint arXiv:1508.05133, 2015.
[89] J. Lee et al., “Deep neural networks as Gaussian processes,” arXiv preprint arXiv:1711.00165, 2017.
[90] A. Damianou et al., “Deep Gaussian processes,” in Artificial Intelligence and Statistics, 2013.
[91] J.-T. Kuo et al., “Variational recurrent neural networks for speech separation,” in Interspeech, 2017.
[92] K. O. Stanley et al., “Evolving neural networks through augmenting topologies,” Evolutionary computation, 2002.
[93] K. Kandasamy et al., “Neural architecture search with bayesian optimisation and optimal transport,” in NIPS, 2018.
[94] B. Zoph et al., “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
[95] H. Liu et al., “Darts: Differentiable architecture search,” in ICLR, 2018.
[96] S. Xie et al., “Snas: stochastic neural architecture search,” in ICLR, 2019.
[97] H. Cai et al., “Proxylessnas: Direct neural architecture search on target task and hardware,” in ICLR, 2018.
[98] S. Hu et al., “Neural architecture search for lf-mmi trained time delay neural networks,” in ICASSP, 2020.
[99] S. Hu et al., “Dsnas: Direct neural architecture search without parameter retraining,” in CVPR, 2020.
[100] J. Yu et al., “Audio-visual multi-channel integration and recognition of overlapped speech,” TASLP, 2021.
[101] A. Paszke et al., “Automatic differentiation in pytorch,” 2017.
[102] L. Gillick et al., “Some statistical issues in the comparison of speech recognition algorithms,” in ICASSP,, 1989.
[103] J. Yu et al., “Audio-visual multi-channel recognition of overlapped speech,” in Interspeech, 2020.

	$P (w_{t} \| w_{1}^{t - 1}) =$	$λ_{1} P_{n g} (w_{t} \| w_{1}^{t - 1}) + λ_{2} P_{l s t m} (w_{t} \| w_{1}^{t - 1})$
		$+ (1 - λ_{1} - λ_{2}) P_{b l s t m} (w_{t} \| w_{1}^{t - 1})$		(37)