Super-model ecosystem:
A domain-adaptation perspective

Fengxiang He The authors are with JD Explore Academy, JD.com Inc., Beijing, 100176, China. Email: fengxiang.f.he@gmail.com and dacheng.tao@gmail.com. Dacheng Tao¹

¹footnotemark: 1

Abstract

This paper attempts to establish the theoretical foundation for the emerging super-model paradigm via domain adaptation, where one first trains a very large-scale model, i.e., super model (or foundation model in some other papers), on a large amount of data and then adapts it to various specific domains. Super-model paradigms help reduce computational and data cost and carbon emission, which is critical to AI industry, especially enormous small and medium-sized enterprises. We model the super-model paradigm as a two-stage diffusion process: (1) in the pre-training stage, the model parameter diffuses from random initials and converges to a steady distribution; and (2) in the fine-tuning stage, the model parameter is transported to another steady distribution. Both training stages can be mathematically modeled by the Uhlenbeck-Ornstein process which converges to two Maxwell-Boltzmann distributions, respectively, each of which characterizes the corresponding convergent model. An $O (1 / \sqrt{N})$ generalization bound is then established via PAC-Bayesian framework. The theory finds that the generalization error of the fine-tuning stage is dominant in domain adaptation. In addition, our theory suggests that the generalization is determined by a new measure that characterizes the domain discrepancy between the source domain and target domain, based on the covariance matrices and the shift of the converged local minimum.

Keywords: generalization, diffusion equation, Uhlenbeck-Ornstein process, Fokker-Plank equation, PAC-Bayesian framework.

1 Introduction

Large-scale pretrained models (or foundation models) (Han et al., 2021; Chen et al., 2021), including GPT-3 (Brown et al., 2020) and BERT (Devlin et al., 2018), enables a new paradigm in machine learning: pre-training a large-scale model on very large-scale datasets and then transferring the learned model to an unseen domain. This paradigm was first introduced in natural language processing and recently to computer vision. It sheds light in a higher-level automation and is establishing a new paradigm with advantages/supremacy. In this paradigm, a super model learns meta knowledge from large amounts of data and reduces learning cost in specific domains. This helps considerably reduce the computational and data cost of applying machine learning in many specific applications. This is thus of significant values to enormous small and medium-sized enterprises. Additionally, super-model paradigm enables better management of the geographic location of machine learning workload and the datacenter infrastructure, which has been shown able to significantly reduce the carbon emission (Patterson et al., 2021).

Technically, domain adaptation plays a vital role for the knowledge transferring in the super-model paradigm. Usually, the data in a target domain is much smaller than the one in the source domain. In the light of this, an appropriate understanding to the generalizability of the transferred model on the target domain is of high importance.

In this paper, we prove an upper bound for the generalization error (generalization bound) for domain adaptation algorithms. The generalization error is defined as the difference between the expected risk $R$ and the empirical risk $^R$ . Intuitively, a larger generalization bound indicates that the generalization error is possibly larger and thus suggests worse generalizability.

We model the super model paradigm as a two-stage diffusion processes. In the first stage, stochastic gradient-based optimizers, usually stochastic gradient-based optimization, including stochastic gradient descent (SGD) (Robbins and Monro, 1951), momentum (Nesterov, 1983; Tseng, 1998), and Adam (Kingma and Ba, 2014), learns a pre-trained model on the source-domain data via empirical risk minimization,

min θ {^R}_{S} (θ) = min θ \frac{1}{N} N \sum i = 1 ℓ (h_{θ} (x_{i}), y_{i}),

where ${^R}_{S} (θ)$ is the empirical risk of model parameterized by $θ$ on the training sample $S$ , which is defined to be

S = {(x_{1}, y_{1}), \dots, (x_{N}, y_{N}) | x_{i} \in R^{d_{X}}, y_{i} \in R^{d_{Y}}},

(1)

where $N$ is the training sample size, and $l$ is the loss function. The convergent parameter initializes the fine-tuning process on the target domain in the second stage. We model the parameter trajectory of SGD by a stochastic process, Uhlenbeck-Ornstein process (Uhlenbeck and Ornstein, 1930), as follows,

	$Δ θ (t) =$	$θ (t + 1) - θ (t) = - η {^g}_{S} (θ (t))$
	$=$	$- η g (θ) + \frac{η}{\sqrt{\| S \|}} B Δ W, Δ W \sim N (0, I),$		(2)

where $B$ is positive definite matrix which characterizes the covariance of the gradient noise. This can also be smoothed to a diffusion equation, Fokker-Plank equation. Correspondingly, the trajectories can be modeled by the dynamics of the Fokker-Plank equations. Further, the steady distributions of the Uhlenbeck-Ornstein equations characterize the distributions of the learned models.

Deep learning can be formulated as solving a non-convex optimization problem: The loss surface of neural networks are usually highly non-convex due to the complexity of neural network architectures. In general, solving a non-convex optimization problem is NP-hard. However, numerous experiments show that deep learning has excellent optimization performance. This mystery is partially addressed by some empirical finding on the local convexity and smoothness of the loss surfaces of deep neural networks. Empirical results show that the loss surface around the convergent local minima is second-order smooth, as shown by Li et al. (2018).

This empirical finding inspires us to model the loss surface around the convergent local minimum as a quadratic function. This assumption determines the derivatives and boundary conditions of the Fokker-Plank equation. Moreover, the model parameter is usually initialized by following a Gaussian distribution. Based on them, Fokker-Plank equation has a steady distribution in the form of Maxwell-Boltzmann distribution, which governs the distribution of the learned model by the SGD. During the pre-training stage, SGD converges a Maxwell-Boltzmann distribution around the local minimum given below,

q_{P T} (θ) = M_{P T} exp {- \frac{1}{2} θ^{⊤} Σ_{P T}^{- 1} θ},

(3)

where $M_{P T}$ is the normalizer and $Σ_{P T}$ is the covariance.

This distribution is then used as the initial distribution in the fine-tuning stage. Subsequently, SGD in the fine-tuning stage learns the mapping from the initialization to a new Maxwell-Boltzmann distribution centered at the new local minimum on the loss surface in the target domain as follows,

q_{F T} (θ) = M_{P T} exp {- \frac{1}{2} (θ - θ_{F T})^{⊤} Σ_{F T}^{- 1} (θ - θ_{F T})},

(4)

where $M_{F T}$ is the normalizer, $θ_{F T}$ is the distribution shift, and $Σ_{F T}$ is the covariance.

Based on the diffusion processes, we then establish PAC-Bayesian generalization bounds for the learned super model on the source domain and the transferred model on the target domain. The PAC-Bayesian framework (McAllester, 1999a, b) upper bounds the generalization error of a stochastic algorithm via the distance between the initial distribution and the distribution of the learned hypothesis, usually measured by some information-theoretical distances, such as KL-divergence. Intuitively, the PAC-Bayesian theory suggests that training a very-large model from a no-knowledge prior, such as Gaussian distribution and uniform distribution, needs a very large amount of data to secure the generalizability; and if the initialization is near the distribution of the learned hypothesis, the needed sample complexity can be relatively much smaller. However, a high-quality prior is not accessible in practice. This significantly limits the model size, particularly in low-resource scenarios. This renders the key motivation of the super-model paradigm: (1) training a super model on a very large-scale dataset, in order to learn a high-quality model from the no-knowledge prior; and (2) using the learned super model as a high-quality prior in the down-stream application, in order to reduce the needed training data and supports larger model size.

In this paper, the generalization bound in pre-training is established based on the KL-divergence between the Maxwell-Boltzmann distribution in pre-training as below,

		$R (Q_{P T}) \leq^R (Q_{P T})$
	$+$	$\sqrt{\frac{D (Q_{P T}, P) + 2 log (\frac{1}{δ}) + 2 log N_{P T} + 4}{4 N_{P T} - 2}},$		(5)

where

D (Q_{P T}, P) = log (det (Σ_{P T})) + tr (Σ_{P T} - I),

and $R (Q_{P T})$ is the expected risk, $^R (Q_{P T})$ is the empirical risk, $Σ_{P T}$ is the covariance of the distribution of the learned hypothesis, and $N_{P T}$ is the training sample size in the pre-training.

Meanwhile, the generalization bound in pre-training is established based on the KL-divergence between the two Maxwell-Boltzmann distributions in pre-training and fine-tuning as follows,

		$R (Q_{F T}) \leq^R (Q_{F T})$
	$+$	$\sqrt{\frac{D (Q_{F T}, Q_{P T}) + 2 log (\frac{1}{δ}) + 2 log N_{F T} + 4}{4 N_{F T} - 2}},$		(6)

where

		$D (Q_{F T}, Q_{P T})$
	$=$	$log (det (Σ_{P T}^{- 1} Σ_{F T})) + tr (Σ_{P T}^{- 1} Σ_{F T} - I) + θ_{F T}^{⊤} Σ_{P T}^{- 1} θ_{F T},$

and $R (Q_{F T})$ is the expected risk, $^R (Q_{F T})$ is the empirical risk, $Σ_{F T}$ is the covariance of the distribution of the learned hypothesis, $θ_{F T}$ is the shift of the distribution center, and $N_{F T}$ is the training sample size in the fine-tuning.

We further define two new notions to measure the domain discrepancy as follows,

		$D (Q_{F T}, Q_{P T})$
	$=$	$log (det (Σ_{P T}^{- 1} Σ_{F T})) + tr (Σ_{P T}^{- 1} Σ_{F T} - I) + θ_{F T}^{⊤} Σ_{P T}^{- 1} θ_{F T},$

and

		$~ D (Q_{F T}, Q_{P T})$
	$=$	$log (tr (Σ_{P T}^{- 1} Σ_{F T})) + tr (Σ_{P T}^{- 1} Σ_{F T}) + θ_{F T}^{⊤} Σ_{P T}^{- 1} θ_{F T}$
		$+ d log d - d,$

where $d$ is the parameter size. These two notions measure the magnitude of the domains shifts based on the learned hypotheses on the source domains and target domains.

Our theory have the following implications:

The very large-scale datasets employed in the pre-training stage helps secure obtaining a high-quality model from a no-knowledge prior. The learned model carries the knowledge learned from the training data in the pre-training stage. The distribution of the pre-trained model severs as a high-quality initialization in the down-stream fine-tuning stage.
Comparing the generalization bounds in the pre-training stage and fine-tuning stage, we show that the generalization error of the fine-tuning stage is dominant in the super-model paradigm. This is because the dominantly large size of the training data in the pre-training stage. This finding supports the feasibility and efficiency of the super-model paradigm.
Our generalization bound suggests that the generalization on the target domain is determined by the magnitude of the domain shifts. Generally, larger domain shifts lead to worse generalization on the target domain. This supports the heuristic in practice that the performance on the target domain is limited by the domain shifts.

It is worth noting that the super-model paradigm also supports model compression in the model deployment, including pruning, quantization, model distillation, etc. The influence of model compression methods can be directly plugged in our theory.

2 Background

This section reviews the related work, including super model, domain adaptation, generalization, and deep learning theory.

Domain adaptation. Domain adaptation algorithms transfer knowledge from one domain to another. It enables the super model paradigm. Domain adaptation has three main streams:

(1) Discrepancy-based domain adaptation modifies the loss to narrow the discrepancy between the features from the source domain $D^{s}$ and the ones from the target domain $D^{t}$ . Tzeng et al. (2014) introduce a fully-connected adaptation layer into CNN for learning the representation of the kernel $ϕ (\cdot)$ in order to minimize the maximum mean discrepancy (MMD) between the features from different domains:

		$MMD (X_{S}, X_{T})$
	$=$

Long et al. (2015) employ multiple adaptation layers. Long et al. (2015, 2016) introduce residual blocks into the classifiers of source domain. Long et al. (2017) further consider the discrepancy of the joint distribution $P (X, Y)$ rather than the marginal distribution $P (X)$ ;

(2) Adversarial-based domain adaptation maps the source domain $D^{s}$ and the target domain $D^{t}$ to a general space, inspired by generative adversarial networks (GANs): If a classifier hardly separates examples of source domain and those from target domain, the feature extractor has narrowed the two domains. Ganin and Lempitsky (2015) propose gradient reversal layers that reverse the gradients generated by the domain classifier during backpropagation. Zhang et al. (2018) argue that the features of the bottom layers contain more domain information, while those of the top layers contain less domain information. They further employ collaborative learning to learn domain informative features in the bottom layers, and adapt adversarial learning to learn domain uninformative features in the top layers; and

(3) Reconstruction-based domain adaptation reconstructs the features extracted from the source domain $D^{s}$ to the target domain $D^{t}$ . Ghifary et al. (2016) reconstruct examples from the target domain via features learned from source domain classification task. Bousmalis et al. (2016) reconstruct the inputs via both private representation and shared representation of both source and target domains.

Generalization. Good generalization guarantees that an algorithm learns the underlying patterns in training data rather than just memorize the data. In this way, good generalization abilities provide confidence that the models trained on existing data can be applied to similar but unseen scenarios. Three major approaches in analyzing the generalizability are seen in the literature: (1) generalization bounds based on the hypothesis complexity, including VC dimension (Blumer et al., 1989; Vapnik, 2006), Rademacher complexity (Koltchinskii and Panchenko, 2000; Koltchinskii, 2001; Bartlett and Mendelson, 2002), and covering number (Dudley, 1967; Haussler, 1995). The results are usually obtained via concentration inequalities. They also suggest controlling the model size to secure the generalizability, which is no longer valid in deep learning; (2) generalization bounds based on the algorithmic stability (Rogers and Wagner, 1978; Bousquet and Elisseeff, 2002; Xu et al., 2011). The results in this stream follow the motivation that learning algorithms robust to small disturbances in input data usually have good generalizability; and (3) generalization bounds in the PAC-Bayes framework (McAllester, 1999a, b). The results are obtained based on information-theoretical versions of concentration inequalities.

Deep learning theory. Deep learning has been deployed successfully in many real-world scenarios. However, the theoretical foundations of deep learning are still elusive. For example, there is no explanation for how deep learning algorithms work, why they can succeed, when they would fail, and whether they would hurt society. Such deficiency in explainability questions the transparency and accountability of deep learning, and further undermines our confidence of deploying deep learning in security-critical application domains, such as medical diagnosis (Kulikowski, 1980; Silver et al., 2016) and drug discovery (Chen et al., 2018a). Many works have emerged to establish the theoretical foundations of deep learning via VC dimension (Harvey et al., 2017), Rademacher complexity (Golowich et al., 2018; Bartlett et al., 2017), covering number (Bartlett et al., 2017), Fisher-Rao norm (Liang et al., 2019; Tu et al., 2020), PAC-Bayesian framework (Neyshabur et al., 2017), algorithmic stability (Hardt et al., 2016; Kuzborskij and Lampert, 2018; Verma and Zhang, 2019), and the dynamics of stochastic gradient descent or its variants (Mandt et al., 2017; Mou et al., 2018b; He et al., 2019). Please see more related works in surveys (E et al., 2020; He and Tao, 2020; Poggio et al., 2020). This work is committed to establishing theoretical foundations of privacy, generalization, adversarial attack in deep learning, all of which have profound importance in enhancing the explainability, transparency, and accountability of deep models.

Generalization of SGD. Some generalization bounds for algorithms trained by SGD are proposed. Mou et al. (2018a) analyze the generalization of stochastic gradient Langevin dynamics (SGLD), and prove an $O (1 / N)$ upper bound and an $O (1 / \sqrt{N})$ upper bound for the generalization error, respectively via algorithmic stability and PAC-Bayesian theory. Pensia et al. (2018) analyze the generalizability of noisy and iterative machine learning algorithms. A generalization bound is then proved given the mutual information between the output hypothesis and the input data. It also proved generalization bounds for SGLD as examples. Chen et al. (2018b) prove that the convergence and stability for iterative machine learning algorithms have a trade-off under both convex smooth assumption and strong convex smooth assumption. Under the same assumptions, Chen et al. (2018b) prove an $O (1 / N)$ generalization bound for SGD. Liu et al. (2017) prove an $O (1 / N)$ generalization bound for SGD when the loss function is Lipschitz continuous and smooth. London (2017) prove a generalization bound for SGD based on the KL divergence between the prior $P$ and the posterior $Q$ under the PAC-Bayes framework. He et al. (2019) present a PAC-Bayes generalization bound for SGD based on stochastic differential equations. In the work of He et al., the gradient noise is modeled by a Gaussian distribution. Meng et al. (2020) extend the gradient noise to be state-dependent. Cheng et al. (2020) extend the gradient noise to be Levy process.

3 Notations and preliminaries

Suppose the training dataset is $S = {(x_{1}, y_{1}), \dots, (x_{N}, y_{N}) | x_{i} \in R^{d_{X}}, y_{i} \in R^{d_{Y}}, i = 1, \dots, N}$ , where $d_{X}$ is the dimension of the feature $X$ and $d_{Y}$ is the dimension of the label $Y$ . Suppose $x_{i}$ and $y_{i}$ are independent and identically distributed (i.i.d.) observation of variables $X \in X$ and $Y \in Y$ , respectively. We also rewrite $z_{i} = (x_{i}, y_{i})$ , which is an i.i.d. observation of random variable $Z = (X, Y) \in Z$ . Denote the generating distribution of $Z$ is $D$ .

Formally, machine learning algorithms are designed to select the hypothesis function $F_{θ}$ with the lowest expected risk $R$ under the loss function $l$ from a hypothesis class ${F_{θ} | θ \in Θ \subset R^{d}}$ , where $θ$ is the parameter of the hypothesis and $d$ is the dimension of the parameter $θ$ . For many stochastic algorithms, such as SGD, we usually use a distribution to express the output parameter. Suppose the parameter follows a distribution $Q$ , the expected risks respectively in terms of $θ$ and $Q$ are defined as:

	$R (θ) = E_{(X, Y) \sim D} l (F_{θ} (X), Y),$		(7)
	$R (Q) = E_{θ \sim Q} E_{(X, Y) \sim D} l (F_{θ} (X), Y) .$		(8)

However, the expected risk $R$ is not available from the data, since we do not know the formulation of latent distribution $D$ of data. Practically, we use the empirical risk $^R$ to estimate the expected risk $R$ , which is defined as:

	$^R (θ) = \frac{1}{\| T \|} \| T \| \sum i = 1 l (F_{θ} (X_{i}), Y_{i}),$		(9)
	$^R (Q) = E_{θ \sim Q} ⎡ ⎣ \frac{1}{\| T \|} \| T \| \sum i = 1 l (F_{θ} (X_{i}), Y_{i}) ⎤ ⎦,$		(10)

where all $(X_{i}, Y_{i})$ constitute the training sample $T$ .

Learning algorithms usually solve the following empirical risk minimization (ERM) problem to approach the optimal hypothesis,

min θ {^R}_{S} (θ) = min θ \frac{1}{N} N \sum i = 1 ℓ (h_{θ} (x_{i}), y_{i}) .

We usually employ stochastic gradient-based optimizers for ERM in deep learning. Popular options of stochastic gradient-based optimizers include stochastic gradient descent (SGD) (Robbins and Monro, 1951), momentum (Nesterov, 1983; Tseng, 1998), and Adam (Kingma and Ba, 2014). For the brevity, we analyze SGD in this paper. The analysis for other stochastic gradient-based optimizers is similar.

Suppose $B$ is a mini batch randomly drawn from the training sample set $S$ . Then, the stochastic gradient on $B$ is as follows,

{^g}^{E R M} (θ) = \frac{1}{| B |} \sum (x_{i}, y_{i}) \in B \nabla_{θ} ℓ (h_{θ} (x_{i}), y_{i}) .

In the $t$ -th iteration, the weight is updated as follows,

where $θ_{t}^{E R M}$ is the weight vector in the $t$ -th iteration and $η_{t}$ is the corresponding learning rate.

Meanwhile, adversarial training employs SGD to solve the following minimax problem,

min θ {^R}_{S}^{A} (θ) = min θ \frac{1}{N} N \sum i = 1 max ∥ x_{i}^{'} - x_{i} ∥ \leq ρ ℓ (h_{θ} (x_{i}^{'}), y_{i}),

(11)

where $ρ$ is the radius of the ball centered at the example $(x_{i}, y_{i})$ . Here, we call ${^R}_{S}^{A} (θ)$ adversarial empirical risk. Correspondingly, the stochastic gradient on a mini batch $B$ and the weight update are calculated as below,

	${^g}^{A} (θ) = \frac{1}{\| B \|} \sum (x_{i}, y_{i}) \in B \nabla_{θ} max ∥ x_{i}^{'} - x_{i} ∥ \leq ρ ℓ (h_{θ} (x_{i}^{'}), y_{i}),$
	$θ_{t + 1}^{A} = θ_{t}^{A} - η_{t} {^g}^{A} (θ_{t}^{A}) .$		(12)

Definition 1 (KL Divergence; cf. Kullback and Leibler (1951)).

Suppose two distributions $P$ and $Q$ are defined on the same support. Then the KL divergence between $P$ and $Q$ is defined as

D_{K L} (P ∥ Q) = E_{P} (log \frac{d P}{d Q}) .

To avoid technicalities, the measurability/integrability issues are ignored throughout this paper. Moreover, Fubini’s theorem is assumed to be applicable for any integration with respect to multiple variables, that the order of integrations is exchangeable. Also, we assume the stable (stationary) solutions of all stochastic differential equations involved exit and are unique.

4 Super-model paradigm

A supreme industrial paradigm has been emerging that (1) pre-training a large-scale model on large amounts of multi-modality data, such as GPT-3 (Brown et al., 2020) and BERT (Devlin et al., 2018); and (2) fine-tuning the obtained model on specific smaller domain where data size is relatively difficult to access. In this paper, we name it as super-model paradigm. Super-model paradigm enables efficient and effective knowledge discovery in low-resource application scenarios, including few-shot learning (Snell et al., 2017; Sung et al., 2018) and zero-shot learning (Romera-Paredes and Torr, 2015). A key cornerstone technology wherein is domain adaptation. This section describes this paradigm.

Large-scale pre-trained models. Recent advances are seen mainly in natural language processing (NLP), particularly after the appearance of transformer (Vaswani et al., 2017). ELMo (Peters et al., 2018) finds the word embedding in NLP is not invariant in different application domains, but considerably changes with context. Based on this observation, ELMo pre-trains a large-scale bidirectional LSTM on a large text corpus to generate word vectors by fine-tuning. BERT (Devlin et al., 2018) employs the transformer encoder for detecting bidirectional information in the context. Meanwhile, Liu et al. (2018) employs the transformer decoder to word embedding with fine-tuning, in order to realize wider attention. GPT (Radford et al., 2018) also employs the transformer decoder but is fine-tuned on each specific task for better performance. Extended from GPT, GPT-2 (Radford et al., 2019) and GPT-3 (Brown et al., 2020) construct huge models in order to realize zero-shot learning. The comparison between these “super models” is presented in the following table.

SM	Architecture	Params
ELMo	BiLSTM	-
BERT	Transformer Encoder	110M $\sim$ 340M
GPT	Transformer Deconder	117M
GPT-2	Transformer Deconder	117M $\sim$ 1,542M
GPT-3	Transformer Deconder	175M

Table 1: List of Super Models (SMs)

Pre-training stage. The first step of the super-model paradigm is pre-training a super model on large-scale data, sometimes of multi-modality. The learned model in this stage is of high quality that the approximation and generalization of the output hypothesis are usually excellent, which suggests that the learned model has stored rich general knowledge in the learned model. This makes it possible to apply the learned model for the smaller specific application domains.

Fine-tuning stage. The learned model is then fine-tuned on the target domain, usually a smaller specific domain. The stored general knowledge is thereby transferred to the target domain. In this way, super-model paradigm reduces considerable sources of knowledge discovery in the target domain.

Theoretical advantages. According to the PAC-Bayesian theory, the generalizability of the learned model is determined by the distance between the posterior and the prior. As we will show in the next two sections, the very large-scale training data in the pre-training stage secures learning a high-quality model with no-knowledge prior. The learned knowledge is of high value but consumed enormous resources which is not accessible for many potential machine learning users, particularly small and medium-sized enterprises. In the super-model paradigm, the high-quality model learned in the pre-training stage is employed as the initialization in the fine-tuning stage. In this way, we significantly reduce the needed sample complexity in the fine-tuning stage.

Industrial values. Machine learning has been thriving in a wide range of areas. However, the industrial applications are still limited. This is partially caused by the high cost of computing facilities and data annotations. The paradigm based on super models significantly reduce the cost of machine learning applications. This is particularly important for small and medium-size enterprises.

Climate value. Super-model paradigm enables recycling discovered general knowledge in enormous application domains. This would also help significantly reduce the carbon emission. Meanwhile, the super-model paradigm centralizes the modeling training process which can help manage the geographic location and the datacenter infrastructure in order to reduce the carbon emission, as a recurrent work suggested (Patterson et al., 2021):

Geographic location of machine learning workload scheduling can result carbon emission vary around five times to ten times, even when the country and the organization remain invariant.
Cloud data centers can be around 1.4-2X more energy-efficient. Meanwhile, machine learning-oriented accelerators can be 2-5X more effective.

Super-model paradigm can thus reduce the carbon print of machine learning application and further contribute in slowing down the climate crisis.

5 Diffusion processes in super-model paradigm

We consider a diffusion process-based model that serves an envelope for domain adaptation methods. Two diffusion processes are designed for modeling the pre-training and fine-tuning stages, respectively. The knowledge transition can then be modeled via the transition of diffusion processes.

5.1 Diffusion process in pre-training

In pre-training, SGD explores on the loss surface for a decent local minimum. Compared with gradient descent, SGD introduces gradient noise into the gradient and then the weight. The noise plays as an implicit regularizer that controls the hypothesis complexity of the learned model. In this section, we employ a stochastic differential equation to characterize the trajectory of SGD.

We assume that the loss function in the local region around the minimum is convex and second-order differentiable, as shown in the following assumption.

Assumption 1.

Suppose that the empirical risk $R (θ)$ around the optimum as the following equation,

R (θ) = \frac{1}{2} θ^{⊤} A_{P T} θ,

(13)

where $A_{P T}$ is the Hessian matrix around the minimum and is a (semi) positive-definite matrix.

Remark 1.

This assumption implicitly assumes that the converged local minimum is at the zero point. This would not influence the generality under translational motion. Specifically, suppose the converged local minimum is at $θ^{'}$ . We may perform a translational motion to the neural network to move the converged local minimum to zero.

Remark 2.

The Hessian matrix $A_{P T}$ of the loss surface characterizes the local geometry around the converged local minimum. Its determinant characterizes the flatness/sharpness of the loss function around the local minimum (Keskar et al., 2017; Goyal et al., 2017).

Remark 3.

The covariance matrix $C_{P T}$ characterizes the fluctuation introduced by the mini bathes into the gradient estimation. A recent intuition for the advantage of SGD is that it introduces noise into the gradient, so that it can jump out of bad local minima.

The loss $l_{n} (θ)$ and gradient $^R (θ)$ calculated on a mini-batch are un-biased estimators of the empirical risk $^R$ and the full gradient $\nabla_{θ} R (θ)$ , as follows,

	$E [l_{n} (θ)] = E [^R (θ)] = R (θ),$		(14)
	$E [\nabla_{θ} l_{n} (θ)] = E [{^g}_{S} (θ)] = g (θ) = \nabla_{θ} R (θ),$		(15)

where the expectations are in terms of the corresponding examples $(X, Y)$ .

The fluctuations introduced by the mini batches are modeled by Gauss distributions centered at $g (θ) = \nabla_{θ} R (θ)$ . Specifically, we assume that

\nabla_{θ} l_{n} (θ) \sim N (g (θ), C),

(16)

where $C$ is the covariance matrix and is a constant matrix for all $θ$ . This Gaussian assumption is also employed in by E (2017) and Mandt et al. (2017). Therefore, we further have the following estimation,

{^g}_{S} (θ) = \frac{1}{| S |} \sum n \in S \nabla_{θ} l_{n} (θ) \sim N (g (θ), \frac{1}{| S |} C) .

(17)

SGD uses the stochastic gradient ${^g}_{S} (θ)$ to iteratively update the parameter $θ$ in order to minimize the function $R (θ)$ :

Δ θ (t) =

θ (t + 1) - θ (t) = - η {^g}_{S} (θ (t)) = - η g (θ) + \frac{η}{\sqrt{| S |}} B Δ W,

(18)

and

Δ W \sim N (0, I),

where $B$ is positive definite matrix which characterizes the covariance of the gradient noise. We define that

C = B^{⊤} B .

In this paper, we consider the case that the batch size $| S |$ and learning rate $η$ are constant.

Combining eqs. (13) and (18), we have the following analytic form of the stationary distribution (Gardiner and others, 1985):

q_{P T} (θ) = M_{P T} exp {- \frac{1}{2} θ^{⊤} Σ_{P T}^{- 1} θ},

(19)

where $M_{P T}$ is the normalizer and

Σ_{P T} A_{P T} + A_{P T} Σ_{P T} = \frac{η_{P T}}{| S_{P T} |} C_{P T},

$η_{P T}$ , $| S_{P T} |$ , and $C_{P T}$ are the learning rate, batch size, and the covariance matrix in the pre-training stage, respectively.

Remark 4.

In this section, we show that the learned hypothesis is drawn from the steady distribution of a Fokker-Plank equation, which is a Gibs-Boltzmann distribution centered around the zero point.

5.2 Knowledge transition in fine-tuning

SGD in the fine-tuning stage can also be characterized by the Uhlenbeck-Ornstein equation (eq. 18), while the initial condition is different. The fine-tuning stage is initialized by the steady distribution of the pre-training stage $Σ_{P T}$ . Similarly, the SGD converges to another steady distribution $Σ_{F T}$ . In this way, we model the domain adaptation as a two-stage diffusion process. The second-stage diffusion process characterizes the knowledge transition between the two domains.

We assume that the loss function in the local region around the minimum is convex and $2$ -order differentiable, as shown in the following assumption.

Assumption 2.

Suppose that the empirical risk $R (θ)$ around the optimum as the following equation,

R (θ) = \frac{1}{2} (θ - θ_{F T})^{⊤} A_{F T} (θ - θ_{F T}),

(20)

where $A$ is the Hessian matrix around the minimum and is a (semi) positive-definite matrix.

Recall that we assumed that the converged local minimum in the pre-training stage is at the zero point. In the fine-tuning stage, the converged local minimum cannot be assumed at the same point in general. Thus, a shift term $θ_{F T}$ is introduced to characterize the the shift of the converged local minimum.

Similarly, combining eqs. (20) and (18), we have the following analytic form of the stationary distribution:

q_{F T} (θ) = M_{P T} exp {- \frac{1}{2} (θ - θ_{F T})^{⊤} Σ_{F T}^{- 1} (θ - θ_{F T})},

(21)

where $M_{F T}$ is the normalizer and

Σ_{F T} A_{F T} + A_{F T} Σ_{F T} = \frac{η_{P T}}{| S_{P T} |} C_{P T},

$η_{F T}$ , $| S_{F T} |$ , and $C_{F T}$ are the learning rate, batch size, and the covariance matrix in the fine-tuning stage.

Recall that the converged local minimizer in the pre-training stage is drawn from a Gibs-Boltzmann distribution centered at the zero point. This is inherited from the assumption that the local minimum is around the zero point. However, in the fine-tuning stage, the converged local minimum has a shift $θ_{F T}$ from the zero point. This leads to a shift $θ_{F T}$ of the distribution of the learned hypothesis.

6 Generalization analysis of super-model paradigm

The knowledge transition characterized by the diffusion process in the fine-tuning. In this paper, we employ the PAC-Bayesian theory to analyze the generalizability of domain adaptation.

6.1 PAC-Bayesian framework

PAC-Bayesian theory corporates the PAC theory and Bayesian statistics (McAllester, 1999a, b). It presents a generalization bound for a stochastic algorithm based on the distance between the learned hypothesis and the prior measured by the KL divergence. The PAC-Bayesian bound characterizes the trade-off between minimising the empirical risk and exploring further areas of the hypothesis space from the initial.

Lemma 1 (see McAllester (1999a), Theorem 1).

For any positive real $δ \in (0, 1)$ , with probability at least $1 - δ$ over a sample of size $N$ , we have the following inequality for all distributions $Q$ :

R (Q) \leq

^R (Q) + \sqrt{\frac{D (Q | | P) + log \frac{1}{δ} + log N + 2}{2 N - 1}},

(22)

where $D (Q | | P)$ is the KL divergence between the distributions $Q$ and $P$ and is defined as,

D (Q | | P) = E_{θ \sim Q} (log \frac{Q (θ)}{P (θ)}) .

(23)

This lemma characterizes the influence on the generalization via the distance between the distribution $Q$ of the learned hypothesis and the prior $P$ measured by the KL divergence $D (Q | | P)$ . The KL divergence serves as a hypothesis complexity measure. In specific, a larger KL divergence corresponds to a larger hypothesis complexity and further a worse generalizability.

6.2 Generalization bound

We then obtain a generalization bound for the pre-training stage as follows.

Theorem 1.

For any positive real $δ \in (0, 1)$ , with probability at least $1 - δ$ over a training sample set of size $N_{P T}$ , we have the following inequality for the distribution $Q$ of the output hypothesis function of SGD:

R (Q_{P T}) \leq^R (Q_{P T}) + \sqrt{\frac{D (Q_{P T}, P) + 2 log (\frac{1}{δ}) + 2 log N_{P T} + 4}{4 N_{P T} - 2}},

(24)

where

D (Q_{P T}, P) = log (det (Σ_{P T})) + tr (Σ_{P T} - I) .

The proof for this generalization bound has two parts: (1) utilize results from stochastic differential equation (SDE) to find the stationary solution of the latent Ornstein-Uhlenbeck process (eq. 18) which expresses the iterative update of SGD; and (2) adapt the PAC-Bayes framework to obtain the generalization bound based on the stationary distribution. A detailed proof is omitted here and is given in Appendix 8.2.

Similarly, we can obtain a generalization bound for the fine-tuning stage as follows.

Theorem 2.

For any positive real $δ \in (0, 1)$ , with probability at least $1 - δ$ over a training sample set of size $N_{F T}$ , we have the following inequality for the distribution $Q_{F T}$ of the output hypothesis function of SGD:

R (Q_{F T}) \leq^R (Q_{F T}) + \sqrt{\frac{D (Q_{F T}, Q_{P T}) + 2 log (\frac{1}{δ}) + 2 log N_{F T} + 4}{4 N_{F T} - 2}},

where

D (Q_{F T}, Q_{P T}) = log (det (Σ_{P T}^{- 1} Σ_{F T})) + tr (Σ_{P T}^{- 1} Σ_{F T} - I) + θ_{F T}^{⊤} Σ_{P T}^{- 1} θ_{F T} .

Remark 5.

The generalization bounds in both pre-training and fine-tuning stages are in order of $O (1 / \sqrt{N})$ , which suggests that the generalization error converges to zero when the training sample size goes to infinity.

6.3 Dominance of fine-tuning in generalization of domain adaptation

In the super-model paradigm, the model is usually pre-trained on large amounts of data in a wide source domain and then fine-tuned on specific domains with relatively smaller training data. The training sample size $N_{P T}$ in the source domain is significantly larger than the size $N_{F T}$ of the data in the target domain. For example, the GPT-3 is trained on 45TB data. Meanwhile, the training sample size in the target domain is relatively smaller.

Remark 6.

Combining Theorems 1 and 2, the comparison between the training sample sizes on the source domain and the target domain suggests that the generalization error of the fine-tuning stage is dominant in the super-model paradigm.

6.4 Impact of the domain shifts

Theorem 2 helps characterize how the domain shifts between the source domain and the target domain influences the generalization on the target domain. The domain shifts are measured by the following discrepancy.

Definition 2 (Domain discrepancy).

Suppose the distributions of the learned models in the pre-training and fine-tuning are $Q_{F T}$ and $Q_{P T}$ as follows,

	$q_{P T} (θ) =$	$M_{P T} exp {- \frac{1}{2} θ^{⊤} Σ_{P T}^{- 1} θ},$
	$q_{F T} (θ) =$	$M_{F T} exp {- \frac{1}{2} (θ - θ_{F T})^{⊤} Σ_{F T}^{- 1} (θ - θ_{F T})},$		(25)

where $M_{P T}$ and $M_{F T}$ are two normalizers, $Σ_{P T}$ and $Σ_{F T}$ are two covariance matrices, and $θ_{F T}$ is the center shift between the two learned hypotheses.

Then, the domain discrepancy between the two domains are defined as below,

D (Q_{F T}, Q_{P T}) = log (det (Σ_{P T}^{- 1} Σ_{F T})) + tr (Σ_{P T}^{- 1} Σ_{F T} - I) + θ_{F T}^{⊤} Σ_{P T}^{- 1} θ_{F T} .

Remark 7.

In Definition 2, we assume the distribution of the pre-trained model is centered at the zero point. This assumption would not hurt the generality. Suppose the distribution center is not at the zero point. One may move it to the zero point via reparamterization.

Remark 8.

The domain discrepancy $D (Q_{F T}, Q_{P T})$ is constituted by two parts: (1) $Σ_{P T}^{- 1} Σ_{F T}$ characterizes the matchness in the aspect of covariance; and (2) $θ_{F T}^{⊤} Σ_{P T}^{- 1} θ_{F T}$ characterizes the matchness of the center shift in the lens of the the covariance in the source domain.

Remark 9.

Our generalization bound suggests that the generalization on the target domain is determined by the magnitude of the domain shifts. Generally, larger domain shifts lead to worse generalization on the target domain. This supports the heuristic in practice that the performance on the target domain is limited by the domain shifts.

Based on Definition 2, one can get the following lemma.

Lemma 2.

The domain discrepancy $D (Q_{F T}, Q_{P T})$ can be rearranged as follows,

		$D (Q_{F T}, Q_{P T})$
	$\leq$	$log (tr (Σ_{P T}^{- 1} Σ_{F T})) + tr (Σ_{P T}^{- 1} Σ_{F T}) + θ_{F T}^{⊤} Σ_{P T}^{- 1} θ_{F T} + d log d - d .$

Proof of Lemma 2.

We have that

		$D (Q_{F T}, Q_{P T})$
	$=$	$log (det (Σ_{P T}^{- 1} Σ_{F T})) + tr (Σ_{P T}^{- 1} Σ_{F T} - I) + θ_{F T}^{⊤} Σ_{P T}^{- 1} θ_{F T}$
	$\leq$	$log (d^{d} tr (Σ_{P T}^{- 1} Σ_{F T})) - d + tr (Σ_{P T}^{- 1} Σ_{F T}) .$

∎

Based on Lemma 2, we define a new notion for measuring the domain shifts as follows.

Definition 3 (Dimension-dependent domain discrepancy).

Suppose the distributions of the learned models in the pre-training and fine-tuning are $Q_{F T}$ and $Q_{P T}$ as follows,

	$q_{P T} (θ) =$	$M_{P T} exp {- \frac{1}{2} θ^{⊤} Σ_{P T}^{- 1} θ},$
	$q_{F T} (θ) =$	$M_{F T} exp {- \frac{1}{2} (θ - θ_{F T})^{⊤} Σ_{F T}^{- 1} (θ - θ_{F T})},$		(26)

where $M_{P T}$ and $M_{F T}$ are two normalizers, $Σ_{P T}$ and $Σ_{F T}$ are two covariance matrices, and $θ_{F T}$ is the center shift between the two learned hypotheses.

Then, the domain discrepancy between the two domains are defined as below,

		$~ D (Q_{F T}, Q_{P T})$
	$=$	$log (tr (Σ_{P T}^{- 1} Σ_{F T})) + tr (Σ_{P T}^{- 1} Σ_{F T}) + θ_{F T}^{⊤} Σ_{P T}^{- 1} θ_{F T} + d log d - d .$

From Theorem 2, one may obtain the following corollary.

Corollary 1.

R (Q_{F T}) \leq^R (Q_{F T}) + \sqrt{\frac{~ D (Q_{F T}, Q_{P T}) + 2 log (\frac{1}{δ}) + 2 log N_{F T} + 4}{4 N_{F T} - 2}},

(27)

where

		$~ D (Q_{F T}, Q_{P T})$
	$=$	$log (tr (Σ_{P T}^{- 1} Σ_{F T})) + tr (Σ_{P T}^{- 1} Σ_{F T}) + θ_{F T}^{⊤} Σ_{P T}^{- 1} θ_{F T} + d log d - d .$

and $A$ is the Hessian matrix of the loss function around the local minimum.

7 Discussion and future work

Large-scale pre-trained models, such as GPT-3 and Bert, enables a new industrial paradigm: pre-training a super model on large amounts of multi-modality data (sometimes of low-quality) and then fine-tuning the learned model to smaller specific application domains. This paradigm may start a super-model paradigm that would significantly reduce the application cost of machine learning, which is critical for enormous small and medium-sized enterprises.

A major technique in this paradigm is domain adaptation which enables the knowledge transfer between the two domains. We model a super-model paradigm as a two-stage diffusion process: (1) in the pre-training stage, the trajectory of the stochastic gradient descent (SGD) or its variants searches on the loss surface driven by Uhlenbeck-Ornstein process discretely or smoothly by the Fokker-Plank equation. The model weight starts from a no-knowledge prior and converges to a Maxwell-Boltzmann distribution; and (2) in the fine-tuning stage, the trajectory of SGD is driven by a similar SDE, which starts from the learned model distribution in the pre-training stage and converges to another Maxwell-Boltzmann distribution. Based on the diffusion processes, an $O (1 / \sqrt{N})$ generalization bound is obtained via the PAC-Bayesian framework.

The generalization bounds suggest that the fine-tuning stage dominates the generalization of the whole paradigm. The generalization is determined by the domain discrepancy between the pre-training and fine-tuning domains, which is characterized by a new measure based on the covariance and domain shifts.

In this work, we make several assumptions and abstractions. This section discusses the limitation introduced by them and give several potential extensions.

Model compression in the fine-tuning stage. In this paper, we ignore the model compression approaches in the fine-tuning stage, which are sometimes employed in practice. Popular model compression methods include model distillation, pruning, and quantization. The effects of model compression can be seen as operators on the loss surface and the learned model. A future direction is to mathematically characterize the influence of model compression. The results may be plug-and-play components to the presented theory in this paper.
Gradient noise in SGD. In this paper, we assume that the gradient noise is drawn from a Gaussian distribution. Recent works also made assumptions that the gradient noise as a Levy process, Laplacian noise, etc. The exact distribution of the gradient noise is still an open problem. In addition, the gradient noise is assumed state-independent, which can be easily extended to state-dependent. A future direction is to study the distribution of the gradient noise in SGD. It is worth noting that relatively little efforts are needed to change the gradient noise distribution assumptions in this paper.
Advanced techniques in modeling SGD. In this paper, we model the trajectory of SGD via Fokker-Planck equation and Uhlenbeck-Ornstein process. This modeling ignores the influence of several techniques, such as momentum and adaptive learning rate. Recent works discover that these techniques may have implicit regularization on the learned model while would not have determinant impact. A future direction is modeling the SGD as a more sophisticated stochastic differential equation.
Distribution/data-dependent priors. Some works design priors relying on the data generation distribution but still not directly relying on the training data. This would be reasonable since we can assume the data distribution has been fixed before the data was collected (Lever et al., 2013). Such distribution-dependent priors have shown to be able to considerably tighten the generalization bounds. Negrea et al. (2019) further push the frontier that constructs priors not independent with data. Suppose $S_{j}$ is a subset of $\subset S$ with size of $n < m$ . One may design a prior exploiting $S_{J}$ to deliver a data-dependent forecast of the posterior $Q$ . A future direction is modeling the SGD via distribution/data-dependent priors.

8 Proofs

This section presents the proofs for the given theory.

We model the the iterative updates in SGD employing a stochastic differential equation. This approach is also seen in the literature; see, e.g., E (2017); Mandt et al. (2017); Mou et al. (2018a); He et al. (2019); Meng et al. (2020); Cheng et al. (2020); Xie et al. (2020); Wang et al. (2021).

We first translate the updates in SGD as Ornstein-Uhlenbeck process (Uhlenbeck and Ornstein, 1930) under some mild assumptions. The Ornstein-Uhlenbeck process has a steady distribution which is then employed to characterize the distribution of the learned hypothesis. We further obtain a generalization bound via PAC-Bayesian framework by exploiting the stationary distribution, which characterizes the influence on the generalization via the distance between the output hypothesis distribution and its prior (McAllester, 1999a, b).

8.1 Proof of Theorem 1

The proof for Theorem 1 replies on the following lemma.

Lemma 3 (cf. Mandt et al. (2017), pp. 27-18, Appendix B).

Under the second-order differentiable assumption (eq. 13), the Ornstein-Uhlenbeck process (eq. 18)’s stationary distribution,

q (θ) = M exp {- \frac{1}{2} θ^{⊤} Σ_{P T}^{- 1} θ},

(28)

has the following property,

A Σ_{P T} + Σ_{P T} A = \frac{η}{| S |} C .

(29)

This lemma gives the analytic form of the steady distribution of the Ornstein-Uhlenbeck process. This lemma is from Mandt et al. (2017). Here, we recall the proof to make this paper complete.

Proof.

Form a result in Ornstein-Uhlenbeck process (Gardiner and others, 1985), we know that the parameter $θ$ has the following analytic solution,

θ (t) = θ (0) e^{- A t} + \sqrt{\frac{η}{| S |}} \int_{0}^{t} e^{- A (t - t^{'})} B d W (t^{'}),

(30)

where $W (t^{'})$ is a white noise and follows $N (0, I)$ . From eq. (28), we know that

Σ_{P T} = E_{θ \sim Q} [θ θ^{⊤}] .

(31)

Therefore, we have the following equation,

$A Σ_{P T} + Σ_{P T} A =$	$\frac{η}{\| S \|} \int_{- \infty}^{t} A e^{- A (t - t_{0})} C e^{- A (t - t_{0})} d t^{'}$
	$+ \frac{η}{\| S \|} \int_{- \infty}^{t} e^{- A (t - t_{0})} C e^{- A (t - t_{0})} d t^{'} A$
$=$	$\frac{η}{\| S \|} \int_{- \infty}^{t} \frac{d}{d t^{'}} A e^{- A (t - t_{0})} C e^{- A (t - t_{0})}$
$=$	$\frac{η}{\| S \|} C .$	(32)

The proof is completed. ∎

Then, we can prove Theorem 1. This proof is inspired by He et al. (2019). Here, we recall the proof to make this paper complete.

Proof of Theorem 1.

In PAC-Bayesian framework (Lemma 1), an essential part is the KL divergence between the distribution of the learned hypothesis and the priori on the hypothesis space. The prior distribution can be interpreted as the distribution of the initial parameters, which are usually settled according to Gaussian distributions or uniform distributions.¹¹1Usually, when there is no confident prior knowledge of the latent model parameters, the priori should be set as distributions with no information, such as Gaussian distributions or uniform distributions. This setting comes from two considerations: (1) Once the algorithms based on the Bayesian statistics can converge, after long enough time and with big enough data, the algorithms can always converge to the stationary distributions. This is guaranteed by the assumption that the stationary solution of the latent stochastic differential equation exists and is unique; (2) Setting priori should be very careful, as we can not assume we have any knowledge of the target hypothesis function before we have started training the model. Here, we use a standard Gaussian distribution $N (0, I)$ as the priori. Suppose the densities of the stationary distribution $Q_{P T}$ and the prior distribution $P$ are respectively $p (θ)$ and $q_{P T} (θ)$ in terms of the parameter $θ$ as the following equations,

	$p (θ) = \frac{1}{\sqrt{2 π det (I)}} exp {- \frac{1}{2} θ^{⊤} I θ},$		(33)
	$q_{P T} (θ) = \frac{1}{\sqrt{2 π det (Σ_{P T})}} exp {- \frac{1}{2} θ^{⊤} Σ_{P T}^{- 1} θ},$		(34)

where ep. (34) comes from eq. (28) by calculating the normalizer $M$ .

Therefore,

	$log (\frac{q_{P T} (θ)}{p (θ)})$
$=$	$log (\frac{\sqrt{2 π det (I)}}{\sqrt{2 π det (Σ_{P T})}} exp {\frac{1}{2} θ^{⊤} I θ - \frac{1}{2} θ^{⊤} Σ_{P T}^{- 1} θ})$
$=$	$\frac{1}{2} log (\frac{1}{det (Σ_{P T})}) + \frac{1}{2} (θ^{⊤} I θ - θ^{⊤} Σ_{P T}^{- 1} θ) .$	(35)

Applying eq. (8.2) to eq. (23), we can calculate the KL divergence between the distributions $Q_{P T}$ and $P$ (we assume $Θ = R^{d}$ ):

	$D (Q_{P T} \| \| P)$
$=$	$E_{θ \sim Q_{P T}} (log \frac{Q_{P T} (θ)}{P (θ)})$
$=$	$\int_{θ \in Θ} log (\frac{q_{P T} (θ)}{p (θ)}) q_{P T} (θ) d θ$
$=$	$\int_{θ \in Θ} [\frac{1}{2} log (\frac{1}{det (Σ_{P T})}) + \frac{1}{2} (θ^{⊤} I θ - θ^{⊤} Σ_{P T}^{- 1} θ)] q (θ) d θ$
$=$	$\frac{1}{2} log (\frac{1}{det (Σ_{P T})}) + \frac{1}{2} \int_{θ \in Θ} θ^{⊤} I θ p (θ) d θ - \frac{1}{2} \int_{R^{\| S \|}} θ^{⊤} Σ_{P T}^{- 1} θ q (θ) d θ$
$=$	$\frac{1}{2} log (\frac{1}{det (Σ_{P T})}) + \frac{1}{2} E_{θ \sim N (0, Σ_{P T})} θ^{⊤} I θ - \frac{1}{2} E_{θ \sim N (0, Σ_{P T})} θ^{⊤} Σ_{P T}^{- 1} θ$
$=$	$\frac{1}{2} log (\frac{1}{det (Σ_{P T})}) + \frac{1}{2} tr (Σ_{P T} - I) .$	(36)

From eq. (29), we have that

A_{P T} Σ_{P T} + Σ_{P T} A_{P T} = \frac{η_{P T}}{| S_{P T} |} C .

(37)

Therefore,

A_{P T} Σ_{P T} A_{P T}^{- 1} + Σ_{P T} = \frac{η_{P T}}{| S_{P T} |} C A_{P T}^{- 1} .

(38)

After calculating the trace of the both sides, we have the following equation,

tr (A_{P T} Σ_{P T} A_{P T}^{- 1} + Σ_{P T}) = tr (\frac{η_{P T}}{| S_{P T} |} C A_{P T}^{- 1}) .

(39)

The left-hand side (LHS) is as follows,

$LHS =$	$tr (A_{P T} Σ_{P T} A_{P T}^{- 1} + Σ_{P T})$
$=$
$=$
$=$	$tr (Σ_{P T}) + tr (Σ_{P T})$
$=$	$2 tr (Σ_{P T}) .$	(40)

Therefore,

tr (Σ_{P T}) = \frac{1}{2} tr (\frac{}{η_{P T}} | S_{P T} | C A_{P T}^{- 1}) = \frac{1}{2} \frac{η_{P T}}{| S_{P T} |} tr (C A_{P T}^{- 1}) .

(41)

At the same time, we can easily calculate that

tr (I) = d,

(42)

as $I \in R^{d \times d}$ , where $d$ is the dimension of the parameter $θ$ .

Insert eqs. (41) and (42) to eq. (8.2), we can get the following inequality,

D (Q_{P T} | | P) \leq \frac{1}{4} \frac{η_{P T}}{| S_{P T} |} t r (C A_{P T}^{- 1}) - \frac{1}{2} log (det (Σ_{P T})) - \frac{1}{2} d .

(43)

Eq. (43) gives an upper bound for the distance (measured by KL divergence) between the stationary distribution of the output weights by SGD and the priori on the hypothesis space. Considering the monotonicity of the generalization bound in terms of the KL divergence, we can further obtain a PAC-Bayesian generalization bound for SGD by inserting the KL divergence bound (eq. 43) into the PAC-Bayesian framework (eq. (22) of Lemma 1).

The proof is completed. ∎

8.2 Proof of Theorem 2

This section proves Theorem 2. The proof is similar to the previous theorem.

Proof of Theorem 2.

Similarly, the distribution $Q_{F T}$ of the learned hypothesis and the prior distributions $Q_{P T}$ are respectively $q_{P T} (θ)$ and $p_{F T} (θ)$ in terms of the parameter $θ$ as the following equations,

	$q_{P T} (θ) = \frac{1}{\sqrt{2 π det (Σ_{P T})}} exp {- \frac{1}{2} θ^{⊤} Σ_{P T}^{- 1} θ},$		(44)
	$q_{F T} (θ) = \frac{1}{\sqrt{2 π det (Σ_{F T})}} exp {- \frac{1}{2} θ^{⊤} Σ_{F T}^{- 1} θ},$		(45)

where ep. (45) comes from calculating the normalizer $M$ .

Therefore,

	$log (\frac{q_{F T} (θ)}{q_{P T} (θ)})$
$=$	$log (\frac{\sqrt{2 π det (Σ_{P T})}}{\sqrt{2 π det (Σ_{F T})}} exp {\frac{1}{2} θ^{⊤} Σ_{P T}^{- 1} θ - \frac{1}{2} θ^{⊤} Σ_{F T}^{- 1} θ})$
$=$	$\frac{1}{2} log (\frac{det (Σ_{P T})}{det (Σ_{F T})}) + \frac{1}{2} (θ^{⊤} Σ_{P T}^{- 1} θ - θ^{⊤} Σ_{F T}^{- 1} θ) .$	(46)

Then, the KL divergence between the distributions $Q_{F T}$ and $Q_{P T}$ are as follows (we assume $Θ = R^{d}$ ):

	$D (Q_{F T} \| \| Q_{P T})$
$=$	$E_{θ \sim Q_{F T}} (log \frac{Q_{F T} (θ)}{Q_{P T} (θ)})$
$=$	$\int_{θ \in Θ} log (\frac{q_{F T} (θ)}{q_{P T} (θ)}) q_{F T} (θ) d θ$
$=$	$\int_{θ \in Θ} [\frac{1}{2} log (\frac{det (Σ_{P T})}{det (Σ_{F T})}) + \frac{1}{2} (θ^{⊤} Σ_{P T}^{- 1} θ - θ^{⊤} Σ_{F T}^{- 1} θ)] q (θ) % d θ$
$=$	$\frac{1}{2} log (\frac{det (Σ_{P T})}{det (Σ_{F T})}) + \frac{1}{2} \int_{R^{\| S \|}} θ^{⊤} Σ_{P T}^{- 1} θ q_{F T} (θ) d θ$
	$- \frac{1}{2} \int_{R^{\| S \|}} θ^{⊤} Σ_{F T}^{- 1} θ q_{F T} (θ) d θ$
$=$	$\frac{1}{2} log (\frac{det (Σ_{P T})}{det (Σ_{F T})}) + \frac{1}{2} E_{θ \sim N (0, Σ_{P T})} θ^{⊤} Σ_{P T}^{- 1} θ - \frac{1}{2} E_{θ \sim N (0, Σ_{F T})} θ^{⊤} Σ_{F T}^{- 1} θ$
$=$	$\frac{1}{2} log (det (Σ_{P T}^{- 1} Σ_{F T})) + \frac{1}{2} tr (Σ_{P T}^{- 1} Σ_{F T} - I) + \frac{1}{2} θ_{F T}^{⊤} Σ_{P T}^{- 1} θ_{F T} .$	(47)

Therefore, we have

R (Q_{F T}) \leq^R (Q_{F T}) + \sqrt{\frac{D (Q_{F T}, Q_{P T}) + 2 log (\frac{1}{δ}) + 2 log N_{F T} + 4}{4 N_{F T} - 2}},

(48)

where

D (Q_{F T}, Q_{P T}) = log (det (Σ_{P T}^{- 1} Σ_{F T})) + tr (Σ_{P T}^{- 1} Σ_{F T} - I) + θ_{F T}^{⊤} Σ_{P T}^{- 1} θ_{F T},

The proof is completed. ∎

Acknowledgments

The authors appreciate Shiye Lei for helpful discussions.

References

P. L. Bartlett, D. J. Foster, and M. J. Telgarsky (2017) Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240–6249. Cited by: §2.
P. L. Bartlett and S. Mendelson (2002) Rademacher and Gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3 (Nov), pp. 463–482. Cited by: §2.
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth (1989) Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM 36 (4), pp. 929–965. Cited by: §2.
K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan (2016) Domain separation networks. arXiv preprint arXiv:1608.06019. Cited by: §2.
O. Bousquet and A. Elisseeff (2002) Stability and generalization. Journal of Machine Learning Research 2 (Mar), pp. 499–526. Cited by: §2.
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1, §4, §4.
H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao (2021) Pre-trained image processing transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12299–12310. Cited by: §1.
H. Chen, O. Engkvist, Y. Wang, M. Olivecrona, and T. Blaschke (2018a) The rise of deep learning in drug discovery. Drug Discovery Today 23 (6), pp. 1241–1250. Cited by: §2.
Y. Chen, C. Jin, and B. Yu (2018b) Stability and convergence trade-off of iterative optimization algorithms. arXiv preprint arXiv:1804.01619. Cited by: §2.
X. Cheng, D. Yin, P. Bartlett, and M. Jordan (2020) Stochastic gradient and langevin processes. In International Conference on Machine Learning, pp. 1810–1819. Cited by: §2, §8.
J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §4, §4.
R. M. Dudley (1967) The sizes of compact subsets of hilbert space and continuity of Gaussian processes. Journal of Functional Analysis 1 (3), pp. 290–330. Cited by: §2.
W. E, C. Ma, S. Wojtowytsch, and L. Wu (2020) Towards a mathematical understanding of neural network-based machine learning: what we know and what we don’t. arXiv preprint arXiv:2009.10713. Cited by: §2.
W. E (2017) A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics 5 (1), pp. 1–11. Cited by: §5.1, §8.
Y. Ganin and V. Lempitsky (2015) Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pp. 1180–1189. Cited by: §2.
C. W. Gardiner et al. (1985) Handbook of stochastic methods. Vol. 3, springer Berlin. Cited by: §5.1, §8.1.
M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li (2016) Deep reconstruction-classification networks for unsupervised domain adaptation. In European Conference on Computer Vision, pp. 597–613. Cited by: §2.
N. Golowich, A. Rakhlin, and O. Shamir (2018) Size-independent sample complexity of neural networks. In Annual Conference on Learning Theory, pp. 297–299. Cited by: §2.
P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: Remark 2.
X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, L. Zhang, W. Han, M. Huang, et al. (2021) Pre-trained models: past, present and future. AI Open. Cited by: §1.
M. Hardt, B. Recht, and Y. Singer (2016) Train faster, generalize better: stability of stochastic gradient descent. In International Conference on Machine learning, pp. 1225–1234. Cited by: §2.
N. Harvey, C. Liaw, and A. Mehrabian (2017) Nearly-tight VC-dimension bounds for piecewise linear neural networks. In Annual Conference on Learning Theory, pp. 1064–1068. Cited by: §2.
D. Haussler (1995) Sphere packing numbers for subsets of the boolean $n$ -cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory, Series A 69 (2), pp. 217–232. Cited by: §2.
F. He, T. Liu, and D. Tao (2019) Control batch size and learning rate to generalize well: theoretical and empirical evidence. In Advances in Neural Information Processing Systems, Cited by: §2, §2, §8.1, §8.
F. He and D. Tao (2020) Recent advances in deep learning theory. arXiv preprint arXiv:2012.10931. Cited by: §2.
N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2017) On large-batch training for deep learning: generalization gap and sharp minima. In International Conference on Leanring Representations, Cited by: Remark 2.
D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §1, §3.
V. Koltchinskii and D. Panchenko (2000) Rademacher processes and bounding the risk of function learning. In High Dimensional Probability II, pp. 443–457. Cited by: §2.
V. Koltchinskii (2001) Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory 47 (5), pp. 1902–1914. Cited by: §2.
C. A. Kulikowski (1980) Artificial intelligence methods and systems for medical consultation. IEEE Transactions on Pattern Analysis and Machine Intelligence (5), pp. 464–476. Cited by: §2.
S. Kullback and R. A. Leibler (1951) On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: Definition 1.
I. Kuzborskij and C. Lampert (2018) Data-dependent stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 2815–2824. Cited by: §2.
G. Lever, F. Laviolette, and J. Shawe-Taylor (2013) Tighter pac-bayes bounds through distribution-dependent priors. Theoretical Computer Science 473, pp. 4–28. Cited by: 4th item.
H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein (2018) Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, Cited by: §1.
T. Liang, T. Poggio, A. Rakhlin, and J. Stokes (2019) Fisher-rao metric, geometry, and complexity of neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 888–896. Cited by: §2.
P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer (2018) Generating wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198. Cited by: §4.
T. Liu, G. Lugosi, G. Neu, and D. Tao (2017) Algorithmic stability and hypothesis complexity. In International Conference on Machine Learning, pp. 2159–2167. Cited by: §2.
B. London (2017) A PAC-Bayesian analysis of randomized learning with application to stochastic gradient descent. In Advances in Neural Information Processing Systems, Cited by: §2.
M. Long, Y. Cao, J. Wang, and M. Jordan (2015) Learning transferable features with deep adaptation networks. In International conference on machine learning, pp. 97–105. Cited by: §2.
M. Long, H. Zhu, J. Wang, and M. I. Jordan (2016) Unsupervised domain adaptation with residual transfer networks. arXiv preprint arXiv:1602.04433. Cited by: §2.
M. Long, H. Zhu, J. Wang, and M. I. Jordan (2017) Deep transfer learning with joint adaptation networks. In International conference on machine learning, pp. 2208–2217. Cited by: §2.
S. Mandt, M. D. Hoffman, and D. M. Blei (2017) Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research 18 (1), pp. 4873–4907. Cited by: §2, §5.1, §8.1, §8, Lemma 3.
D. A. McAllester (1999a) PAC-Bayesian model averaging. In Annual Conference of Learning Theory, Vol. 99, pp. 164–170. Cited by: §1, §2, §6.1, §8, Lemma 1.
D. A. McAllester (1999b) Some PAC-Bayesian theorems. Machine Learning 37 (3), pp. 355–363. Cited by: §1, §2, §6.1, §8.
Q. Meng, S. Gong, W. Chen, Z. Ma, and T. Liu (2020) Dynamic of stochastic gradient descent with state-dependent noise. arXiv preprint arXiv:2006.13719. Cited by: §2, §8.
W. Mou, L. Wang, X. Zhai, and K. Zheng (2018a) Generalization bounds of sgld for non-convex learning: two theoretical viewpoints. In Annual Conference On Learning Theory, Cited by: §2, §8.
W. Mou, L. Wang, X. Zhai, and K. Zheng (2018b) Generalization bounds of sgld for non-convex learning: two theoretical viewpoints. In Annual Conference On Learning Theory, pp. 605–638. Cited by: §2.
J. Negrea, M. Haghifam, G. K. Dziugaite, A. Khisti, and D. M. Roy (2019) Information-theoretic generalization bounds for sgld via data-dependent estimates. In Advances in Neural Information Processing Systems, pp. 11015–11025. Cited by: 4th item.
Y. E. Nesterov (1983) A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. Akad. Nauk Sssr, Vol. 269, pp. 543–547. Cited by: §1, §3.
B. Neyshabur, S. Bhojanapalli, and N. Srebro (2017) A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564. Cited by: §2.
D. Patterson, J. Gonzalez, Q. Le, C. Liang, L. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean (2021) Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350. Cited by: §1, §4.
A. Pensia, V. Jog, and P. Loh (2018) Generalization error bounds for noisy, iterative algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 546–550. Cited by: §2.
M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: §4.
T. Poggio, A. Banburski, and Q. Liao (2020) Theoretical issues in deep networks. Proceedings of the National Academy of Sciences. Cited by: §2.
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §4.
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §4.
H. Robbins and S. Monro (1951) A stochastic approximation method. The Annals of Mathematical Statistics, pp. 400–407. Cited by: §1, §3.
W. H. Rogers and T. J. Wagner (1978) A finite sample distribution-free performance bound for local discrimination rules. The Annals of Statistics, pp. 506–514. Cited by: §2.
B. Romera-Paredes and P. Torr (2015) An embarrassingly simple approach to zero-shot learning. In International conference on machine learning, pp. 2152–2161. Cited by: §4.
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484. Cited by: §2.
J. Snell, K. Swersky, and R. S. Zemel (2017) Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175. Cited by: §4.
F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. Hospedales (2018) Learning to compare: relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1199–1208. Cited by: §4.
P. Tseng (1998) An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM Journal on Optimization 8 (2), pp. 506–531. Cited by: §1, §3.
Z. Tu, F. He, and D. Tao (2020) Understanding generalization in recurrent neural networks. In International Conference on Learning Representations, Cited by: §2.
E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell (2014) Deep domain confusion: maximizing for domain invariance. arXiv preprint arXiv:1412.3474. Cited by: §2.
G. E. Uhlenbeck and L. S. Ornstein (1930) On the theory of the brownian motion. Physical review 36 (5), pp. 823. Cited by: §1, §8.
V. Vapnik (2006) Estimation of dependences based on empirical data. Springer Science & Business Media. Cited by: §2.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §4.
S. Verma and Z. Zhang (2019) Stability and generalization of graph convolutional neural networks. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1539–1548. Cited by: §2.
H. Wang, Y. Huang, R. Gao, and F. P. Calmon (2021) Learning while dissipating information: understanding the generalization capability of sgld. arXiv preprint arXiv:2102.02976. Cited by: §8.
Z. Xie, I. Sato, and M. Sugiyama (2020) A diffusion theory for deep learning dynamics: stochastic gradient descent exponentially favors flat minima. arXiv e-prints, pp. arXiv–2002. Cited by: §8.
H. Xu, C. Caramanis, and S. Mannor (2011) Sparse algorithms are not stable: a no-free-lunch theorem. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1), pp. 187–193. Cited by: §2.
W. Zhang, W. Ouyang, W. Li, and D. Xu (2018) Collaborative and adversarial network for unsupervised domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3801–3809. Cited by: §2.

Super-model ecosystem: A domain-adaptation perspective

Abstract

1 Introduction

2 Background

3 Notations and preliminaries

Definition 1 (KL Divergence; cf. Kullback and Leibler (1951)).

4 Super-model paradigm

5 Diffusion processes in super-model paradigm

5.1 Diffusion process in pre-training

Assumption 1.

Remark 1.

Remark 2.

Remark 3.

Remark 4.

5.2 Knowledge transition in fine-tuning

Assumption 2.

6 Generalization analysis of super-model paradigm

6.1 PAC-Bayesian framework

Lemma 1 (see McAllester (1999a), Theorem 1).

6.2 Generalization bound

Theorem 1.

Theorem 2.

Remark 5.

6.3 Dominance of fine-tuning in generalization of domain adaptation

Remark 6.

6.4 Impact of the domain shifts

Definition 2 (Domain discrepancy).

Remark 7.

Remark 8.

Remark 9.

Lemma 2.

Proof of Lemma 2.

Definition 3 (Dimension-dependent domain discrepancy).

Corollary 1.

7 Discussion and future work

8 Proofs

8.1 Proof of Theorem 1

Lemma 3 (cf. Mandt et al. (2017), pp. 27-18, Appendix B).

Proof.

Proof of Theorem 1.

8.2 Proof of Theorem 2

Proof of Theorem 2.

Acknowledgments

References

Super-model ecosystem:
A domain-adaptation perspective