Deep Generative Modeling on Limited Data with Regularization by Nontransferable Pre-trained Models

Yong Zhong

^{12}

Hongtao Liu¹

^{12}

Xiaodong Liu

^{12}

Fan Bao

^{3}

Weiran Shen

^{12}

Chongxuan Li²

^{12}

^{1}

Gaoling School of AI, Renmin University of China, Beijing, China

^{2}

Beijing Key Lab of Big Data Management and Analysis Methods, Beijing, China

^{3}

Department of Computer Science Technology, Tsinghua University, Beijing, China
{yongzhong, ht6, xiaodong.liu}@ruc.edu.cn, bf19@mails.tsinghua.edu.cn,
{shenweiran, chongxuanli}@ruc.edu.cn Equal contribution.Correspondence to Weiran Shen and Chongxuan Li.

¹footnotemark: 1

²footnotemark: 2

Abstract

Deep generative models (DGMs) are data-eager. Essentially, it is because learning a complex model on limited data suffers from a large variance and easily overfits. Inspired by the bias-variance dilemma, we propose regularized deep generative model (Reg-DGM), which leverages a nontransferable pre-trained model to reduce the variance of generative modeling with limited data. Formally, Reg-DGM optimizes a weighted sum of a certain divergence between the data distribution and the DGM and the expectation of an energy function defined by the pre-trained model w.r.t. the DGM. Theoretically, we characterize the existence and uniqueness of the global minimum of Reg-DGM in the nonparametric setting and rigorously prove the statistical benefits of Reg-DGM w.r.t. the mean squared error and the expected risk in a simple yet representative Gaussian-fitting example. Empirically, it is quite flexible to specify the DGM and the pre-trained model in Reg-DGM. In particular, with a ResNet-18 classifier pre-trained on ImageNet and a data-dependent energy function, Reg-DGM consistently improves the generation performance of strong DGMs including StyleGAN2 and ADA on several benchmarks with limited data and achieves competitive results to the state-of-the-art methods.

1 Introduction

Deep generative models (DGMs) Kingma and Welling (2013); Goodfellow et al. (2014); Sohl-Dickstein et al. (2015); Van den Oord et al. (2016); Dinh et al. (2016); Hinton and Salakhutdinov (2006) employ neural networks to capture the underlying distribution of high-dimensional data and find applications in various learning tasks Kingma et al. (2014); Zhu et al. (2017); Hoogeboom et al. (2019); Razavi et al. (2019); Karras et al. (2020a); Ramesh et al. (2021, 2022); Ho et al. (2022). Such models are often data-eager Li et al. (2021); Wang et al. (2018a); Karras et al. (2020a) due to the complex function classes. Recent work Wang et al. (2018a); Karras et al. (2020a) found that the classical variants of generative adversarial networks (GAN) Goodfellow et al. (2014); Karras et al. (2019, 2020b) produce poor samples with limited data. In principle, the problem is shared by other DGMs (e.g., Kingma and Welling (2013); Van Den Oord et al. (2017); Razavi et al. (2019)), and improving the sample efficiency is a common challenge.

The root cause of the problem is that learning a model in a complex class on limited data suffers from a large variance and easily overfits the training data Karras et al. (2020a); Mohri et al. (2018); Murphy (2012). To relieve the problem, previous work either employed sophisticated data augmentation strategies Karras et al. (2020a); Jiang et al. (2021), or designed new losses for the discriminator in GAN Cui et al. (2021); Yang et al. (2021), or transferred a pre-trained DGM Wang et al. (2018a); Noguchi and Harada (2019); Mo et al. (2020). In this paper, inspired by the bias-variance dilemma, we propose a complementary framework, regularized deep generative model (Reg-DGM), which leverages a pre-trained model to reduce the variance of training a DGM with limited data. The paper is organized as follows.

In Sec. 2, we formulate the objective function of Reg-DGM as the sum of a certain divergence between the data distribution and model distribution, and a regularization term weighted by a hyperparameter. The regularization term can be understood as the negative expected log-likelihood of an energy-based model, whose energy function is defined by a pre-trained model, w.r.t. the DGM. Intuitively, with an appropriate value of the weighting hyperparameter, Reg-DGM balances between the data distribution and the pre-trained model to achieve a better bias-variance trade-off with limited data.

In Sec. 3, we characterize the optimization behavior and statistical benefits of Reg-DGM in theory. On one hand, under mild regularity conditions, we prove the existence and uniqueness of the global minimum of the regularized optimization problem with the KL and JS divergence in the nonparametric setting. Besides, we show that the global minimum is in the form of a reweighted data distribution whose weights are determined by the pre-trained model. On the other hand, we construct a simple yet prototypical Gaussian-fitting example to compare the maximum likelihood estimation (MLE), the pre-trained model, and our approach under two common measures: the mean squared error and the expected risk. Our approach is provably the best under both measures, which quantifies the motivation of Reg-DGM and provides insights into the excellent performance of Reg-DGM in practice.

In Sec. 4, we specify the two components in Reg-DGM. On one hand, we employ strong variants of GAN Karras et al. (2020b, a) as the base DGM for broader interests. On the other hand, we consider a nontransferable setting where the pre-trained model does not necessarily have the same architecture or the same model formulation as the base DGM, or does not even have to be a generative model. Indeed, we adopt a feature extractor trained for classification by default, and the corresponding energy function is defined as the expected mean squared error between the features of the generated samples and training data. Notably, such a model cannot be directly used in the fine-tuning approaches Wang et al. (2018b); Mo et al. (2020).

In Sec. 5, we present experiments on several benchmarks, including the FFHQ Karras et al. (2019), LSUN CAT Yu et al. (2015), and CIFAR-10 Krizhevsky et al. (2009) datasets. We compare Reg-DGM with a large family of methods, including the base DGMs Karras et al. (2020b, a), the transfer-based approaches Wang et al. (2018b); Mo et al. (2020), the augmentation-based methods Karras et al. (2020a); Jiang et al. (2021); Zhao et al. (2020a) and others Cui et al. (2021); Yang et al. (2021). With a classifier pre-trained on ImageNet, Reg-DGM consistently improves the generation performance of strong DGMs Karras et al. (2020b, a) on several benchmarks with limited data and achieves competitive results to the state-of-the-art methods. Our results demonstrate that Reg-DGM can achieve a good bias-variance trade-off in practice, which agrees with our intuition and theory.

In Sec. 6, we present more related work. In Sec. 7, we conclude the paper and discuss limitations as well as potential future work.

2 Method

The goal of (deep) generative modeling is to learn an (implicit or explicit) model distribution $p_{g}$ over the sample space $X$ from a training set $S = {x_{i}}_{i = 1}^{m}$ of size $m$ . The elements in $S$ are assumed drawn i.i.d. according to an unknown data distribution $p_{d} \in P_{X}$ , where $P_{X}$ is the set of all valid distributions over $X$ . A general formulation for generative models is to minimize a certain divergence $D (\cdot | | \cdot)$ between the two distributions as follows:

min p_{g} \in H D (p_{d} | | p_{g}),

(1)

where $H \subset P_{X}$ is some function class such as neural networks in deep generative models Kingma and Welling (2013); Goodfellow et al. (2014); Sohl-Dickstein et al. (2015); Van den Oord et al. (2016); Dinh et al. (2016); Hinton and Salakhutdinov (2006). The objective in Eq. (1) is often estimated over the training set $S$ because $p_{d}$ is unknown.

We consider a challenging problem in this paper: learning a deep generative model (DGM) with limited data Karras et al. (2020a). Complementary to existing augmentation-based methods Karras et al. (2020a); Jiang et al. (2021), we propose to leverage an external model $f$ pre-trained on a related and potentially large dataset (e.g., ImageNet Deng et al. (2009)) to reduce the variance of the DGMs on limited data (e.g., CIFAR-10 Krizhevsky et al. (2009)). A classical way to leverage the pre-trained models is fine-tuning Wang et al. (2018b); Mo et al. (2020), which (partially) uses the pre-trained model $f$ to initialize the generative model $p_{g}$ and hence restricts the model families of both $f$ and $p_{g}$ .

In contrast, we focus on a nontransferable setting, where the pre-trained model does not necessarily have the same architecture or the same model formulation as $p_{g}$ or does not even have to be a generative model. In particular, we can employ a pre-trained model $f$ in an arbitrary form as long as we can define a real-valued function $R_{f} : X \to R$ satisfying mild regularity conditions ¹¹1We assume that $\int_{X} exp (- R_{f} (x)) d x < \infty$ and Assumption 3.1 holds.. Given the function, we simply optimize a weighted sum of a statistical divergence as in the original DGM and the expectation of the function under the model distribution as follows:

(2)

where $λ > 0$ is the weighting hyperparameter balancing the two terms. Similarly to the base DGM, Eq. (2) can be estimated over the training set $S$ , as presented in Sec. 4. We refer to our approach as regularized deep generative model (Reg-DGM) when $H$ consists of neural networks.

Notably, we could employ a more general functional of $p_{g}$ and $f$ as the regularization term while we focus on the specific one in Eq. (2) for two reasons. On one hand, we can treat $R_{f}$ as an energy function LeCun et al. (2006), which defines a probability distribution $p_{f} (x) \propto exp (- R_{f} (x))$ . Therefore, the regularization term in Eq. (2) can be explained as the expected negative log-likelihood of $p_{f} (x)$ :

E_{x \sim p_{g}} [R_{f} (x)] \Leftrightarrow - E_{x \sim p_{g}} [log (p_{f} (x))],

(3)

where $\Leftrightarrow$ means equality up to an additive constant irrelevant to the optimization of $p_{g}$ . Intuitively, the regularization term encourages $p_{g}$ to produce “good” samples in the eye of an energy-based model defined by the pre-trained models. On the other hand, with such a regularization term, Reg-DGM enjoys nice convergence properties in the nonparametric setting and provable statistical benefits in a representative Gaussian-fitting example, as detailed in Sec. 3.

3 Analyses

Before going into details of the implementation, we first investigate some natural and fundamental problems about optimization and statistical benefits of our approach:

In the nonparametric setting, whether a global optimum exists for the problem (2)? If it exists, is it unique and how does it balance the pre-trained model and the data?
Why and when is our approach provably better than the original divergence minimization as well as the pre-trained model w.r.t. common measures in statistics and learning theory?

For clarity, we only present the assumptions and main results in this section and refer the readers to Appendix A for complete proof of all results.

3.1 Optimization Analyses in the Nonparametric Setting

Similarly to the original GAN Goodfellow et al. (2014), we first analyze the optimization of Eq. (2) in the nonparametric setting, i.e. $H = P_{X}$ . Our results rely on the following regularity conditions.

Assumption 3.1.

1. $X$ is a nonempty compact set. 2. $R_{f} : X \to R$ is continuous and bounded.

Remark.

The regularity conditions in Assumption 3.1 are mild in the sense:

The sample space $X$ is often a subset of a $n$ -dimensional Euclidean space. Then, $X$ is compact if and only if $X$ is bounded and closed, which holds for extensive datasets in various types including images, videos, audios, and texts.
In this paper, $R_{f}$ is defined by compositing a neural network and a continuous real-valued function. Then $R_{f}$ is continuous and bounded on $X$ if the neural network has bounded weights and uses continuous activation functions including ReLU Nair and Hinton (2010), Tanh and Sigmoid.

Built upon the classical recipe of the calculus of variations and properties of the KL divergence in the topology of weak convergence, we establish our theory on the existence and uniqueness of the global minimum of problem (2) with the KL and JS divergence.²²2We consider the two divergences because they are employed in the two representative DGMs: VAE and GAN. The results are formally characterized in the following Theorem 3.2 and Theorem 3.3 respectively.

Theorem 3.2.

Under Assumption 3.1, for any $λ > 0$ , there exists a unique global minimum of the problem in Eq. (2) with the KL divergence. Further, the global minimum is in the form of $p_{g}^{*} (x) = \frac{p_{d} (x)}{α^{*} + λ R_{f} (x)}$ , where $α^{*} \in R$ .

Theorem 3.3.

Under Assumption 3.1, for any $λ > 0$ , there exists a unique global minimum of the problem in Eq. (2) with the JS divergence. Further, the global minimum is in the form of $p_{g}^{*} (x) = \frac{p_{d} (x)}{e^{α^{*} + λ R_{f} (x)} - 1}$ , where $α^{*} \in R$ .

As shown in Theorem 3.2 and Theorem 3.3, the global minimum is in the form of reweighted data distribution and the weights are negatively correlated to the energy function defined by the pre-trained model. Qualitatively, the global minimum assigns high density for a sample $x$ if it has high density under the data distribution (i.e. $p_{d} (x)$ ) and low value of the energy function (i.e., $R_{f} (x)$ ) (see Fig. 1 for an illustration). Naturally, as $λ \to 0$ , the denominator of $p_{g}^{*} (x)$ tends to a constant and $p_{g}^{*} (x) \to p_{d} (x)$ , which recovers the solution of pure divergence minimization in Eq. (1).

Notably, the weights in Theorem 3.2 and Theorem 3.3 are different because of the different divergences. In particular, the effect of the pretrained model is enlarged by the exponential term in Theorem 3.3 using JSD. Fig. 1 (c) and (d) show that with the same value of $λ$ , the weighting coefficients of JSD are distributed in a larger range than KLD.

We also analyze the convergence of Reg-DGM trained by (stochastic) gradient descent upon the convergence theory for deep learning Allen-Zhu et al. (2019). In particular, we show that under standard and verifiable smoothness assumptions, with a high probability, Reg-DGM with a sufficiently wide ReLU CNN converges to a global optimum (of the empirical risk) trained by GD and converges to a local minimum trained by SGD. The assumptions and results are formally presented in Appendix A.3.

3.2 Statistical Benefits When the Sample Size Is Small

In the rest of the section, we analyze when and why our approach is statistically preferable given a small sample size. On one hand, the solution of pure divergence minimization typically has a low bias while can be of large variance especially when the sample size $m$ is small. On the other hand, we can also treat the pre-trained model as a baseline estimator since it defines an energy-based model. The pre-trained model may have a large bias but it is of zero variance. In comparison, our approach conjoins the two worlds with the additional flexibility of choosing $λ$ and thus can achieve a better bias-variance trade-off than the two baselines. To formalize the idea, we conduct a simple yet prototypical Gaussian-fitting example as follows.

Example 3.4 (Gaussian-fitting example).

The data distribution is a (univariate) Gaussian $x \sim N (μ^{*}, σ^{2})$ with density function $p_{d} (x)$ , where $σ^{2}$ is known and $μ^{*}$ is the parameter to be estimated. A training sample $S = {x_{i}}_{i = 1}^{m}$ is drawn i.i.d. according to $p_{d} (x)$ . The hypothesis class for $p_{g}$ is $H = {N (μ, σ^{2}) | μ \in R}$ . The regularization term in Eq. (2) is $R_{f} (x) := - log N ({^μ}_{PRE}, σ^{2})$ , where ${^μ}_{PRE} \neq μ^{*}$ , i.e., the pre-trained model defines a distribution $p_{f} (x)$ where $x$ follows $N ({^μ}_{PRE}, σ^{2})$ .

For simplicity, we consider the maximum likelihood estimation (MLE) (i.e. using the KL divergence in Eq.(1)³³3The analyses can be extended to the JS divergence case.) as the baseline, and its solution for Example 3.4 is given by the sample mean Bishop and Nasrabadi (2006):

{^μ}_{MLE} = \frac{1}{m} m \sum i = 1 x_{i},^μ_{MLE} \sim N (μ^{*}, \frac{1}{m} σ^{2}) .

(4)

The pre-trained model ${^μ}_{PRE}$ is another meaningful baseline for our approach. Clearly, it has a bias of ${^μ}_{PRE} - μ^{*}$ and a zero variance. The following Lemma 3.5 characterizes the solution of our approach.

Lemma 3.5.

In the Gaussian-fitting example 3.4, the solution of our approach based on MLE is

{^μ}_{REG} = \frac{1}{1 + λ} {^μ}_{% MLE} + \frac{λ}{1 + λ} {^μ}_{PRE}, {^μ}_{REG} \sim N (\frac{1}{1 + λ} μ^{*} + \frac{λ}{1 + λ} {^μ}_{PRE}, \frac{σ^{2}}{m (1 + λ)^{2}}) .

(5)

The solutions of all estimators in the Gaussian-fitting example quantify the intuition of why our approach is preferable: the MLE has a bias of zero while a potentially large variance; the pre-trained model has a variance of zero while a potentially large bias. In contrast, our approach balances the bias and variance by the hyperparameter $λ$ . Below, we show that our approach is provably better than the baselines if $λ$ is properly selected under common measures in statistics and learning theory.

We first compare all estimators in terms of the mean squared error (MSE). The MSE of an estimator $^θ$ w.r.t. an unknown parameter $θ$ is defined as: $MSE [^θ] := E [(θ -^θ)^{2}],$ which can be decomposed as the sum of the variance of the estimator and the squared bias of the estimator. Namely, $MSE [^θ] = Bias [^θ]^{2} + Var [^θ]$ Bishop and Nasrabadi (2006). As shown in Theorem 3.6, our approach provides a better bias-variance trade-off than both the MLE and the pre-trained model if $λ$ is within an appropriate range.

Theorem 3.6.

Let $β = \frac{λ}{λ + 1}$ be the normalized weight of the regularization term. In the Gaussian-fitting example 3.4, if $max{σ2−m(^μPRE−μ∗)2σ2+m(^μPRE−μ∗)2,0}<β<min{2σ2σ2+m(^μPRE−μ∗)2,1}$ ⁴⁴4Note that $max{σ2−m(^μPRE−μ∗)2σ2+m(^μPRE−μ∗)2,0}<min{2σ2σ2+m(^μPRE−μ∗)2,1}$ always holds., then the following inequality holds:

MSE[^μREG]<min{MSE[^μMLE],MSE[^μPRE]}.

(6)

The optimal value of $β$ is $\frac{σ^{2}}{σ^{2} + m ({^μ}_{PRE} - μ^{*})^{2}}$ . The corresponding $λ$ is $\frac{σ^{2}}{m ({^μ}_{PRE} - μ^{*})^{2}}$ and the corresponding MSE of our approach is $\frac{MSE [{^μ}_{MLE}] MSE [{^μ}_{% PRE}]}{MSE [{^μ}_{MLE}] + MSE [{^μ}_{% P R E}]}$ .

We plot the MSE in Fig. 2 for a clear and straightforward comparison of the estimators. We also evaluate all estimators in terms of the expected risk, which is a commonly used measure in statistical learning theory. In a density estimation task for $p_{d}$ , the expected risk of a hypothesis $^μ$ , which depends on the training sample $S$ , is $R (^μ) := E_{x \sim p_{d}} [- log p (x;^μ)]$ . We can show that the expectation of the expected risk w.r.t. $S$ coincides with the corresponding MSE in the Gaussian-fitting example and directly obtain the following Corollary 3.6.1 from Theorem 3.6.

Corollary 3.6.1.

In the Gaussian-fitting example 3.4, if $λ$ satisfies the same condition as in Theorem 3.6, then the following inequality holds:

ES[R(^μREG)]<min{ES[R(^μMLE)],ES[R(^μPRE)]}.

(7)

Limitations and insights of the Gaussian Example. In the Gaussian-fitting example, the hypothesis class is simple and the MSE and expected risk have analytic forms. Given the fact that the generalization analysis in deep learning is still largely open Belkin (2021); Bartlett et al. (2021) and therefore currently it lacks the tools to generalize our Corollary 3.6.1 to the context of deep learning. Despite the simpleness, the analyses inspire a data-dependent energy function presented in Sec. 4 and provide insights into the excellent performance of Reg-DGM in Sec. 5. We believe that our attempt to solve this problem with insights from a bias-variance trade-off perspective is meaningful and can benefit this area because existing methods in this area are mainly empirical.

4 Implementation

In the section, we specify the formulation of the base model, the pre-trained model, and the energy function. The readers can refer to Appendix B.2 and the source code for more details.

4.1 Base Models

Although Reg-DGM applies to other models like variational auto-encoders (VAE) Kingma and Welling (2013); Van Den Oord et al. (2017); Razavi et al. (2019), we focus on generative adversarial networks (GAN) Goodfellow et al. (2014), which is most representative and widely used in the scenarios with limited data.

GAN optimizes an estimate of the JS divergence via a minimax formulation as follows:

min G max D E_{x \sim p_{d} (x)} [log D (x)] + E_{x \sim p_{g} (x)} [log (1 - D (x))],

(8)

where $G$ is a generator that defines $p_{g} (x)$ and $D$ is a discriminator that estimates the JS divergence by discriminating samples. Both $G$ and $D$ are parameterized by neural networks and Eq. (8) is estimated by the Monte Carlo method over mini-batches sampled from the training set. For a broader interest, we adopt three strong variants, StyleGAN2 Karras et al. (2020b) and ADA Karras et al. (2020a), as the base DGMs.

4.2 Pre-trained models and Energy Functions

We consider a nontransferable setting where the pre-trained model does not necessarily have the same architecture or the same formulation as $p_{g}$ or does not even have to be a generative model. In particular, we investigate two prototypical cases and design the corresponding regularization term. Recall that we treat the regularization term as the energy function of an EBM in Eq. (3).

In the first case, the pre-trained model $f$ outputs the log density of a pre-trained DGM (e.g., VAE), then a natural choice of the energy function is

R_{f} (x) := - f (x),

(9)

which encourages $p_{g}$ to produce samples of high likelihood in the eye of the pre-trained DGM.

In the second case, the pre-trained model $f$ is a feature extractor (of output dimension $d$ ) trained for other tasks (e.g., classification) instead of generation. The energy function is defined by the expected mean squared error between the features of a generated sample and a training sample as follows:

R_{f} (x) := E_{x^{'} \sim p_{d}} [\frac{1}{d} | | f (x) - f (x^{'}) | |_{2}^{2}],

(10)

which forces $p_{g}$ to produce samples whose features are similar to the features of the training data in expectation. Note that both the convergence analyses in a non-parametric setting as in Theorem 3.3 and the non-convex optimization setting as in Appendix A.3 allows a data-dependent energy as presented in Eq. (10). The expectation is estimated by the Monte Carlo method of a single sample for efficiency by default and increasing the number of samples won’t affect the performance significantly (see results in Appendix C.4). Throughout the experiment, we focus on the second case by default. Note that such models are nontransferable in the fine-tuning manner Hinton et al. (2006). Nevertheless, Reg-DGM is still competitive with the transfer-based approaches Wang et al. (2018b); Mo et al. (2020) as shown in Tab. 1.

Choice of the energy function. We mention that there are many alternative energy functions can be employed in Reg-DGM. Intuitively, as suggested in Theorem 3.6, Reg-DGM benefits if the EBM is close to the target distribution $p_{d}$ . Therefore, if $f$ is pretrained on a dataset of much richer semantics (e.g., ImageNet) than $p_{d}$ (e.g., images of faces), the it is crucial to involve training data in the energy like Eq. (10) to making the EBM closer to $p_{d}$ . We provide additional results of the feature matching regularization Salimans et al. (2016) and entropy regularization Grandvalet and Bengio (2004), which are worse than Eq. (10) either qualitatively or quantitatively. See details in Appendix C.4.

Currently, both evaluating DGMs empirically and quantifying the generalization of a deep model theoretically are largely open problems. Therefore, it lacks principled tools to compare the energy functions and pretrained models. Nevertheless, we evaluate Eq. (10) with strong baselines Karras et al. (2020a) and different datasets in Table 1. Further, we also perform a comprehensive study of $f$ with different backbones, different pretrained datasets, different layers and random weights without training in Appendix C.4. We observe a consistent improvement over the baseline using Eq. (10) and therefore it is relatively safe to say our implementation provides a promising option in a new setting.

Method	FFHQ		LSUN CAT		CIFAR- $10$
Method	$1$ k	$5$ k	$1$ k	$5$ k	$50$ k
Transfer Wang et al. (2018a)	$21.42$	$12.34$
Freeze-D Mo et al. (2020)	$19.77$	$12.69$
DA $^{†}$ Zhao et al. (2020a)	$25.66$	$10.45$	$42.26$	$16.11$	${8.49}^{⋆}$
InsGen $^{†}$ Yang et al. (2021)	$19.58$
GenCo $^{†}$ Cui et al. (2021)	$65.31$	$27.96$	$140.08$	$40.79$	$8.83 \pm {0.04}^{⋆}$
DA + GenCo $^{†}$ Cui et al. (2021)					$6.57 \pm 0.01$
ADA + bCR $^{‡}$ Zhao et al. (2020b)	$22.61$	$10.58$	$38.82$	$16.80$
$R_{LC}$ $^{†}$ Tseng et al. (2021)	$63.16 \pm 0.11$	$23.83 \pm 0.11$			$8.31 \pm {0.05}^{⋆}$
ADA + $R_{LC}$ $^{†}$ Tseng et al. (2021)	$21.7 \pm 0.06$				$2.47 \pm 0.01$
APA $^{†}$ Jiang et al. (2021)	$45.19$	$13.25$
StyleGAN2 Karras et al. (2020b)	$103.66$	$52.71$	$186.55$	$115.16$	$7.16 \pm 0.12$
Reg-StyleGAN2 (ours)	$75.99$	$37.77$	$107.02$	$63.10$	$6.56 \pm 0.14$
ADA Karras et al. (2020a)	$22.26$	$12.64$	$41.81$	$16.76$	$3.07 \pm 0.08$
Reg-ADA (ours)	$20.05$	$11.95$	$36.17$	$15.91$	$2.95 \pm 0.05$
ADA + APA Jiang et al. (2021)	$19.71$	$8.84$	$24.09$	$11.79$	$2.64 \pm 0.08$
Reg-ADA-APA (ours)	$17.88$	$8.02$	$21.88$	$11.27$	$2.58 \pm 0.04$

Table 1: Following the direct competitor Karras et al. (2020a), we report the median FID

↓

on FFHQ and LSUN CAT and the mean FID on CIFAR-10 out of 3 runs for a fair comparison.

^{†}

indicates that the results are taken from corresponding references and

^{‡}

indicates that the results are taken from Karras et al. (2020a). Otherwise, the results are reproduced by us upon the official implementation of Karras et al. (2020a) and Jiang et al. (2021).

^{⋆}

indicates that the backbone used by the corresponding method is BigGAN Brock et al. (2018) instead of StyleGAN2 Karras et al. (2020b).

5 Experiments

We evaluate Reg-DGM on several benchmarks with limited data Karras et al. (2020a), including the FFHQ Karras et al. (2019), LSUN CAT Yu et al. (2015), and CIFAR-10 Krizhevsky et al. (2009) datasets. We present the main results and analyses in the section and refer the readers to Appendix B.2 for experimental details and Appendix C for additional results. We submit the source code in the supplementary material and we will release it after the blind review. Throughout the section, we refer to our approach as the name of the base DGM with the prefix “Reg-”. For instance, “Reg-ADA” denotes our approach with ADA Karras et al. (2020a) as the base DGM.

5.1 Benchmark Results with Limited Data

We employ StyleGAN2 Karras et al. (2020b) and ADA Karras et al. (2020a) as the base DGM and a ResNet-18 He et al. (2015) classifier trained on ImageNet Deng et al. (2009) as the pre-trained model. The associated energy function is defined in Eq. (10).

Quantitatively, we compare Reg-DGM with a large family of methods, including the base DGMs Karras et al. (2020b, a), the transfer-based approach Wang et al. (2018b); Mo et al. (2020), the augmentation-based methods Karras et al. (2020a); Jiang et al. (2021) and others Cui et al. (2021); Yang et al. (2021). Following the most direct and strong competitor Karras et al. (2020a), we report the median FID Heusel et al. (2017) on FFHQ and LSUN CAT, and the mean FID on CIFAR-10 out of 3 runs for a fair comparison in Tab. 1. For completeness, we also report the mean FID with the standard deviation on FFHQ and LSUN CAT in Appendix C.2.

As shown in Tab. 1, both Reg-StyleGAN2 and Reg-ADA consistently outperform the corresponding base DGM in five settings, demonstrating that Reg-DGM can achieve a good bias-variance trade-off in practice, which agrees with our intuition and theory. Besides, the superior performance of Reg-ADA over ADA shows that our contribution is orthogonal to the augmentation-based approaches and we believe the results can be further improved upon Jiang et al. (2021). Notably, the improvement of Reg-DGM over the base DGM is more substantial when the sample size $m$ gets smaller. This is as expected because as suggested in Theorem 3.6, the ratio between $MSE [{^μ}_{REG}]$ and $MSE [{^μ}_{MLE}]$ decreases as $MSE [{^μ}_{MLE}]$ increases given a fixed pre-trained model and the optimal $λ$ (see Fig. 2 (a)). A similar argument holds under the measure of the expected risk according to Corollary 3.6.1.

We mention that the methods based on fine-tuning Wang et al. (2018a); Mo et al. (2020) in Tab. 1 employ a GAN pre-trained on CelebA-HQ Karras et al. (2017), which is also a face dataset of the same resolution as FFHQ. In comparison, our approach is built upon a classifier pertained on ImageNet. As highlighted in Wang et al. (2018a), the density of the pre-trained dataset is more important than the diversity. Nevertheless, Reg-DGM is competitive with these strong baselines while enjoying the flexibility of choosing the pre-trained model and dataset. Further, we directly adopt the same pre-trained model across all datasets including CIFAR-10, where it takes additional efforts to get a suitable pre-trained model to fine-tune. Based on the results, it is safe to emphasize the complementary role of Reg-DGM to the approaches based on fine-tuning.

Qualitatively, we show the samples generated from the base DGMs Karras et al. (2019, 2020a) and our Reg-DGM on FFHQ- $5$ k and Obama- $100$ in Fig. 3. It can be seen that with the regularization, our approach can produce faces of a normal shape. For completeness, we attempt to get samples from the EBM defined by the pre-trained model via Langevin dynamics but cannot get a reasonable result. See more details in Sec. 5.2. We present more sampling results in Appendix C.3.

We plot the learning curves of our Reg-DGM and the base DGM on all datasets in Appendix C.1 and perform an ablation study on the architecture of the classifier and the form of the energy function in Appendix C.4.

5.2 Sensitivity Analysis of the Weighting Hyperparameter

As suggested in Theorem 3.6 and Corollary 3.6.1, the value of the weighting hyperparameter $λ$ is crucial for the performance of Reg-DGM and there is an appropriate range of $λ$ such that Reg-DGM is more preferable than the base model.

We empirically validate the argument on FFHQ- $5$ k with StyleGAN2 as the base model in Tab. 2. It can be seen that $λ$ affects the performance significantly. Notably, although it is nearly impossible to get the optimal $λ$ via grid search, there is a range of $λ$ (e.g. $0.01 - 1$ in Tab. 2) such that Reg-DGM outperforms the base DGM, which agrees with our theoretical analyses. Besides, Reg-DGM is not too sensitive when the value of $λ$ is around the optimum. For instance, the gap between the results of $λ = 0.1$ and $λ = 1$ in Tab. 2 is much smaller than their gain compared to the baseline.

The performance of Reg-DGM deteriorates with a large lambda in Tab. 2 as expected. Theorem 3.6 shows that Reg-DGM is preferable if and only if the value of lambda is in a proper interval. Intuitively, a very large lambda means that we almost ignore the training data, which should lead to inferior performance.

6 Related Work

Values of $λ$	$λ = 0$ (baseline)	$λ = 0.01$	$λ = 0.1$	$λ = 1$	$λ = 10$	$λ = 100$
FID $↓$	$52.71$	$47.49$	$41.51$	$37.77$	$53.09$	$178.53$

Table 2: Sensitivity analysis of

λ

on FFHQ-

5

k with StyleGAN2 Karras et al. (2020b) as the base DGM (corresponds to

λ = 0

). There is a large range of

λ

such that Reg-DGM outperforms the baseline.

Fine-tuning approaches. A milestone of deep learning is that a deep generative model fine-tuned for classification outperforms the classical SVM on recognizing the hand-writing digits Hinton et al. (2006). Since then, the idea of fine-tuning has a significant impact Devlin et al. (2018); He et al. (2020) including generative models with limited data Wang et al. (2018b); Mo et al. (2020); Wang et al. (2020); Li et al. (2020); Ojha et al. (2021). However, an inherent restriction of fine-tuning is that the pre-trained model and target model should partially share a common structure. Thus, it may take additional efforts to find a suitable pre-trained model to fine-tune. In comparison, Reg-DGM provides an alternative way to make it possible to exploit a pre-trained classifier to help generative modeling. Notably, the latter is often thought of as much harder than the former.

Other generative adversarial networks with limited data. To relieve the overfitting problem of the discriminator, DA Zhao et al. (2020a), ADA Karras et al. (2020a) and APA Jiang et al. (2021) design sophisticated data augmentation strategies. GenCo Cui et al. (2021) designs a co-training framework that introduces multiple complimentary discriminators. InsGen Yang et al. (2021) improves the data efficiency of GAN via an instance discrimination loss Wu et al. (2018). We believe that Reg-DGM is orthogonal to these methods based on our results in Tab. 1.

Regularization in probabilistic models. Extensive regularization approaches have been developed in traditional Bayesian inference Zhu et al. (2014) and probabilistic modeling Chang et al. (2007); Liang et al. (2009); Ganchev et al. (2010). Among them, posterior regularization (PR) Ganchev et al. (2010) is representative. PR encodes the human knowledge about the task as linear constraints of the latent representations in generative models for better inference performance. Such methods have been extended to deep generative models Hu et al. (2018); Du et al. (2018); Shu et al. (2018); Xu et al. (2019) for a similar reason. Technically, PR-based methods regularize the latent space via handcrafted or jointly trained constraints. In comparison, our approach regularizes the data space via a pre-trained model in the perspective of the bias-variance dilemma. Besides, PR-based methods are suitable for structured prediction tasks instead of generative modeling with limited data, which is the main focus of this paper.

7 Conclusions and Discussions

In this paper, we propose regularized deep generative model (Reg-DGM), which leverages a pre-trained model for regularization to reduce the variance of deep generative modeling with limited data. The regularization is defined by the negative expected log-likelihood of an energy-based model, whose energy function is defined by the pre-trained model. Theoretically, we characterize the existence and uniqueness of the global minimum of Reg-DGM in the nonparametric setting and rigorously prove that Reg-DGM can achieve a better bias-variance trade-off in a simple yet representative Gaussian-fitting example. Empirically, with a classifier pre-trained on ImageNet and a data-dependent energy function, Reg-DGM consistently improves the generation performance of strong DGMs on several benchmarks and achieves competitive results to the state-of-the-art methods.

An interesting future work is to analyze the generalization behaviour of Reg-DGM in general and inspire new energy functions. Currently, the generalization analysis of deep learning is still largely open Zhang et al. (2021); Bartlett et al. (2021); Belkin (2021), and there lacks a tool to formalize our intuition on the bias-variance trade-off in general.

Social Impact: This work presents a framework to train deep generative models on small data. By improving the data efficiency, it can potentially benefit real-world applications like medicine analysis and automatic drive. However, this work can have negative consequences in the form of “DeepFakes”, as existing GANs. It is worth noting that this work may exacerbate such issues by improving the data efficiency of GANs. How to detect “DeepFakes” is an active research area in machine learning, which aims to relieve the problem.

Acknowledgement

We thank Guoqiang Wu for helpful discussions about the generalization analysis. This work was supported by NSF of China (NO. 62076145); Beijing Outstanding Young Scientist Program (NO. BJJWZYJH012019100020098); Major Innovation & Planning Interdisciplinary Platform for the “Double-First Class" Initiative, Renmin University of China; the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China (22XNKJ13). Part of the computing resources supporting this work, totaled 720 A100 GPU hours, were provided by High-Flyer AI. (Hangzhou High-Flyer AI Fundamental Research Co., Ltd.).

References

[1] G. Ajjanagadde, A. Makur, J. Klusowski, S. Xu, et al. (2017) Lecture notes on information theory. Lab. Inf. Decis. Syst., Massachusetts Inst. Technol., Cambridge, MA, USA, Tech. Rep. Cited by: §A.1.
[2] Z. Allen-Zhu, Y. Li, and Z. Song (2019) A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. Cited by: §A.3, Theorem A.1, §3.1, footnote 5.
[3] P. L. Bartlett, A. Montanari, and A. Rakhlin (2021) Deep learning: a statistical viewpoint. Acta numerica 30, pp. 87–201. Cited by: §3.2, §7.
[4] M. Belkin (2021) Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica 30, pp. 203–248. Cited by: §3.2, §7.
[5] P. Billingsley (2013) Convergence of probability measures. John Wiley & Sons. Cited by: §A.2.
[6] C. M. Bishop and N. M. Nasrabadi (2006) Pattern recognition and machine learning. Vol. 4, Springer. Cited by: §A.4, §3.2, §3.2.
[7] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: Table 1.
[8] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman (2018) Vggface2: a dataset for recognising faces across pose and age. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pp. 67–74. Cited by: §C.4.
[9] M. Chang, L. Ratinov, and D. Roth (2007) Guiding semi-supervision with constraint-driven learning. In Proceedings of the 45th annual meeting of the association of computational linguistics, pp. 280–287. Cited by: §6.
[10] K. Cui, J. Huang, Z. Luo, G. Zhang, F. Zhan, and S. Lu (2021) GenCo: generative co-training for generative adversarial networks with limited data. arXiv preprint arXiv:2110.01254. Cited by: §1, §1, Table 1, §5.1, §6.
[11] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2, §5.1.
[12] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §6.
[13] L. Dinh, J. Sohl-Dickstein, and S. Bengio (2016) Density estimation using real nvp. arXiv preprint arXiv:1605.08803. Cited by: §1, §2.
[14] C. Du, K. Xu, C. Li, J. Zhu, and B. Zhang (2018) Learning implicit generative models by teaching explicit ones. arXiv preprint arXiv:1807.03870. Cited by: §6.
[15] K. Ganchev, J. Graça, J. Gillenwater, and B. Taskar (2010) Posterior regularization for structured latent variable models. The Journal of Machine Learning Research 11, pp. 2001–2049. Cited by: §6.
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2, §3.1, §4.1.
[17] Y. Grandvalet and Y. Bengio (2004) Semi-supervised learning by entropy minimization. Advances in neural information processing systems 17. Cited by: §C.4, §4.2.
[18] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738. Cited by: §6.
[19] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: §5.1.
[20] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30. Cited by: §B.2, §5.1.
[21] G. E. Hinton, S. Osindero, and Y. Teh (2006) A fast learning algorithm for deep belief nets. Neural computation 18 (7), pp. 1527–1554. Cited by: §4.2, §6.
[22] G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §1, §2.
[23] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet (2022) Video diffusion models. arXiv preprint arXiv:2204.03458. Cited by: §1.
[24] E. Hoogeboom, J. Peters, R. Van Den Berg, and M. Welling (2019) Integer discrete flows and lossless compression. Advances in Neural Information Processing Systems 32. Cited by: §1.
[25] Z. Hu, Z. Yang, R. R. Salakhutdinov, L. Qin, X. Liang, H. Dong, and E. P. Xing (2018) Deep generative models with learnable knowledge constraints. Advances in Neural Information Processing Systems 31. Cited by: §6.
[26] L. Jiang, B. Dai, W. Wu, and C. C. Loy (2021) Deceive d: adaptive pseudo augmentation for gan training with limited data. Advances in Neural Information Processing Systems 34. Cited by: Table 3, §1, §1, §2, Table 1, §5.1, §5.1, §6.
[27] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §5.1.
[28] T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila (2020) Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems 33, pp. 12104–12114. Cited by: §B.2, §B.2, §B.2, §B.2, Table 3, §C.3, Table 4, §1, §1, §1, §1, §2, §4.1, §4.2, Table 1, Figure 3, §5.1, §5.1, §5.1, §5, §6.
[29] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410. Cited by: §B.2, §1, §1, §5.1, §5.
[30] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8110–8119. Cited by: §B.2, Table 3, Table 4, Table 5, §1, §1, §1, §4.1, Table 1, §5.1, §5.1, Table 2.
[31] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2, §4.1.
[32] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589. Cited by: §1.
[33] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §B.2, §1, §2, §5.
[34] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang (2006) A tutorial on energy-based learning. Predicting structured data 1 (0). Cited by: §2.
[35] C. Li, K. Xu, J. Zhu, J. Liu, and B. Zhang (2021) Triple generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
[36] Y. Li, R. Zhang, J. Lu, and E. Shechtman (2020) Few-shot image generation with elastic weight consolidation. arXiv preprint arXiv:2012.02780. Cited by: §6.
[37] P. Liang, M. I. Jordan, and D. Klein (2009) Learning from measurements in exponential families. In Proceedings of the 26th annual international conference on machine learning, pp. 641–648. Cited by: §6.
[38] S. Mo, M. Cho, and J. Shin (2020) Freeze the discriminator: a simple baseline for fine-tuning gans. arXiv preprint arXiv:2002.10964. Cited by: §1, §1, §1, §2, §4.2, Table 1, §5.1, §5.1, §6.
[39] M. Mohri, A. Rostamizadeh, and A. Talwalkar (2018) Foundations of machine learning. MIT press. Cited by: §1.
[40] K. P. Murphy (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §1.
[41] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Icml, Cited by: item 2.
[42] A. Noguchi and T. Harada (2019) Image generation from small datasets via batch statistics adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2750–2758. Cited by: §1.
[43] U. Ojha, Y. Li, J. Lu, A. A. Efros, Y. J. Lee, E. Shechtman, and R. Zhang (2021) Few-shot image generation via cross-domain correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10743–10752. Cited by: §6.
[44] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. Cited by: §C.4.
[45] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125. Cited by: §1.
[46] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever (2021) Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. Cited by: §1.
[47] A. Razavi, A. Van den Oord, and O. Vinyals (2019) Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems 32. Cited by: §1, §4.1.
[48] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in neural information processing systems, pp. 2234–2242. Cited by: §C.4, §4.2.
[49] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §C.4.
[50] R. Shu, H. H. Bui, S. Zhao, M. J. Kochenderfer, and S. Ermon (2018) Amortized inference regularization. Advances in Neural Information Processing Systems 31. Cited by: §6.
[51] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli (2015) Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. Cited by: §1, §2.
[52] H. Tseng, L. Jiang, C. Liu, M. Yang, and W. Yang (2021) Regularizing generative adversarial networks under limited data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7921–7931. Cited by: Table 1.
[53] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, et al. (2016) Conditional image generation with pixelcnn decoders. Advances in neural information processing systems 29. Cited by: §1, §2.
[54] A. Van Den Oord, O. Vinyals, et al. (2017) Neural discrete representation learning. Advances in neural information processing systems 30. Cited by: §1, §4.1.
[55] O. Van Gaans (2003) Probability measures on metric spaces. Lecture notes. Cited by: §A.1.
[56] Y. Wang, A. Gonzalez-Garcia, D. Berga, L. Herranz, F. S. Khan, and J. v. d. Weijer (2020) Minegan: effective knowledge transfer from gans to target domains with few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9332–9341. Cited by: §6.
[57] Y. Wang, C. Wu, L. Herranz, J. van de Weijer, A. Gonzalez-Garcia, and B. Raducanu (2018) Transferring gans: generating images from limited data. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 218–234. Cited by: §1, §1, Table 1, §5.1.
[58] Y. Wang, C. Wu, L. Herranz, J. van de Weijer, A. Gonzalez-Garcia, and B. Raducanu (2018) Transferring gans: generating images from limited data. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 218–234. Cited by: §1, §1, §2, §4.2, §5.1, §6.
[59] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733–3742. Cited by: §6.
[60] T. Xu, C. Li, J. Zhu, and B. Zhang (2019) Multi-objects generation with amortized structural regularization. Advances in Neural Information Processing Systems 32. Cited by: §6.
[61] C. Yang, Y. Shen, Y. Xu, and B. Zhou (2021) Data-efficient instance generation from instance discrimination. Advances in Neural Information Processing Systems 34. Cited by: §1, §1, Table 1, §5.1, §6.
[62] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao (2015) Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §B.2, §1, §5.
[63] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2021) Understanding deep learning (still) requires rethinking generalization. Communications of the ACM 64 (3), pp. 107–115. Cited by: §7.
[64] S. Zhao, Z. Liu, J. Lin, J. Zhu, and S. Han (2020) Differentiable augmentation for data-efficient gan training. Advances in Neural Information Processing Systems 33, pp. 7559–7570. Cited by: §1, Table 1, §6.
[65] Z. Zhao, S. Singh, H. Lee, Z. Zhang, A. Odena, and H. Zhang (2020) Improved consistency regularization for gans. arXiv preprint arXiv:2002.04724. Cited by: Table 1.
[66] J. Zhu, N. Chen, and E. P. Xing (2014) Bayesian inference with posterior regularization and applications to infinite latent svms. The Journal of Machine Learning Research 15 (1), pp. 1799–1847. Cited by: §6.
[67] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1.

Appendix A Proofs

a.1 Proof of Theorem 3.2

Proof.

Ignoring some constant irrelevant to the optimization, we rewrite the optimization problem of our approach with the KL divergence as follows:

\begin{matrix} min p_{g} \int_{X} (- ln (p_{g} (x)) p_{d} (x) + & λ R_{f} (x) p_{g} (x)) d x, subject to \int_{X} p_{g} (x) d x & = 1, \forall x \in X, p_{g} (x) & \geq 0. \end{matrix}

(11)

For clarity, we denote the functional in problem (11) to be optimized as

J (p_{g}) := \int_{X} (- ln (p_{g} (x)) p_{d} (x) +

λ R_{f} (x) p_{g} (x)) d x .

According to Assumption 3.1 that $X$ is a nonempty compact set, the feasible area characterized by the constraints (i.e., $P_{X}$ ) is compact in the topology of weak convergence by the Prokhorov’s Theorem (See Corollary 6.8 in [55]). By Theorem 3.6 in [1], KL divergence is lower semi-continuous in the topology of weak convergence. According to Assumption 3.2 that $R_{f}$ is continuous and bounded on $X$ , the regularization term is continuous in the topology of weak convergence. Therefore, by the extreme value theorem, the global minimum of $J (p_{g})$ exists in the feasible area.

Note that the optimization problem (11) is convex. To obtain a necessary condition for the global minima, we get the Lagrangian with the equality constraint:

L (p_{g}) := \int_{X} (- ln (p_{g} (x)) p_{d} (x) + λ R_{f} (x) p_{g} (x)) d x + α (\int_{X} p_{g} (x) d x - 1),

(12)

where $α \in R$ . Note that for simplicity, we do not include the inequality constraint, which will be verified shortly. It is easy to check the constraint qualifications for problem (11). By the calculus of variations, a necessary condition for a global minimum of problem (11) is

\frac{δ L}{δ p_{g}} = - \frac{p_{d} (x)}{p_{g} (x)} + λ R_{f} (x) + α = 0,

(13)

which implies that

p_{g}^{*} (x) = \frac{p_{d} (x)}{α + λ R_{f} (x)} .

(14)

We define $A = {α \in R | \frac{p_{d} (x)}{α + λ R_{f} (x)} \geq 0, \forall x \in X}$ . According to Assumption 3.1 that $R_{f}$ is bounded on $X$ , we have $A \neq \emptyset$ . We define a function $ϕ : A \to R$ as

ϕ (α) = \int_{X} \frac{p_{d} (x)}{α + λ R_{f} (x)} d x .

(15)

It is easy to see $ϕ (α)$ is monotonically decreasing and there is at most one $α^{*} \in A$ such that $ϕ (α^{*}) = 1$ , which finishes the proof together with the existence of the global minimum. ∎

a.2 Proof of Theorem 3.3

Proof.

The proof here shares the same spirit of Theorem 3.5. Ignoring some constant irrelevant to the optimization, we rewrite the optimization problem of our approach with the JS divergence as follows:

\begin{matrix} min p_{g} \int_{X} (- ln (p_{g} (x) + p_{d} (x)) p_{d} (x) - ln (p_{g} (x) + p_{d} (x)) p_{g} (x) + ln (p_{g} (x)) p_{g} (x) + & λ R_{f} (x) p_{g} (x)) d x, subject to \int_{X} p_{g} (x) d x & = 1, \forall x \in X, p_{g} (x) & \geq 0. \end{matrix}

(16)

For clarity, we denote the functional in problem (11) to be optimized as

	$J (p_{g}) := \int_{X} (- ln (p_{g} (x) + p_{d} (x)) p_{d} (x) - ln (p_{g} (x) + p_{d} (x)) p_{g} (x) + ln (p_{g} (x)) p_{g} (x) +$		(17)
	$λ R_{f} (x) p_{g} (x)) d x .$		(18)

Similarly to the proof of Theorem 3.3, the global minimum of $J (p_{g})$ exists in the feasible area due to Assumption 3.1 and the fact that the JS divergence is lower semi-continuous in the topology of weak convergence.

Notably, the optimization problem (16) is convex due to the convexity of the JS divergence [5]. To obtain a necessary condition for the global minima, we get the Lagrangian with the equality constraint:

	$L (p_{g}) := \int_{X} (- ln (p_{g} (x) + p_{d} (x)) p_{d} (x) - ln (p_{g} (x) + p_{d} (x)) p_{g} (x) + ln (p_{g} (x)) p_{g} (x) +$
	$λ R_{f} (x) p_{g} (x)) d x + α (\int_{X} p_{g} (x) d x - 1),$		(19)

where $α \in R$ . Similarly to the proof of Theorem 3.3, a necessary condition for a global minimum of problem (11) is

\frac{δ L}{δ p_{g}} = - ln (p_{g} (x) + p_{d} (x)) + ln (p_{g} (x)) + λ R_{f} (x) + α = 0,

(20)

which implies that

p_{g}^{*} (x) = \frac{p_{d} (x)}{e^{α + λ R_{f} (x)} - 1} .

(21)

Similarly to the proof in Theorem 3.2, there is at most one $α^{*} \in R$ such that $\int_{X} \frac{p_{d} (x)}{e^{α + λ R_{f} (x)} - 1} = 1$ , which finishes the proof together with the existence of the global minimum. ∎

a.3 Convergence with Neural Networks Trained by (Stochastic) Gradient Descent

We establish the convergence of Reg-DGM with over-parameterized neural networks trained by (stochastic) gradient descent upon the general framework [2].

Theorem A.1.

(General convergence guarantee [2]) For an arbitrary Lipschitz-smooth loss function $L$ , with probability at least $1 - e^{- Ω (log m)}$ , a ReLU convolutional neural network with width $m$ and depth $l$ trained by gradient descent with an appropriate learning rate satisfy the following.⁵⁵5See additional assumptions on regularity of the data in [2].

If $L$ is non-convex, and $σ$ -gradient dominant, then GD finds $ϵ$ -error minimizer in $~ O (p o l y (n, l, log \frac{1}{ϵ}, \frac{1}{σ}))$ iterations as long as $m \geq ~ Ω (p o l y (n, l, \frac{1}{σ}))$ .
If $L$ is non-convex, then SGD finds a point with $| | \nabla f | | \leq ϵ$ in $~ O (p o l y (m, l, log \frac{1}{ϵ}))$ iterations as long as $m \geq ~ Ω (p o l y (n, l, \frac{1}{ϵ}))$ .

We assume the following standard smoothness conditions, which can be verified in practice with bounded data and weights.

Assumption A.2.

(Smoothness conditions)

$\exists b > 0$ such that $\forall θ \in Θ, \forall x \in X, p (θ; x) \geq b .$
$\exists L > 0$ such that $\forall x \in X, \forall θ \in Θ, \forall θ^{'} \in Θ, | p (θ; x) - p (θ^{'}; x) | \leq L | | θ - θ^{'} | |$ .
$\exists K > 0$ such that $\forall θ \in Θ, \forall θ^{'} \in Θ, \int | p (θ; x) - p (θ^{'}; x) | d x \leq K | | θ - θ^{'} | |$ .
${sup}_{x \in X} {sup}_{y \in X} | | f (x) - f (y) | |^{2}) \leq B$ .

We consider the general density estimation setting where $L_{MLE} (θ; x_{i}) := - log p_{θ} (x_{i})$ . Note that 1. and 2. in Assumption A.2 directly imply that $L_{MLE} (\cdot; x)$ is $\frac{L}{b}$ -Lipschitz. Formally, given a set of samples $S = {x_{i}}_{i = 1}^{n}$ , Reg-DGM optimizes

^R [θ] := \frac{1}{n} n \sum i = 1 L_{% MLE} (θ; x_{i}) + λ E_{x \sim p_{g}} [R_{f} (x)] .

If $R_{f}$ is independent from the training data $x$ , then the overall loss function is also $\frac{L}{b}$ -Lipschitz and Theorem A.1 directly applies. Otherwise, we can show that the data-dependent regularization used in our experiments is also Lipschitz-smooth. By the linearity of expectation, we have

	${^θ}_{REG} :=$			(22)
	$=$			(23)

We define $L_{REG} (θ; x_{i}) := \frac{λ}{d} E_{y \sim p_{θ}} | | f (y) - f (x_{i}) | |_{2}^{2}$ , which is Lipschitz-smooth:

	$\| L_{REG} (θ; x) - L_{REG} (θ^{'}; x) \| =$	$\frac{λ}{d} ∣ ∣ E_{y \sim p_{θ}} \| \| f (y) - f (x) \| \|_{2}^{2} - E_{y \sim p_{θ^{'}}} \| \| f (y) - f (x) \| \|_{2}^{2} ∣ ∣$
	$\leq$	$\frac{λ}{d} \int \| p_{θ} (y) - p_{θ^{'}} (y) \| \| \| f (x) - f (y) \| \|^{2} d y$
	$\leq$	$\frac{λ}{d} (sup y \in X \| \| f (x) - f (y) \| \|^{2}) \int \| p_{θ} (y) - p_{θ^{'}} (y) \| d y$
	$\leq$	$\frac{λ B K}{d} \| \| θ - θ^{'} \| \| .$

Therefore, Theorem A.1 applies to Reg-DGM with the data dependent energy defined in Sec. 4 of the main text.

a.4 Proof of Lemma 3.5

Proof.

In the Gaussian fitting example, the optimization problem can be written as

${^μ}_{REG}$	$= arg min μ - \frac{1}{m} m \sum i = 1 log N (μ, σ^{2}) + λ E_{x \sim N (μ, σ^{2})} [- log p_{f} (x)]$	(24)
	$= arg min μ \frac{1}{m} m \sum i = 1 \frac{(μ - x_{i})^{2}}{2 σ^{2}} + λ (\frac{(μ - {^μ}_{PRE})^{2}}{2 σ^{2}} + \frac{1}{2} log (2 π σ^{2}) + \frac{1}{2})$	(25)
	$= arg min μ \frac{1}{m} m \sum i = 1 \frac{(μ - x_{i})^{2}}{2 σ^{2}} + λ \frac{(μ - {^μ}_{PRE})^{2}}{2 σ^{2}},$	(26)

where the first equality holds by the definition and properties of Gaussian [6] and the second equality holds by omitting a constant irrelevant to the optimization. It is easy to solve the above quadratic programming problem analytically:

{^μ}_{REG} = \frac{1}{m (1 + λ)} m \sum i = 1 x_{i} + \frac{λ}{1 + λ} {^μ}_{PRE} = \frac{1}{1 + λ} {^μ}_{MLE} + \frac{λ}{1 + λ} {^μ}_{PRE},

(27)

where ${^μ}_{MLE} \sim N (μ^{*}, \frac{σ^{2}}{m})$ . ${^μ}_{R E G}$ is obtained by an affine transformation of a Gaussian random variable and thus is also Gaussian distributed as follows:

{^μ}_{REG} \sim N (\frac{1}{1 + λ} μ^{*} + \frac{λ}{1 + λ} {^μ}_{PRE}, \frac{σ^{2}}{m (1 + λ)^{2}}) .

(28)

∎

a.5 Proof of Theorem 3.6

Proof.

In the Gaussian-fitting example, we have

{^μ}_{MLE} \sim N (μ^{*}, \frac{σ^{2}}{} m),

(29)

and

{^μ}_{REG} \sim N (\frac{1}{1 + λ} μ^{*} + \frac{λ}{1 + λ} {^μ}_{PRE}, \frac{σ^{2}}{m (1 + λ)^{2}}) .

(30)

According to the bias-variance decomposition of the MSE, we have

MSE [{^μ}_{MLE}] = \frac{σ^{2}}{m},

(31)

and

MSE [{^μ}_{REG}] = \frac{λ^{2}}{(1 + λ)^{2}} ({^μ}_{PRE} - μ^{*})^{2} + \frac{σ^{2}}{m (1 + λ)^{2}} .

(32)

Let $β = \frac{λ}{1 + λ} \in (0, 1)$ be the normalized weight of the regularization term. Then, we can rewrite $MSE [{^μ}_{REG}]$ as

MSE [{^μ}_{REG}] = β^{2} ({^μ}_{PRE} - μ^{*})^{2} + (1 - β)^{2} \frac{σ^{2}}{m} .

(33)

To satisfy $MSE [{^μ}_{REG}] < MSE [{^μ}_{MLE}]$ , we have

β^{2} ({^μ}_{PRE} - μ^{*})^{2} + (1 - β)^{2} \frac{σ^{2}}{m} < \frac{σ^{2}}{m} \Rightarrow β < \frac{2 σ^{2}}{σ^{2} + m ({^μ}_{PRE} - μ^{*})^{2}} .

(34)

The pre-trained model ${^μ}_{PRE}$ is another meaningful baseline, which has a bias of ${^μ}_{PRE} - μ^{*}$ and a zero variance. Its MSE is given by

MSE [{^μ}_{PRE}] = ({^μ}_{PRE} - μ^{*})^{2} .

(35)

To satisfy $MSE [{^μ}_{REG}] < MSE [{^μ}_{PRE}]$ , we have

β^{2} ({^μ}_{PRE} - μ^{*})^{2} + (1 - β)^{2} \frac{σ^{2}}{m} < ({^μ}_{PRE} - μ^{*})^{2} \Rightarrow β > \frac{σ^{2} - m ({^μ}_{PRE} - μ^{*})^{2}}{σ^{2} + m ({^μ}_{PRE} - μ^{*})^{2}} .

(36)

We now computes the optimal $MSE [{^μ}_{REG}]$ . It is easy to see that the quadratic programming problem in Eq.(33) achieves its minimum at $β^{*} = \frac{σ^{2}}{σ^{2} + m ({^μ}_{PRE} - μ^{*})^{2}}$ with the corresponding $λ^{*} = \frac{σ^{2}}{m ({^μ}_{PRE} - μ^{*})^{2}}$ . The minimum value is $\frac{σ^{2} ({^μ}_{PRE} - μ^{*})^{2}}{σ^{2} + m ({^μ}_{PRE} - μ^{*})^{2}} = \frac{MSE [{^μ}_{MLE}] MSE [{^μ}_{PRE}]}{MSE [{^μ}_{MLE}] + MSE [{^μ}_{PRE}]}$ , which completes the proof. ∎

a.6 Proof of Corollary 3.6.1

Proof.

In the Gaussian-fitting example, the expected risk for a hypothesis $^μ$ is

$R (^μ)$	$= E_{S \sim p_{d}^{m}} E_{x \sim N (μ^{*}, σ^{2})} [- log N (^μ, σ^{2})]$	(37)
	$= E_{S \sim p_{d}^{m}} [\frac{(^μ - μ^{*})^{2}}{2 σ^{2}} + \frac{1}{2} log (2 π σ^{2}) + \frac{1}{2}]$	(38)
	$= \frac{1}{2 σ^{2}} MSE [^μ] + \frac{1}{2} log (2 π σ^{2}) + \frac{1}{2},$	(39)

which completes the proof together with Theorem 3.6. ∎

Appendix B Experimental Details

Our implementation is built upon some publicly available code. Below, we include the links and please refer to the licenses therein.

b.1 Toy data

In the toy example for optimization analyses, the data follows a uniform distribution over $[0, 1]$ . The energy function is defined as $R_{f} (x) = 0.7 x + 0.9$ . The optimal $β^{*}$ is estimated by numerical integration.

In the experiments for the Gaussian-fitting example, we set $σ^{2} = 1$ , $m = 150$ , and ${^μ}_{PRE} - μ^{*} = 0.1$ by default.

b.2 GAN with Limited Data

Datasets. In our experiments, we use FFHQ [29], which consists of $70, 000$ human face images of resolution $256 \times 256$ , LSUN CAT [62], which consists of 200,000 cat face images of resolution $256 \times 256$ , and CIFAR-10 [33], which consists of $50, 000$ natural images of resolution $32 \times 32$ . Specifically, we split training subsets of size $1$ k and $5$ k on FFHQ and LSUN CAT in the same way as [28]. In all experiments, we do not use x-flips to amplify training data except for combining with APA.

Metrics. To quantitatively evaluate the experimental results, we choose the Fréchet inception distance (FID) [20] as our metric. We compute the FID between $50, 000$ generated images and all real images instead of training subsets [20]. Following [28], we report the medium FID on FFHQ and LSUN CAT and the mean FID with standard deviation on CIFAR- $10$ out of 3 runs . We record the best FID during training in each run as in [28].

Base DGM. In particular, the lighter-weight StyleGAN2 as the backbone for FFHQ and LSUN CAT and the tuning StyleGAN2 as the backbone for CIFAR-10 following [28]. Compared to the official StyleGAN2, the lighter-weight StyleGAN2 have the same performance and less computing cost on the FFHQ and LSUN CAT and the tuning StyleGAN2 is more suitable for CIFAR-10 [28]. Our implementation is based on the official code of ADA ⁶⁶6https://github.com/NVlabs/stylegan2-ada-pytorch.

Pre-trained model. We choose the ResNet-18 ⁷⁷7https://pytorch.org/vision/stable/models.html trained on the ImageNet dataset as the pre-trained model. We normalize both the real and fake images based on the mean and standard deviation of training data and then feed them into the classifier. We extract the features of the last fully connected layer in the pre-trained model, which is frozen during training of generative models. On CIFAR-10, we interpolate ⁸⁸8https://pytorch.org/docs/stable/generated/torch.nn.functional.interpolate.html both the fake and real images to a resolution of $256 \times 256$ after normalization.

Hyperparameters. Some parameters are shown in Tab. 3. The weight parameter $λ$ controls the strength of our regularization term. We choose the weighting hyperparameter $λ$ by performing grid search over $[4, 1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001, 0.00005, 0.00001, 0.000005]$ according to FID following prior work [30, 28]. Other parameters remain the same settings as [28].

Computing amount. All experiments are completed on 8 NVIDIA 2080Ti GPUs. It takes $1$ days $6$ hours $19$ minutes to run our method with ADA on FFHQ or LSUN CAT and $2$ days $17$ hours $10$ minutes on CIFAR-10 at a time.

Parameter	FFHQ- $1$ k	FFHQ- $5$ k	LSUN CAT- $1$ k	LSUN CAT- $5$ k	CIFAR-10
Base DGM	StyleGAN2 [30]	StyleGAN2 [30]	StyleGAN2 [30]	StyleGAN2 [30]	StyleGAN2 [30]
$λ$	$4$	$1$	$4$	$1$	$1 \times 10^{- 5}$
Number of GPUs	$4$	$4$	$8$	$8$	$8$
Training length	$5$ M	$5$ M	$5$ M	$5$ M	$25$ M
Minibatch size	$64$	$64$	$64$	$64$	$64$
Base DGM	ADA [28]	ADA [28]	ADA [28]	ADA [28]	ADA [28]
$λ$	$0.01$	$0.001$	$0.01$	$0.0005$	$5 \times 10^{- 6}$
Number of GPUs	$8$	$8$	$8$	$8$	$8$
Training length	$16$ M	$16$ M	$16$ M	$16$ M	$60$ M
Minibatch size	$64$	$64$	$64$	$64$	$64$
Base DGM	ADA+APA [26]	ADA+APA [26]	ADA+APA [26]	ADA+APA [26]	ADA+APA [26]
$λ$	$0.01$	$0.01$	$0.005$	$0.001$	$0.1$
Number of GPUs	$8$	$8$	$8$	$8$	$8$
Training length	$25$ M	$25$ M	$25$ M	$25$ M	$100$ M
Minibatch size	$64$	$64$	$64$	$64$	$64$

Table 3: Hyperparameters in the experiments of GAN. Reg-DGM shares the same hyperparameters as the base DGM if not specified. All models converge with the corresponding training length.

Appendix C Additional Results

c.1 Learning Curves

We show the learning curves of GANs on FFHQ, LSUN CAT and CIFAR-10 in Fig.4, Fig.5 and Fig.6 respectively. Reg-DGM consistently improves both baselines in all settings and the improvements increase as the number of the training data decreases. Moreover, the curves of Reg-DGM generally have a smaller fluctuation, which is consistent with our theory that the regularization reduces the variance of the baselines. One exception is Fig.6 (a), which shows that Reg-DGM is more unstable than the baseline, which is caused by a bad random initialization. We mention that the instability of Reg-ADA in Fig. 4 (b) is due to that of ADA.

ADA and Reg-ADA — (a) StyleGAN2 and Reg-StyleGAN2

c.2 Standard Deviation on FFHQ and LSUN CAT

As shown in Tab.4, we also provide the mean FID and standard deviation on FFHQ and LSUN CAT. Reg-DGM can reduce the mean FID significantly (compared to the measurement variance) and achieve a similar if not smaller standard deviation.

Method	FFHQ		LSUN CAT
Method	$1$ k	$5$ k	$1$ k	$5$ k
StyleGAN2 [30]	$102.62 \pm 5.67$	$53.37 \pm 1.92$	$189.57 \pm 8.13$	$110.83 \pm 6.85$
Reg-StyleGAN2 (ours)	$77.80 \pm 3.65$	$38.14 \pm 0.97$	$112.15 \pm 8.48$	$64.11 \pm 2.51$
ADA [28]	$22.10 \pm 0.50$	$12.72 \pm 0.13$	$41.59 \pm 1.71$	$16.77 \pm 0.74$
Reg-ADA (ours)	$20.16 \pm 0.22$	$11.88 \pm 0.13$	$36.85 \pm 1.09$	$15.85 \pm 0.10$

Table 4: The mean FID and standard deviation on FFHQ and LSUN CAT which is a supplement to reported medium FID.

c.3 More Samples of GAN

Fig.7 and Fig.8 respectively show the samples randomly generated by models with best FID trained on FFHQ-5k and CIFAR-10, using slight truncation as in [28]. Our regularization can help base DGMs achieve better or comparable image quality.

Reg-StyleGAN2 (FID — (a) StyleGAN2 (FID $51.41$ )

c.4 Ablation of Classifier and Results under More Evaluation Metrics

In this section, to better understand the influence of classifiers on our method, we try to use pre-trained classifiers with different regularization terms, layers, and backbones. We retain the same experimental setting as in Tab.3.

Regularization form. We first investigate the feature matching objective [48] as an alternative regularization terms. Formally, it computes the square of the $l_{2}$ -norm between expected features of real and fake samples from one layer of an feature extractor $f$ , which can be represented as follows:

| | E_{x^{'} \sim p_{d}} [f (x^{'})] - E_{x \sim p_{g}} [f (x)] | |_{2}^{2} .

(40)

Note that the feature matching objective cannot be rewritten as the expectation over $p_{g}$ and thus cannot be understood as an energy function. As before, we adopt the last fully-connected layer of ResNet-18, and the results of StyleGAN2 regularized by Eq. (40) are shown in Tab. 5. Feature matching can greatly reduce FID of StyleGAN2 while it cannot improve the visual quality of the samples, as shown in Fig. 9.

Then, we evaluate the entropy-minimization regularization [17] as follows:

- H (softmax (f (x))),

(41)

where $f (x)$ outputs the logits for the prediction distribution. As shown in Table 6, the entropy regularization achieves slightly worse FID results than the baseline within a small search space of $λ$ , showing the importance of the data dependency in the energy function.

Method	FFHQ		LSUN CAT
Method	$1$ k	$5$ k	$1$ k	$5$ k
StyleGAN2 [30]	$103.66$	$52.71$	$186.55$	$115.16$
Reg-StyleGAN2 (ours)	$59.96$	$32.65$	$66.46$	$47.56$

Table 5: The media FID on FFHQ and LSUN CAT with feature matching as the regularization.

λ

is simply set as the same values in Tab.3.

$λ$	FID $↓$
$1$	$145.55$
$0.5$	$111.288$
$0.1$	$92.11$
$0.001$	$53.37$

Table 6: FID on FFHQ-

5

k for the entropy-minimization regularization.

Figure 9: Samples generated for FFHQ- $5$ k using feature matching (FID $32.65$ ), untruncated.

Layers in $f$ . To explore the different layers of a pre-trained model, we retrain GANs separately using the first convolution layer of ResNet-18 and the last layers of four modules in ResNet-18. As shown in Tab.7, our method with features of diverse single layers can all improve the baseline StyleGAN2, and the last layer of ResNet-18 is most beneficial for our regularization strategy.

Layer index	(Baseline)	$1$	$17$	$33$	$49$	$65$	$- 1$ (by default)
FID $↓$	$52.71$	$46.39$	$47.65$	$51.61$	$45.24$	$46.01$	$37.77$

Table 7: Results with different layers on FFHQ-

5

k. The “

- 1

layer” represent the last layer (i.e., our default setting). Note that layers are all indexed by the function named_modules in Pytorch.

λ

is simply set as the same values from Tab.3.

Backbone of $f$ . We employ ResNet-50 and ResNet-101 as the feature extractor to explore the effect of different backbones on Reg-DGM. Tab.8 shows the results on FFHQ- $5$ k. Even using the default $λ$ without tuning, Reg-DGM with ResNet-50 and ResNet-101 can achieve a similar FID to that of our default setting and outperform the baseline. We believe that Reg-DGM with ResNet-50 and ResNet-101 can get better results if we finely search the hyperparameter $λ$ .

Backbone	(Baseline)	ResNet-18	ResNet-50	ResNet-101
FID $↓$	$52.71$	$37.77$	$40.95$	$42.63$

Table 8: Results with different backbones on FFHQ-

5

λ

is simply set as the same values from Tab.3.

Monte Carlo estimate of $f$ . We evaluate Reg-StyleGAN2 with 8, 16, 32, and 64 (the default batch size) samples to estimate the energy function in Eq. (10) of the main text, as shown in Table 9. We do observe a significant improvement by increasing the number of samples. For instance, when $λ = 1$ , the estimate with 64 samples achieves an FID of 38.10, which is similar to 37.77 of the single sample estimate. Intuitively, the features of faces are likely concentrated in a small area of the feature space of $f$ , which is discriminative to other classes of natural images like cars, making the variance negligible to the training process of the generative model.

MC	1 (by default)	8	16	32	64
$λ = 1$	$37.77$	$40.97$	$40.78$	$37.54$	$38.10$
$λ = 10$	$53.09$	$53.36$	$58.58$	$52.93$	$48.21$

Table 9: FID on FFHQ-

5

k for different number of samples in MC.

Changing the training data of $f$ . We trained Reg-DGM with the image encoder of the CLIP model [44] (an architecture very similar to ResNet-50), which is pre-trained on large-scale noisy text-image pairs instead of ImageNet, and with the face classifier Inception-ResNet-v1⁹⁹9https://github.com/timesler/facenet-pytorch of FaceNet [49], which is pre-trained on the large-scale face dataset VGGFace2 [8]. Their FID and KID results are shown in Tab. 10 and Tab.11, separately. Without heavily tuning the hyperparameter, Reg-DGM shows consistent improvement over the two baselines under both FID and KID metrics. The results with the CLIP model and the FaceNet model eliminate the potential bias caused by the ImageNet pre-trained $f$ .

Method	FID $↓$	KID $\times 10^{3} ↓$	$λ$
StyleGAN2	$52.71$	$39.52$
Reg-StyleGAN2	$40.98$	$27.56$	$50.0$
ADA	$12.64$	$5.17$
Reg-ADA	$11.09$	$3.91$	$1.0$

Table 10: Results on FFHQ-

5

k using a pre-trained CLIP with the ResNet50 image encoder.

Method	FID $↓$	KID $\times 10^{3} ↓$	$λ$
StyleGAN2	$52.71$	$39.52$
Reg-StyleGAN2	$38.80$	$23.38$	$4.0$
ADA	$12.64$	$5.17$
Reg-ADA	$11.74$	$3.94$	$0.01$

Table 11: Results on FFHQ-

5

k using a pre-trained FaceNet with the face classifier Inception-ResNet-v1.

More evaluation metrics. For a comprehensive comparison of Reg-DGM and baselines, we also evaluate the models under KID in Table 12. The conclusion remains the same as FID. Namlely, Reg-DGM consistently improves baselines in all settings.

Method	FFHQ		LSUN CAT		CIFAR- $10$
Method	$1$ k	$5$ k	$1$ k	$5$ k	$50$ k
StyleGAN	$98.06$	$39.52$	$161.95$	$100.57$	$3.66 \pm 0.07$
Reg-StyleGAN2 (ours)	$47.91$	$23.06$	$83.71$	$42.68$	$2.89 \pm 0.11$
ADA	$9.77$	$5.17$	$23.30$	$8.13$	$0.90 \pm 0.12$
Reg-ADA (ours)	$9.38$	$4.30$	$20.52$	$6.54$	$0.83 \pm 0.06$

Table 12: KID

\times 10^{3}

on FFHQ, LSUN-CAT, and CIFAR-

10

Results with $100$ samples. We present experiments on FFHQ- $100$ to compare Reg-DGM with the two most direct competitors StyleGAN2 and ADA. We randomly select 100 images for training. The results are shown in Table 13. Again, Reg-DGM consistently improves StyleGAN2 and ADA. For instance, Reg-ADA achieves an FID of 63.53, which is better than the $73.70$ of ADA.

Method	FID $↓$	$λ$
StyleGAN2	$132.68$
Reg-StyleGAN2	$113.96$	$1.0$
ADA	$73.70$
Reg-ADA	$65.35$	$0.01$

Table 13: FID on FFHQ-

100

	$\| L_{REG} (θ; x) - L_{REG} (θ^{'}; x) \| =$	$\frac{λ}{d} ∣ ∣ E_{y \sim p_{θ}} \| \| f (y) - f (x) \| \|_{2}^{2} - E_{y \sim p_{θ^{'}}} \| \| f (y) - f (x) \| \|_{2}^{2} ∣ ∣$
	$\leq$	$\frac{λ}{d} \int \| p_{θ} (y) - p_{θ^{'}} (y) \| \| \| f (x) - f (y) \| \|^{2} d y$
	$\leq$	$\frac{λ}{d} (sup y \in X \| \| f (x) - f (y) \| \|^{2}) \int \| p_{θ} (y) - p_{θ^{'}} (y) \| d y$
	$\leq$	$\frac{λ B K}{d} \| \| θ - θ^{'} \| \| .$