Feature Alignment by Uncertainty and Self-Training
for Source-Free Unsupervised Domain Adaptation

JoonHo Lee Gyemin Lee Machine Learning Research Center, Samsung SDS Technology Research, Republic of Korea Department of Electronic and IT Media Engineering, Seoul National University of Science and Technology, Republic of Korea

Abstract

Most unsupervised domain adaptation (UDA) methods assume that labeled source images are available during model adaptation. However, this assumption is often infeasible owing to confidentiality issues or memory constraints on mobile devices. To address these problems, we propose a simple yet effective source-free UDA method that uses only a pre-trained source model and unlabeled target images. Our method captures the aleatoric uncertainty by incorporating data augmentation and trains the feature generator with two consistency objectives. The feature generator is encouraged to learn consistent visual features away from the decision boundaries of the head classifier. Inspired by self-supervised learning, our method promotes inter-space alignment between the prediction space and the feature space while incorporating intra-space consistency within the feature space to reduce the domain gap between the source and target domains. We also consider epistemic uncertainty to boost the model adaptation performance. Extensive experiments on popular UDA benchmarks demonstrate that the performance of our approach is comparable or even superior to vanilla UDA methods without using source images or network modifications.

keywords:

unsupervised domain adaptation, source-free domain adaptation, uncertainty, self-training, image classification, multi-source domain adaptation

^†^†journal:

1 Introduction

Deep neural networks have enabled major breakthroughs on various visual recognition tasks. Much of their success is attributable to large well-annotated data. However, collecting large amounts of labeled data is time-consuming and costly. A solution is to transfer knowledge from a label-rich source domain to a label-scarce target domain. However, this process is hindered by the difference between the source data distribution and the target data distribution. Domain adaptation methods are proposed to tackle this domain shift problem.

Unsupervised domain adaptation (UDA) considers a form of domain adaption where only unlabeled images are available in the target domain. Many existing UDA methods attempt to reduce the domain shift by minimizing the difference between the source and the target feature distributions Long et al. (2015); Ganin et al. (2016); Tzeng et al. (2017); Xu et al. (2019); Zhang et al. (2019); Lee et al. (2019). These methods assume that labeled images are available in the source domain and take advantage of the source images during model adaptation.

However, this assumption is infeasible in many cases. For example, transferring medical records containing private information is strictly prohibited. In many institutions, business-sensitive data remain confidential. Memory constraints pose another issue, especially considering the increased demand for mobile services integrating on-device DNN applications. The limited memory of a mobile device does not allow the storage of large-scale source data such as ImageNet data Deng et al. (2009).

Figure 1: Schematic diagram of the proposed method (FAUST). To achieve source-free UDA, we fix the head classifier $H$ trained on source images and train the feature generator $G$ . We consider aleatoric (data) and epistemic (model) uncertainties to push target visual features away from the decision boundary of $H$ . Shading around each sample represents aleatoric uncertainty and shading near the decision boundary represents epistemic uncertainty.

To address such limitations, we propose a novel source-free UDA method referred to as Feature Alignment by Uncertainty and Self-Training (FAUST). We assume that only a source model is available, whereas source images are unavailable. As illustrated in Fig. 1, the proposed method freezes the head classifier from the source model and adapts the feature generator to the target images. Our key idea is to consider aleatoric uncertainty while training the feature generator to align the features between both domains. Aleatoric uncertainty is characterized by noise inherent in images Kiureghian and Ditlevsen (2009). By considering the effects of noisy images, our feature encoder learns to generate visual features away from the class decision boundaries. Thus, the decision boundaries effectively reside in the low-density regions in the feature space.

To capture the aleatoric uncertainty, we propose two consistency objectives by incorporating multiple perturbed views of the same image. Motivated by self-training methods Caron et al. (2018, 2020), we constrain different views of the same image to have similar embeddings by imposing intra-space consistency in the feature space. However, the domain gap can still cause incorrect predictions of these consistent images. Thus, we also enforce inter-space consistency between the feature space and the prediction space. For this purpose, we present a feature-based pseudo-labeling strategy that selectively uses source-similar target image features. As the head classifier from the source model encodes the source distribution, we leverage this information to identify source-similar target images. By pseudo-labeling a target image according to the source distribution and by comparing this pseudo-label against predictions from multiple views of the same image, our inter-space consistency encourages the target features to align with the source feature distribution.

We also consider epistemic uncertainty for source-free UDA. Whereas aleatoric uncertainty is more effective, epistemic uncertainty has proven to be useful for UDA Lee and Lee (2020); Lee et al. (2019). Along with the proposed two consistency losses capturing aleatoric uncertainty, FAUST incorporates the epistemic uncertainty loss based on Monte Carlo dropout sampling.

Our extensive experiments and analysis confirm that FAUST is comparable or superior to recent UDA methods, even completely without the use of source images.

2 Related Work

Unsupervised Domain Adaptation. Given labeled source images and unlabeled target images, the objective of UDA is to address the domain shift between the source domain and the target domain. Earlier works align the feature distributions from both domains either by matching their statistical moments Long et al. (2015); Zhang et al. (2019); Xu et al. (2019) or by deploying a domain discriminator in an adversarial manner Ganin et al. (2016); Tzeng et al. (2017); Zhang et al. (2019); Lee et al. (2019); Zou et al. (2019). Some recent works use stochastic predictions to make the model target-discriminative Saito et al. (2018b); Shu et al. (2018); Kim et al. (2019); Lu et al. (2020); Lee et al. (2019); Lee and Lee (2020) or leverage the optimal transport principle Damodaran et al. (2018); Xu et al. (2020). Semantic augmentation has also been used Li et al. (2021). These methods assume that source images are available during the knowledge transfer process. This assumption is not valid when the source data is private or confidential. This issue creates more difficult source-free problems.

Source-free UDA. Lately, several methods have investigated source-free UDA. 3C-GAN Li et al. (2020) produces target-style samples using a GAN, which collaborates with the source model during adaptation. SHOT Liang et al. (2020) freezes the source classifier and trains the feature encoder by means of pseudo-labeling and mutual information maximization. SDDA Kurmi et al. (2021) models a GAN combined with a gradient reversal layer to produce domain-invariant features. SoFA Yeh et al. (2021) induces a reference distribution using the source model to extract class semantic features. SFDA Kim et al. (2021) uses pseudo-labels based on low-entropy samples and the point-to-set distance.

Multi-Source DA. Multi-source DA (MSDA) problems have also been explored Peng et al. (2019); Zhou et al. (2021). Similarly to vanilla UDA, MSDA assumes one unlabeled target domain. However, MSDA exploits more than one labeled source domain to promote better target performance, outperforming the simple source-combine strategy. M $^{3}$ SDA Peng et al. (2019) aligns the moments of source feature distributions with each other and to the target domain. DAEL Zhou et al. (2021) leverages complementary information in multiple classifiers, where each classifier is trained on each domain.

Self-Supervised Learning. The goal of the proposed method is to learn effective visual features without human annotated labels. Typical self-supervised methods design pretext tasks to provide automated supervision, where both the inputs and labels are derived from an unlabeled dataset Bertinetto et al. (2018); Noroozi and Favaro (2016); Gidaris et al. (2018). Many recent contrastive methods build upon the contrastive loss and a set of image transforms Chen et al. (2020); He et al. (2020); Grill et al. (2020). They learn to map different views of the same image nearby and views from different images apart. Clustering-based methods Caron et al. (2018) iteratively learn features by clustering images and predicting their cluster assignments. SwAV Caron et al. (2020) incorporates clustering into contrastive learning by comparing the assignment from one view and predicting it from another view. Our pseudo-labeling strategy in the inter-space consistency loss is inspired in part by Caron et al. (2020). However, rather than using simple $k$ -means clustering, our approach uses confident features.

Uncertainty. The predictive uncertainties are generally grouped as epistemic or aleatoric Kendall and Gal (2017). Epistemic uncertainty, known as model uncertainty, can be estimated by means of Monte Carlo dropout sampling Gal and Ghahramani (2016). Among UDA methods, Lee and Lee (2020); Lee et al. (2019) considered epistemic uncertainty. Aleatoric uncertainty is described as the noise inherent in images and is known to be more effective in computer vision. It can be captured by a distribution over the model outputs. Alternatively, test-time data augmentation can be used to estimate the aleatoric uncertainty Ayhan and Berens (2018).

Figure 2: Framework of the proposed Feature Alignment by Uncertainty and Self-Training (FAUST). FAUST aims to learn consistent visual features by considering predictive uncertainties. When considering aleatoric (data) uncertainty, FAUST incorporates data augmentations ( $v$ =2) and enforces intra-space consistency in the feature space and inter-space consistency between the feature space and the prediction space. Generated target features are encouraged to be away from the decision boundaries of the frozen head classifier $H$ .

3 Preliminary and Notations

The proposed method addresses the UDA problem in which only a pre-trained source model is available, whereas access to the source data is prohibited. Conventional UDA methods assume that we are given a set of fully-labeled images $(X_{s}, Y_{s})$ from the source domain $D_{s}$ and a set of unlabeled images $X_{t}$ from the target domain $D_{t}$ . However, in more challenging source-free UDA problems, the source images $(X_{s}, Y_{s})$ are no longer available for model adaptation. Instead, we are given a source model $F_{s}$ trained on $(X_{s}, Y_{s})$ . In this paper, we use only $F_{s}$ and $X_{t}$ to build a model $F_{t}$ adapted to the target domain $D_{t}$ .

We consider $K$ -way classification where the source domain and the target domain share the same label space. In our formulation, a source classification model is divided into two parts: a feature generator $G$ and a task-specific head classifier $H$ . We fix the head classifier $H$ to focus on adapting the feature generator $G$ to the target domain. A decent source model holds a considerable amount of information about the source data. In particular, the feature distribution of source embeddings is encoded in the head classifier. Therefore, preserving the head classifier and making full use of it is a reasonable strategy in the source-free setting.

We let $F = H \circ G$ denote a model with a feature generator $G$ and head classifier $H$ . Given an image $x$ , its feature embedding is $z = G (x)$ . Its $K$ -dimensional prediction output is $p = p (x) = σ (H (G (x)))$ , where $σ$ is a softmax function. In pseudo-labeling, $^σ$ denotes a softmax function with the sharpening temperature $T$ . We use $H (s, p)$ as the cross-entropy between the two probability distributions $s$ and $p$ , whereas $H (p)$ is the entropy of the distribution $p$ .

4 Proposed Method

Our source-free UDA considers aleatoric uncertainty to learn consistent target visual features. To capture the aleatoric uncertainty, we incorporate data augmentation transforms and promote consistency within the feature space. Our feature generator is encouraged to learn features that are away from the decision boundaries of the fixed head classifier. By leveraging the head classifier that encodes the source feature distribution, the inter-space consistency across the feature space and the prediction space mitigates the domain gap. We also model epistemic uncertainty for an additional performance gain.

4.1 Aleatoric Uncertainty by Augmentation

We propose to consider aleatoric uncertainty for UDA tasks. Aleatoric uncertainty describes noise inherent in images. Our intuition is that by considering the effects of noisy target images, we can encourage the feature extractor to generate features away from the decision boundaries of the head classifier. One can envision that images with higher uncertainty are likely to be located around the decision boundaries. Hence, even small perturbations can lead to different decision outputs. On the other hand, images with higher confidence (lower uncertainty) will be distant from the decision boundaries, and their outputs will remain consistent. Thus, we aim to train the feature generator $G$ to generate consistent features under perturbations.

To capture the aleatoric uncertainty, we incorporate data augmentation into our UDA method. Given an image $x$ , let $x_{a}^{(1)}, \dots, x_{a}^{(v)}$ be a set of $v$ views under a random augmentation transform $T$ . As described, these transformed views of the same image should have similar feature representations. This insight motivates our intra-space consistency objective. To make their feature embeddings located more closely to the feature embedding of $x$ , we minimize their distances

L_{f} = \frac{1}{v} v \sum i = 1 D (z, z_{a}^{(i)})

(1)

where $D$ denotes the cosine dissimilarity between two feature embeddings $z = G (x)$ and $z_{a}^{(i)} = G (x_{a}^{(i)})$ .

4.2 Feature Alignment by Self-Training

We promote consistency in the feature space. However, intra-space consistency is not sufficient for UDA tasks because some “consistent” target images can be incorrectly predicted due to a domain shift. To reduce the domain gap between the source domain $D_{s}$ and the target domain $D_{t}$ , we also propose inter-space consistency across the feature space and the prediction space. Our inter-space consistency objective encourages the target features to align with the source feature distribution.

For our purpose, the frozen head classifier $H$ plays a key role. In a head classifier trained on source images, ample information about the source distribution in the feature space and the decision boundaries for class prediction is held. We leverage this information to enforce consistency between the pseudo-labels generated in the feature space and the classification outputs in the prediction space.

We compute the pseudo-labels in the feature space by matching the feature embeddings to a set of prototype vectors in the feature space. If the source feature distribution is encoded in these prototypes, the target images will be labeled according to the source distribution. Ideally, these prototypes can be obtained from the source feature embeddings. However, this is not possible in our source-free UDA setting. Instead, we use the target images and the head classifier to estimate these prototypes. One can easily speculate that target images similar to the source images will be classified with high confidence. Hence, we propose to use source-similar target images to produce prototype vectors. The detailed procedure for computing prototype vectors and pseudo-labeling is presented below.

Once we obtain the pseudo-labels, we compare the pseudo-label of an image against the prediction outputs of multiple augmented versions of the same image. Let $s$ denote the pseudo-label of target image $x$ . The proposed inter-space consistency loss is formulated as follows:

L_{i} = \frac{1}{v} v \sum i = 1 H (s, p_{a}^{(i)}) .

(2)

Here, $p_{a}^{(i)}$ is the prediction output of random-augmented view $x_{a}^{(i)}$ . Therefore, our method enforces consistency between the feature space and the prediction space. This loss also ensures different augmentations of the same image to have similar feature embeddings and prediction outputs.

To summarize, FAUST aims to learn features away from the decision boundaries by leveraging aleatoric uncertainty. Given an image, its neighbors (noisy versions obtained by augmentation) must be mapped closely in the feature space ( $L_{f}$ ) and must have consistent prediction outputs ( $L_{i}$ ). As a result, the feature of the image will maintain its distance from the decision boundaries. Thus, modeling aleatoric uncertainty in this way will effectively locate the boundaries in the low density regions in the feature space. Fig. 2 illustrates our intra-space and inter-space consistency approach.

Pseudo-Labeling with Confidence-Weighted Prototypes. Here, we describe our pseudo-labeling strategy. We pseudo-label feature space vectors using a soft nearest prototype classifier. We compute each prototype $c_{k}$ by evaluating

c_{k} = | B | \sum j = 1 p_{j k} z_{j}

(3)

where $z_{j}$ is the feature embedding of the non-augmented target image $x_{j}$ in mini-batch $B$ , and $p_{j k}$ is the prediction probability of $x_{j}$ to class $k$ . We assign class conditional confidence $p_{j k}$ as weights to each feature vector. Because target images similar to source images will have higher confidence and dissimilar ones will have lower confidence, the prototype vectors generated using only target images will be similar to the prototypes from the source images.

To obtain the soft pseudo-label of a target image $x$ , we match its feature $z$ to the set of $K$ prototype vectors. Let $C$ denote the matrix whose columns are the prototypes $c_{1}, \dots c_{K}$ . The matching is done by the cosine similarity

s = ˆ σ (C^{T} z)

(4)

where the feature vector $z$ and each column vector (prototype) of $C$ are normalized by division by their L2-norms during the similarity computation.

4.3 Epistemic Uncertainty and Entropy

Epistemic Uncertainty. Our intra-space and inter-space consistency losses are designed to capture aleatoric uncertainty. We can also account for epistemic uncertainty for UDA tasks. Epistemic uncertainty explains model uncertainty due to a lack of data and can be reduced by collecting more data. Though modeling aleatoric uncertainty is more effective in computer vision, epistemic uncertainty has proven to be useful in situations where the training set is small or the test-time (target) images are different from the train-time (source) images Lee and Lee (2020); Lee et al. (2019).

We estimate epistemic uncertainty using Monte Carlo (MC) dropout sampling Gal and Ghahramani (2016). After activating dropout in the model, we perform multiple stochastic forward passes. The prediction outputs $p\scaletoMC3pt(x)$ of $x$ are called MC dropout samples. We use the L2-norm of their sample standard deviation to compute the epistemic uncertainty loss

Lu=Ex[∥Std(p\scaletoMC3pt(x))∥]

(5)

where $S t d$ denotes the sample standard deviation.

Entropy Minimization. The conditional entropy is known to be very effective when used to capture cluster assumptions Grandvalet and Bengio (2005) and has been adopted by many UDA methods. Under the cluster assumption, decision boundaries away from high-density regions are preferred. Because this assumption conforms to our consistency losses, we minimize the conditional entropy

L_{e} = H (p) = - E_{x} [p (x)^{T} log p (x)]

(6)

to train our feature generator. This loss also encourages our feature generator to learn confident (low entropy) features.

4.4 Overall Objective and Optimization

Given the source model $F_{s} = H_{s} \circ G_{s}$ and the unlabeled target images $X_{t}$ , we fix $H_{s}$ and train the target-domain feature generator $G_{t}$ with the following objective:

min G L_{i} + α L_{f} + β L_{e} + γ L_{u}

(7)

where $α, β \geq 0$ are tradeoffs. Considering the computational burden of MC dropout sampling, we can choose to turn on/off $L_{u}$ by setting $γ$ equal to 1 or 0. These settings are respectively referred as FAUST+U and FAUST in Sec. 5.

During training, prototype computation and pseudo-labeling steps are performed in each mini-batch. Contrary to global approaches Liang et al. (2020); Caron et al. (2018), our in-batch approach incurs a lower computational cost when the size of the target data is enormous or increasing. This simple end-to-end training strategy as well as no network customization are among the advantages of our method.

5 Experiment

We evaluate the proposed method on popular benchmark datasets with smaller images (Digit and Sign) and larger images (Office-Home and VisDA). We also extend our validation to more complex multi-source domain adaptation (MSDA) tasks using miniDomainNet.

5.1 Setup

Digit. We test our method using the three standard digit recognition datasets of SVHN Netzer et al. (2011), MNIST LeCun et al. (1998) and USPS Hull (1994). The goal is to classify an image into one of ten digits. We consider four UDA tasks: SVHN $\to$ MNIST (S $\to$ M), MNIST $\to$ SVHN (M $\to$ S), MNIST $\to$ USPS (M $\to$ U) and USPS $\to$ MNIST (U $\to$ M). We note that M $\to$ S is most challenging and is often ignored in literature.

Sign. We also experiment on two traffic sign datasets: Synthetic Signs (SYNSIG) Moiseev et al. (2013) and the German Traffic Sign Recognition Benchmark (GTSRB) Stallkamp et al. (2011), sharing 43 classes. SYNSIG contains 100K synthetic traffic sign images and GTSRB has more than 50K actual traffic sign images. We evaluate the UDA task of SYNSIG $\to$ GTSRB (S $\to$ G).

Office-Home. Office-Home Venkateswara et al. (2017) contains 15,500 larger images of 65 categories from four distinct domains: Artistic images (Ar), Clip Art (Cl), Product images (Pr), and Real-World images (Rw). We consider all 12 UDA tasks.

VisDA. VisDA-C Peng et al. (2016) is a large-scale benchmark including 152K synthetic 3D model images rendered from different angles and lighting conditions and 55K photo-real images sampled from MSCOCO Lin et al. (2014). This benchmark is one of the most challenging UDA tasks because it contains a large number of complex images and considers a more practical synthetic-to-real UDA problem.

MiniDomainNet. We extend our evaluation to MSDA tasks using miniDomainNet Zhou et al. (2021). MiniDomainNet is a subset of DomainNet Peng et al. (2019) containing 140K 96 $\times$ 96 images of 126 categories from four different domains: Clipart (Cl), Painting (Pa), Real (Re) and Sketch (Sk). Unlike single-source tasks, the goal of MSDA is to adapt multiple source domains to a target domain. We evaluate FAUST on four MSDA tasks: $R \to$ Cl, $R \to$ Pa, $R \to$ Re and $R \to$ Sk, where $R$ denotes the remaining three domains apart from the target domain. For example, Pa, Re and Sk are the source domains in $R \to$ Cl. Following a setup of Zhou et al. (2021), we hold out 630 test images per domain.

Baseline Methods. We compare our method with state-of-the-art works for vanilla UDA that require source data: DANN Ganin et al. (2016), ADDA Tzeng et al. (2017), MCD Saito et al. (2018b), SWD Lee et al. (2019), SAFN Xu et al. (2019), DTA Lee et al. (2019), IEDA Choi et al. (2020), STAR Lu et al. (2020), GVB-GD Cui et al. (2020b), MUDA Lee and Lee (2020), CDAN Long et al. (2018), BNM Cui et al. (2020a), MDD Zhang et al. (2019), RWOT Xu et al. (2020) and TSA Li et al. (2021). Our method is also compared with recent source-free UDA methods: SDDA Kurmi et al. (2021), SoFA Yeh et al. (2021), SFDA Kim et al. (2021), SHOT Liang et al. (2020) and 3C-GAN Li et al. (2020). For multi-source tasks, we run the comparison with DCTN Xu et al. (2018), M $^{3}$ SDA Peng et al. (2019), MME Saito et al. (2019) and DAEL Zhou et al. (2021). For the baseline methods, all results are directly cited from the relevant published papers.

5.2 Implementation Details

Network Architecture. In all UDA tasks, we use the same network architectures used in prior works Tzeng et al. (2017); Saito et al. (2018b) for a fair comparison. We use LeNetLeCun et al. (1998)-variant networks with three convolution layers ( $G$ ) and two linear layers ( $H$ ) for Digit and Sign. The input image sizes are set to 28 $\times$ 28 for M $\leftrightarrow$ U, 32 $\times$ 32 for S $\leftrightarrow$ M, and 40 $\times$ 40 for S $\to$ G. For Office-Home and VisDA, the input images of 224 $\times$ 224 are standardized using ImageNet statistics before being fed into ResNet-50 and ResNet-101 He et al. (2016), respectively. For miniDomainNet, ResNet-18 is used, as in Zhou et al. (2021). For all ResNet models, we employ a pre-trained ResNet trunk ( $G$ ) followed by two linear layers with 1,000 neurons ( $H$ ).

Source Model Training. We specify the training details of our source models for reproducibility. To train the source models, we use random augmentation and the cosine learning rate decay. The initial learning rate is 10 $^{- 2}$ for the datasets containing smaller images (Digit and Sign) and 10 $^{- 3}$ for those containing larger images (Office-Home, VisDA and miniDomainNet). To validate the source model, we use a validation set if one is available. Otherwise, we randomly spare 10% of the train samples. For Office-Home and VisDA, we additionally apply label smoothing Müller et al. (2019) for a fair comparison with Liang et al. (2020). For a multi-source task, we combine all of the source domains (source-combine) to train a single source model.

	Methods	S $\to$ M	M $\to$ S	M $\to$ U	U $\to$ M	S $\to$ G
	Source-only	70.0	47.8	77.4	84.9	77.1
Source access (vanilla)	ADDA Tzeng et al. (2017)	76.0 $^{1.8}$	-	90.1 $^{0.8}$	89.4 $^{0.2}$	-
	DRCNGhifary et al. (2016)	82.0 $^{0.1}$	40.1	91.8 $^{0.1}$	73.7 $^{0.0}$	-
	MCD Saito et al. (2018b)	96.2 $^{0.4}$	28.7	94.2 $^{0.7}$	94.1 $^{0.3}$	94.4 $^{0.3}$
	DIRT-T Shu et al. (2018)	99.4	76.5	-	-	99.5
	SWD Lee et al. (2019)	98.9 $^{0.1}$	-	98.1 $^{0.1}$	97.1 $^{0.1}$	98.6 $^{0.3}$
	IEDA Choi et al. (2020)	98.9	78.5	95.0	97.5	-
	MUDA Lee and Lee (2020)	99.1 $^{0.4}$	-	98.5 $^{0.1}$	96.7 $^{0.4}$	98.6 $^{0.5}$
	STAR Lu et al. (2020)	98.8 $^{0.1}$	-	97.8 $^{0.1}$	97.7 $^{0.1}$	95.8 $^{0.2}$
	RWOT Xu et al. (2020)	98.8 $^{0.1}$	-	98.5 $^{0.1}$	97.5 $^{0.2}$	-
	TSA Li et al. (2021)	98.7 $^{0.2}$	-	98.0 $^{0.1}$	98.3 $^{0.3}$	-
Source-free	SDDA Kurmi et al. (2021)	76.3	-	89.9	-	-
	SHOT Liang et al. (2020)	98.9 $^{0.0}$	-	98.0 $^{0.2}$	98.4 $^{0.6}$	-
	3C-GAN Li et al. (2020)	99.4 $^{0.1}$	-	97.3 $^{0.2}$	99.3 $^{0.1}$	99.6 $^{0.1}$
	FAUST (v=2)	99.6 $^{0.0}$	85.9 $^{0.2}$	98.3 $^{0.1}$	98.8 $^{0.0}$	99.5 $^{0.1}$
	FAUST+U (v=2)	99.6 $^{0.0}$	91.3 $^{0.1}$	98.8 $^{0.1}$	99.1 $^{0.1}$	99.7 $^{0.0}$
	Target Supervised	99.6 $^{0.0}$	92.5 $^{0.3}$	99.5 $^{0.1}$	99.5 $^{0.1}$	99.8 $^{0.1}$

Table 1: Classification accuracy (%) on Digit and Sign. +U indicates that the epistemic uncertainty loss

L_{u}

is additionally used. Best in bold and second best in bold italic.

Optimization. We optimize with Adam for S $\to$ M and U $\to$ M and with SGD (momentum 0.9) for all of the other tasks. We apply a fixed learning rate of 2.0 $\times$ 10 $^{- 4}$ and a weight decay of 5.0 $\times$ 10 $^{- 4}$ for all experiments. For VisDA and Office-Home, we employ a cosine decay schedule with an initial learning rate of 5.0 $\times$ 10 $^{- 4}$ . The sharpening temperature $T$ for pseudo-labeling is set to 0.025.

Methods	Ar $\to$ Cl	Ar $\to$ Pr	Ar $\to$ Re	Cl $\to$ Ar	Cl $\to$ Pr	Cl $\to$ Re	Pr $\to$ Ar	Pr $\to$ Cl	Pr $\to$ Re	Re $\to$ Ar	Re $\to$ Cl	Re $\to$ Pr	Avg.
Source-only	43.2	57.9	71.1	52.2	61.4	60.3	47.4	39.7	64.7	66.5	50.2	80.5	57.9
DANN Ganin et al. (2016)	45.6	59.3	70.1	47.0	58.5	60.9	46.1	43.7	68.5	63.2	51.8	76.8	57.6
CDAN Long et al. (2018)	50.7	70.6	76.0	57.6	70.0	70.0	57.4	50.9	77.3	70.9	56.7	61.6	64.1
SAFN Xu et al. (2019)	52.0	71.7	76.3	64.2	69.9	71.9	63.7	51.4	77.1	70.9	57.1	81.5	67.3
MDD Zhang et al. (2019)	54.9	73.7	77.8	60.0	71.4	71.8	61.2	53.6	78.1	72.5	60.2	82.3	68.1
BNM Cui et al. (2020a)	56.2	73.7	79.0	63.1	73.6	74.0	62.4	54.8	80.7	72.4	58.9	83.5	69.4
GVB-GD Cui et al. (2020b)	57.0	74.7	79.8	64.6	74.1	74.6	65.2	55.1	81.0	74.6	59.7	84.3	70.4
TSA Li et al. (2021)	57.6	75.8	80.7	64.3	76.3	75.1	66.7	55.7	81.2	75.7	61.9	83.8	71.2
SoFA Yeh et al. (2021)	-	74.1	77.6	-	71.8	75.1	-	-	-	-	-	-	-
SFDA Kim et al. (2021)	48.4	73.4	76.9	64.3	69.8	71.7	62.7	45.3	76.6	69.8	50.5	79.0	65.7
SHOT Liang et al. (2020)	57.1	78.1	81.5	68.0	78.2	78.1	67.4	54.9	82.2	73.3	58.8	84.3	71.8
FAUST (v=2)	59.4	78.5	79.4	62.7	77.6	75.0	64.5	61.0	78.3	72.7	64.8	85.9	71.6
FAUST+U (v=2)	61.4	79.2	79.6	63.3	76.9	75.2	65.3	59.4	79.0	74.7	64.2	86.1	72.0

Table 2: Classification accuracy (%) on Office-Home (ResNet-50).

Methods	plane	bcycl	bus	car	horse	knife	mcycl	person	plant	sktbrd	train	truck	Avg.
Source-only	70.0	60.8	57.9	59.6	82.9	42.5	80.3	34.4	48.8	38.3	84.2	19.9	56.6
MCD Saito et al. (2018b)	87.0	60.9	83.7	64.0	88.9	79.6	84.7	76.9	88.6	40.3	83.0	25.8	71.9
SAFN Xu et al. (2019)	93.6	61.3	84.1	70.6	94.1	79.0	91.8	79.6	89.9	55.6	89.0	24.4	76.1
SWD Lee et al. (2019)	90.8	82.5	81.7	70.5	91.7	69.5	86.3	77.5	87.4	63.6	85.6	29.2	76.4
MUDA Lee and Lee (2020)	92.2	79.5	80.8	70.2	91.9	78.5	90.8	81.9	93.0	62.5	88.7	31.9	78.5
DTA Lee et al. (2019)	93.7	82.2	85.6	83.8	93.0	81.0	90.7	82.1	95.1	78.1	86.4	32.1	81.5
STAR Lu et al. (2020)	95.0	84.0	84.6	73.0	91.6	91.8	85.9	78.4	94.4	84.7	87.0	42.2	82.7
RWOT Xu et al. (2020)	95.1	80.3	83.7	90.0	92.4	68.0	92.5	82.2	87.9	78.4	90.4	68.2	84.0
SoFA Yeh et al. (2021)	-	-	-	-	-	-	-	-	-	-	-	-	64.6
SFDA Kim et al. (2021)	86.9	81.7	84.6	63.9	93.1	91.4	86.6	71.9	84.5	58.2	74.5	42.7	76.7
3C-GAN Li et al. (2020)	94.8	73.4	68.8	74.8	93.1	95.4	88.6	84.7	89.1	84.7	83.5	48.1	81.6
SHOT Liang et al. (2020)	94.3	88.5	80.1	57.3	93.1	94.9	80.7	80.3	91.5	89.1	86.3	58.2	82.9
FAUST (v=1)	95.6	80.4	85.2	76.2	94.8	97.3	91.5	84.0	92.2	87.9	86.9	45.0	84.8
FAUST (v=2)	96.7	77.6	87.6	73.3	95.5	95.4	92.9	83.6	95.3	89.5	87.7	46.9	85.2
FAUST+U (v=1)	96.0	78.2	87.0	78.0	94.6	96.3	90.7	83.3	96.3	87.9	86.4	45.1	84.9

Table 3: Classification accuracy (%) on VisDA (ResNet-101).

Uncertainty Setting. FAUST embodies random data augmentation transforms to capture the aleatoric uncertainty. We apply RandAugment Cubuk et al. (2020) and Cutout DeVries and Taylor (2017). The number of transformed views $v$ is set to 2. We reduce $v$ to 1 when we experiment with FAUST+U on VisDA because the computational cost is high.

When we explore epistemic uncertainty with FAUST+U, we employ MC dropout sampling Gal and Ghahramani (2016). For Digit and Sign, we incorporate in-between dropout layers into the LeNet-variant networks and set the dropout rate to 0.4 for $H$ and 0.1 for $G$ . For the ResNet models, the dropout rate of $H$ is set to 0.4, whereas the residual blocks are left intact. We draw ten MC dropout samples for Digit and Sign and two samples for the other tasks due to the computational burden. The additional dropout layers are disabled unless $γ$ =1 in equation (7).

Evaluation Protocol. We set the number of training epochs to 200, 200, 100, 2, and 100 and the mini-batch size to 256, 256, 128, 64, and 256 for Digit, Sign, Office-Home, VisDA and miniDomainNet, respectively. The results are fairly insensitive to the mini-batch size if it is large enough compared to the label set size. We report the mean accuracy (and the standard deviation if space is available) from three independent runs. We indicate the best in bold and the second best in bold italic.

5.3 Results

Digit and Sign. First, we evaluate the performance of the proposed method on the Digit and Sign tasks. As shown in Table 1, our method is consistently superior to previous works, including source-access methods. In particular, FAUST outscores other methods by large margins on M $\to$ S. The adaptation from black-and-white handwritten digits (MNIST) to colored street-view house numbers (SVHN) is the most challenging task. Because this task is often ignored in the literature, Table 1 shows only a few available results. On this task, the epistemic uncertainty loss in FAUST+U provides a significant improvement from 85.9% to 91.3%.

Furthermore, the performance of our method is close to the target supervised accuracies. This result is remarkable considering the use of no source data during adaptation. In U $\to$ M, Li et al. (2020) slightly outperforms our method. Whereas our input image size is 28 $\times$ 28 as in most prior works Tzeng et al. (2017); Saito et al. (2018b, a); Kim et al. (2019); Lu et al. (2020), Li et al. (2020) used a size of 32 $\times$ 32. This choice involves a network with more parameters, which can affect its performance.

Office-Home. FAUST shows accuracies comparable to those of the best baseline method, as can be seen in Table 2. FAUST+U improves FAUST on most Office-Home tasks and records the highest average accuracy of 72.0%. This result shows that the epistemic uncertainty loss $L_{u}$ in FAUST+U enhances UDA by complementing the aleatoric uncertainty losses.

Methods	$R \to$ Cl	$R \to$ Pa	$R \to$ Re	$R \to$ Sk	Avg.
Source-only	63.4	49.9	61.5	44.1	54.8
MCD Saito et al. (2018b)	62.9 $^{0.7}$	45.8 $^{0.5}$	57.6 $^{0.3}$	45.9 $^{0.7}$	53.0
DCTN Xu et al. (2018)	62.1 $^{0.6}$	45.8 $^{0.5}$	58.9 $^{0.6}$	48.3 $^{0.3}$	54.5
DANN Ganin et al. (2016)	65.6 $^{0.3}$	46.3 $^{0.7}$	58.7 $^{0.6}$	47.9 $^{0.5}$	54.6
M $^{3}$ SDA Peng et al. (2019)	64.2 $^{0.3}$	49.1 $^{0.2}$	57.7 $^{0.2}$	49.2 $^{0.3}$	55.0
MME Saito et al. (2019)	68.1 $^{0.2}$	47.1 $^{0.3}$	63.3 $^{0.2}$	43.5 $^{0.5}$	55.5
DAEL Zhou et al. (2021)	70.0 $^{0.5}$	55.1 $^{0.8}$	66.1 $^{0.1}$	55.7 $^{0.8}$	61.7
FAUST (v=2)	68.1 $^{0.5}$	52.2 $^{0.7}$	68.7 $^{0.2}$	59.1 $^{0.6}$	62.0
FAUST+U (v=2)	67.0 $^{0.1}$	51.9 $^{0.4}$	67.1 $^{0.4}$	57.5 $^{0.5}$	60.9
Target Supervised	72.6 $^{0.3}$	60.5 $^{0.7}$	80.5 $^{0.3}$	63.4 $^{0.2}$	69.3

Table 4: Classification accuracy (%) on miniDomainNet (ResNet-18).

R

denotes the remaining source domains.

Methods	$R \to$ Cl	$R \to$ Pa	$R \to$ Re	$R \to$ Sk	Avg.
Source-only	68.9	57.5	66.4	56.4	62.3
FAUST (v=2)	74.9 $^{0.3}$	60.1 $^{0.6}$	75.0 $^{0.8}$	64.6 $^{1.1}$	68.6
FAUST+U (v=2)	75.0 $^{0.8}$	60.7 $^{0.7}$	75.4 $^{0.8}$	64.0 $^{1.0}$	68.8
Target Supervised	80.8 $^{0.5}$	68.3 $^{0.8}$	85.0 $^{0.4}$	71.0 $^{0.3}$	76.3

Table 5: MiniDomainNet results with ResNet-50.

VisDA. We report the performance of FAUST on the challenging VisDA dataset in Table 3. For each of the twelve categories, the accuracy is improved by a large margin over the source-only model, by more than 28% on average. Our approach outperforms all of the baselines and establishes a new state-of-the-art average accuracy of 85.2%. This result of source-free FAUST is better than that of the best vanilla UDA method that uses the source data.

Table 3 shows the effect of modeling the epistemic uncertainty. The number of transformed views $v$ for FAUST+U is reduced to 1 due to memory constraints. For a fair comparison, FAUST with $v$ =1 is presented. Adding the epistemic uncertainty loss $L_{u}$ to FAUST slightly raises the accuracy. Table 3 also suggests that increasing $v$ from 1 to 2 improves the performance. The accuracy is increased by 0.4% with $v$ =2. We analyze the effects of more views in Sec 5.4.

We note that this result on VisDA as well as the result on Sign (S $\to$ G) demonstrate that our approach can be effective during difficult adaptation tasks from synthetic images to real images, even in a source-free setting.

MiniDomainNet. In more complex multi-source tasks, FAUST shows promising results, as shown in Table 4. Despite the naive source-combine setup, FAUST achieves a higher average accuracy of 62.0% than the best multi-source method and significantly outperforms the other baselines. We note that FAUST does not use source images during adaptation, unlike the other methods.

Contrary to the previous single-source tasks, FAUST+U shows degraded performance. Simply combining multiple source domains makes it difficult for the model to classify as many as 126 object categories. Because the features from different domains are not well-aligned, the model uncertainty can be high near the decision boundary. In relation to this, we speculate that the network capacity of ResNet-18 cannot sufficiently reduce $L_{u}$ . When we replace ResNet-18 with a larger network ResNet-50, the performance degradation issue disappears (see Table 5).

5.4 Analysis and Discussion

Training Stability. Learning curves on various benchmark tasks are illustrated in Fig. 3. The horizontal axis of each task is adjusted to a different scale for better visualization. The steady change of the target accuracy and training loss over the number of iterations shows that learning is stable during adaptation and converges well.

Feature Visualization. We visualize the feature embeddings for the challenging M $\to$ S task. In Fig. 4a and Fig. 4b, the target (SVHN) features are separated from the source (MINIST) features, forming a large group in the center regardless of their classes. However, both domains are better aligned after adaptation, and the features of the same class are in closer proximity in Fig. 4c and Fig. 4d.

(Best viewed in color)
The learning curves of (a) target accuracy and (b) training loss on Digit, Sign and VisDA.
The — Figure 3: (Best viewed in color) The learning curves of (a) target accuracy and (b) training loss on Digit, Sign and VisDA. The $x$ -axis denotes the number of UDA iterations and is rescaled per task.

Figure 4: (Best viewed in color) Feature embeddings from the last pooling layer of $G$ are visualized using t-SNE v. d. Maaten and Hinton (2008). (a) and (c): Red and blue dots are the test samples from the source (MNIST) and the target (SVHN) domains, respectively. (b) and (d): Each color represents a different class.

Ablation Study. We investigate the effect of each component of our training objective in equation (7). Table 6 shows the ablation study on seven source-free UDA tasks. When using only $L_{e}$ or $L_{u}$ , the results are mostly better than the source-only. However, the accuracy collapses on challenging tasks such as M $\to$ S and S $\to$ G. On the other hand, the aleatoric uncertainty loss ( $L_{i} + L_{f}$ ) consistently shows improved accuracy on various tasks. The improvement is more conspicuous on M $\to$ S, Office-Home (O-H) and VisDA. This observation confirms that modeling aleatoric uncertainty plays a significant role in improving the UDA performance. By constraining the intra- and inter-consistency, the extracted features fall far from the decision boundaries. Adding $L_{e}$ (FAUST) and $L_{u}$ (FAUST+U) further boost the accuracy.

Methods	S $\to$ M	M $\to$ S	M $\to$ U	U $\to$ M	S $\to$ G	O-H	VisDA
Source-only	70.0	47.8	77.4	84.9	77.1	57.9	56.6
Only $L_{e}$	99.1	8.7	97.7	95.8	98.7	66.7	78.9
Only $L_{u}$	99.4	42.9	98.2	99.0	5.1	64.1	74.0
$L_{i}$ + $L_{f}$ ( $v$ =1)	99.5	81.7	98.2	98.3	99.4	71.1	84.8
FAUST ( $v$ =1)	99.6	85.9	98.3	98.9	99.6	71.1	84.8
FAUST+U ( $v$ =1)	99.6	91.1	98.8	99.0	99.7	71.7	84.9
$L_{i}$ + $L_{f}$ ( $v$ =2)	99.5	81.9	98.1	98.2	99.5	71.6	85.2
FAUST ( $v$ =2)	99.6	85.9	98.3	98.8	99.5	71.6	85.2
FAUST+U ( $v$ =2)	99.6	91.3	98.8	99.1	99.7	72.0	-

Table 6: Ablation study on various source-free UDA tasks.

Methods	S $\to$ M	M $\to$ S	M $\to$ U	U $\to$ M	S $\to$ G
FAUST ( $v$ =1)	99.6 $^{0.0}$	85.9 $^{0.1}$	98.3 $^{0.1}$	98.9 $^{0.1}$	99.6 $^{0.0}$
FAUST ( $v$ =2)	99.6 $^{0.0}$	85.9 $^{0.2}$	98.3 $^{0.1}$	98.8 $^{0.0}$	99.5 $^{0.1}$
FAUST ( $v$ =3)	99.5 $^{0.1}$	86.0 $^{0.2}$	98.2 $^{0.1}$	98.8 $^{0.1}$	99.4 $^{0.0}$
FAUST ( $v$ =4)	99.5 $^{0.1}$	85.9 $^{0.3}$	98.2 $^{0.0}$	98.8 $^{0.1}$	99.5 $^{0.1}$
FAUST ( $v$ =5)	99.5 $^{0.0}$	86.1 $^{0.2}$	98.2 $^{0.1}$	98.8 $^{0.1}$	99.4 $^{0.0}$

Table 7: Effects of the number of views

v

on Digit and Sign.

Number of Views $v$ . Increasing $v$ from 1 to 2 improves the performance on Office-Home and VisDA as shown in Table 6. We investigate the effect of more views in Table 7. Due to memory constraints, only smaller datasets are analyzed. On Digit and Sign, we observe no clear difference in the accuracy with larger values of $v$ . Because augmentation transform is applied on the fly, FAUST sees different views of a given image each time. Even with a smaller $v$ , FAUST effectively considers many different views as the iterations progress. Thus, we suggest that $v$ =2 is sufficient in practice.

Augmentation Transforms. FAUST leverages random augmented images to model aleatoric uncertainty. In incorporating data augmentation, we can consider a wide range of augmentation methods, from simple translation and horizontal flips to strong augmentations that heavily distort the given images. We compare two augmentation schemes: a standard flip-and-shift (weak) transform and the RandAugment (strong) transform. RandAugment Cubuk et al. (2020) is a type of strong augmentation that applies a set of randomly selected transforms with random magnitudes. In Table 8, the results are comparable between the two augmentation methods on most UDA tasks. However, weak augmentation breaks down on the black-and-white to color task M $\to$ S. Weak augmentation also shows a noticeable difference on M $\to$ U, where the target data size is relatively small. These observations suggest that strong augmentation produces more stable and consistent adaptation outcomes.

Methods	S $\to$ M	M $\to$ S	M $\to$ U	U $\to$ M	S $\to$ G	O-H	VisDA
Weak	99.5	12.9	98.1	99.1	99.7	71.5	84.8
Strong	99.6	91.1	98.8	99.0	99.7	71.7	84.9

Table 8: FAUST+U (

v

=1) results with weak/strong augmentation.

$γ$	$α, β$	S $\to$ M	M $\to$ S	M $\to$ U	U $\to$ M	S $\to$ G	O-H	VisDA	miniDN
0	$α$	0.8	0.2	0.2	0.2	0.5	1.0	1.0	1.0
0	$β$	0.2	0.8	0.8	0.8	0.5	0.0	0.0	0.0
1	$α$	0.8	0.5	0.5	0.2	0.5	1.0	1.0	1.0
1	$β$	0.2	0.5	0.5	0.8	0.5	0.0	0.0	0.0

Table 9: Complete list of hyperparameters (

α, β, γ

Hyperparameters ( $α, β$ ). For hyperparameters ( $α$ , $β$ ), we tune the pair in a limited search space { $(1.0, 0.0)$ , $(0.8, 0.2)$ , $(0.5, 0.5)$ , $(0.2, 0.8)$ , $(0.0, 1.0)$ }. A few labeled target samples are used, similarly to Shu et al. (2018); Lee et al. (2019). A complete list of the selected hyperparameters is presented in Table 9.

6 Conclusion

We proposed a novel UDA method that does not require access to source images. The proposed method leverages aleatoric uncertainty by employing multiple data augmentations and trains the feature generator by encouraging inter-space consistency and intra-space consistency. The feature generator is promoted to learn consistent feature representations that are away from the decision boundaries of the fixed head classifier, thus mitigating the divergence between the source and target domains. We also explored the effect of the epistemic uncertainty loss. Empirical results demonstrate that our approach outperforms current state-of-the-art methods on various challenging UDA benchmark tasks without using source images or requiring customization of the network architecture.

References

M. S. Ayhan and P. Berens (2018) Test-time data augmentation for estimation of heteroscedastic aleatoric uncertainty in deep neural networks. In International conference on Medical Imaging with Deep Learning, Cited by: §2.
L. Bertinetto, J. Henriques, P. Torr, and A. Vedaldi (2018) Meta-learning with differentiable closed-form solvers. In Proceedings of European Conference on Computer Vision (ECCV), Cited by: §2.
M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of European Conference on Computer Vision (ECCV), Cited by: §1, §2, §4.4.
M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing System (NeurIPS), Cited by: §1, §2.
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In Proceedings of International Conference on Machine Learning (ICML), Cited by: §2.
J. Choi, Y. Choi, J. Kim, J. Chang, I. Kwon, Y. Gwon, and S. Min (2020) Visual domain adaptation by consensus-based transfer to intermediate domain. In AAAI, Vol. 34, pp. 10655–10662. External Links: Link, Document Cited by: §5.1, Table 1.
E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2020) Randaugment: practical automated data augmentation with a reduced search space. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 3008–3017. External Links: Document Cited by: §5.2, §5.4.
S. Cui, S. Wang, J. Zhuo, L. Li, Q. Huang, and Q. Tian (2020a) Towards discriminability and diversity: batch nuclear-norm maximization under label insufficient situations. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1, Table 2.
S. Cui, S. Wang, J. Zhuo, C. Su, Q. Huang, and Q. Tian (2020b) Gradually vanishing bridge for adversarial domain adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1, Table 2.
B. B. Damodaran, B. Kellenberger, R. Flamary, D. Tuia, and N. Courty (2018) DeepJDOT: deep joint distribution optimal transport for unsupervised domain adaptation. In Proceedings of European Conference on Computer Vision (ECCV), Cited by: §2.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) A large-scale hierarchical image database. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
T. DeVries and G. Taylor (2017) Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552. Cited by: §5.2.
Y. Gal and Z. Ghahramani (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In Proceedings of International Conference on Machine Learning (ICML), Cited by: §2, §4.3, §5.2.
Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain adversarial training of neural networks. Journal of Machine Learning Research 17(59), pp. 1–35. Cited by: §1, §2, §5.1, Table 2, Table 4.
M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li (2016) Deep reconstruction-classification networks for unsupervised domain adaptation. In Proceedings of European Conference on Computer Vision (ECCV), Cited by: Table 1.
S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §2.
Y. Grandvalet and Y. Bengio (2005) Semi-supervised learning by entropy minimization. In Advances in Neural Information Processing System (NeurIPS), L. Saul, Y. Weiss, and L. Bottou (Eds.), Vol. 17, pp. . Cited by: §4.3.
J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko (2020) Bootstrap your own latent: a new approach to self-supervised learning. In Advances in Neural Information Processing System (NeurIPS), Cited by: §2.
K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.2.
K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
J. J. Hull (1994) A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 550–554. Cited by: §5.1.
A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In Advances in Neural Information Processing System (NeurIPS), pp. 5580–5590. Cited by: §2.
M. Kim, P. Sahu, B. Gholami, and V. Pavlovic (2019) Unsupervised visual domain adaptation: a deep max-margin gaussian process approach. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §5.3.
Y. Kim, D. Cho, K. Han, P. Panda, and S. Hong (2021) Domain adaptation without source data. IEEE Transactions on Artificial Intelligence 2 (6), pp. 508–518. External Links: Document Cited by: §2, §5.1, Table 2, Table 3.
A. D. Kiureghian and O. Ditlevsen (2009) Aleatory or epistemic? Does it matter?. Structural Safety 31 (2), pp. 105–112. Note: Risk Acceptance and Risk Communication External Links: ISSN 0167-4730, Document Cited by: §1.
V. K. Kurmi, V. K. Subramanian, and V. P. Namboodiri (2021) Domain impression: a source data free domain adaptation method. In Proceedings of Winter Conference on Applications of Computer Vision (WACV), pp. 615–625. Cited by: §2, §5.1, Table 1.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient based learning applied to document recognition. In Proceedings of the IEEE, Vol. 86(11), pp. 2278–2324. Cited by: §5.1, §5.2.
C. Lee, T. Batra, M. H. Baig, and D. Ulbricht (2019) Sliced wasserstein discrepancy for unsupervised domain adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §5.1, Table 1, Table 3.
J. Lee and G. Lee (2020) Model uncertainty for unsupervised domain adaptation. In Proceedings of IEEE International Conference on Image Processing (ICIP), pp. 1841–1845. External Links: Document Cited by: §1, §2, §2, §4.3, §5.1, Table 1, Table 3.
S. Lee, D. Kim, N. Kim, and S. Jeong (2019) Drop to adapt: learning discriminative features for unsupervised domain adaptation. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1, §2, §2, §4.3, §5.1, §5.4, Table 3.
R. Li, Q. Jiao, W. Cao, H. Wong, and S. Wu (2020) Model adaptation: unsupervised domain adaptation without source data. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §5.1, §5.3, Table 1, Table 3.
S. Li, M. Xie, K. Gong, C. H. Liu, Y. Wang, and W. Li (2021) Transferable semantic augmentation for domain adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11516–11525. Cited by: §2, §5.1, Table 1, Table 2.
J. Liang, D. Hu, and J. Feng (2020) Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. In Proceedings of International Conference on Machine Learning (ICML), pp. 6028–6039. Cited by: §2, §4.4, §5.1, §5.2, Table 1, Table 2, Table 3.
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proceedings of European Conference on Computer Vision (ECCV), Cited by: §5.1.
M. Long, Y. Cao, J. Wang, and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. In Proceedings of International Conference on Machine Learning (ICML), Cited by: §1, §2.
M. Long, Z. Cao, J. Wang, and M. I. Jordan (2018) Conditional adversarial domain adaptation. In Advances in Neural Information Processing System (NeurIPS), Cited by: §5.1, Table 2.
Z. Lu, Y. Yang, X. Zhu, C. Liu, Y. Song, and T. Xiang (2020) Stochastic classifiers for unsupervised domain adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §5.1, §5.3, Table 1, Table 3.
B. Moiseev, A. Konev, A. Chigorin, and A. Konushin (2013) Evaluation of traffic sign recognition methods trained on synthetically generated data. In International Conference on Advanced Concepts for Intelligent Vision Systems, Springer, pp. 576–583. Cited by: §5.1.
R. Müller, S. Kornblith, and G. E. Hinton (2019) When does label smoothing help?. In Advances in Neural Information Processing System (NeurIPS), H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . Cited by: §5.2.
Y. Netzer, T.Wang, A. Coates, A. Bissacco, B.Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In Advances in Neural Information Processing System (NeurIPS) Workshop, Vol. 2011, pp. 5. Cited by: §5.1.
M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of European Conference on Computer Vision (ECCV), Cited by: §2.
X. Peng, B. Usman, N. Kaushik, J. Hoffman, D. Wang, and K. Saenko (2016) VisDA: the visual domain adaptation challenge. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1.
X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko, and B. Wang (2019) Moment matching for multi-source domain adaptation. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1406–1415. Cited by: §2, §5.1, §5.1, Table 4.
K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2018a) Adversarial dropout regularization. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §5.3.
K. Saito, K. Watanabe, Y. Ushiku, and T. Harada (2018b) Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §5.1, §5.2, §5.3, Table 1, Table 3, Table 4.
K. Saito, D. Kim, S. Sclaroff, T. Darrell, and K. Saenko (2019) Semi-supervised domain adaptation via minimax entropy. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §5.1, Table 4.
R. Shu, H. H. Bui, H. Narui, and S. Ermon (2018) A DIRT-T approach to unsupervised domain adaptation. In Proceedings of International Conference on Learning Representations (ICLR), Cited by: §2, §5.4, Table 1.
J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel (2011) The German traffic sign recognition benchmark: a multi-class classification competition. In the 2011 International Joint Conference on Neural Networks (IJCNN), IEEE, pp. 1453–1460. Cited by: §5.1.
E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 4. Cited by: §1, §2, §5.1, §5.2, §5.3, Table 1.
L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of Machine Learning Research 9, pp. 2579–2605. Cited by: Figure 4.
H. Venkateswara, J. Eusebio, S. Chakraborty, and S. Panchanathan (2017) Deep hashing network for unsupervised domain adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5018–5027. Cited by: §5.1.
R. Xu, Z. Chen, W. Zuo, J. Yan, and L. Lin (2018) Deep cocktail network: multi-source unsupervised domain adaptation with category shift. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1, Table 4.
R. Xu, P. Liu, L. Wang, C. Chen, and J. Wang (2020) Reliable weighted optimal transport for unsupervised domain adaptation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §5.1, Table 1, Table 3.
R. Xu, G. Li, J. Yang, and L. Lin (2019) Larger norm more transferable: an adaptive feature norm approach for unsupervised domain adaptation. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1, §2, §5.1, Table 2, Table 3.
H. Yeh, B. Yang, P. C. Yuen, and T. Harada (2021) SoFA: source-data-free feature alignment for unsupervised domain adaptation. In Proceedings of Winter Conference on Applications of Computer Vision (WACV), pp. 474–483. Cited by: §2, §5.1, Table 2, Table 3.
Y. Zhang, T. Liu, M. Long, and M. Jordan (2019) Bridging theory and algorithm for domain adaptation. In Proceedings of International Conference on Machine Learning (ICML), pp. 7404–7413. Cited by: §1, §2, §5.1, Table 2.
K. Zhou, Y. Yang, Y. Qiao, and T. Xiang (2021) Domain adaptive ensemble learning. IEEE Transactions on Image Processing 30, pp. 8008–8018. Cited by: §2, §5.1, §5.1, §5.2, Table 4.
H. Zou, Y. Zhou, J. Yang, H. Liu, H. P. Das, and C. J. Spanos (2019) Consensus adversarial domain adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: §2.

Feature Alignment by Uncertainty and Self-Training for Source-Free Unsupervised Domain Adaptation