Uncertainty-Induced Transferability Representation for Source-Free Unsupervised Domain Adaptation

Jiangbo Pei1, Zhuqing Jiang1, Aidong Men, Liang Chen, Yang Liu, Qingchao Chen^✉ Jiangbo Pei and Aidong Men are with the School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China. Jiangbo Pei is also affiliated with the National Institute of Health Data Science, Peking University. (e-mail: jiangbop@bupt.edu.cn; menad@bupt.edu.cn).Zhuqing Jiang is with Beijing Key Laboratory of Network System and Network Culture, and also with Beijing University of Posts and Telecommunications, Beijing 100876, China,(e-mail: jiangzhuqing@bupt.edu.cn).Liang Chen is with School of Mathematical Sciences, Peking University, Beijing 100871, China,(e-mail: clandzyy@pku.edu.cn).Yang Liu is with Wangxuan Institute of Computer Technology at Peking University, Beijing, 100080, China, (email: yangliu@pku.edu.cn).Qingchao Chen is with the National Institute of Health Data Science, Peking University, Beijing, 100191, China. (e-mail: qingchao.chen@pku.edu.cn).This work is supported by Peking University Medicine Seed Fund for Interdisciplinary Research (BMU2022MX011), the Fundamental Research Funds for the Central Universities and PKU-OPPO Innovation Fund BO202103.1 Equally contributed first author.✉ Corresponding author.

Abstract

Source-free unsupervised domain adaptation (SFUDA) aims to learn a target domain model using unlabeled target data and the knowledge of a well-trained source domain model. Most previous SFUDA works focus on inferring semantics of target data based on the source knowledge. Without measuring the transferability of the source knowledge, these methods insufficiently exploit the source knowledge, and fail to identify the reliability of the inferred target semantics. However, existing transferability measurements require either source data or target labels, which are infeasible in SFUDA. To this end, firstly, we propose a novel Uncertainty-induced Transferability Representation (UTR), which leverages uncertainty as the tool to analyse the channel-wise transferability of the source encoder in the absence of the source data and target labels. The domain-level UTR unravels how transferable the encoder channels are to the target domain and the instance-level UTR characterizes the reliability of the inferred target semantics. Secondly, based on the UTR, we propose a novel Calibrated Adaption Framework (CAF) for SFUDA, including i) the source knowledge calibration module that guides the target model to learn the transferable source knowledge and discard the non-transferable one, and ii) the target semantics calibration module that calibrates the unreliable semantics. With the help of the calibrated source knowledge and the target semantics, the model adapts to the target domain safely and ultimately better. We verified the effectiveness of our method using experimental results and demonstrated that the proposed method achieves state-of-the-art performances on the three SFUDA benchmarks. Code is available at https://github.com/SPIresearch/UTR.

I Introduction

Fig. 1: (a): Most existing SFUDA methods directly transfer all source knowledge to the target model at the start of training, infer the semantics (labels) of target data using the model, and update the model with the inferred semantics. Without identifying the source knowledge’s transferability, the target model receives less-transferable knowledge (for example, the feature “Horse hoof” which is learned to classify humans and horses in the real-world source domain but may not be suitable for the target cartoon domain). The less-transferable knowledge hinders the model to infer the semantics of the target data (e.g. Misclassified Horse Image). (b): The $U T R_{D}$ identifies how transferable each channel of the source encoder is to the target domain. (c): The $U T R_{I}$ characterizes the reliability of the inferred semantics of each target sample.

Deep neural networks have achieved state-of-the-art performance in a variety of image processing and computer vision applications when the testing data and training data are drawn from the same distribution (domain). When the model needs to be deployed in a new target domain (e.g. a new user uploads photos to a social media website), the recommendation or retrieval model often suffers from huge performance degradation due to the cross-user domain gap. Unsupervised domain adaptation (UDA) is an effective solution to tackle the domain gap, which aims at adapting a model to a target domain where labels are not available with the help of a labeled source domain dataset. However, the vanilla UDA assumes the source data is accessible during adaptation, which is not always practical. On the one hand, data privacy protection is increasingly important because data often contain personal information. Sharing source domain data will endanger personal privacy and is strictly prohibited in many applications, especially in social media, medicine and biometrics. On the other hand, transmitting source data is costly such as video data or high-resolution images.

Source-free unsupervised domain adaptation (SFUDA) is proposed as a promising task to tackle previous issues. SFUDA aims to learn a discriminative target domain model, given the unlabeled target domain data and a pre-trained source model but without any source data or labels. To address SFUDA, as shown in Fig. 1 (a), most existing works [yang2021exploiting, yang2021generalized, xia2021adaptive, yang2020unsupervised, liang2020we] directly transfer all source knowledge to the target model at the start of training, infer the semantics (labels) of target data using the model and turn back to update the model with the inferred semantics. However, these methods suffer two limitations. Firstly, the utilization of the source knowledge is limited. On the one hand, the way that transfers all source knowledge to the target model ignores discarding the non-transferable knowledge. On the other hand, they transfer source knowledge at the start of training only, which wastes the valuable transferable knowledge learned from the well-annotated source domain. Secondly, the semantics of the target data inferred by the source knowledge is risky, if using the non-transferable knowledge (taking the “human hoof” in Fig. 1 (a) as an example). Updating the model using these risky semantics rarely learns a discriminative target model.

Therefore, the key common challenge and the missing part of existing SFUDA methods is to measure the transferability of the source knowledge to the target domain in the absence of source data and target labels. To our best knowledge, only Wang et al. [wang2022exploring] proposed to search for domain-invariant/transferable model parameters. They explore the transferability of source model parameters based on calculating their variations after each adaptation procedure in the stochastic optimization. However, their measurement is susceptible to the quality of the adaptive procedures.

Beyond SFUDA, the transferability of the deep neural network has been studied intensively [nguyen2020leep, you2021logme, ben2010theory, gretton2006kernel].

Nevertheless, existing transferability measurements require either source data or target labels, which are not applicable to SFUDA.

To tackle the key challenge, we proposed a novel Uncertainty-induced Transferability Representation (UTR), which provides a transferability measurement to the source knowledge in the absence of source data and target labels. Specifically, we develop the uncertainty as a tool to measure the transferability of the source model, inspired by the theory of distributional uncertainty [gawlikowski2021survey, nandy2020towards, gao2020reducing] that measures how “unfamiliar” a trained model is with any input data, and the model uncertainty [malinin2018predictive, gawlikowski2021survey, nandy2020towards] that reflects the degree to which the model fits its training distribution. Intuitively, the two uncertainties reveal the probability of the input data that are sampled from the training distribution of the model–that is, an implicit Uncertainty Distance (UD) between the input data and the training one. This measurement provides us solid theoretical supports and more importantly, we propose to bridge the uncertainty and the source model transferability in the SFUDA setup: if a lower UD of the target data and source model is obtained, we made the following conjectures: the target data is “closer” to the source domain distribution in the encoding space of the model, which indicates the source knowledge (source model parameters) can more effectively eliminate the domain discrepancy between the two domains, reflecting the source knowledge is more transferable to the target domain. On the other hand, stemming from our finding that different channels of the source features have different transferability to the target domain, we propose to measure the transferability of the source encoder (the feature encoder of the source model) channel-wisely. Intuitively, the transferability of different channels reflects the transferability of the “partial” source knowledge that encodes the features in these channels, facilitating us to explore which “partial” source knowledge is transferable or non-transferable.

Our UTR can be considered as a transferability spectrum, consisting of the instance and channel axis, where the instance axis denotes which target data is used to calculate the UD for the transferability measuring, while the channel axis represents the transferability of different channels of the source encoder. To facilitate the UTR to address the previous two limitations in SFUDA, we designed the following variants. Specifically, for the first limitation, the UTR on the domain-level, namely $U T R_{D}$ , integrates the UTR over all of the target instances, which measures the transferability of different channels more accurately than the UTR of each target instance, thus can efficiently guide the utilization of the knowledge of the source encoder. For the second limitation, the instance-level namely $U T R_{I}$ integrates UTR of a particular instance over all channels, which is proven to characterize the reliability of the inferred target semantics of each target instance. The usages of the $U T R_{D}$ and $U T R_{I}$ are illustrated in Fig. 1 (b) and (c).

Based on the introduced domain-level and instance-level UTR, we proposed a novel Calibrated Adaptation Framework to address the two limitations of existing SFUDA works. Firstly, a source knowledge calibration module is designed, which uses $U T R_{D}$ to identify the transferability of different channels of the source encoder, and calibrates the source knowledge that transferred to the target domain by distilling the knowledge in transferable channels and discard the knowledge in less-transferable ones. Secondly, a target semantic calibration module is proposed based on our $U T R_{I}$ to detect unreliable target semantics and calibrate them by designing a semantic calibration loss. The semantic calibration loss encourages the model to “forget” the unreliable semantics and “discover” the true ones. With the calibrated source knowledge and target semantics, we safely adapt the model to the target domain, therefore summarizing a better-performing target model.

Our main contributions are summarized as follows: Firstly, we propose an Uncertainty-induced Transferability Representation (UTR) to explore the source model transferability in the absence of source data and target labels, which is beneficial to the SFUDA community. Secondly, we design a novel Calibrated Adaptation Framework (CAF) to calibrate the source knowledge and the inferred target semantics, allowing the target model to fully and safely exploit the source knowledge and target data, hence learning a better-performing target model. Finally, we verified the effectiveness of our method with extensive experimental results and demonstrated that the proposed method achieves state-of-the-art performances on the three SFUDA benchmarks.

Ii Related Work

Ii-a Source free unsupervised domain adaptation

Recent years have witnessed great achievements in the vanilla UDA [ben2010theory, gretton2006kernel, long2017conditional, ganin2016domain, chhabra2021glocal, chhabra2021iterative, han2022learning, moon2022multi, deng2022dynamic, deng2021joint, xu2021neutral]. However, they assume that the source data is accessible during the adaptation, which is not always practical. SFUDA aims to adapt a source-trained model to an unlabeled target domain without access to source data [liang2020we, ye2021source, yang2021model]. Without the labeled source data, some methods propose to generate labeled data by generative adversarial net (GAN) [goodfellow2014generative]. Kurmi et al.[kurmi2021domain] generate source data using the source-trained classifier, so that the vanilla UDA methods can be applied. Li et al.[li2020model] leverage a conditional GAN to directly produce training samples in the target style. These methods use the source model as auxiliary supervisions to control the label of the generated data. Nevertheless, the source model is ineffective for the data generation process, due to the instability training of GAN [arjovsky2017towards]. Most existing SFUDA methods directly transfer all the source knowledge to the target model at the start of training, infer the semantic information of target data using the target model, and update the target model with the inferred semantic information. SHOT [liang2020we] and ISFDA [li2021imbalanced] predict the target category using the pseudo-labeling strategy. CPGA [qiu2021source] and BAIT [yang2020unsupervised] propose to align the samples with category-wise prototypes in a contrastive learning framework. NRC[yang2021exploiting] and LSC-SDA[yang2021generalized] aim at propagating the categorical semantics from the neighborhood/cluster structure to the feature space. Xia et al. [xia2021adaptive] focus on the disagreements between target data and the source model. They select partial target data with high agreements with the source model and apply the source model to these samples. However, without measuring the transferability of the source knowledge, these methods fail to control over discarding non-transferable knowledge and preserving transferable knowledge. Additionally, they fail to identify risks of applying the source model to infer target semantics. To our best knowledge, Wang et al. [wang2022exploring] is the most similar work as ours. They explored to transfer only partial source model parameters based on calculating the parameter variations after each adaptation procedure in the stochastic optimization. However, their measurement is susceptible to the quality of the adaptive procedures. In contrast, our method measures the transferability using only the source model and unlabeled target data, which is irrelevant to the adaptation procedure, so that can avoid the hazards of the unreliable adaptation.

(a): The model uncertainty measures the degree to which a model’s fitted region covers its training distribution. (b) The distributional uncertainty measures the probability of an input instance that is sampled from a region that unfitted/unfamiliar by the model, which reveals how far the sample is from the fitted region of the model.
(c): The distributional and model uncertainties reveal an implicit uncertainty distance (UD) from the target instance to the source data distribution, which reveals the ability of the source model in reducing the domain discrepancy between the target and source domains, therefore suggesting the transferability of the source model to the target domain. In SFUDA, UD could be approximated by distributional uncertainty as the model uncertainty is small.
(d) The UTR leverages the distributional uncertainty to estimate transferability of different channels of the source encoder to the target domain.
(e) The domain-level UTR integrates the UTR over all target instances to estimate the transferability of these channels more accurately.
(f) The instance-level UTR integrates UTR on the channel axis, which identifies the risk of using source knowledge to predict the semantics of the target instance. — Fig. 2: (a): The model uncertainty measures the degree to which a model’s fitted region covers its training distribution. (b) The distributional uncertainty measures the probability of an input instance that is sampled from a region that unfitted/unfamiliar by the model, which reveals how far the sample is from the fitted region of the model. (c): The distributional and model uncertainties reveal an implicit uncertainty distance (UD) from the target instance to the source data distribution, which reveals the ability of the source model in reducing the domain discrepancy between the target and source domains, therefore suggesting the transferability of the source model to the target domain. In SFUDA, UD could be approximated by distributional uncertainty as the model uncertainty is small. (d) The UTR leverages the distributional uncertainty to estimate transferability of different channels of the source encoder to the target domain. (e) The domain-level UTR integrates the UTR over all target instances to estimate the transferability of these channels more accurately. (f) The instance-level UTR integrates UTR on the channel axis, which identifies the risk of using source knowledge to predict the semantics of the target instance.

Ii-B Uncertainty

Uncertainty is an important criterion to measure the robustness of a deep model[kwon2021repurposing, kendall2017uncertainties, gawlikowski2021survey, sensoy2018evidential]. Given an annotated sample $(x, y)$ and a model parameterized by $θ$ trained on domain $D$ , the uncertainty can be decomposed into the following equation:

P (y | x, D) = \iint P (y | μ)      D a t a P (μ | x, θ)      D i s t r i b u t i o n a l P (θ | D)      M o d e l d θ d μ,

(1)

where $μ = θ (x)$ is the predicted label distribution and the three probability density functions represent the data uncertainty, model uncertainty, and distributional uncertainty respectively [gawlikowski2021survey, nandy2020towards, gao2020reducing]. The data uncertainty is almost irreducible which arises from the natural complexity of the data, such as the class overlap, label noise, homoscedastic and heteroscedastic noise. The model uncertainty measures how well the model fits to its training distribution [malinin2018predictive, gawlikowski2021survey, nandy2020towards]. The distributional uncertainty measures the probability of an input instance that is sampled from a region that the model is “unfamiliar” with. Its characteristic has prompted its usage in the out-of-distribution detection [sedlmeier2019uncertainty, padhy2020revisiting, mcallister2019robustness] and also in vanilla UDA methods [gao2020reducing, liang2019exploring]. To our best knowledge, our work is the first to propose the use of uncertainty to explore transferability in SFUDA.

Ii-C Transferability

It is essential to asses and measure the model transferability and there are two mainstream methods in the deep learning community. Firstly, the transferability of a model is measured by how much it can bridge the domain discrepancy between the source and the target domain[chen2019transferability]. It can be calculated by domain discrepancy measurements such as Proxy $A$ -distance [ben2010theory] and Maximum Mean Discrepancy (MMD) [gretton2006kernel]. In addition, Chen et al. [chen2019transferability] propose the Corresponding Angle to measure the transferability. However, these methods require the access to the source data which is not suitable for SFUDA. Secondly, some transfer learning methods investigated the transferability of pre-trained source representations to the target domain. Existing works in this line of research have been proposed, such as the NCE [tran2019transferability], LEEP [nguyen2020leep] and LogME [you2021logme]. Nevertheless, they need the target data annotations which are not applicable in SFUDA. In contrast, our proposed method can estimate the transferability in the absence of source data and target data labels that fits the challenging SFUDA setup.

Iii Uncertainty-Induced Transferability Representation

It is essential to analyse the source knowledge transferability for SFUDA, however, existing transferability measurements are not applicable in SFUDA. To tackle this problem, in Section III-A, we develop the Uncertainty Distance (UD) as a tool to estimate the general transferability in the absence of source data and target annotations. In Section III-B, we introduce the channel-wise transferability analysis and propose the Uncertainty-induced Transferability Representation (UTR). In Section III-C, we derive the domain-level UTR and the instance-level UTR and state their effectiveness for the SFUDA community.

Iii-a Transferability measurement using Uncertainty Distance

Not all knowledge in the source model is transferable and discriminative to the target domain. Therefore, it brings risks if we do not measure and quantify the transferability of source knowledge but deploy it directly in the target domain. However, previous transferability measurements require some matched information, either both the source and target data, or data-annotation pairs of the target domain. These requirements are infeasible in SFUDA where only unmatched source model and target data are provided. The unmatched information makes it extremely challenging to measure the transferability by acquiring “known and certain” information as the supervision signal.

To this end, our work alternatively explores and exploits the uncertainty as a fundamental tool, and proposes an Uncertainty Distance (UD) to address these challenges. The UD is an implicit distance between the target instance $x_{t}$ and the source domain $D_{s}$ . A low UD demonstrates that $x_{t}$ is “close” to $D_{s}$ given a source model parameterized by $θ_{s}$ , which reflects that it is efficient for the $θ_{s}$ to reduce the domain discrepancy between the source and target domains. Therefore it suggests that the $θ_{s}$ is transferable to the target domain and a high UD indicates the opposite.

Our consideration is shown in Fig. 2 (a)-(c). Given the source model parameterized by $θ_{s}$ , the model uncertainty characterizes the degree to which the fitted region of $θ_{s}$ covers its training distribution (i.e. the source domain $D_{s}$ ). While given both the $θ_{s}$ and the target instance $x_{t}$ , the distributional uncertainty reveals how far the $x_{t}$ is from the fitted region of the $θ_{s}$ . Previous observations inspired us that the cooperation of the two uncertainties reveals the distance between the target instance $x_{t}$ and the source domain $D_{s}$ . Such a distance implicitly reflects the contributions of the source model to reduce the domain discrepancy. It can also be used to probe and measure the transferability of the source model to the target instance for SFUDA.

By incorporating the distributional uncertainty and the model uncertainty, we first formulate the UD as:

U D (x_{t}, θ_{s}, D_{s})

= M (P (θ_{s} (x_{t}) | x_{t}, θ_{s})      D i s t r i b u t i o n a l P (θ_{s} | D_{s})      M o d e l),

(2)

where $M (\cdot)$ is the uncertainty measurement function such as the Sensitivity Analysis [nagy2007distributional], the Deep Ensembles[lakshminarayanan2016simple] and the MC dropout [gal2016dropout].

Although it requires the $D_{s}$ in Equation 2 to measure the model uncertainty, we argue that it is still feasible to estimate transferability using UD in the SFUDA. The reason is that the source model has been well-trained in the source domain so that the $θ_{s}$ fits $D_{s}$ well. As shown in Fig. 2 (c), in this case, the model uncertainty is small enough to be ignored, and the UD in SFUDA can be approximated by the distributional uncertainty calculated by the target instance $x_{t}$ and source model parameters $θ_{s}$ :

U D (x_{t}, θ_{s}) = M (P (θ_{s} (x_{t}) | x_{t}, θ_{s})) .

(3)

Iii-B Channel-wise Transferability Analysis

The proposed UD in Equation 3 essentially measures the transferability of the whole source knowledge (i.e., the whole $θ_{s}$ ) to the target instances. Nevertheless, as motivated in the introduction, only partial knowledge is useful for the target domain. Therefore we proposed to analyse the transferability of the knowledge in a finer-grained manner: to determine which part of the learned source parameters are transferable to the target domain. A straight-forward method is to measure the transferability of the partial and individual source parameters $θ$ using $U D (x_{t}, θ)$ , where $θ \subset θ_{s}$ . However, it is well-known that most deep neural networks belong to the end-to-end “black-box” system, where the knowledge is highly abstract and entangled. Individual parameters generally make no sense, let alone analyzing their transferability.

To tackle this challenge, we propose to estimate the transferability of different channels of the source encoder rather than different model parameters, as shown in Fig. 2 (d). In this way, the transferability of a particular channel natural represents the transferability of the ”partial” source knowledge (relevant parameter) that encodes the feature of this channel. More specifically, we propose the Uncertainty-induced Transferability Representation (UTR), a transferability spectrum, composed of the instance axis and the channel axis, which is formulated as:

U T R (x_{t}, h_{s}) = [U D (x_{t}, h_{s}^{1}), . . ., U D (x_{t}, h_{s}^{d})],

(4)

where $U D (x_{t}, h_{s}^{i}) = M (P (z_{i} | x_{t}, h_{s}^{i}))$ , $x_{t}$ is the target instance, $z = h_{s} (x_{t}), z \in R^{d}$ denotes the $d$ -channel target features produced by the source encoder $h_{s}$ , and $z_{i} = (h_{s} (x_{t}))^{i} = h_{s}^{i} (x_{t})$ is the target feature of the $i^{t h}$ channel, $h_{s}^{i}$ is the potential source parameter to encode $z_{i}$ . The instance axis of UTR denotes which target data is used to calculate the UD for the transferability estimating. The channel axis represents the transferability of different channels of the source encoder. The $i^{t h}$ channel of the UTR (i.e., $U D (x_{t}, h_{s}^{i})$ ) indicates the transferability of $z_{i}$ to the target domain. A low value of $U D (x_{t}, h_{s}^{i})$ indicates that the target instance $x_{t}$ is close to the source one in the space of $z_{i}$ and suggests that the source knowledge to encode $z_{i}$ (i.e., the parameters $h_{s}^{i}$ ) is highly transferable across the two domains.

To calculate the UTR, we adopt the sensitivity analysis method [nagy2007distributional] as the uncertainty measurement $M (\cdot)$ for Equation 4. To be specific, the model parameters of $h_{s}$ are perturbed for $T$ times randomly as follows: ${h_{s; T} = (1 + r_{t}) * h_{s}}_{t = 1}^{T}$ are firstly calculated by inserting $T$ random perturbations ${r_{t}}_{t = 1}^{T}$ to original parameter $θ_{h s}$ . Then the uncertainty is estimated by calculating the variance of the $T$ outputs of the $i^{t h}$ dimension feature:

M (P (z_{i} | x_{t}, h_{s}^{i})) = V a r_{h_{s} \sim h_{s; T}} ((h_{s} (x_{t}))^{i}) .

(5)

Fig. 3: Overview of Our Calibrated Adaptation Framework. (a) Source knowledge absorption calibration. The $U T R_{D}$ is calculated and used to estimate the transferability of the knowledge of the source encoder $h_{s}$ . Then knowledge in $h_{s}$ is distilled into the target encoder $h_{t}$ with $L_{k d}$ , which controls the target encoder to absorb transferable source knowledge and neglect less-transferable knowledge according to $U T R_{D}$ . (b) Target semantics calibration. (b.1) Infer target semantics with target model. (b.2) Select instances whose inferred semantics are risk (the red point) according to their $U T R_{I}$ and threshold $τ$ . (b.3) The forget objective $L_{f}$ of the semantics calibrate loss minimizes the negative cross-entropy to risk instances, forcing [Li: it] to forget the current unreliable semantics. (b.4) The discover objective $L_{d}$ of the semantics calibrate loss guides to discover their true semantics by minimize the entropy of the prediction probability distribution of the target instances. (c) Adaptation. (c.1) Re-infer target semantics. (c.2) Refine the model with the adapt loss $L_{a}$ .

Iii-C The Domain-level and Instance-level UTR

Given a source model parameterized by $θ_{s}$ and a target instance $x_{t}$ , the UTR (in Equation (4)) is able to quantify the fine-grained transferability of the instance-level target features. In order to tackle the limitations in the SFUDA community: 1) measuring the transferability of source knowledge to target domain to sufficiently exploit it; 2) measuring the risk of inferring semantic information of target instances using the source knowledge, we design two variants of the UTR on two levels: the domain-level UTR namely the $U T R_{D}$ and the instance-level UTR namely the $U T R_{I}$ .

The $U T R_{D}$ describes the domain-level transferability estimation over the channel axis, which identifies how transferable each channel of the source encoder is to the target domain using the UD of the source model to all target instances. The $U T R_{I}$ characterizes the instance-level trasferability over all the target instances, which identifies the instance-level risk of inferring target semantic labels. The two are useful measurements proposed to fit in the later on adaptation framework for SFUDA problem.

Specifically, as shown in Fig. 2 (e), the $U T R_{D}$ is calculated by integrating the $U T R (x_{t}, h_{s})$ of all $n_{t}$ target instances over the target domain $D_{t}$ . Detailed formulation is as follows:

	$U T R_{D}$	$(h_{s}) = E_{x_{t} \sim X_{t}} U T R (x_{t}, h_{s})$		(6)
		$= \frac{1}{n_{t}} [n_{t} \sum i = 0 U D (x_{t}^{i}, h_{s}^{1}), . . ., n_{t} \sum i = 0 U D (x_{t}^{i}, h_{s}^{d})]$		(6)

As for the instance-level transferability spectrum, the $U T R_{I}$ is calculated by integrating the $U T R (x_{t}, h_{s})$ over all the $d$ -channels of the source encoder $h_{s}$ , as shown in Fig. 2 (f). The detailed formulation of the $U T R_{I}$ is as follows:

	$U T R_{I} (x_{t}) = E_{z \sim R^{d}} U T R (x_{t}, h_{s})$		(7)
	$= \frac{1}{d} [d \sum i = 0 U D (x_{t}, h_{s}^{i})]$		(7)

Iv Calibrated Adaptation Framework

Iv-a Notation

In this paper, we focus on the $K$ -way visual object classification task. SFUDA provides a well-trained source model parameterized by $θ_{s}$ to the target domain $D_{t}$ , where $θ_{s} = g_{s} \circ h_{s}$ , $h_{s}$ is the parameter of the source encoder, and $g_{s}$ is the parameter of the source classifier. The target domain $D_{t} = {x_{t}^{i}}_{i = 1}^{n_{t}}$ consists of $n_{t}$ unlabeled target instances. The SFUDA aims to learn a discriminative target model parameterized by $θ_{t} = g_{t} \circ h_{t}$ using the $θ_{s}$ and $D_{t}$ .

Iv-B Overall

Most existing SFUDA methods directly transfer all source knowledge to the target model at the start of training, infer the semantic information (target labels) of target data using the model, and directly update the model using the inferred semantic information. However, they are limited as follows: 1) the utilization of the source knowledge is limited. Directly transferring all source knowledge to the target model ignores discarding the less-transferable one. And updating the models using the inferred target semantics failed to preserve the discriminative knowledge in the source model. 2) the target semantic information inferred by the source model is risky due to the less-transferable source knowledge. Refining the model using the risky semantic information is unreliable.

To this end, we introduce the Calibrated Adaptation Framework (CAF). To tackle the first limitation, we propose to calibrate the source knowledge that transferred to the target model using our $U T R_{D}$ . To tackle the second, we propose to calibrate the inferred semantic information of target instances based on our $U T R_{I}$ . Finally, we adapt the model based on the calibrated source knowledge and target semantics. The overview of CAF is shown in Fig. 3. The pseudo-code of the whole algorithm is described in Algorithm 1.

Source knowledge calibration. To address the limitation 1, instead of directly inheriting all source knowledge, we design a transferability-controlled knowledge distillation loss $L_{k d}$ , which used $U T R_{D}$ to control the knowledge distillation by quantifying different channels’ transferability and assigning more transferable channels larger weights. On the one hand, it prompts the target model to learn transferable source knowledge and discard less-transferable ones. On the other hand, it constrains the updated target model by unceasingly distilling the transferable source knowledge along the whole training process, rather than at the beginning only.

Target semantics calibration. The less-transferable knowledge is prone to lead to incorrect semantics (labels) inferred by the model. Considering that it is fundamental in SFUDA to update the target model based on the inferred semantics of target instances, calibrating the incorrect semantics is essential to learn a discriminative target model. Specifically, after inferring target semantics (using the source model at the beginning of training, and the target model later), we use the $U T R_{I}$ to select instances whose inferred semantics are unreliable. Then, a semantic calibration loss $L_{s c}$ is designed to calibrate their model predictions. Specifically, on the one hand, as the semantics inferred by the feature of less-transferable channels tend to be wrong, we proposed to use to “forget” the current semantics by minimizing a negative cross-entropy $L_{c}$ . It implicitly guides the model to re-initialize the parameters representing the less-transferable knowledge. On the other hand, we minimize the entropy of the prediction probability distribution of these instances to force their predictions close to a new and appropriate class category. This procedure discovers the new and proper semantics of these instances.

Adaptation. With the above two steps, the target model “safely” integrates the source knowledge and target semantics. The adaptation step finally refines the target model using inferred semantics from the transferable knowledge, therefore summarizing a better-performing discriminative model.

Iv-C Source Knowledge Calibration and Distillation

Not all source knowledge is transferable and discriminative to the target domain. Directly transferring all source knowledge to the target model without dealing with the less-transferable parts of it is detrimental to the adaptation of the target domain. To this end, we instruct the target model to selectively learn the features of transferable channels of the source encoder, therefore, to inherit transferable knowledge from the source encoder. Given the source encoder $h_{s}$ , the $U T R_{D} (h_{s})$ is calculated following the Equation (6) to estimate the transferability of each channel in $h_{s}$ , where a lower $U T R_{D}$ value suggests stronger transferability. Then, the target model learn the source knowledge based on the identified transferability. We proposed a novel transferability-controlled knowledge distillation loss as the objective:

L_{k d} = E_{x_{t} \sim X_{t}} [∥ Q (U T R_{D} (h_{s})) ⊙ [h_{s} (x_{t}) - h_{t} (x_{t})] ∥_{2}],

(8)

where $Q (x) = s i g m o i d (- x)$ is a monotone minus function, $⊙$ is the Hadamard product. The $Q (U T R_{D} (h_{s}))$ weights the mean squared error term $∥ h_{s} (x_{t}) - h_{t} (x_{t}) ∥_{2}$ to distill knowledge within $h_{s}$ to $h_{t}$ , aiming to assign large weights to features with low $U T R_{D}$ while small ones to those with high $U T R_{D}$ , guiding the target model to learn more transferable knowledge from the source model and discard less-transferable ones in a well-controlled manner.

Iv-D Target Semantics Calibration

Refining a target model based on the inferred semantics (labels) of target instances is a fundamental and important step for the adaptation in SFUDA. Due to the less-transferable source knowledge, the predicted target semantics may be incorrect, which greatly hinders the adaptation to the target domain. To this end, we design the target semantics calibration module to calibrate the target semantics.

First, the inferred semantics of a target instance $x_{t}$ is $^y = arg max p (x_{t})$ with probability $p^{^y} (x_{t})$ , where $p (x_{t}) = σ (θ_{s} (x_{t}) / θ_{t} (x_{t}))$ is the source/target model predicted probability distribution, $σ (.)$ is the softmax function. Note that we use the source model to infer target semantics at the first epoch, and turn to use the target model later since the target model will be more discriminative to the target domain after adaptation.

Second, we leverage $U T R_{I}$ to detect risk target instances whose semantic is prone to be incorrectly inferred that satisfies ${x_{t} : U T R_{I} (x_{t}) > τ}$ as $X_{t; r i s k}$ , where $τ$ denotes the threshold. Following the first step, the feature encoder that calculates $U T R_{I} (x_{t})$ (Equation 7) changes from $h_{s}$ to $h_{t}$ after the first epoch.

Third, based on the detected instances $X_{t; r i s k}$ , we propose a semantics calibrated loss $L_{s c}$ to calibrate the semantics of these instances. Since their semantics is prone to be inaccurate, we train the target model to firstly forget these semantics by minimizing the negative cross-entropy loss. The forget objective $L_{f}$ is represented as follows:

L_{f} = E_{x_{t} \sim X_{t; r i s k}} - C E (x_{t},^y) .

(9)

As illustrated in Fig. 3 (b.3), optimizing this term decreases the prediction probability to the misclassified category $^y$ .

On the other hand, we guide the target model to discover the true semantic by the following discover objective $L_{d}$ :

L_{d} = - E_{x_{t} \sim X_{t}} K \sum k = 1 p (x_{t}) l o g p (x_{t}),

(10)

where $p (x_{t}) = σ (θ_{t} (x_{t}))$ is the target model predicted probability distribution. $L_{d}$ aims to minimize the entropy of the $p (x_{t})$ , thus guiding the model to assign the prediction to an appropriate class. Note that, instead of only minimizing the entropy on $X_{t; r i s k}$ , we calculate $L_{d}$ on all target instances $X_{t}$ . In this way, the semantic information of instances with low $U T R_{I}$ , where the model tends to make the right predictions, is also introduced to help the semantic discovery of instances in $X_{t; r i s k}$ .

Such a ”forget-discover” process implicitly guides the model to free itself from the shackles of less-transferable knowledge and facilitates the discovery of the true semantics of the target data, and the semantic calibration loss can be denoted as:

L_{s c} = γ L_{f} + L_{d},

(11)

where $γ$ is the scale coefficient of the $L_{f}$ .

0: Source model parameterized by

θ_{s} = g_{s} \circ h_{s}

, target model parameterized

θ_{t} = g_{t} \circ h_{t}

, unlabeled target instances

D_{t}

0: hyperparameter

τ

λ

γ

Calculate

U T R_{D} (h_{s})

while i

<

max epoch do

In the

i^{t h}

epoch

Sample batch

T

from

D_{t}

Calculate the

L_{k d}

Infer target semantics

Calculate

U T R_{I} (x_{t})

, select

X_{t; r i s k}

with

τ

Calculate

γ L_{f} + L_{d}

Train the target model by optimizing

λ L_{k d} + γ L_{f} + L_{d}

In the

i + 1^{t h}

epoch

Sample batch

T

from

D_{t}

Infer target semantics

Calculate

L_{a}

Train the target model by optimizing

L_{a}

i = i + 2

end while

Algorithm 1 Calibrated Adaptation Framework

Iv-E Adaptation

With the above two steps to calibrate the source knowledge and target semantics, the target model then can safely adapt to the target model. In the adaptation step, we re-infer the target semantics by the model and use it to refine the target model, ultimately adapting the model to the target domain.

In this step, we adopt the pseudo-label strategy in [liang2020we] to re-infer the semantic $^y$ of the target instance $x_{t}$ consider its simplicity and effectiveness. Given $x_{t}$ and $^y$ , we optimize the model with the cross-entropy loss and the objective of the adapt step can be formulated as:

L_{a} = E_{x_{t} \sim X_{t}} C E (x_{t},^y) .

(12)

Iv-F Training Steps

In this subsection, we summarize the training steps of CAF framework. The two calibration steps are separate with the adapt step. Specifically, in the $i^{t h}$ epoch, perform two calibration steps to calibrate transferable source knowledge and target semantics by:

min θ_{t} λ L_{k d} + γ L_{f} + L_{d},

(13)

where $λ$ and $γ$ is the scale coefficient.

And in the $i + 1^{t h}$ epoch, conduct the adaption step to adapt the target model to the target domain:

min θ_{t} λ L_{a} .

(14)

V Results

V-a Datasets

We evaluate our SFUDA method using the following three benchmarks: Office-31 [saenko2010adapting], the Office-Home[venkateswara2017deep] and the VisDA[peng2017visda]. Office-31[saenko2010adapting] contains 4,652 images in 31 categories from three domains: Amazon (A), Webcam (W) and DSLR (D). Office-Home[venkateswara2017deep] consists of four domains, i.e., Artistic images (Ar), Clip Art (Cl), Product images (Pr), and Real-World images (Rw), with 65 classes and a total of 15,500 images. VisDA[peng2017visda] is a more challenging dataset, whose source domain contains 152k synthetic images generated by rendering 3D models while the target domain has 55k real object images sampled from Microsoft COCO [lin2014microsoft].

V-B Implementations

For fair comparisons with existing methods, we adopt the backbone of ResNet-50 [he2016deep] for Office-31 and Office-Home and ResNet-101 for VisDA. Following the setups in [liang2020we, yang2021exploiting], along with the backbones, we used a fully-connected (fc) layer with the output channels of $256$ as the encoder. A fc layer with the weight normalization as the classifier. The source model is trained following the same strategy with [liang2020we, yang2021exploiting]. The pre-trained source model is used to adapt to the target domain but without using any labeled source data. In the optimization, we adopt SGD with momentum 0.9 and batch size of 64 on all datasets. For the Office-31 and office-Home datasets, the learning rates used to train the ResNet-50 backbone and the newly added layers are 1e-3 and 1e-2 respectively. The learning rate is 1e-4 for VisDA. We trained 40, 60 and 50 epochs for Office-31, Office-Home and VisDA respectively. Note that the mixup[zhang2017mixup] data augmentation is used in the adaptation step. The threshold of $U T R_{I}$ , i.e. $τ$ , is set to be 3. The weight $λ$ of the transferability-controlled knowledge distillation loss is set to 10 at the beginning. As the training procedure progresses, the model is gradually adapted to the target domain, requiring less source knowledge. Therefore, after 10 epochs, we decrease $λ$ to zero. The weight $γ$ of the ”forget” loss is set to 0.9. For the uncertainty measurement (Equation 5), $T$ is set to $2$ , and $r_{t}$ is randomly sampled from the uniform distribution $U (- 0.05, 0.05)$ .

Method	A $\to$ D	A $\to$ W	D $\to$ A	D $\to$ W	W $\to$ A	W $\to$ D	Avg.
Source-model	80.4	76.5	60.2	95.6	63.4	98.6	79.1
SoFA[yeh2021sofa]	73.9	71.7	53.7	96.7	54.6	98.2	74.8
SFDA[kim2021domain]	92.2	91.1	71.0	98.2	71.2	99.5	87.2
SHOT[liang2020we]	94.0	90.1	74.7	98.4	74.3	99.9	88.6
3C-GAN[li2020model]	92.7	93.7	75.3	98.5	77.8	99.8	89.6
BAIT[yang2020unsupervised]	92.0	94.6	74.6	98.1	75.2	100.0	89.1
NRC[yang2021exploiting]	96.0	90.8	75.3	99.0	75.0	100.0	89.4
HCL[huang2021model]	94.7	92.5	75.9	98.2	77.7	100	89.8
AAA[li2021divergence]	95.6	94.2	75.6	98.1	76.0	99.8	89.9
A2Net [xia2021adaptive]	94.5	94.0	76.7	99.2	76.1	100.0	90.1
DIPE[wang2022exploring]	96.6	93.1	75.5	98.4	77.2	99.6	90.1
Ours	95.0	93.5	76.3	99.1	78.4	100.0	90.3

TABLE I: Classification accuracies (%) on Office-31 dataset.

V-C Comparison with State-of-the-Art Methods

We report the results on Office-31, Office-Home, and VisDA, in Tables I, II, and III, respectively.

On Office-31 tasks, in terms of the average accuracy of 6 transfer tasks, our method outperforms the state-of-the-art work A2Net[xia2021adaptive] and DIPE[wang2022exploring] by 0.2%, improving from 90.1% to 90.3%. We also achieve the state-of-the-art results on W $\to$ A and W $\to$ D. For other transfer directions of Office-31, we achieved very competitive results. We hypothesize the reason of the results is that our method brings transferability risk quantification to SFUDA and integrates the “safe-to-transfer” source knowledge to the target domain for better adaptation. We also argue that our method is more useful and brings more improvements for challenging adaptation tasks, where the cross-domain transfer risk is high. Instead, the Office-31 transfer tasks are easy and bring less risks (considering that the average accuracy of the source-only model is 79.1%), so our method improvement is competitive and not that significant.

As expected, on the more challenging Office-Home tasks and VisDA tasks (the mean accuracy of the source models are 60.0% and 48.0%), our method brings larger improvement. In the Office-Home tasks, we achieve the state-of-the-art performance on 8 of 12 tasks, and also outperform the prior work in terms of the average accuracy of 0.6%, improving from 72.6% (by A2Net[xia2021adaptive]) to 73.2%. Particularly, we have achieved significant improvements in two difficult tasks Ar $\to$ Cl and Re $\to$ Cl and outperform the second best one by 1.2% and 0.7%, respectively. On the VisDA tasks, our method outperforms others among 10 out of 12 tasks and surpasses the SOTA method NRC by a large margin (2.4%) in the per-class accuracy.

Compared with the most related work DIPE[wang2022exploring], our method also obtains performance improvement in all three benchmarks, including 0.2% in Office-31, 0.7% in Office-Home and 4.2% in VisDA. The reported results clearly demonstrate the efficacy of our method.

Method	AC	AP	AR	CA	CP	CR	PA	PC	PR	RA	RC	RP	Avg.
Source-model	44.8	67.4	75.1	52.3	63.4	63.7	53.6	39.5	72.7	64.1	45.2	77.6	60.0
SHOT[liang2020we]	57.1	78.1	81.5	68.0	78.2	78.1	67.4	54.9	82.2	73.3	58.8	84.3	71.8
SFDA[kim2021domain]	48.4	73.4	76.9	64.3	69.8	71.7	62.7	45.3	76.6	69.8	50.5	79.0	65.7
SoFA[yeh2021sofa]	-	74.1	77.6	-	71.8	75.1	-	-	-	-	-	-	-
BAIT[yang2020unsupervised]	57.4	77.5	82.4	68.0	77.2	75.1	67.1	55.5	81.9	73.9	59.5	84.2	71.6
PS [du2021generation]	57.8	77.3	81.2	68.4	76.9	78.1	67.8	57.3	82.1	75.2	59.1	83.4	72.1
AAA[li2021divergence]	56.7	78.3	82.1	66.4	78.5	79.4	67.6	53.5	81.6	74.5	58.4	84.1	71.8
NRC[yang2021exploiting]	57.7	80.3	82.0	68.1	79.8	78.6	65.3	56.4	83.0	71.0	58.6	85.6	72.2
DIPE [wang2022exploring]	56.5	79.2	80.7	70.1	79.8	78.8	67.9	55.1	83.5	74.1	59.3	84.8	72.5
A2Net [xia2021adaptive]	58.4	79.0	82.4	67.5	79.3	78.9	68.0	56.2	82.9	74.1	60.5	85.0	72.6
Ours	59.8	81.2	83.2	67.2	79.2	80.1	68.4	56.4	83.0	73.7	61.2	85.9	73.2

TABLE II: Classification accuracies (%) on Office-Home dataset (ResNet-50). AC denote the task Ar

\to

CL.

Method	plane	bcycl	bus	car	horse	knife	mcycl	person	plant	sktbrd	train	truck	$P e r_{c}$
Source-model	64.6	28.9	47.2	63.5	67.2	12.4	82.5	23.5	61.7	31.4	82.1	11.1	48.0
3C-GAN[li2020model]	94.8	73.4	68.8	74.8	93.1	95.4	88.6	84.7	89.1	84.7	83.5	48.1	81.6
SHOT[liang2020we]	94.3	88.5	80.1	57.3	93.1	94.9	80.7	80.3	91.5	89.1	86.3	58.2	82.9
SFDA[kim2021domain]	86.9	81.7	84.6	63.9	93.1	91.4	86.6	71.9	84.5	58.2	74.5	42.7	76.7
BAIT[yang2020unsupervised]	93.7	83.2	84.5	65.0	92.9	95.4	88.1	80.8	90.0	89.0	84.0	45.3	82.7
DIPE[wang2022exploring]	95.2	87.6	78.8	55.9	93.9	95.0	84.1	81.7	92.1	88.9	85.4	58.0	83.1
HCL[huang2021model]	93.3	85.4	80.7	68.5	91.0	88.1	86.0	78.6	86.6	88.8	80.0	74.7	83.5
PS[du2021generation]	95.3	86.2	82.3	61.6	93.3	95.7	86.7	80.4	91.6	90.9	86.0	59.5	84.1
AAA[li2021divergence]	94.4	85.9	74.9	60.2	96.0	93.5	87.8	80.8	90.2	92.0	86.6	68.3	84.2
A2Net [xia2021adaptive]	96.1	88.3	85.5	74.1	97.1	95.4	89.5	79.4	95.4	92.9	89.1	42.6	85.4
NRC[yang2021exploiting]	96.8	91.3	82.4	62.4	96.2	95.9	86.1	80.6	94.8	94.1	90.4	59.7	85.9
Ours	98.0	92.9	88.3	78.0	97.8	97.7	91.1	84.7	95.5	91.4	91.2	41.1	87.3

TABLE III: Classification accuracies (%) on VisDA-C dataset (ResNet-101),

P e r_{c}

denotes the per-class accuracy.

Fig. 4: The visualization of the prediction probability, prediction accuracy of different samples and their $U T R_{I}$ in VISDA by (a): source model (b) CAF without $L_{f}$ , and (c) CAF. Blue point represent samples that the model predicted correctly, and red indicates that the prediction is wrong. The vertical axis represents the prediction probability of the sample, and the horizontal axis represents $U T R_{I}$ .

Fig. 5: The Accuracy- $U T R_{I}$ curve of source model to target samples on Ar $\to$ Cl and VisDA tasks. The horizontal axis is the threshold of $U T R_{T}$ , denotes samples that satisfied $U T R_{I} (x_{t}) > τ$ . The vertical axis represents the predicted semantic accuracy of the model for these samples. For a better illustration, we select samples to which the max prediction probability of the source model are larger than 0.5.

Vi Analysis

Vi-a Ablation Study

Ablation study on the Source Knowledge Calibration. We designed the transferability-controlled knowledge distillation loss $L_{k d}$ to distill transferable source knowledge. To prove its effectiveness, the ablation results of this module are reported in the first three rows of Table IV. It can be seen that the v2 ( $L_{a} + L_{k d}$ ) outperforms v1 ( $L_{a}$ ) by 2.1% in Ar $\to$ Cl and 1.7% in Ar $\to$ Re respectively. However, it induces negative effects on Ar $\to$ Pr. We hypothesize the reasons to be that the transferable knowledge to the target domain may not discriminate the target samples and may infer wrong semantics in the Ar $\to$ Pr. To prove our hypothesis, by comparing results of v4 and ours in Table IV, it proves that by adding two semantics calibration losses $L_{f}$ and $L_{d}$ , the $L_{k d}$ is more effective and brings significant improvements on all transfer directions Ar $\to$ Cl, Ar $\to$ Re and Ar $\to$ Pr (see more details in the following two comparisons: [v2 $\leftrightarrow$ v3] and [v4 $\leftrightarrow$ Ours]). It may prove that calibrating the target semantics helps calibrate transferable source knowledge.

Method	Module	Ar $\to$ Cl	Ar $\to$ Pr	Ar $\to$ Re
source	-	44.6	67.3	74.0
v1	$L_{a}$	50.4	73.3	75.6
v2	$L_{a} + L_{k d}$	52.5	72.4	77.3
v3	$L_{a} + L_{d}$	54.6	77.1	80.6
v4	$L_{a} + L_{f} + L_{d}$	57.1	77.2	81.1
v5	$L_{a} + L_{d} + L_{k d}$	58.5	79.2	82.0
v6	$L_{a} + L_{f} + L_{k d}$	46.6	70.1	74.3
Ours	$L_{a} + L_{f} + L_{d} + L_{k d}$	59.8	81.2	83.2
Ours-Merge	$L_{a} + L_{f} + L_{d} + L_{k d}$	57.7	79.3	82.1
Ours-Online	$L_{a} + L_{f} + L_{d} + L_{k d}$	59.2	81.2	83.1
Ours-Ensemble	$L_{a} + L_{f} + L_{d} + L_{k d}$	59.1	80.7	82.9
Ours-MC dropout	$L_{a} + L_{f} + L_{d} + L_{k d}$	59.3	80.5	82.7

TABLE IV: Ablation study on three Office-Home task.

HPR	Ar $\to$ Pr
$τ$	1.5m/76.2	2m/78.6	2.5m/81.2	3m/81.2	3.5m/80.6	4m/79.6
$λ$	0.5/78.56	2.5/77.91	5/79.08	7.5/80.1	10/81.2	12.5/80.4
$γ$	0.1/79.3	0.5/80.7	0.9/81.2	1.0/80.8	1.5/79.5	2/79.4

TABLE V: Hyperparameter(HPR) analysis. The results are shown in form of Value/Acc(%).

Ablation study on the Target Semantics Calibration We designed a forget loss $L_{f}$ and a discover loss $L_{d}$ to calibrate the target semantics subsequently. By comparing the results of v1 and v3 in Table IV, we may conclude that adding $L_{d}$ in v1 boosts the performances by 4.2%, 3.8% and 5.0% on the three tasks respectively.

In addition, using the $L_{f}$ only without the $L_{d}$ brings negative effects (see the comparison [v6 $\leftrightarrow$ Ours]). However, from comparisons [v3 $\leftrightarrow$ v4] and [v5 $\leftrightarrow$ Ours] in Table IV, we verify that $L_{f}$ is only effective when combined with $L_{d}$ .

Moreover, we notice that the forget loss $L_{f}$ is more effective on challenging tasks, e.g. Ar $\to$ Cl. We hypothesize that it is because the predictions on challenging tasks tend to be wrong and therefore forgetting model prediction completely brings more improvements.

To further understand the forget loss $L_{f}$ and prove the previous arguments, we visualize the correct/incorrect case of target prediction and the $U T R_{I}$ obtained by the source model, our CAF model without the forget loss and our CAF model in Fig. 4 (a), (b) and (c) respectively. It can be seen in (a) that the source model predicts many wrong semantics in the target domain (the red points) due to the less-transferable knowledge. By observing many red points on the upper right of the Fig. 4 (b), it seems that we can not calibrate the wrong semantics of samples with high prediction probability without $L_{f}$ , since the model is confident in its inferred semantics and tends to maintain these semantics. Finally, from Fig. 4 (c), it can be seen that after adding the forget term $L_{f}$ , the model forgets the wrong semantics and finally calibrates their semantics. The above experiments verify the effectiveness of the Target Semantics Calibration.

Method	Cl $\to$ Ar	Cl $\to$ Pr	Cl $\to$ Re
Ours	67.2	79.2	80.1
SHOT[liang2020we]	68.0	78.2	78.1
SHOT+Ours	68.5	79.3	80.3
NRC[yang2021exploiting]	68.1	79.8	78.6
NRC+Ours	68.9	80.1	80.2

TABLE VI: Improvement to existing SFUDA methods.

Ablation study on merging different steps. In our CAF framework, the two calibration steps integrate “transferable” source knowledge and the reliable target semantics first and after that the adaptation step refines the model using the calibrated knowledge and target semantics. Therefore, it is necessary and better to perform the two calibration steps before the adaptation step. To prove the necessary, we conduct extra experiments performing the calibration and adaptation steps simultaneously. The results are denoted as “Ours-Merge” in Table IV. It can be observed that the “Ours” result outperforms the “Ours-merge” result by 2.1%, 1.9% and 1.1% in Ar $\to$ Cl, Ar $\to$ Pr and Ar $\to$ Re respectively.

Vi-B Hyperparameter Analysis

We analyse the sensitivity of the following hyperparameters: the $U T R_{I}$ threshold $τ$ and the weights $λ$ and $γ$ of the losses $L_{K D}$ and $L_{f}$ respectively. The results in Table V demonstrate that our method is stable to the choices of hyperparameters in a wide range.

Vi-C Calibration on other existing methods

Without measuring the transferability, current SFUDA methods [yang2021exploiting, yang2021generalized, xia2021adaptive, yang2020unsupervised, liang2020we] directly perform adaptation but ignore calibration steps in our CAF framework. Our two calibration steps fill in the gap, and are flexible and “plug-and-play”. Therefore, we add our calibration modules on existing SFUDA works [yang2021exploiting, liang2020we] and report the experimental results in Table VI. It can be seen that adding our calibration modules on SHOT/NRC methods improves SHOT/NRC by 0.9/0.8%, 1.2/0.3%, and 2.2/2.1% on Cl $\to$ Ar, Cl $\to$ Pr and Cl $\to$ Re respectively. It proves that our CAF method is “plug-and-play” and effective to different SFUDA baselines.

Measurement	Ar $\to$ Cl		Ar $\to$ Pr		Ar $\to$ Re		Cl $\to$ Ar		Cl $\to$ Pr		Cl $\to$ Re		Pr $\to$ Ar		Pr $\to$ Cl		Pr $\to$ Re		Re $\to$ Ar		Re $\to$ Cl		Re $\to$ Pr
Measurement	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$
MMD $↓$	0.38	0.41	0.11	0.11	0.50	0.56	0.71	0.74	0.20	0.21	0.17	0.18	0.48	0.54	0.50	0.57	0.32	0.30	0.30	0.31	0.58	0.54	0.19	0.24
A-Distance $↓$	1.47	1.51	1.23	1.35	0.80	0.86	1.40	1.43	1.20	1.25	1.40	1.44	1.33	1.45	1.47	1.52	0.84	0.87	1.00	1.07	1.47	1.45	0.85	0.88
Corresponding Angle $↑$	-0.14	-0.15	-0.05	-0.13	-0.72	-0.71	0.97	0.94	0.98	0.97	0.99	0.97	0.27	0.09	-0.09	-0.49	0.31	0.28	0.29	0.21	-0.12	0.40	0.23	0.08
LogME $↑$	0.83	0.82	0.93	0.92	0.89	0.87	0.85	0.81	0.84	0.82	0.84	0.81	0.83	0.82	0.81	0.80	0.86	0.84	0.84	0.83	0.84	0.82	0.92	0.90
LEEP $↑$	-3.66	-3.79	-3.30	-3.35	-3.25	-3.34	-2.81	-2.98	-2.39	-2.65	-2.38	-2.54	-3.51	-3.63	-3.75	-3.85	-3.15	-3.28	-3.20	-3.36	-3.51	-3.61	-3.12	-3.24
NCE $↑$	-2.05	-2.17	-1.21	-1.31	-1.12	-1.51	-1.71	-1.99	-1.43	-1.58	-1.44	-1.43	-1.82	-2.07	-2.30	-2.55	-1.21	-1.39	-2.43	-2.54	-2.09	-2.41	-0.93	-1.05
Accuracy(%) $↑$	49.5	47.1	60.3	58.2	62.9	61.2	48.6	47.2	59.5	57.9	61.7	60.4	48.4	45.3	38.7	33.9	68.8	67.5	61.8	60.2	43.5	39.6	75.4	73.1

TABLE VII: Comparison with different transferability measurements on the Office-Home tasks.

z_{l o w}

and

z_{h i g h}

are features with low

U T R_{D}

and high

U T R_{D}

, respectively. The

↑

↓

indicates the larger/smaller the value, the higher the transferability. In each task, current transferability measurements are calculated on

Z_{l o w}

and

Z_{h i g h}

, respectively. The more transferable one is bolded.

Measurement	A $\to$ D		A $\to$ W		D $\to$ A		D $\to$ W		W $\to$ A		W $\to$ D		Synthetic $\to$ Real
Measurement	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$	$Z_{l o w}$	$Z_{h i g h}$
MMD $↓$	0.95	0.96	0.23	0.26	0.50	0.55	0.50	0.53	0.16	0.17	0.30	0.27	0.55	0.61
A-Distance $↓$	1.72	1.81	1.70	1.78	1.64	1.77	0.93	1.35	1.42	1.45	0.93	1.2	0.16	0.17
Corresponding Angle $↑$	0.7	0.61	0.15	0.11	0.10	-0.05	0.57	0.12	0.27	-0.02	-0.2	-0.3	0.38	0.11
LogME $↑$	0.74	0.70	0.75	0.72	0.61	0.60	0.74	0.63	0.64	0.63	0.78	0.76	0.21	0.20
LEEP $↑$	-2.51	-2.65	-2.11	-2.41	-3.01	-3.55	-2.44	-2.51	-3.14	-3.00	-2.00	-2.01	-0.20	-0.22
NCE $↑$	-0.55	-0.79	-0.69	-0.84	-1.64	-1.55	-0.28	-0.38	-1.51	-1.65	-0.14	-0.15	-1.03	-0.32
Accuracy(%) $↑$	80.3	73.1	74.8	68.4	54.6	50.0	92.5	89.1	59.6	55.4	98.7	97.1	51.3	45.9

TABLE VIII: Comparison with different transferability measurements on the Office-31 and VisDA tasks.

z_{l o w}

and

z_{h i g h}

are features with low

U T R_{D}

and high

U T R_{D}

, respectively. The

↑

↓

indicates the larger/smaller the value, the higher the transferability. In each task, current transferability measurements are calculated on

Z_{l o w}

and

Z_{h i g h}

, respectively. The more transferable one is bolded.

Vi-D Empirical Analysis of UTR

The effectiveness of $U T R_{D}$ . The $U T R_{D}$ describes the domain-level transferability over the channel axis, which identifies how transferable each channel of the source encoder is to the target domain. To evaluate the effectiveness of UD, we conduct the following experiments.

Implementation. The experiments are conducted on the Office-31, Office-Home, and VisDA tasks. For the Office-31 tasks and the Office-Home tasks, the backbone of ResNet-50 along with a fc layer is the source encoder, whose output channel $d = 256$ . A fc layer with weight normalization is the classifier. For the VisDA tasks, we replace the ResNet-50 with the ResNet-101 and keep the other settings the same. We follow [liang2020we, yang2021exploiting] to train the source model. For each task, we feedforward all target data to the pre-trained source model and finally calculate $U T R_{D} (h_{s})$ according to Equation 6, where $T = 2$ , $r_{t}$ is randomly sampled from $U (- 0.05, 0.05)$ .

Comparison Protocols. We evaluate our $U T R_{D}$ by comparing it with the existing transferability measurements, that are inapplicable in the SFUDA, including: MMD [gretton2006kernel], A-Distance [ben2010theory], Corresponding Angle [chen2019transferability], LogME [you2021logme], LEEP [nguyen2020leep] and NCE[tran2019transferability]. In addition, the performance (prediction accuracy) is also considered as an extra intuitive measurement. Considering that existing transferability measurements are not suitable for a single channel’s feature, we design the following comparison protocol. Specifically, we sort the 256 channels representations $z = h_{s} (x)$ and split them into two separate 128 channels vectors $Z_{l o w}$ and $Z_{h i g h}$ , representing the channels with the 128 smallest $U T R_{D}^{i} (h_{s})$ and the 128 largest $U T R_{D}^{i} (h_{s})$ respectively. In other words, the conclusion of $U T R_{D} (h_{s})$ is that $Z_{l o w}$ is more transferable than $Z_{h i g h}$ . Then we calculate the existing transferability measurements on $Z_{l o w}$ and $Z_{h i g h}$ and report the consistency of their conclusion with ours. Note that these source data and target annotation are given when using these measurements. The results are reported in Table VII and VIII.

First, we quantitatively measure the transferability of $Z_{l o w}$ and $Z_{h i g h}$ with the two vanilla UDA methods, the MMD and the A-Distance, which requires the source data. These methods measure the ability to bridge the domain discrepancy between the source and target domain. The lower MMD/A-Distance, the more transferable the model is. From Table VII and VIII, it can be seen that in 15 out of 19 adaptation tasks, the MMD of $Z_{l o w}$ is lower than that of $Z_{h i g h}$ , and in 17 tasks, the A-Distance of $Z_{l o w}$ is lower than that of $Z_{h i g h}$ . The result indicates that $U T R_{D}$ is consistent with these domain discrepancy measurements in most case, which suggest features in channels with less $U T R_{D}$ are more effective to bridge the domain discrepancy, therefore; they are more transferable.

Second, we compared with the Corresponding Angle, which is proposed by Chen et al. [chen2019transferability] according to their observation that the eigenvectors with the largest singular values will dominate the feature transferability. We can observe that in 17 cases, the Corresponding Angle of $Z_{l o w}$ is larger than that of $Z_{h i g h}$ in all experiments, demonstrating the consistency of $U T R_{D}$ with the Corresponding Angle.

Third, we compare the consistency of our method with the transferability measurements LogME, NCE, and LEEP that estimate the potential of the source model parameter in learning a well-performed target model by refining. In Table VII and VIII, it can be seen that $U T R_{D}$ is consistent with LogME in all tasks , and also consistent with NCE and LEEP in 18 and 17 cases, respectively. These results denote that the relevant source model parameters to encode $Z_{l o w}$ is more transferable than $Z_{h i g h}$ .

Finally, We also calculate the classification performances of $Z_{l o w}$ and $Z_{h i g h}$ on the target domain. Using $Z_{l o w}$ , for example, we set the features of channels which not belongs to $Z_{l o w}$ to zero, feedforward the modified feature into the classifier to get prediction and calculate the prediction accuracy.. $Z_{h i g h}$ is evaluated in the same way. As shown in Table VII and VIII, the prediction using $Z_{l o w}$ is more accurate than $Z_{h i g h}$ in target domain on all adaptation tasks. For example, the accuracy of $Z_{l o w}$ outperform $Z_{h i g h}$ by 3.1%, 4.8% and 1.3% on Pr $\to$ Ar, Pr $\to$ Cl and Pr $\to$ Re, respectively. Note that we did not extra train the source model but only split it into two part of channels $Z_{l o w}$ and $Z_{h i g h}$ according to our $U T R_{D}$ . The significant performance gap between the two parts indicates that $Z_{l o w}$ with less $U T R_{D}$ is more transferable to the target domain than $Z_{h i g h}$ . The experimental observations from the series of studies above illustrate that 1) the proposed $U T R_{D}$ is strongly consistent with current transferability measurements and can estimate the transferability, 2) Our method can effectively analyse the internal transferability of the source model, the channels of source encoder with lower $U T R_{D}$ is more transferable to the target domain, which allows us to leverage the source knowledge more efficiently and safely, which proves our motivation.

Fig. 7: (a)-(c): The effectiveness of the $U T R_{D}$ for channels of different layers. (a): At the last layer. (b) At the penultimate layer. (c) At the antepenultimate layer. (d)-(f): The effectiveness of the $U T R_{D}$ to the target model during the adaptation process (trained with 5, 10, 15, 20, 25 and 30 epochs) on (d) $A r \to C l$ , (e) $A r \to P r$ , and (f) $A r \to R e$ . $Z_{l o w}$ and $Z_{h i g h}$ represent the channels with the 128 smallest $U T R_{D}^{i} (h_{s})$ and the 128 largest $U T R_{D}^{i} (h_{s})$ respectively.

The effectiveness of $U T R_{I}$ . The $U T R_{I}$ identifies the instance-level reliability of inferring target semantics using the source model. To evaluate its effectiveness, we draw the Accuracy- $U T R_{I}$ curve in Fig. 5, which describes the relationship between the $U T R_{I}$ of different target samples and the prediction accuracy of the source model to these samples. It can be seen that the source model is more accurate to samples with small $U T R_{I}$ . And the prediction accuracy tends to decrease with the increase of the $U T R_{I}$ . This phenomenon demonstrates the effectiveness of $U T R_{I}$ to identify the instance-level risk of inferring target semantic labels.

Extension to other layers. In our previous experiments, UTR is calculated using the last layer output of the feature extractor (the FC layer). Here, we explore the feasibility of extending the UTR to channels of other layers. To this end, we use the average pooling to extract features of different channels $z \in R^{2048}$ from the last, penultimate, and antepenultimate bottleneck of the ResNet-50 backbone, respectively. Then we calculate the UTR of these channels and evaluate the effectiveness of $U T R_{D}$ and $U T R_{I}$ . For the $U T R_{D}$ , the evaluation method is similar to the previous one: that is, the feature $z$ is divided into two 1024 channels vectors $Z_{l o w}$ and $Z_{h i g h}$ according to $U T R_{D} (h_{s})$ , and their corresponding angles are compared to evaluate their transferability. The results are shown in Fig. 7. We can see that the corresponding angle of $Z_{l o w}$ with lower $U T R_{D}$ , is larger than $Z_{h i g h}$ with higher $U T R_{D}$ . Therefore, the $U T R_{D}$ is effective at the last, penultimate, and antepenultimate bottlenecks of Resnet-50 as well. The similar trends among multiple layers’ features prove that our transferable index has the potential to extend to the feature representation of other layers. For the $U T R_{I}$ , the Accuracy- $U T R_{I}$ curve is shown in Fig. 6 (a). It can be seen that the $U T R_{I}$ is satisfying in the last bottleneck of the ResNet-50 backbone. On the penultimate, and antepenultimate bottlenecks, it may be invalid, such as when $τ = 2.5$ but is effective overall.

Fig. 8: The Grad-CAM [selvaraju2017grad] visualization of the features of channels with the smallest $U T R_{D}$ on the source and target domains.

Fig. 10: The t-SNE visualizations of features of $Z_{l o w}$ and $Z_{h i g h}$ on the target domain (Cl) of the Office-Home task Ar $\to$ Cl. For a better illustration, we choose features in the first 6 classes, and different color denotes different class. Best viewed in colors.

Fig. 11: The t-SNE visualizations of different methods on the target domain (Ar) of the Office-Home task Cl $\to$ Ar, including: the source model, Shot, NRC and Ours. For a better illustration, we choose features in the first 6 classes, and different color denotes different class. Best viewed in colors.

Measurement	MMD $↓$	A-Distance $↓$	CA $↑$	LogME $↑$	LEEP $↑$	NCE $↑$
VGG16 $Z_{l o w}$	0.60	1.36	-0.01	0.81	-3.61	-2.36
VGG16 $Z_{h i g h}$	0.64	1.44	-0.21	0.80	-3.63	-2.43
AlexNet $Z_{l o w}$	0.51	1.31	0.50	0.76	-3.80	-2.88
AlexNet $Z_{h i g h}$	0.57	1.28	0.25	0.75	-3.85	-2.97

TABLE IX: Comparison with existing transferability measurements with different model structures on office-home tasks Ar

\to

Cl. The

↑

↓

indicates the larger/smaller the value, the higher the transferability.

Extension to other network architectures. In this section, we evaluate the effectiveness of UTR on different backbone models including the VGG16[simonyan2014very] and AlexNet[krizhevsky2012imagenet]. The experiments are conducted on Office-Home task Ar $\to$ Cl. The results of $U T R_{D}$ are reported in Table IX. The results of $U T R_{I}$ is shown in Fig. 6 (b). It can be seen that using two different backbone architectures, the $U T R_{D}$ is consistent with the most recent transferability measurements. In addition, the $U T R_{I}$ is able to reveal the target semantics risk, which demonstrates that our UTR method is able to apply to different model architectures.

Extension to the target model. We have investigated the effectiveness of our UTR on the source model. In this subsection, we evaluate it on the target model in the adaptation process. The results of $U T R_{D}$ and $U T R_{I}$ are shown in Fig. 7 (d)-(f) and 6 (c), respectively. From Fig. 7 (d)-(f), we can observe that the $U T R_{D}$ is also effective for the target model in the first few steps of adaptation.

To be specific, we can see that in the first 5 steps, the $Z_{l o w}$ has a larger Corresponding Angle between the source and the target domain than $Z_{h i g h}$ , which indicates that it is more transferable than $Z_{h i g h}$ . However, it can be seen that after training for a period, it is inadequate to use the $U T R_{D}$ for identifying the target model. For example, in the epoch 10/30 of Fig. 7 (d), the corresponding angle of $Z_{l o w}$ is lower than that of $Z_{h i g h}$ .

The same phenomenon can be observed for $U T R_{I}$ . From Fig. 6 (c), we can observe that the $U T R_{I}$ is effective in the epoch 0 and epoch 5, but became invalid in the epoch 15. The main reason may be that after a period of training, the model gradually adapts to the target domain. Thus, it no longer needs or even actively abandons the source knowledge. It is worth noting that if the target model does not fit the source domain well, the model uncertainty in the Equation 2 can not be ignored. Therefore the UD may not be calculated without accessing to the source data.

Calculating the $U T R_{D}$ stochastically. In our implementation, the calculation of $U T R_{D}$ requires to feed-forward all target samples. As a statistical measurement, $U T R_{D}$ can also be adapted to the online version, where $U T R_{D}$ is updated using the moving average method widely used in Batch Normalization[ioffe2015batch]. We conducted new experiments using the moving-average calculation of $U T R_{D}$ , with their results denoted as “Ours-Online”. We set the momentum to 0.1 and conduct the experiment on three office-home tasks: Ar $\to$ Cl, Ar $\to$ Pr, and Ar $\to$ Re. The results in Table IV show that the performances of the online version on the three tasks are 59.2%, 81.2% and 83.1%, respectively. These are very similar to the original version “Ours”.

Uncertainty Estimation. We evaluate the performances of using different uncertainty implementation methods to calculate $U T R_{D}$ , i.e., the $M (.)$ in Equation 5, including the sensitivity analysis [nagy2007distributional] and Deep Ensembles [lakshminarayanan2016simple] and Monte Carlo dropout [gal2016dropout], denoted as “Ours-Ensemble” and “Ours-MC drooout”, respectively. Table IV shows that our method is not sensitive to various uncertainty implementation methods.

Visualization. Fig. 8 and Fig. 9 illustrate the Grad-CAM [selvaraju2017grad] feature visualization of a source model on the source and target data. Fig. 8 visualizes the feature of the most transferable channel selected by our proposed $U T R_{D}$ (i.e., with the smallest $U T R_{D}$ ). Fig. 8 shows the feature of the most non-transferable one (i.e., with the largest $U T R_{D}$ ). It can be observed that the feature in Fig. 8 captures the semantic information “screen” on the source domain and it remains the same semantic information on the target domain, which indicates that it is transferable to the target domain. However, the feature in Fig. 8 seems to focus on “keyboard” in the source domain, but fails to capture the same semantics on the target data, which suggests it is non-transferable to the target domain.

Fig. 10 shows the t-SNE [van2008visualizing] visualizations of the features of $Z_{l o w}$ (128 channels’ features with low $U T R_{D}$ ) and $Z_{h i g h}$ (128 channels’ features with high $U T R_{D}$ ) on the task Cl $\to$ Al. We can see that the semantic information extracted by $Z_{l o w}$ in the target domain is more discriminative than $Z_{h i g h}$ . It qualitatively proves that it is more transferable to the target domain. The above phenomenons demonstrate that $U T R_{D}$ is effective to estimate the transferability of the knowledge in the source encoder.

By estimating the transferability of different channels of the source encoder, our method can incorporate more valuable knowledge into the target domain to learn a more discriminative target model. To prove it, we provide the t-SNE visualizations of the feature obtained by the original source model, SHOT, NRC and our method on the task Cl $\to$ Al in Fig. 11. As expected, the feature extracted by our method is more semantically discriminative.

Vii Method Limitation

In this paper, we propose the Uncertainty-induced Transferability Representation (UTR) to explore the transferability of the source model in the absence of source data and target annotations. We prove the effectiveness and universality of the domain-level UTR and the instance-level UTR, which help the SFUDA community leverage the knowledge of the source model and target data fully and safely. However, it also has the following two limitations.

First, we use the distributional uncertainty to approximate the implicit uncertainty distance, which assumes that the model uncertainty is small enough to be ignored. As we discussed in Section VI-D: “Extension to the target model”, because the model uncertainty represents how much the pre-trained model covers the training distribution, the assumption may be violated somehow with the model gradually adapted to the target domain. In this paper, it will be our future work to quantify when the previous assumption is violated.

Second, we demonstrate the consistency of $U T R_{D}$ with existing domain discrepancy measurements. However, at present it is only a way to analyse the transferability, but not a rigorous domain distribution divergence yet that can be explicitly optimized, such as MMD and A-Distance. We hope that future research will address this limitation.

Viii Conclusions

In this paper, we develop a novel measurement termed Uncertainty-induced Transferability Representation (UTR) which uses uncertainty distance as a tool to estimate transferability in the absence of source data and target annotations. The domain-level UTR describes how transferable each source feature dimension is to the target domain, and the instance-level UTR identifies the reliability of the inferred target semantics. Based on the UTR, we propose a novel Calibrated Adaption Framework (CAF) for SFUDA, including a source knowledge calibration module to control the target model to learn transferable knowledge and discard non-transferable one, and a target semantics calibration module calibrates the target semantics. The calibrated source knowledge and target semantics help the target model fully and safely leverage the source knowledge and target data, ultimately prompting to better adapt to the target domain. We verified the effectiveness of our method using experimental results and demonstrated that the proposed method achieves state-of-the-art performances on three SFUDA benchmarks.

Uncertainty-Induced Transferability Representation for Source-Free Unsupervised Domain Adaptation

Abstract

I Introduction

Ii Related Work

Ii-a Source free unsupervised domain adaptation

Ii-B Uncertainty

Ii-C Transferability

Iii Uncertainty-Induced Transferability Representation

Iii-a Transferability measurement using Uncertainty Distance

Iii-B Channel-wise Transferability Analysis

Iii-C The Domain-level and Instance-level UTR

Iv Calibrated Adaptation Framework

Iv-a Notation

Iv-B Overall

Iv-C Source Knowledge Calibration and Distillation

Iv-D Target Semantics Calibration

Iv-E Adaptation

Iv-F Training Steps

V Results

V-a Datasets

V-B Implementations

V-C Comparison with State-of-the-Art Methods

Vi Analysis

Vi-a Ablation Study

Vi-B Hyperparameter Analysis

Vi-C Calibration on other existing methods

Vi-D Empirical Analysis of UTR

Vii Method Limitation

Viii Conclusions

References