Identifying Latent Causal Content for Multi-Source Domain Adaptation

Yuhang Liu¹, Zhen Zhang¹, Dong Gong², Mingming Gong³, Biwei Huang⁴,
Kun Zhang⁵, Javen Qinfeng Shi¹
¹ Australian Institute for Machine Learning, The University of Adelaide, Australia
² School of Computer Science and Engineering, The University of New South Wales, Australia
³ School of Mathematics and Statistics, The University of Melbourne, Australia
⁴ Halicioğlu Data Science Institute (HDSI), University of California San Diego, USA
⁵ Department of Philosophy, Carnegie Mellon University, USA
yuhang.liu01@adelaide.edu.au

Abstract

Multi-source domain adaptation (MSDA) learns to predict the labels in target domain data, under the setting where all data from multiple source domains are labelled and the data from the target domain are unlabeled. To handle this problem, most of methods focus on learning invariant representations across domains. However, their success severely relies on the assumption that label distribution remains unchanged across domains. To mitigate it, we propose a new assumption, latent covariate shift (LCS), where the marginal distribution of the latent content variable changes across domains, and the conditional distribution of the label given the latent content remains invariant across domains. We introduce a latent style variable to complement the latent content variable forming a latent causal graph as the data and label generating process. We show that although the latent style variable is unidentifiable due to transitivity property in the latent space, the latent content variable can be identified up to simple scaling under some mild conditions. This motivates us to propose a novel method for MSDA, which learns the invariant label distribution conditional on the latent content variable, instead of learning invariant representations. Empirical evaluation on simulation and real data demonstrates the effectiveness of the proposed method, compared with many state-of-the-art methods based on invariant representation.

1 Introduction

Traditional machine learning requires the training and testing set to be independent and identically distributed distributions (vapnik1999overview). This strict condition may not be fulfilled in various potential real-world applications. For example, in medical applications, it is common to seek to train a model on patients from few hospitals, and generalize it to a new hospital (zech2018variable). In this case it is often reasonable to consider that the distributions of data from training hospitals are different from the new hospital (koh2021wilds). Domain adaptation (DA) is a promising research area to handle such problems. In general, DA can frequently be considered in multi-source settings (MSDA) where source domain data are collected from multiple domains. Formally, let $x$ denote the input, e.g. image, $y$ denote the labels in source and target domains, and $D$ denote the domain index. We observe labeled data pairs $(x^{S}, y^{S})$ from the multiple joint distributions $p (x, y | D_{1}), p (x, y | D_{2}), . . ., p (x, y | D_{M})$ in source domains, and unlabeled input data $x^{T}$ from the joint distribution $p (x, y | D_{T})$ in target domain. The training phase of MSDA is to use $(x^{S}, y^{S})$ and $x^{T}$ , to train a predictor so that it can provide a satisfactory estimation for $y^{T}$ in the target domain. The key for MSDA is to understand how the joint distribution change across different domains, resulting in $p (x, y | D_{1}) \neq p (x, y | D_{2}) . . . \neq p (x, y | D_{M}) \neq p (x, y | D_{T})$ .

Most early methods assume the change of the joint distribution results from Covariate Shift (huang2006correcting; bickel2007discriminative; sugiyama2007direct; wen2014robust), e.g., $p (x, y | D_{i}) = p (y | x) p (x | D_{i})$ , as depicted by Figure 1(a). This setting assumes that $p (x)$ changes across domains, while the conditional distribution $p (y | x)$ is invariant across domain. Such assumption is too strong for some real applications, especially for image data. For example, the assumption of invariant $p (y | x)$ implies that $p (y)$ should change as the change of $p (x)$ . However, we can easily change style information in image data that is irrelevant to the label (e.g., background, view), so that the distribution $p (x)$ changes while the distribution of $p (y)$ remains unchanged, which clearly violates the assumption.

The illustration of three different assumptions for MSDA. — (a) Covariate Shift

In contrast to covariate shift above, most current works consider a more reasonable assumption, Conditional Shift as depicted by Figure 1(b). It assumes that $p (x | y)$ changes while the distribution $p (y)$ is invariant across domains (zhang2013domain; zhang2015multi; scholkopf2012causal; stojanov2021domain; peng2019moment). This setting motivates a popular class of methods focusing on learning invariant representations across different domains to approach the true $z$ in Figure 1(b) (ganin2016domain; zhao2018adversarial; saito2018maximum; mancini2018boosting; yang2020curriculum; wang2020learning; li2021t; stojanov2021domain). However, the label distribution $p (y)$ may change across domains in many real application scenarios (tachet2020domain; lipton2018detecting; zhang2013domain). In these scenarios, those methods based on learning invariant representations, may be resulting in degenerating performance. In fact, recent work has proven an upper bound on the performance of those methods when label distributions change across domains (zhao2019learning).

To relieve the problem above, this work proposes a new assumption, Latent Covariate Shift (LCS), as depicted by Figure 1(c). Unlike to conditional shift, LCS assumes that there is a latent content variable $z_{c}$ , whose marginal distribution $p (z_{c})$ is variant across domains, and the label distribution conditional on it $p (y | z_{c})$ is invariant. Due to the change of $p (z_{c})$ across domains, LCS allows label distribution $p (y)$ to change across domain, due to the invariant $p (y | z_{c})$ . To more deeply understand and handle LCS, we propose a latent causal graph to formulate data and label generating process, by using two latent variables, e.g., the latent content variable $z_{c}$ and the latent style variable $z_{s}$ as depicted in Figure 2. In the proposed latent causal graph, the domain variable $D$ causes the independent exogenous variables $n_{c}$ and $n_{s}$ , which are corresponding to the latent content variable $z_{c}$ and the latent style variable $z_{s}$ , respectively. The observed input data is caused by both $z_{c}$ and $z_{s}$ , while label is only caused by $z_{c}$ . We show that although it is often impossible to identify the latent style variable $z_{s}$ without strong assumptions, due to transitivity property in latent space, the latent content variable $z_{c}$ can be identifiable up to simple scaling by using the identifiability result from nonlinear ICA (khemakhem2020variational) and dependence between $n_{c}$ and $y$ . This motivates us to propose a novel method to learn the invariant conditional distribution $p (y | z_{c})$ for MSDA, instead of learning invariant representations conditional on label. Since $z_{c}$ is identifiable, the proposed method provides a principled way to guarantee that the learned predictor $p (y | z_{c})$ can be generalized to the target domain. Empirical evaluation on synthetic and real data demonstrates the effectiveness of the proposed method, compared with many state-of-the-art methods.

2 Related work

Learning invariant representations. Due to the strong assumptions in covariate shift, most current works for domain adaptation consider conditional shift, which learns invariant representations across domains (ganin2016domain; zhao2018adversarial; saito2018maximum; mancini2018boosting; yang2020curriculum; wang2020learning; li2021t). Such representations can be obtained by applying suitable linear or nonlinear transformation on the input data. The key of these methods is how to enforce the invariance of the learned representations. For example, the invariance can be enforced by maximum classifier discrepancy (saito2018maximum), or by a domain discriminator for adversarial training (ganin2016domain; zhao2018adversarial), or by moment matching (peng2019moment), or by relation alignment loss (wang2020learning). However, all these methods assume label distribution to be invariant across domains. As a result, when label distribution is varying across domains, they may perform well only in the overlapping areas among the all label distributions in different domains, and face with challenges in the non-overlapping areas. To overcome this, some works propose to learn invariant representations conditional on the label across domains (gong2016domain; ghifary2016scatter; tachet2020domain). One of the challenges in these methods is that the labels in the target domain is unavailable. More importantly, these methods does not guarantee that the learnt representations to be consistent with the true relevant information for predicting the label in the target domain, thus there is no principled way to guarantee that the learned predictor can be generalized to the target domain.

Learning invariant conditional distribution $p (y | z_{c})$ . There exist few of works exploring the invariant conditional distribution $p (y | z_{c})$ for domain adaptation (kull2014patterns; bouvier2019hidden). Differ from these two works, the proposed method provides the identifiability of $z_{c}$ , so that the learned $p (y | z_{c})$ in this work can be generalized to the target domain in a principled way. Besides, in the context of out-of-distribution generalization, some recent works explore learning invariant conditional distribution $p (y | z_{c})$ (arjovsky2019invariant; sun2021recovering; liu2021learning; lu2021invariant). For example, arjovsky2019invariant imposes learn the optimal invariant predictor across domains from the viewpoint of an intimate link between invariance and causation, while the proposed method directly explores conditional invariance given the proposed latent causal graph. sun2021recovering mainly focus on single domain, while the proposed method consider multiple domains. The proposed method is also different from the work in liu2021learning in that the former assume the latent content variable caused by the style variable, while the latter depends on a confounder to model the causal relation between the latent content variable and the style variable. Unlike the work in lu2021invariant that the label is treated as a variable causing the other latent variables, the proposed method assumes that the label have no child nodes.

3 The Proposed Latent Causal Graph for Latent Covariate Shift

Figure 2: The proposed latent causal model.

To more deeply understand and handle LCS, we introduce a latent causal graph as depicted by Figure 2. It introduces the observed domain variable $D$ to denote in which specific domain data are collected. Let $n_{c}$ and $n_{s}$ denote the noise (exogenous) variables According to the definition of structural causal model (Pearl00; Spirtes00), $n_{c}$ and $n_{s}$ should be mutually independent conditional on the observed $D$ . Both $n_{c}$ and $n_{s}$ are corresponding to the latent content variable $z_{c}$ and the latent style variable $z_{s}$ , respectively. Here $z_{c}$ and $z_{s}$ denote the latent causal content information and the latent style information, respectively. Generally speaking, $z_{c}$ and $z_{s}$ should be dependent given the domain variable $D$ . Here we consider that $z_{c}$ causes $z_{s}$ , to model the correlation between $z_{s}$ and $y$ . In the proposed latent causal graph, $z_{c}$ change across domains while $p (y | z_{c})$ is invariant across domains, which meets the basic assumption in latent covariate shift. In the following, we discuss two key causal relations, which also highlights the novelty of the proposed latent causal graph.

$z_{c}$ causes $y$ : Previous works consider the causal relation between $x$ and $y$ as $y \to x$ (gong2016domain; stojanov2019data; li2018domain), while we employ $z_{c} \to y$ . We argue that these two cases are not contradictory since the labels $y$ in these two cases represent two different physical meanings. To understand this point, let $^y$ replace $y$ in the first case (i.e., $^y \to x$ ) to distinguish from $y$ in the second case. For the first case, consider the generative process of images. A label should be first sampled, e.g., $^y$ , then one may determine content information regarding to the label $^y$ , and finally generate a image, which is a reasonable assumption in many real application scenarios. In the proposed latent causal graph, $n_{c}$ play a role to replace $^y$ and causes the content variable $z_{c}$ . We then assume $z_{c} \to y$ , which formulates the process that experts extract content information from given images and then provide reasonable labels according to their domain knowledge. This assumption has been made by some recent works (mahajan2021domain; liu2021learning; sun2021recovering). Particularly, these two different labels, $^y$ and $y$ , has been considered in the work (mahajan2021domain). Here we provide a further detailed interpretation for the difference between $^y$ and $y$ .

$z_{c}$ causes $z_{s}$ : It is clear that there exists a spurious correlation between the label $y$ and the style variable $z_{s}$ in many real applications. We here employ $z_{c}$ as a confounding factor of both $y$ and $z_{s}$ to model the spurious correlation. The rationality of this assumption can be further verified from the viewpoint of the converse. In particular, if we assume that $z_{s}$ causes $z_{c}$ , all high-level information in input data $x$ , $z_{s}$ and $z_{c}$ , would be causally related to the label $y$ , which can not model the spurious correlation and is obviously unreasonable. Therefore, assuming $z_{c} \to z_{s}$ is more persuasive and consistent with previous works (gong2016domain; stojanov2019data; mahajan2021domain). One recent work in (sun2021recovering) leverages a additional variable as a confounding factor that causes both the content variable $z_{c}$ and the style variable $z_{s}$ to model their relation. However, the identifiability in the work (sun2021recovering) do not depend on the confounding factor. As a result, the confounding factor can be incorporated into the domain index, which is equivalent to the case where $z_{c}$ and $z_{s}$ are independent for a single domain. By contrast, the proposed latent graph assumes a more general setting where $z_{c}$ and $z_{s}$ are dependent for a single domain, as depicted by Figure 2. Experimental results will further highlight the advantages of the proposed latent causal graph, compared with the work (sun2021recovering).

4 Identifiability Analysis of the Proposed Latent Causal Graph

In this section, we will show that $z_{c}$ can be identifiable up to simple scaling under some mild conditions. To this end, we first provide some identifiable results in nonlinear ICA (khemakhem2020variational), which shows that the exogenous variables, $n_{c}$ and $n_{s}$ , can be identifiable up to simple permutation and scaling with some mild conditions. With this result, we then show that although the proposed latent causal graph as a whole is unidentifiable due to transitivity property, the part of the graph $z_{c}$ can also be identifiable up to simple scaling, by using the dependence between $n_{c}$ and $y$ .

4.1 Identifying $n_{c}$ and $n_{s}$ up to Permutation and Scaling by Nonliner ICA

Nonlinear ICA aims to separate independent latent variables, e.g., $n_{c}$ and $n_{s}$ , from observed mixing data, e.g, $x$ , generated by a nonlinear function. It is known that nonlinear ICA is highly ill-posed, and one can not recover the independent latent variables, without some assumptions (hyvarinen1999nonlinear). Recent work in (khemakhem2020variational) has show that under relatively mild conditions independent latent variables can be identifiable. Specifically, one can assume there is an auxiliary observed variable, similar as $D$ in Figure 2, which influences the distributions of all independent latent variables. This auxiliary variable could be time series or side information. Conditioning on the auxiliary variable, we can recover the independent latent variables up to simple permutation and scaling. The auxiliary variable could also be regarded as domain index, and causes both $n_{c}$ and $n_{s}$ as depicted in Figure 2. In addition, the input data $x$ can be regarded as observed mixing data. Since we can see all input data $x$ from source and target domains in the setting of domain adaptation, it is trivial to extend the identifiability result of nonlinear ICA to the proposed causal graph, i.e., $n_{c}$ and $n_{s}$ in source and target domains can be recover up to simple permutation and scaling under the mild conditions. For example, independent latent variables $n_{s}$ and $n_{c}$ are sampled from independent Gaussian distributions whose means and variances are modulated by the observed variable $D$ as mentioned in khemakhem2020variational. Note that we here assume $n_{s}$ and $n_{s}$ to be vectors. Although the identifiability result of nonlinear ICA is prove for scalar, we assume all components in these two vectors to be mutually independent, which is reasonable because both vectors denote noise variables, so that the identifiability result of nonlinear ICA can also be used for $n_{s}$ and $n_{c}$ .

4.2 The Curse of Identifiability in Latent Space: Transitivity Property

Figure 3: Two equivalent graph structures.

Even with the identifiability result of $n_{c}$ and $n_{s}$ , it is still challenging to completely identify the proposed causal graph. In fact, we have the following result:

Proposition 4.1.

With the identifiability result of $n_{c}$ and $n_{s}$ , the proposed causal graph is unidentifiable without additional assumptions, due to transitive property in latent space.

Proof.

To prove non-identifiability, it is sufficient to show that several different graph structures lead to the same observed data. In particular, given the fact that $n_{c}$ and $n_{s}$ is identifiable up to permutation and scaling, let us consider the net effect of $n_{c}$ on $x$ . There are two different paths to ’explain’ the net effect of $n_{c}$ on $x$ . One path is $n_{c} \to z_{c} \to x$ . In this case, since we have no limitation on the function class of edges, we can cut the path $z_{c} \to z_{s}$ off (e.g., the left sub-figure of Figure 3) and obtain the same observed data depicted by Figure 2. The other path is $n_{c} \to z_{c} \to z_{s} \to x$ . In this case, we can cut the ’path’ $z_{c} \to x$ off (e.g., the right sub-figure of Figure 3) and generate same observed data depicted by Figure 2. Therefore, two sub-figure are equivalent with the proposed latent causal graph in Figure 3. ∎

The non-identifiability result above is because we can not determine which path is the correct path corresponding to the net effect of $n_{c}$ on $x$ , e.g., $n_{c} \to z_{c} \to x$ or $n_{c} \to z_{c} \to z_{s} \to x$ . We term it transitivity property in this work. It often appears in latent causal discovery and seriously hinders the identifiability. For reader who may be interested in that problem, we recommend recent work (adams2021identification).

4.3 Identifying $z_{c}$ up to Scaling by the Dependence Between $n_{c}^{S}$ and $y^{S}$

Although the proposed causal graph as a whole is unidentifiable, for domain adaptation application we are only interested in the identifiability of $z_{c}$ , instead of the latent variable $z_{s}$ , since label $y$ is only caused by $z_{c}$ . Due to the observed $y^{S}$ from source domains, we have the following result:

Proposition 4.2.

With the assumptions of nonlinear ICA, the content variable $z_{c}$ in the proposed latent causal graph can be identifiable up to simple scaling by using the dependence between $n_{c}^{S}$ and $y^{S}$ from source domains.

Proof.

As mentioned above, there are permutation indeterminacy and scaling indeterminacy in identifying $n_{c}$ and $n_{s}$ . The permutation indeterminacy implies that if we obtain the two recovered variables, e.g., ${^n}_{c}$ and ${^n}_{s}$ , we are uncertain of which the recovered variable is the latent content variable $n_{c}$ , i.e., $n_{c} = {^n}_{c}$ or $n_{c} = {^n}_{s}$ . If we can solve this permutation problem, since the parent node of $z_{c}$ includes $n_{c}$ only, $z_{c}$ can also be identifiable up to scaling, i.e., $z_{c} = f (n_{c})$ where $f$ can be any nonlinear function. Let us consider the relationships among $n_{c}$ , $n_{s}$ and $y$ in the proposed causal graph, it is clear that the label $y$ only depends on $n_{c}$ and is independent with $n_{s}$ , given the domain variable $D$ . As a result, we can compute the correlations (e.g., by mutual information) between $y^{S}$ and ${^n}_{c}^{S}$ or between $y^{S}$ and ${^n}_{s}^{S}$ , to determine which recovered variable is $n_{c}$ . Here superscript $S$ denotes data from source domains. ∎

The proposition shows that the content variable $z_{c}$ can be identifiable up to simple scaling. The scaling indeterminacy of $z_{c}$ is no significance and can be ignored in latent space, since this indeterminacy can be ’absorbed’ by nonlinear function class of edges and do not change the causal direction. For example, consider the recovered variable ${^z}_{c}$ and its scaling $s c a l i n g ({^z}_{c})$ . When we try to learn a invariant predictor $g (\cdot)$ form ${^z}_{c}$ to $y$ , the scaling indeterminacy can be ’absorbed’ by a composition predictor, e.g., $g (s c a l i n g (\cdot))$ .

5 Learning Invariant $p (y | z_{c})$ for MSDA

The identifiable $z_{c}$ provides a principled way to guarantee that we can learn the conditional distribution $p (y | z_{c})$ to be generalized to the target domain. In this section we propose a novel method to show how to learn the invariant conditional distribution $p (y | z_{c})$ for MSDA.

5.1 The Proposed Method for Learning Invariant $p (y | z_{c})$

As mentioned in section 4.3, since the identifiability $z_{c}$ is built on the identifiability result of nonlinear ICA, we need to identify $n_{c}$ and $n_{s}$ first. To meet the conditions of identifiable $n_{c}$ and $n_{s}$ as mentioned in khemakhem2020variational, we employ the following Gaussian prior on $n_{c}$ and $n_{s}$ :

(1)

where $μ$ and $Σ$ denote the mean and variance, respectively. Both are depending on the domain variable $D$ and can be implemented by multi-layer perceptrons. Since $n_{c}$ and $n_{s}$ could be vector and are corresponding to independent noise variables, $Σ$ here is a diagonal matrix. There are some exponential distributions, e.g., Laplace distribution, which also meets the conditions of identifiable $n_{c}$ and $n_{s}$ and thus are feasible (khemakhem2020variational). We here employ the Gassuan prior since it is flexible to directly use the re-parametric trick (kingma2013auto). The nature of the proposed Gaussian prior equation 1 gives rise to the following variational posterior:

q (n | D, x) = q (n_{c} | D, x) q (n_{s} | D, x) = N (μ_{n_{c}}^{^{'}} (D, x), Σ_{n_{c}}^{^{'}} (D, x)) N (μ_{n_{s}}^{^{'}} (D, x), Σ_{n_{s}}^{^{'}} (D, x)),

(2)

where $μ^{^{'}}$ and $Σ^{^{'}}$ denote the mean and variance of the posterior, respectively. Again, both are depending on the domain variable $D$ and the observed $x$ , and can be implemented by multi-layer perceptrons. Combining this with the Gaussian prior equation 1, we arrive at the following evidence lower bound (ELBO):

(3)

where $D_{K L}$ denotes the Kullback–Leibler divergence.

Maximizing the ELBO equation 3, we can recover $n_{c}$ and $n_{s}$ up to simple scaling and permutation. To solve the permutation as mentioned in section 4.3, we can simultaneously maximize the dependence of $y^{S}$ and $n_{c}^{S}$ . Here we employ the mutual information to maximize the dependence. As a result, we arrive at:

(4)

where $I (n_{c}^{S}, y^{S})$ denotes the mutual information between $n_{c}^{S}$ and $y^{S}$ in source domains. $λ$ is a regularization hyper-parameter that balances the ELBO and the mutual information (MI). The proposed method is termed iLCC-MSDA (identifiable latent causal content for MSDA), includes two part, ELBO and mutual information. The ELBO ensures that $n_{c}$ and $n_{s}$ can be recovered up to scaling and permutation. The MI handles the permutation, and thus ensures the recovered $n_{c}$ to be corresponding to the true latent content variable, instead of the latent style variable. In the implementation, we use the variational low bounder of mutual information proposed by alemi2016deep to approximate the mutual information in equation 4.

max λ (E_{q (n | D, x)} (p (x | D)) - D_{K L} (q (n | D, x) | | p (n | D))) + E_{q (n_{c} | D, x)} (p (y^{S} | n_{c}^{S})) .

(5)

A depiction of the proposed iLCC-MSDA is shown in Figure 4.

Figure 4: The proposed iLCC-MSDA to learn the invariant $p (y^{S} | z_{c}^{S})$ for multiple source domain adaptation. C denotes concatenation, S denotes sampling from a distribution.

5.2 Heuristic Constraints for the Proposed Method

Enhancing the independence of $n_{c}$ and $n_{s}$

As we discussed, the performance of the proposed iLCC-MSDA above depends on the assumptions for the identifiability of $n_{c}$ and $n_{s}$ . For example, one of the assumptions is that there exist many different domains to change the distributions of $n_{c}$ and $n_{s}$ . However, in practical implementation we may have no lots of source domains for domain adaptation in real applications. To remedy this, we employ a heuristic method to enhance the independence. Motivated by the progress in disentangled representation learning (higgins2016beta; kim2018disentangling; chen2018isolating), we proposes using a hyperparameter $β$ to control the emphasis on learning statistically the independent latent variables $n_{c}$ and $n_{s}$ given the domain variable $D$ .

Entropy regularization

In the loss function equation 5, we enforce the causal relation between $y^{S}$ and $n_{c}^{S})$ in source domains by the mutual information. To encourage such causal relation in target domain, we can also maximize the mutual information between $y^{T}$ and $n_{c}^{T}$ by minimizing the conditional entropy:

L_{e n t} = - E (p (y | z_{c}^{T}) log p (y | z_{c}^{T}))

(6)

This regularization has been empirically used in previous works (wang2020learning; li2021t), while we consider it from the viewpoint of causality. Overall, our loss function is:

max λ (E_{q (n | D, x)} (p (x | D)) - β D_{K L} (q (n | D, x) | | p (n | D))) + E_{q (n_{c} | D, x)} (p (y^{S} | n_{c}^{S})) + γ L_{e n t},

(7)

where $β, λ, γ$ are hyper-parameters that trade off the independence of $n_{c}$ and $n_{s}$ , the classifier and the entropy regularization loss terms.

6 Experiments

6.1 Experiments on Synthetic Data

Dataset

We conduct experiments on synthetic data, generated by the following process: we divide the latent variables into 5 segments, which are corresponding to 5 domains. Each segment includes 1000 examples. Within each segment, we first sample the mean and the variance from uniform distributions $[1, 2]$ and $[0.3, 1]$ for the latent exogenous variables $n_{c}$ and $n_{s}$ , respectively. Then for each segment, we generate $z_{c}$ , $z_{s}$ , $x$ and $y$ according to the following structural causal model:

(8)

where following (khemakhem2020variational) we mix the latent $z_{c}$ and $z_{s}$ using a multi-layer perceptron to generate $x$ .

Results

In implementation, we use the first 4 segments as source domains, and the last segment as target domain. Figure 5(a) shows the true and recovered distributions of the exogenous variables $n_{c}$ . Due to the support of nonlinear ICA, the proposed iLCC-MSDA obtain the mean correlation coefficient (MCC) 0.96 between the original $n_{c}$ and the recovered. Due to the invariant conditional distribution $p (y | n_{c})$ , even with the change of distribution of the exogenous variables $n_{c}$ as shown in Figure 5(a), the learned $p (y | n_{c})$ can be generalized to target segment in a principle way as depicted by the Figure 5(b). Due to the limited space, Figure 5(b) only shows 200 samples for the true and predicted $y$ .

The Result on Synthetic Data. — (a) Recovered $n_{c}$

6.2 Experiments on Real Data

Dataset

We further evaluate the proposed iLCC-MSDA on benchmark domain adaptation dataset PACS dataset (li2017deeper) and Terra Incognita (beery2018recognition). To obtaining the situation of label distribution shift on PACS dataset, we filtered the original dataset by re-sampling it to generate three datasets, PACS ( $D_{K L} = 0.3$ ), PACS ( $D_{K L} = 0.5$ ) and PACS ( $D_{K L} = 0.7$ ), where $D_{K L} = 0.3 (0.5, 0.7)$ denotes that KL divergence is approximately 0.3 (0.5, 0.7) for label distributions of any two different domains. See Figure 6 for detailed label distributions.

Label distributions of the filtered PACS data. — (a)

Baselines

We compare the proposed method with state-of-the-art methods to verify its effectiveness. Particularly, we compare the proposed methods with empirical risk minimization (ERM), MCDA (saito2018maximum), M3DA (peng2019moment), LtC-MSDA (wang2020learning), T-SVDNet (li2021t), IRM (arjovsky2019invariant), IWCDAN (tachet2020domain) and LaCIM (sun2021recovering). In these methods, MCDA, M3DA, LtC-MSDA and T-SVDNet learn an invariant representation, while IRM, IWCDAN and LaCIM learn invariant conditional distributions for MSDA, allowing label distribution to shift. Details of implementation, including network architectures and hyper-parameter setting, are in the APPENDIX. All the proposed methods are averaged over 3 runs with standard deviation.

\adjustbox

max width= PACS ( $D_{K L} = 0.3$ ) Methods Accuracy $\to$ Art $\to$ Cartoon $\to$ Photo $\to$ Sketch Average
ERM 82.3 $\pm$ 0.3 81.3 $\pm$ 0.9 94.9 $\pm$ 0.2 76.2 $\pm$ 0.7 83.6
MCDA ((saito2018maximum)) 76.6 $\pm$ 0.6 85.1 $\pm$ 0.3 96.6 $\pm$ 0.1 70.1 $\pm$ 1.3 82.1 M3SDA (peng2019moment) 79.6 $\pm$ 1.0 86.6 $\pm$ 0.5 97.1 $\pm$ 0.3 83.3 $\pm$ 1.0 86.6 LtC-MSDA (wang2020learning) 82.7 $\pm$ 1.3 84.9 $\pm$ 1.4 96.9 $\pm$ 0.2 75.3 $\pm$ 3.1 84.9 T-SVDNet (li2021t) 81.8 $\pm$ 0.3 86.5 $\pm$ 0.2 95.9 $\pm$ 0.2 80.7 $\pm$ 0.8 86.3 IRM (arjovsky2019invariant) 79.6 $\pm$ 0.7 77.0 $\pm$ 2.2 94.6 $\pm$ 0.2 71.7 $\pm$ 2.3 80.7 IWCDAN (tachet2020domain) 84.0 $\pm$ 0.5 78.1 $\pm$ 0.7 96.0 $\pm$ 0.1 75.5 $\pm$ 1.9 83.4 LaCIM (sun2021recovering) 63.1 $\pm$ 1.5 72.6 $\pm$ 1.0 82.7 $\pm$ 1.3 71.5 $\pm$ 0.9 72.5 iLCC-MSDA(Ours) 86.4 $\pm$ 0.8 81.1 $\pm$ 0.8 95.9 $\pm$ 0.1 86.0 $\pm$ 1.0 87.4 PACS ( $D_{K L} = 0.5$ )
ERM 85.4 $\pm$ 0.6 76.4 $\pm$ 0.5 94.4 $\pm$ 0.4 85.0 $\pm$ 0.6 85.3
MCDA ((saito2018maximum)) 81.6 $\pm$ 0.1 76.8 $\pm$ 0.1 93.6 $\pm$ 0.1 84.1 $\pm$ .6 84.0 M3SDA (peng2019moment) 81.2 $\pm$ 1.2 77.5 $\pm$ 1.3 94.5 $\pm$ 0.5 84.3 $\pm$ 0.5 84.4 LtC-MSDA (wang2020learning) 85.2 $\pm$ 1.5 75.2 $\pm$ 2.6 94.9 $\pm$ 0.6 85.1 $\pm$ 2.7 85.1 T-SVDNet (li2021t) 84.8 $\pm$ 0.3 77.6 $\pm$ 1.7 94.2 $\pm$ 0.2 86.4 $\pm$ 0.2 85.6 IRM (arjovsky2019invariant) 81.5 $\pm$ 0.3 71.1 $\pm$ 1.3 94.2 $\pm$ 0.1 78.7 $\pm$ 0.7 81.4 IWCDAN (tachet2020domain) 79.2 $\pm$ 1.6 72.6 $\pm$ 0.7 95.6 $\pm$ 0.1 82.1 $\pm$ 2.2 82.4 LaCIM (sun2021recovering) 67.4 $\pm$ 1.6 66.6 $\pm$ 0.6 81.0 $\pm$ 1.2 82.3 $\pm$ 0.6 74.3 iLCC-MSDA(Ours) 89.0 $\pm$ 0.7 77.6 $\pm$ 0.5 95.0 $\pm$ 0.3 87.4 $\pm$ 1.6 87.3 PACS ( $D_{K L} = 0.7$ )
ERM 86.1 $\pm$ 0.6 76.8 $\pm$ 0.3 94.6 $\pm$ 0.4 81.3 $\pm$ 2.0 84.7
MCDA ((saito2018maximum)) 80.8 $\pm$ 0.7 74.1 $\pm$ 1.2 94.4 $\pm$ 0.4 77.9 $\pm$ 0.4 81.8 M3SDA (peng2019moment) 82.7 $\pm$ 1.3 76.2 $\pm$ 1.0 94.5 $\pm$ 0.7 80.8 $\pm$ 1.2 83.6 LtC-MSDA (wang2020learning) 83.7 $\pm$ 1.6 74.6 $\pm$ 1.4 95.0 $\pm$ 0.7 80.8 $\pm$ 0.6 83.5 T-SVDNet (li2021t) 83.3 $\pm$ 0.8 74.7 $\pm$ 0.6 95.2 $\pm$ 0.3 74.5 $\pm$ 3.3 81.9 IRM (arjovsky2019invariant) 84.3 $\pm$ 0.8 73.3 $\pm$ 1.8 94.3 $\pm$ 0.1 69.4 $\pm$ 4.6 80.3 IWCDAN (tachet2020domain) 76.3 $\pm$ 0.8 73.9 $\pm$ 1.6 93.1 $\pm$ 0.5 77.6 $\pm$ 3.8 80.2 LaCIM (sun2021recovering) 63.6 $\pm$ 0.9 68.7 $\pm$ 1.4 77.5 $\pm$ 3.8 77.8 $\pm$ 2.2 71.9 iLCC-MSDA(Ours) 90.7 $\pm$ 0.3 74.2 $\pm$ 0.7 95.8 $\pm$ 0.3 83.0 $\pm$ 2.2 86.0 iLCC-MSDA(Ours) with $β = 1$ 90.2 $\pm$ 0.5 73.4 $\pm$ 0.8 95.7 $\pm$ 0.4 82.7 $\pm$ 0.7 85.5 iLCC-MSDA(Ours) with $γ = 0$ 81.1 $\pm$ 1.5 70.0 $\pm$ 1.6 92.0 $\pm$ 0.5 59.6 $\pm$ 0.7 75.7

Table 1: Classification results and ablation study on PACS data.

\adjustbox

max width= Methods Accuracy $\to$ L28 $\to$ L43 $\to$ L46 $\to$ L7 Average
ERM 54.1 $\pm$ 2.8 62.3 $\pm$ 0.7 44.7 $\pm$ 0.9 74.5 $\pm$ 2.6 58.9
MCDA ((saito2018maximum)) 54.9 $\pm$ 4.1 61.2 $\pm$ 1.2 42.7 $\pm$ 0.3 64.8 $\pm$ 8.1 55.9 M3SDA (peng2019moment) 62.3 $\pm$ 1.4 62.7 $\pm$ 0.4 41.3 $\pm$ 0.3 57.4 $\pm$ 0.9 55.9 LtC-MSDA (wang2020learning) 51.9 $\pm$ 5.7 54.6 $\pm$ 1.3 45.7 $\pm$ 1.0 69.1 $\pm$ 0.3 55.3 T-SVDNet (li2021t) 58.2 $\pm$ 1.7 61.9 $\pm$ 0.3 45.6 $\pm$ 2.0 68.2 $\pm$ 1.1 58.5 IRM (arjovsky2019invariant) 57.5 $\pm$ 1.7 60.7 $\pm$ 0.3 42.4 $\pm$ 0.6 74.1 $\pm$ 1.6 58.7 IWCDAN (tachet2020domain) 58.1 $\pm$ 1.8 59.3 $\pm$ 1.9 43.8 $\pm$ 1.5 58.9 $\pm$ 3.8 55.0 LaCIM (sun2021recovering) 58.2 $\pm$ 3.3 59.8 $\pm$ 1.6 46.3 $\pm$ 1.1 70.8 $\pm$ 1.0 58.8 iLCC-MSDA(Ours) 64.3 $\pm$ 3.4 63.1 $\pm$ 1.6 44.7 $\pm$ 0.4 80.8 $\pm$ 0.4 63.2 iLCC-MSDA(Ours) with $β = 1$ 56.3 $\pm$ 4.3 61.5 $\pm$ 0.7 45.2 $\pm$ 0.3 80.1 $\pm$ 0.6 60.8 iLCC-MSDA(Ours) with $γ = 0$ 54.8 $\pm$ 1.4 58.9 $\pm$ 1.8 46.8 $\pm$ 1.4 73.1 $\pm$ 0.6 58.4

Table 2: Classification results on TerraIncognita.

Ablation studies

The bottom of Table 1 and 2 presents the results for ablation studies. We can observe that entropy regularization equation 6 significantly increases the performance (around $10 %$ and $5 %$ ) of the proposed method on both dataset. This justifies the importance of the causal relation between $y$ and $n_{c}$ , which is consistent with our model assumption. Besides, the hyper-parameter $β$ also boosts the performance by enforcing the independence between the latent variables $n_{c}$ and $n_{s}$ .

Results

The results by different methods on PACS are presented in Table 1. We can observe that as the increase of KL divergence of label distribution, the performance of MCDA, M3DA, LtC-MSDA and T-SVDNet, which are based on learning an invariant representations, gradually degenerates. When the KL divergence is about 0.7, the performance of these methods is worse than traditional ERM. Compared with the methods, which allows label distribution to change across domains, including IRM, IWCDAN and LaCIM, the proposed iLCC-MSDA obtains the best performance, due to our theoretical insights. Table 2 depicts the results by different methods on challenging Terra Incognit. The proposed iLCC-MSDA achieves a significant performance gain on the challenging task $\to$ L7. Compared with the other methods, the proposed iLCC-MSDA is the only one that is superior to ERM.

7 Conclusion

The key for domain adaptation is understanding how the joint distribution of input data and label changes across domains. Previous works assume the covariate shift or the conditional shift to interpret the change of the joint distribution, which may be severely restricted in real applications. This work relaxes these assumptions and proposes a new assumption, latent covariate shift. To handle it, we propose a latent causal graph to more precisely formulate the generative process of input data and label, by using a latent causal variable and a latent style variables. Built on the identifiability result of nonlinear ICA, we show that the latent causal variable can be identified up to scaling. This motivates us a new method to learn invariant label distribution conditional on the latent causal variable, with a principled way to guarantee generalization. Experiments show the advantages of our theoretical results and the performance of the the proposed method, compared with state-of-the-art methods across various dataset.

References

Appendix A Appendix

Data Details

The original PACS (li2017deeper) is a multiple domain dataset contains 4 domains, Photo, Artpainting, Cartoon and Sketch, which shares the same seven categories. The KL divergence of label distributions of any two domains in the original PACS is very small, round 0.1. For obtaining the proposed latent covariate shift, we filter the original dataset by re-sampling it, and obtain three new datasets with different the KL divergences of label distributions as depicted by Figure 6. For Terra Incognita (beery2018recognition), the label distribution is long-tailed at each domain, and each domain has a different label distribution, which is naturally applicable for our setting. This work use the four domains from the original data, L28, L43, L46 and L7, which shares the same seven categories: bird, bobcat, empty, opossum, rabbit, raccoon, skunk, as depicted by Figure 7

Label distributions of Terra Incognita data used in this work. — (a)

Implementation Details

For the synthetic data, we used a encoder, e.g. 3-layer fully connected network with 30 hidden nodes for each layer, and decoder, e.g. 3-layer fully connected network with 30 hidden nodes for each layer. We use 3-layer fully connected network with 30 hidden nodes for prior model. Since this is a ideal environment to verify the proposed method, for hyper-parameters, we set $β = 1$ and $γ = 0$ to remove the heuristic constraints, and we set $λ = 1 e - 2$ . For the real data, all methods used the same network backbone, ResNet-18 pre-trained on ImageNet. Since it can be challenging to train VAE on high-resolution images, we use extracted features by ResNet-18 as our VAE input. We then use 2-layer fully connected networks as the VAE encoder and decoder, use 2-layer fully connected network for the prior model, use 2-layer fully connected network to transfer $n_{c}$ to $z_{c}$ . For hyper-parameters, we set $β = 4$ , $γ = 0.1$ , $λ = 1 e - 4$ for the proposed method on all datasets.

The t-SNE visualizations of learned features — (a)

Identifying Latent Causal Content for Multi-Source Domain Adaptation

Abstract

1 Introduction

2 Related work

3 The Proposed Latent Causal Graph for Latent Covariate Shift

4 Identifiability Analysis of the Proposed Latent Causal Graph

4.1 Identifying nc and ns up to Permutation and Scaling by Nonliner ICA

4.2 The Curse of Identifiability in Latent Space: Transitivity Property

Proposition 4.1.

Proof.

4.3 Identifying zc up to Scaling by the Dependence Between nSc and yS

Proposition 4.2.

Proof.

5 Learning Invariant p(y|zc) for MSDA

5.1 The Proposed Method for Learning Invariant p(y|zc)

5.2 Heuristic Constraints for the Proposed Method

Enhancing the independence of nc and ns

Entropy regularization

6 Experiments

6.1 Experiments on Synthetic Data

Dataset

Results

6.2 Experiments on Real Data

Dataset

Baselines

Ablation studies

Results

7 Conclusion

References

Appendix A Appendix

Data Details

Implementation Details

4.1 Identifying $n_{c}$ and $n_{s}$ up to Permutation and Scaling by Nonliner ICA

4.3 Identifying $z_{c}$ up to Scaling by the Dependence Between $n_{c}^{S}$ and $y^{S}$

5 Learning Invariant $p (y | z_{c})$ for MSDA

5.1 The Proposed Method for Learning Invariant $p (y | z_{c})$

Enhancing the independence of $n_{c}$ and $n_{s}$