LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data

Jihye Park ¹, Soohyun Kim ¹¹, Sunwoo Kim ¹¹, Jaejun Yoo ²,
Youngjung Uh ³, Seungryong Kim ¹ Equal contributionCorresponding author

^†footnotemark:

Abstract

Existing techniques for image-to-image translation commonly have suffered from two critical problems: heavy reliance on per-sample domain annotation and/or inability of handling multiple attributes per image. Recent methods adopt clustering approaches to easily provide per-sample annotations in an unsupervised manner. However, they cannot account for the real-world setting; one sample may have multiple attributes. In addition, the semantics of the clusters are not easily coupled to human understanding. To overcome these, we present a LANguage-driven Image-to-image Translation model, dubbed LANIT. We leverage easy-to-obtain candidate domain annotations given in texts for a dataset and jointly optimize them during training. The target style is specified by aggregating multi-domain style vectors according to the multi-hot domain assignments. As the initial candidate domain texts might be inaccurate, we set the candidate domain texts to be learnable and jointly fine-tune them during training. Furthermore, we introduce a slack domain to cover samples that are not covered by the candidate domains. Experiments on several standard benchmarks demonstrate that LANIT achieves comparable or superior performance to existing models. Our code is available at https://KU-CVLAB.github.io/LANIT/.

\affiliations

¹ Korea University, Seoul, Korea ² UNIST, Ulsan, Korea ³ Yonsei University, Seoul, Korea
{ghp1112, shkim1211, sw-kim, seungryong_kim}@korea.ac.kr, jaejun.yoo@unist.ac.kr, yj.uh@yonsei.ac.kr

Introduction

Unpaired image-to-translation frameworks aim to translate an image from one domain to another. They are typically trained on many images with their respective domain labels. It has been popularly studied for a decade in numerous computer vision applications, such as image manipulation gonzalez2018image; zhou2021cocosnet; zheng2021spatially, multi-modal translation huang2018multimodal; choi2018stargan; liu2019few; choi2020starganv2; baek2021rethinking, and instance-aware translation bhattacharjee2020dunit; jeong2021memory; kim2022instaformer.

Conventional methods huang2018multimodal; choi2018stargan; choi2020starganv2 require at least per-sample domain supervision based on the strict assumption that each image should be described by a one-hot domain label, which is frequently violated. For example, a face can have multiple attributes. In addition, annotating domain label for each image may become considerably labor intensive and error-prone, and subjectivity may lead to inconsistency across different annotators.

To mitigate such a heavy reliance on per-sample domain annotation, some recent methods formulate the problem in a few-shot setting liu2019few; saito2020coco or semi-supervised learning setting wang2020semi. More recently, TUNIT baek2021rethinking presents a framework to jointly learn domain clustering and translation within a unified model, thus enabling a fully-unsupervised learning of image-to-image translation. Kim et al. kim2022style also learn prototypes in self-supervised training. Although these methods ease the burden of per-sample domain annotation and show relatively competitive performance, they still inherit the limitation of using one-hot domain labels to train the image translation model. In addition, these unsupervised learning methods lack a way to understand the semantic meaning of each learned domain cluster, limiting their controllability in real-world applications.

Compared to the per-sample domain annotations, it could be much easier to list candidate text domain descriptions that would describe images in the dataset, namely dataset-level annotation as exemplified in Fig. 1. In numerous applications, it can be easily obtained, or may already be available. For example, if a user’s current interest is a dataset of faces, it is easy to obtain various attributes from other sources that could be used to describe a face in text form. In this paper, we propose for the first time to train an image-to-image translation model using such dataset-level supervision.

Recently, thanks to the powerful representational capability of the large-scale vision-language model, e.g., CLIP radford2021learning or ALIGN jia2021scaling, computer vision applications have recently been reformulated in a language-driven manner that not only improves the performance but also provides controllability consistent with human intent. Especially, language-driven image synthesis and manipulation methods have been popularly presented patashnik2021styleclip; abdal2021clip2stylegan; liu2021fusedream. However, how such vision-language models can be effectively used for image-to-image translation tasks remains unexplored.

In this paper, we present a novel framework to guide an image-to-image translation model for unlabeled datasets using intuitive textual domain descriptions, what we refer to as dataset-level supervision. Our approach exploits a pretrained vision-language model like CLIP to assign a multi-hot domain label to each image instead of using an one-hot domain label as in conventional methods liu2019few; choi2020starganv2; touvron2021training. Specifically, we encode the style reference image into per-domain style vectors as in existing methods and aggregate the regarding a multi-hot domain label. The domain label is estimated by calculating the similarity between the style image and the candidate domain text in a common vision-language embedding space. Then the generator translates the content image using the aggregated style vector. To enforce the networks to generate an output that faithfully and precisely follows the input style, we present a domain-consistency loss defined between domain labels of input and output. In addition, to account for the observation that the initial prompt may not be optimal to describe all images in the dataset, we present two additional techniques: a prompt learning scheme and a slack domain $\emptyset$ that enables adaptive thresholding in order to filter out mislabeled data. We train the domain prompt and translation models in an end-to-end manner.

We evaluate our method on several standard benchmarks bossard2014food; liu2019few; karras2019style. Experimental results show that the proposed model trained only with dataset-level supervision is comparable or even outperforms the latest methods including StarGAN2 choi2020starganv2 trained with per-sample supervision. We also provide intensive ablation studies and user study results to validate and analyze our method.

Related Work

Image-to-Image Translation.

While early methods for image-to-image translation considered a paired setting isola2017image, most recent state-of-the-arts focus on an unpaired setting hoffman2018cycada; park2020contrastive; wu2019relgan; zhou2021cocosnet; baek2021rethinking; zheng2021spatially; gabbay2021scaling; liu2021smoothing. This trend was initiated by the CycleGAN CycleGAN2017, of which many variants were proposed, formulating uni-modal models liu2017unsupervised; yi2017dualgan; zhang2020cross and multi-modal models huang2018multimodal; lee2018diverse; choi2018stargan; choi2020starganv2. Although these methods overcome the inherent limitation of paired setting, they still require at least per-sample domain annotation, which is often labor intensive or even impossible to achieve in some cases. In addition, this per-sample supervision does not consider that some images may require multi-hot domain label such as facial attributes, e.g., blond and young.

To mitigate the reliance on per-sample supervision, FUNIT liu2019few and CoCo-FUNIT saito2020coco proposed a few-shot setting, but still requiring a few annotations. SEMIT wang2020semi utilizes a pseudo labeling technique. S $^{3}$ GAN luvcic2019high, Self-conditioned GAN liu2020diverse, and bahng2020exploring adopt clustering methods or a classifier. Recently, TUNIT and Kim et al. kim2022style have proposed the truly-unsupervised frameworks with clustering approach or prototype-based learning, which do not need any per-sample domain supervision. However, their performances were limited in that it is challenging to understand the semantic meaning of each pseudo domain, which in turn limits their applicability. Note that the aforementioned methods also inherit the problem of the one-hot domain labeling assumption.

Figure 2: Network configuration. Our model consists of content and style encoders ( $E_{C}$ , $E_{S}$ ), vision-language encoders ( $E_{V}$ , $E_{L}$ ), and generator ( $G$ ). We extract content and style vectors from content $x$ and style $y$ , respectively. By leveraging vision-language features, we measure the pseudo domain label $d^{y}$ of $y$ . We generate $^y$ with content vector $c^{x}$ and aggregated style vector $a^{y}$ through the generator.

Vision-Language Model in Image Manipulation.

Many large-scale vision-language models radford2021learning; jia2021scaling have been popularly adopted in computer vision problems. Especially, CLIP has been employed in image synthesis and manipulation wei2021hairclip; patashnik2021styleclip; abdal2021clip2stylegan; liu2021fusedream. For instance, StyleCLIP and HairCLIP manipulate the latent space of pretrained StyleGAN karras2019style through the guidance of CLIP. CLIP2StyleGAN extends such approach in an unsupervised manner that eliminates the need of target style description. CLIPstyler kwon2021clipstyler trains a lightweight network for artistic style transfer with the supervision from CLIP. However, all the aforementioned methods only focus on image manipulation and usage for image-to-image translation task remains unexplored.

Prompt Learning.

Since the success of pretrained large-scale language models such as GPT-3 brown2020language in NLP, various methods for optimizing the prompt in a text format have been proposed petroni2019language; shin2020autoprompt; li2021prefix; liu2021pre. Inspired by these works, some works zhou2021learning; yao2021cpt; ge2022domain; zhou2022conditional; huang2022unsupervised using CLIP also attempted to optimize the input prompt. CoOp and CPT showed that optimizing a continuous prompt could surpass manually-designed discrete prompt on zero-shot image classification. However, these methods require the class supervision.

Methodology

Motivation and Overview

We leverage a list of dataset-level domain descriptions in texts and their similarity with the unlabeled images in CLIP embedding space instead of the images with per-sample domain annotations. Our framework not only alleviates difficulty in the labeling process but also enables multi-hot label setting in both labeling and image translation process.

Specifically, let us denote a pair of images as content $x$ and style $y$ . Our model, consisting of content encoder $E_{C}$ , style encoder $E_{S}$ , mapping encoder $E_{M}$ , and generator $G$ , aims to learn an image-to-image translation with content $x$ and style $y$ (or between domains $X$ and $Y$ ) to generate a translated image $^y$ given a list of possible domain descriptions as prompt $t_{n}$ for $n$ -th domain from $N$ domains. To measure a pseudo domain label $d^{y}$ of $y$ , we leverage a pretrained large-scale vision-language model, i.e., CLIP that has vision and language encoders $E_{V}$ and $E_{L}$ , respectively.

In the reference-guided phase, we first estimate a pseudo multi-hot domain labels from similarity between the style image and the candidate domain prompts. Then the style vectors $s_{n}^{y}$ for $n$ -th domain are aggregated according to the estimated domains to modulate the content representation $c^{x}$ . In the latent-guided phase, we follow choi2020starganv2 except the aggregation of the multiple latent vectors.

During training, we jointly optimize the domain prompt $t_{n}$ and the image translators consisting of $E_{C}$ , $E_{S}$ , $E_{M}$ , and $G$ , while vision-language encoders $E_{V}$ and $E_{L}$ are fixed. To train our networks, we first use an adversarial loss defined between translated image $^y$ and style image $y$ with multi-domain discriminators with multi-hot domain label. To improve the disentanglement ability for content and style, following choi2020starganv2, we also use cycle consistency loss, style reconstruction loss, and style diversification loss. In addition, to encourage the model to generate an output $^y$ following the input style $y$ more faithfully and precisely, we present a domain-consistency loss, which also allows for learning the domain prompt $t_{n}$ with domain diversification loss. Our overall architecture is shown in Fig. 2.

Language-Driven Domain Labeling

In this section, we explain how to obtain a pseudo multi-hot domain label for an image using pretrained vision-language model. Unlike TUNIT baek2021rethinking that clusters the images from the dataset $I \subset R^{H \times W \times 3}$ with height $H$ and width $W$ using the visual features only, we use the textual domain description as prompt $t_{n}$ , and softly cluster the images by measuring the similarity between visual features and language features. In our framework, the prompts ${t_{n}}$ are initialized by the human-annotated domain descriptions or keywords, namely dataset-level supervision, and fine-tuned during training.

In specific, we first extract vision and language features $v^{y} = E_{V} (y) \in R^{1 \times l}$ with $l$ channels, and $u^{t} = E_{L} (t) \in R^{N \times l}$ with $l$ channels from style $y \in R^{H \times W \times 3}$ and all the prompts $t \in R^{N \times t}$ by pretrained vision-language model, i.e., CLIP, respectively, and then measure a similarity $f^{y} = [f_{n}^{y, t}]_{n} \in R^{N \times 1}$ , computed as:

f_{n}^{y, t} = {¯ v}^{y} \cdot {¯ u}_{n}^{t},

(1)

where ${¯ v}^{x} = v^{x} / ∥ v^{x} ∥$ and ${¯ u}^{t} = u^{t} / ∥ u^{t} ∥$ . Then, pseudo multi-hot domain label $d^{y} = [d_{n}^{y}]_{n} \in R^{N \times 1}$ is estimated by considering Top- $K$ similarities among $f_{n}^{y, t}$ for all $n$ as ones, and the remaining elements as zeros. In our framework, this domain label $d^{y}$ is used to extract an aggregated style vector and define multi-domain discriminators, which will be discussed in the following.

Image Translation with Pseudo Domain Label

Our image-to-image translation model consists of content encoder $E_{C}$ , style encoder $E_{S}$ , mapping encoder $E_{M}$ , and generator $G$ , similar to StarGAN2 choi2020starganv2. We first extract content and style vectors from $x \in R^{H \times W \times 3}$ and $y \in R^{H \times W \times 3}$ such that $c^{x} = E_{C} (x) \in R^{h_{c} \times w_{c} \times c}$ with height $h_{c}$ and width $w_{c}$ , and $s_{n}^{y} = E_{S, n} (y) \in R^{1 \times s}$ for $n$ -th domain (or $~ s = E_{M} (z)$ from a random latent $z$ ), respectively. While existing methods select a single vector from ${s_{1}^{y}, . . ., s_{N}^{y}}$ according to ground-truth one-hot domain label, which is inserted to the generator, we aggregate such style vectors ${s_{1}^{y}, . . ., s_{N}^{y}}$ with the pseudo domain label $d^{y}$ such that

a^{y} = \frac{1}{M^{y}} N \sum n = 1 s_{n}^{y} d_{n}^{y},

(2)

where $M^{y}$ is the number of one values in $d^{y}$ . With this aggregated style vector $a^{y} \in R^{1 \times s}$ and content vector $c^{x}$ , our networks finally generates the output such that $^y = G (c^{x}, a^{y})$ , containing a style normalization layer. The detailed network architecture is described in the suppl. material.

Figure 3: Prompt learning with domain-consistency loss. With a learnable prompt initialized by a template and textual domain descriptions, we define a domain-consistency loss $L_{d c}$ between two domain labels $d^{y}$ and $d^{^y}$ from $y$ and $^y$ , respectively, and minimize the loss function not to only learn the optimal prompt but also the translation modules in an end-to-end manner.

Setting Up Domain Prompt

So far, we discussed the method for language-driven domain labeling and image translation with the pseudo domain label. Since our framework highly depends on the prompt $t_{n}$ , if the given prompt $t_{n}$ is ambiguous, not representative, or even impossible to describe all the images in the dataset, the performance may be limited. To overcome these, we present the prompt learning network as shown in Fig. 3. More specifically, we set the prompt $t_{n}$ learnable, similarly to zhou2021learning; zhou2022conditional; huang2022unsupervised; the learnable continuous vectors shared across all domains are first defined as template; we define a learnable vector for each domain, which is initialized by input domain descriptions and fine-tuned during training. We define the prompt $t_{n}$ as follows:

t_{n} = [p_{1}, p_{2}, . . ., p_{L}, p_{n}^{d o m a i n}],

(3)

where $p_{l}$ ( $l \in {1, . . ., L}$ ) is a vector for a word in the template with the same dimension as $p_{n}^{d o m a i n}$ , and $p_{n}^{d o m a i n}$ is a vector for $n$ -th textual domain description. By allowing the prompt to be updated, more optimal prompt can be optimized, which in turn boosts the translation performance.

In order to boost the convergence of learning, all $p_{l}$ and $p_{n}^{d o m a i n}$ can be initialized with a template, e.g., “a face with”, and given textual domain descriptions, e.g., “black hair”, respectively, and fine-tuned during training. Note that the input domain descriptions should be given in our framework, but it can also be obtained by using a predefined dictionary radford2021learning; abdal2021clip2stylegan in a manner that the high relevant texts to dataset can be selected based on the similarity between an image and candidate texts.

Moreover, our framework is designed under the assumption that given domain descriptions must faithfully represent every sample in the entire dataset, otherwise it would lead to inaccurate pseudo-labeling. To address this issue, given textual domain descriptions, we also add an additional slack domain as “ $\emptyset$ domain” to deal with unknown or uncertain domain in order that an image with ambiguous label is assigned to this domain. We formulate such additional $\emptyset$ domain as dataset-level description, e.g., “food” or “face”. This allows our framework to consider only confident data for translation, which has higher similarity with given textual domain descriptions than $\emptyset$ domain, thus enabling our network to be robust to given domain descriptions. The examples assigned to this domain are shown in Fig. 4.

Figure 4: Examples of assigned to $\emptyset$ domain. We set candidate domain descriptions as “blonde hair”, “bangs”, “smiling’, and “eyeglasses”, which have no attributes corresponding to the candidate domain descriptions, resulting them to be assigned to $\emptyset$ domain.

Loss Functions

Adversarial Loss.

We adopt an adversarial loss to guide the translated images $^y$ to arrive in the training distribution of the target domain. Following StarGAN2 and TUNIT, we also adopt the multi-domain discriminators, but we use the outputs of multi-domain discriminators weighted by a multi-hot domain label $d^{y}$ :

(4)

where $D_{n} (\cdot)$ denotes $n$ -th discriminator output and $d_{n}^{y}$ denotes $n$ -th element of $d^{y}$ . It should be noted that if we set $d^{y}$ as an one-hot domain label by ground-truth or pseudo-label, this loss function becomes the same adversarial loss in choi2020starganv2; baek2021rethinking.

Domain-Consistency Loss.

We further present a domain-consistency loss that encourages the networks to generate an image $^y$ following the multi-hot domain label of input style image $y$ more precisely. Perhaps one of the simplest function is to use the dot product between $f^{y}$ and $f^{^y}$ such that $L = f^{y} \cdot f^{^y}$ . Minimizing this, however, can induce erroneous solutions, e.g., a constant prompt at all domains, as the prompt is also used as input for the translation module. To overcome this, inspired by SimSiam chen2021exploring, our domain-consistency loss $L_{d c}$ employs the pseudo multi-hot domain labels as targets because they are detached:

L_{d c} = H (d^{y}, f^{^y}) + H (d^{^y}, f^{y}),

(5)

where $H (\cdot, \cdot)$ denotes a cross-entropy to handle multi-hot labels. In the loss, the first term enables training the prompt only, and the second term enables training the remaining translation modules with the current prompt fixed, helping stable convergence and local minima problem.

Cycle-Consistency Loss.

To make the translated image similar to its original image, which also regularizes the ill-posed translation problem, we adopt a cycle-consistency loss such that:

L_{c y c} = E_{x, y} [{∥ x - G (c^{^y}, a^{x}) ∥}_{1}],

(6)

where $c^{^y}$ denotes the content from $^y$ , and $a^{x}$ denotes the style vector of input $x$ . By encouraging the generator $G$ to reconstruct the input image $x$ with estimated style vector $a^{x}$ , $G$ learns to preserve the original characteristics of $x$ .

Style Reconstruction Loss.

While the generator is able to synthesize realistic images with the losses above, the synthetic results are not guaranteed to be style-consistent with $y$ . In order to better learn style representation, we compute $l$ -1 loss between style vector from the translated image and style image such that

L_{s t y} = E_{x, y} [∥ s^{y} - E_{S} (^y) ∥_{1}] .

(7)

Overall Objective.

Our full loss functions are as follows:

L_{t o t a l} = λ_{a d v} L_{a d v} + λ_{d c} L_{d c} + λ_{c y c} L_{c y c} + λ_{s t y} L_{s t y},

(8)

where $λ_{a d v}$ , $λ_{d c}$ , $λ_{c y c}$ , and $λ_{s t y}$ are hyper-parameters.

Experiments

Implementation Details

We adopted StarGAN2 choi2020starganv2 as the baseline architecture for content, style, mapping encoders, decoder, and discriminator. The number of template tokens, $L$ , is empirically set to four. The pre-trained weights and code will be made available.

Experimental Setup

Datasets.

We conduct the experiments on five standard datasets, including Animal Faces-10, Food-10, CelebA-HQ, FFHQ, and LHQ bossard2014food; liu2019few; liu2015deep; karras2019style; skorokhodov2021aligning. In particular, we consider 10 textual domain descriptions among Animal Faces-149 and Food-101, respectively, selected in TUNIT, to define initial prompt of our framework. For CelebA-HQ and FFHQ, we obtain 40 attributes from CelebA-HQ, and sample 10 attributes for three times to define initial prompt, and report the averaged results.

Evaluation Metrics.

In our experiments, we adopt four quantitative evaluation metrics. First, the mean of class-wise Frénchet Inception Distance (mFID) heusel2017gans is used to evaluate how the translated image reflects the input style. In addition, we use Density and Coverage (D&C) naeem2020reliable, which measure the fidelity and diversity of the translated images, respectively. The lower mFID value, the better quality of the image. As fidelity and diversity get better, the D&C scores are larger or closer to 1. We also measure Acc., namely classification accuracy, between the domain labels with the highest probability and ground-truth domain labels.

Content

Style

StarGAN2

TUNIT

Kim et al.

(I)

(II)

(III)

(IV)

Figure 5: Qualitative comparison for Tab. 1.

Experimental Results

Quantitative Results.

Methods		CelebA-HQ
Methods		mFID	D&C
	StarGAN2 choi2020starganv2 (sup.)	32.16	1.221 / 0.446
	TUNIT baek2021rethinking (unsup.)	61.29	0.244 / 0.130
	Kim et al. kim2022style (unsup.)	41.33	0.602 / 0.241
(I)	LANIT (Top-1)	49.65	0.561 / 0.320
(II)	LANIT (Top-3)	41.68	0.681 / 0.338
(III)	(II) + $\emptyset$ domain	33.39	0.963 / 0.366
(IV)	(III) + prompt learning	32.71	1.181 / 0.425

Table 1: Quantitative comparison on CelebA-HQ.

We first report the quantitative comparison of our model with StarGAN2 (supervised), TUNIT and Kim et al. kim2022style (unsupervised). Note that StarGAN2 is trained using 10 classes of ground-truth per-sample domain labels. Tab. 1 shows the quantitative comparison in terms of mFID and D&C metrics. In this section, we mainly compare (IV), which is our final model, with other methods. Detailed analysis of the comparison from (I) to (III) will be provided in the ablation section. LANIT (IV) achieves the best mFID and D&C scores, proving that our model outperforms the state-of-the-art methods in terms of image quality, fidelity and diversity. We also provide the quantitative comparison on Animal Faces-10 and Food-10 datasets in Tab. 2 in terms of mFID, D&C, and Acc. metrics. We can observe that LANIT consistently outperforms TUNIT by a large margin on Animal Faces-10 and Food-10 datasets on Acc. and mFID. Although our LANIT only utilizes unlabeled datasets with dataset-level text supervision, it shows competitive results against StarGAN2, which requires ground-truth per-sample domain labels. The classification accuracy demonstrates that our vision-language based clustering method has remarkably superior performance. Note that we do not provide quantitative results on FFHQ since it does not provide the ground-truth labels.

Qualitative Results.

We show visual results for Tab. 1 in Fig. 5. We can observe that the quality differences between our LANIT and TUNIT or Kim et al. are significant. The images generated by the other methods tend to contain artifacts and show limited performance in style transferring from style images, while our method yields stable performance in terms of style representation and semantic consistency. Especially, we show that our model is capable of capturing multiple subtle attributes very well, such as “smiling” or “bang”, while TUNIT and Kim et al. often fail. Interestingly, our LANIT shows competitive results compared to StarGAN2 which is a supervised method, even with better style preservation. Moreover, in Fig. 6, we visualize the reference-guided image translation results of TUNIT, Kim et al and our LANIT on FFHQ and CelebA-HQ. The results of LANIT consistently show higher quality than TUNIT and Kim et al. on both datasets. We also show domain-to-domain diverse image synthesis of our LANIT in Fig. 7 guided by target domain labels. Thanks to our proposed prompt learning technique, the mapping network and style encoder can faithfully produce the style vectors reflecting the target multiple domain styles.

Content

Style

TUNIT

Kim et al.

LANIT

Content

Style

TUNIT

Kim et al.

LANIT

Figure 6: Reference-guided image translation results by LANIT compared to other models on FFHQ and CelebA-HQ. Note that more attributes (mustache, bang, smile, etc.) are reflected by LANIT than others.

Figure 7: Latent-guided diverse image synthesis results by our LANIT on CelebA-HQ and LHQ. Target domain text descriptions are shown above each image.

Methods	Animal Faces-10			Food-10
Methods	mFID	D&C	Acc.	mFID	D&C	Acc.
StarGAN2 (sup.)	33.67	1.54 / 0.91	1(GT)	65.03	1.09 / 0.76	1(GT)
TUNIT (unsup.)	47.7	1.04 / 0.81	0.841	52.2	1.08 / 0.88	0.848
Kim et al. (unsup.)	36.83	1.06 / 0.82	0.976	49.34	1.06 / 0.80	0.971
LANIT (Top-1)	32.67	1.57 / 0.86	0.998	49.13	1.24 / 0.73	0.996

Table 2: Quantitative comparison on Animal Faces-10 and Food-10. The configurations of StarGAN2 use ground-truth domain labels while TUNIT and Kim et al. use pseudo-labels generated from each image. Our LANIT uses only textual domain descriptions.

$N$	Methods	Animal Faces-10		Food-10		CelebA-HQ
$N$	Methods	mFID	D&C	mFID	D&C	mFID	D&C
$4$	TUNIT	77.7	0.88 / 0.74	67.4	0.85 / 0.79	61.5	0.24 / 0.12
$4$	LANIT	68.7	1.48 / 0.46	61.2	1.10 / 0.88	53.7	0.51 / 0.28
$7$	TUNIT	62.7	1.02 / 0.73	52.7	1.02 / 0.85	54.7	0.33 / 0.16
$7$	LANIT	47.9	1.58 / 0.66	50.5	1.12 / 0.63	46.1	0.61 / 0.24
$10$	TUNIT	47.7	1.04 / 0.81	52.2	1.08 / 0.88	61.3	0.24 / 0.13
$10$	LANIT	32.7	1.57 / 0.89	49.1	1.24 / 0.73	32.7	1.18 / 0.43
$13$	TUNIT	56.8	0.99 / 0.72	54.8	0.97 / 0.85	98.9	0.08 / 0.03
$13$	LANIT	32.9	1.55 / 0.85	55.0	1.31 / 0.77	34.8	0.69 / 0.34
$16$	TUNIT	54.1	1.09 / 0.78	54.8	1.03 / 0.86	127.7	0.04 / 0.02
$16$	LANIT	34.3	1.59 / 0.82	53.9	1.34 / 0.79	32.1	0.89 / 0.35

Table 3: Quantitative comparison of LANIT with TUNIT by varying the number of domains.

Ablation Study and Analysis

Number of Domain Descriptions $N$ .

In Tab. 3, We validate the effects of the number of candidate domains $N$ compared to TUNIT. Our results show that a different number of $N$ brings minor changes to our model than TUNIT, which justifies the robustness of our framework. In addition, to analyze the effectiveness of each strategy on different $N$ ’s, we measured Acc. in Fig. 8. In our setting, the use of both $\emptyset$ domain and prompt learning technique improves the labeling accuracy, which is highlighted in $N = 4, 7, 13, 16$ . In animal faces-10 and food-10, LANIT shows the best performance when there exist 10 candidate domains, whereas in CelebA-HQ, there is little difference in the FID score when $N$ is over 10. We observe a trade-off between our labeling accuracy and the number of domain prompts; the model with more domain labels has higher representation capacity with more diverse attributes, while it has high computational complexity.

$\emptyset$ Domain.

We study the effectiveness of $\emptyset$ domain in Tab. 1 (III). We observe the model with $\emptyset$ domain highly outperforms the baseline model. This indicates that the unknown domain enables the model to mitigate the influence of ambiguous or noisy data. Our results show that LANIT with $\emptyset$ domain shows impressive performance improvement, compared to the model without prompt learning in terms of classification accuracy and mFID. Also in Fig. 8, we find that our $\emptyset$ domain brings large improvement in performance for every benchmark, especially in CelebA-HQ. In the results, we observe that our assumption that an image may have more than a single attribute is well demonstrated with datasets that inherits more diverse attributes; For CelebA-HQ, domain labels can be described with diverse attributes such as hair style or facial attributes, while Animal Faces-10 and Food-10 are visually distinct, leading to limited domain descriptions.

Prompt Learning.

In order to evaluate the effectiveness of prompt learning, we compare the estimated pseudo-domain labels with and without the proposed prompt learning, where the latter can be regarded as an offline clustering. It shows that with prompt learning, the probability of image/text prompt pair is more accurately estimated. Since the prompt learning helps reduce a label noise by refining the prompt, the result without prompt learning shows rather limited clustering performance, which is verified in Tab. 1(IV).

In Fig. 8, prompt learning is more effective in Animal Faces-10 and Food-10, while $\emptyset$ domain brings large improvement in performance in CelebA-HQ. This highlights the importance of the use of both of our proposed strategies, helping to find optimal domain labels.

Number of Activated Attributes $K$ .

Additionally, we show the effect of selecting single or multiple activated Top- $K$ attributes in Tab. 1. We can see that the result of Top-3 (II) significantly surpasses Top-1 (I). This demonstrates that an image can be described more precisely when given with multiple keywords or text descriptions, while most prior works including TUNIT, Kim et al. and StarGAN2 assume to assign a single label for each image. However, we choose Top-1 domain label for Animal Faces-10 and Food-10, since animal species or kinds of food cannot be described as multi attributes . It shows the flexibility of our framework.

.
We evaluate the classification accuracy of pseudo-labeling. We find that our — (a) CelebA-HQ

User Study.

Finally, we conducted a user study to evaluate the image quality, content preservation, and style consistency of LANIT compared to StarGAN2, TUNIT and Kim et al. 204 users were involved in this study, asked to answer 60 questions, each of which is generated from randomly sampled 20 images from CelebA-HQ. The examples of questions are as follows: “Which results do you think have the highest quality/preserve content information such as pose/have similar style of reference image?" Fig. 9 summarizes the user study, where our LANIT achieves the top rank in all tasks.

Conclusion

In this paper, we proposed LANguage-driven Image-to-image Translation framework for Unlabeled Data (LANIT) that leverages dataset-level supervision. By using textual domain descriptions as dataset-level supervision, our model not only mitigates the heavy dependency on per-sample supervision but also considers multi-hot domain label for an image, which provides richer representation. We have presented the language-driven domain labeling based on the pretrained vision-language model. Considering that user-provided domain descriptions cannot account for the whole dataset, we introduced an additional $\emptyset$ domain to filter out data, which dramatically improves the translation quality. By formulating the textual domain description as a learnable prompt, we also presented joint learning of the prompt and image translation modules with the proposed domain-consistency loss function. Our experiments have shown that our LANIT performs similarly or even better than the existing image-to-image translation models that require per-sample supervision. In the future, we will apply LANIT to other tasks, such as domain adaptation and domain generalization.

References

Appendix A. More Implementation Details

Network architecture of LANIT.

We summarize the detailed network architecture of our LANIT in Tab. 4. We basically follow the content encoder, style encoder, mapping network and generator architecture from StarGAN2 choi2020starganv2.

Content Encoder
Layer	Resample	Norm	Output shape $(C \times H \times W)$
Conv1 $\times$ 1	-	-	$(64, 256, 256)$
Resblock	AvgPool	InstanceNorm	$(128, 128, 128)$
Resblock	AvgPool	InstanceNorm	$(256, 64, 64)$
Resblock	AvgPool	InstanceNorm	$(512, 32, 32)$
Resblock	AvgPool	InstanceNorm	$(512, 16, 16)$

Mapping Network
Layer	Type	Activation	Output shape $(C)$
Latent	Shared	-	$16$
Linear	Shared	ReLU	$512$
Linear	Shared	ReLU	$512$
Linear	Shared	ReLU	$512$
Linear	Shared	ReLU	$512$
Linear	Unshared	ReLU	$512$
Linear	Unshared	ReLU	$512$
Linear	Unshared	ReLU	$512$
Linear	Unshared	-	$64$

Generator
Layer	Resample	Norm	Output shape $(C \times H \times W)$
Resblock	-	InstanceNorm	$(512, 16, 16)$
Resblock	-	InstanceNorm	$(512, 16, 16)$
Resblock	-	AdaptiveInstanceNorm	$(512, 16, 16)$
Resblock	-	AdaptiveInstanceNorm	$(512, 16, 16)$
Resblock	Upsample	AdaptiveInstanceNorm	$(512, 32, 32)$
Resblock	Upsample	AdaptiveInstanceNorm	$(256, 64, 64)$
Resblock	Upsample	AdaptiveInstanceNorm	$(128, 128, 128)$
Resblock	Upsample	AdaptiveInstanceNorm	$(64, 256, 256)$
Conv1 $\times$ 1	-	-	$(3, 256, 256)$

Style Encoder and	Discriminator
Layer	Type	Activation	Output shape $(C \times H \times W)$
Conv1 $\times$ 1	-	-	$(64, 256, 256)$
Resblock	AvgPool	InstanceNorm	$(128, 128, 128)$
Resblock	AvgPool	InstanceNorm	$(256, 64, 64)$
Resblock	AvgPool	InstanceNorm	$(512, 32, 32)$
Resblock	AvgPool	InstanceNorm	$(512, 16, 16)$
Resblock	AvgPool	-	$(512, 8, 8)$
Resblock	AvgPool	-	$(512, 4, 4)$
LReLU	-	-	$(512, 4, 4)$
Conv4 $\times$ 4	-	-	$(512, 1, 1)$
LReLU	-	-	$(512, 1, 1)$
Linear(Unshared)	-	-	$(64, 1, 1)$

Table 4: Network architecture of our LANIT.

Additional experimental setup.

We employ an Adam optimizer, where $β_{1} = 0.0$ and $β_{2} = 0.99$ , for 100,000 iteration using a step decay learning rate scheduler. We also set a batch size of 8, and an initial learning rate of 1e-4 for encoder, generator, discriminator, and 1e-6 for prompt. All coefficients for the losses are set to 1. The training images are resized to 256 $\times$ 256. We conduct experiments using a single 24GB RTX 3090 GPU. The pre-trained weights and code will be made publicly available.

Domain descriptions.

Tab. 5 describes additional details on the domain descriptions. We follow 10 pre-defined domains from TUNIT baek2021rethinking for Animal Faces-10 and Food-10. The initial prompts in our framework are “A photo of animal face with [domain name]” and “A photo of food with [domain name]”, respectively. Likewise, for CelebA-HQ liu2015deep and FFHQ karras2019style, we obtain 40 pre-defined textual attributes from CelebA-HQ liu2015deep, and we have mainly shown the results using 10 domain descriptions, which are randomly selected for 3 times, and then report the averaged results. The table provides only one set among the three randomly selected sets and the others are noted in the respective figures when necessary. Please note that we do not use per-sample domain labels in all cases and the domain labels work as the dataset-level candidates.

Datasets	Template	N	Domain Descriptions
CelebA-HQ liu2015deep FFHQ karras2019style	A face of	4	‘blond hair’, ‘black hair’, ‘smiling’, ‘eyeglasses’
		7	‘blond hair’, ‘wavy hair’, ‘black hair’ , ‘smiling’, ‘eyeglasses’, ‘goatee’, ‘bangs’
		10	‘blond hair’, ‘bald’, ‘wavy hair’, ‘black hair’ , ‘smiling’, ‘straight hair’, ‘eyeglasses’, ‘goatee’, ‘bangs’, ‘arched eyebrows’
Animal Faces-10 liu2019few	A photo of animal face with	4	‘beagle’, ‘golden retriever’, ‘tabby cat’, ‘tiger’
		7	‘beagle’, ‘dandie dinmont terrier’, ‘golden retriever’, ‘white fox’, ‘tabby cat’, ‘snow leopard’, ‘tiger’
		10	‘appenzeller sennenhund’, ‘beagle’, ‘dandie dinmont terrier’, ‘golden retriever’, ‘malinois’, ‘white fox’, ‘tabby cat’, ‘snow leopard’, ‘lion’, ‘tiger’
Food-10 bossard2014food	A photo of food with	4	‘baby back ribs’, ‘beignets’, ‘dumplings’, ‘edamame’
		7	‘baby back ribs’, ‘beef carpaccio’, ‘beignets’, ‘clam chowder’, ‘dumplings’, ‘edamame’, ‘strawberry shortcake’
		10	‘baby back ribs’, ‘beef carpaccio’, ‘beignets’, ‘bibimbap’, ‘caesar salad’, ‘clam chowder’, ‘dumplings’, ‘edamame’, ‘spaghetti bolognese’, ‘strawberry shortcake’
LHQ skorokhodov2021aligning	A photo of scene	10	‘with mountain’, ‘with field’, ‘with lake’ ,‘with ocean’, ‘with waterfall’, ‘in summer’, ‘in winter’, ‘on sunny day’, ‘on cloudy day’, ‘at sunset’
MetFace karras2020training	A portrait with	10	‘oil painting’, ‘grayscale’, ‘black hair’, ‘wavy hair’, ‘male’, ‘mustache’, ‘smiling’, ‘gray hair’, ‘blonde hair’, ‘sculpture’
Anime chao2019_online	A photo of anime with	10	‘brown hair’, ‘red hair’, ‘black hair’, ‘purple hair’, ‘blond hair’, ‘blue hair’, ‘pink hair’, ‘silver hair’, ‘green hair’, ‘white hair’
LSUN-Car yu2015lsun	A car painted with	10	‘red color’, ‘orange color’, ‘gray color’, ‘blue color’, ‘yellow color’, ‘white color’, ‘black color’, ‘silver color’, ‘green color’, ‘pink color’
LSUN-Church yu2015lsun	A church	7	‘at night’, ‘with sunset’, ‘in winter’, ‘on cloudy day’, ‘on sunny day’, ‘with trees’, ‘with a river’

Table 5: Examples of Domain Descriptions for Each Dataset.

Appendix B. Additional Ablation Study

Number of activated attributes $K$ during inference.

We observe that our LANIT also works on flexible number of $K$ during inference time, which demonstrates the controllability of our LANIT. In Fig. 10, for example, the results with $K = 1, 2$ also show the outputs faithfully representing the target domain prompts, even if we trained the model with $K = 3$ .

Figure 10: Qualitative results by varying the number of $K$ during inference. Note that the model is trained with Top-3 setting.

Number of domain descriptions $N$ .

In the main paper, we have examined the impacts of the number of candidate domains, $\emptyset$ domain, and prompt learning. In this section, we additionally validate the effects of different number of domain descriptions, shown in Fig. 11 on CelebA-HQ. While our model consistently generates impressive outputs, results with larger number of domain descriptions tend to faithfully represent diverse attributes.

We also show examples of estimated domain labels by LANIT by varying the number of domain descriptions as illustrated in Fig. 12, where each experiment ( $N = 4, 7, 10$ ) is trained with textual descriptions as in Tab. 5. Interestingly, some estimated domain labels are more accurate than even ground-truth labels, which proves that our language-driven domain labeling has ability to correct mislabeling. This demonstrates the effect of our suggested $\emptyset$ domain. For example, we find that ‘strawberry shortcake’ is not included as candidate domain label when $N = 4$ , which leads the input image to be labeled as slack domain.

Figure 11: Qualitative results by varying the number of domain description $N$ .

Variations in $\emptyset$ domain according to the descriptions $N$ in detail.

We highlighted the effectiveness of $\emptyset$ domain in our framework for preventing error propagation. We show more details about how the proposed $\emptyset$ domain is assigned in Fig. 12 by visualizing the examples which are assigned to $\emptyset$ domain. Each model with different $N$ is trained with the domain descriptions based on Tab. 5. For example, in the first row, when $N = 7, 10$ , the sample is assigned to the “strawberry shortcake”, as the ground-truth label. However, when $N = 4$ , the sample is not assigned to any domain, which means that the sample does not have any optimal matching prompts in given candidate domain descriptions, as we expected. In other words, the candidate domain descriptions for $N = 4$ in Food-10 (which can be found in Tab. 5) do not include “strawberry shortcake”, so that the sample should not be assigned to any domain. In the second row, we show how $\emptyset$ domain works for multi-hot labeling setting. In the third row, we address that the ground-truth labels might include label error. As seen, a human face has been labeled as “caesar salad” in Food-10 dataset; in our framework, this sample is assigned as $\emptyset$ domain for every number of $N$ . Although noisy data might be included in the dataset, our $\emptyset$ domain can effectively filter such data.

Figure 12: Examples of LANIT-estimated labels by varying the number of domain descriptions on Food-10, CelebA-HQ, Food-10, respectively. For each experiment, we adopt domain descriptions as in Tab. 5.

Number of activated attributes $K$ on Animal Faces-10 and Food-10 datasets.

As reported in the main paper, we choose Top-1 domain label as default setting in Animal Faces-10 and Food-10 since we set animal species for Animal Faces-10 and kinds of food for Food-10 as candidate domains, which are not appropriate for multi-hot label setting. In Tab. 6, we evaluate quantitative results between Top-1 and Top-3 attribute settings on Animal Faces-10 and Food-10. As we expected, LANIT records the best score at Top-1 setting. Nevertheless, even when using the Top-3 domain, our model records higher numerical accuracy than TUNIT thanks to the $\emptyset$ domain. Note that the accuracy is about 0.79 in the case of Top-3 without $\emptyset$ domain for both datasets.

Methods	Animal Faces-10			Food-10
Methods	mFID	D&C	Acc.	mFID	D&C	Acc.
StarGAN2 (sup.)	33.67	1.54 / 0.91	1(GT)	65.03	1.09 / 0.76	1(GT)
TUNIT (unsup.)	47.7	1.04 / 0.81	0.841	52.2	1.08 / 0.88	0.848
Kim et al. (unsup.)	36.83	1.06 / 0.82	0.976	49.34	1.06 / 0.80	0.971
LANIT (Top-1)	32.67	1.57 / 0.86	0.998	49.13	1.24 / 0.73	0.996
LANIT (Top-3)	41.4	1.07 / 0.76	0.909	50.27	1.04 / 0.78	0.961

Table 6: Quantitative comparison on Animal Faces-10 and Food-10.

Appendix C. Additional Comparisons

Additional comparisons to other CLIP-based manipulation methods.

We also include additional qualitative comparisons to other CLIP-based image manipulation methods, including DiffusionCLIP kim2022diffusionclip and StyleCLIP patashnik2021styleclip, on CelebA-HQ, in Fig. 13. It should be also noted that StyleCLIP relies on pre-trained StyleGAN2 viazovetskyi2020stylegan2 generator and DiffusionCLIP relies on pre-trained diffusion model ho2020denoising. In the results, DiffusionCLIP and StyleCLIP show limited capacity to synthesize realistic results with multiple attributes, while our LANIT generates high-quality images, faithfully representing each domain description, regardless of the number of domain descriptions. For instance, in the first example with [“blond hair”, “bang”, “smiling”], the output of StyleCLIP fails to represent the style of “bang”, while the result with [“blond hair”, “bang”] shows successful result reflecting the style of “bang”. The reason is that our LANIT model focuses on learning the prior of style from each text during training, while others are designed to just fine-tune the pre-trained generative models on each instance independently with CLIP loss.

Figure 13: Additional qualitative results with (a) DiffusionCLIP kim2022diffusionclip, (b) StyleCLIP patashnik2021styleclip, and (c) Ours.

Additional comparisons in Animal Faces-10 and Food-10 datasets.

In Fig. 14, we show reference-guided image synthesis results of our LANIT in comparison to TUNIT baek2021rethinking and StarGAN2 choi2020starganv2. As shown in Fig. 14, our model generates competitive results compared to StarGAN2 choi2020starganv2 with much simpler annotations, and outstanding results compared to TUNIT baek2021rethinking.

Figure 14: Reference-guided image translation results by our LANIT compared to TUNIT baek2021rethinking and StarGAN2 choi2020starganv2 on Animal Faces-10 and Food-10. Given domain descriptions for Animal Faces-10: ‘appenzeller sennenhund’, ‘beagle’, ‘dandie dinmont terrier’, ‘golden retriever’, ‘malinois’, ‘white fox’, ‘tabby cat’, ‘snow leopard’, ‘lion’, ‘tiger’, and for Food-10: ‘baby back ribs’, ‘beef carpaccio’, ‘beignets’, ‘bibimbap’, ‘caesar salad’, ‘clam chowder’, ‘dumplings’, ‘edamame’, ‘spaghetti bolognese’, ‘strawberry shortcake’.

Additional comparisons in the CelebA-HQ dataset.

In this section, we show additional quantitative results and visual results compared to Smoothing liu2021smoothing on CelebA-HQ in Tab. 7 and Fig. 15. More quantitative comparisons on other methods are in the main paper.

Methods		CelebA-HQ
Methods		mFID	D&C
	Smoothing liu2021smoothing (supervised)	35.93	1.253 / 0.431
	StarGAN2 choi2020starganv2 (supervised)	32.16	1.221 / 0.446
	TUNIT baek2021rethinking (unsupervised)	61.29	0.244 / 0.13
	Kim et al. kim2022style (unsupervised)	41.33	0.602 / 0.241
(I)	LANIT (Top-1)	49.65	0.561 / 0.320
(II)	LANIT (Top-3)	41.68	0.681 / 0.338
(III)	(II) + $\emptyset$ domain	33.39	0.963 / 0.366
(IV)	(III) + prompt learning	32.71	1.181 / 0.425

Table 7: Quantitative comparison on CelebA-HQ liu2015deep.

Figure 15: Reference-guided image translation results by our LANIT compared to Smoothing liu2021smoothing on CelebA-HQ. Given domain descriptions are as follows: ‘blond hair’, ‘bald’, ‘wavy hair’, ‘black hair’ , ‘smiling’, ‘straight hair’, ‘eyeglasses’, ‘goatee’, ‘bangs’, ‘arched eyebrows’.

Appendix D. Additional Results of LANIT

In this section, we visualize additional visual results of our method in Fig. 16, Fig. 17, Fig. 18, Fig. 20, Fig. 21, Fig. 22, Fig. 23 on Animal Faces-10, Food-10, CelebA-HQ, MetFace, Anime, LSUN-car and LSUN-church datasets liu2019few; bossard2014food; liu2015deep; karras2020training; chao2019_online; yu2015lsun, including latent-guided diverse image synthesis and reference-guided image translation results. Thanks to our multi-hot labeling setting with proposed prompt learning technique and $\emptyset$ domain, our mapping network and style encoder can produce the style vectors faithfully representing target multiple domain styles.

Figure 16: Image translation results on Animal Faces-10. Given domain descriptions are as follows: ‘appenzeller sennenhund’, ‘beagle’, ‘dandie dinmont terrier’, ‘golden retriever’, ‘malinois’, ‘white fox’, ‘tabby cat’, ‘snow leopard’, ‘lion’, ‘tiger’.

Figure 17: Image translation results on Food-10. Given domain descriptions are as follows: ‘baby back ribs’, ‘beef carpaccio’, ‘beignets’, ‘bibimbap’, ‘caesar salad’, ‘clam chowder’, ‘dumplings’, ‘edamame’, ‘spaghetti bolognese’, ‘strawberry shortcake’.

Figure 18: Latent-guided diverse image synthesis results by our LANIT on CelebA-HQ. Given domain descriptions are as follows: ‘bangs’, ‘blond hair’, ‘black hair’ ,‘smiling’, ‘pale skin’, ‘heavy makeup’, ‘no beard’, ‘rosy cheeks’, ‘wearing lipstick’, ‘male’.

Figure 19: Latent-guided diverse image synthesis results by our LANIT on LHQ. Given domain descriptions are as follows: “with mountain”, “with field”, “with lake” ,“with ocean”, “with waterfall”, “in summer”, “in winter”, “on sunny day”, “on cloudy day”, “at sunset”.

Figure 20: Reference-guided image translation results by our LANIT on MetFace karras2020training. Given domain descriptions are as follows: “oil painting”, “grayscale”, “black hair”, “wavy hair”, “male”, “mustache”, “smiling”, “gray hair”, “blonde hair”, “sculpture”.

Figure 21: Reference-guided image translation results by our LANIT on Anime chao2019_online. Given domain descriptions are as follows: “brown hair”, “red hair”, “black hair”, “purple hair”, “blond hair”, “blue hair”, “pink hair”, “silver hair”, “green hair”, “white hair”.

Figure 22: Reference-guided image translation results by our LANIT on LSUN-car. Given domain descriptions are as follows: “red color”, “orange color”, ‘gray color”, “blue color”, “yellow color”, “white color”, “black color”, “silver color”, “green color”, “pink color”.

Figure 23: Reference-guided image translation results by our LANIT on LSUN-church. Given domain descriptions are as follows: “at night”, “at sunset”, “in winter”, “on cloudy day”, “on sunny day”, “with trees”, “with a river”.

Limitations

Although our LANIT shows outstanding performance on various benchmarks, LANIT inherits class preference problem from pretrained vision-language models. In specific, as observed in recent literature such as UPL huang2022unsupervised, using pretrained vision-language models, i.e., CLIP, often shows class preference for an image, which may result in an imbalanced distribution of pseudo-labeled data in our framework. To overcome this, we suggest a further work to adopt a domain normalization technique. Unlike UPL huang2022unsupervised that avoids this bias by simply selecting a few confident samples per class, domain normalization technique as data pre-processing to further obtain uniform domain preference. Specifically, we can normalize the CLIP similarity by mean and variance of the confident samples of each domain.

Content	“beagle”	“dandie dinmont terrier”	“golder retriever”	“malinois”	“appenzeller sennenhund”



Content	“white fox”	“tabby cat”	“snow leopard”	“lion”	“tiger”

Content	“baby back ribs”	“beef carpaccio”	“beignets”	“bibimbap”	“caeser salad”



Content	“clam chowder”	“dumplings”	“edamame”	“bolonese”	“strawberry shortcake”

Content	“bang” “blond hair” “smiling”	“bang” “black hair” “smiling”	“pale skin” “rosy cheeks” “lipstick”	“makeup” “pale skin” “black hair”	“blond hair” “male” “no beard”

Content	“ocean” “cloudy” “sunset”	“field” “lake” “winter”	“mountain” “winter” “sunset”	“mountain” “waterfall” “summer”	“mountain” “lake” “sunny”

Content	“blond hair”	“bang” “blond hair”	“bang” “blond hair” “smiling”	Content	“black hair”	“pale skin” “black hair”	“pale skin” “black hair” “lipstick”

Content	“blond hair”	“blond hair” “bang”	“blond hair” “bang” “smiling”	Content	“black hair”	“black hair” “pale skin”	“black hair” “pale skin” “lipstick”

LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data

Abstract

Introduction

Related Work

Image-to-Image Translation.

Vision-Language Model in Image Manipulation.

Prompt Learning.

Methodology

Motivation and Overview

Language-Driven Domain Labeling

Image Translation with Pseudo Domain Label

Setting Up Domain Prompt

Loss Functions

Adversarial Loss.

Domain-Consistency Loss.

Cycle-Consistency Loss.

Style Reconstruction Loss.

Overall Objective.

Experiments

Implementation Details

Experimental Setup

Datasets.

Evaluation Metrics.

Experimental Results

Quantitative Results.

Qualitative Results.

Ablation Study and Analysis

Number of Domain Descriptions N.

∅ Domain.

Prompt Learning.

Number of Activated Attributes K.

User Study.

Conclusion

References

Appendix A. More Implementation Details

Network architecture of LANIT.

Additional experimental setup.

Domain descriptions.

Appendix B. Additional Ablation Study

Number of activated attributes K during inference.

Number of domain descriptions N.

Variations in ∅ domain according to the descriptions N in detail.

Number of activated attributes K on Animal Faces-10 and Food-10 datasets.

Appendix C. Additional Comparisons

Additional comparisons to other CLIP-based manipulation methods.

Additional comparisons in Animal Faces-10 and Food-10 datasets.

Additional comparisons in the CelebA-HQ dataset.

Appendix D. Additional Results of LANIT

Limitations

Number of Domain Descriptions $N$ .

$\emptyset$ Domain.

Number of Activated Attributes $K$ .

Number of activated attributes $K$ during inference.

Number of domain descriptions $N$ .

Variations in $\emptyset$ domain according to the descriptions $N$ in detail.

Number of activated attributes $K$ on Animal Faces-10 and Food-10 datasets.