AWADA: Attention-Weighted Adversarial Domain Adaptation for Object Detection

Maximilian Menke
Robert Bosch GmbH
maximilian.menke@de.bosch.com Thomas Wenzel
Robert Bosch GmbH
thomas.wenzel2@de.bosch.com Andreas Schwung
Fachhochschule Südwestfalen
schwung.andreas@fh-sef.de

Abstract

Object detection networks have reached an impressive performance level, yet a lack of suitable data in specific applications often limits it in practice. Typically, additional data sources are utilized to support the training task. In these, however, domain gaps between different data sources pose a challenge in deep learning. GAN-based image-to-image style-transfer is commonly applied to shrink the domain gap, but is unstable and decoupled from the object detection task.
We propose AWADA, an Attention-Weighted Adversarial Domain Adaptation framework for creating a feedback loop between style-transformation and detection task. By constructing foreground object attention maps from object detector proposals, we focus the transformation on foreground object regions and stabilize style-transfer training.
In extensive experiments and ablation studies, we show that AWADA reaches state-of-the-art unsupervised domain adaptation object detection performance in the commonly used benchmarks for tasks such as synthetic-to-real, adverse weather and cross-camera adaptation.

\wacvfinalcopy

1 Introduction

In deep learning, extensive real-world data collection aims at creating large training datasets for downstream tasks such as object detection. Labeling such data can be very expensive because, typically, human annotators have to annotate each object instance by hand.
One direction of research focuses on unsupervised domain adaptation methods, which transform knowledge learned from a labeled source domain to a related unlabeled target domain. Hence, no labels on the target domain are required anymore. Object detectors typically suffer on a target domain when trained on a source domain because of the gap between both data domains. In domain adaptation, by utilizing Generative Adversarial Networks (GANs) [6], current style-transfer methods transform the image style of a source domain to a target domain style. Therefore, the style-transfer network aligns the data distributions on the image level.
Especially for object detection, foreground object regions are more relevant than the background. Current style-transfer networks based on e.g. Cycle-GAN [36] ignore this foreground-background distinction and use all pixels equally weighted to solve generators and discriminators’ adversarial min-max optimization problems. Therefore, the style-transfer network is disconnected from the final downstream task, i.e. object detection.
We propose AWADA, an Attention-Weighted Adversarial Domain Adaptation framework for object detection, which targets this problem. Using foreground and background distinction based on object proposals (see Figure 1) we re-weight the style-transfer loss functions using our proposed Attention-Weighting Modules (AWM). Our method restricts the adversarial game between generators and discriminators focusing on object foreground regions. With AWADA, we create a novel feedback loop between downstream task and style-transfer network to target disconnected transformation and detection tasks we see in current SOTA style-transfer networks. Our method also stabilizes the training of the GAN-based style-transfer training.
We evaluate AWADA on different automotive benchmarks as synthetic-to-real, adverse weather, and cross-camera adaptation. We show that AWADA outperforms current unsupervised domain adaptation methods. Our contributions are the following:

Figure 1: Exemplary images and corresponding attention maps constructed from object detector proposals of sim10k [10] on the left and Cityscapes [3] on the right. Bright overlays mark attention regions of foreground object used to re-weight our GAN losses. Note that only car objects are used to construct attention maps for both images.

We propose AWADA, a novel style-transformation framework, which introduces a feedback loop between the object detection and the transformation task.
We weight foreground and background areas differently when training our style-transfer network. AWADA focuses the adversarial training on object regions using attention maps constructed from object detector proposals.
We show that AWADA outperforms current GAN-based SOTA methods in unsupervised domain adaptation on three standard automotive benchmarks.

2 Related Work

This section gives an overview of current state-of-the-art methods in unsupervised domain adaptation. We first give a broad overview of domain adaptation in object detection. Afterwards we discuss GAN-based style-transfer networks, which are primarily motivated by semantic segmentation. Finally, we outline current trends in utilizing foreground-background distinction in domain adaptation.

2.1 Domain Adaptation in Object Detection

Domain adaptation in deep learning object detection started in 2018 by introducing global and local gradient reversal layers (GRL [5]) into a Faster-RCNN object detector network [2, 19]. Recent work utilizes image and instance full alignment (iFAN) to align multi-level features between two related data domains [37]. The authors in [18] propose a similarity-based domain alignment by matching only features of objects belonging to the same cluster. Furthermore, normalizing features of two domains before applying GRL is discussed by Su et al. [23] and suppression of gradients is applied by Wang et al. [26]. In the direction of feature learning, there are several other publications [27, 1, 24].
Another research direction e.g. by the authors of [9], [12] and [31] leverage iterative pseudo labeling of the target domain. Work by Soviany et al. uses curriculum learning [22] by first creating pseudo labels for easy instances and iteratively takes more difficult instances into account. In [4], the authors applied Unbiased Mean Teacher, which uses knowledge distillation to align two domains in a student-teacher setup.
With DA-DETR [33], transformers were applied in domain adaptation for the first time. Using hybrid self-attention modules, they improve object detection performance while also providing a strong transformer baseline across different automotive benchmarks. Another research direction is using Graph Neural Networks (GNN) to close the domain gap by structuring the domain distributions using graphs [28, 13].
Contrary to other approaches, we do not modify the Faster-RCNN object detector from Ren et al. [17]. Therefore our work is orthogonal to other’s work [2, 8, 37], as we keep the detector network untouched and apply style-transfer to align the data distributions.

2.2 GAN-based Domain Adaptation

Figure 2: In (a), we show our proposed AWADA framework with training of source to target generator $G_{S T}$ , target to source generator $G_{T S}$ and two discriminators $D_{S}$ and $D_{T}$ . With our proposed Attention-Weighting Modules (AWM) we utilize foreground object attention maps, drawn from pre-trained object detectors. Source domain attention maps are obtained from a source domain-trained detector, while the target domain detector is trained on the stylized source domain. We re-weight the per-pixel loss of each adversarial component of the style-transfer network using our attention maps. Afterwards in (b), we train a final object detector on AWADA stylized images while freezing the generator $G_{S T}$ .

In semantic segmentation, the first approach suing style-transfer for domain adaptation was Cycle-GAN [36], introducing cycle-consistency to enforce consistent generation with reducing visual artifacts. CyCADA [7] introduced semantic consistency into the Cycle-GAN framework. MADAN [34] extended CyCADA by utilizing multiple source domains in their generation framework. Other orthogonal style-transfer frameworks as UNIT [15] and Pix2PixHD [25] have been released as well.
Style-transfer has also been applied to object detection by a wide range of other publications [9, 31, 22]. They mostly rely on a fixed Cycle-GAN style-transfer network to transform the image style of the source domain. However, the authors of [16] propose to reduce the patch size in training and adapt CyCADA from semantic segmentation to object detection. They have shown, that including semantic consistency in style transfer also improves object detection performance.
We extend this approach by introducing object detection-guided attention into the transformation cycle to modify the style-transfer network specifically for object detection, which has not been considered in related work before.

2.3 Attention in Domain Adaptation

Leveraging foreground object attention mechanisms in domain adaptation is not new anymore. Zheng et al. propose a method guiding the GRL-based domain transfer by attention maps drawn from an object detector [35]. Specific for anchor-less object detection models [8] utilizes center and objectiveness scores to align intermediate feature representations. Yang et al. weight multi-level GRL predictions with hard-attention maps constructed from object proposals [29]. By utilizing pseudo attention maps acquired from box-level semantic segmentation, the authors of [32] propose a course domain alignment framework. Using self-attention on multi-level feature maps, ILLUME [11] guides cross-domain feature alignment with self-attentive gradient reversal layers.
Previous methods try to perform selective feature alignment based on attention mechanisms. In contrast, we combine attention with style-transfer by creating a feedback loop from object detection to the style-transfer network. We note that previous research using style-transfer does not connect the downstream task to the style-transfer network. Therefore we propose to use attention maps drawn from object proposals to re-weight the loss functions of style-transfer networks.

3 Proposed Method

In this section, we describe our proposed method AWADA. First, we define the standard unsupervised domain adaptation setting for object detection. Then, we explain our AWADA framework and show how we combine it with object detection. Later, we describe how we construct attention maps from an object detector and use them to re-weight loss functions of style-transfer networks using our proposed Attention-Weighting Modules (AWM) to create our proposed AWADA framework.

3.1 Unsupervised Domain Adaptation in Object Detection

We assume a labeled source domain $D_{S}$ containing source images $X_{S} = {x_{S}^{i}}_{i = 1}^{N_{S}}$ and corresponding bounding box labels $Y_{S} = {y_{S}^{i}}_{i = 1}^{N_{S}}$ . We further assume an unlabeled target domain $D_{T}$ with target images $X_{T} = {x_{T}^{i}}_{i = 1}^{N_{T}}$ and no labels $Y_{T} = {}$ . The number of samples in each domain is represented by $N_{S}$ for the source domain and $N_{T}$ for the target domain respectively. Our setting is unsupervised in the target domain, as we do not use any labels from it except for evaluation purposes.
With unsupervised domain adaptation, our aim is to extract predictions $^YT={^yTi}NTi=1$ from an object detector trained without target domain labels. The two domains should be related to each other regarding class occurrences.

Figure 3: An input image is fed through a pre-trained object detector to create foreground object proposals. For source images we use a detector trained on the source domain and for target images we use a detector trained on the stylized source domain. Each proposal has a four-dimensional box descriptor and a confidence score. We create attention maps of input image size by accumulating proposals per spatial position.

3.2 Framework Overview

Our proposed AWADA framework in Figure 2 is based on CyCADA* [16]. We extend on CyCADA* by adding our Attention-Weighting Modules (AWMs) to each generator and discriminator loss to focus the style-transfer network training on foreground objects. AWADA consists of a generator transforming source images to target-style images. Additionally, there is a generator transforming target images to source-like images. Our framework also consists of two discriminators that predict the root domain of each image.
Furthermore we define a novel training sequence for domain adaptive object detection. First, we train the CyCADA* style-transfer network without foreground-background distinction. Afterwards we train a Faster-RCNN [17] object detector on CyCADA* stylized source images $X_{S T}$ to construct attention maps for the target domain. We also train an object detector on the source images to construct attention maps for the source domain. Finally, we re-weight adversarial loss functions with the attention maps using our proposed AWMs and train AWADA to focus the transformation on foreground object regions. Afterwards we freeze the source-to-target generator of AWADA to transfer source images to the target style. Using these stylized source images, we train our final Faster-RCNN object detector and evaluate on the target domain.
By utilizing attention maps constructed from pre-trained object detector proposals, we are the first to connect the downstream task with the style-transfer network. Therefore, our proposed AWADA framework constructs a feedback loop between the object detection downstream task and the style-transfer network. Following we explain each individual components of our AWADA framework in more detail.

3.3 Style-Transfer Baseline CyCADA*

We follow the authors of [16] and adapt CyCADA [7] from semantic segmentation to object detection, and name their approach CyCADA*. In this section we describe some details of the latter, since our proposed AWADA framework is based on it.
Style-transfer networks based on Cycle-GAN [36] rely on two generators and two discriminators. The generators should generate high-quality fake samples, which the discriminators classify as such. A generator $G_{S T}$ transforms source image patches into target-like ones. Contrary, a second generator $G_{T S}$ transforms target images into source-like ones. The discriminators $D_{S}$ and $D_{T}$ use adversarial GAN loss $L_{G A N}^{S T}$ and $L_{G A N}^{T S}$ respectively to predict the root domain:

L_{G A N}^{S T} (G_{S T}, D_{T}, x_{S}, x_{T}) = E_{x_{S} \sim X_{S}} log (D_{T} (G_{S T} (x_{S}))) + E_{x_{T} \sim X_{T}} log (1 - D_{T} (x_{T}))

(1)

In addition, Cycle-GAN [36] introduced a cycle consistency loss $L_{C Y C}$ . Cycle consistency enforces the style-transfer network to remain the image content without hallucinating new objects. Using a L1 reconstruction loss, cycle consistency is calculated as shown in Equation 2 for both transformation directions:

L_{C Y C} (G_{S T}, G_{T S}, x_{S}, x_{T}) = E_{x_{S} \sim X_{S}} ∥ G_{T S} (G_{S T} (x_{S})) - x_{S} ∥_{1} + E_{x_{T} \sim X_{T}} ∥ G_{S T} (G_{T S} (x_{T})) - x_{T} ∥_{1}

(2)

Introduced by CyCADA [7], semantic consistency is used to retain semantic information through the style-transformation. Cycle-GAN often aligns data distributions by e.g. generating trees in sky areas, which leads to decreased performance of downstream tasks on stylized images. The authors of [16] observe that semantic consistency is generally a well-suited regularization mechanism for style-transfer tasks, independent of the downstream task. Using a pre-trained semantic segmentation network $F$ trained on the source domain Equation 3 shows the Kullback-Leibler divergence between predictions of the semantic consistency network on source and stylized source samples.

L_{s e m} (G_{S T}, F, x_{S}) = E_{x_{S} \sim X_{S}} K L (F (G_{S T} (x_{S})) | | F (x_{S}))

(3)

For training the full style-transfer model, all loss functions, including GAN losses of both generator and discriminator pairs, cycle-consistency loss, and semantic consistency loss, are weighted by factors $a_{1}$ to $a_{4}$ we derive from CyCADA [7]. We sum all weighted losses for the final loss as shown in Equation 4:

L_{t o t a l} = a_{1} L_{G A N}^{S T} + a_{2} L_{G A N}^{T S} + a_{3} L_{C Y C} + a_{4} L_{s e m}

(4)

3.4 Attention Map Construction

In this Section we describe how we construct our attention maps for training AWADA. Object detector models typically create object proposals separating foreground and background regions. We sample proposals for source domain images from a pre-trained object detector trained on the source domain. We sample proposals from an object detector trained on CyCADA* stylized source images for the target domain to create more accurate attention maps for the target domain than by training on the source domain directly. Compared to perfect predictions, proposals additionally introduce some context and hard negatives into training, which we found to be in fact beneficial in preliminary experiments. From object proposals, we create attention maps, highlighting foreground regions and masking out background regions.

af(u,v)=hard(Sc(u,v))={1if |Sc(u,v)| \textgreater\hphantom{} 00otherwise

(5)

Attention maps are created on input image resolution using the hard accumulation method $f$ . Our method receives a set $S^{c} (u, v)$ of proposal confidences $\geq c$ for proposals at position (u,v) and outputs the attention map value based on Equation 5. For the source domain, we construct $N_{S}$ attention maps $A_{f, S} = {a_{f, S}^{i}}_{i = 1}^{N_{S}}$ . For the target domain, we construct $N_{T}$ attention maps $A_{f, T} = {a_{f, T}^{i}}_{i = 1}^{N_{T}}$ the same size as the original patch.

3.5 Attention-Weighting Module

As seen in Figure 2, we use our proposed Attention-Weighting Modules (AWM) to re-weight the style-transfer network loss functions in AWADA.
Typically, loss functions are averaged over all spatial loss pixels. In our AWMs we move this averaging behind the weighting of the per-pixel loss $L_{p i x}$ with the corresponding attention map $a_{f, d}$ of domain $d$ .

L_{w} (L_{p i x}, a_{f, d}) = \frac{1}{W H} W \sum u = 1 H \sum v = 1 L_{p i x} (u, v) \cdot a_{f, d} (u, v)

(6)

From the attention map associated with the input image, we crop the same region we used to crop a random input image patch. We then resize the cropped attention map to be the same size as the pixel-wise loss map.
We multiply the per-pixel loss $L_{p i x}$ with the associated cropped and resized attention map $a_{f, d}$ . Thus, only regions highlighted in $a_{f, d}$ remain in the loss and background is multiplied with zero. Afterwards we calculate the mean of the weighted loss map of size $W$ x $H$ to form the final weighted mean loss $L_{w}$ . This weighted loss is used in training to perform the model weight update.
Our generators should produce high quality foreground object regions, as we mask out background regions which are less relevant for the final detection task. We do not weight the cycle and semantic constsiency losses, as they are responsible for regularizing the whole GAN training. Further analyses are performed on this in the ablation studies.

Method	Dataset	Person	Rider	Car	Truck	Bus	Train	Mcycle	Bicycle	mAP
Baseline	Foggy	32.9	38.7	40.7	15.7	25.4	2.5	20.1	40.3	27.0
iFAN [37]	Foggy	32.6	40.0	48.5	27.9	45.5	31.7	22.8	33.0	35.3
UDA [31]	Foggy	38.2	42.1	55.6	25.9	43.5	27.6	33.5	39.2	38.2
FFDA [29]	Foggy	33.8	48.3	50.7	26.6	49.2	39.4	35.8	36.8	40.1
UMT [4]	Foggy	33.0	46.7	48.6	34.1	56.5	46.8	30.4	37.3	41.7
CGD [12]	Foggy	38.0	47.4	53.1	34.2	47.5	41.1	38.3	38.9	42.3
ILLUME [11]	Foggy	35.8	45.1	54.3	34.5	49.7	50.3	38.7	42.0	43.8
CyCADA* [16]	Foggy	42.1	49.1	53.8	21.8	41.7	26.0	29.4	45.7	38.7 $\pm$ 0.96
AWADA (Ours)	Foggy	44.7	52.7	60.9	28.7	51.0	36.2	36.0	48.5	44.8 $\pm$ 0.65
Oracle	Foggy	50.2	54.8	66.4	33.3	57.0	41.0	36.2	51.4	48.8
Baseline	BDD100k	34.2	24.2	53.7	15.0	14.3	-	12.3	19.5	24.8
AFL [14]	BDD100k	32.4	32.6	50.4	20.6	23.4	-	18.9	25.0	29.0
PDA [9]	BDD100k	37.6	32.9	51.8	19.3	23.7	-	16.1	25.3	29.5
ILLUME [11]	BDD100k	33.2	20.5	47.8	20.8	33.8	-	24.4	26.7	29.6
CyCADA* [16]	BDD100k	40.4	33.1	55.8	15.0	18.1	-	17.4	29.2	29.8 $\pm$ 0.90
AWADA (Ours)	BDD100k	41.5	34.2	56.0	18.7	20.0	-	20.4	29.7	31.5 $\pm$ 0.36
Oracle	BDD100k	52.7	42.8	73.0	53.8	52.5	-	36.4	39.5	50.1

Table 1: Results of detectors trained on the Cityscapes Dataset and evaluated on Foggy-Cityscapes for adverse weather and on BDD100k for cross-camera adaptation. We report mean and standard deviation of nine experiments each.

4 Experiments

In this section, we evaluate AWADA on a variety of unsupervised domain adaptation benchmarks and compare to the state of the art. Finally, we conduct ablation studies to verify the benefits of each component of AWADA separately.

4.1 Experimental Setup

We evaluate AWADA on three automotive benchmarks for object detection: Synthetic-to-Real, Adverse Weather and Cross-Camera domain..

4.1.1 Synthetic-to-Real.

Recorded from the video game GTA5, we choose sim10k [10] as our source dataset and Cityscapes [3] as our target dataset for synthetic-to-real adaptation. As Cityscapes is made for semantic segmentation, we use the tightest rectangle of its instance masks to create bounding box labels following [2]. The sim10k dataset contains 10,000 annotated images. We picked only 6,591 day images, because Cityscapes does not contain any night images following [16]. We find that night images decrease the style-transfer performance. In this setting we only consider the car class. We do not use labels of Cityscapes except for evaluation on the 500 validation images. This benchmark is particularly interesting because sim10k and Cityscapes have approximately the same image resolution and different object instance appearances and is therefore a challenging task for domain adaptation.

4.1.2 Cross-Camera.

In the setting of cross-camera domain adaptation we choose Cityscapes as our source and BDD100k [30] as our target domain. BDD100k daytime set contains 36,728 images with bounding box annotations, which we choose because Cityscapes has no night images. We evaluate the cross-camera adaptation on seven common classes of both datasets. This benchmark imposes challenges by its high diversity and inter-frame domain gap.

4.1.3 Adverse Weather.

For adverse weather adaptation we choose Cityscapes [3] as the source domain and Foggy-Cityscapes [20] as the target domain. Foggy Cityscapes is created by a fog simulation run on original Cityscapes images. We choose only the highest fog level with a sight of view of about 150 meters, matching the setting in [18]. Cityscapes contains 2975 labeled training images with eight different road user classes. Foggy-Cityscapes validation set contains 500 images.

4.1.4 Implementation Details.

We choose Faster-RCNN as our object detector as this is the default in most other work [2, 19, 9, 18, 37, 23]. We resize all images to 600px shortest side and batch-size one. With a VGG16 [21] backbone, we train for 30 epochs using Adam optimizer and a learning rate of 1e-5. We report results using mean average precision (mAP) with an intersection over union (IoU) threshold of 0.5.
We train AWADA for 200 epochs with a batch size of two. For all other hyperparameters, we follow the protocol of CyCADA* [16]. In attention map creation, we take proposals with a minimum confidence score of 0.5 into account. To save computation time, we pre-compute the attention maps for each image of the source and target domain. As GAN training is instable, we report results as an average of nine experiments, training three GANs and evaluate each using three detectors for each of our reported results using different random seeds.
We present results in comparison with a baseline trained directly on the source domain while evaluating on the target domain. As an upper bound, we additionally report results from training on the labeled target domain as an oracle.

4.2 Experimental Results

Our experiments first compare AWADA to the current state-of-the-art methods in unsupervised domain adaptation. Table 1 shows results of detectors trained on an AWADA stylized Cityscapes dataset and evaluated on the Foggy-Cityscapes dataset for adverse weather and on BDD100k dataset for cross-camera adaptation. AWADA outperforms all current SOTA methods on both benchmarks. Both benchmarks are highly complex and show that AWADA is capable of improving detection performance on different automotive relevant classes. As we do not modify the object detector network, AWADA can be combined with approaches such as ILLUME or CGD to likely further improve object detection performance.
In Table 2, we report results of AWADA on synthetic-to-real adaptation and show, that AWADA is on paar with other methods on the synthetic-to-real benchmark. From the repetition of our experiments we see the increased stability of AWADA compared to CyCADA*. Unfortunately we cannot conduct such a comparison to other methods, since the corresponding results have not been reported.
AWADA requires a total of 88 hours trained on a single V100 GPU. As we pre-compute the attention maps, the remaining AWADA training requires 40 GPU hours, which is only about 10 hours longer than a CyCADA* training.

Method	car AP
Baseline	34.7
iFAN [37]	46.9
UDA [31]	52.3
SCAN [13]	52.6
ILLUME [11]	53.1
CyCADA* [16]	51.9 $\pm$ 0.69
AWADA (Ours)	53.2 $\pm$ 0.29
Oracle	71.1

Table 2: Results in the sim10k to Cityscapes synthetic-to-real adaptation benchmark. We report mean and standard deviation of nine experiments each.

4.3 Ablation Studies

In this section, we conduct ablation studies on our proposed AWADA framework. We first show experiments on different types of loss functions we apply our Attention-Weighting Modules (AWM) to. Afterwards we show the influence of our attention map creation strategy on the object detection performance. At the end we analyze the total runtime of AWADA.

4.3.1 Attention-Weighting of Losses.

Our first ablation study targets the choice of loss functions we attach our AWMs to. We conduct experiments in which we re-weight the various loss functions that appear in our AWADA framework. We define four groups representing the discriminator $L_{d i s c}$ , the generator $L_{g e n}$ , the cycle-consistency $L_{c y c}$ , and the semantic-consistency loss $L_{s e m}$ .

$L_{d i s c}$	$L_{g e n}$	$L_{c y c}$	$L_{s e m}$	car AP
-	-	-	-	51.9 $\pm$ 0.69
✓	-	-	-	50.5 $\pm$ 0.25
-	✓	-	-	52.4 $\pm$ 0.65
✓	✓	-	-	53.2 $\pm$ 0.29
✓	✓	✓	-	50.5 $\pm$ 1.07
✓	✓	✓	✓	51.4 $\pm$ 0.72

Table 3: Ablation study on the different losses that are weighted with object detection attention maps using our proposed AWM on the sim10k to Cityscapes benchmark. The check mark indicates the application of an AWM to the loss, otherwise the loss remains unweighted.

In Table 3, we see that the weighting of only the loss functions connected to the adversarial optimization problem $L_{d i s c}$ and $L_{g e n}$ performs the best. In style-transfer, the generators and discriminators compete with each other. Re-weighting only the generator or the discriminator loss leads to a decrease of performance, which is likely to be caused by imbalance of the adversarial optimization problem. This suggests re-weighting these losses equally. Cycle-consistency and semantic consistency are responsible for the supervision and regularization of the generation cycle. In order to balance regularization and focus on foreground object regions we do not re-weight regularization losses, which shows best results.
We also tested normalizing the attention-weighted loss, but found results to significantly degrade. This shows that the natural AWADA weighting is most beneficial for training.

Figure 4: Qualitative Results of the adverse weather Cityscapes [3] to Foggy-Cityscapes [20] Benchmark. It can be seen, that CyCADA* produces very strong and deep fog even for close image regions. In contrast, our AWADA framework produces more realistic fog because of the regularization based on foreground focused transformation we apply.

4.3.2 Attention Map Construction Method.

Next we analyze the quality of our constructed attention maps and the influence on the detection performance. Our default AWADA implementation uses object detector proposals to construct attention maps.
At first, we evaluate the influence of our fuzzy, imperfect attention maps constructed from proposals. Perfect attention maps could be obtained when we use ground truth semantic segmentation masks or bounding boxes. As shown in Table 4, using perfect attention maps from bounding boxes or segmentation degrades results. We also compared inflating GT boxes by 20%, which improved results slightly. We hypothesize that adding context around objects is beneficial for the downstream detection training.

Method	GTA2City	City2Foggy	City2BDD
CyCADA*	51.9 $\pm$ 0.69	38.7 $\pm$ 0.96	29.8 $\pm$ 0.90
GT Masks	50.4 $\pm$ 0.46	38.8 $\pm$ 1.07	-
GT Boxes	51.9 $\pm$ 0.82	40.9 $\pm$ 1.36	29.9 $\pm$ 0.55
GT Inflate	52.6 $\pm$ 0.59	42.0 $\pm$ 0.98	29.6 $\pm$ 0.52
AWADA	53.2 $\pm$ 0.29	44.8 $\pm$ 0.65	31.5 $\pm$ 0.36

Table 4: Ablation study on different ground truth attention map creation methods on the sim10k to Cityscapes benchmark.

Table 5 additionally shows results of using 10-50% of randomly marking pixels as foreground. This is an ablation and simplistic application of our loss-masking framework, where we mask out a fixed percentage of random pixels instead of object boxes. This method works surprisingly well in some scenarios, yet results are sensitive to the choice of the randomness hyperparameter and increased instability. AWADA results however have been stable across all datasets using the same set of hyperparameters, which indicates further optimization potential of AWADA compared to random results. We also tried combining AWADA with random, with results in between both individual methods, but again increased instability. Therefore despite its simplicity, we cannot generally recommend using the random method and prefer AWADA for more stable results.
If we simply reduce the adversarial loss weights in AWADA without using AWMs, the training collapses, therefore the improvement in AWADA does not come from reduced loss alone.
We also tried different proposal accumulation functions $f$ as mean, median or max which did not improve the results. Therefore we stick to our hard accumulation function. Further results are supplied in the supplementary material.

Method	GTA2City	City2Foggy	City2BDD
CyCADA*	51.9 $\pm$ 0.69	38.7 $\pm$ 0.96	29.8 $\pm$ 0.90
Rnd. 10 %	52.9 $\pm$ 0.70	42.7 $\pm$ 0.62	26.8 $\pm$ 0.71
Rnd. 30 %	52.8 $\pm$ 0.68	45.7 $\pm$ 0.42	29.7 $\pm$ 0.24
Rnd. 50 %	51.5 $\pm$ 0.47	46.2 $\pm$ 0.61	30.3 $\pm$ 0.48
AWADA	53.2 $\pm$ 0.29	44.8 $\pm$ 0.65	31.5 $\pm$ 0.36

Table 5: Ablation study on different random attention map creation methods on the sim10k to Cityscapes benchmark.

5 Conclusion

We propose AWADA, an Attention-Weighted Adversarial Domain Adaptation framework. We direct our style-transfer training to focus the adversarial optimization problem of generators and discriminators on foreground object regions. By creating attention maps from foreground object proposals, we for the first create a feedback loop between detection and transformation tasks in GAN-based style transfer. AWADA does not only improve object detection performance, it also stabilizes the training of the GAN network.
Experiments and ablation studies show that our AWADA framework outperforms current unsupervised domain adaptation methods on synthetic-to-real, adverse weather and cross-camera automotive object detection benchmarks and provide a more stable and regularized image-to-image style-transfer method.

References

[1] C. Chen, Z. Zheng, X. Ding, Y. Huang, and Q. Dou (2020) Harmonizing transferability and discriminability for adapting object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8869–8878. Cited by: §2.1.
[2] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool (2018) Domain adaptive faster r-cnn for object detection in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3339–3348. Cited by: §2.1, §4.1.1, §4.1.4.
[3] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: Figure 1, Figure 4, §4.1.1, §4.1.3.
[4] J. Deng, W. Li, Y. Chen, and L. Duan (2021) Unbiased mean teacher for cross-domain object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4091–4101. Cited by: §2.1, Table 1.
[5] Y. Ganin and V. Lempitsky (2015) Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pp. 1180–1189. Cited by: §2.1.
[6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §1.
[7] J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell (2018) Cycada: cycle-consistent adversarial domain adaptation. In International conference on machine learning, pp. 1989–1998. Cited by: §2.2, §3.3, §3.3, §3.3.
[8] C. Hsu, Y. Tsai, Y. Lin, and M. Yang (2020) Every pixel matters: center-aware feature alignment for domain adaptive object detector. In European Conference on Computer Vision, pp. 733–748. Cited by: §2.1, §2.3.
[9] H. Hsu, C. Yao, Y. Tsai, W. Hung, H. Tseng, M. Singh, and M. Yang (2020) Progressive domain adaptation for object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 749–757. Cited by: §2.1, §2.2, Table 1, §4.1.4.
[10] M. Johnson-Roberson, C. Barto, R. Mehta, S. N. Sridhar, K. Rosaen, and R. Vasudevan (2017) Driving in the matrix: can virtual worlds replace human-generated annotations for real world tasks?. 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 746–753. Cited by: Figure 1, §4.1.1.
[11] V. Khindkar, C. Arora, V. N. Balasubramanian, A. Subramanian, R. Saluja, and C. Jawahar (2022) To miss-attend is to misalign! residual self-attentive feature alignment for adapting object detectors. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3632–3642. Cited by: §2.3, Table 1, Table 2.
[12] S. Li, J. Huang, X. Hua, and L. Zhang (2021) Category dictionary guided unsupervised domain adaptation for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 1949–1957. Cited by: §2.1, Table 1.
[13] W. Li, X. Liu, X. Yao, and Y. Yuan (2022) SCAN: cross domain object detection with semantic conditioned adaptation. In AAAI, Vol. 6, pp. 7. Cited by: §2.1, Table 2.
[14] X. Li, W. Chen, D. Xie, S. Yang, P. Yuan, S. Pu, and Y. Zhuang (2021) A free lunch for unsupervised domain adaptive object detection without source data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 8474–8481. Cited by: Table 1.
[15] M. Liu, T. Breuel, and J. Kautz (2017) Unsupervised image-to-image translation networks. Advances in neural information processing systems 30. Cited by: §2.2.
[16] M. Menke, T. Wenzel, and A. Schwung Improving gan-based domain adaptation for object detection. In IEEE International Conference on Intelligent Transportation Systems (ITSC) 2022, Cited by: §2.2, §3.2, §3.3, §3.3, Table 1, §4.1.1, §4.1.4, Table 2.
[17] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28, pp. 91–99. Cited by: §2.1, §3.2.
[18] F. Rezaeianaran, R. Shetty, R. Aljundi, D. O. Reino, S. Zhang, and B. Schiele (2021) Seeking similarities over differences: similarity-based domain alignment for adaptive object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9204–9213. Cited by: §2.1, §4.1.3, §4.1.4.
[19] K. Saito, Y. Ushiku, T. Harada, and K. Saenko (2019) Strong-weak distribution alignment for adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6956–6965. Cited by: §2.1, §4.1.4.
[20] C. Sakaridis, D. Dai, and L. Van Gool (2018) Semantic foggy scene understanding with synthetic data. International Journal of Computer Vision 126 (9), pp. 973–992. Cited by: Figure 4, §4.1.3.
[21] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.4.
[22] P. Soviany, R. T. Ionescu, P. Rota, and N. Sebe (2021) Curriculum self-paced learning for cross-domain object detection. Computer Vision and Image Understanding 204, pp. 103166. Cited by: §2.1, §2.2.
[23] P. Su, K. Wang, X. Zeng, S. Tang, D. Chen, D. Qiu, and X. Wang (2020) Adapting object detectors with conditional domain normalization. In European Conference on Computer Vision, pp. 403–419. Cited by: §2.1, §4.1.4.
[24] V. Vs, V. Gupta, P. Oza, V. A. Sindagi, and V. M. Patel (2021) Mega-cda: memory guided attention for category-aware unsupervised domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4516–4526. Cited by: §2.1.
[25] T. Wang, M. Liu, J. Zhu, A. Tao, J. Kautz, and B. Catanzaro (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798–8807. Cited by: §2.2.
[26] Y. Wang, R. Zhang, S. Zhang, M. Li, Y. Xia, X. Zhang, and S. Liu (2021) Domain-specific suppression for adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9603–9612. Cited by: §2.1.
[27] C. Xu, X. Zhao, X. Jin, and X. Wei (2020) Exploring categorical regularization for domain adaptive object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11724–11733. Cited by: §2.1.
[28] M. Xu, H. Wang, B. Ni, Q. Tian, and W. Zhang (2020) Cross-domain detection via graph-induced prototype alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12355–12364. Cited by: §2.1.
[29] Y. Yang and N. Ray (2021) Foreground-focused domain adaption for object detection. In 2020 25th International Conference on Pattern Recognition (ICPR), pp. 6941–6948. Cited by: §2.3, Table 1.
[30] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell (2020-06) BDD100K: a diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.2.
[31] F. Yu, D. Wang, Y. Chen, N. Karianakis, T. Shen, P. Yu, D. Lymberopoulos, S. Lu, W. Shi, and X. Chen (2022) SC-uda: style and content gaps aware unsupervised domain adaptation for object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 382–391. Cited by: §2.1, §2.2, Table 1, Table 2.
[32] H. Zhang, G. Luo, J. Li, and F. Wang (2021) C2FDA: coarse-to-fine domain adaptation for traffic object detection. IEEE Transactions on Intelligent Transportation Systems. Cited by: §2.3.
[33] J. Zhang, J. Huang, Z. Luo, G. Zhang, and S. Lu (2021) Da-detr: domain adaptive detection transformer by hybrid attention. arXiv preprint arXiv:2103.17084. Cited by: §2.1.
[34] S. Zhao, B. Li, P. Xu, X. Yue, G. Ding, and K. Keutzer (2021) MADAN: multi-source adversarial domain aggregation network for domain adaptation. International Journal of Computer Vision, pp. 1–26. Cited by: §2.2.
[35] Y. Zheng, D. Huang, S. Liu, and Y. Wang (2020) Cross-domain object detection through coarse-to-fine feature adaptation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13766–13775. Cited by: §2.3.
[36] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §1, §2.2, §3.3, §3.3.
[37] C. Zhuang, X. Han, W. Huang, and M. Scott (2020) Ifan: image-instance full alignment networks for adaptive object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 13122–13129. Cited by: §2.1, Table 1, §4.1.4, Table 2.