Style-Agnostic Reinforcement Learning

Juyong Lee\orcidlink0000-0002-8155-3998 These authors contributed equally to this work.Pohang University of Science and Technology (POSTECH), South Korea
¹
Seokjun Ahn\orcidlink0000-0002-3769-9965⁰ Pohang University of Science and Technology (POSTECH), South Korea
¹
Jaesik Park\orcidlink0000-0001-5541-409X Pohang University of Science and Technology (POSTECH), South Korea
¹

¹email: {joy.lee, sdeveloper, jaesik.park}@postech.ac.kr

Abstract

We present a novel method of learning style-agnostic representation using both style transfer and adversarial learning in the reinforcement learning framework. The style, here, refers to task-irrelevant details such as the color of the background in the images, where generalizing the learned policy across environments with different styles is still a challenge. Focusing on learning style-agnostic representations, our method trains the actor with diverse image styles generated from an inherent adversarial style perturbation generator, which plays a min-max game between the actor and the generator, without demanding expert knowledge for data augmentation or additional class labels for adversarial training. We verify that our method achieves competitive or better performances than the state-of-the-art approaches on Procgen and Distracting Control Suite benchmarks, and further investigate the features extracted from our model, showing that the model better captures the invariants and is less distracted by the shifted style. The code is available at https://github.com/POSTECH-CVLab/style-agnostic-RL.

Keywords:

Reinforcement Learning, Domain Generalization, Neural Style Transfer, Adversarial Learning

1 Introduction

Learning visual representation in reinforcement learning (RL) framework incorporated with deep convolutional neural networks enabled achieving remarkable performances in various control tasks, including video games [Video1, Video2], robot manipulation [Robot1, Robot2], and autonomous driving [Navigation]. Unfortunately, however, generalization of the learned policies to unseen environments often results in failures, even with slight changes in the backgrounds [DGinRL1, DGinRL2, DGinRL3].

Several methods have been proposed to overcome this limitation of RL agents, such as having an encoder with generative models [DARLA, WorldModel, Dreamer, SLAC] or training with auxiliary tasks [CURL, CTRL, DARL]. Methods using generative models are designed to train the agents to understand the world environment, and auxiliary tasks enable the agent to extract better features that will lead to better performances. Due to its simplicity, the latter technique is gaining interest. For example, recent works have shown that representation learning with self-supervision objectives [CURL, PAD], data randomization with feature matching [RAND], and data augmentation with additional regularization [DrQ, SODA, SVEA] result in high success.

The central concept of these approaches is to diversify training data so that the RL agents can learn invariants to the different styles of environments. Here, the style of the environment indicates too detailed or irrelevant elements in the observation. In an autonomous driving situation, for instance, detecting the road or pedestrians is key to success, while the texture of the road, the colors of the other cars, or the weather condition can be regarded as different styles, which distract the agent from abstract and understand the situation. Data augmentation, thus, might lead to better generalization capacity by mimicking natural style changes of observations. However, the results are inefficient or unstable without a careful choice of augmentation type and timing [RAD, InDA/ExDA]. To tackle this issue, sounder training methods of adding more regularization terms can be applied [DrQ, DrAC, SODA, SVEA], but this makes the training objectives much more complex.

In this work, we focus on learning style-agnostic representations and propose SAR: Style-Agnostic RL, which adopts the concept of both style transfer and adversarial learning. Style transfer has been applied in many computer vision tasks, including domain generalization in RL [StyleGAN, StyleDG2, MixStyle]. Here, we further examine how style transfer is used to train the agents via generating images of new styles. The generator module in our model generates never-seen styles and helps the actor generalize its learned policy to the unseen styles with various background images, including realistic images, without any heuristics or explicit environment class labels. Notably, the generator is trained with adversarial loss to perform adaptive style perturbation to the encoded feature representation. To our best knowledge, this attempt and success have not been presented anywhere before. An overview of our model is described in Figure 1.

In summary, the contributions of this paper are as follows:

First, we introduce SAR, a novel method of learning style-agnostic representation for domain generalization in RL.
Second, we conduct extensive empirical evaluations showing that the model better captures invariants between different styles of environment.
Finally, we show that the SAR agents achieve competitive or better results on the Procgen [ProcGen] and Distracting Control Suite [DistCS] benchmarks than the previous state-of-the-art algorithms.

2 Related work

2.1 Domain Generalization in RL

The main target of the domain generalization in RL can be summarized as training an agent to learn a robust policy that can be generalized to unseen environments. This allows RL algorithms to be applied in more realistic situations because agents are often tested in different environments from the training stage. One example is deploying a policy learned from the simulation to the real world in the robot manipulation task.

Data randomization is a promising technique for such domain generalization in many cases [Robot2, Robot3]. However, it is difficult to build an accurate and practical simulator that enables using data randomization. Visual augmentation, on the other hand, is much easier to apply as it is based on simple image transformations. Laskin & Lee et al. [RAD], for example, demonstrated that simply using data augmentation, such as random cropping or gray scaling, is indeed helpful in improving the generalization capacity of RL agents. Also, Yarats & Kostrikov et al. [DrQ] suggested using regularization terms for stabilizing the model training when using data augmentation.

However, although data augmentation is potentially effective, it has several limitations. For example, a naïve choice of the augmentation type may degrade the generalization performance [RAD]. Applying cropping to an essential part of the image may confuse the agent, or training the model to produce the same action from a rotated image may be unreasonable. Here, we present a method for domain generalization by diversifying the training examples without requiring a complex strategy for data augmentation. The generator in the SAR model generates new feature examples having different styles and helps the agents with learning style-agnostic representations.

2.2 Adversarial Feature Learning

Adversarial feature learning has become popular for domain generalization in computer vision tasks [SONG, ADV, MMDAAE, featureAttack, SagNet]. Li et al. [MMDAAE] showed that adversarial objectives help a model learn universal feature representations across different domains. Furthermore, Nam & Lee et al. [SagNet] proposed a method of reducing the style gap for domain generalization in the image classification task. Inspired by this work, we investigate the adversarial feature learning for RL agents, but with a simpler training procedure, i.e., without dividing training phases or considering the environment style’s classes.

We note that adopting adversarial training for RL is not new [RARL, DARL]. To our best knowledge, however, exploiting adversarial learning to the latent features in RL framework and the min-max game scheme is not presented before. Especially, our method can be interpreted as domain randomization beyond pixel space. Mixing styles with linear interpolation for representation learning in RL setting has been proposed in the earlier work [MixStyle]. However, unlike in the previous study, the style perturbation generator in SAR produces new synthetic styles that will not be seen with a simple interpolation. The adversarial examples help the actor extract style-agnostic embeddings without any label of styles and, finally, learn a robust policy for unseen environments.

3 Backgrounds

3.1 Deep Reinforcement Learning

RL agents interact and get trained with the world environment within a Markov decision process, which is defined as a tuple of $($ state space $S$ , action space $A$ , transition probability $P$ , reward space $R$ , and discount factor $γ)$ ; at every timestep $t$ , the agent observes a state $s_{t} \in S$ and takes an action $a_{t} \in A$ from its policy $π (a_{t} | s_{t})$ [MDP]. Then, the agent is rewarded with $r_{t} \in R$ , and moves to the next state $s_{t + 1}$ sampled from the transition probability $P (s_{t + 1} | s_{t}, a_{t})$ .

The policy of the agent is optimized to maximize the discounted sum of rewards $G_{t} = \sum_{k = t}^{\infty} γ^{k} r_{k}$ . With given state $s_{t}$ , the value of the state $V (s_{t})$ is estimated as $E_{τ \sim π} [G_{t} | s_{t}]$ and the value of the state-action $Q (s_{t}, a_{t})$ is computed as $E_{τ \sim π} [G_{t} | s_{t}, a_{t}]$ , with trajectory $τ$ sampled from the policy $π$ .

With deep RL algorithms, the policy $π$ gets parameterized by a set of learnable parameters $ψ$ , and value function $V$ or $Q$ is optimized with network parameter $ϕ$ . Also, especially for visual-based RL, since the images only offer partial observations, Mnih et al. [stackObs] has proposed that defining the state $s_{t}$ as a stacked consecutive image frames $(o_{t - k}, o_{t - k + 1}, \dots, o_{t})$ , where $O$ is a high-dimensional image space and $o \in O$ , is effective.

3.1.1 Proximal policy optimization

(PPO) [PPO] is a state-of-the-art on-policy RL algorithm that is used for, in our setting, discrete control tasks. Here, on-policy refers to a situation in which the model is trained with trajectories collected from the current policy. With PPO, the actor is updated using policy gradients, where the gradients are computed by using (i) action-advantages $A_{t}$ to reduce the gradient variances and (ii) clipped-ratio loss to constraint the update region. The critic estimates the state-value $V_{ϕ}$ , and gets trained with mean-squared error loss toward a target state-value $V_{t}^{t a r g e t}$ using generalized advantage estimation [PPO]. So, the objectives for the actor and critic network can be written as follows:

$A_{t}$	$= Q_{ϕ} (s_{t}, a_{t}) - V_{ϕ} (s_{t})$	(1)
$L_{a c t o r} (ψ)$		(2)
$L_{c r i t i c} (ϕ)$	$= E_{s_{t} \sim π} [(V_{ϕ} (s_{t}) - V_{t}^{t a r g e t})^{2}],$	(3)

where $ϵ$ is a coefficient for clipping function $c l i p (\cdot) \to [1 - ϵ, 1 + ϵ]$ .

3.1.2 Soft actor-critic

(SAC) [SAC] is an off-policy RL algorithm for continuous control tasks. Since off-policy algorithms can train the agent with trajectories collected from the different policies, other than the current one, it appears to be more flexible to alternative routes but may get slower. With SAC, the actor learns a policy $π_{ψ}$ , with the guide of critic estimating the state-action value $Q_{ϕ}$ to maximize an objective as a sum of the reward and the policy entropy $E_{s_{t}, a_{t} \sim π} [\sum_{t} r_{t} + α H (π (a_{t} | s_{t}))]$ . Here, $α$ is an entropy coefficient determining the priority of exploration over exploitation.

The actor, then, is trained by maximizing the expected return of its sampled actions where the objective can be denoted as follows:

L_{a c t o r} (ψ) = - E_{a_{t} \sim π} [Q_{ϕ} (s_{t}, a_{t}) - α log π_{ψ} (a_{t} | s_{t})] .

(4)

The critic is updated to minimize the temporal difference. The objectives for the critic, with the estimated target value of the next state, are as follows:

	$V (s_{t + 1})$	$= E_{a_{t} \sim π} [Q_{ϕ} (s_{t + 1}, a_{t}) - α log π_{ψ} (a_{t} \| s_{t + 1})]$		(5)
	$L_{c r i t i c} (ϕ)$			(6)

where $D$ is the replay buffer.

In this work, we show that our method can be attached to both on-policy and off-policy RL algorithms, namely PPO and SAC. Also, our method can be applied to both continuous and discrete control tasks as tested with the Procgen and Distracting Control Suite benchmark.

3.1.3 Style transfer via instance normalization

For style transfer, many recent works adopt a method of using instance normalization (IN) [IN, AdaIN, CIN, StyleTransfer1, MixStyle]. The underlying idea is that the mean and standard deviations of feature maps, computed across the spatial dimension within each feature channel, reflect the images’ style. For example, the color or texture of an image can be captured with these statistics, which may be irrelevant features for classifying or detecting an object. By using IN, the effect of styles can be normalized with the formula:

I N (z) = γ \cdot \frac{z - μ (z)}{σ (z)} + β

(7)

where $z \in R^{C \times H \times W}$ is a feature map with channel $C$ , height $H$ and width $W$ , and $β, γ \in R^{C}$ refers to the affine transformation parameters.

Note that $μ (z) \in R^{C}$ and $σ (z) \in R^{C}$ are denoted as:

μ (z)_{c} = \frac{1}{H W} H \sum h = 1 W \sum w = 1 z_{c, h, w}, σ (z)_{c} = \sqrt{\frac{1}{H W} H \sum h = 1 W \sum w = 1 (z_{c, h, w} - μ (z)_{c})^{2}}

(8)

with $c \in {1, \dots, C}$ .

Moreover, Huang & Belongie [AdaIN] proposed the method of adaptive instance normalization (AdaIN), which can be understood as replacing the style statistics of a target content image with those of a source style image with the definition below:

A d a I N (z, z^{'}) = σ (z^{'}) \cdot \frac{z - μ (z)}{σ (z)} + μ (z^{'})

(9)

where $z^{'}$ is the feature map extracted from the source style image.

This idea can be used for mixing styles between images within a mini-batch. Especially in the domain adaptation for image classification, this has been proved to be successful [SagNet]. Zhou et al. [MixStyle] adopted style mixing for domain generalization in RL. However, the scope of mixing styles is restricted only to the training mini-batch as AdaIN is an interpolation. Here, our method enables the agents to observe unseen styles by generating new adversarial feature examples.

Figure 1: Overview of the proposed Style-Agnostic Reinforcement learning (SAR) with the base model of PPO. The upper *Style Mixing* module makes the policy network focus on the critical content in the observations by mixing styles from randomly chosen states $s^{'}$ . We newly employ our *Style Perturbation* module, helping the agent with learning a robust policy by adversarially perturbing latent features.

4 Method

Overview. SAR is composed of an actor-critic module with RL objectives and a style perturbation generator helping the agents to observe more diverse styles of observations. While the generator is updated to produce more substantial perturbations for style transfer by maximizing the difference between the action predictions, the actor learns a more robust policy to the attack from the generator by minimizing the gap between predicted action distributions.

To perform this min-max game between actor and generator, we present a style perturbation layer, shown in Figure 1. Unlike the conventional approach using only style mixing within the mini-batch [MixStyle], the model in the training phase generates new styles and observes a broader range of feature examples. Note that this does not require explicit data augmentation that potentially degrades performance without a cautious choice of augmentation type.

4.1 Style Perturbation Layer

Our method is based on the concept of style transfer, which was proven to be successful in generating images with new styles [CIN, StyleGAN]. The style perturbation layer shifts the style of observations $z$ with the generated perturbation mean $β_{a d v} (z)$ and variance $γ_{a d v} (z)$ , to build style-perturbed feature map $z_{a d v}$ , or StylePerturb(z), with the following equation:

z_{a d v} = γ_{a d v} (z) \cdot \frac{z - μ (z)}{σ (z)} + β_{a d v} (z) .

(10)

Then, the SAR agent should take the same action from $z_{t}$ and $z_{a d v, t}$ to be robust among different environments, as the perturbed feature indicates an observation with different styles but the same semantics, e.g., in Procgen, the same player, enemies, and items, but shifted texture of the background image, the colors of projectiles, and the shapes of obstacles. We will further explain the objectives to achieve this generalization.

4.2 SAR Objectives

Primarily, the policy network is updated via PPO or SAC objectives. Thus, the actor loss of SAR is adopted from Equation 2 with PPO baseline or from Equation 4 when using SAC. We will denote this loss be $L_{a c t o r}^{\circ}$ . Also, for the critic loss, as suggested in RAD [RAD], we adopt the critic objective of PPO or SAC interchangeably, denoted as $L_{c r i t i c}^{\circ}$ .

Another big goal of SAR is to be robust to different environments. Therefore, the agent should learn its policy by minimizing the difference between the distributions of actions from the style-perturbed features $z_{a d v, t}$ and the original ones $z_{t}$ . By leveraging KL-divergence, we can calculate the objective as $L_{d i v} = K L [π (\cdot | z_{t}) | | π (\cdot | z_{a d v, t})]$ . Integrating this with a weight coefficient $λ$ , the objective for the SAR actor module can be written as:

L_{a c t o r} (ψ) = L_{a c t o r}^{\circ} (ψ) + λ \cdot L_{d i v}

(11)

On the other hand, the generator participates in the min-max game in another manner: to maximize the differences between the action distributions. This module is trained with the objective of the same $L_{d i v}$ but with a converted sign. Unlike the previous works using class label information of the environment style [DARL] or additional heavy background images [SVEA], the objectives for the robust policy (i.e., adversarial loss) do not demand any secondary labors. Hence, the overall goals for the generator can be formalized as:

L_{g e n} (θ) = - λ^{'} \cdot L_{d i v},

(12)

where $λ^{'}$ can be different coefficient from that of actor objective.

Finally, the critic gets updated to guide the actor to optimize its policy to maximize the value function. Meanwhile, we observed that the sharing critic network, for predicting the value for both style-perturbed features and the original ones, does not bring a huge difference in the performance from decoupling the critic network but lighter training computation. Instead, we add a regularization term $G_{c r i t i c}$ for the value function, to minimize the difference between the value predicted from the adversarial example, i.e., $(V_{ϕ} (z_{t}) - V_{ϕ} (z_{a d v, t}))^{2}$ , which helps stabilization. Thus, the critic’s objectives can be computed as follows:

L_{c r i t i c} (ϕ) = L_{c r i t i c}^{\circ} (ϕ) + κ \cdot G_{c r i t i c},

(13)

with hyperparameter $κ$ ¹¹1The values used for each hyperparameters $λ$ , $λ^{'}$ , $κ$ in the experiment are described in the supplementary material..

4.2.1 On convergence.

When the SAR agents learn the optimal policy $π^{*}$ , the KL divergence term, or $L_{d i v}$ , becomes zero. This is the situation where the actors infer the same actions from the features with different styles. This might be one of the two cases: (i) the generator produces the same style statistics for all images in the mini-batch, or more possibly, (ii) the actor well focuses on the invariant part of all observations.

Since the model should learn an additional generator module, the training procedure indeed demands more computations. However, the sample efficiency is not highly degraded even with limited training timesteps, e.g., the usual 25M timesteps in Procgen. Although the agent may not learn the optimal policy due to the limited number of epochs, we also empirically observed that the performances of the SAR agents converge as shown in Figure 3.

4.3 Pseudo-code

Here, we present the pseudo-code of the SAR algorithm. As depicted in Figure 1, to maximize the effect of style transfer, we design the $z_{t}$ to pass a Style Mixing module and a Style Perturbation module with two divided branches. In the Style Mixing module, the styles of observations in the mini-batch get interpolated with Equation 9. In Style Perturbation module, on the other hand, the styles of observations are shifted with new styles generated from the generator network with Equation 10.

With two different features $z_{t}$ and $z_{a d v, t}$ , the SAR agent predicts two different action distributions $π_{t}$ and $π_{a d v, t}$ . The difference between these predictions $L_{d i v}$ is computed, and it gets interpreted in two different ways: by the generator to produce more unfamiliar styles and by the actor to make its policy more robust.

1:Initialize rollout or replay buffer

D

2:Initialize parameters for policy

ψ

, generator

θ

, and critic

ϕ

3:for every epoch do

4: for every environment step do

5: Sample

(s_{t}, a_{t}, r_{t}, s_{t + 1})

6: Update

D \leftarrow D

\cup {(s_{t}, a_{t}, r_{t}, s_{t + 1})}

7: end for

8: for each mini-batch sampled from

D

z_{t} \leftarrow E n c o d e r (s_{t})

▹

Encoder in the actor network

10: Generate

β_{a d v} (z_{t}), γ_{a d v} (z_{t})

▹

From the generator network

11:

z_{a d v, t} \leftarrow S t y l e P e r t u r b (z_{t})

▹

Use Equation 10

12:

z_{t} \leftarrow A d a I N (z_{t}, z_{t}^{'})

▹

z_{t}^{'}

is permuted from

z_{t}

within mini-batch

13: Compute

L_{d i v}

from

z_{t}, z_{a d v, t}

14: Compute

L_{a c t o r}

L_{g e n}

, and

L_{c r i t i c}

15: Update

ψ

θ

, and

ϕ

16: end for

17:end for

Algorithm 1 SAR algorithm

5 Results

5.1 Setup

In this section, we exhibit the experiment results for the generalization performance of our SAR model on Procgen [ProcGen] and Distracting Control Suite [DistCS] benchmarks. Recently, these benchmarks have become a standard for measuring the generalization performance of visual-based RL algorithms [RAD, DrQ, DrAC, SODA, SVEA]. These contain reasonably challenging and diverse tasks, which are highly relevant to real-world robot learning.

Figure 2: Examples of seen training environments from (a) starpilot and (b) jumper in Procgen, (c) walker:walk and (d) cartpole:balance task in Distracting Control Suite, with examples of unseen test environments from (e) starpilot and (f) jumper in Procgen, (g) walker:walk and (h) cartpole:balance task in Distracting Control Suite.

While the Procgen benchmark is with a discrete action space, the Distracting Control Suite presents continuous control tasks. We exploited PPO as the basic baseline on the Procgen, and SAC as the basic baseline on the Distracting Control Suite, showing that the SAR algorithm can be applied to both on-policy and off-policy algorithms. Figure 2 visualizes some examples of training and test environments in the two different benchmarks.

OpenAI Procgen. One key reason for choosing this benchmark is that this presents different styles between test and training environments. We train the agents on the first 200 levels in the Procgen environment. Then, we test the generalization performance of the agents on the environment levels sampled from the full distribution of unseen levels, with easy distribution mode. Among 16 tasks, we selected four tasks demonstrating comparably more considerable differences (starpilot, climber, jumper, ninja) and four tasks showing comparably less significant differences (coinrun, maze, bigfish, dodgeball) between the training and test environments style.

Distracting Control Suite. DeepMind Control Suite [DMC] presents various continuous control tasks where RL agents can be tested. On top of the DMC, Stone et al. [DistCS] proposed Distracting Control Suite that distracts the agents by applying a color shift, changing the background images into videos, and rotating the camera angle. We test our model and other baselines with different noise coefficient values and show how these models generalize to unseen situations.

5.2 Generalization Performance

Procgen. First, Table 1 shows the result of the generalization test of SAR with six other baselines. The SAR agent achieved high and robust performances in the zero-shot generalization test: 3 top-1 scores and 7 top-3 scores out of 8 tasks.

The baselines are six visual-learning RL algorithms showing state-of-the-art results on Procgen. PPO [PPO] is the vanilla on-policy RL baseline, and RAD [RAD] uses data augmentation on top of PPO. We performed random translation (denoted as ‘trans’) and random color cutout (denoted as ‘color’) for RAD, as they are reporting the best performance. Among many advanced algorithms on RAD, UCB DrAC[DrAC], and Meta DrAC [DrAC] are chosen to be compared with our method among three variants of DrAC; the former one presents the best performance among the variants. Mixstyle [MixStyle] exploits the style mixing, and DARL [DARL] uses an adversarial objective for regularization with style $l a b e l$ s. ²²2We reproduced all the results of the baselines. The results showed better than the reported performance in several tasks as more training steps [RAD, DrAC, MixStyle].

PPO [PPO]

RAD [RAD] (trans)

RAD [RAD] (color)

UCB DrAC [DrAC]

Meta DrAC [DrAC]

MixStyle [MixStyle]

DARL [DARL]

SAR (Ours)

Starpilot

30.37

±11.14

29.57

±7.52

27.03

±7.51

33.17 ±6.37

29.40

±4.61

25.70

±8.13

21.97

±10.66

35.87 ±9.13

Climber

6.73

±1.27

4.87

±1.31

7.23

±2.05

9.43 ±1.35

7.77

±0.68

7.37

±2.71

7.03

±1.37

7.93 ±1.10

Jumper

6.00

±2.65

4.67

±0.58

5.67

±1.53

5.67

±0.58

7.33 ±2.52

6.00

±2.65

7.67 ±1.53

6.33

±1.15

Ninja

6.00

±2.83

5.33

±2.52

5.33

±2.08

6.33

±1.53

7.33

±0.58

8.67 ±1.53

7.33

±0.58

8.33 ±1.15

Coinrun

8.67

±1.15

8.33

±1.15

9.33 ±1.15

9.00 ±1.00

8.33

±0.58

9.33 ±0.58

9.33 ±1.15

9.00 ±1.00

Maze

4.67

±0.58

5.33 ±1.53

5.33 ±0.58

7.33 ±1.53

4.67

±0.58

5.33 ±0.58

3.67

±1.15

5.00

±1.00

Bigfish

10.37

±3.27

6.03

±2.06

10.13

±1.84

9.37

±3.16

12.03 ±4.38

9.00

±2.94

9.07

±4.05

13.20 ±6.16

Dodgeball

4.13

±1.75

4.93 ±1.53

3.20

±1.56

8.13 ±1.33

2.40

±2.46

3.60

±2.31

4.47

±2.73

3.60

±2.23

Avg. Rank

4.9

5.8

4.8

3.1

4.5

3.9

4.5

2.9

Table 1: The generalization scores of SAR and baseline methods on Procgen. The results are averaged over three runs with 100M training timesteps without smoothing. The ranking stands for the average rank among all tasks. The top-1 score is bold.

Distracting Control Suite. As Table 2 demonstrates, SAR again showed robust performances in selected four tasks in Distracting Control suite compared to the baselines. This experiment implies that our method can also be applied in continuous control tasks and is attachable to the off-policy RL algorithms.

In this experiment, we purposely tested different baselines from Procgen to compare SAR with various algorithms. SAC [SAC] is the vanilla off-policy RL algorithm, and CURL [CURL] uses a contrastive objective for representation learning on top of SAC. DrQ [DrQ] was chosen as the representative baseline using the data augmentation with additional regularization terms. PAD [PAD] adapts to a new test environment using self-supervision.³³3We reproduced all the results of the baselines and applied ‘trans’ to DrQ. The results with zero noise well match the reported performances in most cases [SAC, CURL, DrQ].⁴⁴4The performance of PAD differs from the reported value because of the simultaneous application of natural video backgrounds, color noise, and camera angle noise.

SAC

[SAC]

CURL

[CURL]

DrQ

[DrQ]

PAD

[PAD]

SAR (Ours)

walker

:walk

zero noise

373±89

828±99

930±23

838±47

325±57

moderate

96±10

88±11

126±33

125±27

139±19

hard

85±8

57±7

80±11

71±8

112±15

cartpole

:balance

zero noise

996±1

995±3

996±3

992±6

990±5

moderate

262±20

215±57

246±15

236±17

266±26

hard

251±12

216±62

240±26

238±22

261±17

reacher

:easy

zero noise

197±7

960±24

844±63

671±285

177±51

moderate

88±11

79±11

83±10

75±19

98±13

hard

72±11

67±12

78±3

71±8

93±10

cheetah

:run

zero noise

316±159

280±12

332±21

285±29

304±80

moderate

55±10

46±8

47±8

49±5

49±11

hard

53±15

41±8

33±13

41±10

46±13

Avg. Rank

2.125

4.625

3.125

3.625

1.25

Table 2: The generalization results on Distracting Control Suite after training 500k timesteps. The models are evaluated in two distraction settings: moderate setup with the noise coefficient

β_{c a m} = β_{r g b} = 0.3

and 60 background videos, and hard setup with the noise coefficient

β_{c a m} = β_{r g b} = 0.5

and 60 background videos, where

β_{c a m}

and

β_{r g b}

mean camera angle noise intensity and color noise intensity. The results are averaged over 3 runs with different seeds, and rank is calculated within distracted environments.

Model behavior. Figure 3 provides the learning curve of the SAR agents. They exhibit competitive sample efficiency compared to the baselines. A quantitative comparison of the models’ computational complexity is in Table 3. Although the SAR model requires more parameters, it does not sacrifice much training and test time in comparison with methods using data augmentation.

Figure 3: The learning curve of SAR with baselines. For better visualization, we selected three models and three tasks: PPO (blue), RAD (orange), and SAR (red). Here, we applied exponential moving average smoothing with a coefficient value of 0.98.

PPO

[PPO]

RAD [RAD]

(color)

UCB

DrAC[DrAC]

MixStyle

[MixStyle]

DARL

[DARL]

SAR

(Ours)

Parameters (

\times 10^{6}

)

0.626

0.678

1.151

Training Time (s)

6.507

11.605

12.841

6.735

6.542

13.377

Test Time (s)

2.983

2.656

3.154

2.349

2.521

3.969

Table 3: A comparison between the number of parameters, training time, and test time. The training time refers to the time consumed for 256 timesteps and an update, and the test time is for running ten episodes in Procgen.

With augmentation. Training the SAR agents can be integrated with other techniques. For example, Table 4 presents the result of the SAR agents with data augmentation. Both the use of random translation and color cutout improved the performance. This result implies that the SAR agents can potentially be improved using other auxiliary tasks or regularization terms.

On curriculum learning. Choice of timing for adopting the min-max game, i.e., curriculum learning, can improve final generalization performances for SAR. See supplementary for the results of the experiment.

PPO

[PPO]

MixStyle

[MixStyle]

SAR

(

λ = 0

)

SAR

(

κ = 0

)

SAR

(trans)

SAR

(color)

starpilot

27.09

±0.83

26.81

±0.89

27.44

±2.59

29.28

±7.79

28.92

±4.60

30.76

±0.90

33.72

±1.16

Table 4: Results on generalization performances of SAR with the application of data augmentation and ablation study in starpilot. SAR (

λ = 0

) refers to the setting without adversarial loss, and SAR (

κ = 0

) refers to the setting without regularization loss. We apply two different data augmentation: trans and color. The results are averaged over three runs.

5.3 Ablation Study

This ablation study answers two questions regarding (1) whether the generator module helps generalization performance and (2) generalization term $G_{V}$ is important for stabilization. Comparing SAR to PPO baseline, MixStyle using only style mixing, and SAR ( $λ = 0$ ), Table 4 shows that the adversarial objective helps improve the mean of the performances. Comparing SAR to SAR ( $κ = 0$ ), Table 4 shows that $G_{V}$ helps stabilize the variance of the performances. ⁵⁵5Note that the results in Table 4 are slightly different from Table 1, as we applied exponential moving average smoothing before averaging with coefficient value 0.98.

5.4 Learned Feature Analysis

Furthermore, we qualitatively examine the features extracted from the encoder learned with the SAR objectives. The feature $z$ we analyzed is from encoded features before entering the AdaIN layer to exclude the effect of explicit style mixing.

We demonstrate three analyses on the embedding:

GradCAM [GradCAM] visualization for the high-level understanding interpretation.
Reconstruction images from the feature maps.
t-SNE [t-SNE] for analyzing the latent representations.

5.4.1 Visualization of model decision

We use GradCAM [GradCAM] to visualize where the trained agents are focusing with respect to the decisions. GradCAM can be computed by averaging the activation scores across the channels of the target convolutional layer and weighting by their gradients. Both the agent trained by the vanilla PPO and our agents predict their actions as focused on similar objects in the training environment; in the case of starpilot, they are focusing on the shooter and the projectiles from enemies. In the unseen test environment shown in Figure 4, however, the vanilla PPO agent gets more distracted by the changed backgrounds and focuses on irrelevant areas in the images.

Figure 4: GradCAM results of (a) PPO and (b) SAR, overlaid on (c) the original images from starpilot in Procgen. The highlighted regions represent where the agent is focusing. The SAR model better focuses on what is important with the style shifts. (d) Image reconstruction results from features extracted with the SAR agents, and (e) the original observations of starpilot in Procgen are displayed.

5.4.2 Image reconstruction from embedded features.

Reconstructing images from the feature maps displays a more straightforward visualization of the characteristics of the learned features. We trained a new decoder network that converts the feature maps into the original images from training environments. In Figure 4, we show the reconstructed and original images. While the meaningful semantics, e.g., shooters or enemies, are remained, the reconstructed background seems invariant to the different original styles.

5.4.3 t-SNE Analysis.

Li et al. [DARL] addressed that the distance between the embedding in the latent space may reflect the dissimilarities between the features. Thus, by observing the t-SNE [t-SNE] of the embedding from different environments, how the feature maps are correlated with the style of images can be visualized. While the features extracted from the PPO encoder are patterned with respect to the level labels, i.e., the styles, the SAR encoder extracts invariant embedding regardless of them. The visualization result can be seen in supplementary materials.

6 Limitation

We address the limitation of the SAR agents, mainly shown in the noise-free setting in Distracting Control Suite in Table 2, although they could well adapt to heavy noise. The additional terms in learning objectives may negatively affect the performance when there is zero noise in the test environment. Not enough styles of training environments would have also affected the actors, as they could not observe a sufficient amount of styles of training features to compete well with the generator. The generator would have taken the wrong direction for generating the new styles, and a failure in the min-max game may happen. Curriculum learning may help alleviate such concerns.

7 Conclusion

The SAR agents learn style-agnostic representations by observing features with a wide range of styles by (i) mixing with style randomization and (ii) producing from an adversarial style perturbation generator. In both Procgen and Distracting Control Suite benchmark experimentation, the SAR agents show the best generalization performances in terms of rank. The qualitative analysis reveals that the model helps to learn style-agnostic representations. We hope that the progress made here provides a broader view bringing out more techniques for many other tasks as well, as the SAR agents do.

Acknowledgement

This work was supported by IITP grant funded by the Korea government(MSIT) (No.2019-0-01906, Artificial Intelligence Graduate School Program(POSTECH) and No.2022-0-00290, Visual Intelligence for Space-Time Understanding and Generation based on Multi-layered Visual Common Sense).

References

Supplementary Material

Appendix A Implementation details

We explain the implementation details for both Procgen [ProcGen] and Distracting Control Suite [DistCS] benchmark. We reproduce all the baseline results on top of implementation of PPO [PPOimpl] for Procgen and implementation of SAC [SACimpl] for Distracting Control Suite.

a..1 Hyperparameters

a..1.1 Procgen.

The baselines compared with our SAR model are PPO [PPO], RAD [RAD], UCB DrAC [DrAC], Meta DrAC [DrAC], MixStyle [MixStyle], and DARL [DARL]. We follow the settings of Cobbe et al. [ProcGenImpl] in Procgen; the encoder in the actor-network is based on ResNet architecture [Impala], and the encoded features are shared to both actor and critic networks. The encoder is composed of three layer-blocks, where one layer-block is built with five convolutional layers with two skip connections. The hyperparameters for the model and environments are well described in Table S1.

For PPO, the basic baseline model, we use generalized advantage estimation [PPO] but no stacked observations [stackObs].

For RAD, we apply the random translation and color cutout, where their results are shown in Figure S1, as they are reported as the best [RAD].

For UCB DrAC and Meta DrAC, we follow the setting in [DrAC].

For MixStyle, the style mixing is done once after the feature passes through two layer-blocks, where the feature flow is divided into two branches in SAR.

For DARL, we not only follow the adaptive coefficient in gradient reversal layer [DARL] but also control the effect of the domain adversarial loss with the coefficient of $d$ with the value reported in Table S1.

For SAR (Ours), we set the adversarial coefficients $λ, λ^{'}$ to be equal, but they can be optionally different. We perform a grid search to find the best hyperparameter pairs and report them as indices (1) $\sim$ (3) for the adversarial coefficient and (4) $&$ (5) for the value similarity coefficient, where each index refers to the task of:

(1) starpilot, jumper, coinrun
(2) climber, ninja, bigfish
(3) maze, dodgeball
(4) starpilot, ninja, coinrun
(5) climber, jumper, maze, bigfish, dodgeball

a..1.2 Distracting Control Suite

We compare our SAR model with SAC [SAC], CURL [CURL] and DrQ [DrQ] in Distracting Control Suite. The encoder network has 3 CNN layers with layer size 32 and kernel size 3. We set the stride of the first layer of the encoder to 2 and the stride of the remaining layers to 1. The encoder network is shared between actor and critic. In Table S2, we report the value of the hyperparameters.

For SAC and CURL, we reproduce the results based on the implementation in [SACimpl] and [CURL].

For DrQ, we apply the random translation as they are reported as the best [DrQ]. We set the augmentation coefficients K=1 and M=1.

For SAR (Ours), we also set the adversarial coefficients $λ$ , $λ^{'}$ to be equal. We search for the best hyperparameter pair using grid search and report them as indices (1) $&$ (2) for the adversarial coefficient and (3) $&$ (4) for the value similarity coefficient, where each index refers to the task of:

(1) walker:walk, cartpole:balance
(2) reacher:easy, cheetah:run
(3) walker:walk, reacher:easy, cheetah:run
(4) cartpole:balance

a..2 Visualization of images

a..2.1 Augmentation result in Procgen

In our generalization performance experiment in Procgen, especially for RAD, we use two data augmentation methods: random translation and random color cutout. We show the results in Figure S1.

Figure S1: (a) The original images, and augmentation results with (b) random translation and (c) random color cutout, of coinrun in Procgen. These methods are only applied to RAD in the generalization performance experiment. An additional experiment comparing SAR agents with and without these methods is conducted separately.

a..2.2 Distraction result in Distracting Control Suite.

In Figure S2, we visualize more diverse examples in Distracting Control Suite. The noises, referring to the distractions we apply, are shifts of color, distortions in the camera angle, and changing the background image into videos. Especially, the camera angle noise intensity, i.e., $β_{cam}$ , in the main text refers to camera angle distraction intensity.

Figure S2: Distracting environment examples of cartpole:balance task in Distracting Control Suite. Row (a) shows the task with zero noise. Row (b) shows the task with color shift $β_{rgb} = 0.5$ . Row (c) shows the with camera angle distraction $β_{cam} = 0.5$ . Row (d) shows the task with changed backgrounds. Row (e) shows the task with color shift $β_{rgb} = 0.5$ , camera angle distraction $β_{cam} = 0.5$ and changed backgrounds.

Hyperparameter

Value

Input image resolution

(64,64)

Discount factor

γ

0.999

Generalized advantage estimates

0.95

# timesteps per rollout

256

# epochs per rollout

# minibatches per epoch

Entropy bonus

0.01

PPO gradient clip range

ϵ

0.2

Reward normlization

yes

Learning rate

5e-4

# workers

# environments per worker

# total timesteps

100M

Optimizer

Adam

Recurrent neural network

Frame stack

k

Regularization coefficient

α_{r}

0.1 (UCB DrAC,

Meta DrAC)

Exploration coefficient

c

0.1 (UCB DrAC)

Sliding window size

K

10 (UCB DrAC)

Domain loss coefficient

d

0.9 (DARL)

Meta gradient clip range

100 (Meta DrAC)

Meta # train steps

1 (Meta DrAC)

Meta # test steps

1 (Meta DrAC)

Adversarial coefficient

λ

0.1 (1);

0.01 (2);

0.001 (3)

Value similarity coefficient

κ

1.0 (4);

0.1 (5)

Table S1: Hyperparameters for SAR (Ours) and baselines in the Procgen experiment. The indices inside the parentheses for ‘adversarial coefficient’ and ‘value similarity coefficient’ indicate that the different values are used in different tasks.

Hyperparameter

Value

Input image resolution

(84, 84)

Discount factor

γ

0.99

Frame stack

k

Random shift

Up to 4 pixels

Action repeat

cartpole

finger

else

Episode length

1000

Replay buffer size

100000

Optimizer

Adam

Learning rate

actor, critic, attacker

alpha

10^{- 3}

10^{- 4}

Encoder feature dimension

Target smoothing coefficient

τ

actor

critic

alpha

0.05

0.01

0.5

Target update interval

Batch size

128

Latent dimension

128

Initial temperature

0.1

Initial steps

1000

Network update frequency

attacker, critic

actor

Adversarial coefficient

λ

0.01 (1);

0.1 (2)

Value similarity coefficient

κ

0.1 (3);

1.0 (4)

Table S2: Hyperparameters for SAR (Ours) and baselines in the Distracting Control Suite experiment.

Appendix B Learning curves

We plot the learning curve of SAR agents and baselines in both Procgen and Distracting Control Suite. For clear visualization, each graph is illustrated with smoothing, following the settings of Cobbe et al. [ProcGenImpl].

b..0.1 Procgen.

Figure S3 shows the learning curves of models with the best performances from each algorithm in the Procgen environment. We apply an exponential moving average smoothing with the smoothing coefficient value of $0.95$ .

b..0.2 Distracting Control Suite.

Figure S4 and Figure S5 show the learning curves of agents in Distracting Control Suite. In Distracting Control Suite, the training environments do not present various styles, unlike Procgen. However, although the SAR agents are not trained with a wide enough range of training environments showing not the best performance in the training phase, they adapt to distracting environments. They show the best performances in three out of four tasks. Here, we apply exponential moving average smoothing only for the training curve with the smoothing coefficient value of $0.99$ .

Figure S3: The learning curves of training and test in Procgen.

Figure S4: The learning curves of training in Distracting Control Suite.

Figure S5: Evaluation performances with respect to various distraction scales in Distracting Control Suite. $β_{rgb}$ and $β_{cam}$ are the same with x-axis distraction values. We use 4, 8 and 60 background videos in 0.1, 0.2 and $>$ 0.3 distraction values respectively. The SAR agents show better generalization performances with stronger noises. The observation images in distracting environments are depicted in Figure S2.

Appendix C Curriculum Learning

We conduct an experiment on curriculum learning in both Procgen and Distracting Control Suite benchmarks. As addressed by Ko $&$ Ok[InDA/ExDA], the timing for the adoption of data augmentation may affect the test performance.

In Distracting Control Suite environment, while the SAR agents show no better performance with zero noise setting than its baseline SAC [SAC], they get improved on several tasks by adopting curriculum learning. The SAR agents also show better performances in several tasks in Procgen with a warm-up stage.

c..0.1 Procgen.

The SAR agents are trained for 50M timesteps on Procgen with three different start times for applying the adversarial loss: from the beginning, after 10M timesteps of warm-up, and after 25M timesteps of warm-up. The results are averaged over three runs with different seeds.

	from start	after 10M		after 25M
	at final	at 10M	at final	at 25M	at final
startpilot	27.6±7.9	27.6±7.9	30.7±5.3	21.4±4.1	34.0±13.9
climber	8.4±0.5	5.6±2.7	7.1±0.1	6.3±2.8	6.5±2.6
jumper	6.0±1.0	6.3±0.6	6.0±1.0	4.0±2.7	6.3±2.1
ninja	6.0±1.0	5.0±2.7	6.7±0.6	6.3±0.6	8.3±1.2
coinrun	7.7±0.6	7.7±0.6	7.0±1.0	8.3±1.2	8.3±0.6
maze	6.3±2.1	6.3±0.6	5.0±2.0	7.0±1.0	7.3±3.1
bigfish	6.6±1.0	1.6±0.6	12.3±5.4	3.3±3.5	8.9±0.1
dodgeball	2.1±1.8	0.5±0.3	2.1±2.5	1.4±0.7	3.4±2.3

Table S3: The generalization results on Procgen benchmark with different curriculum learning. The best result in the final is bold, and the second result in the final is underlined.

c..0.2 Distracting Control Suite.

The generalization results in Distracting Control Suite with curriculum learning. The SAR agents have trained 500k timesteps with applying the adversarial loss after 300k of warm-up. The results are averaged over three runs with different seeds.

SAR from start

SAR after 300k

walker

:walk

zero noise

325±57

420±78

moderate

139±19

113±23

hard

112±15

88±20

cartpole

:balance

zero noise

990±5

996±2

moderate

266±26

249±10

hard

261±17

241±16

reacher

:easy

zero noise

177±51

211±33

moderate

98±13

85±19

hard

93±10

77±14

cheetah

:run

zero noise

304±80

277±47

moderate

49±11

57±16

hard

46±13

53±13

Table S4: The generalization results on Distracting Control Suite with curriculum learning. The best results are in bold.

Appendix D Learned Feature Analysis

d..0.1 Image reconstruction from embedded features.

Figure S6 displays three consecutive original frames and their corresponding reconstructed images with embeddings from a trained SAR agent on two different levels of jumper in Procgen. This well describes that the SAR agent is extracting style-agnostic representation features while preserving important elements from the images.

Figure S6: (a) Images from two episodes of a trained SAR agent and (b) reconstructed images with the representation features from the agent with jumper in Procgen.

d..0.2 t-SNE Analysis.

Figure S7 presents the t-SNE results from embeddings extracted from trained the SAR and PPO agents. While the representation features extracted from the PPO agent are well grouped concerning the styles of environments, those from the SAR agent are scattered without a certain pattern. The average distance between all the PPO sample pairs, where a sample is composed of two features five frames apart, is 1.21, while that of SAR sample pairs is 3.41. Also, with sample pairs composed of features 15 frames apart, that of PPO is 3.43, while that of SAR is 9.30.

Figure S7: t-SNE results of (left) PPO and (right) SAR with the starpilot in Procgen. Example image pairs within 15 frames apart from 3 different levels are marked.