Dynamic Memory-based Curiosity: A Bootstrap Approach for Exploration

Zijian Gao¹, Kele Xu¹, YiYing Li², Yuanzhao Zhai¹,
Dawei Feng¹, Bo Ding¹, XinJun Mao¹, Huaimin Wang¹ Corresponding author: Kele Xu

Abstract

The sparsity of extrinsic rewards poses a serious challenge for reinforcement learning (RL). Currently, many efforts have been made on curiosity which can provide a representative intrinsic reward for effective exploration. However, the challenge is still far from being solved. In this paper, we present a novel curiosity for RL, named DyMeCu, which stands for Dynamic Memory-based Curiosity. Inspired by human curiosity and information theory, DyMeCu consists of a dynamic memory and dual online learners. The curiosity arouses if memorized information can not deal with the current state, and the information gap between dual learners can be formulated as the intrinsic reward for agents, and then such state information can be consolidated into the dynamic memory. Compared with previous curiosity methods, DyMeCu can better mimic human curiosity with dynamic memory, and the memory module can be dynamically grown based on a bootstrap paradigm with dual learners. On multiple benchmarks including DeepMind Control Suite and Atari Suite, large-scale empirical experiments are conducted and the results demonstrate that DyMeCu outperforms competitive curiosity-based methods with or without extrinsic rewards. We will release the code to enhance reproducibility.

\affiliations

¹ National University of Defense Technology, Changsha, China
² Artificial Intelligence Research Center, DII, Beijing, China

kelele.xu@gmail.com

1 Introduction

Figure 1: DyMeCu employs a novel learning framework to build the intrinsic reward for RL, which consists of a dynamic memory and dual online learners. The information discrepancy of the current state compared with the retrieved information from the memory makes curiosity arouse. We get the curiosity-based intrinsic reward for agent learning by calculating the information gap between dual online learners. Then the state information can be dynamically consolidated into the memory in the bootstrap paradigm for curiosity fading.

Despite the success of reinforcement learning (RL) on sequential decision-making tasks Bellemare et al. (2013); Tesauro and others (1995); Mnih et al. (2015), many current methods struggle with sparse extrinsic rewards. To cope with the sparsity, curiosity provides a representative intrinsic reward that can encourage agents to explore new states. Designing algorithms to efficiently construct curiosity can be a key component in RL systems. Previous research has shown that intrinsic rewards can help alleviate the issues resulting from the lacking of dense extrinsic rewards Liu and Abbeel (2021b); Tao et al. (2020); Yang et al. (2021).

For human learning, curiosity motivates people to seek and retain more information through exploration in the environment Burda et al. (2018); Ryan and Deci (2000); Smith and Gasser (2005). The process of arousing and satisfying curiosity can be summed up as one cycle: when a person encounters a problem, he/she will first try to solve it by retrieving information from memory. If retrieval from memory fails, he/she realizes that the current memorized information is insufficient solve the problem. A conscious awareness of information discrepancy then sparks curiosity about the problem, and curiosity stimulates the search for new information. Once the information discrepancy is eliminated, people may have no further curiosity to learn more about the current problem until another problem is encountered Rotgans and Schmidt (2017); Silvia (2017). Human curiosity is constantly consolidated based on the dynamic memory, which consists of the encoding, storing, and retrieving information stage Hayes et al. (2021). As the curiosity fades, additional information is consolidated into the memory. The consolidation results in the forming of new dynamic memories, which depends on the hippocampus O’Reilly and Rudy (2001).

Many attempts have been made to build curiosity in RL, which fall into two main categories: count-based and prediction-based. However, such curiosity is very different from human curiosity, and the problem is far from solved. Taking the Random Network Distillation (RND) Burda et al. (2019) method as an example, RND initializes a random fixed target network with state embeddings, and trains another prediction network to fit the output of the target network. A random fixed target network can be regarded as a random fixed memory, so that RND cannot retain contextual knowledge about the environment Yang et al. (2021). Without dynamically incorporating contextual information into memory, random features may not be sufficient to interpret dynamic environments. Therefore, this kind of curiosity is evaluated in a non-developmental way, which severely limits the performance of curiosity in RL.

¹¹footnotetext: The term bootstrap is used in this text in its colloquial meaning rather than its statistical connotation.

In this work, to mimic human curiosity, we formalize and investigate a Dynamic Memory-based Curiosity mechanism, named DyMeCu. Inspired by the bootstrap paradigm Guo et al. (2020); Grill et al. (2020); Flennerhag et al. (2021), we construct dual online learners to learn the latent state to formulate dynamic memory model (Figure 1). On the one hand, state information can be consolidated to the memory via the exponential moving average (EMA) Haynes et al. (2012); Klinker (2011); Grebenkov and Serror (2014) of dual learners’ parameters. The bootstrap paradigm, on the other hand, utilizes supervised signals from memory to improve dual learners’ encoding ability. Furthermore, the curiosity is measured by the information gap between the dual learners, which is essentially an uncertainty estimation of given state based on dynamic memory Mai et al. (2022); Liu et al. (2020); Abdar et al. (2021).

In brief, our contribution in this paper is:

We firstly analyze the shortcomings of previous curiosity-based intrinsic reward methods, and suggest to mimic human curiosity leveraging a dynamic memory instead of a fixed one, based on the information theory.
We propose a novel and practicable intrinsic reward method for RL agents, named DyMeCu (Dynamic Memory-based Curiosity), which consists of a dynamic memory and dual online learners, and thus can measure the curiosity and consolidate the information in a feasible way. Meanwhile, different strategies are explored to further improve the performance of DyMeCu.
On multiple benchmarks including DeepMind Control Suite (DMC) Tunyasuvunakool et al. (2020) and Atari Suite Bellemare et al. (2013), large-scale empirical experiments demonstrate that DyMeCu outperforms the other competitive curiosity-based methods and pre-training strategies.

2 Related Work

2.1 Curiosity-Based Intrinsic Reward

In RL, the exploration issue is a long standing challenge. Previous attempts suggest that: if there is no additional reward, exploration can be regarded as a hunt for information theoretically, which also can be viewed as the curiosity Berlyne (1950); Schmidhuber (1991); Kidd and Hayden (2015); de Abril and Kanai (2018); Jaegle et al. (2019); Friston et al. (2016); Peterson and Verstynen (2021). One intuitive formulation of curiosity is the count-based methods, where the less visited state has more state novelty for exploration. But it can not scale to large-scale or continuous state spaces Kearns and Singh (2002); Charikar (2002). Inspired by count-based methods, RND calculates the state novelty by distilling a random fixed network (target network) into another prediction network (predictor network). The predictor network is trained to minimize the prediction error for each state and take the prediction error as the intrinsic reward. Apart from count-based methods, prediction-based methods also show competitive or better performance by modeling the environment dynamics Pathak et al. (2017, 2019); Kim et al. (2020); Burda et al. (2018). With the assumption that more visited state-action pairs will result in more accurate prediction, the intrinsic reward can be applied as the variance of predictions of ensembles or the distance between prediction states and true states, such as the Disagreement method Pathak et al. (2019) and ICM Pathak et al. (2017) method. There have been few attempts to design a curiosity that contains memory and effectively uses information consolidated in memory, which however is the main goal of this paper.

2.2 Uncertainty Estimation

Our work is also related to the uncertainty estimation, as uncertainty is crucial which allows an agent to discern when to exploit and when to explore its environment in RL Szepesvári (2009). Previous intrinsic rewards can also be interpreted from the perspective of uncertainty estimation, which can evaluate curiosity by estimating the deep learning model’s uncertainty (confidence). Take Disagreement as an example, instead of comparing the prediction to the ground-truth, they suggest to evaluate the uncertainty of multiple prediction models using the deep ensemble Dietterich (2000), despite incurring additional computation. RND also claims that the distillation error can be viewed as a quantification of the uncertainty. Unlike RND, in our work, we evaluate the uncertainty of given states though measuring the information gap between dual learners which rely on dynamic memory instead of a random fixed network.

Initialization: policy network $π_{ϕ}$ ; dual online learner networks $f_{θ_{1}}$ , $f_{θ_{2}}$ ; memory network $M_{ω}$ ; coefficients of intrinsic and extrinsic reward $ζ$ , $β$ .

1: while Training do

2: for

t = 1, \dots, T

3: Receive state

s_{t}

from environment

a_{t} \leftarrow π_{ϕ} (a | s)

based on policy network

π_{ϕ}

5: Take action

a_{t}

, receive state

s_{t + 1}

and extrinsic reward

{r_{t}}^{e x t}

from environment

6: Collect step data into replay buffer

s_{t} \leftarrow s_{t + 1}

8: end for

9: Sample batch data as

{(s_{i}, a_{i}, {r_{i}}^{e x t}, s_{i + 1})}_{i = 1}^{N}

from replay buffer

10: for

e a c h i = 1, \dots, N

11: Generate latent state vectors

z_{i}^{θ_{1}} = f_{θ_{1}} (s_{i})

z_{i}^{θ_{2}} = f_{θ_{2}} (s_{i})

z_{i}^{ω} = M_{ω} (s_{i})

12: Calculate intrinsic reward

r_{i}^{i n t} = ∥ z_{i}^{θ_{1}} - z_{i}^{θ_{2}} ∥^{2}

13: Calculate total reward

{r_{i}}^{t o t a l} = ζ {r_{i}}^{i n t} + β r_{i}^{e x t}

14: end for

15: Update

θ_{1}

and

θ_{2}

with sampled data by minimizing loss with equation (3)

16: Update

ω

with equation (7)

17: Update

ϕ

with sampled data by maximizing

r^{t o t a l}

using RL algorithm

18: end while

Algorithm 1 Dynamic Memory-based Curiosity

Figure 2: Performance of different methods in the fine-tuning phase on DeepMind Control Suite.

3 Methodology

In general, if an agent encounters a state with the information value $E$ compared to its memory, then this state is worth exploring and such information value is worth consolidating to its memory dynamically Rotgans and Schmidt (2017); Silvia (2017). In detail, the concept of information value $E$ necessitates the formation of the dynamic memory $M$ and a way $g$ to consolidate information to the memory. For deep neural networks, the memory $M$ can be embedded in the latent space and $g$ can by the function that maps state $s$ into memory Peterson and Verstynen (2021). This kind of consolidating information is denoted by:

g (s; M) \to M^{'} .

(1)

With the memory $M$ which has been learned by $g$ over historical states, we can measure the information value $E$ of the next state $s_{t + 1}$ . According to the information theory Ishii et al. (2002); Reddy et al. (2016); Gray (2011) and the concepts proposed in Peterson and Verstynen (2021), the information value $E$ of a state should (1) only depend on the memory and what can be immediately learned (i.e., $M$ , $s$ and $g$ ); (2) be non-negative because $E$ is for exploring the environment; (3) decelerate in the finite time for the same state. Thus we define $E$ as:

E = ∥ g (s_{t + 1}; M) - M ∥ .

(2)

We can get from the definition of $E$ as :

(1) If one state has been completely explored, or cannot be learned, then no more information gain can be added into the current memory, and $E = 0$ . Such state is no longer worth exploring.

(2) If $E > 0$ , then the larger the value of $E$ , the more information gain can be consolidated into the memory. In other words, the larger the value of $E$ , the memory $M$ is less aware of the current state, such state is more worth exploring. It is such information deficiency of memory that sparks the curiosity of agents.

In this paper, we will focus on how to obtain and leverage the information value $E$ for agents exploration, and the information consolidation method $g$ in details.

3.1 Dynamic Memory-based Curiosity

In our framework, if the information in current memory cannot handle the encountered state, then the curiosity is aroused. We model the memory as a learnable neural network, but there is a dilemma that we do not have a “benchmark” encoded network in the parameter space to encode the encountered state accurately, since not enough supervised signal provided here. Thus it is difficult and not sound to directly define the curiosity by comparing a random encoded state with the output latent state from a dynamic memory network. We instead introduce dual online learners for state encode representations. These dual learners have the same network architecture as the memory network but with each own different parameters. In their network parameter space, the dual networks are supervised by the memory encoding ability. And then our curiosity can be defined by the gap of the encodings of the same state output by dual learners’ networks. The intuition of our dual learners is: if the state information has been squeezed out by the memory, then the memory can completely know and resolve the state, and the dual learners which can be seen as the two imitators to the memory are easier to get the similar encodings to the current state. In other words, if one state is little known by the memory, then the dual learners may produce quite different encodings to it, which represents the larger information value $E$ and thus stimulates the agent to explore this state. Here, for the uniform description, we refer the encoded states and encodings to the latent states, which reflect the cognition of states by the memory and learner networks.

In our implementation, such kind of latent gap $E$ will spark the curiosity as the intrinsic reward for agents exploration. After the RL agents learning, such information will be consolidated to the memory for memory better growing. In terms of the consolidation method $g$ , it is externalized as updating the memory parameters via the exponential moving average (EMA) Haynes et al. (2012); Klinker (2011); Grebenkov and Serror (2014) of the dual learners’ parameters.

From such analysis, we see the dual learners first learn based on the memory network for measure the information value for exploration, and then the memory network consolidate information gain based on dual learners in the EMA way. The memory is actually seeking for the appropriate position in the parameter space dynamically, in order that its network can better characterize the memory and cognition ability of the seen states in environments. In a word, our dynamic memory is updated in a bootstrap Grill et al. (2020) way. Figure 1 and algorithm 1 present the whole framework and pseudo-code of DyMeCu.

Domain	Task	ICM	Disagreement	RND	APT	ProtoRL	SMM	DIAYN	APS	Ours
	Flip	398 $\pm$ 18	407 $\pm$ 75	465 $\pm$ 62	477 $\pm$ 16	480 $\pm$ 23	505 $\pm$ 26	381 $\pm$ 17	461 $\pm$ 24	630 $\pm$ 37
	Run	216 $\pm$ 35	291 $\pm$ 81	352 $\pm$ 29	344 $\pm$ 28	200 $\pm$ 15	430 $\pm$ 26	242 $\pm$ 11	257 $\pm$ 27	588 $\pm$ 25
	Stand	928 $\pm$ 18	680 $\pm$ 107	901 $\pm$ 8	914 $\pm$ 8	870 $\pm$ 23	877 $\pm$ 34	860 $\pm$ 26	835 $\pm$ 64	965 $\pm$ 5
Walker	Walk	696 $\pm$ 162	595 $\pm$ 153	814 $\pm$ 116	759 $\pm$ 35	777 $\pm$ 33	821 $\pm$ 36	661 $\pm$ 26	711 $\pm$ 68	934 $\pm$ 16
Average Performance		560 $\pm$ 59	494 $\pm$ 104	633 $\pm$ 54	624 $\pm$ 22	582 $\pm$ 24	659 $\pm$ 31	536 $\pm$ 20	566 $\pm$ 46	780 $\pm$ 21
	Jump	112 $\pm$ 4	383 $\pm$ 265	580 $\pm$ 72	462 $\pm$ 48	425 $\pm$ 63	298 $\pm$ 39	578 $\pm$ 46	529 $\pm$ 42	694 $\pm$ 15
	Run	91 $\pm$ 29	389 $\pm$ 61	385 $\pm$ 47	339 $\pm$ 40	316 $\pm$ 36	220 $\pm$ 37	415 $\pm$ 28	465 $\pm$ 37	479 $\pm$ 6
	Stand	184 $\pm$ 100	628 $\pm$ 114	800 $\pm$ 54	622 $\pm$ 57	560 $\pm$ 71	367 $\pm$ 42	706 $\pm$ 48	714 $\pm$ 50	921 $\pm$ 14
Quadruped	Walk	99 $\pm$ 46	384 $\pm$ 28	392 $\pm$ 39	434 $\pm$ 64	403 $\pm$ 91	184 $\pm$ 26	406 $\pm$ 64	602 $\pm$ 86	833 $\pm$ 44
Average Performance		122 $\pm$ 45	446 $\pm$ 117	540 $\pm$ 53	465 $\pm$ 52	426 $\pm$ 66	268 $\pm$ 36	527 $\pm$ 47	578 $\pm$ 43	732 $\pm$ 20
	Reach bottom left	102 $\pm$ 47	117 $\pm$ 17	103 $\pm$ 17	88 $\pm$ 12	121 $\pm$ 22	40 $\pm$ 9	17 $\pm$ 5	96 $\pm$ 13	155 $\pm$ 13
	Reach bottom right	75 $\pm$ 27	142 $\pm$ 3	101 $\pm$ 26	115 $\pm$ 12	113 $\pm$ 16	50 $\pm$ 9	31 $\pm$ 4	93 $\pm$ 9	146 $\pm$ 16
	Reach top left	105 $\pm$ 29	121 $\pm$ 17	146 $\pm$ 46	112 $\pm$ 11	124 $\pm$ 20	50 $\pm$ 7	11 $\pm$ 3	65 $\pm$ 10	166 $\pm$ 14
Jaco	Reach top right	93 $\pm$ 19	131 $\pm$ 10	99 $\pm$ 25	136 $\pm$ 5	135 $\pm$ 19	37 $\pm$ 8	19 $\pm$ 4	81 $\pm$ 11	152 $\pm$ 4
Average Performance		94 $\pm$ 31	128 $\pm$ 12	113 $\pm$ 29	113 $\pm$ 10	124 $\pm$ 20	44 $\pm$ 9	20 $\pm$ 4	84 $\pm$ 11	155 $\pm$ 12

Table 1: Performance comparison with different pre-training methods on DeepMind Control Suite. The best results are in bold font in each task, and the second best results are underlined.

Learning of Dual Learners:

Dual online learner models $f_{θ_{1}}$ and $f_{θ_{2}}$ are defined by a set of weights $θ_{1}$ and $θ_{2}$ with the same architecture as the memory network $M_{ω}$ . The memory provides the regression targets for the learning of dual learners $f_{θ_{1}}$ and $f_{θ_{2}}$ . Given a current state $s_{t}$ , the learners transform it into the latent states $z_{t}^{θ_{1}} ≜ f_{θ_{1}} (s_{t})$ and $z_{t}^{θ_{2}} ≜ f_{θ_{2}} (s_{t})$ respectively, and the memory network outputs $z_{t}^{ω} ≜ M_{ω} (s_{t})$ . The mean squared error (MSE) between them is:

⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} L_{θ_{1}} & ≜ {∥ ∥ z_{t}^{θ_{1}} - z_{t}^{ω} ∥ ∥}^{2}, L_{θ_{2}} & ≜ {∥ ∥ z_{t}^{θ_{2}} - z_{t}^{ω} ∥ ∥}^{2} . \end{matrix}

(3)

Based on $L_{θ_{1}}$ and $L_{θ_{2}}$ , the dual learners are updated as :

{\begin{matrix} θ_{1} & \leftarrow optim (θ_{1}, \nabla_{θ_{1}} L_{θ_{1}}, η), θ_{2} & \leftarrow optim (θ_{2}, \nabla_{θ_{2}} L_{θ_{1}}, η), \end{matrix}

(4)

where $optim$ and $η$ represent the optimizer and learning rate.

Intrinsic Reward based on Curiosity:

The curiosity relies on the information value of current state. In our method, such information value can be measured by the information gap between dual learners. This information gap can also be considered following the $δ$ -Progress Achiam and Sastry (2017); Graves et al. (2017) to form the curiosity. We obtain the intrinsic reward to agents based on the curiosity from the information value:

r_{t}^{i n t} = ∥ (z_{t}^{θ_{1}} - z_{t}^{ω}) - (z_{t}^{θ_{2}} - z_{t}^{ω}) ∥^{2} = ∥ z_{t}^{θ_{1}} - z_{t}^{θ_{2}} ∥^{2} .

(5)

From another point of view, the dual-learner mechanism can be regarded as the variant of ensemble Mai et al. (2022) for uncertainty estimation. Compared with previous attempts which requires heavily ensembling (such as the Disagreement), our lightweight solution can previous better performance while retaining computation efficiency.

Overall, we can get the optimization goal for the agent:

max ϕ E_{π_{ϕ} (s_{t})} [\sum γ^{t} (ζ r_{t}^{i n t} + β r_{t}^{e x t})],

(6)

where $γ$ is the discount factor and $ϕ$ represents parameters of policy $π$ ; $ζ$ and $β$ are the coefficients of the intrinsic reward and extrinsic reward respectively.

Consolidating Information into Memory:

The memory model is updated in an EMA way for sake of its stability to the old state information and the plasticity to the current new state information. In other words, the memory is dynamically growing taking the contextual environment information into account. Specifically, given a decay rate $α \in [0, 1]$ and after each training step, the memory $M_{ω}$ can be updated as:

ω \leftarrow α ω + (1 - α) \frac{θ_{1} + θ_{2}}{2} .

(7)

Intuitions on DyMeCu’s behavior:

The dynamic memory-based curiosity is closer to human curiosity mechanism. It is the cognitive difference compared to the memory that stimulates our curiosity to explore the world, and then we will consolidate the cognition information to the memory dynamically. In addition, from the knowledge distillation view, such memory can also be regarded as the teacher model in Mean Teacher-based approach Tarvainen and Valpola (2017). The memory is essentially a self-ensemble of the intermediate models of learners. The paradigm we proposed is one type of replay mechanism that is thought to play an important role in memory formation, retrieval, and consolidation Hayes et al. (2021). Moreover, we consider our way to form the memory can also be used in the continual learning to address the issue of catastrophic forgetting Arani et al. (2021).

Figure 3: Performance comparison on Atari games subsets using both intrinsic rewards and extrinsic rewards.

4 Experiments and Analysis

4.1 Experimental Settings

We evaluate our method in both pre-training and traditional RL situations utilizing two widely used benchmarks: DeepMind Control Suite (DMC) Tunyasuvunakool et al. (2020) and Atari Suite Bellemare et al. (2013). We follow the RND Burda et al. (2019) experimental settings for Atari Suite and settings of URLB Laskin et al. (2021) which is the pre-training benchmark for DeepMind Control Suite. We apply PPO algorithm Schulman et al. (2017) to train the agent. The hyper-parameter $α$ was set as 0.99 in all experiments, and the implementation details and hyper-parameters can be found in the appendix.

4.2 DeepMind Control Suite

Many well-performing approaches like URLB Laskin et al. (2021) use the pre-training and fine-tuning paradigm to improve sample efficiency for RL, especially in the experiment benchmark like DMC containing various domains and complex tasks. We evaluate DyMeCu on all three domains of DMC, namely Walker, Quadruped, and Jaco Arm (from easiest to hardest), and each of them has four tasks. During the pre-training phase, the agents are trained for 2 million steps with only intrinsic rewards produced by the curiosity. During the fine-tuning phase, the agents are trained for 100k steps with only extrinsic rewards.

Table 1 reports the final scores and standard deviations of DyMeCu and other competitive methods. We compare DyMeCu with both intrinsic reward-based methods (ICM, Disagreement, RND, APT Liu and Abbeel (2021b)) and other pre-training strategies (ProtoRL Yarats et al. (2021), SMM Lee et al. (2019), DIAYN Eysenbach et al. (2018), APS Liu and Abbeel (2021a)). DyMeCu improves average performance by 18.3%, 26.6%, and 21.0% on these three domains respectively. From the quantitative results, we can see our DyMeCu achieve the new state-of-the-art across all 12 tasks, demonstrating DyMeCu’s ability to improve the model performance and robustness through pre-training paradigm. Figure 2 plots 6 learning curves (fine-tuning phase) of DyMeCu and three competitive curiosity-based methods. All learning curves can be found in the appendix. DyMeCu shows a superior convergence speed than other methods. Meanwhile, the convergence result of DyMeCu also surpasses others significantly. DyMeCu’s speed increase may sbe mainly due to the contextual state information being consolidated into memory dynamically, rather than a random fixed setting like the RND. Based on the dynamic memory, the exploration of agents can be much more efficient.

4.3 Atari Suite

For the Atari suite, we first record the performance of agents with both intrinsic and extrinsic rewards. The experiments conducts 100M running steps - equivalent to 400M frames and the intrinsic and extrinsic rewards coefficients are set to $ζ = 1$ and $β = 2$ respectively for all methods, following the setup of the previous curiosity-based methods. Table 2 lists the aggregate metrics and scores of three methods trained with both intrinsic and extrinsic rewards on the Atari 26 games. Human and random scores are adopted from Hessel et al. (2018). As done in previous works Liu and Abbeel (2021b); Yarats et al. (2020); Schwarzer et al. (2020), we normalize the episode reward as human-normalized scores (HNS) by expert human scores to account for different score scales in each game. #SOTA denotes the number of games that the current method exceeds other methods and mean HNS is calculated as the average of $(agent score - random score) / (human score - random % score)$ of all games. From Table 2, DyMeCu displays the superiority over Disagreement and ICM with its highest mean HNS and #SOTA.

Game	Random	Human	ICM	Disagreement	Ours
Alien	227.8	7127.7	1524.7	1304.7	2589.2
Amidar	5.8	1719.5	763.0	506.6	470.1
Assault	222.4	742.0	1365.5	1544.6	4539.3
Asterix	210.0	8503.3	2103.4	1616.2	4576
Bank Heist	14.2	753.1	1359.4	1343.4	1529.5
BattleZone	2360.0	37187.5	51459.1	65387.4	58220.0
Boxing	0.1	12.1	98.9	99.3	99.6
Breakout	1.7	30.5	247.6	177.8	119.7
ChopperCommand	811.0	7387.8	9456.5	10286.9	9521.0
Crazy Climber	10780.5	23829.4	135003.3	132614.0	106682.0
Demon Attack	107805.0	35829.4	4679.2	6606.0	8417.0
Freeway	0.0	29.6	33.8	33.9	30.7
Frostbite	65.2	4334.7	309.4	295.1	1750.0
Gopher	257.6	2412.5	14619.4	14202.4	9750.3
Hero	1027.0	30826.4	13482.2	13488.0	12728.5
Jamesbond	29.0	302.8	680.8	726.5	5052.5
Kangaroo	52.0	3035.0	12922.7	14621.8	10760.0
Krull	1598.0	2665.6	10027.1	11402.7	6447.0
Kung Fu Master	258.5	22736.3	40157.7	32607.2	44604.9
Ms Pacman	307.3	6951.6	2787.0	6287.8	2752.4
Pong	-20.7	14.6	20.1	20.6	15.3
Private Eye	24.9	69571.3	96.0	98.0	100.0
Qbert	163.9	13455.0	16388.9	22474.5	14770.2
Road Runner	11.5	7845.0	56273.7	55359.3	32271.0
Seaquest	68.4	42054.7	16178.1	2733.1	3910.9
Up N Down	533.4	11693.2	46152.9	18235.5	18067.6
Mean HNS	0.0	1.0	2.861	2.767	3.019
#SOTA	N/A	N/A	7	9	10

Table 2: Performance comparison of curiosity-based methods using both intrinsic and extrinsic rewards on 26 Atari games subset. The bold font indicates the best value.

Figure 3 displays the learning curves using both intrinsic and extrinsic rewards. We compare DyMeCu with three widely-used baselines, including Disagreement, ICM and RND, on 6 random chosen Atari games. DyMeCu shows evident advantages in most games on the performance and learning speed. For example, on Jamesbond, the convergence plot reward of DyMeCu is more than three times that of other methods. Moreover, we also compare the performance of agents trained with only intrinsic rewards. As shown in Figure 4, of the 6 environments, DyMeCu outperforms Disagreement baseline in all environments, outperforms ICM and RND baselines both in 4 environments. Overall, the results in Atari Suite show that DyMeCu outperforms other curiosity-based methods, demonstrating DyMeCu’s ability to generate more accurate intrinsic rewards and provide more useful information for better exploration.

4.4 Further Analysis on DyMeCu

Further analysis including ablation studies on DyMeCu are presented to give an intuition of its behavior and performance. We run the experiments across 3 random seeds and all following experiments conducts 50M running steps - equivalent to 200M frames.

Figure 5: Performance comparison among two kinds of deployments and baselines with both intrinsic and extrinsic rewards.

Figure 4: Performance comparison on Atari games suite subsets using only intrinsic rewards.

Dual learners:

Here we explore to design the curiosity under the naive setting, that is, using one encoding network to learn to encode the latent space, and thus the curiosity-based intrinsic reward can be defined as the gap with the memory network:

r_{t}^{i n t} = ∥ z_{t}^{θ} - z_{t}^{ω} ∥^{2} .

(8)

The memory is updated with $ω \leftarrow α ω + (1 - α) θ$ . As shown in Figure 5, one-learner mechanism does not show significant advantages over other methods, whereas our dual-learner mechanism performs much better with more accurate curiosity and corresponding intrinsic rewards.

\diagboxMethodGame	Alien	Kangaroo	MsPacman	Ave.
Disagreement	316.6	514.0	291.0	373.9
ICM	374.2	557.0	412.7	447.9
RND	206.1	412.0	607.2	408.4
DyMeCu (ours)	492.0	739.0	602.4	611.1
DyMeCu_update with one learner	521.7	782.0	500.6	601.4
DyMeCu_with additional module	390.6	645.2	644.4	560.1

Table 3: Performance comparison of baselines and DyMeCu under different settings with only intrinsic rewards. The results represent the average episode reward at the end of training. The Ave. in the last column shows the average result among the three tasks.

Update of memory network:

The memory network in DyMeCu is updated with dual learners, we additionally evaluate the performance of DyMeCu when the memory is updated using only one of the learner’s parameters. The results in Table 3 indicate that both learners can consolidate state information into the memory well. Combined with Figure 5, it is useful and necessary to assign and train dual learners, and then we can update the memory with dual or one-learner, while dual-learner update mechanism shows a little superior performance.

Structure of learners:

The bootstrap idea has been explored and used in some previous researches. The most similar one to ours is BYOL Grill et al. (2020), which uses the bootstrap method for self-supervised learning in computer vision. Furthermore, Grill et al. add another predictor module to the online network, and compare the output of predictor to the target network, and it is the key to generating well-performed representations Chen and He (2021). Similarly, in this ablation study, we also design the controlled trials, in which additional two convolution layers are added to each of dual learners. In Table 3, we can find that such learnable additional module does not lead to significant improvement. Under our analysis, unlike previous work using the bootstrap method, we aim to generate the intrinsic reward by calculating the information value (i.e., information gap between dual learners) as accurate as possible, instead of better representations for downstream tasks.

Robustness to hyper-parameter $α$ :

Figure 6: Performance comparison with different values of $α$ with only intrinsic rewards.

There is a concern of the updating speed of memory network in the EMA way. It is about how much and how fast to accept and consolidate the new environment information. Therefore, to further analyze the updating effect of the hyper-parameter $α$ , we evaluate DyMeCu with different values of $α$ in a rational interval, and we assess the agents’ performance in three different Atari games: Alien, Kangaroo, and Krull. For more direct and visual comparison, we normalize the episode reward of DyMeCu as baseline-normalized scores (BNS) which is calculated as the average of $(DyMeCu score - random score) / (baseline score - random % score)$ where the baseline score is the average score of baselines. As illustrated in Figure 6, all values of the hyper-parameter $α$ between 0.99 and 0.9999 yield satisfied performance, generally greater than twice that of the baseline average. DyMeCu shows acceptable robustness to the updating hyper-parameter.

5 Conclusion

To address the challenge of extrinsic rewards sparsity in RL, we propose DyMeCu to mimic human curiosity in this paper. Specifically, DyMeCu consists of a dynamic memory and dual online learners. The information gap between dual learners sparks the agent’s curiosity and then formulates the intrinsic reward, and the state information can then be consolidated into the dynamic memory. Large-scale empirical experiments are conducted on multiple benchmarks, and the experimental results show that DyMeCu outperforms competing curiosity-based methods under different settings.

References

M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. R. Acharya, et al. (2021) A review of uncertainty quantification in deep learning: techniques, applications and challenges. Information Fusion 76, pp. 243–297. Cited by: §1.
J. Achiam and S. Sastry (2017) Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732. Cited by: §3.1.
E. Arani, F. Sarfraz, and B. Zonooz (2021) Learning fast, learning slow: a general continual learning method based on complementary learning system. In International Conference on Learning Representations, Cited by: §3.1.
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, pp. 253–279. Cited by: 3rd item, §1, §4.1.
D. E. Berlyne (1950) Novelty and curiosity as determinants of exploratory behaviour. British journal of psychology 41 (1), pp. 68. Cited by: §2.1.
Y. Burda, H. Edwards, D. Pathak, A. Storkey, T. Darrell, and A. A. Efros (2018) Large-scale study of curiosity-driven learning. In International Conference on Learning Representations, Cited by: §1, §2.1.
Y. Burda, H. Edwards, A. Storkey, and O. Klimov (2019) Exploration by random network distillation. In International Conference on Learning Representations, Cited by: §1, §4.1.
M. S. Charikar (2002) Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp. 380–388. Cited by: §2.1.
X. Chen and K. He (2021) Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758. Cited by: §4.4.
I. M. de Abril and R. Kanai (2018) Curiosity-driven reinforcement learning with homeostatic regulation. In 2018 international joint conference on neural networks (ijcnn), pp. 1–6. Cited by: §2.1.
T. G. Dietterich (2000) Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Cited by: §2.2.
B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine (2018) Diversity is all you need: learning skills without a reward function. arXiv preprint arXiv:1802.06070. Cited by: §4.2.
S. Flennerhag, Y. Schroecker, T. Zahavy, H. van Hasselt, D. Silver, and S. Singh (2021) Bootstrapped meta-learning. arXiv preprint arXiv:2109.04504. Cited by: §1.
K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, G. Pezzulo, et al. (2016) Active inference and learning. Neuroscience & Biobehavioral Reviews 68, pp. 862–879. Cited by: §2.1.
A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu (2017) Automated curriculum learning for neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1311–1320. Cited by: §3.1.
R. M. Gray (2011) Entropy and information theory. Springer Science & Business Media. Cited by: §3.
D. S. Grebenkov and J. Serror (2014) Following a trend with an exponential moving average: analytical results for a gaussian model. Physica A: Statistical Mechanics and its Applications 394, pp. 288–303. Cited by: §1, §3.1.
J. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al. (2020) Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, pp. 21271–21284. Cited by: §1, §3.1, §4.4.
Z. D. Guo, B. A. Pires, B. Piot, J. Grill, F. Altché, R. Munos, and M. G. Azar (2020) Bootstrap latent-predictive representations for multitask reinforcement learning. In International Conference on Machine Learning, pp. 3875–3886. Cited by: §1.
T. L. Hayes, G. P. Krishnan, M. Bazhenov, H. T. Siegelmann, T. J. Sejnowski, and C. Kanan (2021) Replay in deep learning: current approaches and missing biological elements. Neural Computation 33 (11), pp. 2908–2950. Cited by: §1, §3.1.
D. Haynes, S. Corns, and G. K. Venayagamoorthy (2012) An exponential moving average algorithm. In 2012 IEEE Congress on Evolutionary Computation, pp. 1–8. Cited by: §1, §3.1.
M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In Thirty-second AAAI conference on artificial intelligence, Cited by: §4.3.
S. Ishii, W. Yoshida, and J. Yoshimoto (2002) Control of exploitation–exploration meta-parameter in reinforcement learning. Neural networks 15 (4-6), pp. 665–687. Cited by: §3.
A. Jaegle, V. Mehrpour, and N. Rust (2019) Visual novelty, curiosity, and intrinsic reward in machine learning and the brain. Current opinion in neurobiology 58, pp. 167–174. Cited by: §2.1.
M. Kearns and S. Singh (2002) Near-optimal reinforcement learning in polynomial time. Machine learning 49 (2), pp. 209–232. Cited by: §2.1.
C. Kidd and B. Y. Hayden (2015) The psychology and neuroscience of curiosity. Neuron 88 (3), pp. 449–460. Cited by: §2.1.
K. Kim, M. Sano, J. De Freitas, N. Haber, and D. Yamins (2020) Active world model learning with progress curiosity. In International conference on machine learning, pp. 5306–5315. Cited by: §2.1.
F. Klinker (2011) Exponential moving average versus moving exponential average. Mathematische Semesterberichte 58 (1), pp. 97–107. Cited by: §1, §3.1.
M. Laskin, D. Yarats, H. Liu, K. Lee, A. Zhan, K. Lu, C. Cang, L. Pinto, and P. Abbeel (2021) URLB: unsupervised reinforcement learning benchmark. In Deep RL Workshop NeurIPS 2021, Cited by: §4.1, §4.2.
L. Lee, B. Eysenbach, E. Parisotto, E. Xing, S. Levine, and R. Salakhutdinov (2019) Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274. Cited by: §4.2.
H. Liu and P. Abbeel (2021a) Aps: active pretraining with successor features. In International Conference on Machine Learning, pp. 6736–6747. Cited by: §4.2.
H. Liu and P. Abbeel (2021b) Behavior from the void: unsupervised active pre-training. Advances in Neural Information Processing Systems 34. Cited by: §1, §4.2, §4.3.
J. Liu, Z. Lin, S. Padhy, D. Tran, T. Bedrax Weiss, and B. Lakshminarayanan (2020) Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. Advances in Neural Information Processing Systems 33, pp. 7498–7512. Cited by: §1.
V. Mai, K. Mani, and L. Paull (2022) Sample efficient deep reinforcement learning via uncertainty estimation. arXiv preprint arXiv:2201.01666. Cited by: §1, §3.1.
V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1.
R. C. O’Reilly and J. W. Rudy (2001) Conjunctive representations in learning and memory: principles of cortical and hippocampal function.. Psychological review 108 (2), pp. 311. Cited by: §1.
D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, pp. 2778–2787. Cited by: §2.1.
D. Pathak, D. Gandhi, and A. Gupta (2019) Self-supervised exploration via disagreement. In International Conference on Machine Learning, pp. 5062–5071. Cited by: §2.1.
E. J. Peterson and T. D. Verstynen (2021) Curiosity eliminates the exploration-exploitation dilemma. bioRxiv, pp. 671362. Cited by: §2.1, §3, §3.
G. Reddy, A. Celani, and M. Vergassola (2016) Infomax strategies for an optimal balance between exploration and exploitation. Journal of Statistical Physics 163 (6), pp. 1454–1476. Cited by: §3.
J. I. Rotgans and H. G. Schmidt (2017) The role of interest in learning: knowledge acquisition at the intersection of situational and individual interest. In The science of interest, pp. 69–93. Cited by: §1, §3.
R. M. Ryan and E. L. Deci (2000) Intrinsic and extrinsic motivations: classic definitions and new directions. Contemporary educational psychology 25 (1), pp. 54–67. Cited by: §1.
J. Schmidhuber (1991) A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222–227. Cited by: §2.1.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.1.
M. Schwarzer, A. Anand, R. Goel, R. D. Hjelm, A. Courville, and P. Bachman (2020) Data-efficient reinforcement learning with self-predictive representations. In International Conference on Learning Representations, Cited by: §4.3.
P. J. Silvia (2017) Curiosity. In The science of interest, pp. 97–107. Cited by: §1, §3.
L. Smith and M. Gasser (2005) The development of embodied cognition: six lessons from babies. Artificial life 11 (1-2), pp. 13–29. Cited by: §1.
C. Szepesvári (2009) Synthesis lectures on artificial intelligence and machine learning. Synthesis lectures on artificial intelligence and machine learning. Cited by: §2.2.
R. Y. Tao, V. François-Lavet, and J. Pineau (2020) Novelty search in representational space for sample efficient exploration. Advances in Neural Information Processing Systems 33, pp. 8114–8126. Cited by: §1.
A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems 30. Cited by: §3.1.
G. Tesauro et al. (1995) Temporal difference learning and td-gammon. Communications of the ACM 38 (3), pp. 58–68. Cited by: §1.
S. Tunyasuvunakool, A. Muldal, Y. Doron, S. Liu, S. Bohez, J. Merel, T. Erez, T. Lillicrap, N. Heess, and Y. Tassa (2020) Dm_control: software and tasks for continuous control. Software Impacts 6, pp. 100022. Cited by: 3rd item, §4.1.
T. Yang, H. Tang, C. Bai, J. Liu, J. Hao, Z. Meng, and P. Liu (2021) Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668. Cited by: §1, §1.
D. Yarats, R. Fergus, A. Lazaric, and L. Pinto (2021) Reinforcement learning with prototypical representations. In International Conference on Machine Learning, pp. 11920–11931. Cited by: §4.2.
D. Yarats, I. Kostrikov, and R. Fergus (2020) Image augmentation is all you need: regularizing deep reinforcement learning from pixels. In International Conference on Learning Representations, Cited by: §4.3.