Dynamic Memory-based Curiosity: A Bootstrap Approach for Exploration
Abstract
The sparsity of extrinsic rewards poses a serious challenge for reinforcement learning (RL). Currently, many efforts have been made on curiosity which can provide a representative intrinsic reward for effective exploration. However, the challenge is still far from being solved. In this paper, we present a novel curiosity for RL, named DyMeCu, which stands for Dynamic Memory-based Curiosity. Inspired by human curiosity and information theory, DyMeCu consists of a dynamic memory and dual online learners. The curiosity arouses if memorized information can not deal with the current state, and the information gap between dual learners can be formulated as the intrinsic reward for agents, and then such state information can be consolidated into the dynamic memory. Compared with previous curiosity methods, DyMeCu can better mimic human curiosity with dynamic memory, and the memory module can be dynamically grown based on a bootstrap paradigm with dual learners. On multiple benchmarks including DeepMind Control Suite and Atari Suite, large-scale empirical experiments are conducted and the results demonstrate that DyMeCu outperforms competitive curiosity-based methods with or without extrinsic rewards. We will release the code to enhance reproducibility.
1 National University of Defense Technology, Changsha, China
2 Artificial Intelligence Research Center, DII, Beijing, China
kelele.xu@gmail.com
1 Introduction
Despite the success of reinforcement learning (RL) on sequential decision-making tasks Bellemare et al. (2013); Tesauro and others (1995); Mnih et al. (2015), many current methods struggle with sparse extrinsic rewards. To cope with the sparsity, curiosity provides a representative intrinsic reward that can encourage agents to explore new states. Designing algorithms to efficiently construct curiosity can be a key component in RL systems. Previous research has shown that intrinsic rewards can help alleviate the issues resulting from the lacking of dense extrinsic rewards Liu and Abbeel (2021b); Tao et al. (2020); Yang et al. (2021).
For human learning, curiosity motivates people to seek and retain more information through exploration in the environment Burda et al. (2018); Ryan and Deci (2000); Smith and Gasser (2005). The process of arousing and satisfying curiosity can be summed up as one cycle: when a person encounters a problem, he/she will first try to solve it by retrieving information from memory. If retrieval from memory fails, he/she realizes that the current memorized information is insufficient solve the problem. A conscious awareness of information discrepancy then sparks curiosity about the problem, and curiosity stimulates the search for new information. Once the information discrepancy is eliminated, people may have no further curiosity to learn more about the current problem until another problem is encountered Rotgans and Schmidt (2017); Silvia (2017). Human curiosity is constantly consolidated based on the dynamic memory, which consists of the encoding, storing, and retrieving information stage Hayes et al. (2021). As the curiosity fades, additional information is consolidated into the memory. The consolidation results in the forming of new dynamic memories, which depends on the hippocampus O’Reilly and Rudy (2001).
Many attempts have been made to build curiosity in RL, which fall into two main categories: count-based and prediction-based. However, such curiosity is very different from human curiosity, and the problem is far from solved. Taking the Random Network Distillation (RND) Burda et al. (2019) method as an example, RND initializes a random fixed target network with state embeddings, and trains another prediction network to fit the output of the target network. A random fixed target network can be regarded as a random fixed memory, so that RND cannot retain contextual knowledge about the environment Yang et al. (2021). Without dynamically incorporating contextual information into memory, random features may not be sufficient to interpret dynamic environments. Therefore, this kind of curiosity is evaluated in a non-developmental way, which severely limits the performance of curiosity in RL.
In this work, to mimic human curiosity, we formalize and investigate a Dynamic Memory-based Curiosity mechanism, named DyMeCu. Inspired by the bootstrap paradigm Guo et al. (2020); Grill et al. (2020); Flennerhag et al. (2021), we construct dual online learners to learn the latent state to formulate dynamic memory model (Figure 1). On the one hand, state information can be consolidated to the memory via the exponential moving average (EMA) Haynes et al. (2012); Klinker (2011); Grebenkov and Serror (2014) of dual learners’ parameters. The bootstrap paradigm, on the other hand, utilizes supervised signals from memory to improve dual learners’ encoding ability. Furthermore, the curiosity is measured by the information gap between the dual learners, which is essentially an uncertainty estimation of given state based on dynamic memory Mai et al. (2022); Liu et al. (2020); Abdar et al. (2021).
In brief, our contribution in this paper is:
-
We firstly analyze the shortcomings of previous curiosity-based intrinsic reward methods, and suggest to mimic human curiosity leveraging a dynamic memory instead of a fixed one, based on the information theory.
-
We propose a novel and practicable intrinsic reward method for RL agents, named DyMeCu (Dynamic Memory-based Curiosity), which consists of a dynamic memory and dual online learners, and thus can measure the curiosity and consolidate the information in a feasible way. Meanwhile, different strategies are explored to further improve the performance of DyMeCu.
2 Related Work
2.1 Curiosity-Based Intrinsic Reward
In RL, the exploration issue is a long standing challenge. Previous attempts suggest that: if there is no additional reward, exploration can be regarded as a hunt for information theoretically, which also can be viewed as the curiosity Berlyne (1950); Schmidhuber (1991); Kidd and Hayden (2015); de Abril and Kanai (2018); Jaegle et al. (2019); Friston et al. (2016); Peterson and Verstynen (2021). One intuitive formulation of curiosity is the count-based methods, where the less visited state has more state novelty for exploration. But it can not scale to large-scale or continuous state spaces Kearns and Singh (2002); Charikar (2002). Inspired by count-based methods, RND calculates the state novelty by distilling a random fixed network (target network) into another prediction network (predictor network). The predictor network is trained to minimize the prediction error for each state and take the prediction error as the intrinsic reward. Apart from count-based methods, prediction-based methods also show competitive or better performance by modeling the environment dynamics Pathak et al. (2017, 2019); Kim et al. (2020); Burda et al. (2018). With the assumption that more visited state-action pairs will result in more accurate prediction, the intrinsic reward can be applied as the variance of predictions of ensembles or the distance between prediction states and true states, such as the Disagreement method Pathak et al. (2019) and ICM Pathak et al. (2017) method. There have been few attempts to design a curiosity that contains memory and effectively uses information consolidated in memory, which however is the main goal of this paper.
2.2 Uncertainty Estimation
Our work is also related to the uncertainty estimation, as uncertainty is crucial which allows an agent to discern when to exploit and when to explore its environment in RL Szepesvári (2009). Previous intrinsic rewards can also be interpreted from the perspective of uncertainty estimation, which can evaluate curiosity by estimating the deep learning model’s uncertainty (confidence). Take Disagreement as an example, instead of comparing the prediction to the ground-truth, they suggest to evaluate the uncertainty of multiple prediction models using the deep ensemble Dietterich (2000), despite incurring additional computation. RND also claims that the distillation error can be viewed as a quantification of the uncertainty. Unlike RND, in our work, we evaluate the uncertainty of given states though measuring the information gap between dual learners which rely on dynamic memory instead of a random fixed network.
3 Methodology
In general, if an agent encounters a state with the information value compared to its memory, then this state is worth exploring and such information value is worth consolidating to its memory dynamically Rotgans and Schmidt (2017); Silvia (2017). In detail, the concept of information value necessitates the formation of the dynamic memory and a way to consolidate information to the memory. For deep neural networks, the memory can be embedded in the latent space and can by the function that maps state into memory Peterson and Verstynen (2021). This kind of consolidating information is denoted by:
(1) |
With the memory which has been learned by over historical states, we can measure the information value of the next state . According to the information theory Ishii et al. (2002); Reddy et al. (2016); Gray (2011) and the concepts proposed in Peterson and Verstynen (2021), the information value of a state should (1) only depend on the memory and what can be immediately learned (i.e., , and ); (2) be non-negative because is for exploring the environment; (3) decelerate in the finite time for the same state. Thus we define as:
(2) |
We can get from the definition of as :
(1) If one state has been completely explored, or cannot be learned, then no more information gain can be added into the current memory, and . Such state is no longer worth exploring.
(2) If , then the larger the value of , the more information gain can be consolidated into the memory. In other words, the larger the value of , the memory is less aware of the current state, such state is more worth exploring. It is such information deficiency of memory that sparks the curiosity of agents.
In this paper, we will focus on how to obtain and leverage the information value for agents exploration, and the information consolidation method in details.
3.1 Dynamic Memory-based Curiosity
In our framework, if the information in current memory cannot handle the encountered state, then the curiosity is aroused. We model the memory as a learnable neural network, but there is a dilemma that we do not have a “benchmark” encoded network in the parameter space to encode the encountered state accurately, since not enough supervised signal provided here. Thus it is difficult and not sound to directly define the curiosity by comparing a random encoded state with the output latent state from a dynamic memory network. We instead introduce dual online learners for state encode representations. These dual learners have the same network architecture as the memory network but with each own different parameters. In their network parameter space, the dual networks are supervised by the memory encoding ability. And then our curiosity can be defined by the gap of the encodings of the same state output by dual learners’ networks. The intuition of our dual learners is: if the state information has been squeezed out by the memory, then the memory can completely know and resolve the state, and the dual learners which can be seen as the two imitators to the memory are easier to get the similar encodings to the current state. In other words, if one state is little known by the memory, then the dual learners may produce quite different encodings to it, which represents the larger information value and thus stimulates the agent to explore this state. Here, for the uniform description, we refer the encoded states and encodings to the latent states, which reflect the cognition of states by the memory and learner networks.
In our implementation, such kind of latent gap will spark the curiosity as the intrinsic reward for agents exploration. After the RL agents learning, such information will be consolidated to the memory for memory better growing. In terms of the consolidation method , it is externalized as updating the memory parameters via the exponential moving average (EMA) Haynes et al. (2012); Klinker (2011); Grebenkov and Serror (2014) of the dual learners’ parameters.
From such analysis, we see the dual learners first learn based on the memory network for measure the information value for exploration, and then the memory network consolidate information gain based on dual learners in the EMA way. The memory is actually seeking for the appropriate position in the parameter space dynamically, in order that its network can better characterize the memory and cognition ability of the seen states in environments. In a word, our dynamic memory is updated in a bootstrap Grill et al. (2020) way. Figure 1 and algorithm 1 present the whole framework and pseudo-code of DyMeCu.
Domain | Task | ICM | Disagreement | RND | APT | ProtoRL | SMM | DIAYN | APS | Ours |
---|---|---|---|---|---|---|---|---|---|---|
Flip | 39818 | 40775 | 46562 | 47716 | 48023 | 50526 | 38117 | 46124 | 63037 | |
Run | 21635 | 29181 | 35229 | 34428 | 20015 | 43026 | 24211 | 25727 | 58825 | |
Stand | 92818 | 680107 | 9018 | 9148 | 87023 | 87734 | 86026 | 83564 | 9655 | |
Walker | Walk | 696162 | 595153 | 814116 | 75935 | 77733 | 82136 | 66126 | 71168 | 93416 |
Average Performance | 56059 | 494104 | 63354 | 62422 | 58224 | 65931 | 53620 | 56646 | 78021 | |
Jump | 1124 | 383265 | 58072 | 46248 | 42563 | 29839 | 57846 | 52942 | 69415 | |
Run | 9129 | 38961 | 38547 | 33940 | 31636 | 22037 | 41528 | 46537 | 4796 | |
Stand | 184100 | 628114 | 80054 | 62257 | 56071 | 36742 | 70648 | 71450 | 92114 | |
Quadruped | Walk | 9946 | 38428 | 39239 | 43464 | 40391 | 18426 | 40664 | 60286 | 83344 |
Average Performance | 12245 | 446117 | 54053 | 46552 | 42666 | 26836 | 52747 | 57843 | 73220 | |
Reach bottom left | 10247 | 11717 | 10317 | 8812 | 12122 | 409 | 175 | 9613 | 15513 | |
Reach bottom right | 7527 | 1423 | 10126 | 11512 | 11316 | 509 | 314 | 939 | 14616 | |
Reach top left | 10529 | 12117 | 14646 | 11211 | 12420 | 507 | 113 | 6510 | 16614 | |
Jaco | Reach top right | 9319 | 13110 | 9925 | 1365 | 13519 | 378 | 194 | 8111 | 1524 |
Average Performance | 9431 | 12812 | 11329 | 11310 | 12420 | 449 | 204 | 8411 | 15512 |
-
Learning of Dual Learners:
Dual online learner models and are defined by a set of weights and with the same architecture as the memory network . The memory provides the regression targets for the learning of dual learners and . Given a current state , the learners transform it into the latent states and respectively, and the memory network outputs . The mean squared error (MSE) between them is:
(3) |
Based on and , the dual learners are updated as :
(4) |
where and represent the optimizer and learning rate.
-
Intrinsic Reward based on Curiosity:
The curiosity relies on the information value of current state. In our method, such information value can be measured by the information gap between dual learners. This information gap can also be considered following the -Progress Achiam and Sastry (2017); Graves et al. (2017) to form the curiosity. We obtain the intrinsic reward to agents based on the curiosity from the information value:
(5) |
From another point of view, the dual-learner mechanism can be regarded as the variant of ensemble Mai et al. (2022) for uncertainty estimation. Compared with previous attempts which requires heavily ensembling (such as the Disagreement), our lightweight solution can previous better performance while retaining computation efficiency.
Overall, we can get the optimization goal for the agent:
(6) |
where is the discount factor and represents parameters of policy ; and are the coefficients of the intrinsic reward and extrinsic reward respectively.
-
Consolidating Information into Memory:
The memory model is updated in an EMA way for sake of its stability to the old state information and the plasticity to the current new state information. In other words, the memory is dynamically growing taking the contextual environment information into account. Specifically, given a decay rate and after each training step, the memory can be updated as:
(7) |
-
Intuitions on DyMeCu’s behavior:
The dynamic memory-based curiosity is closer to human curiosity mechanism. It is the cognitive difference compared to the memory that stimulates our curiosity to explore the world, and then we will consolidate the cognition information to the memory dynamically. In addition, from the knowledge distillation view, such memory can also be regarded as the teacher model in Mean Teacher-based approach Tarvainen and Valpola (2017). The memory is essentially a self-ensemble of the intermediate models of learners. The paradigm we proposed is one type of replay mechanism that is thought to play an important role in memory formation, retrieval, and consolidation Hayes et al. (2021). Moreover, we consider our way to form the memory can also be used in the continual learning to address the issue of catastrophic forgetting Arani et al. (2021).
4 Experiments and Analysis
4.1 Experimental Settings
We evaluate our method in both pre-training and traditional RL situations utilizing two widely used benchmarks: DeepMind Control Suite (DMC) Tunyasuvunakool et al. (2020) and Atari Suite Bellemare et al. (2013). We follow the RND Burda et al. (2019) experimental settings for Atari Suite and settings of URLB Laskin et al. (2021) which is the pre-training benchmark for DeepMind Control Suite. We apply PPO algorithm Schulman et al. (2017) to train the agent. The hyper-parameter was set as 0.99 in all experiments, and the implementation details and hyper-parameters can be found in the appendix.
4.2 DeepMind Control Suite
Many well-performing approaches like URLB Laskin et al. (2021) use the pre-training and fine-tuning paradigm to improve sample efficiency for RL, especially in the experiment benchmark like DMC containing various domains and complex tasks. We evaluate DyMeCu on all three domains of DMC, namely Walker, Quadruped, and Jaco Arm (from easiest to hardest), and each of them has four tasks. During the pre-training phase, the agents are trained for 2 million steps with only intrinsic rewards produced by the curiosity. During the fine-tuning phase, the agents are trained for 100k steps with only extrinsic rewards.
Table 1 reports the final scores and standard deviations of DyMeCu and other competitive methods. We compare DyMeCu with both intrinsic reward-based methods (ICM, Disagreement, RND, APT Liu and Abbeel (2021b)) and other pre-training strategies (ProtoRL Yarats et al. (2021), SMM Lee et al. (2019), DIAYN Eysenbach et al. (2018), APS Liu and Abbeel (2021a)). DyMeCu improves average performance by 18.3%, 26.6%, and 21.0% on these three domains respectively. From the quantitative results, we can see our DyMeCu achieve the new state-of-the-art across all 12 tasks, demonstrating DyMeCu’s ability to improve the model performance and robustness through pre-training paradigm. Figure 2 plots 6 learning curves (fine-tuning phase) of DyMeCu and three competitive curiosity-based methods. All learning curves can be found in the appendix. DyMeCu shows a superior convergence speed than other methods. Meanwhile, the convergence result of DyMeCu also surpasses others significantly. DyMeCu’s speed increase may sbe mainly due to the contextual state information being consolidated into memory dynamically, rather than a random fixed setting like the RND. Based on the dynamic memory, the exploration of agents can be much more efficient.
4.3 Atari Suite
For the Atari suite, we first record the performance of agents with both intrinsic and extrinsic rewards. The experiments conducts 100M running steps - equivalent to 400M frames and the intrinsic and extrinsic rewards coefficients are set to and respectively for all methods, following the setup of the previous curiosity-based methods. Table 2 lists the aggregate metrics and scores of three methods trained with both intrinsic and extrinsic rewards on the Atari 26 games. Human and random scores are adopted from Hessel et al. (2018). As done in previous works Liu and Abbeel (2021b); Yarats et al. (2020); Schwarzer et al. (2020), we normalize the episode reward as human-normalized scores (HNS) by expert human scores to account for different score scales in each game. #SOTA denotes the number of games that the current method exceeds other methods and mean HNS is calculated as the average of of all games. From Table 2, DyMeCu displays the superiority over Disagreement and ICM with its highest mean HNS and #SOTA.
Game | Random | Human | ICM | Disagreement | Ours |
---|---|---|---|---|---|
Alien | 227.8 | 7127.7 | 1524.7 | 1304.7 | 2589.2 |
Amidar | 5.8 | 1719.5 | 763.0 | 506.6 | 470.1 |
Assault | 222.4 | 742.0 | 1365.5 | 1544.6 | 4539.3 |
Asterix | 210.0 | 8503.3 | 2103.4 | 1616.2 | 4576 |
Bank Heist | 14.2 | 753.1 | 1359.4 | 1343.4 | 1529.5 |
BattleZone | 2360.0 | 37187.5 | 51459.1 | 65387.4 | 58220.0 |
Boxing | 0.1 | 12.1 | 98.9 | 99.3 | 99.6 |
Breakout | 1.7 | 30.5 | 247.6 | 177.8 | 119.7 |
ChopperCommand | 811.0 | 7387.8 | 9456.5 | 10286.9 | 9521.0 |
Crazy Climber | 10780.5 | 23829.4 | 135003.3 | 132614.0 | 106682.0 |
Demon Attack | 107805.0 | 35829.4 | 4679.2 | 6606.0 | 8417.0 |
Freeway | 0.0 | 29.6 | 33.8 | 33.9 | 30.7 |
Frostbite | 65.2 | 4334.7 | 309.4 | 295.1 | 1750.0 |
Gopher | 257.6 | 2412.5 | 14619.4 | 14202.4 | 9750.3 |
Hero | 1027.0 | 30826.4 | 13482.2 | 13488.0 | 12728.5 |
Jamesbond | 29.0 | 302.8 | 680.8 | 726.5 | 5052.5 |
Kangaroo | 52.0 | 3035.0 | 12922.7 | 14621.8 | 10760.0 |
Krull | 1598.0 | 2665.6 | 10027.1 | 11402.7 | 6447.0 |
Kung Fu Master | 258.5 | 22736.3 | 40157.7 | 32607.2 | 44604.9 |
Ms Pacman | 307.3 | 6951.6 | 2787.0 | 6287.8 | 2752.4 |
Pong | -20.7 | 14.6 | 20.1 | 20.6 | 15.3 |
Private Eye | 24.9 | 69571.3 | 96.0 | 98.0 | 100.0 |
Qbert | 163.9 | 13455.0 | 16388.9 | 22474.5 | 14770.2 |
Road Runner | 11.5 | 7845.0 | 56273.7 | 55359.3 | 32271.0 |
Seaquest | 68.4 | 42054.7 | 16178.1 | 2733.1 | 3910.9 |
Up N Down | 533.4 | 11693.2 | 46152.9 | 18235.5 | 18067.6 |
Mean HNS | 0.0 | 1.0 | 2.861 | 2.767 | 3.019 |
#SOTA | N/A | N/A | 7 | 9 | 10 |
Figure 3 displays the learning curves using both intrinsic and extrinsic rewards. We compare DyMeCu with three widely-used baselines, including Disagreement, ICM and RND, on 6 random chosen Atari games. DyMeCu shows evident advantages in most games on the performance and learning speed. For example, on Jamesbond, the convergence plot reward of DyMeCu is more than three times that of other methods. Moreover, we also compare the performance of agents trained with only intrinsic rewards. As shown in Figure 4, of the 6 environments, DyMeCu outperforms Disagreement baseline in all environments, outperforms ICM and RND baselines both in 4 environments. Overall, the results in Atari Suite show that DyMeCu outperforms other curiosity-based methods, demonstrating DyMeCu’s ability to generate more accurate intrinsic rewards and provide more useful information for better exploration.
4.4 Further Analysis on DyMeCu
Further analysis including ablation studies on DyMeCu are presented to give an intuition of its behavior and performance. We run the experiments across 3 random seeds and all following experiments conducts 50M running steps - equivalent to 200M frames.
-
Dual learners:
Here we explore to design the curiosity under the naive setting, that is, using one encoding network to learn to encode the latent space, and thus the curiosity-based intrinsic reward can be defined as the gap with the memory network:
(8) |
The memory is updated with . As shown in Figure 5, one-learner mechanism does not show significant advantages over other methods, whereas our dual-learner mechanism performs much better with more accurate curiosity and corresponding intrinsic rewards.
\diagboxMethodGame | Alien | Kangaroo | MsPacman | Ave. |
---|---|---|---|---|
Disagreement | 316.6 | 514.0 | 291.0 | 373.9 |
ICM | 374.2 | 557.0 | 412.7 | 447.9 |
RND | 206.1 | 412.0 | 607.2 | 408.4 |
DyMeCu (ours) | 492.0 | 739.0 | 602.4 | 611.1 |
DyMeCu_update with one learner | 521.7 | 782.0 | 500.6 | 601.4 |
DyMeCu_with additional module | 390.6 | 645.2 | 644.4 | 560.1 |
-
Update of memory network:
The memory network in DyMeCu is updated with dual learners, we additionally evaluate the performance of DyMeCu when the memory is updated using only one of the learner’s parameters. The results in Table 3 indicate that both learners can consolidate state information into the memory well. Combined with Figure 5, it is useful and necessary to assign and train dual learners, and then we can update the memory with dual or one-learner, while dual-learner update mechanism shows a little superior performance.
-
Structure of learners:
The bootstrap idea has been explored and used in some previous researches. The most similar one to ours is BYOL Grill et al. (2020), which uses the bootstrap method for self-supervised learning in computer vision. Furthermore, Grill et al. add another predictor module to the online network, and compare the output of predictor to the target network, and it is the key to generating well-performed representations Chen and He (2021). Similarly, in this ablation study, we also design the controlled trials, in which additional two convolution layers are added to each of dual learners. In Table 3, we can find that such learnable additional module does not lead to significant improvement. Under our analysis, unlike previous work using the bootstrap method, we aim to generate the intrinsic reward by calculating the information value (i.e., information gap between dual learners) as accurate as possible, instead of better representations for downstream tasks.
-
Robustness to hyper-parameter :
There is a concern of the updating speed of memory network in the EMA way. It is about how much and how fast to accept and consolidate the new environment information. Therefore, to further analyze the updating effect of the hyper-parameter , we evaluate DyMeCu with different values of in a rational interval, and we assess the agents’ performance in three different Atari games: Alien, Kangaroo, and Krull. For more direct and visual comparison, we normalize the episode reward of DyMeCu as baseline-normalized scores (BNS) which is calculated as the average of where the baseline score is the average score of baselines. As illustrated in Figure 6, all values of the hyper-parameter between 0.99 and 0.9999 yield satisfied performance, generally greater than twice that of the baseline average. DyMeCu shows acceptable robustness to the updating hyper-parameter.
5 Conclusion
To address the challenge of extrinsic rewards sparsity in RL, we propose DyMeCu to mimic human curiosity in this paper. Specifically, DyMeCu consists of a dynamic memory and dual online learners. The information gap between dual learners sparks the agent’s curiosity and then formulates the intrinsic reward, and the state information can then be consolidated into the dynamic memory. Large-scale empirical experiments are conducted on multiple benchmarks, and the experimental results show that DyMeCu outperforms competing curiosity-based methods under different settings.
References
- A review of uncertainty quantification in deep learning: techniques, applications and challenges. Information Fusion 76, pp. 243–297. Cited by: §1.
- Surprise-based intrinsic motivation for deep reinforcement learning. arXiv preprint arXiv:1703.01732. Cited by: §3.1.
- Learning fast, learning slow: a general continual learning method based on complementary learning system. In International Conference on Learning Representations, Cited by: §3.1.
- The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, pp. 253–279. Cited by: 3rd item, §1, §4.1.
- Novelty and curiosity as determinants of exploratory behaviour. British journal of psychology 41 (1), pp. 68. Cited by: §2.1.
- Large-scale study of curiosity-driven learning. In International Conference on Learning Representations, Cited by: §1, §2.1.
- Exploration by random network distillation. In International Conference on Learning Representations, Cited by: §1, §4.1.
- Similarity estimation techniques from rounding algorithms. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp. 380–388. Cited by: §2.1.
- Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758. Cited by: §4.4.
- Curiosity-driven reinforcement learning with homeostatic regulation. In 2018 international joint conference on neural networks (ijcnn), pp. 1–6. Cited by: §2.1.
- Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Cited by: §2.2.
- Diversity is all you need: learning skills without a reward function. arXiv preprint arXiv:1802.06070. Cited by: §4.2.
- Bootstrapped meta-learning. arXiv preprint arXiv:2109.04504. Cited by: §1.
- Active inference and learning. Neuroscience & Biobehavioral Reviews 68, pp. 862–879. Cited by: §2.1.
- Automated curriculum learning for neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1311–1320. Cited by: §3.1.
- Entropy and information theory. Springer Science & Business Media. Cited by: §3.
- Following a trend with an exponential moving average: analytical results for a gaussian model. Physica A: Statistical Mechanics and its Applications 394, pp. 288–303. Cited by: §1, §3.1.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems 33, pp. 21271–21284. Cited by: §1, §3.1, §4.4.
- Bootstrap latent-predictive representations for multitask reinforcement learning. In International Conference on Machine Learning, pp. 3875–3886. Cited by: §1.
- Replay in deep learning: current approaches and missing biological elements. Neural Computation 33 (11), pp. 2908–2950. Cited by: §1, §3.1.
- An exponential moving average algorithm. In 2012 IEEE Congress on Evolutionary Computation, pp. 1–8. Cited by: §1, §3.1.
- Rainbow: combining improvements in deep reinforcement learning. In Thirty-second AAAI conference on artificial intelligence, Cited by: §4.3.
- Control of exploitation–exploration meta-parameter in reinforcement learning. Neural networks 15 (4-6), pp. 665–687. Cited by: §3.
- Visual novelty, curiosity, and intrinsic reward in machine learning and the brain. Current opinion in neurobiology 58, pp. 167–174. Cited by: §2.1.
- Near-optimal reinforcement learning in polynomial time. Machine learning 49 (2), pp. 209–232. Cited by: §2.1.
- The psychology and neuroscience of curiosity. Neuron 88 (3), pp. 449–460. Cited by: §2.1.
- Active world model learning with progress curiosity. In International conference on machine learning, pp. 5306–5315. Cited by: §2.1.
- Exponential moving average versus moving exponential average. Mathematische Semesterberichte 58 (1), pp. 97–107. Cited by: §1, §3.1.
- URLB: unsupervised reinforcement learning benchmark. In Deep RL Workshop NeurIPS 2021, Cited by: §4.1, §4.2.
- Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274. Cited by: §4.2.
- Aps: active pretraining with successor features. In International Conference on Machine Learning, pp. 6736–6747. Cited by: §4.2.
- Behavior from the void: unsupervised active pre-training. Advances in Neural Information Processing Systems 34. Cited by: §1, §4.2, §4.3.
- Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. Advances in Neural Information Processing Systems 33, pp. 7498–7512. Cited by: §1.
- Sample efficient deep reinforcement learning via uncertainty estimation. arXiv preprint arXiv:2201.01666. Cited by: §1, §3.1.
- Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1.
- Conjunctive representations in learning and memory: principles of cortical and hippocampal function.. Psychological review 108 (2), pp. 311. Cited by: §1.
- Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, pp. 2778–2787. Cited by: §2.1.
- Self-supervised exploration via disagreement. In International Conference on Machine Learning, pp. 5062–5071. Cited by: §2.1.
- Curiosity eliminates the exploration-exploitation dilemma. bioRxiv, pp. 671362. Cited by: §2.1, §3, §3.
- Infomax strategies for an optimal balance between exploration and exploitation. Journal of Statistical Physics 163 (6), pp. 1454–1476. Cited by: §3.
- The role of interest in learning: knowledge acquisition at the intersection of situational and individual interest. In The science of interest, pp. 69–93. Cited by: §1, §3.
- Intrinsic and extrinsic motivations: classic definitions and new directions. Contemporary educational psychology 25 (1), pp. 54–67. Cited by: §1.
- A possibility for implementing curiosity and boredom in model-building neural controllers. In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, pp. 222–227. Cited by: §2.1.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §4.1.
- Data-efficient reinforcement learning with self-predictive representations. In International Conference on Learning Representations, Cited by: §4.3.
- Curiosity. In The science of interest, pp. 97–107. Cited by: §1, §3.
- The development of embodied cognition: six lessons from babies. Artificial life 11 (1-2), pp. 13–29. Cited by: §1.
- Synthesis lectures on artificial intelligence and machine learning. Synthesis lectures on artificial intelligence and machine learning. Cited by: §2.2.
- Novelty search in representational space for sample efficient exploration. Advances in Neural Information Processing Systems 33, pp. 8114–8126. Cited by: §1.
- Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems 30. Cited by: §3.1.
- Temporal difference learning and td-gammon. Communications of the ACM 38 (3), pp. 58–68. Cited by: §1.
- Dm_control: software and tasks for continuous control. Software Impacts 6, pp. 100022. Cited by: 3rd item, §4.1.
- Exploration in deep reinforcement learning: a comprehensive survey. arXiv preprint arXiv:2109.06668. Cited by: §1, §1.
- Reinforcement learning with prototypical representations. In International Conference on Machine Learning, pp. 11920–11931. Cited by: §4.2.
- Image augmentation is all you need: regularizing deep reinforcement learning from pixels. In International Conference on Learning Representations, Cited by: §4.3.