Normality-Guided Distributional Reinforcement Learning for Continuous Control

Ju-Seung Byun, Andrew Perrault

Abstract

Learning a predictive model of the mean return, or value function, plays a critical role in many reinforcement learning algorithms. Distributional reinforcement learning (DRL) methods instead model the value distribution, which has been shown to improve performance in many settings. In this paper, we model the value distribution as approximately normal using the Markov Chain central limit theorem. We analytically compute quantile bars to provide a new DRL target that is informed by the decrease in standard deviation that occurs over the course of an episode. In addition, we suggest an exploration strategy based on how closely the learned value distribution resembles the target normal distribution to make the value function more accurate for better policy improvement. The approach we outline is compatible with many DRL structures. We use proximal policy optimization as a testbed and show that both the normality-guided target and exploration bonus produce performance improvements. We demonstrate our method outperforms DRL baselines on a number of continuous control tasks.

\affiliations

Department of Computer Science & Engineering
The Ohio State University
Columbus, OH 43210, USA
{byun.83, perrault.17}@osu.edu

Introduction

In reinforcement learning, an agent receives reward by interacting with the environment and updates their policy to maximize the cumulative reward or return. The return often has a high variance, which may make training unstable. To reduce the variance, actor-critic algorithms (Thomas, 2014; Schulman et al., 2015; Mnih et al., 2016; Gu et al., 2016) use a learned value function, a model of the return that serves a baseline to reduce variance and speed up training.

The standard approach to learning a value function estimates a single scalar value for each state. Several recent studies learn a value distribution instead, attempting to capture the randomness in the interaction between the agent and environment (Bellemare et al., 2017a; Barth-Maron et al., 2018; Dabney et al., 2017; Singh et al., 2020; Yue et al., 2020). This approach, called distributional reinforcement learning (DRL), has been shown to increase stability and speed up learning. A common framework for DRL leverages quantile regression, modeling the value distribution using quantile bars and using the Wasserstein distance to maximize the similarity between the model and the observed returns (Bellemare et al., 2017a; Dabney et al., 2017).

Although DRL frameworks have yielded improved performance, most only make use of the mean of distributional value network when updating the policy. In this paper, we observe that the true value distribution for many policies is often normal. This observation is supported by central limit theorems for Markov chains as well as empirical inspection of value distributions. We additionally find that, in many continuous control tasks, the variance of the value distribution decreases linearly in the timestep (Table 1). These observations motivate a new target for DRL learning—a normal distribution with variance inversely proportional to timestep.

Our key contributions are:

We provide a new target for the distributional value function, guiding it to output quantiles that are approximately normal and reduce in variance as timestep increases.
We propose an exploration strategy by measuring the distance between the distributional value function and the target distribution. This strategy encourages an agent to further explore states where the current value distribution does not appear normal in order to provide a better critic and, hence, better policy updates.
We provide an empirical validation that uses proximal policy optimization (PPO) (Schulman et al., 2017) on several continuous control tasks. We compare our normality-guided target and exploration strategy to standard PPO and distributional PPO and find that our method outperforms the baselines.

As far as we know, we are the first work to leverage approximate normality of returns in reinforcement learning training. We emphasize that we expect that approximate normality in returns will hold widely across many common domains, and the insights from the normality-guided approach we propose are compatible with many existing DRL methods at low computational cost.

Figure 1: Quantile bars of trained QR-DQN or DPPO (defined in Experiments section) for an initial state and their cumulative distribution function (CDF) (Orange) with normal distribution CDF (Green). The first-row graphs are for discrete environments with QR-DQN, and the second-row graphs are for continuous environments with DPPO. The mean and standard deviation is computed based on the QR-DQN or DPPO quntile output. The number of quantile bars is 200 for QR-DQN and 100 for DPPO, but only 45 are shown for visualization. The shape of the CDF is close to the CDF of the normal distribution. (Note that the zig-zag parts of the CDF is because the quantile bars are not well ordered.)

Related Work

We briefly review related work on distributional reinforcement learning (DRL) and exploration in reinforcement learning (RL). Bellemare et al. (2017a) studied the distribution of the value network for deep Q-learning and proposed C51, a DRL algorithm that uses a categorical distribution to model the value distribution. C51 produced state-of-the-art performance on the Arcade Learning Environment (Bellemare et al., 2013), but has the disadvantage of requiring the parameters of the modeling distribution to be fixed in advance. Dabney et al. (2017) proposed QR-DQN, which leverages quantile regression and allows for an adaptive distribution structure. In addition they resolved a key theoretical question, resolving an issue with KL-divergence ( $D_{K L}$ ), which could lead to non-convergence, by replacing it with Wasserstein distance metric and showing convergence of the distributional value network. These methods, and subsequent DRL approaches use only the mean of the value distribution for updating the policy (with the exception of Mavrin et al. (2019)—see below).

Several extensions of DRL to continuous tasks have been proposed, including distributed distributional deterministic policy gradients (D4PG) (Barth-Maron et al., 2018), sample-based distributional policy gradient (SDPG) (Singh et al., 2020) and implicit distributional actor-critic (IDAC) (Yue et al., 2020), none of which use properties of the learned value distribution beyond the mean. Our contribution is model agnostic and is compatible with any existing DRL algorithm that learns the value distribution with a neural network.

In RL, training an optimal agent requires visiting many states. A challenge in model-free RL agent is that the entropy of the policy rapidly decreases and falls into the local optimal without sufficient state exploration. There are several ways methods have addressed this problem. One solution for encouraging an agent to visit enough states is using an entropy regularizer. Ahmed et al. (2019); Liu et al. (2019) study higher entropy can produce smoother training results. Optimistic exploration methods (Bellemare et al., 2016; Strehl and Littman, 2005; Tang et al., 2016; Fu et al., 2017; Burda et al., 2018) assumes that an unfamiliar state has a positive potential and uses some kind of structure to estimate how often a state has been visited—decreasing the positive potential with more visits. Thompson sampling (Osband et al., 2016) maintains a belief distribution in the Q-function and samples a policy from the Q-distribution. Methods using information gain (Houthooft et al., 2016) encourages the agent to visit helpful states on how much the agent can learn on the state. We propose a different idea, leveraging the features of the learned distribution to estimate whether a state is sufficiently visited or not. If the quantile bars do not approximate a normal distribution, we hypothesize that this corresponds to uncertainty in the state value and an exploration bonus is added. The most similar work is Mavrin et al. (2019)—which propose an optimistic exploration strategy using the learned upper quantiles.

$σ$	$t = 0$	$t = \frac{1}{5} T$	$t = \frac{2}{5} T$	$t = \frac{3}{5} T$	$t = \frac{4}{5} T$	$t = T$
Hopper-v3	632.23	495.73	379.43	258.06	124.48	0.02
Walker2d-v3	748.23	595.82	459.51	302.33	114.33	0.05
HalfCheetah-v3	94.28	86.46	75.70	65.77	44.36	0.26
Ant-v3	411.59	333.03	257.73	173.28	91.58	0.26

Table 1: Standard deviations of return from timestep

t

decrease linearly as timestep increases (

T

stands for the last timestep). Each entry represents a standard deviation of returns from

t

over 100 episodes. This result motivates our approach of reducing the standard deviation of a target for

V^{D, π}

as timestep increases.

Preliminaries

We consider an infinite-horizon Markov decision process (MDP) (Puterman, 2014; Sutton and Barto, 2018) defined by the tuple $M = (S, A, P, R, γ, μ)$ . The agent interacts with an environment and takes an action $a_{t} \in A$ according to a policy $π_{θ} (a_{t} | s_{t})$ parameterized by $θ$ for each state $s_{t} \in S$ at time $t$ . Then, the environment changes the current state $s_{t}$ to the next state $s_{t + 1}$ based on the transition probability $P (s_{t + 1} | s_{t}, a_{t})$ . The reward function $R : S \times A \to R$ provides the reward for ( $s_{t}, a_{t}$ ). $γ$ denotes the discount factor and $μ$ is the initial state distribution for $s_{0}$ . The goal of a reinforcement learning (RL) algorithm is to find a policy $π_{θ}$ that maximizes the expected cumulative reward, or return:

θ^{*} = a r g m a x θ E \begin{matrix} s_{0} \sim μ a_{t} \sim π_{θ} (\cdot | s_{t}) s_{t + 1} \sim P (\cdot | s_{t}, a_{t}) \end{matrix} [\infty \sum t = 0 γ^{t} R (s_{t}, a_{t})] .

(1)

In many RL algorithms, a value function $V^{π} (s_{t})$ is trained to estimate the expected return under the current policy, $E [\sum_{t^{'} = t}^{\infty} γ^{t^{'}} R (s_{t^{'}}, a_{t^{'}})]$ , and is updated with the target:

\begin{matrix} V^{π} (s_{t}) = R (s_{t}, a_{t}) + γ V^{π} (s_{t + 1}) where s_{t + 1} \sim P (\cdot | s_{t}, a_{t}) \end{matrix}

(2)

In the distributional reinforcement learning (DRL), Bellemare et al. (2017b) proposed an action-state distributional value function $Q^{D, π} (s, a)$ which models the distribution of the return of the agent from taking action $a$ . We model the distribution by having $Q^{D, π}$ output $N$ quantile bars ${q_{0}, q_{1}, . . ., q_{N - 1}}$ . Similar to the Bellman optimality operator $T^{π}$ (Watkins and Dayan, 1992) (Equation 3) for the state-action value function $Q^{π}$ , $Q^{D, π}$ is updated by a distributional Bellman optimality operator (Equation 4):

\begin{matrix} T^{π} Q^{π} (s_{t}, a_{t}) & = E [R (s_{t}, a_{t})] + E_{s_{t + 1} \sim P (\cdot | s_{t}, a_{t})} [m a x_{a_{t + 1}} Q^{π} (s_{t + 1}, a_{t + 1})] \end{matrix}

(3)

TπQD,π(st,at)=R(st,at)+γQD,π(st+1,at+1)where \,\,st+1∼P(⋅|st,at),at+1=argmaxat+1E[QD,π(st+1,a1t+1)]

(4)

Figure 2: Illustration of how an exploration bonus is computed. When $N = 4$ , there are 4 quantile bars ${q_{0}, q_{1}, q_{2}, q_{3}}$ as an output of $V^{D, π}$ . Compute the mean $q_{a v g}$ of the quantile bars, and find $σ_{i}$ such that $P (Z \leq \frac{q_{i} - q_{a v g}}{σ_{i}}) = 0.2 * (i + 1) = \frac{i + 1}{N + 1}$ for $i \in {0, 1, 2, 3}$ . $N (q_{a v g}, σ^{2} (t))$ is the target distribution and $N (q_{a v g}, σ_{a v g}^{2})$ where $σ_{a v g} = \frac{1}{3} \sum_{i = 0}^{3} σ_{i}$ . Measure the distance that how much $N (q_{a v g}, σ_{a v g}^{2})$ is different from $N (q_{a v g}, σ^{2} (t))$ with KL divergence.

According to the Bellman optimality operator, the standard deviation of the distributional value function increases together with timestep because the discount factor shrinks quantiles. Let’s suppose $Q^{D π} (s_{t + 1}, a_{t + 1})$ = ${q_{0}, q_{1}, . . ., q_{N - 1}}$ at timestep $t + 1$ , then the target for $Q^{D π} (s_{t}, a_{t})$ will be ${R (s_{t}, a_{t}) + γ q_{0}, R (s_{t}, a_{t}) + γ q_{1}, . . ., R (s_{t}, a_{t}) + γ q_{N - 1}}$ . Since the discount factor $γ$ is less than 1, the range of the distribution represented by the target quantiles is reduced because. However, this shrunk quantiles contradict the emprical results of return (Table 1). We introduce the Markov Chain Central Limit Theorem (MC-CLT) Jones (2004) as a basis for approximating the value distribution as a normal distribution with a reduced standard deviation as timestep increases.

Theorem 1 (Markov Chain Central Limit Theorem)
Let $X$ be a Harris ergodic Markov chain on $X$ with stationary distribution $ρ$ and let $f : X \to R$ is a Borel function. If $X$ is uniformly ergodic and $E_{ρ} [f^{2} (x)] < \infty$ then for any initial distribution, as $n \to \infty$ ,

√n(¯fn−Eρ[f])d→N(0,σ2),where \,\,¯fn=1nn∑i=1f(Xi)a.s−−→Eρ[f]σ2=Varρ(f(X1))+2∞∑k=1Covρ(f(X1),f(X1+k)).

(5)

We set $f$ to be the reward function $R$ and (state, action) corresponds to $X$ . Based on the form of the variance and the part converging to the normal distribution in Equation 5, we make our distributional value function approximate a normal distribution with variance inversely proportional to timestep. Figure 1 shows empirically results that the value distribution learned by QR-DQN or DPPO at convergence differs imperceptibly from a normal distribution.

Approach

In this section, we present a method for obtaining the standard deviation required for time-dependent normal distributional reinforcement learning and then introduce a method for obtaining the target of $V^{D, π}$ using it. Then, we discuss how to measure the distance between the current $V^{D, π}$ and the target normal distribution and use it as an exploration strategy.

Standard Deviation of Approximated Normal Distribution

Intuitively, if we consider rewards as random variables, the variance of the value distribution of an initial state $s_{0}$ will be greater than the variance of the value distribution of state $s_{T}$ of the last timestep $T$ . We assume that the Markov decision process and reward function of RL satisfy the conditions of MC-CLT. We additionally assume that each variance and covariance of rewards are equal regardless of the timestep because the reward of stochastic environments is not fixed, i.e., a reward received at t can occur in another trajectory or a few timesteps later. We note that these tasks have handcrafted reward functions that are designed to be dense—this property may not hold in tasks with a sparse reward function or a deterministic environment. Let $T$ be a large enough number as a last timestep, then we can model the return’s distribution as:

¯Rt≈N(E(Rt),(T−t+1)σ2t)where \,\,¯Rt=T∑i=tR(st,at)→Eπ[Rt]σ2t≈Var(R(st,at))+2T−t∑k=1Cov(R(st,at),R(st+k,at+k)).

(6)

We here define a distributional value function $V^{D, π} (s)$ estimating the distribution of an expected return. $V^{D, π} (s)$ approximates the distribution as a normal distribution that has a smaller variance as timestep increases by Markov Chain Central Limit Theorem (MC-CLT) with quantile bars. In detail, the value function $V^{π} (s_{t})$ estimates the expected return from $s_{t}$ , We linearly decrease the variance of $V^{D, π}$ as timestep increases following MC-CLT. We empirically verify this property for continuous control tasks (Table 1). If the number of output nodes of $V^{D, π} (s)$ is $N$ , then $V^{D, π} (s) = {q_{0} (s), q_{2} (s), . . ., q_{N - 1} (s)}$ . The mean of the approximated normal distribution $q_{a v g}$ is the average of the quantile bars, and is updated by TD error or directly fitted to the returns like general RL algorithms, but the variance is analytically computed.

\begin{matrix} V^{D, π} (s_{t}) = N (q_{a v g} (s_{t}), σ^{2} (t)) q_{a v g} (s_{t}) = \frac{1}{N} \sum_{i = 0}^{N - 1} q_{i} (s_{t}) σ^{2} : Z^{+} \to R^{+} \end{matrix}

(7)

We use $σ_{m i n}$ as the minimum variance to prevent $V^{D, π} (s)$ from having a negative number or too small variance when an episode length can be larger than $l_{c u r}$ . As the policy improves based on the current distributional value function, episodes become longer, and the return increases. Thus, the distributional value function must update for the improved return accordingly. Consequently, $V^{D, π}$ reduces the variance by considering the recent episode length $l_{c u r}$ and returns $R_{c u r}$ :

σ^{2} (t) = m a x (\frac{σ_{m i n}^{2} - R_{c u r}^{2}}{l_{c u r}} t + R_{c u r}^{2}, σ_{m i n}^{2}) .

(8)

Target for Distributional Value Function

We provide the quantile bars of an approximated normal distribution to $V^{D, π}$ with the calculated variance above and returns. To compute the target quantile bars, we first need the mean of the approximated normal distribution. The mean can be TD target or a return of a trajectory. Here, we use return $R_{t}$ for the sake of notation simplicity. To obtain a quantile target for $V^{D, π}$ , we first compute the $Z$ values of the standard normal distribution. For example, when $N = 4$ , $Z = {- 0.841, - 0.253, 0.253, 0.841}$ , i.e., $P (X <= Z [i]) \approx 0.2 * (i + 1) = \frac{i + 1}{N + 1}$ for $i \in {0, 1, 2, 3}$ . Then we compute the target quantile bars $q_{t}^{'}$ of $V^{D, π} (s_{t})$ such that $q_{t}^{'} = {R_{t} + σ (t) Z [i] | i \in {0, 1, . . ., N - 1}}$ and fit $V^{D, π} (s_{t})$ to $q_{t}^{'}$ by quantile Huber loss (Aravkin et al., 2014). The Huber loss (Huber, 1964) is as follows:

L_{κ} (u) = {\begin{matrix} \frac{1}{2} u^{2}, if | u | < κ κ (| u | - \frac{1}{2} κ), otherwise . \end{matrix}

(9)

The quantile Huber loss is an asymmetric variant of the Huber loss.

1|D|TN∑τ∈DT∑t=0N−1∑i=0ρκτi(qi(st)−q′t,i),where \,\,ρκτi(u)=|τi−δ{u<0}|Lκ(u)τi=i+1N+1 for i={0,1,...,N−1}D={τ0,τ1...,τk−1} trajectory set of % policy π

(10)

Figure 3: Training curves of MC-CLT PPO with bonus (blue), MC-CLT PPO without bonus (orange), DPPO (green), and PPO (purple) for 10 different random seeds on 8 continuous control benchmarks. Each solid line is the average returns and the shaded region represents the standard deviations.

Uncertainty Bonus

1: Input: policy

π

, distributional value function

V^{D, π}

, rollout buffer

B

2: Initialize

π

V^{D, π}

randomly

3: Initialize

l_{c u r}

R_{c u r}

with a small number, and

t = 0

4: for

i = 1

to epoch_num

5: for

j = 1

to rollout_num

a_{t} \sim π (\cdot | s_{t})

s_{t + 1} \sim P (\cdot | s_{t}, a_{t})

8: Compute

σ^{2} (t)

and

σ_{a v g}^{2}

with

V^{D, π} (s_{t})

R^{'} (s_{t}, a_{t}) = R (s_{t}, a_{t}) + α U (s_{t}, t)

10: Store (

s_{t}, a_{t}, R^{'} (s_{t}, a_{t}), V^{D, π} (s_{t}), t

) in

B

11:

t = t + 1

12: if env is done

13: Reset env and

t = 0

14: end for

15: Update

l_{c u r}

and

R_{c u r}

based on the collected trajectories above

16: Optimize

π

with respect to PPO loss

17: Compute target

q^{'}

for

V^{D, π}

18: Optimize

V^{D, π}

to fit

q^{'}

with respect to the quantile Huber loss

19: end for

Algorithm 1 MC-CLT DRL with Normality Exploration Bonus

Since $V^{D, π}$ target is a distribution, we can use the properties of this distribution to calculate the exploration bonus. We measure the distance between the current prediction of $V^{D, π}$ and the normal distribution that $V^{D, π}$ that it should approximate with KL divergence. Unlike optimistic exploration algorithms that pursue novelty, we evaluate how $V^{D, π}$ ’s prediction closes to the target normal distribution to judge whether $V^{D, π}$ generalizes well for the visited states. In order to better improve the policy, the value of the current policy must be estimated well. If the visited state is unfamiliar or the generalization error generated in updating $V^{D, π}$ is significant, the distance between $V^{D, π}$ and the target is enormous, we encourage the agent to visit this state more to reduce the uncertainty of $V^{D, π}$ .

Figure 2 illustrates this when the number of quantile bars is $N = 4$ . If the prediction of V is valid, it should be symmetrical with respect to the mean of the quantile bars, and its variance is equal to $σ^{2} (t)$ . We first find the mean $q_{a v g}$ = $\frac{1}{N} \sum_{i = 0}^{i = N - 1} q_{i}$ of the quantile bars and calculate the standard deviations ${σ_{0}, σ_{1}, . . ., σ_{N - 1}}$ for each quantile such that $P (Z \leq \frac{q_{i} - q_{a v g}}{σ_{i}}) = \frac{i + 1}{N + 1}$ for $i \in {0, 1, . . ., N - 1}$ . The first approximated distribution is $N (q_{a v g}, σ_{a v g}^{2})$ where $σ_{a v g} = \frac{1}{N} \sum_{i = 0}^{i = N - 1} σ_{i}$ . The second one is $N (q_{a v g}, σ^{2} (t))$ using the variance that is suggested in the above section through MC-CLT. The distance between these two is measured through KL divergence and added to the reward. However, too large distance can significantly affect the reward provided; hence, we scale it with a constant $α$ . Also, there exists a case where the quantile bars are not ordered $\exists i < j$ such that $q_{i} \geq q_{j}$ , especially at the beginning of training. In this case, we provide an exploration bonus as $\frac{R_{t}}{l}$ meaning a reward per episode length.

U (s_{t}, t) = ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ \begin{matrix} \frac{R_{t}}{l}, if \exists i < j % such that q_{i} \geq q_{j} min (D_{K L} (N (q_{a v g}, σ^{2} (t)) | | N (q_{a v g}, σ_{a v g}^{2}), \frac{R_{t}}{l}), otherwise . \end{matrix}

(11)

R^{'} (s_{t}, a_{t}) = R (s_{t}, a_{t}) + α U (s_{t}, t)

(12)

We propose a normality-guided algorithm to improve policy by using $R^{'} (s, a)$ instead of $R (s, a)$ for PPO in Algorithm 1. This method can be applied to various policy update algorithms.

# of output nodes		$N = 200$	$N = 100$	$N = 50$	$N = 10$
Hopper-v3	Average return	$2429 \pm 490$	$2777 \pm 414$	$2786 \pm 642$	$2702 \pm 309$
Hopper-v3	Ordered ratio	$0.95 \pm 0.04$	$0.96 \pm 0.04$	$1.0 \pm 0.0$	$1.0 \pm 0.0$
Swimmer-v3	Average return	$120.9 \pm 4.77$	$126.3 \pm 3.29$	$122.0 \pm 2.13$	$128.7 \pm 3.98$
Swimmer-v3	Ordered ratio	$0.84 \pm 0.03$	$0.92 \pm 0.02$	$0.98 \pm 0.01$	$1.0 \pm 0.0$
BipedalWalker-v3	Average return	$205.9 \pm 7.75$	$184.2 \pm 23.8$	$187.9 \pm 21.9$	$209.4 \pm 5.78$
BipedalWalker-v3	Ordered ratio	$0.54 \pm 0.41$	$0.67 \pm 0.35$	$0.54 \pm 0.25$	$1.0 \pm 0.0$

Table 2: Average returns and ordered ratios of quantile outputs with standard deviations. For each entry, we use 3 different random seeds to train models and get the average returns and ordered ratios over 30 episodes. This result shows that a larger number of output nodes does not lead to better performance, whereas ordered ratios keep decreasing. A smaller number of output nodes (

N \leq 10

) almost generates ordered quantile bars that conform to the quantile definition.

Experiments

We evaluate our method on continuous OpenAI gym Box2D (Brockman et al., 2016) and MuJoCo 8 tasks (Todorov et al., 2012). For our experiments, we update the policy with Proximal Policy Optimization (PPO) (Schulman et al., 2017) and the advantage is computed by GAE (Schulman et al., 2015). We set PPO with the scalar value function and PPO with the distributional value function (DPPO) as baseline comparisons. The target for $V^{D, π} (s_{t})$ of DPPO is $R (s_{t}, a_{t}) + V^{D, π} (s_{t + 1})$ similar to Equation 4. The quantile bars of $V^{D, π} (s_{t})$ is shifted by the amount of $R (s_{t}, a_{t})$ . Additionally, we compare our method by training our model without an exploration bonus (MC-CLT PPO without bonus) to show the effectiveness of the normality exploration strategy (MC-CLT PPO with bonus).

Figure 3 represents the experimental results for the eight environments. Each plot represents the average return and standard error of models trained with 10 random seeds. The results of the PPO are worst in all domains except BipedalWalkerHardcore. MC-CLT without bonus also is comparable or better than DPPO (except in BipedalWalkerHardcore-v3), meaning that it is practical to provide the target of $V^{D, π}$ by analytically calculating the standard deviation. Finally, MC-CLT with the normality exploration bonus performs best in all domains. Through normality exploration, the policy makes better improvements using a better critic.

For a fair comparison with PPO, the hyperparameters such as learning rate and batch size were fixed except for the network size of $V^{D, π}$ . All policies have a two-layer tanh network with 64 x 32 units, the value function of PPO has a two-layer ReLU network with 64 x 64 units, but all distributional value functions have a two-layer ReLU 512 x 512 units. All networks are updated with Adam optimizer (Kingma and Ba, 2014). Our implementation is based on Spinning Up (Achiam, 2018).

We also observe that a large number of output nodes ( $N = 100 \sim 200$ ) for the quantile representation, such as Stable Baselines3’s QR-DQN (Raffin et al., 2021), does not noticeably lead to a performance improvement. Instead, a large number of output nodes causes the problem that the quantile bars are not ordered, contradicting the quantile definition. For the above continuous tasks, we have a relatively small number of output nodes ( $N = 8 \sim 12$ ) for MC-CLT PPO and DPPO. Using a smaller output node is advantageous in terms of computational cost and outputs more stably ordered quantile bars. Table 2 shows the average returns and the ratios at which quantile outputs are ordered over multiple episodes according to the number of output nodes.

Conclusion

We have presented a method for DRL that the output quantile approximate the value distribution to a normal distribution with the analytically computed standard deviation as timestep goes under mild assumptions for Markov Chain Central Limit Theorem (Jones, 2004). Existing distributional DRL algorithms do not consider controlling the standard deviation and updating the distribution value function with the distributional Bellman optimality operator. The discount factor of the operator derives a variance proportional to timestep. In contrast, we assume that the standard deviation should gradually decrease with timestep based on the empirical results, and we propose an exploration strategy using the distance of the target distribution and the learned distributional value function. This exploration encourages an agent to visit states not explored enough, making the value function more accurate and expecting better policy improvement. Our method showed consistently better performance than the baselines overall in our evaluations.

Although our method shows promising results, there is room for improvement. First of all, hyperparameters such as $σ_{m i n}$ and $α$ vary for each given task so that these hyperparameters can be dynamically optimally tuned. For instance, we are able to update $σ_{m i n}$ according to a collected trajectories set. Second, the exploration bonus can be used to modify a sample’s gradient. If the bonus is greater than a threshold, we might consider it too uncertain, and reduce the sample’s gradient. Finally, we calculate the standard deviation using the actual timestep from an environment, but since some tasks often visit the visited state repeatedly, the timestep may not be appropriate to use as for calculating a standard deviation. Instead, we can train a neural network predicting the timestep for a given state and use this pseudo timestep to compute the standard deviation.

Reproducibility Statement

We provide the hyperparameters used in our evaluations and all source code is available at https://github.com/shashacks/MC˙CLT˙Sumbission.

References

J. Achiam (2018) Spinning Up in Deep Reinforcement Learning. Cited by: Experiments.
Z. Ahmed, N. Le Roux, M. Norouzi, and D. Schuurmans (2019) Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, pp. 151–160. Cited by: Related Work.
A. Y. Aravkin, A. Kambadur, A. C. Lozano, and R. Luss (2014) Sparse quantile huber regression for efficient and robust estimation. ArXiv abs/1402.4624. Cited by: Target for Distributional Value Function.
G. Barth-Maron, M. W. Hoffman, D. Budden, W. Dabney, D. Horgan, D. TB, A. Muldal, N. Heess, and T. P. Lillicrap (2018) Distributed distributional deterministic policy gradients. CoRR abs/1804.08617. External Links: Link, 1804.08617 Cited by: Introduction, Related Work.
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling (2013) The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, pp. 253–279. Cited by: Related Work.
M. G. Bellemare, W. Dabney, and R. Munos (2017a) A distributional perspective on reinforcement learning. CoRR abs/1707.06887. External Links: Link, 1707.06887 Cited by: Introduction, Related Work.
M. G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lakshminarayanan, S. Hoyer, and R. Munos (2017b) The cramer distance as a solution to biased wasserstein gradients. CoRR abs/1705.10743. External Links: Link, 1705.10743 Cited by: Preliminaries.
M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos (2016) Unifying count-based exploration and intrinsic motivation. CoRR abs/1606.01868. External Links: Link, 1606.01868 Cited by: Related Work.
G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. CoRR abs/1606.01540. External Links: Link, 1606.01540 Cited by: Experiments.
Y. Burda, H. Edwards, A. J. Storkey, and O. Klimov (2018) Exploration by random network distillation. CoRR abs/1810.12894. External Links: Link, 1810.12894 Cited by: Related Work.
W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos (2017) Distributional reinforcement learning with quantile regression. CoRR abs/1710.10044. External Links: Link, 1710.10044 Cited by: Introduction, Related Work.
J. Fu, J. D. Co-Reyes, and S. Levine (2017) EX2: exploration with exemplar models for deep reinforcement learning. CoRR abs/1703.01260. External Links: Link, 1703.01260 Cited by: Related Work.
S. Gu, T. Lillicrap, Z. Ghahramani, R. E. Turner, and S. Levine (2016) Q-prop: sample-efficient policy gradient with an off-policy critic. arXiv preprint arXiv:1611.02247. Cited by: Introduction.
R. Houthooft, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel (2016) Curiosity-driven exploration in deep reinforcement learning via bayesian neural networks. CoRR abs/1605.09674. External Links: Link, 1605.09674 Cited by: Related Work.
P. J. Huber (1964) Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics 35 (1), pp. 73 – 101. External Links: Document, Link Cited by: Target for Distributional Value Function.
G. L. Jones (2004) On the Markov chain central limit theorem. Probability Surveys 1 (none), pp. 299 – 320. External Links: Document, Link Cited by: Preliminaries, Conclusion.
D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization.. CoRR abs/1412.6980. External Links: Link Cited by: Experiments.
Z. Liu, X. Li, B. Kang, and T. Darrell (2019) Regularization matters in policy optimization. arXiv preprint arXiv:1910.09191. Cited by: Related Work.
B. Mavrin, S. Zhang, H. Yao, L. Kong, K. Wu, and Y. Yu (2019) Distributional reinforcement learning for efficient exploration. CoRR abs/1905.06125. External Links: Link, 1905.06125 Cited by: Related Work, Related Work.
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: Introduction.
I. Osband, C. Blundell, A. Pritzel, and B. V. Roy (2016) Deep exploration via bootstrapped DQN. CoRR abs/1602.04621. External Links: Link, 1602.04621 Cited by: Related Work.
M. L. Puterman (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: Preliminaries.
A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann (2021) Stable-baselines3: reliable reinforcement learning implementations. Journal of Machine Learning Research 22 (268), pp. 1–8. External Links: Link Cited by: Experiments.
J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel (2015) High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438. Cited by: Introduction, Experiments.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Link, 1707.06347 Cited by: 3rd item, Experiments.
R. Singh, K. Lee, and Y. Chen (2020) Sample-based distributional policy gradient. CoRR abs/2001.02652. External Links: Link, 2001.02652 Cited by: Introduction, Related Work.
A. L. Strehl and M. L. Littman (2005) A theoretical analysis of model-based interval estimation. In ICML, pp. 856–863. External Links: Link Cited by: Related Work.
R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: Preliminaries.
H. Tang, R. Houthooft, D. Foote, A. Stooke, X. Chen, Y. Duan, J. Schulman, F. D. Turck, and P. Abbeel (2016) #exploration: A study of count-based exploration for deep reinforcement learning. CoRR abs/1611.04717. External Links: Link, 1611.04717 Cited by: Related Work.
P. Thomas (2014) Bias in natural actor-critic algorithms. In International conference on machine learning, pp. 441–448. Cited by: Introduction.
E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vol. , pp. 5026–5033. External Links: Document Cited by: Experiments.
C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3), pp. 279–292. Cited by: Preliminaries.
Y. Yue, Z. Wang, and M. Zhou (2020) Implicit distributional reinforcement learning. CoRR abs/2007.06159. External Links: Link, 2007.06159 Cited by: Introduction, Related Work.