An Analysis of Abstracted Model-Based Reinforcement Learning

Rolf A. N. Starre
Department of Intelligent Systems
Delft University of Technology
R.A.N.Starre@tudelft.nl
&Marco Loog
Department of Intelligent Systems
Delft University of Technology
M.Loog@tudelft.nl
Frans A. Oliehoek
Department of Intelligent Systems
Delft University of Technology
F.A.Oliehoek@tudelft.nl

Abstract

Many methods for Model-based Reinforcement learning (MBRL) provide guarantees for both the accuracy of the Markov decision process (MDP) model they can deliver and the learning efficiency. At the same time, state abstraction techniques allow for a reduction of the size of an MDP while maintaining a bounded loss with respect to the original problem. It may come as a surprise, therefore, that no such guarantees are available when combining both techniques, i.e., where MBRL merely observes abstract states. Our theoretical analysis shows that abstraction can introduce a dependence between samples collected online (e.g., in the real world), which means that most results for MBRL can not be directly extended to this setting. The new results in this work show that concentration inequalities for martingales can be used to overcome this problem and allows for extending the results of algorithms such as R-MAX to the setting with abstraction. Thus producing the first performance guarantees for ‘Abstracted RL’: model-based reinforcement learning with an abstracted model.

1 Introduction

Tabular Model-based Reinforcement Learning (MBRL) methods provide guarantees that show they can learn efficiently in Markov decision processs (MDPs) Brafman and Tennenholtz (2002); Strehl and Littman (2008); Jaksch et al. (2010). They do this by finding solutions to a fundamental problem for Reinforcement Learning (RL), the exploration-exploitation dilemma: when to take actions to obtain more information, and when to take actions that maximize reward based on the current knowledge. However, MDPs can be very large, which can be problematic for tabular methods. One way to deal with this is by using abstractions, such as temporal abstractions Sutton et al. (1999) or state abstractions Li (2009); Abel et al. (2016). State abstractions can also be seen as a special case of function approximation, where every state maps to its abstract state Mahadevan (2010). Here we analyze the case where the abstraction function $ϕ$ is an approximate model similarity abstraction Abel et al. (2016) that maps states to abstract states. In our setting we assume that the environment returns states $s$ , but the agent only observes abstract states $ϕ (s)$ (see Figure 1). This setting, which was considered before Ortner et al. (2014a); Abel et al. (2018),¹¹1We refer to Section 4 for a comparison with the related work. is what we call Abstracted RL.

Abstracted RL corresponds to RL in a Partially Observable MDP (POMDP) Kaelbling et al. (1998), as previously described Bai et al. (2016). It is well known that policies for POMDPs that only base their action on the last observation $ϕ (s)$ could be arbitrarily bad Singh et al. (1994). However, when $ϕ$ is an exact model similarity abstraction Li (2009), ²²2Also known as stochastic bisimulation Givan et al. (2003). an abstraction that maps states to the same abstract state only when their reward and transition functions in the abstract space are the same, the resulting problem can be considered an MDP and this worst-case does not apply Li et al. (2006). Therefore, intuitively, one may expect that when $ϕ$ is an approximate model similarity abstraction this worst-case also does not apply, since this abstraction maps states to the same abstract state only when their reward and transition functions in the abstract space are close. However, to proof efficient learning, MBRL methods typically (e.g., Strehl and Littman (2008); Jaksch et al. (2010)) use results that rely on the assumption of independent and identically distributed (i.i.d.) samples, such as Theorem 2 from Weissman et al. (2003). We analyze collecting samples in Abstracted RL, using a generic state abstraction function, and show that the abstraction can cause the samples to become dependent, which means that most guarantees of existing MBRL methods do not hold in the online Abstracted RL setting.³³3The reader might be puzzled by this statement, since certain guarantees on the combination of abstraction and RL are known. This can be explained by the generality of Abstracted RL: in this setting there is a non-stationarity caused by the clustering of states with different dynamics. There is a lot of related work in other abstraction settings (e.g., abstraction selection) where this complication does not occur due to the particularities of their setting Paduraru et al. (2008); Hallak et al. (2013); Maillard et al. (2013); Majeed and Hutter (2018); Ortner et al. (2019); Du et al. (2019). In Section 4 we give details to back up our claim for individual papers.

The primary technical results in this work show that concentration inequalities that rely on independent samples can be replaced with a concentration inequality for martingales to guarantee than an accurate model can be learned in Abstracted RL. With an approximate model similarity abstraction, this allows for extending results of MBRL methods to Abstracted RL, showing efficient learning.

The outline is as follows. First Sections 2.1 and 2.2 introduce MBRL and state abstraction, respectively. Section 3.1 analyzes sample collection in Abstracted RL and shows that in this setting samples cannot be assumed to be independent. Section 3.2 gives the main results that inequalities for martingales can be used to guarantee accurate model learning in Abstracted RL, and that this allows to extend the results of MBRL methods to Abstracted RL. Section 4 covers related work and finally Section 5 closes with conclusions.

2 Preliminaries

As is common for RL problems, we assume the environment the agent is acting in can be represented by an infinite horizon MDP $M\coloneqq⟨S,A,T,R⟩$ Puterman (2014). Where $S$ is a finite set of states $s \in S$ , $A$ a finite set of actions $a \in A$ , $T$ a transition function $T (s^{'} | s, a) = Pr (s^{'} | s, a)$ , and $R$ a reward function $R (s, a)$ which gives the reward received when the agent executes action $a$ in state $s$ .

In RL the goal of the agent is to find an optimal policy $π^{*} : S \to A$ which maximizes the expectation of the cumulative reward. $V^{π} (s)$ denotes the expected value of the cumulative reward under policy $π$ starting from state $s$ . Similarly, $Q^{π} (s, a)$ denotes the expected value of the cumulative reward when first taking action $a$ from state $s$ and then following policy $π$ .

2.1 Model-Based RL

MBRL methods learn a model from the experience that is gained by the agent acting in the MDP. For a fixed state-action pair $(s, a)$ let $τ_{1}, τ_{2}, \dots, τ_{N (s, a)}$ be the first $N (s, a)$ time steps at which action $a$ has been chosen in state $s$ , then the obtained experience for $(s, a)$ can be written as the sequence $Ys,a\coloneqq(s′(τ1+1),s′(τ2+1),⋯,s′(τN(s,a)+1))$ . $Y_{s, a}$ stores the next-states $s^{'}$ reached after taking action $a$ from state $s$ , where $N (s, a)$ is the number of times we have taken action $a$ from state $s$ . We use $Y$ to refer to the collection of all $Y_{s, a}$ . The obtained experience can be used to construct the empirical model $T_{Y}$ , often used in MBRL Brafman and Tennenholtz (2002); Strehl and Littman (2008); Jaksch et al. (2010). The empirical model is constructed simply by counting the number of times a particular next-state has been observed and normalizing the obtained quantity by the total count:

\forall s^{'} \in S : T_{Y} (s^{'} | s, a) ≜ \frac{1}{N (s, a)} N (s, a) \sum i = 1 1 {Y_{s, a}^{(τ_{i} + 1)} = s^{'}} .

(1)

Here $1 {\cdot}$ denotes the indicator function of the specified event, i.e., it is $1$ if $Y_{s, a}^{(τ_{i} + 1)} = s^{'}$ and $0$ otherwise.

Knowledge about the quality of the empirical model is a crucial element in any performance guarantee, irrespective of whether this is in terms of, for instance, PAC-MDP Strehl and Littman (2008) or regret Jaksch et al. (2010). Concentration inequalities, such as Theorem 2.1 from Weissman et al. (2003), are often used to give finite-sample guarantees on the accuracy of the empirical model $T_{Y}$ . However, these inequalities typically make use of the fact that samples are i.i.d. In most MBRL settings this is not a problem under some assumptions, e.g. when the MDP is communicating ⁴⁴4An MDP is communicating if for all $s_{1}, s_{2} \in S$ there exists a deterministic policy that eventually leads from $s_{1}$ to $s_{2}$ Puterman (2014).. In this case, due to the Markov property, the obtained samples are i.i.d.

In general, of course the hope is that with enough samples the empirical model $T_{Y}$ becomes accurate. With accurate we mean that the distance between $T_{Y} (\cdot | s, a)$ and $T (\cdot | s, a)$ will be small, this distance can, for instance, be measured using the $L_{1}$ norm, defined as:

| | T_{Y} (\cdot | s, a) - T (\cdot | s, a) | |_{1} ≜ \sum s^{'} \in S | T_{Y} (s^{'} | s, a) - T (s^{'} | s, a) | .

(2)

Part of theorem 2.1 from Weissman et al. (2003) then gives a guarantee of accuracy for the empirical model:

Lemma 1 ( $L_{1}$ inequality Weissman et al. (2003)).

Let $Y_{s, a} = Y^{(τ_{1})}, Y^{(τ_{2})}, \dots, Y^{(τ_{N (s, a)} + 1)}$ be i.i.d. random variables distributed according to $T (\cdot | s, a)$ . Then, for all $ϵ > 0$ ,

Pr (| | T_{Y} (\cdot | s, a) - T (\cdot | s, a) | |_{1} \geq ϵ) \leq (2^{| S |} - 2) e^{- \frac{1}{2} N (s, a) ϵ^{2}} .

(3)

In this way, MBRL can upper bound the probability that, given enough samples, the empirical model $T_{Y} (\cdot | s, a)$ will be far away ( $\geq ϵ$ ) from the true model $T (\cdot | s, a)$ .

2.2 State abstraction for Known Models

In the planning setting, where the model is known a priori, a state abstraction can be formulated as a aggregation or mapping from states to abstract states Li et al. (2006). This is done with an abstraction function $ϕ$ , a surjective function that maps from states, $s \in S$ , to abstract states $¯ s \in ¯ S$ : $ϕ (s) : S \to ¯ S$ . We use the $¯$ notation to refer to the abstract space, and define $¯ S$ as $¯ S = {ϕ (s) | s \in S}$ . We slightly overload notation and also let $¯ s$ denote the set of states that map to $¯ s$ , i.e., $¯ s = {s \in S | ϕ (s) = ¯ s}$ . The use should be clear from the context.

This is a general form of state abstraction, that clusters together states with different dynamics into abstract states. Note that we do assume that the given state abstraction deterministically maps states to an abstract state, meaning that the abstract state space is at most the size of the original state space, $| ¯ S | \leq | S |$ . This is in contrast to some related work on problems with block structure Du et al. (2019), where a Markov state can lead to multiple observations (abstract states in our terminology) that need to be aggregated appropriately to result in a small MDP Azizzadenesheli et al. (2016); Du et al. (2019).

Many different methods of state abstraction exist Li (2009); Abel et al. (2016), here we focus on approximate model similarity abstraction Abel et al. (2016), also known as approximate stochastic bisimulation Dean et al. (1997); Givan et al. (2003). In this abstraction, two states can map to the same abstract state if their behavior is similar in the abstract space, i.e., when the reward function and the transitions to abstract states are close. The transition probability to an abstract state $P ({¯ s}^{'} | s, a)$ can be determined as:

Pr ({¯ s}^{'} | s, a) = \sum s^{'} \in {¯ s}^{'} T (s^{'} | s, a) .

(4)

Then, approximate model similarity is defined as:

Definition 1.

An approximate model similarity abstraction, $ϕ_{m o d e l, η_{R}, η_{T}}$ , for fixed $η_{R}, η_{T}$ , satisfies

	$ϕ_{m o d e l, η} (s_{1}) = ϕ_{m o d e l, η} (s_{2}) ⟹ \forall a \in A :$	$\| R (s_{1}, a) - R (s_{2}, a) \| \leq η_{R},$
	$\forall {¯ s}^{'} \in ¯ S, a \in A :$	$\| Pr ({¯ s}^{'} \| s_{1}, a) - Pr ({¯ s}^{'} \| s_{2}, a) \| \leq η_{T} .$		(5)

From now on we will just refer to $ϕ_{model, η_{R}, η_{T}}$ as $ϕ$ .

We note that the abstraction we consider is still quite generic. It can cluster together states that have different transition and reward functions. However, in the online Abstracted RL setting, the differences in dynamics can cause a dependence between the samples, as we will show in detail in Section 3. E.g. looking at $(¯ s, a)$ , the probability that we reach a state $s^{'}$ depends both on the probability that we reach a particular state $s \in ¯ s$ and then state $s^{'}$ from $s$ .

In the planning setting, when the model is known, the abstraction function $ϕ$ can be used to construct an abstract MDP, which can be useful because the abstract MDP is smaller, making it easier to find a solution and such a solution can work well in the original MDP Li et al. (2006); Abel et al. (2016). An abstract MDP ${¯ M}_{ω}$ is constructed from the model of an MDP $M$ , an abstraction function $ϕ$ , and an action-specific weighting function $ω$ , ⁵⁵5The action-specific weighting function is a more general weighting function than is typically used, e.g. by Li et al. (2006), which is not action-specific, i.e., it only depends on the state $s$ . More formally it is the case where $\forall a, a^{'} \in A, s \in S : ω (s, a) = ω (s, a^{'})$ . with $\forall s \in S, a \in A : 0 \leq ω (s, a) \leq 1$ , and the weights of the states associated with an abstract state $¯ s$ sum up to 1: $\sum_{s^{'} \in ϕ (s)} ω (s^{'}, a) = 1$ . The weighting function can be used to create abstract transition and reward functions, which are a weighted average of the original transition and reward functions. In this way, from $M$ , $ϕ$ and any $ω$ we can construct an abstract MDP ${¯ M}_{ω}$ :

Definition 2.

Given an MDP $M$ , $ϕ$ , and $ω$ , ${¯ M}_{ω} = ⟨ ¯ S, A, {¯ T}_{ω}, {¯ R}_{ω} ⟩$ is constructed as: $¯ S = {ϕ (s) | s \in S}, A = A,$

	$\forall ¯ s \in ¯ S, a \in A : {¯ R}_{ω} (¯ s, a)$	$= \sum s \in ¯ s ω (s, a) R (s, a),$		(6)
	$\forall ¯ s, {¯ s}^{'} \in ¯ S, a \in A : {¯ T}_{ω} ({¯ s}^{'} \| ¯ s, a)$	$= \sum s \in ¯ s \sum s^{'} \in {¯ s}^{'} ω (s, a) T (s^{'} \| s, a) .$		(7)

Note that the abstract MDP ${¯ M}_{ω}$ itself is an MDP, this means we can use any planning method we like to find an optimal policy ${¯ π}^{*}$ for ${¯ M}_{ω}$ . A desirable property of the approximate model similarity abstraction is that we can upper bound upper bound the loss in value due to using an optimal policy for ${¯ M}_{ω}$ , ${¯ π}^{*}$ in $M$ instead of using the optimal solution for $M$ Dearden and Boutilier (1997); Abel et al. (2016); Taïga et al. (2018). For example, for the discounted infinite horizon we have:

Lemma 2 (Lemma 4 Taïga et al. (2018)).

An approximate model similarity abstraction (Definition 1), has sub-optimality bounded in $η$ : $\forall s \in S : V^{*} (s) - V^{{¯ π}^{*}} (s) \leq \frac{2 η + 2 γ (| ¯ S | - 1) η}{(1 - γ)^{2}}$ .

Here $η_{R} = η_{T}$ , and $γ$ is the discount factor Puterman (2014). Note that any policy on the abstract space $¯ π$ can be used in $M$ as follows: $¯π(s)\coloneqq¯π(ϕ(s))$ .

3 Abstracted MBRL

As explained, we are interested in Abstracted RL, where we have an approximate model similarity abstraction function $ϕ$ and an MDP $M$ , but do not know the transition and reward functions of $M$ . To illustrate the impact of our results most clearly, we analyze the R-MAX algorithm Brafman and Tennenholtz (2002), a well-known and relatively straightforward method that guarantees sample efficient learning, in the Abstracted RL setting. The procedure is shown in Algorithm 1. We make the following assumptions, that stem from the original analysis: we assume that the MDP is ergodic Puterman (2014), ⁶⁶6An ergodic, or recurrent, MDP is an MDP where, under every stationary policy, every state is recurrent (i.e., asymptotically every state will be visited infinitely often) Puterman (2014). that we know $S$ and $A$ , that the reward function is deterministic, and that we know the minimum and maximum reward. W.l.o.g. we assume the rewards are between $0$ and $1$ . We add the assumption that the agent has access to an approximate model similarity abstraction $ϕ$ .

As in the original, the input to the algorithm is $δ$ the allowed failure probability, $ϵ$ the error bound, and $T_{ϵ}$ the $ϵ$ -return mixing time of an optimal policy Kearns and Singh (2002). The $ϵ$ -return mixing time for a policy $π$ is the minimum amount of steps it takes for a policy to guarantee that the expected return is equal to $V^{π} - ϵ$ . New is that the algorithm also receives as input the abstraction function $ϕ$ (and thus knows $¯ S$ ), the agent acts in $M$ but only observes $ϕ (s)$ , as in Figure 1, and builds an empirical (abstracted) model from the observations it obtains. Because of the abstraction, we will not be able to guarantee the $ϵ$ error bound. However, we are able to guarantee an error bound that is a function of $ϵ$ and the error of the abstraction $η$ . This means that if the abstraction is a close approximation, the algorithm will still be able to obtain near-optimal performance in a number of steps polynomial in $¯ S, A, T_{ϵ}, \frac{1}{ϵ}$ , and $\frac{1}{δ}$ .

The agent collects data for every abstract state-action pair $(¯ s, a)$ , which is stored as sequences ${¯ Y}_{¯ s, a}$ :

{¯ Y}_{¯ s, a} : {{¯ s}^{' (τ_{1} + 1)}, {¯ s}^{' (τ_{2} + 1)}, \dots, {¯ s}^{' (τ_{N (¯ s, a)} + 1)}} .

(8)

Similar to before in (1), we construct an empirical model ${¯ T}_{Y}$ , now looking at the abstract next-states that were reached:

{¯ T}_{Y} ({¯ s}^{'} | ¯ s, a) ≜ \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 1 {{¯ Y}_{¯ s, a}^{(τ_{i} + 1)} = {¯ s}^{'}} .

(9)

If the empirical model ${¯ T}_{Y}$ would be equal, or close, to the transition function ${¯ T}_{ω}$ of an abstract MDP ${¯ M}_{ω}$ , constructed from the true MDP with $ϕ$ and a valid $ω$ , we could upper bound the loss in performance due to applying learned policy ${¯ π}^{*}$ to $M$ instead of the optimal policy $π^{*}$ Abel et al. (2016); Taïga et al. (2018).

Our main question is: do the finite-sample model learning guarantees of MBRL algorithms still hold in the Abstracted RL setting?

Input:

ϕ, δ, ϵ, T_{ϵ}

for all

(¯ s, a) \in ¯ S \times A

{¯ T}_{Y} (¯ s | ¯ s, a) = 1

{¯ R}_{Y} (¯ s, a) = 1

{¯ Y}_{¯ s, a} = []

end for

{¯ M}_{Y} = ⟨ ¯ S, A, {¯ T}_{Y}, {¯ R}_{Y} ⟩

Select

m

, the number of samples required per state-action pair.

while

{min}_{(¯ s, a)} | Y_{¯ s, a} | < m

Compute optimal

T_{ϵ}

-step policy

¯ π

{¯ M}_{Y}

for the current abstract state.

for

T

timesteps do

¯ s = ϕ (s)

a = ¯ π (¯ s)

s^{'}, r =

Step(

s, a

)

s = s^{'}

| {¯ Y}_{¯ s, a} | < m

then

{¯ Y}_{¯ s, a}

.append(

ϕ (s^{'})

)

| {¯ Y}_{¯ s, a} | = 1

then

{¯ R}_{Y} (¯ s, a) = r

end if

| {¯ Y}_{¯ s, a} | = m

then

{¯ T}_{Y} ({¯ s}^{'} | ¯ s, a) = \frac{1}{m} \sum_{i = 1}^{m} 1 {{¯ Y}_{¯ s, a}^{(τ_{i} + 1)} = {¯ s}^{'}}

end if

end for

end while

Compute optimal policy

{¯ π}^{*}

for

¯ M

and run indefinitely.

Algorithm 1 Procedure: Abstracted R-MAX

3.1 Abstracted RL Can Lead to Dependent Samples

We want to be able to guarantee that the empirical model ${¯ T}_{Y}$ will be close to the transition model of an abstract MDP ${¯ M}_{ω}$ . To define this transition model, we first look at how the data is collected. In the online data collection, a sample in ${¯ Y}_{¯ s, a}$ is drawn when the agent takes action $a$ when it is in a state $s \in ¯ s$ . Specifically the $i$ -th abstract ${¯ Y}_{¯ s, a}^{(τ_{i} + 1)} = {¯ s}^{'}$ is drawn from state $X_{¯ s, a}^{(τ_{i})} = s \in ¯ s$ :

{¯ Y}_{¯ s, a}^{(τ_{i} + 1)} \sim Pr (\cdot | X_{¯ s, a}^{(τ_{i})} = s, a) .

(10)

Let $X_{¯ s, a} = (X_{¯ s, a}^{(τ_{i})})_{i = 1}^{N (¯ s, a)}$ denote the sequence of states $s \in ¯ s$ from which the agent took action $a$ . Each state $s$ gets a weight according to how often it was sampled from, which we formalize with the weighting function $ω_{X}$ : $\forall (¯ s, a), s \in ¯ s : ω_{X} (s, a) ≜ \frac{1}{N (¯ s, a)} \sum_{i = 1}^{N (¯ s, a)} 1 {X_{¯ s, a}^{(τ_{i})} = s}$ . We use $ω_{X}$ to define ${¯ T}_{ω_{X}}$ analogous to (7):

\forall (¯ s, a), {¯ s}^{'} : {¯ T}_{ω_{X}} ({¯ s}^{'} | ¯ s, a)

≜ \sum s \in ¯ s ω_{X} (s, a) \sum s^{'} \in {¯ s}^{'} T (s^{'} | s, a) .

(11)

We want to have a concentration inequality to provide bounds on the deviation of the empirical model ${¯ T}_{Y}$ from ${¯ T}_{ω_{X}}$ , we refer to this inequality as the abstract L1 inequality, similar in form to (3):

P (| {¯ T}_{Y} (\cdot | ¯ s, a) - {¯ T}_{ω_{X}} (\cdot | ¯ s, a) |_{1} \geq ϵ) \leq δ,

(12)

where ${¯ T}_{Y} (\cdot | ¯ s, a)$ is defined according to (9) and ${¯ T}_{ω_{X}}$ according to (11).

If we could directly obtain i.i.d. samples from ${¯ T}_{ω_{X}}$ and base our empirical model ${¯ T}_{Y}$ on the obtained samples, then we would be able to show that the abstract L1 inequality holds by applying Lemma 1. Since in this case, we could have $N (¯ s, a)$ i.i.d. samples per abstract state-action pair, distributed according to ${¯ T}_{ω_{X}} (\cdot | ¯ s, a)$ . However, the samples are not guaranteed to be i.i.d. when the agent follows Algorithm 1 to collect the samples. First, every sample ${¯ Y}^{(i)}$ was obtained after taking action $a$ from state $X_{¯ s, a}^{(i)} = s \in ¯ s$ , as in (10), and these can have different distributions if $X_{¯ s, a}^{(i)} \neq X_{¯ s, a}^{(j)}$ . ⁷⁷7By itself this is not an issue, as the proof of Lemma 1 can be adapted to show that it also holds for random variables are independent but not (necessarily) identically distributed, which we show in the Appendix. Second, they are not guaranteed to be independent as we will now discuss.

Independence

We may be tempted to assume the samples are independent, i.e.,

\forall {¯ s}_{1}^{'}, \dots, {¯ s}_{m}^{'} \in (¯ S)^{m} : Pr ({¯ Y}_{¯ s, a}^{(τ_{1} + 1)} = {¯ s}_{1}^{'}, \dots, ¯ Y_{¯ s, a}^{(τ_{m} + 1)} = {¯ s}_{m}^{'}) = Pr ({¯ Y}_{¯ s, a}^{(τ_{1} + 1)} = {¯ s}_{1}^{'}) \dots P ({¯ Y}_{¯ s, a}^{(τ_{m} + 1)} = {¯ s}_{m}^{'})

(13)

however, this may not be the case:

Observation 1.

When collecting samples online while using an abstraction function, i.e., based on Algorithm 1, the samples cannot be assumed to be independent.

The observation that samples may not be independent occurs in the situation where we 1) collect samples online in the real environment, of which we do not know the transitions, and 2) when the samples collected are abstract states $¯ s$ , i.e., the states $s$ are not observed. The following counterexample illustrates Observation 1.

Counterexample

Figure 2: Simple MDP, with only 1 action, and abstraction. The small circles are states (1,2,3,4). A, B and C are the abstract states. The arrows show the transition probabilities, e.g. $P (3 | 1) = 0.6$ .

To show that the samples may not be independent, we will give a counterexample. We use the example MDP and abstraction in Figure 2, where we have 4 states, 3 abstract states and only 1 action. We look at the transition probability from abstract state $A$ , ${¯ T}_{Y} (\cdot | A)$ , where we omit the action from the notation since the example MDP has only one action.

We will consider two samples, the first two entries in ${¯ Y}_{A}$ , and show that for at least one combination of ${¯ s}_{1}^{'}$ and ${¯ s}_{2}^{'}$ the samples are not independent. Consider ${¯ s}_{1}^{'} = {¯ s}_{2}^{'} = B$ . That is, the first two times that we experience a transition from the abstract state $A$ , we end up in $B$ . We denote the $i$ -th experienced transition to an abstract next state from abstract state $A$ (after we landed back there from either $B$ or $C$ ) as ${¯ Y}_{A}^{(τ_{i} + 1)}$ . Let state $1$ be the starting state. Then we have $Pr ({¯ Y}_{A}^{(τ_{1} + 1)} = B) = Pr (B | 1) = 0.6,$ for the first sample. The second sample is more complex, we have

	$Pr ({¯ Y}_{A}^{(τ_{2} + 1)} = B) = \sum ¯ s \in ¯ S Pr ({¯ Y}_{A}^{(τ_{2} + 1)} = B \| {¯ Y}_{A}^{(τ_{1} + 1)} = ¯ s) Pr ({¯ Y}_{A}^{(τ_{1} + 1)} = ¯ s)$		(14)
	$= Pr ({¯ Y}_{A}^{(τ_{2} + 1)} = B \| {¯ Y}_{A}^{(τ_{1} + 1)} = A) Pr ({¯ Y}_{A}^{(τ_{1} + 1)} = A) + Pr ({¯ Y}_{A}^{(τ_{2} + 1)} = B \| {¯ Y}_{A}^{(τ_{1} + 1)} = B) Pr ({¯ Y}_{A}^{(τ_{1} + 1)} = B)$
	$+ Pr ({¯ Y}_{A}^{(τ_{2} + 1)} = B \| {¯ Y}_{A}^{(τ_{1} + 1)} = C) Pr ({¯ Y}_{A}^{(τ_{1} + 1)} = C) .$		(15)
	$= Pr (Y_{A}^{(τ_{2} + 1)} = 3 \| Y_{A}^{(τ_{1} + 1)} = 1) Pr (Y_{A}^{(τ_{1} + 1)} = 1) + Pr (Y_{A}^{(τ_{2} + 1)} = 3 \| Y_{A}^{(τ_{1} + 1)} = 2) Pr (Y_{A}^{(τ_{1} + 1)} = 2)$
	$+ Pr (Y_{A}^{(τ_{2} + 1)} = 3 \| {¯ Y}_{A}^{(τ_{1} + 1)} = 3) Pr (Y_{A}^{(τ_{1} + 1)} = 3) + Pr (Y_{A}^{(τ_{2} + 1)} = 3 \| Y_{A}^{(τ_{1} + 1)} = 4) Pr (Y_{A}^{(τ_{1} + 1)} = 4)$		(16)
	$= 0 + 0 + 0.6 \cdot 0.6 + 0.4 \cdot 0.4 = 0.52.$		(17)

Here we use that $Pr ({¯ s}^{'} | ¯ s) Pr (¯ s) = \sum_{s \in ¯ s} Pr ({¯ s}^{'} | s) Pr (s)$ for the step from (15) to (16). For the final step, note that $Pr ({¯ Y}_{A}^{(τ_{1} + 1)} = 1) = Pr ({¯ Y}_{A}^{(τ_{1} + 1)} = 2) = 0$ , since there is no transition from a state in $A$ to a state in $A$ , so we can remove the first term.

So we end up with: $Pr ({¯ Y}_{A}^{(τ_{1} + 1)} = B) Pr ({¯ Y}_{A}^{(τ_{2} + 1)} = B) = 0.6 \cdot 0. 52 = 0.321$ . And for the joint probability: $Pr ({¯ Y}_{A}^{(τ_{1} + 1)} = B, {¯ Y}_{A}^{(τ_{2} + 1)} = B) = Pr ({¯ Y}_{A}^{(τ_{1} + 1)} = B) Pr ({¯ Y}_{A}^{(τ_{2} + 1)} = B | {¯ Y}_{A}^{(τ_{1} + 1)} = B) = 0.6 \cdot 0.6 = 0.36.$ Here $Pr ({¯ Y}_{A}^{(τ_{2} + 1)} = B | {¯ Y}_{A}^{(τ_{1} + 1)} = B)$ is 0.6 because given that the first transition from state $A$ ended in state $B$ and that $P (1 | B) = 1$ , the second transition from state $A$ will start in state $1$ and $P (B | 1) = 0.6$ . Thus we have that

Pr ({¯ Y}_{A}^{(τ_{1} + 1)} = B, {¯ Y}_{A}^{(τ_{2} + 1)} = B) \neq Pr ({¯ Y}_{A}^{(τ_{1} + 1)} = B) Pr ({¯ Y}_{A}^{(τ_{2} + 1)} = B),

(18)

the samples are not independent. Leading us to the second observation:

Observation 2.

As independence cannot be guaranteed, Lemmas 1 and 6 cannot be readily applied to show that the abstract L1 inequality holds.

3.2 Guarantees for Abstract Model Learning Using Martingales

Here we also want to give a guarantee in the form of the abstract L1 inequality from (12). While in the previous section we found this was not possible with concentration inequalities such as Hoeffding’s inequality, because the samples are not guaranteed to be independent, here we look at a related bound for samples that are weakly dependent, i.e., Azuma-Hoeffding’s inequality. This inequality makes use of the properties of a martingale difference sequence, which are slightly weaker than independence:

Definition 3 (Martingale difference sequence Strehl and Littman (2008)).

The sequence $Z_{1}, Z_{2}, \dots$ is a martingale difference sequence if, $\forall i$ , it satisfies the following conditions:

	$E [$	$Z_{i} \| Z_{1}, Z_{2}, \dots, Z_{i - 1}] = 0,$
	$\|$	$Z_{i} \| < \infty .$

The properties of the martingale difference sequence can be used to obtain the following concentration inequality:

Lemma 3 (Azuma’s Lemma Strehl and Littman (2008)).

If the random variables $Z_{1}, Z_{2}, \dots$ form a martingale difference sequence (Def. 3), with $| Z_{i} | \leq b$ , then

Pr (n \sum i = 1 Z_{i} > ϵ) \leq e^{- \frac{ϵ^{2}}{2 b^{2} n}} .

(19)

After defining a suitable $Z_{i}$ (done in the proof of Proposition 1) and using the definitions of ${¯ T}_{Y} (\cdot | ¯ s, a)$ (9) and ${¯ T}_{ω_{X}} (\cdot | ¯ s, a)$ (11), we can show that with high probability the empirical abstract transition function ${¯ T}_{Y}$ will be close to the abstract transition function ${¯ T}_{ω_{X}}$ :

Proposition 1 (Abstract L1 inequality).

For a fixed value of $N (¯ s, a)$ , we have that with probability $1 - δ$ the following holds:

| | {¯ T}_{Y} (\cdot | ¯ s, a) - {¯ T}_{ω_{X}} (\cdot | ¯ s, a) | |_{1} \leq ϵ,

(20)

where $δ = 2^{| ¯ S |} e^{- \frac{1}{8} N (¯ s, a) ϵ^{2}}$ .

Thus we can show that the empirical abstract transition function ${¯ T}_{Y}$ gets close to the abstract transition function ${¯ T}_{ω_{X}}$ . This shows that in RL with state abstraction, where the samples are not independent, we can replace the use of Hoeffding’s inequality with Lemma 3 to show that we learn an accurate abstract transition function.

Finally, we show how Proposition 1 can be used to give guarantees for MBRL methods in Abstracted RL with an approximate model similarity abstraction. We illustrate this using the R-MAX algorithm Brafman and Tennenholtz (2002), thus providing the first finite-sample guarantees for Abstracted RL:

Theorem 1.

Given an MDP M, an approximate model similarity abstraction $ϕ$ , with $η_{R}$ and $η_{T}$ , and inputs $| ¯ S |, | A |, ϵ, δ, T_{ϵ}$ . With probability of at least $1 - δ$ the R-MAX algorithm adapted to abstraction (Algorithm 1) will attain an expected return of $Opt (\prod_{M} (ϵ, T_{ϵ})) - 3 g (η_{T}, η_{R}) - 2 ϵ$ within a number of steps polynomial in $| ¯ S |, | A |, \frac{1}{ϵ} \frac{1}{δ}, T_{ϵ}$ . Where $T_{ϵ}$ is the $ϵ$ -return mixing time of the optimal policy, the policies for $M$ whose $ϵ$ -return mixing time is $T_{ϵ}$ are denoted by $\prod_{M} (ϵ, T_{ϵ})$ , the optimal expected return achievable by such policies are denoted by $Opt (\prod_{M} (ϵ, T_{ϵ}))$ , and

g (η_{T}, η_{R}) = T_{ϵ} η_{R} + \frac{(T_{ϵ} - 1) T_{ϵ}}{2} η_{T} | ¯ S | .

The proof can be found in the Appendix and follows the line of the original R-MAX proof, using the assumptions mentioned at the start of Section 3. In the proof, we use the Abstract L1 inequality to show the empirical abstract model is accurate with high probability, and an upper bound that we establish on the difference in the expected (finite horizon) value between the original MDP and an abstract MDP under any abstract policy $¯ π$ . This upper bound is similar to the results of Lemma 2.

As is typical with abstraction there is a trade-off in the performance, but the required number of steps is reduced. Without abstraction the performance guarantee is $Opt (\prod_{M} (ϵ, T_{ϵ})) - 2 ϵ$ , while with the abstraction the additional penalty of $3 g (η_{T}, η_{R})$ appears because we use an approximate model similarity abstraction. The advantage is that the number of steps that is required is polynomial in the size of the abstract space $| ¯ S |$ rather than $| S |$ .

4 Related Work

Many studies have considered the combination of abstraction with either planning or RL. In most of these studies, the dependence of samples that arises in Abstracted RL is not an issue due to various reasons, such as the assumption that the collected samples are independent Paduraru et al. (2008); Ortner et al. (2014b); Jiang et al. (2015), looking at convergence in the limit Singh et al. (1995); Hutter (2016); Majeed and Hutter (2018), or because access to an MDP model is assumed Hallak et al. (2013); Maillard et al. (2013); Ortner et al. (2019). A more extensive discussion of the related work is available in the Appendix.

In the Abstracted RL setting a negative result has been provided, showing that R-MAX Brafman and Tennenholtz (2002) no longer maintains its guarantees when paired with any type of state abstraction function Abel et al. (2018). This is shown with an example that uses approximate Q-function abstractions Abel et al. (2016). Our counterexample is more powerful: indicating problems with the normal analysis even for approximate model similarity abstractions. Yet, our second result shows that for R-MAX-like algorithms it is still possible to give guarantees in Abstracted RL when an approximate model similarity abstraction is used and we take into account the $η_{R}$ and $η_{T}$ inaccuracies in the error.

Another study considered a setting related to abstraction, where the transition and reward functions may change over time, either abruptly or gradually Ortner et al. (2020). The reward and transition probabilities depend on the timestep $t$ , so $T (s^{'} | s, a, t)$ instead of $T (s^{'} | s, a)$ . To give results they bound the variation in the reward and transition functions over time. They adapt the confidence intervals for the state-action pairs to take the variation into account. In their setting the MDP is fixed given the timestep, but in the abstraction setting this is not fixed, each time we run the MDP the transition function at a timestep $t$ could be different.

Some of the studies in the abstraction selection setting do not assume that the set of abstraction functions contains a Markov model Lattimore et al. (2013); Ortner et al. (2014a). One of these assumes the agent has access to a set of environments, including the true environment, rather than a set of representations Lattimore et al. (2013). Because they have access to environments rather than an abstraction, they do not need to learn a transition model, making it different from our setting. The other study uses Theorem 2.1 from Weissman et al. (2003) that requires i.i.d. samples Ortner et al. (2014a), we have shown that independent samples cannot be guaranteed in Abstracted RL.

Other related work is in the area of MDPs with rich observations or block structure Azizzadenesheli et al. (2016); Du et al. (2019). However, in that setting each observation can be generated only from a single hidden state, which means that the issue of non-i.i.d. data due to abstraction does not arise. In contrast, each observation can be generated from multiple hidden states in Abstracted RL. The rich observation setting can be seen as an aggregation problem, where the observations can be aggregated to form a small (latent) MDP Azizzadenesheli et al. (2016). But in our case, we do not try to learn the MDP (as it is not small). Their setting is also related to exact model similarity (or bisimulation) Du et al. (2019), but we focus on approximate model similarity which is what introduces the problems as described here.

For planning in abstract MDPs, there are results for exact state abstractions Li et al. (2006) and for approximate state abstractions Abel et al. (2016). The results for approximate state abstractions allow for quantifying an upper bound on performance for the optimal policy of an abstract MDP, as in Section 2.2. This has been build on by giving a result for performing RL interacting with an explicitly constructed abstract MDP Taïga et al. (2018), which is different from Abstract RL since the abstract MDP is still an MDP.

Even without abstraction, in certain cases a dependence can arise for RL in MDPs. For instance, it has been shown that dependence can appear if the MDP is not communicating Strehl and Littman (2008). The non-communicating property can be realistic, as there could be problems where there are states to which we cannot return. They show that this specific case of dependence in non-communicating MDPs is not a problem because it is still possible to use a concentration inequality for independent samples, e.g., Hoeffding’s inequality, as an upper bound. However, their proof uses the fact that the transition and rewards are identically distributed, which is not guaranteed in Abstracted RL.

5 Conclusions

We analyzed Abstracted RL: the combination of MBRL and state abstraction when the model of the MDP is not available. Via a counterexample, we have shown that in Abstracted RL samples obtained online cannot be guaranteed to be independent. Many current guarantees from MBRL methods make use of concentration results that assume i.i.d. samples, e.g., Theorem 2.1 from Weissman et al. (2003), the empirical Bernstein inequality Audibert et al. (2007); Maurer and Pontil (2009) or the Chernoff bound. Because they use these concentration inequalities, their guarantees do not hold in the Abstracted RL setting. In fact, none of the existing analyses of MBRL is applicable to Abstracted RL. We showed that the samples in Abstracted RL are only weakly dependent and that concentration inequalities for (weakly) dependent variables, such as Lemma 3, are a viable alternative through which we can come to guarantees on the empirical model. We used this result to present the first sample efficient learning results for the Abstracted RL setting, thus showing it is possible to combine the benefits of small abstracted state spaces and performance guarantees.

The assumptions on the reward that we made here, i.e., that the reward is deterministic and that each state in an abstract state has the same reward function, can also be relaxed. To show that we can accurately learn an abstract reward function again a suitable martingale difference sequence has to be defined, then Lemma 3 can be used. The assumption that the agent receives the $ϵ$ -mixing time of the optimal policy can also be lifted Brafman and Tennenholtz (2002).

We considered a specific type of abstraction, the approximate model similarity abstraction. It may be possible to extend our results to other types of abstraction, for this it is important that an upper bound can be established on the difference in value between the original MDP and an abstract MDP under a sub-optimal abstract policy.

Extending the results to more recent algorithms, e.g., MBIE Strehl and Littman (2008) and UCRL2 Jaksch et al. (2010), requires adapting to the different assumptions. For instance, R-MAX Brafman and Tennenholtz (2002) assumes the MDP is ergodic, while UCRL2 and MBIE make the slightly weaker assumptions that the MDP is communicating and non-communicating, respectively. For extending the results of algorithms that use the empirical Bernstein inequality, Bernstein-type inequalities for martingales Dzhaparidze and Van Zanten (2001) could be used.

{ack}

We would like to thank Elena Congeduti for the help with deriving part of the theoretical results. This project received funding from the European Research Council (ERC)

under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 758824 —INFLUENCE).

References

D. Abel, D. Arumugam, L. Lehnert, and M. Littman (2018) State abstractions for lifelong reinforcement learning. In International Conference on Machine Learning, pp. 10–19. Cited by: Appendix D, Figure 1, §1, §4.
D. Abel, D. Hershkowitz, and M. Littman (2016) Near optimal behavior via approximate state abstraction. In International Conference on Machine Learning, pp. 2915–2923. Cited by: Appendix D, Appendix D, Appendix D, §1, §2.2, §2.2, §2.2, §3, §4, §4.
J. Audibert, R. Munos, and C. Szepesvári (2007) Tuning bandit algorithms in stochastic environments. In International conference on algorithmic learning theory, pp. 150–165. Cited by: §5.
K. Azizzadenesheli, A. Lazaric, and A. Anandkumar (2016) Reinforcement learning in rich-observation mdps using spectral methods. arXiv preprint arXiv:1611.03907. Cited by: Appendix D, §2.2, §4.
A. Bai, S. Srivastava, and S. J. Russell (2016) Markovian state and action abstractions for mdps via hierarchical mcts.. In IJCAI, pp. 3029–3039. Cited by: §1.
G. Boole (1854) An investigation of the laws of thought: on which are founded the mathematical theories of logic and probabilities. Dover Publications. Cited by: Lemma 5.
R. I. Brafman and M. Tennenholtz (2002) R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research 3 (Oct), pp. 213–231. Cited by: §C.3, Appendix D, §1, §2.1, §3.2, §3, §4, §5, §5.
T. Dean, R. Givan, and S. Leach (1997) Model reduction techniques for computing approximately optimal solutions for markov decision processes. In Proceedings of the Thirteenth conference on Uncertainty in artificial intelligence, pp. 124–131. Cited by: §2.2.
R. Dearden and C. Boutilier (1997) Abstraction and approximate decision-theoretic planning. Artificial Intelligence 89 (1-2), pp. 219–283. Cited by: §2.2.
S. Du, A. Krishnamurthy, N. Jiang, A. Agarwal, M. Dudik, and J. Langford (2019) Provably efficient RL with rich observations via latent state decoding. In International Conference on Machine Learning, pp. 1665–1674. Cited by: Appendix D, §2.2, §4, footnote 3.
K. Dzhaparidze and J. Van Zanten (2001) On bernstein-type inequalities for martingales. Stochastic processes and their applications 93 (1), pp. 109–117. Cited by: §5.
R. Givan, T. Dean, and M. Greig (2003) Equivalence notions and model minimization in markov decision processes. Artificial Intelligence 147 (1-2), pp. 163–223. Cited by: §2.2, footnote 2.
A. Hallak, D. Di-Castro, and S. Mannor (2013) Model in markovian processes. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 374–382. Cited by: Appendix D, Appendix D, §4, footnote 3.
W. Hoeffding (1963) Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58 (301), pp. 13–30. Cited by: Lemma 4.
R. A. Howard (1960) Dynamic programming and markov processes.. The MIT Press, Cambridge, MA. Cited by: §C.2.
M. Hutter (2016) Extreme state aggregation beyond markov decision processes. Theoretical Computer Science 650, pp. 73–91. Cited by: Appendix D, Appendix D, §4.
T. Jaksch, R. Ortner, and P. Auer (2010) Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11 (Apr), pp. 1563–1600. Cited by: Appendix D, §1, §1, §2.1, §2.1, §5.
N. Jiang, A. Kulesza, and S. Singh (2015) Abstraction selection in model-based reinforcement learning. In International Conference on Machine Learning, pp. 179–188. Cited by: Appendix D, Appendix D, §4.
L. P. Kaelbling, M. L. Littman, and A. R. Cassandra (1998) Planning and acting in partially observable stochastic domains. Artificial intelligence 101 (1-2), pp. 99–134. Cited by: §1.
M. Kearns and S. Singh (2002) Near-optimal reinforcement learning in polynomial time. Machine learning 49 (2), pp. 209–232. Cited by: §3.
T. Lattimore, M. Hutter, P. Sunehag, et al. (2013) The sample-complexity of general reinforcement learning. In Proceedings of the 30th International Conference on Machine Learning, Cited by: Appendix D, §4.
D. A. Levin and Y. Peres (2017) Markov chains and mixing times. Vol. 107, American Mathematical Soc.. Cited by: Appendix B.
L. Li, T. J. Walsh, and M. L. Littman (2006) Towards a unified theory of state abstraction for mdps.. In ISAIM, Cited by: Appendix D, §1, §2.2, §2.2, §4, footnote 5.
L. Li (2009) A unifying framework for computational reinforcement learning theory. Ph.D. Thesis, Rutgers University-Graduate School-New Brunswick. Cited by: §1, §1, §2.2.
S. Mahadevan (2010) Representation discovery in sequential decision making. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1718–1721. Cited by: §1.
O. Maillard, P. Nguyen, R. Ortner, and D. Ryabko (2013) Optimal regret bounds for selecting the state representation in reinforcement learning. In International Conference on Machine Learning, pp. 543–551. Cited by: Appendix D, Appendix D, §4, footnote 3.
S. J. Majeed and M. Hutter (2018) On Q-learning convergence for non-markov decision processes.. In IJCAI, pp. 2546–2552. Cited by: Appendix D, Appendix D, §4, footnote 3.
A. Maurer and M. Pontil (2009) Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740. Cited by: §5.
R. Ortner, P. Gajane, and P. Auer (2020) Variational regret bounds for reinforcement learning. In Uncertainty in Artificial Intelligence, pp. 81–90. Cited by: §C.1, Appendix D, §4.
R. Ortner, O. Maillard, and D. Ryabko (2014a) Selecting near-optimal approximate state representations in reinforcement learning. In International Conference on Algorithmic Learning Theory, pp. 140–154. Cited by: Appendix D, §1, §4.
R. Ortner, M. Pirotta, A. Lazaric, R. Fruit, and O. Maillard (2019) Regret bounds for learning state representations in reinforcement learning. In Advances in Neural Information Processing Systems, pp. 12738–12748. Cited by: Appendix D, Appendix D, §4, footnote 3.
R. Ortner, D. Ryabko, P. Auer, and R. Munos (2014b) Regret bounds for restless markov bandits. Theoretical Computer Science 558, pp. 62–76. Cited by: Appendix D, Appendix D, §4.
C. Paduraru, R. Kaplow, D. Precup, and J. Pineau (2008) Model-based reinforcement learning with state aggregation. In 8th European Workshop on Reinforcement Learning, Cited by: Appendix D, Appendix D, §4, footnote 3.
M. L. Puterman (2014) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §2.2, §2, §3, footnote 4, footnote 6.
S. P. Singh, T. Jaakkola, and M. I. Jordan (1994) Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings 1994, pp. 284–292. Cited by: §1.
S. P. Singh, T. Jaakkola, and M. I. Jordan (1995) Reinforcement learning with soft state aggregation. In Advances in neural information processing systems, pp. 361–368. Cited by: Appendix D, Appendix D, §4.
A. L. Strehl and M. L. Littman (2008) An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences 74 (8), pp. 1309–1331. Cited by: §C.1, §C.3, Appendix D, §1, §1, §2.1, §2.1, §4, §5, Definition 3, Lemma 3.
R. S. Sutton, D. Precup, and S. Singh (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: §1.
A. A. Taïga, A. Courville, and M. G. Bellemare (2018) Approximate exploration through state abstraction. arXiv preprint arXiv:1808.09819. Cited by: Appendix D, §2.2, §3, §4, Lemma 2.
T. Weissman, E. Ordentlich, G. Seroussi, S. Verdu, and M. J. Weinberger (2003) Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep. Cited by: Appendix B, Appendix B, Appendix D, §1, §2.1, §2.1, §4, §5, Lemma 1.

Appendix A Well Known Results

We restate some well-known results that we use in the proofs in the other sections.

a.1 Hoeffding’s Inequality

Hoeffding’s inequality can tell us what the probability is that the average of $m$ random independent (but not necessarily identically distributed) samples deviates more than $ϵ$ from its expectation.

Let $Z^{(1)}, Z^{(2)}, \dots, Z^{(m)}$ be bounded independent random variables and let $¯ Z$ and $μ$ be defined as:

	$¯ Z$	$≜ \frac{Z^{(1)} + \dots + Z^{(m)}}{m},$		(21)
	$μ ≜ E [¯ Z]$	$= \frac{E [Z^{(1)} + \dots + Z^{(m)}]}{m} .$		(22)

Then Hoeffding’s inequality states:

Lemma 4 (Hoeffding’s inequality Hoeffding [1963]).

If $Z^{(1)}, Z^{(2)}, \dots, Z^{(m)}$ are independent and $0 \leq Z^{(i)} \leq 1$ for $i = 1, \dots, m$ , then for $0 < ϵ < 1 - μ$ we have the following inequalities

$Pr (¯ Z - μ \geq ϵ)$	$\leq e^{- 2 m ϵ^{2}},$	(23)
$Pr (\| ¯ Z - μ \| \geq ϵ)$	$\leq 2 e^{- 2 m ϵ^{2}},$	(24)
$Pr (m \sum i = 1 (Z^{(i)} - μ) \geq ϵ)$	$\leq e^{- 2 \frac{ϵ^{2}}{m}},$	(25)
$Pr (\| m \sum i = 1 (Z^{(i)} - μ) \| \geq ϵ)$	$\leq 2 e^{- 2 \frac{ϵ^{2}}{m}} .$	(26)

a.2 Union Bound

Given that we have a set of events, the union bound allows us to upper bound the probability that at least one of the events happens, even when these events are not independent.

Lemma 5 (Union Bound Boole [1854]).

For a countable set of events $A_{1}, A_{2}, A_{3}, \dots$ , we have

Pr (\cup_{i} A_{i}) \leq \sum i Pr (A_{i}) .

(28)

I.e., the probability that at least one of the events happens is at most the sum of the probabilities of the individual events.

Appendix B L1 Inequality for Independent but not Identically Distributed Variables

We show that for independent, but not identically, distributed samples we can adapt the proof of Weissman et al. [2003] to obtain the following result:

Lemma 6.

Let $X_{¯ s, a} = s_{1}, \dots, s_{m}$ be a sequence of states $s \in ¯ s$ and let ${¯ Y}_{¯ s, a} = {¯ Y}^{(1)}, {¯ Y}^{(2)}, \dots, {¯ Y}^{(m)}$ be independent random variables distributed according to $Pr (\cdot | s_{1}, a), \dots, Pr (\cdot | s_{m}, a)$ . Then, $\forall ϵ > 0$ ,

Pr (| | {¯ T}_{Y} (\cdot | ¯ s, a) - {¯ T}_{ω_{X}} (\cdot | ¯ s, a) | |_{1} \geq ϵ) \leq (2^{| ¯ S |} - 2) e^{- \frac{1}{2} m ϵ^{2}} .

(29)

Proof.

The proof mostly follows the steps by Weissman et al. [2003].

To shorten notation we define $P_{Y} ≜ {¯ T}_{Y} (\cdot | ¯ s, a)$ and $P_{ω_{X}} ≜ {¯ T}_{ω_{X}} (\cdot | ¯ s, a)$ .

We will make use of the following result (Proposition 4.2 in Levin and Peres [2017]), that for any distribution $Q$ on $¯ S$

| | Q - P_{ω_{X}} | |_{1} = 2 max ¯ S \subseteq ¯ S (Q (¯ S) - P_{ω_{X}} (¯ S),

where $¯ S$ is a subset of $¯ S$ and $P_{ω_{X}} (¯ S) = \sum_{{¯ s}^{'} \in ¯ S} P_{ω_{X}} ({¯ s}^{'})$ . Thus we have that

| | P_{Y} - P_{ω_{X}} | |_{1} = 2 max ¯ S \subseteq ¯ S (P_{Y} (¯ S) - P_{ω_{X}} (¯ S)) .

(30)

Using this we can write

$Pr (\| \| P_{Y} - P_{ω_{X}} \| \|_{1} \geq ϵ)$	$= Pr [2 max ¯ S \subseteq ¯ S [P_{Y} (¯ S) - P_{ω_{X}} (¯ S)] \geq ϵ]$	(31)
	$= Pr [max ¯ S \subseteq ¯ S [P_{Y} (¯ S) - P_{ω_{X}} (¯ S)] \geq \frac{ϵ}{} 2]$	(32)
		(33)
		(34)

where the last step follows from the union bound (Lemma 5).

Assuming $ϵ > 0$ , we have for $¯ S = ¯ S$ and for $¯ S = \emptyset$ that $Pr (P_{Y} (¯ S) - P_{ω_{X}} (¯ S) \geq \frac{ϵ}{2}) = 0$ .

For every other subset $¯ S$ , we can define a random binary variable that is $1$ when $Y^{(i)} \in ¯ S$ and $0$ otherwise. We have that $P_{ω_{X}} (¯ S)$ acts as $μ$ (22) from Lemma 4 and $P_{Y} (¯ S)$ as $¯ Z$ (21). Thus applying Lemma 4 to this random variable we have:

Pr (P_{Y} (¯ S) - P_{ω_{X}} (¯ S) \geq \frac{ϵ}{2}) \leq e^{- 2 m {\frac{ϵ}{2}}^{2}} = e^{- \frac{1}{2} m ϵ^{2}} .

(35)

Then it follows that

$Pr (\| \| P_{Y} - P_{ω_{X}} \| \|_{1} \geq ϵ)$	$\leq \sum ¯ S \subseteq ¯ S Pr (P_{Y} (¯ S) - P_{ω_{X}} (¯ S) \geq \frac{ϵ}{2})$	(36)
	$\leq \sum ¯ S \subset ¯ S : ¯ S \neq ¯ S, \emptyset Pr (P_{Y} (¯ S) - P_{ω_{X}} (¯ S) \geq \frac{ϵ}{2})$	(37)
	$\leq (2^{\| ¯ S \|} - 2) e^{- \frac{1}{2} m ϵ^{2}},$	(38)

where $¯ S \subset ¯ S : ¯ S \neq ¯ S, \emptyset$ denotes that the empty set $\emptyset$ and the full set $¯ S$ are excluded. ∎

Appendix C Proofs

Before getting to the proof of Theorem 1, in Appendix C.3, we first give additional Lemmas that will be used in the proof. First, we show in Appendix C.1 how in Abstracted RL we can use a concentration inequality for martingales to learn an accurate transition model, with high probability. Then, in Appendix C.2 we give upper bounds on the difference in value between the real MDP and an abstract MDP, under various policies. Finally, in Appendix C.3 we use the first two results to show that we can provide efficient learning guarantees for R-MAX in Abstracted RL.

c.1 Concentration Inequality on The L1 Norm for Martingales in Abstracted RL

The following results show that, with a high probability, the empirical abstract transition function ${¯ T}_{Y}$ will be close to the abstract transition function ${¯ T}_{ω_{X}}$ . In the proof, we define a suitable martingale difference sequence for the transition function, and use this to obtain the following result for learning a transition function in Abstracted RL:

Proposition 1 (Abstract L1 inequality). For a fixed value of $N (¯ s, a)$ , we have that with probability $1 - δ$ the following holds:

| | {¯ T}_{Y} (\cdot | ¯ s, a) - {¯ T}_{ω_{X}} (\cdot | ¯ s, a) | |_{1} \leq ϵ,

(39)

where $δ = 2^{| ¯ S |} e^{- \frac{1}{8} N (¯ s, a) ϵ^{2}}$ .

Proof of Proposition 1.

The proof follows the general approach of Ortner et al. [2020]. We find it useful to first define an abstract transition function based on $X_{¯ s, a}$ as

\forall (¯ s, a), {¯ s}^{'} : {¯ T}_{ω_{X}} ({¯ s}^{'} | ¯ s, a) ≜ \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 T ({¯ s}^{'} | X_{¯ s, a}^{(τ_{i})}, a),

(40)

where $T ({¯ s}^{'} | X_{¯ s, a}^{(τ_{i})}, a) ≜ \sum_{s^{'} \in {¯ s}^{'}} T (s^{'} | X_{¯ s, a}^{(τ_{i})}, a, a)$ . We write ${¯ T}_{ω_{X}}$ because this definition is equivalent to using a weighting function as in (Eq. 11):

$\forall (¯ s, a), {¯ s}^{'} : {¯ T}_{ω_{X}} ({¯ s}^{'} \| ¯ s, a)$	$≜ \sum s \in ¯ s ω_{X} (s, a) \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a)$	(Eq. 11)	(41)
	$= \sum s \in ¯ s \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 1 {X_{¯ s, a}^{(τ_{i})} = s} \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a)$		(42)
	$= \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 \sum s \in ¯ s 1 {X_{¯ s, a}^{(τ_{i})} = s} \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a)$		(43)
	$= \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 \sum s^{'} \in {¯ s}^{'} T (s^{'} \| X_{¯ s, a}^{(τ_{i})}, a)$		(44)
	$= \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 T ({¯ s}^{'} \| X_{¯ s, a}^{(τ_{i})}, a)$	(Eq. 40)	(45)

Then we have

$\| \| {¯ T}_{Y} (\cdot \| ¯ s, a) - {¯ T}_{ω_{X}} (\cdot \| ¯ s, a) \| \|_{1}$	$= \sum {¯ s}^{'} \| {¯ T}_{Y} ({¯ s}^{'} \| ¯ s, a) - {¯ T}_{ω_{X}} ({¯ s}^{'} \| ¯ s, a) \|$	(46)
	$= max x \in {- 1, 1}^{¯ S} \sum {¯ s}^{'} ({¯ T}_{Y} ({¯ s}^{'} \| ¯ s, a) - {¯ T}_{ω_{X}} ({¯ s}^{'} \| ¯ s, a)) z ({¯ s}^{'})$	(47)
	$= max z \in {- 1, 1}^{¯ S} \sum {¯ s}^{'} (\frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 1 {{¯ Y}_{¯ s, a}^{(τ_{i} + 1)} = {¯ s}^{'}} - \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 T ({¯ s}^{'} \| X_{¯ s, a}^{(τ_{i})}, a)) z ({¯ s}^{'})$	(48)
	$= max z \in {- 1, 1}^{¯ S} \sum {¯ s}^{'} \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 1 {{¯ Y}_{¯ s, a}^{(τ_{i} + 1)} = {¯ s}^{'}} z ({¯ s}^{'}) - \sum {¯ s}^{'} \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 T ({¯ s}^{'} \| X_{¯ s, a}^{(τ_{i})}, a) z ({¯ s}^{'})$	(49)
	$= max z \in {- 1, 1}^{¯ S} \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 z ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) - \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 \sum {¯ s}^{'} T ({¯ s}^{'} \| X_{¯ s, a}^{(τ_{i})}, a) z ({¯ s}^{'})$	(50)
	$= max z \in {- 1, 1}^{¯ S} \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 (z ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) - \sum {¯ s}^{'} T ({¯ s}^{'} \| X_{¯ s, a}^{(τ_{i})}, a) z ({¯ s}^{'}))$	(51)
	$= max z \in {- 1, 1}^{¯ S} \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)}),$	(52)

where we set

Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)})

\coloneqqz(¯Y(τi+1)¯s,a)−∑¯s′T(¯s′|X(τi)¯s,a,aτi)z(¯s′),

with $z$ a vector of size $| ¯ S |$ with entries $\pm 1$ , we write $z (¯ s)$ for the entry in $z$ with index $¯ s$ . To show that $\sum_{i}^{N (¯ s, a)} Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)})$ is a martingale difference sequence, we should follow Definition 3 and show that $\forall i : E [Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) | Z_{τ_{1}}, Z_{τ_{2}}, \dots, Z_{τ_{i - 1}}] = 0$ and $| Z_{i} | < \infty$ . For the second part, we have that $\forall i : | Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) | \leq 2$ , since $| z ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) | \leq 1$ and $| \sum_{{¯ s}^{'}} T ({¯ s}^{'} | X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}) z ({¯ s}^{'}) | \leq 1$ . For the first part, we use the following Lemma, the proof of which follows after the current proof:

Lemma 7.

Let $π$ be a policy, and suppose the sequence $s_{1}, a_{1} \dots, s_{t - 1}, a_{t - 1}, s_{t}$ is to be generated by $π$ . If $1 \leq τ_{1} < τ_{2} < \dots < τ_{i - 1} < τ_{i} \leq t$ , then $E [Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) | Z_{τ_{1}}, Z_{τ_{2}}, \dots, Z_{τ_{i - 1}}] = 0$ .

Lemma 7 shows that $\sum_{i}^{N (¯ s, a)} Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)})$ is a martingale difference sequence with $\forall i : | Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) | \leq 2$ for any fixed $z$ and fixed $N (¯ s, a) = n$ so that by Azuma-Hoeffding (Lemma 3):

Pr (N (¯ s, a) \sum i = 1 Z_{τ_{i}} > ϵ) \leq e^{- \frac{}{ϵ^{2}} 8 N (¯ s, a)} .

(53)

Similarly, $\sum_{i}^{N (¯ s, a)} \frac{1}{N (¯ s, a)} Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)})$ is a martingale difference sequence with $\forall i : | \frac{1}{N (¯ s, a)} Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) | \leq \frac{2}{N (¯ s, a)}$ for any fixed $z$ and fixed $N (¯ s, a) = n$ so that by Azuma-Hoeffding (Lemma 3):

$Pr (\frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 Z_{τ_{i}} > ϵ)$	$\leq e^{- \frac{ϵ^{2}}{2 \frac{4}{N (¯ s, a)^{2}} N (¯ s, a)}}$	(54)
	$= e^{- \frac{ϵ^{2}}{\frac{8}{N (¯ s, a)}}}$	(55)
	$= e^{- \frac{1}{8} N (¯ s, a) ϵ^{2}} .$	(56)

From (46) and (52) we then obtain

Pr (| | {¯ T}_{Y} (\cdot | ¯ s, a) - {¯ T}_{ω_{X}} (\cdot | ¯ s, a) | |_{1} > ϵ) = Pr (max z \in {- 1, 1}^{S} \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 Z_{τ_{i}} > ϵ) .

(57)

A union bound (Lemma 5) over all $2^{| ¯ S |}$ vectors $z$ for a fixed value of $N (s, a)$ shows

Pr (max z \in {- 1, 1}^{S} \frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 Z_{τ_{i}} > ϵ)

\leq \sum z \in {- 1, 1}^{S} Pr (\frac{1}{N (¯ s, a)} N (¯ s, a) \sum i = 1 Z_{τ_{i}} > ϵ),

(58)

so that using (56) with probability $1 - 2^{| ¯ S |} e^{- \frac{1}{8} N (¯ s, a) ϵ^{2}}$

| | {¯ T}_{Y} (\cdot | ¯ s, a) - {¯ T}_{ω_{X}} (\cdot | ¯ s, a) | |_{1} \leq ϵ .

(59)

∎

Now the proof of Lemma 7:

Proof of Lemma 7.

We follow the general structure of the proof of Lemma 8 in Strehl and Littman [2008]. We have

$E [Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)})]$	$= \sum c_{τ_{i} + 1} Pr (c_{τ_{i} + 1}) Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)})$	(60)
	$= \sum c_{τ_{i}} Pr (c_{τ_{i}}) \sum {¯ Y}_{¯ s, a}^{(τ_{i} + 1)} Pr ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)} \| c_{τ_{i}}, a_{τ_{i}}) Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)})$	(61)
	$= \sum c_{τ_{i}} Pr (c_{τ_{i}}) \sum {¯ Y}_{¯ s, a}^{(τ_{i} + 1)} Pr ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)} \| X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}) Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) .$	(62)

The sum $\sum_{c_{τ_{i} + 1}}$ is over all possible sequences $c_{τ_{i} + 1}$ that end in a state ${¯ s}_{τ_{i} + 1}$ , resulting from $τ_{i}$ actions chosen by an agent following policy $π$ . Conditioning on the sequence of random variables $Z_{τ_{1}}, Z_{τ_{2}}, \dots, Z_{τ_{i - 1}}$ can make some sequences $c_{τ_{i}}$ more likely and others less likely, that is

	$E [Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) \| Z_{τ_{1}}, Z_{τ_{2}}, \dots, Z_{τ_{i - 1}}]$		(63)
	$= \sum c_{τ_{i}} Pr (c_{τ_{i}} \| Z_{τ_{1}}, Z_{τ_{2}}, \dots, Z_{τ_{i - 1}}) \sum {¯ Y}_{¯ s, a}^{(τ_{i} + 1)} Pr ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)} \| X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}) Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) .$		(64)

Importantly, since $P ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)} | {¯ s}_{τ_{i}}, a_{τ_{i}}, Z_{τ_{1}}, \dots, Z_{τ_{i} - 1}) = P ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)} | {¯ s}_{τ_{i}}, a_{τ_{i}})$ , fixed values of $Z_{τ_{1}}, Z_{τ_{2}}, \dots, Z_{τ_{i - 1}}$ do not influence the innermost sum of (64). For this innermost sum we have

	$\sum {¯ Y}_{¯ s, a}^{(τ_{i} + 1)} Pr ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)} \| X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}) Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)})$		(65)
	$= \sum {¯ Y}_{¯ s, a}^{(τ_{i} + 1)} Pr ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)} \| X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}) [z ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) - \sum {¯ s}^{'} T ({¯ s}^{'} \| X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}) z ({¯ s}^{'})]$		(66)
	$= \sum {¯ Y}_{¯ s, a}^{(τ_{i} + 1)} Pr ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)} \| X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}) z ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) - \sum {¯ Y}_{¯ s, a}^{(τ_{i} + 1)} Pr ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)} \| X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}) \sum {¯ s}^{'} T ({¯ s}^{'} \| X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}) z ({¯ s}^{'})$		(67)
	$= \sum {¯ Y}_{¯ s, a}^{(τ_{i} + 1)} Pr ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)} \| X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}) z ({¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) - \sum {¯ s}^{'} T ({¯ s}^{'} \| X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}) z ({¯ s}^{'})$		(68)
	$= 0.$		(69)

So we conclude

	$E [Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)}) \| Z_{τ_{1}}, Z_{τ_{2}}, \dots, Z_{τ_{i - 1}}]$		(70)
	$= \sum c_{τ_{i}} Pr (c_{τ_{i}} \| Z_{τ_{1}}, Z_{τ_{2}}, \dots, Z_{τ_{i - 1}}) \sum {¯ s}_{τ_{i} + 1} Pr ({¯ s}_{τ_{i} + 1} \| X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}) Z_{τ_{i}} (z, X_{¯ s, a}^{(τ_{i})}, a_{τ_{i}}, {¯ Y}_{¯ s, a}^{(τ_{i} + 1)})$		(71)
	$= \sum c_{τ_{i}} Pr (c_{τ_{i}} \| Z_{τ_{1}}, Z_{τ_{2}}, \dots, Z_{τ_{i - 1}}) \times 0$		(72)
	$= 0.$		(73)

∎

Finally, we can use Proposition 1 to determine the number of samples required to guarantee that with probability $1 - δ$ the distance $| | {¯ T}_{Y} (\cdot | ¯ s, a) - {¯ T}_{ω_{x}} (\cdot | ¯ s, a) | |_{1}$ will be smaller than $ϵ$ :

Lemma 8.

For inputs $κ$ and $ϵ$ ( $0 < κ < 1, 0 < ϵ < 2$ ), we have that for a number of samples $m \geq \frac{2 [ln (2^{| ¯ S |} - 2) - ln (κ)]}{ϵ^{2}}$ the following holds:

Pr (| | {¯ T}_{Y} (\cdot | ¯ s, a) - {¯ T}_{ω_{x}} (\cdot | ¯ s, a) | |_{1} \geq ϵ) \leq κ .

(74)

Proof.

To shorten notation we again use the definitions $P_{Y} ≜ {¯ T}_{Y} (\cdot | ¯ s, a)$ and $P_{ω_{x}} ≜ {¯ T}_{ω_{x}} (\cdot | ¯ s, a)$ . It follows from Proposition 1 that

Pr (| | P_{Y} - P_{ω_{x}} | |_{1} \geq ϵ) \leq 2^{| ¯ S |} e^{- \frac{1}{8} m ϵ^{2}} .

(75)

We need to select $m$ such that $κ \geq 2^{| ¯ S |} e^{- \frac{1}{8} m ϵ^{2}}$ :

$κ$	$\geq 2^{\| ¯ S \|} e^{- \frac{1}{8} m ϵ^{2}}$	(76)
$\frac{κ}{2^{\| ¯ S \|}}$	$\geq e^{- \frac{1}{8} m ϵ^{2}}$	(77)
$ln (κ) - ln (2^{\| ¯ S \|})$	$\geq - \frac{m ϵ^{2}}{8}$	(78)
$\frac{m ϵ^{2}}{8}$	$\geq ln (2^{\| ¯ S \|}) - ln (κ)$	(79)
$m$	$\geq \frac{8 [ln (2^{\| ¯ S \|}) - ln (κ)]}{ϵ^{2}} .$	(80)

Thus if $m \geq \frac{8 [ln (2^{| ¯ S |}) - ln (κ)]}{ϵ^{2}}$ we have

Pr (| | P_{Y} - P_{ω_{x}} | |_{1} \geq ϵ) \leq κ .

\qed

(81)

c.2 Upper Bounds on Value Differences Under Different Policies

Let $L^{π}$ be the Bellman operator for the policy $π$ Howard [1960], we define $\forall s \in S$ :

	$V^{π, n} (s)$	$= R (s, π (s)) + \sum s^{'} \in S T (s^{'} \| s, π (s)) V^{π, n - 1} (s),$		(82)
	$V^{π, 1} (s)$	$= R (s, π (s)) .$		(83)

Before going into the differences with abstract spaces, we first give a simulation Lemma for two MDP on the same state-action space:

Lemma 9.

Let $M$ and $M^{'}$ be two MDP on the same state-action space, with

	$\forall s, a \in S \times A : \| R_{M} (s, a) - R_{M^{'}} (s, a) \|$	$\leq ϵ_{R},$		(84)
	$\forall s, a, s^{'} \in S \times A \times S : \| T_{M} (s^{'} \| s, a) - T_{M^{'}} (s^{'} \| s, a) \|$	$\leq ϵ_{T} .$		(85)

Then, for every policy $π$ and for every state $s \in S$ we have:

| V_{M}^{π, n} (s) - V_{M^{'}}^{π, n} (s) | \leq n ϵ_{R} + \frac{(n - 1) n}{2} ϵ_{T} | S | .

(86)

Proof.

By induction we will show that for $n \geq 1$

\forall s \in S : | V_{M}^{π, n} (s) - V_{M^{'}}^{π, n} (s) | \leq n ϵ_{R} + \frac{(n - 1) n}{2} ϵ_{T} | S | .

(87)

For $n = 1$ we have

| V_{M}^{π, n} (s) - V_{M^{'}}^{π, 1} (s) | = | R_{M} (s, π (s)) - R_{M^{'}} (s, π (s)) | \leq ϵ_{R} .

(88)

Now assume that the induction hypothesis, (87), holds for $n - 1$ , then

	$\| V_{M}^{π, n} (s) - V_{M^{'}}^{π, n} (s) \| = \| R_{M} (s, π (¯ s)) - R_{M^{'}} (s, π (s)) + \sum s^{'} \in S T_{M} (s^{'} \| s, π (s)) V_{M}^{π, n - 1} (s^{'}) - \sum s^{'} \in S T_{M^{'}} (s^{'} \| s, π (s))) V_{M^{'}}^{π, n - 1} (s^{'}) \|$		(89)
	$\leq \| R_{M} (s, π (s)) - R_{M^{'}} (s, π (s)) \| + \| \sum s^{'} \in S T_{M} (s^{'} \| s, π (s)) V_{M}^{π, n - 1} (s^{'}) - \sum s^{'} \in S T_{M^{'}} (s^{'} \| s, π (s)) V_{M^{'}}^{π, n - 1} (s^{'}) \|$		(90)
	$\leq ϵ_{R} + \| \sum s^{'} \in S T_{M} (s^{'} \| s, π (s)) V_{M}^{π, n - 1} (s^{'}) - \sum s^{'} \in S T_{M} (s^{'} \| s, π (s)) V_{M^{'}}^{π, n - 1} (s^{'}) \|$
	$+ \| \sum s^{'} \in S T_{M} (s^{'} \| s, π (s)) V_{M^{'}}^{π, n - 1} (s^{'}) - \sum s^{'} \in S T_{M^{'}} (s^{'} \| s, π (s)) V_{M^{'}}^{π, n - 1} (s^{'}) \|$		(91)
			(92)
	$\leq ϵ_{R} + (n - 1) ϵ_{R} + \frac{(n - 1 - 1) (n - 1)}{2} ϵ_{T} \| S \| + ϵ_{T} (n - 1) \| S \|$		(93)
	$= n ϵ_{R} + \frac{(n - 2) (n - 1)}{2} ϵ_{T} \| S \| + ϵ_{T} (n - 1) \| S \|$		(94)
	$= n ϵ_{R} + (n - 1 + \frac{(n - 2) (n - 1)}{2} ϵ_{T} \| S \|$		(95)
	$= n ϵ_{R} + \frac{(n - 1) n}{2} ϵ_{T} \| S \| .$		(96)

For the step from (90) to (91) we add and subtract $\sum_{{¯ s}^{'} \in S} T_{M} (s^{'} | s, π (s)) V_{M^{'}}^{π, n - 1} (s^{'})$ , and from (92) to (93) we use the inductive hypothesis and the fact that we can upperbound $V_{M^{'}}^{π, n - 1} (s^{'})$ by $(n - 1)$ , since the maximum reward per timestep is $1$ . ∎

This shows that for similar MDPs the values under any policy are also similar. In a sense, with an approximate model-similarity model, the abstract MDP $¯ M$ can also be close to the MDP $M$ when

	$\forall ¯ s, a \in ¯ S \times A, s \in ¯ s : \| R (s, a) - ¯ R (¯ s, a) \| \leq η_{R},$		(97)
	$\forall ¯ s, a, {¯ s}^{'} \in ¯ S \times A \times ¯ S, s \in ¯ s : \| \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a) - ¯ T ({¯ s}^{'} \| ¯ s, a) \| \leq η_{T} .$		(98)

By $V^{¯ π, n}$ we denote the value in $M$ under policy $¯ π$ and by ${¯ V}^{¯ π, n}$ the value in $¯ M$ under policy $¯ π$ .

The following Lemma shows that for any abstract policy $¯ π$ we can upper bound the difference in value between $V^{¯ π, n}$ and ${¯ V}^{¯ π, n}$ . This shows that the value obtained in the abstract MDP $¯ M$ will be close to the value obtained in the real MDP $M$ .

Lemma 10.

For every abstract policy $¯ π$ and for every state $s \in ¯ s$

| V^{¯ π, n} (s) - {¯ V}^{¯ π, n} (s) | \leq n ϵ_{R} + \frac{(n - 1) n}{2} ϵ_{T} | ¯ S | .

(99)

Proof.

By induction we will show that for $n \geq 1$

\forall ¯ s \in ¯ S, s \in ¯ s : | V^{¯ π, n} (s) - {¯ V}^{¯ π, n} (s) | \leq n ϵ_{R} + \frac{(n - 1) n}{2} ϵ_{T} | ¯ S | .

(100)

For $n = 1$ we have

| V^{¯ π, 1} (s) - {¯ V}^{¯ π, 1} (s) | = | R (s, π (¯ s)) - ¯ R (¯ s, ¯ π (¯ s)) | \leq ϵ_{R} .

(101)

Now assume that the induction hypothesis, (100), holds for $n - 1$ , then

	$\| V^{¯ π, n} (s) - {¯ V}^{¯ π, n} (s) \| = \| R (s, ¯ π (¯ s)) - ¯ R (¯ s, ¯ π (¯ s)) + \sum s^{'} \in S T (s^{'} \| s, ¯ π (s)) V^{¯ π, n - 1} (s^{'}) - \sum {¯ s}^{'} \in ¯ S ¯ T ({¯ s}^{'} \| ¯ s, ¯ π (¯ s))) {¯ V}^{¯ π, n - 1} ({¯ s}^{'}) \|$		(102)
	$\leq \| R (s, ¯ π (¯ s)) - ¯ R (¯ s, ¯ π (¯ s)) \| + \| \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, ¯ π (s)) V^{¯ π, n - 1} (s^{'}) - \sum {¯ s}^{'} \in ¯ S ¯ T ({¯ s}^{'} \| ¯ s, ¯ π (¯ s)) {¯ V}^{¯ π, n - 1} ({¯ s}^{'}) \|$		(103)
	$\leq ϵ_{R} + \| \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, ¯ π (s)) V^{¯ π, n - 1} (s^{'}) - \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, ¯ π (¯ s)) {¯ V}^{¯ π, n - 1} ({¯ s}^{'}) \|$
	$+ \| \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, ¯ π (¯ s)) {¯ V}^{¯ π, n - 1} ({¯ s}^{'}) - \sum {¯ s}^{'} \in ¯ S ¯ T ({¯ s}^{'} \| ¯ s, ¯ π (¯ s)) {¯ V}^{¯ π, n - 1} ({¯ s}^{'}) \|$		(104)
			(105)
	$\leq ϵ_{R} + (n - 1) ϵ_{R} + \frac{(n - 1 - 1) (n - 1)}{2} ϵ_{T} \| ¯ S \| + ϵ_{T} (n - 1) \| ¯ S \|$		(106)
	$= n ϵ_{R} + \frac{(n - 2) (n - 1)}{2} ϵ_{T} \| ¯ S \| + ϵ_{T} (n - 1) \| ¯ S \|$		(107)
	$= n ϵ_{R} + \frac{(n - 1) n}{2} ϵ_{T} \| ¯ S \| .$		(108)

For the step from (103) to (104) we add and subtract $\sum_{{¯ s}^{'} \in ¯ S} \sum_{s^{'} \in {¯ s}^{'}} T (s^{'} | s, ¯ π (¯ s)) {¯ V}^{¯ π, n - 1} ({¯ s}^{'})$ , and from (105) to (106) we use the inductive hypothesis and the fact that we can upperbound ${¯ V}^{¯ π, n - 1} ({¯ s}^{'})$ by $(n - 1)$ , since the maximum reward per timestep is $1$ . ∎

Similarly, the following Lemma shows that we can upper bound the difference in value between the optimal value in $M$ and the optimal value in $¯ M$ .

Lemma 11.

With an $ϕ$

\forall ¯ s \in ¯ S, s \in ¯ s : | V^{*, n} (s) - {¯ V}^{*, n} (¯ s) | \leq n ϵ_{R} + \frac{(n - 1) n}{2} ϵ_{T} | ¯ S | .

(109)

Proof.

First we define

	$\forall ¯ s \in ¯ S, s \in S : V^{*, n} (s)$	$= max a \in A [R (s, a) + \sum s^{'} \in S T (s^{'} \| s, a) V^{*, n - 1} (s^{'})],$		(110)
	${¯ V}^{*, n} (¯ s)$	$= max a \in A [¯ R (¯ s, a) + \sum {¯ s}^{'} \in ¯ S ¯ T ({¯ s}^{'} \| s, a) {¯ V}^{*, n - 1} ({¯ s}^{'})] .$		(111)

By induction we will show that for $n \geq 1$

\forall ¯ s \in ¯ S, s \in ¯ s : | V^{*, n} (s) - {¯ V}^{*, n} (¯ s) | \leq n ϵ_{R} + \frac{(n - 1) n}{2} ϵ_{T} | ¯ S | .

(112)

Making use of the fact that $max f - max g | \leq max | f - g |$ , we have for $n = 1$

| V^{*, 1} (s) - {¯ V}^{*, 1} (s) | = | max a \in A R (s, a) - max a \in A ¯ R (¯ s, a) | \leq max a \in A | R (s, a) - ¯ R (¯ s, a) | \leq ϵ_{R} .

(113)

Now assume that the induction hypothesis, (112), holds for $n - 1$ , then

	$\| V^{, n} (s) - {¯ V}^{, n} (¯ s) \| = max a \in A \| R (s, a) - ¯ R (¯ s, a) + \sum s^{'} \in S T (s^{'} \| s, a) V^{, n - 1} (s^{'}) - \sum {¯ s}^{'} \in ¯ S ¯ T ({¯ s}^{'} \| ¯ s, a) {¯ V}^{, n - 1} ({¯ s}^{'}) \|$		(114)
	$\leq max a \in A \| R (s, a) - ¯ R (¯ s, a) \| + max a \in A \| \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a) V^{, n - 1} (s^{'}) - \sum {¯ s}^{'} \in ¯ S ¯ T ({¯ s}^{'} \| ¯ s, a) {¯ V}^{, n - 1} ({¯ s}^{'}) \|$		(115)
	$\leq ϵ_{R} + max a \in A \| \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a) V^{, n - 1} (s^{'}) - \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a) {¯ V}^{, n - 1} ({¯ s}^{'}) \|$
	$+ max a \in A \| \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a) {¯ V}^{, n - 1} ({¯ s}^{'}) - \sum {¯ s}^{'} \in ¯ S ¯ T ({¯ s}^{'} \| ¯ s, a) {¯ V}^{, n - 1} ({¯ s}^{'}) \|$		(116)
	$\leq ϵ_{R} + max a \in A \| \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a) [V^{, n - 1} (s^{'}) - {¯ V}^{, n - 1} ({¯ s}^{'})] \| + max a \in A \| \sum {¯ s}^{'} \in ¯ S [¯ T ({¯ s}^{'} \| ¯ s, a) - \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a)] {¯ V}^{*, n - 1} ({¯ s}^{'}) \|$		(117)
	$\leq ϵ_{R} + (n - 1) ϵ_{R} + \frac{(n - 1 - 1) (n - 1)}{2} ϵ_{T} \| ¯ S \| + ϵ_{T} (n - 1) \| ¯ S \|$		(118)
	$= n ϵ_{R} + \frac{(n - 1) n}{2} ϵ_{T} \| ¯ S \| .$		(119)

For the step from (115) to (116) we add and subtract $\sum_{{¯ s}^{'} \in ¯ S} \sum_{s^{'} \in {¯ s}^{'}} T (s^{'} | s, a) {¯ V}^{*, n - 1} ({¯ s}^{'})$ , and from (117) to (118) we use the inductive hypothesis and again the fact that we can upperbound ${¯ V}^{*, n - 1} ({¯ s}^{'})$ by $(n - 1)$ , since the maximum reward per timestep is $1$ . ∎

Finally, we want to give results for an empirical abstract model $^¯ M$ , whose transition probabilities and rewards are within $ϵ_{T}$ and $ϵ_{R}$ , respectively, from those of an abstract MDP $¯ M$ . The following Lemma shows that we can upper bound the loss in value when using the n-step optimal policy ${^¯ π}^{*}$ for $^¯ M$ and apply it to $M$ :

Lemma 12.

Let $M$ be an MDP, $¯ M$ and abstract MDP constructed using an approximate-model-similarity abstraction $ϕ$ , with $η_{R}$ and $η_{T}$ , $^¯ M$ an MDP in the abstract space from $ϕ$ with

| ¯ T ({¯ s}^{'} | ¯ s, a) -^¯ T ({¯ s}^{'} | ¯ s, a) | \leq ϵ_{T}, | ¯ R (¯ s, a) -^¯ R (¯ s, a) | \leq ϵ_{R} .

(120)

Then

V^{*, n} (s) - V^{{^¯ π}^{*}, n} (s) \leq 2 n ϵ_{R} + (n - 1) n ϵ_{T} | ¯ S | .

(121)

Proof.

Note that we have

	$\forall ¯ s, a \in ¯ S \times A, s \in ¯ s : \| R (s, a) -^¯ R (¯ s, a) \| \leq η_{R} + ϵ_{R},$		(122)
	$\forall ¯ s, a, {¯ s}^{'} \in ¯ S \times A \times ¯ S, s \in ¯ s : \| \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a) -^¯ T ({¯ s}^{'} \| ¯ s, a) \| \leq η_{T} + ϵ .$		(123)

By the triangular inequality we have, $\forall s \in ¯ s, ¯ s \in ¯ S$ :

	$V^{, n} (s) - V^{{^¯ π}^{}, n} (s) \leq \| V^{, n} (s) - {^¯ V}^{, n} (¯ s) \| + \| {^¯ V}^{{^¯ π}^{, n}} (¯ s) - V^{{^¯ π}^{}, n} (s) \|$		(124)
	$\leq n (η_{R} + ϵ_{R}) + \frac{(n - 1) n}{2} (η_{T} + ϵ_{T}) \| ¯ S \| + n (η_{R} + ϵ_{R}) + \frac{(n - 1) n}{2} (η_{T} + ϵ_{T}) \| ¯ S \|$		(125)
	$= 2 n (η_{R} + ϵ_{R}) + (n - 1) n (η_{T} + ϵ_{T}) \| ¯ S \| .$		(126)

This follows from Lemmas 11 and 10. ∎

c.3 R-MAX Adapted to Abstracted RL

Let $L$ be the set of unknown abstract state-action pairs. $M$ is an MDP. $M_{L}$ is an MDP that is the same as $M$ on the known state-action pairs $(s, a)$ , $(ϕ (s), a) \notin L$ , and different on the states in the abstract states in $L$ . For $M_{L}$ every state-action pair in $L$ the transition results in a self-loop and gives the maximum reward ( $1$ ), i.e., $\forall ϕ (s), a \in L \times A : T_{M_{L}} (s | s, a) = 1, R_{M_{L}} (s, a) = 1$ .

First, we restate an Implicit Explore or Exploit Lemma. For two MDP that have different dynamics only in the unknown state-action pairs, the probability $Pr (A_{M})$ that we encounter an unknown state-action pair in an $n$ -step trial is small if the difference in the n-step value between the two MDP is also small:

Lemma 13 (Implicit Explore or Exploit).

Let $M$ be an MDP, and $L$ and $M_{L}$ as above. Let $s$ be some state, and $A_{M}$ the event that an abstract state-action pair in $L$ ( $(ϕ (s), a) \in L$ ) is encountered in a trial generated by starting from state $s_{1} \in {¯ s}_{1}$ and following $π$ for $n$ steps in $M$ . Then,

V_{M}^{π, n} (s_{1}) \geq V_{M_{L}}^{π, n} (s_{1}) - n Pr (A_{M}) .

(127)

Proof.

This proof follows the steps of Lemma 3 from Strehl and Littman [2008].

For a fixed path $p_{t} = s_{1}, a_{1}, r_{1}, \dots, s_{t}, a_{t}, r_{t}$ , we define ${Pr}_{M}^{t} (p_{t})$ as the probability that $p_{t}$ occurs when running policy $π$ in $M$ starting from state $s_{1}$ . We let $L_{t}$ be the set of paths $p_{t}$ such that there is at least one unknown state $s_{i}$ in $p_{t}$ ( $ϕ (s_{i}) \in L$ ). We further let $r_{M} (t)$ be the reward received at time $t$ and $r_{M} (p_{t}, t)$ the reward at time $t$ in the path $p_{t}$ . We have the following:

$E [r_{M_{L}} (t)] - E [r_{M} (t)]$	$= \sum p_{t} \notin L_{T} (t Pr M_{L} (p_{t}) r_{M_{L}} (p_{t}, t) - t Pr M (p_{t}) r_{M} (p_{t}, t))$	(128)
	$+ \sum p_{t} \in L_{T} (t Pr M_{L} (p_{t}) r_{M_{L}} (p_{t}, t) - t Pr M (p_{t}) r_{M} (p_{t}, t))$	(129)
	$= \sum p_{t} \in L_{T} (t Pr M_{L} (p_{t}) r_{M_{L}} (p_{t}, t) - t Pr M (p_{t}) r_{M} (p_{t}, t))$	(130)
	$\leq \sum p_{t} \in L_{T} t Pr M_{L} (p_{t}) r_{M_{L}} (p_{t}, t) \leq Pr (A_{M}) .$	(131)

Here $\sum_{p_{t} \notin L_{T}} ({Pr}_{M_{L}}^{t} (p_{t}) r_{M_{L}} (p_{t}, t) - {Pr}_{M}^{t} (p_{t}) r_{M} (p_{t}, t)) = 0$ because by definition $M$ and $M_{L}$ behave identically on the known state-action pairs, and $\sum_{p_{t} \in L_{T}} {Pr}_{M_{L}}^{t} (p_{t}) r_{M_{L}} (p_{t}, t) \leq Pr (A_{M})$ is true because $r_{M_{L}} (p_{t}, t)$ is at most $1$ . Finally we can write

	$V_{M_{L}}^{π, n} (s_{1}) - V_{M}^{π, n} (s_{1})$			(132)
		$\leq n Pr (A_{M}) .$		(133)

Thus $V_{M}^{π, n} (s_{1}) \geq V_{M_{L}}^{π, n} (s_{1}) - n Pr (A_{M})$ . ∎

Now we are ready to prove the main theorem.

Theorem 1. Given an MDP M, an approximate model similarity abstraction $ϕ$ , with $η_{R}$ and $η_{T}$ , and inputs $| ¯ S |, | A |, ϵ, δ, T_{ϵ}$ . With probability of at least $1 - δ$ the R-MAX algorithm adapted to abstraction (Algorithm 1) will attain an expected return of $Opt (\prod_{M} (ϵ, T_{ϵ})) - 3 \frac{g (η_{T}, η_{R})}{T_{ϵ}} - 2 ϵ$ within a number of steps polynomial in $| ¯ S |, | A |, \frac{1}{ϵ} \frac{1}{δ}, T_{ϵ}$ . Where $T_{ϵ}$ is the $ϵ$ -return mixing time of the optimal policy, the policies for $M$ whose $ϵ$ -return mixing time is $T_{ϵ}$ are denoted by $\prod_{M} (ϵ, T_{ϵ})$ , the optimal expected return achievable by such policies are denoted by $Opt (\prod_{M} (ϵ, T_{ϵ}))$ , and

g (η_{T}, η_{R}) = T_{ϵ} η_{R} + \frac{(T_{ϵ} - 1) T_{ϵ}}{2} η_{T} | ¯ S | .

Proof of Theorem 1.

The proof uses elements of the Theorem from Brafman and Tennenholtz [2002]. The proof follows the following steps:

We show that the expected average reward of the algorithm is at least as stated, if the algorithm does not fail.
The probability to fail is at most $δ$ , this can be decomposed into three elements.
1. Probability that the transition function estimates are not within the desired bounds.
2. The probability that we do not attain the number of required visits in polynomial time.
3. The probability that the actual return is lower than the expected return.

Now we first assume the algorithm does not fail. We define ${¯ M}_{ω_{X}}$ an abstract MDP constructed from $ϕ$ with $η_{T}$ and $η_{R}$ . Similar to $M_{L}$ , we define ${¯ M}_{ω_{X}, L}$ to be the same as ${¯ M}_{ω_{X}}$ on the known abstraction state-action pairs, and with a self-loop and the maximum reward on the unknown abstract state-action pairs. We also define an empirical abstract MDP ${¯ M}_{Y}$ , of which the transition probabilities ${¯ T}_{Y} ({¯ s}^{'} | ¯ s, a)$ are within some $ϵ_{2}$ (defined later) of those in ${¯ M}_{ω_{X}}$ and with ${¯ R}_{ω_{X}} (¯ s, a) = {¯ R}_{Y} (¯ s, a)$ , because of the assumption that the rewards are deterministic. Then, ${¯ M}_{Y, L}$ is the abstract MDP that is the same as ${¯ M}_{Y}$ on the known abstract state-action pairs and the same as ${¯ M}_{ω_{X}, L}$ on the unknown abstract state-action pairs. We denote the R-MAX policy with $¯ π$ .

Let $A_{M}$ be the event that following $¯ π$ we encounter a state-action pair $(ϕ (s), a) \in L$ in $T_{ϵ}$ steps. From Lemma 13 we have that for all $s$

V_{M}^{¯ π, n} (s) \geq V_{M_{L}}^{¯ π, n} (s) - T_{ϵ} Pr (A_{M}) .

(134)

Now suppose that $Pr (A_{M}) < ϵ_{1}$ , for some $ϵ_{1}$ (defined later), then we have

$V_{M}^{¯ π, T_{ϵ}} (s)$	$\geq V_{M_{L}}^{¯ π, T_{ϵ}} (s) - T_{ϵ} Pr (A_{M})$	(135)
	$\geq V_{M_{L}}^{¯ π, T_{ϵ}} (s) - T_{ϵ} ϵ_{1}$	(136)
	$\geq V_{{¯ M}_{ω_{X}, L}}^{¯ π, T_{ϵ}} (s) - T_{ϵ} ϵ_{1} - g (η_{T}, η_{R})$	(137)
	$\geq V_{{¯ M}_{Y, L}}^{¯ π, T_{ϵ}} (s) - T_{ϵ} ϵ_{1} - g (ϵ_{2}) - g (η_{T}, η_{R})$	(138)
	$\geq V_{{¯ M}_{Y}}^{*, T_{ϵ}} (s) - T_{ϵ} ϵ_{1} - g (ϵ_{2}) - g (η_{T}, η_{R})$	(139)
	$\geq V_{M}^{*, T_{ϵ}} (s) - T_{ϵ} ϵ_{1} - g (ϵ_{2}) - g (η_{T}, η_{R}) - 2 g (η_{T} + ϵ_{2}, η_{R}) .$	(140)

Here the step from (135) to (136) follows from the assumption that $Pr (A_{M}) < ϵ_{1}$ . The step from (136) to (137) follows from Lemma 10, where $g (η_{T}, η_{R}) = T_{ϵ} η_{R} + \frac{(T_{ϵ} - 1) T_{ϵ}}{2} η_{T} | ¯ S |$ . The step from (137) to (138) follows from Lemma 9, where $g (ϵ_{2}) = \frac{(T_{ϵ} - 1) T_{ϵ}}{2} ϵ_{2} | ¯ S |$ . The step from (138) to (139) follows because the R-MAX policy $¯ π$ is the optimal policy for ${¯ M}_{Y, L}$ and ${¯ M}_{Y, L}$ is the same as ${¯ M}_{Y}$ on the known state-action pairs and overestimates the value that can be obtained on the unknown state-action pairs (to the maximum value). Finally, the step from (139) to (140) follows from Lemma 12.

To obtain the result for the average reward we have to divide (140) by $T_{ϵ}$ , we get

	$Opt (\prod M (ϵ, T_{ϵ})) - T_{ϵ} ϵ_{1} / T_{ϵ} - g (ϵ_{2}) / T_{ϵ} - g (η_{T}, η_{R}) / T_{ϵ} - 2 g (η_{T} + ϵ_{2}, η_{R}) / T_{ϵ}$		(141)
	$= Opt (\prod M (ϵ, T_{ϵ})) - ϵ_{1} - \frac{(T_{ϵ} - 1) T_{ϵ}}{2} ϵ_{2} \| ¯ S \| / T_{ϵ} - (T_{ϵ} η_{R} + \frac{(T_{ϵ} - 1) T_{ϵ}}{2} η_{T} \| ¯ S \|) / T_{ϵ} - 2 (T_{ϵ} η_{R} + \frac{(T_{ϵ} - 1) T_{ϵ}}{2} (η_{T} + ϵ_{2}) \| ¯ S \|) / T_{ϵ}$		(142)
	$= Opt (\prod M (ϵ, T_{ϵ})) - ϵ_{1} - \frac{(T_{ϵ} - 1)}{2} ϵ \| ¯ S \| - η_{R} - \frac{(T_{ϵ} - 1)}{2} η_{T} \| ¯ S \| - 2 η_{R} - (T_{ϵ} - 1) (η_{T} + ϵ_{2}) \| ¯ S \|$		(143)
	$= Opt (\prod M (ϵ, T_{ϵ})) - ϵ_{1} - \frac{(T_{ϵ} - 1)}{2} ϵ_{2} \| ¯ S \| - 3 η_{R} - \frac{(T_{ϵ} - 1)}{2} η_{T} \| ¯ S \| - (T_{ϵ} - 1) ϵ_{2} \| ¯ S \| - (T_{ϵ} - 1) η_{T} \| ¯ S \|$		(144)
	$= Opt (\prod M (ϵ, T_{ϵ})) - ϵ_{1} - 3 \frac{}{(T_{ϵ} - 1)} 2 ϵ_{2} \| ¯ S \| - 3 η_{R} - 3 \frac{(T_{ϵ} - 1)}{2} η_{T} \| ¯ S \|$		(145)
	$= Opt (\prod M (ϵ, T_{ϵ})) - ϵ_{1} - 3 \frac{}{(T_{ϵ} - 1)} 2 ϵ_{2} \| ¯ S \| - 3 η_{R} - 3 \frac{(T_{ϵ} - 1)}{2} η_{T} \| ¯ S \|$		(146)
	$= Opt (\prod M (ϵ, T_{ϵ})) - \frac{3}{2} ϵ - 3 \frac{g (η_{T}, η_{R})}{T_{ϵ}} .$		(147)

In the last step we chose $ϵ_{1} = \frac{3}{8} ϵ$ , and $ϵ_{2} = \frac{3 ϵ}{4 | ¯ S | (T_{ϵ} - 1)}$ .

The above assumed that the algorithm did not fail, but this cannot be guaranteed with probability 1 within a number of steps that is polynomial in the input. Now we will show that the probability of failure can be bounded to $δ$ , there are three reasons why the algorithm could fail.

First, we need to show that the transition probabilities of ${¯ M}_{Y}$ are within $ϵ_{2}$ of $M$ . This is to ensure that, once all the abstract state-action pairs are known, the loss of value because of an inaccurate transition model, $V_{{¯ M}_{Y}}^{*, T_{ϵ}} - V_{M}^{*, T_{ϵ}}$ is within $2 g (η_{T} + ϵ_{2}, η_{R}) = 2 T_{ϵ} η_{R} + 2 ((T_{ϵ} + ϵ_{2}) - 1) (T_{ϵ} + ϵ_{2}) η_{T} | ¯ S |$ by Lemma 12. We use the martingale concentration inequality to show that we can guarantee that we can choose $K_{1}$ such that if we sample each $(¯ s, a)$ $K_{1}$ times then the probability that our transition estimate is outside of the desired bound is less than $\frac{δ}{3 | ¯ S | | A |}$ for every abstract state-action pair and then apply the Union Bound so that the total probability is less than $δ / 3$ . We can guarantee this by using $K_{1} \geq \frac{2 [ln (2^{| ¯ S |} - 2) - ln (δ / (3 | ¯ S | | A |))]}{(\frac{3 ϵ}{4 | ¯ S | (T_{ϵ} - 1)})^{2}} = \frac{32 | ¯ S |^{2} (T_{ϵ} - 1)^{2} [ln (2^{| ¯ S |} - 2) - ln (δ / (3 | ¯ S | | A |))]}{9 ϵ^{2}}$ by Lemma 8.
Before we assumed that $Pr (A_{M})$ , the probability to encounter an unknown abstract state-acton pair in an $T_{ϵ}$ -step trial, was smaller than $\frac{3 ϵ}{8}$ . Here we can use Hoeffding’s Inequality to show that after $K_{2}$ $T_{ϵ}$ -step trials where $Pr (A_{M}) \geq \frac{3 ϵ}{8}$ all the abstract state-action pairs become known with probability at least $1 - δ / 3$ , i.e., that every abstract state-action pair is visited at least $K_{1}$ times. Let $X_{i}$ be the indicator variable that is $1$ if we visit an unknown abstract state-action pair in a trial, and $0$ otherwise. For the trials where $Pr (X_{i} = 1) \geq \frac{3 ϵ}{8}$ we can use Hoeffding’s Inequality as an upperbound, so that we have

$Pr (| K_{2} \sum i = 1 X_{1} - \frac{3 ϵ}{8} K_{2} | \geq K_{2}^{2 / 3}) \leq 2 e^{- \frac{2 (K_{2}^{2 / 3})^{2}}{K_{2}}} = 2 e^{- 2 K_{2}^{1 / 3}} .$ (148)

We can now choose $K_{2}$ , s.t. $K_{2}^{\frac{2}{3}} + \frac{3 ϵ}{8} K_{2} > K_{1} | ¯ S | | A |$ and $2 e^{- 2 K_{2}^{1 / 3}} < \frac{δ}{3 | ¯ S | | A |}$ , to guarantee that the probability that we will fail to explore enough is at most $δ / 3$ .
Finally, the actual return may be lower than the expected return when we perform a $T_{ϵ}$ -step trial where we do not explore. We can again use Hoeffding’s Inequality to determine the number of steps $K_{3}$ needed to ensure that the actual average return is within $ϵ / 2$ of $Opt (\prod_{M} (ϵ, T_{ϵ})) - \frac{3}{2} ϵ - 3 \frac{g (η_{T}, η_{R})}{T_{ϵ}}$ , so that the probability that the actual return obtained is not at least the desired $Opt (\prod_{M} (ϵ, T_{ϵ})) - 2 ϵ - 3 \frac{g (η_{T}, η_{R})}{T_{ϵ}}$ within $K_{3} = Z | ¯ S | T_{ϵ}$ exploitation steps, with some number $Z > 0$ , is at most $δ / 3$ . Let $X_{i}$ denote the average return in the $i$ -th exploitation step, and $μ$ the average expected return in an exploitation step, so that $μ$ is at least $Opt (\prod_{M} (ϵ, T_{ϵ})) - \frac{3}{2} ϵ - 3 \frac{g (η_{T}, η_{R})}{T_{ϵ}}$ . Then

$Pr (K_{3} \sum i = 1 (μ - X_{i}) \geq K_{3}^{2 / 3}) \leq e^{- 2 \frac{}{(K_{3}^{2 / 3})^{2}} K_{3}} = e^{- 2 K_{3}^{1 / 3}} .$ (149)

This means that the average return for $K_{3}$ exploitation steps is $K_{3}^{\frac{2}{3}} / K_{3} = \frac{1}{K_{3}^{\frac{1}{3}}}$ , or more, lower than $μ$ with probability of at most $e^{- 2 K_{3}^{1 / 3}}$ . We can now choose $Z$ , so that $ϵ / 2 > \frac{1}{K_{3}^{\frac{1}{3}}}$ and $e^{- 2 K^{{\frac{1}{3}}_{3}}} < δ / 3$ , to get the desired result: with probability at most $δ / 3$ the obtained value will be more than $ϵ / 2$ lower than the expected value.

The probability of failure is thus at most $3 * δ / 3 = δ$ , and an average return that is at most $2 ϵ + 3 \frac{g (η_{T}, η_{R})}{T_{ϵ}}$ lower than $Opt (\prod_{M} (ϵ, T_{ϵ}))$ will be obtained with probability at least $1 - δ$ . ∎

Appendix D Related work - Extensive

Many studies have considered the combination of abstraction with either planning or RL. In most of these studies, the dependence of samples that arises in Abstracted RL is not an issue due to various reasons, such as the assumption that the collected samples are independent Paduraru et al. [2008], Ortner et al. [2014b], Jiang et al. [2015], looking at convergence in the limit Singh et al. [1995], Hutter [2016], Majeed and Hutter [2018], or because access to an MDP model is assumed Hallak et al. [2013], Maillard et al. [2013], Ortner et al. [2019].

In the Abstracted RL setting a negative result has been provided, showing that R-MAX Brafman and Tennenholtz [2002] no longer maintains its guarantees when paired with any type of state abstraction function Abel et al. [2018]. This is shown with an example that uses approximate Q-function abstractions Abel et al. [2016]. Our counterexample is more powerful: indicating problems with the normal analysis even for approximate model similarity abstractions. Yet, our second result shows that for R-MAX-like algorithms it is still possible to give guarantees in Abstracted RL when an approximate model similarity abstraction is used and we take into account the $η_{R}$ and $η_{T}$ inaccuracies in the error.

Another study considered a setting related to abstraction, where the transition and reward functions may change over time, either abruptly or gradually Ortner et al. [2020]. The reward and transition probabilities depend on the timestep $t$ , so $T (s^{'} | s, a, t)$ instead of $T (s^{'} | s, a)$ . To give results they bound the variation in the reward and transition functions over time. They adapt the confidence intervals for the state-action pairs to take the variation into account. In their setting the MDP is fixed given the timestep, but in the abstraction setting this is not fixed, each time we run the MDP the transition function at a timestep $t$ could be different.

Some of the studies in the abstraction selection setting do not assume that the set of abstraction functions contains a Markov model Lattimore et al. [2013], Ortner et al. [2014a]. One of these assumes the agent has access to a set of environments, including the true environment, rather than a set of representations Lattimore et al. [2013]. Because they have access to environments rather than an abstraction, they do not need to learn a transition model, making it different from our setting. The other study uses Theorem 2.1 from Weissman et al. [2003] that requires i.i.d. samples Ortner et al. [2014a], we have shown that independent samples cannot be guaranteed in Abstracted RL.

Other related work is in the area of MDPs with rich observations or block structure Azizzadenesheli et al. [2016], Du et al. [2019]. However, in that setting each observation can be generated only from a single hidden state, which means that the issue of non-i.i.d. data due to abstraction does not arise. In contrast, each observation can be generated from multiple hidden states in Abstracted RL. The rich observation setting can be seen as an aggregation problem, where the observations can be aggregated to form a small (latent) MDP Azizzadenesheli et al. [2016]. But in our case, we do not try to learn the MDP (as it is not small). Their setting is also related to exact model similarity (or bisimulation) Du et al. [2019], but we focus on approximate model similarity which is what introduces the problems as described here.

One way to avoid the issue of dependent samples is by making the assumption that samples are obtained independently Paduraru et al. [2008], Ortner et al. [2014b], Jiang et al. [2015]. One study considers the setting with a continuous domain where we are given a dataset, with i.i.d. samples Paduraru et al. [2008]. Then discretization is used to aggregate states into abstract states. They give the probability that the model will be $ϵ$ -accurate given a fixed dataset. While they assume that the data has been gathered i.i.d., our results show that Martingale concentration inequalities could be used to extend their results to the online data collection in Abstracted RL setting. Another study operates in operate in the abstraction selection setting, where the agent is provided with a set of abstraction functions (state representations) Jiang et al. [2015]. They do not assume that any of the abstraction functions results in a Markov model, but they do assume a given dataset, with data that was collected i.i.d. They give a bound directly on how accurate the Q-values based on the (implicitly) learned model will be, rather than on the accuracy of the model itself. As we showed, the assumption that the data is i.i.d. is not a trivial assumption, since it means the data cannot just have been collected online. Another study’s main focus is on bandits but also gives results MDPs with a coloring function Ortner et al. [2014b]. State aggregation can be seen as a special case of this coloring, and they extend the results from UCRL2 Jaksch et al. [2010] to the setting with a coloring function. However, they assume that the samples are independent. They use the Azuma-Hoeffding inequality for the transition function, which also holds for weakly dependent samples. But, since they assume the samples are independent, they do not show the martingale difference sequence property for the (actually dependent) samples.

Quite a few studies in the abstraction selection setting make the assumption that the given set of state representations contains a Markov model Hallak et al. [2013], Maillard et al. [2013], Ortner et al. [2019]. One study gives asymptotic guarantees for selecting the correct model and for building an exact MDP model Hallak et al. [2013]. The assumption that there is an MDP model in the given set of representations is crucial in their analysis since for this ‘true model’ the samples are i.i.d. Similarly, other studies also assume that the given set of state representations contains a Markov model Maillard et al. [2013], Ortner et al. [2019]. They create an algorithm for which they obtain regret bounds, their analysis also makes use of the Markov representation.

Another way to deal with the issue of dependence is by looking at convergence in the limit Singh et al. [1995], Hutter [2016], Majeed and Hutter [2018]. One study gives an asymptotic result for the convergence of Q-learning and TD(0) in MDPs with soft state aggregation Singh et al. [1995]. Soft state aggregation means that a state $s$ belongs to a cluster $x$ with some probability $P (x | s)$ , this means a state $s$ can belong to several clusters. Their result relies on having a stationary policy that assigns a non-zero probability to every action in every state and the assumption that the MDP is ergodic. Together these imply there is a limiting state distribution, and using this they show convergence asymptotically. Another study gives a variety of results focusing on both approximate and exact abstractions in environments without MDP assumptions Hutter [2016]. Several of these are in the planning setting, similar to other results Abel et al. [2016]. Most relevant for us is their Theorem 12, which for online RL shows convergence in the limit of the empirical transition function under weak conditions, e.g. if the abstract process itself is an MDP. Under this condition however the problem reduces to RL in an (abstract) MDP, rather than Abstracted RL. Follow-up work builds on some of these results and focuses on the combination of model-free RL and exact abstraction Majeed and Hutter [2018]. They show that, under the condition of state uniformity, q-learning can be shown to converge in the limit to the optimal solution. State uniformity means that histories that are aggregated together have the same optimal q-values. In contrast to our setting, they look at an exact abstraction, extending it to approximate aggregation was left as an open question.

For planning in abstract MDPs, there are results for exact state abstractions Li et al. [2006] and for approximate state abstractions Abel et al. [2016]. The results for approximate state abstractions allow for quantifying an upper bound on performance for the optimal policy of an abstract MDP, as in Section 2.2. This has been build on by giving a result for performing RL interacting with an explicitly constructed abstract MDP Taïga et al. [2018], which is different from Abstract RL since the abstract MDP is still an MDP.

Even without abstraction, in certain cases a dependence can arise for RL in MDPs. For instance, it has been shown that dependence can appear if the MDP is not communicating Strehl and Littman [2008]. The non-communicating property can be realistic, as there could be problems where there are states to which we cannot return. They show that this specific case of dependence in non-communicating MDPs is not a problem because it is still possible to use a concentration inequality for independent samples, e.g., Hoeffding’s inequality, as an upper bound. However, their proof uses the fact that the transition and rewards are identically distributed, which is not guaranteed in Abstracted RL.

	$E [$	$Z_{i} \| Z_{1}, Z_{2}, \dots, Z_{i - 1}] = 0,$
	$\|$	$Z_{i} \| < \infty .$

	$\| V^{, n} (s) - {¯ V}^{, n} (¯ s) \| = max a \in A \| R (s, a) - ¯ R (¯ s, a) + \sum s^{'} \in S T (s^{'} \| s, a) V^{, n - 1} (s^{'}) - \sum {¯ s}^{'} \in ¯ S ¯ T ({¯ s}^{'} \| ¯ s, a) {¯ V}^{, n - 1} ({¯ s}^{'}) \|$		(114)
	$\leq max a \in A \| R (s, a) - ¯ R (¯ s, a) \| + max a \in A \| \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a) V^{, n - 1} (s^{'}) - \sum {¯ s}^{'} \in ¯ S ¯ T ({¯ s}^{'} \| ¯ s, a) {¯ V}^{, n - 1} ({¯ s}^{'}) \|$		(115)
	$\leq ϵ_{R} + max a \in A \| \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a) V^{, n - 1} (s^{'}) - \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a) {¯ V}^{, n - 1} ({¯ s}^{'}) \|$
	$+ max a \in A \| \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a) {¯ V}^{, n - 1} ({¯ s}^{'}) - \sum {¯ s}^{'} \in ¯ S ¯ T ({¯ s}^{'} \| ¯ s, a) {¯ V}^{, n - 1} ({¯ s}^{'}) \|$		(116)
	$\leq ϵ_{R} + max a \in A \| \sum {¯ s}^{'} \in ¯ S \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a) [V^{, n - 1} (s^{'}) - {¯ V}^{, n - 1} ({¯ s}^{'})] \| + max a \in A \| \sum {¯ s}^{'} \in ¯ S [¯ T ({¯ s}^{'} \| ¯ s, a) - \sum s^{'} \in {¯ s}^{'} T (s^{'} \| s, a)] {¯ V}^{*, n - 1} ({¯ s}^{'}) \|$		(117)
	$\leq ϵ_{R} + (n - 1) ϵ_{R} + \frac{(n - 1 - 1) (n - 1)}{2} ϵ_{T} \| ¯ S \| + ϵ_{T} (n - 1) \| ¯ S \|$		(118)
	$= n ϵ_{R} + \frac{(n - 1) n}{2} ϵ_{T} \| ¯ S \| .$		(119)

An Analysis of Abstracted Model-Based Reinforcement Learning

Abstract

1 Introduction

2 Preliminaries

2.1 Model-Based RL

Lemma 1 (L1 inequality Weissman et al. (2003)).

2.2 State abstraction for Known Models

Definition 1.

Definition 2.

Lemma 2 (Lemma 4 Taïga et al. (2018)).

3 Abstracted MBRL

3.1 Abstracted RL Can Lead to Dependent Samples

Independence

Observation 1.

Counterexample

Observation 2.

3.2 Guarantees for Abstract Model Learning Using Martingales

Definition 3 (Martingale difference sequence Strehl and Littman (2008)).

Lemma 3 (Azuma’s Lemma Strehl and Littman (2008)).

Proposition 1 (Abstract L1 inequality).

Theorem 1.

4 Related Work

5 Conclusions

References

Appendix A Well Known Results

a.1 Hoeffding’s Inequality

Lemma 4 (Hoeffding’s inequality Hoeffding [1963]).

a.2 Union Bound

Lemma 5 (Union Bound Boole [1854]).

Appendix B L1 Inequality for Independent but not Identically Distributed Variables

Lemma 6.

Proof.

Appendix C Proofs

c.1 Concentration Inequality on The L1 Norm for Martingales in Abstracted RL

Proof of Proposition 1.

Lemma 7.

Proof of Lemma 7.

Lemma 8.

Proof.

c.2 Upper Bounds on Value Differences Under Different Policies

Lemma 9.

Proof.

Lemma 10.

Proof.

Lemma 11.

Proof.

Lemma 12.

Proof.

c.3 R-MAX Adapted to Abstracted RL

Lemma 13 (Implicit Explore or Exploit).

Proof.

Proof of Theorem 1.

Appendix D Related work - Extensive

Lemma 1 ( $L_{1}$ inequality Weissman et al. (2003)).