This paper investigates conditions under which modi cations to the reward function of a Markov decision process preserve the optimal policy. It is shown that, besides the positive linear transformation familiar from utility theory, one can add a reward for transitions between states that is expressible as the di erence in value of an arbitrary potential function applied to those states. Furthermore, this is shown to be a necessary condition for invariance, in the sense that any other transformation may yield suboptimal policies unless further assumptions are made about the underlying MDP. These results shed light on the practice of reward shaping, a method used in reinforcement learning whereby additional training rewards are used to guide the learning agent. In particular, some well-known \bugs" in reward shaping procedures are shown to arise from non-potential-based rewards, and methods are given for constructing shaping potentials corresponding to distance-based and subgoalbased heuristics. We show that such potentials can lead to substantial reductions in learning time.
translated by 谷歌翻译