Deep reinforcement learning (RL) has achieved several high profile successesin difficult decision-making problems. However, these algorithms typicallyrequire a huge amount of data before they reach reasonable performance. Infact, their performance during learning can be extremely poor. This may beacceptable for a simulator, but it severely limits the applicability of deep RLto many real-world tasks, where the agent must learn in the real environment.In this paper we study a setting where the agent may access data from previouscontrol of the system. We present an algorithm, Deep Q-learning fromDemonstrations (DQfD), that leverages small sets of demonstration data tomassively accelerate the learning process even from relatively small amounts ofdemonstration data and is able to automatically assess the necessary ratio ofdemonstration data while learning thanks to a prioritized replay mechanism.DQfD works by combining temporal difference updates with supervisedclassification of the demonstrator's actions. We show that DQfD has betterinitial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN)as it starts with better scores on the first million steps on 41 of 42 gamesand on average it takes PDD DQN 83 million steps to catch up to DQfD'sperformance. DQfD learns to out-perform the best demonstration given in 14 of42 games. In addition, DQfD leverages human demonstrations to achievestate-of-the-art results for 11 games. Finally, we show that DQfD performsbetter than three related algorithms for incorporating demonstration data intoDQN.
translated by 谷歌翻译