This paper introduces new optimality-preserving operators on Q-functions. Wefirst describe an operator for tabular representations, the consistent Bellmanoperator, which incorporates a notion of local policy consistency. We show thatthis local consistency leads to an increase in the action gap at each state;increasing this gap, we argue, mitigates the undesirable effects ofapproximation and estimation errors on the induced greedy policies. Thisoperator can also be applied to discretized continuous space and time problems,and we provide empirical results evidencing superior performance in thiscontext. Extending the idea of a locally consistent operator, we then derivesufficient conditions for an operator to preserve optimality, leading to afamily of operators which includes our consistent Bellman operator. Ascorollaries we provide a proof of optimality for Baird's advantage learningalgorithm and derive other gap-increasing operators with interestingproperties. We conclude with an empirical study on 60 Atari 2600 gamesillustrating the strong potential of these new operators.
translated by 谷歌翻译