Current approaches in video forecasting attempt to generate videos directlyin pixel space using Generative Adversarial Networks (GANs) or VariationalAutoencoders (VAEs). However, since these approaches try to model all thestructure and scene dynamics at once, in unconstrained settings they oftengenerate uninterpretable results. Our insight is to model the forecastingproblem at a higher level of abstraction. Specifically, we exploit human posedetectors as a free source of supervision and break the video forecastingproblem into two discrete steps. First we explicitly model the high levelstructure of active objects in the scene---humans---and use a VAE to model thepossible future movements of humans in the pose space. We then use the futureposes generated as conditional information to a GAN to predict the futureframes of the video in pixel space. By using the structured space of pose as anintermediate representation, we sidestep the problems that GANs have ingenerating video pixels directly. We show through quantitative and qualitativeevaluation that our method outperforms state-of-the-art methods for videoprediction.
translated by 谷歌翻译