From just a glance, humans can make rich predictions about the future stateof a wide range of physical systems. On the other hand, modern approaches fromengineering, robotics, and graphics are often restricted to narrow domains andrequire direct measurements of the underlying states. We introduce the VisualInteraction Network, a general-purpose model for learning the dynamics of aphysical system from raw visual observations. Our model consists of aperceptual front-end based on convolutional neural networks and a dynamicspredictor based on interaction networks. Through joint training, the perceptualfront-end learns to parse a dynamic visual scene into a set of factored latentobject representations. The dynamics predictor learns to roll these statesforward in time by computing their interactions and dynamics, producing apredicted physical trajectory of arbitrary length. We found that from just sixinput video frames the Visual Interaction Network can generate accurate futuretrajectories of hundreds of time steps on a wide range of physical systems. Ourmodel can also be applied to scenes with invisible objects, inferring theirfuture states from their effects on the visible objects, and can implicitlyinfer the unknown mass of objects. Our results demonstrate that the perceptualmodule and the object-based dynamics predictor module can induce factoredlatent representations that support accurate dynamical predictions. This workopens new opportunities for model-based decision-making and planning from rawsensory observations in complex physical environments.
translated by 谷歌翻译