We propose to directly map raw visual observations and text input to actionsfor instruction execution. While existing approaches assume access tostructured environment representations or use a pipeline of separately trainedmodels, we learn a single model to jointly reason about linguistic and visualinput. We use reinforcement learning in a contextual bandit setting to train aneural network agent. To guide the agent's exploration, we use reward shapingwith different forms of supervision. Our approach does not require intermediaterepresentations, planning procedures, or training different models. We evaluatein a simulated environment, and show significant improvements over supervisedlearning and common reinforcement learning variants.
translated by 谷歌翻译