We know very little about how neural language models (LM) use priorlinguistic context. In this paper, we investigate the role of context in anLSTM LM, through ablation studies. Specifically, we analyze the increase inperplexity when prior context words are shuffled, replaced, or dropped. On twostandard datasets, Penn Treebank and WikiText-2, we find that the model iscapable of using about 200 tokens of context on average, but sharplydistinguishes nearby context (recent 50 tokens) from the distant history. Themodel is highly sensitive to the order of words within the most recentsentence, but ignores word order in the long-range context (beyond 50 tokens),suggesting the distant past is modeled only as a rough semantic field or topic.We further find that the neural caching model (Grave et al., 2017b) especiallyhelps the LSTM to copy words from within this distant context. Overall, ouranalysis not only provides a better understanding of how neural LMs use theircontext, but also sheds light on recent success from cache-based models.
translated by 谷歌翻译