We introduce a new test of how well language models capture meaning inchildren's books. Unlike standard language modelling benchmarks, itdistinguishes the task of predicting syntactic function words from that ofpredicting lower-frequency words, which carry greater semantic content. Wecompare a range of state-of-the-art models, each with a different way ofencoding what has been previously read. We show that models which storeexplicit representations of long-term contexts outperform state-of-the-artneural language models at predicting semantic content words, although thisadvantage is not observed for syntactic function words. Interestingly, we findthat the amount of text encoded in a single memory representation is highlyinfluential to the performance: there is a sweet-spot, not too big and not toosmall, between single words and full sentences that allows the most meaningfulinformation in a text to be effectively retained and recalled. Further, theattention over such window-based memories can be trained effectively throughself-supervision. We then assess the generality of this principle by applyingit to the CNN QA benchmark, which involves identifying named entities inparaphrased summaries of news articles, and achieve state-of-the-artperformance.
translated by 谷歌翻译