关于语境化词语表示问题的研究 - 用于句子理解的可重用神经网络组件的发展 - 最近出现了一系列进展,其中心是使用ELMo等方法进行语言建模的无监督预训练任务。本文提供了第一个大规模的系统研究,比较了该语境中不同的预训练任务,既作为语言建模的补充,也作为潜在的替代。该研究的主要结果支持使用语言模型作为预训练任务,并使用语言模型的多任务学习在可比模型中设置新的技术水平。然而,仔细观察这些结果可以发现令人担忧的强大基线和跨越目标任务的惊人变化的结果,这表明广泛使用的预训练和冻结句子编码器的范例可能不是进一步工作的理想平台。
translated by 谷歌翻译
我们考虑将在多种语言中学习的连续单词表示与公共空间对齐的问题。最近表明,在两种语言的情况下,可以在没有监督的情况下学习这种映射。本文将这一系列工作扩展到将多种语言与公共空间对齐的问题。解决方案是将所有语言独立地映射到枢轴语言。不幸的是,这降低了间接单词翻译的质量。因此,我们提出了一种新的配方,可确保可成比例的映射,从而实现更好的对齐。我们通过在11种语言中共同对齐单词向量来评估我们的方法,通过间接映射显示一致性,同时保持直接单词翻译的竞争性能。
translated by 谷歌翻译
可以对不同语言中单独学习的连续单词表示进行对齐,以使它们的单词在公共空间中具有可比性。现有工作通常解决最小二乘回归问题以学习旋转对齐小双语词典,并使用检索标准进行推断。在本文中,我们提出了一种统一的公式,以端到端的方式直接优化检索标准。我们在标准基准测试中的实验表明,我们的方法优于最先进的翻译技术,对英语 - 中文等远程语言对进行了最大的改进。
translated by 谷歌翻译
通常情况下,表现最佳的语言模型是具有n-gram的神经语言模型的集合。在这项工作中,我们提出了一种方法来改进这两种模型的组合方式。通过使用预测两个模型之间混合权重的小型网络,我们在每个时间步骤调整它们的相对重要性。由于门控网络很小,它可以快速训练少量的数据,并且不会增加分数时间的开销。我们在十亿字基准测试中进行的实验显示,对现有技术的整体进行了显着的改进,而没有对基本模块进行再培训。
translated by 谷歌翻译
Recurrent neural networks (RNNs) have achieved impressive results in avariety of linguistic processing tasks, suggesting that they can inducenon-trivial properties of language. We investigate here to what extent RNNslearn to track abstract hierarchical syntactic structure. We test whether RNNstrained with a generic language modeling objective in four languages (Italian,English, Hebrew, Russian) can predict long-distance number agreement in variousconstructions. We include in our evaluation nonsensical sentences where RNNscannot rely on semantic or lexical cues ("The colorless green ideas I ate withthe chair sleep furiously"), and, for Italian, we compare model performance tohuman intuitions. Our language-model-trained RNNs make reliable predictionsabout long-distance agreement, and do not lag much behind human performance. Wethus bring support to the hypothesis that RNNs are not just shallow-patternextractors, but they also acquire deeper grammatical competence.
translated by 谷歌翻译
Distributed word representations, or word vectors, have recently been appliedto many tasks in natural language processing, leading to state-of-the-artperformance. A key ingredient to the successful application of theserepresentations is to train them on very large corpora, and use thesepre-trained models in downstream tasks. In this paper, we describe how wetrained such high quality word representations for 157 languages. We used twosources of data to train these models: the free online encyclopedia Wikipediaand data from the common crawl project. We also introduce three new wordanalogy datasets to evaluate these word vectors, for French, Hindi and Polish.Finally, we evaluate our pre-trained word vectors on 10 languages for whichevaluation datasets exists, showing very strong performance compared toprevious models.
translated by 谷歌翻译
我们引入Parseval网络,这是一种深度神经网络,其中线性,卷积和聚合层的Lipschitz常数被约束为小于1.Parseval网络是通过分析深度神经网络预测的鲁棒性来实证和理论推动的。输入受到对抗性的扰动。 Parseval网络最重要的特征是将线性和卷积层的权重矩阵保持为(近似)Parseval紧框架,这是正交矩阵tonon-square矩阵的扩展。我们描述了在SGD期间如何保持这些约束的有效性。我们表明,Parseval网络在CIFAR-10/100和街景HouseNumbers(SVHN)方面的准确性方面与最先进的网络相匹配,同时比其对手的对抗例子更强大。顺便提一下,Parseval网络也倾向于训练紧固并更好地利用网络的全部容量。
translated by 谷歌翻译
We propose an extension to neural network language models to adapt theirprediction to the recent history. Our model is a simplified version of memoryaugmented networks, which stores past hidden activations as memory and accessesthem through a dot product with the current hidden activation. This mechanismis very efficient and scales to very large memory sizes. We also draw a linkbetween the use of external memory in neural network and cache models used withcount based language models. We demonstrate on several language model datasetsthat our approach performs significantly better than recent memory augmentednetworks.
translated by 谷歌翻译
We propose an approximate strategy to efficiently train neural network basedlanguage models over very large vocabularies. Our approach, called adaptivesoftmax, circumvents the linear dependency on the vocabulary size by exploitingthe unbalanced word distribution to form clusters that explicitly minimize theexpectation of computation time. Our approach further reduces the computationaltime by exploiting the specificities of modern architectures and matrix-matrixvector operations, making it particularly suited for graphical processingunits. Our experiments carried out on standard benchmarks, such as EuroParl andOne Billion Word, show that our approach brings a large gain in efficiency overstandard approximations while achieving an accuracy close to that of the fullsoftmax. The code of our method is available athttps://github.com/facebookresearch/adaptive-softmax.
translated by 谷歌翻译
在大型无标签语料库上训练的连续单词表示可用于许多自然语言处理任务。学习这种表征的流行模型通过为每个单词指定一个不同的向量来忽略单词的形态。这是一个限制,特别是对于具有大词汇和许多罕见词汇的语言。在本文中,我们提出了一种基于skipgram模型的新方法,其中每个单词都表示为一包字符$ n $ -grams。向量表示与每个字符$ n $ -gram相关联;单词表示为这些表示的总和。我们的方法很快,允许快速训练大型语料库上的模型,并允许计算未显示在训练数据中的单词的单词表示。我们评估了九种不同语言的单词表示,包括关键词相似性和类比任务。通过与最近提出的形态词表示进行比较,我们表明我们的向量在这些任务上实现了最先进的表现。
translated by 谷歌翻译