在自然语言处理(NLP)中,重要的是检测两个序列之间的关系或者在给定其他观察序列的情况下生成一系列标记。我们将建模序列对的问题类型称为序列到序列(seq2seq)映射问题。许多研究致力于寻找解决这些问题的方法,传统方法依赖于手工制作的特征,对齐模型,分割启发式和外部语言资源的组合。虽然取得了很大进展,但这些传统方法还存在各种缺陷,如复杂的流水线,繁琐的特征工程,以及领域适应的困难。最近,神经网络成为NLP,语音识别和计算机视觉中许多问题的解决方案。神经模型是强大的,因为它们可以端到端地进行训练,很好地概括为看不见的例子,同样的框架可以很容易地适应新的领域。本论文的目的是通过神经网络推进seq2seq映射问题的最新技术。我们从三个主要方面探索解决方案:研究用于表示序列的神经模型,建模序列之间的相互作用,以及使用不成对数据来提高神经模型的性能。对于每个方面,我们提出新模型并评估它们对seq2seq映射的各种任务的功效。
translated by 谷歌翻译
Sequence-to-Sequence (seq2seq) modeling has rapidly become an important general-purpose NLP tool that has proven effective for many text-generation and sequence-labeling tasks. Seq2seq builds on deep neural language modeling and inherits its remarkable accuracy in estimating local, next-word distributions. In this work, we introduce a model and beam-search training scheme, based on the work of Daumé III and Marcu (2005), that extends seq2seq to learn global sequence scores. This structured approach avoids classical biases associated with local training and unifies the training loss with the test-time usage, while preserving the proven model architecture of seq2seq and its efficient training approach. We show that our system outperforms a highly-optimized attention-based seq2seq system and other baselines on three different sequence to sequence tasks: word ordering, parsing, and machine translation.
translated by 谷歌翻译
在过去的几十年中,已经针对各种监督学习任务提出了许多损失函数,包括回归,分类,排序和更一般的结构化预测。了解支撑这些损失的核心原则和理论属性是正确解决正确问题的关键,并创造新的损失,并结合其优势。在本文中,我们介绍了Fenchel-Younglosses,一种为正则预测函数构造凸损失函数的通用方法。我们在非常广泛的环境中提供他们的属性的深入研究,涵盖所有上述监督学习任务,并揭示稀疏性,广义熵和分离边缘之间的新联系。我们证明Fenchel-Young损失统一了许多众所周知的损失函数,并允许轻松创建有用的新函数。最后,我们得出了有效的预测和训练算法,使Fenchel-Young在理论和实践中都有所损失。
translated by 谷歌翻译
Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, DP algorithms are usually non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically , we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on two structured prediction tasks and on structured and sparse attention for neural machine translation.
translated by 谷歌翻译
结构化预测需要搜索组合数量的结构。为了解决这个问题,我们引入了SparseMAP:一种新的sparsestructured推理方法及其自然损失函数。 SparseMAP仅自动选择少数全局结构:它位于MAP推理(选择单个结构)和边际推理之间,边际推理为所有结构(包括难以置信的结构)分配概率质量。重要的是,可以仅使用对MAP oracle的调用来计算SparseMAP,使其适用于具有难以处理的边际推断的问题,例如线性对齐。稀疏性使梯度反向传播无论结构如何都有效,使我们能够利用通用和稀疏结构的隐藏层来增强深度神经网络。依赖性解析和自然语言推理中的实验揭示了竞争准确性,改进的可解释性以及捕获自然语言模糊性的能力,这对于管道系统是有吸引力的。
translated by 谷歌翻译
Recurrent neural network models with an attention mechanism have proven to be extremely effective on a wide variety of sequence-to-sequence problems. However, the fact that soft attention mechanisms perform a pass over the entire input sequence when producing each element in the output sequence precludes their use in online settings and results in a quadratic time complexity. Based on the insight that the alignment between input and output sequence elements is monotonic in many problems of interest, we propose an end-to-end differentiable method for learning monotonic alignments which, at test time, enables computing attention online and in linear time. We validate our approach on sentence summarization, machine translation, and online speech recognition problems and achieve results competitive with existing sequence-to-sequence models.
translated by 谷歌翻译
将n维向量转换为nobjects上的概率分布是许多机器学习任务中常用的组件,如多类别分类,多标签分类,注意机制等。为此,已经提出并使用了几个概率映射函数,如softmax,sum等文献。 - 标准化,球形softmax和sparsemax,但是对于它们如何与它们相关联的术语却很少有人理解。此外,上述配方均未提供对稀疏程度的明确控制。为了解决这个问题,我们开发了一个统一的框架,将所有这些配方都包含在特殊情况中。该框架确保简单的封闭形式解决方案和适合通过反向传播学习的子梯度的存在。在这个框架内,我们提出了两种新的稀疏配方,稀疏基因和稀疏类,寻求在所需的稀疏程度上提供控制。我们进一步开发了新的凸透镜功能,有助于在多标签分类设置中诱导上述配方的行为,显示出改进的性能。我们还经验证明,当用于计算注意力权重时,所提出的公式在标准的seq2seqtasks(例如神经机器翻译和抽象概括)上实现了更好或相当的性能。
translated by 谷歌翻译
词汇外(OOV)词可能对机器翻译(MT)任务构成严重挑战,特别是对于低资源语言(LRL)。本文采用seq2seq模型的变体来执行从印地语到Bhojpuri的这些词的转换(一个LRL实例),从基于印地语 - Bhojpuri词的双语词典构建的一组同源词中学习。我们证明了我们的模型可以有效地用于具有有限数量的平行语料库的语言,通过在角色级别工作来掌握多种类型的词适应,轮次同步或历时,借词或同源词的语音和正字相似性。我们提供了对适合此任务的角色级NMT系统的训练方面的全面概述,并结合对其各自错误情况的详细分析。使用我们的方法,我们在Hindi到Bhojpuri翻译任务中实现了超过6个BLEU的改进。此外,我们通过将其成功应用于Hindi-Banglacognate对,我们证明了这种转换能够很好地扩展到其他语言。我们的工作可以看作是以下过程中的重要一步:(i)解决MT任务中出现的OOV词语问题,(ii)为资源受限语言创建有效的平行语料库,以及(iii)利用词汇捕获的增强语义知识级别嵌入ontocharacter级别的任务。
translated by 谷歌翻译
最近有很多工作是使用强化学习方法或通过优化光束来训练神经注意力模型。在本文中,我们调查了一系列经典的目标函数,这些函数被广泛用于训练结构预测的线性模型,并将它们应用于神经序列到序列模型。 Ourexperiments显示,这些损失可以通过稍微优秀的光束搜索优化在类似的设置中表现出色。我们还报告了IWSLT'14德语 - 英语翻译以及Gigaword抽象概括的最新技术成果。在更大的WMT'14英语 - 法语翻译任务中,序列级训练达到41.5 BLEU,这与现有技术相当。
translated by 谷歌翻译
Beam search is a desirable choice of test-time decoding algorithm for neural sequence models because it potentially avoids search errors made by simpler greedy methods. However, typical cross entropy training procedures for these models do not directly consider the behaviour of the final decoding method. As a result, for cross-entropy trained models, beam decoding can sometimes yield reduced test performance when compared with greedy decoding. In order to train models that can more effectively make use of beam search, we propose a new training procedure that focuses on the final loss metric (e.g. Hamming loss) evaluated on the output of beam search. While well-defined, this "direct loss" objective is itself discontinuous and thus difficult to optimize. Hence, in our approach, we form a sub-differentiable surrogate objective by introducing a novel continuous approximation of the beam search decoding procedure. In experiments, we show that optimizing this new training objective yields substantially better results on two sequence tasks (Named Entity Recognition and CCG Su-pertagging) when compared with both cross entropy trained greedy decoding and cross entropy trained beam decoding baselines.
translated by 谷歌翻译
在本文中,我们探讨了深度神经网络在自然语言生成中的应用。具体来说,我们实现了两个序列到序列的神经变分模型 - 变分自动编码器(VAE)和变量编码器 - 解码器(VED)。用于文本生成的VAE难以训练,因为与损失函数的Kullback-Leibler(KL)发散项相关的问题消失为零。我们通过实施优化启发式(例如KL权重退火和字丢失)成功地训练VAE。我们还通过随机采样,线性插值和来自输入的邻域的采样来证明这种连续潜在空间的有效性。我们认为,如果VAE的设计不合适,可能会导致绕过连接,导致在训练期间忽略后期空间。我们通过实验证明了解码器隐藏状态初始化的例子,这种绕过连接将VAE降级为确定性模型,从而减少了生成的句子的多样性。我们发现传统的注意机制使用序列序列VED模型作为旁路连接,从而改进了模型的潜在空间。为了避免这个问题,我们提出了变分注意机制,其中关注上下文向量被建模为可以从分布中采样的随机变量。 Weshow凭经验使用自动评估指标,即熵和不同测量指标,我们的变分注意模型产生比确定性注意模型更多样化的输出句子。通过人类评估研究进行的定性分析证明,我们的模型同时产生的质量高,并且与确定性的注意力对应物产生的质量一样流畅。
translated by 谷歌翻译
当前最先进的机器翻译系统基于编码器 - 解码器架构,其首先编码输入序列,并基于输入编码来创建输出序列。两者都与注意机制接口,该注意机制基于解码器状态重新组合源码的固定编码。我们提出了一种替代方法,它不依赖于跨越两个序列的单个2D卷积神经网络。我们网络的每一层都根据到目前为止产生的输出序列重新编码源令牌。因此,类似注意的属性在整个网络中是普遍存在的。我们的模型产生了出色的结果,优于最先进的编码器 - 解码器系统,同时在概念上更简单,参数更少。
translated by 谷歌翻译
We present an approach to training neural networks to generate sequences using actor-critic methods from reinforcement learning (RL). Current log-likelihood training methods are limited by the discrepancy between their training and testing modes, as models must generate tokens conditioned on their previous guesses rather than the ground-truth tokens. We address this problem by introducing a critic network that is trained to predict the value of an output token, given the policy of an actor network. This results in a training procedure that is much closer to the test phase, and allows us to directly optimize for a task-specific score such as BLEU. Crucially, since we leverage these techniques in the supervised learning setting rather than the traditional RL setting, we condition the critic network on the ground-truth output. We show that our method leads to improved performance on both a synthetic task, and for German-English machine translation. Our analysis paves the way for such methods to be applied in natural language generation tasks, such as machine translation, caption generation, and dialogue modelling.
translated by 谷歌翻译
神经注意力已经成为许多最先进的自然语言处理模型和相关领域的核心。注意网络是一种易于训练和有效的方法,可以轻柔地模拟对齐;然而,该方法并未在概率论中对潜在的比对进行边缘化。此属性使得难以将注意力与其他对齐方法进行比较,将其与概率模型组合,并以观察到的数据为基础进行后验推理。一个相关的潜在方法,艰难的注意力,修复了这些问题,但通常更难训练和不准确。这项工作考虑了变分注意网络,备选方案和学习潜在变量对齐模型的重点关注,基于摊销变分推理的基于近似界限。我们进一步提出了用于减少梯度方差的方法,以使这些方法在计算上可行。实验表明,对于机器翻译和视觉问题回答,低效的精确潜在变量模型优于标准神经注意,但是当使用基于注意力的训练时,这些增益消失了。另一方面,变化注意力保留了大部分性能增益,但训练速度与神经拉伸相当。
translated by 谷歌翻译
在过去的几年中,序列到序列(seq2seq)模型的神经抽象文本摘要已经获得了很多的普及。已经提出了许多有趣的技术来改进seq2seq模型,使得它们能够处理不同的挑战,例如显着性,流畅性和人类可读性,并生成高质量的摘要。一般而言,这些技术中的大多数在以下三个类别之一中不同:网络结构,参数推断和解码/生成。还有其他问题,例如培训模型的效率和并行性。在本文中,我们从网络结构,训练策略和摘要生成算法的角度提供了关于不同seq2seq模型的综合文献和技术调查,用于抽象文本摘要。许多模型首先被提出用于语言建模和生成任务,例如机器翻译,然后应用于抽象文本摘要。因此,我们还对这些模型进行了简要回顾。作为本次调查的一部分,我们还开发了一个开源库,即神经抽象文本摘要器(NATS)工具包,用于抽象文本摘要。在广泛使用的CNN /每日邮件数据集上进行了大量的实验,以检验几种不同神经网络组件的有效性。最后,我们在两个最近发布的数据集(即Newsroom和Bytecup)上对在NATS中实现的两个模型进行了基准测试。
translated by 谷歌翻译
Attention networks have proven to be an effective approach for embeddingcategorical inference within a deep neural network. However, for many tasks wemay want to model richer structural dependencies without abandoning end-to-endtraining. In this work, we experiment with incorporating richer structuraldistributions, encoded using graphical models, within deep networks. We showthat these structured attention networks are simple extensions of the basicattention procedure, and that they allow for extending attention beyond thestandard soft-selection approach, such as attending to partial segmentations orto subtrees. We experiment with two different classes of structured attentionnetworks: a linear-chain conditional random field and a graph-based parsingmodel, and describe how these models can be practically implemented as neuralnetwork layers. Experiments show that this approach is effective forincorporating structural biases, and structured attention networks outperformbaseline attention models on a variety of synthetic and real tasks: treetransduction, neural machine translation, question answering, and naturallanguage inference. We further find that models trained in this way learninteresting unsupervised hidden representations that generalize simpleattention.
translated by 谷歌翻译
We formulate sequence to sequence transduction as a noisy channel decoding problem and use recurrent neural networks to parameterise the source and channel models. Unlike direct models which can suffer from explaining-away effects during training, noisy channel models must produce outputs that explain their inputs, and their component models can be trained with not only paired training samples but also unpaired samples from the marginal output distribution. Using a latent variable to control how much of the conditioning sequence the channel model needs to read in order to generate a subsequent symbol, we obtain a tractable and effective beam search decoder. Experimental results on abstractive sentence summarisation, morphological inflection, and machine translation show that noisy channel models outperform direct models, and that they significantly benefit from increased amounts of unpaired output data that direct models cannot easily use.
translated by 谷歌翻译
Word Sense Disambiguation models exist in many flavors. Even though supervised ones tend to perform best in terms of accuracy, they often lose ground to more flexible knowledge-based solutions, which do not require training by a word expert for every disambiguation target. To bridge this gap we adopt a different perspective and rely on sequence learning to frame the disambiguation problem: we propose and study in depth a series of end-to-end neural architectures directly tailored to the task, from bidirectional Long Short-Term Memory to encoder-decoder models. Our extensive evaluation over standard benchmarks and in multiple languages shows that sequence learning enables more versatile all-words models that consistently lead to state-of-the-art results, even against word experts with engineered features.
translated by 谷歌翻译
Nearly all previous work on neural machine translation (NMT) has used quite restricted vocabularies, perhaps with a subsequent method to patch in unknown words. This paper presents a novel word-character solution to achieving open vocabulary NMT. We build hybrid systems that translate mostly at the word level and consult the character components for rare words. Our character-level recurrent neural networks compute source word representations and recover unknown target words when needed. The twofold advantage of such a hybrid approach is that it is much faster and easier to train than character-based ones; at the same time, it never produces unknown words as in the case of word-based models. On the WMT'15 English to Czech translation task, this hybrid approach offers an addition boost of +2.1−11.4 BLEU points over models that already handle unknown words. Our best system achieves a new state-of-the-art result with 20.7 BLEU score. We demonstrate that our character models can successfully learn to not only generate well-formed words for Czech, a highly-inflected language with a very complex vocabulary, but also build correct representations for English source words.
translated by 谷歌翻译
Most existing machine translation systems operate at the level of words, relying on explicit segmentation to extract tokens. We introduce a neural machine translation (NMT) model that maps a source character sequence to a target character sequence without any seg-mentation. We employ a character-level con-volutional network with max-pooling at the encoder to reduce the length of source representation , allowing the model to be trained at a speed comparable to subword-level models while capturing local regularities. Our character-to-character model outperforms a recently proposed baseline with a subword-level encoder on WMT'15 DE-EN and CS-EN, and gives comparable performance on FI-EN and RU-EN. We then demonstrate that it is possible to share a single character-level encoder across multiple languages by training a model on a many-to-one translation task. In this multilingual setting, the character-level encoder significantly outper-forms the subword-level encoder on all the language pairs. We observe that on CS-EN, FI-EN and RU-EN, the quality of the multilingual character-level translation even surpasses the models specifically trained on that language pair alone, both in terms of BLEU score and human judgment.
translated by 谷歌翻译