In this paper, we analyze neural network-based dialogue systems trained in an end-to-end manner using an updated version of the recent Ubuntu Dialogue Corpus, a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words 1. This dataset is interesting because of its size, long context lengths, and technical nature; thus, it can be used to train large models directly from data with minimal feature engineering. We provide baselines in two different environments: one where models are trained to select the correct next response from a list of candidate responses, and one where models are trained to maximize the log-likelihood of a generated utterance conditioned on the context of the conversation. These are both evaluated on a recall task that we call next utterance classification (NUC), and using vector-based metrics that capture the topicality of the responses. We observe that current end-to-end models are 1. This work is an extension of a paper appearing in SIGDIAL (Lowe et al., 2015). This paper further includes results on generative dialogue models, more extensive evaluation of the retrieval models using vector-based generative metrics, and a qualitative examination of responses from the generative models and classification errors made by the Dual Encoder model. Experiments are performed on a new version of the corpus, the Ubuntu Dialogue Corpus v2, which is publicly available: https://github.com/rkadlec/ubuntu-ranking-dataset-creator. The early dataset has been updated to add features and fix bugs, which are detailed in Section 3. This is an open-access article distributed under the terms of a Creative Commons Attribution License (http : //creativecommons.org/licenses/by/3.0/). LOWE, POW, SERBAN, CHARLINN, LIU AND PINEAU unable to completely solve these tasks; thus, we provide a qualitative error analysis to determine the primary causes of error for end-to-end models evaluated on NUC, and examine sample utterances from the generative models. As a result of this analysis, we suggest some promising directions for future research on the Ubuntu Dialogue Corpus, which can also be applied to end-to-end dialogue systems in general.
translated by 谷歌翻译
序列到序列(Seq2Seq)模型在生成自然对话交换方面取得了显着的成功。尽管由这些神经网络模型产生了语法上形成的响应,但它们是非文本的,简短的和通用的。在这项工作中,我们引入了一个TopicalHierarchical Recurrent Encoder Decoder(THRED),一个新颖的,完全数据驱动的多转响应生成系统,旨在产生上下文和主题感知响应。我们的模型建立在基本的Seq2Seq模型之上,通过分层联合注意机制对其进行处理,该机制将历史概念和先前的交互结合到响应生成中。通过我们的模型,我们提供了一个干净,高质量的会话数据,来自Reddit评论。我们评估THRED两个新的自动化指标,称为语义相似度和响应回声指数,以及人道评估。我们的实验表明,与强大的基线相比,所提出的模型能够生成更多样化和上下文相关的响应。
translated by 谷歌翻译
当前的会话系统可以遵循简单的命令并回答基本问题,但是他们难以保持关于特定主题的连贯和开放式对话。正在组织ConversationalIntelligence(ConvAI)挑战等竞赛,以推动研究发展朝着这一目标迈进。本文详细介绍了参加2017年ConvAI挑战的RLLChatbott。这项研究的目标是更好地理解当前深度学习和强化学习工具如何用于构建一个健壮而灵活的开放域会话代理。我们提供了一个详尽的描述,说明如何使用集合模型从大多数公共领域数据集构建和训练对话系统。除了新颖的消息排名和选择方法之外,这项工作的第一个贡献是对不同文本生成模型的详细描述和分析。此外,还提供了一个新的开源会话数据集。与我们负责选择每次交互返回的消息的基线模型相比,对这些数据的培训显着提高了排名和选择机制的Recall @ k得分。
translated by 谷歌翻译
This paper introduces the Ubuntu Dialogue Corpus, a dataset containing almost 1 million multi-turn dialogues, with a total of over 7 million utterances and 100 million words. This provides a unique resource for research into building dialogue managers based on neural language models that can make use of large amounts of unlabeled data. The dataset has both the multi-turn property of conversations in the Dialog State Tracking Challenge datasets, and the unstructured nature of interactions from microblog services such as Twitter. We also describe two neural learning architectures suitable for analyzing this dataset, and provide benchmark performance on the task of selecting the best next response.
translated by 谷歌翻译
在本文中,我们探讨了深度神经网络在自然语言生成中的应用。具体来说,我们实现了两个序列到序列的神经变分模型 - 变分自动编码器(VAE)和变量编码器 - 解码器(VED)。用于文本生成的VAE难以训练,因为与损失函数的Kullback-Leibler(KL)发散项相关的问题消失为零。我们通过实施优化启发式(例如KL权重退火和字丢失)成功地训练VAE。我们还通过随机采样,线性插值和来自输入的邻域的采样来证明这种连续潜在空间的有效性。我们认为,如果VAE的设计不合适,可能会导致绕过连接,导致在训练期间忽略后期空间。我们通过实验证明了解码器隐藏状态初始化的例子,这种绕过连接将VAE降级为确定性模型,从而减少了生成的句子的多样性。我们发现传统的注意机制使用序列序列VED模型作为旁路连接,从而改进了模型的潜在空间。为了避免这个问题,我们提出了变分注意机制,其中关注上下文向量被建模为可以从分布中采样的随机变量。 Weshow凭经验使用自动评估指标,即熵和不同测量指标,我们的变分注意模型产生比确定性注意模型更多样化的输出句子。通过人类评估研究进行的定性分析证明,我们的模型同时产生的质量高,并且与确定性的注意力对应物产生的质量一样流畅。
translated by 谷歌翻译
In this paper, we propose a novel end-to-end neural architecture for ranking candidate answers , that adapts a hierarchical recurrent neu-ral network and a latent topic clustering module. With our proposed model, a text is encoded to a vector representation from an word-level to a chunk-level to effectively capture the entire meaning. In particular, by adapting the hierarchical structure, our model shows very small performance degradations in longer text comprehension while other state-of-the-art recurrent neural network models suffer from it. Additionally, the latent topic clustering module extracts semantic information from target samples. This clustering module is useful for any text related tasks by allowing each data sample to find its nearest topic cluster, thus helping the neural network model analyze the entire data. We evaluate our models on the Ubuntu Dialogue Corpus and consumer electronic domain question answering dataset, which is related to Samsung products. The proposed model shows state-of-the-art results for ranking question-answer pairs.
translated by 谷歌翻译
Automatically evaluating the quality of dialogue responses for unstructureddomains is a challenging problem. Unfortunately, existing automatic evaluationmetrics are biased and correlate very poorly with human judgements of responsequality. Yet having an accurate automatic evaluation procedure is crucial fordialogue research, as it allows rapid prototyping and testing of new modelswith fewer expensive human evaluations. In response to this challenge, weformulate automatic dialogue evaluation as a learning problem. We present anevaluation model (ADEM) that learns to predict human-like scores to inputresponses, using a new dataset of human response scores. We show that the ADEMmodel's predictions correlate significantly, and at a level much higher thanword-overlap metrics such as BLEU, with human judgements at both the utteranceand system-level. We also show that ADEM can generalize to evaluating dialoguemodels unseen during training, an important step for automatic dialogueevaluation.
translated by 谷歌翻译
构建可与人类通信的系统是人工智能的核心问题。这项工作提出了一种新颖的神经网络架构,用于端到端多转对话对话设置中的响应选择。该体系结构应用了上下文关注,并结合了域特定字描述提供的附加外部知识。它使用双向门控循环单元(GRU)来编码上下文响应,并学习参与给定潜在响应表示的上下文单词,反之亦然。此外,它使用另一个GRU对域关键字描述进行编码来合并外部域特定信息。这允许在响应中更好地表示特定于域的关键字,从而提高整体性能。实验结果表明,我们的模型在多转对话中优于所有其他最先进的响应选择方法。
translated by 谷歌翻译
We propose simple and flexible training and decoding methods for influencing output style and topic in neural encoder-decoder based language generation. This capability is desirable in a variety of applications , including conversational systems, where successful agents need to produce language in a specific style and generate responses steered by a human puppeteer or external knowledge. We decompose the neural generation process into empirically easier sub-problems: a faithfulness model and a decoding method based on selective-sampling. We also describe training and sampling algorithms that bias the generation process with a specific language style restriction, or a topic restriction. Human evaluation results show that our proposed methods are able to restrict style and topic without degrading output quality in conversational tasks.
translated by 谷歌翻译
Modeling dialog systems is currently one of the most active problems in Natural Language Processing. Recent advances in Deep Learning have sparked an interest in the use of neural networks in modeling language, particularly for personal-ized conversational agents that can retain contex-tual information during dialog exchanges. This work carefully explores and compares several of the recently proposed neural conversation models, and carries out a detailed evaluation on the multiple factors that can significantly affect predictive performance, such as pretraining, embedding training , data cleaning, diversity-based reranking, evaluation setting, etc. Based on the tradeoffs of different models, we propose a new neural generative dialog model conditioned on speakers as well as context history that outperforms previous models on both retrieval and generative metrics. Our findings indicate that pretraining speaker embeddings on larger datasets, as well as bootstrapping word and speaker embeddings, can significantly improve performance (up to 3 points in perplexity), and that promoting diversity in using Mutual Information based techniques has a very strong effect in ranking metrics.
translated by 谷歌翻译
构建开放式多圈对话系统是人工智能中最有趣和最具挑战性的任务之一。许多研究人员一直致力于建立这样的对话系统,但很少有人在正在进行的对话中对会话流进行建模。此外,人们在谈话中谈论高度相关的方面是常见的。主题是连贯的,自然漂移的,这表明了对话流建模的必要性。为此,我们提出了具有强化学习方法(RLCw)的多转换词驱动的会话系统,该方法努力选择具有最大未来信用的自适应提示词,从而提高生成的响应的质量。我们引入了一个新的方法来衡量提示词在有效性和相关性方面的质量。为了进一步优化长期对话模型,本文采用了强化方法。在real-realedataset上的实验表明,我们的模型在模拟转弯,多样性和人道评估方面始终优于一组竞争基线。
translated by 谷歌翻译
The past decade has witnessed the boom of human-machine interactions, particularly via dialog systems. In this paper, we study the task of response generation in open-domain multi-turn dialog systems. Many research efforts have been dedicated to building intelligent dialog systems, yet few shed light on deepening or widening the chatting topics in a conversational session, which would attract users to talk more. To this end, this paper presents a novel deep scheme consisting of three channels, namely global, wide, and deep ones. The global channel encodes the complete historical information within the given context, the wide one employs an attention-based recurrent neural network model to predict the keywords that may not appear in the historical context, and the deep one trains a Multi-layer Perceptron model to select some keywords for an in-depth discussion. Thereafter, our scheme integrates the outputs of these three channels to generate desired responses. To justify our model, we conducted extensive experiments to compare our model with several state-of-the-art baselines on two datasets: one is constructed by ourselves and the other is a public benchmark dataset. Experimental results demonstrate that our model yields promising performance by widening or deepening the topics of interest.
translated by 谷歌翻译
我们调查对话响应生成系统的评估指标,其中监督标签(例如任务完成)不可用。响应生成的近期工作采用了机器翻译的度量标准,以模拟模型生成的对单个目标响应的响应。我们证明这些指标与非technicalTwitter域中的人类判断非常弱相关,而在技术Ubuntu域中根本不相关。我们提供定量和定性结果,突出显示未成熟度量的具体弱点,并为对话系统的自动评估指标的未来发展提供建议。
translated by 谷歌翻译
While recent neural encoder-decoder models have shown great promise in mod-eling open-domain conversations, they often generate dull and generic responses. Unlike past work that has focused on diversifying the output of the decoder at word-level to alleviate this problem, we present a novel framework based on conditional variational autoencoders that captures the discourse-level diversity in the encoder. Our model uses latent variables to learn a distribution over potential conversational intents and generates diverse responses using only greedy de-coders. We have further developed a novel variant that is integrated with linguistic prior knowledge for better performance. Finally, the training procedure is improved by introducing a bag-of-word loss. Our proposed models have been validated to generate significantly more diverse responses than baseline approaches and exhibit competence in discourse-level decision-making. 1
translated by 谷歌翻译
深度学习方法采用多个处理层来学习数据的层次表示,并在manydomains中产生了最先进的结果。最近,各种模型设计和方法在自然语言处理(NLP)的背景下蓬勃发展。在本文中,我们回顾了已经用于大量NLP任务的重要深度学习相关模型和方法,并提供了他们演变的演练。我们对各种模型进行了比较,比较和对比,并对NLP深度学习的过去,现在和未来进行了详细的理解。
translated by 谷歌翻译
由于大型语料库的可用性,在会话代理中使用连接主义方法已经快速推进。然而,现有的对话模式往往缺乏连贯性,而且缺乏内容。这项工作提出了一种架构,以结合非结构化知识源,以增强chit-chat类型的生成对话模型中的下一个话语预测。我们专注于与Reddit新闻数据集一致的序列到序列(Seq2Seq)会话代理,并考虑合并来自维基百科摘要以及NELL知识库的外部知识。 Ourexperiments显示更快的训练时间和更高的困惑,当leveragingexternal知识。
translated by 谷歌翻译
我们针对开放域会话代理的现有编码器 - 解码器模型提出了三种增强,旨在有效地建模一致性和促进输出多样性:(1)我们引入一种一致性度量作为对话上下文与生成的响应之间的嵌入相似性,(2)我们根据相干性度量过滤我们的训练语料库,以获得局部相干和词汇多样化的上下文 - 响应对,(3)然后我们使用条件变量自动编码器模型训练响应生成器,该模型将相干性度量作为潜在变量并使用上下文门来保证与背景的主题一致性和促进双重多样性。在OpenSubtitles语料库上的实验表明,在BLEU评分以及连贯性和多样性的指标方面,竞争神经模型得到了实质性的改进。
translated by 谷歌翻译
In this paper, we propose a novel neu-ral network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and de-coder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
translated by 谷歌翻译
受到先前尝试使用Neuralnetworks(Hill,Cho,Korhonen,&Bengio,2015)回答填字游戏问题的启发,本论文实现了扩展以改善现有定义模型在回答填字游戏问题上的表现。对原始实现的讨论和评估发现,有一些方法可以扩展重复的神经模型。来自相关领域的神经语言模型和神经机器翻译的见解提供了这些扩展所需的理由和手段。两个扩展应用于LSTM编码器,首先采用序列中LSTM状态的平均值,然后使用双向LSTM,这两种实现都用于改进定义和填字游戏测试集的模型性能。为了提高字串问题的性能,培训数据增加到包括交叉问题和答案,这有助于改进定义结果以及填字游戏问题。最后的实验是使用子单元分段进行的,首先在源侧进行,然后进行初步实验以促进字符级输出。最初,基线结果的精确复制证明是不成功的。 Despitethis,扩展提高了性能,允许定义模型改变了以前工作的递归神经网络变体的性能(Hill,et al。,2015)。
translated by 谷歌翻译
We introduce the multiresolution recurrent neural network, which extends thesequence-to-sequence framework to model natural language generation as twoparallel discrete stochastic processes: a sequence of high-level coarse tokens,and a sequence of natural language tokens. There are many ways to estimate orlearn the high-level coarse tokens, but we argue that a simple extractionprocedure is sufficient to capture a wealth of high-level discourse semantics.Such procedure allows training the multiresolution recurrent neural network bymaximizing the exact joint log-likelihood over both sequences. In contrast tothe standard log- likelihood objective w.r.t. natural language tokens (wordperplexity), optimizing the joint log-likelihood biases the model towardsmodeling high-level abstractions. We apply the proposed model to the task ofdialogue response generation in two challenging domains: the Ubuntu technicalsupport domain, and Twitter conversations. On Ubuntu, the model outperformscompeting approaches by a substantial margin, achieving state-of-the-artresults according to both automatic evaluation metrics and a human evaluationstudy. On Twitter, the model appears to generate more relevant and on-topicresponses according to automatic evaluation metrics. Finally, our experimentsdemonstrate that the proposed model is more adept at overcoming the sparsity ofnatural language and is better able to capture long-term structure.
translated by 谷歌翻译