In this paper we explore the task of modeling (semi) structured object sequences; in particular we focus our attention on the problem of developing a structure-aware input representation for such sequences. In such sequences, we assume that each structured object is represented by a set of key-value pairs which encode the attributes of the structured object. Given a universe of keys, a sequence of structured objects can then be viewed as an evolution of the values for each key, over time. We encode and construct a sequential representation using the values for a particular key (Temporal Value Modeling - TVM) and then self-attend over the set of key-conditioned value sequences to a create a representation of the structured object sequence (Key Aggregation - KA). We pre-train and fine-tune the two components independently and present an innovative training schedule that interleaves the training of both modules with shared attention heads. We find that this iterative two part-training results in better performance than a unified network with hierarchical encoding as well as over, other methods that use a {\em record-view} representation of the sequence \cite{de2021transformers4rec} or a simple {\em flattened} representation of the sequence. We conduct experiments using real-world data to demonstrate the advantage of interleaving TVM-KA on multiple tasks and detailed ablation studies motivating our modeling choices. We find that our approach performs better than flattening sequence objects and also allows us to operate on significantly larger sequences than existing methods.
translated by 谷歌翻译
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).
translated by 谷歌翻译
Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.
translated by 谷歌翻译
最近的趋势表明,一般的模型,例如BERT,GPT-3,剪辑,在规模上广泛的数据训练,已经显示出具有单一学习架构的各种功能。在这项工作中,我们通过在大尺度上培训通用用户编码器来探讨通用用户表示学习的可能性。我们展示了扩展法在用户建模区域中持有,其中训练错误将作为幂律规模的幂级,具有计算量。我们的对比学习用户编码器(CLUE),优​​化任务 - 不可知目标,并且所产生的用户嵌入式延伸我们对各种下游任务中的可能做些什么。 Clue还向其他域和系统展示了巨大的可转移性,因为在线实验上的性能显示在线点击率(CTR)的显着改进。此外,我们还调查了如何根据扩展因子,即模型容量,序列长度和批量尺寸来改变性能如何变化。最后,我们讨论了线索的更广泛影响。
translated by 谷歌翻译
Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .
translated by 谷歌翻译
在培训数据中拟合复杂的模式,例如推理和争议,是语言预训练的关键挑战。根据最近的研究和我们的经验观察,一种可能的原因是训练数据中的一些易于适应的模式,例如经常共同发生的单词组合,主导和伤害预训练,使模型很难适合更复杂的信息。我们争辩说,错误预测可以帮助找到危害语言理解的这种主导模式。当发生错误预测时,应该经常与导致MIS预测的模型拟合的MIS预测字相同的模式。如果我们可以添加正规化以培训模型,当MIS预测发生并更多地对待更微妙的模式时,可以在更多信息上缩小到这种主导模式时,可以在预训练中有效地安装更多信息。在此动机之后,我们提出了一种新的语言预培训方法,错误预测作为伤害警报(MPA)。在MPA中,当在预训练期间发生错误预测时,我们使用其共同发生信息来指导自我关注模块的多个头部。变压器模块中的一些自我关注头经过优化,以将更低的注意重量分配给频繁地在误报中的输入句子中的单词,同时将更高权重分配给另一个单词。通过这样做,变压器模型训练,以依赖于主导的频繁共同发生模式,而在误报中,当发生错误预测时,在剩余更复杂的信息上更加关注更多。我们的实验表明,MPA加快了伯特和电器的预训练,并提高了他们对下游任务的表现。
translated by 谷歌翻译
日志异常检测是IT操作(AIOPs)的人工智能领域的关键组成部分。考虑到变量域的日志数据,Retring为未知域的整个网络效率低于实际工业场景,特别是对于低资源域。但是,之前的深层模型仅仅集中在同一域中提取日志序列的语义,导致多域日志的概括。因此,我们提出了一种统一的基于变换器的日志异常检测框架(\ OurMethod {}),其包括预先曝光和基于适配器的调谐阶段。我们的模型首先在源域上留下来验证以获取日志数据的共享语义知识。然后,我们通过基于适配器的调谐将预磨模的模型传送到目标域。所提出的方法在包括一个源域和两个目标域的三个公共数据集上进行评估。实验结果表明,我们的简单且有效的方法,具有较少的可训练参数和较低的目标领域的培训成本,在三个基准上实现了最先进的性能。
translated by 谷歌翻译
在法律文本中预先培训的基于变压器的预训练语言模型(PLM)的出现,法律领域中的自然语言处理受益匪浅。有经过欧洲和美国法律文本的PLM,最著名的是Legalbert。但是,随着印度法律文件的NLP申请量的迅速增加以及印度法律文本的区别特征,也有必要在印度法律文本上预先培训LMS。在这项工作中,我们在大量的印度法律文件中介绍了基于变压器的PLM。我们还将这些PLM应用于印度法律文件的几个基准法律NLP任务,即从事实,法院判决的语义细分和法院判决预测中的法律法规识别。我们的实验证明了这项工作中开发的印度特定PLM的实用性。
translated by 谷歌翻译
与伯特(Bert)等语言模型相比,已证明知识增强语言表示的预培训模型在知识基础构建任务(即〜关系提取)中更有效。这些知识增强的语言模型将知识纳入预训练中,以生成实体或关系的表示。但是,现有方法通常用单独的嵌入表示每个实体。结果,这些方法难以代表播出的实体和大量参数,在其基础代币模型之上(即〜变压器),必须使用,并且可以处理的实体数量为由于内存限制,实践限制。此外,现有模型仍然难以同时代表实体和关系。为了解决这些问题,我们提出了一个新的预培训模型,该模型分别从图书中学习实体和关系的表示形式,并分别在文本中跨越跨度。通过使用SPAN模块有效地编码跨度,我们的模型可以代表实体及其关系,但所需的参数比现有模型更少。我们通过从Wikipedia中提取的知识图对我们的模型进行了预训练,并在广泛的监督和无监督的信息提取任务上进行了测试。结果表明,我们的模型比基线学习对实体和关系的表现更好,而在监督的设置中,微调我们的模型始终优于罗伯塔,并在信息提取任务上取得了竞争成果。
translated by 谷歌翻译
在本文中,我们提出了一条新型的管道,该管道利用语言基础模型进行时间顺序模式挖掘,例如人类的移动性预测任务。例如,在预测利益(POI)客户流量的任务中,通常从历史日志中提取访问次数,并且仅使用数值数据来预测访客流。在这项研究中,我们直接对包含各种信息的自然语言输入执行预测任务,例如数值和上下文的语义信息。引入特定的提示以将数值时间序列转换为句子,以便可以直接应用现有的语言模型。我们设计了一个Auxmoblcast管道,用于预测每个POI中的访问者数量,将辅助POI类别分类任务与编码器架构结构集成在一起。这项研究提供了所提出的Auxmoblcast管道有效性以发现移动性预测任务中的顺序模式的经验证据。在三个现实世界数据集上评估的结果表明,预训练的语言基础模型在预测时间序列中也具有良好的性能。这项研究可以提供有远见的见解,并为预测人类流动性提供新的研究方向。
translated by 谷歌翻译
Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BIGBIRD, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BIGBIRD is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BIGBIRD drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.
translated by 谷歌翻译
用户建模对于理解用户行为至关重要,对于改善用户体验和个性化建议至关重要。当用户与软件交互时,通过记录和分析系统生成大量命令序列。这些命令序列包含用户目标和意图的线索。但是,这些数据模式是高度非结构化和未标记的,因此标准预测系统很难学习。我们提出了SimCurl,这是一个简单而有效的对比度自我监督的深度学习框架,从未标记的命令序列中学习用户表示。我们的方法介绍了用户会议网络体系结构,以及会话辍学作为一种新颖的数据增强方式。我们在超过十亿命令的现实世界命令序列数据集上训练和评估我们的方法。当将学习的表示形式转移到经验和专业知识分类等下游任务时,我们的方法对现有方法显示了显着改善。
translated by 谷歌翻译
预训练的模型(PTM)已成为自然语言处理和计算机视觉下游任务的基本骨干。尽管通过在BAIDU地图上将通用PTM应用于与地理相关的任务中获得的最初收益,但随着时间的流逝,表现平稳。造成该平稳的主要原因之一是缺乏通用PTM中的可用地理知识。为了解决这个问题,在本文中,我们介绍了Ernie-Geol,这是一个地理和语言预培训模型,设计和开发了用于改善Baidu Maps的地理相关任务。 Ernie-Geol经过精心设计,旨在通过预先培训从包含丰富地理知识的异质图生成的大规模数据来学习地理语言的普遍表示。大规模现实数据集进行的广泛定量和定性实验证明了Ernie-Geol的优势和有效性。自2021年4月以来,Ernie-Geol已经在百度地图上部署在生产中,这显着受益于各种下游任务的性能。这表明Ernie-Geol可以作为各种与地理有关的任务的基本骨干。
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameterreduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT. * Work done as an intern at Google Research, driving data processing and downstream task evaluations.
translated by 谷歌翻译
目前,用于训练语言模型的最广泛的神经网络架构是所谓的BERT,导致各种自然语言处理(NLP)任务的改进。通常,BERT模型中的参数的数量越大,这些NLP任务中获得的结果越好。不幸的是,内存消耗和训练持续时间随着这些模型的大小而大大增加。在本文中,我们调查了较小的BERT模型的各种训练技术:我们将不同的方法与Albert,Roberta和相对位置编码等其他BERT变体相结合。此外,我们提出了两个新的微调修改,导致更好的性能:类开始终端标记和修改形式的线性链条条件随机字段。此外,我们介绍了整个词的注意力,从而降低了伯特存储器的使用,并导致性能的小幅增加,与古典的多重关注相比。我们评估了这些技术的五个公共德国命名实体识别(NER)任务,其中两条由这篇文章引入了两项任务。
translated by 谷歌翻译
大型的语言模型(PRELMS)正在彻底改变所有基准的自然语言处理。但是,它们的巨大尺寸对于小型实验室或移动设备上的部署而言是过分的。修剪和蒸馏等方法可减少模型尺寸,但通常保留相同的模型体系结构。相反,我们探索了蒸馏预告片中的更有效的架构,单词的持续乘法(CMOW),该构造将每个单词嵌入为矩阵,并使用矩阵乘法来编码序列。我们扩展了CMOW体系结构及其CMOW/CBOW-HYBRID变体,具有双向组件,以提供更具表现力的功能,在预绘制期间进行一般(任务无义的)蒸馏的单次表示,并提供了两种序列编码方案,可促进下游任务。句子对,例如句子相似性和自然语言推断。我们的基于矩阵的双向CMOW/CBOW-HYBRID模型在问题相似性和识别文本范围内的Distilbert具有竞争力,但仅使用参数数量的一半,并且在推理速度方面快三倍。除了情感分析任务SST-2和语言可接受性任务COLA外,我们匹配或超过ELMO的ELMO分数。但是,与以前的跨架结构蒸馏方法相比,我们证明了检测语言可接受性的分数增加了一倍。这表明基于基质的嵌入可用于将大型预赛提炼成竞争模型,并激励朝这个方向进行进一步的研究。
translated by 谷歌翻译
文本到SQL解析是一项必不可少且具有挑战性的任务。文本到SQL解析的目的是根据关系数据库提供的证据将自然语言(NL)问题转换为其相应的结构性查询语言(SQL)。来自数据库社区的早期文本到SQL解析系统取得了显着的进展,重度人类工程和用户与系统的互动的成本。近年来,深层神经网络通过神经生成模型显着提出了这项任务,该模型会自动学习从输入NL问题到输出SQL查询的映射功能。随后,大型的预训练的语言模型将文本到SQL解析任务的最新作品带到了一个新级别。在这项调查中,我们对文本到SQL解析的深度学习方法进行了全面的评论。首先,我们介绍了文本到SQL解析语料库,可以归类为单转和多转。其次,我们提供了预先训练的语言模型和现有文本解析方法的系统概述。第三,我们向读者展示了文本到SQL解析所面临的挑战,并探索了该领域的一些潜在未来方向。
translated by 谷歌翻译
Language model pre-training has proven to be useful in learning universal language representations. As a state-of-the-art language model pre-training model, BERT (Bidirectional Encoder Representations from Transformers) has achieved amazing results in many language understanding tasks. In this paper, we conduct exhaustive experiments to investigate different fine-tuning methods of BERT on text classification task and provide a general solution for BERT fine-tuning. Finally, the proposed solution obtains new state-of-the-art results on eight widely-studied text classification datasets. 1
translated by 谷歌翻译
Neural language representation models such as BERT pre-trained on large-scale corpora can well capture rich semantic patterns from plain text, and be fine-tuned to consistently improve the performance of various NLP tasks. However, the existing pre-trained language models rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better language understanding. We argue that informative entities in KGs can enhance language representation with external knowledge. In this paper, we utilize both large-scale textual corpora and KGs to train an enhanced language representation model (ERNIE), which can take full advantage of lexical, syntactic, and knowledge information simultaneously. The experimental results have demonstrated that ERNIE achieves significant improvements on various knowledge-driven tasks, and meanwhile is comparable with the state-of-the-art model BERT on other common NLP tasks. The source code and experiment details of this paper can be obtained from https:// github.com/thunlp/ERNIE.
translated by 谷歌翻译