智能论文笔记

GUIM -- General User and Item Embedding with Mixture of Representation in E-commerce

Chao Yang , Ru He , Fangquan Lin , Suoyuan Song , Jingqiao Zhang , Cheng Yang

分类：人工智能

2022-07-02

我们的目标是为阿里巴巴业务的每个用户和每个产品项目建立一般代表性（嵌入），包括淘宝和Tmall，这是世界上最大的电子商务网站之一。用户和项目的代表性在各种下游应用程序中发挥着关键作用，包括建议系统，搜索，营销，需求预测等。受到自然语言处理（NLP）域中的BERT模型的启发，我们提出了GUIM（与代表的混合物混合在一起）的GUIM（一般用户项目），以实现大量，结构化的多模式数据，包括数亿美元的相互作用用户和项目。我们利用表示（MOR）的混合物作为一种新颖的表示形式来建模每个用户的各种兴趣。此外，我们使用对比度学习中的Infonce，以避免由于众多词汇的大小（令牌）词汇大小，因此避免了棘手的计算成本。最后，我们建议一组代表性的下游任务作为标准基准，以评估学到的用户和/或项目嵌入的质量，类似于NLP域中的胶合基准。我们在这些下游任务中的实验结果清楚地表明了从GUIM模型中学到的嵌入的比较价值。

translated by 谷歌翻译

Towards Universal Sequence Representation Learning for Recommender Systems

Yupeng Hou , Shanlei Mu , Wayne Xin Zhao , Yaliang Li , Bolin Ding , Ji-Rong Wen

分类：人工智能 | 机器学习

2022-06-13

为了开发有效的顺序推荐人，提出了一系列序列表示学习（SRL）方法来模拟历史用户行为。大多数现有的SRL方法都依赖于开发序列模型以更好地捕获用户偏好的明确项目ID。尽管在某种程度上有效，但由于通过明确建模项目ID的限制，这些方法很难转移到新的建议方案。为了解决这个问题，我们提出了一种新颖的通用序列表示方法，名为UNISREC。提出的方法利用项目的文本在不同的建议方案中学习可转移表示形式。为了学习通用项目表示形式，我们设计了一个基于参数美白和Experts的混合物增强的适配器的轻巧项目编码体系结构。为了学习通用序列表示，我们通过抽样多域负面因素介绍了两个对比的预训练任务。借助预训练的通用序列表示模型，我们的方法可以在电感或跨传导设置下以参数有效的方式有效地传输到新的推荐域或平台。在现实世界数据集上进行的广泛实验证明了该方法的有效性。尤其是，我们的方法还导致跨平台环境中的性能提高，显示了所提出的通用SRL方法的强可传递性。代码和预培训模型可在以下网址获得：https：//github.com/rucaibox/unisrec。

translated by 谷歌翻译

Scaling Law for Recommendation Models: Towards General-purpose User Representations

Kyuyong Shin , Hanock Kwak , Kyung-Min Kim , Su Young Kim , Max Nihlen Ramstrom , Jisu Jeong

分类：机器学习

2021-11-15

最近的趋势表明，一般的模型，例如BERT，GPT-3，剪辑，在规模上广泛的数据训练，已经显示出具有单一学习架构的各种功能。在这项工作中，我们通过在大尺度上培训通用用户编码器来探讨通用用户表示学习的可能性。我们展示了扩展法在用户建模区域中持有，其中训练错误将作为幂律规模的幂级，具有计算量。我们的对比学习用户编码器（CLUE），优化任务 - 不可知目标，并且所产生的用户嵌入式延伸我们对各种下游任务中的可能做些什么。 Clue还向其他域和系统展示了巨大的可转移性，因为在线实验上的性能显示在线点击率（CTR）的显着改进。此外，我们还调查了如何根据扩展因子，即模型容量，序列长度和批量尺寸来改变性能如何变化。最后，我们讨论了线索的更广泛影响。

translated by 谷歌翻译

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin , Ming-Wei Chang , Kenton Lee , Kristina Toutanova

分类：

2018-10-11

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

translated by 谷歌翻译

SAS: Self-Augmentation Strategy for Language Model Pre-training

Yifei Xu , Jingqiao Zhang , Ru He , Liangzhu Ge , Chao Yang , Cheng Yang , Ying Nian Wu

分类：自然语言处理 | 人工智能

2021-06-14

用于预培训语言模型的自我监督学习的核心包括预训练任务设计以及适当的数据增强。语言模型中的大多数数据增强都是独立于上下文的。最近在电子中提出了一个开创性的增强，并通过引入辅助生成网络（发电机）来实现最先进的性能，以产生用于培训主要辨别网络（鉴别者）的上下文化数据增强。然而，这种设计引入了发电机的额外计算成本，并且需要调整发电机和鉴别器之间的相对能力。在本文中，我们提出了一种自增强策略（SAS），其中单个网络用于审视以后的时期的培训常规预训练和上下文化数据增强。基本上，该策略消除了单独的发电机，并使用单个网络共同执行具有MLM（屏蔽语言建模）和RTD（替换令牌检测）头的两个预训练任务。它避免了寻找适当大小的发电机的挑战，这对于在电子中证明的性能至关重要，以及其随后的变体模型至关重要。此外，SAS是一项常规策略，可以与最近或将来的许多新技术无缝地结合，例如杜伯塔省的解除关注机制。我们的实验表明，SAS能够在具有相似或更少的计算成本中优于胶水任务中的电磁和其他最先进的模型。

translated by 谷歌翻译

Learning Large-scale Universal User Representation with Sparse Mixture of Experts

Caigao Jiang , Siqiao Xue , James Zhang , Lingyue Liu , Zhibo Zhu , Hongyan Hao

分类：机器学习 | 自然语言处理

2022-07-11

学习用户序列行为嵌入非常复杂且充满挑战，因为随着时间的推移和用户功能的高尺寸，功能相互作用复杂。最近的新兴基金会模型，例如伯特及其变体，鼓励大量研究人员在该领域进行调查。但是，与自然语言处理（NLP）任务不同，用户行为模型的参数主要来自用户嵌入层，这使得大多数现有作品在训练大规模的通用用户嵌入中失败。此外，从多个下游任务中学到了用户表示，并且过去的研究工作无法解决Seesaw现象。在本文中，我们提出了SuperMoe，这是一个通用框架，旨在从多个任务中获取高质量的用户表示。具体而言，用户行为序列是由MOE Transformer编码的，因此我们可以将模型容量提高到数十亿个参数，甚至可以将模型能力提高到数万亿个参数。为了在跨多个任务学习时处理Seesaw现象，我们使用任务指标设计了新的损失功能。我们在公共数据集和私人现实世界业务方案上进行了广泛的离线实验。我们的方法在最新模型上取得了最佳性能，结果证明了我们框架的有效性。

translated by 谷歌翻译

Interest-oriented Universal User Representation via Contrastive Learning

Qinghui Sun , Jie Gu , Bei Yang , XiaoXiao Xu , Renjun Xu , Shangde Gao , Hong Liu , Huan Xu

分类：机器学习

2021-09-18

用户表示对于在工业中提供高质量的商业服务至关重要。最近普遍的用户表示已经获得了许多兴趣，我们可以摆脱训练每个下游应用程序的繁琐工作的繁琐工作。在本文中，我们试图改善来自两个观点的通用用户表示。首先，提出了一种对比的自我监督学习范式来指导代表模型培训。它提供了一个统一的框架，允许以数据驱动的方式进行长期或短期兴趣表示学习。此外，提出了一种新型多息提取模块。该模块介绍了兴趣字典以捕获给定用户的主要兴趣，然后通过行为聚合生成其兴趣的面向的表示。实验结果证明了学习用户陈述的有效性和适用性。

translated by 谷歌翻译

Reusable Self-Attention-based Recommender System for Fashion

Marjan Celikik , Jacek Wasilewski , Sahar Mbarek , Pablo Celayes , Pierre Gagliardi , Duy Pham , Nour Karessli , Ana Peleteiro Ramallo

分类：机器学习

2022-11-29

A large number of empirical studies on applying self-attention models in the domain of recommender systems are based on offline evaluation and metrics computed on standardized datasets, without insights on how these models perform in real life scenarios. Moreover, many of them do not consider information such as item and customer metadata, although deep-learning recommenders live up to their full potential only when numerous features of heterogeneous types are included. Also, typically recommendation models are designed to serve well only a single use case, which increases modeling complexity and maintenance costs, and may lead to inconsistent customer experience. In this work, we present a reusable Attention-based Fashion Recommendation Algorithm (AFRA), that utilizes various interaction types with different fashion entities such as items (e.g., shirt), outfits and influencers, and their heterogeneous features. Moreover, we leverage temporal and contextual information to address both short and long-term customer preferences. We show its effectiveness on outfit recommendation use cases, in particular: 1) personalized ranked feed; 2) outfit recommendations by style; 3) similar item recommendation and 4) in-session recommendations inspired by most recent customer actions. We present both offline and online experimental results demonstrating substantial improvements in customer retention and engagement.

translated by 谷歌翻译

Every Preference Changes Differently: Neural Multi-Interest Preference Model with Temporal Dynamics for Recommendation

Hui Shi , Yupeng Gu , Yitong Zhou , Bo Zhao , Sicun Gao , Jishen Zhao

分类：人工智能 | 机器学习

2022-07-14

用户嵌入（用户的矢量化表示）对于推荐系统至关重要。已经提出了许多方法来为用户构建代表性，以找到用于检索任务的类似项目，并且已被证明在工业推荐系统中也有效。最近，人们发现使用多个嵌入式代表用户的能力，希望每个嵌入代表用户对某个主题的兴趣。通过多息表示，重要的是要对用户对不同主题的喜好进行建模以及偏好如何随时间变化。但是，现有方法要么无法估算用户对每个利息的亲和力，要么不合理地假设每个用户的每一个利息随时间而逐渐消失，从而损害了候选人检索的召回。在本文中，我们提出了多功能偏好（MIP）模型，这种方法不仅可以通过更有效地使用用户的顺序参与来为用户产生多种利益因此，可以按比例地从每个利息中检索候选人。在各种工业规模的数据集上进行了广泛的实验，以证明我们方法的有效性。

translated by 谷歌翻译

RecGURU: Adversarial Learning of Generalized User Representations for Cross-Domain Recommendation

Chenglin Li , Mingjun Zhao , Huanming Zhang , Chenyun Yu , Lei Cheng , Guoqiang Shu , Beibei Kong , Di Niu

分类：人工智能

2021-11-19

跨域建议可以帮助缓解传统的连续推荐系统中的数据稀疏问题。在本文中，我们提出了Recguru算法框架，以在顺序推荐中生成包含跨域的用户信息的广义用户表示，即使在两个域中的最小或没有公共用户时也是如此。我们提出了一种自我细心的AutoEncoder来导出潜在用户表示，以及域鉴别器，其旨在预测所产生的潜在表示的原点域。我们提出了一种新的逆势学习方法来训练两个模块，以使从不同域生成的用户嵌入到每个用户的单个全局Gur。学习的Gur捕获了用户的整体偏好和特征，因此可以用于增强行为数据并改进在涉及用户的任何单个域中的推荐。在两个公共交叉域推荐数据集以及从现实世界应用程序收集的大型数据集进行了广泛的实验。结果表明，Recguru提高了性能，优于各种最先进的顺序推荐和跨域推荐方法。收集的数据将被释放以促进未来的研究。

translated by 谷歌翻译

Optimizing small BERTs trained for German NER

Jochen Zöllner , Konrad Sperfeld , Christoph Wick , Roger Labahn

分类：自然语言处理 | 人工智能

2021-04-23

目前，用于训练语言模型的最广泛的神经网络架构是所谓的BERT，导致各种自然语言处理（NLP）任务的改进。通常，BERT模型中的参数的数量越大，这些NLP任务中获得的结果越好。不幸的是，内存消耗和训练持续时间随着这些模型的大小而大大增加。在本文中，我们调查了较小的BERT模型的各种训练技术：我们将不同的方法与Albert，Roberta和相对位置编码等其他BERT变体相结合。此外，我们提出了两个新的微调修改，导致更好的性能：类开始终端标记和修改形式的线性链条条件随机字段。此外，我们介绍了整个词的注意力，从而降低了伯特存储器的使用，并导致性能的小幅增加，与古典的多重关注相比。我们评估了这些技术的五个公共德国命名实体识别（NER）任务，其中两条由这篇文章引入了两项任务。

translated by 谷歌翻译

Outfit Generation and Recommendation -- An Experimental Study

Marjan Celikik , Matthias Kirmse , Timo Denk , Pierre Gagliardi , Sahar Mbarek , Duy Pham , Ana Peleteiro Ramallo

分类：机器学习

2022-11-29

Over the past years, fashion-related challenges have gained a lot of attention in the research community. Outfit generation and recommendation, i.e., the composition of a set of items of different types (e.g., tops, bottom, shoes, accessories) that go well together, are among the most challenging ones. That is because items have to be both compatible amongst each other and also personalized to match the taste of the customer. Recently there has been a plethora of work targeted at tackling these problems by adopting various techniques and algorithms from the machine learning literature. However, to date, there is no extensive comparison of the performance of the different algorithms for outfit generation and recommendation. In this paper, we close this gap by providing a broad evaluation and comparison of various algorithms, including both personalized and non-personalized approaches, using online, real-world user data from one of Europe's largest fashion stores. We present the adaptations we made to some of those models to make them suitable for personalized outfit generation. Moreover, we provide insights for models that have not yet been evaluated on this task, specifically, GPT, BERT and Seq-to-Seq LSTM.

translated by 谷歌翻译

A text autoencoder from transformer for fast encoding language representation

Tan Huang

分类：自然语言处理 | 人工智能

2021-11-04

近年来BERT显示明显的优势，在自然语言处理任务的巨大潜力。然而，培训和应用BERT需要计算上下文语言表示，这阻碍了它的普遍性和适用性密集的时间和资源。为了克服这个瓶颈，我们采用窗口屏蔽机制立正层提出了深刻的双向语言模型。这项工作计算上下文的语言表示，而没有随意屏蔽一样在BERT和保持深双向架构类似BERT。为了计算相同的句子表示，我们的方法显示出O（n）的复杂性相比少其他基于变压器的型号O（$ N ^ $ 2）。为了进一步显示其优越性，计算在CPU环境背景下的语言表述中进行，通过短信分类方面使用的嵌入，从所提出的方法，logistic回归显示更高的精度。 Moverover，所提出的方法也实现了语义相似任务显著更高的性能。

translated by 谷歌翻译

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , Peter J. Liu

分类：

2019-10-23

Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

translated by 谷歌翻译

Learning from Mistakes: Using Mis-predictions as Harm Alerts in Language Pre-Training

Chen Xing , Wenhao Liu , Caiming Xiong

分类：自然语言处理

2020-12-16

在培训数据中拟合复杂的模式，例如推理和争议，是语言预训练的关键挑战。根据最近的研究和我们的经验观察，一种可能的原因是训练数据中的一些易于适应的模式，例如经常共同发生的单词组合，主导和伤害预训练，使模型很难适合更复杂的信息。我们争辩说，错误预测可以帮助找到危害语言理解的这种主导模式。当发生错误预测时，应该经常与导致MIS预测的模型拟合的MIS预测字相同的模式。如果我们可以添加正规化以培训模型，当MIS预测发生并更多地对待更微妙的模式时，可以在更多信息上缩小到这种主导模式时，可以在预训练中有效地安装更多信息。在此动机之后，我们提出了一种新的语言预培训方法，错误预测作为伤害警报（MPA）。在MPA中，当在预训练期间发生错误预测时，我们使用其共同发生信息来指导自我关注模块的多个头部。变压器模块中的一些自我关注头经过优化，以将更低的注意重量分配给频繁地在误报中的输入句子中的单词，同时将更高权重分配给另一个单词。通过这样做，变压器模型训练，以依赖于主导的频繁共同发生模式，而在误报中，当发生错误预测时，在剩余更复杂的信息上更加关注更多。我们的实验表明，MPA加快了伯特和电器的预训练，并提高了他们对下游任务的表现。

translated by 谷歌翻译

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He , Xiaodong Liu , Jianfeng Gao , Weizhu Chen

分类：

2020-06-05

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .

translated by 谷歌翻译

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Xiao Han , Licheng Yu , Xiatian Zhu , Li Zhang , Yi-Zhe Song , Tao Xiang

分类：计算机视觉

2022-07-17

事实证明，大规模的视觉和语言（V+L）预训练已被证明有效地增强了下游V+L任务。但是，当涉及时尚域时，现有的V+L方法是不足的，因为它们忽略了时尚V+L数据和下游任务的独特特征。在这项工作中，我们提出了一个以时尚为中心的新型V+L表示框架，被称为Fashionvil。它包含两个新型时尚特定的预训练任务，旨在使用时尚V+L数据利用两个内在属性。首先，与其他域仅包含单个图像文本对的其他域相比，时尚域中可能有多个图像。因此，我们提出了一项多视图对比学习任务，以将一个图像的可视化表示为另一个图像+文本的组成多模式表示。其次，时尚文本（例如，产品描述）通常包含丰富的细粒概念（属性/名词短语）。为了利用这一点，引入了伪归因于分类任务，以鼓励同一概念的学习的单峰（视觉/文本）表示。此外，时尚V+L任务唯一包含不符合常见的一流或两流体系结构的任务（例如，文本引导的图像检索）。因此，我们提出了一个灵活的，多功能的V+L模型体系结构，该体系结构由模态 - 静态变压器组成，以便可以灵活地适应任何下游任务。广泛的实验表明，我们的FashionVil在五个下游任务中实现了新的最新技术。代码可从https://github.com/brandonhanx/mmf获得。

translated by 谷歌翻译

Effective and Efficient Training for Sequential Recommendation Using Cumulative Cross-Entropy Loss

Fangyu Li , Shenbao Yu , Feng Zeng , Fang Yang

分类：机器学习

2023-01-03

Increasing research interests focus on sequential recommender systems, aiming to model dynamic sequence representation precisely. However, the most commonly used loss function in state-of-the-art sequential recommendation models has essential limitations. To name a few, Bayesian Personalized Ranking (BPR) loss suffers the vanishing gradient problem from numerous negative sampling and predictionbiases; Binary Cross-Entropy (BCE) loss subjects to negative sampling numbers, thereby it is likely to ignore valuable negative examples and reduce the training efficiency; Cross-Entropy (CE) loss only focuses on the last timestamp of the training sequence, which causes low utilization of sequence information and results in inferior user sequence representation. To avoid these limitations, in this paper, we propose to calculate Cumulative Cross-Entropy (CCE) loss over the sequence. CCE is simple and direct, which enjoys the virtues of painless deployment, no negative sampling, and effective and efficient training. We conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness and efficiency of CCE. The results show that employing CCE loss on three state-of-the-art models GRU4Rec, SASRec, and S3-Rec can reach 125.63%, 69.90%, and 33.24% average improvement of full ranking NDCG@5, respectively. Using CCE, the performance curve of the models on the test data increases rapidly with the wall clock time, and is superior to that of other loss functions in almost the whole process of model training.

translated by 谷歌翻译

Representing Knowledge by Spans: A Knowledge-Enhanced Model for Information Extraction

Jiacheng Li , Yannis Katsis , Tyler Baldwin , Ho-Cheol Kim , Andrew Bartko , Julian McAuley , Chun-Nan Hsu

分类：自然语言处理 | 人工智能

2022-08-20

与伯特（Bert）等语言模型相比，已证明知识增强语言表示的预培训模型在知识基础构建任务（即〜关系提取）中更有效。这些知识增强的语言模型将知识纳入预训练中，以生成实体或关系的表示。但是，现有方法通常用单独的嵌入表示每个实体。结果，这些方法难以代表播出的实体和大量参数，在其基础代币模型之上（即〜变压器），必须使用，并且可以处理的实体数量为由于内存限制，实践限制。此外，现有模型仍然难以同时代表实体和关系。为了解决这些问题，我们提出了一个新的预培训模型，该模型分别从图书中学习实体和关系的表示形式，并分别在文本中跨越跨度。通过使用SPAN模块有效地编码跨度，我们的模型可以代表实体及其关系，但所需的参数比现有模型更少。我们通过从Wikipedia中提取的知识图对我们的模型进行了预训练，并在广泛的监督和无监督的信息提取任务上进行了测试。结果表明，我们的模型比基线学习对实体和关系的表现更好，而在监督的设置中，微调我们的模型始终优于罗伯塔，并在信息提取任务上取得了竞争成果。

translated by 谷歌翻译

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin Clark , Minh-Thang Luong , Quoc V. Le , Christopher D. Manning

分类：

2020-03-23

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

translated by 谷歌翻译