智能论文笔记

General Cross-Architecture Distillation of Pretrained Language Models into Matrix Embeddings

Lukas Galke , Isabelle Cuber , Christoph Meyer , Henrik Ferdinand Nölscher , Angelina Sonderecker , Ansgar Scherp

分类：自然语言处理 | 机器学习

2021-09-17

大型的语言模型（PRELMS）正在彻底改变所有基准的自然语言处理。但是，它们的巨大尺寸对于小型实验室或移动设备上的部署而言是过分的。修剪和蒸馏等方法可减少模型尺寸，但通常保留相同的模型体系结构。相反，我们探索了蒸馏预告片中的更有效的架构，单词的持续乘法（CMOW），该构造将每个单词嵌入为矩阵，并使用矩阵乘法来编码序列。我们扩展了CMOW体系结构及其CMOW/CBOW-HYBRID变体，具有双向组件，以提供更具表现力的功能，在预绘制期间进行一般（任务无义的）蒸馏的单次表示，并提供了两种序列编码方案，可促进下游任务。句子对，例如句子相似性和自然语言推断。我们的基于矩阵的双向CMOW/CBOW-HYBRID模型在问题相似性和识别文本范围内的Distilbert具有竞争力，但仅使用参数数量的一半，并且在推理速度方面快三倍。除了情感分析任务SST-2和语言可接受性任务COLA外，我们匹配或超过ELMO的ELMO分数。但是，与以前的跨架结构蒸馏方法相比，我们证明了检测语言可接受性的分数增加了一倍。这表明基于基质的嵌入可用于将大型预赛提炼成竞争模型，并激励朝这个方向进行进一步的研究。

translated by 谷歌翻译

TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao , Yichun Yin , Lifeng Shang , Xin Jiang , Xiao Chen , Linlin Li , Fang Wang , Qun Liu

分类：

2019-09-23

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT can be effectively transferred to a small "student" Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT. TinyBERT 41 with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT BASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT 4 is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only ∼28% parameters and ∼31% inference time of them. Moreover, TinyBERT 6 with 6 layers performs on-par with its teacher BERT BASE .

translated by 谷歌翻译

On the Effectiveness of Compact Biomedical Transformers

Omid Rohanian , Mohammadmahdi Nouriborji , Samaneh Kouchaki , David A. Clifton

分类：自然语言处理 | 机器学习

2022-09-07

在生物医学语料库中预先培训的语言模型，例如Biobert，最近在下游生物医学任务上显示出令人鼓舞的结果。另一方面，由于嵌入尺寸，隐藏尺寸和层数等因素，许多现有的预训练模型在资源密集型和计算上都是沉重的。自然语言处理（NLP）社区已经制定了许多策略来压缩这些模型，利用修剪，定量和知识蒸馏等技术，从而导致模型更快，更小，随后更易于使用。同样，在本文中，我们介绍了六种轻型模型，即Biodistilbert，Biotinybert，BioMobilebert，Distilbiobert，Tinybiobert和Cmpactactbiobert，并通过掩护的语言在PubMed DataSet上通过掩护数据进行了知识蒸馏而获得的知识蒸馏来获得。建模（MLM）目标。我们在三个生物医学任务上评估了所有模型，并将它们与Biobert-V1.1进行比较，以创建有效的轻量级模型，以与较大的对应物相同。所有模型将在我们的HuggingFace配置文件上公开可用，网址为https://huggingface.co/nlpie，用于运行实验的代码将在https://github.com/nlpie-research/compact-compact-biomedical-transformers上获得。

translated by 谷歌翻译

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh , Lysandre Debut , Julien Chaumond , Thomas Wolf

分类：

2019-10-02

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be finetuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pre-training, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

translated by 谷歌翻译

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin , Ming-Wei Chang , Kenton Lee , Kristina Toutanova

分类：

2018-10-11

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

translated by 谷歌翻译

Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

Made Nindyatama Nityasya , Haryo Akbarianto Wibowo , Rendi Chevi , Radityo Eko Prasojo , Alham Fikri Aji

分类：自然语言处理

2022-01-03

我们从任务特定的BERT基教师模型执行知识蒸馏（KD）基准到各种学生模型：Bilstm，CNN，Bert-Tiny，Bert-Mini和Bert-small。我们的实验涉及在两个任务中分组的12个数据集：印度尼西亚语言中的文本分类和序列标记。我们还比较蒸馏的各个方面，包括使用Word Embeddings和未标记的数据增强的使用。我们的实验表明，尽管基于变压器的模型的普及程度不断上升，但是使用Bilstm和CNN学生模型，与修剪的BERT模型相比，使用Bilstm和CNN学生模型提供了性能和计算资源（CPU，RAM和存储）之间的最佳权衡。我们进一步提出了一些快速胜利，通过涉及涉及丢失功能，Word Embeddings和未标记的数据准备的简单选择的高效KD培训机制来生产小型NLP模型。

translated by 谷歌翻译

A Survey on Model Compression and Acceleration for Pretrained Language Models

Canwen Xu , Julian McAuley

分类：自然语言处理 | 人工智能 | 机器学习

2022-02-15

Despite achieving state-of-the-art performance on many NLP tasks, the high energy cost and long inference delay prevent Transformer-based pretrained language models (PLMs) from seeing broader adoption including for edge and mobile computing. Efficient NLP research aims to comprehensively consider computation, time and carbon emission for the entire life-cycle of NLP, including data preparation, model training and inference. In this survey, we focus on the inference stage and review the current state of model compression and acceleration for pretrained language models, including benchmarks, metrics and methodology.

translated by 谷歌翻译

Prune Once for All: Sparse Pre-Trained Language Models

Ofir Zafrir , Ariel Larey , Guy Boudoukh , Haihao Shen , Moshe Wasserblat

分类：自然语言处理 | 人工智能 | 机器学习

2021-11-10

基于变压器的语言模型应用于自然语言处理的广泛应用程序。但是，它们效率低，难以部署。近年来，已经提出了许多压缩算法来提高目标硬件上大型变压器的模型的实现效率。在这项工作中，我们通过整合体重修剪和模型蒸馏来提出一种训练稀疏预训练的变压器语言模型的新方法。这些稀疏的预训练型号可用于在维护稀疏模式的同时传输广泛的任务。我们展示了我们有三个已知的架构的方法，以创建稀疏的预训练伯特基，BERT-MATRY和DISTOLBERT。我们展示了压缩稀疏的预训练模型如何培训他们的知识，以最小的精度损失将他们的知识转移到五种不同的下游自然语言任务。此外，我们展示了如何使用量化感知培训进一步将稀疏模型的重量压缩为8位精度。例如，在SQUAdv1.1上使用我们稀疏预训练的BERT频率，并量化为8位，我们为编码器达到40美元的压缩比，而不是1 \％$精度损失。据我们所知，我们的结果表明Bert-Base，Bert-Light和Distilbert的最佳压缩至准确率。

translated by 谷歌翻译

SAS: Self-Augmentation Strategy for Language Model Pre-training

Yifei Xu , Jingqiao Zhang , Ru He , Liangzhu Ge , Chao Yang , Cheng Yang , Ying Nian Wu

分类：自然语言处理 | 人工智能

2021-06-14

用于预培训语言模型的自我监督学习的核心包括预训练任务设计以及适当的数据增强。语言模型中的大多数数据增强都是独立于上下文的。最近在电子中提出了一个开创性的增强，并通过引入辅助生成网络（发电机）来实现最先进的性能，以产生用于培训主要辨别网络（鉴别者）的上下文化数据增强。然而，这种设计引入了发电机的额外计算成本，并且需要调整发电机和鉴别器之间的相对能力。在本文中，我们提出了一种自增强策略（SAS），其中单个网络用于审视以后的时期的培训常规预训练和上下文化数据增强。基本上，该策略消除了单独的发电机，并使用单个网络共同执行具有MLM（屏蔽语言建模）和RTD（替换令牌检测）头的两个预训练任务。它避免了寻找适当大小的发电机的挑战，这对于在电子中证明的性能至关重要，以及其随后的变体模型至关重要。此外，SAS是一项常规策略，可以与最近或将来的许多新技术无缝地结合，例如杜伯塔省的解除关注机制。我们的实验表明，SAS能够在具有相似或更少的计算成本中优于胶水任务中的电磁和其他最先进的模型。

translated by 谷歌翻译

Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

Qinyuan Ye , Madian Khabsa , Mike Lewis , Sinong Wang , Xiang Ren , Aaron Jaech

分类：自然语言处理 | 机器学习

2021-10-16

将最新的变压器模型蒸馏成轻量级的学生模型是降低推理时计算成本的有效方法。学生模型通常是紧凑的变压器，参数较少，而昂贵的操作（例如自我发项）持续存在。因此，对于实时或大量用例，提高的推理速度仍然不令人满意。在本文中，我们旨在通过将教师模型提炼成更大，更稀疏的学生模型来进一步推动推理速度的极限 - 更大的是它们扩展到数十亿个参数；稀疏，大多数模型参数是N-gram嵌入。我们对六个单词文本分类任务的实验表明，这些学生模型平均保留了罗伯塔大师教师表现的97％，同时推理时GPU和CPU的加速速度最高为600倍。进一步的调查表明，我们的管道也有助于句子对分类任务和域泛化设置。

translated by 谷歌翻译

Knowledge Distillation of Transformer-based Language Models Revisited

Chengqiang Lu , Jianwei Zhang , Yunfei Chu , Zhengyu Chen , Jingren Zhou , Fei Wu , Haiqing Chen , Hongxia Yang

分类：自然语言处理 | 人工智能

2022-06-29

在过去的几年中，基于变压器的预训练的语言模型在行业和学术界都取得了惊人的成功。但是，较大的模型尺寸和高运行时间延迟是在实践中应用它们的严重障碍，尤其是在手机和物联网（IoT）设备上。为了压缩该模型，最近有大量文献围绕知识蒸馏（KD）的主题长大。然而，KD在基于变压器的模型中的工作方式仍不清楚。我们取消了KD的组件，并提出了一个统一的KD框架。通过框架，花费了23,000多个GPU小时的系统和广泛的实验，从知识类型的角度，匹配策略，宽度深度折衷，初始化，型号大小等。在培训前语言模型中，对先前最新的（SOTA）的相对显着改善。最后，我们为基于变压器模型的KD提供了最佳实践指南。

translated by 谷歌翻译

A Primer in BERTology: What we know about how BERT works

Anna Rogers , Olga Kovaleva , Anna Rumshisky

分类：

2020-02-27

Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.

translated by 谷歌翻译

On the Effect of Dropping Layers of Pre-trained Transformer Models

Hassan Sajjad , Fahim Dalvi , Nadir Durrani , Preslav Nakov

分类：自然语言处理 | 机器学习

2020-04-08

基于变压器的NLP模型是使用数亿甚至数十亿个参数训练的，从而限制了其在计算受限环境中的适用性。尽管参数的数量通常与性能相关，但尚不清楚下游任务是否需要整个网络。在最新的修剪和提炼预培训模型的工作中，我们探索了在预训练模型中放下层的策略，并观察修剪对下游胶水任务的影响。我们能够修剪Bert，Roberta和XLNet型号高达40％，同时保持其原始性能的98％。此外，我们证明，在大小和性能方面，您的修剪模型与使用知识蒸馏的型号相提并论。我们的实验产生有趣的观察结果，例如（i）下层对于维持下游任务性能最重要，（ii）某些任务（例如释义检测和句子相似性）对于降低层的降低和（iii）经过训练的模型更强大。使用不同的目标函数表现出不同的学习模式，并且层掉落。

translated by 谷歌翻译

FairDistillation: Mitigating Stereotyping in Language Models

Pieter Delobelle , Bettina Berendt

分类：自然语言处理 | 机器学习

2022-07-10

大型的预训练的语言模型成功地用于多种语言的各种任务中。随着这种不断增加的使用，有害副作用的风险也会上升，例如通过再现和加强刻板印象。但是，在解决多种语言或考虑不同的偏见时，发现和缓解这些危害通常很难做到，并且在计算上变得昂贵。为了解决这个问题，我们提出了Fairdistiltation：一种基于知识蒸馏的跨语性方法，可以在控制特定偏见的同时构建较小的语言模型。我们发现，我们的蒸馏方法不会对大多数任务的下游性能产生负面影响，并成功减轻刻板印象和代表性危害。我们证明，与替代方法相比，Fairdistillation可以以低得多的成本创建更公平的语言模型。

translated by 谷歌翻译

A Primer on Pretrained Multilingual Language Models

Sumanth Doddapaneni , Gowtham Ramesh , Mitesh M. Khapra , Anoop Kunchukuttan , Pratyush Kumar

分类：自然语言处理

2021-07-01

多语言语言模型（\ mllms），如mbert，xlm，xlm-r，\ textit {etc。}已成为一种可行的选择，使预先估计到大量语言的力量。鉴于他们的成功在零射击转移学习中，在（i）建立更大的\ mllms〜覆盖了大量语言（ii）创建覆盖更广泛的任务和语言来评估的详尽工作基准mllms〜（iii）分析单音零点，零拍摄交叉和双语任务（iv）对Monolingual的性能，了解\ mllms〜（v）增强（通常）学习的通用语言模式（如果有的话）有限的容量\ mllms〜以提高他们在已见甚至看不见语言的表现。在这项调查中，我们审查了现有的文学，涵盖了上述与\ MLLMS有关的广泛研究领域。根据我们的调查，我们建议您有一些未来的研究方向。

translated by 谷歌翻译

Training data-efficient image transformers & distillation through attention

Hugo Touvron , Matthieu Cord , Matthijs Douze , Francisco Massa , Alexandre Sablayrolles , Hervé Jégou

分类：

2020-12-23

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These highperforming vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption.In this work, we produce competitive convolution-free transformers by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data.More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

translated by 谷歌翻译

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin Clark , Minh-Thang Luong , Quoc V. Le , Christopher D. Manning

分类：

2020-03-23

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

translated by 谷歌翻译

Causal Distillation for Language Models

Zhengxuan Wu , Atticus Geiger , Josh Rozner , Elisa Kreiss , Hanson Lu , Thomas Icard , Christopher Potts , Noah D. Goodman

分类：自然语言处理 | 机器学习

2021-12-05

蒸馏工作导致语言模型更紧凑，没有严重的性能下降。蒸馏的标准方法培训了针对两个目标的学生模型：特定于任务的目标（例如，语言建模）和模仿目标，并鼓励学生模型的隐藏状态与较大的教师模型类似。在本文中，我们表明，增强蒸馏有利于第三个目标，鼓励学生通过交换干预培训（IIT）来模仿教师的因果计算过程。 IIT推动学生模型成为教师模型的因果抽象 - 一种具有相同因果结构的更简单的模型。 IIT是完全可差异的，容易实施，并与其他目标灵活结合。与伯特标准蒸馏相比，通过IIT蒸馏导致维基百科（屏蔽语言建模）逐步困惑，并对胶水基准（自然语言理解），队（问题接听）和Conll-2003（命名实体识别）进行了改进。

translated by 谷歌翻译

Training data-efficient image transformers & distillation through attention

分类：

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. These highperforming vision transformers are pre-trained with hundreds of millions of images using a large infrastructure, thereby limiting their adoption.In this work, we produce competitive convolutionfree transformers trained on ImageNet only using a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop) on ImageNet with no external data.We also introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention, typically from a convnet teacher. The learned transformers are competitive (85.2% top-1 acc.) with the state of the art on ImageNet, and similarly when transferred to other tasks. We will share our code and models.

translated by 谷歌翻译

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text

Qing Li , Boqing Gong , Yin Cui , Dan Kondratyuk , Xianzhi Du , Ming-Hsuan Yang , Matthew Brown

分类：计算机视觉 | 机器学习

2021-12-14

在本文中，我们探讨了构建统一基础模型的可能性，该模型可以适应愿景和仅文本任务。从BERT和VIT开始，我们设计一个由模态特定标记，共享变压器编码器和任务特定的输出头组成的统一变压器。为了有效地预先列车在未配对的图像和文本上联合培训拟议的模型，我们提出了两种新颖的技术：（i）我们使用单独培训的BERT和VIT模型作为老师，并应用知识蒸馏，为关节提供额外的准确的监督信号训练; （ii）我们提出了一种新颖的渐变掩蔽策略，以平衡图像和文本预培训损失的参数更新。我们通过微调它分别在图像分类任务和自然语言理解任务上进行微调，评估联合预训练的变压器。实验表明，由此产生的统一基础变压器令人惊讶地在视觉和仅文本任务中令人惊讶地令人惊讶，并且所提出的知识蒸馏和梯度掩蔽策略可以有效地提升分别训练模型水平的性能。

translated by 谷歌翻译