智能论文笔记

Analyzing and Mitigating Interference in Neural Architecture Search

Jin Xu , Xu Tan , Kaitao Song , Renqian Luo , Yichong Leng , Tao Qin , Tie-Yan Liu , Jian Li

分类：自然语言处理 | 机器学习

2021-08-29

体重共享是一种流行的方法，可以通过重复以前训练的儿童模型的共享操作员的权重来降低神经体系结构搜索（NAS）的成本。但是，由于重量共享引起的不同儿童模型之间的干扰，这些儿童模型的估计准确性和地面真实准确性之间的等级相关性很低。在本文中，我们通过对不同的儿童模型进行采样并计算共享操作员的梯度相似性来研究干扰问题，并观察：1）两个儿童模型之间对共享操作员的干扰与不同操作员的数量正相关； 2）当共享操作员的输入和输出更相似时，干扰较小。受这两个观察结果的启发，我们提出了两种减轻干扰的方法：1）魔术-T：而不是随机采样儿童模型以进行优化，而是通过在相邻优化步骤之间修改一个操作员来最大程度地减少对干扰的干扰，从而提出了一种逐步修改方案。共享操作员； 2）Magic-A：强迫所有儿童模型的操作员的输入和输出与减少干扰相似。在BERT搜索空间上进行的实验证明，通过我们提出的每种方法来缓解干扰可以改善Super-PET的秩相关性，并结合两种方法可以取得更好的结果。我们发现的体系结构优于Roberta $ _ {\ rm base} $ 1.1和0.6分，而Electra $ _ {\ rm base} $在DEV和测试集的粘合基准的$ 1.6和1.1分。关于BERT压缩，阅读理解和成像网任务的广泛结果证明了我们提出的方法的有效性和普遍性。

translated by 谷歌翻译

Learning to Rank Ace Neural Architectures via Normalized Discounted Cumulative Gain

Yuge Zhang , Quanlu Zhang , Li Lyna Zhang , Yaming Yang , Chenqian Yan , Xiaotian Gao , Yuqing Yang

分类：计算机视觉 | 人工智能

2021-08-06

神经体系结构搜索（NAS）的主要挑战之一是有效地对体系结构的性能进行排名。绩效排名者的主流评估使用排名相关性（例如，肯德尔的tau），这对整个空间都同样关注。但是，NAS的优化目标是识别顶级体系结构，同时对搜索空间中其他体系结构的关注更少。在本文中，我们从经验和理论上都表明，标准化的累积累积增益（NDCG）对于排名者来说是一个更好的指标。随后，我们提出了一种新算法Acenas，该算法直接通过Lambdarank优化NDCG。它还利用体重共享NAS产生的弱标签来预先培训排名，以便进一步降低搜索成本。对12个NAS基准和大规模搜索空间进行的广泛实验表明，我们的方法始终超过SOTA NAS方法，精度提高了3.67％，搜索成本降低了8倍。

translated by 谷歌翻译

Design Automation for Fast, Lightweight, and Effective Deep Learning Models: A Survey

Dalin Zhang , Kaixuan Chen , Yan Zhao , Bin Yang , Lina Yao , Christian S. Jensen

分类：机器学习 | 人工智能

2022-08-22

深度学习技术在各种任务中都表现出了出色的有效性，并且深度学习具有推进多种应用程序（包括在边缘计算中）的潜力，其中将深层模型部署在边缘设备上，以实现即时的数据处理和响应。一个关键的挑战是，虽然深层模型的应用通常会产生大量的内存和计算成本，但Edge设备通常只提供非常有限的存储和计算功能，这些功能可能会在各个设备之间差异很大。这些特征使得难以构建深度学习解决方案，以释放边缘设备的潜力，同时遵守其约束。应对这一挑战的一种有希望的方法是自动化有效的深度学习模型的设计，这些模型轻巧，仅需少量存储，并且仅产生低计算开销。该调查提供了针对边缘计算的深度学习模型设计自动化技术的全面覆盖。它提供了关键指标的概述和比较，这些指标通常用于量化模型在有效性，轻度和计算成本方面的水平。然后，该调查涵盖了深层设计自动化技术的三类最新技术：自动化神经体系结构搜索，自动化模型压缩以及联合自动化设计和压缩。最后，调查涵盖了未来研究的开放问题和方向。

translated by 谷歌翻译

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , Radu Soricut

分类：

2019-09-26

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameterreduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large. The code and the pretrained models are available at https://github.com/google-research/ALBERT. * Work done as an intern at Google Research, driving data processing and downstream task evaluations.

translated by 谷歌翻译

KNAS: Green Neural Architecture Search

Jingjing Xu , Liang Zhao , Junyang Lin , Rundong Gao , Xu Sun , Hongxia Yang

分类：机器学习

2021-11-26

许多现有的神经结构搜索（NAS）解决方案依赖于架构评估的下游培训，这需要巨大的计算。考虑到这些计算带来了大量碳足迹，本文旨在探索绿色（即环保）NAS解决方案，可以在不培训的情况下评估架构。直观地，由架构本身引起的梯度，直接决定收敛和泛化结果。它激励我们提出梯度内核假设：梯度可以用作下游训练的粗粒粒度，以评估随机初始化网络。为了支持假设，我们进行理论分析，找到一个实用的梯度内核，与培训损失和验证性能有良好的相关性。根据这一假设，我们提出了一种新的基于内核的架构搜索方法knas。实验表明，KNA可实现比图像分类任务的“火车-TER-TEST”范式更快地实现竞争力。此外，极低的搜索成本使其具有广泛的应用。搜索网络还优于两个文本分类任务的强大基线Roberta-Light。代码可用于\ url {https://github.com/jingjing-nlp/knas}。

translated by 谷歌翻译

Mutually-aware Sub-Graphs Differentiable Architecture Search

Haoxian Tan , Sheng Guo , Yujie Zhong , Matthew R. Scott , Weilin Huang

分类：计算机视觉

2021-07-09

在NAS领域中，可分构造的架构搜索是普遍存在的，因为它的简单性和效率，其中两个范例，多路径算法和单路径方法主导。多路径框架（例如，DARTS）是直观的，但遭受内存使用和培训崩溃。单路径方法（例如，e.g.gdas和proxylesnnas）减轻了内存问题并缩小了搜索和评估之间的差距，但牺牲了性能。在本文中，我们提出了一种概念上简单的且有效的方法来桥接这两个范式，称为相互意识的子图可差架构搜索（MSG-DAS）。我们框架的核心是一个可分辨动的Gumbel-Topk采样器，它产生多个互斥的单路径子图。为了缓解多个子图形设置所带来的Severer Skip-Connect问题，我们提出了一个Dropblock-Identity模块来稳定优化。为了充分利用可用的型号（超级网和子图），我们介绍了一种记忆高效的超净指导蒸馏，以改善培训。所提出的框架击中了灵活的内存使用和搜索质量之间的平衡。我们展示了我们在想象中和CIFAR10上的方法的有效性，其中搜索的模型显示了与最近的方法相当的性能。

translated by 谷歌翻译

Exploring Complicated Search Spaces with Interleaving-Free Sampling

Yunjie Tian , Lingxi Xie , Jiemin Fang , Jianbin Jiao , Qixiang Ye , Qi Tian

分类：机器学习 | 计算机视觉

2021-12-05

现有的神经结构搜索算法主要在具有短距离连接的搜索空间上。我们争辩说，这种设计虽然安全稳定，障碍搜索算法从探索更复杂的情景。在本文中，我们在具有长距离连接的复杂搜索空间上构建搜索算法，并显示现有的权重共享搜索算法由于存在\ TextBF {交织连接}而大部分失败。基于观察，我们介绍了一个名为\ textbf {if-nas}的简单且有效的算法，在那里我们在搜索过程中执行定期采样策略来构建不同的子网，避免在任何中的交织连接出现。在所提出的搜索空间中，IF-NAS优于随机采样和先前的重量共享搜索算法，通过显着的余量。 IF-NAS还推广到微单元的空间，这些空间更容易。我们的研究强调了宏观结构的重要性，我们期待沿着这个方向进一步努力。

translated by 谷歌翻译

FlowNAS: Neural Architecture Search for Optical Flow Estimation

Zhiwei Lin , Tingting Liang , Taihong Xiao , Yongtao Wang , Zhi Tang , Ming-Hsuan Yang

分类：计算机视觉

2022-07-04

现有的光流估计器通常采用通常用于图像分类的网络体系结构作为提取人均功能的编码器。但是，由于任务之间的自然差异，用于图像分类的架构可能是最佳的流量估计。为了解决此问题，我们建议一种名为Falownas的神经体系结构搜索方法，以自动找到用于流估计任务的更好的编码器体系结构。我们首先设计一个合适的搜索空间，包括各种卷积运算符，并构建一个体重共享的超级网络，以有效评估候选体系结构。然后，为了更好地训练超级网络，我们提出了特征对齐蒸馏，该蒸馏利用训练有素的流量估计器来指导超级网络的训练。最后，利用资源约束的进化算法找到最佳体系结构（即子网络）。实验结果表明，从超级网络继承的权重的发现的结构达到了4.67 \％f1-able kitti上的误差，这是RAFT基线的8.4 \％降低，超过了先进的手工制作的型号GMA和AGFlow，同时降低模型的复杂性和延迟。源代码和训练有素的模型将在https://github.com/vdigpku/flownas中发布。

translated by 谷歌翻译

TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao , Yichun Yin , Lifeng Shang , Xin Jiang , Xiao Chen , Linlin Li , Fang Wang , Qun Liu

分类：

2019-09-23

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT can be effectively transferred to a small "student" Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT. TinyBERT 41 with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT BASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT 4 is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only ∼28% parameters and ∼31% inference time of them. Moreover, TinyBERT 6 with 6 layers performs on-par with its teacher BERT BASE .

translated by 谷歌翻译

ShiftAddNAS: Hardware-Inspired Search for More Accurate and Efficient Neural Networks

Haoran You , Baopu Li , Huihong Shi , Yonggan Fu , Yingyan Lin

分类：机器学习 | 人工智能

2022-05-17

具有密集乘法的神经网络（NNS）（例如，卷积和变形金刚）具有饥饿的能力，阻碍了它们更广泛的部署到资源受限的设备中。因此，遵循节能硬件实施的共同实践的无乘法网络，以更有效的运算符（例如，位移位和加法）参数化NN，并引起了人们的关注。但是，从实现的准确性方面，无乘法网络的表现不足。为此，这项工作倡导混合NN，包括强大但昂贵的乘法和有效而强大的运营商来嫁给两全其美的运营商，并提出了ShiftAddnas，它们可以自动寻找更准确，更有效的NN。我们的ShiftAddnas突出了两个推动者。具体而言，它集成了（1）第一个混合搜索空间，该空间同时结合了基于乘法的和无乘法的运算符，以促进精确和有效的混合NNS的开发；（2）一种新型的重量共享策略，可以在遵循异质分布的不同操作员之间有效分享（例如，用于卷积的高斯与添加操作员的拉普拉斯人），并同时导致超级降低的超网尺寸和更好的搜索网络。对各种模型，数据集和任务的广泛实验和消融研究始终如一地验证了ShiftAddnas的功效，例如，与最先进的NN相比，获得的精度高达 +4.7％，或者+4.9更好的BLEU得分，而BLEU得分更好最多可提供93％或69％的能源和延迟节省。可以在https://github.com/rice-eic/shiftaddnas上获得代码和预估计的模型。

translated by 谷歌翻译

Language Model Pre-training on True Negatives

Zhuosheng Zhang , Hai Zhao , Masao Utiyama , Eiichiro Sumita

分类：自然语言处理

2022-12-01

Discriminative pre-trained language models (PLMs) learn to predict original texts from intentionally corrupted ones. Taking the former text as positive and the latter as negative samples, the PLM can be trained effectively for contextualized representation. However, the training of such a type of PLMs highly relies on the quality of the automatically constructed samples. Existing PLMs simply treat all corrupted texts as equal negative without any examination, which actually lets the resulting model inevitably suffer from the false negative issue where training is carried out on pseudo-negative data and leads to less efficiency and less robustness in the resulting PLMs. In this work, on the basis of defining the false negative issue in discriminative PLMs that has been ignored for a long time, we design enhanced pre-training methods to counteract false negative predictions and encourage pre-training language models on true negatives by correcting the harmful gradient updates subject to false negative predictions. Experimental results on GLUE and SQuAD benchmarks show that our counter-false-negative pre-training methods indeed bring about better performance together with stronger robustness.

translated by 谷歌翻译

General Cross-Architecture Distillation of Pretrained Language Models into Matrix Embeddings

Lukas Galke , Isabelle Cuber , Christoph Meyer , Henrik Ferdinand Nölscher , Angelina Sonderecker , Ansgar Scherp

分类：自然语言处理 | 机器学习

2021-09-17

大型的语言模型（PRELMS）正在彻底改变所有基准的自然语言处理。但是，它们的巨大尺寸对于小型实验室或移动设备上的部署而言是过分的。修剪和蒸馏等方法可减少模型尺寸，但通常保留相同的模型体系结构。相反，我们探索了蒸馏预告片中的更有效的架构，单词的持续乘法（CMOW），该构造将每个单词嵌入为矩阵，并使用矩阵乘法来编码序列。我们扩展了CMOW体系结构及其CMOW/CBOW-HYBRID变体，具有双向组件，以提供更具表现力的功能，在预绘制期间进行一般（任务无义的）蒸馏的单次表示，并提供了两种序列编码方案，可促进下游任务。句子对，例如句子相似性和自然语言推断。我们的基于矩阵的双向CMOW/CBOW-HYBRID模型在问题相似性和识别文本范围内的Distilbert具有竞争力，但仅使用参数数量的一半，并且在推理速度方面快三倍。除了情感分析任务SST-2和语言可接受性任务COLA外，我们匹配或超过ELMO的ELMO分数。但是，与以前的跨架结构蒸馏方法相比，我们证明了检测语言可接受性的分数增加了一倍。这表明基于基质的嵌入可用于将大型预赛提炼成竞争模型，并激励朝这个方向进行进一步的研究。

translated by 谷歌翻译

Progressive Automatic Design of Search Space for One-Shot Neural Architecture Search

Xin Xia , Xuefeng Xiao , Xing Wang , Min Zheng

分类：计算机视觉

2020-05-15

神经结构搜索（NAS）引起了日益增长的兴趣。为了降低搜索成本，最近的工作已经探讨了模型的重量分享，并在单枪NAS进行了重大进展。然而，已经观察到，单次模型精度较高的模型并不一定在独立培训时更好地执行更好。为了解决这个问题，本文提出了搜索空间的逐步自动设计，名为Pad-NAS。与超字幕中的所有层共享相同操作搜索空间的先前方法不同，我们根据操作修剪制定逐行搜索策略，并构建层面操作搜索空间。通过这种方式，Pad-NAS可以自动设计每层的操作，并在搜索空间质量和模型分集之间实现权衡。在搜索过程中，我们还考虑了高效神经网络模型部署的硬件平台约束。关于Imagenet的广泛实验表明我们的方法可以实现最先进的性能。

translated by 谷歌翻译

Prior-Guided One-shot Neural Architecture Search

Peijie Dong , Xin Niu , Lujun Li , Linzhen Xie , Wenbin Zou , Tian Ye , Zimian Wei , Hengyue Pan

分类：计算机视觉

2022-06-27

神经体系结构搜索方法寻求具有有效的体重共享超级网训练的最佳候选者。但是，最近的研究表明，关于独立架构和共享重量网络之间的性能的排名一致性差。在本文中，我们提出了提前引导的一声NAS（PGONA），以加强超级网的排名相关性。具体而言，我们首先探讨激活功能的效果，并提出基于三明治规则的平衡采样策略，以减轻超级网中的重量耦合。然后，采用了拖鞋和禅宗得分来指导超级网的训练，并具有排名相关性损失。我们的PGONA在CVPR2022第二轻型NAS挑战赛的SuperNet轨道中排名第三。代码可在https://github.com/pprp/cvpr2022-nas?competition-track1-3th-solution中找到。

translated by 谷歌翻译

RSBNet: One-Shot Neural Architecture Search for A Backbone Network in Remote Sensing Image Recognition

Cheng Peng , Yangyang Li , Ronghua Shang , Licheng Jiao

分类：人工智能 | 计算机视觉

2021-12-07

最近，已经成功地应用于各种遥感图像（RSI）识别任务的大量基于深度学习的方法。然而，RSI字段中深度学习方法的大多数现有进步严重依赖于手动设计的骨干网络提取的特征，这严重阻碍了由于RSI的复杂性以及先前知识的限制而受到深度学习模型的潜力。在本文中，我们研究了RSI识别任务中的骨干架构的新设计范式，包括场景分类，陆地覆盖分类和对象检测。提出了一种基于权重共享策略和进化算法的一拍架构搜索框架，称为RSBNet，其中包括三个阶段：首先，在层面搜索空间中构造的超空网是在自组装的大型中预先磨削 - 基于集合单路径培训策略进行缩放RSI数据集。接下来，预先培训的SuperNet通过可切换识别模块配备不同的识别头，并分别在目标数据集上进行微调，以获取特定于任务特定的超网络。最后，我们根据没有任何网络训练的进化算法，搜索最佳骨干架构进行不同识别任务。对于不同识别任务的五个基准数据集进行了广泛的实验，结果显示了所提出的搜索范例的有效性，并证明搜索后的骨干能够灵活地调整不同的RSI识别任务并实现令人印象深刻的性能。

translated by 谷歌翻译

HyPe: Better Pre-trained Language Model Fine-tuning with Hidden Representation Perturbation

Hongyi Yuan , Zheng Yuan , Chuanqi Tan , Fei Huang , Songfang Huang

分类：自然语言处理

2022-12-17

Language models with the Transformers structure have shown great performance in natural language processing. However, there still poses problems when fine-tuning pre-trained language models on downstream tasks, such as over-fitting or representation collapse. In this work, we propose HyPe, a simple yet effective fine-tuning technique to alleviate such problems by perturbing hidden representations of Transformers layers. Unlike previous works that only add noise to inputs or parameters, we argue that the hidden representations of Transformers layers convey more diverse and meaningful language information. Therefore, making the Transformers layers more robust to hidden representation perturbations can further benefit the fine-tuning of PLMs en bloc. We conduct extensive experiments and analyses on GLUE and other natural language inference datasets. Results demonstrate that HyPe outperforms vanilla fine-tuning and enhances generalization of hidden representations from different layers. In addition, HyPe acquires negligible computational overheads, and is better than and compatible with previous state-of-the-art fine-tuning techniques.

translated by 谷歌翻译

Differentiable Architecture Search with Random Features

Xuanyang Zhang , Yonggang Li , Xiangyu Zhang , Yongtao Wang , Jian Sun

分类：计算机视觉

2022-08-18

可区分的架构搜索（飞镖）大大促进了NAS技术的发展，因为其搜索效率很高，但遭受了性能崩溃的影响。在本文中，我们努力从两个方面减轻飞镖的性能崩溃问题。首先，我们研究了飞镖中超级网的表达能力，然后仅使用训练batchnorm来得出新的飞镖范式设置。其次，从理论上讲，随机特征稀释了跳过连接在超网优化中的辅助连接作用，并使搜索算法专注于更公平的操作选择，从而解决了性能崩溃问题。我们具有随机功能的实例化飞镖和PC-Darts，分别为每个命名的RF-Darts和RF-PCDART构建一个改进的版本。实验结果表明，RF-darts在CIFAR-10上获得\ TextBf {94.36 \％}测试精度（这是NAS Bench-201的最接近最佳结果），并实现了最新的最新最先进的TOP-1从CIFAR-10传输时，ImageNet上\ TextBf {24.0 \％}的测试错误。此外，RF-DARTS在三个数据集（CIFAR-10，CIFAR-100和SVHN）和四个搜索空间（S1-S4）上进行稳健性能。此外，RF-PCDARTS在Imagenet上取得了更好的结果，即\ textbf {23.9 \％} top-1和\ textbf {7.1 \％} top-5 top-5测试错误，超越了代表性的方法，例如单路径，训练免费，，直接在Imagenet上搜索部分通道范例。

translated by 谷歌翻译

Searching a High-Performance Feature Extractor for Text Recognition Network

Hui Zhang , Quanming Yao , James T. Kwok , Xiang Bai

分类：计算机视觉 | 人工智能

2022-09-27

功能提取器在文本识别（TR）中起着至关重要的作用，但是由于昂贵的手动调整，自定义其体系结构的探索相对较少。在这项工作中，受神经体系结构搜索（NAS）的成功启发，我们建议搜索合适的功能提取器。我们通过探索具有良好功能提取器的原理来设计特定于域的搜索空间。该空间包括用于空间模型的3D结构空间和顺序模型的基于转换的空间。由于该空间是巨大且结构复杂的，因此无法应用现有的NAS算法。我们提出了一种两阶段算法，以有效地在空间中进行搜索。在第一阶段，我们将空间切成几个块，并借助辅助头逐步训练每个块。我们将延迟约束引入第二阶段，并通过自然梯度下降从受过训练的超级网络搜索子网络。在实验中，进行了一系列消融研究，以更好地了解设计的空间，搜索算法和搜索架构。我们还将所提出的方法与手写和场景TR任务上的各种最新方法进行了比较。广泛的结果表明，我们的方法可以以较小的延迟获得更好的识别性能。

translated by 谷歌翻译

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin Clark , Minh-Thang Luong , Quoc V. Le , Christopher D. Manning

分类：

2020-03-23

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

translated by 谷歌翻译

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He , Xiaodong Liu , Jianfeng Gao , Weizhu Chen

分类：

2020-06-05

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .

translated by 谷歌翻译