智能论文笔记

Block-Skim: Efficient Question Answering for Transformer

Yue Guan , Zhengyi Li , Jingwen Leng , Zhouhan Lin , Minyi Guo , Yuhao Zhu

分类：自然语言处理

2021-12-16

变压器模型已经取得了有希望的自然语言处理（NLP）任务，包括提取问题应答（QA）。 NLP任务中使用的通用变压器编码器在所有层中处理上下文段落中所有输入令牌的隐藏状态。但是，与序列分类等其他任务不同，应答所提出的问题不一定需要上下文段落中的所有令牌。在此动机之后，我们提出了薄块撇子，这将在更高的隐藏层中略微浏览不必要的上下文，以改善和加速变压器性能。块撇屏的关键概念是识别必须进一步处理的上下文，并且可以在推理期间早期安全地丢弃的语言。批判性地，我们发现这些信息可以充分地从变压器模型内的自我注意重量得出。我们进一步将对应于下层的不必要位置对应的隐藏状态，实现了显着的推理时间加速。令我们惊讶的是，我们观察到这种方式修剪的模型优于他们的全尺寸对应物。 Block-Skim在不同数据集上提高了QA模型的准确性，并在BERT-Base模型上实现了3次加速。

translated by 谷歌翻译

Neural ranking models for document retrieval

Mohamed Trabelsi , Zhiyu Chen , Brian D. Davison , Jeff Heflin

分类：机器学习

2021-02-23

排名模型是信息检索系统的主要组成部分。排名的几种方法是基于传统的机器学习算法，使用一组手工制作的功能。最近，研究人员在信息检索中利用了深度学习模型。这些模型的培训结束于结束，以提取来自RAW数据的特征来排序任务，因此它们克服了手工制作功能的局限性。已经提出了各种深度学习模型，每个模型都呈现了一组神经网络组件，以提取用于排名的特征。在本文中，我们在不同方面比较文献中提出的模型，以了解每个模型的主要贡献和限制。在我们对文献的讨论中，我们分析了有前途的神经元件，并提出了未来的研究方向。我们还显示文档检索和其他检索任务之间的类比，其中排名的项目是结构化文档，答案，图像和视频。

translated by 谷歌翻译

A Primer in BERTology: What we know about how BERT works

Anna Rogers , Olga Kovaleva , Anna Rumshisky

分类：

2020-02-27

Transformer-based models have pushed state of the art in many areas of NLP, but our understanding of what is behind their success is still limited. This paper is the first survey of over 150 studies of the popular BERT model. We review the current state of knowledge about how BERT works, what kind of information it learns and how it is represented, common modifications to its training objectives and architecture, the overparameterization issue and approaches to compression. We then outline directions for future research.

translated by 谷歌翻译

Pre-trained Transformer-Based Approach for Arabic Question Answering : A Comparative Study

Kholoud Alsubhi , Amani Jamal , Areej Alhothali

分类：自然语言处理

2021-11-10

问题答案（QA）是自然语言处理中最具挑战性的最具挑战性的问题之一（NLP）。问答（QA）系统试图为给定问题产生答案。这些答案可以从非结构化或结构化文本生成。因此，QA被认为是可以用于评估文本了解系统的重要研究区域。大量的QA研究致力于英语语言，调查最先进的技术和实现最先进的结果。然而，由于阿拉伯QA中的研究努力和缺乏大型基准数据集，在阿拉伯语问答进展中的研究努力得到了很大速度的速度。最近许多预先接受的语言模型在许多阿拉伯语NLP问题中提供了高性能。在这项工作中，我们使用四个阅读理解数据集来评估阿拉伯QA的最先进的接种变压器模型，它是阿拉伯语 - 队，ArcD，AQAD和TYDIQA-GoldP数据集。我们微调并比较了Arabertv2基础模型，ArabertV0.2大型型号和ARAElectra模型的性能。在最后，我们提供了一个分析，了解和解释某些型号获得的低绩效结果。

translated by 谷歌翻译

Learning from Mistakes: Using Mis-predictions as Harm Alerts in Language Pre-Training

Chen Xing , Wenhao Liu , Caiming Xiong

分类：自然语言处理

2020-12-16

在培训数据中拟合复杂的模式，例如推理和争议，是语言预训练的关键挑战。根据最近的研究和我们的经验观察，一种可能的原因是训练数据中的一些易于适应的模式，例如经常共同发生的单词组合，主导和伤害预训练，使模型很难适合更复杂的信息。我们争辩说，错误预测可以帮助找到危害语言理解的这种主导模式。当发生错误预测时，应该经常与导致MIS预测的模型拟合的MIS预测字相同的模式。如果我们可以添加正规化以培训模型，当MIS预测发生并更多地对待更微妙的模式时，可以在更多信息上缩小到这种主导模式时，可以在预训练中有效地安装更多信息。在此动机之后，我们提出了一种新的语言预培训方法，错误预测作为伤害警报（MPA）。在MPA中，当在预训练期间发生错误预测时，我们使用其共同发生信息来指导自我关注模块的多个头部。变压器模块中的一些自我关注头经过优化，以将更低的注意重量分配给频繁地在误报中的输入句子中的单词，同时将更高权重分配给另一个单词。通过这样做，变压器模型训练，以依赖于主导的频繁共同发生模式，而在误报中，当发生错误预测时，在剩余更复杂的信息上更加关注更多。我们的实验表明，MPA加快了伯特和电器的预训练，并提高了他们对下游任务的表现。

translated by 谷歌翻译

A Survey on Model Compression and Acceleration for Pretrained Language Models

Canwen Xu , Julian McAuley

分类：自然语言处理 | 人工智能 | 机器学习

2022-02-15

Despite achieving state-of-the-art performance on many NLP tasks, the high energy cost and long inference delay prevent Transformer-based pretrained language models (PLMs) from seeing broader adoption including for edge and mobile computing. Efficient NLP research aims to comprehensively consider computation, time and carbon emission for the entire life-cycle of NLP, including data preparation, model training and inference. In this survey, we focus on the inference stage and review the current state of model compression and acceleration for pretrained language models, including benchmarks, metrics and methodology.

translated by 谷歌翻译

TiltedBERT: Resource Adjustable Version of BERT

Sajjad Kachuee , Mohammad Sharifkhani

分类：自然语言处理

2022-01-10

在本文中，我们提出了一种新型可调微调方法，可提高BERT模型对下游任务的训练和推理时间。在所提出的方法中，我们首先通过我们提出的冗余度量检测每层中的更重要的单词向量，然后通过我们提出的策略消除不太重要的单词向量。在我们的方法中，每层中的字矢量消除速率由倾斜速率超参数控制，并且模型学会使用比原始BERT \ TextSubscript {Base}相当较低数量的浮点操作（闪光）。模型。我们所提出的方法不需要任何额外的训练步骤，并且它也可以推广到其他基于变压器的模型。我们执行广泛的实验，显示较高层中的字矢量具有令人印象深刻的冗余，可以消除和减少训练和推理时间。实验结果对广泛情绪分析，分类和回归数据集，以及IMDB和胶水等基准表明我们的提出方法在各种数据集中有效。通过在BERT \ TextSubscript {Base}模型上应用我们的方法，我们平均将推理时间降低5.3倍的5.3倍，平均精度降低。在微调阶段之后，可以使用我们的方法脱机调整属性调整模型的推理时间，以获得各种倾斜率值选择。此外，我们提出了一种数学加速分析，可以准确估计我们方法的加速。在此分析的帮助下，可以在微调或离线调谐阶段之前选择倾斜速率超参数。

translated by 谷歌翻译

Fast and Accurate FSA System Using ELBERT: An Efficient and Lightweight BERT

Siyuan Lu , Chenchen Zhou , Keli Xie , Jun Lin , Zhongfeng Wang

分类：自然语言处理

2022-11-16

With the development of deep learning and Transformer-based pre-trained models like BERT, the accuracy of many NLP tasks has been dramatically improved. However, the large number of parameters and computations also pose challenges for their deployment. For instance, using BERT can improve the predictions in the financial sentiment analysis (FSA) task but slow it down, where speed and accuracy are equally important in terms of profits. To address these issues, we first propose an efficient and lightweight BERT (ELBERT) along with a novel confidence-window-based (CWB) early exit mechanism. Based on ELBERT, an innovative method to accelerate text processing on the GPU platform is developed, solving the difficult problem of making the early exit mechanism work more effectively with a large input batch size. Afterward, a fast and high-accuracy FSA system is built. Experimental results show that the proposed CWB early exit mechanism achieves significantly higher accuracy than existing early exit methods on BERT under the same computation cost. By using this acceleration method, our FSA system can boost the processing speed by nearly 40 times to over 1000 texts per second with sufficient accuracy, which is nearly twice as fast as FastBERT, thus providing a more powerful text processing capability for modern trading systems.

translated by 谷歌翻译

Exploring and Exploiting Multi-Granularity Representations for Machine Reading Comprehension

Nuo Chen , Chenyu You

分类：自然语言处理 | 人工智能

2022-08-18

最近，在机器阅读理解（MRC）中广泛研究了注意力增强的多层编码器，例如变压器。为了预测答案，通常使用预测因子仅从最终编码层中汲取信息，该层生成源序列的粗粒表示，即段落和问题。分析表明，随着编码层的增加，源序列的表示会变得更粗糙。人们普遍认为，随着深度神经网络中越来越多的层数，编码过程将越来越多地为每个位置收集相关信息，从而导致更粗糙的表示形式，这增加了与其他位置相似的可能性（指均质性）。这种现象会误导该模型做出错误的判断并降低表现。在本文中，我们认为，如果预测指标可以利用编码器不同粒度的表示形式，从而提供了源序列的不同视图，从而使模型的表达能力可以充分利用，那将是更好的。为此，我们提出了一种新型方法，称为自适应双向注意封闭网络（ABA-NET），该方法可自适应地利用不同级别的源代表向预测指标。此外，由于更好的表示是提高MRC性能的核心，因此胶囊网络和自我发项模块被仔细设计为我们编码器的构建块，该模块分别提供了探索本地和全球表示的能力。在三个基准数据集（即小队1.0，Squad 2.0和COQA）上进行的实验结果证明了我们方法的有效性。特别是，我们在小队1.0数据集上设置了新的最新性能

translated by 谷歌翻译

Big Bird: Transformers for Longer Sequences

Manzil Zaheer , Guru Guruganesh , Avinava Dubey , Joshua Ainslie , Chris Alberti , Santiago Ontanon , Philip Pham , Anirudh Ravula , Qifan Wang , Li Yang

分类：

2020-07-28

Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BIGBIRD, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BIGBIRD is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having O(1) global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BIGBIRD drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

translated by 谷歌翻译

SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance

Li Lyna Zhang , Youkow Homma , Yujing Wang , Min Wu , Mao Yang , Ruofei Zhang , Ting Cao , Wei Shen

分类：人工智能 | 自然语言处理

2022-08-30

AD相关建模在包括Microsoft Bing在内的在线广告系统中起着至关重要的作用。为了利用强大的变压器在这种低延迟设置中，许多现有方法脱机执行广告端计算。虽然有效，但这些方法无法提供冷启动广告，从而导致对此类广告的相关性预测不佳。这项工作旨在通过结构化修剪设计一种新的低延迟BERT，以在CPU平台上授权实时在线推断对Cold Start Ads相关性。我们的挑战是，以前的方法通常将变压器的所有层都缩减为高，均匀的稀疏性，从而产生无法以可接受的精度实现令人满意的推理速度的模型。在本文中，我们提出了SwiftPruner - 一个有效的框架，利用基于进化的搜索自动在所需的延迟约束下自动找到表现最佳的稀疏BERT模型。与进行随机突变的现有进化算法不同，我们提出了一个具有潜伏意见的多目标奖励的增强突变器，以进行更好的突变，以有效地搜索层稀疏模型的大空间。广泛的实验表明，与均匀的稀疏基线和最先进的搜索方法相比，我们的方法始终达到更高的ROC AUC和更低的潜伏度。值得注意的是，根据我们在1900年的延迟需求，SwiftPruner的AUC比Bert-Mini在大型现实世界数据集中的最先进的稀疏基线高0.86％。在线A/B测试表明，我们的模型还达到了有缺陷的冷启动广告的比例，并获得了令人满意的实时服务延迟。

translated by 谷歌翻译

Luna: Linear Unified Nested Attention

Xuezhe Ma , Xiang Kong , Sinong Wang , Chunting Zhou , Jonathan May , Hao Ma , Luke Zettlemoyer

分类：机器学习 | 自然语言处理

2021-06-03

变压器注意机制的二次计算和内存复杂性限制了对长序列建模的可扩展性。在本文中，我们提出了Luna，一种线性统一嵌套关注机制，使Softmax注意力具有两个嵌套线性关注功能，仅产生线性（与二次）的时间和空间复杂度相反。具体地，通过第一注意功能，LUNA将输入序列包装成固定长度的序列。然后，使用第二关注功能未包装包装序列。与更传统的关注机制相比，LUNA引入具有固定长度的附加序列作为输入和额外的相应输出，允许LUNA线性地进行关注操作，同时还存储足够的上下文信息。我们对三个序列建模任务的基准进行了广泛的评估：长上下文序列建模，神经机平移和大型预磨损的屏蔽语言建模。竞争甚至更好的实验结果表明了Luna的有效性和效率与各种各样相比

translated by 谷歌翻译

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin , Ming-Wei Chang , Kenton Lee , Kristina Toutanova

分类：

2018-10-11

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a;Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

translated by 谷歌翻译

TinyBERT: Distilling BERT for Natural Language Understanding

Xiaoqi Jiao , Yichun Yin , Lifeng Shang , Xin Jiang , Xiao Chen , Linlin Li , Fang Wang , Qun Liu

分类：

2019-09-23

Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT can be effectively transferred to a small "student" Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT. TinyBERT 41 with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT BASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT 4 is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only ∼28% parameters and ∼31% inference time of them. Moreover, TinyBERT 6 with 6 layers performs on-par with its teacher BERT BASE .

translated by 谷歌翻译

Embedding Recycling for Language Models

Jon Saad-Falcon , Amanpreet Singh , Luca Soldaini , Mike D'Arcy , Arman Cohan , Doug Downey

分类：自然语言处理

2022-07-11

大型神经模型的培训和推断很昂贵。但是，对于许多应用程序域，虽然新任务和模型经常出现，但建模的基础文档主要保持不变。我们研究如何通过嵌入回收利用（ER）来降低此类设置的计算成本：在执行训练或推理时从以前的模型中重新使用激活。与以前的工作相反，重点是冻结小型分类头进行填充，这通常会导致绩效显着下降，我们提出了从预告片的模型中缓存中间层的输出，并为新任务的剩余层进行填充。我们表明，我们的方法在训练过程中提供了100％的速度和55-86％的推理，并且对科学领域中文本分类和实体识别任务的准确性产生了可观的影响。对于通用域的问答任务，ER提供了类似的加速和少量准确性。最后，我们确定了ER的几个开放挑战和未来的方向。

translated by 谷歌翻译

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Pengcheng He , Xiaodong Liu , Jianfeng Gao , Weizhu Chen

分类：

2020-06-05

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions, respectively. Second, an enhanced mask decoder is used to incorporate absolute positions in the decoding layer to predict the masked tokens in model pre-training. In addition, a new virtual adversarial training method is used for fine-tuning to improve models' generalization. We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understand (NLU) and natural langauge generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). Notably, we scale up DeBERTa by training a larger version that consists of 48 Transform layers with 1.5 billion parameters. The significant performance boost makes the single DeBERTa model surpass the human performance on the SuperGLUE benchmark (Wang et al., 2019a) for the first time in terms of macro-average score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the SuperGLUE leaderboard as of January 6, 2021, outperforming the human baseline by a decent margin (90.3 versus 89.8). The pre-trained DeBERTa models and the source code were released at: https://github.com/microsoft/DeBERTa 1 .

translated by 谷歌翻译

Multi-Instance Training for Question Answering Across Table and Linked Text

Vishwajeet Kumar , Saneem Chemmengath , Yash Gupta , Jaydeep Sen , Samarth Bharadwaj , Soumen Chakrabarti

分类：自然语言处理 | 人工智能

2021-12-14

使用来自表格（TableQA）的信息回答自然语言问题是最近的兴趣。在许多应用程序中，表未孤立，但嵌入到非结构化文本中。通常，通过将其部分与表格单元格内容或非结构化文本跨度匹配，并从任一源中提取答案来最佳地回答问题。这导致了HybridQA数据集引入的TextableQA问题的新空间。现有的表格表示对基于变换器的阅读理解（RC）架构的适应性未通过单个系统解决两个表示的不同模式。培训此类系统因对遥远监督的需求而进一步挑战。为了降低认知负担，培训实例通常包括问题和答案，后者匹配多个表行和文本段。这导致嘈杂的多实例培训制度不仅涉及表的行，而且涵盖了链接文本的跨度。我们通过提出Mitqa来回应这些挑战，这是一个新的TextableQA系统，明确地模拟了表行选择和文本跨度选择的不同但密切相关的概率空间。与最近的基线相比，我们的实验表明了我们的方法的优越性。该方法目前在HybridQA排行榜的顶部，并进行了一个试验集，在以前公布的结果上实现了对em和f1的21％的绝对改善。

translated by 谷歌翻译

Which Student is Best? A Comprehensive Knowledge Distillation Exam for Task-Specific BERT Models

Made Nindyatama Nityasya , Haryo Akbarianto Wibowo , Rendi Chevi , Radityo Eko Prasojo , Alham Fikri Aji

分类：自然语言处理

2022-01-03

我们从任务特定的BERT基教师模型执行知识蒸馏（KD）基准到各种学生模型：Bilstm，CNN，Bert-Tiny，Bert-Mini和Bert-small。我们的实验涉及在两个任务中分组的12个数据集：印度尼西亚语言中的文本分类和序列标记。我们还比较蒸馏的各个方面，包括使用Word Embeddings和未标记的数据增强的使用。我们的实验表明，尽管基于变压器的模型的普及程度不断上升，但是使用Bilstm和CNN学生模型，与修剪的BERT模型相比，使用Bilstm和CNN学生模型提供了性能和计算资源（CPU，RAM和存储）之间的最佳权衡。我们进一步提出了一些快速胜利，通过涉及涉及丢失功能，Word Embeddings和未标记的数据准备的简单选择的高效KD培训机制来生产小型NLP模型。

translated by 谷歌翻译

A Comprehensive Survey on Multi-hop Machine Reading Comprehension Approaches

Azade Mohammadi , Reza Ramezani , Ahmad Baraani

分类：自然语言处理

2022-12-08

Machine reading comprehension (MRC) is a long-standing topic in natural language processing (NLP). The MRC task aims to answer a question based on the given context. Recently studies focus on multi-hop MRC which is a more challenging extension of MRC, which to answer a question some disjoint pieces of information across the context are required. Due to the complexity and importance of multi-hop MRC, a large number of studies have been focused on this topic in recent years, therefore, it is necessary and worth reviewing the related literature. This study aims to investigate recent advances in the multi-hop MRC approaches based on 31 studies from 2018 to 2022. In this regard, first, the multi-hop MRC problem definition will be introduced, then 31 models will be reviewed in detail with a strong focus on their multi-hop aspects. They also will be categorized based on their main techniques. Finally, a fine-grain comprehensive comparison of the models and techniques will be presented.

translated by 谷歌翻译

SAS: Self-Augmentation Strategy for Language Model Pre-training

Yifei Xu , Jingqiao Zhang , Ru He , Liangzhu Ge , Chao Yang , Cheng Yang , Ying Nian Wu

分类：自然语言处理 | 人工智能

2021-06-14

用于预培训语言模型的自我监督学习的核心包括预训练任务设计以及适当的数据增强。语言模型中的大多数数据增强都是独立于上下文的。最近在电子中提出了一个开创性的增强，并通过引入辅助生成网络（发电机）来实现最先进的性能，以产生用于培训主要辨别网络（鉴别者）的上下文化数据增强。然而，这种设计引入了发电机的额外计算成本，并且需要调整发电机和鉴别器之间的相对能力。在本文中，我们提出了一种自增强策略（SAS），其中单个网络用于审视以后的时期的培训常规预训练和上下文化数据增强。基本上，该策略消除了单独的发电机，并使用单个网络共同执行具有MLM（屏蔽语言建模）和RTD（替换令牌检测）头的两个预训练任务。它避免了寻找适当大小的发电机的挑战，这对于在电子中证明的性能至关重要，以及其随后的变体模型至关重要。此外，SAS是一项常规策略，可以与最近或将来的许多新技术无缝地结合，例如杜伯塔省的解除关注机制。我们的实验表明，SAS能够在具有相似或更少的计算成本中优于胶水任务中的电磁和其他最先进的模型。

translated by 谷歌翻译