Free-text rationales (FTRs) follow how humans communicate by explaining reasoning processes via natural language. A number of recent works have studied how to improve language model (LM) generalization by using FTRs to teach LMs the correct reasoning processes behind correct task outputs. These prior works aim to learn from FTRs by appending them to the LM input or target output, but this may introduce an input distribution shift or conflict with the task objective, respectively. We propose KNIFE, which distills FTR knowledge from an FTR-augmented teacher LM (takes both task input and FTR) to a student LM (takes only task input), which is used for inference. Crucially, the teacher LM's forward computation has a bottleneck stage in which all of its FTR states are masked out, which pushes knowledge from the FTR states into the task input/output states. Then, FTR knowledge is distilled to the student LM by training its task input/output states to align with the teacher LM's. On two question answering datasets, we show that KNIFE significantly outperforms existing FTR learning methods, in both fully-supervised and low-resource settings.
translated by 谷歌翻译
Neural language models (LMs) have achieved impressive results on various language-based reasoning tasks by utilizing latent knowledge encoded in their own pretrained parameters. To make this reasoning process more explicit, recent works retrieve a rationalizing LM's internal knowledge by training or prompting it to generate free-text rationales, which can be used to guide task predictions made by either the same LM or a separate reasoning LM. However, rationalizing LMs require expensive rationale annotation and/or computation, without any assurance that their generated rationales improve LM task performance or faithfully reflect LM decision-making. In this paper, we propose PINTO, an LM pipeline that rationalizes via prompt-based learning, and learns to faithfully reason over rationales via counterfactual regularization. First, PINTO maps out a suitable reasoning process for the task input by prompting a frozen rationalizing LM to generate a free-text rationale. Second, PINTO's reasoning LM is fine-tuned to solve the task using the generated rationale as context, while regularized to output less confident predictions when the rationale is perturbed. Across four datasets, we show that PINTO significantly improves the generalization ability of the reasoning LM, yielding higher performance on both in-distribution and out-of-distribution test sets. Also, we find that PINTO's rationales are more faithful to its task predictions than those generated by competitive baselines.
translated by 谷歌翻译
通过突出显示最大影响输出的文本输入,提取理由对给定任务实例的预测解释了语言模型(LM)预测。理想情况下,理由提取应该是忠诚的(反映LM的行为),合理的(对人类来说),数据效率和快速,而不牺牲LM的任务表现。先前的基本原理提取工程由专门的方法提供解决这些desiderata的各个子集 - 但从来没有五个。狭隘地关注某些Desiderata通常会以忽略的牺牲品为代价,因此现有的理由提取器在现实世界应用中往往是不切实际的。为了解决这一挑战,我们提出了Unirex,统一和高度灵活的理由提取学习框架,允许用户容易地占所有五个因素。 UNIREX使理论提取器培训过程的端到端定制,支持任意:(1)启发式/学习的理由提取者,(2)忠诚和/或合理性目标的组合,以及(3)金理由监管的数额。在三个文本分类数据集中,我们最好的UNIrex配置实现了与强基线相比的五个desiderata的较高余额。此外,Unirex培训的理由提取器甚至可以推广到看不见的数据集和任务。
translated by 谷歌翻译
Given the success with in-context learning of large pre-trained language models, we introduce in-context learning distillation to transfer in-context few-shot learning ability from large models to smaller models. We propose to combine in-context learning objectives with language modeling objectives to distill both the ability to read in-context examples and task knowledge to the smaller models. We perform in-context learning distillation under two different few-shot learning paradigms: Meta In-context Tuning (Meta-ICT) and Multitask In-context Tuning (Multitask-ICT). Multitask-ICT performs better on multitask few-shot learning but also requires more computation than Meta-ICT. Our method shows consistent improvements for both Meta-ICT and Multitask-ICT on two benchmarks: LAMA and CrossFit. Our extensive experiments and analysis reveal that in-context learning objectives and language modeling objectives are complementary under the Multitask-ICT paradigm. In-context learning objectives achieve the best performance when combined with language modeling objectives.
translated by 谷歌翻译
自由文本的理由旨在通过自然语言更灵活,直观地解释神经语言模型(LM)行为。为了确保理由质量,重要的是要拥有衡量理由的忠诚度(反映了LM的实际行为)和合理性(对人类的说服力)很重要。所有现有的自由文本理由指标均基于模拟性(基本原理与LM预测标签之间的关联),但没有评估此类指标可靠性的协议。为了调查这一点,我们提出了框架,该框架是评估自由文本理由的模拟性指标的框架。框架基于三个公理:(1)良好的指标应为参考理由产生最高的分数,从而最大程度地逐构构建标签标签的关联; (2)良好的指标应适当地对理由的语义扰动敏感; (3)良好的指标应该对LM的任务性能的变化具有鲁棒性。在三个文本分类数据集中,我们表明现有的可模拟性指标无法满足所有三个帧公理,因为它们是通过模型预处理实现的,该模型预处理弄乱了度量标准的信号。我们介绍了一种非原始的模拟性变体,该变体将(1)和(3)的性能平均提高41.7%和42.9%,同时在(2)上进行竞争性能。
translated by 谷歌翻译
Language models (LMs) have demonstrated remarkable performance on downstream tasks, using in-context exemplars or human instructions. Recent works have shown that chain-of-thought (CoT) prompting can elicit models to solve complex reasoning tasks, step-by-step. However, the efficacy of prompt-based CoT methods is restricted to very large LMs such as GPT-3 (175B), thus limiting deployability. In this paper, we revisit the fine-tuning approach to enable complex reasoning in smaller LMs, optimized to efficiently perform a specific task. We propose Fine-tune-CoT, a method that leverages the capabilities of very large LMs to generate reasoning samples and teach smaller models via fine-tuning. We evaluate our method on publicly available LMs across a wide range of complex tasks and model sizes. We find that Fine-tune-CoT enables substantial reasoning capability in small models, whereas previous prompt-based baselines exhibit near-random performance. Student models can even outperform the teacher in some tasks while reducing model size requirements by several orders of magnitude. We conduct extensive ablations and sample studies to understand the reasoning capabilities of student models. We also identify several important nuances that have been overlooked in concurrent fine-tuning works on CoT and address them in our analysis.
translated by 谷歌翻译
最近,非自动增加(NAT)模型并行地预测输出,与自回归(AT)模型相比,实现了产生速度的大量改进。在对原始数据上表现更差的同时,大多数NAT模型都被培训为在教师模型生成的蒸馏数据上的学生模型,称为序列级知识蒸馏。提高模型性能的有效培训策略是自蒸馏混合(SDM)培训,预先训练原始数据模型,通过预先训练的模型本身产生蒸馏数据,最后重新列举模型原始数据和蒸馏数据的组合。在这项工作中,我们的目标是查看NAT模型的SDM,但发现直接采用SDM到NAT模型在翻译质量方面没有改进。通过仔细分析,我们观察失效与教师模型与NAT学生模型的建模和确认偏差相关。基于这些发现,我们提出了一种增强的策略,通过向经典SDM添加两个阶段来提高名为SDMRT的策略:一个是在自蒸馏数据上进行预重磅,另一个是对滤波后的教师蒸馏数据进行微调。我们的结果在多个NAT模型上以0.6至1.2 bleu表示基础。作为另一个奖励,对于迭代细化NAT模型,我们的方法可以在半迭代号内倾斜基线,这意味着2x加速度。
translated by 谷歌翻译
已经证明了对比学习适合学习句子嵌入,可以显着提高语义文本相似性(STS)任务。最近,大型对比学习模型,例如句子T5倾向于学到更强大的句子嵌入。虽然有效,但由于计算资源或时间成本限制,这种大型型号很难在线服务。为了解决这个问题,通常采用知识蒸馏(KD),这可以将大型“教师”模型压缩成一个小的“学生”模型,但通常会遭受一些性能损失。在这里,我们提出了一个增强的KD框架,称为蒸馏 - 对比度(迪斯科)。所提出的迪斯科框架首先利用KD将大句子嵌入模型的能力转移到大型未标记数据的小学生模型,然后在标记的训练数据上具有对比学习的学生模型。对于迪斯科舞厅的KD进程,我们进一步提出了对比的知识蒸馏(CKD),以增强教师模型培训,KD和学生模型的一致性,这可能会提高迅速学习的表现。 7 STS基准测试的广泛实验表明,使用所提出的迪斯科和CKD培训的学生模型很少或甚至没有性能损失,并且始终如一地优于相同参数大小的相应对应物。令人惊讶的是,我们的110米学生模型甚至可以优于最新的最新(SOTA)模型,即句子T5(11B),只有1%的参数。
translated by 谷歌翻译
Language model pre-training, such as BERT, has significantly improved the performances of many natural language processing tasks. However, pre-trained language models are usually computationally expensive, so it is difficult to efficiently execute them on resourcerestricted devices. To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. By leveraging this new KD method, the plenty of knowledge encoded in a large "teacher" BERT can be effectively transferred to a small "student" Tiny-BERT. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages. This framework ensures that TinyBERT can capture the general-domain as well as the task-specific knowledge in BERT. TinyBERT 41 with 4 layers is empirically effective and achieves more than 96.8% the performance of its teacher BERT BASE on GLUE benchmark, while being 7.5x smaller and 9.4x faster on inference. TinyBERT 4 is also significantly better than 4-layer state-of-the-art baselines on BERT distillation, with only ∼28% parameters and ∼31% inference time of them. Moreover, TinyBERT 6 with 6 layers performs on-par with its teacher BERT BASE .
translated by 谷歌翻译
Chain of thought prompting successfully improves the reasoning capabilities of large language models, achieving state of the art results on a range of datasets. However, these reasoning capabilities only appear to emerge in models with a size of over 100 billion parameters. In this paper, we explore the transfer of such reasoning capabilities to models with less than 100 billion parameters via knowledge distillation. Specifically, we finetune a student model on the chain of thought outputs generated by a larger teacher model. Our experiments show that the proposed method improves task performance across arithmetic, commonsense and symbolic reasoning datasets. For example, the accuracy of T5 XXL on GSM8K improves from 8.11% to 21.99% when finetuned on PaLM-540B generated chains of thought.
translated by 谷歌翻译
我们从任务特定的BERT基教师模型执行知识蒸馏(KD)基准到各种学生模型:Bilstm,CNN,Bert-Tiny,Bert-Mini和Bert-small。我们的实验涉及在两个任务中分组的12个数据集:印度尼西亚语言中的文本分类和序列标记。我们还比较蒸馏的各个方面,包括使用Word Embeddings和未标记的数据增强的使用。我们的实验表明,尽管基于变压器的模型的普及程度不断上升,但是使用Bilstm和CNN学生模型,与修剪的BERT模型相比,使用Bilstm和CNN学生模型提供了性能和计算资源(CPU,RAM和存储)之间的最佳权衡。我们进一步提出了一些快速胜利,通过涉及涉及丢失功能,Word Embeddings和未标记的数据准备的简单选择的高效KD培训机制来生产小型NLP模型。
translated by 谷歌翻译
大型预估计模型(例如GPT-3)取得了显着的性能,在训练过程中暴露于大量数据上。类似地,将如此大型模型提炼成紧凑的模型以进行有效的部署,也需要大量(标记或未标记的)培训数据。在本文中,我们提出了培训高质量紧凑型模型的教师指导培训(TGT)框架,该模型利用了预验证的生成模型获得的知识,同时避免了大量数据的需求。 TGT利用了教师获得基础数据域的良好表示的事实,该事实通常对应于比输入空间要低得多的尺寸歧管。此外,我们可以使用老师通过采样或基于梯度的方法来更有效地探索输入空间。因此,使TGT对于有限的数据或长尾设置特别有吸引力。我们正式在我们的概括范围内正式捕获了所提出的数据域探索的好处。我们发现TGT可以提高几个图像分类基准以及一系列文本分类和检索任务的准确性。
translated by 谷歌翻译
Step-by-step reasoning approaches like chain-of-thought (CoT) have proved to be a very effective technique to induce reasoning capabilities in large language models. However, the success of the CoT approach depends primarily on model size, and often billion parameter-scale models are needed to get CoT to work. In this paper, we propose a knowledge distillation approach, that leverages the step-by-step CoT reasoning capabilities of larger models and distils these reasoning abilities into smaller models. Our approach Decompositional Distillation learns a semantic decomposition of the original problem into a sequence of subproblems and uses it to train two models: a) a problem decomposer that learns to decompose the complex reasoning problem into a sequence of simpler sub-problems and b) a problem solver that uses the intermediate subproblems to solve the overall problem. On a multi-step math word problem dataset (GSM8K), we boost the performance of GPT-2 variants up to 35% when distilled with our approach compared to CoT. We show that using our approach, it is possible to train a GPT-2-large model (775M) that can outperform a 10X larger GPT-3 (6B) model trained using CoT reasoning. Finally, we also demonstrate that our approach of problem decomposition can also be used as an alternative to CoT prompting, which boosts the GPT-3 performance by 40% compared to CoT prompts.
translated by 谷歌翻译
在过去的几年中,基于变压器的预训练的语言模型在行业和学术界都取得了惊人的成功。但是,较大的模型尺寸和高运行时间延迟是在实践中应用它们的严重障碍,尤其是在手机和物联网(IoT)设备上。为了压缩该模型,最近有大量文献围绕知识蒸馏(KD)的主题长大。然而,KD在基于变压器的模型中的工作方式仍不清楚。我们取消了KD的组件,并提出了一个统一的KD框架。通过框架,花费了23,000多个GPU小时的系统和广泛的实验,从知识类型的角度,匹配策略,宽度深度折衷,初始化,型号大小等。在培训前语言模型中,对先前最新的(SOTA)的相对显着改善。最后,我们为基于变压器模型的KD提供了最佳实践指南。
translated by 谷歌翻译
由于许多微调预先训练的语言模型〜(PLMS)具有有希望的性能,因此慷慨地释放,研究了重用这些模型的更好方法至关重要,因为它可以大大降低再培训计算成本和潜在的环境副作用。在本文中,我们探索了一种小型模型重用范式,知识合并〜(ka)。如果没有人为注释,KA旨在将来自不同教师的知识合并到一个专门从事不同的分类问题中的知识,进入多功能的学生模型。实现这一目标,我们设计了模型不确定感知知识合并〜(Muka)框架,其使用Monte-Carlo辍学来识别潜在的足够教师,以估计金色监督指导学生。实验结果表明,Muka在基准数据集上实现了对基准的基本改进。进一步的分析表明,Muka可以通过多个教师模型,异构教师,甚至交叉数据集教师概括很好的复杂设置。
translated by 谷歌翻译
Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) generalization by transferring the knowledge from a larger model (a teacher). Although KD methods achieve state-of-the-art performance in numerous settings, they suffer from several problems limiting their performance. It is shown in the literature that the capacity gap between the teacher and the student networks can make KD ineffective. Additionally, existing KD techniques do not mitigate the noise in the teacher's output: modeling the noisy behaviour of the teacher can distract the student from learning more useful features. We propose a new KD method that addresses these problems and facilitates the training compared to previous techniques. Inspired by continuation optimization, we design a training procedure that optimizes the highly non-convex KD objective by starting with the smoothed version of this objective and making it more complex as the training proceeds. Our method (Continuation-KD) achieves state-of-the-art performance across various compact architectures on NLU (GLUE benchmark) and computer vision tasks (CIFAR-10 and CIFAR-100).
translated by 谷歌翻译
本文研究了从预先训练的模型,尤其是蒙面自动编码器中提取知识的潜力。我们的方法很简单:除了优化掩盖输入的像素重建损失外,我们还将教师模型的中间特征图与学生模型的中间特征图之间的距离最小化。此设计导致一个计算高效的知识蒸馏框架,给定1)仅使用一个少量可见的补丁子集,2)(笨拙的)教师模型仅需要部分执行,\ ie,\ ie,在前几个中,向前传播输入层,用于获得中间特征图。与直接蒸馏微型模型相比,提炼预训练的模型显着改善了下游性能。例如,通过将知识从MAE预先训练的VIT-L提炼为VIT-B,我们的方法可实现84.0%的Imagenet Top-1精度,表现优于直接将微型VIT-L蒸馏的基线,降低1.2%。更有趣的是,我们的方法即使具有极高的掩盖率也可以从教师模型中进行鲁棒性蒸馏:例如,在蒸馏过程中仅可见十个斑块,我们的VIT-B具有竞争力的前1个Imagenet精度为83.6%,在95%的掩盖率中,只有十个斑块。 ;令人惊讶的是,它仍然可以通过仅四个可见斑(98%的掩盖率)积极训练来确保82.4%的Top-1 Imagenet精度。代码和模型可在https://github.com/ucsc-vlaa/dmae上公开获得。
translated by 谷歌翻译
预测任务标签和为其预测生成自由文本阐述的自律化模型可以实现与NLP系统更直观的交互。然而,这些模型目前正在接受大量人为的自由文本解释,每个任务都会阻碍更广泛的使用。我们建议使用少数培训例子研究更现实的自律化建立。我们出示2月 - 一个标准化的四个现有英语数据集和相关指标。我们通过2月份广泛探索自然语言提示来确定正确的提示方法。然后,通过使用此提示并缩放模型大小,我们证明了几次拍摄自合合理化的进展。我们展示了这项任务的完善房间仍然有充足的改进空间:人类注册人评估的生成解释的平均合理性最多为51%,而人类解释的合理性是76%。我们希望2月份与我们的拟议方法一起促使社区承担几次拍摄的自我合理化挑战。
translated by 谷歌翻译
This technical report briefly describes our JDExplore d-team's Vega v2 submission on the SuperGLUE leaderboard. SuperGLUE is more challenging than the widely used general language understanding evaluation (GLUE) benchmark, containing eight difficult language understanding tasks, including question answering, natural language inference, word sense disambiguation, coreference resolution, and reasoning. [Method] Instead of arbitrarily increasing the size of a pretrained language model (PLM), our aim is to 1) fully extract knowledge from the input pretraining data given a certain parameter budget, e.g., 6B, and 2) effectively transfer this knowledge to downstream tasks. To achieve goal 1), we propose self-evolution learning for PLMs to wisely predict the informative tokens that should be masked, and supervise the masked language modeling (MLM) process with rectified smooth labels. For goal 2), we leverage the prompt transfer technique to improve the low-resource tasks by transferring the knowledge from the foundation model and related downstream tasks to the target task. [Results] According to our submission record (Oct. 2022), with our optimized pretraining and fine-tuning strategies, our 6B Vega method achieved new state-of-the-art performance on 4/8 tasks, sitting atop the SuperGLUE leaderboard on Oct. 8, 2022, with an average score of 91.3.
translated by 谷歌翻译
Data Augmentation (DA) is frequently used to automatically provide additional training data without extra human annotation. However, data augmentation may introduce noisy data that impairs training. To guarantee the quality of augmented data, existing methods either assume no noise exists in the augmented data and adopt consistency training or use simple heuristics such as training loss and diversity constraints to filter out ``noisy'' data. However, those filtered examples may still contain useful information, and dropping them completely causes loss of supervision signals. In this paper, based on the assumption that the original dataset is cleaner than the augmented data, we propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data. A simple self-regularization module is applied to force the model prediction to be consistent across two distinct dropouts to further prevent overfitting on noisy labels. Our method can be applied to augmentation techniques in general and can consistently improve the performance on both text classification and question-answering tasks.
translated by 谷歌翻译