近年来,知识蒸馏有显着改善,可以为更好的效率产生紧凑的学生模型,同时保留教师模型的模型效果。以前的研究发现:由于能力不匹配,更准确的教师对更好的教师无需。在本文中,我们旨在通过模型校准的角度分析现象。我们发现较大的教师模型可能过于过度自信,因此学生模型无法有效地模仿。虽然,在教师模型的简单模型校准之后,教师模型的大小与学生模型的性能具有正相关。
translated by 谷歌翻译
机器学习中的知识蒸馏是将知识从名为教师的大型模型转移到一个名为“学生”的较小模型的过程。知识蒸馏是将大型网络(教师)压缩到较小网络(学生)的技术之一,该网络可以部署在手机等小型设备中。当教师和学生之间的网络规模差距增加时,学生网络的表现就会下降。为了解决这个问题,在教师模型和名为助教模型的学生模型之间采用了中间模型,这反过来弥补了教师与学生之间的差距。在这项研究中,我们已经表明,使用多个助教模型,可以进一步改进学生模型(较小的模型)。我们使用加权集合学习将这些多个助教模型组合在一起,我们使用了差异评估优化算法来生成权重值。
translated by 谷歌翻译
Knowledge distillation (KD) has gained a lot of attention in the field of model compression for edge devices thanks to its effectiveness in compressing large powerful networks into smaller lower-capacity models. Online distillation, in which both the teacher and the student are learning collaboratively, has also gained much interest due to its ability to improve on the performance of the networks involved. The Kullback-Leibler (KL) divergence ensures the proper knowledge transfer between the teacher and student. However, most online KD techniques present some bottlenecks under the network capacity gap. By cooperatively and simultaneously training, the models the KL distance becomes incapable of properly minimizing the teacher's and student's distributions. Alongside accuracy, critical edge device applications are in need of well-calibrated compact networks. Confidence calibration provides a sensible way of getting trustworthy predictions. We propose BD-KD: Balancing of Divergences for online Knowledge Distillation. We show that adaptively balancing between the reverse and forward divergences shifts the focus of the training strategy to the compact student network without limiting the teacher network's learning process. We demonstrate that, by performing this balancing design at the level of the student distillation loss, we improve upon both performance accuracy and calibration of the compact student network. We conducted extensive experiments using a variety of network architectures and show improvements on multiple datasets including CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet. We illustrate the effectiveness of our approach through comprehensive comparisons and ablations with current state-of-the-art online and offline KD techniques.
translated by 谷歌翻译
基于蒸馏的压缩网络的性能受蒸馏质量的管辖。大型网络(教师)到较小网络(学生)的次优蒸馏的原因主要归因于给定教师与学生对的学习能力中的差距。虽然很难蒸馏所有教师的知识,但可以在很大程度上控制蒸馏质量以实现更好的性能。我们的实验表明,蒸馏品质主要受教师响应的质量来限制,这反过来又受到其反应中存在相似信息的影响。训练有素的大容量老师在学习细粒度辨别性质的过程中丢失了类别之间的相似性信息。没有相似性信息导致蒸馏过程从一个例子 - 许多阶级学习减少到一个示例 - 一类学习,从而限制了教师的不同知识的流程。由于隐式假设只能蒸馏出灌输所知,而不是仅关注知识蒸馏过程,我们仔细审查了知识序列过程。我们认为,对于给定的教师 - 学生对,通过在训练老师的同时找到批量大小和时代数量之间的甜蜜点,可以提高蒸馏品。我们讨论了找到这种甜蜜点以便更好地蒸馏的步骤。我们还提出了蒸馏假设,以区分知识蒸馏和正则化效果之间的蒸馏过程的行为。我们在三个不同的数据集中进行我们的所有实验。
translated by 谷歌翻译
Figure 1. An illustration of standard knowledge distillation. Despite widespread use, an understanding of when the student can learn from the teacher is missing.
translated by 谷歌翻译
在多种方式知识蒸馏研究的背景下,现有方法主要集中在唯一的学习教师最终产出问题。因此,教师网络与学生网络之间存在深处。有必要强制学生网络来学习教师网络的模态关系信息。为了有效利用从教师转移到学生的知识,采用了一种新的模型关系蒸馏范式,通过建模不同的模态之间的关系信息,即学习教师模级克矩阵。
translated by 谷歌翻译
在这项工作中,我们提出了一种新颖的方法,用于对训练有素的神经网络学习。特别是,我们根据层的传输函数形成Bregman的差异,并通过合并平均向量并将主方向归一化,并构造原始Bregman PCA公式的扩展,并将主方向归一化,相对于围绕平均值的局部凸功能的几何形状。这种概括允许将学习的表示形式导出为具有非线性的固定层。作为知识蒸馏的应用,我们为学生网络的学习问题提出了预测教师表示的压缩系数,这些内容被作为输入到导入层的输入。我们的经验发现表明,与使用教师的倒数第二层表示和软标签相比,与典型的教师培训相比,我们的方法在网络之间传输信息更为有效。
translated by 谷歌翻译
最初引入了知识蒸馏,以利用来自单一教师模型的额外监督为学生模型培训。为了提高学生表现,最近的一些变体试图利用多个教师利用不同的知识来源。然而,现有研究主要通过对多种教师预测的平均或将它们与其他无标签策略相结合,将知识集成在多种来源中,可能在可能存在低质量的教师预测存在中误导学生。为了解决这个问题,我们提出了信心感知的多教师知识蒸馏(CA-MKD),该知识蒸馏(CA-MKD)在地面真理标签的帮助下,适用于每个教师预测的样本明智的可靠性,与那些接近单热的教师预测标签分配了大量的重量。此外,CA-MKD包含中间层,以进一步提高学生表现。广泛的实验表明,我们的CA-MKD始终如一地优于各种教师学生架构的所有最先进的方法。
translated by 谷歌翻译
尽管深层模型在医学图像分割中表现出了有希望的性能,但它们在很大程度上依赖大量宣布的数据,这很难访问,尤其是在临床实践中。另一方面,高准确的深层模型通常有大型模型尺寸,从而限制了它们在实际情况下的工作。在这项工作中,我们提出了一个新颖的不对称联合教师框架ACT-NET,以减轻半监督知识蒸馏的昂贵注释和计算成本的负担。我们通过共同教师网络推进教师学习的学习,以通过交替的学生和教师角色来促进从大型模型到小模型的不对称知识蒸馏,从而获得了临床就业的微小但准确的模型。为了验证我们的行动网络的有效性,我们在实验中采用了ACDC数据集进行心脏子结构分段。广泛的实验结果表明,ACT-NET的表现优于其他知识蒸馏方法,并实现无损分割性能,参数少250倍。
translated by 谷歌翻译
随着AI芯片(例如GPU,TPU和NPU)的改进以及物联网(IOT)的快速发展,一些强大的深神经网络(DNN)通常由数百万甚至数亿个参数组成,这些参数是可能不适合直接部署在低计算和低容量单元(例如边缘设备)上。最近,知识蒸馏(KD)被认为是模型压缩的有效方法之一,以减少模型参数。 KD的主要概念是从大型模型(即教师模型)的特征图中提取有用的信息,以引用成功训练一个小型模型(即学生模型),该模型大小比老师小得多。尽管已经提出了许多基于KD的方法来利用教师模型中中间层的特征图中的信息,但是,它们中的大多数并未考虑教师模型和学生模型之间的特征图的相似性,这可能让学生模型学习无用的信息。受到注意机制的启发,我们提出了一种新颖的KD方法,称为代表教师钥匙(RTK),该方法不仅考虑了特征地图的相似性,而且还会过滤掉无用的信息以提高目标学生模型的性能。在实验中,我们使用多个骨干网络(例如Resnet和wideresnet)和数据集(例如CIFAR10,CIFAR100,SVHN和CINIC10)验证了我们提出的方法。结果表明,我们提出的RTK可以有效地提高基于注意的KD方法的分类精度。
translated by 谷歌翻译
知识蒸馏已成为获得紧凑又有效模型的重要方法。为实现这一目标,培训小型学生模型以利用大型训练有素的教师模型的知识。然而,由于教师和学生之间的能力差距,学生的表现很难达到老师的水平。关于这个问题,现有方法建议通过代理方式减少教师知识的难度。我们认为这些基于代理的方法忽视了教师的知识损失,这可能导致学生遇到容量瓶颈。在本文中,我们从新的角度来缓解能力差距问题,以避免知识损失的目的。我们建议通过对抗性协作学习建立一个更有力的学生,而不是牺牲教师的知识。为此,我们进一步提出了一种逆势协作知识蒸馏(ACKD)方法,有效提高了知识蒸馏的性能。具体来说,我们用多个辅助学习者构建学生模型。同时,我们设计了对抗的对抗性协作模块(ACM),引入注意机制和对抗的学习,以提高学生的能力。四个分类任务的广泛实验显示了拟议的Ackd的优越性。
translated by 谷歌翻译
Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) generalization by transferring the knowledge from a larger model (a teacher). Although KD methods achieve state-of-the-art performance in numerous settings, they suffer from several problems limiting their performance. It is shown in the literature that the capacity gap between the teacher and the student networks can make KD ineffective. Additionally, existing KD techniques do not mitigate the noise in the teacher's output: modeling the noisy behaviour of the teacher can distract the student from learning more useful features. We propose a new KD method that addresses these problems and facilitates the training compared to previous techniques. Inspired by continuation optimization, we design a training procedure that optimizes the highly non-convex KD objective by starting with the smoothed version of this objective and making it more complex as the training proceeds. Our method (Continuation-KD) achieves state-of-the-art performance across various compact architectures on NLU (GLUE benchmark) and computer vision tasks (CIFAR-10 and CIFAR-100).
translated by 谷歌翻译
Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant size and extend the framework to multi-step distillation. Theoretical analysis and extensive experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach.
translated by 谷歌翻译
由于其事件驱动的计算,尖峰神经网络(SNN)已成为常规人工神经网络(ANN)的节能替代方案。考虑到SNN模型的未来部署到限制神经形态设备上,许多研究应用了最初用于ANN模型压缩的技术,例如网络量化,修剪和知识蒸馏,用于SNN。其中,关于知识蒸馏的现有作品报告了学生SNN模型的准确性提高。但是,对能源效率的分析也是SNN的重要特征。在本文中,我们从准确性和能源效率方面彻底分析了蒸馏SNN模型的性能。在此过程中,我们观察到使用常规知识蒸馏方法时,尖峰数量大幅增加,导致能量效率低下。基于此分析,为了达到能源效率,我们提出了一种具有异质温度参数的新知识蒸馏方法。我们在两个不同的数据集上评估我们的方法,并表明由此产生的SNN学生满足了尖峰数量的准确性和减少。在MNIST数据集上,我们提议的学生SNN的精度高达0.09%,与接受常规知识蒸馏方法的学生SNN相比,SNN的峰值降低了65%。我们还将结果与其他SNN压缩技术和训练方法进行了比较。
translated by 谷歌翻译
One of the most efficient methods for model compression is hint distillation, where the student model is injected with information (hints) from several different layers of the teacher model. Although the selection of hint points can drastically alter the compression performance, conventional distillation approaches overlook this fact and use the same hint points as in the early studies. Therefore, we propose a clustering based hint selection methodology, where the layers of teacher model are clustered with respect to several metrics and the cluster centers are used as the hint points. Our method is applicable for any student network, once it is applied on a chosen teacher network. The proposed approach is validated in CIFAR-100 and ImageNet datasets, using various teacher-student pairs and numerous hint distillation methods. Our results show that hint points selected by our algorithm results in superior compression performance compared to state-of-the-art knowledge distillation algorithms on the same student models and datasets.
translated by 谷歌翻译
In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple teacher models with multi-level capacities, to transfer knowledge into student model in an one-to-one manner. Sampling distribution plays an important role in SKD. We heuristically present three types of sampling distributions to assign appropriate probabilities for multi-level teacher models. SKD has two advantages: 1) it can preserve the diversities of multi-level teacher models via stochastically sampling single teacher model in each iteration, and 2) it can also improve the efficacy of knowledge distillation via multi-level teacher models when large capacity gap exists between the teacher model and the student model. Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT$_{\rm BASE}$ model by 40% while retaining 99.5% performances of language understanding and being 100% faster.
translated by 谷歌翻译
Knowledge Distillation (KD) aims to distill the knowledge of a cumbersome teacher model into a lightweight student model. Its success is generally attributed to the privileged information on similarities among categories provided by the teacher model, and in this sense, only strong teacher models are deployed to teach weaker students in practice. In this work, we challenge this common belief by following experimental observations: 1) beyond the acknowledgment that the teacher can improve the student, the student can also enhance the teacher significantly by reversing the KD procedure; 2) a poorly-trained teacher with much lower accuracy than the student can still improve the latter significantly. To explain these observations, we provide a theoretical analysis of the relationships between KD and label smoothing regularization. We prove that 1) KD is a type of learned label smoothing regularization and 2) label smoothing regularization provides a virtual teacher model for KD. From these results, we argue that the success of KD is not fully due to the similarity information between categories from teachers, but also to the regularization of soft targets, which is equally or even more important.Based on these analyses, we further propose a novel Teacher-free Knowledge Distillation (Tf-KD) framework, where a student model learns from itself or manuallydesigned regularization distribution. The Tf-KD achieves comparable performance with normal KD from a superior teacher, which is well applied when a stronger teacher model is unavailable. Meanwhile, Tf-KD is generic and can be directly deployed for training deep neural networks. Without any extra computation cost, Tf-KD achieves up to 0.65% improvement on ImageNet over well-established baseline models, which is superior to label smoothing regularization.
translated by 谷歌翻译
知识蒸馏是一种培训小型学生网络的流行技术,以模仿更大的教师模型,例如网络的集合。我们表明,虽然知识蒸馏可以改善学生泛化,但它通常不得如此普遍地工作:虽然在教师和学生的预测分布之间,甚至在学生容量的情况下,通常仍然存在令人惊讶的差异完美地匹配老师。我们认为优化的困难是为什么学生无法与老师匹配的关键原因。我们还展示了用于蒸馏的数据集的细节如何在学生与老师匹配的紧密关系中发挥作用 - 以及教师矛盾的教师并不总是导致更好的学生泛化。
translated by 谷歌翻译
在线知识蒸馏会在所有学生模型之间进行知识转移,以减轻对预培训模型的依赖。但是,现有的在线方法在很大程度上依赖于预测分布并忽略了代表性知识的进一步探索。在本文中,我们提出了一种用于在线知识蒸馏的新颖的多尺度功能提取和融合方法(MFEF),其中包括三个关键组成部分:多尺度功能提取,双重注意和功能融合,以生成更有信息的特征图,以用于蒸馏。提出了在通道维度中的多尺度提取利用分界线和catenate,以提高特征图的多尺度表示能力。为了获得更准确的信息,我们设计了双重注意,以适应重要的渠道和空间区域。此外,我们通过功能融合来汇总并融合了以前的处理功能地图,以帮助培训学生模型。关于CIF AR-10,CIF AR-100和Cinic-10的广泛实验表明,MFEF转移了更有益的代表性知识,以蒸馏和胜过各种网络体系结构之间的替代方法
translated by 谷歌翻译
Most existing distillation methods ignore the flexible role of the temperature in the loss function and fix it as a hyper-parameter that can be decided by an inefficient grid search. In general, the temperature controls the discrepancy between two distributions and can faithfully determine the difficulty level of the distillation task. Keeping a constant temperature, i.e., a fixed level of task difficulty, is usually sub-optimal for a growing student during its progressive learning stages. In this paper, we propose a simple curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD), which controls the task difficulty level during the student's learning career through a dynamic and learnable temperature. Specifically, following an easy-to-hard curriculum, we gradually increase the distillation loss w.r.t. the temperature, leading to increased distillation difficulty in an adversarial manner. As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks and brings general improvements at a negligible additional computation cost. Extensive experiments on CIFAR-100, ImageNet-2012, and MS-COCO demonstrate the effectiveness of our method. Our code is available at https://github.com/zhengli97/CTKD.
translated by 谷歌翻译