Knowledge Distillation (KD) transfers the knowledge from a high-capacity teacher model to promote a smaller student model. Existing efforts guide the distillation by matching their prediction logits, feature embedding, etc., while leaving how to efficiently utilize them in junction less explored. In this paper, we propose Hint-dynamic Knowledge Distillation, dubbed HKD, which excavates the knowledge from the teacher' s hints in a dynamic scheme. The guidance effect from the knowledge hints usually varies in different instances and learning stages, which motivates us to customize a specific hint-learning manner for each instance adaptively. Specifically, a meta-weight network is introduced to generate the instance-wise weight coefficients about knowledge hints in the perception of the dynamical learning progress of the student model. We further present a weight ensembling strategy to eliminate the potential bias of coefficient estimation by exploiting the historical statics. Experiments on standard benchmarks of CIFAR-100 and Tiny-ImageNet manifest that the proposed HKD well boost the effect of knowledge distillation tasks.
translated by 谷歌翻译
知识蒸馏(KD)将知识从高容量的教师网络转移到加强较小的学生。现有方法着重于发掘知识的提示,并将整个知识转移给学生。但是,由于知识在不同的学习阶段显示出对学生的价值观,因此出现了知识冗余。在本文中,我们提出了知识冷凝蒸馏(KCD)。具体而言,每个样本上的知识价值是动态估计的,基于期望最大化(EM)框架的迭代性凝结,从老师那里划定了一个紧凑的知识,以指导学生学习。我们的方法很容易建立在现成的KD方法之上,没有额外的培训参数和可忽略不计的计算开销。因此,它为KD提出了一种新的观点,在该观点中,积极地识别教师知识的学生可以学会更有效,有效地学习。对标准基准测试的实验表明,提出的KCD可以很好地提高学生模型的性能,甚至更高的蒸馏效率。代码可在https://github.com/dzy3/kcd上找到。
translated by 谷歌翻译
最初引入了知识蒸馏,以利用来自单一教师模型的额外监督为学生模型培训。为了提高学生表现,最近的一些变体试图利用多个教师利用不同的知识来源。然而,现有研究主要通过对多种教师预测的平均或将它们与其他无标签策略相结合,将知识集成在多种来源中,可能在可能存在低质量的教师预测存在中误导学生。为了解决这个问题,我们提出了信心感知的多教师知识蒸馏(CA-MKD),该知识蒸馏(CA-MKD)在地面真理标签的帮助下,适用于每个教师预测的样本明智的可靠性,与那些接近单热的教师预测标签分配了大量的重量。此外,CA-MKD包含中间层,以进一步提高学生表现。广泛的实验表明,我们的CA-MKD始终如一地优于各种教师学生架构的所有最先进的方法。
translated by 谷歌翻译
知识蒸馏已成为获得紧凑又有效模型的重要方法。为实现这一目标,培训小型学生模型以利用大型训练有素的教师模型的知识。然而,由于教师和学生之间的能力差距,学生的表现很难达到老师的水平。关于这个问题,现有方法建议通过代理方式减少教师知识的难度。我们认为这些基于代理的方法忽视了教师的知识损失,这可能导致学生遇到容量瓶颈。在本文中,我们从新的角度来缓解能力差距问题,以避免知识损失的目的。我们建议通过对抗性协作学习建立一个更有力的学生,而不是牺牲教师的知识。为此,我们进一步提出了一种逆势协作知识蒸馏(ACKD)方法,有效提高了知识蒸馏的性能。具体来说,我们用多个辅助学习者构建学生模型。同时,我们设计了对抗的对抗性协作模块(ACM),引入注意机制和对抗的学习,以提高学生的能力。四个分类任务的广泛实验显示了拟议的Ackd的优越性。
translated by 谷歌翻译
One of the most efficient methods for model compression is hint distillation, where the student model is injected with information (hints) from several different layers of the teacher model. Although the selection of hint points can drastically alter the compression performance, conventional distillation approaches overlook this fact and use the same hint points as in the early studies. Therefore, we propose a clustering based hint selection methodology, where the layers of teacher model are clustered with respect to several metrics and the cluster centers are used as the hint points. Our method is applicable for any student network, once it is applied on a chosen teacher network. The proposed approach is validated in CIFAR-100 and ImageNet datasets, using various teacher-student pairs and numerous hint distillation methods. Our results show that hint points selected by our algorithm results in superior compression performance compared to state-of-the-art knowledge distillation algorithms on the same student models and datasets.
translated by 谷歌翻译
除了使用硬标签的标准监督学习外,通常在许多监督学习设置中使用辅助损失来改善模型的概括。例如,知识蒸馏增加了第二个教师模仿模型训练的损失,在该培训中,教师可能是一个验证的模型,可以输出比标签更丰富的分布。同样,在标记数据有限的设置中,弱标记信息以标签函数的形式使用。此处引入辅助损失来对抗标签函数,这些功能可能是基于嘈杂的规则的真实标签近似值。我们解决了学习以原则性方式结合这些损失的问题。我们介绍AMAL,该AMAL使用元学习在验证度量上学习实例特定的权重,以实现损失的最佳混合。在许多知识蒸馏和规则降解域中进行的实验表明,Amal在这些领域中对竞争基准的增长可显着。我们通过经验分析我们的方法,并分享有关其提供性能提升的机制的见解。
translated by 谷歌翻译
Most existing distillation methods ignore the flexible role of the temperature in the loss function and fix it as a hyper-parameter that can be decided by an inefficient grid search. In general, the temperature controls the discrepancy between two distributions and can faithfully determine the difficulty level of the distillation task. Keeping a constant temperature, i.e., a fixed level of task difficulty, is usually sub-optimal for a growing student during its progressive learning stages. In this paper, we propose a simple curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD), which controls the task difficulty level during the student's learning career through a dynamic and learnable temperature. Specifically, following an easy-to-hard curriculum, we gradually increase the distillation loss w.r.t. the temperature, leading to increased distillation difficulty in an adversarial manner. As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks and brings general improvements at a negligible additional computation cost. Extensive experiments on CIFAR-100, ImageNet-2012, and MS-COCO demonstrate the effectiveness of our method. Our code is available at https://github.com/zhengli97/CTKD.
translated by 谷歌翻译
Knowledge distillation (KD) has gained a lot of attention in the field of model compression for edge devices thanks to its effectiveness in compressing large powerful networks into smaller lower-capacity models. Online distillation, in which both the teacher and the student are learning collaboratively, has also gained much interest due to its ability to improve on the performance of the networks involved. The Kullback-Leibler (KL) divergence ensures the proper knowledge transfer between the teacher and student. However, most online KD techniques present some bottlenecks under the network capacity gap. By cooperatively and simultaneously training, the models the KL distance becomes incapable of properly minimizing the teacher's and student's distributions. Alongside accuracy, critical edge device applications are in need of well-calibrated compact networks. Confidence calibration provides a sensible way of getting trustworthy predictions. We propose BD-KD: Balancing of Divergences for online Knowledge Distillation. We show that adaptively balancing between the reverse and forward divergences shifts the focus of the training strategy to the compact student network without limiting the teacher network's learning process. We demonstrate that, by performing this balancing design at the level of the student distillation loss, we improve upon both performance accuracy and calibration of the compact student network. We conducted extensive experiments using a variety of network architectures and show improvements on multiple datasets including CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet. We illustrate the effectiveness of our approach through comprehensive comparisons and ablations with current state-of-the-art online and offline KD techniques.
translated by 谷歌翻译
无数据知识蒸馏(DFKD)最近一直吸引了研究社区的越来越关注,归因于其仅使用合成数据压缩模型的能力。尽管取得了令人鼓舞的成果,但最先进的DFKD方法仍然患有数据综合的低效率,使得无数据培训过程非常耗时,因此可以对大规模任务进行不适当的。在这项工作中,我们介绍了一个被称为FastDFKD的有效方案,使我们能够将DFKD加速到数量级。在我们的方法中,我们的方法是一种重用培训数据中共享共同功能的新策略,以便综合不同的数据实例。与先前的方法独立优化一组数据,我们建议学习一个Meta合成器,该综合仪寻求常见功能作为快速数据合成的初始化。因此,FastDFKD仅在几个步骤内实现数据综合,显着提高了无数据培训的效率。在CiFAR,NYUV2和Imagenet上的实验表明,所提出的FastDFKD实现了10美元\时代$甚至100美元\倍$加速,同时保持与现有技术的表现。
translated by 谷歌翻译
知识蒸馏(KD)是压缩边缘设备深层分类模型的有效工具。但是,KD的表现受教师和学生网络之间较大容量差距的影响。最近的方法已诉诸KD的多个教师助手(TA)设置,该设置依次降低了教师模型的大小,以相对弥合这些模型之间的尺寸差距。本文提出了一种称为“知识蒸馏”课程专家选择的新技术,以有效地增强在容量差距问题下对紧凑型学生的学习。该技术建立在以下假设的基础上:学生网络应逐渐使用分层的教学课程来逐步指导,因为它可以从较低(较高的)容量教师网络中更好地学习(硬)数据样本。具体而言,我们的方法是一种基于TA的逐渐的KD技术,它每个输入图像选择单个教师,该课程是基于通过对图像进行分类的难度驱动的课程的。在这项工作中,我们凭经验验证了我们的假设,并对CIFAR-10,CIFAR-100,CINIC-10和Imagenet数据集进行了严格的实验,并在类似VGG的模型,Resnets和WideresNets架构上显示出提高的准确性。
translated by 谷歌翻译
在线知识蒸馏会在所有学生模型之间进行知识转移,以减轻对预培训模型的依赖。但是,现有的在线方法在很大程度上依赖于预测分布并忽略了代表性知识的进一步探索。在本文中,我们提出了一种用于在线知识蒸馏的新颖的多尺度功能提取和融合方法(MFEF),其中包括三个关键组成部分:多尺度功能提取,双重注意和功能融合,以生成更有信息的特征图,以用于蒸馏。提出了在通道维度中的多尺度提取利用分界线和catenate,以提高特征图的多尺度表示能力。为了获得更准确的信息,我们设计了双重注意,以适应重要的渠道和空间区域。此外,我们通过功能融合来汇总并融合了以前的处理功能地图,以帮助培训学生模型。关于CIF AR-10,CIF AR-100和Cinic-10的广泛实验表明,MFEF转移了更有益的代表性知识,以蒸馏和胜过各种网络体系结构之间的替代方法
translated by 谷歌翻译
Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) generalization by transferring the knowledge from a larger model (a teacher). Although KD methods achieve state-of-the-art performance in numerous settings, they suffer from several problems limiting their performance. It is shown in the literature that the capacity gap between the teacher and the student networks can make KD ineffective. Additionally, existing KD techniques do not mitigate the noise in the teacher's output: modeling the noisy behaviour of the teacher can distract the student from learning more useful features. We propose a new KD method that addresses these problems and facilitates the training compared to previous techniques. Inspired by continuation optimization, we design a training procedure that optimizes the highly non-convex KD objective by starting with the smoothed version of this objective and making it more complex as the training proceeds. Our method (Continuation-KD) achieves state-of-the-art performance across various compact architectures on NLU (GLUE benchmark) and computer vision tasks (CIFAR-10 and CIFAR-100).
translated by 谷歌翻译
知识蒸馏在模型压缩方面取得了显着的成就。但是,大多数现有方法需要原始的培训数据,而实践中的实际数据通常是不可用的,因为隐私,安全性和传输限制。为了解决这个问题,我们提出了一种有条件的生成数据无数据知识蒸馏(CGDD)框架,用于培训有效的便携式网络,而无需任何实际数据。在此框架中,除了使用教师模型中提取的知识外,我们将预设标签作为额外的辅助信息介绍以培训发电机。然后,训练有素的发生器可以根据需要产生指定类别的有意义的培训样本。为了促进蒸馏过程,除了使用常规蒸馏损失,我们将预设标签视为地面真理标签,以便学生网络直接由合成训练样本类别监督。此外,我们强制学生网络模仿教师模型的注意图,进一步提高了其性能。为了验证我们方法的优越性,我们设计一个新的评估度量称为相对准确性,可以直接比较不同蒸馏方法的有效性。培训的便携式网络通过提出的数据无数据蒸馏方法获得了99.63%,99.07%和99.84%的CIFAR10,CIFAR100和CALTECH101的相对准确性。实验结果表明了所提出的方法的优越性。
translated by 谷歌翻译
知识蒸馏是通过知识转移模型压缩的有效稳定的方法。传统知识蒸馏(KD)是将来自大型和训练有素的教师网络的知识转移到小型学生网络,这是一种单向过程。最近,已经提出了深度相互学习(DML)来帮助学生网络协同和同时学习。然而,据我们所知,KD和DML从未在统一的框架中共同探索,以解决知识蒸馏问题。在本文中,我们调查教师模型在KD中支持更值得信赖的监督信号,而学生则在DML中捕获教师的类似行为。基于这些观察,我们首先建议将KD与DML联合在统一的框架中。此外,我们提出了一个半球知识蒸馏(SOKD)方法,有效提高了学生和教师的表现。在这种方法中,我们在DML中介绍了同伴教学培训时尚,以缓解学生的模仿困难,并利用KD训练有素的教师提供的监督信号。此外,我们还显示我们的框架可以轻松扩展到基于功能的蒸馏方法。在CiFAR-100和Imagenet数据集上的广泛实验证明了所提出的方法实现了最先进的性能。
translated by 谷歌翻译
Binary neural networks are the extreme case of network quantization, which has long been thought of as a potential edge machine learning solution. However, the significant accuracy gap to the full-precision counterparts restricts their creative potential for mobile applications. In this work, we revisit the potential of binary neural networks and focus on a compelling but unanswered problem: how can a binary neural network achieve the crucial accuracy level (e.g., 80%) on ILSVRC-2012 ImageNet? We achieve this goal by enhancing the optimization process from three complementary perspectives: (1) We design a novel binary architecture BNext based on a comprehensive study of binary architectures and their optimization process. (2) We propose a novel knowledge-distillation technique to alleviate the counter-intuitive overfitting problem observed when attempting to train extremely accurate binary models. (3) We analyze the data augmentation pipeline for binary networks and modernize it with up-to-date techniques from full-precision models. The evaluation results on ImageNet show that BNext, for the first time, pushes the binary model accuracy boundary to 80.57% and significantly outperforms all the existing binary networks. Code and trained models are available at: https://github.com/hpi-xnor/BNext.git.
translated by 谷歌翻译
Knowledge Distillation (KD) aims to distill the knowledge of a cumbersome teacher model into a lightweight student model. Its success is generally attributed to the privileged information on similarities among categories provided by the teacher model, and in this sense, only strong teacher models are deployed to teach weaker students in practice. In this work, we challenge this common belief by following experimental observations: 1) beyond the acknowledgment that the teacher can improve the student, the student can also enhance the teacher significantly by reversing the KD procedure; 2) a poorly-trained teacher with much lower accuracy than the student can still improve the latter significantly. To explain these observations, we provide a theoretical analysis of the relationships between KD and label smoothing regularization. We prove that 1) KD is a type of learned label smoothing regularization and 2) label smoothing regularization provides a virtual teacher model for KD. From these results, we argue that the success of KD is not fully due to the similarity information between categories from teachers, but also to the regularization of soft targets, which is equally or even more important.Based on these analyses, we further propose a novel Teacher-free Knowledge Distillation (Tf-KD) framework, where a student model learns from itself or manuallydesigned regularization distribution. The Tf-KD achieves comparable performance with normal KD from a superior teacher, which is well applied when a stronger teacher model is unavailable. Meanwhile, Tf-KD is generic and can be directly deployed for training deep neural networks. Without any extra computation cost, Tf-KD achieves up to 0.65% improvement on ImageNet over well-established baseline models, which is superior to label smoothing regularization.
translated by 谷歌翻译
在线知识蒸馏(OKD)通过相互利用教师和学生之间的差异来改善所涉及的模型。它们之间的差距上有几个关键的瓶颈 - 例如,为什么以及何时以及何时损害表现,尤其是对学生的表现?如何量化教师和学生之间的差距? - 接受了有限的正式研究。在本文中,我们提出了可切换的在线知识蒸馏(Switokd),以回答这些问题。 Switokd的核心思想不是专注于测试阶段的准确性差距,而是通过两种模式之间的切换策略来适应训练阶段的差距,即蒸馏差距 - 专家模式(暂停老师,同时暂停教师保持学生学习)和学习模式(重新启动老师)。为了拥有适当的蒸馏差距,我们进一步设计了一个自适应开关阈值,该阈值提供了有关何时切换到学习模式或专家模式的正式标准,从而改善了学生的表现。同时,老师从我们的自适应切换阈值中受益,并基本上与其他在线艺术保持同步。我们进一步将Switokd扩展到具有两个基础拓扑的多个网络。最后,广泛的实验和分析验证了Switokd在最新面前的分类的优点。我们的代码可在https://github.com/hfutqian/switokd上找到。
translated by 谷歌翻译
随着AI芯片(例如GPU,TPU和NPU)的改进以及物联网(IOT)的快速发展,一些强大的深神经网络(DNN)通常由数百万甚至数亿个参数组成,这些参数是可能不适合直接部署在低计算和低容量单元(例如边缘设备)上。最近,知识蒸馏(KD)被认为是模型压缩的有效方法之一,以减少模型参数。 KD的主要概念是从大型模型(即教师模型)的特征图中提取有用的信息,以引用成功训练一个小型模型(即学生模型),该模型大小比老师小得多。尽管已经提出了许多基于KD的方法来利用教师模型中中间层的特征图中的信息,但是,它们中的大多数并未考虑教师模型和学生模型之间的特征图的相似性,这可能让学生模型学习无用的信息。受到注意机制的启发,我们提出了一种新颖的KD方法,称为代表教师钥匙(RTK),该方法不仅考虑了特征地图的相似性,而且还会过滤掉无用的信息以提高目标学生模型的性能。在实验中,我们使用多个骨干网络(例如Resnet和wideresnet)和数据集(例如CIFAR10,CIFAR100,SVHN和CINIC10)验证了我们提出的方法。结果表明,我们提出的RTK可以有效地提高基于注意的KD方法的分类精度。
translated by 谷歌翻译
尽管深层模型在医学图像分割中表现出了有希望的性能,但它们在很大程度上依赖大量宣布的数据,这很难访问,尤其是在临床实践中。另一方面,高准确的深层模型通常有大型模型尺寸,从而限制了它们在实际情况下的工作。在这项工作中,我们提出了一个新颖的不对称联合教师框架ACT-NET,以减轻半监督知识蒸馏的昂贵注释和计算成本的负担。我们通过共同教师网络推进教师学习的学习,以通过交替的学生和教师角色来促进从大型模型到小模型的不对称知识蒸馏,从而获得了临床就业的微小但准确的模型。为了验证我们的行动网络的有效性,我们在实验中采用了ACDC数据集进行心脏子结构分段。广泛的实验结果表明,ACT-NET的表现优于其他知识蒸馏方法,并实现无损分割性能,参数少250倍。
translated by 谷歌翻译
机器学习中的知识蒸馏是将知识从名为教师的大型模型转移到一个名为“学生”的较小模型的过程。知识蒸馏是将大型网络(教师)压缩到较小网络(学生)的技术之一,该网络可以部署在手机等小型设备中。当教师和学生之间的网络规模差距增加时,学生网络的表现就会下降。为了解决这个问题,在教师模型和名为助教模型的学生模型之间采用了中间模型,这反过来弥补了教师与学生之间的差距。在这项研究中,我们已经表明,使用多个助教模型,可以进一步改进学生模型(较小的模型)。我们使用加权集合学习将这些多个助教模型组合在一起,我们使用了差异评估优化算法来生成权重值。
translated by 谷歌翻译