深度神经网络的合奏表现出了卓越的性能,但是它们的沉重计算成本阻碍将它们应用于资源有限的环境。它激发了从合奏老师的知识到较小的学生网络,并且有两个重要的设计选择,用于这种合奏蒸馏:1)如何构建学生网络,以及2)在培训期间应显示哪些数据。在本文中,我们提出了一种平均水平技术,其中有多个子网的学生经过培训以吸收合奏教师的功能多样性,但是这些子网的适当平均进行推理,提供了一个学生网络,没有额外的推理成本。我们还提出了一种扰动策略,该策略寻求投入,从中可以更好地转移到学生的教师中。结合这两个,我们的方法在以前的各种图像分类任务上的方法上有了显着改进。
translated by 谷歌翻译
知识蒸馏是一种培训小型学生网络的流行技术,以模仿更大的教师模型,例如网络的集合。我们表明,虽然知识蒸馏可以改善学生泛化,但它通常不得如此普遍地工作:虽然在教师和学生的预测分布之间,甚至在学生容量的情况下,通常仍然存在令人惊讶的差异完美地匹配老师。我们认为优化的困难是为什么学生无法与老师匹配的关键原因。我们还展示了用于蒸馏的数据集的细节如何在学生与老师匹配的紧密关系中发挥作用 - 以及教师矛盾的教师并不总是导致更好的学生泛化。
translated by 谷歌翻译
Figure 1. An illustration of standard knowledge distillation. Despite widespread use, an understanding of when the student can learn from the teacher is missing.
translated by 谷歌翻译
Knowledge distillation (KD) has gained a lot of attention in the field of model compression for edge devices thanks to its effectiveness in compressing large powerful networks into smaller lower-capacity models. Online distillation, in which both the teacher and the student are learning collaboratively, has also gained much interest due to its ability to improve on the performance of the networks involved. The Kullback-Leibler (KL) divergence ensures the proper knowledge transfer between the teacher and student. However, most online KD techniques present some bottlenecks under the network capacity gap. By cooperatively and simultaneously training, the models the KL distance becomes incapable of properly minimizing the teacher's and student's distributions. Alongside accuracy, critical edge device applications are in need of well-calibrated compact networks. Confidence calibration provides a sensible way of getting trustworthy predictions. We propose BD-KD: Balancing of Divergences for online Knowledge Distillation. We show that adaptively balancing between the reverse and forward divergences shifts the focus of the training strategy to the compact student network without limiting the teacher network's learning process. We demonstrate that, by performing this balancing design at the level of the student distillation loss, we improve upon both performance accuracy and calibration of the compact student network. We conducted extensive experiments using a variety of network architectures and show improvements on multiple datasets including CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet. We illustrate the effectiveness of our approach through comprehensive comparisons and ablations with current state-of-the-art online and offline KD techniques.
translated by 谷歌翻译
Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant size and extend the framework to multi-step distillation. Theoretical analysis and extensive experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach.
translated by 谷歌翻译
在线知识蒸馏(OKD)通过相互利用教师和学生之间的差异来改善所涉及的模型。它们之间的差距上有几个关键的瓶颈 - 例如,为什么以及何时以及何时损害表现,尤其是对学生的表现?如何量化教师和学生之间的差距? - 接受了有限的正式研究。在本文中,我们提出了可切换的在线知识蒸馏(Switokd),以回答这些问题。 Switokd的核心思想不是专注于测试阶段的准确性差距,而是通过两种模式之间的切换策略来适应训练阶段的差距,即蒸馏差距 - 专家模式(暂停老师,同时暂停教师保持学生学习)和学习模式(重新启动老师)。为了拥有适当的蒸馏差距,我们进一步设计了一个自适应开关阈值,该阈值提供了有关何时切换到学习模式或专家模式的正式标准,从而改善了学生的表现。同时,老师从我们的自适应切换阈值中受益,并基本上与其他在线艺术保持同步。我们进一步将Switokd扩展到具有两个基础拓扑的多个网络。最后,广泛的实验和分析验证了Switokd在最新面前的分类的优点。我们的代码可在https://github.com/hfutqian/switokd上找到。
translated by 谷歌翻译
最近对知识蒸馏的研究发现,组合来自多位教师或学生的“黑暗知识”是有助于为培训创造更好的软目标,但以更大的计算和/或参数的成本为本。在这项工作中,我们通过在同一批量中传播和集合其他样本的知识来提供批处理知识合奏(烘焙)以生产用于锚固图像的精细柔软目标。具体地,对于每个感兴趣的样本,根据采样间的亲和力加权知识的传播,其与当前网络一起估计。然后可以集合传播的知识以形成更好的蒸馏靶。通过这种方式,我们的烘焙框架只通过单个网络跨多个样本进行在线知识。与现有知识合并方法相比,它需要最小的计算和内存开销。广泛的实验表明,轻质但有效的烘烤始终如一地提升多个数据集上各种架构的分类性能,例如,在想象网上的显着+ 0.7%的VINE-T的增益,只有+ 1.5%计算开销和零附加参数。烘焙不仅改善了Vanilla基线,还超越了所有基准的单一网络最先进。
translated by 谷歌翻译
尽管深层神经网络在各种任务中取得了巨大的成功,但它们不断增加的规模也为部署带来了重要的开销。为了压缩这些模型,提出了知识蒸馏将知识从笨拙(教师)网络转移到轻量级(学生)网络中。但是,老师的指导并不总是改善学生的概括,尤其是当学生和老师之间的差距很大时。以前的作品认为,这是由于老师的高确定性,导致更难适应的标签。为了软化这些标签,我们提出了一种修剪方法,称为预测不确定性扩大(PRUE),以简化教师。具体而言,我们的方法旨在减少教师对数据的确定性,从而为学生产生软预测。我们从经验上研究了提出的方法通过在CIFAR-10/100,Tiny-Imagenet和Imagenet上实验的实验的有效性。结果表明,接受稀疏教师培训的学生网络取得更好的表现。此外,我们的方法允许研究人员从更深的网络中提取知识,以进一步改善学生。我们的代码公开:\ url {https://github.com/wangshaopu/prue}。
translated by 谷歌翻译
Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher's representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. Our method sets a new state-of-the-art in many transfer tasks, and sometimes even outperforms the teacher network when combined with knowledge distillation.
translated by 谷歌翻译
最先进的蒸馏方法主要基于中间层的深层特征,而logit蒸馏的重要性被极大地忽略了。为了提供研究逻辑蒸馏的新观点,我们将经典的KD损失重新分为两个部分,即目标类知识蒸馏(TCKD)和非目标类知识蒸馏(NCKD)。我们凭经验研究并证明了这两个部分的影响:TCKD转移有关训练样本“难度”的知识,而NCKD是Logit蒸馏起作用的重要原因。更重要的是,我们揭示了经典的KD损失是一种耦合的配方,该配方抑制了NCKD的有效性,并且(2)限制了平衡这两个部分的灵活性。为了解决这些问题,我们提出了脱钩的知识蒸馏(DKD),使TCKD和NCKD能够更有效,更灵活地发挥其角色。与基于功能的复杂方法相比,我们的DKD可相当甚至更好的结果,并且在CIFAR-100,ImageNet和MS-Coco数据集上具有更好的培训效率,用于图像分类和对象检测任务。本文证明了Logit蒸馏的巨大潜力,我们希望它对未来的研究有所帮助。该代码可从https://github.com/megvii-research/mdistiller获得。
translated by 谷歌翻译
知识蒸馏是将“知识”从大型模型(教师)转移到更紧凑的(学生)的过程,通常在模型压缩的背景下使用。当两个模型都具有相同的体系结构时,此过程称为自distillation。几项轶事表明,一个自灭的学生可以在持有的数据上胜过老师的表现。在这项工作中,我们系统地研究了许多设置。我们首先表明,即使有一个高度准确的老师,自我介绍也使学生在所有情况下都可以超越老师。其次,我们重新审视了(自我)蒸馏的现有理论解释,并确定矛盾的例子,揭示了这些解释的可能缺点。最后,我们通过损失景观几何形状的镜头为自我鉴定的动态提供了另一种解释。我们进行了广泛的实验,以表明自我验证会导致最小化的最小值,从而导致更好的概括。
translated by 谷歌翻译
Most existing distillation methods ignore the flexible role of the temperature in the loss function and fix it as a hyper-parameter that can be decided by an inefficient grid search. In general, the temperature controls the discrepancy between two distributions and can faithfully determine the difficulty level of the distillation task. Keeping a constant temperature, i.e., a fixed level of task difficulty, is usually sub-optimal for a growing student during its progressive learning stages. In this paper, we propose a simple curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD), which controls the task difficulty level during the student's learning career through a dynamic and learnable temperature. Specifically, following an easy-to-hard curriculum, we gradually increase the distillation loss w.r.t. the temperature, leading to increased distillation difficulty in an adversarial manner. As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks and brings general improvements at a negligible additional computation cost. Extensive experiments on CIFAR-100, ImageNet-2012, and MS-COCO demonstrate the effectiveness of our method. Our code is available at https://github.com/zhengli97/CTKD.
translated by 谷歌翻译
知识蒸馏是通过知识转移模型压缩的有效稳定的方法。传统知识蒸馏(KD)是将来自大型和训练有素的教师网络的知识转移到小型学生网络,这是一种单向过程。最近,已经提出了深度相互学习(DML)来帮助学生网络协同和同时学习。然而,据我们所知,KD和DML从未在统一的框架中共同探索,以解决知识蒸馏问题。在本文中,我们调查教师模型在KD中支持更值得信赖的监督信号,而学生则在DML中捕获教师的类似行为。基于这些观察,我们首先建议将KD与DML联合在统一的框架中。此外,我们提出了一个半球知识蒸馏(SOKD)方法,有效提高了学生和教师的表现。在这种方法中,我们在DML中介绍了同伴教学培训时尚,以缓解学生的模仿困难,并利用KD训练有素的教师提供的监督信号。此外,我们还显示我们的框架可以轻松扩展到基于功能的蒸馏方法。在CiFAR-100和Imagenet数据集上的广泛实验证明了所提出的方法实现了最先进的性能。
translated by 谷歌翻译
Unlike existing knowledge distillation methods focus on the baseline settings, where the teacher models and training strategies are not that strong and competing as state-of-the-art approaches, this paper presents a method dubbed DIST to distill better from a stronger teacher. We empirically find that the discrepancy of predictions between the student and a stronger teacher may tend to be fairly severer. As a result, the exact match of predictions in KL divergence would disturb the training and make existing methods perform poorly. In this paper, we show that simply preserving the relations between the predictions of teacher and student would suffice, and propose a correlation-based loss to capture the intrinsic inter-class relations from the teacher explicitly. Besides, considering that different instances have different semantic similarities to each class, we also extend this relational match to the intra-class level. Our method is simple yet practical, and extensive experiments demonstrate that it adapts well to various architectures, model sizes and training strategies, and can achieve state-of-the-art performance consistently on image classification, object detection, and semantic segmentation tasks. Code is available at: https://github.com/hunto/DIST_KD .
translated by 谷歌翻译
While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks. In this paper, we extend this idea to allow the training of a student that is deeper and thinner than the teacher, using not only the outputs but also the intermediate representations learned by the teacher as hints to improve the training process and final performance of the student. Because the student intermediate hidden layer will generally be smaller than the teacher's intermediate hidden layer, additional parameters are introduced to map the student hidden layer to the prediction of the teacher hidden layer. This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.
translated by 谷歌翻译
Knowledge Distillation (KD) consists of transferring "knowledge" from one machine learning model (the teacher) to another (the student). Commonly, the teacher is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the student's compactness, without sacrificing too much performance. We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers. Surprisingly, these Born-Again Networks (BANs), outperform their teachers significantly, both on computer vision and language modeling tasks. Our experiments with BANs based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10 (3.5%) and CIFAR-100 (15.5%) datasets, by validation error. Additional experiments explore two distillation objectives: (i) Confidence-Weighted by Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating the effect of the teacher outputs on both predicted and nonpredicted classes.
translated by 谷歌翻译
Knowledge distillation is a widely applicable techniquefor training a student neural network under the guidance of a trained teacher network. For example, in neural network compression, a high-capacity teacher is distilled to train a compact student; in privileged learning, a teacher trained with privileged data is distilled to train a student without access to that data. The distillation loss determines how a teacher's knowledge is captured and transferred to the student. In this paper, we propose a new form of knowledge distillation loss that is inspired by the observation that semantically similar inputs tend to elicit similar activation patterns in a trained network. Similarity-preserving knowledge distillation guides the training of a student network such that input pairs that produce similar (dissimilar) activations in the teacher network produce similar (dissimilar) activations in the student network. In contrast to previous distillation methods, the student is not required to mimic the representation space of the teacher, but rather to preserve the pairwise similarities in its own representation space. Experiments on three public datasets demonstrate the potential of our approach.
translated by 谷歌翻译
最初引入了知识蒸馏,以利用来自单一教师模型的额外监督为学生模型培训。为了提高学生表现,最近的一些变体试图利用多个教师利用不同的知识来源。然而,现有研究主要通过对多种教师预测的平均或将它们与其他无标签策略相结合,将知识集成在多种来源中,可能在可能存在低质量的教师预测存在中误导学生。为了解决这个问题,我们提出了信心感知的多教师知识蒸馏(CA-MKD),该知识蒸馏(CA-MKD)在地面真理标签的帮助下,适用于每个教师预测的样本明智的可靠性,与那些接近单热的教师预测标签分配了大量的重量。此外,CA-MKD包含中间层,以进一步提高学生表现。广泛的实验表明,我们的CA-MKD始终如一地优于各种教师学生架构的所有最先进的方法。
translated by 谷歌翻译
Model distillation is an effective and widely used technique to transfer knowledge from a teacher to a student network. The typical application is to transfer from a powerful large network or ensemble to a small network, that is better suited to low-memory or fast execution requirements. In this paper, we present a deep mutual learning (DML) strategy where, rather than one way transfer between a static pre-defined teacher and a student, an ensemble of students learn collaboratively and teach each other throughout the training process. Our experiments show that a variety of network architectures benefit from mutual learning and achieve compelling results on CIFAR-100 recognition and Market-1501 person re-identification benchmarks. Surprisingly, it is revealed that no prior powerful teacher network is necessary -mutual learning of a collection of simple student networks works, and moreover outperforms distillation from a more powerful yet static teacher.
translated by 谷歌翻译
Knowledge distillation aims at transferring knowledge acquired in one model (a teacher) to another model (a student) that is typically smaller. Previous approaches can be expressed as a form of training the student to mimic output activations of individual data examples represented by the teacher. We introduce a novel approach, dubbed relational knowledge distillation (RKD), that transfers mutual relations of data examples instead. For concrete realizations of RKD, we propose distance-wise and angle-wise distillation losses that penalize structural differences in relations. Experiments conducted on different tasks show that the proposed method improves educated student models with a significant margin. In particular for metric learning, it allows students to outperform their teachers' performance, achieving the state of the arts on standard benchmark datasets.
translated by 谷歌翻译