这项工作研究了标签平滑(LS)和知识蒸馏(KD)之间的兼容性。解决这一论文陈述的当代发现采取二分法的观点:Muller等。 (2019)和Shen等。 (2021b)。至关重要的是,没有努力理解和解决这些矛盾的发现,留下了原始问题 - 顺利还是不平稳教师网络? - 未得到答复。我们工作的主要贡献是对系统扩散的发现,分析和验证是缺失的概念,这在理解和解决这些矛盾的发现方面具有重要作用。这种系统的扩散基本上削减了从LS训练的老师蒸馏的好处,从而使KD在升高的温度无效时使KD呈现。我们的发现得到了大规模实验,分析和案例研究的全面支持,包括图像分类,神经机器翻译和紧凑的学生蒸馏任务,这些任务跨越了多个数据集和教师 - 学生架构。根据我们的分析,我们建议从业者使用具有低温转移的LS训练的老师来实现高性能学生。代码和型号可在https://keshik6.github.io/revisiting-ls-kd-compatibility/
translated by 谷歌翻译
The generalization and learning speed of a multi-class neural network can often be significantly improved by using soft targets that are a weighted average of the hard targets and the uniform distribution over labels. Smoothing the labels in this way prevents the network from becoming over-confident and label smoothing has been used in many state-of-the-art models, including image classification, language translation and speech recognition. Despite its widespread use, label smoothing is still poorly understood. Here we show empirically that in addition to improving generalization, label smoothing improves model calibration which can significantly improve beam-search. However, we also observe that if a teacher network is trained with label smoothing, knowledge distillation into a student network is much less effective. To explain these observations, we visualize how label smoothing changes the representations learned by the penultimate layer of the network. We show that label smoothing encourages the representations of training examples from the same class to group in tight clusters. This results in loss of information in the logits about resemblances between instances of different classes, which is necessary for distillation, but does not hurt generalization or calibration of the model's predictions.
translated by 谷歌翻译
Mixup is a popular data augmentation technique based on creating new samples by linear interpolation between two given data samples, to improve both the generalization and robustness of the trained model. Knowledge distillation (KD), on the other hand, is widely used for model compression and transfer learning, which involves using a larger network's implicit knowledge to guide the learning of a smaller network. At first glance, these two techniques seem very different, however, we found that ``smoothness" is the connecting link between the two and is also a crucial attribute in understanding KD's interplay with mixup. Although many mixup variants and distillation methods have been proposed, much remains to be understood regarding the role of a mixup in knowledge distillation. In this paper, we present a detailed empirical study on various important dimensions of compatibility between mixup and knowledge distillation. We also scrutinize the behavior of the networks trained with a mixup in the light of knowledge distillation through extensive analysis, visualizations, and comprehensive experiments on image classification. Finally, based on our findings, we suggest improved strategies to guide the student network to enhance its effectiveness. Additionally, the findings of this study provide insightful suggestions to researchers and practitioners that commonly use techniques from KD. Our code is available at https://github.com/hchoi71/MIX-KD.
translated by 谷歌翻译
基于蒸馏的压缩网络的性能受蒸馏质量的管辖。大型网络(教师)到较小网络(学生)的次优蒸馏的原因主要归因于给定教师与学生对的学习能力中的差距。虽然很难蒸馏所有教师的知识,但可以在很大程度上控制蒸馏质量以实现更好的性能。我们的实验表明,蒸馏品质主要受教师响应的质量来限制,这反过来又受到其反应中存在相似信息的影响。训练有素的大容量老师在学习细粒度辨别性质的过程中丢失了类别之间的相似性信息。没有相似性信息导致蒸馏过程从一个例子 - 许多阶级学习减少到一个示例 - 一类学习,从而限制了教师的不同知识的流程。由于隐式假设只能蒸馏出灌输所知,而不是仅关注知识蒸馏过程,我们仔细审查了知识序列过程。我们认为,对于给定的教师 - 学生对,通过在训练老师的同时找到批量大小和时代数量之间的甜蜜点,可以提高蒸馏品。我们讨论了找到这种甜蜜点以便更好地蒸馏的步骤。我们还提出了蒸馏假设,以区分知识蒸馏和正则化效果之间的蒸馏过程的行为。我们在三个不同的数据集中进行我们的所有实验。
translated by 谷歌翻译
Figure 1. An illustration of standard knowledge distillation. Despite widespread use, an understanding of when the student can learn from the teacher is missing.
translated by 谷歌翻译
尽管深层神经网络在各种任务中取得了巨大的成功,但它们不断增加的规模也为部署带来了重要的开销。为了压缩这些模型,提出了知识蒸馏将知识从笨拙(教师)网络转移到轻量级(学生)网络中。但是,老师的指导并不总是改善学生的概括,尤其是当学生和老师之间的差距很大时。以前的作品认为,这是由于老师的高确定性,导致更难适应的标签。为了软化这些标签,我们提出了一种修剪方法,称为预测不确定性扩大(PRUE),以简化教师。具体而言,我们的方法旨在减少教师对数据的确定性,从而为学生产生软预测。我们从经验上研究了提出的方法通过在CIFAR-10/100,Tiny-Imagenet和Imagenet上实验的实验的有效性。结果表明,接受稀疏教师培训的学生网络取得更好的表现。此外,我们的方法允许研究人员从更深的网络中提取知识,以进一步改善学生。我们的代码公开:\ url {https://github.com/wangshaopu/prue}。
translated by 谷歌翻译
知识蒸馏(KD)已广泛发展并增强了各种任务。经典的KD方法将KD损失添加到原始的跨熵(CE)损失中。我们尝试分解KD损失,以探索其与CE损失的关系。令人惊讶的是,我们发现它可以被视为CE损失和额外损失的组合,其形式与CE损失相同。但是,我们注意到额外的损失迫使学生学习教师绝对概率的相对可能性。此外,这两个概率的总和是不同的,因此很难优化。为了解决这个问题,我们修改了配方并提出分布式损失。此外,我们将教师的目标输出作为软目标,提出软损失。结合软损失和分布式损失,我们提出了新的KD损失(NKD)。此外,我们将学生的目标输出稳定,将其视为无需教师的培训的软目标,并提出了无教师的新KD损失(TF-NKD)。我们的方法在CIFAR-100和Imagenet上实现了最先进的性能。例如,以Resnet-34为老师,我们将Imagenet TOP-1的RESNET18的TOP-1精度从69.90%提高到71.96%。在没有教师的培训中,Mobilenet,Resnet-18和Swintransformer-tiny的培训占70.04%,70.76%和81.48%,分别比基线高0.83%,0.86%和0.30%。该代码可在https://github.com/yzd-v/cls_kd上找到。
translated by 谷歌翻译
Knowledge Distillation (KD) aims to distill the knowledge of a cumbersome teacher model into a lightweight student model. Its success is generally attributed to the privileged information on similarities among categories provided by the teacher model, and in this sense, only strong teacher models are deployed to teach weaker students in practice. In this work, we challenge this common belief by following experimental observations: 1) beyond the acknowledgment that the teacher can improve the student, the student can also enhance the teacher significantly by reversing the KD procedure; 2) a poorly-trained teacher with much lower accuracy than the student can still improve the latter significantly. To explain these observations, we provide a theoretical analysis of the relationships between KD and label smoothing regularization. We prove that 1) KD is a type of learned label smoothing regularization and 2) label smoothing regularization provides a virtual teacher model for KD. From these results, we argue that the success of KD is not fully due to the similarity information between categories from teachers, but also to the regularization of soft targets, which is equally or even more important.Based on these analyses, we further propose a novel Teacher-free Knowledge Distillation (Tf-KD) framework, where a student model learns from itself or manuallydesigned regularization distribution. The Tf-KD achieves comparable performance with normal KD from a superior teacher, which is well applied when a stronger teacher model is unavailable. Meanwhile, Tf-KD is generic and can be directly deployed for training deep neural networks. Without any extra computation cost, Tf-KD achieves up to 0.65% improvement on ImageNet over well-established baseline models, which is superior to label smoothing regularization.
translated by 谷歌翻译
知识蒸馏是一种培训小型学生网络的流行技术,以模仿更大的教师模型,例如网络的集合。我们表明,虽然知识蒸馏可以改善学生泛化,但它通常不得如此普遍地工作:虽然在教师和学生的预测分布之间,甚至在学生容量的情况下,通常仍然存在令人惊讶的差异完美地匹配老师。我们认为优化的困难是为什么学生无法与老师匹配的关键原因。我们还展示了用于蒸馏的数据集的细节如何在学生与老师匹配的紧密关系中发挥作用 - 以及教师矛盾的教师并不总是导致更好的学生泛化。
translated by 谷歌翻译
Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant size and extend the framework to multi-step distillation. Theoretical analysis and extensive experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach.
translated by 谷歌翻译
One of the most efficient methods for model compression is hint distillation, where the student model is injected with information (hints) from several different layers of the teacher model. Although the selection of hint points can drastically alter the compression performance, conventional distillation approaches overlook this fact and use the same hint points as in the early studies. Therefore, we propose a clustering based hint selection methodology, where the layers of teacher model are clustered with respect to several metrics and the cluster centers are used as the hint points. Our method is applicable for any student network, once it is applied on a chosen teacher network. The proposed approach is validated in CIFAR-100 and ImageNet datasets, using various teacher-student pairs and numerous hint distillation methods. Our results show that hint points selected by our algorithm results in superior compression performance compared to state-of-the-art knowledge distillation algorithms on the same student models and datasets.
translated by 谷歌翻译
知识蒸馏是将“知识”从大型模型(教师)转移到更紧凑的(学生)的过程,通常在模型压缩的背景下使用。当两个模型都具有相同的体系结构时,此过程称为自distillation。几项轶事表明,一个自灭的学生可以在持有的数据上胜过老师的表现。在这项工作中,我们系统地研究了许多设置。我们首先表明,即使有一个高度准确的老师,自我介绍也使学生在所有情况下都可以超越老师。其次,我们重新审视了(自我)蒸馏的现有理论解释,并确定矛盾的例子,揭示了这些解释的可能缺点。最后,我们通过损失景观几何形状的镜头为自我鉴定的动态提供了另一种解释。我们进行了广泛的实验,以表明自我验证会导致最小化的最小值,从而导致更好的概括。
translated by 谷歌翻译
最近对知识蒸馏的研究发现,组合来自多位教师或学生的“黑暗知识”是有助于为培训创造更好的软目标,但以更大的计算和/或参数的成本为本。在这项工作中,我们通过在同一批量中传播和集合其他样本的知识来提供批处理知识合奏(烘焙)以生产用于锚固图像的精细柔软目标。具体地,对于每个感兴趣的样本,根据采样间的亲和力加权知识的传播,其与当前网络一起估计。然后可以集合传播的知识以形成更好的蒸馏靶。通过这种方式,我们的烘焙框架只通过单个网络跨多个样本进行在线知识。与现有知识合并方法相比,它需要最小的计算和内存开销。广泛的实验表明,轻质但有效的烘烤始终如一地提升多个数据集上各种架构的分类性能,例如,在想象网上的显着+ 0.7%的VINE-T的增益,只有+ 1.5%计算开销和零附加参数。烘焙不仅改善了Vanilla基线,还超越了所有基准的单一网络最先进。
translated by 谷歌翻译
Often we wish to transfer representational knowledge from one neural network to another. Examples include distilling a large network into a smaller one, transferring knowledge from one sensory modality to a second, or ensembling a collection of models into a single estimator. Knowledge distillation, the standard approach to these problems, minimizes the KL divergence between the probabilistic outputs of a teacher and student network. We demonstrate that this objective ignores important structural knowledge of the teacher network. This motivates an alternative objective by which we train a student to capture significantly more information in the teacher's representation of the data. We formulate this objective as contrastive learning. Experiments demonstrate that our resulting new objective outperforms knowledge distillation and other cutting-edge distillers on a variety of knowledge transfer tasks, including single model compression, ensemble distillation, and cross-modal transfer. Our method sets a new state-of-the-art in many transfer tasks, and sometimes even outperforms the teacher network when combined with knowledge distillation.
translated by 谷歌翻译
Most existing distillation methods ignore the flexible role of the temperature in the loss function and fix it as a hyper-parameter that can be decided by an inefficient grid search. In general, the temperature controls the discrepancy between two distributions and can faithfully determine the difficulty level of the distillation task. Keeping a constant temperature, i.e., a fixed level of task difficulty, is usually sub-optimal for a growing student during its progressive learning stages. In this paper, we propose a simple curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD), which controls the task difficulty level during the student's learning career through a dynamic and learnable temperature. Specifically, following an easy-to-hard curriculum, we gradually increase the distillation loss w.r.t. the temperature, leading to increased distillation difficulty in an adversarial manner. As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks and brings general improvements at a negligible additional computation cost. Extensive experiments on CIFAR-100, ImageNet-2012, and MS-COCO demonstrate the effectiveness of our method. Our code is available at https://github.com/zhengli97/CTKD.
translated by 谷歌翻译
神经网络可以从单个图像中了解视觉世界的内容是什么?虽然它显然不能包含存在的可能对象,场景和照明条件 - 在所有可能的256 ^(3x224x224)224尺寸的方形图像中,它仍然可以在自然图像之前提供强大的。为了分析这一假设,我们通过通过监控掠夺教师的知识蒸馏来制定一种训练神经网络的培训神经网络。有了这个,我们发现上述问题的答案是:“令人惊讶的是,很多”。在定量术语中,我们在CiFar-10/100上找到了94%/ 74%的前1个精度,在想象中,通过将这种方法扩展到音频,84%的语音组合。在广泛的分析中,我们解除了增强,源图像和网络架构的选择,以及在从未见过熊猫的网络中发现“熊猫神经元”。这项工作表明,一个图像可用于推断成千上万的对象类,并激励关于增强和图像的基本相互作用的更新的研究议程。
translated by 谷歌翻译
知识蒸馏(KD)在将学习表征从大型模型(教师)转移到小型模型(学生)方面表现出非常有希望的能力。但是,随着学生和教师之间的容量差距变得更大,现有的KD方法无法获得更好的结果。我们的工作表明,“先验知识”对KD至关重要,尤其是在应用大型老师时。特别是,我们提出了动态的先验知识(DPK),该知识将教师特征的一部分作为特征蒸馏之前的先验知识。这意味着我们的方法还将教师的功能视为“输入”,而不仅仅是``目标''。此外,我们根据特征差距动态调整训练阶段的先验知识比率,从而引导学生在适当的困难中。为了评估所提出的方法,我们对两个图像分类基准(即CIFAR100和Imagenet)和一个对象检测基准(即MS Coco)进行了广泛的实验。结果表明,在不同的设置下,我们方法在性能方面具有优势。更重要的是,我们的DPK使学生模型的表现与教师模型的表现呈正相关,这意味着我们可以通过应用更大的教师进一步提高学生的准确性。我们的代码将公开用于可重复性。
translated by 谷歌翻译
Knowledge distillation (KD) has been actively studied for image classification tasks in deep learning, aiming to improve the performance of a student based on the knowledge from a teacher. However, applying KD in image regression with a scalar response variable has been rarely studied, and there exists no KD method applicable to both classification and regression tasks yet. Moreover, existing KD methods often require a practitioner to carefully select or adjust the teacher and student architectures, making these methods less flexible in practice. To address the above problems in a unified way, we propose a comprehensive KD framework based on cGANs, termed cGAN-KD. Fundamentally different from existing KD methods, cGAN-KD distills and transfers knowledge from a teacher model to a student model via cGAN-generated samples. This novel mechanism makes cGAN-KD suitable for both classification and regression tasks, compatible with other KD methods, and insensitive to the teacher and student architectures. An error bound for a student model trained in the cGAN-KD framework is derived in this work, providing a theory for why cGAN-KD is effective as well as guiding the practical implementation of cGAN-KD. Extensive experiments on CIFAR-100 and ImageNet-100 show that we can combine state of the art KD methods with the cGAN-KD framework to yield a new state of the art. Moreover, experiments on Steering Angle and UTKFace demonstrate the effectiveness of cGAN-KD in image regression tasks, where existing KD methods are inapplicable.
translated by 谷歌翻译
自我介绍在训练过程中利用自身的非均匀软监管,并在没有任何运行时成本的情况下提高性能。但是,在训练过程中的开销经常被忽略,但是在巨型模型的时代,培训期间的时间和记忆开销越来越重要。本文提出了一种名为ZIPF标签平滑(ZIPF的LS)的有效自我验证方法,该方法使用网络的直立预测来生成软监管,该软监管在不使用任何对比样本或辅助参数的情况下符合ZIPF分布。我们的想法来自经验观察,即当对网络进行适当训练时,在按样品的大小和平均分类后,应遵循分布的分布,让人联想到ZIPF的自然语言频率统计信息,这是在按样品中的大小和平均值进行排序之后进行的。 。通过在样本级别和整个培训期内强制执行此属性,我们发现预测准确性可以大大提高。使用INAT21细粒分类数据集上的RESNET50,与香草基线相比,我们的技术获得了 +3.61%的准确性增长,而与先前的标签平滑或自我验证策略相比,增益增加了0.88%。该实现可在https://github.com/megvii-research/zipfls上公开获得。
translated by 谷歌翻译
最先进的蒸馏方法主要基于中间层的深层特征,而logit蒸馏的重要性被极大地忽略了。为了提供研究逻辑蒸馏的新观点,我们将经典的KD损失重新分为两个部分,即目标类知识蒸馏(TCKD)和非目标类知识蒸馏(NCKD)。我们凭经验研究并证明了这两个部分的影响:TCKD转移有关训练样本“难度”的知识,而NCKD是Logit蒸馏起作用的重要原因。更重要的是,我们揭示了经典的KD损失是一种耦合的配方,该配方抑制了NCKD的有效性,并且(2)限制了平衡这两个部分的灵活性。为了解决这些问题,我们提出了脱钩的知识蒸馏(DKD),使TCKD和NCKD能够更有效,更灵活地发挥其角色。与基于功能的复杂方法相比,我们的DKD可相当甚至更好的结果,并且在CIFAR-100,ImageNet和MS-Coco数据集上具有更好的培训效率,用于图像分类和对象检测任务。本文证明了Logit蒸馏的巨大潜力,我们希望它对未来的研究有所帮助。该代码可从https://github.com/megvii-research/mdistiller获得。
translated by 谷歌翻译