压缩预训练的深度神经网络的任务吸引了研究社区的广泛兴趣,因为它在使从业人员摆脱数据访问要求方面的巨大好处。在该域中,低级别的近似是一种有前途的方法,但是现有的解决方案被认为是限制的设计选择,并且未能有效地探索设计空间,从而导致严重的准确性降解和有限的压缩比达到了有限。为了解决上述局限性,这项工作提出了SVD-NAS框架,该框架将低级近似和神经体系结构搜索的域结合在一起。 SVD-NAS通用并扩展了以前作品的设计选择,通过引入低级别的建筑空间LR空间,这是一个更细粒度的低级别近似设计空间。之后,这项工作提出了基于梯度的搜索,以有效地穿越LR空间。对可能的设计选择的更精细,更彻底的探索导致了CNN模型的参数,失败和潜伏期的提高精度以及降低。结果表明,在数据限制问题设置下,SVD-NAS的成像网上的精度比最新方法高2.06-12.85pp。 SVD-NAS在https://github.com/yu-zhewen/svd-nas上开源。
translated by 谷歌翻译
深度学习技术在各种任务中都表现出了出色的有效性,并且深度学习具有推进多种应用程序(包括在边缘计算中)的潜力,其中将深层模型部署在边缘设备上,以实现即时的数据处理和响应。一个关键的挑战是,虽然深层模型的应用通常会产生大量的内存和计算成本,但Edge设备通常只提供非常有限的存储和计算功能,这些功能可能会在各个设备之间差异很大。这些特征使得难以构建深度学习解决方案,以释放边缘设备的潜力,同时遵守其约束。应对这一挑战的一种有希望的方法是自动化有效的深度学习模型的设计,这些模型轻巧,仅需少量存储,并且仅产生低计算开销。该调查提供了针对边缘计算的深度学习模型设计自动化技术的全面覆盖。它提供了关键指标的概述和比较,这些指标通常用于量化模型在有效性,轻度和计算成本方面的水平。然后,该调查涵盖了深层设计自动化技术的三类最新技术:自动化神经体系结构搜索,自动化模型压缩以及联合自动化设计和压缩。最后,调查涵盖了未来研究的开放问题和方向。
translated by 谷歌翻译
Low-rankness plays an important role in traditional machine learning, but is not so popular in deep learning. Most previous low-rank network compression methods compress the networks by approximating pre-trained models and re-training. However, the optimal solution in the Euclidean space may be quite different from the one in the low-rank manifold. A well-pre-trained model is not a good initialization for the model with low-rank constraints. Thus, the performance of a low-rank compressed network degrades significantly. Compared to other network compression methods such as pruning, low-rank methods attracts less attention in recent years. In this paper, we devise a new training method, low-rank projection with energy transfer (LRPET), that trains low-rank compressed networks from scratch and achieves competitive performance. First, we propose to alternately perform stochastic gradient descent training and projection onto the low-rank manifold. Compared to re-training on the compact model, this enables full utilization of model capacity since solution space is relaxed back to Euclidean space after projection. Second, the matrix energy (the sum of squares of singular values) reduction caused by projection is compensated by energy transfer. We uniformly transfer the energy of the pruned singular values to the remaining ones. We theoretically show that energy transfer eases the trend of gradient vanishing caused by projection. Third, we propose batch normalization (BN) rectification to cut off its effect on the optimal low-rank approximation of the weight matrix, which further improves the performance. Comprehensive experiments on CIFAR-10 and ImageNet have justified that our method is superior to other low-rank compression methods and also outperforms recent state-of-the-art pruning methods. Our code is available at https://github.com/BZQLin/LRPET.
translated by 谷歌翻译
混合精确的深神经网络达到了硬件部署所需的能源效率和吞吐量,尤其是在资源有限的情况下,而无需牺牲准确性。但是,不容易找到保留精度的最佳每层钻头精度,尤其是在创建巨大搜索空间的大量模型,数据集和量化技术中。为了解决这一困难,最近出现了一系列文献,并且已经提出了一些实现有希望的准确性结果的框架。在本文中,我们首先总结了文献中通常使用的量化技术。然后,我们对混合精液框架进行了彻底的调查,该调查是根据其优化技术进行分类的,例如增强学习和量化技术,例如确定性舍入。此外,讨论了每个框架的优势和缺点,我们在其中呈现并列。我们最终为未来的混合精液框架提供了指南。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
我们为深神经网络提出了一种新的全球压缩框架,它自动分析每个层以识别最佳的每个层压缩比,同时实现所需的整体压缩。我们的算法通过将其通道切入多个组并通过低秩分解来分解每个组来铰接压缩每个卷积(或完全连接)层的想法。在我们的算法的核心处于从Eckart Young MiRSKY定理中推导了层面错误界限的推导。然后,我们利用这些界限将压缩问题框架作为优化问题,我们希望最小化层次的最大压缩误差并提出朝向解决方案的有效算法。我们的实验表明,我们的方法优于各种网络和数据集的现有低级压缩方法。我们认为,我们的结果为未来的全球性能大小的研究开辟了新的途径,即现代神经网络的全球性能大小。我们的代码可在https://github.com/lucaslie/torchprune获得。
translated by 谷歌翻译
We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy, but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the redundancy present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate speedups of convolutional layers on both CPU and GPU by a factor of 2×, while keeping the accuracy within 1% of the original model.
translated by 谷歌翻译
Model compression is a critical technique to efficiently deploy neural network models on mobile devices which have limited computation resources and tight power budgets. Conventional model compression techniques rely on hand-crafted heuristics and rule-based policies that require domain experts to explore the large design space trading off among model size, speed, and accuracy, which is usually sub-optimal and time-consuming. In this paper, we propose AutoML for Model Compression (AMC) which leverage reinforcement learning to provide the model compression policy. This learning-based compression policy outperforms conventional rule-based compression policy by having higher compression ratio, better preserving the accuracy and freeing human labor. Under 4× FLOPs reduction, we achieved 2.7% better accuracy than the handcrafted model compression policy for VGG-16 on ImageNet. We applied this automated, push-the-button compression pipeline to MobileNet and achieved 1.81× speedup of measured inference latency on an Android phone and 1.43× speedup on the Titan XP GPU, with only 0.1% loss of ImageNet Top-1 accuracy.
translated by 谷歌翻译
重量修剪是一种有效的模型压缩技术,可以解决在移动设备上实现实时深神经网络(DNN)推断的挑战。然而,由于精度劣化,难以利用硬件加速度,以及某些类型的DNN层的限制,难以降低的应用方案具有有限的应用方案。在本文中,我们提出了一般的细粒度的结构化修剪方案和相应的编译器优化,适用于任何类型的DNN层,同时实现高精度和硬件推理性能。随着使用我们的编译器优化所支持的不同层的灵活性,我们进一步探讨了确定最佳修剪方案的新问题,了解各种修剪方案的不同加速度和精度性能。两个修剪方案映射方法,一个是基于搜索,另一个是基于规则的,建议自动推导出任何给定DNN的每层的最佳修剪规则和块大小。实验结果表明,我们的修剪方案映射方法,以及一般细粒化结构修剪方案,优于最先进的DNN优化框架,最高可达2.48 $ \ times $和1.73 $ \ times $ DNN推理加速在CiFar-10和Imagenet DataSet上没有准确性损失。
translated by 谷歌翻译
我们介绍了延迟感知网络加速度(LANA) - 一种在神经结构上建立的方法,用于加速神经网络的神经结构搜索技术和教师学生蒸馏。 Lana由两个阶段组成:在第一阶段,它会使用层面特征映射蒸馏来列举每层教师网络的许多替代操作。在第二阶段,它解决了使用新颖的整数线性优化(ILP)方法的有效操作的组合选择。 ILP带来独特的属性,因为它(i)在几秒钟内执行NAS,(ii)轻松满足预算约束,(iii)在图层粒度上工作,(iv)支持巨大的搜索空间$ o(10 ^ { 100})$,超越先前的搜索方法,效率和效率。在广泛的实验中,我们表明Lana产生了由目标潜伏期预算限制的有效和准确的模型,同时比其他技术明显快。我们分析了三个流行的网络架构:高效的网络,高效网络和reses,并在压缩较大模型的较小模型的延迟级别时,实现所有型号(高达3.0 \%$)的准确性改进。 Lana通过GPU和CPU实现显着的加速(高达5美元\倍),以没有准确性下降。代码将很快分享。
translated by 谷歌翻译
在过去几年中,已经制作了神经结构搜索领域的显着改进。然而,由于存在搜索的约束和实际推断时间之间的间隙,搜索有效网络仍然具有挑战性。为了搜索具有低推理时间的高性能网络,若干以前的作品为搜索算法设置了计算复杂性约束。然而,许多因素影响推理的速度(例如,拖鞋,MAC)。单个指示符与延迟之间的相关性并不强。目前,提出了一些重新参数化(REP)技术将多分支转换为对单路径架构进行推断友好的。然而,多分支架构仍然是人为定义和效率低下。在这项工作中,我们提出了一种适用于结构重新参数化技术的新搜索空间。 repnas是一种单级NAS方法,以便在分支号约束下有效地搜索每个层的最佳分支块(ODBB)。我们的实验结果表明,搜索的ODBB可以轻松超越手动各种分支块(DBB),高效培训。代码和型号将越早提供。
translated by 谷歌翻译
语义细分是许多视觉系统的骨干,从自动驾驶汽车和机器人导航到增强现实和电信。在有限的资源信封内经常在严格的延迟约束下运行,对有效执行的优化变得很重要。同时,目标平台的异质功能以及不同应用程序的不同限制需要设计和培训多个针对特定目标的细分模型,从而导致过度维护成本。为此,我们提出了一个框架,用于将最新的分割CNN转换为多EXIT语义细分(MESS)网络:经过特殊训练的模型,这些模型沿其深度沿其深度进行参数化的早期出口到i)在推断过程中动态保存计算更容易的样本和ii)通过提供可定制的速度准确性权衡来节省培训和维护成本。设计和培训此类网络天真地损害了性能。因此,我们为多EXIT网络提出了新颖的两期培训方案。此外,Mess的参数化可以使附件分割头的数字,位置和体系结构以及退出策略通过详尽的搜索在<1GPUH中进行部署。这使得混乱能够快速适应每个目标用例的设备功能和应用要求,并提供火车一路上的部署解决方案。与原始的骨干网络相比,Mess变体具有相同精度的潜伏期增长率高达2.83倍,而相同的计算预算的潜伏期提高到同一计算预算的准确性高5.33 pp。最后,与最先进的技术相比,MESS提供了更快的架构选择订单。
translated by 谷歌翻译
我们提出了三种新型的修剪技术,以提高推理意识到的可区分神经结构搜索(DNAS)的成本和结果。首先,我们介绍了DNA的随机双路构建块,它可以通过内存和计算复杂性在内部隐藏尺寸上进行搜索。其次,我们在搜索过程中提出了一种在超级网的随机层中修剪块的算法。第三,我们描述了一种在搜索过程中修剪不必要的随机层的新技术。由搜索产生的优化模型称为Prunet,并在Imagenet Top-1图像分类精度的推理潜伏期中为NVIDIA V100建立了新的最先进的Pareto边界。将Prunet作为骨架还优于COCO对象检测任务的GPUNET和EFIDENENET,相对于平均平均精度(MAP)。
translated by 谷歌翻译
低秩张量压缩已被提议作为一个有前途的方法,以减少他们的边缘设备部署神经网络的存储和计算需求。张量压缩减少的通过假设网络的权重来表示神经网络权重所需的参数的数目具有一个粗糙的高级结构。此粗结构假设已经被应用到压缩大神经网络如VGG和RESNET。计算机视觉任务然而现代国家的最先进的神经网络(即MobileNet,EfficientNet)已经通过在深度方向上可分离卷积假定粗因式分解结构,使得纯张量分解较少有吸引力的方法。我们建议低张量分解稀疏修剪,以充分利用粗粒和细粒结构的压缩相结合。我们在压缩SOTA架构的权重(MobileNetv3,EfficientNet,视觉变压器),并比较这种方法来疏剪枝,独自张量分解。
translated by 谷歌翻译
嵌入式和个人物联网设备由微控制器单元(MCU)供电,其极端资源稀缺是依赖于设备的深度学习推断的应用的主要障碍。与通常需要执行神经网络的内容相比,存储器,内存和计算能力较少的秩序,对网络架构上的严格结构约束并呼叫专业模型压缩方法。在这项工作中,我们为卷积神经网络提出了可分散的结构化网络修剪方法,它集成了模型的MCU特定的资源使用和参数重要性反馈,以获得高度压缩但准确的分类模型。我们的方法(a)提高了高达80倍的模型的关键资源使用; (b)在培训型号的同时迭代地修剪,导致没有开销甚至改善培训时间; (c)与现有MCU的特定方法相比,在比较多的时间内生产具有匹配或改进的资源使用的压缩模型。压缩模型可供下载。
translated by 谷歌翻译
在本文中,我们提出了用于卷积神经网络的可分散的信道稀疏性搜索(DCS)。与需要用户手动设置每个卷积层的紫星比的传统信道修剪算法不同,DCSS自动搜索稀疏的最佳组合。灵感来自可怜的架构搜索(飞镖),我们从连续放松中汲取课程,并利用梯度信息来平衡计算成本和指标。由于直接应用飞镖方案引起形状不匹配和过度的记忆消耗,因此在过滤器内引入一种名为重量共享的新技术。这种技术优雅地消除了具有可忽略额外资源的形状不匹配的问题。我们不仅开展全面的实验,不仅是图像分类,还可以找到包括语义分割和图像超分辨率的粒度任务,以验证DCSS的有效性。与以前的网络修剪方法相比,DCSS实现了图像分类的最先进结果。语义分割和图像超分辨率的实验结果表明,特定于任务特定搜索的性能比转移超薄模型实现了更好的性能,展示了广泛的适用性和高效率的DCSS。
translated by 谷歌翻译
少量样本压缩旨在将大冗余模型压缩成一个小型紧凑型,只有少量样品。如果我们的微调模型直接具有这些限制的样本,模型将容易受到过度装备,并且几乎没有学习。因此,先前的方法优化压缩模型逐层,并尝试使每个层具有与教师模型中的相应层相同的输出,这是麻烦的。在本文中,我们提出了一个名为mimicking的新框架,然后替换(mir),以实现几个样本压缩,这首先促使修剪模型输出与教师在倒数第二层中的相同功能,然后在倒数第二个之前替换教师的图层调整良好的紧凑型。与以前的层面重建方法不同,我们的MIR完全优化整个网络,这不仅简单而有效,而且还无人驾驶和一般。MIR优于以前的余量。代码即将推出。
translated by 谷歌翻译
We propose an efficient and unified framework, namely ThiNet, to simultaneously accelerate and compress CNN models in both training and inference stages. We focus on the filter level pruning, i.e., the whole filter would be discarded if it is less important. Our method does not change the original network structure, thus it can be perfectly supported by any off-the-shelf deep learning libraries. We formally establish filter pruning as an optimization problem, and reveal that we need to prune filters based on statistics information computed from its next layer, not the current layer, which differentiates ThiNet from existing methods. Experimental results demonstrate the effectiveness of this strategy, which has advanced the state-of-the-art. We also show the performance of ThiNet on ILSVRC-12 benchmark. ThiNet achieves 3.31× FLOPs reduction and 16.63× compression on VGG-16, with only 0.52% top-5 accuracy drop. Similar experiments with ResNet-50 reveal that even for a compact network, ThiNet can also reduce more than half of the parameters and FLOPs, at the cost of roughly 1% top-5 accuracy drop. Moreover, the original VGG-16 model can be further pruned into a very small model with only 5.05MB model size, preserving AlexNet level accuracy but showing much stronger generalization ability.
translated by 谷歌翻译
本文探讨了从视觉变压器查找最佳子模型的可行性,并引入了纯Vision变压器减肥(VIT-SLIM)框架,可以在跨多个维度从原始模型的端到端搜索这样的子结构,包括输入令牌,MHSA和MLP模块,具有最先进的性能。我们的方法基于学习和统一的L1稀疏限制,具有预定的因素,以反映不同维度的连续搜索空间中的全局重要性。通过单次训练方案,搜索过程非常有效。例如,在DeIT-S中,VIT-SLIM仅需要〜43 GPU小时进行搜索过程,并且搜索结构具有灵活的不同模块中的多维尺寸。然后,根据运行设备上的精度折叠折衷的要求采用预算阈值,并执行重新训练过程以获得最终模型。广泛的实验表明,我们的耐比可以压缩高达40%的参数和40%的视觉变压器上的40%拖鞋,同时在Imagenet上提高了〜0.6%的精度。我们还展示了我们搜索模型在几个下游数据集中的优势。我们的源代码将公开提供。
translated by 谷歌翻译
Neural network pruning offers a promising prospect to facilitate deploying deep neural networks on resourcelimited devices. However, existing methods are still challenged by the training inefficiency and labor cost in pruning designs, due to missing theoretical guidance of non-salient network components. In this paper, we propose a novel filter pruning method by exploring the High Rank of feature maps (HRank). Our HRank is inspired by the discovery that the average rank of multiple feature maps generated by a single filter is always the same, regardless of the number of image batches CNNs receive. Based on HRank, we develop a method that is mathematically formulated to prune filters with low-rank feature maps. The principle behind our pruning is that low-rank feature maps contain less information, and thus pruned results can be easily reproduced. Besides, we experimentally show that weights with high-rank feature maps contain more important information, such that even when a portion is not updated, very little damage would be done to the model performance. Without introducing any additional constraints, HRank leads to significant improvements over the state-of-the-arts in terms of FLOPs and parameters reduction, with similar accuracies. For example, with ResNet-110, we achieve a 58.2%-FLOPs reduction by removing 59.2% of the parameters, with only a small loss of 0.14% in top-1 accuracy on CIFAR-10. With Res-50, we achieve a 43.8%-FLOPs reduction by removing 36.7% of the parameters, with only a loss of 1.17% in the top-1 accuracy on ImageNet. The codes can be available at https://github.com/lmbxmu/HRank.
translated by 谷歌翻译