是否可以在深网络中重组非线性激活函数以创建硬件有效的模型?为了解决这个问题,我们提出了一个称为重组激活网络(RANS)的新范式,该范式操纵模型中的非线性数量以提高其硬件意识和效率。首先,我们提出了RAN-STHICER(RAN-E) - 一个新的硬件感知搜索空间和半自动搜索算法 - 用硬件感知的块替换效率低下的块。接下来,我们提出了一种称为RAN-IMPLICIC(RAN-I)的无训练模型缩放方法,从理论上讲,我们在非线性单元的数量方面证明了网络拓扑与其表现性之间的联系。我们证明,我们的网络在不同尺度和几种类型的硬件上实现最新的成像网结果。例如,与有效网络-lite-B0相比,RAN-E在ARM Micro-NPU上每秒(FPS)提高了1.5倍,同时提高了类似的精度。另一方面,ran-i以相似或更好的精度表现出#macs的#macs降低2倍。我们还表明,在基于ARM的数据中心CPU上,RAN-I的FPS比Convnext高40%。最后,与基于Convnext的模型相比,基于RAN-I的对象检测网络在数据中心CPU上获得了类似或更高的映射,并且在数据中心CPU上的fps高达33%。
translated by 谷歌翻译
深度学习技术在各种任务中都表现出了出色的有效性,并且深度学习具有推进多种应用程序(包括在边缘计算中)的潜力,其中将深层模型部署在边缘设备上,以实现即时的数据处理和响应。一个关键的挑战是,虽然深层模型的应用通常会产生大量的内存和计算成本,但Edge设备通常只提供非常有限的存储和计算功能,这些功能可能会在各个设备之间差异很大。这些特征使得难以构建深度学习解决方案,以释放边缘设备的潜力,同时遵守其约束。应对这一挑战的一种有希望的方法是自动化有效的深度学习模型的设计,这些模型轻巧,仅需少量存储,并且仅产生低计算开销。该调查提供了针对边缘计算的深度学习模型设计自动化技术的全面覆盖。它提供了关键指标的概述和比较,这些指标通常用于量化模型在有效性,轻度和计算成本方面的水平。然后,该调查涵盖了深层设计自动化技术的三类最新技术:自动化神经体系结构搜索,自动化模型压缩以及联合自动化设计和压缩。最后,调查涵盖了未来研究的开放问题和方向。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardwareaware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2% more accurate on ImageNet classification while reducing latency by 20% compared to MobileNetV2. MobileNetV3-Small is 6.6% more accurate compared to a MobileNetV2 model with comparable latency. MobileNetV3-Large detection is over 25% faster at roughly the same accuracy as Mo-bileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 34% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.
translated by 谷歌翻译
在过去几年中,已经制作了神经结构搜索领域的显着改进。然而,由于存在搜索的约束和实际推断时间之间的间隙,搜索有效网络仍然具有挑战性。为了搜索具有低推理时间的高性能网络,若干以前的作品为搜索算法设置了计算复杂性约束。然而,许多因素影响推理的速度(例如,拖鞋,MAC)。单个指示符与延迟之间的相关性并不强。目前,提出了一些重新参数化(REP)技术将多分支转换为对单路径架构进行推断友好的。然而,多分支架构仍然是人为定义和效率低下。在这项工作中,我们提出了一种适用于结构重新参数化技术的新搜索空间。 repnas是一种单级NAS方法,以便在分支号约束下有效地搜索每个层的最佳分支块(ODBB)。我们的实验结果表明,搜索的ODBB可以轻松超越手动各种分支块(DBB),高效培训。代码和型号将越早提供。
translated by 谷歌翻译
我们提出了三种新型的修剪技术,以提高推理意识到的可区分神经结构搜索(DNAS)的成本和结果。首先,我们介绍了DNA的随机双路构建块,它可以通过内存和计算复杂性在内部隐藏尺寸上进行搜索。其次,我们在搜索过程中提出了一种在超级网的随机层中修剪块的算法。第三,我们描述了一种在搜索过程中修剪不必要的随机层的新技术。由搜索产生的优化模型称为Prunet,并在Imagenet Top-1图像分类精度的推理潜伏期中为NVIDIA V100建立了新的最先进的Pareto边界。将Prunet作为骨架还优于COCO对象检测任务的GPUNET和EFIDENENET,相对于平均平均精度(MAP)。
translated by 谷歌翻译
在这项工作中,我们提出了一种方法,以准确评估和比较有效的神经网络构建块的性能,以硬件感知方式进行计算机视觉。我们的比较使用了基于设计空间的随机采样网络的帕累托前沿来捕获潜在的准确性/复杂性权衡。我们表明,我们的方法允许通过以前的比较范例获得的信息匹配,但对硬件成本和准确性之间的关系提供了更多见解。我们使用我们的方法来分析不同的构件并评估其在一系列嵌入式硬件平台上的性能。这突出了基准构建块作为神经网络设计过程中的预选步骤的重要性。我们表明,选择合适的构件可以在特定硬件ML加速器上加快推理的速度2倍。
translated by 谷歌翻译
有效的深层神经网络(DNN)模型配备了紧凑的操作员(例如,深度卷积)在降低DNN的理论复杂性(例如,权重/操作总数)的同时,在保持体面的模型准确性的同时,显示出很大的潜力。但是,由于其通常采用的紧凑型操作员的低硬件利用率,现有的有效DNN仍然受到履行其提高现实硬件效率的承诺的限制。在这项工作中,我们为开发真实硬件有效的DNN开辟了新的压缩范式,从而提高了硬件效率,同时保持模型的准确性。有趣的是,我们观察到,尽管某些DNN层的激活功能有助于DNNS的训练优化和可实现的准确性,但在训练后可以正确删除它们,而不会损害模型的准确性。受到这一观察的启发,我们提出了一个称为DepthShrinker的框架,该框架通过缩小现有有效DNN的基本构建块来开发硬件友好的紧凑型网络,这些构件具有不规则的计算模式,并具有大量改进的硬件利用率,从而将硬件的计算模式缩小到密集的情况下。令人兴奋的是,我们的DepthShrinker框架提供了硬件友好的紧凑网络,既优于最先进的有效DNN和压缩技术方法元元素。我们的代码可在以下网址找到:https://github.com/facebookresearch/depthshrinker。
translated by 谷歌翻译
深神经网络(DNNS)在各种机器学习(ML)应用程序中取得了巨大成功,在计算机视觉,自然语言处理和虚拟现实等中提供了高质量的推理解决方案。但是,基于DNN的ML应用程序也带来计算和存储要求的增加了很多,对于具有有限的计算/存储资源,紧张的功率预算和较小形式的嵌入式系统而言,这尤其具有挑战性。挑战还来自各种特定应用的要求,包括实时响应,高通量性能和可靠的推理准确性。为了应对这些挑战,我们介绍了一系列有效的设计方法,包括有效的ML模型设计,定制的硬件加速器设计以及硬件/软件共同设计策略,以启用嵌入式系统上有效的ML应用程序。
translated by 谷歌翻译
由于存储器和计算资源有限,部署在移动设备上的卷积神经网络(CNNS)是困难的。我们的目标是通过利用特征图中的冗余来设计包括CPU和GPU的异构设备的高效神经网络,这很少在神经结构设计中进行了研究。对于类似CPU的设备,我们提出了一种新颖的CPU高效的Ghost(C-Ghost)模块,以生成从廉价操作的更多特征映射。基于一组内在的特征映射,我们使用廉价的成本应用一系列线性变换,以生成许多幽灵特征图,可以完全揭示内在特征的信息。所提出的C-Ghost模块可以作为即插即用组件,以升级现有的卷积神经网络。 C-Ghost瓶颈旨在堆叠C-Ghost模块,然后可以轻松建立轻量级的C-Ghostnet。我们进一步考虑GPU设备的有效网络。在建筑阶段的情况下,不涉及太多的GPU效率(例如,深度明智的卷积),我们建议利用阶段明智的特征冗余来制定GPU高效的幽灵(G-GHOST)阶段结构。舞台中的特征被分成两个部分,其中使用具有较少输出通道的原始块处理第一部分,用于生成内在特征,另一个通过利用阶段明智的冗余来生成廉价的操作。在基准测试上进行的实验证明了所提出的C-Ghost模块和G-Ghost阶段的有效性。 C-Ghostnet和G-Ghostnet分别可以分别实现CPU和GPU的准确性和延迟的最佳权衡。代码可在https://github.com/huawei-noah/cv-backbones获得。
translated by 谷歌翻译
Structured channel pruning has been shown to significantly accelerate inference time for convolution neural networks (CNNs) on modern hardware, with a relatively minor loss of network accuracy. Recent works permanently zero these channels during training, which we observe to significantly hamper final accuracy, particularly as the fraction of the network being pruned increases. We propose Soft Masking for cost-constrained Channel Pruning (SMCP) to allow pruned channels to adaptively return to the network while simultaneously pruning towards a target cost constraint. By adding a soft mask re-parameterization of the weights and channel pruning from the perspective of removing input channels, we allow gradient updates to previously pruned channels and the opportunity for the channels to later return to the network. We then formulate input channel pruning as a global resource allocation problem. Our method outperforms prior works on both the ImageNet classification and PASCAL VOC detection datasets.
translated by 谷歌翻译
In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3.is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers. The intermediate expansion layer uses lightweight depthwise convolutions to filter features as a source of non-linearity. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design.Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on ImageNet [1] classification, COCO object detection [2], VOC image segmentation [3]. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as actual latency, and the number of parameters.
translated by 谷歌翻译
大多数现有的深神经网络都是静态的,这意味着它们只能以固定的复杂性推断。但资源预算可以大幅度不同。即使在一个设备上,实惠预算也可以用不同的场景改变,并且对每个所需预算的反复培训网络是非常昂贵的。因此,在这项工作中,我们提出了一种称为Mutualnet的一般方法,以训练可以以各种资源约束运行的单个网络。我们的方法列举了具有各种网络宽度和输入分辨率的模型配置队列。这种相互学习方案不仅允许模型以不同的宽度分辨率配置运行,而且还可以在这些配置之间传输独特的知识,帮助模型来学习更强大的表示。 Mutualnet是一般的培训方法,可以应用于各种网络结构(例如,2D网络:MobileNets,Reset,3D网络:速度,X3D)和各种任务(例如,图像分类,对象检测,分段和动作识别),并证明了实现各种数据集的一致性改进。由于我们只培训了这一模型,它对独立培训多种型号而言,它也大大降低了培训成本。令人惊讶的是,如果动态资源约束不是一个问题,则可以使用Mutualnet来显着提高单个网络的性能。总之,Mutualnet是静态和自适应,2D和3D网络的统一方法。代码和预先训练的模型可用于\ url {https://github.com/tayang1122/mutualnet}。
translated by 谷歌翻译
混合精确的深神经网络达到了硬件部署所需的能源效率和吞吐量,尤其是在资源有限的情况下,而无需牺牲准确性。但是,不容易找到保留精度的最佳每层钻头精度,尤其是在创建巨大搜索空间的大量模型,数据集和量化技术中。为了解决这一困难,最近出现了一系列文献,并且已经提出了一些实现有希望的准确性结果的框架。在本文中,我们首先总结了文献中通常使用的量化技术。然后,我们对混合精液框架进行了彻底的调查,该调查是根据其优化技术进行分类的,例如增强学习和量化技术,例如确定性舍入。此外,讨论了每个框架的优势和缺点,我们在其中呈现并列。我们最终为未来的混合精液框架提供了指南。
translated by 谷歌翻译
模型量化已成为加速深度学习推理的不可或缺的技术。虽然研究人员继续推动量化算法的前沿,但是现有量化工作通常是不可否认的和不可推销的。这是因为研究人员不选择一致的训练管道并忽略硬件部署的要求。在这项工作中,我们提出了模型量化基准(MQBench),首次尝试评估,分析和基准模型量化算法的再现性和部署性。我们为实际部署选择多个不同的平台,包括CPU,GPU,ASIC,DSP,并在统一培训管道下评估广泛的最新量化算法。 MQBENCK就像一个连接算法和硬件的桥梁。我们进行全面的分析,并找到相当大的直观或反向直观的见解。通过对齐训练设置,我们发现现有的算法在传统的学术轨道上具有大致相同的性能。虽然用于硬件可部署量化,但有一个巨大的精度差距,仍然不稳定。令人惊讶的是,没有现有的算法在MQBench中赢得每一项挑战,我们希望这项工作能够激发未来的研究方向。
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
更好的准确性和效率权衡在对象检测中是一个具有挑战性的问题。在这项工作中,我们致力于研究对象检测的关键优化和神经网络架构选择,以提高准确性和效率。我们调查了无锚策略对轻质对象检测模型的适用性。我们增强了骨干结构并设计了颈部的轻质结构,从而提高了网络的特征提取能力。我们改善标签分配策略和损失功能,使培训更稳定和高效。通过这些优化,我们创建了一个名为PP-Picodet的新的实时对象探测器系列,这在移动设备的对象检测上实现了卓越的性能。与其他流行型号相比,我们的模型在准确性和延迟之间实现了更好的权衡。 Picodet-s只有0.99m的参数达到30.6%的地图,它是地图的绝对4.8%,同时与yolox-nano相比将移动CPU推理延迟减少55%,并且与Nanodet相比,MAP的绝对改善了7.1%。当输入大小为320时,它在移动臂CPU上达到123个FPS(使用桨Lite)。Picodet-L只有3.3M参数,达到40.9%的地图,这是地图的绝对3.7%,比yolov5s更快44% 。如图1所示,我们的模型远远优于轻量级对象检测的最先进的结果。代码和预先训练的型号可在https://github.com/paddlepaddle/paddledentions提供。
translated by 谷歌翻译
Designing accurate and efficient ConvNets for mobile devices is challenging because the design space is combinatorially large. Due to this, previous neural architecture search (NAS) methods are computationally expensive. ConvNet architecture optimality depends on factors such as input resolution and target devices. However, existing approaches are too resource demanding for case-by-case redesigns. Also, previous work focuses primarily on reducing FLOPs, but FLOP count does not always reflect actual latency. To address these, we propose a differentiable neural architecture search (DNAS) framework that uses gradient-based methods to optimize Con-vNet architectures, avoiding enumerating and training individual architectures separately as in previous methods. FBNets (Facebook-Berkeley-Nets), a family of models discovered by DNAS surpass state-of-the-art models both designed manually and generated automatically. FBNet-B achieves 74.1% top-1 accuracy on ImageNet with 295M FLOPs and 23.1 ms latency on a Samsung S8 phone, 2.4x smaller and 1.5x faster than MobileNetV2-1.3[17] with similar accuracy. Despite higher accuracy and lower latency than MnasNet[20], we estimate FBNet-B's search cost is 420x smaller than MnasNet's, at only 216 GPUhours. Searched for different resolutions and channel sizes, FBNets achieve 1.5% to 6.4% higher accuracy than Mo-bileNetV2. The smallest FBNet achieves 50.2% accuracy and 2.9 ms latency (345 frames per second) on a Samsung S8. Over a Samsung-optimized FBNet, the iPhone-Xoptimized model achieves a 1.4x speedup on an iPhone X. FBNet models are open-sourced at https://github. com/facebookresearch/mobile-vision. * Work done while interning at Facebook.… Figure 1. Differentiable neural architecture search (DNAS) for ConvNet design. DNAS explores a layer-wise space that each layer of a ConvNet can choose a different block. The search space is represented by a stochastic super net. The search process trains the stochastic super net using SGD to optimize the architecture distribution. Optimal architectures are sampled from the trained distribution. The latency of each operator is measured on target devices and used to compute the loss for the super net.
translated by 谷歌翻译
大多数对象检测框架都使用最初设计用于图像分类的主链体系结构,通常在Imagenet上具有预训练的参数。但是,图像分类和对象检测本质上是不同的任务,并且不能保证分类的最佳主链也适用于对象检测。最近的神经体系结构搜索(NAS)研究表明,自动设计专门用于对象检测的骨干有助于提高整体准确性。在本文中,我们引入了一种神经体系结构适应方法,该方法可以优化给定的主链以进行检测目的,同时仍允许使用预训练的参数。我们建议除了每个块的输出通道尺寸外,还通过搜索特定操作和层数来调整微体系结构。重要的是要找到最佳的通道深度,因为它极大地影响了特征表示功能和计算成本。我们使用搜索的主链进行对象检测进行实验,并证明我们的主链在可可数据集上的手动设计和搜索的最新骨干均优于手动设计和搜索的骨干。
translated by 谷歌翻译
Designing convolutional neural networks (CNN) for mobile devices is challenging because mobile models need to be small and fast, yet still accurate. Although significant efforts have been dedicated to design and improve mobile CNNs on all dimensions, it is very difficult to manually balance these trade-offs when there are so many architectural possibilities to consider. In this paper, we propose an automated mobile neural architecture search (MNAS) approach, which explicitly incorporate model latency into the main objective so that the search can identify a model that achieves a good trade-off between accuracy and latency. Unlike previous work, where latency is considered via another, often inaccurate proxy (e.g., FLOPS), our approach directly measures real-world inference latency by executing the model on mobile phones. To further strike the right balance between flexibility and search space size, we propose a novel factorized hierarchical search space that encourages layer diversity throughout the network. Experimental results show that our approach consistently outperforms state-of-the-art mobile CNN models across multiple vision tasks. On the ImageNet classification task, our MnasNet achieves 75.2% top-1 accuracy with 78ms latency on a Pixel phone, which is 1.8× faster than MobileNetV2 [29] with 0.5% higher accuracy and 2.3× faster than NASNet [36] with 1.2% higher accuracy. Our MnasNet also achieves better mAP quality than MobileNets for COCO object detection. Code is at https://github.com/tensorflow/tpu/ tree/master/models/official/mnasnet.
translated by 谷歌翻译