量化广泛用于云和边缘系统,以减少深层神经网络的记忆占用,潜伏期和能耗。特别是,混合精液量化,即,在网络的不同部分中使用不同的位宽度,已被证明可以提供出色的效率提高,尤其是通过自动化神经体系结构确定的优化的位宽度分配,尤其是通过自动化的位宽度分配(NAS)工具。最先进的混合精液在层面上,即,它对每个网络层的权重和激活张量使用不同的位宽度。在这项工作中,我们扩大了搜索空间,提出了一种新颖的NA,该NAS独立选择每个重量张量通道的位宽度。这为工具提供了额外的灵活性,即仅针对与最有用的功能相关的权重分配更高的精度。在MLPERF微小的基准套件上进行测试,我们获得了精确度大小与精度与能量空间的帕累托最佳模型的丰富集合。当部署在MPIC RISC-V边缘处理器上时,我们的网络将记忆和能量分别减少了63%和27%,而与层的方法相比,以相同的精度为单位。
translated by 谷歌翻译
基于惯性数据的人类活动识别(HAR)是从智能手机到超低功率传感器的嵌入式设备上越来越扩散的任务。由于深度学习模型的计算复杂性很高,因此大多数嵌入式HAR系统基于简单且不那么精确的经典机器学习算法。这项工作弥合了在设备上的HAR和深度学习之间的差距,提出了一组有效的一维卷积神经网络(CNN),可在通用微控制器(MCUS)上部署。我们的CNN获得了将超参数优化与子字节和混合精确量化的结合,以在分类结果和记忆职业之间找到良好的权衡。此外,我们还利用自适应推断作为正交优化,以根据处理后的输入来调整运行时的推理复杂性,从而产生更灵活的HAR系统。通过在四个数据集上进行实验,并针对超低功率RISC-V MCU,我们表明(i)我们能够为HAR获得一组丰富的帕累托(Pareto)最佳CNN,以范围超过1个数量级记忆,潜伏期和能耗; (ii)由于自适应推断,我们可以从单个CNN开始得出> 20个运行时操作模式,分类分数的不同程度高达10%,并且推理复杂性超过3倍,并且内存开销有限; (iii)在四个基准中的三个基准中,我们的表现都超过了所有以前的深度学习方法,将记忆占用率降低了100倍以上。获得更好性能(浅层和深度)的少数方法与MCU部署不兼容。 (iv)我们所有的CNN都与推理延迟<16ms的实时式evice Har兼容。他们的记忆职业在0.05-23.17 kb中有所不同,其能源消耗为0.005和61.59 UJ,可在较小的电池供应中进行多年的连续操作。
translated by 谷歌翻译
混合精确的深神经网络达到了硬件部署所需的能源效率和吞吐量,尤其是在资源有限的情况下,而无需牺牲准确性。但是,不容易找到保留精度的最佳每层钻头精度,尤其是在创建巨大搜索空间的大量模型,数据集和量化技术中。为了解决这一困难,最近出现了一系列文献,并且已经提出了一些实现有希望的准确性结果的框架。在本文中,我们首先总结了文献中通常使用的量化技术。然后,我们对混合精液框架进行了彻底的调查,该调查是根据其优化技术进行分类的,例如增强学习和量化技术,例如确定性舍入。此外,讨论了每个框架的优势和缺点,我们在其中呈现并列。我们最终为未来的混合精液框架提供了指南。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
深神经网络(DNNS)在各种机器学习(ML)应用程序中取得了巨大成功,在计算机视觉,自然语言处理和虚拟现实等中提供了高质量的推理解决方案。但是,基于DNN的ML应用程序也带来计算和存储要求的增加了很多,对于具有有限的计算/存储资源,紧张的功率预算和较小形式的嵌入式系统而言,这尤其具有挑战性。挑战还来自各种特定应用的要求,包括实时响应,高通量性能和可靠的推理准确性。为了应对这些挑战,我们介绍了一系列有效的设计方法,包括有效的ML模型设计,定制的硬件加速器设计以及硬件/软件共同设计策略,以启用嵌入式系统上有效的ML应用程序。
translated by 谷歌翻译
我们日常生活中的深度学习是普遍存在的,包括自驾车,虚拟助理,社交网络服务,医疗服务,面部识别等,但是深度神经网络在训练和推理期间需要大量计算资源。该机器学习界主要集中在模型级优化(如深度学习模型的架构压缩),而系统社区则专注于实施级别优化。在其间,在算术界中提出了各种算术级优化技术。本文在模型,算术和实施级技术方面提供了关于资源有效的深度学习技术的调查,并确定了三种不同级别技术的资源有效的深度学习技术的研究差距。我们的调查基于我们的资源效率度量定义,阐明了较低级别技术的影响,并探讨了资源有效的深度学习研究的未来趋势。
translated by 谷歌翻译
深度学习技术在各种任务中都表现出了出色的有效性,并且深度学习具有推进多种应用程序(包括在边缘计算中)的潜力,其中将深层模型部署在边缘设备上,以实现即时的数据处理和响应。一个关键的挑战是,虽然深层模型的应用通常会产生大量的内存和计算成本,但Edge设备通常只提供非常有限的存储和计算功能,这些功能可能会在各个设备之间差异很大。这些特征使得难以构建深度学习解决方案,以释放边缘设备的潜力,同时遵守其约束。应对这一挑战的一种有希望的方法是自动化有效的深度学习模型的设计,这些模型轻巧,仅需少量存储,并且仅产生低计算开销。该调查提供了针对边缘计算的深度学习模型设计自动化技术的全面覆盖。它提供了关键指标的概述和比较,这些指标通常用于量化模型在有效性,轻度和计算成本方面的水平。然后,该调查涵盖了深层设计自动化技术的三类最新技术:自动化神经体系结构搜索,自动化模型压缩以及联合自动化设计和压缩。最后,调查涵盖了未来研究的开放问题和方向。
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
深神经网络(DNN)的庞大计算和记忆成本通常排除了它们在资源约束设备中的使用。将参数和操作量化为较低的位精确,为神经网络推断提供了可观的记忆和能量节省,从而促进了在边缘计算平台上使用DNN。量化DNN的最新努力采用了一系列技术,包括渐进式量化,步进尺寸的适应性和梯度缩放。本文提出了一种针对边缘计算的混合精度卷积神经网络(CNN)的新量化方法。我们的方法在模型准确性和内存足迹上建立了一个新的Pareto前沿,展示了一系列量化模型,可提供低于4.3 MB的权重(WGTS。)和激活(ACTS。)。我们的主要贡献是:(i)用张量学的学习精度,(ii)WGTS的靶向梯度修饰,(i)硬件感知的异质可区分量化。和行为。为了减轻量化错误,以及(iii)多相学习时间表,以解决从更新到学习的量化器和模型参数引起的学习不稳定性。我们证明了我们的技术在Imagenet数据集上的有效性,包括高效网络lite0(例如,WGTS。的4.14MB和ACTS。以67.66%的精度)和MobilenEtV2(例如3.51MB WGTS。 % 准确性)。
translated by 谷歌翻译
深度神经网络(DNN)的记录断裂性能具有沉重的参数化,导致外部动态随机存取存储器(DRAM)进行存储。 DRAM访问的禁用能量使得在资源受限的设备上部署DNN是不普遍的,呼叫最小化重量和数据移动以提高能量效率。我们呈现SmartDeal(SD),算法框架,以进行更高成本的存储器存储/访问的较低成本计算,以便在推理和培训中积极提高存储和能量效率。 SD的核心是一种具有结构约束的新型重量分解,精心制作以释放硬件效率潜力。具体地,我们将每个重量张量分解为小基矩阵的乘积以及大的结构稀疏系数矩阵,其非零被量化为-2的功率。由此产生的稀疏和量化的DNN致力于为数据移动和重量存储而大大降低的能量,因为由于稀疏的比特 - 操作和成本良好的计算,恢复原始权重的最小开销。除了推理之外,我们采取了另一次飞跃来拥抱节能培训,引入创新技术,以解决培训时出现的独特障碍,同时保留SD结构。我们还设计专用硬件加速器,充分利用SD结构来提高实际能源效率和延迟。我们在不同的设置中对多个任务,模型和数据集进行实验。结果表明:1)应用于推理,SD可实现高达2.44倍的能效,通过实际硬件实现评估; 2)应用于培训,储存能量降低10.56倍,减少了10.56倍和4.48倍,与最先进的训练基线相比,可忽略的准确性损失。我们的源代码在线提供。
translated by 谷歌翻译
为了以计算有效的方式部署深层模型,经常使用模型量化方法。此外,由于新的硬件支持混合的位算术操作,最近对混合精度量化(MPQ)的研究开始通过搜索网络中不同层和模块的优化位低宽,从而完全利用表示的能力。但是,先前的研究主要是在使用强化学习,神经体系结构搜索等的昂贵方案中搜索MPQ策略,或者简单地利用部分先验知识来进行位于刻度分配,这可能是有偏见和优势的。在这项工作中,我们提出了一种新颖的随机量化量化(SDQ)方法,该方法可以在更灵活,更全球优化的空间中自动学习MPQ策略,并具有更平滑的梯度近似。特别是,可区分的位宽参数(DBP)被用作相邻位意选择之间随机量化的概率因素。在获取最佳MPQ策略之后,我们将进一步训练网络使用熵感知的bin正则化和知识蒸馏。我们广泛评估了不同硬件(GPU和FPGA)和数据集的多个网络的方法。 SDQ的表现优于所有最先进的混合或单个精度量化,甚至比较低的位置量化,甚至比各种重新网络和Mobilenet家族的全精度对应物更好,这表明了我们方法的有效性和优势。
translated by 谷歌翻译
由于神经网络变得更加强大,因此在现实世界中部署它们的愿望是一个上升的愿望;然而,神经网络的功率和准确性主要是由于它们的深度和复杂性,使得它们难以部署,尤其是在资源受限的设备中。最近出现了神经网络量化,以满足这种需求通过降低网络的精度来降低神经网络的大小和复杂性。具有较小和更简单的网络,可以在目标硬件的约束中运行神经网络。本文调查了在过去十年中开发的许多神经网络量化技术。基于该调查和神经网络量化技术的比较,我们提出了该地区的未来研究方向。
translated by 谷歌翻译
我们介绍了MLPERF小型推理基准(FPGA)平台上MLPERF微小的推理基准的最新结果。我们使用开源HLS4ML和Finn工作流,旨在使FPGA中优化神经网络的AI硬件代码民主化。我们介绍关键字发现,异常检测和图像分类基准任务的设计和实现过程。最终的硬件实现是针对速度和效率量身定制的,可配置的,可配置的空间数据流体系结构,并引入了新的通用优化和作为本工作的一部分开发的常见工作流程。完整的工作流程从量化感知培训到FPGA实施。该解决方案部署在芯片(PYNQ-Z2)和纯FPGA(ARTY A7-100T)平台上。由此产生的提交的潜伏期低至20 $ \ mu $ s和每次推论的低至30 $ \ mu $ j的能耗。我们展示了异质硬件平台上新兴的ML基准如何催化协作和开发新技术和更容易访问的工具。
translated by 谷歌翻译
Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. With this representation, we can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism which allows us to obtain multiple quantized networks from one full precision source model by progressively mapping the higher precision weights to their adjacent lower precision counterparts. Then, with networks of different bit-widths from one source model, multi-objective optimization is employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. By doing this, the shared weights will be optimized to balance the performance of different quantized models, thus making the weights transferable among different bit widths. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width. Code will be available.
translated by 谷歌翻译
机器学习的进步为低端互联网节点(例如微控制器)带来了新的机会,将情报带入了情报。传统的机器学习部署具有较高的记忆力,并计算足迹阻碍了其在超资源约束的微控制器上的直接部署。本文强调了为MicroController类设备启用机载机器学习的独特要求。研究人员为资源有限的应用程序使用专门的模型开发工作流程,以确保计算和延迟预算在设备限制之内,同时仍保持所需的性能。我们表征了微控制器类设备的机器学习模型开发的广泛适用的闭环工作流程,并表明几类应用程序采用了它的特定实例。我们通过展示多种用例,将定性和数值见解介绍到模型开发的不同阶段。最后,我们确定了开放的研究挑战和未解决的问题,要求仔细考虑前进。
translated by 谷歌翻译
Uniform-precision neural network quantization has gained popularity since it simplifies densely packed arithmetic unit for high computing capability. However, it ignores heterogeneous sensitivity to the impact of quantization errors across the layers, resulting in sub-optimal inference accuracy. This work proposes a novel neural architecture search called neural channel expansion that adjusts the network structure to alleviate accuracy degradation from ultra-low uniform-precision quantization. The proposed method selectively expands channels for the quantization sensitive layers while satisfying hardware constraints (e.g., FLOPs, PARAMs). Based on in-depth analysis and experiments, we demonstrate that the proposed method can adapt several popular networks channels to achieve superior 2-bit quantization accuracy on CIFAR10 and ImageNet. In particular, we achieve the best-to-date Top-1/Top-5 accuracy for 2-bit ResNet50 with smaller FLOPs and the parameter size.
translated by 谷歌翻译
不断需要在低容量设备上使用的图像超分辨率(SR)的高性能和计算有效的神经网络模型。获取此类模型的一种方法是压缩现有体系结构,例如量化。另一个选择是发现新的有效解决方案的神经体系结构搜索(NAS)。我们为专门设计的SR搜索空间提出了一种新颖的量化NAS程序。我们的方法执行NAS以找到量化友好的SR模型。搜索依赖于将量化噪声添加到参数和激活中,而不是直接量化参数。我们的Quontnas比固定体系结构的均匀或混合精度量化找到了具有更好的PSNR/BITOP权衡的体系结构。此外,我们对噪声过程的搜索比直接量化权重的速度快30%。
translated by 谷歌翻译
模型量化已成为加速深度学习推理的不可或缺的技术。虽然研究人员继续推动量化算法的前沿,但是现有量化工作通常是不可否认的和不可推销的。这是因为研究人员不选择一致的训练管道并忽略硬件部署的要求。在这项工作中,我们提出了模型量化基准(MQBench),首次尝试评估,分析和基准模型量化算法的再现性和部署性。我们为实际部署选择多个不同的平台,包括CPU,GPU,ASIC,DSP,并在统一培训管道下评估广泛的最新量化算法。 MQBENCK就像一个连接算法和硬件的桥梁。我们进行全面的分析,并找到相当大的直观或反向直观的见解。通过对齐训练设置,我们发现现有的算法在传统的学术轨道上具有大致相同的性能。虽然用于硬件可部署量化,但有一个巨大的精度差距,仍然不稳定。令人惊讶的是,没有现有的算法在MQBench中赢得每一项挑战,我们希望这项工作能够激发未来的研究方向。
translated by 谷歌翻译
In recent years, image and video delivery systems have begun integrating deep learning super-resolution (SR) approaches, leveraging their unprecedented visual enhancement capabilities while reducing reliance on networking conditions. Nevertheless, deploying these solutions on mobile devices still remains an active challenge as SR models are excessively demanding with respect to workload and memory footprint. Despite recent progress on on-device SR frameworks, existing systems either penalize visual quality, lead to excessive energy consumption or make inefficient use of the available resources. This work presents NAWQ-SR, a novel framework for the efficient on-device execution of SR models. Through a novel hybrid-precision quantization technique and a runtime neural image codec, NAWQ-SR exploits the multi-precision capabilities of modern mobile NPUs in order to minimize latency, while meeting user-specified quality constraints. Moreover, NAWQ-SR selectively adapts the arithmetic precision at run time to equip the SR DNN's layers with wider representational power, improving visual quality beyond what was previously possible on NPUs. Altogether, NAWQ-SR achieves an average speedup of 7.9x, 3x and 1.91x over the state-of-the-art on-device SR systems that use heterogeneous processors (MobiSR), CPU (SplitSR) and NPU (XLSR), respectively. Furthermore, NAWQ-SR delivers an average of 3.2x speedup and 0.39 dB higher PSNR over status-quo INT8 NPU designs, but most importantly mitigates the negative effects of quantization on visual quality, setting a new state-of-the-art in the attainable quality of NPU-based SR.
translated by 谷歌翻译
When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations involved in processing these layers has significantly reduced. As a result, the proportion of the execution time of other layers, such as batch normalization layers, has increased. Thus, in this work, we conduct a detailed analysis of the batch normalization layer to efficiently reduce the runtime overhead in the batch normalization process. Backed up by the thorough analysis, we present an extremely efficient batch normalization, named LightNorm, and its associated hardware module. In more detail, we fuse three approximation techniques that are i) low bit-precision, ii) range batch normalization, and iii) block floating point. All these approximate techniques are carefully utilized not only to maintain the statistics of intermediate feature maps, but also to minimize the off-chip memory accesses. By using the proposed LightNorm hardware, we can achieve significant area and energy savings during the DNN training without hurting the training accuracy. This makes the proposed hardware a great candidate for the on-device training.
translated by 谷歌翻译