不断需要在低容量设备上使用的图像超分辨率(SR)的高性能和计算有效的神经网络模型。获取此类模型的一种方法是压缩现有体系结构,例如量化。另一个选择是发现新的有效解决方案的神经体系结构搜索(NAS)。我们为专门设计的SR搜索空间提出了一种新颖的量化NAS程序。我们的方法执行NAS以找到量化友好的SR模型。搜索依赖于将量化噪声添加到参数和激活中,而不是直接量化参数。我们的Quontnas比固定体系结构的均匀或混合精度量化找到了具有更好的PSNR/BITOP权衡的体系结构。此外,我们对噪声过程的搜索比直接量化权重的速度快30%。
translated by 谷歌翻译
We revisit the one-shot Neural Architecture Search (NAS) paradigm and analyze its advantages over existing NAS approaches. Existing one-shot method, however, is hard to train and not yet effective on large scale datasets like ImageNet. This work propose a Single Path One-Shot model to address the challenge in the training. Our central idea is to construct a simplified supernet, where all architectures are single paths so that weight co-adaption problem is alleviated. Training is performed by uniform path sampling. All architectures (and their weights) are trained fully and equally. Comprehensive experiments verify that our approach is flexible and effective. It is easy to train and fast to search. It effortlessly supports complex search spaces (e.g., building blocks, channel, mixed-precision quantization) and different search constraints (e.g., FLOPs, latency). It is thus convenient to use for various needs. It achieves start-of-the-art performance on the large dataset ImageNet.Equal contribution. This work is done when Haoyuan Mu and Zechun Liu are interns at MEGVII Technology.
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
混合精确的深神经网络达到了硬件部署所需的能源效率和吞吐量,尤其是在资源有限的情况下,而无需牺牲准确性。但是,不容易找到保留精度的最佳每层钻头精度,尤其是在创建巨大搜索空间的大量模型,数据集和量化技术中。为了解决这一困难,最近出现了一系列文献,并且已经提出了一些实现有希望的准确性结果的框架。在本文中,我们首先总结了文献中通常使用的量化技术。然后,我们对混合精液框架进行了彻底的调查,该调查是根据其优化技术进行分类的,例如增强学习和量化技术,例如确定性舍入。此外,讨论了每个框架的优势和缺点,我们在其中呈现并列。我们最终为未来的混合精液框架提供了指南。
translated by 谷歌翻译
我们提出了三种新型的修剪技术,以提高推理意识到的可区分神经结构搜索(DNAS)的成本和结果。首先,我们介绍了DNA的随机双路构建块,它可以通过内存和计算复杂性在内部隐藏尺寸上进行搜索。其次,我们在搜索过程中提出了一种在超级网的随机层中修剪块的算法。第三,我们描述了一种在搜索过程中修剪不必要的随机层的新技术。由搜索产生的优化模型称为Prunet,并在Imagenet Top-1图像分类精度的推理潜伏期中为NVIDIA V100建立了新的最先进的Pareto边界。将Prunet作为骨架还优于COCO对象检测任务的GPUNET和EFIDENENET,相对于平均平均精度(MAP)。
translated by 谷歌翻译
深度学习技术在各种任务中都表现出了出色的有效性,并且深度学习具有推进多种应用程序(包括在边缘计算中)的潜力,其中将深层模型部署在边缘设备上,以实现即时的数据处理和响应。一个关键的挑战是,虽然深层模型的应用通常会产生大量的内存和计算成本,但Edge设备通常只提供非常有限的存储和计算功能,这些功能可能会在各个设备之间差异很大。这些特征使得难以构建深度学习解决方案,以释放边缘设备的潜力,同时遵守其约束。应对这一挑战的一种有希望的方法是自动化有效的深度学习模型的设计,这些模型轻巧,仅需少量存储,并且仅产生低计算开销。该调查提供了针对边缘计算的深度学习模型设计自动化技术的全面覆盖。它提供了关键指标的概述和比较,这些指标通常用于量化模型在有效性,轻度和计算成本方面的水平。然后,该调查涵盖了深层设计自动化技术的三类最新技术:自动化神经体系结构搜索,自动化模型压缩以及联合自动化设计和压缩。最后,调查涵盖了未来研究的开放问题和方向。
translated by 谷歌翻译
In recent years, image and video delivery systems have begun integrating deep learning super-resolution (SR) approaches, leveraging their unprecedented visual enhancement capabilities while reducing reliance on networking conditions. Nevertheless, deploying these solutions on mobile devices still remains an active challenge as SR models are excessively demanding with respect to workload and memory footprint. Despite recent progress on on-device SR frameworks, existing systems either penalize visual quality, lead to excessive energy consumption or make inefficient use of the available resources. This work presents NAWQ-SR, a novel framework for the efficient on-device execution of SR models. Through a novel hybrid-precision quantization technique and a runtime neural image codec, NAWQ-SR exploits the multi-precision capabilities of modern mobile NPUs in order to minimize latency, while meeting user-specified quality constraints. Moreover, NAWQ-SR selectively adapts the arithmetic precision at run time to equip the SR DNN's layers with wider representational power, improving visual quality beyond what was previously possible on NPUs. Altogether, NAWQ-SR achieves an average speedup of 7.9x, 3x and 1.91x over the state-of-the-art on-device SR systems that use heterogeneous processors (MobiSR), CPU (SplitSR) and NPU (XLSR), respectively. Furthermore, NAWQ-SR delivers an average of 3.2x speedup and 0.39 dB higher PSNR over status-quo INT8 NPU designs, but most importantly mitigates the negative effects of quantization on visual quality, setting a new state-of-the-art in the attainable quality of NPU-based SR.
translated by 谷歌翻译
Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. With this representation, we can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism which allows us to obtain multiple quantized networks from one full precision source model by progressively mapping the higher precision weights to their adjacent lower precision counterparts. Then, with networks of different bit-widths from one source model, multi-objective optimization is employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. By doing this, the shared weights will be optimized to balance the performance of different quantized models, thus making the weights transferable among different bit widths. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width. Code will be available.
translated by 谷歌翻译
由于神经网络变得更加强大,因此在现实世界中部署它们的愿望是一个上升的愿望;然而,神经网络的功率和准确性主要是由于它们的深度和复杂性,使得它们难以部署,尤其是在资源受限的设备中。最近出现了神经网络量化,以满足这种需求通过降低网络的精度来降低神经网络的大小和复杂性。具有较小和更简单的网络,可以在目标硬件的约束中运行神经网络。本文调查了在过去十年中开发的许多神经网络量化技术。基于该调查和神经网络量化技术的比较,我们提出了该地区的未来研究方向。
translated by 谷歌翻译
量化图像超分辨率的深卷积神经网络大大降低了它们的计算成本。然而,现有的作品既不患有4个或低位宽度的超低精度的严重性能下降,或者需要沉重的微调过程以恢复性能。据我们所知,这种对低精度的漏洞依赖于特征映射值的两个统计观察。首先,特征贴图值的分布每个通道和每个输入图像都变化显着变化。其次,特征映射具有可以主导量化错误的异常值。基于这些观察,我们提出了一种新颖的分布感知量化方案(DAQ),其促进了超低精度的准确训练量化。 DAQ的简单功能确定了具有低计算负担的特征图和权重的动态范围。此外,我们的方法通过计算每个通道的相对灵敏度来实现混合精度量化,而无需涉及任何培训过程。尽管如此,量化感知培训也适用于辅助性能增益。我们的新方法优于最近的培训甚至基于培训的量化方法,以超低精度为最先进的图像超分辨率网络。
translated by 谷歌翻译
Uniform-precision neural network quantization has gained popularity since it simplifies densely packed arithmetic unit for high computing capability. However, it ignores heterogeneous sensitivity to the impact of quantization errors across the layers, resulting in sub-optimal inference accuracy. This work proposes a novel neural architecture search called neural channel expansion that adjusts the network structure to alleviate accuracy degradation from ultra-low uniform-precision quantization. The proposed method selectively expands channels for the quantization sensitive layers while satisfying hardware constraints (e.g., FLOPs, PARAMs). Based on in-depth analysis and experiments, we demonstrate that the proposed method can adapt several popular networks channels to achieve superior 2-bit quantization accuracy on CIFAR10 and ImageNet. In particular, we achieve the best-to-date Top-1/Top-5 accuracy for 2-bit ResNet50 with smaller FLOPs and the parameter size.
translated by 谷歌翻译
我们为移动设备提出了一个轻巧的单图超分辨率网络,名为XCAT。XCAT引入了具有交叉串联(HXBLOCK)的异质群卷积块。输入通道向组卷积块的异质拆分减少了操作数量,交叉串联允许在级联HXBlocks的中间输入张量之间进行信息流。HXBlocks内部的交叉串联也可以避免使用更昂贵的操作,例如1x1卷积。为了进一步预见昂贵的张量副本操作,XCAT利用不可训练的卷积内核来应用采样操作。XCAT考虑了整数量化的设计,还利用了几种技术,例如基于强度的数据增强。Integer的XCAT量化XCAT可在320ms的Mali-G71 MP2 GPU上实时运行,以及适用于实时应用的30ms(NCHW)和8.8ms(NHWC)的Synaptics Dolphin NPU。
translated by 谷歌翻译
模型量化已成为加速深度学习推理的不可或缺的技术。虽然研究人员继续推动量化算法的前沿,但是现有量化工作通常是不可否认的和不可推销的。这是因为研究人员不选择一致的训练管道并忽略硬件部署的要求。在这项工作中,我们提出了模型量化基准(MQBench),首次尝试评估,分析和基准模型量化算法的再现性和部署性。我们为实际部署选择多个不同的平台,包括CPU,GPU,ASIC,DSP,并在统一培训管道下评估广泛的最新量化算法。 MQBENCK就像一个连接算法和硬件的桥梁。我们进行全面的分析,并找到相当大的直观或反向直观的见解。通过对齐训练设置,我们发现现有的算法在传统的学术轨道上具有大致相同的性能。虽然用于硬件可部署量化,但有一个巨大的精度差距,仍然不稳定。令人惊讶的是,没有现有的算法在MQBench中赢得每一项挑战,我们希望这项工作能够激发未来的研究方向。
translated by 谷歌翻译
量化广泛用于云和边缘系统,以减少深层神经网络的记忆占用,潜伏期和能耗。特别是,混合精液量化,即,在网络的不同部分中使用不同的位宽度,已被证明可以提供出色的效率提高,尤其是通过自动化神经体系结构确定的优化的位宽度分配,尤其是通过自动化的位宽度分配(NAS)工具。最先进的混合精液在层面上,即,它对每个网络层的权重和激活张量使用不同的位宽度。在这项工作中,我们扩大了搜索空间,提出了一种新颖的NA,该NAS独立选择每个重量张量通道的位宽度。这为工具提供了额外的灵活性,即仅针对与最有用的功能相关的权重分配更高的精度。在MLPERF微小的基准套件上进行测试,我们获得了精确度大小与精度与能量空间的帕累托最佳模型的丰富集合。当部署在MPIC RISC-V边缘处理器上时,我们的网络将记忆和能量分别减少了63%和27%,而与层的方法相比,以相同的精度为单位。
translated by 谷歌翻译
在这项工作中,我们提出了一种方法,以准确评估和比较有效的神经网络构建块的性能,以硬件感知方式进行计算机视觉。我们的比较使用了基于设计空间的随机采样网络的帕累托前沿来捕获潜在的准确性/复杂性权衡。我们表明,我们的方法允许通过以前的比较范例获得的信息匹配,但对硬件成本和准确性之间的关系提供了更多见解。我们使用我们的方法来分析不同的构件并评估其在一系列嵌入式硬件平台上的性能。这突出了基准构建块作为神经网络设计过程中的预选步骤的重要性。我们表明,选择合适的构件可以在特定硬件ML加速器上加快推理的速度2倍。
translated by 谷歌翻译
Single Image Super-Resolution (SISR) tasks have achieved significant performance with deep neural networks. However, the large number of parameters in CNN-based met-hods for SISR tasks require heavy computations. Although several efficient SISR models have been recently proposed, most are handcrafted and thus lack flexibility. In this work, we propose a novel differentiable Neural Architecture Search (NAS) approach on both the cell-level and network-level to search for lightweight SISR models. Specifically, the cell-level search space is designed based on an information distillation mechanism, focusing on the combinations of lightweight operations and aiming to build a more lightweight and accurate SR structure. The network-level search space is designed to consider the feature connections among the cells and aims to find which information flow benefits the cell most to boost the performance. Unlike the existing Reinforcement Learning (RL) or Evolutionary Algorithm (EA) based NAS methods for SISR tasks, our search pipeline is fully differentiable, and the lightweight SISR models can be efficiently searched on both the cell-level and network-level jointly on a single GPU. Experiments show that our methods can achieve state-of-the-art performance on the benchmark datasets in terms of PSNR, SSIM, and model complexity with merely 68G Multi-Adds for $\times 2$ and 18G Multi-Adds for $\times 4$ SR tasks.
translated by 谷歌翻译
尽管具有卷积神经网络(CNN)的图像超分辨率(SR)的突破性进步,但由于SR网络的计算复杂性很高,SR尚未享受无处不在的应用。量化是解决此问题的有前途方法之一。但是,现有的方法无法量化低于8位的位宽度的SR模型,由于固定的位宽度量化量的严重精度损失。在这项工作中,为了实现高平均比重减少,准确性损失较低,我们建议针对SR网络的新颖的内容感知动态量化(CADYQ)方法,该方法将最佳位置分配给本地区域和层,并根据输入的本地内容适应。图片。为此,引入了一个可训练的位选择器模块,以确定每一层和给定的本地图像补丁的适当位宽度和量化水平。该模块受量化灵敏度的控制,该量化通过使用贴片的图像梯度的平均幅度和层的输入特征的标准偏差来估计。拟议的量化管道已在各种SR网络上进行了测试,并对几个标准基准进行了广泛评估。计算复杂性和升高恢复精度的显着降低清楚地表明了SR提出的CADYQ框架的有效性。代码可从https://github.com/cheeun/cadyq获得。
translated by 谷歌翻译
随着深度学习(DL)的出现,超分辨率(SR)也已成为一个蓬勃发展的研究领域。然而,尽管结果有希望,但该领域仍然面临需要进一步研究的挑战,例如,允许灵活地采样,更有效的损失功能和更好的评估指标。我们根据最近的进步来回顾SR的域,并检查最新模型,例如扩散(DDPM)和基于变压器的SR模型。我们对SR中使用的当代策略进行了批判性讨论,并确定了有前途但未开发的研究方向。我们通过纳入该领域的最新发展,例如不确定性驱动的损失,小波网络,神经体系结构搜索,新颖的归一化方法和最新评估技术来补充先前的调查。我们还为整章中的模型和方法提供了几种可视化,以促进对该领域趋势的全球理解。最终,这篇综述旨在帮助研究人员推动DL应用于SR的界限。
translated by 谷歌翻译
我们介绍了延迟感知网络加速度(LANA) - 一种在神经结构上建立的方法,用于加速神经网络的神经结构搜索技术和教师学生蒸馏。 Lana由两个阶段组成:在第一阶段,它会使用层面特征映射蒸馏来列举每层教师网络的许多替代操作。在第二阶段,它解决了使用新颖的整数线性优化(ILP)方法的有效操作的组合选择。 ILP带来独特的属性,因为它(i)在几秒钟内执行NAS,(ii)轻松满足预算约束,(iii)在图层粒度上工作,(iv)支持巨大的搜索空间$ o(10 ^ { 100})$,超越先前的搜索方法,效率和效率。在广泛的实验中,我们表明Lana产生了由目标潜伏期预算限制的有效和准确的模型,同时比其他技术明显快。我们分析了三个流行的网络架构:高效的网络,高效网络和reses,并在压缩较大模型的较小模型的延迟级别时,实现所有型号(高达3.0 \%$)的准确性改进。 Lana通过GPU和CPU实现显着的加速(高达5美元\倍),以没有准确性下降。代码将很快分享。
translated by 谷歌翻译
深神经网络(DNNS)在各种机器学习(ML)应用程序中取得了巨大成功,在计算机视觉,自然语言处理和虚拟现实等中提供了高质量的推理解决方案。但是,基于DNN的ML应用程序也带来计算和存储要求的增加了很多,对于具有有限的计算/存储资源,紧张的功率预算和较小形式的嵌入式系统而言,这尤其具有挑战性。挑战还来自各种特定应用的要求,包括实时响应,高通量性能和可靠的推理准确性。为了应对这些挑战,我们介绍了一系列有效的设计方法,包括有效的ML模型设计,定制的硬件加速器设计以及硬件/软件共同设计策略,以启用嵌入式系统上有效的ML应用程序。
translated by 谷歌翻译