当今的大多数计算机视觉管道都是围绕深神经网络构建的,卷积操作需要大部分一般的计算工作。与标准算法相比,Winograd卷积算法以更少的MAC计算卷积,当使用具有2x2尺寸瓷砖$ F_2 $的版本时,3x3卷积的操作计数为2.25倍。即使收益很大,Winograd算法具有较大的瓷砖尺寸,即$ f_4 $,在提高吞吐量和能源效率方面具有更大的潜力,因为它将所需的MAC降低了4倍。不幸的是,具有较大瓷砖尺寸的Winograd算法引入了数值问题,这些问题阻止了其在整数域特异性加速器上的使用和更高的计算开销,以在空间和Winograd域之间转换输入和输出数据。为了解锁Winograd $ F_4 $的全部潜力,我们提出了一种新颖的Tap-Wise量化方法,该方法克服了使用较大瓷砖的数值问题,从而实现了仅整数的推断。此外,我们介绍了以功率和区域效率的方式处理Winograd转换的自定义硬件单元,并展示了如何将此类自定义模块集成到工业级,可编程的DSA中。对大量最先进的计算机视觉基准进行了广泛的实验评估表明,Tap-Wise量化算法使量化的Winograd $ F_4 $网络几乎与FP32基线一样准确。 Winograd增强的DSA可实现高达1.85倍的能源效率,最高可用于最先进的细分和检测网络的端到端速度高达1.83倍。
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
在小型电池约束的物流设备上部署现代TinyML任务需要高计算能效。使用非易失性存储器(NVM)的模拟内存计算(IMC)承诺在深神经网络(DNN)推理中的主要效率提高,并用作DNN权重的片上存储器存储器。然而,在系统级别尚未完全理解IMC的功能灵活性限制及其对性能,能量和面积效率的影响。为了目标实际的端到端的IOT应用程序,IMC阵列必须括在异构可编程系统中,引入我们旨在解决这项工作的新系统级挑战。我们介绍了一个非均相紧密的聚类架构,整合了8个RISC-V核心,内存计算加速器(IMA)和数字加速器。我们在高度异构的工作负载上基准测试,例如来自MobileNetv2的瓶颈层,显示出11.5倍的性能和9.5倍的能效改进,而在核心上高度优化并行执行相比。此外,我们通过将我们的异构架构缩放到多阵列加速器,探讨了在IMC阵列资源方面对全移动级DNN(MobileNetv2)的端到端推断的要求。我们的结果表明,我们的解决方案在MobileNetv2的端到端推断上,在执行延迟方面比现有的可编程架构更好,比最先进的异构解决方案更好的数量级集成内存计算模拟核心。
translated by 谷歌翻译
基于von-neumann架构的传统计算系统,数据密集型工作负载和应用程序(如机器学习)和应用程序都是基本上限制的。随着数据移动操作和能量消耗成为计算系统设计中的关键瓶颈,对近数据处理(NDP),机器学习和特别是神经网络(NN)的加速器等非传统方法的兴趣显着增加。诸如Reram和3D堆叠的新兴内存技术,这是有效地架构基于NN的基于NN的加速器,因为它们的工作能力是:高密度/低能量存储和近记忆计算/搜索引擎。在本文中,我们提出了一种为NN设计NDP架构的技术调查。通过基于所采用的内存技术对技术进行分类,我们强调了它们的相似之处和差异。最后,我们讨论了需要探索的开放挑战和未来的观点,以便改进和扩展未来计算平台的NDP架构。本文对计算机学习领域的计算机架构师,芯片设计师和研究人员来说是有价值的。
translated by 谷歌翻译
深度神经网络(DNN)的记录断裂性能具有沉重的参数化,导致外部动态随机存取存储器(DRAM)进行存储。 DRAM访问的禁用能量使得在资源受限的设备上部署DNN是不普遍的,呼叫最小化重量和数据移动以提高能量效率。我们呈现SmartDeal(SD),算法框架,以进行更高成本的存储器存储/访问的较低成本计算,以便在推理和培训中积极提高存储和能量效率。 SD的核心是一种具有结构约束的新型重量分解,精心制作以释放硬件效率潜力。具体地,我们将每个重量张量分解为小基矩阵的乘积以及大的结构稀疏系数矩阵,其非零被量化为-2的功率。由此产生的稀疏和量化的DNN致力于为数据移动和重量存储而大大降低的能量,因为由于稀疏的比特 - 操作和成本良好的计算,恢复原始权重的最小开销。除了推理之外,我们采取了另一次飞跃来拥抱节能培训,引入创新技术,以解决培训时出现的独特障碍,同时保留SD结构。我们还设计专用硬件加速器,充分利用SD结构来提高实际能源效率和延迟。我们在不同的设置中对多个任务,模型和数据集进行实验。结果表明:1)应用于推理,SD可实现高达2.44倍的能效,通过实际硬件实现评估; 2)应用于培训,储存能量降低10.56倍,减少了10.56倍和4.48倍,与最先进的训练基线相比,可忽略的准确性损失。我们的源代码在线提供。
translated by 谷歌翻译
State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120× energy saving; Exploiting sparsity saves 10×; Weight sharing gives 8×; Skipping zero activations from ReLU saves another 3×. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102 GOPS/s working directly on a compressed network, corresponding to 3 TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88×10 4 frames/sec with a power dissipation of only 600mW. It is 24,000× and 3,400× more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9×, 19× and 3× better throughput, energy efficiency and area efficiency.
translated by 谷歌翻译
胶囊网络(CAPSNET)是图像处理的新兴趋势。与卷积神经网络相反,CAPSNET不容易受到对象变形的影响,因为对象的相对空间信息在整个网络中保存。但是,它们的复杂性主要与胶囊结构和动态路由机制有关,这使得以其原始形式部署封闭式以由小型微控制器(MCU)供电的设备几乎是不合理的。在一个智力从云到边缘迅速转移的时代,这种高复杂性对在边缘的采用capsnets的采用构成了严重的挑战。为了解决此问题,我们提出了一个API,用于执行ARM Cortex-M和RISC-V MCUS中的量化capsnet。我们的软件内核扩展了ARM CMSIS-NN和RISC-V PULP-NN,以用8位整数作为操作数支持胶囊操作。随之而来的是,我们提出了一个框架,以执行CAPSNET的训练后量化。结果显示,记忆足迹的减少近75%,准确性损失范围从0.07%到0.18%。在吞吐量方面,我们的ARM Cortex-M API可以分别在仅119.94和90.60毫秒(MS)的中型胶囊和胶囊层执行(STM32H7555ZIT6U,Cortex-M7 @ 480 MHz)。对于GAP-8 SOC(RISC-V RV32IMCXPULP @ 170 MHz),延迟分别降至7.02和38.03 ms。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
稀疏卷积神经网络(CNNS)在过去几年中获得了显着的牵引力,因为与其致密的对应物相比,稀疏的CNNS可以大大降低模型尺寸和计算。稀疏的CNN经常引入层形状和尺寸的变化,这可以防止密集的加速器在稀疏的CNN模型上执行良好。最近提出的稀疏加速器,如SCNN,Eyeriss V2和Sparten,积极利用双面或全稀稀物质,即重量和激活的稀疏性,用于性能收益。然而,这些加速器具有低效的微架构,其限制了它们的性能,而不对非单位步幅卷积和完全连接(Fc)层的支持,或者遭受系统负荷不平衡的大规模遭受。为了规避这些问题并支持稀疏和密集的模型,我们提出了幻影,多线程,动态和灵活的神经计算核心。 Phantom使用稀疏二进制掩码表示,以主动寻求稀疏计算,并动态调度其计算线程以最大化线程利用率和吞吐量。我们还生成了幻象神经计算核心的二维(2D)网格体系结构,我们将其称为Phantom-2D加速器,并提出了一种支持CNN的所有层的新型数据流,包括单位和非单位步幅卷积,和fc层。此外,Phantom-2D使用双级负载平衡策略来最小化计算空闲,从而进一步提高硬件利用率。为了向不同类型的图层显示支持,我们评估VGG16和MobileNet上的幻影架构的性能。我们的模拟表明,Phantom-2D加速器分别达到了12倍,4.1 X,1.98x和2.36倍,超密架构,SCNN,Sparten和Eyeriss V2的性能增益。
translated by 谷歌翻译
我们日常生活中的深度学习是普遍存在的,包括自驾车,虚拟助理,社交网络服务,医疗服务,面部识别等,但是深度神经网络在训练和推理期间需要大量计算资源。该机器学习界主要集中在模型级优化(如深度学习模型的架构压缩),而系统社区则专注于实施级别优化。在其间,在算术界中提出了各种算术级优化技术。本文在模型,算术和实施级技术方面提供了关于资源有效的深度学习技术的调查,并确定了三种不同级别技术的资源有效的深度学习技术的研究差距。我们的调查基于我们的资源效率度量定义,阐明了较低级别技术的影响,并探讨了资源有效的深度学习研究的未来趋势。
translated by 谷歌翻译
The last few years have seen a lot of work to address the challenge of low-latency and high-throughput convolutional neural network inference. Integrated photonics has the potential to dramatically accelerate neural networks because of its low-latency nature. Combined with the concept of Joint Transform Correlator (JTC), the computationally expensive convolution functions can be computed instantaneously (time of flight of light) with almost no cost. This 'free' convolution computation provides the theoretical basis of the proposed PhotoFourier JTC-based CNN accelerator. PhotoFourier addresses a myriad of challenges posed by on-chip photonic computing in the Fourier domain including 1D lenses and high-cost optoelectronic conversions. The proposed PhotoFourier accelerator achieves more than 28X better energy-delay product compared to state-of-art photonic neural network accelerators.
translated by 谷歌翻译
编译器框架对于广泛使用基于FPGA的深度学习加速器来说是至关重要的。它们允许研究人员和开发人员不熟悉硬件工程,以利用域特定逻辑所获得的性能。存在传统人工神经网络的各种框架。然而,没有多大的研究努力已经进入创建针对尖刺神经网络(SNNS)进行优化的框架。这种新一代的神经网络对于在边缘设备上部署AI的越来越有趣,其具有紧密的功率和资源约束。我们的端到端框架E3NE为FPGA自动生成高效的SNN推理逻辑。基于Pytorch模型和用户参数,它应用各种优化,并评估基于峰值的加速器固有的权衡。多个水平的并行性和新出现的神经编码方案的使用导致优于先前的SNN硬件实现的效率。对于类似的型号,E3NE使用的硬件资源的少于50%,功率较低20%,同时通过幅度降低延迟。此外,可扩展性和通用性允许部署大规模的SNN模型AlexNet和VGG。
translated by 谷歌翻译
利用稀疏性是加速在移动设备上的量化卷积神经网络(CNN)推断的关键技术。现有稀疏的CNN加速器主要利用无结构性稀疏性并实现显着的加速。然而,由于无界,很大程度上不可预测的稀疏模式,利用非结构化稀疏性需要复杂的硬件设计,具有显着的能量和面积开销,这对能量和区域效率至关重要的移动/ IOT推理场景特别有害。我们建议利用结构化的稀疏性,更具体地,更密集地绑定块(DBB)稀疏性,用于重量和激活。 DBB块张于每个块的最大非零数。因此,DBB暴露静态可预测的稀疏模式,使瘦稀疏性利用硬件能够。我们提出了新的硬件基元,以分别为(静态)权重和(动态)激活的DBB稀疏性,具有非常低的开销。建立在基元的顶部,我们描述了一种基于收缩阵列的CNN加速器的S2TA,可利用联合重量和激活DBB稀疏性和传统的收缩系统阵列上不可用的数据重用的新维度。与具有零值时钟门控的完全阵列的强基线相比,16NM中的S2TA达到超过2倍的加速和能量减少,超过五个流行的CNN基准。与近期的非收缩稀疏加速器相比,Eyeriss V2(65nm)和Sparten(45nm),S2TA在65nm中使用约2.2倍和3.1倍的每次推断的能量较少。
translated by 谷歌翻译
作为其核心计算,一种自我发挥的机制可以在整个输入序列上分配成对相关性。尽管表现良好,但计算成对相关性的成本高昂。尽管最近的工作表明了注意力分数低的元素的运行时间修剪的好处,但自我发挥机制的二次复杂性及其芯片内存能力的需求被忽略了。这项工作通过构建一个称为Sprint的加速器来解决这些约束,该加速器利用RERAM横杆阵列的固有并行性以近似方式计算注意力分数。我们的设计使用RERAM内的轻质模拟阈值电路来降低注意力评分,从而使Sprint只能获取一小部分相关数据到芯片内存。为了减轻模型准确性的潜在负面影响,Sprint重新计算数字中少数获取数据的注意力评分。相关注意分数的组合内修剪和片上重新计算可以将Sprint转化为仅线性的二次复杂性。此外,我们即使修剪后,我们也可以识别并利用相邻的注意操作之间的动态空间位置,从而消除了昂贵但冗余的数据获取。我们在各种最新的变压器模型上评估了我们提出的技术。平均而言,当使用总16KB芯片内存时,Sprint会产生7.5倍的速度和19.6倍的能量,而实际上与基线模型的等值级相当(平均为0.36%的降级)。
translated by 谷歌翻译
多年来,通过广泛研究了与量化的神经网络。遗憾的是,在GPU上的有限精度支持(例如,INT1和INT4)上通常限制具有多样化的精度(例如,1位重量和2位激活)的事先努力。为了打破这种限制,我们介绍了第一个任意精密神经网络框架(APNN-TC),以充分利用对AMPERE GPU张量核心的量化优势。具体地,APNN-TC首先结合了一种新的仿真算法来支持与INT1计算基元和XOR /和BOOLEAN操作的任意短比特宽度计算。其次,APNN-TC集成了任意精密层设计,以有效地将仿真算法映射到带有新型批处理策略和专业内存组织的张量核心。第三,APNN-TC体现了一种新型任意精密NN设计,可最大限度地减少层次的内存访问,并进一步提高性能。广泛的评估表明,APNN-TC可以通过Cutlass内核和各种NN模型实现显着加速,例如Reset和VGG。
translated by 谷歌翻译
Recently, automated co-design of machine learning (ML) models and accelerator architectures has attracted significant attention from both the industry and academia. However, most co-design frameworks either explore a limited search space or employ suboptimal exploration techniques for simultaneous design decision investigations of the ML model and the accelerator. Furthermore, training the ML model and simulating the accelerator performance is computationally expensive. To address these limitations, this work proposes a novel neural architecture and hardware accelerator co-design framework, called CODEBench. It is composed of two new benchmarking sub-frameworks, CNNBench and AccelBench, which explore expanded design spaces of convolutional neural networks (CNNs) and CNN accelerators. CNNBench leverages an advanced search technique, BOSHNAS, to efficiently train a neural heteroscedastic surrogate model to converge to an optimal CNN architecture by employing second-order gradients. AccelBench performs cycle-accurate simulations for a diverse set of accelerator architectures in a vast design space. With the proposed co-design method, called BOSHCODE, our best CNN-accelerator pair achieves 1.4% higher accuracy on the CIFAR-10 dataset compared to the state-of-the-art pair, while enabling 59.1% lower latency and 60.8% lower energy consumption. On the ImageNet dataset, it achieves 3.7% higher Top1 accuracy at 43.8% lower latency and 11.2% lower energy consumption. CODEBench outperforms the state-of-the-art framework, i.e., Auto-NBA, by achieving 1.5% higher accuracy and 34.7x higher throughput, while enabling 11.0x lower energy-delay product (EDP) and 4.0x lower chip area on CIFAR-10.
translated by 谷歌翻译
原则上,稀疏的神经网络应该比传统的密集网络更有效。大脑中的神经元表现出两种类型的稀疏性;它们稀疏地相互连接和稀疏活跃。当组合时,这两种类型的稀疏性,称为重量稀疏性和激活稀疏性,提出了通过两个数量级来降低神经网络的计算成本。尽管存在这种潜力,但今天的神经网络只使用重量稀疏提供适度的性能益处,因为传统的计算硬件无法有效地处理稀疏网络。在本文中,我们引入了互补稀疏性,这是一种显着提高现有硬件对双稀疏网络性能的新技术。我们证明我们可以实现高性能运行的重量稀疏网络,我们可以通过结合激活稀疏性来乘以这些加速。采用互补稀疏性,我们显示出对FPGA的推断的吞吐量和能效提高了100倍。我们分析了典型的商业卷积网络等各种内核的可扩展性和资源权衡,例如Resnet-50和MobileNetv2。我们的互补稀疏性的结果表明,重量加激活稀疏性可以是有效的缩放未来AI模型的有效组合。
translated by 谷歌翻译
近年来,卷积神经网络(CNN)证明了它们在许多领域解决问题的能力,并且以前无法进行准确性。但是,这带有广泛的计算要求,这使得普通CPU无法提供所需的实时性能。同时,FPGA对加速CNN推断的兴趣激增。这是由于他们有能力创建具有不同级别的并行性的自定义设计。此外,与GPU相比,FPGA提供每瓦的性能更好。基于FPGA的CNN加速器的当前趋势是实现多个卷积层处理器(CLP),每个处理器都针对一层层量身定制。但是,CNN体系结构的日益增长的复杂性使得优化目标FPGA设备上可用的资源,以使最佳性能更具挑战性。在本文中,我们提出了CNN加速器和随附的自动设计方法,该方法采用元启发式学来分区可用的FPGA资源来设计多CLP加速器。具体而言,提出的设计工具采用模拟退火(SA)和禁忌搜索(TS)算法来查找所需的CLP数量及其各自的配置,以在给定的目标FPGA设备上实现最佳性能。在这里,重点是关键规格和硬件资源,包括数字信号处理器,阻止RAM和芯片内存储器带宽。提出了使用四个众所周知的基准CNN的实验结果和比较,表明所提出的加速框架既令人鼓舞又有前途。基于SA-/TS的多CLP比在加速Alexnet,Squeezenet 1.1,VGGNET和Googlenet架构上的最新单个/多CLP方法高1.31x-2.37倍高2.37倍。和VC709 FPGA板。
translated by 谷歌翻译
神经网络(NNS)的重要性和复杂性正在增长。神经网络的性能(和能源效率)可以通过计算或内存资源约束。在内存阵列附近或内部放置计算的内存处理(PIM)范式是加速内存绑定的NNS的可行解决方案。但是,PIM体系结构的形式各不相同,其中不同的PIM方法导致不同的权衡。我们的目标是分析基于NN的性能和能源效率的基于DRAM的PIM架构。为此,我们分析了三个最先进的PIM架构:(1)UPMEM,将处理器和DRAM阵列集成到一个2D芯片中; (2)Mensa,是针对边缘设备量身定制的基于3D堆栈的PIM架构; (3)Simdram,它使用DRAM的模拟原理来执行位序列操作。我们的分析表明,PIM极大地受益于内存的NNS:(1)UPMEM在GPU需要内存过度按要求的通用矩阵 - 矢量乘数内核时提供23x高端GPU的性能; (2)Mensa在Google Edge TPU上提高了3.0倍和3.1倍的能源效率和吞吐量,用于24个Google Edge NN型号; (3)SIMDRAM在三个二进制NNS中以16.7倍/1.4倍的速度优于CPU/GPU。我们得出的结论是,由于固有的建筑设计选择,NN模型的理想PIM体系结构取决于模型的独特属性。
translated by 谷歌翻译
When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations involved in processing these layers has significantly reduced. As a result, the proportion of the execution time of other layers, such as batch normalization layers, has increased. Thus, in this work, we conduct a detailed analysis of the batch normalization layer to efficiently reduce the runtime overhead in the batch normalization process. Backed up by the thorough analysis, we present an extremely efficient batch normalization, named LightNorm, and its associated hardware module. In more detail, we fuse three approximation techniques that are i) low bit-precision, ii) range batch normalization, and iii) block floating point. All these approximate techniques are carefully utilized not only to maintain the statistics of intermediate feature maps, but also to minimize the off-chip memory accesses. By using the proposed LightNorm hardware, we can achieve significant area and energy savings during the DNN training without hurting the training accuracy. This makes the proposed hardware a great candidate for the on-device training.
translated by 谷歌翻译