由于数字技术的最新进展和可信数据的可用性,人工智能,深度学习领域已经出现,并且已经证明了其解决复杂学习问题的能力和有效性。特别是,卷积神经网络(CNN)已经证明了它们在图像检测和识别应用中的有效性。但是,它们需要密集的CPU操作和内存带宽,这使得通用CPU无法达到所需的性能水平。因此,使用了专用集成电路(ASIC),现场可编程门阵列(FPGA)和图形处理单元(GPU)的硬件加速器提高CNN的吞吐量更确切地说,FPGA最近被采用来加速深度学习网络的实现,因为它们能够最大化并行性以及由于它们的能量效率。在本文中,我们回顾了现有的加速FPGA深度学习网络的技术。我们强调了各种技术所采用的关键特性,以提高加速性能。此外,我们还提供了有关增强FPGA用于CNN加速的利用率的建议。本文研究的技术代表了基于FPGA的加速学习网络加速器的最新趋势。因此,本次审查预计将指导未来有效的硬件加速器的发展,并对深度学习研究人员有用。
translated by 谷歌翻译
OpenCL FPGA has recently gained great popularity with emerging needs for workload acceleration such as Convolutional Neu-ral Network (CNN), which is the most popular deep learning architecture in the domain of computer vision. While OpenCL enhances the code portability and programmability of FPGA, it comes at the expense of performance. The key challenge is to optimize the OpenCL kernels to efficiently utilize the flexible hardware resources in FPGA. Simply optimizing the OpenCL kernel code through various compiler options turns out insufficient to achieve desirable performance for both compute-intensive and data-intensive workloads such as convolutional neural networks. In this paper, we first propose an analytical performance model and apply it to perform an in-depth analysis on the resource requirement of CNN classifier kernels and available resources on modern FPGAs. We identify that the key performance bottleneck is the on-chip memory bandwidth. We propose a new kernel design to effectively address such bandwidth limitation and to provide an optimal balance between computation, on-chip, and off-chip memory access. As a case study, we further apply these techniques to design a CNN accelerator based on the VGG model. Finally, we evaluate the performance of our CNN accelerator using an Altera Arria 10 GX1150 board. We achieve 866 Gop/s floating point performance at 370MHz working frequency and 1.79 Top/s 16-bit fixed-point performance at 385MHz. To the best of our knowledge, our implementation achieves the best power efficiency and performance density compared to existing work.
translated by 谷歌翻译
Convolutional neural nets (CNNs) have become a practical means to perform vision tasks, particularly in the area of image classification. FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on FPGAs have shown limited advantages over other devices such as GPUs. Previous approaches on FPGAs have often been memory bound due to the limited external memory bandwidth on the FPGA device. We show a novel architecture written in OpenCL TM , which we refer to as a Deep Learning Accelerator (DLA), that maximizes data reuse and minimizes external memory bandwidth. Furthermore , we show how we can use the Winograd transform to significantly boost the performance of the FPGA. As a result, when running our DLA on Intel's Arria 10 device we can achieve a performance of 1020img/s, or 23img/s/W when running the AlexNet CNN benchmark. This comes to 1382 GFLOPs and is 10x faster with 8.4x more GFLOPS and 5.8x better efficiency than the state-of-the-art on FP-GAs. Additionally, 23 img/s/W is competitive against the best publicly known implementation of AlexNet on nVidia's TitanX GPU.
translated by 谷歌翻译
研究表明,卷积神经网络包含显着的冗余,即使权重和激活从浮点值减少到二进制值,也可以获得高分类精度。在本文中,我们介绍了FINN,这是一个使用灵活的异构流体系结构构建快速灵活的FPGA加速器的框架。利用一组新的优化功能,可以实现二值化神经网络到硬件的高效映射,我们实现了完全连接,卷积和池化层,每层计算资源可以满足用户提供的吞吐量要求。在ZC706嵌入式FPGA平台上,总系统功率小于25 W,我们每秒显示高达1230万像素分类,MNISTdataset上的延迟为0.31 {延迟,精度为95.8%,每秒21906次图像分类为288 {\ mu CIFAR-10和SVHN数据集的延迟分别为80.1%和94.9%。据我们所知,我们是迄今为止在这些基准测试中报告的最快分类率。
translated by 谷歌翻译
Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models-aka "real-time AI". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-purpose architectures, has fueled an explosion of specialized Neural Processing Units (NPUs). NPUs for interactive services should satisfy two requirements: (1) execution of DNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models (e.g., RNNs, CNNs, MLPs) without costly silicon updates. This paper describes the NPU architecture for Project Brainwave , a production-scale system for real-time AI. The Brainwave NPU achieves more than an order of magnitude improvement in latency and throughput over state-of-the-art GPUs on large RNNs at a batch size of 1. The NPU attains this performance using a single-threaded SIMD ISA paired with a distributed mi-croarchitecture capable of dispatching over 7M operations from a single instruction. The spatially distributed microarchitecture, scaled up to 96,000 multiply-accumulate units, is supported by hierarchical instruction decoders and schedulers coupled with thousands of independently addressable high-bandwidth on-chip memories, and can transparently exploit many levels of fine-grain SIMD parallelism. When targeting an FPGA, microarchitectural parameters such as native datapaths and numerical precision can be "synthesis specialized" to models at compile time, enabling high FPGA performance competitive with hardened NPUs. When running on an Intel Stratix 10 280 FPGA, the Brainwave NPU achieves performance ranging from ten to over thirty-five teraflops, with no batching, on large, memory-intensive RNNs.
translated by 谷歌翻译
Convolutional neural networks (CNN) are the current state-of-the-art for many computer vision tasks. CNNs outperform older methods in accuracy, but require vast amounts of computation and memory. As a result, existing CNN applications are typically run on clusters of CPUs or GPUs. Research on FPGA acceleration of CNN workloads has achieved reductions in power and energy consumption. However, large GPUs outperform modern FPGAs in throughput, and the existence of compatible deep learning frameworks give GPUs a significant advantage in programmability. Recent work in machine learning demonstrates the potential of very low precision CNNs-i.e., CNNs with bi-narized weights and activations. Such binarized neural networks (BNNs) appear well suited for FPGA implementation, as their dominant computations are bitwise logic operations and their memory requirements are greatly reduced. A combination of low-precision networks and high-level design methodology may help address the performance and productivity gap between FPGAs and GPUs. In this paper, we present the design of a BNN accelerator that is synthesized from C++ to FPGA-targeted Verilog. The accelerator out-performs existing FPGA-based CNN accelerators in GOPS as well as energy and resource efficiency.
translated by 谷歌翻译
深度神经网络(DNN)目前广泛用于许多人工智能(AI)应用,包括计算机视觉,语音识别和机器人技术。虽然DNN在许多AI任务上提供最先进的准确性,但却以高计算复杂性为代价。因此,技术要求能够有效地处理DNN以提高能量效率和吞吐量,而不会牺牲应用精度或增加对AI系统中DNN的广泛部署至关重要的硬件成本。本文旨在提供有关实现DNN高效处理目标的最新进展的综合指南和调查。具体而言,它将提供DNN的概述,讨论支持DNN的各种硬件平台和体系结构,并突出降低计算成本的关键趋势通过联合硬件设计和DNN算法变化,仅通过硬件设计变更或DNN。它还将总结各种开发资源,使研究人员和从业人员能够快速开始这一领域,并突出重要的基准测量指标和设计考虑因素,用于评估快速增长的DNN硬件设计数量,可选择包括算法设计,在学术界和行业。读者将从本文中删除以下概念:了解DNN的关键设计注意事项;能够使用基准和比较指标评估不同的DNN硬件实现;了解各种硬件架构和平台之间的权衡;能够评估各种DNN设计技术在高效处理中的实用性;并了解最近的实施趋势和机会。
translated by 谷歌翻译
Convolutional neural networks (CNN) applications range from recognition and reasoning (such as handwriting recognition, facial expression recognition and video surveillance) to intelligent text applications such as semantic text analysis and natural language processing applications. Two key observations drive the design of a new architecture for CNN. First, CNN workloads exhibit a widely varying mix of three types of parallelism: parallelism within a convolution operation, intra-output parallelism where multiple input sources (features) are combined to create a single output, and inter-output parallelism where multiple, independent outputs (features) are computed simultaneously. Workloads differ significantly across different CNN applications, and across different layers of a CNN. Second, the number of processing elements in an architecture continues to scale (as per Moore's law) much faster than off-chip memory bandwidth (or pin-count) of chips. Based on these two observations, we show that for a given number of processing elements and off-chip memory bandwidth, a new CNN hardware architecture that dynamically configures the hardware on-the-fly to match the specific mix of parallelism in a given workload gives the best throughput performance. Our CNN compiler automatically translates high abstraction network specification into a parallel microprogram (a sequence of low-level VLIW instructions) that is mapped, scheduled and executed by the coprocessor. Compared to a 2.3 GHz quad-core, dual socket Intel Xeon, 1.35 GHz C870 GPU, and a 200 MHz FPGA implementation, our 120 MHz dynamically configurable architecture is 4x to 8x faster. This is the first CNN architecture to achieve real-time video stream processing (25 to 30 frames per second) on a wide range of object detection and recognition tasks.
translated by 谷歌翻译
Convolutional Neural Networks (CNNs) have emerged as a fundamental technology for machine learning. High performance and extreme energy efficiency are critical for deployments of CNNs in a wide range of situations, especially mobile platforms such as autonomous vehicles, cameras, and electronic personal assistants. This paper introduces the Sparse CNN (SCNN) accelerator architecture, which improves performance and energy efficiency by exploiting the zero-valued weights that stem from network pruning during training and zero-valued activations that arise from the common ReLU operator applied during inference. Specifically, SCNN employs a novel dataflow that enables maintaining the sparse weights and activations in a compressed encoding, which eliminates unnecessary data transfers and reduces storage requirements. Furthermore, the SCNN dataflow facilitates efficient delivery of those weights and activations to the multiplier array, where they are extensively reused. In addition, the accumulation of multiplication products are performed in a novel accumulator array. Our results show that on contemporary neural networks, SCNN can improve both performance and energy by a factor of 2.7× and 2.3×, respectively , over a comparably provisioned dense CNN accelerator.
translated by 谷歌翻译
Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of domains. FPGAs are an attractive choice for DNNs since they offer a programmable substrate for acceleration and are becoming available across different market segments. However, obtaining both performance and energy efficiency with FPGAs is a laborious task even for expert hardware designers. Furthermore, the large memory footprint of DNNs, coupled with the FPGAs' limited on-chip storage makes DNN acceleration using FPGAs more challenging. This work tackles these challenges by devising DNNWEAVER, a framework that automatically generates a synthesizable accelerator for a given (DNN, FPGA) pair from a high-level specification in Caffe [1]. To achieve large benefits while preserving automation, DNNWEAVER generates accelerators using hand-optimized design templates. First, DNNWEAVER translates a given high-level DNN specification to its novel ISA that represents a macro dataflow graph of the DNN. The DNNWEAVER compiler is equipped with our optimization algorithm that tiles, schedules, and batches DNN operations to maximize data reuse and best utilize target FPGA's memory and other resources. The final result is a custom synthesizable accelerator that best matches the needs of the DNN while providing high performance and efficiency gains for the target FPGA. We use DNNWEAVER to generate accelerators for a set of eight different DNN models and three different FPGAs, Xil-inx Zynq, Altera Stratix V, and Altera Arria 10. We use hardware measurements to compare the generated accelerators to both mul-ticore CPUs (ARM Cortex A15 and Xeon E3) and many-core GPUs (Tegra K1, GTX 650Ti, and Tesla K40). In comparison, the generated accelerators deliver superior performance and efficiency without requiring the programmers to participate in the arduous task of hardware design.
translated by 谷歌翻译
使用FPGA加速ConvNets近年来引起了人们的极大关注。然而,FPGA加速器设计尚未利用ConvNets的最新进展。因此,诸如每秒帧数(FPS)之类的关键应用特性被忽略,有利于简单地计算GOP,并且对于应用成功至关重要的准确性结果通常被记录下来。在这项工作中,我们采用算法 - 硬件协同设计方法来开发一个名为Synetgy的ConvNet加速器和一个名为DiracNet的新型ConvNet模型。加速器和ConvNet都是针对FPGA要求量身定制的。 DiractNet,顾名思义,是一个只有1x1转换的ConvNet,而空间卷积被更高效的shiftoperations取代。 DiracNet在ImageNet上实现了竞争准确性(89.0%排名前5),但参数减少了48倍,OP比VGG16减少了65倍。我们进一步将DiracNet的权重量化为1位,激活为4位,精度损失小于1%。这些量化很好地利用了FPGA硬件的性质。 Inshort,DiracNet的小型号,低计算OP数,超低精度和简化的运算符使我们能够为FPGA共同设计高度定制的计算单元。我们通过高级综合在utrat SoC系统上实现DiracNet的计算单元。实施仅需2个人完成一个月。我们的加速器在ImageNet上的最终前5精​​度为88.3%,高于之前报道的所有嵌入式FPGA加速器。此外,加速器在ImageNet分类任务上达到72.8FPS的推断速度,超过先前的相似性,至少12.8倍。
translated by 谷歌翻译
Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/areacost ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on-chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.
translated by 谷歌翻译
Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently , many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to inefficient designs because the same processor structure is used to compute CNN layers of radically varying dimensions. We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational efficiency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x.
translated by 谷歌翻译
Machine-Learning tasks are becoming pervasive in a broad range of domains, and in a broad range of systems (from embedded systems to data centers). At the same time, a small set of machine-learning algorithms (especially Convo-lutional and Deep Neural Networks, i.e., CNNs and DNNs) are proving to be state-of-the-art across many applications. As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, a machine-learning accelerator can achieve the rare combination of efficiency (due to the small number of target algorithms) and broad application scope. Until now, most machine-learning accelerator designs have focused on efficiently implementing the computational part of the algorithms. However, recent state-of-the-art CNNs and DNNs are characterized by their large size. In this study, we design an accelerator for large-scale CNNs and DNNs, with a special emphasis on the impact of memory on accelerator design, performance and energy. We show that it is possible to design an accelerator with a high throughput, capable of performing 452 GOP/s (key NN operations such as synaptic weight multiplications and neurons outputs additions) in a small footprint of 3.02 mm 2 and 485 mW; compared to a 128-bit 2GHz SIMD processor , the accelerator is 117.87x faster, and it can reduce the total energy by 21.08x. The accelerator characteristics are obtained after layout at 65nm. Such a high throughput in a small footprint can open up the usage of state-of-the-art machine-learning algorithms in a broad set of systems and for a broad set of applications.
translated by 谷歌翻译
We present a novel mechanism to accelerate state-of-art Con-volutional Neural Networks (CNNs) on CPU-FPGA platform with coherent shared memory. First, we exploit Fast Fourier Transform (FFT) and Overlap-and-Add (OaA) to reduce the computational requirements of the convolutional layer. We map the frequency domain algorithms onto a highly-parallel OaA-based 2D convolver design on the FPGA. Then, we propose a novel data layout in shared memory for efficient data communication between the CPU and the FPGA. To reduce the memory access latency and sustain peak performance of the FPGA, our design employs double buffering. To reduce the inter-layer data remapping la-tency, we exploit concurrent processing on the CPU and the FPGA. Our approach can be applied to any kernel size less than the chosen FFT size with appropriate zero-padding leading to acceleration of a wide range of CNN models. We exploit the data parallelism of OaA-based 2D convolver and task parallelism to scale the overall system performance. By using OaA, the number of floating point operations is reduced by 39.14% ∼ 54.10% for the state-of-art CNNs. We implement VGG16, AlexNet and GoogLeNet on Intel Quick-Assist QPI FPGA Platform. These designs sustain 123.48 GFLOPs/sec, 83.00 GFLOPs/sec and 96.60 GFLOPs/sec, respectively. Compared with the state-of-the-art AlexNet implementation, our design achieves 1.35x GFLOPs/sec improvement using 3.33x less multipliers and 1.1x less memory. Compared with the state-of-art VGG16 implementation , our design has 0.66x GFLOPs/sec using 3.48x less mul-tipliers without impacting the classification accuracy. For GoogLeNet implementation, our design achieves 5.56x improvement in performance compared with 16 threads running on a 10 Core Intel Xeon Processor at 2.8 GHz.
translated by 谷歌翻译
用于卷积神经网络(CNN)的最先进的加速器通常专注于仅加速卷积层,但不会对完全连接的层进行优先级划分。因此,它们缺乏对完整CNN设计的硬件架构和各种数据流的协同优化,这可以提供更高的性能/能效潜力。为此,我们提出了一种新颖的并行 - 并行神经阵列(MPNA)加速器,它集成了两个异构收缩阵列和各自高度优化的数据流模式,共同加速卷积(CONV)和全连接(FC)层。除了充分利用可用的片外存储器带宽外,这些优化的数据流还可实现所有数据类型(即权重,输入和输出激活)的高数据重用,从而使我们的MPNA能够实现高节能。我们使用ASIC设计流程为28nm技术合成了我们的MPNA架构,并使用多个真实复杂的CNN执行功能和时序验证。 MPNA在280MHz时达到149.7GOPS / W,功耗为239mW。实验结果表明,与最先进的加速器相比,我们的MPNA架构提供了1.7倍的整体性能提升,与基线架构相比节能51%。
translated by 谷歌翻译
Many recent visual recognition systems can be seen as being composed of multiple layers of convolutional filter banks, interspersed with various types of non-linearities. This includes Convolutional Networks, HMAX-type archi-tectures, as well as systems based on dense SIFT features or Histogram of Gradients. This paper describes a highly-compact and low power embedded system that can run such vision systems at very high speed. A custom board built around a Xilinx Virtex-4 FPGA was built and tested. It measures 70 × 80 mm, and the complete system-FPGA, camera , memory chips, flash-consumes 15 watts in peak, and is capable of more than 4 × 10 9 multiply-accumulate operations per second in real vision application. This enables real-time implementations of object detection, object recognition , and vision-based navigation algorithms in small-size robots, micro-UAVs, and hand-held devices. Real-time face detection is demonstrated, with speeds of 10 frames per second at VGA resolution.
translated by 谷歌翻译
大规模深度神经网络的训练通常受到可用计算资源的限制。我们研究了有限精度数据表示和计算对神经网络训练的影响。在低精度定点计算的背景下,我们观察到舍入方案在确定网络在训练期间的行为中起着至关重要的作用。我们的结果表明,在使用随机舍入时,只能使用16位宽的定点数表示来训练深度网络,并且在分类精度方面几乎没有降低。我们还演示了一种节能的硬件加速器,它通过随机舍入实现低精度定点运算。
translated by 谷歌翻译
基于神经网络的图像处理方法正在被广泛用于实际应用中。现代神经网络在计算上是昂贵的并且需要专用硬件,例如图形处理单元。由于这种硬件在现实生活中并不总是可用,因此对移动设备的神经网络设计有着迫切的需求。移动神经网络通常具有减少的参数数量并且需要相对较少数量的算术运算。但是,它们通常在软件级别执行并使用浮点计算。没有进一步优化的移动网络的使用可能无法在需要高处理速度时提供足够的性能,例如,在实时视频处理(每秒30帧)中。在这项研究中,我们建议优化以加速计算,以便在移动设备上有效地使用已经训练的神经网络。具体来说,我们提出了一种方法,通过将计算从软件转移到硬件并使用定点计算而不是浮点来加速神经网络。我们提出了许多神经网络架构设计方法,以通过定点计算提高性能。我们还展示了现有数据集如何被修改和适应手中的识别任务的一个例子。最后,我们提出了基于浮点门阵列的设备的设计和实现,以解决实时手写权限分类的实际问题。移动摄像头视频输入。
translated by 谷歌翻译
近年来,神经网络在诸如物体识别之类的领域中已经超过了经典算法,例如,在众所周知的ImageNet挑战中。因此,正在努力开发快速有效的加速器,特别是对于卷积神经网络(CNN)。在这项工作中,我们展示了一个完全C可编程处理器ConvAix,它与许多现有架构相反,不依赖于多线和累加(MAC)单元的硬连线阵列。相反,它使用精心设计的矢量指令集将计算映射到独立的矢量通道。所提出的处理器针对延迟敏感的应用程序,并且每个周期能够执行多达192个MAC操作。 ConvAix在28nm CMOS中以400 MHz的目标时钟频率运行,从而在其目标域内提供最佳性能并具有适当的灵活性。来自众所周知的CNN(AlexNet,VGG-16)的几个2D卷积层的仿真结果显示使用16位定点算法的矢量指数,平均ALU利用率为72.5%。与其他灵活性较低的知名设计相比,ConvAix提供高达497 GOP / s / W的竞争能效,同时在面积效率和处理速度方面甚至超过它们。
translated by 谷歌翻译