translated by 谷歌翻译
在CERN大强子撞机(LHC)的碰撞中的带电粒子轨迹的测定是一个重要但挑战性的问题,特别是在LHC(HL-LHC)的未来高亮度相期间的高相互作用密度条件下。图形神经网络(GNNS)是一种类型的几何深度学习算法,通过将跟踪器数据嵌入作为图形节点来成功应用于此任务的几何深度学习算法,而边缘表示可能的曲线段 - 并将边缘分类为真实或假轨道段。但是,由于其大量的计算成本,它们在基于硬件或软件的触发器应用中的研究受到限制。在本文中,我们介绍了一个自动翻译工作流程,集成到一个名为$ \ texttt {hls4ml} $的更广泛的工具中,用于将GNN转换为现场可编程门阵列(FPGA)的固件。我们使用此翻译工具实现用于带电粒子跟踪的GNN,使用TrackML挑战DataSet在FPGA上培训,其中设计针对不同的图表大小,任务复杂和延迟/吞吐量要求。该工作可以在HL-LHC实验的触发水平下纳入带电粒子跟踪GNN。
translated by 谷歌翻译
有效的量子控制对于使用当前技术的实用量子计算实施是必需的。用于确定最佳控制参数的常规算法在计算上是昂贵的,在很大程度上将它们排除在模拟之外。构成作为查找表的现有硬件解决方案不精确且昂贵。通过设计机器学习模型来近似传统工具的结果,可以生成更有效的方法。然后可以将这样的模型合成为硬件加速器以用于量子系统。在这项研究中,我们演示了一种用于预测最佳脉冲参数的机器学习算法。该算法的轻量级足以适合低资源FPGA,并以175 ns的延迟和管道间隔为5 ns,$〜>〜>〜$〜>〜$ 0.99。从长远来看,这种加速器可以在传统计算机无法运行的量子计算硬件附近使用,从而在低潜伏期以合理的成本实现量子控制,而不会在低温环境之外产生大型数据带宽。
translated by 谷歌翻译
我们介绍了MLPERF小型推理基准(FPGA)平台上MLPERF微小的推理基准的最新结果。我们使用开源HLS4ML和Finn工作流,旨在使FPGA中优化神经网络的AI硬件代码民主化。我们介绍关键字发现,异常检测和图像分类基准任务的设计和实现过程。最终的硬件实现是针对速度和效率量身定制的,可配置的,可配置的空间数据流体系结构,并引入了新的通用优化和作为本工作的一部分开发的常见工作流程。完整的工作流程从量化感知培训到FPGA实施。该解决方案部署在芯片(PYNQ-Z2)和纯FPGA(ARTY A7-100T)平台上。由此产生的提交的潜伏期低至20 $ \ mu $ s和每次推论的低至30 $ \ mu $ j的能耗。我们展示了异质硬件平台上新兴的ML基准如何催化协作和开发新技术和更容易访问的工具。
translated by 谷歌翻译
神经网络在广泛的任务中展示了他们出色的表现。具体地,基于长短短期存储器(LSTM)单元格的复发架构表现出了在真实数据中模拟时间依赖性的优异能力。然而,标准的经常性架构无法估计其不确定性,这对于安全关键型应用如医学,这是必不可少的。相比之下,贝叶斯经常性神经网络(RNN)能够以提高的精度提供不确定性估计。尽管如此,贝叶斯的RNN是在计算上和记忆所要求的,尽管他们的优势尽管他们的实用性限制了他们的实用性。为了解决这个问题,我们提出了一种基于FPGA的硬件设计,以加速基于贝叶斯LSTM的RNN。为了进一步提高整体算法 - 硬件性能,提出了一种共同设计框架来探索贝叶斯RNN的最适合的算法 - 硬件配置。我们对医疗保健应用进行了广泛的实验,以证明我们的设计和框架的有效性的提高。与GPU实施相比,我们的FPGA的设计可以实现高达10倍的加速,能效率较高的近106倍。据我们所知,这是第一份针对FPGA上的贝叶斯RNN的加速的工作。
translated by 谷歌翻译
In this work, we demonstrate the offline FPGA realization of both recurrent and feedforward neural network (NN)-based equalizers for nonlinearity compensation in coherent optical transmission systems. First, we present a realization pipeline showing the conversion of the models from Python libraries to the FPGA chip synthesis and implementation. Then, we review the main alternatives for the hardware implementation of nonlinear activation functions. The main results are divided into three parts: a performance comparison, an analysis of how activation functions are implemented, and a report on the complexity of the hardware. The performance in Q-factor is presented for the cases of bidirectional long-short-term memory coupled with convolutional NN (biLSTM + CNN) equalizer, CNN equalizer, and standard 1-StpS digital back-propagation (DBP) for the simulation and experiment propagation of a single channel dual-polarization (SC-DP) 16QAM at 34 GBd along 17x70km of LEAF. The biLSTM+CNN equalizer provides a similar result to DBP and a 1.7 dB Q-factor gain compared with the chromatic dispersion compensation baseline in the experimental dataset. After that, we assess the Q-factor and the impact of hardware utilization when approximating the activation functions of NN using Taylor series, piecewise linear, and look-up table (LUT) approximations. We also show how to mitigate the approximation errors with extra training and provide some insights into possible gradient problems in the LUT approximation. Finally, to evaluate the complexity of hardware implementation to achieve 400G throughput, fixed-point NN-based equalizers with approximated activation functions are developed and implemented in an FPGA.
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
这项工作提出了专门针对粒子探测器的低潜伏期图神经网络(GNN)设计的新型可重构体系结构。加速粒子探测器的GNN是具有挑战性的,因为它需要次微秒延迟才能在CERN大型强子撞机实验的级别1触发器中部署网络以进行在线事件选择。本文提出了一种自定义代码转换,并在基于互动网络的GNN中使用完全连接的图表中的矩阵乘法操作降低了强度,从而避免了昂贵的乘法。它利用了稀疏模式以及二进制邻接矩阵,并避免了不规则的内存访问,从而降低了延迟和硬件效率的提高。此外,我们引入了一种基于外部产品的基质乘法方法,该方法通过降低潜伏期设计的强度降低来增强。此外,引入了融合步骤,以进一步降低设计延迟。此外,提出了GNN特异性算法 - 硬件共同设计方法,该方法不仅找到了具有更好延迟的设计,而且在给定的延迟约束下发现了高精度的设计。最后,已经设计和开源了此低延迟GNN硬件体系结构的可自定义模板,该模板可以使用高级合成工具来生成低延迟的FPGA设计,并有效地利用资源。评估结果表明,我们的FPGA实施速度高24倍,并且消耗的功率比GPU实施少45倍。与我们以前的FPGA实施相比,这项工作的延迟降低了6.51至16.7倍。此外,我们的FPGA设计的延迟足以使GNN在亚微秒,实时撞机触发器系统中部署,从而使其能够从提高的精度中受益。
translated by 谷歌翻译
translated by 谷歌翻译
在本文中,我们提供了一种系统的方法来评估和比较数字信号处理中神经网络层的计算复杂性。我们提供并链接四个软件到硬件的复杂性度量,定义了不同的复杂度指标与层的超参数的关系。本文解释了如何计算这四个指标以进行馈送和经常性层,并定义在这种情况下,我们应该根据我们是否表征了面向更软件或硬件的应用程序来使用特定的度量。新引入的四个指标之一,称为“添加和位移位数(NAB)”,用于异质量化。 NABS不仅表征了操作中使用的位宽的影响,还表征了算术操作中使用的量化类型。我们打算这项工作作为与神经网络在实时数字信号处理中应用相关的复杂性估计级别(目的)的基线,旨在统一计算复杂性估计。
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
Recurrent neural networks (RNN) are the backbone of many text and speech applications. These architectures are typically made up of several computationally complex components such as; non-linear activation functions, normalization, bi-directional dependence and attention. In order to maintain good accuracy, these components are frequently run using full-precision floating-point computation, making them slow, inefficient and difficult to deploy on edge devices. In addition, the complex nature of these operations makes them challenging to quantize using standard quantization methods without a significant performance drop. We present a quantization-aware training method for obtaining a highly accurate integer-only recurrent neural network (iRNN). Our approach supports layer normalization, attention, and an adaptive piecewise linear (PWL) approximation of activation functions, to serve a wide range of state-of-the-art RNNs. The proposed method enables RNN-based language models to run on edge devices with $2\times$ improvement in runtime, and $4\times$ reduction in model size while maintaining similar accuracy as its full-precision counterpart.
translated by 谷歌翻译
在科学计算的许多领域越来越流行的人工神经网络(ANN)的大量使用迅速增加了现代高性能计算系统的能源消耗。新型的神经形态范式提供了一种吸引人的替代方案,它直接在硬件中实施了ANN。但是,对于科学计算中用例使用ANN在神经形态硬件上运行ANN的实际好处知之甚少。在这里,我们提出了一种方法,用于测量使用常规硬件的ANN来计算推理任务的时间。此外,我们为这些任务设计了一个体系结构,并根据最先进的模拟内存计算(AIMC)平台估算了相同的指标,这是神经形态计算中的关键范例之一。在二维凝结物质系统中的量子多体物理学中的用例比较两种方法,并在粒子物理学中大型强子对撞机上以40 MHz的速率以40 MHz的速率进行异常检测。我们发现,与传统硬件相比,AIMC最多可以达到一个较短的计算时间,最高三个数量级的能源成本。这表明使用神经形态硬件进行更快,更可持续的科学计算的潜力。
translated by 谷歌翻译
translated by 谷歌翻译
In this article, we use artificial intelligence algorithms to show how to enhance the resolution of the elementary particle track fitting in inhomogeneous dense detectors, such as plastic scintillators. We use deep learning to replace more traditional Bayesian filtering methods, drastically improving the reconstruction of the interacting particle kinematics. We show that a specific form of neural network, inherited from the field of natural language processing, is very close to the concept of a Bayesian filter that adopts a hyper-informative prior. Such a paradigm change can influence the design of future particle physics experiments and their data exploitation.
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
translated by 谷歌翻译
translated by 谷歌翻译