本文介绍了专门针对软件定义无线电(SDR)的新域特异性嵌入式语言(DSEL)。从一组精心设计的组件中,它可以构建有效的软件数字通信系统,能够以简单明了的方式利用现代处理器体系结构的并行性。特别是,提出的DSEL使管道和序列重复技术的组合能够从数字通信系统中提取时间和空间并行性。我们利用了真实用例上的DSEL功能:用于完全在软件中设计的广泛使用的DVB-S2标准的完全数字收发器。通过评估,我们展示了建议的软件DVB-S2收发器如何从现代高端多核CPU目标中获得最大的收益。
translated by 谷歌翻译
最近,使用卷积神经网络(CNNS)存在移动和嵌入式应用的爆炸性增长。为了减轻其过度的计算需求,开发人员传统上揭示了云卸载,突出了高基础设施成本以及对网络条件的强烈依赖。另一方面,强大的SOC的出现逐渐启用设备执行。尽管如此,低端和中层平台仍然努力充分运行最先进的CNN。在本文中,我们展示了Dyno,一种分布式推断框架,将两全其人的最佳框架结合起来解决了几个挑战,例如设备异质性,不同的带宽和多目标要求。启用这是其新的CNN特定数据包装方法,其在onloading计算时利用CNN的不同部分的精度需求的可变性以及其新颖的调度器,该调度器共同调谐分区点并在运行时传输数据精度适应其执行环境的推理。定量评估表明,Dyno优于当前最先进的,通过竞争对手的CNN卸载系统,在竞争对手的CNN卸载系统上提高吞吐量超过一个数量级,最高可达60倍的数据。
translated by 谷歌翻译
在小型电池约束的物流设备上部署现代TinyML任务需要高计算能效。使用非易失性存储器(NVM)的模拟内存计算(IMC)承诺在深神经网络(DNN)推理中的主要效率提高,并用作DNN权重的片上存储器存储器。然而,在系统级别尚未完全理解IMC的功能灵活性限制及其对性能,能量和面积效率的影响。为了目标实际的端到端的IOT应用程序,IMC阵列必须括在异构可编程系统中,引入我们旨在解决这项工作的新系统级挑战。我们介绍了一个非均相紧密的聚类架构,整合了8个RISC-V核心,内存计算加速器(IMA)和数字加速器。我们在高度异构的工作负载上基准测试,例如来自MobileNetv2的瓶颈层,显示出11.5倍的性能和9.5倍的能效改进,而在核心上高度优化并行执行相比。此外,我们通过将我们的异构架构缩放到多阵列加速器,探讨了在IMC阵列资源方面对全移动级DNN(MobileNetv2)的端到端推断的要求。我们的结果表明,我们的解决方案在MobileNetv2的端到端推断上,在执行延迟方面比现有的可编程架构更好,比最先进的异构解决方案更好的数量级集成内存计算模拟核心。
translated by 谷歌翻译
Video, as a key driver in the global explosion of digital information, can create tremendous benefits for human society. Governments and enterprises are deploying innumerable cameras for a variety of applications, e.g., law enforcement, emergency management, traffic control, and security surveillance, all facilitated by video analytics (VA). This trend is spurred by the rapid advancement of deep learning (DL), which enables more precise models for object classification, detection, and tracking. Meanwhile, with the proliferation of Internet-connected devices, massive amounts of data are generated daily, overwhelming the cloud. Edge computing, an emerging paradigm that moves workloads and services from the network core to the network edge, has been widely recognized as a promising solution. The resulting new intersection, edge video analytics (EVA), begins to attract widespread attention. Nevertheless, only a few loosely-related surveys exist on this topic. A dedicated venue for collecting and summarizing the latest advances of EVA is highly desired by the community. Besides, the basic concepts of EVA (e.g., definition, architectures, etc.) are ambiguous and neglected by these surveys due to the rapid development of this domain. A thorough clarification is needed to facilitate a consensus on these concepts. To fill in these gaps, we conduct a comprehensive survey of the recent efforts on EVA. In this paper, we first review the fundamentals of edge computing, followed by an overview of VA. The EVA system and its enabling techniques are discussed next. In addition, we introduce prevalent frameworks and datasets to aid future researchers in the development of EVA systems. Finally, we discuss existing challenges and foresee future research directions. We believe this survey will help readers comprehend the relationship between VA and edge computing, and spark new ideas on EVA.
translated by 谷歌翻译
基于von-neumann架构的传统计算系统,数据密集型工作负载和应用程序(如机器学习)和应用程序都是基本上限制的。随着数据移动操作和能量消耗成为计算系统设计中的关键瓶颈,对近数据处理(NDP),机器学习和特别是神经网络(NN)的加速器等非传统方法的兴趣显着增加。诸如Reram和3D堆叠的新兴内存技术,这是有效地架构基于NN的基于NN的加速器,因为它们的工作能力是:高密度/低能量存储和近记忆计算/搜索引擎。在本文中,我们提出了一种为NN设计NDP架构的技术调查。通过基于所采用的内存技术对技术进行分类,我们强调了它们的相似之处和差异。最后,我们讨论了需要探索的开放挑战和未来的观点,以便改进和扩展未来计算平台的NDP架构。本文对计算机学习领域的计算机架构师,芯片设计师和研究人员来说是有价值的。
translated by 谷歌翻译
In this work, we demonstrate the offline FPGA realization of both recurrent and feedforward neural network (NN)-based equalizers for nonlinearity compensation in coherent optical transmission systems. First, we present a realization pipeline showing the conversion of the models from Python libraries to the FPGA chip synthesis and implementation. Then, we review the main alternatives for the hardware implementation of nonlinear activation functions. The main results are divided into three parts: a performance comparison, an analysis of how activation functions are implemented, and a report on the complexity of the hardware. The performance in Q-factor is presented for the cases of bidirectional long-short-term memory coupled with convolutional NN (biLSTM + CNN) equalizer, CNN equalizer, and standard 1-StpS digital back-propagation (DBP) for the simulation and experiment propagation of a single channel dual-polarization (SC-DP) 16QAM at 34 GBd along 17x70km of LEAF. The biLSTM+CNN equalizer provides a similar result to DBP and a 1.7 dB Q-factor gain compared with the chromatic dispersion compensation baseline in the experimental dataset. After that, we assess the Q-factor and the impact of hardware utilization when approximating the activation functions of NN using Taylor series, piecewise linear, and look-up table (LUT) approximations. We also show how to mitigate the approximation errors with extra training and provide some insights into possible gradient problems in the LUT approximation. Finally, to evaluate the complexity of hardware implementation to achieve 400G throughput, fixed-point NN-based equalizers with approximated activation functions are developed and implemented in an FPGA.
translated by 谷歌翻译
基于代理的建模(ABM),仿真(ABS)和分布式计算(ABC)是建立的方法。互联网和基于Web的技术是合适的运营商。本文是一份技术报告,其中具有JavaScript Agent Machine(JAM)平台的某些教程,以及使用AgentJS编程的代理程序,该代理是广泛使用的JavaScript编程语言的子集,用于编程基于移动状态的反应性代理。除了解释特定设计选择的动机以及在JavaScript中介绍架构和代理编程的核心概念外,简短示例还说明了JAM平台的功能及其组件,用于部署大型多机构系统在强大的强大中诸如互联网之类的异质环境。果酱适合在强大的异质和移动环境中部署。最后,果酱可用于ABC以及在统一方法中用于ABS,最终使移动人群感测和模拟(ABS)。
translated by 谷歌翻译
这项工作侧重于特定于域的加速器的有效敏捷设计方法。我们采用垂直开发堆栈的功能逐个功能增强,并将其应用于TVM / VTA推理加速器。我们已经增强了VTA设计空间,并启用了用于额外工作负载的端到端支持。这是通过增强VTA微架构和指令集架构(ISA)来实现的,以及通过增强TVM编译堆栈来支持各种VTA配置。 VTA TSIM实现(基于凿子)已通过ALU / GEMM执行单元的完全流水线版本增强。在TSIM中,内存宽度现在可以在8-64字节之间。对于支持较大的刮板,已经使场宽度更加灵活。已添加新的说明:元素 - WISE 8位乘法,支持深度卷积,并使用焊盘值的选择加载以支持最大池。还添加了对更多层和更好的双缓冲。完全管制的ALU / GEMM有助于显着帮助:4.9倍的循环较少,最小区域更改为在默认配置下运行RESET-18。可以实例化特征在于11.5倍的循环计数的配置,以12倍的循环计数更大的区域。显示了区域性能帕累托曲线上的许多点,展示了执行单元尺寸,内存接口宽度和刻痕尺寸的余额。最后,VTA现在能够运行MobileNet 1.0和所有层进行Resnet,包括先前禁用的池和完全连接的图层。 TVM / VTA架构始终在几分钟内以RTL呈现端到端工作量评估。通过我们的修改,它现在提供了更大的可行配置,具有广泛的成本与性能。所有提到的所有功能都可以在OpenSource叉中提供,而这些功能的子集已经上游。
translated by 谷歌翻译
TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. Tensor-Flow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, generalpurpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous "parameter server" designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with a focus on training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model and demonstrate the compelling performance that Tensor-Flow achieves for several real-world applications.
translated by 谷歌翻译
本文介绍了正向和反向模式的校正和高效自动分化的校正和高效自动化。自动差异是获得数值节目梯度的方法,这对于优化,不确定量化和机器学习至关重要。计算渐变的计算成本是实践中的常见瓶颈。对于使用OpenMP为多核CPU或GPU并行化的应用程序,还希望并行计算渐变。我们提出了一个框架,原因是生成的衍生代码的正确性,我们向差异化模型证明了我们的OpenMP扩展。我们在自动差异化工具磁带上实施此模型,并在我们的扩展差异化过程之后差异化的目前的测试用例。生成的衍生程序的性能在前进和反向模式优于顺序,尽管我们的反向模式通常比输入程序更差。
translated by 谷歌翻译
自主机时代的一个主要技术挑战是自动驾驶机器的编程,它要求跨多个领域的协同作用,包括基本的计算机科学,计算机架构和机器人技术,并且需要学术界和行业的专业知识。本文讨论了与生产现实生活自动驾驶机器相关的编程理论和实践,并在特定功能要求,性能期望和自主机的实施约束的背景下涵盖了从高级概念到低级代码生成的各个方面。
translated by 谷歌翻译
为低功耗设备上的高性能计算选择适当的编程范例可以很有用来加快计算。许多Android设备都有一个集成的GPU,虽然没有正式支持 - OpenCL框架可以在Android设备上用于寻址这些GPU。 OpenCL支持线程和数据并行性。使用GPU的应用程序必须考虑到用户可以在任何时刻暂停用户或Android操作系统。我们已创建一个包装器库,允许在Android设备上使用OpenCL。已经写入OpenCL程序可以用几乎没有修改来执行。我们使用此库将DBSCAN和kmeans算法的性能与同一设备上的其他单个和多线程实现的ARM-V7平板电脑的集成GPU进行比较。我们调查了哪些编程范式和语言允许执行速度和能耗之间的最佳权衡。在Android设备上使用GPU进行HPC,可以帮助在遥控区域下进行计算密集型机器学习或数据挖掘任务,在恶劣的环境条件下以及能源供应是一个问题的领域。
translated by 谷歌翻译
变形金刚是一种深入学习语言模型,用于数据中心中的自然语言处理(NLP)服务。在变压器模型中,生成的预训练的变压器(GPT)在文本生成或自然语言生成(NLG)中取得了显着的性能,它需要在摘要阶段处理大型输入上下文,然后是产生一个生成阶段的一次单词。常规平台(例如GPU)专门用于在摘要阶段平行处理大型输入,但是由于其顺序特征,它们的性能在生成阶段显着降低。因此,需要一个有效的硬件平台来解决由文本生成的顺序特征引起的高潜伏期。在本文中,我们提出了DFX,这是一种多FPGA加速器,该设备在摘要和发电阶段中执行GPT-2模型端到端,并具有低延迟和高吞吐量。 DFX使用模型并行性和优化的数据流,这是模型和硬件感知的设备之间快速同时执行执行。其计算核心根据自定义说明运行,并提供GPT-2操作端到端。我们在四个Xilinx Alveo U280 FPGAS上实现了建议的硬件体系结构,并利用了高带宽内存(HBM)的所有频道,以及用于高硬件效率的最大计算资源数量。 DFX在现代GPT-2模型上实现了四个NVIDIA V100 GPU的5.58倍加速度和3.99倍的能效。 DFX的成本效益比GPU设备更具成本效益,这表明它是云数据中心中文本生成工作负载的有前途解决方案。
translated by 谷歌翻译
最新的努力改善了满足当今应用程序要求的神经网络(NN)加速器的性能,这引起了基于逻辑NN推理的新趋势,该趋势依赖于固定功能组合逻辑。将如此大的布尔函数与许多输入变量和产品项绘制到现场可编程门阵列(FPGA)上的数字信号处理器(DSP)需要一个新颖的框架,考虑到此过程中DSP块的结构和可重构性。本文中提出的方法将固定功能组合逻辑块映射到一组布尔功能,其中与每个功能相对应的布尔操作映射到DSP设备,而不是FPGA上的查找表(LUTS),以利用高性能,DSP块的低潜伏期和并行性。 %本文还提出了一种用于NNS编译和映射的创新设计和优化方法,并利用固定功能组合逻辑与DSP进行了使用高级合成流的FPGA上的DSP。 %我们在几个\ revone {DataSets}上进行的实验评估和选定的NNS与使用DSP的基于ART FPGA的NN加速器相比,根据推理潜伏期和输出准确性,证明了我们框架的可比性。
translated by 谷歌翻译
Distributed deep learning (DDL) systems strongly depend on network performance. Current electronic packet switched (EPS) network architectures and technologies suffer from variable diameter topologies, low-bisection bandwidth and over-subscription affecting completion time of communication and collective operations. We introduce a near-exascale, full-bisection bandwidth, all-to-all, single-hop, all-optical network architecture with nanosecond reconfiguration called RAMP, which supports large-scale distributed and parallel computing systems (12.8~Tbps per node for up to 65,536 nodes). For the first time, a custom RAMP-x MPI strategy and a network transcoder is proposed to run MPI collective operations across the optical circuit switched (OCS) network in a schedule-less and contention-less manner. RAMP achieves 7.6-171$\times$ speed-up in completion time across all MPI operations compared to realistic EPS and OCS counterparts. It can also deliver a 1.3-16$\times$ and 7.8-58$\times$ reduction in Megatron and DLRM training time respectively} while offering 42-53$\times$ and 3.3-12.4$\times$ improvement in energy consumption and cost respectively.
translated by 谷歌翻译
In this paper, we present PARTIME, a software library written in Python and based on PyTorch, designed specifically to speed up neural networks whenever data is continuously streamed over time, for both learning and inference. Existing libraries are designed to exploit data-level parallelism, assuming that samples are batched, a condition that is not naturally met in applications that are based on streamed data. Differently, PARTIME starts processing each data sample at the time in which it becomes available from the stream. PARTIME wraps the code that implements a feed-forward multi-layer network and it distributes the layer-wise processing among multiple devices, such as Graphics Processing Units (GPUs). Thanks to its pipeline-based computational scheme, PARTIME allows the devices to perform computations in parallel. At inference time this results in scaling capabilities that are theoretically linear with respect to the number of devices. During the learning stage, PARTIME can leverage the non-i.i.d. nature of the streamed data with samples that are smoothly evolving over time for efficient gradient computations. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning, distributing operations on up to 8 NVIDIA GPUs, showing significant speedups that are almost linear in the number of devices, mitigating the impact of the data transfer overhead.
translated by 谷歌翻译
当今的大多数计算机视觉管道都是围绕深神经网络构建的,卷积操作需要大部分一般的计算工作。与标准算法相比,Winograd卷积算法以更少的MAC计算卷积,当使用具有2x2尺寸瓷砖$ F_2 $的版本时,3x3卷积的操作计数为2.25倍。即使收益很大,Winograd算法具有较大的瓷砖尺寸,即$ f_4 $,在提高吞吐量和能源效率方面具有更大的潜力,因为它将所需的MAC降低了4倍。不幸的是,具有较大瓷砖尺寸的Winograd算法引入了数值问题,这些问题阻止了其在整数域特异性加速器上的使用和更高的计算开销,以在空间和Winograd域之间转换输入和输出数据。为了解锁Winograd $ F_4 $的全部潜力,我们提出了一种新颖的Tap-Wise量化方法,该方法克服了使用较大瓷砖的数值问题,从而实现了仅整数的推断。此外,我们介绍了以功率和区域效率的方式处理Winograd转换的自定义硬件单元,并展示了如何将此类自定义模块集成到工业级,可编程的DSA中。对大量最先进的计算机视觉基准进行了广泛的实验评估表明,Tap-Wise量化算法使量化的Winograd $ F_4 $网络几乎与FP32基线一样准确。 Winograd增强的DSA可实现高达1.85倍的能源效率,最高可用于最先进的细分和检测网络的端到端速度高达1.83倍。
translated by 谷歌翻译
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-ofthe-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator.The system is open sourced and in production use inside several major companies.
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
Communication and computation are often viewed as separate tasks. This approach is very effective from the perspective of engineering as isolated optimizations can be performed. On the other hand, there are many cases where the main interest is a function of the local information at the devices instead of the local information itself. For such scenarios, information theoretical results show that harnessing the interference in a multiple-access channel for computation, i.e., over-the-air computation (OAC), can provide a significantly higher achievable computation rate than the one with the separation of communication and computation tasks. Besides, the gap between OAC and separation in terms of computation rate increases with more participating nodes. Given this motivation, in this study, we provide a comprehensive survey on practical OAC methods. After outlining fundamentals related to OAC, we discuss the available OAC schemes with their pros and cons. We then provide an overview of the enabling mechanisms and relevant metrics to achieve reliable computation in the wireless channel. Finally, we summarize the potential applications of OAC and point out some future directions.
translated by 谷歌翻译