手机等边缘设备上的微调模型将对敏感数据实现隐私的个性化。但是,在历史上,Edge训练仅限于具有简单体系结构的相对较小的模型,因为训练既是记忆力和能量密集型的。我们提出了诗人,这是一种算法,可以在存储器筛分的边缘设备上训练大型神经网络。诗人共同优化了重新布置和分页的综合搜索搜索空间,这两种算法可减少返回传播的记忆消耗。鉴于记忆预算和运行时间的限制,我们制定了一项混合成员线性计划(MILP),以进行最佳培训。我们的方法使培训能够在嵌入式设备上显着更大的模型,同时减少能源消耗,同时不修改反向传播的数学正确性。我们证明,可以在皮质类嵌入式设备的内存约束中微调RESNET-18和BERT,同时在能源效率方面的当前边缘训练方法的表现。诗人是一个开源项目,网址为https://github.com/shishirpatil/poet
translated by 谷歌翻译
Memory efficiency is crucial in training deep learning networks on resource-restricted devices. During backpropagation, forward tensors are used to calculate gradients. Despite the option of keeping those dependencies in memory until they are reused in backpropagation, some forward tensors can be discarded and recomputed later from saved tensors, so-called checkpoints. This allows, in particular, for resource-constrained heterogeneous environments to make use of all available compute devices. Unfortunately, the definition of these checkpoints is a non-trivial problem and poses a challenge to the programmer - improper or excessive recomputations negate the benefit of checkpointing. In this article, we present XEngine, an approach that schedules network operators to heterogeneous devices in low memory environments by determining checkpoints and recomputations of tensors. Our approach selects suitable resources per timestep and operator and optimizes the end-to-end time for neural networks taking the memory limitation of each device into account. For this, we formulate a mixed-integer quadratic program (MIQP) to schedule operators of deep learning networks on heterogeneous systems. We compare our MIQP solver XEngine against Checkmate, a mixed-integer linear programming (MILP) approach that solves recomputation on a single device. Our solver finds solutions that are up to 22.5 % faster than the fastest Checkmate schedule in which the network is computed exclusively on a single device. We also find valid schedules for networks making use of both central processing units and graphics processing units if memory limitations do not allow scheduling exclusively to the graphics processing unit.
translated by 谷歌翻译
The recent breakthroughs in machine learning (ML) and deep learning (DL) have enabled many new capabilities across plenty of application domains. While most existing machine learning models require large memory and computing power, efforts have been made to deploy some models on resource-constrained devices as well. There are several systems that perform inference on the device, while direct training on the device still remains a challenge. On-device training, however, is attracting more and more interest because: (1) it enables training models on local data without needing to share data over the cloud, thus enabling privacy preserving computation by design; (2) models can be refined on devices to provide personalized services and cope with model drift in order to adapt to the changes of the real-world environment; and (3) it enables the deployment of models in remote, hardly accessible locations or places without stable internet connectivity. We summarize and analyze the-state-of-art systems research to provide the first survey of on-device training from a systems perspective.
translated by 谷歌翻译
最近,使用卷积神经网络(CNNS)存在移动和嵌入式应用的爆炸性增长。为了减轻其过度的计算需求,开发人员传统上揭示了云卸载,突出了高基础设施成本以及对网络条件的强烈依赖。另一方面,强大的SOC的出现逐渐启用设备执行。尽管如此,低端和中层平台仍然努力充分运行最先进的CNN。在本文中,我们展示了Dyno,一种分布式推断框架,将两全其人的最佳框架结合起来解决了几个挑战,例如设备异质性,不同的带宽和多目标要求。启用这是其新的CNN特定数据包装方法,其在onloading计算时利用CNN的不同部分的精度需求的可变性以及其新颖的调度器,该调度器共同调谐分区点并在运行时传输数据精度适应其执行环境的推理。定量评估表明,Dyno优于当前最先进的,通过竞争对手的CNN卸载系统,在竞争对手的CNN卸载系统上提高吞吐量超过一个数量级,最高可达60倍的数据。
translated by 谷歌翻译
深神经网络(DNNS)在各种机器学习(ML)应用程序中取得了巨大成功,在计算机视觉,自然语言处理和虚拟现实等中提供了高质量的推理解决方案。但是,基于DNN的ML应用程序也带来计算和存储要求的增加了很多,对于具有有限的计算/存储资源,紧张的功率预算和较小形式的嵌入式系统而言,这尤其具有挑战性。挑战还来自各种特定应用的要求,包括实时响应,高通量性能和可靠的推理准确性。为了应对这些挑战,我们介绍了一系列有效的设计方法,包括有效的ML模型设计,定制的硬件加速器设计以及硬件/软件共同设计策略,以启用嵌入式系统上有效的ML应用程序。
translated by 谷歌翻译
随着人工智能(AI)的积极发展,基于深神经网络(DNN)的智能应用会改变人们的生活方式和生产效率。但是,从网络边缘生成的大量计算和数据成为主要的瓶颈,传统的基于云的计算模式无法满足实时处理任务的要求。为了解决上述问题,通过将AI模型训练和推理功能嵌入网络边缘,Edge Intelligence(EI)成为AI领域的尖端方向。此外,云,边缘和终端设备之间的协作DNN推断提供了一种有希望的方法来增强EI。然而,目前,以EI为导向的协作DNN推断仍处于早期阶段,缺乏对现有研究工作的系统分类和讨论。因此,我们已经对有关以EI为导向的协作DNN推断的最新研究进行了全面调查。在本文中,我们首先回顾了EI的背景和动机。然后,我们为EI分类了四个典型的DNN推理范例,并分析其特征和关键技术。最后,我们总结了协作DNN推断的当前挑战,讨论未来的发展趋势并提供未来的研究方向。
translated by 谷歌翻译
机器学习的进步为低端互联网节点(例如微控制器)带来了新的机会,将情报带入了情报。传统的机器学习部署具有较高的记忆力,并计算足迹阻碍了其在超资源约束的微控制器上的直接部署。本文强调了为MicroController类设备启用机载机器学习的独特要求。研究人员为资源有限的应用程序使用专门的模型开发工作流程,以确保计算和延迟预算在设备限制之内,同时仍保持所需的性能。我们表征了微控制器类设备的机器学习模型开发的广泛适用的闭环工作流程,并表明几类应用程序采用了它的特定实例。我们通过展示多种用例,将定性和数值见解介绍到模型开发的不同阶段。最后,我们确定了开放的研究挑战和未解决的问题,要求仔细考虑前进。
translated by 谷歌翻译
嵌入式和个人物联网设备由微控制器单元(MCU)供电,其极端资源稀缺是依赖于设备的深度学习推断的应用的主要障碍。与通常需要执行神经网络的内容相比,存储器,内存和计算能力较少的秩序,对网络架构上的严格结构约束并呼叫专业模型压缩方法。在这项工作中,我们为卷积神经网络提出了可分散的结构化网络修剪方法,它集成了模型的MCU特定的资源使用和参数重要性反馈,以获得高度压缩但准确的分类模型。我们的方法(a)提高了高达80倍的模型的关键资源使用; (b)在培训型号的同时迭代地修剪,导致没有开销甚至改善培训时间; (c)与现有MCU的特定方法相比,在比较多的时间内生产具有匹配或改进的资源使用的压缩模型。压缩模型可供下载。
translated by 谷歌翻译
随着智能设备和物联网无处不在的部署的出现,机器学习推断的数据源已越来越多地转移到网络的边缘。现有的机器学习推理平台通常假设一个均匀的基础架构,并且不考虑包括边缘设备,本地集线器,边缘数据中心和云数据中心的更复杂和分层的计算基础架构。另一方面,最近的Automl工作为异质环境提供了可行的解决方案,用于模型压缩,修剪和量化。对于机器学习模型,现在我们可能很容易找到甚至生成一系列在准确性和效率之间进行不同权衡的模型。我们设计和实施Jellybean,这是一种用于服务和优化机器学习推理工作流程的系统。给定的服务级目标(例如,吞吐量,准确性),Jellybean选择了满足准确性目标的最具成本效益的模型,并决定如何在基础架构的不同层次上部署它们。评估表明,与最先进的模型选择和工人分配解决方案相比,Jellybean的视觉问题回答总成本最高可达58%,而NVIDIA AI City Challenge的车辆跟踪最多可达36%。 Jellybean还优于先前的ML服务系统(例如,在云上火花)的服务成本高达5倍。
translated by 谷歌翻译
在设备训练中,该模型可以通过微调预训练的模型来适应从传感器中收集的新数据。但是,对于具有少量内存资源的物联网设备,训练记忆消耗是过敏的。我们提出了一个算法 - 系统共同设计框架,以便仅使用256KB的内存使设备训练成为可能。在设备训练面临两个独特的挑战:(1)由于比特精确的混合和缺乏归一化而难以优化神经网络的量化图; (2)有限的硬件资源(内存和计算)不允许完整的向后计算。为了应对优化难度,我们提出了量化缩放量表来校准梯度尺度并稳定量化训练。为了减少内存足迹,我们提出稀疏更新,以跳过不太重要的层和子张量的梯度计算。该算法创新是由轻量级训练系统(小型训练引擎)实现的,该系统可修剪向后的计算图,以支持稀疏更新并卸载运行时自动分化以编译时间。我们的框架是第一个实用解决方案,用于在微型IoT设备上进行视觉识别的设备转移学习(例如,仅具有256KB SRAM的微控制器),使用少于1/100的现有框架的存储器,同时匹配云训练的准确性+Tinyml应用程序VWW的边缘部署。我们的研究使IoT设备不仅可以执行推理,还可以不断适应新的数据,以实现终身学习。
translated by 谷歌翻译
混合精确的深神经网络达到了硬件部署所需的能源效率和吞吐量,尤其是在资源有限的情况下,而无需牺牲准确性。但是,不容易找到保留精度的最佳每层钻头精度,尤其是在创建巨大搜索空间的大量模型,数据集和量化技术中。为了解决这一困难,最近出现了一系列文献,并且已经提出了一些实现有希望的准确性结果的框架。在本文中,我们首先总结了文献中通常使用的量化技术。然后,我们对混合精液框架进行了彻底的调查,该调查是根据其优化技术进行分类的,例如增强学习和量化技术,例如确定性舍入。此外,讨论了每个框架的优势和缺点,我们在其中呈现并列。我们最终为未来的混合精液框架提供了指南。
translated by 谷歌翻译
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-ofthe-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator.The system is open sourced and in production use inside several major companies.
translated by 谷歌翻译
Embedded and IoT devices, largely powered by microcontroller units (MCUs), could be made more intelligent by leveraging on-device deep learning. One of the main challenges of neural network inference on an MCU is the extremely limited amount of read-write on-chip memory (SRAM, < 512 kB). SRAM is consumed by the neural network layer (operator) input and output buffers, which, traditionally, must be in memory (materialised) for an operator to execute. We discuss a novel execution paradigm for microcontroller deep learning, which modifies the execution of neural networks to avoid materialising full buffers in memory, drastically reducing SRAM usage with no computation overhead. This is achieved by exploiting the properties of operators, which can consume/produce a fraction of their input/output at a time. We describe a partial execution compiler, Pex, which produces memory-efficient execution schedules automatically by identifying subgraphs of operators whose execution can be split along the feature ("channel") dimension. Memory usage is reduced further by targeting memory bottlenecks with structured pruning, leading to the co-design of the network architecture and its execution schedule. Our evaluation of image and audio classification models: (a) establishes state-of-the-art performance in low SRAM usage regimes for considered tasks with up to +2.9% accuracy increase; (b) finds that a 4x memory reduction is possible by applying partial execution alone, or up to 10.5x when using the compiler-pruning co-design, while maintaining the classification accuracy compared to prior work; (c) uses the recovered SRAM to process higher resolution inputs instead, increasing accuracy by up to +3.9% on Visual Wake Words.
translated by 谷歌翻译
Running machine learning inference on tiny devices, known as TinyML, is an emerging research area. This task requires generating inference code that uses memory frugally, a task that standard ML frameworks are ill-suited for. A deployment framework for TinyML must be a) parametric in the number representation to take advantage of the emerging representations like posits, b) carefully assign high-precision to a few tensors so that most tensors can be kept in low-precision while still maintaining model accuracy, and c) avoid memory fragmentation. We describe MinUn, the first TinyML framework that holistically addresses these issues to generate efficient code for ARM microcontrollers (e.g., Arduino Uno, Due and STM32H747) that outperforms the prior TinyML frameworks.
translated by 谷歌翻译
如今,DNN在边缘设备上无处不在。随着其重要性和用例的越来越重要,它不太可能将所有DNN包装到设备内存中,并期望每个推断都被加热。因此,寒冷的推断,读取,初始化和执行DNN模型的过程变得司空见惯,并且迫切要求优化其性能。为此,我们提出了NNV12,这是第一个为冷推理NNV12优化的设备推理引擎是在3个新颖的优化旋钮上构建的:为每个DNN操作员选择适当的内核(实现),绕过权重转换过程,以缓存该帖子。 - 在磁盘上转移权重,并在不对称处理器上进行了许多核的管道执行。为了解决巨大的搜索空间,NNV12采用了基于启发式的计划来获得近乎最佳的内核计划计划。我们完全实施了NNV12的原型,并在广泛的实验中评估了其性能。它表明,与Edge CPU和GPU上的最先进的DNN发动机相比,NNV12的达到15.2倍和401.5倍。
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
In recent years, image and video delivery systems have begun integrating deep learning super-resolution (SR) approaches, leveraging their unprecedented visual enhancement capabilities while reducing reliance on networking conditions. Nevertheless, deploying these solutions on mobile devices still remains an active challenge as SR models are excessively demanding with respect to workload and memory footprint. Despite recent progress on on-device SR frameworks, existing systems either penalize visual quality, lead to excessive energy consumption or make inefficient use of the available resources. This work presents NAWQ-SR, a novel framework for the efficient on-device execution of SR models. Through a novel hybrid-precision quantization technique and a runtime neural image codec, NAWQ-SR exploits the multi-precision capabilities of modern mobile NPUs in order to minimize latency, while meeting user-specified quality constraints. Moreover, NAWQ-SR selectively adapts the arithmetic precision at run time to equip the SR DNN's layers with wider representational power, improving visual quality beyond what was previously possible on NPUs. Altogether, NAWQ-SR achieves an average speedup of 7.9x, 3x and 1.91x over the state-of-the-art on-device SR systems that use heterogeneous processors (MobiSR), CPU (SplitSR) and NPU (XLSR), respectively. Furthermore, NAWQ-SR delivers an average of 3.2x speedup and 0.39 dB higher PSNR over status-quo INT8 NPU designs, but most importantly mitigates the negative effects of quantization on visual quality, setting a new state-of-the-art in the attainable quality of NPU-based SR.
translated by 谷歌翻译
Video, as a key driver in the global explosion of digital information, can create tremendous benefits for human society. Governments and enterprises are deploying innumerable cameras for a variety of applications, e.g., law enforcement, emergency management, traffic control, and security surveillance, all facilitated by video analytics (VA). This trend is spurred by the rapid advancement of deep learning (DL), which enables more precise models for object classification, detection, and tracking. Meanwhile, with the proliferation of Internet-connected devices, massive amounts of data are generated daily, overwhelming the cloud. Edge computing, an emerging paradigm that moves workloads and services from the network core to the network edge, has been widely recognized as a promising solution. The resulting new intersection, edge video analytics (EVA), begins to attract widespread attention. Nevertheless, only a few loosely-related surveys exist on this topic. A dedicated venue for collecting and summarizing the latest advances of EVA is highly desired by the community. Besides, the basic concepts of EVA (e.g., definition, architectures, etc.) are ambiguous and neglected by these surveys due to the rapid development of this domain. A thorough clarification is needed to facilitate a consensus on these concepts. To fill in these gaps, we conduct a comprehensive survey of the recent efforts on EVA. In this paper, we first review the fundamentals of edge computing, followed by an overview of VA. The EVA system and its enabling techniques are discussed next. In addition, we introduce prevalent frameworks and datasets to aid future researchers in the development of EVA systems. Finally, we discuss existing challenges and foresee future research directions. We believe this survey will help readers comprehend the relationship between VA and edge computing, and spark new ideas on EVA.
translated by 谷歌翻译
丹尼德缩放结束和摩尔法的放缓使能量使用数据中心在不可持续的道路上。数据中心已经是全球电力使用的大部分,应用需求以快速缩放。我们认为,数据中心计算的碳强度的大幅减少可以通过以软件为中心的方法来实现:通过修改系统API,通过修改系统API来使应用程序开发人员可见的能量和碳,使其成为可能进行知情的贸易性能和碳排放之间,并通过提高应用程序编程水平,以便灵活地使用更节能的计算和存储方法。我们还为系统软件奠定了一个研究议程,以减少数据中心计算的碳足迹。
translated by 谷歌翻译
在过去的十年中,深度神经网络(DNNS)的规模成倍增长,只剩下那些具有大量基于数据中心的资源的人具有开发和培训此类模型的能力。对于可能只有有限的资源(例如,单个多GPU服务器)的研究人员的长尾巴的主要挑战之一是GPU内存能力与模型大小相比。问题是如此严重,以至于训练大规模DNN模型的内存需求通常可以超过单个服务器上所有可用GPU的总容量;这个问题只会随着不断增长的模型大小的趋势而变得更糟。当前依赖于虚拟化GPU内存的解决方案(通过向CPU内存交换/从CPU内存)会产生过多的交换开销。在本文中,我们提出了一个新的培训框架,和谐和倡导者,重新思考了DNN框架如何安排计算并移动数据以在单个商品服务器上有效地推动培训大规模模型的边界。在各种大型DNN模型中,Harmony能够将交换负载最多减少两个数量级,并在具有虚拟化内存的高度优化基线上获得高达7.6倍的训练吞吐量加速。
translated by 谷歌翻译