The recent breakthroughs in machine learning (ML) and deep learning (DL) have enabled many new capabilities across plenty of application domains. While most existing machine learning models require large memory and computing power, efforts have been made to deploy some models on resource-constrained devices as well. There are several systems that perform inference on the device, while direct training on the device still remains a challenge. On-device training, however, is attracting more and more interest because: (1) it enables training models on local data without needing to share data over the cloud, thus enabling privacy preserving computation by design; (2) models can be refined on devices to provide personalized services and cope with model drift in order to adapt to the changes of the real-world environment; and (3) it enables the deployment of models in remote, hardly accessible locations or places without stable internet connectivity. We summarize and analyze the-state-of-art systems research to provide the first survey of on-device training from a systems perspective.
translated by 谷歌翻译
在设备训练中,该模型可以通过微调预训练的模型来适应从传感器中收集的新数据。但是,对于具有少量内存资源的物联网设备,训练记忆消耗是过敏的。我们提出了一个算法 - 系统共同设计框架,以便仅使用256KB的内存使设备训练成为可能。在设备训练面临两个独特的挑战:(1)由于比特精确的混合和缺乏归一化而难以优化神经网络的量化图; (2)有限的硬件资源(内存和计算)不允许完整的向后计算。为了应对优化难度,我们提出了量化缩放量表来校准梯度尺度并稳定量化训练。为了减少内存足迹,我们提出稀疏更新,以跳过不太重要的层和子张量的梯度计算。该算法创新是由轻量级训练系统(小型训练引擎)实现的,该系统可修剪向后的计算图,以支持稀疏更新并卸载运行时自动分化以编译时间。我们的框架是第一个实用解决方案,用于在微型IoT设备上进行视觉识别的设备转移学习(例如,仅具有256KB SRAM的微控制器),使用少于1/100的现有框架的存储器,同时匹配云训练的准确性+Tinyml应用程序VWW的边缘部署。我们的研究使IoT设备不仅可以执行推理,还可以不断适应新的数据,以实现终身学习。
translated by 谷歌翻译
随着人工智能(AI)的积极发展,基于深神经网络(DNN)的智能应用会改变人们的生活方式和生产效率。但是,从网络边缘生成的大量计算和数据成为主要的瓶颈,传统的基于云的计算模式无法满足实时处理任务的要求。为了解决上述问题,通过将AI模型训练和推理功能嵌入网络边缘,Edge Intelligence(EI)成为AI领域的尖端方向。此外,云,边缘和终端设备之间的协作DNN推断提供了一种有希望的方法来增强EI。然而,目前,以EI为导向的协作DNN推断仍处于早期阶段,缺乏对现有研究工作的系统分类和讨论。因此,我们已经对有关以EI为导向的协作DNN推断的最新研究进行了全面调查。在本文中,我们首先回顾了EI的背景和动机。然后,我们为EI分类了四个典型的DNN推理范例,并分析其特征和关键技术。最后,我们总结了协作DNN推断的当前挑战,讨论未来的发展趋势并提供未来的研究方向。
translated by 谷歌翻译
手机等边缘设备上的微调模型将对敏感数据实现隐私的个性化。但是,在历史上,Edge训练仅限于具有简单体系结构的相对较小的模型,因为训练既是记忆力和能量密集型的。我们提出了诗人,这是一种算法,可以在存储器筛分的边缘设备上训练大型神经网络。诗人共同优化了重新布置和分页的综合搜索搜索空间,这两种算法可减少返回传播的记忆消耗。鉴于记忆预算和运行时间的限制,我们制定了一项混合成员线性计划(MILP),以进行最佳培训。我们的方法使培训能够在嵌入式设备上显着更大的模型,同时减少能源消耗,同时不修改反向传播的数学正确性。我们证明,可以在皮质类嵌入式设备的内存约束中微调RESNET-18和BERT,同时在能源效率方面的当前边缘训练方法的表现。诗人是一个开源项目,网址为https://github.com/shishirpatil/poet
translated by 谷歌翻译
机器学习的进步为低端互联网节点(例如微控制器)带来了新的机会,将情报带入了情报。传统的机器学习部署具有较高的记忆力,并计算足迹阻碍了其在超资源约束的微控制器上的直接部署。本文强调了为MicroController类设备启用机载机器学习的独特要求。研究人员为资源有限的应用程序使用专门的模型开发工作流程,以确保计算和延迟预算在设备限制之内,同时仍保持所需的性能。我们表征了微控制器类设备的机器学习模型开发的广泛适用的闭环工作流程,并表明几类应用程序采用了它的特定实例。我们通过展示多种用例,将定性和数值见解介绍到模型开发的不同阶段。最后,我们确定了开放的研究挑战和未解决的问题,要求仔细考虑前进。
translated by 谷歌翻译
我们日常生活中的深度学习是普遍存在的,包括自驾车,虚拟助理,社交网络服务,医疗服务,面部识别等,但是深度神经网络在训练和推理期间需要大量计算资源。该机器学习界主要集中在模型级优化(如深度学习模型的架构压缩),而系统社区则专注于实施级别优化。在其间,在算术界中提出了各种算术级优化技术。本文在模型,算术和实施级技术方面提供了关于资源有效的深度学习技术的调查,并确定了三种不同级别技术的资源有效的深度学习技术的研究差距。我们的调查基于我们的资源效率度量定义,阐明了较低级别技术的影响,并探讨了资源有效的深度学习研究的未来趋势。
translated by 谷歌翻译
最近,使用卷积神经网络(CNNS)存在移动和嵌入式应用的爆炸性增长。为了减轻其过度的计算需求,开发人员传统上揭示了云卸载,突出了高基础设施成本以及对网络条件的强烈依赖。另一方面,强大的SOC的出现逐渐启用设备执行。尽管如此,低端和中层平台仍然努力充分运行最先进的CNN。在本文中,我们展示了Dyno,一种分布式推断框架,将两全其人的最佳框架结合起来解决了几个挑战,例如设备异质性,不同的带宽和多目标要求。启用这是其新的CNN特定数据包装方法,其在onloading计算时利用CNN的不同部分的精度需求的可变性以及其新颖的调度器,该调度器共同调谐分区点并在运行时传输数据精度适应其执行环境的推理。定量评估表明,Dyno优于当前最先进的,通过竞争对手的CNN卸载系统,在竞争对手的CNN卸载系统上提高吞吐量超过一个数量级,最高可达60倍的数据。
translated by 谷歌翻译
深神经网络(DNNS)在各种机器学习(ML)应用程序中取得了巨大成功,在计算机视觉,自然语言处理和虚拟现实等中提供了高质量的推理解决方案。但是,基于DNN的ML应用程序也带来计算和存储要求的增加了很多,对于具有有限的计算/存储资源,紧张的功率预算和较小形式的嵌入式系统而言,这尤其具有挑战性。挑战还来自各种特定应用的要求,包括实时响应,高通量性能和可靠的推理准确性。为了应对这些挑战,我们介绍了一系列有效的设计方法,包括有效的ML模型设计,定制的硬件加速器设计以及硬件/软件共同设计策略,以启用嵌入式系统上有效的ML应用程序。
translated by 谷歌翻译
对将AI功能从云上的数据中心转移到边缘或最终设备的需求越来越大,这是由在智能手机,AR/VR设备,自动驾驶汽车和各种汽车上运行的快速实时AI的应用程序举例说明的。物联网设备。然而,由于DNN计算需求与边缘或最终设备上的计算能力之间的较大增长差距,这种转变受到了严重的阻碍。本文介绍了XGEN的设计,这是DNN的优化框架,旨在弥合差距。 XGEN将横切共同设计作为其一阶考虑。它的全栈AI面向AI的优化包括在DNN软件堆栈的各个层的许多创新优化,所有这些优化都以合作的方式设计。独特的技术使XGEN能够优化各种DNN,包括具有极高深度的DNN(例如Bert,GPT,其他变形金刚),并生成代码比现有DNN框架中的代码快几倍,同时提供相同的准确性水平。
translated by 谷歌翻译
受到深入学习的巨大成功通过云计算和边缘芯片的快速发展的影响,人工智能研究(AI)的研究已经转移到计算范例,即云计算和边缘计算。近年来,我们目睹了在云服务器上开发更高级的AI模型,以超越传统的深度学习模型,以造成模型创新(例如,变压器,净化家庭),训练数据爆炸和飙升的计算能力。但是,边缘计算,尤其是边缘和云协同计算,仍然在其初期阶段,因为由于资源受限的IOT场景,因此由于部署了非常有限的算法而导致其成功。在本调查中,我们对云和边缘AI进行系统审查。具体而言,我们是第一个设置云和边缘建模的协作学习机制,通过彻底的审查使能够实现这种机制的架构。我们还讨论了一些正在进行的先进EDGE AI主题的潜在和实践经验,包括预先训练模型,图形神经网络和加强学习。最后,我们讨论了这一领域的有希望的方向和挑战。
translated by 谷歌翻译
本文介绍了有关如何架构,设计和优化深神经网络(DNN)的最新概述,以提高性能并保留准确性。该论文涵盖了一组跨越整个机器学习处理管道的优化。我们介绍两种类型的优化。第一个改变了DNN模型,需要重新训练,而第二个则不训练。我们专注于GPU优化,但我们认为提供的技术可以与其他AI推理平台一起使用。为了展示DNN模型优化,我们在流行的Edge AI推理平台(Nvidia Jetson Agx Xavier)上改善了光流的最先进的深层网络体系结构之一,RAFT ARXIV:2003.12039。
translated by 谷歌翻译
现代消费电子设备已为其主要功能采用了深度学习的情报服务。供应商最近开始在设备上执行情报服务,以在设备中保存个人数据,降低网络和云成本。我们发现了通过使用用户数据更新神经网络的情况,而无需将数据暴露在设备中:设备培训。例如,我们可能会添加一个新课程,我的狗Alpha用于机器人真空吸尘器,适应用户口音的语音识别,让文本到语音说话,好像用户会说话。但是,目标设备的资源限制遇到了重大困难。我们建议NNTrainer,这是一个轻巧的设备培训框架。我们描述了NNTrainer实施的神经网络的优化技术,这些技术与传统一起评估。评估表明,NNTrainer可以将内存消耗降低至1/28,而不会恶化准确性或训练时间,并有效地个性化了对设备上的应用程序。 NNTrainer是跨平台和实用的开源软件,该软件正在作者隶属关系中部署到数百万个设备。
translated by 谷歌翻译
深度学习技术在各种任务中都表现出了出色的有效性,并且深度学习具有推进多种应用程序(包括在边缘计算中)的潜力,其中将深层模型部署在边缘设备上,以实现即时的数据处理和响应。一个关键的挑战是,虽然深层模型的应用通常会产生大量的内存和计算成本,但Edge设备通常只提供非常有限的存储和计算功能,这些功能可能会在各个设备之间差异很大。这些特征使得难以构建深度学习解决方案,以释放边缘设备的潜力,同时遵守其约束。应对这一挑战的一种有希望的方法是自动化有效的深度学习模型的设计,这些模型轻巧,仅需少量存储,并且仅产生低计算开销。该调查提供了针对边缘计算的深度学习模型设计自动化技术的全面覆盖。它提供了关键指标的概述和比较,这些指标通常用于量化模型在有效性,轻度和计算成本方面的水平。然后,该调查涵盖了深层设计自动化技术的三类最新技术:自动化神经体系结构搜索,自动化模型压缩以及联合自动化设计和压缩。最后,调查涵盖了未来研究的开放问题和方向。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
In recent years, deep learning (DL) models have demonstrated remarkable achievements on non-trivial tasks such as speech recognition and natural language understanding. One of the significant contributors to its success is the proliferation of end devices that acted as a catalyst to provide data for data-hungry DL models. However, computing DL training and inference is the main challenge. Usually, central cloud servers are used for the computation, but it opens up other significant challenges, such as high latency, increased communication costs, and privacy concerns. To mitigate these drawbacks, considerable efforts have been made to push the processing of DL models to edge servers. Moreover, the confluence point of DL and edge has given rise to edge intelligence (EI). This survey paper focuses primarily on the fifth level of EI, called all in-edge level, where DL training and inference (deployment) are performed solely by edge servers. All in-edge is suitable when the end devices have low computing resources, e.g., Internet-of-Things, and other requirements such as latency and communication cost are important in mission-critical applications, e.g., health care. Firstly, this paper presents all in-edge computing architectures, including centralized, decentralized, and distributed. Secondly, this paper presents enabling technologies, such as model parallelism and split learning, which facilitate DL training and deployment at edge servers. Thirdly, model adaptation techniques based on model compression and conditional computation are described because the standard cloud-based DL deployment cannot be directly applied to all in-edge due to its limited computational resources. Fourthly, this paper discusses eleven key performance metrics to evaluate the performance of DL at all in-edge efficiently. Finally, several open research challenges in the area of all in-edge are presented.
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
Video, as a key driver in the global explosion of digital information, can create tremendous benefits for human society. Governments and enterprises are deploying innumerable cameras for a variety of applications, e.g., law enforcement, emergency management, traffic control, and security surveillance, all facilitated by video analytics (VA). This trend is spurred by the rapid advancement of deep learning (DL), which enables more precise models for object classification, detection, and tracking. Meanwhile, with the proliferation of Internet-connected devices, massive amounts of data are generated daily, overwhelming the cloud. Edge computing, an emerging paradigm that moves workloads and services from the network core to the network edge, has been widely recognized as a promising solution. The resulting new intersection, edge video analytics (EVA), begins to attract widespread attention. Nevertheless, only a few loosely-related surveys exist on this topic. A dedicated venue for collecting and summarizing the latest advances of EVA is highly desired by the community. Besides, the basic concepts of EVA (e.g., definition, architectures, etc.) are ambiguous and neglected by these surveys due to the rapid development of this domain. A thorough clarification is needed to facilitate a consensus on these concepts. To fill in these gaps, we conduct a comprehensive survey of the recent efforts on EVA. In this paper, we first review the fundamentals of edge computing, followed by an overview of VA. The EVA system and its enabling techniques are discussed next. In addition, we introduce prevalent frameworks and datasets to aid future researchers in the development of EVA systems. Finally, we discuss existing challenges and foresee future research directions. We believe this survey will help readers comprehend the relationship between VA and edge computing, and spark new ideas on EVA.
translated by 谷歌翻译
深度神经网络(DNN)已成为移动和嵌入式系统中的普遍存在的技术,用于图像/对象识别和分类。执行多个DNN的趋势同时加剧了资源受限移动设备上满足严格延迟/准确性要求的现有限制。现有技术通过根据资源动态缩放模型大小来探索精度资源权衡的光。然而,这种模型缩放方法接近迫在眉睫的挑战:(i)模型尺寸的大空间探索,(ii)对不同模型组合的培训时间非常长。在本文中,我们介绍了Legodnn,一种用于在移动视觉系统中运行多DNN工作负载的轻质块粒度缩放解决方案。 Legodnn仅通过在DNN中提取和培训少数常见块(例如,在VGG和RENET中的VGG和8中的8中)来保证短模型培训时间。在运行时,Legodnn最佳地结合了这些块的后代模型,以最大限度地在特定资源和延迟约束下最大限度地提高精度,同时通过DNN的智能块级缩放来降低切换开销。我们在Tensorflow Lite中实现Legodnn,并通过一组普遍的DNN模型,广泛地评估了最先进的技术(浮标缩放,知识蒸馏和模型压缩)。评估结果表明,乐高达在模型尺寸下提供了1,296倍至279,936倍,而在不增加训练时间的情况下,推断准确性的提高高达31.74%,降低缩放能耗减少了71.07%。
translated by 谷歌翻译
5G建筑和深度学习的融合在无线通信和人工智能领域都获得了许多研究兴趣。这是因为深度学习技术已被确定为构成5G体系结构的5G技术的潜在驱动力。因此,关于5G架构和深度学习的融合进行了广泛的调查。但是,大多数现有的调查论文主要集中于深度学习如何与特定的5G技术融合,因此,不涵盖5G架构的全部范围。尽管最近有一份调查文件似乎很强大,但对该论文的评论表明,它的结构不佳,无法专门涵盖深度学习和5G技术的收敛性。因此,本文概述了关键5G技术和深度学习的融合。讨论了这种融合面临的挑战。此外,还讨论了对未来6G体系结构的简要概述,以及如何与深度学习进行融合。
translated by 谷歌翻译
混合精确的深神经网络达到了硬件部署所需的能源效率和吞吐量,尤其是在资源有限的情况下,而无需牺牲准确性。但是,不容易找到保留精度的最佳每层钻头精度,尤其是在创建巨大搜索空间的大量模型,数据集和量化技术中。为了解决这一困难,最近出现了一系列文献,并且已经提出了一些实现有希望的准确性结果的框架。在本文中,我们首先总结了文献中通常使用的量化技术。然后,我们对混合精液框架进行了彻底的调查,该调查是根据其优化技术进行分类的,例如增强学习和量化技术,例如确定性舍入。此外,讨论了每个框架的优势和缺点,我们在其中呈现并列。我们最终为未来的混合精液框架提供了指南。
translated by 谷歌翻译