分散的分布式学习是利用私有用户生成的本地数据在边缘设备上启用大规模机器学习(训练)的关键,而不依赖于云。然而,实际实现这种设备培训受到通信瓶颈的限制,训练深层模型的计算复杂性和跨设备的显着数据分布偏差。在文献中提出了许多基于反馈的压缩技术,以降低通信成本,并且通过提高收敛速率,少数作品提出算法改变,以帮助存在偏斜数据分布的性能。据我们所知,文献中没有工作,适用并显示计算有效的训练技术这种量化,修剪等,用于对等对等分散的学习设置。在本文中,我们分析并展示了低精度分散培训的趋同,旨在降低培训和推论的计算复杂性。此外,我们研究偏斜和通信压缩程度对各种计算机视觉和自然语言处理(NLP)任务的低精度分散训练的影响。我们的实验表明,与其全面的数据相比,8位分散的训练与其完整的精密对手相比,即使具有异质数据,也具有最小的精度损失。但是,当通过稀疏的沟通压缩伴随着低精度训练时,我们观察1-2%的准确性。所提出的低精度分散培训减少了计算复杂性,内存使用量和通信成本,同时交易低于IID和非IID数据的1%准确性。特别是具有更高的偏斜值,我们观察精度增加(〜0.5%),具有低精度训练,表明量化的正则化效果。
translated by 谷歌翻译
分散的学习算法可以通过在不同设备和位置生成的大型分布式数据集对深度学习模型进行培训,而无需中央服务器。在实际情况下,分布式数据集可以在整个代理之间具有显着不同的数据分布。当前的最新分散算法主要假设数据分布是独立且分布相同的(IID)。本文的重点是用最小的计算和内存开销来改善非IID数据分布的分散学习。我们提出了邻居梯度聚类(NGC),这是一种新型的分散学习算法,使用自我和交叉梯度信息修改每个代理的局部梯度。特别是,所提出的方法用自级的加权平均值,模型变化的跨梯度(接收到的邻居模型参数相对于本地数据集的衍生物)和数据变化,将模型的局部梯度取代了模型变化的均值平均值交叉梯度(相对于其邻居数据集的本地模型的衍生物)。此外,我们提出了compngc,这是NGC的压缩版本,通过压缩交叉梯度将通信开销降低了$ 32 \ times $。我们证明了所提出的技术在各种模型体系结构和图形拓扑上采样的非IID数据分布上提出的技术的经验收敛性和效率。我们的实验表明,NGC和COMPNGC的表现优于现有的最先进的(SOTA)去中心化学习算法,而不是非IID数据的$ 1-5 \%$,其计算和内存需求明显降低。此外,我们还表明,所提出的NGC方法的表现优于$ 5-40 \%$,而没有其他交流。
translated by 谷歌翻译
Many large-scale machine learning (ML) applications need to perform decentralized learning over datasets generated at different devices and locations. Such datasets pose a significant challenge to decentralized learning because their different contexts result in significant data distribution skew across devices/locations. In this paper, we take a step toward better understanding this challenge by presenting a detailed experimental study of decentralized DNN training on a common type of data skew: skewed distribution of data labels across devices/locations. Our study shows that: (i) skewed data labels are a fundamental and pervasive problem for decentralized learning, causing significant accuracy loss across many ML applications, DNN models, training datasets, and decentralized learning algorithms; (ii) the problem is particularly challenging for DNN models with batch normalization; and (iii) the degree of data skew is a key determinant of the difficulty of the problem. Based on these findings, we present SkewScout, a system-level approach that adapts the communication frequency of decentralized learning algorithms to the (skew-induced) accuracy loss between data partitions. We also show that group normalization can recover much of the accuracy loss of batch normalization.
translated by 谷歌翻译
近年来,目睹了分布式数据并行培训的越来越多的系统列表。现有系统很大程度上适合两个范例,即参数服务器和MPI样式的集体操作。在算法方面,研究人员提出了广泛的技术,以通过系统弛豫降低通信:量化,分散和通信延迟。然而,大多数情况下,如果不是全部,现有系统仅依赖于标准的同步和异步随机梯度(SG)的优化,因此不能利用机器学习社区最近发展的所有可能的优化。鉴于该系统和理论的当前景观之间的新出现差距,我们构建了一个MPI式通信库,提供了一种基元的集合,这既灵活又模块化,以支持分布式的最先进的系统松弛技术训练。 BAGUA提供了这种设计,拥有巨大的实现和扩展各种最先进的分布式学习算法的能力。在具有多达16台机器(128个GPU)的生产群集中,BAGUA可以在端到端培训时间内优于Pytorch-DDP,Horovod和ByTeps,在各种任务范围内的重大边缘(最多2次)。此外,我们进行严格的权衡探索,表明不同的算法和系统放松在不同的网络条件下实现了最佳性能。
translated by 谷歌翻译
通过联合学习培训的机器学习模型的收敛速度受到异构数据分区的显着影响,甚至在没有中央服务器的完全分散的设置中。在本文中,我们表明,通过仔细设计潜在的通信拓扑,可以显着降低标签分布偏斜的影响,这是一种重要的数据异质性。我们呈现D-Cliques,一种新颖的拓扑,其通过在稀疏互连的批分中分组节点来减少梯度偏压,使得Clique中的标签分布代表全局标签分布。我们还展示了如何调整分散的SGD的更新,以获得不偏的渐变,并利用D-Cliques实现有效的动量。我们对MNIST和CIFAR10的广泛实证评估表明,我们的方法提供了类似的收敛速度作为完全连接的拓扑,这提供了数据异构设置中的最佳收敛性,并且在边缘和消息的数量下显着降低。在1000节点拓扑中,D-Cliques需要98%的边缘和96%的总信息,在跨越群体中使用小世界拓扑的进一步获得。
translated by 谷歌翻译
联合学习(FL)是一个蓬勃发展的分布式机器学习框架,其中中心参数服务器(PS)协调许多本地用户以训练全局一致的模型。传统的联合学习不可避免地依赖于具有PS的集中拓扑。因此,一旦PS失败,它将瘫痪。为了缓解如此单点故障,特别是在PS上,一些现有的工作已经提供了CDSGD和D-PSGD等分散的FL(DFL)实现,以便于分散拓扑中的流体。然而,这些方法仍存在一些问题,例如,在CDSGD中的用户最终模型和D-PSGD中的网络范围的模型平均必需品之间存在一些问题。为了解决这些缺陷,本文设计了一种作为DACFL的新DFL实现,其中每个用户使用自己的训练数据列举其模型,并通过对称和双随机矩阵将中间模型与其邻居交换。 DACFL将每个用户本地培训的进度视为离散时间过程,并采用第一个订单动态平均共识(FODAC)方法来跟踪\ Texit {平均模型}在没有PS的情况下。在本文中,我们还提供了DACFL的理论收敛性分析,即在I.I.D数据的前提下,以加强其合理性。 Mnist,Fashion-Mnist和CiFar-10的实验结果验证了我们在几间不变性和时变网络拓扑中的解决方案的可行性,并在大多数情况下声明DACFL优于D-PSGD和CDSGD。
translated by 谷歌翻译
In recent years, deep learning (DL) models have demonstrated remarkable achievements on non-trivial tasks such as speech recognition and natural language understanding. One of the significant contributors to its success is the proliferation of end devices that acted as a catalyst to provide data for data-hungry DL models. However, computing DL training and inference is the main challenge. Usually, central cloud servers are used for the computation, but it opens up other significant challenges, such as high latency, increased communication costs, and privacy concerns. To mitigate these drawbacks, considerable efforts have been made to push the processing of DL models to edge servers. Moreover, the confluence point of DL and edge has given rise to edge intelligence (EI). This survey paper focuses primarily on the fifth level of EI, called all in-edge level, where DL training and inference (deployment) are performed solely by edge servers. All in-edge is suitable when the end devices have low computing resources, e.g., Internet-of-Things, and other requirements such as latency and communication cost are important in mission-critical applications, e.g., health care. Firstly, this paper presents all in-edge computing architectures, including centralized, decentralized, and distributed. Secondly, this paper presents enabling technologies, such as model parallelism and split learning, which facilitate DL training and deployment at edge servers. Thirdly, model adaptation techniques based on model compression and conditional computation are described because the standard cloud-based DL deployment cannot be directly applied to all in-edge due to its limited computational resources. Fourthly, this paper discusses eleven key performance metrics to evaluate the performance of DL at all in-edge efficiently. Finally, several open research challenges in the area of all in-edge are presented.
translated by 谷歌翻译
深度学习在广泛的AI应用方面取得了有希望的结果。较大的数据集和模型一致地产生更好的性能。但是,我们一般花费更长的培训时间,以更多的计算和沟通。在本调查中,我们的目标是在模型精度和模型效率方面提供关于大规模深度学习优化的清晰草图。我们调查最常用于优化的算法,详细阐述了大批量培训中出现的泛化差距的可辩论主题,并审查了解决通信开销并减少内存足迹的SOTA策略。
translated by 谷歌翻译
联合学习(FL)可以培训全球模型,而无需共享存储在多个设备上的分散的原始数据以保护数据隐私。由于设备的能力多样化,FL框架难以解决Straggler效应和过时模型的问题。此外,数据异质性在FL训练过程中会导致全球模型的严重准确性降解。为了解决上述问题,我们提出了一个层次同步FL框架,即Fedhisyn。 Fedhisyn首先根据其计算能力将所有可​​用的设备簇分为少数类别。经过一定的本地培训间隔后,将不同类别培训的模型同时上传到中央服务器。在单个类别中,设备根据环形拓扑会相互传达局部更新的模型权重。随着环形拓扑中训练的效率更喜欢具有均匀资源的设备,基于计算能力的分类减轻了Straggler效应的影响。此外,多个类别的同步更新与单个类别中的设备通信的组合有助于解决数据异质性问题,同时达到高精度。我们评估了基于MNIST,EMNIST,CIFAR10和CIFAR100数据集的提议框架以及设备的不同异质设置。实验结果表明,在训练准确性和效率方面,Fedhisyn的表现优于六种基线方法,例如FedAvg,脚手架和Fedat。
translated by 谷歌翻译
分散算法是一种计算形式,通过依赖于直接连接代理之间的低成本通信的本地动态实现全局目标。在涉及分布式数据集的大规模优化任务中,分散算法显示出强大,有时优越,性能与中央节点的分布式算法。最近,发展分散的深度学习算法引起了极大的关注。它们被视为使用参数服务器或环形恢复协议的那些的低通信开销替代方案。但是,缺乏易于使用和高效的软件包仅在纸上保持了最分散的算法。为了填补差距,我们介绍了Bluefog,一个Python库进行了直接的,高性能的不同分散算法的实现。基于各种通信操作的统一抽象,Bluefog提供直观的接口来实现分散的算法的频谱,从使用静态无向图的那些,用于使用动态和定向图形的同步操作进行异步操作。 Bluefog还采用了多种系统级加速技术,以进一步优化深度学习任务的性能。在主流DNN培训任务中,Bluefog达到了更高的吞吐量,并实现了一个总体上的吞吐量1.2 \ times \ sim 1.8 \ times $ speedup,这是一个基于环 - allyuce的最先进的分布式深度学习包。 Bluefog是https://github.com/bluefog-lib/bluefog的开源。
translated by 谷歌翻译
Most distributed machine learning systems nowadays, including TensorFlow and CNTK, are built in a centralized fashion. One bottleneck of centralized algorithms lies on high communication cost on the central node. Motivated by this, we ask, can decentralized algorithms be faster than its centralized counterpart?Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to C-PSGD but requires much less communication cost on the busiest node. We further conduct an empirical study to validate our theoretical analysis across multiple frameworks (CNTK and Torch), different network configurations, and computation platforms up to 112 GPUs. On network configurations with low bandwidth or high latency, D-PSGD can be up to one order of magnitude faster than its well-optimized centralized counterparts.
translated by 谷歌翻译
Federated Learning allows multiple parties to jointly train a deep learning model on their combined data, without any of the participants having to reveal their local data to a centralized server. This form of privacy-preserving collaborative learning however comes at the cost of a significant communication overhead during training. To address this problem, several compression methods have been proposed in the distributed training literature that can reduce the amount of required communication by up to three orders of magnitude. These existing methods however are only of limited utility in the Federated Learning setting, as they either only compress the upstream communication from the clients to the server (leaving the downstream communication uncompressed) or only perform well under idealized conditions such as iid distribution of the client data, which typically can not be found in Federated Learning. In this work, we propose Sparse Ternary Compression (STC), a new compression framework that is specifically designed to meet the requirements of the Federated Learning environment. STC extends the existing compression technique of top-k gradient sparsification with a novel mechanism to enable downstream compression as well as ternarization and optimal Golomb encoding of the weight updates. Our experiments on four different learning tasks demonstrate that STC distinctively outperforms Federated Averaging in common Federated Learning scenarios where clients either a) hold non-iid data, b) use small batch sizes during training, or where c) the number of clients is large and the participation rate in every communication round is low. We furthermore show that even if the clients hold iid data and use medium sized batches for training, STC still behaves paretosuperior to Federated Averaging in the sense that it achieves fixed target accuracies on our benchmarks within both fewer training iterations and a smaller communication budget. These results advocate for a paradigm shift in Federated optimization towards high-frequency low-bitwidth communication, in particular in bandwidth-constrained learning environments.
translated by 谷歌翻译
分布式学习已成为缩放机器学习并解决数据隐私需求不断增长的积分工具。虽然对网络拓扑的更强大,但分散的学习计划没有获得与其集中式同行相同的人气水平,因为它们具有较低的竞争性能。在这项工作中,我们将此问题归因于分散的学习工人之间缺乏同步,在经验和理论上表现出来,收敛速度与工人之间的同步水平相关联。我们认为,基于非线性漫步(非政府组织)的新型分散式学习框架,享有有吸引力的有限时间共识性,以实现更好的同步。我们对其收敛性提供了仔细分析,并讨论了现代分布式优化应用的优点,如深神经网络。我们对通信延迟和随机聊天如何影响学习的分析进一步实现了适应异步和随机通信的实际变体的推导。为了验证我们提案的有效性,我们通过广泛的测试,我们通过广泛的测试来利用竞争解决方案,令人鼓舞的结果报告。
translated by 谷歌翻译
现代深度学习应用程序需要越来越多地计算培训最先进的模型。为了解决这一需求,大型企业和机构使用专用的高性能计算集群,其建筑和维护既昂贵又远远超出大多数组织的预算。结果,一些研究方向成为几个大型工业甚至更少的学术作用者的独家领域。为了减轻这种差异,较小的团体可以汇集他们的计算资源并运行有利于所有参与者的协作实验。这种范式称为网格或志愿者计算,在众多科学领域看到了成功的应用。然而,由于高延迟,不对称带宽以及志愿者计算独特的几个挑战,使用这种用于机器学习的方法是困难的。在这项工作中,我们仔细分析了这些约束,并提出了一种专门用于协作培训的新型算法框架。我们展示了我们在现实条件下的SWAV和Albert预先预价的方法的有效性,并在成本的一小部分中实现了与传统设置相当的性能。最后,我们提供了一份成功的协作语言模型预先追溯的详细报告,有40名参与者。
translated by 谷歌翻译
现代深度学习模型通常在分布式机器集合中并行培训,以减少训练时间。在这种情况下,机器之间模型更新的通信变成了一个重要的性能瓶颈,并且已经提出了各种有损的压缩技术来减轻此问题。在这项工作中,我们介绍了一种新的,简单但理论上和实践上有效的压缩技术:自然压缩(NC)。我们的技术分别应用于要进行压缩的更新向量的所有条目,并通过随机舍入到两个的(负或正)两种功能,可以通过忽略Mantissa来以“自然”方式计算。我们表明,与没有压缩相比,NC将压缩向量的第二刻增加不超过微小因子$ \ frac {9} {8} $,这意味着NC对流行训练算法的收敛速度的影响,例如分布式SGD,可以忽略不计。但是,NC启用的通信节省是可观的,导致$ 3 $ - $ 4 \ times $ $改善整体理论运行时间。对于需要更具侵略性压缩的应用,我们将NC推广到自然抖动,我们证明这比常见的随机抖动技术要好得多。我们的压缩操作员可以自行使用,也可以与现有操作员结合使用,从而产生更具侵略性的结合效果,并在理论和实践中提供新的最先进。
translated by 谷歌翻译
扩展培训工作负载的能力是深度学习的关键性能推动者之一。主要缩放方法是基于数据并行GPU的培训,该培训已经被硬件和软件支持高效地支持高效的GPU通信,特别是通过带宽过度曝光。此支持以A价格出现:相对于其“消费者级”对应物,“云级”服务器之间存在幅度成本差异,但相对于其“消费者级”对应物,虽然服务器级和消费者级GPU可以具有类似的计算信封。在本文中,我们调查了昂贵的硬件过度控制方法是否可以通过算法和系统设计所涵盖,并提出称为CGX的框架,为通信压缩提供有效的软件支持。我们认为,在没有硬件支持的情况下,该框架能够从消费者级多GPU系统中删除通信瓶颈:在没有硬件支持的情况下:在培训现代模型和全部准确性方面时,我们的框架可以在商品上进行2-3倍的自动加速系统使用8个消费者级NVIDIA RTX 3090 GPU,并使其超越NVIDIA DGX-1服务器的吞吐量,其具有类似的峰值闪光,但是从带宽过度提供的益处。
translated by 谷歌翻译
Recently, local peer topology has been shown to influence the overall convergence of decentralized learning (DL) graphs in the presence of data heterogeneity. In this paper, we demonstrate the advantages of constructing a proxy-based locally heterogeneous DL topology to enhance convergence and maintain data privacy. In particular, we propose a novel peer clumping strategy to efficiently cluster peers before arranging them in a final training graph. By showing how locally heterogeneous graphs outperform locally homogeneous graphs of similar size and from the same global data distribution, we present a strong case for topological pre-processing. Moreover, we demonstrate the scalability of our approach by showing how the proposed topological pre-processing overhead remains small in large graphs while the performance gains get even more pronounced. Furthermore, we show the robustness of our approach in the presence of network partitions.
translated by 谷歌翻译
参加联合学习(FL)的设备通常具有异质通信,计算和内存资源。但是,在同步FL中,所有设备都需要按照服务器规定的相同截止日期来完成培训。我们的结果表明,在受约束的设备上训练较小的神经网络(NN)子集,即按照最新状态提出的删除神经元/过滤器,这是效率低下的,可以防止这些设备对模型做出有效的贡献。这会导致不公平的w.r.t受限设备的可实现精确度,尤其是在跨设备的类标签偏斜的情况下。我们提出了一种新型的FL技术CocoFl,该技术在所有设备上都保持了完整的NN结构。为了适应设备的异质资源,CocoFl冻结并量化了选定的层,减少通信,计算和内存需求,而其他层仍被完全精确地训练,使得能够达到高精度。因此,CoCOFL有效地利用了设备上的可用资源,并允许受限的设备对FL系统做出重大贡献,从而提高了参与者的公平性(准确性均等),并显着提高了模型的最终准确性。
translated by 谷歌翻译
使用多个计算节点通常可以加速在大型数据集上的深度神经网络。这种方法称为分布式训练,可以通过专门的消息传递协议,例如环形全部减少。但是,以比例运行这些协议需要可靠的高速网络,其仅在专用集群中可用。相比之下,许多现实世界应用程序,例如联合学习和基于云的分布式训练,在具有不稳定的网络带宽的不可靠的设备上运行。因此,这些应用程序仅限于使用参数服务器或基于Gossip的平均协议。在这项工作中,我们通过提出MOSHPIT全部减少的迭代平均协议来提升该限制,该协议指数地收敛于全局平均值。我们展示了我们对具有强烈理论保证的分布式优化方案的效率。该实验显示了与使用抢占从头开始训练的竞争性八卦的策略和1.5倍的加速,显示了1.3倍的Imagenet培训的加速。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译