In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, andMann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.
translated by 谷歌翻译
Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis.We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T 1/2 -where T denotes the number of total steps-compared to mini-batch SGD. This also holds for asynchronous implementations.Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.
translated by 谷歌翻译
我们研究了在$ n $工人上的分布式培训的异步随机梯度下降算法,随着时间的推移,计算和通信频率变化。在此算法中,工人按照自己的步调并行计算随机梯度,并在没有任何同步的情况下将其返回服务器。该算法的现有收敛速率对于非凸平的光滑目标取决于最大梯度延迟$ \ tau _ {\ max} $,并表明$ \ epsilon $ stationary点在$ \ mathcal {o} \!\左后达到(\ sigma^2 \ epsilon^{ - 2}+ \ tau _ {\ max} \ epsilon^{ - 1} \ right)$ iterations,其中$ \ sigma $表示随机梯度的方差。在这项工作(i)中,我们获得了$ \ Mathcal {o} \!\ left(\ sigma^2 \ epsilon^{ - 2}+ sqrt {\ tau _ {\ max} \ max} \ tau_ {avg} {avg} } \ epsilon^{ - 1} \ right)$,没有任何更改的算法,其中$ \ tau_ {avg} $是平均延迟,可以大大小于$ \ tau _ {\ max} $。我们还提供(ii)一个简单的延迟自适应学习率方案,在该方案下,异步SGD的收敛速率为$ \ Mathcal {o} \!\ left(\ sigma^2 \ epsilon^{ - 2} { - 2}+ \ tau_ {-2 avg} \ epsilon^{ - 1} \ right)$,并且不需要任何额外的高参数调整或额外的通信。我们的结果首次显示异步SGD总是比迷你批次SGD快。此外,(iii)我们考虑了由联邦学习应用激发的异质功能的情况,并通过证明与先前的作品相比对最大延迟的依赖性较弱,并提高收敛率。特别是,我们表明,收敛率的异质性项仅受每个工人内平均延迟的影响。
translated by 谷歌翻译
Federated learning is a distributed framework according to which a model is trained over a set of devices, while keeping data localized. This framework faces several systemsoriented challenges which include (i) communication bottleneck since a large number of devices upload their local updates to a parameter server, and (ii) scalability as the federated network consists of millions of devices. Due to these systems challenges as well as issues related to statistical heterogeneity of data and privacy concerns, designing a provably efficient federated learning method is of significant importance yet it remains challenging. In this paper, we present FedPAQ, a communication-efficient Federated Learning method with Periodic Averaging and Quantization. FedPAQ relies on three key features: (1) periodic averaging where models are updated locally at devices and only periodically averaged at the server; (2) partial device participation where only a fraction of devices participate in each round of the training; and (3) quantized messagepassing where the edge nodes quantize their updates before uploading to the parameter server. These features address the communications and scalability challenges in federated learning. We also show that FedPAQ achieves near-optimal theoretical guarantees for strongly convex and non-convex loss functions and empirically demonstrate the communication-computation tradeoff provided by our method.
translated by 谷歌翻译
亚当是训练深神经网络的最具影响力的自适应随机算法之一,即使在简单的凸面设置中,它也被指出是不同的。许多尝试,例如降低自适应学习率,采用较大的批量大小,结合了时间去相关技术,寻求类似的替代物,\ textit {etc。},以促进Adam-type算法融合。与现有方法相反,我们引入了另一种易于检查的替代条件,这仅取决于基础学习率的参数和历史二阶时刻的组合,以确保通用ADAM的全球融合以解决大型融合。缩放非凸随机优化。这种观察结果以及这种足够的条件,对亚当的差异产生了更深刻的解释。另一方面,在实践中,无需任何理论保证,广泛使用了迷你ADAM和分布式ADAM。我们进一步分析了分布式系统中的批次大小或节点的数量如何影响亚当的收敛性,从理论上讲,这表明迷你批次和分布式亚当可以通过使用较大的迷你批量或较大的大小来线性地加速节点的数量。最后,我们应用了通用的Adam和Mini Batch Adam,具有足够条件来求解反例并在各种真实世界数据集上训练多个神经网络。实验结果完全符合我们的理论分析。
translated by 谷歌翻译
分布式优化和学习的最新进展表明,沟通压缩是减少交流的最有效手段之一。尽管在通信压缩下的收敛速率有很多结果,但理论下限仍然缺失。通过通信压缩的算法的分析将收敛归因于两个抽象属性:无偏见的属性或承包属性。它们可以通过单向压缩(仅从工人到服务器的消息被压缩)或双向压缩来应用它们。在本文中,我们考虑了分布式随机算法,以最大程度地减少通信压缩下的平滑和非凸目标函数。我们为算法建立了收敛的下限,无论是在单向或双向中使用无偏压缩机还是使用承包压缩机。为了缩小下限和现有上限之间的差距,我们进一步提出了一种新石器时代的算法,该算法在轻度条件下几乎达到了我们的下限(达到对数因素)。我们的结果还表明,使用承包双向压缩可以产生迭代方法,该方法的收敛速度与使用无偏见的单向压缩的方法一样快。实验结果验证了我们的发现。
translated by 谷歌翻译
Most distributed machine learning systems nowadays, including TensorFlow and CNTK, are built in a centralized fashion. One bottleneck of centralized algorithms lies on high communication cost on the central node. Motivated by this, we ask, can decentralized algorithms be faster than its centralized counterpart?Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to C-PSGD but requires much less communication cost on the busiest node. We further conduct an empirical study to validate our theoretical analysis across multiple frameworks (CNTK and Torch), different network configurations, and computation platforms up to 112 GPUs. On network configurations with low bandwidth or high latency, D-PSGD can be up to one order of magnitude faster than its well-optimized centralized counterparts.
translated by 谷歌翻译
我们考虑了分布式随机优化问题,其中$ n $代理想要最大程度地减少代理本地函数总和给出的全局函数,并专注于当代理的局部函数在非i.i.i.d上定义时,专注于异质设置。数据集。我们研究本地SGD方法,在该方法中,代理执行许多局部随机梯度步骤,并偶尔与中央节点进行通信以改善其本地优化任务。我们分析了本地步骤对局部SGD的收敛速率和通信复杂性的影响。特别是,我们允许在$ i $ th的通信回合($ h_i $)期间允许在所有通信回合中进行固定数量的本地步骤。我们的主要贡献是将本地SGD的收敛速率表征为$ \ {h_i \} _ {i = 1}^r $在强烈凸,convex和nonconvex local函数下的函数,其中$ r $是沟通总数。基于此特征,我们在序列$ \ {h_i \} _ {i = 1}^r $上提供足够的条件,使得本地SGD可以相对于工人数量实现线性加速。此外,我们提出了一种新的沟通策略,将本地步骤提高,优于现有的沟通策略,以突出局部功能。另一方面,对于凸和非凸局局功能,我们认为固定的本地步骤是本地SGD的最佳通信策略,并恢复了最新的收敛速率结果。最后,我们通过广泛的数值实验证明我们的理论结果是合理的。
translated by 谷歌翻译
使用多个计算节点通常可以加速在大型数据集上的深度神经网络。这种方法称为分布式训练,可以通过专门的消息传递协议,例如环形全部减少。但是,以比例运行这些协议需要可靠的高速网络,其仅在专用集群中可用。相比之下,许多现实世界应用程序,例如联合学习和基于云的分布式训练,在具有不稳定的网络带宽的不可靠的设备上运行。因此,这些应用程序仅限于使用参数服务器或基于Gossip的平均协议。在这项工作中,我们通过提出MOSHPIT全部减少的迭代平均协议来提升该限制,该协议指数地收敛于全局平均值。我们展示了我们对具有强烈理论保证的分布式优化方案的效率。该实验显示了与使用抢占从头开始训练的竞争性八卦的策略和1.5倍的加速,显示了1.3倍的Imagenet培训的加速。
translated by 谷歌翻译
现代深度学习模型通常在分布式机器集合中并行培训,以减少训练时间。在这种情况下,机器之间模型更新的通信变成了一个重要的性能瓶颈,并且已经提出了各种有损的压缩技术来减轻此问题。在这项工作中,我们介绍了一种新的,简单但理论上和实践上有效的压缩技术:自然压缩(NC)。我们的技术分别应用于要进行压缩的更新向量的所有条目,并通过随机舍入到两个的(负或正)两种功能,可以通过忽略Mantissa来以“自然”方式计算。我们表明,与没有压缩相比,NC将压缩向量的第二刻增加不超过微小因子$ \ frac {9} {8} $,这意味着NC对流行训练算法的收敛速度的影响,例如分布式SGD,可以忽略不计。但是,NC启用的通信节省是可观的,导致$ 3 $ - $ 4 \ times $ $改善整体理论运行时间。对于需要更具侵略性压缩的应用,我们将NC推广到自然抖动,我们证明这比常见的随机抖动技术要好得多。我们的压缩操作员可以自行使用,也可以与现有操作员结合使用,从而产生更具侵略性的结合效果,并在理论和实践中提供新的最先进。
translated by 谷歌翻译
Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. SIGNSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. The relative 1 / 2 geometry of gradients, noise and curvature informs whether SIGNSGD or SGD is theoretically better suited to a particular problem. On the practical side we find that the momentum counterpart of SIGNSGD is able to match the accuracy and convergence speed of ADAM on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss (1823) we prove that majority vote can achieve the same reduction in variance as full precision distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve fast communication and fast convergence. Code to reproduce experiments is to be found at https://github.com/jxbz/signSGD.
translated by 谷歌翻译
由于培训数据集的大小爆炸,分布式学习近年来受到了日益增长的兴趣。其中一个主要瓶颈是中央服务器和本地工人之间的沟通成本。虽然已经证明错误反馈压缩以通过随机梯度下降(SGD)降低通信成本,但在培训大规模机器学习方面广泛用于培训的通信有效的适应性梯度方法楷模。在本文中,我们提出了一种新的通信 - 压缩AMSGRAD,用于分布式非透明的优化问题,可提供有效的效率。我们所提出的分布式学习框架具有有效的渐变压缩策略和工人侧模型更新设计。我们证明所提出的通信有效的分布式自适应梯度方法会聚到具有与随机非凸化优化设置中的未压缩的vanilla amsgrad相同的迭代复杂度的一阶静止点。关于各种基准备份我们理论的实验。
translated by 谷歌翻译
SOTA decentralized SGD algorithms can overcome the bandwidth bottleneck at the parameter server by using communication collectives like Ring All-Reduce for synchronization. While the parameter updates in distributed SGD may happen asynchronously there is still a synchronization barrier to make sure that the local training epoch at every learner is complete before the learners can advance to the next epoch. The delays in waiting for the slowest learners(stragglers) remain to be a problem in the synchronization steps of these state-of-the-art decentralized frameworks. In this paper, we propose the (de)centralized Non-blocking SGD (Non-blocking SGD) which can address the straggler problem in a heterogeneous environment. The main idea of Non-blocking SGD is to split the original batch into mini-batches, then accumulate the gradients and update the model based on finished mini-batches. The Non-blocking idea can be implemented using decentralized algorithms including Ring All-reduce, D-PSGD, and MATCHA to solve the straggler problem. Moreover, using gradient accumulation to update the model also guarantees convergence and avoids gradient staleness. Run-time analysis with random straggler delays and computational efficiency/throughput of devices is also presented to show the advantage of Non-blocking SGD. Experiments on a suite of datasets and deep learning networks validate the theoretical analyses and demonstrate that Non-blocking SGD speeds up the training and fastens the convergence. Compared with the state-of-the-art decentralized asynchronous algorithms like D-PSGD and MACHA, Non-blocking SGD takes up to 2x fewer time to reach the same training loss in a heterogeneous environment.
translated by 谷歌翻译
由于其在数据隐私保护,有效的沟通和并行数据处理方面的好处,联邦学习(FL)近年来引起了人们的兴趣。同样,采用适当的算法设计,可以实现fl中收敛效应的理想线性加速。但是,FL上的大多数现有作品仅限于I.I.D.的系统。数据和集中参数服务器以及与异质数据集分散的FL上的结果仍然有限。此外,在完全分散的FL下,与数据异质性在完全分散的FL下,可以实现收敛的线性加速仍然是一个悬而未决的问题。在本文中,我们通过提出一种称为Net-Fleet的新算法,以解决具有数据异质性的完全分散的FL系统,以解决这些挑战。我们算法的关键思想是通过合并递归梯度校正技术来处理异质数据集,以增强FL(最初旨在用于通信效率)的本地更新方案。我们表明,在适当的参数设置下,所提出的净型算法实现了收敛的线性加速。我们进一步进行了广泛的数值实验,以评估所提出的净化算法的性能并验证我们的理论发现。
translated by 谷歌翻译
Federated learning has attracted increasing attention with the emergence of distributed data. While extensive federated learning algorithms have been proposed for the non-convex distributed problem, the federated learning in practice still faces numerous challenges, such as the large training iterations to converge since the sizes of models and datasets keep increasing, and the lack of adaptivity by SGD-based model updates. Meanwhile, the study of adaptive methods in federated learning is scarce and existing works either lack a complete theoretical convergence guarantee or have slow sample complexity. In this paper, we propose an efficient adaptive algorithm (i.e., FAFED) based on the momentum-based variance reduced technique in cross-silo FL. We first explore how to design the adaptive algorithm in the FL setting. By providing a counter-example, we prove that a simple combination of FL and adaptive methods could lead to divergence. More importantly, we provide a convergence analysis for our method and prove that our algorithm is the first adaptive FL algorithm to reach the best-known samples $O(\epsilon^{-3})$ and $O(\epsilon^{-2})$ communication rounds to find an $\epsilon$-stationary point without large batches. The experimental results on the language modeling task and image classification task with heterogeneous data demonstrate the efficiency of our algorithms.
translated by 谷歌翻译
Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost of communicating gradient updates between nodes; consequently, several lossy compresion heuristics have been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these heuristics do not always converge. In this paper, we propose Quantized SGD (QSGD), a family of compression schemes with convergence guarantees and good practical performance. QSGD allows the user to smoothly trade off communication bandwidth and convergence time: nodes can adjust the number of bits sent per iteration, at the cost of possibly higher variance. We show that this trade-off is inherent, in the sense that improving it past some threshold would violate information-theoretic lower bounds. QSGD guarantees convergence for convex and non-convex objectives, under asynchrony, and can be extended to stochastic variance-reduced techniques. When applied to training deep neural networks for image classification and automated speech recognition, QSGD leads to significant reductions in end-to-end training time. For instance, on 16GPUs, we can train the ResNet-152 network to full accuracy on ImageNet 1.8× faster than the full-precision variant. time to the same target accuracy is 2.7×. Further, even computationally-heavy architectures such as Inception and ResNet can benefit from the reduction in communication: on 16GPUs, QSGD reduces the end-to-end convergence time of ResNet152 by approximately 2×. Networks trained with QSGD can converge to virtually the same accuracy as full-precision variants, and that gradient quantization may even slightly improve accuracy in some settings. Related Work. One line of related research studies the communication complexity of convex optimization. In particular, [40] studied two-processor convex minimization in the same model, provided a lower bound of Ω(n(log n + log(1/ ))) bits on the communication cost of n-dimensional convex problems, and proposed a non-stochastic algorithm for strongly convex problems, whose communication cost is within a log factor of the lower bound. By contrast, our focus is on stochastic gradient methods. Recent work [5] focused on round complexity lower bounds on the number of communication rounds necessary for convex learning.Buckwild! [10] was the first to consider the convergence guarantees of low-precision SGD. It gave upper bounds on the error probability of SGD, assuming unbiased stochastic quantization, convexity, and gradient sparsity, and showed significant speedup when solving convex problems on CPUs. QSGD refines these results by focusing on the trade-off between communication and convergence. We view quantization as an independent source of variance for SGD, which allows us to employ standard convergence results [7]. The main differences from Buckw
translated by 谷歌翻译
许多深度学习领域都受益于使用越来越大的神经网络接受公共数据训练的培训,就像预先训练的NLP和计算机视觉模型一样。培训此类模型需要大量的计算资源(例如,HPC群集),而小型研究小组和独立研究人员则无法使用。解决问题的一种方法是,几个较小的小组将其计算资源汇总在一起并训练一种使所有参与者受益的模型。不幸的是,在这种情况下,任何参与者都可以通过故意或错误地发送错误的更新来危害整个培训。在此类同龄人的情况下进行培训需要具有拜占庭公差的专门分布式培训算法。这些算法通常通过引入冗余通信或通过受信任的服务器传递所有更新来牺牲效率,从而使它们无法应用于大规模深度学习,在该大规模深度学习中,模型可以具有数十亿个参数。在这项工作中,我们提出了一种新的协议,用于强调沟通效率的安全(容忍)分散培训。
translated by 谷歌翻译
在联合学习(FL)的新兴范式中,大量客户端(例如移动设备)用于在各自的数据上训练可能的高维模型。由于移动设备的带宽低,分散的优化方法需要将计算负担从那些客户端转移到计算服务器,同时保留隐私和合理的通信成本。在本文中,我们专注于深度,如多层神经网络的培训,在FL设置下。我们提供了一种基于本地模型的层状和维度更新的新型联合学习方法,减轻了非凸起和手头优化任务的多层性质的新型联合学习方法。我们为Fed-Lamb提供了一种彻底的有限时间收敛性分析,表征其渐变减少的速度有多速度。我们在IID和非IID设置下提供实验结果,不仅可以证实我们的理论,而且与最先进的方法相比,我们的方法的速度更快。
translated by 谷歌翻译
当今部署在边缘网络上的联合学习(FL)系统由大量在数据和/或计算能力中具有高度异质性的工人组成,这些工人要求在时间,努力,数据异质性等方面参加灵活的工作者参与为了满足灵活的工人参与的需求,我们考虑了一种新的FL范式,称为“无政府状态联邦学习”(AFL)(AFL)。与常规FL模型形成鲜明对比的是,AFL中的每个工人都可以自由选择i)何时参加FL,ii)根据当前情况(例如,电池,通信,电池级别,通信渠道,隐私问题)。但是,AFL中这种混乱的工人行为在算法设计中引发了许多新的开放问题。特别是,尚不清楚是否可以开发收敛的AFL训练算法,如果是的,则在什么条件下以及可实现的收敛速度的速度下。为此,我们提出了两种无政府状态的联合平均(AFA)算法,分别命名为AFA-CD和AFA-CS的跨设备和跨核心设置的双向学习率。令人惊讶的是,我们表明,在轻度的无政府状态假设下,这两种AFL算法都达到了最著名的收敛速率,作为常规FL的最新算法。此外,它们保留了新的AFL范式中的工人数量和本地步骤,保留了高度可取的{\ em线性加速效应}。我们通过对现实世界数据集进行广泛的实验来验证提出的算法。
translated by 谷歌翻译
当任何延迟较大时,异步随机梯度下降(SGD)的现有分析显着降低,给人的印象是性能主要取决于延迟。相反,无论梯度中的延迟如何,我们都证明,我们可以更好地保证相同的异步SGD算法,而不是仅取决于用于实现算法的平行设备的数量。我们的保证严格比现有分析要好,我们还认为,异步SGD在我们考虑的设置中优于同步Minibatch SGD。为了进行分析,我们介绍了基于“虚拟迭代”和延迟自适应步骤的新颖递归,这使我们能够为凸面和非凸面目标得出最先进的保证。
translated by 谷歌翻译