在大规模分布的深度学习中,沟通瓶颈一直是一个关键问题。在这项工作中,我们研究了以随机块的稀疏为梯度压缩机的分布式SGD,该梯度压缩机是兼容的环形压缩机,并且高度计算效率,但导致性能较低。为了解决这一重要问题,我们从新颖方面改善了沟通效率的分布式SGD,即梯度的差异和第二刻之间的权衡。通过这种动机,我们提出了一种新的分离错误反馈(DEF)算法,该算法比非凸问题的错误反馈显示更好的收敛绑定。我们还建议DEFA在训练的早期阶段加速DEF的概括,该训练的概括范围比DEF更好。此外,我们首次与迭代平均(SGD-IA)建立了通信有效的分布式SGD和SGD之间的联系。广泛的深度学习实验表明,在各种环境下所提出的方法的显着经验改进。
translated by 谷歌翻译
现代深度学习模型通常在分布式机器集合中并行培训,以减少训练时间。在这种情况下,机器之间模型更新的通信变成了一个重要的性能瓶颈,并且已经提出了各种有损的压缩技术来减轻此问题。在这项工作中,我们介绍了一种新的,简单但理论上和实践上有效的压缩技术:自然压缩(NC)。我们的技术分别应用于要进行压缩的更新向量的所有条目,并通过随机舍入到两个的(负或正)两种功能,可以通过忽略Mantissa来以“自然”方式计算。我们表明,与没有压缩相比,NC将压缩向量的第二刻增加不超过微小因子$ \ frac {9} {8} $,这意味着NC对流行训练算法的收敛速度的影响,例如分布式SGD,可以忽略不计。但是,NC启用的通信节省是可观的,导致$ 3 $ - $ 4 \ times $ $改善整体理论运行时间。对于需要更具侵略性压缩的应用,我们将NC推广到自然抖动,我们证明这比常见的随机抖动技术要好得多。我们的压缩操作员可以自行使用,也可以与现有操作员结合使用,从而产生更具侵略性的结合效果,并在理论和实践中提供新的最先进。
translated by 谷歌翻译
由于培训数据集的大小爆炸,分布式学习近年来受到了日益增长的兴趣。其中一个主要瓶颈是中央服务器和本地工人之间的沟通成本。虽然已经证明错误反馈压缩以通过随机梯度下降(SGD)降低通信成本,但在培训大规模机器学习方面广泛用于培训的通信有效的适应性梯度方法楷模。在本文中,我们提出了一种新的通信 - 压缩AMSGRAD,用于分布式非透明的优化问题,可提供有效的效率。我们所提出的分布式学习框架具有有效的渐变压缩策略和工人侧模型更新设计。我们证明所提出的通信有效的分布式自适应梯度方法会聚到具有与随机非凸化优化设置中的未压缩的vanilla amsgrad相同的迭代复杂度的一阶静止点。关于各种基准备份我们理论的实验。
translated by 谷歌翻译
Federated Learning是一种机器学习培训范式,它使客户能够共同培训模型而无需共享自己的本地化数据。但是,实践中联合学习的实施仍然面临许多挑战,例如由于重复的服务器 - 客户同步以及基于SGD的模型更新缺乏适应性,大型通信开销。尽管已经提出了各种方法来通过梯度压缩或量化来降低通信成本,并且提出了联合版本的自适应优化器(例如FedAdam)来增加适应性,目前的联合学习框架仍然无法立即解决上述挑战。在本文中,我们提出了一种具有理论融合保证的新型沟通自适应联合学习方法(FedCAMS)。我们表明,在非convex随机优化设置中,我们提出的fedcams的收敛率与$ o(\ frac {1} {\ sqrt {tkm}})$与其非压缩的对应物相同。各种基准的广泛实验验证了我们的理论分析。
translated by 谷歌翻译
In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, andMann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.
translated by 谷歌翻译
Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis.We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T 1/2 -where T denotes the number of total steps-compared to mini-batch SGD. This also holds for asynchronous implementations.Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.
translated by 谷歌翻译
This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is O(m/N +1/m+λ 2 )-stable in expectation in the non-convex non-smooth setting, where N is the total sample size of the whole system, m is the worker number, and 1−λ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an2 ) in-average generalization bound, which is nonvacuous even when λ is closed to 1, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/ Generalization-of-DSGD.
translated by 谷歌翻译
Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. SIGNSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. The relative 1 / 2 geometry of gradients, noise and curvature informs whether SIGNSGD or SGD is theoretically better suited to a particular problem. On the practical side we find that the momentum counterpart of SIGNSGD is able to match the accuracy and convergence speed of ADAM on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss (1823) we prove that majority vote can achieve the same reduction in variance as full precision distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve fast communication and fast convergence. Code to reproduce experiments is to be found at https://github.com/jxbz/signSGD.
translated by 谷歌翻译
分布式优化和学习的最新进展表明,沟通压缩是减少交流的最有效手段之一。尽管在通信压缩下的收敛速率有很多结果,但理论下限仍然缺失。通过通信压缩的算法的分析将收敛归因于两个抽象属性:无偏见的属性或承包属性。它们可以通过单向压缩(仅从工人到服务器的消息被压缩)或双向压缩来应用它们。在本文中,我们考虑了分布式随机算法,以最大程度地减少通信压缩下的平滑和非凸目标函数。我们为算法建立了收敛的下限,无论是在单向或双向中使用无偏压缩机还是使用承包压缩机。为了缩小下限和现有上限之间的差距,我们进一步提出了一种新石器时代的算法,该算法在轻度条件下几乎达到了我们的下限(达到对数因素)。我们的结果还表明,使用承包双向压缩可以产生迭代方法,该方法的收敛速度与使用无偏见的单向压缩的方法一样快。实验结果验证了我们的发现。
translated by 谷歌翻译
亚当是训练深神经网络的最具影响力的自适应随机算法之一,即使在简单的凸面设置中,它也被指出是不同的。许多尝试,例如降低自适应学习率,采用较大的批量大小,结合了时间去相关技术,寻求类似的替代物,\ textit {etc。},以促进Adam-type算法融合。与现有方法相反,我们引入了另一种易于检查的替代条件,这仅取决于基础学习率的参数和历史二阶时刻的组合,以确保通用ADAM的全球融合以解决大型融合。缩放非凸随机优化。这种观察结果以及这种足够的条件,对亚当的差异产生了更深刻的解释。另一方面,在实践中,无需任何理论保证,广泛使用了迷你ADAM和分布式ADAM。我们进一步分析了分布式系统中的批次大小或节点的数量如何影响亚当的收敛性,从理论上讲,这表明迷你批次和分布式亚当可以通过使用较大的迷你批量或较大的大小来线性地加速节点的数量。最后,我们应用了通用的Adam和Mini Batch Adam,具有足够条件来求解反例并在各种真实世界数据集上训练多个神经网络。实验结果完全符合我们的理论分析。
translated by 谷歌翻译
Sign-based algorithms (e.g. SIGNSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator.We then show that using error-feedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm (EF-SGD) with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions. Thus EF-SGD achieves gradient compression for free. Our experiments thoroughly substantiate the theory and show that error-feedback improves both convergence and generalization. Code can be found at https://github.com/epfml/error-feedback-SGD.
translated by 谷歌翻译
许多深度学习领域都受益于使用越来越大的神经网络接受公共数据训练的培训,就像预先训练的NLP和计算机视觉模型一样。培训此类模型需要大量的计算资源(例如,HPC群集),而小型研究小组和独立研究人员则无法使用。解决问题的一种方法是,几个较小的小组将其计算资源汇总在一起并训练一种使所有参与者受益的模型。不幸的是,在这种情况下,任何参与者都可以通过故意或错误地发送错误的更新来危害整个培训。在此类同龄人的情况下进行培训需要具有拜占庭公差的专门分布式培训算法。这些算法通常通过引入冗余通信或通过受信任的服务器传递所有更新来牺牲效率,从而使它们无法应用于大规模深度学习,在该大规模深度学习中,模型可以具有数十亿个参数。在这项工作中,我们提出了一种新的协议,用于强调沟通效率的安全(容忍)分散培训。
translated by 谷歌翻译
当今部署在边缘网络上的联合学习(FL)系统由大量在数据和/或计算能力中具有高度异质性的工人组成,这些工人要求在时间,努力,数据异质性等方面参加灵活的工作者参与为了满足灵活的工人参与的需求,我们考虑了一种新的FL范式,称为“无政府状态联邦学习”(AFL)(AFL)。与常规FL模型形成鲜明对比的是,AFL中的每个工人都可以自由选择i)何时参加FL,ii)根据当前情况(例如,电池,通信,电池级别,通信渠道,隐私问题)。但是,AFL中这种混乱的工人行为在算法设计中引发了许多新的开放问题。特别是,尚不清楚是否可以开发收敛的AFL训练算法,如果是的,则在什么条件下以及可实现的收敛速度的速度下。为此,我们提出了两种无政府状态的联合平均(AFA)算法,分别命名为AFA-CD和AFA-CS的跨设备和跨核心设置的双向学习率。令人惊讶的是,我们表明,在轻度的无政府状态假设下,这两种AFL算法都达到了最著名的收敛速率,作为常规FL的最新算法。此外,它们保留了新的AFL范式中的工人数量和本地步骤,保留了高度可取的{\ em线性加速效应}。我们通过对现实世界数据集进行广泛的实验来验证提出的算法。
translated by 谷歌翻译
Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost of communicating gradient updates between nodes; consequently, several lossy compresion heuristics have been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these heuristics do not always converge. In this paper, we propose Quantized SGD (QSGD), a family of compression schemes with convergence guarantees and good practical performance. QSGD allows the user to smoothly trade off communication bandwidth and convergence time: nodes can adjust the number of bits sent per iteration, at the cost of possibly higher variance. We show that this trade-off is inherent, in the sense that improving it past some threshold would violate information-theoretic lower bounds. QSGD guarantees convergence for convex and non-convex objectives, under asynchrony, and can be extended to stochastic variance-reduced techniques. When applied to training deep neural networks for image classification and automated speech recognition, QSGD leads to significant reductions in end-to-end training time. For instance, on 16GPUs, we can train the ResNet-152 network to full accuracy on ImageNet 1.8× faster than the full-precision variant. time to the same target accuracy is 2.7×. Further, even computationally-heavy architectures such as Inception and ResNet can benefit from the reduction in communication: on 16GPUs, QSGD reduces the end-to-end convergence time of ResNet152 by approximately 2×. Networks trained with QSGD can converge to virtually the same accuracy as full-precision variants, and that gradient quantization may even slightly improve accuracy in some settings. Related Work. One line of related research studies the communication complexity of convex optimization. In particular, [40] studied two-processor convex minimization in the same model, provided a lower bound of Ω(n(log n + log(1/ ))) bits on the communication cost of n-dimensional convex problems, and proposed a non-stochastic algorithm for strongly convex problems, whose communication cost is within a log factor of the lower bound. By contrast, our focus is on stochastic gradient methods. Recent work [5] focused on round complexity lower bounds on the number of communication rounds necessary for convex learning.Buckwild! [10] was the first to consider the convergence guarantees of low-precision SGD. It gave upper bounds on the error probability of SGD, assuming unbiased stochastic quantization, convexity, and gradient sparsity, and showed significant speedup when solving convex problems on CPUs. QSGD refines these results by focusing on the trade-off between communication and convergence. We view quantization as an independent source of variance for SGD, which allows us to employ standard convergence results [7]. The main differences from Buckw
translated by 谷歌翻译
为了提高分布式学习的训练速度,近年来见证了人们对开发同步和异步分布式随机方差减少优化方法的极大兴趣。但是,所有现有的同步和异步分布式训练算法都遭受了收敛速度或实施复杂性的各种局限性。这激发了我们提出一种称为\ algname(\ ul {s} emi-as \ ul {yn}的算法} ent \ ul {s} earch),它利用方差减少框架的特殊结构来克服同步和异步分布式学习算法的局限性,同时保留其显着特征。我们考虑分布式和共享内存体系结构下的\ algname的两个实现。我们表明我们的\ algname算法具有\(o(\ sqrt {n} \ epsilon^{ - 2}( - 2}(\ delta+1)+n)\)\)和\(o(\ sqrt {n} {n} 2}(\ delta+1)d+n)\)用于实现\(\ epsilon \)的计算复杂性 - 分布式和共享内存体系结构分别在非convex学习中的固定点,其中\(n \)表示培训样本的总数和\(\ delta \)表示工人的最大延迟。此外,我们通过建立二次强烈凸和非convex优化的算法稳定性界限来研究\ algname的概括性能。我们进一步进行广泛的数值实验来验证我们的理论发现
translated by 谷歌翻译
我们研究了在$ n $工人上的分布式培训的异步随机梯度下降算法,随着时间的推移,计算和通信频率变化。在此算法中,工人按照自己的步调并行计算随机梯度,并在没有任何同步的情况下将其返回服务器。该算法的现有收敛速率对于非凸平的光滑目标取决于最大梯度延迟$ \ tau _ {\ max} $,并表明$ \ epsilon $ stationary点在$ \ mathcal {o} \!\左后达到(\ sigma^2 \ epsilon^{ - 2}+ \ tau _ {\ max} \ epsilon^{ - 1} \ right)$ iterations,其中$ \ sigma $表示随机梯度的方差。在这项工作(i)中,我们获得了$ \ Mathcal {o} \!\ left(\ sigma^2 \ epsilon^{ - 2}+ sqrt {\ tau _ {\ max} \ max} \ tau_ {avg} {avg} } \ epsilon^{ - 1} \ right)$,没有任何更改的算法,其中$ \ tau_ {avg} $是平均延迟,可以大大小于$ \ tau _ {\ max} $。我们还提供(ii)一个简单的延迟自适应学习率方案,在该方案下,异步SGD的收敛速率为$ \ Mathcal {o} \!\ left(\ sigma^2 \ epsilon^{ - 2} { - 2}+ \ tau_ {-2 avg} \ epsilon^{ - 1} \ right)$,并且不需要任何额外的高参数调整或额外的通信。我们的结果首次显示异步SGD总是比迷你批次SGD快。此外,(iii)我们考虑了由联邦学习应用激发的异质功能的情况,并通过证明与先前的作品相比对最大延迟的依赖性较弱,并提高收敛率。特别是,我们表明,收敛率的异质性项仅受每个工人内平均延迟的影响。
translated by 谷歌翻译
使用多个计算节点通常可以加速在大型数据集上的深度神经网络。这种方法称为分布式训练,可以通过专门的消息传递协议,例如环形全部减少。但是,以比例运行这些协议需要可靠的高速网络,其仅在专用集群中可用。相比之下,许多现实世界应用程序,例如联合学习和基于云的分布式训练,在具有不稳定的网络带宽的不可靠的设备上运行。因此,这些应用程序仅限于使用参数服务器或基于Gossip的平均协议。在这项工作中,我们通过提出MOSHPIT全部减少的迭代平均协议来提升该限制,该协议指数地收敛于全局平均值。我们展示了我们对具有强烈理论保证的分布式优化方案的效率。该实验显示了与使用抢占从头开始训练的竞争性八卦的策略和1.5倍的加速,显示了1.3倍的Imagenet培训的加速。
translated by 谷歌翻译
Federated learning is a distributed framework according to which a model is trained over a set of devices, while keeping data localized. This framework faces several systemsoriented challenges which include (i) communication bottleneck since a large number of devices upload their local updates to a parameter server, and (ii) scalability as the federated network consists of millions of devices. Due to these systems challenges as well as issues related to statistical heterogeneity of data and privacy concerns, designing a provably efficient federated learning method is of significant importance yet it remains challenging. In this paper, we present FedPAQ, a communication-efficient Federated Learning method with Periodic Averaging and Quantization. FedPAQ relies on three key features: (1) periodic averaging where models are updated locally at devices and only periodically averaged at the server; (2) partial device participation where only a fraction of devices participate in each round of the training; and (3) quantized messagepassing where the edge nodes quantize their updates before uploading to the parameter server. These features address the communications and scalability challenges in federated learning. We also show that FedPAQ achieves near-optimal theoretical guarantees for strongly convex and non-convex loss functions and empirically demonstrate the communication-computation tradeoff provided by our method.
translated by 谷歌翻译
具有周期性模型的本地随机梯度下降(SGD)平均(FEDAVG)是联合学习中的基础算法。该算法在多个工人上独立运行SGD,并定期平均所有工人的模型。然而,当本地SGD与许多工人一起运行时,周期性平均导致跨越工人的重大模型差异,使全局损失缓慢收敛。虽然最近的高级优化方法解决了专注于非IID设置的问题,但由于底层定期模型平均而仍存在模型差异问题。我们提出了一个部分模型平均框架,这些框架减轻了联合学习中的模型差异问题。部分平均鼓励本地模型在参数空间上保持彼此接近,并且它可以更有效地最小化全局损失。鉴于固定数量的迭代和大量工人(128),验证精度高达2.2%的验证精度高于周期性的完整平均值。
translated by 谷歌翻译
Most distributed machine learning systems nowadays, including TensorFlow and CNTK, are built in a centralized fashion. One bottleneck of centralized algorithms lies on high communication cost on the central node. Motivated by this, we ask, can decentralized algorithms be faster than its centralized counterpart?Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to C-PSGD but requires much less communication cost on the busiest node. We further conduct an empirical study to validate our theoretical analysis across multiple frameworks (CNTK and Torch), different network configurations, and computation platforms up to 112 GPUs. On network configurations with low bandwidth or high latency, D-PSGD can be up to one order of magnitude faster than its well-optimized centralized counterparts.
translated by 谷歌翻译