Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the good scalability for distributed applications.
translated by 谷歌翻译
Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis.We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T 1/2 -where T denotes the number of total steps-compared to mini-batch SGD. This also holds for asynchronous implementations.Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.
translated by 谷歌翻译
我们开发了一种新方法来解决中央服务器中分布式学习问题中的通信约束。我们提出和分析了一种执行双向压缩的新算法,并仅使用uplink(从本地工人到中央服务器)压缩达到与算法相同的收敛速率。为了获得此改进,我们设计了MCM,一种算法,使下行链路压缩仅影响本地模型,而整体模型则保留。结果,与以前的工作相反,本地服务器上的梯度是在干扰模型上计算的。因此,融合证明更具挑战性,需要精确控制这种扰动。为了确保它,MCM还将模型压缩与存储机制相结合。该分析打开了新的门,例如纳入依赖工人的随机模型和部分参与。
translated by 谷歌翻译
Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost of communicating gradient updates between nodes; consequently, several lossy compresion heuristics have been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these heuristics do not always converge. In this paper, we propose Quantized SGD (QSGD), a family of compression schemes with convergence guarantees and good practical performance. QSGD allows the user to smoothly trade off communication bandwidth and convergence time: nodes can adjust the number of bits sent per iteration, at the cost of possibly higher variance. We show that this trade-off is inherent, in the sense that improving it past some threshold would violate information-theoretic lower bounds. QSGD guarantees convergence for convex and non-convex objectives, under asynchrony, and can be extended to stochastic variance-reduced techniques. When applied to training deep neural networks for image classification and automated speech recognition, QSGD leads to significant reductions in end-to-end training time. For instance, on 16GPUs, we can train the ResNet-152 network to full accuracy on ImageNet 1.8× faster than the full-precision variant. time to the same target accuracy is 2.7×. Further, even computationally-heavy architectures such as Inception and ResNet can benefit from the reduction in communication: on 16GPUs, QSGD reduces the end-to-end convergence time of ResNet152 by approximately 2×. Networks trained with QSGD can converge to virtually the same accuracy as full-precision variants, and that gradient quantization may even slightly improve accuracy in some settings. Related Work. One line of related research studies the communication complexity of convex optimization. In particular, [40] studied two-processor convex minimization in the same model, provided a lower bound of Ω(n(log n + log(1/ ))) bits on the communication cost of n-dimensional convex problems, and proposed a non-stochastic algorithm for strongly convex problems, whose communication cost is within a log factor of the lower bound. By contrast, our focus is on stochastic gradient methods. Recent work [5] focused on round complexity lower bounds on the number of communication rounds necessary for convex learning.Buckwild! [10] was the first to consider the convergence guarantees of low-precision SGD. It gave upper bounds on the error probability of SGD, assuming unbiased stochastic quantization, convexity, and gradient sparsity, and showed significant speedup when solving convex problems on CPUs. QSGD refines these results by focusing on the trade-off between communication and convergence. We view quantization as an independent source of variance for SGD, which allows us to employ standard convergence results [7]. The main differences from Buckw
translated by 谷歌翻译
Sign-based algorithms (e.g. SIGNSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator.We then show that using error-feedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm (EF-SGD) with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions. Thus EF-SGD achieves gradient compression for free. Our experiments thoroughly substantiate the theory and show that error-feedback improves both convergence and generalization. Code can be found at https://github.com/epfml/error-feedback-SGD.
translated by 谷歌翻译
在过去的几年中,各种通信压缩技术已经出现为一个不可或缺的工具,有助于缓解分布式学习中的通信瓶颈。然而,尽管{\ em偏见}压缩机经常在实践中显示出卓越的性能,但与更多的研究和理解的{\ EM无偏见}压缩机相比,非常少见。在这项工作中,我们研究了三类偏置压缩操作员,其中两个是新的,并且它们在施加到(随机)梯度下降和分布(随机)梯度下降时的性能。我们首次展示偏置压缩机可以在单个节点和分布式设置中导致线性收敛速率。我们证明了具有错误反馈机制的分布式压缩SGD方法,享受ergodic速率$ \ mathcal {o} \ left(\ delta l \ exp [ - \ frac {\ mu k} {\ delta l}] + \ frac {(c + \ delta d)} {k \ mu} \右)$,其中$ \ delta \ ge1 $是一个压缩参数,它在应用更多压缩时增长,$ l $和$ \ mu $是平滑性和强凸常数,$ C $捕获随机渐变噪声(如果在每个节点上计算完整渐变,则$ C = 0 $如果在每个节点上计算),则$ D $以最佳($ d = 0 $ for over参数化模型)捕获渐变的方差)。此外,通过对若干合成和经验的通信梯度分布的理论研究,我们阐明了为什么和通过多少偏置压缩机优于其无偏的变体。最后,我们提出了几种具有有希望理论担保和实际表现的新型偏置压缩机。
translated by 谷歌翻译
沟通是大规模机器学习模型的分布式培训中的关键瓶颈之一,而交换信息(例如随机梯度或模型)的有损压缩是减轻此问题的最有效工具之一。研究最多的压缩技术之一是无偏压缩操作员的类别,其方差为我们希望压缩的向量的平方规范的倍数界定。根据设计,该方差可能保持较高,并且只有在输入向量接近零时才会减少。但是,除非被训练的模型过度参数化,否则我们希望在经典方法的迭代(例如分布式压缩{\ sf sgd}的迭代术中,我们希望压缩的矢量有A的理由,对收敛产生不利影响速度。由于这个问题,最近提出了一些更详尽且看似截然不同的算法,目的是规避了这个问题。这些方法基于在我们通常希望压缩的向量和一些辅助向量之间压缩{\ em差异}的想法,这些辅助向量会在整个迭代过程中变化。在这项工作中,我们退后一步,并在概念上和理论上开发了研究此类方法的统一框架。我们的框架结合了使用无偏和有偏的压缩机压缩梯度和模型的方法,并阐明了辅助向量的构造。此外,我们的一般框架可以改善几种现有算法,并可以产生新的算法。最后,我们进行了几个数字实验,以说明和支持我们的理论发现。
translated by 谷歌翻译
现代深度学习模型通常在分布式机器集合中并行培训,以减少训练时间。在这种情况下,机器之间模型更新的通信变成了一个重要的性能瓶颈,并且已经提出了各种有损的压缩技术来减轻此问题。在这项工作中,我们介绍了一种新的,简单但理论上和实践上有效的压缩技术:自然压缩(NC)。我们的技术分别应用于要进行压缩的更新向量的所有条目,并通过随机舍入到两个的(负或正)两种功能,可以通过忽略Mantissa来以“自然”方式计算。我们表明,与没有压缩相比,NC将压缩向量的第二刻增加不超过微小因子$ \ frac {9} {8} $,这意味着NC对流行训练算法的收敛速度的影响,例如分布式SGD,可以忽略不计。但是,NC启用的通信节省是可观的,导致$ 3 $ - $ 4 \ times $ $改善整体理论运行时间。对于需要更具侵略性压缩的应用,我们将NC推广到自然抖动,我们证明这比常见的随机抖动技术要好得多。我们的压缩操作员可以自行使用,也可以与现有操作员结合使用,从而产生更具侵略性的结合效果,并在理论和实践中提供新的最先进。
translated by 谷歌翻译
我们介绍了一个框架 - Artemis-,以解决分布式或联合设置中的学习问题,并具有通信约束和设备部分参与。几位工人(随机抽样)使用中央服务器执行优化过程来汇总其计算。为了减轻通信成本,Artemis允许在两个方向上(从工人到服务器,相反)将发送的信息与内存机制相结合。它改进了仅考虑单向压缩(对服务器)的现有算法,或在压缩操作员上使用非常强大的假设,并且通常不考虑设备的部分参与。我们在非I.I.D中的随机梯度(仅在最佳点界定的噪声方差)提供了快速的收敛速率(线性最高到阈值)。设置,突出显示内存对单向和双向压缩的影响,分析Polyak-Ruppert平均。我们在分布中使用收敛性,以获得渐近方差的下限,该方差突出了实际的压缩极限。我们提出了两种方法,以解决设备部分参与的具有挑战性的案例,并提供实验结果以证明我们的分析有效性。
translated by 谷歌翻译
自适应梯度方法对解决许多机器学习问题的性能具有出色的性能。尽管最近研究了多种自适应方法,它们主要专注于经验或理论方面,并且还通过使用一些特定的自适应学习率来解决特定问题。希望为解决一般问题的理论保证来设计一种普遍的自适应梯度算法框架。为了填补这一差距,我们通过引入包括大多数现有自适应梯度形式的通用自适应矩阵提出了一种更快和普遍的自适应梯度框架(即,Super-Adam)。此外,我们的框架可以灵活地集成了减少技术的势头和方差。特别是,我们的小说框架为非透露设置下的自适应梯度方法提供了收敛分析支持。在理论分析中,我们证明我们的超亚当算法可以实现$ \ tilde {o}(\ epsilon ^ { - 3})$的最着名的复杂性,用于查找$ \ epsilon $ -stationary points的非核心优化,这匹配随机平滑非渗透优化的下限。在数值实验中,我们采用各种深度学习任务来验证我们的算法始终如一地优于现有的自适应算法。代码可在https://github.com/lijunyi95/superadam获得
translated by 谷歌翻译
Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families of communication-reduction methods, such as quantization, largebatch methods, and gradient sparsification, have been proposed. To date, gradient sparsification methods-where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally-are known to yield some of the largest practical gains. Such methods can reduce the amount of communication per step by up to three orders of magnitude, while preserving model accuracy. Yet, this family of methods currently has no theoretical justification. This is the question we address in this paper. We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude. Our analysis and empirical validation also reveal that these methods do require analytical conditions to converge well, justifying existing heuristics.
translated by 谷歌翻译
在分布式或联合的优化和学习中,不同计算单元之间的通信通常是瓶颈和梯度压缩,可广泛用于减少每个迭代方法中每个通信回合中发送的位数。有两类的压缩操作员和单独的算法利用它们。在具有有界方差的无偏随机压缩机(例如Rand-K)的情况下,Mishchenko等人的Diana算法。 (2019年),它实现了一种减少差异技术来处理压缩引入的差异,是当前的最新状态。在偏见和承包压缩机(例如TOP-K)的情况下,Richt \'Arik等人的EF21算法。 (2021)而不是实现错误反馈机制,是当前的最新状态。这两类的压缩方案和算法是不同的,具有不同的分析和证明技术。在本文中,我们将它们统一成一个框架,并提出了一种新算法,将Diana和EF21恢复为特定情况。我们的一般方法与新的,较大的压缩机类别一起使用,该类别具有两个参数,分别是偏见和方差,并包括无偏见和偏见的压缩机作为特定情况。这使我们能够继承两个世界中最好的:例如EF21,与戴安娜(Diana)不同,可以使用偏见的压缩机,例如Top-k,可以使用其在实践中的良好表现。就像戴安娜(Diana)和EF21不同一样,压缩机的独立随机性可以减轻压缩的影响,当平行工人的数量较大时,收敛速率提高。这是第一次提出具有所有这些功能的算法。我们证明其在某些条件下的线性收敛。我们的方法朝着更好地理解两个SO-FAR不同的沟通效率分布式学习的世界迈出了一步。
translated by 谷歌翻译
梯度压缩是一种流行的技术,可改善机器学习模型分布式培训中随机一阶方法的沟通复杂性。但是,现有作品仅考虑随机梯度的替换采样。相比之下,在实践中众所周知,最近从理论上证实,基于没有替代抽样的随机方法,例如随机改组方法(RR)方法,其性能要比用更换梯度进行梯度的方法更好。在这项工作中,我们在文献中缩小了这一差距,并通过梯度压缩和没有替代抽样的方法提供了第一次分析方法。我们首先使用梯度压缩(Q-RR)开发一个随机重新填充的分布式变体,并展示如何通过使用控制迭代来减少梯度量化的方差。接下来,为了更好地适合联合学习应用程序,我们结合了本地计算,并提出了一种称为Q-Nastya的Q-RR的变体。 Q-Nastya使用本地梯度步骤以及不同的本地和全球步骤。接下来,我们还展示了如何在此设置中减少压缩差异。最后,我们证明了所提出的方法的收敛结果,并概述了它们在现有算法上改进的几种设置。
translated by 谷歌翻译
In large-scale machine learning, recent works have studied the effects of compressing gradients in stochastic optimization in order to alleviate the communication bottleneck. These works have collectively revealed that stochastic gradient descent (SGD) is robust to structured perturbations such as quantization, sparsification, and delays. Perhaps surprisingly, despite the surge of interest in large-scale, multi-agent reinforcement learning, almost nothing is known about the analogous question: Are common reinforcement learning (RL) algorithms also robust to similar perturbations? In this paper, we investigate this question by studying a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction, where a general compression operator is used to model the perturbation. Our main technical contribution is to show that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic theoretical guarantees as their SGD counterparts. We then extend our results significantly to nonlinear stochastic approximation algorithms and multi-agent settings. In particular, we prove that for multi-agent TD learning, one can achieve linear convergence speedups in the number of agents while communicating just $\tilde{O}(1)$ bits per agent at each time step. Our work is the first to provide finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling. Our analysis hinges on studying the drift of a novel Lyapunov function that captures the dynamics of a memory variable introduced by error feedback.
translated by 谷歌翻译
使用多个计算节点通常可以加速在大型数据集上的深度神经网络。这种方法称为分布式训练,可以通过专门的消息传递协议,例如环形全部减少。但是,以比例运行这些协议需要可靠的高速网络,其仅在专用集群中可用。相比之下,许多现实世界应用程序,例如联合学习和基于云的分布式训练,在具有不稳定的网络带宽的不可靠的设备上运行。因此,这些应用程序仅限于使用参数服务器或基于Gossip的平均协议。在这项工作中,我们通过提出MOSHPIT全部减少的迭代平均协议来提升该限制,该协议指数地收敛于全局平均值。我们展示了我们对具有强烈理论保证的分布式优化方案的效率。该实验显示了与使用抢占从头开始训练的竞争性八卦的策略和1.5倍的加速,显示了1.3倍的Imagenet培训的加速。
translated by 谷歌翻译
我们研究基于{\ em本地培训(LT)}范式的分布式优化方法:通过在参数平均之前对客户进行基于本地梯度的培训来实现沟通效率。回顾田地的进度,我们{\ em识别5代LT方法}:1)启发式,2)均匀,3)sublinear,4)线性和5)加速。由Mishchenko,Malinovsky,Stich和Richt \'{A} Rik(2022)发起的5 $ {}^{\ rm th} $生成,由Proxskip方法发起通信加速机制。受到最近进度的启发,我们通过证明可以使用{\ em差异}进一步增强它们,为5 $ {}^{\ rm th} $生成LT方法的生成。尽管LT方法的所有以前的所有理论结果都完全忽略了本地工作的成本,并且仅根据交流回合的数量而被构成,但我们证明我们的方法在{\ em总培训成本方面都比{\ em em总培训成本}大得多当本地计算足够昂贵时,在制度中的理论和实践中,最先进的方法是proxskip。我们从理论上表征了这个阈值,并通过经验结果证实了我们的理论预测。
translated by 谷歌翻译
我们考虑分散的优化问题,其中许多代理通过在基础通信图上交换来最大程度地减少其本地功能的平均值。具体而言,我们将自己置于异步模型中,其中只有一个随机部分在每次迭代时执行计算,而信息交换可以在所有节点之间进行,并以不对称的方式进行。对于此设置,我们提出了一种算法,该算法结合了整个网络上梯度跟踪和差异的差异。这使每个节点能够跟踪目标函数梯度的平均值。我们的理论分析表明,在预期混合矩阵的轻度连通性条件下,当局部目标函数强烈凸面时,算法会汇聚。特别是,我们的结果不需要混合矩阵是双随机的。在实验中,我们研究了一种广播机制,该机制将信息从计算节点传输到其邻居,并确认我们方法在合成和现实世界数据集上的线性收敛性。
translated by 谷歌翻译
由于培训数据集的大小爆炸,分布式学习近年来受到了日益增长的兴趣。其中一个主要瓶颈是中央服务器和本地工人之间的沟通成本。虽然已经证明错误反馈压缩以通过随机梯度下降(SGD)降低通信成本,但在培训大规模机器学习方面广泛用于培训的通信有效的适应性梯度方法楷模。在本文中,我们提出了一种新的通信 - 压缩AMSGRAD,用于分布式非透明的优化问题,可提供有效的效率。我们所提出的分布式学习框架具有有效的渐变压缩策略和工人侧模型更新设计。我们证明所提出的通信有效的分布式自适应梯度方法会聚到具有与随机非凸化优化设置中的未压缩的vanilla amsgrad相同的迭代复杂度的一阶静止点。关于各种基准备份我们理论的实验。
translated by 谷歌翻译
在本文中,我们提出了Nesterov加速改组梯度(NASG),这是一种用于凸有限和最小化问题的新算法。我们的方法将传统的Nesterov的加速动量与不同的改组抽样方案相结合。我们证明,我们的算法使用统一的改组方案提高了$ \ Mathcal {o}(1/t)$的速率,其中$ t $是时代的数量。该速率比凸状制度中的任何其他改组梯度方法要好。我们的收敛分析不需要对有限域或有界梯度条件的假设。对于随机洗牌方案,我们进一步改善了收敛性。在采用某种初始条件时,我们表明我们的方法在解决方案的小社区附近收敛得更快。数值模拟证明了我们算法的效率。
translated by 谷歌翻译
我们研究了在$ n $工人上的分布式培训的异步随机梯度下降算法,随着时间的推移,计算和通信频率变化。在此算法中,工人按照自己的步调并行计算随机梯度,并在没有任何同步的情况下将其返回服务器。该算法的现有收敛速率对于非凸平的光滑目标取决于最大梯度延迟$ \ tau _ {\ max} $,并表明$ \ epsilon $ stationary点在$ \ mathcal {o} \!\左后达到(\ sigma^2 \ epsilon^{ - 2}+ \ tau _ {\ max} \ epsilon^{ - 1} \ right)$ iterations,其中$ \ sigma $表示随机梯度的方差。在这项工作(i)中,我们获得了$ \ Mathcal {o} \!\ left(\ sigma^2 \ epsilon^{ - 2}+ sqrt {\ tau _ {\ max} \ max} \ tau_ {avg} {avg} } \ epsilon^{ - 1} \ right)$,没有任何更改的算法,其中$ \ tau_ {avg} $是平均延迟,可以大大小于$ \ tau _ {\ max} $。我们还提供(ii)一个简单的延迟自适应学习率方案,在该方案下,异步SGD的收敛速率为$ \ Mathcal {o} \!\ left(\ sigma^2 \ epsilon^{ - 2} { - 2}+ \ tau_ {-2 avg} \ epsilon^{ - 1} \ right)$,并且不需要任何额外的高参数调整或额外的通信。我们的结果首次显示异步SGD总是比迷你批次SGD快。此外,(iii)我们考虑了由联邦学习应用激发的异质功能的情况,并通过证明与先前的作品相比对最大延迟的依赖性较弱,并提高收敛率。特别是,我们表明,收敛率的异质性项仅受每个工人内平均延迟的影响。
translated by 谷歌翻译