Most distributed machine learning systems nowadays, including TensorFlow and CNTK, are built in a centralized fashion. One bottleneck of centralized algorithms lies on high communication cost on the central node. Motivated by this, we ask, can decentralized algorithms be faster than its centralized counterpart?Although decentralized PSGD (D-PSGD) algorithms have been studied by the control community, existing analysis and theory do not show any advantage over centralized PSGD (C-PSGD) algorithms, simply assuming the application scenario where only the decentralized network is available. In this paper, we study a D-PSGD algorithm and provide the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent. This is because D-PSGD has comparable total computational complexities to C-PSGD but requires much less communication cost on the busiest node. We further conduct an empirical study to validate our theoretical analysis across multiple frameworks (CNTK and Torch), different network configurations, and computation platforms up to 112 GPUs. On network configurations with low bandwidth or high latency, D-PSGD can be up to one order of magnitude faster than its well-optimized centralized counterparts.
translated by 谷歌翻译
SOTA decentralized SGD algorithms can overcome the bandwidth bottleneck at the parameter server by using communication collectives like Ring All-Reduce for synchronization. While the parameter updates in distributed SGD may happen asynchronously there is still a synchronization barrier to make sure that the local training epoch at every learner is complete before the learners can advance to the next epoch. The delays in waiting for the slowest learners(stragglers) remain to be a problem in the synchronization steps of these state-of-the-art decentralized frameworks. In this paper, we propose the (de)centralized Non-blocking SGD (Non-blocking SGD) which can address the straggler problem in a heterogeneous environment. The main idea of Non-blocking SGD is to split the original batch into mini-batches, then accumulate the gradients and update the model based on finished mini-batches. The Non-blocking idea can be implemented using decentralized algorithms including Ring All-reduce, D-PSGD, and MATCHA to solve the straggler problem. Moreover, using gradient accumulation to update the model also guarantees convergence and avoids gradient staleness. Run-time analysis with random straggler delays and computational efficiency/throughput of devices is also presented to show the advantage of Non-blocking SGD. Experiments on a suite of datasets and deep learning networks validate the theoretical analyses and demonstrate that Non-blocking SGD speeds up the training and fastens the convergence. Compared with the state-of-the-art decentralized asynchronous algorithms like D-PSGD and MACHA, Non-blocking SGD takes up to 2x fewer time to reach the same training loss in a heterogeneous environment.
translated by 谷歌翻译
并行系统中的通信施加了显着的开销,这往往是并联机器学习中的瓶颈。为了减轻其中一些开销,在本文中,我们提出了Eventgrad - 一种具有事件触发通信的算法,用于并行机器学习中的随机梯度下降。该算法的主要思想是在并行机器学习中的随机梯度下降的标准实现中修改通信的需求,仅在某些迭代时仅在必要时进行通信。我们为我们所提出的算法的融合提供了理论分析。我们还实现了用于训练CiFar-10数据集的流行残余神经网络的数据并行培训的提议算法,并显示Evervgrad可以将通信负载降低到60%,同时保持相同的精度水平。此外,Evervgrad可以与其他方法(例如Top-K稀疏)组合,以进一步降低通信,同时保持精度。
translated by 谷歌翻译
In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradient in parallel, aggregates all gradients in a single server to obtain the average, and update each worker's local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, andMann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.
translated by 谷歌翻译
分散和联合学习的关键挑战之一是设计算法,这些算法有效地处理跨代理商的高度异构数据分布。在本文中,我们在数据异质性下重新审视分散的随机梯度下降算法(D-SGD)的分析。我们在D-SGD的收敛速率上展示了新数量的关键作用,称为\ emph {邻居异质性}。通过结合通信拓扑结构和异质性,我们的分析阐明了这两个分散学习中这两个概念之间的相互作用较低。然后,我们认为邻里的异质性提供了一种自然标准,可以学习数据依赖性拓扑结构,以减少(甚至可以消除)数据异质性对D-SGD收敛时间的有害影响。对于与标签偏度分类的重要情况,我们制定了学习这样一个良好拓扑的问题,例如我们使用Frank-Wolfe算法解决的可拖动优化问题。如一组模拟和现实世界实验所示,我们的方法提供了一种设计稀疏拓扑的方法,可以在数据异质性下平衡D-SGD的收敛速度和D-SGD的触电沟通成本。
translated by 谷歌翻译
我们考虑了分布式随机优化问题,其中$ n $代理想要最大程度地减少代理本地函数总和给出的全局函数,并专注于当代理的局部函数在非i.i.i.d上定义时,专注于异质设置。数据集。我们研究本地SGD方法,在该方法中,代理执行许多局部随机梯度步骤,并偶尔与中央节点进行通信以改善其本地优化任务。我们分析了本地步骤对局部SGD的收敛速率和通信复杂性的影响。特别是,我们允许在$ i $ th的通信回合($ h_i $)期间允许在所有通信回合中进行固定数量的本地步骤。我们的主要贡献是将本地SGD的收敛速率表征为$ \ {h_i \} _ {i = 1}^r $在强烈凸,convex和nonconvex local函数下的函数,其中$ r $是沟通总数。基于此特征,我们在序列$ \ {h_i \} _ {i = 1}^r $上提供足够的条件,使得本地SGD可以相对于工人数量实现线性加速。此外,我们提出了一种新的沟通策略,将本地步骤提高,优于现有的沟通策略,以突出局部功能。另一方面,对于凸和非凸局局功能,我们认为固定的本地步骤是本地SGD的最佳通信策略,并恢复了最新的收敛速率结果。最后,我们通过广泛的数值实验证明我们的理论结果是合理的。
translated by 谷歌翻译
This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is O(m/N +1/m+λ 2 )-stable in expectation in the non-convex non-smooth setting, where N is the total sample size of the whole system, m is the worker number, and 1−λ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an2 ) in-average generalization bound, which is nonvacuous even when λ is closed to 1, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at Generalization-of-DSGD.
translated by 谷歌翻译
我们研究了具有大规模分布数据的机器学习模型问题的随机分散优化。我们扩展了以降低方差(VR)的广泛使用的额外和挖掘方法,并提出了两种方法:VR-Extra和VR挖掘。提出的VR-Extra需要$ o(((\ kappa_s+n)\ log \ frac {1} {\ epsilon})$随机梯度评估和$ o(((\ kappa_b+kappa_c) } {\ epsilon})$通信回合以达到Precision $ \ Epsilon $,这是非加速梯度型方法中最好的复杂性,其中$ \ kappa_s $和$ \ kappa_b $是随机条件和批次条件号和批次条件号和批次条件号和批次条件强烈凸和平滑问题的数字分别为$ \ kappa_c $是通信网络的条件编号,而$ n $是每个分布式节点上的样本大小。所提出的VR挖掘的通信成本更高,为$ O((\ kappa_b+\ kappa_c^2)\ log \ frac {1} {\ epsilon})$。我们的随机梯度计算复杂性与单机电VR方法(例如SAG,SAGA和SVRG)相同,我们的通信复杂性分别与额外的挖掘和挖掘相同。为了进一步加快收敛速度​​,我们还提出了加速的VR-Extra和VR挖掘,并使用最佳$ O((((\ sqrt {n \ kappa_s}+n)+log \ frac {1} {\ epsilon} {\ epsilon})$随机梯度计算复杂度和$ O(\ sqrt {\ kappa_b \ kappa_c} \ log \ frac {1} {\ epsilon})$ communication Complactity。我们的随机梯度计算复杂性也与单基加速的VR方法(例如Katyusha)相同,我们的通信复杂性与加速的全批次分散方法(例如MSDA)相同。
translated by 谷歌翻译
亚当是训练深神经网络的最具影响力的自适应随机算法之一,即使在简单的凸面设置中,它也被指出是不同的。许多尝试,例如降低自适应学习率,采用较大的批量大小,结合了时间去相关技术,寻求类似的替代物,\ textit {etc。},以促进Adam-type算法融合。与现有方法相反,我们引入了另一种易于检查的替代条件,这仅取决于基础学习率的参数和历史二阶时刻的组合,以确保通用ADAM的全球融合以解决大型融合。缩放非凸随机优化。这种观察结果以及这种足够的条件,对亚当的差异产生了更深刻的解释。另一方面,在实践中,无需任何理论保证,广泛使用了迷你ADAM和分布式ADAM。我们进一步分析了分布式系统中的批次大小或节点的数量如何影响亚当的收敛性,从理论上讲,这表明迷你批次和分布式亚当可以通过使用较大的迷你批量或较大的大小来线性地加速节点的数量。最后,我们应用了通用的Adam和Mini Batch Adam,具有足够条件来求解反例并在各种真实世界数据集上训练多个神经网络。实验结果完全符合我们的理论分析。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
在本文中,我们考虑了在$ N $代理的分布式优化问题,每个都具有本地成本函数,协作最小化连接网络上的本地成本函数的平均值。为了解决问题,我们提出了一种分布式随机重新洗脱(D-RR)算法,该算法结合了经典分布式梯度下降(DGD)方法和随机重新洗脱(RR)。我们表明D-RR继承了RR的优越性,以使光滑强凸和平的非凸起目标功能。特别是,对于平稳强凸的目标函数,D-RR在平方距离方面实现$ \ Mathcal {o}(1 / T ^ 2)$汇率(这里,$ t $计算迭代总数)在迭代和独特的最小化之间。当假设客观函数是平滑的非凸块并且具有Lipschitz连续组件函数时,我们将D-RR以$ \ Mathcal {O}的速率驱动到0美元的平方标准(1 / T ^ {2 / 3})$。这些收敛结果与集中式RR(最多常数因素)匹配。
translated by 谷歌翻译
我们开发了一个通用框架,统一了几种基于梯度的随机优化方法,用于在集中式和分布式场景中,用于经验风险最小化问题。该框架取决于引入的增强图的引入,该图形由对样品进行建模和边缘建模设备设备间通信和设备内随机梯度计算。通过正确设计增强图的拓扑结构,我们能够作为特殊情况恢复为著名的本地-SGD和DSGD算法,并提供了统一的方差还原(VR)和梯度跟踪(GT)方法(例如Saga) ,本地-SVRG和GT-SAGA。我们还提供了统一的收敛分析,以依靠适当的结构化lyapunov函数,以实现平滑和(强烈的)凸目标,并且获得的速率可以恢复许多现有算法的最著名结果。速率结果进一步表明,VR和GT方法可以有效地消除设备内部和跨设备内的数据异质性,从而使算法与最佳解决方案的确切收敛性。数值实验证实了本文中的发现。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
在分散的学习中,节点网络协作以最小化通常是其本地目标的有限总和的整体目标函数,并结合了非平滑的正则化术语,以获得更好的泛化能力。分散的随机近端梯度(DSPG)方法通常用于培训这种类型的学习模型,而随机梯度的方差延迟了收敛速率。在本文中,我们提出了一种新颖的算法,即DPSVRG,通过利用方差减少技术来加速分散的训练。基本思想是在每个节点中引入估计器,该节点周期性地跟踪本地完整梯度,以校正每次迭代的随机梯度。通过将分散的算法转换为具有差异减少的集中内隙近端梯度算法,并控制错误序列的界限,我们证明了DPSVRG以o(1 / t)$的速率收敛于一般凸起目标加上非平滑术语以$ t $作为迭代的数量,而dspg以$ o(\ frac {1} {\ sqrt {t}})$汇聚。我们对不同应用,网络拓扑和学习模型的实验表明,DPSVRG会收敛于DSPG的速度要快得多,DPSVRG的损耗功能与训练时期顺利降低。
translated by 谷歌翻译
Federated learning is a distributed framework according to which a model is trained over a set of devices, while keeping data localized. This framework faces several systemsoriented challenges which include (i) communication bottleneck since a large number of devices upload their local updates to a parameter server, and (ii) scalability as the federated network consists of millions of devices. Due to these systems challenges as well as issues related to statistical heterogeneity of data and privacy concerns, designing a provably efficient federated learning method is of significant importance yet it remains challenging. In this paper, we present FedPAQ, a communication-efficient Federated Learning method with Periodic Averaging and Quantization. FedPAQ relies on three key features: (1) periodic averaging where models are updated locally at devices and only periodically averaged at the server; (2) partial device participation where only a fraction of devices participate in each round of the training; and (3) quantized messagepassing where the edge nodes quantize their updates before uploading to the parameter server. These features address the communications and scalability challenges in federated learning. We also show that FedPAQ achieves near-optimal theoretical guarantees for strongly convex and non-convex loss functions and empirically demonstrate the communication-computation tradeoff provided by our method.
translated by 谷歌翻译
Mini-batch stochastic gradient descent (SGD) is state of the art in large scale distributed training. The scheme can reach a linear speedup with respect to the number of workers, but this is rarely seen in practice as the scheme often suffers from large network delays and bandwidth limits. To overcome this communication bottleneck recent works propose to reduce the communication frequency. An algorithm of this type is local SGD that runs SGD independently in parallel on different workers and averages the sequences only once in a while. This scheme shows promising results in practice, but eluded thorough theoretical analysis.We prove concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers and mini-batch size. The number of communication rounds can be reduced up to a factor of T 1/2 -where T denotes the number of total steps-compared to mini-batch SGD. This also holds for asynchronous implementations.Local SGD can also be used for large scale training of deep learning models. The results shown here aim serving as a guideline to further explore the theoretical and practical aspects of local SGD in these applications.
translated by 谷歌翻译
Decentralized bilevel optimization has received increasing attention recently due to its foundational role in many emerging multi-agent learning paradigms (e.g., multi-agent meta-learning and multi-agent reinforcement learning) over peer-to-peer edge networks. However, to work with the limited computation and communication capabilities of edge networks, a major challenge in developing decentralized bilevel optimization techniques is to lower sample and communication complexities. This motivates us to develop a new decentralized bilevel optimization called DIAMOND (decentralized single-timescale stochastic approximation with momentum and gradient-tracking). The contributions of this paper are as follows: i) our DIAMOND algorithm adopts a single-loop structure rather than following the natural double-loop structure of bilevel optimization, which offers low computation and implementation complexity; ii) compared to existing approaches, the DIAMOND algorithm does not require any full gradient evaluations, which further reduces both sample and computational complexities; iii) through a careful integration of momentum information and gradient tracking techniques, we show that the DIAMOND algorithm enjoys $\mathcal{O}(\epsilon^{-3/2})$ in sample and communication complexities for achieving an $\epsilon$-stationary solution, both of which are independent of the dataset sizes and significantly outperform existing works. Extensive experiments also verify our theoretical findings.
translated by 谷歌翻译
在互联网上的多种代理环境中的新兴应用程序,如互联网,网络传感,自主系统和联合学习,呼叫分散算法,以便在计算和通信方面是资源有效的有限总和优化。在本文中,我们考虑了原型设置,其中代理正在协作地工作,以通过在预定的网络拓扑中与其邻居通信来最小化局部损失函数的总和。我们开发了一种新的算法,称为分散的随机递归梯度方法(DESTRess),用于非耦合有限和优化,它与集中式算法的最佳增量一阶Oracle(IFO)复杂性匹配,用于查找一阶静止点,同时保持通信效率。详细的理论和数值比较证实了迭代在广泛的参数制度上提高现有分散算法的资源效率。 Descress利用了多个关键算法设计思路,包括随机激活的随机递增渐变渐变更新,具有用于本地计算的迷你批次,梯度跟踪,梯度跟踪,用于额外混合(即,多个八卦轮),用于偏移通信,以及仔细选择超参数和新的分析框架可证明达到理想的计算 - 通信权衡。
translated by 谷歌翻译