为了提高分布式学习的训练速度,近年来见证了人们对开发同步和异步分布式随机方差减少优化方法的极大兴趣。但是,所有现有的同步和异步分布式训练算法都遭受了收敛速度或实施复杂性的各种局限性。这激发了我们提出一种称为\ algname(\ ul {s} emi-as \ ul {yn}的算法} ent \ ul {s} earch),它利用方差减少框架的特殊结构来克服同步和异步分布式学习算法的局限性,同时保留其显着特征。我们考虑分布式和共享内存体系结构下的\ algname的两个实现。我们表明我们的\ algname算法具有\(o(\ sqrt {n} \ epsilon^{ - 2}( - 2}(\ delta+1)+n)\)\)和\(o(\ sqrt {n} {n} 2}(\ delta+1)d+n)\)用于实现\(\ epsilon \)的计算复杂性 - 分布式和共享内存体系结构分别在非convex学习中的固定点,其中\(n \)表示培训样本的总数和\(\ delta \)表示工人的最大延迟。此外,我们通过建立二次强烈凸和非convex优化的算法稳定性界限来研究\ algname的概括性能。我们进一步进行广泛的数值实验来验证我们的理论发现
translated by 谷歌翻译
Federated learning is a distributed framework according to which a model is trained over a set of devices, while keeping data localized. This framework faces several systemsoriented challenges which include (i) communication bottleneck since a large number of devices upload their local updates to a parameter server, and (ii) scalability as the federated network consists of millions of devices. Due to these systems challenges as well as issues related to statistical heterogeneity of data and privacy concerns, designing a provably efficient federated learning method is of significant importance yet it remains challenging. In this paper, we present FedPAQ, a communication-efficient Federated Learning method with Periodic Averaging and Quantization. FedPAQ relies on three key features: (1) periodic averaging where models are updated locally at devices and only periodically averaged at the server; (2) partial device participation where only a fraction of devices participate in each round of the training; and (3) quantized messagepassing where the edge nodes quantize their updates before uploading to the parameter server. These features address the communications and scalability challenges in federated learning. We also show that FedPAQ achieves near-optimal theoretical guarantees for strongly convex and non-convex loss functions and empirically demonstrate the communication-computation tradeoff provided by our method.
translated by 谷歌翻译
由于其在数据隐私保护,有效的沟通和并行数据处理方面的好处,联邦学习(FL)近年来引起了人们的兴趣。同样,采用适当的算法设计,可以实现fl中收敛效应的理想线性加速。但是,FL上的大多数现有作品仅限于I.I.D.的系统。数据和集中参数服务器以及与异质数据集分散的FL上的结果仍然有限。此外,在完全分散的FL下,与数据异质性在完全分散的FL下,可以实现收敛的线性加速仍然是一个悬而未决的问题。在本文中,我们通过提出一种称为Net-Fleet的新算法,以解决具有数据异质性的完全分散的FL系统,以解决这些挑战。我们算法的关键思想是通过合并递归梯度校正技术来处理异质数据集,以增强FL(最初旨在用于通信效率)的本地更新方案。我们表明,在适当的参数设置下,所提出的净型算法实现了收敛的线性加速。我们进一步进行了广泛的数值实验,以评估所提出的净化算法的性能并验证我们的理论发现。
translated by 谷歌翻译
近年来,由于它们在对点对点网络上的分散性学习问题(例如,多机构元学习,多机构的多方强化增强学习学习)上,分散的双层优化问题在网络和机器学习社区中引起了越来越多的关注。 ,个性化的培训和拜占庭的弹性学习)。但是,对于具有有限的计算和通信功能的对等网络上的分散式双层优化,如何实现低样本和通信复杂性是迄今为止尚未探索的两个基本挑战。在本文中,我们首次尝试研究了分别与外部和内部子问题相对应的非凸和强结构结构的分散双重优化问题。本文中我们的主要贡献是两倍:i)我们首先提出了一种称为Interact的确定性算法(Inter-gradient-descent-out-outer-tracked-gradeent),需要$ \ Mathcal {o}的样品复杂性(n \ epsilon) ^{ - 1})$和$ \ mathcal {o}的通信复杂性(\ epsilon^{ - 1})$解决双重优化问题,其中$ n $和$ \ epsilon> 0 $是样本的数量在每个代理和所需的平稳性差距上。 ii)为了放宽每次迭代中进行全面梯度评估的需求,我们提出了一个随机方差的互动版本(SVR Interact),该版本将样品复杂性提高到$ \ Mathcal {o}(\ sqrt {n} \ epsilon ^{ - 1})$在达到与确定算法相同的通信复杂性时。据我们所知,这项工作是第一个实现低样本和通信复杂性,以解决网络上的分散双层优化问题。我们的数值实验也证实了我们的理论发现。
translated by 谷歌翻译
We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. We derive stability bounds for both convex and non-convex optimization under standard Lipschitz and smoothness assumptions.Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our findings conceptually underscore the importance of reducing training time beyond its obvious benefit.
translated by 谷歌翻译
我们研究了在$ n $工人上的分布式培训的异步随机梯度下降算法,随着时间的推移,计算和通信频率变化。在此算法中,工人按照自己的步调并行计算随机梯度,并在没有任何同步的情况下将其返回服务器。该算法的现有收敛速率对于非凸平的光滑目标取决于最大梯度延迟$ \ tau _ {\ max} $,并表明$ \ epsilon $ stationary点在$ \ mathcal {o} \!\左后达到(\ sigma^2 \ epsilon^{ - 2}+ \ tau _ {\ max} \ epsilon^{ - 1} \ right)$ iterations,其中$ \ sigma $表示随机梯度的方差。在这项工作(i)中,我们获得了$ \ Mathcal {o} \!\ left(\ sigma^2 \ epsilon^{ - 2}+ sqrt {\ tau _ {\ max} \ max} \ tau_ {avg} {avg} } \ epsilon^{ - 1} \ right)$,没有任何更改的算法,其中$ \ tau_ {avg} $是平均延迟,可以大大小于$ \ tau _ {\ max} $。我们还提供(ii)一个简单的延迟自适应学习率方案,在该方案下,异步SGD的收敛速率为$ \ Mathcal {o} \!\ left(\ sigma^2 \ epsilon^{ - 2} { - 2}+ \ tau_ {-2 avg} \ epsilon^{ - 1} \ right)$,并且不需要任何额外的高参数调整或额外的通信。我们的结果首次显示异步SGD总是比迷你批次SGD快。此外,(iii)我们考虑了由联邦学习应用激发的异质功能的情况,并通过证明与先前的作品相比对最大延迟的依赖性较弱,并提高收敛率。特别是,我们表明,收敛率的异质性项仅受每个工人内平均延迟的影响。
translated by 谷歌翻译
Bilevel优化是在机器学习的许多领域中最小化涉及另一个功能的价值函数的问题。在大规模的经验风险最小化设置中,样品数量很大,开发随机方法至关重要,而随机方法只能一次使用一些样品进行进展。但是,计算值函数的梯度涉及求解线性系统,这使得很难得出无偏的随机估计。为了克服这个问题,我们引入了一个新颖的框架,其中内部问题的解决方案,线性系统的解和主要变量同时发展。这些方向是作为总和写成的,使其直接得出无偏估计。我们方法的简单性使我们能够开发全球差异算法,其中所有变量的动力学都会降低差异。我们证明,萨巴(Saba)是我们框架中著名的传奇算法的改编,具有$ o(\ frac1t)$收敛速度,并且在polyak-lojasciewicz的假设下实现了线性收敛。这是验证这些属性之一的双光线优化的第一种随机算法。数值实验验证了我们方法的实用性。
translated by 谷歌翻译
亚当是训练深神经网络的最具影响力的自适应随机算法之一,即使在简单的凸面设置中,它也被指出是不同的。许多尝试,例如降低自适应学习率,采用较大的批量大小,结合了时间去相关技术,寻求类似的替代物,\ textit {etc。},以促进Adam-type算法融合。与现有方法相反,我们引入了另一种易于检查的替代条件,这仅取决于基础学习率的参数和历史二阶时刻的组合,以确保通用ADAM的全球融合以解决大型融合。缩放非凸随机优化。这种观察结果以及这种足够的条件,对亚当的差异产生了更深刻的解释。另一方面,在实践中,无需任何理论保证,广泛使用了迷你ADAM和分布式ADAM。我们进一步分析了分布式系统中的批次大小或节点的数量如何影响亚当的收敛性,从理论上讲,这表明迷你批次和分布式亚当可以通过使用较大的迷你批量或较大的大小来线性地加速节点的数量。最后,我们应用了通用的Adam和Mini Batch Adam,具有足够条件来求解反例并在各种真实世界数据集上训练多个神经网络。实验结果完全符合我们的理论分析。
translated by 谷歌翻译
我们开发了一种新方法来解决中央服务器中分布式学习问题中的通信约束。我们提出和分析了一种执行双向压缩的新算法,并仅使用uplink(从本地工人到中央服务器)压缩达到与算法相同的收敛速率。为了获得此改进,我们设计了MCM,一种算法,使下行链路压缩仅影响本地模型,而整体模型则保留。结果,与以前的工作相反,本地服务器上的梯度是在干扰模型上计算的。因此,融合证明更具挑战性,需要精确控制这种扰动。为了确保它,MCM还将模型压缩与存储机制相结合。该分析打开了新的门,例如纳入依赖工人的随机模型和部分参与。
translated by 谷歌翻译
当任何延迟较大时,异步随机梯度下降(SGD)的现有分析显着降低,给人的印象是性能主要取决于延迟。相反,无论梯度中的延迟如何,我们都证明,我们可以更好地保证相同的异步SGD算法,而不是仅取决于用于实现算法的平行设备的数量。我们的保证严格比现有分析要好,我们还认为,异步SGD在我们考虑的设置中优于同步Minibatch SGD。为了进行分析,我们介绍了基于“虚拟迭代”和延迟自适应步骤的新颖递归,这使我们能够为凸面和非凸面目标得出最先进的保证。
translated by 谷歌翻译
Nonconvex optimization is central in solving many machine learning problems, in which block-wise structure is commonly encountered. In this work, we propose cyclic block coordinate methods for nonconvex optimization problems with non-asymptotic gradient norm guarantees. Our convergence analysis is based on a gradient Lipschitz condition with respect to a Mahalanobis norm, inspired by a recent progress on cyclic block coordinate methods. In deterministic settings, our convergence guarantee matches the guarantee of (full-gradient) gradient descent, but with the gradient Lipschitz constant being defined w.r.t.~the Mahalanobis norm. In stochastic settings, we use recursive variance reduction to decrease the per-iteration cost and match the arithmetic operation complexity of current optimal stochastic full-gradient methods, with a unified analysis for both finite-sum and infinite-sum cases. We further prove the faster, linear convergence of our methods when a Polyak-{\L}ojasiewicz (P{\L}) condition holds for the objective function. To the best of our knowledge, our work is the first to provide variance-reduced convergence guarantees for a cyclic block coordinate method. Our experimental results demonstrate the efficacy of the proposed variance-reduced cyclic scheme in training deep neural nets.
translated by 谷歌翻译
We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.
translated by 谷歌翻译
当今部署在边缘网络上的联合学习(FL)系统由大量在数据和/或计算能力中具有高度异质性的工人组成,这些工人要求在时间,努力,数据异质性等方面参加灵活的工作者参与为了满足灵活的工人参与的需求,我们考虑了一种新的FL范式,称为“无政府状态联邦学习”(AFL)(AFL)。与常规FL模型形成鲜明对比的是,AFL中的每个工人都可以自由选择i)何时参加FL,ii)根据当前情况(例如,电池,通信,电池级别,通信渠道,隐私问题)。但是,AFL中这种混乱的工人行为在算法设计中引发了许多新的开放问题。特别是,尚不清楚是否可以开发收敛的AFL训练算法,如果是的,则在什么条件下以及可实现的收敛速度的速度下。为此,我们提出了两种无政府状态的联合平均(AFA)算法,分别命名为AFA-CD和AFA-CS的跨设备和跨核心设置的双向学习率。令人惊讶的是,我们表明,在轻度的无政府状态假设下,这两种AFL算法都达到了最著名的收敛速率,作为常规FL的最新算法。此外,它们保留了新的AFL范式中的工人数量和本地步骤,保留了高度可取的{\ em线性加速效应}。我们通过对现实世界数据集进行广泛的实验来验证提出的算法。
translated by 谷歌翻译
This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is O(m/N +1/m+λ 2 )-stable in expectation in the non-convex non-smooth setting, where N is the total sample size of the whole system, m is the worker number, and 1−λ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an2 ) in-average generalization bound, which is nonvacuous even when λ is closed to 1, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at https://github.com/Raiden-Zhu/ Generalization-of-DSGD.
translated by 谷歌翻译
Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. SIGNSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. The relative 1 / 2 geometry of gradients, noise and curvature informs whether SIGNSGD or SGD is theoretically better suited to a particular problem. On the practical side we find that the momentum counterpart of SIGNSGD is able to match the accuracy and convergence speed of ADAM on deep Imagenet models. We extend our theory to the distributed setting, where the parameter server uses majority vote to aggregate gradient signs from each worker enabling 1-bit compression of worker-server communication in both directions. Using a theorem by Gauss (1823) we prove that majority vote can achieve the same reduction in variance as full precision distributed SGD. Thus, there is great promise for sign-based optimisation schemes to achieve fast communication and fast convergence. Code to reproduce experiments is to be found at https://github.com/jxbz/signSGD.
translated by 谷歌翻译
在本文中,我们考虑了在$ N $代理的分布式优化问题,每个都具有本地成本函数,协作最小化连接网络上的本地成本函数的平均值。为了解决问题,我们提出了一种分布式随机重新洗脱(D-RR)算法,该算法结合了经典分布式梯度下降(DGD)方法和随机重新洗脱(RR)。我们表明D-RR继承了RR的优越性,以使光滑强凸和平的非凸起目标功能。特别是,对于平稳强凸的目标函数,D-RR在平方距离方面实现$ \ Mathcal {o}(1 / T ^ 2)$汇率(这里,$ t $计算迭代总数)在迭代和独特的最小化之间。当假设客观函数是平滑的非凸块并且具有Lipschitz连续组件函数时,我们将D-RR以$ \ Mathcal {O}的速率驱动到0美元的平方标准(1 / T ^ {2 / 3})$。这些收敛结果与集中式RR(最多常数因素)匹配。
translated by 谷歌翻译
彼得纤维优化已广泛应用于许多重要的机器学习应用,例如普带的参数优化和元学习。最近,已经提出了几种基于动量的算法来解决贝韦尔优化问题。但是,基于SGD的算法的$ \ Mathcal {\ widetilde o}(\ epsilon ^ {-2}),那些基于势头的算法不会达到可释放的计算复杂性。在本文中,我们提出了两种用于双纤维优化的新算法,其中第一算法采用基于动量的递归迭代,第二算法采用嵌套环路中的递归梯度估计来降低方差。我们表明这两种算法都达到了$ \ mathcal {\ widetilde o}的复杂性(\ epsilon ^ { - 1.5})$,这优于所有现有算法的级别。我们的实验验证了我们的理论结果,并展示了我们在封路数据应用程序中的算法的卓越实证性能。
translated by 谷歌翻译
我们考虑非凸凹minimax问题,$ \ min _ {\ mathbf {x}} \ mathcal {y}} f(\ mathbf {x},\ mathbf {y})$, $ f $在$ \ mathbf {x} $ on $ \ mathbf {y} $和$ \ mathcal {y} $中的$ \ \ mathbf {y} $。解决此问题的最受欢迎的算法之一是庆祝的梯度下降上升(GDA)算法,已广泛用于机器学习,控制理论和经济学。尽管凸凹设置的广泛收敛结果,但具有相等步骤的GDA可以收敛以限制循环甚至在一般设置中发散。在本文中,我们介绍了两次尺度GDA的复杂性结果,以解决非膨胀凹入的最小问题,表明该算法可以找到函数$ \ phi(\ cdot)的静止点:= \ max _ {\ mathbf {Y} \ In \ Mathcal {Y}} F(\ CDOT,\ MATHBF {Y})高效。据我们所知,这是对这一环境中的两次尺度GDA的第一个非因对药分析,阐明了其在培训生成对抗网络(GANS)和其他实际应用中的优越实际表现。
translated by 谷歌翻译
Decentralized bilevel optimization has received increasing attention recently due to its foundational role in many emerging multi-agent learning paradigms (e.g., multi-agent meta-learning and multi-agent reinforcement learning) over peer-to-peer edge networks. However, to work with the limited computation and communication capabilities of edge networks, a major challenge in developing decentralized bilevel optimization techniques is to lower sample and communication complexities. This motivates us to develop a new decentralized bilevel optimization called DIAMOND (decentralized single-timescale stochastic approximation with momentum and gradient-tracking). The contributions of this paper are as follows: i) our DIAMOND algorithm adopts a single-loop structure rather than following the natural double-loop structure of bilevel optimization, which offers low computation and implementation complexity; ii) compared to existing approaches, the DIAMOND algorithm does not require any full gradient evaluations, which further reduces both sample and computational complexities; iii) through a careful integration of momentum information and gradient tracking techniques, we show that the DIAMOND algorithm enjoys $\mathcal{O}(\epsilon^{-3/2})$ in sample and communication complexities for achieving an $\epsilon$-stationary solution, both of which are independent of the dataset sizes and significantly outperform existing works. Extensive experiments also verify our theoretical findings.
translated by 谷歌翻译
本文分析了双模的彼此优化随机算法框架。 Bilevel优化是一类表现出两级结构的问题,其目标是使具有变量的外目标函数最小化,该变量被限制为对(内部)优化问题的最佳解决方案。我们考虑内部问题的情况是不受约束的并且强烈凸起的情况,而外部问题受到约束并具有平滑的目标函数。我们提出了一种用于解决如此偏纤维问题的两次时间尺度随机近似(TTSA)算法。在算法中,使用较大步长的随机梯度更新用于内部问题,而具有较小步长的投影随机梯度更新用于外部问题。我们在各种设置下分析了TTSA算法的收敛速率:当外部问题强烈凸起(RESP。〜弱凸)时,TTSA算法查找$ \ MATHCAL {O}(k ^ { - 2/3})$ -Optimal(resp。〜$ \ mathcal {o}(k ^ {-2/5})$ - 静止)解决方案,其中$ k $是总迭代号。作为一个应用程序,我们表明,两个时间尺度的自然演员 - 批评批评近端策略优化算法可以被视为我们的TTSA框架的特殊情况。重要的是,与全球最优政策相比,自然演员批评算法显示以预期折扣奖励的差距,以$ \ mathcal {o}(k ^ { - 1/4})的速率收敛。
translated by 谷歌翻译