Federated learning has attracted increasing attention with the emergence of distributed data. While extensive federated learning algorithms have been proposed for the non-convex distributed problem, the federated learning in practice still faces numerous challenges, such as the large training iterations to converge since the sizes of models and datasets keep increasing, and the lack of adaptivity by SGD-based model updates. Meanwhile, the study of adaptive methods in federated learning is scarce and existing works either lack a complete theoretical convergence guarantee or have slow sample complexity. In this paper, we propose an efficient adaptive algorithm (i.e., FAFED) based on the momentum-based variance reduced technique in cross-silo FL. We first explore how to design the adaptive algorithm in the FL setting. By providing a counter-example, we prove that a simple combination of FL and adaptive methods could lead to divergence. More importantly, we provide a convergence analysis for our method and prove that our algorithm is the first adaptive FL algorithm to reach the best-known samples $O(\epsilon^{-3})$ and $O(\epsilon^{-2})$ communication rounds to find an $\epsilon$-stationary point without large batches. The experimental results on the language modeling task and image classification task with heterogeneous data demonstrate the efficiency of our algorithms.
translated by 谷歌翻译
在联合学习(FL)的新兴范式中,大量客户端(例如移动设备)用于在各自的数据上训练可能的高维模型。由于移动设备的带宽低,分散的优化方法需要将计算负担从那些客户端转移到计算服务器,同时保留隐私和合理的通信成本。在本文中,我们专注于深度,如多层神经网络的培训,在FL设置下。我们提供了一种基于本地模型的层状和维度更新的新型联合学习方法,减轻了非凸起和手头优化任务的多层性质的新型联合学习方法。我们为Fed-Lamb提供了一种彻底的有限时间收敛性分析,表征其渐变减少的速度有多速度。我们在IID和非IID设置下提供实验结果,不仅可以证实我们的理论,而且与最先进的方法相比,我们的方法的速度更快。
translated by 谷歌翻译
由于其在数据隐私保护,有效的沟通和并行数据处理方面的好处,联邦学习(FL)近年来引起了人们的兴趣。同样,采用适当的算法设计,可以实现fl中收敛效应的理想线性加速。但是,FL上的大多数现有作品仅限于I.I.D.的系统。数据和集中参数服务器以及与异质数据集分散的FL上的结果仍然有限。此外,在完全分散的FL下,与数据异质性在完全分散的FL下,可以实现收敛的线性加速仍然是一个悬而未决的问题。在本文中,我们通过提出一种称为Net-Fleet的新算法,以解决具有数据异质性的完全分散的FL系统,以解决这些挑战。我们算法的关键思想是通过合并递归梯度校正技术来处理异质数据集,以增强FL(最初旨在用于通信效率)的本地更新方案。我们表明,在适当的参数设置下,所提出的净型算法实现了收敛的线性加速。我们进一步进行了广泛的数值实验,以评估所提出的净化算法的性能并验证我们的理论发现。
translated by 谷歌翻译
Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FEDAVG) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including ADAGRAD, ADAM, and YOGI, and analyze their convergence in the presence of heterogeneous data for general nonconvex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.
translated by 谷歌翻译
Federated Learning是一种机器学习培训范式,它使客户能够共同培训模型而无需共享自己的本地化数据。但是,实践中联合学习的实施仍然面临许多挑战,例如由于重复的服务器 - 客户同步以及基于SGD的模型更新缺乏适应性,大型通信开销。尽管已经提出了各种方法来通过梯度压缩或量化来降低通信成本,并且提出了联合版本的自适应优化器(例如FedAdam)来增加适应性,目前的联合学习框架仍然无法立即解决上述挑战。在本文中,我们提出了一种具有理论融合保证的新型沟通自适应联合学习方法(FedCAMS)。我们表明,在非convex随机优化设置中,我们提出的fedcams的收敛率与$ o(\ frac {1} {\ sqrt {tkm}})$与其非压缩的对应物相同。各种基准的广泛实验验证了我们的理论分析。
translated by 谷歌翻译
当今部署在边缘网络上的联合学习(FL)系统由大量在数据和/或计算能力中具有高度异质性的工人组成,这些工人要求在时间,努力,数据异质性等方面参加灵活的工作者参与为了满足灵活的工人参与的需求,我们考虑了一种新的FL范式,称为“无政府状态联邦学习”(AFL)(AFL)。与常规FL模型形成鲜明对比的是,AFL中的每个工人都可以自由选择i)何时参加FL,ii)根据当前情况(例如,电池,通信,电池级别,通信渠道,隐私问题)。但是,AFL中这种混乱的工人行为在算法设计中引发了许多新的开放问题。特别是,尚不清楚是否可以开发收敛的AFL训练算法,如果是的,则在什么条件下以及可实现的收敛速度的速度下。为此,我们提出了两种无政府状态的联合平均(AFA)算法,分别命名为AFA-CD和AFA-CS的跨设备和跨核心设置的双向学习率。令人惊讶的是,我们表明,在轻度的无政府状态假设下,这两种AFL算法都达到了最著名的收敛速率,作为常规FL的最新算法。此外,它们保留了新的AFL范式中的工人数量和本地步骤,保留了高度可取的{\ em线性加速效应}。我们通过对现实世界数据集进行广泛的实验来验证提出的算法。
translated by 谷歌翻译
具有周期性模型的本地随机梯度下降(SGD)平均(FEDAVG)是联合学习中的基础算法。该算法在多个工人上独立运行SGD,并定期平均所有工人的模型。然而,当本地SGD与许多工人一起运行时,周期性平均导致跨越工人的重大模型差异,使全局损失缓慢收敛。虽然最近的高级优化方法解决了专注于非IID设置的问题,但由于底层定期模型平均而仍存在模型差异问题。我们提出了一个部分模型平均框架,这些框架减轻了联合学习中的模型差异问题。部分平均鼓励本地模型在参数空间上保持彼此接近,并且它可以更有效地最小化全局损失。鉴于固定数量的迭代和大量工人(128),验证精度高达2.2%的验证精度高于周期性的完整平均值。
translated by 谷歌翻译
数据异构联合学习(FL)系统遭受了两个重要的收敛误差来源:1)客户漂移错误是由于在客户端执行多个局部优化步骤而引起的,以及2)部分客户参与错误,这是一个事实,仅一小部分子集边缘客户参加每轮培训。我们发现其中,只有前者在文献中受到了极大的关注。为了解决这个问题,我们提出了FedVarp,这是在服务器上应用的一种新颖的差异算法,它消除了由于部分客户参与而导致的错误。为此,服务器只是将每个客户端的最新更新保持在内存中,并将其用作每回合中非参与客户的替代更新。此外,为了减轻服务器上的内存需求,我们提出了一种新颖的基于聚类的方差降低算法clusterfedvarp。与以前提出的方法不同,FedVarp和ClusterFedVarp均不需要在客户端上进行其他计算或其他优化参数的通信。通过广泛的实验,我们表明FedVarp优于最先进的方法,而ClusterFedVarp实现了与FedVarp相当的性能,并且记忆要求较少。
translated by 谷歌翻译
联合学习(FL)框架使Edge客户能够协作学习共享的推理模型,同时保留对客户的培训数据的隐私。最近,已经采取了许多启发式方法来概括集中化的自适应优化方法,例如SGDM,Adam,Adagrad等,以提高收敛性和准确性的联合设置。但是,关于在联合设置中的位置以及如何设计和利用自适应优化方法的理论原理仍然很少。这项工作旨在从普通微分方程(ODE)的动力学的角度开发新的自适应优化方法,以开发FL的新型自适应优化方法。首先,建立了一个分析框架,以在联合优化方法和相应集中优化器的ODES分解之间建立连接。其次,基于这个分析框架,开发了一种动量解耦自适应优化方法FedDA,以充分利用每种本地迭代的全球动量并加速训练收敛。最后但并非最不重要的一点是,在训练过程结束时,全部批处理梯度用于模仿集中式优化,以确保收敛并克服由自适应优化方法引起的可能的不一致。
translated by 谷歌翻译
自适应梯度方法对解决许多机器学习问题的性能具有出色的性能。尽管最近研究了多种自适应方法,它们主要专注于经验或理论方面,并且还通过使用一些特定的自适应学习率来解决特定问题。希望为解决一般问题的理论保证来设计一种普遍的自适应梯度算法框架。为了填补这一差距,我们通过引入包括大多数现有自适应梯度形式的通用自适应矩阵提出了一种更快和普遍的自适应梯度框架(即,Super-Adam)。此外,我们的框架可以灵活地集成了减少技术的势头和方差。特别是,我们的小说框架为非透露设置下的自适应梯度方法提供了收敛分析支持。在理论分析中,我们证明我们的超亚当算法可以实现$ \ tilde {o}(\ epsilon ^ { - 3})$的最着名的复杂性,用于查找$ \ epsilon $ -stationary points的非核心优化,这匹配随机平滑非渗透优化的下限。在数值实验中,我们采用各种深度学习任务来验证我们的算法始终如一地优于现有的自适应算法。代码可在https://github.com/lijunyi95/superadam获得
translated by 谷歌翻译
To lower the communication complexity of federated min-max learning, a natural approach is to utilize the idea of infrequent communications (through multiple local updates) same as in conventional federated learning. However, due to the more complicated inter-outer problem structure in federated min-max learning, theoretical understandings of communication complexity for federated min-max learning with infrequent communications remain very limited in the literature. This is particularly true for settings with non-i.i.d. datasets and partial client participation. To address this challenge, in this paper, we propose a new algorithmic framework called stochastic sampling averaging gradient descent ascent (SAGDA), which i) assembles stochastic gradient estimators from randomly sampled clients as control variates and ii) leverages two learning rates on both server and client sides. We show that SAGDA achieves a linear speedup in terms of both the number of clients and local update steps, which yields an $\mathcal{O}(\epsilon^{-2})$ communication complexity that is orders of magnitude lower than the state of the art. Interestingly, by noting that the standard federated stochastic gradient descent ascent (FSGDA) is in fact a control-variate-free special version of SAGDA, we immediately arrive at an $\mathcal{O}(\epsilon^{-2})$ communication complexity result for FSGDA. Therefore, through the lens of SAGDA, we also advance the current understanding on communication complexity of the standard FSGDA method for federated min-max learning.
translated by 谷歌翻译
在这项工作中,我们提出了FedSSO,这是一种用于联合学习的服务器端二阶优化方法(FL)。与以前朝这个方向的工作相反,我们在准牛顿方法中采用了服务器端近似,而无需客户的任何培训数据。通过这种方式,我们不仅将计算负担从客户端转移到服务器,而且还消除了客户和服务器之间二阶更新的附加通信。我们为我们的新方法的收敛提供了理论保证,并从经验上证明了我们在凸面和非凸面设置中的快速收敛和沟通节省。
translated by 谷歌翻译
标准联合优化方法成功地适用于单层结构的随机问题。然而,许多当代的ML问题 - 包括对抗性鲁棒性,超参数调整和参与者 - 批判性 - 属于嵌套的双层编程,这些编程包含微型型和组成优化。在这项工作中,我们提出了\ fedblo:一种联合交替的随机梯度方法来解决一般的嵌套问题。我们在存在异质数据的情况下为\ fedblo建立了可证明的收敛速率,并引入了二聚体,最小值和组成优化的变化。\ fedblo引入了多种创新,包括联邦高级计算和降低方差,以解决内部级别的异质性。我们通过有关超参数\&超代理学习和最小值优化的实验来补充我们的理论,以证明我们方法在实践中的好处。代码可在https://github.com/ucr-optml/fednest上找到。
translated by 谷歌翻译
Data heterogeneity across clients is a key challenge in federated learning. Prior works address this by either aligning client and server models or using control variates to correct client model drift. Although these methods achieve fast convergence in convex or simple non-convex problems, the performance in over-parameterized models such as deep neural networks is lacking. In this paper, we first revisit the widely used FedAvg algorithm in a deep neural network to understand how data heterogeneity influences the gradient updates across the neural network layers. We observe that while the feature extraction layers are learned efficiently by FedAvg, the substantial diversity of the final classification layers across clients impedes the performance. Motivated by this, we propose to correct model drift by variance reduction only on the final layers. We demonstrate that this significantly outperforms existing benchmarks at a similar or lower communication cost. We furthermore provide proof for the convergence rate of our algorithm.
translated by 谷歌翻译
联合学习(FL)是一个分布式的机器学习框架,可以减轻数据孤岛,在该筒仓中,分散的客户在不共享其私人数据的情况下协作学习全球模型。但是,客户的非独立且相同分布的(非IID)数据对训练有素的模型产生了负面影响,并且具有不同本地更新的客户可能会在每个通信回合中对本地梯度造成巨大差距。在本文中,我们提出了一种联合矢量平均(FedVeca)方法来解决上述非IID数据问题。具体而言,我们为与本地梯度相关的全球模型设定了一个新的目标。局部梯度定义为具有步长和方向的双向向量,其中步长为局部更新的数量,并且根据我们的定义将方向分为正和负。在FedVeca中,方向受步尺的影响,因此我们平均双向向量,以降低不同步骤尺寸的效果。然后,我们理论上分析了步骤大小与全球目标之间的关系,并在每个通信循环的步骤大小上获得上限。基于上限,我们为服务器和客户端设计了一种算法,以自适应调整使目标接近最佳的步骤大小。最后,我们通过构建原型系统对不同数据集,模型和场景进行实验,实验结果证明了FedVeca方法的有效性和效率。
translated by 谷歌翻译
Federated learning is a distributed framework according to which a model is trained over a set of devices, while keeping data localized. This framework faces several systemsoriented challenges which include (i) communication bottleneck since a large number of devices upload their local updates to a parameter server, and (ii) scalability as the federated network consists of millions of devices. Due to these systems challenges as well as issues related to statistical heterogeneity of data and privacy concerns, designing a provably efficient federated learning method is of significant importance yet it remains challenging. In this paper, we present FedPAQ, a communication-efficient Federated Learning method with Periodic Averaging and Quantization. FedPAQ relies on three key features: (1) periodic averaging where models are updated locally at devices and only periodically averaged at the server; (2) partial device participation where only a fraction of devices participate in each round of the training; and (3) quantized messagepassing where the edge nodes quantize their updates before uploading to the parameter server. These features address the communications and scalability challenges in federated learning. We also show that FedPAQ achieves near-optimal theoretical guarantees for strongly convex and non-convex loss functions and empirically demonstrate the communication-computation tradeoff provided by our method.
translated by 谷歌翻译
A key assumption in most existing works on FL algorithms' convergence analysis is that the noise in stochastic first-order information has a finite variance. Although this assumption covers all light-tailed (i.e., sub-exponential) and some heavy-tailed noise distributions (e.g., log-normal, Weibull, and some Pareto distributions), it fails for many fat-tailed noise distributions (i.e., ``heavier-tailed'' with potentially infinite variance) that have been empirically observed in the FL literature. To date, it remains unclear whether one can design convergent algorithms for FL systems that experience fat-tailed noise. This motivates us to fill this gap in this paper by proposing an algorithmic framework called FAT-Clipping (\ul{f}ederated \ul{a}veraging with \ul{t}wo-sided learning rates and \ul{clipping}), which contains two variants: FAT-Clipping per-round (FAT-Clipping-PR) and FAT-Clipping per-iteration (FAT-Clipping-PI). Specifically, for the largest $\alpha \in (1,2]$ such that the fat-tailed noise in FL still has a bounded $\alpha$-moment, we show that both variants achieve $\mathcal{O}((mT)^{\frac{2-\alpha}{\alpha}})$ and $\mathcal{O}((mT)^{\frac{1-\alpha}{3\alpha-2}})$ convergence rates in the strongly-convex and general non-convex settings, respectively, where $m$ and $T$ are the numbers of clients and communication rounds. Moreover, at the expense of more clipping operations compared to FAT-Clipping-PR, FAT-Clipping-PI further enjoys a linear speedup effect with respect to the number of local updates at each client and being lower-bound-matching (i.e., order-optimal). Collectively, our results advance the understanding of designing efficient algorithms for FL systems that exhibit fat-tailed first-order oracle information.
translated by 谷歌翻译
As a novel distributed learning paradigm, federated learning (FL) faces serious challenges in dealing with massive clients with heterogeneous data distribution and computation and communication resources. Various client-variance-reduction schemes and client sampling strategies have been respectively introduced to improve the robustness of FL. Among others, primal-dual algorithms such as the alternating direction of method multipliers (ADMM) have been found being resilient to data distribution and outperform most of the primal-only FL algorithms. However, the reason behind remains a mystery still. In this paper, we firstly reveal the fact that the federated ADMM is essentially a client-variance-reduced algorithm. While this explains the inherent robustness of federated ADMM, the vanilla version of it lacks the ability to be adaptive to the degree of client heterogeneity. Besides, the global model at the server under client sampling is biased which slows down the practical convergence. To go beyond ADMM, we propose a novel primal-dual FL algorithm, termed FedVRA, that allows one to adaptively control the variance-reduction level and biasness of the global model. In addition, FedVRA unifies several representative FL algorithms in the sense that they are either special instances of FedVRA or are close to it. Extensions of FedVRA to semi/un-supervised learning are also presented. Experiments based on (semi-)supervised image classification tasks demonstrate superiority of FedVRA over the existing schemes in learning scenarios with massive heterogeneous clients and client sampling.
translated by 谷歌翻译
联合学习(FL)是一种新兴学习范例,可以通过确保边缘设备上的客户端数据局部性来保护隐私。由于学习系统的多样性和异质性,FL的优化在实践中具有挑战性。尽管最近的研究努力改善异构数据的优化,但时间不断变化的异构数据在现实世界方案中的影响,例如改变客户数据或在训练期间留下或离开的间歇性客户,并未得到很好地研究。在这项工作中,我们提出了持续的联邦学习(CFL),灵活的框架,以捕获FL的时间不正常性。 CFL涵盖复杂和现实的情景 - 在之前的流派中评估了挑战 - 通过提取过去的本地数据集的信息并近似当地目标函数。从理论上讲,我们证明CFL方法在时间不断发展的场景中实现了比\ FEDAVG更快的会聚率,其中益处依赖于近似质量。在一系列实验中,我们表明数值调查结果与收敛分析相匹配,CFL方法显着优于其他SOTA FL基线。
translated by 谷歌翻译
在最近的联邦学习研究中,使用大批量提高了收敛率,但是与使用小批量相比,它需要额外的计算开销。为了克服这一限制,我们提出了一个统一的框架,该框架基于时间变化的概率将参与者分为锚和矿工组。锚点组中的每个客户都使用大批量计算梯度,该梯度被视为其靶心。矿工组中的客户使用串行迷你批次执行多个本地更新,并且每个本地更新也受到客户平均值Bullseyes的平均值的全局目标的间接调节。结果,矿工组遵循了对全球最小化器的近乎最佳更新,该更新适合更新全局模型。通过$ \ epsilon $ - Approximation衡量,FedAmd通过以恒定概率对锚点进行采样锚点,在非convex目标下达到了$ o(1/\ epsilon)$的收敛速率。理论上的结果大大超过了最先进的算法BVR-l-SGD $ O(1/\ Epsilon^{3/2})$,而FedAmd至少减少了$ O(1/\ Epsilon)$沟通开销。关于现实世界数据集的实证研究验证了FedAmd的有效性,并证明了我们提出的算法的优势。
translated by 谷歌翻译