To lower the communication complexity of federated min-max learning, a natural approach is to utilize the idea of infrequent communications (through multiple local updates) same as in conventional federated learning. However, due to the more complicated inter-outer problem structure in federated min-max learning, theoretical understandings of communication complexity for federated min-max learning with infrequent communications remain very limited in the literature. This is particularly true for settings with non-i.i.d. datasets and partial client participation. To address this challenge, in this paper, we propose a new algorithmic framework called stochastic sampling averaging gradient descent ascent (SAGDA), which i) assembles stochastic gradient estimators from randomly sampled clients as control variates and ii) leverages two learning rates on both server and client sides. We show that SAGDA achieves a linear speedup in terms of both the number of clients and local update steps, which yields an $\mathcal{O}(\epsilon^{-2})$ communication complexity that is orders of magnitude lower than the state of the art. Interestingly, by noting that the standard federated stochastic gradient descent ascent (FSGDA) is in fact a control-variate-free special version of SAGDA, we immediately arrive at an $\mathcal{O}(\epsilon^{-2})$ communication complexity result for FSGDA. Therefore, through the lens of SAGDA, we also advance the current understanding on communication complexity of the standard FSGDA method for federated min-max learning.
translated by 谷歌翻译
A key assumption in most existing works on FL algorithms' convergence analysis is that the noise in stochastic first-order information has a finite variance. Although this assumption covers all light-tailed (i.e., sub-exponential) and some heavy-tailed noise distributions (e.g., log-normal, Weibull, and some Pareto distributions), it fails for many fat-tailed noise distributions (i.e., ``heavier-tailed'' with potentially infinite variance) that have been empirically observed in the FL literature. To date, it remains unclear whether one can design convergent algorithms for FL systems that experience fat-tailed noise. This motivates us to fill this gap in this paper by proposing an algorithmic framework called FAT-Clipping (\ul{f}ederated \ul{a}veraging with \ul{t}wo-sided learning rates and \ul{clipping}), which contains two variants: FAT-Clipping per-round (FAT-Clipping-PR) and FAT-Clipping per-iteration (FAT-Clipping-PI). Specifically, for the largest $\alpha \in (1,2]$ such that the fat-tailed noise in FL still has a bounded $\alpha$-moment, we show that both variants achieve $\mathcal{O}((mT)^{\frac{2-\alpha}{\alpha}})$ and $\mathcal{O}((mT)^{\frac{1-\alpha}{3\alpha-2}})$ convergence rates in the strongly-convex and general non-convex settings, respectively, where $m$ and $T$ are the numbers of clients and communication rounds. Moreover, at the expense of more clipping operations compared to FAT-Clipping-PR, FAT-Clipping-PI further enjoys a linear speedup effect with respect to the number of local updates at each client and being lower-bound-matching (i.e., order-optimal). Collectively, our results advance the understanding of designing efficient algorithms for FL systems that exhibit fat-tailed first-order oracle information.
translated by 谷歌翻译
当今部署在边缘网络上的联合学习(FL)系统由大量在数据和/或计算能力中具有高度异质性的工人组成,这些工人要求在时间,努力,数据异质性等方面参加灵活的工作者参与为了满足灵活的工人参与的需求,我们考虑了一种新的FL范式,称为“无政府状态联邦学习”(AFL)(AFL)。与常规FL模型形成鲜明对比的是,AFL中的每个工人都可以自由选择i)何时参加FL,ii)根据当前情况(例如,电池,通信,电池级别,通信渠道,隐私问题)。但是,AFL中这种混乱的工人行为在算法设计中引发了许多新的开放问题。特别是,尚不清楚是否可以开发收敛的AFL训练算法,如果是的,则在什么条件下以及可实现的收敛速度的速度下。为此,我们提出了两种无政府状态的联合平均(AFA)算法,分别命名为AFA-CD和AFA-CS的跨设备和跨核心设置的双向学习率。令人惊讶的是,我们表明,在轻度的无政府状态假设下,这两种AFL算法都达到了最著名的收敛速率,作为常规FL的最新算法。此外,它们保留了新的AFL范式中的工人数量和本地步骤,保留了高度可取的{\ em线性加速效应}。我们通过对现实世界数据集进行广泛的实验来验证提出的算法。
translated by 谷歌翻译
在许多机器学习应用中,在许多移动或物联网设备上生成大规模和隐私敏感数据,在集中位置收集数据可能是禁止的。因此,在保持数据本地化的同时估计移动或物联网设备上的参数越来越吸引人。这种学习设置被称为交叉设备联合学习。在本文中,我们提出了第一理论上保证的跨装置联合学习设置中的一般Minimax问题的算法。我们的算法仅在每轮训练中只需要一小部分设备,这克服了设备的低可用性引入​​的困难。通过在与服务器通信之前对客户端执行多个本地更新步骤,并利用全局梯度估计来进一步减少通信开销,并利用全局梯度估计来校正由数据异质性引入的本地更新方向上的偏置。通过基于新型潜在功能的开发分析,我们为我们的算法建立了理论融合保障。 AUC最大化,强大的对抗网络培训和GAN培训任务的实验结果展示了我们算法的效率。
translated by 谷歌翻译
在最近的联邦学习研究中,使用大批量提高了收敛率,但是与使用小批量相比,它需要额外的计算开销。为了克服这一限制,我们提出了一个统一的框架,该框架基于时间变化的概率将参与者分为锚和矿工组。锚点组中的每个客户都使用大批量计算梯度,该梯度被视为其靶心。矿工组中的客户使用串行迷你批次执行多个本地更新,并且每个本地更新也受到客户平均值Bullseyes的平均值的全局目标的间接调节。结果,矿工组遵循了对全球最小化器的近乎最佳更新,该更新适合更新全局模型。通过$ \ epsilon $ - Approximation衡量,FedAmd通过以恒定概率对锚点进行采样锚点,在非convex目标下达到了$ o(1/\ epsilon)$的收敛速率。理论上的结果大大超过了最先进的算法BVR-l-SGD $ O(1/\ Epsilon^{3/2})$,而FedAmd至少减少了$ O(1/\ Epsilon)$沟通开销。关于现实世界数据集的实证研究验证了FedAmd的有效性,并证明了我们提出的算法的优势。
translated by 谷歌翻译
由于其在数据隐私保护,有效的沟通和并行数据处理方面的好处,联邦学习(FL)近年来引起了人们的兴趣。同样,采用适当的算法设计,可以实现fl中收敛效应的理想线性加速。但是,FL上的大多数现有作品仅限于I.I.D.的系统。数据和集中参数服务器以及与异质数据集分散的FL上的结果仍然有限。此外,在完全分散的FL下,与数据异质性在完全分散的FL下,可以实现收敛的线性加速仍然是一个悬而未决的问题。在本文中,我们通过提出一种称为Net-Fleet的新算法,以解决具有数据异质性的完全分散的FL系统,以解决这些挑战。我们算法的关键思想是通过合并递归梯度校正技术来处理异质数据集,以增强FL(最初旨在用于通信效率)的本地更新方案。我们表明,在适当的参数设置下,所提出的净型算法实现了收敛的线性加速。我们进一步进行了广泛的数值实验,以评估所提出的净化算法的性能并验证我们的理论发现。
translated by 谷歌翻译
数据异构联合学习(FL)系统遭受了两个重要的收敛误差来源:1)客户漂移错误是由于在客户端执行多个局部优化步骤而引起的,以及2)部分客户参与错误,这是一个事实,仅一小部分子集边缘客户参加每轮培训。我们发现其中,只有前者在文献中受到了极大的关注。为了解决这个问题,我们提出了FedVarp,这是在服务器上应用的一种新颖的差异算法,它消除了由于部分客户参与而导致的错误。为此,服务器只是将每个客户端的最新更新保持在内存中,并将其用作每回合中非参与客户的替代更新。此外,为了减轻服务器上的内存需求,我们提出了一种新颖的基于聚类的方差降低算法clusterfedvarp。与以前提出的方法不同,FedVarp和ClusterFedVarp均不需要在客户端上进行其他计算或其他优化参数的通信。通过广泛的实验,我们表明FedVarp优于最先进的方法,而ClusterFedVarp实现了与FedVarp相当的性能,并且记忆要求较少。
translated by 谷歌翻译
Federated Learning是一种机器学习培训范式,它使客户能够共同培训模型而无需共享自己的本地化数据。但是,实践中联合学习的实施仍然面临许多挑战,例如由于重复的服务器 - 客户同步以及基于SGD的模型更新缺乏适应性,大型通信开销。尽管已经提出了各种方法来通过梯度压缩或量化来降低通信成本,并且提出了联合版本的自适应优化器(例如FedAdam)来增加适应性,目前的联合学习框架仍然无法立即解决上述挑战。在本文中,我们提出了一种具有理论融合保证的新型沟通自适应联合学习方法(FedCAMS)。我们表明,在非convex随机优化设置中,我们提出的fedcams的收敛率与$ o(\ frac {1} {\ sqrt {tkm}})$与其非压缩的对应物相同。各种基准的广泛实验验证了我们的理论分析。
translated by 谷歌翻译
Federated learning (FL) is a decentralized and privacy-preserving machine learning technique in which a group of clients collaborate with a server to learn a global model without sharing clients' data. One challenge associated with FL is statistical diversity among clients, which restricts the global model from delivering good performance on each client's task. To address this, we propose an algorithm for personalized FL (pFedMe) using Moreau envelopes as clients' regularized loss functions, which help decouple personalized model optimization from the global model learning in a bi-level problem stylized for personalized FL. Theoretically, we show that pFedMe's convergence rate is state-of-the-art: achieving quadratic speedup for strongly convex and sublinear speedup of order 2/3 for smooth nonconvex objectives. Experimentally, we verify that pFedMe excels at empirical performance compared with the vanilla FedAvg and Per-FedAvg, a meta-learning based personalized FL algorithm.
translated by 谷歌翻译
Federated Averaging (FEDAVG) has emerged as the algorithm of choice for federated learning due to its simplicity and low communication cost. However, in spite of recent research efforts, its performance is not fully understood. We obtain tight convergence rates for FEDAVG and prove that it suffers from 'client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow convergence.As a solution, we propose a new algorithm (SCAFFOLD) which uses control variates (variance reduction) to correct for the 'client-drift' in its local updates. We prove that SCAFFOLD requires significantly fewer communication rounds and is not affected by data heterogeneity or client sampling. Further, we show that (for quadratics) SCAFFOLD can take advantage of similarity in the client's data yielding even faster convergence. The latter is the first result to quantify the usefulness of local-steps in distributed optimization.
translated by 谷歌翻译
Data heterogeneity across clients is a key challenge in federated learning. Prior works address this by either aligning client and server models or using control variates to correct client model drift. Although these methods achieve fast convergence in convex or simple non-convex problems, the performance in over-parameterized models such as deep neural networks is lacking. In this paper, we first revisit the widely used FedAvg algorithm in a deep neural network to understand how data heterogeneity influences the gradient updates across the neural network layers. We observe that while the feature extraction layers are learned efficiently by FedAvg, the substantial diversity of the final classification layers across clients impedes the performance. Motivated by this, we propose to correct model drift by variance reduction only on the final layers. We demonstrate that this significantly outperforms existing benchmarks at a similar or lower communication cost. We furthermore provide proof for the convergence rate of our algorithm.
translated by 谷歌翻译
In federated optimization, heterogeneity in the clients' local datasets and computation speeds results in large variations in the number of local updates performed by each client in each communication round. Naive weighted aggregation of such models causes objective inconsistency, that is, the global model converges to a stationary point of a mismatched objective function which can be arbitrarily different from the true objective. This paper provides a general framework to analyze the convergence of federated heterogeneous optimization algorithms. It subsumes previously proposed methods such as FedAvg and FedProx and provides the first principled understanding of the solution bias and the convergence slowdown due to objective inconsistency. Using insights from this analysis, we propose Fed-Nova, a normalized averaging method that eliminates objective inconsistency while preserving fast error convergence.
translated by 谷歌翻译
我们研究了在$ n $工人上的分布式培训的异步随机梯度下降算法,随着时间的推移,计算和通信频率变化。在此算法中,工人按照自己的步调并行计算随机梯度,并在没有任何同步的情况下将其返回服务器。该算法的现有收敛速率对于非凸平的光滑目标取决于最大梯度延迟$ \ tau _ {\ max} $,并表明$ \ epsilon $ stationary点在$ \ mathcal {o} \!\左后达到(\ sigma^2 \ epsilon^{ - 2}+ \ tau _ {\ max} \ epsilon^{ - 1} \ right)$ iterations,其中$ \ sigma $表示随机梯度的方差。在这项工作(i)中,我们获得了$ \ Mathcal {o} \!\ left(\ sigma^2 \ epsilon^{ - 2}+ sqrt {\ tau _ {\ max} \ max} \ tau_ {avg} {avg} } \ epsilon^{ - 1} \ right)$,没有任何更改的算法,其中$ \ tau_ {avg} $是平均延迟,可以大大小于$ \ tau _ {\ max} $。我们还提供(ii)一个简单的延迟自适应学习率方案,在该方案下,异步SGD的收敛速率为$ \ Mathcal {o} \!\ left(\ sigma^2 \ epsilon^{ - 2} { - 2}+ \ tau_ {-2 avg} \ epsilon^{ - 1} \ right)$,并且不需要任何额外的高参数调整或额外的通信。我们的结果首次显示异步SGD总是比迷你批次SGD快。此外,(iii)我们考虑了由联邦学习应用激发的异质功能的情况,并通过证明与先前的作品相比对最大延迟的依赖性较弱,并提高收敛率。特别是,我们表明,收敛率的异质性项仅受每个工人内平均延迟的影响。
translated by 谷歌翻译
标准联合优化方法成功地适用于单层结构的随机问题。然而,许多当代的ML问题 - 包括对抗性鲁棒性,超参数调整和参与者 - 批判性 - 属于嵌套的双层编程,这些编程包含微型型和组成优化。在这项工作中,我们提出了\ fedblo:一种联合交替的随机梯度方法来解决一般的嵌套问题。我们在存在异质数据的情况下为\ fedblo建立了可证明的收敛速率,并引入了二聚体,最小值和组成优化的变化。\ fedblo引入了多种创新,包括联邦高级计算和降低方差,以解决内部级别的异质性。我们通过有关超参数\&超代理学习和最小值优化的实验来补充我们的理论,以证明我们方法在实践中的好处。代码可在https://github.com/ucr-optml/fednest上找到。
translated by 谷歌翻译
Federated Learning(FL)是一个出色的分布式机器学习框架,因为它在数据隐私和沟通效率方面的好处。由于由于资源的限制,在许多情况下的全面客户参与是不可行的,因此已经研究了部分参与算法,该算法主动选择/采样了一部分客户的子集,旨在实现接近全面参与案例的学习绩效。本文研究了一种被动的部分客户参与方案,该场景知之甚少,其中部分参与是外部事件的结果,即客户辍学,而不是FL算法的决定。我们将fl与客户辍学者一起作为特殊情况,即较大的FL问题,客户可以在其中提交替代(可能不准确)本地模型更新。基于我们的收敛分析,我们开发了一种新的算法FL-FDM,该算法会发现客户的朋友(即数据分布相似的客户),并使用朋友的本地更新作为辍学客户的替代品,从而减少替换错误并改善收敛性能。复杂性降低机制也被纳入FL-FDMS,使其在理论上是合理的,而且实际上有用。关于MNIST和CIFAR-10的实验证实了FL-FDM在处理FL中辍学时的出色性能。
translated by 谷歌翻译
Federated learning has attracted increasing attention with the emergence of distributed data. While extensive federated learning algorithms have been proposed for the non-convex distributed problem, the federated learning in practice still faces numerous challenges, such as the large training iterations to converge since the sizes of models and datasets keep increasing, and the lack of adaptivity by SGD-based model updates. Meanwhile, the study of adaptive methods in federated learning is scarce and existing works either lack a complete theoretical convergence guarantee or have slow sample complexity. In this paper, we propose an efficient adaptive algorithm (i.e., FAFED) based on the momentum-based variance reduced technique in cross-silo FL. We first explore how to design the adaptive algorithm in the FL setting. By providing a counter-example, we prove that a simple combination of FL and adaptive methods could lead to divergence. More importantly, we provide a convergence analysis for our method and prove that our algorithm is the first adaptive FL algorithm to reach the best-known samples $O(\epsilon^{-3})$ and $O(\epsilon^{-2})$ communication rounds to find an $\epsilon$-stationary point without large batches. The experimental results on the language modeling task and image classification task with heterogeneous data demonstrate the efficiency of our algorithms.
translated by 谷歌翻译
客户端之间的非独立和相同分布(非IID)数据分布被视为降低联合学习(FL)性能的关键因素。处理非IID数据(如个性化FL和联邦多任务学习(FMTL)的几种方法对研究社区有很大兴趣。在这项工作中,首先,我们使用Laplacian正规化制定FMTL问题,明确地利用客户模型之间的关系进行多任务学习。然后,我们介绍了FMTL问题的新视图,首次表明配制的FMTL问题可用于传统的FL和个性化FL。我们还提出了两种算法FEDU和DFEDU,分别解决了通信集中和分散方案中的配制FMTL问题。从理论上讲,我们证明了两种算法的收敛速率实现了用于非凸起目标的强大凸起和载位加速的线性加速。实验,我们表明我们的算法优于FL设置的传统算法FedVG,在FMTL设置中的Mocha,以及个性化流程中的PFEDME和PER-FEDAVG。
translated by 谷歌翻译
从经验上证明,在跨客户聚集之前应用多个本地更新的实践是克服联合学习(FL)中的通信瓶颈的成功方法。在这项工作中,我们提出了一种通用食谱,即FedShuffle,可以更好地利用FL中的本地更新,尤其是在异质性方面。与许多先前的作品不同,FedShuffle在每个设备的更新数量上没有任何统一性。我们的FedShuffle食谱包括四种简单的功能成分:1)数据的本地改组,2)调整本地学习率,3)更新加权,4)减少动量方差(Cutkosky and Orabona,2019年)。我们对FedShuffle进行了全面的理论分析,并表明从理论和经验上讲,我们的方法都不遭受FL方法中存在的目标功能不匹配的障碍,这些方法假设在异质FL设置中,例如FedAvg(McMahan等人,McMahan等, 2017)。此外,通过将上面的成分结合起来,FedShuffle在Fednova上改善(Wang等,2020),以前提议解决此不匹配。我们还表明,在Hessian相似性假设下,通过降低动量方差的FedShuffle可以改善非本地方法。最后,通过对合成和现实世界数据集的实验,我们说明了FedShuffle中使用的四种成分中的每种如何有助于改善FL中局部更新的使用。
translated by 谷歌翻译
Federated learning is a distributed machine learning paradigm in which a large number of clients coordinate with a central server to learn a model without sharing their own training data. Standard federated optimization methods such as Federated Averaging (FEDAVG) are often difficult to tune and exhibit unfavorable convergence behavior. In non-federated settings, adaptive optimization methods have had notable success in combating such issues. In this work, we propose federated versions of adaptive optimizers, including ADAGRAD, ADAM, and YOGI, and analyze their convergence in the presence of heterogeneous data for general nonconvex settings. Our results highlight the interplay between client heterogeneity and communication efficiency. We also perform extensive experiments on these methods and show that the use of adaptive optimizers can significantly improve the performance of federated learning.
translated by 谷歌翻译
在这项工作中,我们提出了FedSSO,这是一种用于联合学习的服务器端二阶优化方法(FL)。与以前朝这个方向的工作相反,我们在准牛顿方法中采用了服务器端近似,而无需客户的任何培训数据。通过这种方式,我们不仅将计算负担从客户端转移到服务器,而且还消除了客户和服务器之间二阶更新的附加通信。我们为我们的新方法的收敛提供了理论保证,并从经验上证明了我们在凸面和非凸面设置中的快速收敛和沟通节省。
translated by 谷歌翻译