Data heterogeneity across clients is a key challenge in federated learning. Prior works address this by either aligning client and server models or using control variates to correct client model drift. Although these methods achieve fast convergence in convex or simple non-convex problems, the performance in over-parameterized models such as deep neural networks is lacking. In this paper, we first revisit the widely used FedAvg algorithm in a deep neural network to understand how data heterogeneity influences the gradient updates across the neural network layers. We observe that while the feature extraction layers are learned efficiently by FedAvg, the substantial diversity of the final classification layers across clients impedes the performance. Motivated by this, we propose to correct model drift by variance reduction only on the final layers. We demonstrate that this significantly outperforms existing benchmarks at a similar or lower communication cost. We furthermore provide proof for the convergence rate of our algorithm.
translated by 谷歌翻译
我们研究了在$ n $工人上的分布式培训的异步随机梯度下降算法,随着时间的推移,计算和通信频率变化。在此算法中,工人按照自己的步调并行计算随机梯度,并在没有任何同步的情况下将其返回服务器。该算法的现有收敛速率对于非凸平的光滑目标取决于最大梯度延迟$ \ tau _ {\ max} $,并表明$ \ epsilon $ stationary点在$ \ mathcal {o} \!\左后达到(\ sigma^2 \ epsilon^{ - 2}+ \ tau _ {\ max} \ epsilon^{ - 1} \ right)$ iterations,其中$ \ sigma $表示随机梯度的方差。在这项工作(i)中,我们获得了$ \ Mathcal {o} \!\ left(\ sigma^2 \ epsilon^{ - 2}+ sqrt {\ tau _ {\ max} \ max} \ tau_ {avg} {avg} } \ epsilon^{ - 1} \ right)$,没有任何更改的算法,其中$ \ tau_ {avg} $是平均延迟,可以大大小于$ \ tau _ {\ max} $。我们还提供(ii)一个简单的延迟自适应学习率方案,在该方案下,异步SGD的收敛速率为$ \ Mathcal {o} \!\ left(\ sigma^2 \ epsilon^{ - 2} { - 2}+ \ tau_ {-2 avg} \ epsilon^{ - 1} \ right)$,并且不需要任何额外的高参数调整或额外的通信。我们的结果首次显示异步SGD总是比迷你批次SGD快。此外,(iii)我们考虑了由联邦学习应用激发的异质功能的情况,并通过证明与先前的作品相比对最大延迟的依赖性较弱,并提高收敛率。特别是,我们表明,收敛率的异质性项仅受每个工人内平均延迟的影响。
translated by 谷歌翻译
不确定性估计(UE)技术 - 例如高斯过程(GP),贝叶斯神经网络(BNN),蒙特卡罗辍学(MCDropout) - 旨在通过为每个分配估计的不确定性值来提高机器学习模型的可解释性他们的预测输出。然而,由于过高的不确定性估计可以在实践中具有致命的后果,因此本文分析了上述技术。首先,我们表明GP方法始终会产生高不确定性估计(OOD)数据。其次,我们在2D玩具示例中显示了BNN和MCDRopout在OOD样品上没有提供高不确定性估计。最后,我们凭经验展示了这种BNNS和MCDRopout的陷阱也在现实世界数据集中持有。我们的见解(i)提高了对深度学习中目前流行的UE方法更加谨慎使用的认识,(ii)鼓励开发UE方法,这些方法近似于基于GP的方法 - 而不是BNN和MCDROPOUT,以及我们的经验设置可用于验证任何其他UE方法的ood性能。源代码在娱乐中获得。
translated by 谷歌翻译
联合学习的个性化可以通过交易模型的偏差来提高用户模型的准确性(通过使用来自可能不同)的数据引入的数据来抵消其方差(由于任何单个用户的数据量有限)。为了开发最佳地平衡此权衡的培训算法,有必要扩展我们的理论基础。在这项工作中,我们将个性化协作学习问题正式,作为用户目标$ f_0(x)$的随机优化,同时获得对N $相关但其他用户的不同目标$ \ {f_1(x),\ dots,f_n (x)\} $。我们在此设置中为两个算法提供收敛保证 - 一种名称的个性化方法,称为\ emph {加权梯度平均},以及一种新颖的\ emph {偏压校正}方法 - 以及我们可以最佳地折衷的条件偏差减少方差并实现线性加速WRT \用户数量$ N $。此外,我们还经验验证他们的表现,证实了我们的理论见解。
translated by 谷歌翻译
联合学习是一种强大的分布式学习方案,它允许许多边缘设备在不共享数据的情况下协作训练模型。但是,培训是边缘设备的资源密集型,而有限的网络带宽通常是主要的瓶颈。先前的工作通常通过将模型或消息凝结成紧凑的格式(例如,通过梯度压缩或蒸馏)来克服约束。相比之下,我们提出了Progfered,这是第一个渐进式培训框架,用于有效有效的联盟学习。它固有地降低了计算和双向通信成本,同时保持最终模型的强劲性能。从理论上讲,我们证明了渐进式的渐近率与完整模型上的标准培训相同。在包括CNN(VGG,Resnet,Convnets)和U-Nets在内的广泛体系结构以及从简单分类到医疗图像细分的各种任务的广泛结果表明,我们的高效培训方法可节省高达$ 20 \%的计算至$ 63 \%$ $汇聚型号的通信成本。由于我们的方法也与先前的压缩工作相称,因此我们可以通过结合这些技术来实现广泛的权衡,显示出最高$ 50 \ times $的通信仅为$ 0.1 \%\%$ $ $ $。代码可从获得。
translated by 谷歌翻译
深度学习模型的最先进的培训算法基于随机梯度下降(SGD)。最近,已经探索了许多变体:用于更好的准确度(例如以EXTRARIAINT)的参数,限制SGD更新,以增加效率(例如MEPROP)的参数的子集或(例如丢弃器)的组合。然而,这些方法的收敛通常不会理论上没有研究。我们提出了一个统一的理论框架来研究这种SGD变体 - 包括上述算法,另外还有用于通信有效训练或模型压缩的多种方法。我们的见解可以用作提高这些方法效率的指南,并促进新应用的概率。作为示例,我们解决了共同训练网络的任务,其中一个版本(限于子网)用于创建可泥瓦网络。通过培训低级变压器,与标准一个,我们获得优于卓越的性能,而不是单独培训。
translated by 谷歌翻译
Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.
translated by 谷歌翻译
Federated Averaging (FEDAVG) has emerged as the algorithm of choice for federated learning due to its simplicity and low communication cost. However, in spite of recent research efforts, its performance is not fully understood. We obtain tight convergence rates for FEDAVG and prove that it suffers from 'client-drift' when the data is heterogeneous (non-iid), resulting in unstable and slow convergence.As a solution, we propose a new algorithm (SCAFFOLD) which uses control variates (variance reduction) to correct for the 'client-drift' in its local updates. We prove that SCAFFOLD requires significantly fewer communication rounds and is not affected by data heterogeneity or client sampling. Further, we show that (for quadratics) SCAFFOLD can take advantage of similarity in the client's data yielding even faster convergence. The latter is the first result to quantify the usefulness of local-steps in distributed optimization.
translated by 谷歌翻译
Sign-based algorithms (e.g. SIGNSGD) have been proposed as a biased gradient compression technique to alleviate the communication bottleneck in training large neural networks across multiple workers. We show simple convex counter-examples where signSGD does not converge to the optimum. Further, even when it does converge, signSGD may generalize poorly when compared with SGD. These issues arise because of the biased nature of the sign compression operator.We then show that using error-feedback, i.e. incorporating the error made by the compression operator into the next step, overcomes these issues. We prove that our algorithm (EF-SGD) with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions. Thus EF-SGD achieves gradient compression for free. Our experiments thoroughly substantiate the theory and show that error-feedback improves both convergence and generalization. Code can be found at
translated by 谷歌翻译
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the good scalability for distributed applications.
translated by 谷歌翻译