BOBA: Byzantine-Robust Federated Learning with Label Skewness

Wenxuan Bao, Jingrui He

Abstract

In federated learning, most existing techniques for robust aggregation against Byzantine attacks are designed for the IID setting, i.e., the data distributions for clients are independent and identically distributed. In this paper, we address label skewness, a more realistic and challenging non-IID setting, where each client only has access to a few classes of data. In this setting, state-of-the-art techniques suffer from selection bias, leading to significant performance drop for particular classes; they are also more vulnerable to Byzantine attacks due to the increased deviation among gradients of honest clients. To address these limitations, we propose an efficient two-stage method named BOBA. Theoretically, we prove the convergence of BOBA with an error of optimal order. Empirically, we verify the superior unbiasedness and robustness of BOBA across a wide range of models and data sets against various baselines.

\affiliations

University of Illinois at Urbana-Champaign
{wbao4,jingrui}@illinois.edu

1 Introduction

Federated learning (FL) McMahan et al. (2017) has attracted much research attention in the past decade. It refers to a machine learning system where multiple clients collaboratively train a shared machine learning model under the orchestration of a central server, thus avoiding the leakage of their own private and sensitive data. During training, each client downloads a copy of the shared model from the server, trains the model with local data, and submits a model update to the server for aggregation. In FL, each client’s data is stored locally and invisible to both the server and other clients. FL has wide applications in sales, finance, healthcare Yang et al. (2019), etc.

FL systems can be vulnerable to attacks and failures Kairouz et al. (2021); Lyu et al. (2020). In particular, Byzantine attacks can lead to the convergence to an unsatisfactory model or even divergence Blanchard et al. (2017). A common defending mechanism against Byzantine attacks is to replace the gradients averaging with robust estimation of the center Chen et al. (2017); Yin et al. (2018). These methods have proven Byzantine-robustness when data from different clients are independent and identically distributed (IID). However, the IID assumption is usually violated in real applications. For example, if each client corresponds to a specific user, the data distribution usually varies from user to user. It is unclear how existing robust aggregators are affected by the violation of the IID assumption.

Our work mainly focuses on label skewness, a typical non-IID setting where each client only has access to a few classes of data Li and Zhan (2021); Shen et al. (2022). In this setting, each client has the same conditional data distribution given labels, but their label distributions can vary a lot. Label skewness degrades the model in two ways. First, it introduces a selection bias of clients. Aggregators tend to select some clients over others across iterations and thus bias the model. Second, it increases the deviation among honest gradients, making existing aggregators more vulnerable to Byzantine attacks. It requires advanced techniques to tackle these challenges in FL with label skewness.

Focusing on label skewness, we find that the gradients of honest clients are distributed near a $(c - 1)$ -simplex, where $c$ is the number of classes. Based on this, we propose BOBA (Byzantine-rObust and unBiased Aggregator), a two-stage algorithm to estimate this simplex. In the first stage, we robustly estimate the low dimensional affine subspace that the simplex is in and project all gradients to the subspace. In the second stage, we use a few data samples on the server to estimate the $(c - 1)$ -simplex and further filter out possible Byzantines. With our algorithm, all honest gradients will be preserved with small perturbation, while Byzantine gradients will either be discarded or weakened. We evaluate BOBA theoretically and empirically, proving that it can alleviate selection bias and achieve robustness. We summarize our contributions below:

We make a systematic analysis of the robustness of FL with label skewness. Specifically, we summarize two novel challenges and propose optimal order robustness to measure the robustness of an aggregator. (Section 3)
Based on our finding that honest gradients distribute near a $(c - 1)$ -simplex, we propose BOBA, including an objective addressing both label skewness and robustness, and an efficient optimization algorithm. (Section 4)
We theoretically derive the gradient estimation error of BOBA and make a convergence analysis. Theoretical results show that BOBA achieves optimal order robustness while baseline aggregators do not. (Section 5)
We empirically evaluate the unbiasedness and robustness of BOBA across a wide range of models, data sets, and attacks. We show that BOBA outperforms various baseline aggregators, even when they are allowed to leverage additional server data. (Section 6)

2 Related Works

Robustness in FL

There are extensive works of robust aggregators with IID data Blanchard et al. (2017); Chen et al. (2017); Yin et al. (2018); Pillutla et al. (2022), most of which replace the averaging step on the server with a robust mean estimator. These methods are theoretically proved not to fail arbitrarily in IID settings. However, as shown in our analysis, they suffer from selection bias and increased vulnerability in label skewness settings. A few works have studied robustness with non-IID data by making the clients “more IID”. Karimireddy et al. (2022) applies bucketing to make the input of the aggregator more homogeneous, while sacrificing some robustness. Ghosh et al. (2019) divides clients into IID groups and learns global models in each group. A topic related to selection bias is performance fairness, where each client should have similar accuracy. Hu et al. (2020) introduces a multi-task learning framework to learn a robust and fair global model. However, it is not robust to Byzantine attacks and can only guarantee Pareto optimal. Ditto Li et al. (2021) learns personalized models to achieve fairness and robustness, but still requires training a robust global model.

Non-IIDness in FL

Besides robustness, non-IIDness also raises optimization challenges in FL. When clients take multiple local steps, non-IIDness makes local updates diverge and thus degrades the model. A common method to handle non-IIDness is to share a limited amount of data as augmentation Zhao et al. (2018), which can be collected in many real applications. To further protect privacy, some works replace the raw samples with aggregated samples Yoon et al. (2021), or synthetic samples Zhang et al. (2021). Compared to them, our work assumes very limited server data.

Label Skewness and Mixture Distribution

Plenty of works focus on label skewness, a particular sub-class of non-IIDness. FedAwS Yu et al. (2020) studies an extreme case where each client has only access to one class, while FedRS Li and Zhan (2021) focuses on a general label skewness setting. A related non-IID setting is mixture distribution Marfoq et al. (2021), where each client’s data distribution is a mixture of several shared distributions with its own mixture weights. BOBA mainly focuses on label skewness and can be easily extended to mixture distribution.

3 FL with Label Skewness

Figure 1: Comparison of aggregation result. Orange set is the range of aggregation results when Byzantines are different. The left three aggregators can have very large aggregation error, while BOBA is much more accurate. See details in Appendix B.2.

Setup

We study the FedSGD McMahan et al. (2017) system consisting of one central server and $n$ clients. Each client is either honest (in honest set $H$ ) or Byzantine (in Byzantine set $B$ ). In each communication round, the server broadcasts the parameter $w_{G} \in R^{d}$ to all clients. Each honest client computes the gradient with its own data ${ξ_{i j}}_{j = 1}^{m}$ sampled from $P_{i}$ and sends back the honest gradient $g_{i} = \nabla_{w_{G}} L_{i} (w_{G})$ , where $L_{i} (w_{G}) = \frac{1}{m} \sum_{j = 1}^{m} L (w_{G}; ξ_{i j})$ . Each Byzantine client can send arbitrary Byzantine gradient to the server. Finally, the server aggregates all $n$ gradients $^μ=Agg({gi}ni=1)$ and updates the parameter $w_{G} \leftarrow w_{G} - η^μ$ , where $Agg (\cdot)$ is the aggregation function, and $η$ is the learning rate. In training phase, the system minimizes the empirical risk, $\frac{1}{| H |} \sum_{i \in H} L_{i} (w_{G})$ . Let $| H |, | B |$ be the real number of honest and Byzantine clients respectively. As the server does not know the exact number of Byzantines, let $f$ be the declared number of Byzantines, i.e. the max number of Byzantines that the aggregator is robust to.

For each honest client $i \in H$ , let $E g_{i}$ be its expected gradient, where the expectation is taken on data sampling from $P_{i}$ . Let $μ = \frac{1}{| H |} \sum_{i \in H} g_{i}$ denote the gradient of empirical risk and $E μ = \frac{1}{| H |} \sum_{i \in H} E g_{i}$ denote its expectation, also the gradient of population risk. Aggregators aim to estimate $E μ$ robustly under arbitrary behavior of Byzantine clients.

Distribution of Honest Gradients

This part analyzes the distribution of honest gradients with label skewness. We start with some definitions.

Definition 1 (Inner, outer and total deviations).

For an honest client $i \in H$ , its inner deviation is $g_{i} - E g_{i}$ ; its outer deviation is $E g_{i} - E μ$ , and its total deviation is $g_{i} - E μ$ .

Inner deviation measures the randomness of sampling data from clients’ data distribution $P_{i}$ , while outer deviation measures the difference from a client’s data distribution $P_{i}$ to the global distribution $P = \frac{1}{| H |} \sum_{i \in H} P_{i}$ , without randomness.

In the IID setting, the outer deviation is zero, implying that honest gradients ${g_{i}}_{i \in H}$ are distributed around the same center $E μ$ . When the local batch size $B \to + \infty$ , each $g_{i} p \to E μ$ . Such implication does not hold with label skewness, since outer deviations are non-zero. We formally define label skewness and analyze how honest gradients distribute.

Definition 2 ( $c$ -label skew distribution).

Honest clients’ data distributions ${P_{i}}_{i \in H}$ are $c$ -label skew distribution if

P_{i} (ξ) = c \sum z = 1 p_{i z} Q_{z} (ξ), \forall i \in H

where $P_{i} (ξ)$ is the data distribution of client $i$ . The label $z$ can take $c$ finite values. The mixture weight $p_{i z}$ is the label distribution of client $i$ satisfying $\sum_{z = 1}^{c} p_{i z} = 1$ and the mixture component $Q_{z} (ξ) = P_{i} (ξ | z)$ is the conditional distribution given $z$ . Different clients share the same ${Q_{z} (ξ)}_{z = 1}^{c}$ but different label distribution $p_{i} = [p_{i 1}, \dots, p_{i z}]^{⊤}$ .

$c$ -label skew distribution assumes the heterogeneity among honest clients can be characterized by their divergence in label distribution. With this condition, we can analyze the distribution of honest gradients.

Proposition 1 (Distribution of expected honest gradients).

With $c$ -label skew distribution, for any honest client $i \in H$ ,

	$E g_{i}$	$= \sum ξ P_{i} (ξ) \nabla_{w} L (w; ξ) = \sum ξ c \sum z = 1 p_{i z} Q_{z} (ξ) L (w; ξ)$
		$= c \sum z = 1 p_{i z} \sum ξ Q_{z} (ξ) L (w; ξ) = c \sum z = 1 p_{i z} E γ_{z}$

where $E γ_{z} = \sum_{ξ} Q_{z} (ξ) L (w; ξ)$ is the expected gradient computed with data from class $z$ .

Proposition 1 shows that each expected honest gradient is a convex combination of ${E γ_{z}}_{z = 1}^{c}$ , whose range forms a $(c - 1)$ -simplex. We define the honest simplex to be ${\sum_{z = 1}^{c} p_{z} E γ_{z} : \sum_{z = 1}^{c} p_{z} = 1, p_{z} \geq 0}$ , and the honest subspace to be ${\sum_{z = 1}^{c} p_{z} E γ_{z} : \sum_{z = 1}^{c} p_{z} = 1}$ .

As honest gradients are perturbations of their expectations, they distribute near the honest simplex, approximately forming a $(c - 1)$ -dimensional affine subspace. Thus, if we conduct principle component analysis on honest gradients, the variance should concentrate in the first $(c - 1)$ principle components. Figure 2 verifies our finding on MNIST. Appendix B.2 gives details of this preliminary experiment.

Figure 2: PCA of honest gradients on MNIST ( $c = 10$ )

Notice that the conclusion can be extended to a general mixture distribution Marfoq et al. (2021) when $z$ is a general latent variable other than the class label, which gives more flexibility. Our paper focuses on label skewness for clarity.

Challenges of Label Skewness

In the IID setting, every honest gradient is an unbiased estimation of $E μ$ . It is easy to design a robust aggregator since it only needs to find one honest gradient (or a close Byzantine gradient). However, in label skewness settings, each honest gradient can deviate a lot from $E μ$ . It introduces two challenges to aggregators designed for IID settings: selection bias and increased vulnerability.

Selection bias

Many robust aggregators, including Krum Blanchard et al. (2017), select a subset of gradients and aggregate. With label skewness, such aggregators select some clients much more often than others. Since each honest gradient is only a biased estimation of $E μ$ , the aggregation result is also biased, even when there is no Byzantines. In Figure 1 where honest gradients form two clusters, three baseline aggregators all choose the cluster with more clients, resulting in a highly biased aggregation (the red square). Considering each cluster belongs to one class, the model will be trained with one class of data only, resulting in a model predicting every sample as this class.

Increased vulnerability

We observe that the aggregation result can deviate more from the true center $E μ$ when the total deviations become larger. Figure 1 shows that the range of aggregation results is huge for IID aggregators, even along the direction where honest gradients do not diverge.

For the two reasons above, although IID aggregators are theoretically robust in IID settings, they have large gradient estimation error even with the absence of Byzantines, and also suffer more with their existence. It is necessary to reconsider the theory of robustness in FL with label skewness.

Optimal Order Robustness

In IID settings, robustness is usually explained as bounded gradient estimation error Chen et al. (2017); Blanchard et al. (2017); Yin et al. (2018). However, such bounds confuse inner and outer deviations, ignore the bias introduced by robust aggregators, and thus can be vacuous (e.g., in Figure 1). Different from them, a good aggregator for label skewness should not bias the aggregation when there are no Byzantines. We define this property as unbiasedness.

Definition 3 (Unbiasedness).

An aggregator $Agg (\cdot)$ is unbiased if $Agg ({g_{i}}_{i \in H}) \to E μ$ when $g_{i} \to E g_{i}, \forall i \in H$ ; i.e., the aggregator should not introduce additional bias when there are no Byzantines.

However, when there are Byzantine clients, bias is inevitable in label skewness settings. For example, Byzantine clients can mimic the behavior of honest clients Karimireddy et al. (2022). In this case, any unbiased aggregator should keep them since they are the same as honest ones. With label skewness, such Byzantines can bias the average label distribution by $∥ Δ ¯ p ∥_{1} \leq 2 | B | / n$ , introducing an inevitable gradient estimation error of $O (| B | / n)$ . Based on this fact, we propose optimal order robustness, an adaptive upper bound of gradient estimation error.

Definition 4 (Optimal order robustness).

An aggregator $Agg (\cdot)$ has optimal order robustness if $∥ Agg ({g_{i}}_{i = 1}^{n}) - E μ ∥_{2} \to O (β)$ when $g_{i} \to E g_{i}, \forall i \in H$ . $β = | B | / n$ is the fraction of Byzantines.

Unbiasedness can be seen as a special case of optimal order robustness when $β = 0$ . Most robust aggregators designed for IID settings do not have optimal order robustness. Their bias is huge even when $β = 0$ . New aggregators need to be proposed for label skewness settings.

Finally, we extend the necessary condition on the number of Byzantines to achieve robustness from IID to label skewness. IID settings require $n - 2 f > 0$ , which means that the honest component should “outweigh” the Byzantine component, i.e., $n - f > f$ . However, for $c$ -label skew distribution, each component must outweigh the Byzantine component simutaneously. Otherwise, Byzantines can form a cluster to replace a honest component, resulting in unbounded error. Thus, we have the following extended condition.

Condition 1.

Any $n - 2 f$ expected gradients ${E g_{i}}$ distribute on a $(c - 1)$ -simplex, but not on any $(c - 2)$ -simplex.

This condition indicates that any $n - 2 f$ expectations of honest gradients can uniquely reconstruct the honest simplex. It includes the IID setting when $c = 1$ . This condition reveals that when $c$ is larger, achieving optimal order robustness is harder since aggregators can tolerate fewer Byzantines. The supremum of $f$ depends on the exact value of label distributions.

4 BOBA Algorithm

G = [g_{1}, \dots, g_{n}]

Γ = [γ_{1}, \dots, γ_{c}]

n, f, c, p_{min}

0: Aggregation result

^μ

1: Initialize subspace

^P

m, U, Σ, V = {TrSVD}_{c - 1} (Γ)

2: while not converge do

3: Update

r

G_{[n - f]} = {n - f

gradients in

G

with smallest

∥ g_{i} - Π_{^P} (g_{i}) ∥_{2}}

where

Π_{^P} (g_{i}) = U U^{⊤} (g_{i} - m) + m

4: Update

^P

m, U, Σ, V = {TrSVD}_{c - 1} (G_{[n - f]})

5: end while

6: Encode:

_{i} = U^{⊤} (g_{i} - m), \forall i

;

~ Γ = U^{⊤} (Γ - m 1^{⊤})

7: Estimate:

^p_{i} = {[\begin{matrix} ~ Γ 1^{⊤} \end{matrix}]}^{- 1} [\begin{matrix} {~ g}_{i} 1 \end{matrix}], \forall i

8: Filter:

a=A({^pi}ni=1)

9: Aggregate:

~ μ = \sum_{i = 1}^{n} a_{i} ~ g_{i} / \sum_{i = 1}^{n} a_{i}

10: Decode:

^μ = U ~ g_{G} + m

Algorithm 1 BOBA(stage 1: line 1-5, stage 2: line 6-10)

In this section, we propose BOBA and explain its two stages in detail. In stage 1, we robustly find the honest subspace, and project all gradients to it. In stage 2, we estimate the vertices of the honest simplex, reconstruct the label distribution for each client, and drop clients with abnormal label distribution (i.e., with strongly negative entries). Ideally, all honest gradients will be kept with small perturbation, while all Byzantine gradients will either be weakened (projected to the honest simplex in stage 1) or discarded (in stage 2). Therefore, the impact of Byzantine gradients is limited.

Stage 1: Fitting the Honest Subspace

The goal of stage 1 is to find a $(c - 1)$ -dimensional affine subspace close to all honest gradients under the influence of Byzantine gradients. When there are no Byzantines, a standard way to find the subspace is TrSVD, i.e., truncated singular value decomposition on centralized gradients,

m, U, Σ, V = {TrSVD}_{c - 1} (G), s.t. U Σ V^{⊤} \approx G - m 1^{⊤}

where $G = [g_{1}, \dots, g_{n}] \in R^{d \times n}$ is the client gradient matrix, $m = \frac{1}{n} G 1 \in R^{d}$ is their average, $U \in R^{d \times (c - 1)}, V \in R^{n \times (c - 1)}$ are column-orthogonal and $Σ \in R^{(c - 1) \times (c - 1)}$ is diagonal. TrSVD fits a $(c - 1)$ -dimensional affine subspace $P$ minimizing the reconstruction loss

ℓ (P) = n \sum i = 1 {∥ g_{i} - Π_{P} (g_{i}) ∥}_{i}^{2}

where $Π_{P} (g_{i}) = U U^{⊤} (g_{i} - m) + m$ is a projection function that projects vectors to $P = {U λ + m : λ \in R^{c - 1}}$ .

Although vanilla TrSVD is robust to small perturbations Stewart (1990), it is not Byzantine-robust. When there are Byzantine gradients deviating from the honest subspace, the fitted subspace will be dragged to these Byzantine gradients at the cost of underfitting honest ones. For example in $R^{2}$ , when $n$ honest gradients are uniformly distributed on a segment of ${(x, y) : x \in [- 1, 1], y = 0}$ . TrSVD will fit a subspace of ${y = 0}$ . However, one Byzantine gradient of $(100 n, 100 n)$ can alter the fitted subspace to about ${y = x}$ .

Objective

Since reconstruction loss is vulnerable, we design trimmed reconstruction loss to robustify TrSVD.

ℓ_{t} (P) = min \frac{r \in {0, 1}^{n},}{\sum_{i = 1}^{n} r_{i} = n - f} n \sum i = 1 r_{i} {∥ g_{i} - Π_{P} (g_{i}) ∥}_{i}^{2}

BOBA stage 1 fits an affine subspace $^P$ by minimizing the trimmed reconstruction loss above, which selects the $n - f$ nearest neighbors ( $r_{i} = 1$ ) and ignores $f$ gradients furthest from $^P$ ( $r_{i} = 0$ ). Intuitively, if Byzantines are far from the $^P$ , they will be ignored so $^P$ is not affected; if Byzantine are close to $^P$ , the $n - f$ nearest neighbors of $^P$ still includes at least $n - 2 f$ honest gradients, which are enough to reconstruct the honest subspace (by Condition 1). We show in Appendix A.5 that stage 1 is theoretically guaranteed to estimate the honest subspace robustly.

Strongest colluding Byzantines may focus on another dimension different from the $c - 1$ honest dimensions. But BOBA stage 1 will not identifies the Byzantine dimension as honest. If it makes such mistake, the $n - f$ nearest neighbors will form a $c$ -dimensional affine subspace, including one Byzantine dimension and $c - 1$ honest dimensions (since there are at least $n - 2 f$ honest gradients in the $n - f$ nearest neighbors). Conducting TrSVD on them results in non-negligible loss $\propto δ^{2}$ , which can be further optimized. Meanwhile, correctly identifying all honest dimensions results in a loss unrelated to outer deviations, which is much smaller. In our experiments, we also show that BOBA can resist such colluding Byzantines, e.g. Signflip Xie et al. (2019) and Little Baruch et al. (2019).

Optimization

To minimize trimmed reconstruction loss, we solve a joint minimization problem

^P,^r=argminP,r∈{0,1}n,∑ni=1ri=n−fℓt(P,r)=n∑i=1ri∥gi−ΠP(gi)∥22

Fixing $P$ , the optimal $r$ selects the $n - f$ nearest neighbors of $P$ ; while fixing $r$ , the optimal $P$ can be fitted by conducting TrSVD on the selected $n - f$ gradients. A naive way to minimize trimmed reconstruction loss is exhaustive searching, which iterates every possible value of $r$ , conducts TrSVD to fit $P$ , and chooses the $P$ with the smallest trimmed reconstruction loss. It can guarantee the global minimum but have an exponentially high computational complexity.

Instead, we use alternating optimization, with details in Line 2 - 5 in Algorithm 1. It alternatively updates $P$ and $r$ until convergence. Although the global minimum may not be guaranteed, alternating optimization can converge to a high-quality local minimum with just a few steps, which is more efficient and practical for large-scale FL.

After minimization, we project every gradient to the fitted subspace $^P$ . The projection can weaken Byzantine gradients by eliminating its components orthogonal to $^P$ ; meanwhile, it only introduces small bounded perturbation to honest gradients. However, only applying stage 1 does not guarantee robustness: a Byzantine may still have large components along $^P$ that bias the aggregation. We design stage 2 to further rule out such Byzantine gradients.

Stage 2: Finding the Honest Simplex

Stage 2 aims to estimate the honest simplex and rule out gradients outside it. Since we have no information on the distribution of clients’ label distributions, it is impossible to estimate the honest simplex with only clients’ gradients. We use a small amount of server data to estimate $c$ vertices of honest simplex, and reconstruct the label distribution of each client. Clients with abnormal label distribution will be discarded.

Proposition 1 shows that each vertex of the honest simplexis the expected gradient computed with one class of data. Thus, we initialize $c$ virtual clients on the server, each with one class of data, and compute server gradients ${γ_{z}}_{z = 1}^{c}$ with the same process of honest clients. To estimate the label distribution of a client $i$ , we solve

c \sum z = 1 {^p}_{i z} Π_{^P} (γ_{z}) = Π_{^P} (g_{i}), c \sum z = 1 {^p}_{i z} = 1

Solving this linear system in the gradient space is inefficient. Therefore, we split the projection into two steps: encoding ( ${~ g}_{i} = U^{⊤} (g_{i} - m)$ ) and decoding ( $Π_{^P} (g_{i}) = U {~ g}_{i} + m$ ), and solve the linear system in the latent space. The explicit solution is shown in line 6-7 in Algorithm 1.

If our estimation is perfect (e.g. when $g_{i} \to E g_{i}, γ_{z} \to E γ_{z}$ ), $^p_{i}$ will lie in the probability simplex, i.e. ${p : 1^{⊤} p = 1, p \geq 0}$ if client $i$ is honest, while it can be arbitrary if client $i$ is Byzantine. So we can discard clients with negative entries in $^p_{i}$ , since they must be Byzantines. However in practice, our estimation has bounded error (Appendix A.6). Thus, if an honest client does not have data from a class, which is very common, it can also have a slightly negative entry. Therefore, we design an acceptance criteria

a=A({^pi}ni=1),  where ai=1 if.f. minzpiz≥pmin

Here $p_{min} \leq 0$ is a hyper-parameter deciding the threshold of rejecting Byzantines. In our implementation, we will accept $n - f$ clients with largest ${min}_{z = 1}^{c} p_{i z}$ if $\sum_{i = 1}^{n} a_{i} \leq n - f$ , i.e., our acceptance criteria drops too many clients, since there should be at least $n - f$ honest clients. After dropping Byzantines ( $a_{i} = 0$ ), we average the remained projected gradients as the aggregation result of BOBA.

Computational complexity

The computational complexity of BOBA is $O (k c n d)$ if it conducts TrSVD for $k$ times. The complexity of TrSVD is $O (c n d)$ Halko et al. (2011), where $c$ is the number of classes, $n$ is the number of clients and $d$ is the dimension of gradient. When $k, c$ are small constants, BOBA has the same complexity as vanilla averaging, which is very efficient. Practically we also observe that $k$ is very small. In three settings with MNIST, CIFAR-10 and AG-News (see Section 6)), $k = 3.29, 3.20, 4.77$ on average.

5 Theoretical Analysis

This section provides theoretical results of the convergence of BOBA and shows that BOBA has optimal order robustness. For space limits, all proofs are deferred to Appendix A. Similar to previous works Blanchard et al. (2017); Chen et al. (2017); Yin et al. (2018), we start with a classical convergence analysis connecting convergence to the gradient estimation error.

Proposition 2.

With $L$ -smooth and $μ$ -strongly convex population risk, conducting SGD with gradient $^μ$ with estimation error $∥^μ - E μ ∥_{2} \leq Δ$ , and step size $\frac{μ}{L^{2}}$ , after $t$ steps,

∥ w^{(t)} - w^{*} ∥_{2} \leq {(1 - \frac{μ^{2}}{L^{2}})}^{\frac{t}{2}} ∥ w^{(0)} - w^{*} ∥_{2} + \frac{μ Δ}{L^{2} - L \sqrt{L^{2} - μ^{2}}}

where $w^{*}$ is the optimal model parameter.

Proposition 2 shows that the parameters exponentially converge to the optimal parameters with an error rate of $O (Δ)$ . In other words, the gradient estimation error is a good indicator of the error of the model. The following part derives the gradient estimation error of BOBA and test whether it has optimal order robustness. We start with assumptions.

Assumption 1 (Bounded deviations).

Honest client inner deviation: with large probability $1 - p$ where $p = O (\frac{1}{ϵ^{2}})$ , $\frac{1}{| H |} \sum_{i \in H} ∥ g_{i} - E g_{i} ∥_{2}^{2} \leq ϵ^{2}$ .
Honest client outer deviation: $∥ E g_{i} - E μ ∥_{2}^{2} \leq δ^{2}, \forall i \in H$ .
Server inner deviation: with large probability $1 - p_{s}$ where $p_{s} = O (\frac{1}{ϵ_{s}^{2}})$ , $\frac{1}{c} \sum_{z = 1}^{c} ∥ γ_{z} - E γ_{z} ∥_{2}^{2} \leq ϵ_{s}^{2}$ .
Server outer deviation: $∥ E γ_{z} - E μ ∥_{2}^{2} \leq δ_{s}^{2}, \forall z = 1, \dots, c$ .

Assumption 2 (Bounded client singular value).

Conducting centralized SVD on any $n - 2 f$ expectations of honest gradients, the $(c - 1)$ -th singular value $σ_{c - 1} \geq σ > 0$ .

Assumption 3 (Satisfactory optimizer).

There exists a projection function $P^{'}$ fitted by TrSVD on $n - f$ honest gradients with $ℓ_{t} (P^{'}) \geq ℓ_{t} (^P)$ .

Assumption 1 is a deterministic version of bounded inner and outer variation Wu et al. (2020); Peng et al. (2020), and can be derived from it with Markov inequality. Assumption 2 reformulates Condition 1 with SVD, since the number of non-zero singular values indicates the dimension of a subspace. Assumption 3 says that the optimizer can find a solution that is not too bad (not necessarily the global minimum). This assumption is satisfied in practice since the initialized subspace is fitted by server gradients, which are computed similarly as honest gradients and thus near the honest subspace $P^{*}$ .

Theorem 3.

With probability at least $1 - p - p_{s}$ ,

	$∥^μ - E μ ∥_{2} \leq [1 + 2 \sqrt{\frac{1}{n - 2 f} + \frac{δ^{2}}{σ^{2}}} \sqrt{\| H \|}] ϵ$
	$+ β 2 (1 + c \| p_{min} \|) (\sqrt{c} ϵ_{s} + δ_{s})$

where $β = | B | / n$ is the real fraction of Byzantine clients.

When the outer deviation increases $t$ times, both $δ$ and $σ$ increase $t$ times. When all clients are duplicated, $δ^{2}$ does not change but $σ^{2}$ doubled. Thus generally we have $\frac{δ^{2}}{σ^{2}} \propto \frac{1}{n}$ . When $ϵ_{s} = O (ϵ), δ_{s} = O (δ)$ , $c = O (1), \frac{1}{n - 2 f} = O (\frac{1}{n})$ , $| H | = O (n)$ , and $| p_{min} | = O (1)$ , we have $∥^μ - E μ ∥_{2} = O (ϵ + β δ)$ . Let $ϵ \to 0$ , we conclude that BOBA has optimal order robustness, which baseline aggregators do not achieve (see Appendix A.9).

6 Experiments

In this section, we conduct experiments to evaluate the unbiasedness and robustness of BOBA. Specifically, we aim to answer the following research questions.

RQ1: Is BOBA unbiased when there are no Byzantines?
RQ2: Is BOBA robust enough to Byzantine attacks?
RQ3: How do different components contribute to the effectiveness, robustness and efficiency of BOBA?

Method	MNIST		CIFAR-10		AG-News
Method	Acc $↑$	MRD $↓$	Acc $↑$	MRD $↓$	Acc $↑$	MRD $↓$
Average	90.7	-	70.2	-	88.3	-
Server	82.4	17.9	23.4	63.0	81.2	12.8
CooMed	64.7	75.4	19.7	81.4	82.9	10.3
TrMean	76.7	72.3	29.8	79.2	87.0	5.4
Krum	42.4	93.9	32.6	80.8	65.6	89.9
MKrum	89.3	20.1	69.7	9.4	88.0	5.3
GeoMed	90.2	6.1	69.9	4.1	88.3	0.5
B-Krum	70.0	91.5	55.1	80.0	87.6	4.3
B-MKrum	90.3	5.2	69.8	4.2	88.0	3.5
SelfRej	88.9	24.8	69.4	7.3	85.2	20.3
AvgRej	88.6	26.9	69.5	14.7	84.7	22.2
BOBA	90.5	4.5	69.6	2.6	88.3	0.4

Table 1: Evaluation of unbiasedness (mean % over 10 runs)

Method	MNIST				CIFAR-10				AG-News
Method	Gaussian	Signflip	Little	Worst	Gaussian	Signflip	Little	Worst	Gaussian	Signflip	Little	Worst
Average	9.8	9.8	90.9	9.8	10.0	10.0	65.4	10.0	27.5	25.0	87.6	25.0
CooMed	62.1	42.4	87.3	42.4	20.7	7.7	23.6	7.7	86.0	55.2	81.5	55.2
TrMean	88.6	57.1	83.6	57.1	54.6	13.1	29.7	13.1	88.1	62.4	85.2	62.4
Krum	42.4	42.4	89.7	42.4	36.1	34.7	39.1	34.7	65.8	65.7	80.4	65.7
MKrum	90.8	89.2	90.1	89.2	70.0	54.8	63.7	54.8	88.3	77.0	86.7	77.0
GeoMed	90.2	78.0	89.8	78.0	69.9	46.6	42.0	42.0	88.4	76.2	83.6	76.2
B-Krum	73.6	77.3	89.4	73.6	55.9	56.1	36.5	36.5	88.3	73.3	87.6	73.3
B-MKrum	90.6	88.8	90.1	88.8	70.3	24.3	63.4	24.3	88.3	9.3	85.9	9.3
SelfRej	90.7	68.2	90.2	68.2	70.2	23.3	63.3	23.3	88.3	24.8	86.5	24.8
AvgRej	9.8	90.7	89.6	9.8	10.0	68.3	64.1	10.0	40.8	87.2	84.9	40.8
BOBA	90.7	90.5	90.7	90.5	69.9	68.8	68.2	68.2	88.3	87.9	88.4	87.9

Table 2: Evaluation of robustness (mean Acc % over 10 runs)

We experiment on a wide range of models and datasets: a 3-layer MLP for MNIST Lecun et al. (1998), a 5-layer CNN for CIFAR-10 Krizhevsky and Hinton (2009), and a GRU network for AG-News Zhang et al. (2015). We partition training sets to $| H | = 100, 100, 160$ honest clients with the method in McMahan et al. (2017), respectively. For page limits, we give detailed experimental settings and all error bars in Appendix B.1 and B.2.

Byzantines

We consider three types of Byzantines:

Gaussian Blanchard et al. (2017), a non-colluding attack uploading large-scale vectors from Gaussian distribution.
Signflip Xie et al. (2019), a strong colluding attack uploading $g_{k} = - γ \cdot \frac{1}{| H |} \sum_{i \in H} g_{i}$ , with large $γ$ to make Average perform gradient ascent. We set $γ = 10$ for AG-News and $γ = 20$ for MNIST and CIFAR-10 (since there are less Byzantines).
Little Baruch et al. (2019), a moderate colluding attack uploading vectors within the scope of honest ones to bias the model while avoiding being detected. We set the hyper-parameter according to the original paper, i.e., $z = ϕ^{- 1} ((n - ⌊ n / 2 + 1 ⌋) / (n - | B |))$ .

Aggregators

We compare a wide range of baselines:

Average aggregator McMahan et al. (2017) simply averages all gradients. It is unbiased but vulnerable to attacks.
Server aggregator only uses server data to fit a model. We use it to verify that one cannot train a good model with server data only.
IID aggregators include coordinate median (CooMed) / trimmed mean (TrMean) Yin et al. (2018), Krum / multi-Krum (MKrum) Blanchard et al. (2017), and geometric median (GeoMed) Chen et al. (2017).
Bucketing aggregators Karimireddy et al. (2022) feed buckets (means) of gradients to IID aggregators. We consider bucketing + Krum (B-Krum) and bucketing + MKrum (B-MKrum) with $s = 2$ .
Loss-based rejection aggregators Fang et al. (2020) detect possible Byzantine clients based on the performance on server data. SelfRej selects $n - f$ clients whose local models $w_{i} = w_{G} - η g_{i}$ have smallest loss, while AvgRej selects $n - f$ clients whose gradients can lower the loss of averaged model the most.

All aggregators are set to be robust to $f = 16$ Byzantines on MNIST/CIFAR-10 and $f = 60$ on AG-News. BOBA uses $p_{min} = - 0.5$ . We assume limited server data: 20 per class for MNIST/CIFAR-10 and 30 per class for AG-News, much fewer than the samples on each client.

Unbiasedness and Robustness (RQ1, 2)

Evaluation of unbiasedness

We evaluate the unbiasedness with $| B | = 0$ . Besides accuracy, we introduce max-recall-drop (MRD) as a compliment. It computes how the recall scores of each class differ from the model trained with Average (with $| B | = 0$ ) and picks the largest absolute drop. Smaller MRD indicates a less biased aggregator. As selection bias may dramatically decrease some classes’ recalls while increasing the others, MRD can reflect selection bias better than using accuracy only. Results are shown in Table 1. We observe that BOBA has accuracy very close to Average, and the smallest MRD among all robust aggregators.

Evaluation of robustness

We evaluate the robustness with $| B | = 15, 15, 54$ on three data sets respectively. Results are shown in Table 2. Notice that Byzantines can choose their attack methods to reduce the model accuracy as much as possible. Therefore, we summarize the worst-case accuracy for a clear comparison in the “worst” column. We observe that BOBA significantly improves the worst-case accuracy by 1.3%, 13.4%, 10.7% on three data sets, respectively.

Discussion of aggregators

Models trained with CooMed and Krum suffer a lot from selection bias, totally fail on some classes. As their natural interpolation with Average, TrMean and MKrum select more clients and thus perform better. MKrum usually has good overall accuracy but still suffers from selection bias and has non-negligible recall drops. GeoMed can be expressed as a positive weighted averaging of all gradients, avoiding dropping useful clients. This property alleviates the selection bias, but also makes GeoMed more vulnerable to Byzantine attacks compared to MKrum. We also observe that bucketing aggregators can alleviate the selection bias of a robust aggregator at the cost of robustness. SelfRej is vulnerable to colluding attacks, while AvgRej is vulnerable to non-colluding Gaussian attacks. Among all aggregators, BOBA is the only method that achieves superior unbiasedness and robustness. It gives uniformly good performance under all settings.

Leveraging server data

Noticing that additional server data may introduce advantages to BOBA, we also study whether server data can enhance baseline aggregators for a fair comparison. We enhance the baselines with $^μ=(1−λ)Agg({gi}ni=1)+λ(1c∑cz=1γz)$ and test different $λ$ . Results are given in Appendix B.4, showing that additional server data cannot enhance these aggregators.

Ablation Study (RQ3)

Method	$\| B \| = 0$	$\| B \| = 3$
Method	$\| B \| = 0$	Gaussian	Signflip	Little
Average	93.6	83.3	60.6	93.5
Server	84.2	84.2	84.2	84.2
BOBA	93.4	93.6	93.0	93.4
BOBA-ES	93.3	93.6	93.3	93.5
BOBA w.o. stage 1	84.6	85.6	85.5	85.1
BOBA w.o. stage 2	93.4	57.5	39.4	93.4

Table 3: Ablation study (mean Acc % over 10 runs)

In this part, we test how each components of BOBA contribute to the aggregation. Since we need to test exhaustive searching, we train a 2-layer MLP on Spambase Dua and Graff (2017) with $| H | = 12$ honest clients, most of which have only one class of data. We let $f = 3, p_{min} = - 0.5$ and assume 30 samples per class. Results are shown in Table 3.

Two Stages

BOBA w.o. stage 1 skips the subspace optimization and uses the subspace initialized with server gradients. It fails to utilize clients’ data, and thus has a worse performance similar to training with server data only. It shows that stage 1 can effectively use clients’ data and thus is necessary. BOBA w.o. stage 2 does not discard any gradients in stage 2. Instead, it averages all projected gradients. It is not robust to large scale Byzantine gradients.

Optimizer

BOBA-ES uses exhaustive searching instead of alternating optimization to fit the honest subspace, which globally minimizes the trimmed reconstruction loss. We observe that BOBA has performance comparable to BOBA-ES. However, it only conducts TrSVD for $< 3$ times in average in each rounds, compared to $(\frac{n}{f})$ for BOBA-ES.

We can conclude that (1) both stages in BOBA are necessary to guarantee performance and robustness, and (2) alternating optimization significantly improves the efficiency while introducing only marginal performance loss.

7 Conclusion

This paper focuses on Byzantine-robustness in FL with label skewness. We show that existing aggregators suffer from selection bias and increased vulnerability, and propose BOBA to alleviate these problems. We believe robustness has been overlooked in FL, and hope to inspire future works on analyzing robustness with various flexible FL frameworks.

References

G. Baruch, M. Baruch, and Y. Goldberg (2019) A little is enough: circumventing defenses for distributed learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 8632–8642. External Links: Link Cited by: §4, 3rd item.
P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer (2017) Machine learning with adversaries: byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §1, §2, §3, §3, §5, 1st item, 3rd item.
Y. Chen, L. Su, and J. Xu (2017) Distributed statistical machine learning in adversarial settings: byzantine gradient descent. Proc. ACM Meas. Anal. Comput. Syst. 1 (2). External Links: Link, Document Cited by: §1, §2, §3, §5, 3rd item.
D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §6.
M. Fang, X. Cao, J. Jia, and N. Z. Gong (2020) Local model poisoning attacks to byzantine-robust federated learning. In 29th USENIX Security Symposium, USENIX Security 2020, August 12-14, 2020, S. Capkun and F. Roesner (Eds.), pp. 1605–1622. External Links: Link Cited by: 5th item.
A. Ghosh, J. Hong, D. Yin, and K. Ramchandran (2019) Robust federated learning in a heterogeneous environment. CoRR abs/1906.06629. External Links: Link, 1906.06629 Cited by: §2.
N. Halko, P. G. Martinsson, and J. A. Tropp (2011) Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review 53 (2), pp. 217–288. External Links: Document, Link, https://doi.org/10.1137/090771806 Cited by: §4.
Z. Hu, K. Shaloudegi, G. Zhang, and Y. Yu (2020) FedMGDA+: federated learning meets multi-objective optimization. CoRR abs/2006.11489. External Links: Link, 2006.11489 Cited by: §2.
P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, R. G. L. D’Oliveira, H. Eichner, S. E. Rouayheb, D. Evans, J. Gardner, Z. Garrett, A. Gascón, B. Ghazi, P. B. Gibbons, M. Gruteser, Z. Harchaoui, C. He, L. He, Z. Huo, B. Hutchinson, J. Hsu, M. Jaggi, T. Javidi, G. Joshi, M. Khodak, J. Konecný, A. Korolova, F. Koushanfar, S. Koyejo, T. Lepoint, Y. Liu, P. Mittal, M. Mohri, R. Nock, A. Özgür, R. Pagh, H. Qi, D. Ramage, R. Raskar, M. Raykova, D. Song, W. Song, S. U. Stich, Z. Sun, A. T. Suresh, F. Tramèr, P. Vepakomma, J. Wang, L. Xiong, Z. Xu, Q. Yang, F. X. Yu, H. Yu, and S. Zhao (2021) Advances and open problems in federated learning. Foundations and Trends® in Machine Learning 14 (1–2), pp. 1–210. External Links: Link, Document, ISSN 1935-8237 Cited by: §1.
S. P. Karimireddy, L. He, and M. Jaggi (2022) Byzantine-robust learning on heterogeneous datasets via bucketing. In International Conference on Learning Representations, External Links: Link Cited by: §2, §3, 4th item.
A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto. Cited by: §6.
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86. Cited by: §6.
T. Li, S. Hu, A. Beirami, and V. Smith (2021) Ditto: fair and robust federated learning through personalization. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, M. Meila and T. Zhang (Eds.), Proceedings of Machine Learning Research, Vol. 139, pp. 6357–6368. External Links: Link Cited by: §2.
X. Li and D. Zhan (2021) FedRS: federated learning with restricted softmax for label distribution non-iid data. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, F. Zhu, B. C. Ooi, and C. Miao (Eds.), pp. 995–1005. External Links: Link, Document Cited by: §1, §2.
L. Lyu, H. Yu, X. Ma, L. Sun, J. Zhao, Q. Yang, and P. S. Yu (2020) Privacy and robustness in federated learning: attacks and defenses. CoRR abs/2012.06337. External Links: Link, 2012.06337 Cited by: §1.
O. Marfoq, G. Neglia, A. Bellet, L. Kameni, and R. Vidal (2021) Federated multi-task learning under a mixture of distributions. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan (Eds.), pp. 15434–15447. External Links: Link Cited by: §2, §3.
B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20-22 April 2017, Fort Lauderdale, FL, USA, A. Singh and X. (. Zhu (Eds.), Proceedings of Machine Learning Research, Vol. 54, pp. 1273–1282. External Links: Link Cited by: §1, §3, 1st item, §6.
J. Peng, Z. Wu, and Q. Ling (2020) Byzantine-robust variance-reduced federated learning over distributed non-i.i.d. data. CoRR abs/2009.08161. External Links: Link, 2009.08161 Cited by: §5.
K. Pillutla, S. M. Kakade, and Z. Harchaoui (2022) Robust aggregation for federated learning. IEEE Transactions on Signal Processing 70 (), pp. 1142–1154. External Links: Document Cited by: §2.
Z. Shen, J. Cervino, H. Hassani, and A. Ribeiro (2022) An agnostic approach to federated learning with class imbalance. In International Conference on Learning Representations, External Links: Link Cited by: §1.
G. W. Stewart (1990) Perturbation theory for the singular value decomposition. Technical report University of Maryland at College Park, USA. Cited by: §4.
Z. Wu, Q. Ling, T. Chen, and G. B. Giannakis (2020) Federated variance-reduced stochastic gradient descent with robustness to byzantine attacks. IEEE Transactions on Signal Processing 68 (), pp. 4583–4596. External Links: Document Cited by: §5.
C. Xie, O. Koyejo, and I. Gupta (2019) Fall of empires: breaking byzantine-tolerant SGD by inner product manipulation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, A. Globerson and R. Silva (Eds.), Proceedings of Machine Learning Research, Vol. 115, pp. 261–270. External Links: Link Cited by: §4, 2nd item.
Q. Yang, Y. Liu, T. Chen, and Y. Tong (2019) Federated machine learning: concept and applications. ACM Trans. Intell. Syst. Technol. 10 (2). External Links: ISSN 2157-6904, Link, Document Cited by: §1.
D. Yin, Y. Chen, R. Kannan, and P. Bartlett (2018) Byzantine-robust distributed learning: towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 5650–5659. External Links: Link Cited by: §1, §2, §3, §5, 3rd item.
T. Yoon, S. Shin, S. J. Hwang, and E. Yang (2021) FedMix: approximation of mixup under mean augmented federated learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Link Cited by: §2.
F. X. Yu, A. S. Rawat, A. K. Menon, and S. Kumar (2020) Federated learning with only positive labels. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 10946–10956. External Links: Link Cited by: §2.
L. Zhang, B. Shen, A. Barnawi, S. Xi, N. Kumar, and Y. Wu (2021) FedDPGAN: federated differentially private generative adversarial networks framework for the detection of COVID-19 pneumonia. Inf. Syst. Frontiers 23 (6), pp. 1403–1415. External Links: Link, Document Cited by: §2.
X. Zhang, J. J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, Cited by: §6.
Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra (2018) Federated learning with non-iid data. CoRR abs/1806.00582. External Links: Link, 1806.00582 Cited by: §2.