On Differential Privacy for Federated Learning in Wireless Systems with Multiple Base Stations

Nima Tavangaran Mingzhe Chen Zhaohui Yang
José Mairton B. Da Silva Jr. and H. Vincent Poor N. Tavangaran, J. M. B. da Silva Jr., and H. V. Poor are with the Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ, USA, (e-mails: nimat@princeton.edu and poor@princeton.edu).M. Chen is with the Department of Electrical and Computer Engineering and Institute for Data Science and Computing, University of Miami, Coral Gables, FL, 33146 USA (e-mail: mingzhe.chen@miami.edu).Z. Yang is with the College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China, (e-mail: yang_zhaohui@zju.edu.cn).J. M. B. da Silva Jr. is also with the School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden (e-mail: jmbdsj@kth.se).The work of N. Tavangaran was partly supported by the German Research Foundation (DFG) under Grant TA 1431/1-1.The work of H. V. Poor was supported by the U.S. National Science Foundation under Grants CCF-1908308 and CNS-2128448.This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

Abstract

In this work, we consider a federated learning model in a wireless system with multiple base stations and inter-cell interference. We apply a differential private scheme to transmit information from users to their corresponding base station during the learning phase. We show the convergence behavior of the learning process by deriving an upper bound on its optimality gap. Furthermore, we define an optimization problem to reduce this upper bound and the total privacy leakage. To find the locally optimal solutions of this problem, we first propose an algorithm that schedules the resource blocks and users. We then extend this scheme to reduce the total privacy leakage by optimizing the differential privacy artificial noise. We apply the solutions of these two procedures as parameters of a federated learning system. In this setting, we assume that each user is equipped with a classifier. Moreover, the communication cells are assumed to have mostly fewer resource blocks than numbers of users. The simulation results show that our proposed scheduler improves the average accuracy of the predictions compared with a random scheduler. Furthermore, its extended version with noise optimizer significantly reduces the amount of privacy leakage.

Differential privacy, federated learning, neural networks, wireless channel, multiple base stations.

I Introduction

Machine Learning (ML) systems are expected to play an important role in future mobile communication standards [1]. With increasing applications of ML schemes in wireless systems, new technologies are emerging to enhance the performance of such systems. On the other hand the wireless technology itself can also be deployed to enhance the ML procedures [2]. Among possible candidates, Federated Learning (FL) has been shown to have considerable promise [3, 4, 5] and has the potential to benefit from wireless communication.

FL solves several issues of centralized ML systems by distributing the learning task among several edge devices. One advantage of using an FL system, which makes it a good fit in a wireless setting, is that edge devices do not need to transmit their local datasets to the server. This reduces the amount of wireless resources that is required for accomplishing the given ML task. Apart from this, the privacy of each edge device is not completely compromised since the server does not have a direct access to the data [6].

FL schemes operating over wireless networks have been extensively researched in recent years; see for example [7, 8, 9, 10, 11, 12, 13]. In [8], the authors studied the effects of wireless parameters on the FL process. They derived an upper bound on the optimality gap of the convergence terms and proposed an optimization problem to minimize the upper bound by considering wireless parameters like resource allocation, user scheduling, and packet error rate. Some other works that studied the communication aspects of FL are [14, 15, 16]. Moreover, FL with several layers of aggregation or with hierarchy has been studied in [17, 18, 19, 20, 21, 22].

Although the training data of each device in FL is not transmitted to the server, yet a function of the local model (query) is still sent to the server. It has been shown that this local model might leak some information about the training data [23]. To mitigate this drawback, FL has been extensively studied together with a privacy preserving scheme called Differential Privacy (DP) [24].

DP-based schemes follow the principle of not being adversely affected much by having one’s data used in any analysis [25]. This powerful notion is well established and is applied in industry. To realize a DP-based FL system, each edge device adds some artificial noise to its transmitting information. This noise provides a certain amount of privacy depending on the noise power and sensitivity of the query function.

DP based FL and its convergence behavior have been extensively studied; see for example [26, 27, 28, 29, 30, 31, 32]. In this regard, the work in [26] addresses the privacy implementation challenges through a combination of zero-concentrated differential privacy, local gradient perturbation and secure aggregation.[30] considers the resource allocation and user scheduling to minimize the FL training delay under the constraint of performance and DP requirements. Finally, [32] presents a closed-form global loss and privacy leakage of a DP-based FL system and then minimizes the loss and privacy leakage.

However, none of these works consider a joint learning and resource allocation scheme for DP-based FL that considers the effects of inter-cell interference as well. In this paper, we adapt the framework in [8] and consider a wireless FL system in a multiple base station scenario. Additionally, we consider DP noise added to the gradients [26] and combine this approach with resource scheduling.

The goal of the FL system here is to train a global model for a given predictor. We introduce an iterative DP-based FL scheme with two levels of aggregation (Algorithm 1) and then derive an upper bound on the optimality gap of its convergence terms (Theorem 1).

We then propose an optimization problem whose goal is to improve the convergence of the upper bound on the optimality gap of Algorithm 1 and simultaneously reduce the total privacy leakage. In this regard, the optimization problem is with respect to certain variables like user and resource scheduling, uplink transmit powers, and the amount of DP noise that is applied by each user to its transmitting information.

Since the proposed optimization problem is a non-linear multi-variable mixed integer programming, we divide it into two simpler schemes.

First, we present a suboptimal approach to minimize the objective function only with respect to resource scheduling variables in a sequential manner, i.e., cell by cell. This reduces the original problem to a linear integer programming task and substantially simplifies the implementation. We call this scheme also the optimal scheduler (OptSched). Since this approach performs sequentially from one cell to another one, the amounts of optimal transmit power should be adjusted carefully due to the effects of inter-cell interference. To tackle this problem, we introduce a procedure to determine the users’ optimal transmit powers by solving a simple optimization problem.

Next, we enhance the OptSched scheme by further minimizing the objective function of the proposed optimization problem with respect to the DP noise. This leads us to a convex optimization problem with respect to the DP noise standard deviations. We call this extended scheme the optimal scheduler with DP optimizer (OptSched+DP).

We present all the numerical optimizations and benchmarking results. In this regard, we apply Python optimization packages like CVXPY, CVXOPT, GLPK, and ECOS [33, 34, 35, 36, 37]. The numerical results show that our proposed schemes (OptSched and OptSched+DP) reduce the objective function of the optimization task substantially compared with the case in which we randomly allocate the resources and apply the DP noise.

Next, we apply these (sub-)optimal parameters to our iterative learning scheme (Algorithm 1). In this regard, each user is equipped with a fully connected multi-layer neural network as a classifier. Furthermore, we assume that communication cells have mainly more users than available resource blocks. This is a legitimate assumption due to the bandwidth limitation. We then perform simulations to measure the accuracy, loss, and the amount of the privacy leakage in such a system for the proposed algorithms. To realize the simulations, we apply the TensorFlow, NumPy, and Matplotlib packages [38, 39, 40].

The simulations show that the OptSched scheme predominantly improves the classification accuracy by scheduling the users who have larger data chunks and better uplink channels. The OptSched+DP scheme, on the other hand, achieves a significant reduction in privacy leakage of individual users by systematically adjusting the DP noise power and moderately sacrificing accuracy.

Notation: We denote vectors by lowercase bold letters, e.g. $w$ . Matrices are represented by uppercase bold letters like $X$ , or the identity matrix $I_{d}$ with $d$ rows and $d$ columns. Sets are denoted by Calligraphic fonts like $X$ . Random mechanisms as a special kind of functions are represented by Fraktur fonts, e.g. $M$ . The transpose of a vector $x$ is denoted by $x^{⊺}$ . Logarithms are assumed to be to the basis 2. The set of real numbers is represented by $R$ . $[R]$ denotes the set ${1, 2, \dots, R}$ .

Ii System Model

We begin this section by reviewing some preliminary notions on DP that are required in this work. The complete list of definitions can be found in [24, 25, 41].

Ii-a Differential Privacy Model

Let a data universe $X$ and the distribution $P_{X}$ on it be given. Assume that a database is denoted by a matrix $X \in X^{K \times m}$ and contains $K$ rows of independent and identically distributed (i.i.d.) $m$ -dimensional samples (row vectors). Two databases $X, ~ X \in X^{K \times m}$ are called adjacent if they differ only in one row.

A query (mechanism) $q : X^{K \times m} \to R^{d}$ is a function which takes a database $X \in X^{K \times m}$ as input and gives a $d$ -dimensional output. If the output of the query contains randomness then it is called a randomized mechanism.

In the following, we introduce the notion of privacy for randomized mechanisms, which are defined on a given set of databases $X^{K \times m}$ .

Definition 1.

A randomized mechanism $M : X^{K \times m} \to R^{d}$ is said to be $(ϵ, δ)$ -Differentially Private, or for short $(ϵ, δ)$ -DP, if for every adjacent $X, ~ X \in X^{K \times m}$ , we have that

P r (M (X) \in W) \leq e^{ϵ} P r (M (~ X) \in W) + δ

(1)

holds for any $W \subset R^{d}$ .

In this work, we apply a relaxed version of the $(ϵ, δ)$ -DP that is more suitable for Gaussian mechanisms.

Definition 2.

A randomized mechanism $M : X^{K \times m} \to R^{d}$ is said to be $ρ$ -zero-Concentrated Differentially Private (CDP), or for short $ρ$ -zCDP, if

D_{α} (M (X) ∥ M (~ X)) \leq ρ α

(2)

holds for every adjacent $X, ~ X \in X^{K \times m}$ and all $α \in (1, \infty)$ , where $D_{α}$ is the $α$ -Rényi divergence [41].

Ii-B Federated Learning Model

Based on the notions from previous section, we introduce our privacy preserving FL model for a system with multiple base stations. Let a collection of base stations denoted by the set $S$ be given such that they can communicate with each other through a main server. Assume that each base station $s \in S$ serves a set of edge devices (users) denoted by $U_{s}$ , where the users in $U_{s}$ have some arbitrary order. Let $U_{s}$ denote the size of this set.

We assume that each user $i \in U_{s}$ assigned to the base station $s$ has access to a database

X_{s, i}

\coloneqq(x(1)s,i,x(2)s,i,…,x(Ks,i)s,i)⊺∈RKs,i×m,

where $K_{s, i}$ is the number of samples (row vectors) in the database $X_{s, i}$ . Each row of the above matrix, say $x_{s, i}^{(k)}$ , is an $m$ -dimensional data sample given by

x(k)s,i\coloneqq(x(k)s,i(1),x(k)s,i(2),…,x(k)s,i(m−1),y(k)s,i),

where the first $m - 1$ elements are the inputs and the last entry $y_{s, i}^{(k)}$ is the output of the training data.

In the first step of the FL scheme at round $t = 1$ , the main server broadcasts a weight vector $w^{(t)} \in R^{d}$ to all base stations. This vector is called the global model and can be initialized randomly. Then, each base station $s$ transmits this model to all of its edge devices. Let only a subset of the users in each cell are active and participate in the learning process. We denote the active users in cell $s$ by:

as\coloneqq(as,i)i∈Us∈{0,1}Us,

(3)

where $a_{s, i} = 1$ indicates that user $i \in U_{s}$ is scheduled [8] to participate in the learning process and $a_{s, i} = 0$ , otherwise.

Fig. 1: FL model with multiple base stations.

Fig. 1 shows an example of a model with multiple base stations. In this example, all users of cell $s$ (depicted on the bottom of the figure), are scheduled to participate in the FL process and receive the vector $w^{(t)}$ .

Each scheduled user computes a local loss function depending on the ML algorithm that is applied in the system. We denote the loss function of a user $i \in U_{s}$ by $l (w^{(t)}, x_{s, i}^{(k)})$ , which is a function of the global model and its training sample. Next, this user computes the gradient [42, 4] of its loss function over all given samples as a query function

q(t)s,i(Xs,i)\coloneqq1Ks,iKs,i∑k=1∇l(w(t),x(k)s,i),

(4)

where the gradients are with respect to $w^{(t)}$ .

Similarly as in [26], the user then applies Gaussian noise $n_{s, i}^{(t)} \sim N (0, σ_{s, i}^{2} I_{d})$ to the outcome of the query to implement the randomized mechanism $M_{s}^{(t)}$ as follows

M(t)s,i(Xs,i)\coloneqqq(t)s,i(Xs,i)+n(t)s,i.

(5)

In this context, $n_{s, i}^{(t)}$ is assumed to be independent of all other random variables in our model, including the DP noise that is applied in previous iterations. The main reason of applying Gaussian noise is that it gives tight bounds when applied with zCDP [41]. The following vector denotes the noise standard deviations of the users in the cell $s$ :

σs\coloneqq(σs,i)i∈Us.

(6)

Next, the edge device $i \in U_{s}$ updates its local model by

w(t+1)s,i\coloneqqw(t)−λM(t)s,i(Xs,i),

(7)

where $λ > 0$ is the learning step size. It then transmits its updated model¹¹1For the sake of simplicity, we consider only a single local update at each edge device in each round and also consider the batch gradient descent. $w_{s, i}^{(t + 1)}$ to its corresponding base station $s$ .

In the next step, base station $s$ aggregates all received updated models $w_{s, i}^{(t + 1)}$ as given below:

w(t+1)s\coloneqq1∑i∈UsKs,ias,i∑i∈UsKs,ias,iw(t+1)s,i,

(8)

where $a_{s, i}$ is the scheduling parameter and was defined as an element of the vector $a_{s}$ in Eq. (3).

Consequently, all base stations send their aggregated models to the main server. There, the global model at round $t + 1$ is computed as follows

(9)

where

Ka\coloneqq∑s∈S∑i∈UsKs,ias,i

(10)

is the total number of training samples of all scheduled users.

1:The main server broadcasts

(a_{s}, σ_{s})_{s \in S}

, which are given by (3) and (6), to all base stations and their users.

2:The main server initializes the global model

w^{(0)}

3:for

t = 0 : T

4: The main server broadcasts

w^{(t)}

to all base stations.

5: for base stations

s \in S

in parallel do

6: Base station

s

broadcasts

w^{(t)}

to all its users.

7: for users

i \in U_{s}

in parallel do

8: if

a_{s, i} = 1

then

9: The user

i \in U_{s}

updates its model as in (7).

10: The user

i \in U_{s}

then sends

w_{s, i}^{(t + 1)}

back to the base station

s

11: end if

12: end for

13: The base station

s

aggregates the received models as in (8).

14: The base station

s

then sends

w_{s}^{(t + 1)}

back to the main server.

15: end for

16: The main server aggregates all models as in (9).

17:end for

Algorithm 1 Privacy preserving FL with multiple stations

Next, the main server broadcasts the new global model $w^{(t + 1)}$ to the base stations where it is then forwarded further to their corresponding users. This process continues for a given number of $T$ iterations. Algorithm 1 summarizes these steps, where $(a_{s}, σ_{s})_{s \in S}$ are assumed to be shared with all participants at the beginning of the learning process.

One difference between Algorithm 1 and other approaches like e.g. the FL schemes in [26, 8] is that here the aggregation is done in two steps. Additionally, the noise standard deviations $σ_{s, i}$ at users are not necessarily identical here and a joint optimal user scheduling and DP noise adjustment is possible.

To characterize the DP noise, we need to make the following assumption which can be achieved in practice by weight clipping [43, 26].

Assumption 1.

The gradients of the local loss functions are always upper bounded:

∥ \nabla l (w, x) ∥_{2} \leq L .

In [26], it was shown that if Assumption 1 holds, then after $T$ iterations, a mechanism like $M_{s, i}^{(t)} (X_{s, i})$ is $ρ$ -zCDP where

ρ = 2 T {(\frac{L}{K_{s, i} σ_{s, i}})}^{2}

(11)

is the privacy leakage.

Note that the DP noise affects the convergence of the learning process as well. In Section III, we study the convergence behavior of Algorithm 1 with respect to the vector parameters $a_{s}$ and $σ_{s}$ .

Iii Convergence Analysis

In the following, we define the global loss as a function of local losses. We then derive an upper bound on the optimality gap that appears in each round of Algorithm 1.

The global loss function is computed over all base stations and is given by

f(w(t))\coloneqq1K∑s∈S∑i∈UsKs,i∑k=1l(w(t),x(k)s,i),

(12)

where $K = \sum_{s \in S} \sum_{i \in U_{s}} K_{s, i}$ is the total number of samples (including scheduled or non-scheduled).

The following assumptions are necessary to analyze the global loss function $f$ and have been used before in the literature [44].

Assumption 2.

The loss function $f : R^{d} \to R$ has a minimum value, i.e., there exists an input vector $w^{*} = arg {min}_{w \in R^{d}} (f (w))$ .

Assumption 3.

The gradient $\nabla f (w)$ is uniformly $L$ -Lipschitz continuous with respect to the model $w$ , i.e.,

∥ \nabla f (w) - \nabla f (w^{'}) ∥_{2} \leq L ∥ w - w^{'} ∥_{2} for all w, w^{'} \in R^{d} .

Assumption 4.

The loss function $f : R^{d} \to R$ is $μ$ -strongly convex, i.e.,

f (w) \geq f (w^{'}) + (w - w^{'})^{⊺} \nabla f (w^{'}) + \frac{1}{2} μ ∥ w - w^{'} ∥_{2}^{2}

holds for all $w, w^{'} \in R^{d}$ .

Assumption 5.

The loss function $f : R^{d} \to R$ is twice continuously differentiable. Then, Assumptions 3 and 4 are equivalent to the following:

μ I_{d} ⪯ \nabla^{2} f (w) ⪯ L I_{d} for all w \in R^{d} .

Assumption 6.

There exists constants $ξ_{1} \geq 0$ and $ξ_{2} \geq 1$ , such that for any training sample $x$ and model $w \in R^{d}$ , the following inequality holds

∥ \nabla l (w, x) ∥_{2}^{2} \leq ξ_{1} + ξ_{2} ∥ \nabla f (w) ∥_{2}^{2} .

Now, we are ready to derive an upper bound on the optimality gap of Algorithm 1.

Theorem 1.

Let Assumptions 2–6 hold. Then, the following upper bound on the optimality gap for Algorithm 1 holds:

E [f (w^{t + 1}) - f (w^{*})] \leq C_{1} E [f (w^{(t)}) - f (w^{*})] + C_{2} + C_{3},

where the expectation is taken over the DP noise and

	$C_{1}$
	$C_{2}$	$= \frac{2 ξ_{1}}{L K^{2}} (\sum s \in S \sum i \in U_{s} K_{s, i} (1 - a_{s, i}))^{2},$
	$C_{3}$

Proof.

The proof is provided in the Appendix. ∎

Theorem 1 shows that the expected difference between the global loss and the optimal value $f (w^{*})$ per iteration is upper bound by expressions that depend on $C_{1}$ , $C_{2}$ , and $C_{3}$ . Hence, by lowering the values of $C_{1}$ , $C_{2}$ , and $C_{3}$ , the convergence of Algorithm 1 should be improved. In addition, Theorem 1 shows that the upper bound on the optimality gap is influenced by the scheduling parameters $a_{s, i}$ and DP noise standard deviations $σ_{s, i}$ . We also observe that the upper bound converges only if $C_{1} < 1$ . In Section IV, we design an optimal scheduler and a DP optimizer based on these variables and their effect on this upper bound.

Iv Learning over Wireless Channels with Inter-cell Interference

In this section, we consider other wireless parameters of the communication system and connect them to the notion of learning. These wireless parameters include resource allocation, transmit power consumption, fading channels, inter-cell interference, and communication rate.

We assume that the users apply an Orthogonal Frequency-Division Multiple Access (OFDMA) technique in the uplink channel to transmit data to their corresponding base station. In this case, each edge device $i \in U_{s}$ is assigned a resource block indexed by $n \in [R]$ where $R$ is the total number of available uplink transmission resource blocks in each cell.

We define the uplink resource allocation matrix $R_{s}$ in a given cell $s$ as

Rs\coloneqq(r(1)s,i,r(2)s,i,…,r(R)s,i)i∈Us,

(13)

where $r_{s, i}^{(n)} \in {0, 1}$ . Each row of this matrix represents the resource allocation for a user $i \in U_{s}$ . In this case, $r_{s, i}^{(n)} = 1$ indicates that the edge device $i \in U_{s}$ uses resource block $n$ in the uplink transmission and $r_{s, i}^{(n)} = 0$ , otherwise. Moreover, we assume that each active user ( $a_{s, i} = 1$ ) is assigned only one resource block and inactive users ( $a_{s, i} = 0$ ) are not assigned any resource block at all, i.e.,

R \sum n = 1 r_{s, i}^{(n)} = a_{s, i}, \forall i \in U_{s}, s \in S .

(14)

In addition, edge devices in a given cell $s$ do not interfere with each other, i.e.,

\sum i \in U_{s} r_{s, i}^{(n)} \leq 1, \forall n \in [R], s \in S .

(15)

To be able to formulate the communication rate, we need first to define the transmit powers of the users. Let the uplink transmit power vector of all edge devices at a given cell $s \in S$ be denoted by

ps\coloneqq(ps,i)i∈Us,

where $p_{s, i}$ denotes the transmit power of the user $i \in U_{s}$ . Moreover, the maximum transmit power of each user in any cell is denoted by $P_{max}$ .

Fig. 2: Uplink stage of the wireless FL model with multiple base stations and inter-cell interference.

Another wireless parameter, which is of great importance in the considered system with multiple base stations, is the inter-cell interference. Let $I_{s}^{(n)} (~ s)$ denote the interference signal power [45] from the cell $~ s \in S ∖ {s}$ that affects the uplink signal received by the base station $s$ on the resource block $n$ . In this case, inequality (15) implies that $I_{s}^{(n)} (~ s)$ is a factor of the transmit power of only one user in the cell $~ s$ that transmits signals on the resource block $n$ . In other words, the received interference signal power can be formulated as

I_{s}^{(n)} (~ s) = \sum i \in U_{~ s} h_{s, i} r_{~ s, i}^{(n)} p_{~ s, i} .

(16)

The term $h_{s, i}$ in (16) is the channel gain between the user $i \in U_{~ s}$ and the base station $s$ and can be computed by determining the pathloss [46]. The channel gain is given by

h_{s, i} = l^{2} (\frac{c}{4 π f})^{2} (\frac{1}{d_{s, i}})^{3},

(17)

where $f$ is the uplink center frequency, $d_{s, i}$ is the distance between the user $i \in U_{~ s}$ and base station $s$ , $l$ is the output of a Rayleigh distribution with a unit scale parameter, and $c$ is the speed of light.

Fig. 2 illustrates an example of a wireless communication system with three base stations $s, s^{'},$ and $s^{''}$ in the uplink stage. In this example, the received signals on the resource block $n$ at the base station $s^{'}$ are affected by the interference signal power $I_{s^{'}}^{(n)} (s)$ from cell $s$ . Furthermore, base station $s^{''}$ is affected by the interference signal power $I_{s^{''}}^{(n)} (s^{'})$ from cell $s^{'}$ .

Let the uplink fading channel between each user $i \in U_{s}$ and its corresponding base station $s$ be fixed and equal to $h_{s, i}$ . Also assume that the uplink bandwidth is denoted by $B$ . Furthermore, all participants are assumed to have perfect channel knowledge. It is known (see e.g. [45, 8]) that the maximum uplink communication rate between each user $i \in U_{s}$ and its corresponding base station $s$ can be formulated as

c_{s, i}^{U}

\coloneqqR∑n=1r(n)s,iBlog(1+ps,ihs,i∑~s∈S∖{s}I(n)s(~s)+BN0),

(18)

where $N_{0}$ is the thermal noise power spectral density. We assume that the minimum required uplink communication rate between each user and its base station is denoted by a constant $R_{min}$ .

Before applying these wireless parameters in the learning process, we first need to introduce another measure based on the DP standard deviations $(σ_{s})_{s \in S}$ . In this regard, we define the total privacy leakage as follows:

2 T L^{2} \sum s \in S \sum i \in U_{s} (\frac{1}{K_{s, i} σ_{s, i}})^{2} a_{s, i} .

(19)

In this definition, the summands in (19) are computed by multiplying the privacy leakage given by (11) at each user with the scheduling variables $a_{s, i}$ . Minimizing this measure reduces the individual privacy leakage as well. This improvement is due to the systematic adjustment of the DP noise power at each user. In this case, edge devices who have a larger number of samples $K_{s, i}$ are assigned less DP noise power.

The parameters $R_{s}$ and $σ_{s}$ play a critical role in improving the convergence rate of Algorithm 1 (cf. Theorem 1) and reducing the total privacy leakage. Furthermore, the parameter $p_{s}$ is critical in establishing a reliable communication. By minimizing $C_{1}$ and $C_{2}$ with respect to these parameters, the upper bound on the optimality gap in Theorem 1 reduces and thus the convergence rate of the FL procedure should improve. To this end, it is sufficient to minimize only the expressions inside the squared term in $C_{1}$ .

Therefore, we propose an optimization problem over the variables $(R_{s}, p_{s}, σ_{s})_{s \in S}$ and minimize the values of $C_{1}$ and $C_{2}$ from Theorem 1 and the total privacy leakage given by (19). In this combined formulation, we assume that other FL parameters, such as $L, μ, ξ_{1}, ξ_{2}, d,$ and $T$ , are constant. The main server can then solve this optimization problem and then broadcast the results to all base stations before Algorithm 1 starts.

Since it is hard to directly solve a multi-objective optimization problem for both scheduling and total privacy leakage, we formulate the problem as a single-objective optimization task as follows:

$minimize (R_{s}, p_{s}, σ_{s})_{s \in S} \sum s \in S \sum i \in U_{s} K_{s, i} [1 - R \sum n = 1 r_{s, i}^{(n)}]$
$+ γ \sum s \in S \sum i \in U_{s} (\frac{1}{K_{s, i} σ_{s, i}})^{2} R \sum n = 1 r_{s, i}^{(n)}$		(20)
subject to
$\sum s \in S \sum i \in U_{s} K_{s, i} σ_{s, i}^{2} R \sum n = 1 r_{s, i}^{(n)} \leq V_{max} \sum s \in S \sum i \in U_{s} K_{s, i} R \sum n = 1 r_{s, i}^{(n)},$		(21)
$K_{s, i} σ_{s, i} \geq N_{min} R \sum n = 1 r_{s, i}^{(n)},$	$\forall s \in S, i \in U_{s},$	(22)
$\sum i \in U_{s} r_{s, i}^{(n)} \leq 1,$	$\forall s \in S, n \in [R],$	(23)
$R \sum n = 1 r_{s, i}^{(n)} \leq 1 and r_{s, i}^{(n)} \in {0, 1},$	$\forall s \in S, i \in U_{s},$	(24)
$0 \leq p_{s, i} \leq P_{max},$	$\forall s \in S, i \in U_{s},$	(25)
$c_{s, i}^{U} \geq R_{min} R \sum n = 1 r_{s, i}^{(n)},$	$\forall s \in S, i \in U_{s},$	(26)

where $γ > 0$ is a constant and is used to balance the optimization of the scheduling and the total privacy leakage. Typically, the value of the constant $γ$ can be obtained by hyperparameter tuning and simulations.

Minimizing the first term in the objective function in (20) improves the convergence of Algorithm 1 and is computed by applying (14) to the summation term in $C_{1}$ of Theorem 1. Minimizing the second term, on the other hand, reduces the total privacy leakage at all users and is given by (19).

Constraint (21) guarantees that the DP noise error, which is characterized by the term $C_{3}$ of Theorem 1, is less than a given constant $V_{max}$ . To derive condition (21), we first consider the following upper bound on the squared term in $C_{3}$

(\frac{K_{s, i} a_{s, i}}{K_{a}} σ_{s, i})^{2} \leq \frac{K_{s, i} a_{s, i}}{K_{a}} σ_{s, i}^{2},

(27)

which follows by (10) and $K_{s, i} a_{s, i} \leq K_{a}$ . Constraint (21) then follows by applying (10) and (14) to this upper bound and setting it to be smaller than $V_{max}$ . We can then control the amount of DP noise variance and its error by adjusting the constant $V_{max}$ .

Conditions (23)-(24) provide the resource allocation constraints, whereas (25)-(26) restrict the transmit power to a maximum amount $P_{m a x}$ and ensure a minimum communication rate $R_{m i n}$ for each user in each cell, respectively. Finally, constraint (22) guarantees an upper bound on the privacy leakage of the users individually due to (11). In this case, the constant $N_{min}$ controls the minimum amount of DP noise at each user.

We notice that the variable $a_{s}$ and $R_{s}$ , which are given by (3) and (13), are related due to (14). Therefore, $a_{s}$ does not appear as a minimization variable.

The optimization problem in (20) is not easy to solve. However, we can subdivide it into simpler problems and search for (sub-)optimal solutions. The main server can then compute and broadcast these (sub-)optimal $(R_{s}^{*}, p_{s}^{*}, σ_{s}^{*})_{s \in S}$ to all base stations where they can be forwarded to the users. These computations and initialization should be done prior to the beginning of Algorithm 1.

V Algorithm Design

In this section, we propose two suboptimal sequential algorithms to solve the optimization problem in (20). First, for fixed DP noise the objective function in (20) is minimized with respect to users’ transmit powers and resource block allocation in a cell-by-cell manner. In the second part, with given transmit power and resource block allocation, the optimization problem in (20) becomes convex with respect to the DP noise standard deviations.

V-a Optimal Scheduler

Let the DP noise standard deviations be given such that condition (22) is always satisfied. In the following, we consider the joint transmit power and resource block allocation problem, which is a simplified version of (20).

	$% minimize (R_{s}, p_{s})_{s \in S} \sum s \in S \sum i \in U_{s} K_{s, i} [1 - R \sum n = 1 r_{s, i}^{(n)}]$
	$+ γ \sum s \in S \sum i \in U_{s} (\frac{1}{K_{s, i} σ_{s, i}})^{2} R \sum n = 1 r_{s, i}^{(n)}$		(28)
	$subject to~{}\lx@cref{creftypeplural~refnum}{opt_5}, % \lx@cref{refnum}{opt_4}, \lx@cref{refnum}{opt_33}, \lx@cref{refnum}{opt_3} and% ~\lx@cref{refnum}{opt_1}.$

The optimization problem in (28) is non-linear with respect to $(R_{s})_{s \in S}$ due to (26) and (18). To further simplify it, we first compute the optimal transmit powers while guaranteeing the minimum communication rate constraint. In this case, setting (26) to equality, combining it with (18), and using the fact that $r_{s, i}^{(n)} \in {0, 1}$ , the optimal transmit powers can be obtained as

p_{s, i}^{*} = R \sum n = 1 r_{s, i}^{(n)} (2^{\frac{R_{min}}{} B} - 1) \frac{\sum_{~ s \in S ∖ {s}} I_{s}^{(n)} (~ s) + B N_{0}}{h_{s, i}} .

(29)

We then consider only one cell at each optimization step in an alternating strategy. Based on this approach and applying (29) to (25), the optimization task in (28) reduces to the following linear integer programming problem for a single cell $s$ :

	$minimize R_{s} \sum i \in U_{s} K_{s, i} [1 - R \sum n = 1 r_{s, i}^{(n)}]$
	$+ γ \sum i \in U_{s} (\frac{1}{K_{s, i} σ_{s, i}})^{2} R \sum n = 1 r_{s, i}^{(n)}$		(30)
	subject to
	$\sum i \in U_{s} K_{s, i} σ_{s, i}^{2} R \sum n = 1 r_{s, i}^{(n)} + \sum ~ s \in S ∖ {s} \sum i \in U_{~ s} K_{~ s, i} σ_{~ s, i}^{2} R \sum n = 1 r_{~ s, i}^{(n)}$
	$\leq V_{max} \sum i \in U_{s} K_{s, i} R \sum n = 1 r_{s, i}^{(n)} + V_{max} \sum ~ s \in S ∖ {s} \sum i \in U_{~ s} K_{~ s, i} R \sum n = 1 r_{~ s, i}^{(n)},$		(31)
	$\sum i \in U_{s} r_{s, i}^{(n)} \leq 1, \forall n \in [R],$		(32)
	$R \sum n = 1 r_{s, i}^{(n)} \leq 1 and r_{s, i}^{(n)} \in {0, 1}, \forall i \in U_{s},$		(33)
	$0 \leq R \sum n = 1 r_{s, i}^{(n)} (2^{\frac{R_{min}}{B}} - 1) \frac{\sum_{~ s \in S ∖ {s}} I_{s}^{(n)} (~ s) + B N_{0}}{h_{s, i}} \leq P_{max},$
	$\forall i \in U_{s} .$		(34)

1:Initialize the values of

(R_{s}, p_{s}, σ_{s})_{s \in S}

randomly such that they satisfy (21)-(25).

2:Compute

(p_{s}^{*})_{s \in S}

by solving (35) and unschedule those users whose communication rates do not meet (26).

3:Output the resulting parameters as a (sub-)optimal solution

(R_{s}, p_{s}^{*}, σ_{s})_{s \in S}

Algorithm 2 Random scheduler with random DP noise (RndSched)

To solve (30), we assume that $(R_{~ s}, p_{~ s})_{~ s \in S ∖ {s}}$ are known and satisfy conditions (23)-(25). We then solve this problem with respect to $R_{s}$ while taking $R_{~ s}$ with $~ s \in S ∖ {s}$ as constants. By solving this optimization problem for each cell, we obtain a (sub-)optimal scheduling solution $(R_{s}^{*})_{s \in S}$ for the whole system.

In the next step, the optimal transmit powers $(p_{s})_{s \in S}$ should be accordingly computed by using (29). Yet, the term $\sum_{~ s \in S ∖ {s}} I_{s}^{(n)} (~ s)$ in (29) is itself a linear function of transmit powers of other users due to (16). In fact, (29) can be written as a linear equation system $A p = b$ with unknown variables $p$ . In this case, $p$ is a vector consisting of all transmit powers $p_{s, i}$ and $A$ and $b$ are the coefficients of the linear equation system given by (29). To compute the optimal transmit powers, we solve the following simple optimization:

		$minimize p ∥ A p - b ∥_{1}$		(35)
		$subject to 0 \leq p_{s, i} \leq P_{max}, \forall s \in S, i \in U_{s} .$

After finding the optimal powers from (35), we can compute the uplink communication rates by using (18). We then unschedule those users whose rates do not satisfy (26) and set their transmit power to zero.

1:Initialize the values of

(R_{s}, p_{s}, σ_{s})_{s \in S}

randomly such that they satisfy (21)-(25).

2:for

s \in S

3: For fixed

(R_{~ s}, p_{~ s})_{~ s \in S ∖ {s}}

and

(σ_{s})_{s \in S}

, obtain a (sub-)optimal resource block allocation matrix

R_{s}^{*}

by solving the optimization problem in (30).

4:end for

5:Compute

(p_{s}^{*})_{s \in S}

by solving (35) and unschedule those users whose communication rates do not meet (26).

6:Output the resulting parameters as a (sub-)optimal solution

(R_{s}^{*}, p_{s}^{*}, σ_{s})_{s \in S}

Algorithm 3 Optimal scheduler with random DP noise (OptSched)

Based on these solutions, we propose two procedures for user scheduling and DP noise adjustment. Algorithm 2 presents a random scheduler (RndSched). Algorithm 3 provides an optimal scheduler (OptSched) based on (30). Both algorithms benefit from the power allocation procedure based on (35) and both apply random DP noise to achieve privacy.

We note that one advantage of the optimal scheduler is that it is linear and therefore efficient from a practical point of view compared with (28). Nevertheless, the drawback of this approach is that it is performed sequentially and cell by cell. As a result, there is no guarantee that this approach always provides us with an optimal solution. However, as we will see in Subsection VI-A, it delivers very good results compared with the randomized scheduler. In the next subsection, we extend this algorithm to include a DP optimizer.

V-B DP Optimizer

Let the transmit powers and resource block allocation values from the optimal scheduler $(R_{s}^{*}, p_{s}^{*})_{s \in S}$ be given. The DP noise optimization problem is then given by

	$minimize (σ_{s})_{s \in S} \sum s \in S \sum i \in U_{s} \frac{\sum_{n = 1}^{R} r_{s, i}^{(n)}}{K_{s, i}^{2} σ_{s, i}^{2}}$		(36)
	$subject to (???), (???) .$

Since the objective function and all constraints in (36) are convex, the global optimal solution can be obtained by solving the Karush-Kuhn-Tucker (KKT) [47] conditions. The Lagrange function can be formulated as:

	$L ((σ_{s})_{s \in S})$	$= \sum s \in S \sum i \in U_{s} \frac{\sum_{n = 1}^{R} r_{s, i}^{(n)}}{K_{s, i}^{2} σ_{s, i}^{2}}$
		$+ κ (\sum s \in S \sum i \in U_{s} K_{s, i} σ_{s, i}^{2} R \sum n = 1 r_{s, i}^{(n)}$
		$- V_{max} \sum s \in S \sum i \in U_{s} K_{s, i} R \sum n = 1 r_{s, i}^{(n)}),$

where $κ \geq 0$ is a Lagrange multiplier.

Setting the derivative of $L$ with respect to $σ_{s, i}$ to zero yields

- \frac{2 \sum_{n = 1}^{R} r_{s, i}^{(n)}}{K_{s, i}^{2} σ_{s, i}^{3}} + 2 κ K_{s, i} R \sum n = 1 r_{s, i}^{(n)} σ_{s, i} = 0.

(37)

Let $\sum_{n = 1}^{R} r_{s, i}^{(n)} = 1$ hold. It then follows by combing (37) with (22) that

σ_{s, i} = {{(K_{s, i}^{3} κ)}^{- \frac{1}{4}} ∣ ∣ ∣}_{\frac{N_{min}}{K_{s, i}}}^{- \frac{1}{4}},

(38)

where $a |_{b} = max {a, b}, s = S,$ and $i \in U_{s}$ . If $\sum_{n = 1}^{R} r_{s, i}^{(n)} = 0$ holds, then we have $σ_{s, i} = 0$ .

Fig. 3: User and channel initialization with 100 users.

For the optimal solution, constraint (21) always holds with equality since the objective function monotonically decreases with increasing $σ_{s, i}$ . Consequently, substituting (38) into (21) with equality implies that

	$\sum s \in S \sum i \in U_{s} K_{s, i}$	$R \sum n = 1 r_{s, i}^{(n)} {{(K_{s, i}^{3} κ)}^{- \frac{1}{2}} ∣ ∣ ∣}_{\frac{N_{min}^{2}}{K_{s, i}^{2}}}^{- \frac{1}{2}}$
		$= V_{max} \sum s \in S \sum i \in U_{s} K_{s, i} R \sum n = 1 r_{s, i}^{(n)} .$		(39)

After the value of $κ$ is found from (V-B), the optimal $σ_{s, i}^{*}$ can be computed from (38). We then combine this scheme with the procedure in Subsection V-A. A summary of this scheme is provided in Algorithm 4 (OptSched+DP).

Vi Simulations and Numerical Solutions

Vi-a Optimization Problems

In this subsection, we present the numerical solutions of the algorithms that were presented in Section V. In this regard, we apply the Python optimization packages CVXPY, CVXOPT, GLPK, and ECOS [33, 34, 35, 36, 37].

Since Algorithms 3 and 4 are heuristic, their solutions depend on the initial values of the optimizing variables $(R_{s}, p_{s}, σ_{s})_{s \in S}$ as well as the wireless channels and the number of training samples at each user. As a result, we repeat the computations for several random initial values, channels and data distributions among the users and then compute the average.

System parameter	values
Number of cells or base stations $(S)$ :	$7$
Total number of users:	100
Cell radius:	500m
Uplink center frequency:	2450MHz
Channels’ Rayleigh distribution scale parameter:	1
Uplink resource block bandwidth (B):	$180 K H z$
Thermal noise power spectral density $(N_{0})$ :	$- 174 dBm$
Maximum transmit power $(P_{m a x})$ :	$10 dBm$
Minimum communication rate $(R_{m i n})$ :	$100 K b s$
DP noise error upper bound $(V_{m a x})$ :	$12$
Minimum total DP noise at each user $(N_{m i n})$ :	$100$

TABLE I:

1:Perform Algorithm 3.

2:For the given

(R_{s}^{*}, p_{s}^{*})_{s \in S}

from Algorithm 3, obtain the optimal

(σ_{s}^{*})_{s \in S}

by solving (36).

3:Output the resulting parameters as a (sub-)optimal solution

(R_{s}^{*}, p_{s}^{*}, σ_{s}^{*})_{s \in S}

Algorithm 4 Optimal scheduler with DP noise optimizer (OptSched+DP)

To this end, the variables $(R_{s})_{s \in S}$ are first initialized based on a shuffled Round-robin scheme and $(p_{s}, σ_{s})_{s \in S}$ are set uniformly at random such that $0 \leq p_{s, i} \leq P_{max}$ and $N_{min} / K_{s, i} \leq σ_{s, i} \leq 6 N_{min} / K_{s, i}$ hold. Second, the users are positioned in a square area consisting of seven hexagon cells according to a uniform distribution. The edge devices are then assigned to their nearest base stations according to their random position. Based on their distances to the base stations, their fading channels are then computed by applying (17).

An example of channel initialization, which is generated by our simulator in Python language, is shown in Fig. 3. In this case, the channels between one of the users and base stations are depicted as dashed lines. We notice that the cells 1-6 in this setting can cover users also outside their area while the central cell only covers devices inside the central hexagon. As a result, the effects of boundary and central cells are both taken into account in our simulations.

After the users are assigned to their corresponding base stations, the training data is randomly distributed among all users. Inspired by [48], the number of samples $K_{s, i}$ are determined by a lognormal distribution.

Algorithms 3 and 4 should then provide us with $(R_{s}^{*}, p_{s}^{*})_{s \in S}$ and $(a_{s}^{*})_{s \in S}$ which determine (sub-)optimal allocated resource blocks, uplink transmit powers, and the scheduled users.

The system parameters that are used in the computations are listed in Table I. Fig. 4 shows the results of all algorithms in the form of an empirical Cumulative Distribution Function (CDF) of the normalized objective value in (20). The normalization is done by dividing the value of the objective function by the total number of samples (scheduled or unscheduled). The CDF is computed for two values of available number of resource blocks $R$ and the optimization constant $γ$ from (20) . The results are averaged over $10^{3}$ random channels and initial values. As seen in Fig. 4, the OptSched (Algorithm 3) outperforms the RndSched (Algorithm 2) in terms of minimizing the objective value in (20). Moreover, the OptSched+DP (Algorithm 4) further improves the results of the OptSched by reducing the total privacy leakage. Furthermore, the OptSched+DP achieves lower values for $γ = 10^{7}$ compared with the case in which $γ = 10^{6}$ . This is because, larger $γ$ in (20) gives more weight to the DP noise optimization.

We also notice that by increasing $R$ from 5 to 8, the normalized objective values of the RndSched get slightly closer to the outcome of the OptSched algorithm. This is due to the fact that by increasing $R$ and keeping the total number of users constant, chances that all users are successfully scheduled by RndSched become higher. In this case, for large values of $R$ , the RndSched might eventually achieves the same performance as for OptSched. However, the choice of selecting a large number of resource blocks for a low number of users is not desirable due to the limited amount of available bandwidth.

Vi-B Federated Learning Simulations

In this subsection, we apply the random parameters $a_{s}$ and $σ_{s}$ as well as (sub-)optimal $a_{s}^{*}$ and $σ_{s}^{*}$ from Subsection VI-A to an FL system as described in Algorithm 1. In this case, we assume that the main server and all users each maintain a fully connected neural network in the form of a multi-label classifier. The networks consist of two hidden layers, each with 256 nodes. To implement the simulations, we apply the TensorFlow, NumPy, and Matplotlib packages [38, 39, 40]. Furthermore, we use the MNIST image dataset [49] to train and test the multi-label classifier.

In this setting, we train the local models over $T = 200$ communication rounds between users and the main server. To follow our mathematical model in Section II, we perform no local iterations and use the batch gradient descent scheme. We do not apply any decay and use a fixed learning rate $λ = 0.05$ .

Furthermore, to guarantee that Assumption 1 holds, the gradient of all weights are clipped so that their global norm is smaller than or equal to $L = 10$ . This directly affects the amount of privacy leakage as given by (11).

We perform the simulations over 100 channels and initial values and then average the resulting accuracy and loss. Furthermore, we generate the empirical CDF of the privacy leakage of all users. Fig. 5 shows the accuracy, loss, and privacy leakage CDF of this learning system for different values of available resource blocks $R$ and optimization constant $γ$ .

As seen in Fig. 5, the OptSched outperforms the RndSched algorithm in terms of accuracy and loss for both $R = 5$ and $R = 8$ . In this case, the OptSched systematically selects the users with large chunks of data that have a better channel and suffer less from the inter-cell interference. The RndSched algorithm, however, fails in this scenario since it applies a random scheduling scheme.

The OptSched+DP, on the other hand, slightly degrades the performance of the optimal scheduler by increasing and optimizing the DP noise. Yet OptSched+DP provides a similar or even better performance compared with RndSched scheme for small values of $R$ (see Figs 4(a) and 4(c)). The degradation is the price that is paid to improve the privacy. Figs. 4(e) and 4(f) show the empirical CDF of the privacy leakage ( $ρ$ ). In this case, the value of $ρ$ at each user is computed by using (11) and the results over all simulation iterations are collected to compute the CDF. The simulations show that the OptSched+DP scheme substantially reduces the amount of privacy leakage at each user. In particular it achieves a maximum privacy leakage of around $ρ = 0.5$ thanks to the DP optimizer scheme. This is a significant improvement compared with the RndSched scheme with a maximum leakage of around $ρ = 4$ .

Adjusting the optimization constant $γ$ is also crucial. In this regard, by choosing $γ = 10^{7}$ the OptSched achieves lower privacy leakage compared with $γ = 10^{6}$ (see Figs. 4(e) and 4(f)). This is because larger $γ$ gives less weight to the optimal scheduler in (20) and users with higher DP noise power are preferred in scheduling.

Vii Conclusion

In this work, a privacy preserving FL procedure in a multiple base station scenario with inter-cell interference has been considered. An upper bound on the optimality gap of the convergence term of this learning scheme has been derived and an optimization problem to reduce this upper bound has been provided. We have proposed two sequential algorithms to obtain (sub-)optimal solutions for this optimization task; namely an optimal scheduler (OptSched) in Algorithm 3 and its extended version with DP optimizer (OptSched+DP) in Algorithm 4. In designing these schemes we avoid non-linearity in the integer programming problems. The outputs of these algorithms are then applied to an FL system.

Simulation results have shown that the OptSched increases the accuracy of the classification FL system and reduces the loss compared with the RndSched when the number of available resource blocks $R$ is small. In this case, when the total number of users is $K = 100$ and $R = 5$ , the OptSched shows an accuracy improvement of over $6 %$ . Simulations have further shown that the OptSched not only improves the accuracy but also can reduce the privacy leakage compared with the RndSched if the parameter $γ$ is set properly.

The OptSched+DP, on the other hand, further optimizes the DP noise and substantially reduces the privacy leakage compared with both RndSched and OptSched. In this case, simulations have shown that the OptSched+DP reduces the maximum privacy leakage for both $R = 5$ and $R = 8$ by a factor of 8 (from $ρ = 4$ to $ρ = 0.5$ ). It is worth mentioning that when $R$ is small (e.g. $R = 5$ ), this improvement is achieved while OptSched+DP shows a similar or even better performance in terms of accuracy and loss compared with RndSched.

Appendix

Proof of Theorem 1

Proof.

It follows by using Assumption 5 and applying the Taylor expansion to the global loss function $f$ that

	$f (w^{(t + 1)}) \leq f (w^{(t)}) + ($	$w^{(t + 1)} - w^{(t)})^{⊺} \nabla f (w^{(t)})$
		$+ \frac{L}{2} ∥ ∥ w^{(t + 1)} - w^{(t)} {∥ ∥}_{2}^{2} .$		(40)

Next, we compute the local updates at each user $i \in U_{s}$ by combing (4), (5), and (7) as follows

w_{s, i}^{(t + 1)} = w^{(t)} - \frac{λ}{K_{s, i}} K_{s, i} \sum k = 1 \nabla l (w^{(t)}, x_{s, i}^{(k)}) - λ n_{s, i}^{(t)} .

(41)

Furthermore, Combining (8) and (9) implies that

w^{(t + 1)} = \frac{1}{K_{a}} \sum s \in S \sum i \in U_{s} K_{s, i} a_{s, i} w_{s, i}^{(t + 1)} .

(42)

We then obtain the global update at the main server by inserting the value of $w_{s, i}^{(t + 1)}$ from (41) into (42) as follows

	$w^{(t + 1)} - w^{(t)} = - \frac{λ}{K_{a}}$	$\sum s \in S \sum i \in U_{s} a_{s, i} K_{s, i} \sum k = 1 \nabla l (w^{(t)}, x_{s, i}^{(k)})$
		$- \frac{λ}{K_{a}} \sum s \in S \sum i \in U_{s} K_{s, i} a_{s, i} n_{s, i}^{(t)} .$		(43)

To simplify the rest of calculations, we define a new random variable to reflect the difference between the global update and the global gradient as below:

(44)

Now by inserting the term $w^{(t + 1)} - w^{(t)}$ (the global update) from (44) into (40), we have that

	$f (w^{(t + 1)}) \leq f (w^{(t)})$	$+ λ (Δ^{(t)} - \nabla f (w^{(t)}))^{⊺} \nabla f (w^{(t)})$
		$+ \frac{λ^{2} L}{2} ∥ ∥ Δ^{(t)} - \nabla f (w^{(t)}) {∥ ∥}_{2}^{2} .$		(45)

Furthermore, the following identity always holds:

∥ u - v ∥_{2}^{2} = ∥ u ∥_{2}^{2} + ∥ v ∥_{2}^{2} - 2 u^{⊺} v .

(46)

Considering the learning step size to be $λ = 1 / L$ and applying the identity (46) to (45), it follows that

	$f (w^{(t + 1)}) - f (w^{*})$	$\leq f (w^{(t)}) - f (w^{*})$
		$+ \frac{1}{2 L} [∥ Δ^{(t)} ∥_{2}^{2} - ∥ \nabla f (w^{(t)}) ∥_{2}^{2}],$		(47)

where $f (w^{*})$ is the optimal loss function (Assumption 2).

Inspired by [8], We first obtain an upper bound on the expectation of the term $∥ Δ^{(t)} ∥_{2}^{2}$ on the right hand side of (47). It follows by combining (43) and (44) that

	$E [∥ Δ^{(t)} ∥_{2}^{2}]$
	$= E [∥ ∥ ∥ \nabla f (w^{(t)}) - \frac{1}{K_{a}} \sum s \in S \sum i \in U_{s} a_{s, i} K_{s, i} \sum k = 1 \nabla l (w^{(t)}, x_{s, i}^{(k)})$
	$- \frac{1}{K_{a}} \sum s \in S \sum i \in U_{s} K_{s, i} a_{s, i} n_{s, i}^{(t)} {∥ ∥ ∥}_{2}^{2}]$
	$= E [∥ ∥ ∥ - \frac{K - K_{a}}{K K_{a}} \sum \begin{matrix} (s, i) : a_{s, i} = 1 \end{matrix} K_{s, i} \sum k = 1 \nabla l (w^{(t)}, x_{s, i}^{(k)})$
	$+ \frac{1}{K} \sum \begin{matrix} (s, i) : a_{s, i} = 0 \end{matrix} K_{s, i} \sum k = 1 \nabla l (w^{(t)}, x_{s, i}^{(k)}) {∥ ∥ ∥}_{2}^{2}]$
			(48)

	$+ \frac{1}{K} \sum \begin{matrix} (s, i) : a_{s, i} = 0 \end{matrix} K_{s, i} \sum k = 1 ∥ \nabla l (w^{(t)}, x_{s, i}^{(k)}) ∥_{2})^{2}]$
	$+ d \sum e = 1 E [(\sum s \in S \sum i \in U_{s} \frac{K_{s, i} a_{s, i}}{K_{a}} n_{s, i, e}^{(t)})^{2}],$		(49)

where (48) follows by applying (12) and (46) and the fact that the DP noise is independent of other random variables and $E [n_{s, i}^{(t)}] = 0$ . Inequality (49) is due to the triangle inequality and the fact that the vectors $n_{s, i} = (n_{s, i, e}^{(t)})_{e \in [d]}$ are $d$ -dimensional.

Next, by applying Assumption 6 to (49) we have for some $ξ_{1} \geq 0$ and $ξ_{2} \geq 1$ that

	$E [∥ Δ^{(t)} ∥_{2}^{2}]$
	$\leq (\sum s \in S \sum i \in U_{s} \frac{}{2 K_{s, i}} K (1 - a_{s, i}))^{2} E [(ξ_{1} + ξ_{2} ∥ \nabla f (w^{(t)}) ∥_{2}^{2})]$
	$+ d \sum s \in S \sum i \in U_{s} (\frac{K_{s, i} a_{s, i}}{K_{a}} σ_{s, i})^{2},$		(50)

where the last term in (50) is obtained due to the fact that the random variables $n_{s, i, e}^{(j)}$ in (49) are independent of each other.

On the other hand, since $f$ is $μ$ -strongly convex (Assumption 4) we have that

∥ \nabla f (w^{(t)}) ∥_{2}^{2}

\geq 2 μ [f (w^{(t)}) - f (w^{*})] .

(51)

By inserting (50) in (47) and using (51), the proof follows. ∎

References

[1] Y. C. Eldar, A. Goldsmith, D. Gündüz, H. V. Poor et al., Machine Learning and Wireless Communications. Cambridge University Press, 2022.
[2] H. Hellström, J. M. B. da Silva Jr, M. M. Amiri et al., “Wireless for machine learning: A survey,” Foundations and Trends® in Signal Processing, vol. 15, no. 4, pp. 290–399, 2022.
[3] K. Bonawitz, H. Eichner, W. Grieskamp et al., “Towards federated learning at scale: System design,” in Proc. Syst. Mach. Learn. Conf., 2019, pp. 1–15.
[4] J. Konečný, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated optimization: Distributed machine learning for on-device intelligence,” CoRR, vol. abs/1610.02527, 2016. [Online]. Available: http://arxiv.org/abs/1610.02527
[5] B. McMahan, E. Moore, D. Ramage et al., “Communication-efficient learning of deep networks from decentralized data,” in Proc. International Conference on Artificial Intelligence and Statistics. PMLR, 2017, pp. 1273–1282.
[6] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,” IEEE Signal Process. Mag., vol. 37, no. 3, pp. 50–60, 2020.
[7] S. Samarakoon, M. Bennis, W. Saad, and M. Debbah, “Distributed federated learning for ultra-reliable low-latency vehicular communications,” IEEE Trans. Commun., vol. 68, no. 2, pp. 1146–1159, 2019.
[8] M. Chen, Z. Yang, W. Saad et al., “A joint learning and communications framework for federated learning over wireless networks,” IEEE Trans. Wireless Commun., vol. 20, no. 1, pp. 269–283, 2020.
[9] M. M. Amiri and D. Gündüz, “Federated learning over wireless fading channels,” IEEE Trans. Wireless Commun., vol. 19, no. 5, pp. 3546–3557, 2020.
[10] Z. Yang, M. Chen, W. Saad et al., “Energy efficient federated learning over wireless communication networks,” IEEE Trans. Wireless Commun., vol. 20, no. 3, pp. 1935–1949, March 2021.
[11] R. Hamdi, M. Chen, A. B. Said et al., “Federated learning over energy harvesting wireless networks,” IEEE Internet Things J., 2021.
[12] Z. Wang, Y. Zhou, Y. Shi, and W. Zhuang, “Interference management for over-the-air federated learning in multi-cell wireless networks,” IEEE J. Sel. Areas Commun., 2022.
[13] H. Chen, S. Huang, D. Zhang et al., “Federated learning over wireless IoT networks with optimized communication and resources,” IEEE Internet Things J., 2022.
[14] Q. Zeng, Y. Du, K. Huang, and K. K. Leung, “Energy-efficient radio resource allocation for federated edge learning,” in 2020 IEEE International Conference on Communications Workshops, 2020, pp. 1–6.
[15] S. Wang, T. Tuor, T. Salonidis et al., “Adaptive federated learning in resource constrained edge computing systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6, pp. 1205–1221, 2019.
[16] N. H. Tran, W. Bao, A. Zomaya et al., “Federated learning over wireless networks: Optimization model design and analysis,” in IEEE INFOCOM - IEEE Conference on Computer Communications, 2019, pp. 1387–1395.
[17] S. Hosseinalipour, S. S. Azam, C. G. Brinton et al., “Multi-stage hybrid federated learning over large-scale d2d-enabled fog networks,” IEEE/ACM Trans. Netw., pp. 1–16, 2022.
[18] L. U. Khan, W. Saad, Z. Han, and C. S. Hong, “Dispersed federated learning: Vision, taxonomy, and future directions,” IEEE Wireless Commun., vol. 28, no. 5, pp. 192–198, 2021.
[19] L. U. Khan, Y. K. Tun, M. Alsenwi et al., “A dispersed federated learning framework for 6G-enabled autonomous driving cars,” arXiv preprint arXiv:2105.09641, 2021.
[20] Z. Zhang, Z. Gao, Y. Guo, and Y. Gong, “Scalable and low-latency federated learning with cooperative mobile edge networking,” arXiv preprint arXiv:2205.13054, 2022.
[21] S. R. Pandey, M. N. H. Nguyen, T. N. Dang et al., “Edge-assisted democratized learning toward federated analytics,” IEEE Internet Things J., vol. 9, no. 1, pp. 572–588, 2022.
[22] M. Asad, A. Moustafa, F. A. Rabhi, and M. Aslam, “THF: 3-way hierarchical framework for efficient client selection and resource management in federated learning,” IEEE Internet of Things Journal, vol. 9, no. 13, pp. 11 085–11 097, 2022.
[23] M. Al-Rubaie and J. M. Chang, “Reconstruction attacks against mobile-based continuous authentication systems in the cloud,” IEEE Trans. Inf. Forensics Security, vol. 11, no. 12, pp. 2648–2663, 2016.
[24] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Proc. Theory of Cryptography Conference. Springer, 2006, pp. 265–284.
[25] C. Dwork and A. Roth, “The algorithmic foundations of differential privacy.” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211–407, 2014.
[26] R. Hu, Y. Guo, and Y. Gong, “Concentrated differentially private and utility preserving federated learning,” arXiv preprint arXiv:2003.13761, 2020.
[27] M. Wu, D. Ye, J. Ding et al., “Incentivizing differentially private federated learning: A multi-dimensional contract approach,” IEEE Internet Things J., 2021.
[28] P. Sun, H. Che, Z. Wang et al., “Pain-FL: Personalized privacy-preserving incentive for federated learning,” IEEE J. Sel. Areas Commun., vol. 39, no. 12, pp. 3805–3820, 2021.
[29] K. Wei, J. Li, M. Ding et al., “Federated learning with differential privacy: Algorithms and performance analysis,” IEEE Trans. Inf. Forensics Security, vol. 15, pp. 3454–3469, 2020.
[30] K. Wei, J. Li, C. Ma et al., “Low-latency federated learning over wireless channels with differential privacy,” IEEE J. Sel. Areas Commun., vol. 40, no. 1, pp. 290–307, 2022.
[31] M. S. E. Mohamed, W.-T. Chang, and R. Tandon, “Privacy amplification for federated learning via user sampling and wireless aggregation,” IEEE J. Sel. Areas Commun., vol. 39, no. 12, pp. 3821–3835, 2021.
[32] T. Liu, B. Di, and L. Song, “Privacy-preserving federated edge learning: Modelling and optimization,” IEEE Commun. Lett., 2022.
[33] S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling language for convex optimization,” Journal of Machine Learning Research, vol. 17, no. 83, pp. 1–5, 2016.
[34] A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd, “A rewriting system for convex optimization problems,” Journal of Control and Decision, vol. 5, no. 1, pp. 42–60, 2018.
[35] M. Andersen, L. Vandenberghe, and J. Dahl, “CVXOPT: A python package for convex optimization,” URL https://cvxopt.org.
[36] A. Makhorin, “GNU linear programming kit (GLPK),” URL http://www.gnu.org/software/glpk/glpk.html.
[37] A. Domahidi, E. Chu, and S. Boyd, “ECOS: An SOCP solver for embedded systems,” in European Control Conference (ECC), 2013, pp. 3071–3076.
[38] M. Abadi, A. Agarwal, P. Barham et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015. [Online]. Available: https://www.tensorflow.org/
[39] C. R. Harris, K. J. Millman, S. J. van der Walt et al., “Array programming with NumPy,” Nature, vol. 585, no. 7825, pp. 357–362, Sep. 2020.
[40] J. D. Hunter, “Matplotlib: A 2D graphics environment,” IEEE Computing in Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007.
[41] M. Bun and T. Steinke, “Concentrated differential privacy: Simplifications, extensions, and lower bounds,” in Proc. Theory of Cryptography Conference. Springer, 2016, pp. 635–658.
[42] Y. Nesterov, Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2003, vol. 87.
[43] M. Abadi, A. Chu, I. Goodfellow et al., “Deep learning with differential privacy,” in Proc. of the 2016 ACM SIGSAC Conference on Computer and Communications Security. Association for Computing Machinery, 2016, p. 308–318.
[44] M. P. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic methods for data fitting,” SIAM J. Sci. Comput., vol. 34, no. 3, pp. A1380–A1405, 2012.
[45] M. Moretti and A. Todini, “A resource allocator for the uplink of multi-cell OFDMA systems,” IEEE Trans. Wireless Commun., vol. 6, no. 8, pp. 2807–2812, 2007.
[46] J. Andersen, T. Rappaport, and S. Yoshida, “Propagation measurements and models for wireless communications channels,” IEEE Commun. Mag., vol. 33, no. 1, pp. 42–49, 1995.
[47] M. S. Bazaraa, H. D. Sherali, and C. M. Shetty, Nonlinear programming: theory and algorithms. John Wiley & Sons, 2013.
[48] T. Li, A. K. Sahu, M. Zaheer et al., “Federated optimization in heterogeneous networks,” in Proc. of Machine Learning and Systems, vol. 2, 2020, pp. 429–450.
[49] Y. LeCun, “The MNIST database of handwritten digits,” URL http://yann.lecun.com/exdb/mnist/, 1998.