Exact Decomposition of Quantum Channels for Non-IID Quantum Federated Learning

Haimeng Zhao (赵海萌) haimengzhao@icloud.com Zhili College, Tsinghua University, Beijing 100084, China

September 5, 2022

Abstract

Federated learning refers to the task of performing machine learning with decentralized data from multiple clients while protecting data security and privacy. Works have been done to incorporate quantum advantage in such scenarios. However, when the clients’ data are not independent and identically distributed (IID), the performance of conventional federated algorithms deteriorates. In this work, we explore this phenomenon in the quantum regime with both theoretical and numerical analysis. We further prove that a global quantum channel can be exactly decomposed into channels trained by each client with the help of local density estimators. It leads to a general framework for quantum federated learning on non-IID data with one-shot communication complexity. We demonstrate it on classification tasks with numerical simulations.

^†^†preprint: APS/123-QED

I Introduction

Recent advances in artificial intelligence [goodfellow2016deep] and quantum computing [nielsen2010quantum] have given birth to an emerging field of quantum machine learning [biamonte2017quantum, das2019machine]. By leveraging quantum advantage in machine learning tasks, quantum machine learning has demonstrated unprecedented power in solving a wide range of problems. Notable examples include solving linear equations [harrow2009quantum], quantum supervised learning [lloyd2014quantum, schuld2019quantum, havlivcek2019supervised, rebentrost2014quantum] and quantum generative learning [gao2018quantum, lloyd2018quantum, liu2018differentiable]. This line of research mostly focuses on developing quantum algorithms that can showcase quantum speed-up in machine learning problems.

One of the most important ingredients in machine learning is the data. In real-world applications, data are often distributed among multiple clients and cannot be gathered into a single joint dataset for various reasons. For example, medical data from patients across multiple hospitals [rieke2020future] or cyber-physical attack data from Internet of Things devices [khraisat2021critical] are sensitive and private, and cannot be directly shared without anonymization. Obstacles in data transmission may also forbid us from constructing a joint dataset. For example, the data size might be too large and therefore the transmission is too expensive, or we need quantum data that suffer from decoherence and are hard to preserve or transmit.

Machine learning algorithms with such decentralized data have been developed under the name of federated learning [konevcny2016federated, mcmahan2017comm]. The conventional solution is the federated averaging algorithm [mcmahan2017comm], or FedAvg, in which multiple clients jointly train a global model by sharing only the model parameters/updates while keeping their data private. Its quantum extensions, dubbed qFedAvg, have been proposed to incorporate quantum features such as quantum speed-up and blind computing [li2021quantum, xia2021quantumfed, chen2021federated, chehimi2022quantum, yun2022slimmable].

However, the classical FedAvg algorithm has three shortcomings already identified in the literature. Firstly, it suffers from the non-IID quagmire [zhao2018federated, hsieh2020non], i.e. its performance is proved to deteriorate when the local data of different clients are not independent and identically distributed (IID). But real data are often heterogeneous or even multi-modal for different clients. Secondly, the joint training involves gradient sharing, so the data privacy is threatened by attacks based on gradient inversion [zhu2019deep, geiping2020inverting]. Thirdly, it requires many rounds of communication, which is often the bottleneck of real-world applications. To reduce communication burdens, multiple one-shot alternatives have been proposed [guha2019one, salehkaleybar2021one, zhou2020distilled, kasturi2020fusion].

In this work, we discuss whether the non-IID quagmire exists in quantum federated learning. The answer is yes, and we support it with both theoretical analysis and numerical experiments. Then we move on to propose a solution of this problem. We prove that a global quantum channel can be exactly decomposed into channels trained by each client with the help of local density estimators. It provides a general framework, dubbed qFedInf, for quantum federated learning on non-IID data. Meanwhile, qFedInf is one-shot in terms of communication complexity, and free from gradient sharing. We conduct numerical experiments in highly non-IID settings to demonstrate it.

Ii Main Results

ii.1 Decentralized Quantum Data

We begin by setting up a typical decentralized quantum dataset. Classical datasets can be regarded as a special case of it, where the data samples are orthogonal to each other. We omit the labels for simplicity here.

In federated learning tasks, we are given a set of clients ${C_{i}}_{i = 1}^{n}$ , and we use $| C_{i} ⟩ \in H_{C_{i}}$ to denote the $i$ th client. Each of the clients has access to its own dataset containing $N_{C_{i}}$ samples $D_{i} = {| ψ_{j}^{C_{i}} ⟩}_{j = 1}^{N_{C_{i}}}$ from the data Hilbert space $H_{x}$ . The dataset can be statistically represented by the density matrix $ρ_{x}^{C_{i}} = N_{C_{i}}^{- 1} \sum_{j} | ψ_{j}^{C_{i}} ⟩ ⟨ ψ_{j}^{C_{i}} |$ . Note that the density matrices of different clients may vary dramatically (i.e. non-IID). For example, in a multi-class classification problem, each client may have only seen samples from two or three classes.

If we assume that there is no entanglement among clients, then the joint density matrix of both the data and the clients is represented by $ρ = \sum_{i = 1}^{n} ρ_{x}^{C_{i}} \otimes p_{C_{i}} | C_{i} ⟩ ⟨ C_{i} |$ , where $p_{C_{i}} = N_{C_{i}} / N$ is the statistical weight of client $C_{i}$ , and $N = \sum_{i} N_{C_{i}}$ is the total number of samples. In a classical dataset, this corresponds to the joint probability distribution $p (x, C_{i})$ .

The averaged data density matrix is obtained by tracing out the clients: $ρ_{x} = {Tr}_{C} (ρ) = \sum_{i} p_{C_{i}} ρ_{x}^{C_{i}} .$ Let $P_{C_{i}} = | C_{i} ⟩ ⟨ C_{i} |$ be the projector onto the subspace of client $C_{i}$ . Then we can introduce the conditional density matrix [swart2020introduction] $ρ_{x | C_{i}} = P_{C_{i}} ρ P_{C_{i}} / Tr (ρ P_{C_{i}}) = ρ_{x}^{C_{i}} \otimes | C_{i} ⟩ ⟨ C_{i} |$ , which characterizes the data of a given client $C_{i}$ . Tracing out the clients recovers the local dataset: ${Tr}_{C} (ρ_{x | C_{i}}) = ρ_{x}^{C_{i}}$ . We summarize these concepts in Figure 1.

Figure 1: A schematic view of decentralized quantum data.

ii.2 Quantum Federated Averaging

Here we briefly review the quantum version of FedAvg [mcmahan2017comm]. In a general supervised quantum machine learning problem, we aim to find a quantum channel $M$ that takes an input state $| ψ ⟩$ and transform it into the desired result (e.g. which class the image belongs to). We achieve this by tuning the variational parameters $w$ of the channel $M_{w}$ to minimize the average of some loss function $L$ on a given training dataset $D$ :

min w \frac{1}{N} \sum (| ψ ⟩, y) \in D L (w, | ψ ⟩, y),

(1)

where $| ψ ⟩$ is the input and $y$ is the corresponding label. We use gradient descent with learning rate $η$ to iteratively solve this optimization problem: at time step $t$ ,

w_{t} = w_{t - 1} - η \nabla_{w} \frac{1}{N} \sum (| ψ ⟩, y) \in D L (w, | ψ ⟩, y) .

(2)

For decentralized data, the total dataset is divided into local datasets from multiple clients: $D = \cup_{i} D_{i}$ . Then we can decompose the update rule of Equation (2) into three steps: (1) local updates at time step $t$ for each client $C_{i}$ ,

w_{t}^{i} = w_{t - 1}^{i} - η \nabla_{w} \frac{1}{N_{C_{i}}} \sum (| ψ ⟩, y) \in D_{i} L (w, | ψ ⟩, y),

(3)

(2) global average for every $T$ steps, e.g. at time step $m T, m \in N$ ,

w_{m T} = n \sum i = 1 p_{C_{i}} w_{m T}^{i},

(4)

and (3) broadcast the averaged weights to all clients as the initial parameters for the next iteration. This is the basic protocol of qFedAvg, the common ground of all the existing quantum federated learning algorithms [li2021quantum, xia2021quantumfed, chen2021federated, chehimi2022quantum, yun2022slimmable]. It recovers the centralized update rule when $T = 1$ . However in practice, this cannot be achieved since we use mini batch training strategies.

ii.3 The Non-IID Quagmire of Quantum FedAvg

As pointed out in [zhao2018federated, hsieh2020non], FedAvg faces the problem of non-IID quagmire, in which its performance deteriorate when the data of different clients are non-IID. Does this phenomenon also exist in the quantum regime?

We follow the steps of [zhao2018federated] and quantify the performance difference of FedAvg and the centralized case by the weight divergence $Δ = ∥ w^{(f)} - w^{(c)} ∥$ , where $w^{(f)}$ and $w^{(c)}$ are the weights given by qFedAvg and the centralized case respectively. On the other hand, the level of non-IID can be quantified by the earth mover’s distance (EMD) [rubner2000earth] between the label distribution of client $C_{i}$ $p^{(i)} (y = k) = \sum_{y \in D_{i}} δ_{y, k} / N_{C_{i}}$ and the centralized distribution $p (y = k) = \sum_{y \in D} δ_{y, k} / N_{C}$ : ${EMD}_{i} = \sum_{k} ∥ p^{(i)} (y = k) - p (y = k) ∥$ . Then, as a direct quantum extension to the Proposition 3.1 in [zhao2018federated], we have the following proposition:

Proposition 1

For a loss function of the form $L (w, | ψ ⟩, y) = \sum_{k = 1}^{n_{c}} δ_{y, k} f_{k} (w, | ψ ⟩)$ ¹¹1In a typical classification problem, $k$ denotes the different classes and the standard cross entropy loss is given by $f_{k} (w, | ψ ⟩) = log r_{k} (w, | ψ ⟩)$ , where $r_{k}$ is the predicted probability of $| ψ ⟩$ belonging to class $k$ ., if the gradient $\nabla_{w} E_{| ψ ⟩ | y = k} f_{k} (w, | ψ ⟩)$ defined as $\nabla_{w} (\sum_{| ψ ⟩, y} δ_{y, k} f_{k} (w, | ψ ⟩) / \sum_{| ψ ⟩, y} δ_{y, k})$ is $λ_{k}$ -Lipschitz for all possible value $k$ of label $y$ , then the following inequality holds for the weight divergence of qFedAvg:

\begin{matrix} Δ_{m T} \leq & n \sum i = 1 p_{C_{i}} (a_{i})^{T} Δ_{(m - 1) T} + η n \sum i = 1 p_{C_{i}} {EMD}_{i} T - 1 \sum j = 1 (a_{i})^{j} g (w_{m T - 1 - j}^{(c)}), \end{matrix}

(5)

where $g (w) = {max}_{k} ∥ \nabla_{w} \sum_{| ψ ⟩, y} δ_{y, k} f_{k} (w, | ψ ⟩) ∥$ and $a_{i} = 1 + η \sum_{k} p^{(i)} (y = k) λ_{k}$ .

Therefore, EMD is indeed a source of the weight divergence when $T > 1$ . This provides theoretical explanation for the existence of non-IID quagmire in quantum federated learning. The detailed proof is provided in Appendix A for completeness. Numerical experiments are conducted in Section III.3 to empirically illustrate this phenomenon in quantum federated learning.

ii.4 Federated Decomposition of Quantum Channels

Now that we have identified the non-IID quagmire in qFedAvg, we aim to find a different approach to tackle Equation (1) when the data are decentralized. We consider the case where the loss function is a function of the output of the quantum channel and the label: $L (w, | ψ ⟩, y) = L (M_{w} (σ_{x}^{ψ}), y)$ , where $σ_{x}^{ψ} = | ψ ⟩ ⟨ ψ |$ .

We begin by noting that when the data are decentralized, each client $C_{i}$ can still train a local channel $M_{i}$ on its own data $D_{i}$ :

M_{i} = {argmin}_{M} \frac{1}{N_{C_{i}}} \sum (| ψ ⟩, y) \in D_{i} L (M (σ_{x}^{ψ}), y) .

(6)

Apparently, the original problem, Equation (1), can be decomposed as

\begin{matrix} min M \frac{1}{N} \sum (| ψ ⟩, y) \in D L (M (σ_{x}^{ψ}), y) = min M \sum i p_{C_{i}} ⎡ ⎣ \frac{1}{N_{C_{i}}} \sum (| ψ ⟩, y) \in D_{i} L (M (σ_{x}^{ψ}), y) ⎤ ⎦ \end{matrix} .

(7)

We can saturate the whole minimization problem if we saturate each individual sub problems, which is exactly what $M_{i}$ does. Therefore, if we can construct a global channel $M$ , such that it coincides with all the local channels $M_{i}$ on their own data, i.e.

M (σ_{x}^{ψ}) = M_{i} (σ_{x}^{ψ}), \forall | ψ ⟩ from ρ_{x}^{C_{i}},

(8)

then the goal is achieved. ²²2The fact that identical samples have identical labels guarantees that the global channel $M$ is well defined. For example, suppose we have two identical samples $| ψ^{C_{i}} ⟩ = | ψ^{C_{j}} ⟩ = | ψ ⟩$ with the same label $y$ , but are from different clients $C_{i} \neq C_{j}$ . Then, in order to fulfill the local minimization problems, Equation (6), we must have $M_{i} (| ψ ⟩ ⟨ ψ |) = M_{j} (| ψ ⟩ ⟨ ψ |)$ . This can be rephrased backwards, as demanding that the local channels $M_{i}$ coincide with the global channel $M$ on condition that the input data $σ_{x}^{ψ}$ comes from its own dataset $ρ_{x}^{C_{i}}$ . That is,

M_{i} (σ_{x}^{ψ}) = M (\frac{P_{x}^{ψ} ρ_{x}^{C_{i}} P_{x}^{ψ}}{Tr (ρ_{x}^{C_{i}} P_{x}^{ψ})}) = M ({% Tr}_{C} (\frac{P_{x}^{ψ} ρ_{x | C_{i}} P_{x}^{ψ}}{Tr (ρ_{x | C_{i}} P_{x}^{ψ})})),

(9)

where $P_{x}^{ψ} = | ψ ⟩ ⟨ ψ |$ denotes the projector.

In order to combine these local channels into a global one, we also need to capture the information of the local datasets $ρ_{x}^{C_{i}}$ . To achieve this, we introduce the local density estimator $D_{i}$ , which is trained to output the probability density of the input state within the local dataset:

D_{i} (σ_{x}) = Tr (ρ_{x}^{C_{i}} P_{x}^{ψ}) = Tr (ρ_{x | C_{i}} P_{x}^{ψ}) .

(10)

In fact, the local channels $M_{i}$ and density estimators $D_{i}$ are enough to give an explicit construction of the global channel $M$ . This is provided by the following theorem (proved in Appendix B):

Theorem 1

(Federated Decomposition of Quantum Channels)
For each client $C_{i}, i = 1, \dots, n$ , which only has access to its own data $ρ_{x}^{C_{i}}$ with $N_{C_{i}}$ samples, a local channel $M_{i}$ and a local density estimator $D_{i}$ can be trained. Assuming that there is no entanglement among clients, then the global channel $M$ can be decomposed into

M (σ_{x}) = n \sum i = 1 M_{i} (σ_{x}) q_{i} = n \sum i = 1 M_{i} (σ_{x}) \frac{D_{i} (σ_{x}) p_{C_{i}}}{\sum_{j} D_{j} (σ_{x}) p_{C_{j}}},

(11)

where $q_{i} = D_{i} (σ_{x}) p_{C_{i}} / \sum_{j} D_{j} (σ_{x}) p_{C_{j}}$ , $σ_{x}$ is any pure input state and $p_{C_{i}} = N_{C_{i}} / \sum_{j = 1}^{n} N_{C_{j}}$ . Extension to mixed input states follows from direct linear superposition.

We note that the classical special case of this theorem was first introduced in [liu2022federated]. As a result, for any input state $σ_{x}$ , if we randomly apply a local channel $M_{i}$ with probability $q_{i}$ , the result will be statistically the same as the global channel $M$ . This fact leads to the following framework for quantum federated learning.

ii.5 A One-shot Quantum Federated Learning Framework for Non-IID Data

Theorem 1 provides a framework for quantum federated learning. The specific protocol goes as follows. Firstly, each client $C_{i}$ trains a local channel $M_{i}$ and a local density estimator $D_{i}$ with its own data $ρ_{x}^{C_{i}}$ . This step is completely distributed, and concludes the whole training phase. Secondly, the trained channels $M_{i}$ and density estimators $D_{i}$ are sent to the server for inference according to Equation (11). That is, when a new input $σ_{x}^{ψ}$ comes, the server computes $q_{i}$ via the density estimators. Then it randomly loads the parameters from $M_{i}$ with probability $q_{i}$ and gathers the outcomes. We call this framework quantum federated inference, or qFedInf for short. The detailed algorithms are summarized in Algorithm 1 and 2.

Input: The local data set

D_{i}

; number of training epochs

n_{e}

; batch size

n_{b}

; learning rate

η

; the optimizer Opt.

Output: The trained channel

M_{i}

and density estimator

D_{i}

to be sent to the server.

1 Initialize the channel

M_{i}

with parameters

w

2 for $e p o c h \leftarrow 1$ to $n_{e}$ do

3 for $b a t c h (| ψ_{j}^{C_{i}} ⟩, y_{j})_{j = 1}^{n_{b}} \in D_{i}$ do

4 Calculate the gradient

g \leftarrow \nabla_{w} \sum_{j} L (M_{i, w} (| ψ_{j}^{C_{i}} ⟩ ⟨ ψ_{j}^{C_{i}} |), y_{j})

5 Update the channel parameters

w \leftarrow w - η \cdot O p t (g)

7Train the density estimator

D_{i}

return $M_{i}, D_{i}$ .

Algorithm 1 Quantum Federated Inference: Training Phase on Client

C_{i}

Input: A new sample

| ψ ⟩

to perform inference on; channels

{M_{i}}_{i = 1}^{n}

and density estimators

{D_{i}}_{i = 1}^{n}

gathered from the clients.

Output: The inference outcome

M (| ψ ⟩ ⟨ ψ |)

1 for $i \leftarrow 1$ to $n$ do

2 Calculate the density

{~ q}_{i} \leftarrow D_{i} (| ψ ⟩ ⟨ ψ |)

4Calculate the weights

q_{i} \leftarrow {~ q}_{i} / \sum_{m} q_{m}, i = 1, \dots, n

5 Choose a random channel

M_{i}

with probability

q_{i}

M (| ψ ⟩ ⟨ ψ |) \leftarrow M_{i} (| ψ ⟩ ⟨ ψ |)

return $M (| ψ ⟩ ⟨ ψ |)$ .

Algorithm 2 Quantum Federated Inference: Inference Phase

In practice, there is a wide range of choices for the channels $M_{i}$ and the density estimators $D_{i}$ . For channels, we can use classical, quantum or hybrid algorithms at our will [bishop2006pattern, goodfellow2016deep, li2021recent]. For density estimators, we can use classical ones like Gaussian mixture models [bishop2006pattern] and normalizing flows [rezende2015variational], or quantum ones such as classical shadow tomography [huang2020predicting] and quantum state diagonalization algorithms [larose2019variational, xin2021experimental]. Each kind of density estimator comes with a different training strategy. We can also classify the possible scenarios based on the classical/quantum nature of the data and the channels:

Classical Data & Classical Channel: This is a purely classical problem already discussed in [liu2022federated].

Classical Data & Quantum Channel: To perform quantum channels on classical data, one need to choose an appropriate encoding scheme, e.g. amplitude encoding or gate encoding, to encode the data into quantum states. This gives rise to the problem of whether to use a classical density estimator on the original data or a quantum one on the encoded states. Note that before encoding, different samples are orthogonal to each other. However, the encoded quantum samples are in general overlapped, meaning that there’s a possibility of mistaking one sample from another. Nevertheless, quantum density estimators offer a potential exponential speed-up. So there’s a trade-off between accuracy and efficiency.

Quantum Data & Quantum Channel: For quantum data, we need a quantum density estimator to estimate $Tr (ρ_{x}^{C_{i}} P_{x}^{ψ})$ . This task can be solved by the classical shadow tomography, which is proved to saturate the information-theoretic bound [huang2020predicting].

Compared to qFedAvg, the proposed framework qFedInf shares some merits with its classical counterpart [liu2022federated]. It’s one-shot in the sense that only one communication between the server and each client is required. Meanwhile, there’s no need for a global public dataset or data synthesis/distillation, which are the common ground of existing one-shot algorithms [zhou2020distilled, kasturi2020fusion]. Moreover, since no gradient information is transmitted, it’s automatically immune to the attacks based on gradient inversion [zhu2019deep, geiping2020inverting]. Finally, the density estimators can capture possible data heterogeneity, providing a new way to perform federated learning with non-IID data.

We also note that Theorem 1 holds for generic quantum channels, which is the most general form of quantum information processing. So the proposed framework qFedInf may be applied directly to machine learning tasks beyond classification.

Iii Numerical Experiments

Figure 2: The quantum circuit used as the quantum classifier.

Top-1 Accuracy/%	Centralized	qFedInf( $⋆$ )	qFedAvg( $⋆$ )	qFedInf( $\circ$ )	qFedAvg( $\circ$ )
MNIST	91.2 $\pm$ 0.6	92.4 $\pm$ 0.3	86.2 $\pm$ 1.0	92.7 $\pm$ 0.2	88.4 $\pm$ 0.8
Fashion-MNIST	77.2 $\pm$ 0.5	74.0 $\pm$ 0.3	61.4 $\pm$ 1.4	75.4 $\pm$ 0.3	66.7 $\pm$ 1.3

Table 1: Performance of centralized, qFedInf and qFedAvg classifiers on MNIST and Fashion-MNIST datasets under different federated settings (

⋆

\circ

for star/cycle-2 structure). The averages and standard deviations are obtained from 10 runs.

Figure 3: The test accuracy of qFedAvg on MNIST (left) and Fashion-MNIST (right) with different levels of non-IID in the cycle-m structure. The dashed line represents the test accuracy of centralized classifier (red) and qFedInf trained with 2 classes per client (black). The error bars and shaded regions mark the standard deviation over 10 runs.

iii.1 Constructing Non-IID Datasets

We observe that the existing quantum federated learning algorithms can already achieve high accuracies on binary classification tasks with synthetic and common classical/quantum datasets [chehimi2022quantum, li2021quantum]. Therefore, to better illustrate the performance differences of qFedInf and qFedAvg, we devise a highly heterogeneous federated dataset based on 8 classes (“0” through “7”) from the MNIST hand written digits dataset. To produce heterogeneity, we adopt the star structure and cycle-m structure settings from [liu2022federated] as follows.

Star structure: Each client only has access to the data of two classes, with one of the classes fixed. That is, client $C_{i}, i = 1, \dots, 7$ only has access to the data of digit “0” and “ $i$ ”.

Cycle-m structure: Each client only has access to the data of $m$ classes in a cyclic way. Client $C_{i}, i = 1, \dots, 7$ has access to the data of digit “ $i$ ”, …, “ $i + m - 1$ module 8”.

Datasets of the same structures are also prepared for the Fashion-MNIST dataset, which is composed of images of daily objects and regarded as a harder version of MNIST.

iii.2 Details of the Quantum Classifiers

We parameterize the quantum classifier as a quantum circuit of $l$ layers. Each layer contains a set of controlled-NOT gates on adjacent qubits, followed by a parameterized rotation on each qubit. The detailed circuit is shown in Figure 2.

To load the data into an 8-qubit quantum circuit, we interpolate and resize each image into a size of 16 $\times$ 16 and use amplitude encoding to transform it into a quantum state:

{x_{j}}_{j = 1}^{256} \to | ψ ⟩ = 256 \sum j = 1 \frac{x_{j}}{\sqrt{\sum_{u = 1}^{256} ∥ x_{u} ∥^{2}}} | j ⟩,

(12)

where ${x_{j}}$ is the pixel values of the image and ${| j ⟩}$ denotes the computational basis.

To perform inference, we measure the expectation values on each qubit, amplify the outcomes by a factor of 10 and fed them into the softmax function to predict the probability of each class. We use the Adam optimizer [kingma2017adam] of learning rate $10^{- 2}$ and a batch-size of $128$ to minimize the standard cross entropy loss function. The gradients used in the optimization can be computed via the parameter shift rule [mitarai2018quantum, li2021quantum, li2017hybrid].

All the numerical simulations are conducted with JAX [jax2018github] and TensorCircuit [zhang2022tensorcircuit] on one NVIDIA Tesla V100 GPU. The source code is available at https://github.com/JasonZHM/quantum-fed-infer.

iii.3 Performance of qFedAvg and qFedInf on Non-IID Data

Figure 4: The test loss and accuracy of centralized classifier and qFedAvg on star structure MNIST (left) and Fashion-MNIST (right). The dashed line represents the test accuracy of qFedInf. The error bars and shaded regions mark the standard deviation over 10 runs.

We begin by demonstrating the non-IID quagmire of qFedAvg discussed in Section II.3 with numerical simulations. Specifically, we train and test qFedAvg on datasets of the cycle-m structure. The parameter $m$ serves as a good controller over the level of non-IID: as $m$ increases, each client will have access to more classes, and the level of non-IID will decrease.

We train the channel with $l = 48$ and $m = 2, 3, 4, 5, 6$ using qFedAvg for 5 epochs. The global synchronization frequency of qFedAvg is set to be one time per batch step. The resulting test accuracies on a test set of size 1024 are plotted in Figure 3. In line with the theoretical analysis, we find that the top-1 accuracy increases as the level of non-IID drops. When the data are highly heterogeneous ( $m = 2$ ), qFedAvg suffers from a loss of $\sim 4 %$ ( $10 %$ ) in accuracy on MNIST (Fashion-MNIST) compared to the benchmark trained on the centralized data with the same circuit structure. Nevertheless, when the data heterogeneity is mild ( $m \geq 5$ ), qFedAvg can achieve comparable performance compared with the centralized classifier. We also plot the test loss and accuracy curves on star structure in Figure 4. As expected, qFedAvg converges much slower to a significantly lower accuracy.

To test the performance of qFedInf on non-IID data, we train and test qFedInf on the most heterogeneous settings, namely the star structure and cycle-2 structure, and compare it with qFedAvg and the centralized benchmark. In such settings, each client’s local classifier only need to perform a binary classification, and thus in practice, we find that circuits with only a few layers suffice to achieve good performance. For a fair comparison with the $l = 48$ qFedAvg and the centralized benchmark, we choose $l = 6$ for the local classifiers, so that the total number of variational parameters of the $8$ clients stays the same. We train the local classifiers for 5 epochs and plot the loss and accuracy curves in Figure 5. We adopt Gaussian mixture models with 5 modes as the local density estimators. The combined global model achieves a top-1 accuracy similar to the centralized benchmark and significantly higher than that of qFedAvg on both settings of both datasets. The detailed accuracies are listed in Table 1.

We note that in the training process, qFedAvg requires a total number of 500 communication rounds, while qFedInf only needs one. Moreover, the local classifiers used in qFedInf are much shallower compared to those of qFedAvg and the centralized benchmark. If we change the $l$ of the centralized classifier to 6, its test accuracy will drops to only $\sim 75 %$ . This suggests that, putting the federated settings aside, qFedInf can utilize the collective knowledge of many small models to achieve the capability of a large model. It’s a quantum analog of mixture of experts [masoudnia2014mixture] and ensemble learning [sagi2018ensemble].

Figure 5: The training loss and accuracy of the local classifiers $M_{i}$ in qFedInf on start structure MNIST (left) and Fashion-MNIST (right). The average curves are highlighted.

Iv Conclusions

In this work, we tackle the problem of non-IID data in quantum federated learning. We give a theoretical analysis of the non-IID quagmire in qFedAvg, and support it with numerical experiments. We prove that a global quantum channel can be exactly decomposed into channels trained by each client with the help of local density estimators. It leads to a general framework qFedInf for quantum federated learning on non-IID data. It’s one-shot in terms of communication complexity and immune to gradient-inversion based attacks.

We conduct numerical experiments on multi-class classification tasks to demonstrate the proposed framework. We devise a highly heterogeneous federated dataset based on MNIST and Fashion-MNIST. Experiments show that qFedInf achieves a comparable performance compared with the centralized benchmark, and outperforms the qFedAvg with significantly fewer communication rounds.

Acknowledgements.

We thank Weikang Li, Jingyi Zhang, Yuxuan Yan, Rebing Wu and Yuchen Guo for insightful discussions. We also acknowledge the Tsinghua Astrophysics High-Performance Computing platform for providing computational and data storage resources. This work is financially supported by Zhili College, Tsinghua University.

Appendix A Proof of Proposition 1

Proposition 1 is a direct generalization of Proposition 3.1 in [zhao2018federated]. The ideas are basically the same. Nevertheless, we provide a proof here for completeness. Based on the definition of $Δ$ and the update rules, Equations (2), (3) and (4), we have

\begin{matrix} Δ_{m T} = ∥ w_{m T}^{(f)} - w_{m T}^{(c)} ∥ = ∥ \sum i p_{C_{i}} (w_{m T - 1}^{i} - η \nabla_{w} \frac{1}{N_{C_{i}}} \sum (| ψ ⟩, y) \in D_{i} n_{c} \sum k = 1 δ_{y, k} f_{k} (w_{m T - 1}^{i}, | ψ ⟩)) - w_{m T - 1}^{(c)} + η \nabla_{w} \frac{1}{N} \sum (| ψ ⟩, y) \in D n_{c} \sum k = 1 δ_{y, k} f_{k} (w_{m T - 1}^{(c)}, | ψ ⟩) ∥ = ∥ \sum i p_{C_{i}} (w_{m T - 1}^{i} - η n_{c} \sum k = 1 p^{(i)} (y = k) \nabla_{w} E_{| ψ ⟩ | y = k} f_{k} (w_{m T - 1}^{i}, | ψ ⟩)) - w_{m T - 1}^{(c)} + η n_{c} \sum k = 1 p (y = k) \nabla_{w} E_{| ψ ⟩ | y = k} f_{k} (w_{m T - 1}^{(c)}, | ψ ⟩) ∥ . \end{matrix}

(13)

Now we apply the triangle inequality and the Lipschitz conditions. Together with the definition $a_{i} = 1 + η \sum_{k} p^{(i)} (y = k) λ_{k}$ , we have

\begin{matrix} Δ_{m T} & \leq ∥ \sum i p_{C_{i}} w_{m T - 1}^{i} - w_{m T - 1}^{(c)} ∥ + η ∥ \sum i p_{C_{i}} \sum k p^{(i)} (y = k) (\nabla_{w} E_{| ψ ⟩ | y = k} f_{k} (w_{m T - 1}^{i}, | ψ ⟩)) - \nabla_{w} E_{| ψ ⟩ | y = k} f_{k} (w_{m T - 1}^{(c)}, | ψ ⟩)) ∥ \leq \sum i p_{C_{i}} (1 + η \sum k p^{(i)} (y = k) λ_{k}) ∥ w_{m T - 1}^{i} - w_{m T - 1}^{(c)} ∥ = \sum i p_{C_{i}} a_{i} ∥ w_{m T - 1}^{i} - w_{m T - 1}^{(c)} ∥ . \end{matrix}

(14)

Then we continue going backwards in the time steps. With the triangle inequality and the definitions of $g (w)$ and ${E M D}_{i}$ , we have

(15)

By induction and the broadcast rule $w_{(m - 1) T}^{i} = w_{(m - 1) T}^{(f)}$ , we arrive at

\begin{matrix} ∥ w_{m T - 1}^{i} - w_{m T - 1}^{(c)} ∥ & \leq a_{i}^{T - 1} ∥ w_{(m - 1) T}^{i} - w_{(m - 1) T}^{(c)} ∥ + η {E M D}_{i} T - 2 \sum j = 0 a_{i}^{j} g (w_{m T - 2 - j}^{(c)}) = a_{i}^{T - 1} Δ_{(m - 1) T} + η {E M D}_{i} T - 2 \sum j = 0 a_{i}^{j} g (w_{m T - 2 - j}^{(c)}) \end{matrix}

(16)

Plug it into Equation (14), and we finally reach the desired result:

\begin{matrix} Δ_{m T} & \leq \sum i p_{C_{i}} a_{i}^{T} Δ_{(m - 1) T} + η \sum i p_{C_{i}} {E M D}_{i} T - 1 \sum j = 1 a_{i}^{j} g (w_{m T - 1 - j}^{(c)}) . \end{matrix}

(17)

Appendix B Proof of Theorem 1

With the definitions in Sections II.1 and II.4, for any pure input state $σ_{x}^{ψ} = | ψ ⟩ ⟨ ψ |$ , the global channel $M$ can be decomposed into

\begin{matrix} M (σ_{x}^{ψ}) & = M (\frac{P_{x}^{ψ} ρ_{x} P_{x}^{ψ}}{Tr (ρ_{x} P_{x}^{ψ})}) = M (\frac{P_{x}^{ψ} {Tr}_{C} (ρ) P_{x}^{ψ}}{% Tr (ρ_{x} P_{x}^{ψ})}) = \frac{1}{Tr (ρ_{x} P_{x}^{ψ})} \sum i M (P_{x}^{ψ} {Tr}_{C} (P_{C_{i}} ρ P_{C_{i}}) P_{x}^{ψ}) = \frac{1}{Tr (ρ_{x} P_{x}^{ψ})} \sum i M ({Tr}_{C} (P_{x}^{ψ} ρ_{x | C_{i}} P_{x}^{ψ})) Tr (ρ P_{C_{i}}) = \frac{1}{Tr (ρ_{x} P_{x}^{ψ})} \sum i M_{i} (σ_{x}) Tr (ρ_{x | C_{i}} P_{x}^{ψ}) p_{C_{i}} = \sum i M_{i} (σ_{x}) \frac{D_{i} (σ_{x}) p_{C_{i}}}{\sum_{j} D_{j} (σ_{x}) p_{C_{j}}}, \end{matrix}

(18)

where the second line utilizes the fact that $ρ$ is diagonal in $| C_{i} ⟩$ : $ρ = \sum_{i} P_{C_{i}} ρ P_{C_{i}}$ , and the last equality follows from

\begin{matrix} Tr (ρ_{x} P_{x}^{ψ}) = Tr (ρ P_{x}^{ψ}) = \sum j Tr (P_{C_{j}} ρ P_{C_{j}} P_{x}^{ψ}) = \sum j Tr (ρ_{x | C_{j}} P_{x}^{ψ}) Tr (ρ P_{C_{j}}) = \sum j D_{j} (σ_{x}) p_{C_{j}} \end{matrix}

(19)

As for mixed states, we note that they can always be decomposed into a linear combination of pure states. Thus following from the linearity of quantum channels, the formula for $M$ acting on mixed states follows from direct linear superposition. This completes the proof of Theorem 1.