Dynamic Global Sensitivity for Differentially Private Contextual Bandits

Huazheng Wang huazheng.wang@oregonstate.edu Oregon State UniversityCorvallisOregonUSA , David Zhao dz6hu@virginia.edu University of VirginiaCharlottesvilleVirginiaUSA and Hongning Wang hw5x@virginia.edu University of VirginiaCharlottesvilleVirginiaUSA

2022

Abstract.

We propose a differentially private linear contextual bandit algorithm, via a tree-based mechanism to add Laplace or Gaussian noise to model parameters. Our key insight is that as the model converges during online update, the global sensitivity of its parameters shrinks over time (thus named dynamic global sensitivity). Compared with existing solutions, our dynamic global sensitivity analysis allows us to inject less noise to obtain $(ϵ, δ)$ -differential privacy with added regret caused by noise injection in $~ O (log T \sqrt{T} / ϵ)$ . We provide a rigorous theoretical analysis over the amount of noise added via dynamic global sensitivity and the corresponding upper regret bound of our proposed algorithm. Experimental results on both synthetic and real-world datasets confirmed the algorithm’s advantage against existing solutions.

Abstract.

Bandit algorithms have become a reference solution for interactive recommendation. However, as such algorithms directly interact with users for improved recommendations, serious privacy concerns have been raised regarding its practical use. In this work, we propose a differentially private linear contextual bandit algorithm, via a tree-based mechanism to add Laplace or Gaussian noise to model parameters. Our key insight is that as the model converges during online update, the global sensitivity of its parameters shrinks over time (thus named dynamic global sensitivity). Compared with existing solutions, our dynamic global sensitivity analysis allows us to inject less noise to obtain $(ϵ, δ)$ -differential privacy with added regret caused by noise injection in $~ O (log T \sqrt{T} / ϵ)$ . We provide a rigorous theoretical analysis over the amount of noise added via dynamic global sensitivity and the corresponding upper regret bound of our proposed algorithm. Experimental results on both synthetic and real-world datasets confirmed the algorithm’s advantage against existing solutions.

Differential privacy, contextual bandits

^†^†journalyear: 2022^†^†copyright: acmlicensed^†^†conference: Sixteenth ACM Conference on Recommender Systems; September 18–23, 2022; Seattle, WA, USA^†^†booktitle: Sixteenth ACM Conference on Recommender Systems (RecSys ’22), September 18–23, 2022, Seattle, WA, USA^†^†price: 15.00^†^†doi: 10.1145/3523227.3546781^†^†isbn: 978-1-4503-9278-5/22/09^†^†ccs: Security and privacy Privacy protections^†^†ccs: Theory of computation Sequential decision making^†^†ccs: Theory of computation Online learning algorithms

1. Introduction

Multi-armed bandit algorithms have become a reference solution for sequential decision-making; and they have been successfully applied to a wide variety of real-world applications such as recommendation (LinUCB), display advertisement (li2010exploitation), and clinical trials (durand2018contextual). In each iteration of such problems, a learner selects among a set of recommendation candidates, often referred as arms, and receives the corresponding reward after each selection. The learner’s goal is to maximize the cumulative reward over time or equivalently to minimize regret. This requires the learner to both exploit the currently estimated best arm and explore among the arms to improve its estimation. Contextual bandits extend this setting to that the learner is given a context vector that encodes side information for reward estimation in each iteration.

As bandit algorithms oftentimes directly work with user feedback, e.g., treating user clicks as reward, serious privacy concerns have been raised (thakurta2013nearly; agarwal2017price; tossou2017achieving; shariff2018differentially). We use recommender systems equipped with bandit algorithms as an example to illustrate its risk of privacy breach. In such a system, the algorithm chooses an item for a user, and the user decides whether to click the recommendation based on his/her true preference. Such click feedback is used to update the bandit model for improving its subsequent recommendations. As a result, any change in a user’s preference promptly leads to changes in the algorithm’s output, e.g., different sequences of recommended items. User’s private information, e.g., his/her preference over items in this example, is thus revealed even if the click feedback is kept private. The goal of privacy protection is to prevent the algorithm’s output from revealing a user’s private information, such as item preferences. Real-world privacy breaches have been reported in Amazon’s recommendation system (calandrino2011you) and Facebook’s advertisement system (korolova2010privacy), where an adversary can learn considerable side information about a user solely based on the system’s recommendation sequence.

Differential privacy (dwork2006calibrating; dwork2014algorithmic) provides a rigorous guarantee that an algorithm’s output has little dependency on the change of input at a single data point, which limits the amount of sensitive information an adversary can infer from the algorithm’s output. It has been widely adopted in both industry and academia. The basic idea is to add a controlled level of noise to an algorithm’s output, such that the subsequent change in its output caused by input change is indistinguishable from that caused by the added noise. The key is to determine the scale of the noise, which depends on the sensitivity of the output given the change of input. However, the utility of a private algorithm decreases due to the injected noise. Under the same privacy budget, a good private algorithm adds less noise to its output, preserving more utility while achieving the same level of privacy.

Differential privacy has been studied in stochastic multi-armed bandits (mishra2015nearly; tossou2016algorithms), adversarial bandits (tossou2017achieving), and linear contextual bandits (shariff2018differentially; neel2018mitigating). Because differential privacy is immune to post-processing (dwork2014algorithmic), the output of a bandit algorithm (i.e., the sequence of selected arms) is differentially private once a private mechanism protects sensitive statistics at any internal stage of the algorithm, e.g., collected rewards or estimated bandit parameters. All existing differentially private bandit algorithms add noise to reward or context vectors (shariff2018differentially; neel2018mitigating); but they fail to realize the role of bandit model’s convergence in achieving privacy. Because the model parameter is converging during online update, the same reward change from the user (i.e., the input to the algorithm) will impose smaller impact to the model parameter in the later stage, and thus leads to less change in the algorithm’s output. This property suggests that the model is less sensitive to the change of input over time and thus less noise is needed for privacy protection in the later stage.

Based on this important observation, we first study the dynamic nature of sensitivity of the bandit model parameters over time, which we refer to as dynamic global sensitivity analysis. We apply the tree-based mechanism (dwork2010differential; chan2011private) with dynamic global sensitivity to add Laplace noise to the model parameter of a linear UCB algorithm (LinUCB) to achieve the same level of differential privacy, but with much less added noise compared with previous approaches. We rigorously analyze the level of noise added by the tree-based mechanism using dynamic global sensitivity and prove the upper regret bound of our proposed algorithm. The main contributions of this paper can be summarized as follows:

We quantify the dynamic global sensitivity of a linear bandit model based on its convergence property. As the model converges, it becomes less sensitive to the reward change over time, and thus less noise is needed to obtain the same level of privacy.
We propose a tree-based mechanism with dynamic global sensitivity analysis. We show that this mechanism adds $O (\frac{1}{ϵ} log T \sqrt{log T} log \frac{1}{ζ} / T)$ noise to the bandit model parameter at time $T$ , and with probability $1 - ζ$ , it achieves $(ϵ, δ)$ -differential privacy. As a result, the added regret caused by noise injection is $~ O (log T \sqrt{T} / ϵ)$ ¹¹1 $~ O (\cdot)$ omits the logarithmic terms of $d, 1 / λ, 1 / ζ$ , 1/ $δ$ .
We also empirically evaluate our algorithm on both synthetic and real-world datasets and validate the algorithm’s advantage against existing solutions.

2. Related Work

Multi-armed bandit problem was first studied in (thompson1933likelihood) and (robbins1952some). UCB1 proposed the Upper Confidence Bound (UCB) method to solve the stochastic multi-armed bandit problem. We refer to (bubeck2012regret; lattimore2018bandit) for a comprehensive survey regarding stochastic multi-armed bandits and its variants. In this paper, we focus on the linear contextual bandit problem, where each arm is associated with a context feature vector that encodes side information. The reward is governed by a linear function of the context vector, characterized by the corresponding model parameters. UCB-style solution for linear contextual bandit has been popularly studied in (Auer02; LinUCB; Improved_Algorithm).

Differential privacy (dwork2006calibrating) provides a formal notion to quantify the amount of information regarding to an algorithm’s input that an adversary could obtain by observing the algorithm’s output. A common technique is to add Laplace or Gaussian noise to the algorithm’s output. The scale of noise depends on both privacy level $(ϵ, δ)$ and sensitivity, which is the change of algorithm’s output caused by the change of its input. Originally studied for static databases, differential privacy was first extended to online setting for stream data in (dwork2010differential; chan2011private). Differentially private online learning methods have been studied for online convex optimization (thakurta2013nearly; agarwal2017price; abernethy2017online) and bandit problems (mishra2015nearly; tossou2017achieving; shariff2018differentially; neel2018mitigating; basu2019differential; wang2020global). The key component of these solutions is the tree-based mechanism, which was first proposed in (dwork2010differential; chan2011private) for privately releasing sum statistics in stream data. There are some recent works studied local differential privacy for bandits (ren2020multi; zheng2020locally; han2021generalized) and DP for federated/distributed bandits (dubey2020differentially; dubey2020private). We follow the notion of global differential privacy, and differential privacy refers to the global notion in the rest of the paper.

In the bandit setting, an adversary can observe the selected arms (not the reward). The goal of a private bandit algorithm is to keep the sequence of reward private. In other words, a change in the reward sequence should not lead to any sensible change in the selected arm sequence. Regarding differentially private contextual bandit, mishra2015nearly proposed private versions of contextual UCB and Thompson Sampling algorithms, but they did not provide any theoretical analysis of the resulting algorithms. neel2018mitigating provided regret analysis for a private LinUCB algorithm and proved the added regret introduced by their privacy mechanism is $~ O ({log}^{2.5} T \sqrt{T} / ϵ)$ to achieve $ϵ$ -differential privacy. shariff2018differentially studied the setting of making both context and reward private, and adopted a different privacy notion of joint differential privacy (kearns2014mechanism). However, these solutions do not directly add noise to the bandit model itself but to the input of the model that has constant sensitivity. They thus cannot leverage the convergence property of a bandit model. We study the dynamic nature of sensitivity of bandit model during online update, and propose a tree-based mechanism with dynamic sensitivity to introduce less noise to the model parameter over time. Our proposed method adds additional regret to non-private LinUCB in a scale of $~ O (log T \sqrt{T} / ϵ)$ to achieve $(ϵ, δ)$ -differential privacy. Pichapati et al. (pichapati2019adaclip) proposed an adaptive differentially private SGD algorithm that has similar motivation of our work, i.e., analyzing the dynamics of sensitivity to preserve more utility. But it focuses on a very different perspective: it dynamically (adaptively) computes sensitivity on different dimensions giving the historical gradient, while we compute the dynamic sensitivity based on model convergence.

3. Method

We first provide a brief overview of contextual bandit algorithms and differential privacy. We then provide the dynamic global sensitivity analysis of bandit algorithms with a tree based mechanism. Based on it, we present our differentially private linear contextual bandit algorithm and prove its upper regret bound.

3.1. Preliminaries

Contextual Bandits. In a contextual bandit problem, an algorithm sequentially selects an arm $a_{t}$ from a candidate pool $A_{t} = {a_{1, t}, a_{2, t}, . . ., a_{k, t}}$ , and receives the corresponding reward $r_{t}$ afterwards. Each arm $a$ is associated with a context feature vector $x_{a}$ and its reward is governed by a function of the context feature vector and an unknown bandit model parameter $θ^{*}$ . We made the following assumption on context generation similar to (gentile2014online).

Assumption 1 ().

Context vectors $C_{t} = {x_{a} | a \in A_{t}}$ are assumed to be generated i.i.d. conditioned on the past actions ${a_{1}, . . a_{t - 1}}$ and contexts ${C_{1}, . . C_{t - 1}}$ from a random process $X$ such that $E [X X^{T}]$ is full rank, with minimal eigenvalue $λ_{0}$ ²²2We discuss the impact of this assumption in next section in detail..

The objective of a bandit algorithm is to maximize its cumulative reward (or equivalently minimize the regret) over a finite time horizon $T$ . To simplify our discussion, we assume the observed reward is a linear function of feature vector $x_{a}$ and bandit model parameter $θ^{*}$ with noise, i.e., $r_{t} = x_{a_{t}}^{T} θ^{*} + γ_{t}$ , where $γ_{t}$ is a sub-Gaussian feedback noise. We evaluate a bandit algorithm by its pseudo-regret, which is defined as

(1)

Regret (T) = T \sum t = 1 (x_{a_{t}^{*}}^{T} θ^{*} - x_{a_{t}}^{T} θ^{*}),

where $a_{t}^{*}$ is the best arm according to $θ^{*}$ . The LinUCB algorithm (LinUCB; Improved_Algorithm) is a well-studied solution for linear bandit. It solves the bandit model parameter $^θ$ using ridge regression, i.e., $^θ = A_{t}^{- 1} b_{t}$ , where $A_{t} = \sum_{i = 1}^{t - 1} x_{a_{i}} x_{a_{i}}^{T} + λ I, b_{t} = \sum_{i = 1}^{t - 1} x_{a_{i}} r_{i}$ , $λ$ is the L2-regularization coefficient; and Upper Confidence Bound is used to select an arm.

Differential Privacy. For a contextual bandit problem, denote $S = {r_{1 : T}}$ as the reward sequence. $S^{'}$ is considered as an adjacent neighboring sequence of $S$ , if it only differs from $S$ at one point of reward $r_{i}$ . The output $O$ of a bandit algorithm is the sequence of its selected arms, i.e., ${a_{1 : T}}$ .

Definition 0 (Differential Privacy (dwork2006calibrating)).

A randomized algorithm $M$ is $(ϵ, δ)$ -differentially private if for any adjacent neighboring sequences $S, S^{'}$ and output $O$ ,

P (M (S) \in O) \leq e^{ϵ} P (M (S^{'}) \in O) + δ

When $δ = 0$ , we say algorithm $M$ is $ϵ$ -differentially private.

Differential privacy ensures the adversary observes almost the same output of a private algorithm in a probabilistic sense, if one input data point is changed, where the similarity between outputs is evaluated by $ϵ$ and $δ$ . Laplace and Gaussian noise are commonly used as additive noise to protect the output, where the noise scale is related to the privacy requirement $(ϵ, δ)$ and the global sensitivity of algorithm’s output caused by the change of input. We formally define global sensitivity below.

Definition 0 (Global Sensitivity (dwork2006calibrating)).

For any adjacent neighboring sequences $S, S^{'}$ , global sensitivity of a function $f$ is defined as,

Δ_{f} = max S, S^{'} | f (S) - f (S^{'}) |

Since differential privacy is immune to post-processing (dwork2014algorithmic), previous solutions of private linear bandits (mishra2015nearly; neel2018mitigating) add noise to $\sum_{t} x_{a_{t}} r_{t}$ to obtain privacy, which consequently makes the model parameter $θ_{t}$ and the selected arms ${a_{t}}_{t = 1}^{T}$ private. It is obvious that global sensitivity of $b_{t} = \sum_{t} x_{a_{t}} r_{t}$ is a constant $L$ for all round $t$ , if we assume $∥ x ∥_{2} \leq L$ and $| r | \leq 1$ , without loss of generality. In other words, constant amount of noise has to be added in each round to achieve differential privacy.

3.2. Dynamic Global Sensitivity Analysis for Tree-based Mechanism

1: Inputs:

t, T, ϵ, δ, L, λ_{0}

2: Initialize:

ϵ^{'} = ϵ / {log}_{2} T

δ^{'} = δ / {log}_{2} T

η_{t} = 0

3: Let

b

be the

⌈ {log}_{2} T ⌉ + 1

-bit binary representation of

t

4: for i = 0 to

⌈ {log}_{2} T ⌉ + 1

5: if

b_{i} = 1

then

η_{t} = η_{t} + Lap (\frac{Δ_{t - 2^{i} + 1}}{ϵ^{'}})

7: end if

8: end for

9: Return noise

η_{t}

Algorithm 1 Tree-based mechanism with Dynamic Global Sensitivity

Different from previous work, we directly add noise to the estimated bandit model parameter ${^θ}_{t}$ after each round of model update. As ${^θ}_{t}$ converges over time, the sensitivity of it decreases consequently, which we name as dynamic global sensitivity. We first quantify such sensitivity of ${^θ}_{t}$ and then discuss how to combine it with the tree-based mechanism.

Lemma 0 (Sensitivity of estimated bandit model parameter ${^θ}_{t}$ ).

Let ${^θ}_{t}$ and ${^θ}_{t}^{'}$ be parameter estimations of adjacent neighboring reward sequences $S$ and $S^{'}$ , assuming context vectors ${x_{1, t}, . . x_{K, t}}$ are generated according to Assumption 1. With probability at least $1 - δ$ , the dynamic global sensitivity of bandit model parameter ${^θ}_{t}$ at time $t$ is bounded as,

(2)

Δ_{t} = max S, S^{'} ∥ {^θ}_{t + 1} - {^θ}_{t + 1}^{'} ∥_{2} \leq \frac{2 L}{λ^{'}}

where $λ^{'} = λ_{0} t / 4 - 8 log ((t + 3) / δ) - 2 \sqrt{t log ((t + 3) / δ)}$ is the lower bound of minimum eigenvalue of matrix $A_{t} = \sum_{t} x_{a_{t}} x_{a_{t}}^{T} + λ I$ . ³³3Note that $λ$ is the coefficient for of $L_{2}$ regularization, and it not related with $λ_{0}$ . We use a simplified bound $λ^{'} = λ_{0} t / 16$ when $t > 32 log (1 / δ) / λ_{0}$ , which gives us $Δ_{t} \leq 32 L / λ_{0}$ .

Proof Sketch. Since ${^θ}_{t} = A_{t}^{- 1} \sum_{t} x_{a_{t}} r_{t}$ , the key idea is to quantify the minimum eigenvalue of matrix $A_{t}$ . Here we adopt the i.i.d. assumption of context vectors from Theorem 1 of (gentile2014online) to analyze the corresponding eigen system and the key idea is to use a concentration inequality to bound the eigenvalue. Note that we need to avoid negative sensitivity when $t$ is small. It can be derived from the formula of $λ^{'}$ that by choosing probability $δ > 2 log (3) / (\sqrt{1 + 2 λ_{0}} + 1)$ , the sensitivity is guaranteed to be positive for small $t$ . For large $t$ , we have shown in the lemma that the sensitivity is guaranteed to be positive when $t > 32 log (1 / δ) / λ_{0}$ .

Lemma 3 provides the dynamic global sensitivity analysis of estimated bandit model parameter ${^θ}_{t}$ . The most important property of it is that it is monotonically shrinking over time. This indicates less noise is needed in later stage for privacy protection. We should emphasize that our analysis is for global sensitivity. A similar but quite different notion is local sensitivity (and smoothed sensitivity) proposed by (nissim2007smooth), which studies the dynamic nature of sensitivity given different input $S$ , i.e., ${max}_{S^{'}} | f (S) - f (S^{'}) |$ . Our dynamic sensitivity analysis is not conditioned on any reward sequence, but based on the convergence property of bandit model parameter ${^θ}_{t}$ . And thus it holds for any reward sequence as input.

Remark Our Assumption 1 on the context vectors used in Lemma 3 follows (gentile2014online) and is used to bound the rate of shrinkage of the bandit model’s sensitivity. Similar assumptions are also discussed in (kannan2018smoothed) from the perspective of perturbation on context vectors via a smoothed analysis. And in the appendix, we show that a similar sensitivity analysis can easily lead to a differentially private version of algorithm proposed in (kannan2018smoothed). In the meanwhile, we also note that this environment assumption generally holds in practice, especially in a system with time-varying arm sets, as extensively discussed and studied in (kannan2018smoothed). However, even if the context vectors are generated by an adversary instead of a random process, we can still achieve the same guarantee on the minimal eigenvalue $λ_{0}$ and the same result in Lemma 3 by perturbing the context vector $x_{a}$ with a Gaussian noise sampled from $N (0, σ^{2})$ , such that $λ_{0} \geq Ω (σ^{2})$ , based on Lemma 3.7 in (kannan2018smoothed). In the following, we assume Assumption 1 holds if not specified. And in Corollary 8, we discuss the impact on regret if the algorithm needs to perturb the context vector when the assumption does not hold.

The analyzed sensitivity provides us the level of noise needed for differential privacy. We present a tree-based mechanism with dynamic global sensitivity in Algorithm 1, which takes advantage of the sublinear property of exponential function to further reduce the amount of added noise. Basically, the idea of tree-based mechanism is to view the sum statistics as $⌈ {log}_{2} T ⌉ + 1$ partial sums. Laplace noise is added to each partial sum to achieve $(ϵ^{'}, δ^{'})$ -differential privacy where $ϵ^{'} = ϵ / {log}_{2} T$ , $δ^{'} = δ / {log}_{2} T$ . Here we follow the notion of $(ϵ^{'}, δ^{'})$ -differential privacy because our sensitivity analysis in Lemma 3 is a high probability bound and holds with probability $1 - δ^{'}$ . Based on composition theorem of differential privacy (dwork2010boosting; kairouz2017composition), the final output, the sum of $(ϵ^{'}, δ^{'})$ -differentially private partial sums, is $(ϵ, δ)$ -differentially private. The partial sums are segmented by the tree representation. We refer to (dwork2010differential; chan2011private) for detailed discussion of tree-based mechanism.

However, previous differentially private bandit algorithms with tree-based mechanism only protect variables that have constant sensitivities, such as sum statistics of $\sum_{t} x_{a_{t}} r_{t}$ for contextual bandits (mishra2015nearly; neel2018mitigating; shariff2018differentially) or sum of rewards for non-contextual bandits (tossou2016algorithms; tossou2017achieving), instead of our dynamic global sensitivity. In Algorithm 1, we scale the noise of partial sum $[t - 2^{i + 1}, t - 2^{i} + 1]$ with its dynamic sensitivity $Δ_{t - 2^{i} + 1}$ as shown in line 6.

We first show the privacy guarantee of Algorithm 1 and then study its utility, i.e., the total amount of noise added by Algorithm 1.

Theorem 4 (Privacy).

Algorithm 1 with dynamic global sensitivity $Δ_{i}$ defined in Lemma 3 is $(ϵ, δ)$ -differentially private.

Theorem 5 (Utility).

For a finite time horizon $T$ , at time $t$ , Algorithm 1 with dynamic global sensitivity of bandit model parameter $^θ$ adds $O (\frac{L}{ϵ} log T \sqrt{log t} log \frac{1}{ζ} / t)$ noise with probability $1 - ζ$ to achieve $(ϵ, δ)$ -differential privacy.

Proof Sketch. We follow the definition of differential privacy and use a similar proof as Theorem 3 of (thakurta2013nearly) to get the privacy guarantee. The utility analysis is based on Lemma 3 and the property of sum of samples from Laplace distributions.

All existing tree-based mechanism for private bandit solutions add noise at a scale of $O (\frac{L}{ϵ} log T \sqrt{log t} log \frac{1}{ζ})$ based on constant sensitivity (chan2011private) . By leveraging the dynamic global sensitivity analysis in Lemma 3, our solution adds less noise while achieving the same level of privacy.

Moreover, our finite-time analysis assumes the knowledge of time horizon $T$ beforehand; but it can also be directly extended to the infinite/unknown time horizon by replacing tree-based mechanism with the hybrid mechanism (chan2011private). Although in this paper our focus is the contextual bandit problem, our dynamic global sensitivity analysis can also be generalized to other settings such as online convex optimization, where we can leverage the algorithm’s convergence property to add decreasing noise to the model parameter for better utility.

3.3. Differentially Private LinUCB

1: Inputs:

ϵ, δ, λ, T, L, λ_{0}

2: Initialize:

A_{1} \leftarrow λ I

b_{1} \leftarrow 0

3: for

t = 1

T

4: Observe context vectors

C_{t} = {x_{a} | a \in A_{t}}

5: Take action

a_{t} = {a r g m a x}_{a \in A_{t}} x_{a}^{T} {^θ}_{t}^{p} + α_{t} ∥ x_{a} ∥_{A_{t}^{- 1}}

6: Observe reward

r_{a_{t}}

A_{t + 1} \leftarrow A_{t} + x_{a_{t}} x_{a_{t}}^{T}

b_{t + 1} \leftarrow b_{t} + x_{a_{t}} r_{a_{t}}

9: Sample noise

η_{t} \sim TreeMechanism (t, ϵ, δ)

10:

{^θ}_{t + 1}^{p} \leftarrow A_{t + 1}^{- 1} b_{t + 1} + η_{t}

11: end for

Algorithm 2 Private LinUCB with Dynamic Global Sensitivity

We now provide a differentially private linear contextual algorithm built on top of the tree-based mechanism with dynamic global sensitivity. The details of this algorithm is described in Algorithm 2. At round $t$ , the algorithm receives a set of arms, with each arm associated with a context vector $x_{a}$ . Different from previous solutions (mishra2015nearly; neel2018mitigating) that add noise to $b_{t} = \sum_{t} x_{a_{t}} r_{t}$ , our algorithm directly adds noise to the bandit parameter for privacy (i.e., line 10), and uses the private model parameter ${^θ}_{t}^{p}$ for arm selection (i.e., line 5). The selected sequence of arms are proved to be $(ϵ, δ)$ -differentially private to the reward sequence.

Claim 1 ().

The sequence of selected arms ${a_{t} : t \in [1.. T]}$ by Algorithm 2 is $(ϵ, δ)$ -differentially private.

Proof.

In line 9-10 we use Algorithm 1 to add noise and keep ${^θ}_{t + 1}^{p}$ $(ϵ, δ)$ -differentially private. Because differential privacy is post-processing invariant (dwork2014algorithmic), the sequence of selected arms ${a_{t} : t \in [1.. T]}$ produced by ${^θ}_{t}^{p}$ is thus also $(ϵ, δ)$ -differentially private. ∎

We now specify the confidence bound of reward estimation with ${^θ}_{t}^{p}$ , which we use for arm selection in line 5 of Algorithm 2.

Lemma 0 (Confidence Bound).

Following Assumption 1 and assuming $∥ θ^{*} ∥_{2} \leq S$ , with probability at least $1 - 2 ζ$ , confidence bound of the estimated reward $x^{T} {^θ}_{t}^{p}$ is

(3)

{CB}_{t} (x) = ∥ x^{T} {^θ}_{t}^{p} - x^{T} θ^{*} ∥ \leq α_{t} ∥ x ∥_{A_{t}^{- 1}}

where we define $α_{t}$ as the upper bound of $∥ {^θ}_{t}^{p} - θ^{*} ∥_{A_{t}}$ . We have

(4)

∥ {^θ}_{t}^{p} - θ^{*} ∥_{A_{t}} \leq \frac{L}{ϵ} log T \sqrt{log t} log (\frac{1}{ζ}) / \sqrt{t} + \sqrt{d log \frac{1 + t L^{2} λ}{ζ}} + \sqrt{λ} S

Proof.

Eq. (3) is obtained by the Cauchy–Schwarz inequality. To bound $α_{t}$ , we apply the triangle inequality: $∥ {^θ}_{t}^{p} - θ^{*} ∥_{A_{t}} \leq ∥ {^θ}_{t}^{p} - {^θ}_{t} ∥_{A_{t}} + ∥ {^θ}_{t} - θ^{*} ∥_{A_{t}}$ , where ${^θ}_{t}$ is the non-private estimate of $θ^{*}$ . The first term is bounded according to Theorem 5 under $A_{t}$ norm with probability at least $1 - ζ$ . The second term is the confidence ellipsoid of non-private LinUCB and it can be bounded by Theorem 2 of (Improved_Algorithm) with probability at least $1 - ζ$ . By taking a union bound, the inequality holds with probability at least $1 - 2 ζ$ ∎

Lemma 6 gives a tight construction of uncertainty regarding reward estimation $x^{T} {^θ}_{t}^{p}$ and model parameter estimation $∥ {^θ}_{t}^{p} - θ^{*} ∥_{A_{t}}$ . Note that comparing to the non-private LinUCB, the confidence bound of model estimation $α_{t}$ in our algorithm is relaxed by an additional term $\frac{L}{ϵ} log T \sqrt{log t} log (\frac{1}{ζ}) / \sqrt{t}$ , which captures the upper bound of uncertainty caused by the privacy-preserving noise $η_{t}$ introduced by Algorithm 1.

Now we provide a gap-independent regret bound of Algorithm 2 in following theorem.

Theorem 7 (Regret Bound).

Following Assumption 1, the pseudo-regret of Algorithm 2 up to time $T$ can be bounded by,

	$R e g r e t (T) \leq$	$2 (\sqrt{d log \frac{1 + T L^{2} λ}{ζ}} + \sqrt{λ} S) \sqrt{d T log (λ + \frac{L T}{d})}$
(5)			$+ (2 \frac{3^{1.5} L}{e^{1.5} λ_{0} ϵ} log (\frac{1}{} ζ) \sqrt{d T log (λ + \frac{L T}{d})} + 32 log (1 / δ) / λ_{0})$

with probability at least $1 - 2 ζ$ .

Proof Sketch. We rewrite the cumulative regret of LinUCB defined in Eq. (1) by $\sum_{t = 1}^{T} 2 {CB}_{t} (x_{a_{t}})$ , using the definition of confidence bound and UCB arm selection strategy in line 5 of Algorithm 2. We apply the Cauchy-Schwarz inequality to bound $\sum_{t} α_{t}$ and $\sum_{t} ∥ x_{a_{t}} ∥_{A_{t}^{- 1}}$ . According to Eq. (4) in Lemma 6, we can separate the bound of $α_{t}$ into two terms: one is the confidence bound of the original LinUCB and the other is the bound of injected noise. We use Lemma 11 of (Improved_Algorithm) to bound $\sum_{t} ∥ x_{a_{t}} ∥_{A_{t}^{- 1}}$ .

The first term of regret in Theorem 7 is the same as the regret of non-private LinUCB, which is caused by its parameter estimation uncertainty and exploration in arm selection. The second term is the added regret introduced for privacy based on dynamic global sensitivity. The private LinUCB algorithm in (neel2018mitigating) adds noise to $b_{t}$ and incurs additional regret in the order of $~ O ({log}^{2.5} T \sqrt{T} / ϵ)$ to achieve $ϵ$ -differential privacy, while our algorithm introduces additional regret in the order of $~ O (log T \sqrt{T} / ϵ)$ to achieve $(ϵ, δ)$ -differential privacy ignoring logarithmic terms. The dependency of additional regret on $δ$ is $O (log (1 / δ))$ . We note that our method leverages the convergence of parameter estimation to inject less noise; and since it is a high probability analysis, the relaxed $(ϵ, δ)$ -differential privacy notion is required. Later, we also validate our theoretical analysis of the regret reduction in our empirical evaluations. We currently cannot prove if our regret is optimal with respect to T, since the lower bound of the differentially private linear bandit problem is still an open problem. We note the lower bound discussed in (shariff2018differentially) is not applicable, because their privacy definition includes context and it is different from ours that focuses on the privacy of reward. We consider deriving the lower bound of this problem as an important future direction.

In general, the regret bound of contextual bandits could be categorized into gap-independent bound and gap-dependent bound. Our provided analysis is for a gap-independent bound and it is dominated by the term $O (\sqrt{T})$ after ignoring the logarithmic terms. The gap-dependent regret bound of a bandit algorithm is usually in the order of $O (log T / μ_{1, k})$ , where $μ_{1, k}$ is the minimal reward difference between the best arm and any sub-optimal arm $k$ . We leave the analysis of gap-dependent bound of our proposed algorithm as future work.

Note that Algorithm 2 takes $L$ and $λ_{0}$ as input. From a theoretical perspective, the knowledge of the maximum L2-norm of context features $L$ is a commonly made assumption in linear bandits, which is required to derive important model parameters such as the size of confidence ellipsoid $α_{t}$ in LinUCB (Improved_Algorithm). From a practical view, one can normalize the context feature vectors or observe them ahead of the time to empirically estimate $L$ and $λ_{0}$ .

When the context vectors are not generated from the required random process with minimal eigenvalue $λ_{0}$ , we can add an additional Gaussian perturbation on the context to achieve this condition similar to the idea in kannan2018smoothed. This leads to the following auxiliary result.

Corollary 0 ().

Following the assumptions in Lemma 1 and 2 about the context vectors, by perturbing every context vector $x_{a}$ with $z \sim N (0, σ^{2} I)$ , Algorithm 2 is $(ϵ, δ)$ -differentially private. The pseudo-regret of Algorithm 2 up to time $T$ can be bounded by

	$R e g r e t (T) \leq$	$2 \frac{\sqrt{d log \frac{1 + T L^{2} λ}{ζ}} + \sqrt{λ} S}{σ^{2}} \sqrt{d T log (λ + \frac{L T}{d})}$
(6)			$+ (2 \frac{3^{1.5} L}{e^{1.5} σ^{2} ϵ} log (\frac{1}{ζ}) \sqrt{d T log (λ + \frac{L T}{d})} + 32 log (1 / δ) / σ^{2})$

with probability at least $1 - 2 ζ$ .

Here we notice that if we do not have the assumption on the generation of context vectors, while instead perturbing the context vectors to guarantee Lemma 3, the regret would be larger than the one in Theorem 7, especially when variance $σ^{2}$ is small. However note that the order of the regret in $T$ is still the same. The proof of this corollary can be found in the appendix.

Figure 1. Regret and parameter estimation quality on simulation and real-world dataset.

4. Experiment

We performed empirical evaluations of our proposed private LinUCB with dynamic global sensitivity (denoted as Private LinUCB-DGS) against two baseline algorithms:

LinUCB (LinUCB): it selects an arm based on its upper confidence bound of the estimated reward with given context vectors. As a non-private algorithm, LinUCB does not inject any noise to its parameter estimation.
Private LinUCB (neel2018mitigating): it adds noise to $b_{t}$ using the tree-based mechanism with a constant sensitivity parameter, and incurs additional regret at the order of $~ O ({log}^{2.5} T \sqrt{T} / ϵ)$ .

$∙$ Experiment setup. In our simulation-based study, we generate a size- $K$ ( $K = 1000$ ) arm pool $A$ , in which each arm $a$ is associated with a $d$ -dimensional ( $d = 10$ ) feature vector $x_{a} \in R^{d}$ with $∥ x_{a} ∥_{2} \leq 1$ . Similarly, we create a set of ground-truth bandit parameters $θ^{*} \in R^{d}$ with $∥ θ^{*} ∥_{2} \leq 1$ , which are not disclosed to the learners. Each dimension of both $x_{a}$ and $θ^{*}$ is drawn from a uniform distribution $U (0, 1)$ . At each round $t$ , a randomly-sampled decision set with 10 arms from $A$ are disclosed to the learners for selection. The ground-truth reward $r_{a}$ is corrupted by Gaussian noise $γ \sim N (0, σ^{2})$ before giving back to the learners and the standard deviation $σ$ is set to 0.5.

We also experimented on a real-world dataset extracted from the music streaming service website Last.fm ⁴⁴4http://www.last.fm. The LastFM datasets is published by the HetRec 2011 workshop ⁵⁵5http://grouplens.org/datasets/hetrec-2011. This dataset contains 1,892 users and 17,632 items (artists). We used the information of “listened artists” of each user to create payoffs of recommendation candidates: if a user listened to an artist at least once, the payoff is 1, otherwise 0. Following the settings in (Gang; wu2016contextual; wang2016learning), we extracted the context features of artists and pre-processed the dataset in order to fit them into a contextual bandit setting. Specifically, we create a TF-IDF feature vector using associated tags (6,036 tags in total), which uniquely represents the context of that artist. PCA is used to reduce the dimensionality of the feature vectors to $d = 25$ . For a particular user $i$ , we generate the candidate arm pool with size $| A_{t} | = 25$ by first sampling one item from those non-zero reward candidate in user $i$ ’s past history, and then randomly fulfill the other $24$ from those zero-reward candidates from this user.

Notice that both synthetic data and real-world data met Assumption 1. In synthetic data, the context features are sampled from a uniform distribution. For the LastFM dataset, the context vectors are generated by retaining the first 25 principal components of the original 6,036-dimension TD-IDF representation of items, which means the minimum eigenvalue of $E [x x^{T}]$ is lower bounded.

Figure 2. Detailed empirical analysis of privacy-preserving mechanism for LinUCB.

$∙$ Regret comparison. We first show the regret on simulation and real-world dataset in Figure 1 (a) and (b). As in the real-world dataset we do not have an oracle policy, we instead report each learning algorithm’s cumulative reward for comparison. We set the privacy level $ϵ = 2$ and $δ = 0.1$ in the following experiments as default. On the LastFM dataset, we report the reward ratio normalized by the reward collected from a random policy; and the resulting performance curve is thus the higher the better. We observe that the non-private LinUCB performs better than its two private variants, since no noise is injected to LinUCB’s model estimation. Our private LinUCB-DGS performs much better than private LinUCB on both datasets, as less noise is added . These results support our theoretical analysis that with dynamic global sensitivity, our algorithm adds additional regret in the order of $~ O (log T \sqrt{T} / ϵ)$ . Note that because the privacy notion are different in private LinUCB and our private LinUCB-DGS, here we focus on comparing the utility of these algorithms with the same $ϵ$ .

$∙$ Parameter estimation quality. Figure 1 (c) shows the parameter estimation quality of the three bandit algorithms, i.e., the L2 difference between the estimated bandit parameter ${^θ}_{t}^{p}$ and the ground-truth parameter $θ^{*}$ . Compared with private LinUCB, our model parameter ${^θ}^{p}$ converges faster, which explains its improved performance in regret. We also notice that the convergence of private LinUCB oscillates more seriously. This is because private LinUCB injects noise with a larger variance (as we always introduce zero-mean Laplace noise) to model estimation. As a result, its quality of model parameters estimation varies significantly during online update, which directly leads to its worse regret.

$∙$ Effect of privacy level $ϵ$ . In Figure 2 (a), we show the regret of our algorithm with different privacy parameter $ϵ$ . We vary $ϵ$ from $0.5$ to $5$ . From the results, we notice a clear trade-off between the required privacy level $ϵ$ and the resulting regret. Stronger privacy requirement (i.e., a smaller $ϵ$ ) requires the privacy mechanism to inject more noise, which results in larger regret. This result also supports our theoretical analysis that in our solution is in the order of $O (\frac{1}{ϵ})$ .

$∙$ Sensitivity in arm selection. We now evaluate the algorithm’s sensitivity to reward changes, which is exactly what differential privacy intends to protect in a bandit algorithm. Specifically, we fix the sequence of context and arms and only change reward feedback on one particular arm. Note that the feedback noise in simulation is also fixed for all arms ahead of time in this experiment. We compare how many arms are selected differently after the reward change by a bandit algorithm. In our simulation based study, we run each algorithm 500 iterations and change the reward at iteration ${0, 100, 200}$ by increasing the original reward by 0.5. For experiment on LastFM, we run 5,000 iterations, and flip reward on the selected item at iteration ${0, 1000, 2000}$ . We run private LinUCB-DGS and private LinUCB-DGS for 5 times and report the mean and standard deviation. We run LinUCB just one time as it is a deterministic algorithm.

From Figure 2 (b) and (c) we can observe that the number of changed arms in our solution is clearly smaller than its non-private counterpart, which suggests that private LinUCB-DGS is less sensitive to reward change comparing to original LinUCB. Private LinUCB has a similar number of changed arms than ours in average but with a larger variance. This suggests less stable utility provided by the private LinUCB solution. We also notice that when the reward change happens later, the change in all algorithms’ arm selection gets smaller, which supports our motivation of dynamic sensitivity analysis that the algorithm’s output is less sensitive to its input in the later stage.

5. Conclusions & Future Work

In this paper, we first studied the dynamic nature of sensitivity of linear bandit models and presented a dynamic global sensitivity analysis of such an algorithm under the tree based mechanism, which leads to reduced noise addition for differential privacy. We then developed and analyzed a differentially private linear bandit algorithm based on the concept of dynamic global sensitivity. Our private linear bandit algorithm injects additional regret caused by privacy-preserving mechanism in $~ O (log T \sqrt{T} / ϵ)$ while guarantees $(ϵ, δ)$ -differential privacy. Experimental results on both synthetic and real-world datasets demonstrated the advantage of our algorithm and supported our theoretical analysis.

One important property of bandit algorithms is their intrinsic noise/randomness introduced by exploration, such as the stochastic sampling of arms in Thompson Sampling and EXP3. Less explicit noise should be needed for differential privacy, because of such intrinsic randomness of an algorithm’s output, e.g., free privacy. We plan to take advantage of it for better balancing privacy and utility in bandit algorithms. In addition, as we mentioned before our proposed dynamic global sensitivity analysis is not only limited to linear bandit algorithms. It is meaningful to explore its application in a broader context, such as logistic bandits and online convex optimization. We also notice that the lower bound of the differentially private linear bandit problem that protects privacy of reward is currently still an open problem, and it is important to investigate this lower bound to show the optimality of a private linear bandit algorithm.

Acknowledgements.

We thank the anonymous reviewers for their insightful comments. This work was supported by National Science Foundation Award IIS-2128019, IIS-2007492 and IIS-1553568, Google Faculty Research Award and Bloomberg Data Science Ph.D. Fellowship.

Appendix

Missing Proofs

In this section, we provide the detailed proofs of the lemma and theorems discussed in our paper.

Proof of Lemma 3.

For any neighbouring reward sequences ${S, S^{'}}$ that only differ at the data point $j$ , i.e., $r_{i} = r_{i}^{'}$ when $i \neq j$ , denote $A_{t} = λ I + \sum_{i = 1}^{t} x_{a_{i}} x_{a_{i}}^{T}$ , we have,

		$∥ {^θ}_{t + 1} - {^θ}_{t + 1}^{'} ∥_{2}$
	$=$	$∥ (λ I + t \sum i = 1 x_{a_{i}} x_{a_{i}}^{T})^{- 1} t \sum i = 1 x_{a_{i}} r_{i} - (λ I + t \sum i = 1 x_{a_{i}} x_{a_{i}}^{T})^{- 1} t \sum i = 1 x_{a_{i}} r_{i}^{'} ∥_{2}$
	$=$	$∥ A_{t}^{- 1} x_{a_{i}} (r_{j} - r_{j}^{'}) ∥_{2}$
	$\leq$	$2 L λ_{m a x} (A_{t}^{- 1})$
	$\leq$	$\frac{2 L}{λ_{m i n} (A_{t})}$

$λ_{m i n} (\cdot)$ denotes the minimal eigenvalue of the input matrix. The third step is because of the assumption $∥ x ∥_{2} \leq L$ and $∥ r ∥ \leq 1$ . To analyze the eigenvalue of $A_{t}$ , we adopt the assumption from Theorem 1 of Gentile et al. (gentile2014online) that context vectors ${x_{1, t}, . . x_{K, t}}$ are i.i.d. conditioned on the algorithm’s past actions and observed context. Based on Theorem 1 of (gentile2014online), $λ^{'} = λ_{0} t / 4 - 8 log ((t + 3) / δ) - 2 \sqrt{t log ((t + 3) / δ)}$ and $λ^{'} = λ_{0} t / 16$ when $t > 32 log (1 / δ) / λ_{0}$ . Thus we have $λ_{m i n} (A_{t}) \geq λ_{0} t / 16$ when $t > 32 log (1 / δ) / λ_{0}$ . Substitute it back to the inequality and we finish the proof of Lemma 3. ∎

Proof of Theorem 4.

We can rewrite the estimated model parameter ${^θ}_{t}$ as sum statistics using the Sherman–Morrison formula by

	${^θ}_{t + 1}$	$= (A_{t} + x_{a_{t}} x_{a_{t}}^{T})^{- 1} (b_{t} + x_{a_{t}} r_{a_{t}})$

		$= {^θ}_{t} + Δ_{A_{t}} b_{t} + A_{t}^{- 1} x_{a_{t}} r_{a_{t}} + Δ_{A_{t}} x_{a_{t}} r_{a_{t}}$
		$:= {^θ}_{t} + Δ ({^θ}_{t})$

where $Δ_{A_{t}} = \frac{A_{t}^{- 1} x_{a_{t}} x_{a_{t}}^{T} A_{t}^{- 1}}{1 + x_{a_{t}}^{T} A_{t}^{- 1} x_{a_{t}}}$ . Note that we do not actually need to store nor calculate the partial sum in the tree; but we only use the tree mechanism to sample noise for each partial sum, which makes our implementation efficient. We notice that not saving partial sums on the tree makes our algorithm fail to satisfy the definition of pan privacy (dwork2010differential). However if needed, one can always explicitly maintain a tree for $Δ ({^θ}_{t})$ to achieve pan privacy, with a price of added computational cost.

For partial sum $\sum_{t = i}^{j} Δ ({^θ}_{t})$ , i.e., ${^θ}_{j} - {^θ}_{i}$ , its sensitivity is bounded by ${^θ}_{j} - {^θ}_{j}^{'}$ and further bounded by $Δ_{j}$ as discussed in Lemma 3. According to the definition of differential privacy, the partial sum is $(ϵ^{'}, δ^{'})$ -differentially private if we add $Lap (\frac{Δ_{j}}{ϵ^{'}})$ noise with our sensitivity bound that holds with probability at least $1 - δ^{'}$ .

We note that the incremental part $Δ ({^θ}_{t})$ depends on not only incoming $(x_{a_{t}}, r_{a_{t}})$ , but also the historical statistics summarized in $A_{t}$ . A similar problem also occurs in differentially private online convex optimization problem, where the current gradient depends on the choice of past gradients. We refer to Theorem 3 of (thakurta2013nearly), which uses the advanced composition theorem, to prove that the composition of $(ϵ^{'} = ϵ / (log T), δ^{'} = δ / (log T),)$ -differentially private partial sums achieves $(ϵ, δ)$ -differential privacy. ∎

We clarify the analogy of problem structure between privacy of Online Convex Optimization (OCO) discussed in (thakurta2013nearly) and privacy of contextual bandits as follows: our $Δ θ$ could be viewed similarly as gradient in OCO, which is protected by private sum statistics releasing. In OCO, gradient depends on current loss and historical evaluated parameters ${w_{t}}_{t = 1}^{T}$ , while in contextual bandits $Δ θ$ depends on current reward and historical pulled arms ${x_{t}}_{t = 1}^{T}$ . An arm pulled in contextual bandits is analogous to parameter evaluated in OCO, and thus similar arguments and analysis apply.

Proof of Theorem 5.

To bound the total amount of noise $η_{t}$ added by our tree mechanism, we first state the property of sum of independent Lapalace distributions (also stated in Lemma 2.8 of (chan2011private)): ∎

Lemma 0 ().

With probability $1 - ζ$ , sum of independent Lapalace distributions $Lap (a_{i})$ is bounded by $O (\sqrt{\sum_{i} a_{i}^{2}} log \frac{1}{ζ})$

Let $b$ be the $⌈ {log}_{2} T ⌉ + 1$ -bit binary representation of $t$ . The noise introduced by tree-based mechanism with dynamic global sensitivity can be bounded by

	$η_{t}$	$= \sum b_{i} = 1 Lap (\frac{Δ_{t - 2^{i} + 1}}{ϵ^{'}})$
		$= \sum b_{i} = 1 Lap (\frac{Δ_{t - 2^{i} + 1} log T}{ϵ})$
		$\leq log \frac{1}{ζ} \sqrt{\frac{{log}^{2} T}{ϵ^{2}} (Δ_{\frac{1}{2} t^{'} + 1}^{2} + Δ_{\frac{3}{4} t^{'} + 1}^{2} + Δ_{\frac{7}{8} t^{'} + 1}^{2} + . . . + Δ_{t^{'}}^{2}))}$
		$\leq \frac{32 L}{λ_{0}} log \frac{1}{ζ} \frac{log T}{t ϵ} \sqrt{((\frac{2}{1})^{2} + (\frac{4}{3})^{2} + . . .)}$
		$= \frac{32 L}{λ_{0}} log \frac{1}{ζ} \frac{log T}{t ϵ} \sqrt{((1 + \frac{1}{1})^{2} + (1 + \frac{1}{3})^{2} + . . .)}$
		$\leq \frac{32 L}{λ_{0}} log \frac{1}{ζ} \frac{log T}{t ϵ} \sqrt{2 log t + 2}$

where $t^{'} > t$ is the smallest number that is the power of 2 after $t$ . The last inequality holds because there are at most $log t$ terms in the summation, and $(\frac{1}{1})^{2} + (\frac{1}{3})^{2} + . .$ is upper bounded by 2. As a result, we conclude that the added noise $η_{t}$ is in the order of $O (\frac{L}{ϵ} log T \sqrt{log t} log \frac{1}{ζ} / t)$ with probability $1 - ζ$ .

Proof of Theorem 7.

We first bound the one-step regret at time $t$ as,

	$regret (t)$	$= x_{a_{t}^{}}^{T} θ^{} - x_{a_{t}}^{T} θ^{*} / / Definition of regret$
		$\leq x_{a_{t}^{}}^{T} {^θ}_{t}^{p} + {CB}_{t} (x_{a_{t}^{}}) - x_{a_{t}}^{T} θ^{*} / / Definition of confidence bound$
		$\leq x_{a_{t}}^{T} {^θ}_{t}^{p} + {CB}_{t} (x_{a_{t}}) - x_{a_{t}}^{T} θ^{*} / / Arm selection strategy$
		$\leq 2 {CB}_{t} (x_{a_{t}})$

The pseudo-regret up to time $T$ is thus bounded by,

	$Regret (T)$	$= T \sum t = 1 r e g r e t (t)$
		$\leq T \sum t = 1 2 {CB}_{t} (x_{a_{t}})$

		$\leq 2 T \sum t = 1 (∥ {^θ}_{t}^{p} - {^θ}_{t} ∥_{A_{t}} + ∥ {^θ}_{t} - θ^{*} ∥_{A_{t}}) ∥ x_{a_{t}} ∥_{A_{t}^{- 1}}$
		$\leq 2 \sqrt{T \sum t = 1 ∥ {^θ}_{t} - θ^{*} ∥_{A_{t}}^{2} T \sum t = 1 ∥ x_{a_{t}} ∥_{A_{t}^{- 1}}^{2}}$
		$+ 2 \sqrt{T \sum t = 1 ∥ η_{t} ∥_{A_{t}}^{2} T \sum t = 1 ∥ x_{a_{t}} ∥_{A_{t}^{- 1}}^{2}}$
		$\leq 2 \sqrt{T ∥ {^θ}_{T} - θ^{*} ∥_{A_{T}}^{2} T \sum t = 1 ∥ x_{a_{t}} ∥_{A_{t}^{- 1}}^{2}}$
		$+ 2 \sqrt{T max t^{'} ∥ η_{t^{'}} ∥_{A_{t^{'}}}^{2} T \sum t = 1 ∥ x_{a_{t}} ∥_{A_{t}^{- 1}}^{2}}$

In the fourth step, we separate the upper bound of $α_{t}$ into two terms, among which one is the confidence bound of the original LinUCB and the other is the bound of injected noise. The fifth step follows the Cauchy–Schwarz inequality. Using Lemma 11 of (Improved_Algorithm), we have $\sum_{t} ∥ x_{a_{t}} ∥_{A_{t}^{- 1}}^{2} \leq \sqrt{d log (λ + \frac{L T}{d})}$ . The bound of $∥ {^θ}_{T} - θ^{*} ∥_{A_{T}}^{2} \leq \sqrt{d log \frac{1 + T L^{2} λ}{ζ}} + \sqrt{λ} S$ follows self-normalized bound for martingales of Lemma 9 in (Improved_Algorithm). We bound ${max}_{t^{'}} ∥ η_{t^{'}} ∥_{A_{t^{'}}}^{2}$ when $λ \geq λ_{0} t / 16$ , i.e., $t > 32 log (1 / δ) / λ_{0}$ , by taking the derivative of $\frac{{log}^{1.5} t}{\sqrt{t}}$ and find its maximum value at $t = e^{3}$ , since it is a concave function. Substitute it back and we get its bound of $\frac{3^{1.5}}{e^{1.5}}$ . The regret for $t \leq 32 log (1 / δ) / λ_{0}$ is at most $32 L log (1 / δ) / λ_{0}$ . Combining all these terms together, we prove the regret bound of our developed private LinUCB based on global dynamic sensitivity analysis. ∎

Proof of Corollary 8.

When every context vector $x$ is perturbed by Gaussian noise $z \sim N (0, σ^{2})$ , $E [(x + z) (x + z)^{T}]$ has minimal eigenvalue $λ_{0}$ where $λ_{0} \geq Ω (σ^{2})$ . Thus Lemma 1 and 2 hold, and the privacy guarantee directly follows Theorem 4. Following Lemma 3.3 in (kannan2018smoothed), we have $∥ {^θ}_{T} - θ^{*} ∥_{A_{T}}^{2} \leq \frac{\sqrt{d log \frac{1 + T L^{2} λ}{ζ}} + \sqrt{λ} S}{σ^{2}}$ . Plug the result into the analysis of Theorem 7 and we have the regret bound. ∎


(a) Cumulative regret on simulation data.	(b) Cumulative reward on LastFM.	(c) Parameter estimation error on simulation data.


(a) Effect of privacy level $ϵ$ .	(b) Change of selected arms on simulation with single reward difference.	(c) Change of selected arms on LastFM with single reward difference.

Dynamic Global Sensitivity for Differentially Private Contextual Bandits

Abstract.

Abstract.

1. Introduction

2. Related Work

3. Method

3.1. Preliminaries

Assumption 1 ().

Definition 0 (Differential Privacy (dwork2006calibrating)).

Definition 0 (Global Sensitivity (dwork2006calibrating)).

3.2. Dynamic Global Sensitivity Analysis for Tree-based Mechanism

Lemma 0 (Sensitivity of estimated bandit model parameter ^θt).

Theorem 4 (Privacy).

Theorem 5 (Utility).

3.3. Differentially Private LinUCB

Claim 1 ().

Proof.

Lemma 0 (Confidence Bound).

Proof.

Theorem 7 (Regret Bound).

Corollary 0 ().

4. Experiment

5. Conclusions & Future Work

Acknowledgements.

Appendix

Missing Proofs

Proof of Lemma 3.

Proof of Theorem 4.

Proof of Theorem 5.

Lemma 0 ().

Proof of Theorem 7.

Proof of Corollary 8.

References

Lemma 0 (Sensitivity of estimated bandit model parameter ${^θ}_{t}$ ).