Future Gradient Descent for Adapting the Temporal Shifting Data Distribution in Online Recommendation Systems

Mao Ye The University of Texas at Austin. Ruichen Jiang The University of Texas at Austin. Haoxiang Wang The University of Illinois at Urbana-Champaign. Dhruv Choudhary Meta. Xiaocong Du Meta. Bhargav Bhushanam Meta. Aryan Mokhtari The University of Texas at Austin. Arun Kejariwal Meta. Qiang Liu The University of Texas at Austin.

Abstract

One of the key challenges of learning an online recommendation model is the temporal domain shift, which causes the mismatch between the training and testing data distribution and hence domain generalization error. To overcome, we propose to learn a meta future gradient generator that forecasts the gradient information of the future data distribution for training so that the recommendation model can be trained as if we were able to look ahead at the future of its deployment. Compared with Batch Update, a widely used paradigm, our theory suggests that the proposed algorithm achieves smaller temporal domain generalization error measured by a gradient variation term in a local regret. We demonstrate the empirical advantage by comparing with various representative baselines.

1 Introduction

The web-scale recommendation system is one of the most important modern machine learning applications that provides personalized content to billions of users from inventories of billions of items. These recommendation models have been rapidly growing in both computation and memory in the past few years due to wider-deeper networks and the use of sparse embedding layers. [ye2020adaptive, he2020lightgcn, zhang2020retrain, peng2021learning] have demonstrated the importance of updating the recommendation periodically (e.g., in a daily/weekly basis) as new data arrives to avoid the model being stale in a domain shifting environment.

Designing such a periodical updating pipeline is non-trivial: the algorithm needs to achieve a good balance of consolidating the long-term memory (ensuring the useful past knowledge is preserved) and capturing short-term tendency, which is valuable for near future prediction [zhang2020retrain, peng2021learning, deng2021deeplight]. Algorithms can be categorized into two groups: 1. The sample-based approaches maintain a reservoir to reuse observed historical examples to preserve long-term memory [diaz2012real]. Several heuristics are developed to select past examples via balancing the prioritizing of recency and forgetting [chen2013terec, wang2018streaming, qiu2020gag, zhao2021stratified]; 2. The model-based approaches maintain the long-term memory by transferring knowledge between the past and the current model checkpoints via knowledge distillation [wang2020practical, xu2020graphsail, mi2020ader] and model fusion [zhang2020retrain, peng2021learning].

In this paper, we provide a novel perspective by framing the problem of learning under shifting domains as a temporal domain generalization problem. We observe that the crux lies in the mismatch between the distribution of the training examples and the distribution of the testing example on which the model is deployed for recommendations. From this perspective, existing approaches mitigate such the crux by the distribution mismatch in an indirect way by training a robust model that is less vulnerable to the shift of domain in the near future by making it a master at both short-term and long-term signals. More precisely, we propose a more direct solution towards the temporal domain generalization problem based on forecasting the future information for training. Consider the ideal case that we are able to access the data distribution in the near future when the model is deployed, simply training the model by gradient descent using examples drawn from the future data distribution should be desirable (see ‘Ideal Update’ in Fig 2). In the real world when such future information is unavailable, we propose to train a meta future gradient generator to forecast the gradient of the future examples so that the recommendation model is trained as if we were able to look ahead at the future (i.e., ‘FGD Update’ in Fig 2). In addition to the sample-based and model-based approach, our method is optimizer-based in that the trainer of the recommendation model is improved.

In theory, we frame the problem as an online learning problem in which the temporal domain generalization error is captured by the gradient variation term [chiang2012online, rakhlin2013online] in a local regret [hazan2017efficient]. We provide a theoretical understanding of why the proposed algorithm improves over batch-update, a widely used training pipeline [wang2020practical] and show that our method is able to achieve a similar regret as that of a fixed meta future gradient generator oracle. Empirically, we compare our approach against several representative sample/model-based approaches and observe considerable performance improvement.

Notation. We denote the integer set ${1, 2..., b}$ by $[b]$ . Moreover, $∥ \cdot ∥$ denotes the $ℓ_{2}$ vector norm, $∥ \cdot ∥_{1}$ denotes the $ℓ_{1}$ vector norm, and $S_{b} = {a \in R^{b} : a_{i} \geq 0, ∥ a ∥_{1} = 1}$ is the probability simplex set.

2 Problem and Background

Temporal Domain Generalization.

Consider an online classification problem with the feature space $X$ and the label space $Y$ . Our goal is to learn an accurate prediction model $f_{θ} : X \to Y$ parameterized by $θ \in Θ$ from a stream of datasets $D_{1}, \dots, D_{T}$ in $T$ consecutive rounds. Specifically, at round $t$ , we choose a model parameter $θ_{t} \in Θ$ and deploy our prediction model $f_{θ_{t}}$ . Then we observe the dataset $D_{t}$ with $n_{t}$ labeled examples $D_{t} = {(x_{t}^{(i)}, y_{t}^{(i)})}_{i = 1}^{n_{t}}$ drawn from certain data distribution $P_{t}$ , where $x_{t}^{(i)}$ are the input features and $y_{t}^{(i)}$ is the associated label. Thus, for a given loss function $ℓ : X \times Y \to R_{+}$ , the empirical loss of our prediction model at time $t$ is given by $r_{t} (θ) = E_{(x, y) \sim D_{t}} ℓ (f_{θ} (x), y)$ . Moreover, we consider the situation where the data distribution $P_{t}$ (i.e., domain) is gradually changing over time. A natural performance metric for our learning algorithm is the temporal average of the test loss suffered by the prediction model:

\frac{1}{T} T \sum t = 1 r_{t} (θ_{t}) .

(1)

We remark that $T$ , which denotes the total number of rounds in the online process, is typically large in practice.

The key challenge is the temporal domain generalization. Indeed, at time $t$ we train our prediction model $f_{θ_{t}}$ using the observed examples $\cup_{i \in {0, 1, . . ., t - 1}} D_{i}$ and due to temporal shift of domain, the distribution of the test set does not match the distribution of its training set. Such mismatch of the training and testing domains results in domain generalization error. See Fig 1 for an illustration.

Our formulation is motivated by the online recommendation systems that aim to advertise items to users given user features. The domain is gradually changing because of the flux in the content that gets continuously added/removed from the system [he2014practical, ye2020adaptive]. As the recommendation model needs to be deployed for serving, it is hard to update its parameter in real time [cervantes2018evaluating, wang2020practical, peng2021learning]. The training process is thus discretized in which the model parameter is updated periodically with the hope that it generalizes well in its test domain.

Input: The learning rate

η

for updating the parameter

θ

for

t \in [T]

Deploy the prediction model

f_{θ_{t}}

with parameter

θ_{t}

Collect the new dataset

D_{t}

Initialize

θ_{t + 1}

while

∥ \frac{1}{b} \sum_{i = 0}^{b - 1} \nabla r_{t - i} (θ_{t + 1}) ∥ \geq δ

θ_{t + 1} \leftarrow θ_{t + 1} - η \frac{1}{b} \sum_{i = 0}^{b - 1} \nabla r_{t - i} (θ_{t + 1}) .

end while

end for

Algorithm 1 Batch and Incremental Update

Figure 2: Comparing (1) the ideal update where the future information $\nabla r_{t}$ can be accessed at training time; (2) the batch update; and (3) our proposed approach.

Input: The learning rate

η

η_{ϕ}

for updating the model parameter

θ

and

ϕ

. The initial trajectory buffer

B

for

t \in [T]

Deploy the prediction model

f_{θ_{t}}

with parameter

θ_{t}

. Then collect the new dataset

D_{t}

Initialize the parameter of MFGG

ϕ_{t + 1}

▹

Initialization of

ϕ_{t + 1}

is user-specific.

for Inner loop iteration

k \in K

▹

Update the meta network.

ϕ_{t + 1} \leftarrow ϕ_{t + 1} - η_{ϕ} \sum_{θ \in B} \nabla_{ϕ} ∥ m (θ; ϕ_{t + 1}, t) - \nabla r_{t} (θ) ∥^{2}

▹

May replace with the mini-batch version.

end for

Initialize the trajectory buffer

B = \emptyset

and model parameter

θ_{t + 1}

▹

Initialization scheme of

θ_{t + 1}

is specified by user.

while

∥ m (θ_{t + 1}; ϕ_{t + 1}, t + 1) ∥ \geq δ

▹

Alternatively, we may run gradient descent with a fixed number of iterations.

θ_{t + 1} \leftarrow θ_{t + 1} - η m (θ_{t + 1}; ϕ_{t + 1}, t + 1)

▹

May replace with the mini-batch version.

B \leftarrow B \cup {θ_{t + 1}}

▹

Alternatively, we may update the trajectory buffer

B

every a few iterations.

end while

end for

Algorithm 2 Future Gradient Descent

Batch and Incremental Update.

Batch Update (BU) [hazan2017efficient, wang2020practical] is a widely used updating pipeline for training the recommendation model in temporally shifting domains. At each time $t$ , the model parameters are updated using the gradient of the averaged losses $r_{t}, . . ., r_{t - b + 1}$ , where $b$ is a time window size indicating how many observed data are used. BU with $b = 1$ is also named as Incremental Updating (IU). We summarize the pipeline in Algorithm 1, where $\nabla r_{s}$ with $s \leq 0$ is defined as $0$ . Also see an illustration of BU with $b = 2$ in the second plot of Fig. 2. It is noteworthy that the initialization scheme of $θ_{t + 1}^{'}$ for the updating at each time is problem-dependent and user-specified. For example, we can set $θ_{t + 1}^{'} = θ_{t}$ of $θ_{t + 1}^{'} = θ_{t - b + 1}$ if we consider one-pass training setting [zheng2020shadowsync, ye2020adaptive, du2021alternate].

3 Method

Recall that our goal is to learn $f_{θ_{t}}$ that gives accurate prediction on $D_{t}$ (i.e. achieves small $r_{t}$ ). Think of the ideal world where we were able to access the data $D_{t}$ at the near future during the training time of $f_{θ_{t}}$ , a simple while promising approach is to apply gradient descent using $\nabla r_{t}$ (see the first plot in Fig. 2). In the real case where the future information $\nabla r_{t}$ is no more available, we propose to learn a meta future gradient generator (MFGG) that forecasts $\nabla r_{t}$ given the observed data $\cup_{i = 1}^{t - 1} D_{i}$ ; see the third plot in Fig 2.

Architecture of MFGG.

MFGG models $\nabla r_{t} (θ)$ as an non-linear functional auto-regressive time series model [bosq2000linear]. It approximates $\nabla r_{t} (θ)$ by aggregating the gradient based on the latest $b$ losses $\sum_{i = 0}^{b - 1} a_{i} (D_{t - b}, . . ., D_{t - 1}) \nabla r_{t - 1 - i} (θ)$ where the coefficient of the linear combination $a_{i} (D_{t - b}, . . ., D_{t - 1})$ is a neural network given by the following computation graph.

	$e_{i, j}$	$= Embd (x_{j}^{(i)}) \in R^{d_{1}}$
	$e_{j}$	$= \sum i \in [n_{j}] e_{i, j} \in R^{d_{1}}$
	$z$	$= Self Attention (e_{t - b}, . . ., e_{t - 1}) \in R^{d_{2} \times b}$
	$a$	$= Softmax \circ MLP (z_{t - b}, . . ., z_{t - 1}) \in R^{b}$

Here Embd denotes the embedding layer that maps the categorical feature into a continuous embedding space (the continuous feature remains the same in this layer); Self Attention denotes the self attention layer [vaswani2017attention]; MLP denotes the multi-layer perception. MFGG first extracts the domain features $e_{j}$ over $j \in {t - b, . . ., t - 1}$ of the last $b$ domains) and the self attention then encodes the interaction between the domain features, of which the outcomes are fed into the subsequent layers to calculate the coefficient $a$ . The softmax layer is option and regularizes $a$ to be in a probability simplex $S_{b}$ and hence ensures the magnitude of the generated gradient is within a proper range. Suppose $ϕ$ unions all the parameters, we denote MFGG as $m (θ; ϕ, t)$ . In practice, we can simply replace $D_{j}$ with its mini-batch samples ${^D}_{j}$ , which gives a stochastic gradient version for updating.

Optimization of MFGG.

We use the squared $ℓ_{2}$ loss $∥ m (θ; ϕ, t) - \nabla r_{t} (θ) ∥^{2}$ for measuring the prediction error of $m (θ; ϕ, t)$ at time $t$ . Such error depends on both $ϕ$ , the parameter of MFGG and $θ$ , the parameter of recommendation model used for calculating the gradient. We are more interested in make MFGG accurate at a small subset of the model parameter space $Θ$ in which $θ$ gives a recommendation model with good performance. We thus only apply the $ℓ_{2}$ loss on the (sub-sampled) optimization trajectory of $θ$ , which we denoted as $B$ . That is, we learn $m (θ; ϕ, t)$ by apply gradient descent on

\sum θ \in B ∥ m (θ; ϕ, t) - \nabla r_{t} (θ) ∥^{2} .

Note that here when calculating the gradient of $ϕ$ , $θ$ is viewed as a constant and hence the differentiation of $ϕ$ at $θ$ does not applied. Algorithm 2 summarizes the detailed procedure. Again, a mini-batch version of $m (θ; ϕ, t)$ and $\nabla r_{t} (θ)$ can be used during the training of MFGG. In practice at $t \leq b$ , we don not have enough historical data to compute MFGG, we can simply use IU for training (alternatively, data for offline training can be used instead). Since our approach uses the MFGG to predict the gradient of the loss on the unobserved future data, we name it Future Gradient Descent (FGD).

Extension to a smoothed loss.

In practice, one might be interested in a smoothed version of performance metric as it is observed to be a potentially more robust evaluation metric in practice [he2014practical]. More precisely, consider the loss function

\frac{1}{T} T \sum t = 1 [\frac{1}{w} w - 1 \sum i = 0 r_{t - i} (θ_{t})],

(2)

where $r_{s}$ is identically zero for $s \leq 0$ . This smoothed loss in (2) uses a sliding window with width $w$ over the previous datasets $\cup_{i = 0}^{w - 1} D_{t - i}$ when evaluating. We are mainly interested in the standard metric (1) but when (2) is considered, we can simply generalize FGD by replacing $m (θ; ϕ, t)$ by

¯ m (θ; ϕ, t) = \frac{1}{w} (m (θ; ϕ, t) + w - 1 \sum i = 1 \nabla r_{t - i} (θ)),

when training $θ$ . Here $\nabla r_{s}$ , $s \leq 0$ is defined 0. We refer readers to Algorithm 5 in Appendix C for the details. In the rest of the paper, we focus on the smoothed version of loss as it is more general.

Before moving forward, we emphasize the difference between the two window sizes $b$ and $w$ that appear in the BU/FGD and in the definition of (2), respectively. In some sense, $b$ corresponds to the number of recently observed datasets used for training the model. While, $w$ represents the number of datasets used for testing the model.

4 Theory

In this section, we study the advantage of the proposed FGD over BU and IU theoretically using recent advances in non-convex online learning. Specifically, we show that FGD is able to perform better than BU and IU in terms of the so-called local regret [hazan2017efficient, hallak2021regret], which measures the algorithm’s performance by comparing it with the best one can achieve in hindsight.

4.1 Local Regret

To upper bound the average loss in (1) in a changing environment, one standard approach is to study the average dynamic regret [zinkevich2003online]:

\frac{1}{T} T \sum t = 1 [r_{t} (θ_{t}) - min θ \in Θ r_{t} (θ)],

(3)

which uses the global minimum of $r_{t}$ as a benchmark when evaluating the performance at time $t$ . However, in modern recommendation systems the prediction model $f_{θ}$ is given by a deep neural network, and thus the resulting loss function $r_{t} (θ)$ is highly non-convex. This means finding an approximate global minimum of $r_{t}$ is computationally intractable, making it hopeless to derive any meaningful bound on the average dynamic regret in (3). To remedy this issue, we adopt the notion of local regret proposed by hazan2017efficient. Specifically, given ${θ_{t}}_{t = 1}^{T}$ generated by an online learning algorithm, the average local regret is defined as

R (T) := \frac{1}{T} T \sum t = 1 ∥ \nabla r_{t} (θ_{t}) ∥^{2} .

(4)

Compared with (3), in (4) we evaluate the model parameters in terms of the first-order stationarity, and thus it can be viewed as the non-convex counterpart of the dynamic regret in (3). In particular, a small value of $R (T)$ implies a small gradient on average, suggesting that the algorithm achieves near-optimal performance locally in the long run.

More generally, when the smoothed loss (2) is considered, one can use the average $w$ -local regret accordingly as in [hazan2017efficient]:

R_{w} (T) := \frac{1}{T} T \sum t = 1 ∥ \nabla u_{w, t} (θ_{t}) ∥^{2},

where we evaluate $θ_{t}$ using the smoothed loss function $u_{w, t} (θ) := \frac{1}{w} \sum_{i = 0}^{w - 1} r_{t - i} (θ)$ . In the following, we will focus our analysis on $R_{w} (T)$ , as choosing $w = 1$ also covers the standard local regret in (4).

4.2 Regret of Batch Update

In [hazan2017efficient], the authors analyzed the average $w$ -local regret $R_{w} (T)$ for BU. We recall their result below but offer a different interpretation from the domain generalization perspective.

Proposition 1 ([hazan2017efficient, hallak2021regret]).

With the choice of the window size $b = w$ , the $w$ -local regret incurred by BU in Algorithm 1 satisfies

	$R_{w} (T)$	$≤2∑Tt=1∥∇uw,t−1(θt)∥2/Toptimization error+2Vw(T)/w2\text{% domain generalization}$
		$\leq 2 δ^{2} + \frac{2}{w^{2}} V_{w} (T),$
	where	$V_{w} (T) = \frac{1}{T} T \sum t = 1 sup θ ∥ \nabla r_{t} (θ) - \nabla r_{t - w} (θ) ∥^{2} .$

Furthermore, if $∥ \nabla r_{t} (θ) ∥ \leq M < \infty$ for all $θ \in Θ$ and $t \geq 0$ , choosing $δ = O (1 / w)$ gives $R_{w} (T) = O (1 / w^{2})$ , which is minimax optimal.

The previous works [hazan2017efficient, hallak2021regret] are interested in the worst-case guarantee of the BU algorithm, and the result in Proposition 1 only serves as an intermediate result. However, we observe that this regret bound also offers interesting insights from the perspective of domain generalization. To be specific, we can decompose it into two terms:

The optimization error: this is due to the fact that we only seek a $δ$ -approximate stationary point of the smoothed training loss function $u_{w, t - 1} (θ)$ at round $t$ . It is controllable in the sense that $δ$ can be made arbitrarily small by running more iterations of gradient descent. Indeed, under standard smoothness assumption on $r_{i}$ , we can achieve $∥ \nabla u_{w, t - 1} (θ_{t}) ∥ \leq δ$ within $O (δ^{- 1})$ iterations. The optimization error term thus corresponds to how well we train the recommendation model in each round.

The domain generalization error: this is due to the fact that the the test set $\cup_{i = 0}^{w - 1} D_{t - i}$ for evaluating $θ_{t}$ is different from the training set $\cup_{i = 1}^{w} D_{t - i}$ . It is typically the dominant term in the regret bound and will not vanish even when $δ = 0$ . In some sense, it captures the level of variability in the data distributions, similar to the gradient variation term in [chiang2012online, rakhlin2013online]. We also note that the domain generalization error decreases w.r.t. $w$ . This is because when $w$ increases, the overlap between the training set and the test set becomes larger (i.e., the training set and the test set deviate less)¹¹1Such overlapping mechanism is the key to defending adversaries in non-convex games and we refer to Section 2.3 in [hazan2017efficient] for more details..

In summary, the optimization error term characterizes how well our model performs on the training set, while the domain generalization error term characterizes how much the test set deviates from the training set.

Comparison with other measure of domain divergence.

In Proposition 1, the domain discrepancy is characterized in terms of the gradient variation (i.e, how much the gradient of the loss functions differs). Some other domain discrepancy measures have also been proposed. Examples include the $H$ -divergence [kifer2004detecting] between $D$ and $D^{'}$ defined as $d_{H} (D, D^{'}) = {sup}_{θ} ∥ E_{D} ℓ (f_{θ} (x), y) - E_{D^{'}} ℓ (f_{θ} (x), y) ∥$ and the $H Δ H$ divergence [ben2010theory] defined as $d_{H Δ H} (D, D^{'}) = {sup}_{θ, θ^{'}} ∥ E_{D} ℓ (f_{θ}, f_{θ^{'}}) - E_{D^{'}} ℓ (f_{θ}, f_{θ^{'}}) ∥$ . Overall, the commonly used divergence measures share the general form of ${sup}_{θ} ∥ E_{D} g_{θ} - E_{D^{'}} g_{θ} ∥$ , where $g_{θ}$ is a test function parameterized by $θ$ . The $H$ -divergence uses $g_{θ} = ℓ (f_{θ} (x), y)$ and the $H Δ H$ divergence first extends the parameter space $Θ$ to the product space $Θ \otimes Θ$ and let $g_{(θ, θ^{'})} = ℓ (f_{θ} (x), f_{θ^{'}} (x))$ for any $(θ, θ^{'}) \in Θ \otimes Θ$ . The gradient variation uses $g_{θ} = \nabla_{θ} ℓ (f_{θ}, y)$ . As we consider the local regret for non-convex problems where the goal is to find a first-order stationary point, using the gradient as the test function is a natural fit.

4.3 The Headroom of Batch Update

In the last section, we see that BU achieves the minimax regret, so at first sight it seems there is no room for further improvement. However, we note that this only implies that BU is optimal in the worst-case sense, i.e., when the future data distribution is completely uncorrelated with the previous ones. This is hardly the case in reality: the drift in the data distribution normally happens in a gradual manner, and the data distribution in the past should be informative of the future. Hence, the natural question is: can we do better than BU in a gradually changing environment?

The discussion after Proposition 1 suggests that the only hope for improvement lies in reducing the domain generalization error $V_{w} (T)$ . To illustrate the headroom, we start with Meta Gradient Descent (MGD), a ‘helper algorithm’ that extends BU and serves as an intermediate step towards the proposed FGD. Assume that we are given a sequence of gradient generators ${m (\cdot; t)}_{t = 1}^{T}$ . Then FGD uses a smoothed gradient generator given by

¯ m (θ; t) = \frac{1}{w} (m (θ; t) + w - 1 \sum i = 1 \nabla r_{t - i} (θ)),

for updating, yielding Algorithm 3.

Input: The learning rate

η

for updating the parameter

θ

for

t \in [T]

Deploy the prediction model

f_{θ_{t}}

with parameter

θ_{t}

Collect the new dataset

D_{t}

Construct the smoothed gradient generator

¯ m (\cdot; t + 1)

Initialize

θ_{t + 1}

while

∥ ¯ m (θ_{t + 1}; t + 1) ∥ \geq δ

θ_{t + 1} \leftarrow θ_{t + 1} - η ¯ m (θ_{t + 1}; t + 1)

end while

end for

Algorithm 3 Meta Gradient Descent: a helper algorithm

By substituting $\nabla r_{t - w} (\cdot)$ for $m (\cdot, t)$ , MGD reduces to BU with $b = w$ . Comparing $¯ m (\cdot; t)$ with $\nabla u_{w, t}$ , the true gradient on the test set, we see that

¯ m (θ; t) - \nabla u_{w, t} (θ) = \frac{1}{w} (m (θ; t) - \nabla r_{t} (θ)),

(5)

suggesting that MGD introduces a general gradient generator $m (θ; t)$ as a proxy for $\nabla r_{t} (θ)$ , similar to FGD. On the other hand, we note that the gradient generator in MGD is pre-specified, while FGD parametrizes the gradient generator $m$ with $ϕ$ and optimizes it on the fly.

From this perspective, BU in Algorithm 1 in fact implicitly uses $m (\cdot, t) = \nabla r_{t - w}$ to approximate $\nabla r_{t}$ , which explains why $V_{w} (T)$ depends on the difference between these two terms. While such design makes sense in the very limited case where the sequence of domains is known to have a period of $w$ , it might not be a savvy choice in general. To be specific, one can construct $m$ from the observed datasets $D_{t - 1}, . . ., D_{t - b}$ based on some mapping parameterized by $ϕ \in Φ$ . For instance, such mapping can be given by a deep neural network as described in Section 3. In this way, MGD enables a mechanism that utilizes the past domains more flexibly to predict the future gradient information $\nabla r_{t}$ when it can be forecasted with a more general form.

Theorem 1.

The $w$ -local regret incurred by Algorithm 3 satisfies

R_{w} (T) \leq 2 δ^{2} + \frac{2}{w^{2}} Q (T; m),

where $Q (T; m) := \frac{1}{T} \sum_{t = 1}^{T} {sup}_{θ} ∥ \nabla r_{t} (θ) - m (θ; t) ∥^{2}$ . Furthermore, if both $∥ \nabla r_{t} ∥$ and $∥ m (\cdot; t) ∥$ are upper bounded by $M < \infty$ for all $θ \in Θ$ and $t \geq 0$ , we recover the minimax regret $R_{w} (T) = O (1 / w^{2})$ when $δ = O (1 / w)$ .

Theorem 1 shows that we can greatly improve the regret of BU by reducing the domain generalization error $Q (T; m)$ if $m$ is properly chosen. Specifically, suppose that $M$ —the hypothesis class of $m$ —is rich enough to model the dynamic of the data distribution, in the sense that there exists $m^{*} \in M$ satisfying

Q (T; m^{*}) := \frac{1}{T} T \sum t = 1 sup θ ∥ \nabla r_{t} (θ) - m^{*} (θ; t) ∥^{2} = O (\frac{1}{T}) .

Then the domain generalization error of MGD equipped with $m^{*}$ tends to zero at the rate of $1 / T$ , in contrast to being a non-vanishing dominant term in BU. On the other hand, we can still maintain essentially the same regret bound as BU in the worst case, and thus the improvement almost comes for free.

In the following section, we show that it is indeed possible for FGD to achieve a comparable local regret bound as the one given by MGD with the optimal gradient generator $m^{*}$ in $M$ .

4.4 Regret Bound of FGD

To simplify the analysis, we consider the case where the gradient generator at round $t$ is given by a linear model:

m (θ; ϕ, t) = b \sum i = 1 a_{i} \nabla r_{t - i} (θ),

(6)

where $ϕ = [a_{1}, . . ., a_{b}] \in S_{b}$ is the parameter. The hypothesis class $M$ is thus $M = {{m (\cdot; ϕ, t)}_{t = 1}^{T} : \sum_{i = 1}^{b} a_{i} \nabla r_{t - i} (\cdot), ϕ \in S_{b}}$ . This family of FGD algorithm covers the BU algorithm, which corresponds to setting $a_{b} = 1$ and $a_{i} = 0$ otherwise. For this toy example, we use the classic exponentiated gradient descent method [kivinen97exponentiated] to update $ϕ$ , which ensures that $ϕ \in S_{b}$ . The detailed algorithm is summarized in Algorithm 4 in Appendix B.

Theorem 2.

Assume that for any $t$ , $∥ \nabla r_{t} ∥$ is bounded by $M < \infty$ . Let $M$ be the hypothesis class of $m$ given in (6). For any given constant $c > 0$ , if we set the learning rate for updating $m$ as $η_{ϕ} = c \sqrt{(log b) / (T M^{4})}$ , the $w$ -local regret incurred by Algorithm 4 in Appendix B satisfies

	$R_{w} (T)$	$\leq 2 δ^{2} + \frac{2}{w^{2}} (Q (T; m^{*}) + O (M^{2} \sqrt{log b / T})),$
	where	$Q (T; m^{*}) = min m \in M T \sum t = 1 sup θ ∥ \nabla r_{t} (θ) - m (θ; ϕ, t) ∥^{2} .$

Theorem 2 suggests that FGD with optimized MFGG is able to achieve the regret of Algorithm 3 using $m^{*}$ with $O (1 / \sqrt{T})$ excessive error. As $T$ is usually large, we can see that the excessive error is small.

5 Related Work

Domain Generalization.

Our problem can be viewed as an extension of the classic domain generalization problem. In short, the classic domain generalization problem that is extensively studied in vision or NLP is one-shot in the sense that it aims to generalize a model to one unseen target domain by training over multiple source domains. In contrast, our problem is $T$ -shot, since we have a stream of $T$ pairs of target/source domains. The difference between one-shot and $T$ -shot can be significant. In the one-shot setting, we are unable to receive feedback on how the model generalizes on the unseen domain and thus the existing algorithms are hence focus on improving the worst-case generalization by learning domain-invariant representation based on methods such as domain feature alignment [li2018domain, guo-etal-2019-towards], causal learning [arjovsky2019invariant, wang2022provable], multi-task learning [carlucci2019domain], meta-learning [balaji2018metareg, li2018learning] and data augmentation [yan2020improve, ilse2021selecting]. In comparison, our algorithm mainly focuses on how to use the feedback in the $T$ -shot setting to learn to predict the gradient information of the future unseen domain. While adopting the techniques from the one-shot domain generalization is of interest, the design of those algorithms utilizes a lot of domain knowledge from CV or NLP, making it non-trivial to apply to recommendation systems. We thus leave it for future work.

Continual Learning.

Continual learning is a similar scenario where the goal is to learn an accurate model given a stream of different tasks/domains. Compared with multi-task learning [sener2018multi, crawshaw2020multi, ye2021pareto, wang2021bridging], the key challenge of continual learning is catastrophic forgetting [kirkpatrick2017overcoming]: the model forgets how to solve past tasks after it is exposed to new tasks. Various of types of solutions are proposed, including rehearsal-based methods [lopez2017gradient, aljundi2019gradient, chaudhry2020using], knowledge distillation [rebuffi2017icarl], regularization [kirkpatrick2017overcoming, buzzega2020dark] and architecture adjustment [rusu2016progressive, serra2018overcoming]. Although the learning scenario is similar, a direct application of continual learning methods to our setting might not give a desirable outcome. The reason is that the final goals of the two problems are quite different: continual learning aims to learn the current task without sacrificing the performance of the past learned tasks, while we only focus on performing well in the unobserved future task.

Gradual Domain Adaptation

Gradual domain adaptation (GDA) aims at adapting a model to an unlabeled target domain after being trained on a labeled source domain and a sequence of unlabeled intermediate domains. Despite being similar to the setting of temporal domain generalization, GDA is still different from the latter since there are no labels provided in the intermediate domains for GDA. A modern and common approach for GDA is gradual self-training [kumar2020understanding, wang2022understanding, zhou2022online, dong2022algorithms], which fits a model to the source domain and then adapts the model along the sequence of intermediate domains consecutively with self-training [nigam2000analyzing].

Meta-Learning.

Meta-learning, or learning-to-learn, aims to optimize the training process such that the outcome is improved. Examples of meta-learning includes learning a better initialization [finn2017model, lee2018gradient], optimizer [andrychowicz2016learning, flennerhag2019meta], hyper-parameter [franceschi2018bilevel, chen2019lambdaopt] and network architecture [liu2018darts, wang2022global]. The proposed FGD can be viewed as learning a better optimizer for the temporal domain generalization problems. Meta-learning is also widely deployed in recommendation systems. Examples include solving cold start issue [bharadhwaj2019meta, lee2019melu] through learning initialization and knowledge transferring through model fusion [zhang2020retrain, peng2021learning].

6 Experiment

Method	FM				DeepFM
Method	Auc-8 $↑$	Logloss-8 $↓$	Auc-16 $↑$	Logloss-16 $↓$	Auc-8 $↑$	Logloss-8 $↓$	Auc-16 $↑$	Logloss-16 $↓$
IU	$60.35 \pm 0.54$	$16.74 \pm 0.25$	$60.56 \pm 0.61$	$16.78 \pm 0.16$	$60.48 \pm 0.47$	$15.63 \pm 0.19$	$60.62 \pm 0.60$	$15.69 \pm 0.12$
BU-2	$62.69 \pm 0.50$	$16.04 \pm 0.19$	$62.31 \pm 0.73$	$16.15 \pm 0.20$	$62.65 \pm 0.37$	$15.21 \pm 0.15$	$62.40 \pm 0.48$	$15.22 \pm 0.13$
SPMF-2	$61.56 \pm 0.43$	$18.30 \pm 0.31$	$61.41 \pm 0.75$	$18.48 \pm 0.20$	$61.12 \pm 0.57$	$15.74 \pm 0.21$	$60.64 \pm 0.90$	$15.65 \pm 0.13$
ASMG-2	$63.82 \pm 0.42$	$16.51 \pm 0.28$	$63.80 \pm 0.49$	$16.54 \pm 0.19$	$63.95 \pm 0.42$	$15.00 \pm 0.19$	$63.85 \pm 0.54$	$14.96 \pm 0.13$
Meta-2	$65.23 \pm 0.46$ *	$15.81 \pm 0.27$ *	$64.84 \pm 0.61$ *	$15.89 \pm 0.20$ *	$65.04 \pm 0.42$ *	$14.93 \pm 0.16$ *	$64.60 \pm 0.57$ *	$14.96 \pm 0.13$
BU-3	$63.55 \pm 0.46$	$15.28 \pm 0.15$	$63.40 \pm 0.64$	$15.30 \pm 0.13$	$63.65 \pm 0.40$	$14.93 \pm 0.14$	$63.41 \pm 0.51$	$14.90 \pm 0.11$
SPMF-3	$60.73 \pm 0.55$	$18.18 \pm 0.35$	$61.00 \pm 0.87$	$18.32 \pm 0.23$	$61.83 \pm 0.54$	$14.99 \pm 0.16$	$61.32 \pm 0.62$	$14.74 \pm 0.12$
ASMG-3	$63.21 \pm 0.49$	$18.51 \pm 0.41$	$63.35 \pm 0.69$	$19.61 \pm 0.27$	$65.02 \pm 0.41$	$14.82 \pm 0.17$	$64.77 \pm 0.53$	$14.80 \pm 0.11$
Meta-3	$67.20 \pm 0.25$ *	$15.09 \pm 0.18$ *	$67.05 \pm 0.38$ *	$15.10 \pm 0.14$ *	$66.92 \pm 0.26$ *	$14.65 \pm 0.15$ *	$66.78 \pm 0.37$ *	$14.62 \pm 0.11$
BU-5	$66.19 \pm 0.24$	$14.76 \pm 0.18$	$66.24 \pm 0.30$	$14.71 \pm 0.13$	$66.15 \pm 0.23$	$14.54 \pm 0.15$	$66.23 \pm 0.29$	$14.49 \pm 0.11$
SPMF-5	$61.96 \pm 0.44$	$14.69 \pm 0.13$	$62.21 \pm 0.53$	$14.74 \pm 0.10$	$63.79 \pm 0.41$	$14.83 \pm 0.18$	$62.79 \pm 0.48$	$14.53 \pm 0.13$
ASMG-5	$65.82 \pm 0.32$	$14.79 \pm 0.14$	$65.99 \pm 0.40$	$14.79 \pm 0.11$	$66.49 \pm 0.26$	$14.50 \pm 0.14$	$66.47 \pm 0.35$	$14.50 \pm 0.10$
Meta-5	$69.00 \pm 0.21$ *	$14.62 \pm 0.13$	$69.37 \pm 0.19$ *	$14.61 \pm 0.11$	$68.85 \pm 0.33$ *	$14.39 \pm 0.23$	$69.15 \pm 0.28$ *	$14.38 \pm 0.22$

Table 1: Summarized result for CriteoTB. AUC/Logloss-x denotes the resulted based on the last x days examples. The averaged performance over three random seeds with its standard deviation are reported. We mainly compare the algorithm when the same

b

is used and the best approach as bolded. The * denotes that the best result are statistically significant compared with the second best with p value less than 0.95 using matched-pair t-test.

We demonstrate the effectiveness of the proposed FGD.

Dataset.

We consider two datasets CriteoTB and Avazu. CriteoTB has 13 integer feature fields and 26 categorical feature fields with around 800 million categorical tokens in total. It is the 24-day advertising data published by criteo. Training with the original CriteoTB dataset takes huge computational cost and to reduce computational overhead and increase reproducibility, we use a subsampled CriteoTB with 10% of examples are sampled for evaluation. Avazu contains 11 days of clicks/not clicks data from Avazu and all its 22 feature fields are categorical. We preprocess both datasets following guo2017deepfm, liu2020learnable.

Training Protocol.

In real world recommendation systems, passing the examples multiple times for training might cause severe over-fitting issue [zheng2020shadowsync, ye2020adaptive, du2021alternate]. Following zheng2020shadowsync, ye2020adaptive we perform a single pass on the training data in the sense that each training example is only visited once throughout the training. Thus, we set $θ_{t}^{0} = θ_{t - b}$ during the model training at time $t$ because examples from domain $D_{s}$ , $s \leq t - b$ has been visited for learning $θ_{t - b}$ . In Algorithm 2, the default scheme trains the recommendation models until the norm of the gradient is smaller than a threshold while in the experiment, we use the alternative strategy in which we train the model with a fixed number of iterations such that all the examples are passed exactly once.

Evaluation Protocol.

As we consider an online learning environment, there is no need to split the dataset to training and testing subset. Instead, at the training time of $θ_{t}$ , the data at the next day $D_{t + 1}$ is used to evaluate the performance of $f_{θ_{t}}$ and hence the domain generalization error is considered. Such evaluation protocol matches the real recommendation systems [ye2020adaptive]. We adopt AUC (Area Under the ROC Curve) and Logloss to measure the performance. For Criteo1TB we evaluate the performance using the last 8 or 16 days and the first 16 or 8 days are considered to be offline training for warm up start. For Avazu, the first 3 days are treated to be offline training and hence only the last 8 days are used for evaluation. The metrics are averaged over all the days that are used for evaluation. For all the experimental settings, we run all the compared approaches 3 times with different random seeds and report the averaged result.

Models and Optimizers.

We consider two representative architectures for recommendation models, FM [rendle2010factorization] and DeepFM [guo2017deepfm]. Following guo2017deepfm, liu2020learnable, we use Adam as our optimizer and tune the learning rate for each compared methods from ${0.01, 0.001, 0.0001, 0.00001}$ using the performance of the offline training and the batch size is set to be 1024. For FGD, we add the model at the training trajectory into trajectory buffer every 150/50 iterations for CriteoTB/Avazu. The meta network is trained using SGD with learning rate 0.01 and batch size 20.

Baselines.

For comparison, we consider the following optimization algorithms: Incremental Update (IU) [wang2020practical] that updates the model incrementally only using the newly observed data $D_{t}$ ; Batch Update (BU- $b$ ) [wang2020practical] that updates the model using the most recent $b$ domains ${D_{t}, . . ., D_{t + 1 - b}}$ ; Stream-centered Probabilistic Matrix Factorization (SPMF- $b$ ) [wang2018streaming] in which a reservoir of historical examples are maintained to mix with the new data for current model updating. SPMF- $b$ denotes the setting that the example buffers has the same size as the number of examples in $b$ days; Adaptive Sequential Model Generation (ASMG- $b$ ) [peng2021learning] that generates a better serving model from a sequence of $b$ most recent historical serving models via a meta generator; Future Gradient Descent (FGD- $b$ ) is our approach with the recent $b$ domains used for training the recommendation models.

Result.

Table 1 and 2 summarized the results for CriteoTB and Avazu, respectively. The proposed FGD out-performs the baselines in most cases. We also observe that increasing $b$ improves the performance for most algorithms as more information can be utilized. The performance boost of FGD when increasing $b$ is more significant than other approaches. Compared with CriteoTB, FGD is less significantly better in Avazu dataset. We think the reason might be that the domains of different days in Avazu are less different compared with that in CriteoTB.

Method	FM		DeepFM
Method	Auc $↑$	Logloss $↓$	Auc $↑$	Logloss $↓$
IU	$73.82 \pm 0.18$	$39.92 \pm 0.86$	$73.99 \pm 0.22$	$39.80 \pm 0.81$
BU-2	$74.16 \pm 0.25$	$39.71 \pm 0.88$	$74.31 \pm 0.21$	$39.59 \pm 0.86$
SPMF-2	$69.31 \pm 0.31$	$45.51 \pm 0.99$	$71.11 \pm 0.53$	$42.09 \pm 0.59$
ASMG-2	$74.22 \pm 0.20$	$39.66 \pm 0.89$	$74.34 \pm 0.19$	$39.58 \pm 0.85$
Meta-2	$74.22 \pm 0.28$	$39.77 \pm 0.90$	$74.34 \pm 0.21$	$39.54 \pm 0.87$
BU-3	$74.17 \pm 0.31$	$39.68 \pm 0.89$	$74.50 \pm 0.30$	$39.48 \pm 0.90$
SPMF-3	$68.95 \pm 0.56$	$47.17 \pm 1.27$	$71.93 \pm 0.24$	$41.83 \pm 0.64$
ASMG-3	$73.64 \pm 0.08$	$39.93 \pm 0.83$	$73.95 \pm 0.17$	$39.82 \pm 0.83$
Meta-3	$74.20 \pm 0.27$ *	$39.68 \pm 0.89$	$74.55 \pm 0.28$ *	$39.45 \pm 0.90$

Table 2: Summarized result for Avazu. The setting of the table is the same as that of Table 1.

Temporal Domain Shift and Forecast Error of MFGG.

Figure 3: Left: evolution of $∥ \nabla r_{t} (θ_{t, i}) ∥^{2}$ . Right: the normalized forecast error of MFGG in different time and iterations.

To visualize the effect of the temporal domain shift, we plot the gradient norm during the whole training process. We consider FGD-3 in CriteoTB with DeepFM as the recommendation models. In this examples, at each time $t$ , the recommendation model is trained with $R = 20 K$ iterations. At time $t - 1$ , denote $θ_{t, i}$ as the parameter at the $i$ -th iteration of the training (note that after the training $θ_{t}$ is used to predict examples in $D_{t}$ ). We visualize the evolution of the gradient norm of the future domain $g_{t, i} = ∥ \nabla r_{t} (θ_{t, i}) ∥^{2}$ in a chronological order (i.e., $. . ., g_{t, 1}, . . ., g_{t, R}, g_{t + 1, 1}, . . ., g_{t, R}, . . .$ ) in the left subfigure of Fig 3. Overall, $g$ is decreasing suggesting the improving performance but significant fluctuation of $g$ is also observed: when we shift from $t$ to $t + 1$ , $g$ will suddenly increase demonstrating a considerable deviation between the adjacent domains. We also visualize the (normalized) forecast error $e_{i, t}$ of MFGG in the right subfigure of Fig 3

e_{t, i} = \frac{∥ m (θ_{t + 1, i}; ϕ_{t}, t) - \nabla r_{t + 1} (θ_{t + 1, i}) ∥^{2}}{∥ \nabla r_{t + 1} (θ_{t + 1, i}) ∥^{2}} .

Here, we normalize the error by the gradient norm $∥ \nabla r_{t + 1} (θ_{t + 1, i}) ∥^{2}$ to rule out the effect of the decrease of gradient norm. We observe a decrease of the forecast error demonstrating that the gradient of future domain can be predicted using the past domains. Besides, the error remains stationary which provides evidence that the modeling the MFGG as a functional time-series model is reasonable.

Optimizing MFGG with Random Model. When optimizing MFGG, the loss is calculated based on a model $f_{θ}$ sampled from its training trajectory so that we make MFGG focus on giving good prediction on the gradient of $f_{θ}$ that has reasonable performance. To show the importance of such design, we also run FGD in which MFGG is optimized using $f_{θ}$ with $θ$ randomly initialized. We consider the setting of FGD-3 in CriteoTB and use both FM and DeepFM as recommendation model and summarize the result in Table 3. It can be shown that train the MFGG with random recommendation model degenrates the performance.

Buffer	Method	Auc-8 $↑$	Logloss-8 $↓$	Auc-16 $↑$	Logloss-16 $↓$
FM	Rand	$67.08 \pm 0.28$	$15.17 \pm 0.21$	$67.08 \pm 0.41$	$15.28 \pm 0.16$
FM	Traj	$67.20 \pm 0.25$	$15.09 \pm 0.18$	$67.05 \pm 0.38$	$15.10 \pm 0.14$
DeepFM	Rand	$66.83 \pm 0.27$	$14.68 \pm 0.16$	$66.68 \pm 0.41$	$14.66 \pm 0.12$
DeepFM	Traj	$66.92 \pm 0.26$	$14.65 \pm 0.15$	$66.78 \pm 0.37$	$14.62 \pm 0.11$

Table 3: Comparing the performance when MFGG is trained with model sampled from optimization trajectory (Traj) and randomly initialized model (Rand). The setting of the table is the same as that of Table 1.

Computation Overhead.

We compare the wall clock training time of BU and FGD. We consider the DeepFM model in CriteoTB and report the averaged training time with different $b$ at each time $t$ in Table 4. It can be shown that the proposed FGD introduces only about 15% overhead.

Time/min	BU-2	Meta-3	BU-3	Meta-3	BU-3	Meta-23
Time/min	20.4	24.3	29.7	33.8	47.6	52.2

Table 4: Comparing the wall clock training time of BU and FGD at each round (

t

7 Conclusion

In this paper, we propose future gradient descent (FGD) that forecasts the gradient information of the future domain for training to address the issue of temporal domain shift in online recommendation systems. We show that FGD gives smaller temporal domain generalization in theory compared with a widely adopted algorithm, Batch Update. Empirical evidence is provided to show that FGD outperforms various representatives algorithms.

Acknowledgements.

This work is supported by grant from Meta Inc.

References

Appendix: Future Gradient Descent for Adapting the Temporal Shifting Data Distribution in Online Recommendation Systems

Extra Notation

We introduce several new notations for the appendix. We use $⟨ \cdot, \cdot ⟩$ to denote the inner product between two vectors and use $\circ$ to denote the entrywise product.

Appendix A Proof of Theorem 1

Proof.

We start with a simple decomposition using the triangle inequality:

∥ u_{w, t} (θ_{t}) ∥ \leq ∥ u_{w, t} (θ_{t}) - ¯ m (θ_{t}; t) ∥ + ∥ ¯ m (θ_{t}; t) ∥ .

By the termination condition of Algorithm 5, we have $∥ ¯ m (θ_{t}; t) ∥ \leq δ$ . Furthermore, it follows from (5) that

∥ u_{w, t} (θ_{t}) - ¯ m (θ_{t}; t) ∥ =

\frac{1}{w} ∥ \nabla r_{t} (θ_{t}) - m (θ_{t}; t) ∥ .

Hence, we obtain

∥ u_{w, t} (θ_{t}) ∥^{2} \leq {(δ + \frac{1}{w} ∥ \nabla r_{t} (θ_{t}) - m (θ_{t}; t) ∥)}_{t}^{2} \leq 2 δ^{2} + \frac{2}{w^{2}} ∥ \nabla r_{t} (θ_{t}) - m (θ_{t}; t) ∥^{2} .

(7)

This further implies that

R_{w} (T) = \frac{1}{T} T \sum t = 1 ∥ u_{w, t} (θ_{t}) ∥^{2} \leq \frac{2}{w^{2} T} T \sum t = 1 ∥ \nabla r_{t} (θ_{t}) - m (θ_{t}; t) ∥^{2} + 2 δ^{2},

(8)

and the main result follows from the fact that $∥ \nabla r_{t} (θ_{t}) - m (θ_{t}; t) ∥^{2} \leq {sup}_{θ} ∥ \nabla r_{t} (θ) - m (θ; t) ∥^{2}$ for all $t \in [T]$ . Furthermore, under the boundedness assumption, we have for all $t \in [T]$

∥ \nabla r_{t} (θ_{t}) - m (θ_{t}; t) ∥^{2} \leq {(∥ \nabla r_{t} (θ_{t}) ∥ + ∥ m (θ_{t}; t) ∥)}_{t}^{2} \leq 4 M^{2} .

(9)

Hence, (8) also implies $R_{w} (T) \leq 8 M^{2} / w^{2} + 2 δ^{2}$ , which leads to $R_{w} (T) = O (1 / w^{2})$ when $δ = 1 / w$ . ∎

Appendix B Details of the Result in Section 4.4

Algorithm.

Given $θ_{t}$ , define $h_{t} (ϕ) = ∥ \nabla r_{t} (θ_{t}) - m (θ_{t}; ϕ, t) ∥^{2}$ as a function of $ϕ$ , where we view $θ_{t}$ as a constant. Thus, if follows from that (8) that

R_{w} (T) \leq \frac{2}{w^{2} T} T \sum t = 1 h_{t} (ϕ_{t}) + 2 δ^{2} .

(10)

Thus, our goal is to minimize $\sum_{t = 1}^{T} h_{t} (ϕ_{t})$ in an online manner, since we can only access $h_{t} (ϕ_{t})$ after $ϕ_{t}$ is chosen. To achieve this, we use the classic exponentiated gradient method to update $ϕ_{t}$ . Specifically, for any $ϕ = [a_{1}, \dots, a_{b}] \in S_{b}$ , define the negative potential function $ψ (ϕ) = \sum_{i = 1}^{b} a_{i} log a_{i}$ and its Bregman divergence

Then $ϕ_{t + 1}$ is given by

ϕ_{t + 1} = a r g m i n ϕ \in S_{b} (⟨ \nabla h_{t}, ϕ ⟩ + \frac{1}{η_{ϕ}} B_{ψ} (ϕ; ϕ_{t})) = \frac{ϕ_{t} \circ exp (- η_{ϕ} \nabla h_{t} (ϕ_{t}))}{∥ ϕ_{t} \circ exp (- η_{ϕ} \nabla h_{t} (ϕ_{t})) ∥_{1}},

where $η_{ϕ}$ is the learning rate. See Section 6.6 in orabona2019modern for the derivation of the last equality. Intuitively, $\frac{1}{η_{ϕ}} B_{ψ} (ϕ; ϕ_{t})$ stabilizes the algorithm by ensuring that $ϕ_{t + 1}$ remains close to $ϕ_{t}$ .

This simplified version of FGD is summarized in Algorithm 4. Note that when updating $ϕ$ , we only use the last recommendation model $θ_{t}$ .

Input: The learning rate

η

η_{ϕ}

for updating the model parameter

θ

and

ϕ

Initialize

ϕ_{1} = [1 / b, . . ., 1 / b]

for

t \in [T]

Deploy the prediction model

f_{θ_{t}}

with the parameter

θ_{t}

and collect the new dataset

D_{t}

Construct the function

h_{t} (ϕ) = ∥ \nabla r_{t} (θ_{t}) - m (θ_{t}; ϕ, t) ∥^{2}

ϕ_{t + 1} = \frac{ϕ_{t} \circ exp (- η_{ϕ} \nabla h_{t} (ϕ_{t}))}{∥ ϕ_{t} \circ exp (- η_{ϕ} \nabla h_{t} (ϕ_{t})) ∥_{1}}

▹

One step of Exponentiated gradient descent from

ϕ_{t}

Initialize the model parameter

θ_{t + 1}

while

∥\definecolor[named]pgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0¯m(θt+1;ϕt+1,t+1)∥≥δ

θt+1=θt+1−η\definecolor[named]pgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0¯m(θt+1;ϕt+1,t+1)

end while

end for

Algorithm 4 Generalized Future Gradient Descent for Smoothed Regret (simplified version for the theoretical study)

Lemma 1.

Suppose that we have $∥ \nabla r_{t} (θ) ∥ \leq M$ for all $θ \in Θ$ and $t$ . Then $∥ \nabla h_{t} (ϕ) ∥_{\infty} \leq 8 M^{2}$ for all $ϕ \in S_{b}$ .

Proof.

By definition, we have

h_{t} (ϕ) = ∥ \nabla r_{t} (θ_{t}) - b \sum i = 1 a_{i} \nabla r_{t - i} (θ_{t}) ∥^{2} = ∥ b \sum i = 1 a_{i} (\nabla r_{t} (θ_{t}) - \nabla r_{t - i} (θ_{t})) ∥^{2},

where we used the fact that $\sum_{i = 1}^{b} a_{i} = 1$ . Direct computation shows that

$∣ ∣ ∣ \frac{\partial h_{t}}{\partial a_{i}} (ϕ) ∣ ∣ ∣$	$= 2 ∣ ∣ ⟨ \nabla r_{t} (θ_{t}) - \nabla r_{t - i} (θ_{t}), b \sum j = 1 a_{j} (\nabla r_{t} (θ_{t}) - \nabla r_{t - j} (θ_{t})) ⟩ ∣ ∣$	(11)
	$\leq 2 ∥ \nabla r_{t} (θ_{t}) - \nabla r_{t - i} (θ_{t}) ∥ ∥ ∥ ∥ b \sum j = 1 a_{j} (\nabla r_{t} (θ_{t}) - \nabla r_{t - j} (θ_{t})) ∥ ∥ ∥$	(12)
	$\leq 2 (∥ \nabla r_{t} (θ_{t}) ∥ + ∥ \nabla r_{t - i} (θ_{t}) ∥) (b \sum j = 1 a_{j} (∥ \nabla r_{t} (θ_{t}) ∥ + ∥ \nabla r_{t - j} (θ_{t}) ∥))$	(13)
	$\leq 8 M^{2},$	(14)

where we used Cauchy-Schwarz inequality in (12), the triangle inequality in (13) and the boundedness of the gradients in (14). Hence, we conclude that $∥ \nabla h_{t} (ϕ) ∥_{\infty} \leq 8 M^{2}$ . ∎

Proof of Theorem 2.

Now we proceed to the proof of Theorem 2. This is a standard result in the online learning literature (see, e.g., orabona2019modern). For completeness, we present the proof below.

Proof.

As $ψ$ is $λ$ -strongly convex with $λ = 1$ , we have

B_{ψ} (ϕ; ϕ^{'}) \geq \frac{1}{2} ∥ ϕ - ϕ^{'} ∥_{1}^{2} .

(15)

Throughout the proof, we slightly abuse the notation by writing $η_{ϕ} = η$ and $\nabla h_{t} = \nabla h_{t} (ϕ_{t})$ for simplicity. Notice that by our update rule $ϕ_{t + 1}$ is given by

ϕ_{t + 1}

= a r g m i n ϕ \in S_{b} (η ⟨ \nabla h_{t}, ϕ ⟩ + B_{ψ} (ϕ; ϕ_{t})) .

From the first-order optimality condition, we get for any $ϕ \in S_{b}$ ,

		$⟨ η \nabla h_{t} + \nabla ψ (ϕ_{t + 1}) - \nabla ψ (ϕ_{t}), ϕ_{t + 1} - ϕ ⟩ \leq 0$
	$\Leftrightarrow$	$η ⟨ \nabla h_{t}, ϕ_{t} - ϕ ⟩ \leq η ⟨ \nabla h_{t}, ϕ_{t} - ϕ_{t + 1} ⟩ + ⟨ \nabla ψ (ϕ_{t + 1}) - \nabla ψ (ϕ_{t}), ϕ - ϕ_{t + 1} ⟩$
	$\Leftrightarrow$	$η ⟨ \nabla h_{t}, ϕ_{t} - ϕ ⟩ \leq η ⟨ \nabla h_{t}, ϕ_{t} - ϕ_{t + 1} ⟩ - B_{ψ} (ϕ; ϕ_{t + 1}) + B_{ψ} (ϕ; ϕ_{t}) - B_{ψ} (ϕ_{t + 1}; ϕ_{t}),$

where we used the three-point equality [chen1993convergence] in the last inequality. Furthermore,

	$η ⟨ \nabla h_{t}, ϕ_{t} - ϕ_{t + 1} ⟩ - B_{ψ} (ϕ; ϕ_{t + 1})$	$\leq η ∥ \nabla h_{t} ∥_{\infty} ∥ ϕ_{t} - ϕ_{t + 1} ∥_{1} - \frac{1}{2} ∥ ϕ_{t} - ϕ_{t + 1} ∥_{1}^{2}$
		$\leq \frac{η^{2}}{2} ∥ \nabla h_{t} ∥_{\infty}^{2} + \frac{1}{2} ∥ ϕ_{t} - ϕ_{t + 1} ∥_{1}^{2} - \frac{1}{2} ∥ ϕ_{t} - ϕ_{t + 1} ∥_{1}^{2}$
		$= \frac{η^{2}}{2} ∥ \nabla h_{t} ∥_{\infty}^{2} .$

Combining these two bounds, we have

η ⟨ \nabla h_{t}, ϕ_{t} - ϕ ⟩ \leq B_{ψ} (ϕ; ϕ_{t}) - B_{ψ} (ϕ; ϕ_{t + 1}) + \frac{η^{2}}{2} ∥ \nabla h_{t} ∥_{\infty}^{2} .

Since $h_{t} (ϕ)$ is convex in $ϕ$ , we have $h_{t} (ϕ_{t}) - h_{t} (ϕ) \leq ⟨ \nabla h_{t}, ϕ_{t} - ϕ ⟩$ for any $ϕ \in S_{b}$ . By telescoping, we obtain

	$T \sum t = 1 (h_{t} (ϕ_{t}) - h_{t} (ϕ))$	$\leq T \sum t = 1 ⟨ \nabla h_{t}, ϕ_{t} - ϕ ⟩$
		$\leq \frac{1}{η} T \sum t = 1 [B_{ψ} (ϕ; ϕ_{t}) - B_{ψ} (ϕ; ϕ_{t + 1}) + \frac{η^{2}}{2} ∥ \nabla h_{t} ∥_{\infty}^{2}]$
		$= \frac{1}{η} (B_{ψ} (ϕ; ϕ_{1}) - B_{ψ} (ϕ; ϕ_{T + 1})) + \frac{η}{2} T \sum t = 1 ∥ \nabla h_{t} ∥_{\infty}^{2}$
		$\leq \frac{1}{η} log b + 32 η M^{4} T .$

where we used Lemma 14, $B_{ψ} (ϕ; ϕ_{T + 1}) \geq 0$ and $B_{ψ} (ϕ; ϕ_{1}) = ψ (ϕ) + log b \leq log b$ in the last inequality. Choosing $η = c \sqrt{(log b) / (T M^{4})}$ with some constant $c > 0$ leads to

T \sum t = 1 [h_{t} (ϕ_{t}) - h_{t} (ϕ)] \leq O (M^{2} \sqrt{T log b}) .

(16)

Note that (16) holds for any $ϕ \in S_{b}$ . In particular, we can set $ϕ = ϕ^{*}$ defined by $ϕ^{*} = {a r g m i n}_{ϕ \in S_{b}} \sum_{t = 1}^{T} h_{t} (ϕ)$ . Therefore,

	$T \sum t = 1 h_{t} (ϕ_{t})$	$\leq T \sum t = 1 h_{t} (ϕ^{*}) + O (M^{2} \sqrt{T log b})$
		$= min ϕ \in S_{b} T \sum t = 1 ∥ \nabla r_{t} (θ_{t}) - m (θ_{t}; ϕ, t) ∥^{2} + O (M^{2} \sqrt{T log b})$
		$\leq min ϕ \in S_{b} T \sum t = 1 sup θ ∥ \nabla r_{t} (θ) - m (θ; ϕ, t) ∥^{2} + O (M^{2} \sqrt{T log b}) = min m \in M Q [T; m] + O (M^{2} \sqrt{T log b}) .$

We thus conclude from (10) that

R_{w} (T) \leq \frac{2}{w^{2} T} (min m \in M Q [T; m] + O (M^{2} \sqrt{T log b})) + 2 δ^{2} .

∎

Appendix C A Practical Generalized FGD algorithm.

Input: The learning rate

η

η_{ϕ}

for updating the model parameter

θ

and

ϕ

. The initial trajectory buffer

B

for

t \in [T]

Deploy the prediction model

f_{θ_{t}}

with parameter

θ_{t}

. Then collect the new dataset

D_{t}

Initialize the parameter of MFGG

ϕ_{t + 1}

▹

Initialization of

ϕ_{t + 1}

is user-specific.

for Inner loop iteration

k \in K

▹

Update the meta network.

ϕ_{t + 1} \leftarrow ϕ_{t + 1} - η_{ϕ} \sum_{θ \in B} \nabla_{ϕ} ∥ m (θ; ϕ_{t + 1}, t) - \nabla r_{t} (θ) ∥^{2}

▹

May replace with the mini-batch version.

end for

Initialize the trajectory buffer

B = \emptyset

and model parameter

θ_{t + 1}

▹

Initialization scheme of

θ_{t + 1}

is specified by user.

while

∥ ¯ m (θ_{t + 1}; ϕ_{t + 1}, t + 1) ∥ \geq δ

▹

Alternatively, we may run gradient descent with a fixed number of iterations.

θ_{t + 1} \leftarrow θ_{t + 1} - η m (θ_{t + 1}; ϕ_{t + 1}, t + 1)

▹

May replace with the mini-batch version.

B \leftarrow B \cup {θ_{t + 1}}

▹

Alternatively, we may update the trajectory buffer

B

every a few iterations.

end while

end for

Algorithm 5 Generalized Future Gradient Descent for Smoothed Loss

Compared with FGD in Algorithm 2, we use a smoothed version of MFGG $¯ m$ for training, which is due to the consideration of minimizing a smoothed loss in (2). For completeness, we also summarize the practical algorithm of the generalized version of FGD in Algorithm 5.