Future Gradient Descent for Adapting the Temporal Shifting Data Distribution in Online Recommendation Systems

Mao Ye The University of Texas at Austin. Ruichen Jiang The University of Texas at Austin. Haoxiang Wang The University of Illinois at Urbana-Champaign. Dhruv Choudhary Meta. Xiaocong Du Meta. Bhargav Bhushanam Meta. Aryan Mokhtari The University of Texas at Austin. Arun Kejariwal Meta. Qiang Liu The University of Texas at Austin.
Abstract

One of the key challenges of learning an online recommendation model is the temporal domain shift, which causes the mismatch between the training and testing data distribution and hence domain generalization error. To overcome, we propose to learn a meta future gradient generator that forecasts the gradient information of the future data distribution for training so that the recommendation model can be trained as if we were able to look ahead at the future of its deployment. Compared with Batch Update, a widely used paradigm, our theory suggests that the proposed algorithm achieves smaller temporal domain generalization error measured by a gradient variation term in a local regret. We demonstrate the empirical advantage by comparing with various representative baselines.

1 Introduction

The web-scale recommendation system is one of the most important modern machine learning applications that provides personalized content to billions of users from inventories of billions of items. These recommendation models have been rapidly growing in both computation and memory in the past few years due to wider-deeper networks and the use of sparse embedding layers. [ye2020adaptive, he2020lightgcn, zhang2020retrain, peng2021learning] have demonstrated the importance of updating the recommendation periodically (e.g., in a daily/weekly basis) as new data arrives to avoid the model being stale in a domain shifting environment.

Designing such a periodical updating pipeline is non-trivial: the algorithm needs to achieve a good balance of consolidating the long-term memory (ensuring the useful past knowledge is preserved) and capturing short-term tendency, which is valuable for near future prediction [zhang2020retrain, peng2021learning, deng2021deeplight]. Algorithms can be categorized into two groups: 1. The sample-based approaches maintain a reservoir to reuse observed historical examples to preserve long-term memory [diaz2012real]. Several heuristics are developed to select past examples via balancing the prioritizing of recency and forgetting [chen2013terec, wang2018streaming, qiu2020gag, zhao2021stratified]; 2. The model-based approaches maintain the long-term memory by transferring knowledge between the past and the current model checkpoints via knowledge distillation [wang2020practical, xu2020graphsail, mi2020ader] and model fusion [zhang2020retrain, peng2021learning].

In this paper, we provide a novel perspective by framing the problem of learning under shifting domains as a temporal domain generalization problem. We observe that the crux lies in the mismatch between the distribution of the training examples and the distribution of the testing example on which the model is deployed for recommendations. From this perspective, existing approaches mitigate such the crux by the distribution mismatch in an indirect way by training a robust model that is less vulnerable to the shift of domain in the near future by making it a master at both short-term and long-term signals. More precisely, we propose a more direct solution towards the temporal domain generalization problem based on forecasting the future information for training. Consider the ideal case that we are able to access the data distribution in the near future when the model is deployed, simply training the model by gradient descent using examples drawn from the future data distribution should be desirable (see ‘Ideal Update’ in Fig 2). In the real world when such future information is unavailable, we propose to train a meta future gradient generator to forecast the gradient of the future examples so that the recommendation model is trained as if we were able to look ahead at the future (i.e., ‘FGD Update’ in Fig 2). In addition to the sample-based and model-based approach, our method is optimizer-based in that the trainer of the recommendation model is improved.

In theory, we frame the problem as an online learning problem in which the temporal domain generalization error is captured by the gradient variation term [chiang2012online, rakhlin2013online] in a local regret [hazan2017efficient]. We provide a theoretical understanding of why the proposed algorithm improves over batch-update, a widely used training pipeline [wang2020practical] and show that our method is able to achieve a similar regret as that of a fixed meta future gradient generator oracle. Empirically, we compare our approach against several representative sample/model-based approaches and observe considerable performance improvement.

Notation. We denote the integer set by . Moreover, denotes the vector norm, denotes the vector norm, and is the probability simplex set.

Illustration of the temporal domain generalization problem where the distribution of the training set of
Figure 1: Illustration of the temporal domain generalization problem where the distribution of the training set of mismatches the distribution of its test set at time .

2 Problem and Background

Temporal Domain Generalization.

Consider an online classification problem with the feature space and the label space . Our goal is to learn an accurate prediction model parameterized by from a stream of datasets in consecutive rounds. Specifically, at round , we choose a model parameter and deploy our prediction model . Then we observe the dataset with labeled examples drawn from certain data distribution , where are the input features and is the associated label. Thus, for a given loss function , the empirical loss of our prediction model at time is given by . Moreover, we consider the situation where the data distribution (i.e., domain) is gradually changing over time. A natural performance metric for our learning algorithm is the temporal average of the test loss suffered by the prediction model:

(1)

We remark that , which denotes the total number of rounds in the online process, is typically large in practice.

The key challenge is the temporal domain generalization. Indeed, at time we train our prediction model using the observed examples and due to temporal shift of domain, the distribution of the test set does not match the distribution of its training set. Such mismatch of the training and testing domains results in domain generalization error. See Fig 1 for an illustration.

Our formulation is motivated by the online recommendation systems that aim to advertise items to users given user features. The domain is gradually changing because of the flux in the content that gets continuously added/removed from the system [he2014practical, ye2020adaptive]. As the recommendation model needs to be deployed for serving, it is hard to update its parameter in real time [cervantes2018evaluating, wang2020practical, peng2021learning]. The training process is thus discretized in which the model parameter is updated periodically with the hope that it generalizes well in its test domain.

Input: The learning rate for updating the parameter .
for  do
     Deploy the prediction model with parameter .
     Collect the new dataset .
     Initialize .
     while  do
         
     end while
end for
Algorithm 1 Batch and Incremental Update
Comparing (1) the ideal update where the future information
Figure 2: Comparing (1) the ideal update where the future information can be accessed at training time; (2) the batch update; and (3) our proposed approach.
Input: The learning rate , for updating the model parameter and . The initial trajectory buffer .
for  do
     Deploy the prediction model with parameter . Then collect the new dataset .
     Initialize the parameter of MFGG . Initialization of is user-specific.
     for Inner loop iteration  do Update the meta network.
         . May replace with the mini-batch version.
     end for
     Initialize the trajectory buffer and model parameter . Initialization scheme of is specified by user.
     while  do Alternatively, we may run gradient descent with a fixed number of iterations.
         . May replace with the mini-batch version.
          Alternatively, we may update the trajectory buffer every a few iterations.
     end while
end for
Algorithm 2 Future Gradient Descent

Batch and Incremental Update.

Batch Update (BU) [hazan2017efficient, wang2020practical] is a widely used updating pipeline for training the recommendation model in temporally shifting domains. At each time , the model parameters are updated using the gradient of the averaged losses , where is a time window size indicating how many observed data are used. BU with is also named as Incremental Updating (IU). We summarize the pipeline in Algorithm 1, where with is defined as . Also see an illustration of BU with in the second plot of Fig. 2. It is noteworthy that the initialization scheme of for the updating at each time is problem-dependent and user-specified. For example, we can set of if we consider one-pass training setting [zheng2020shadowsync, ye2020adaptive, du2021alternate].

3 Method

Recall that our goal is to learn that gives accurate prediction on (i.e. achieves small ). Think of the ideal world where we were able to access the data at the near future during the training time of , a simple while promising approach is to apply gradient descent using (see the first plot in Fig. 2). In the real case where the future information is no more available, we propose to learn a meta future gradient generator (MFGG) that forecasts given the observed data ; see the third plot in Fig 2.

Architecture of MFGG.

MFGG models as an non-linear functional auto-regressive time series model [bosq2000linear]. It approximates by aggregating the gradient based on the latest losses where the coefficient of the linear combination is a neural network given by the following computation graph.

Here Embd denotes the embedding layer that maps the categorical feature into a continuous embedding space (the continuous feature remains the same in this layer); Self Attention denotes the self attention layer [vaswani2017attention]; MLP denotes the multi-layer perception. MFGG first extracts the domain features over of the last domains) and the self attention then encodes the interaction between the domain features, of which the outcomes are fed into the subsequent layers to calculate the coefficient . The softmax layer is option and regularizes to be in a probability simplex and hence ensures the magnitude of the generated gradient is within a proper range. Suppose unions all the parameters, we denote MFGG as . In practice, we can simply replace with its mini-batch samples , which gives a stochastic gradient version for updating.

Optimization of MFGG.

We use the squared loss for measuring the prediction error of at time . Such error depends on both , the parameter of MFGG and , the parameter of recommendation model used for calculating the gradient. We are more interested in make MFGG accurate at a small subset of the model parameter space in which gives a recommendation model with good performance. We thus only apply the loss on the (sub-sampled) optimization trajectory of , which we denoted as . That is, we learn by apply gradient descent on

Note that here when calculating the gradient of , is viewed as a constant and hence the differentiation of at does not applied. Algorithm 2 summarizes the detailed procedure. Again, a mini-batch version of and can be used during the training of MFGG. In practice at , we don not have enough historical data to compute MFGG, we can simply use IU for training (alternatively, data for offline training can be used instead). Since our approach uses the MFGG to predict the gradient of the loss on the unobserved future data, we name it Future Gradient Descent (FGD).

Extension to a smoothed loss.

In practice, one might be interested in a smoothed version of performance metric as it is observed to be a potentially more robust evaluation metric in practice [he2014practical]. More precisely, consider the loss function

(2)

where is identically zero for . This smoothed loss in (2) uses a sliding window with width over the previous datasets when evaluating. We are mainly interested in the standard metric (1) but when (2) is considered, we can simply generalize FGD by replacing by

when training . Here , is defined 0. We refer readers to Algorithm 5 in Appendix C for the details. In the rest of the paper, we focus on the smoothed version of loss as it is more general.

Before moving forward, we emphasize the difference between the two window sizes and that appear in the BU/FGD and in the definition of (2), respectively. In some sense, corresponds to the number of recently observed datasets used for training the model. While, represents the number of datasets used for testing the model.

4 Theory

In this section, we study the advantage of the proposed FGD over BU and IU theoretically using recent advances in non-convex online learning. Specifically, we show that FGD is able to perform better than BU and IU in terms of the so-called local regret [hazan2017efficient, hallak2021regret], which measures the algorithm’s performance by comparing it with the best one can achieve in hindsight.

4.1 Local Regret

To upper bound the average loss in (1) in a changing environment, one standard approach is to study the average dynamic regret [zinkevich2003online]:

(3)

which uses the global minimum of as a benchmark when evaluating the performance at time . However, in modern recommendation systems the prediction model is given by a deep neural network, and thus the resulting loss function is highly non-convex. This means finding an approximate global minimum of is computationally intractable, making it hopeless to derive any meaningful bound on the average dynamic regret in (3). To remedy this issue, we adopt the notion of local regret proposed by hazan2017efficient. Specifically, given generated by an online learning algorithm, the average local regret is defined as

(4)

Compared with (3), in (4) we evaluate the model parameters in terms of the first-order stationarity, and thus it can be viewed as the non-convex counterpart of the dynamic regret in (3). In particular, a small value of implies a small gradient on average, suggesting that the algorithm achieves near-optimal performance locally in the long run.

More generally, when the smoothed loss (2) is considered, one can use the average -local regret accordingly as in [hazan2017efficient]:

where we evaluate using the smoothed loss function . In the following, we will focus our analysis on , as choosing also covers the standard local regret in (4).

4.2 Regret of Batch Update

In [hazan2017efficient], the authors analyzed the average -local regret for BU. We recall their result below but offer a different interpretation from the domain generalization perspective.

Proposition 1 ([hazan2017efficient, hallak2021regret]).

With the choice of the window size , the -local regret incurred by BU in Algorithm 1 satisfies

where

Furthermore, if for all and , choosing gives , which is minimax optimal.

The previous works [hazan2017efficient, hallak2021regret] are interested in the worst-case guarantee of the BU algorithm, and the result in Proposition 1 only serves as an intermediate result. However, we observe that this regret bound also offers interesting insights from the perspective of domain generalization. To be specific, we can decompose it into two terms:

The optimization error: this is due to the fact that we only seek a -approximate stationary point of the smoothed training loss function at round . It is controllable in the sense that can be made arbitrarily small by running more iterations of gradient descent. Indeed, under standard smoothness assumption on , we can achieve within iterations. The optimization error term thus corresponds to how well we train the recommendation model in each round.

The domain generalization error: this is due to the fact that the the test set for evaluating is different from the training set . It is typically the dominant term in the regret bound and will not vanish even when . In some sense, it captures the level of variability in the data distributions, similar to the gradient variation term in [chiang2012online, rakhlin2013online]. We also note that the domain generalization error decreases w.r.t. . This is because when increases, the overlap between the training set and the test set becomes larger (i.e., the training set and the test set deviate less)111Such overlapping mechanism is the key to defending adversaries in non-convex games and we refer to Section 2.3 in [hazan2017efficient] for more details..

In summary, the optimization error term characterizes how well our model performs on the training set, while the domain generalization error term characterizes how much the test set deviates from the training set.

Comparison with other measure of domain divergence.

In Proposition 1, the domain discrepancy is characterized in terms of the gradient variation (i.e, how much the gradient of the loss functions differs). Some other domain discrepancy measures have also been proposed. Examples include the -divergence [kifer2004detecting] between and defined as and the divergence [ben2010theory] defined as . Overall, the commonly used divergence measures share the general form of , where is a test function parameterized by . The -divergence uses and the divergence first extends the parameter space to the product space and let for any . The gradient variation uses . As we consider the local regret for non-convex problems where the goal is to find a first-order stationary point, using the gradient as the test function is a natural fit.

4.3 The Headroom of Batch Update

In the last section, we see that BU achieves the minimax regret, so at first sight it seems there is no room for further improvement. However, we note that this only implies that BU is optimal in the worst-case sense, i.e., when the future data distribution is completely uncorrelated with the previous ones. This is hardly the case in reality: the drift in the data distribution normally happens in a gradual manner, and the data distribution in the past should be informative of the future. Hence, the natural question is: can we do better than BU in a gradually changing environment?

The discussion after Proposition 1 suggests that the only hope for improvement lies in reducing the domain generalization error . To illustrate the headroom, we start with Meta Gradient Descent (MGD), a ‘helper algorithm’ that extends BU and serves as an intermediate step towards the proposed FGD. Assume that we are given a sequence of gradient generators . Then FGD uses a smoothed gradient generator given by

for updating, yielding Algorithm 3.

Input: The learning rate for updating the parameter .
for  do
     Deploy the prediction model with parameter .
     Collect the new dataset .
     Construct the smoothed gradient generator .
     Initialize .
     while  do
         
     end while
end for
Algorithm 3 Meta Gradient Descent: a helper algorithm

By substituting for , MGD reduces to BU with . Comparing with , the true gradient on the test set, we see that

(5)

suggesting that MGD introduces a general gradient generator as a proxy for , similar to FGD. On the other hand, we note that the gradient generator in MGD is pre-specified, while FGD parametrizes the gradient generator with and optimizes it on the fly.

From this perspective, BU in Algorithm 1 in fact implicitly uses to approximate , which explains why depends on the difference between these two terms. While such design makes sense in the very limited case where the sequence of domains is known to have a period of , it might not be a savvy choice in general. To be specific, one can construct from the observed datasets based on some mapping parameterized by . For instance, such mapping can be given by a deep neural network as described in Section 3. In this way, MGD enables a mechanism that utilizes the past domains more flexibly to predict the future gradient information when it can be forecasted with a more general form.

Theorem 1.

The -local regret incurred by Algorithm 3 satisfies

where . Furthermore, if both and are upper bounded by for all and , we recover the minimax regret when .

Theorem 1 shows that we can greatly improve the regret of BU by reducing the domain generalization error if is properly chosen. Specifically, suppose that —the hypothesis class of —is rich enough to model the dynamic of the data distribution, in the sense that there exists satisfying

Then the domain generalization error of MGD equipped with tends to zero at the rate of , in contrast to being a non-vanishing dominant term in BU. On the other hand, we can still maintain essentially the same regret bound as BU in the worst case, and thus the improvement almost comes for free.

In the following section, we show that it is indeed possible for FGD to achieve a comparable local regret bound as the one given by MGD with the optimal gradient generator in .

4.4 Regret Bound of FGD

To simplify the analysis, we consider the case where the gradient generator at round is given by a linear model:

(6)

where is the parameter. The hypothesis class is thus . This family of FGD algorithm covers the BU algorithm, which corresponds to setting and otherwise. For this toy example, we use the classic exponentiated gradient descent method [kivinen97exponentiated] to update , which ensures that . The detailed algorithm is summarized in Algorithm 4 in Appendix B.

Theorem 2.

Assume that for any , is bounded by . Let be the hypothesis class of given in (6). For any given constant , if we set the learning rate for updating as , the -local regret incurred by Algorithm 4 in Appendix B satisfies

where

Theorem 2 suggests that FGD with optimized MFGG is able to achieve the regret of Algorithm 3 using with excessive error. As is usually large, we can see that the excessive error is small.

5 Related Work

Domain Generalization.

Our problem can be viewed as an extension of the classic domain generalization problem. In short, the classic domain generalization problem that is extensively studied in vision or NLP is one-shot in the sense that it aims to generalize a model to one unseen target domain by training over multiple source domains. In contrast, our problem is -shot, since we have a stream of pairs of target/source domains. The difference between one-shot and -shot can be significant. In the one-shot setting, we are unable to receive feedback on how the model generalizes on the unseen domain and thus the existing algorithms are hence focus on improving the worst-case generalization by learning domain-invariant representation based on methods such as domain feature alignment [li2018domain, guo-etal-2019-towards], causal learning [arjovsky2019invariant, wang2022provable], multi-task learning [carlucci2019domain], meta-learning [balaji2018metareg, li2018learning] and data augmentation [yan2020improve, ilse2021selecting]. In comparison, our algorithm mainly focuses on how to use the feedback in the -shot setting to learn to predict the gradient information of the future unseen domain. While adopting the techniques from the one-shot domain generalization is of interest, the design of those algorithms utilizes a lot of domain knowledge from CV or NLP, making it non-trivial to apply to recommendation systems. We thus leave it for future work.

Continual Learning.

Continual learning is a similar scenario where the goal is to learn an accurate model given a stream of different tasks/domains. Compared with multi-task learning [sener2018multi, crawshaw2020multi, ye2021pareto, wang2021bridging], the key challenge of continual learning is catastrophic forgetting [kirkpatrick2017overcoming]: the model forgets how to solve past tasks after it is exposed to new tasks. Various of types of solutions are proposed, including rehearsal-based methods [lopez2017gradient, aljundi2019gradient, chaudhry2020using], knowledge distillation [rebuffi2017icarl], regularization [kirkpatrick2017overcoming, buzzega2020dark] and architecture adjustment [rusu2016progressive, serra2018overcoming]. Although the learning scenario is similar, a direct application of continual learning methods to our setting might not give a desirable outcome. The reason is that the final goals of the two problems are quite different: continual learning aims to learn the current task without sacrificing the performance of the past learned tasks, while we only focus on performing well in the unobserved future task.

Gradual Domain Adaptation

Gradual domain adaptation (GDA) aims at adapting a model to an unlabeled target domain after being trained on a labeled source domain and a sequence of unlabeled intermediate domains. Despite being similar to the setting of temporal domain generalization, GDA is still different from the latter since there are no labels provided in the intermediate domains for GDA. A modern and common approach for GDA is gradual self-training [kumar2020understanding, wang2022understanding, zhou2022online, dong2022algorithms], which fits a model to the source domain and then adapts the model along the sequence of intermediate domains consecutively with self-training [nigam2000analyzing].

Meta-Learning.

Meta-learning, or learning-to-learn, aims to optimize the training process such that the outcome is improved. Examples of meta-learning includes learning a better initialization [finn2017model, lee2018gradient], optimizer [andrychowicz2016learning, flennerhag2019meta], hyper-parameter [franceschi2018bilevel, chen2019lambdaopt] and network architecture [liu2018darts, wang2022global]. The proposed FGD can be viewed as learning a better optimizer for the temporal domain generalization problems. Meta-learning is also widely deployed in recommendation systems. Examples include solving cold start issue [bharadhwaj2019meta, lee2019melu] through learning initialization and knowledge transferring through model fusion [zhang2020retrain, peng2021learning].

6 Experiment

Method FM DeepFM
Auc-8 Logloss-8 Auc-16 Logloss-16 Auc-8 Logloss-8 Auc-16 Logloss-16
IU
BU-2
SPMF-2
ASMG-2
Meta-2 * * * * * * *
BU-3
SPMF-3
ASMG-3
Meta-3 * * * * * * *
BU-5
SPMF-5
ASMG-5
Meta-5 * * * *
Table 1: Summarized result for CriteoTB. AUC/Logloss-x denotes the resulted based on the last x days examples. The averaged performance over three random seeds with its standard deviation are reported. We mainly compare the algorithm when the same is used and the best approach as bolded. The * denotes that the best result are statistically significant compared with the second best with p value less than 0.95 using matched-pair t-test.

We demonstrate the effectiveness of the proposed FGD.

Dataset.

We consider two datasets CriteoTB and Avazu. CriteoTB has 13 integer feature fields and 26 categorical feature fields with around 800 million categorical tokens in total. It is the 24-day advertising data published by criteo. Training with the original CriteoTB dataset takes huge computational cost and to reduce computational overhead and increase reproducibility, we use a subsampled CriteoTB with 10% of examples are sampled for evaluation. Avazu contains 11 days of clicks/not clicks data from Avazu and all its 22 feature fields are categorical. We preprocess both datasets following guo2017deepfm, liu2020learnable.

Training Protocol.

In real world recommendation systems, passing the examples multiple times for training might cause severe over-fitting issue [zheng2020shadowsync, ye2020adaptive, du2021alternate]. Following zheng2020shadowsync, ye2020adaptive we perform a single pass on the training data in the sense that each training example is only visited once throughout the training. Thus, we set during the model training at time because examples from domain , has been visited for learning . In Algorithm 2, the default scheme trains the recommendation models until the norm of the gradient is smaller than a threshold while in the experiment, we use the alternative strategy in which we train the model with a fixed number of iterations such that all the examples are passed exactly once.

Evaluation Protocol.

As we consider an online learning environment, there is no need to split the dataset to training and testing subset. Instead, at the training time of , the data at the next day is used to evaluate the performance of and hence the domain generalization error is considered. Such evaluation protocol matches the real recommendation systems [ye2020adaptive]. We adopt AUC (Area Under the ROC Curve) and Logloss to measure the performance. For Criteo1TB we evaluate the performance using the last 8 or 16 days and the first 16 or 8 days are considered to be offline training for warm up start. For Avazu, the first 3 days are treated to be offline training and hence only the last 8 days are used for evaluation. The metrics are averaged over all the days that are used for evaluation. For all the experimental settings, we run all the compared approaches 3 times with different random seeds and report the averaged result.

Models and Optimizers.

We consider two representative architectures for recommendation models, FM [rendle2010factorization] and DeepFM [guo2017deepfm]. Following guo2017deepfm, liu2020learnable, we use Adam as our optimizer and tune the learning rate for each compared methods from using the performance of the offline training and the batch size is set to be 1024. For FGD, we add the model at the training trajectory into trajectory buffer every 150/50 iterations for CriteoTB/Avazu. The meta network is trained using SGD with learning rate 0.01 and batch size 20.

Baselines.

For comparison, we consider the following optimization algorithms: Incremental Update (IU) [wang2020practical] that updates the model incrementally only using the newly observed data ; Batch Update (BU-) [wang2020practical] that updates the model using the most recent domains ; Stream-centered Probabilistic Matrix Factorization (SPMF-) [wang2018streaming] in which a reservoir of historical examples are maintained to mix with the new data for current model updating. SPMF- denotes the setting that the example buffers has the same size as the number of examples in days; Adaptive Sequential Model Generation (ASMG-) [peng2021learning] that generates a better serving model from a sequence of most recent historical serving models via a meta generator; Future Gradient Descent (FGD-) is our approach with the recent domains used for training the recommendation models.

Result.

Table 1 and 2 summarized the results for CriteoTB and Avazu, respectively. The proposed FGD out-performs the baselines in most cases. We also observe that increasing improves the performance for most algorithms as more information can be utilized. The performance boost of FGD when increasing is more significant than other approaches. Compared with CriteoTB, FGD is less significantly better in Avazu dataset. We think the reason might be that the domains of different days in Avazu are less different compared with that in CriteoTB.

Method FM DeepFM
Auc Logloss Auc Logloss
IU
BU-2
SPMF-2
ASMG-2
Meta-2
BU-3
SPMF-3
ASMG-3
Meta-3 * *
Table 2: Summarized result for Avazu. The setting of the table is the same as that of Table 1.

Temporal Domain Shift and Forecast Error of MFGG.

Left: evolution of Left: evolution of
Figure 3: Left: evolution of . Right: the normalized forecast error of MFGG in different time and iterations.

To visualize the effect of the temporal domain shift, we plot the gradient norm during the whole training process. We consider FGD-3 in CriteoTB with DeepFM as the recommendation models. In this examples, at each time , the recommendation model is trained with iterations. At time , denote as the parameter at the -th iteration of the training (note that after the training is used to predict examples in ). We visualize the evolution of the gradient norm of the future domain in a chronological order (i.e., ) in the left subfigure of Fig 3. Overall, is decreasing suggesting the improving performance but significant fluctuation of is also observed: when we shift from to , will suddenly increase demonstrating a considerable deviation between the adjacent domains. We also visualize the (normalized) forecast error of MFGG in the right subfigure of Fig 3

Here, we normalize the error by the gradient norm to rule out the effect of the decrease of gradient norm. We observe a decrease of the forecast error demonstrating that the gradient of future domain can be predicted using the past domains. Besides, the error remains stationary which provides evidence that the modeling the MFGG as a functional time-series model is reasonable.

Optimizing MFGG with Random Model. When optimizing MFGG, the loss is calculated based on a model sampled from its training trajectory so that we make MFGG focus on giving good prediction on the gradient of that has reasonable performance. To show the importance of such design, we also run FGD in which MFGG is optimized using with randomly initialized. We consider the setting of FGD-3 in CriteoTB and use both FM and DeepFM as recommendation model and summarize the result in Table 3. It can be shown that train the MFGG with random recommendation model degenrates the performance.

Buffer Method Auc-8 Logloss-8 Auc-16 Logloss-16
FM Rand
Traj
DeepFM Rand
Traj
Table 3: Comparing the performance when MFGG is trained with model sampled from optimization trajectory (Traj) and randomly initialized model (Rand). The setting of the table is the same as that of Table 1.

Computation Overhead.

We compare the wall clock training time of BU and FGD. We consider the DeepFM model in CriteoTB and report the averaged training time with different at each time in Table 4. It can be shown that the proposed FGD introduces only about 15% overhead.

Time/min BU-2 Meta-3 BU-3 Meta-3 BU-3 Meta-23
20.4 24.3 29.7 33.8 47.6 52.2
Table 4: Comparing the wall clock training time of BU and FGD at each round ().

7 Conclusion

In this paper, we propose future gradient descent (FGD) that forecasts the gradient information of the future domain for training to address the issue of temporal domain shift in online recommendation systems. We show that FGD gives smaller temporal domain generalization in theory compared with a widely adopted algorithm, Batch Update. Empirical evidence is provided to show that FGD outperforms various representatives algorithms.

Acknowledgements.
This work is supported by grant from Meta Inc.

References

Appendix: Future Gradient Descent for Adapting the Temporal Shifting Data Distribution in Online Recommendation Systems

Extra Notation

We introduce several new notations for the appendix. We use to denote the inner product between two vectors and use to denote the entrywise product.

Appendix A Proof of Theorem 1

Proof.

We start with a simple decomposition using the triangle inequality:

By the termination condition of Algorithm 5, we have . Furthermore, it follows from (5) that

Hence, we obtain

(7)

This further implies that

(8)

and the main result follows from the fact that for all . Furthermore, under the boundedness assumption, we have for all

(9)

Hence, (8) also implies , which leads to when . ∎

Appendix B Details of the Result in Section 4.4

Algorithm.

Given , define as a function of , where we view as a constant. Thus, if follows from that (8) that

(10)

Thus, our goal is to minimize in an online manner, since we can only access after is chosen. To achieve this, we use the classic exponentiated gradient method to update . Specifically, for any , define the negative potential function and its Bregman divergence

Then is given by

where is the learning rate. See Section 6.6 in orabona2019modern for the derivation of the last equality. Intuitively, stabilizes the algorithm by ensuring that remains close to .

This simplified version of FGD is summarized in Algorithm 4. Note that when updating , we only use the last recommendation model .

Input: The learning rate , for updating the model parameter and .
Initialize .
for  do
     Deploy the prediction model with the parameter and collect the new dataset .
     Construct the function
     . One step of Exponentiated gradient descent from
     Initialize the model parameter .
     while  do
         .
     end while
end for
Algorithm 4 Generalized Future Gradient Descent for Smoothed Regret (simplified version for the theoretical study)
Lemma 1.

Suppose that we have for all and . Then for all .

Proof.

By definition, we have

where we used the fact that . Direct computation shows that

(11)
(12)
(13)
(14)

where we used Cauchy-Schwarz inequality in (12), the triangle inequality in (13) and the boundedness of the gradients in (14). Hence, we conclude that . ∎

Proof of Theorem 2.

Now we proceed to the proof of Theorem 2. This is a standard result in the online learning literature (see, e.g., orabona2019modern). For completeness, we present the proof below.

Proof.

As is -strongly convex with , we have

(15)

Throughout the proof, we slightly abuse the notation by writing and for simplicity. Notice that by our update rule is given by

From the first-order optimality condition, we get for any ,

where we used the three-point equality [chen1993convergence] in the last inequality. Furthermore,

Combining these two bounds, we have

Since is convex in , we have for any . By telescoping, we obtain

where we used Lemma 14, and in the last inequality. Choosing with some constant leads to

(16)

Note that (16) holds for any . In particular, we can set defined by . Therefore,

We thus conclude from (10) that

Appendix C A Practical Generalized FGD algorithm.

Input: The learning rate , for updating the model parameter and . The initial trajectory buffer .
for  do
     Deploy the prediction model with parameter . Then collect the new dataset .
     Initialize the parameter of MFGG . Initialization of is user-specific.
     for Inner loop iteration  do Update the meta network.
         . May replace with the mini-batch version.
     end for
     Initialize the trajectory buffer and model parameter . Initialization scheme of is specified by user.
     while  do Alternatively, we may run gradient descent with a fixed number of iterations.
         . May replace with the mini-batch version.
          Alternatively, we may update the trajectory buffer every a few iterations.
     end while
end for
Algorithm 5 Generalized Future Gradient Descent for Smoothed Loss

Compared with FGD in Algorithm 2, we use a smoothed version of MFGG for training, which is due to the consideration of minimizing a smoothed loss in (2). For completeness, we also summarize the practical algorithm of the generalized version of FGD in Algorithm 5.