Future Gradient Descent for Adapting the Temporal Shifting Data Distribution in Online Recommendation Systems
Abstract
One of the key challenges of learning an online recommendation model is the temporal domain shift, which causes the mismatch between the training and testing data distribution and hence domain generalization error. To overcome, we propose to learn a meta future gradient generator that forecasts the gradient information of the future data distribution for training so that the recommendation model can be trained as if we were able to look ahead at the future of its deployment. Compared with Batch Update, a widely used paradigm, our theory suggests that the proposed algorithm achieves smaller temporal domain generalization error measured by a gradient variation term in a local regret. We demonstrate the empirical advantage by comparing with various representative baselines.
1 Introduction
The web-scale recommendation system is one of the most important modern machine learning applications that provides personalized content to billions of users from inventories of billions of items. These recommendation models have been rapidly growing in both computation and memory in the past few years due to wider-deeper networks and the use of sparse embedding layers. [ye2020adaptive, he2020lightgcn, zhang2020retrain, peng2021learning] have demonstrated the importance of updating the recommendation periodically (e.g., in a daily/weekly basis) as new data arrives to avoid the model being stale in a domain shifting environment.
Designing such a periodical updating pipeline is non-trivial: the algorithm needs to achieve a good balance of consolidating the long-term memory (ensuring the useful past knowledge is preserved) and capturing short-term tendency, which is valuable for near future prediction [zhang2020retrain, peng2021learning, deng2021deeplight]. Algorithms can be categorized into two groups: 1. The sample-based approaches maintain a reservoir to reuse observed historical examples to preserve long-term memory [diaz2012real]. Several heuristics are developed to select past examples via balancing the prioritizing of recency and forgetting [chen2013terec, wang2018streaming, qiu2020gag, zhao2021stratified]; 2. The model-based approaches maintain the long-term memory by transferring knowledge between the past and the current model checkpoints via knowledge distillation [wang2020practical, xu2020graphsail, mi2020ader] and model fusion [zhang2020retrain, peng2021learning].
In this paper, we provide a novel perspective by framing the problem of learning under shifting domains as a temporal domain generalization problem. We observe that the crux lies in the mismatch between the distribution of the training examples and the distribution of the testing example on which the model is deployed for recommendations. From this perspective, existing approaches mitigate such the crux by the distribution mismatch in an indirect way by training a robust model that is less vulnerable to the shift of domain in the near future by making it a master at both short-term and long-term signals. More precisely, we propose a more direct solution towards the temporal domain generalization problem based on forecasting the future information for training. Consider the ideal case that we are able to access the data distribution in the near future when the model is deployed, simply training the model by gradient descent using examples drawn from the future data distribution should be desirable (see ‘Ideal Update’ in Fig 2). In the real world when such future information is unavailable, we propose to train a meta future gradient generator to forecast the gradient of the future examples so that the recommendation model is trained as if we were able to look ahead at the future (i.e., ‘FGD Update’ in Fig 2). In addition to the sample-based and model-based approach, our method is optimizer-based in that the trainer of the recommendation model is improved.
In theory, we frame the problem as an online learning problem in which the temporal domain generalization error is captured by the gradient variation term [chiang2012online, rakhlin2013online] in a local regret [hazan2017efficient]. We provide a theoretical understanding of why the proposed algorithm improves over batch-update, a widely used training pipeline [wang2020practical] and show that our method is able to achieve a similar regret as that of a fixed meta future gradient generator oracle. Empirically, we compare our approach against several representative sample/model-based approaches and observe considerable performance improvement.
Notation. We denote the integer set by . Moreover, denotes the vector norm, denotes the vector norm, and is the probability simplex set.
2 Problem and Background
Temporal Domain Generalization.
Consider an online classification problem with the feature space and the label space . Our goal is to learn an accurate prediction model parameterized by from a stream of datasets in consecutive rounds. Specifically, at round , we choose a model parameter and deploy our prediction model . Then we observe the dataset with labeled examples drawn from certain data distribution , where are the input features and is the associated label. Thus, for a given loss function , the empirical loss of our prediction model at time is given by . Moreover, we consider the situation where the data distribution (i.e., domain) is gradually changing over time. A natural performance metric for our learning algorithm is the temporal average of the test loss suffered by the prediction model:
(1) |
We remark that , which denotes the total number of rounds in the online process, is typically large in practice.
The key challenge is the temporal domain generalization. Indeed, at time we train our prediction model using the observed examples and due to temporal shift of domain, the distribution of the test set does not match the distribution of its training set. Such mismatch of the training and testing domains results in domain generalization error. See Fig 1 for an illustration.
Our formulation is motivated by the online recommendation systems that aim to advertise items to users given user features. The domain is gradually changing because of the flux in the content that gets continuously added/removed from the system [he2014practical, ye2020adaptive]. As the recommendation model needs to be deployed for serving, it is hard to update its parameter in real time [cervantes2018evaluating, wang2020practical, peng2021learning]. The training process is thus discretized in which the model parameter is updated periodically with the hope that it generalizes well in its test domain.
Batch and Incremental Update.
Batch Update (BU) [hazan2017efficient, wang2020practical] is a widely used updating pipeline for training the recommendation model in temporally shifting domains. At each time , the model parameters are updated using the gradient of the averaged losses , where is a time window size indicating how many observed data are used. BU with is also named as Incremental Updating (IU). We summarize the pipeline in Algorithm 1, where with is defined as . Also see an illustration of BU with in the second plot of Fig. 2. It is noteworthy that the initialization scheme of for the updating at each time is problem-dependent and user-specified. For example, we can set of if we consider one-pass training setting [zheng2020shadowsync, ye2020adaptive, du2021alternate].
3 Method
Recall that our goal is to learn that gives accurate prediction on (i.e. achieves small ). Think of the ideal world where we were able to access the data at the near future during the training time of , a simple while promising approach is to apply gradient descent using (see the first plot in Fig. 2). In the real case where the future information is no more available, we propose to learn a meta future gradient generator (MFGG) that forecasts given the observed data ; see the third plot in Fig 2.
Architecture of MFGG.
MFGG models as an non-linear functional auto-regressive time series model [bosq2000linear]. It approximates by aggregating the gradient based on the latest losses where the coefficient of the linear combination is a neural network given by the following computation graph.
Here Embd denotes the embedding layer that maps the categorical feature into a continuous embedding space (the continuous feature remains the same in this layer); Self Attention denotes the self attention layer [vaswani2017attention]; MLP denotes the multi-layer perception. MFGG first extracts the domain features over of the last domains) and the self attention then encodes the interaction between the domain features, of which the outcomes are fed into the subsequent layers to calculate the coefficient . The softmax layer is option and regularizes to be in a probability simplex and hence ensures the magnitude of the generated gradient is within a proper range. Suppose unions all the parameters, we denote MFGG as . In practice, we can simply replace with its mini-batch samples , which gives a stochastic gradient version for updating.
Optimization of MFGG.
We use the squared loss for measuring the prediction error of at time . Such error depends on both , the parameter of MFGG and , the parameter of recommendation model used for calculating the gradient. We are more interested in make MFGG accurate at a small subset of the model parameter space in which gives a recommendation model with good performance. We thus only apply the loss on the (sub-sampled) optimization trajectory of , which we denoted as . That is, we learn by apply gradient descent on
Note that here when calculating the gradient of , is viewed as a constant and hence the differentiation of at does not applied. Algorithm 2 summarizes the detailed procedure. Again, a mini-batch version of and can be used during the training of MFGG. In practice at , we don not have enough historical data to compute MFGG, we can simply use IU for training (alternatively, data for offline training can be used instead). Since our approach uses the MFGG to predict the gradient of the loss on the unobserved future data, we name it Future Gradient Descent (FGD).
Extension to a smoothed loss.
In practice, one might be interested in a smoothed version of performance metric as it is observed to be a potentially more robust evaluation metric in practice [he2014practical]. More precisely, consider the loss function
(2) |
where is identically zero for . This smoothed loss in (2) uses a sliding window with width over the previous datasets when evaluating. We are mainly interested in the standard metric (1) but when (2) is considered, we can simply generalize FGD by replacing by
when training . Here , is defined 0. We refer readers to Algorithm 5 in Appendix C for the details. In the rest of the paper, we focus on the smoothed version of loss as it is more general.
Before moving forward, we emphasize the difference between the two window sizes and that appear in the BU/FGD and in the definition of (2), respectively. In some sense, corresponds to the number of recently observed datasets used for training the model. While, represents the number of datasets used for testing the model.
4 Theory
In this section, we study the advantage of the proposed FGD over BU and IU theoretically using recent advances in non-convex online learning. Specifically, we show that FGD is able to perform better than BU and IU in terms of the so-called local regret [hazan2017efficient, hallak2021regret], which measures the algorithm’s performance by comparing it with the best one can achieve in hindsight.
4.1 Local Regret
To upper bound the average loss in (1) in a changing environment, one standard approach is to study the average dynamic regret [zinkevich2003online]:
(3) |
which uses the global minimum of as a benchmark when evaluating the performance at time . However, in modern recommendation systems the prediction model is given by a deep neural network, and thus the resulting loss function is highly non-convex. This means finding an approximate global minimum of is computationally intractable, making it hopeless to derive any meaningful bound on the average dynamic regret in (3). To remedy this issue, we adopt the notion of local regret proposed by hazan2017efficient. Specifically, given generated by an online learning algorithm, the average local regret is defined as
(4) |
Compared with (3), in (4) we evaluate the model parameters in terms of the first-order stationarity, and thus it can be viewed as the non-convex counterpart of the dynamic regret in (3). In particular, a small value of implies a small gradient on average, suggesting that the algorithm achieves near-optimal performance locally in the long run.
4.2 Regret of Batch Update
In [hazan2017efficient], the authors analyzed the average -local regret for BU. We recall their result below but offer a different interpretation from the domain generalization perspective.
Proposition 1 ([hazan2017efficient, hallak2021regret]).
With the choice of the window size , the -local regret incurred by BU in Algorithm 1 satisfies
where |
Furthermore, if for all and , choosing gives , which is minimax optimal.
The previous works [hazan2017efficient, hallak2021regret] are interested in the worst-case guarantee of the BU algorithm, and the result in Proposition 1 only serves as an intermediate result. However, we observe that this regret bound also offers interesting insights from the perspective of domain generalization. To be specific, we can decompose it into two terms:
The optimization error: this is due to the fact that we only seek a -approximate stationary point of the smoothed training loss function at round . It is controllable in the sense that can be made arbitrarily small by running more iterations of gradient descent. Indeed, under standard smoothness assumption on , we can achieve within iterations. The optimization error term thus corresponds to how well we train the recommendation model in each round.
The domain generalization error: this is due to the fact that the the test set for evaluating is different from the training set . It is typically the dominant term in the regret bound and will not vanish even when . In some sense, it captures the level of variability in the data distributions, similar to the gradient variation term in [chiang2012online, rakhlin2013online]. We also note that the domain generalization error decreases w.r.t. . This is because when increases, the overlap between the training set and the test set becomes larger (i.e., the training set and the test set deviate less)111Such overlapping mechanism is the key to defending adversaries in non-convex games and we refer to Section 2.3 in [hazan2017efficient] for more details..
In summary, the optimization error term characterizes how well our model performs on the training set, while the domain generalization error term characterizes how much the test set deviates from the training set.
Comparison with other measure of domain divergence.
In Proposition 1, the domain discrepancy is characterized in terms of the gradient variation (i.e, how much the gradient of the loss functions differs). Some other domain discrepancy measures have also been proposed. Examples include the -divergence [kifer2004detecting] between and defined as and the divergence [ben2010theory] defined as . Overall, the commonly used divergence measures share the general form of , where is a test function parameterized by . The -divergence uses and the divergence first extends the parameter space to the product space and let for any . The gradient variation uses . As we consider the local regret for non-convex problems where the goal is to find a first-order stationary point, using the gradient as the test function is a natural fit.
4.3 The Headroom of Batch Update
In the last section, we see that BU achieves the minimax regret, so at first sight it seems there is no room for further improvement. However, we note that this only implies that BU is optimal in the worst-case sense, i.e., when the future data distribution is completely uncorrelated with the previous ones. This is hardly the case in reality: the drift in the data distribution normally happens in a gradual manner, and the data distribution in the past should be informative of the future. Hence, the natural question is: can we do better than BU in a gradually changing environment?
The discussion after Proposition 1 suggests that the only hope for improvement lies in reducing the domain generalization error . To illustrate the headroom, we start with Meta Gradient Descent (MGD), a ‘helper algorithm’ that extends BU and serves as an intermediate step towards the proposed FGD. Assume that we are given a sequence of gradient generators . Then FGD uses a smoothed gradient generator given by
for updating, yielding Algorithm 3.
By substituting for , MGD reduces to BU with . Comparing with , the true gradient on the test set, we see that
(5) |
suggesting that MGD introduces a general gradient generator as a proxy for , similar to FGD. On the other hand, we note that the gradient generator in MGD is pre-specified, while FGD parametrizes the gradient generator with and optimizes it on the fly.
From this perspective, BU in Algorithm 1 in fact implicitly uses to approximate , which explains why depends on the difference between these two terms. While such design makes sense in the very limited case where the sequence of domains is known to have a period of , it might not be a savvy choice in general. To be specific, one can construct from the observed datasets based on some mapping parameterized by . For instance, such mapping can be given by a deep neural network as described in Section 3. In this way, MGD enables a mechanism that utilizes the past domains more flexibly to predict the future gradient information when it can be forecasted with a more general form.
Theorem 1.
The -local regret incurred by Algorithm 3 satisfies
where . Furthermore, if both and are upper bounded by for all and , we recover the minimax regret when .
Theorem 1 shows that we can greatly improve the regret of BU by reducing the domain generalization error if is properly chosen. Specifically, suppose that —the hypothesis class of —is rich enough to model the dynamic of the data distribution, in the sense that there exists satisfying
Then the domain generalization error of MGD equipped with tends to zero at the rate of , in contrast to being a non-vanishing dominant term in BU. On the other hand, we can still maintain essentially the same regret bound as BU in the worst case, and thus the improvement almost comes for free.
In the following section, we show that it is indeed possible for FGD to achieve a comparable local regret bound as the one given by MGD with the optimal gradient generator in .
4.4 Regret Bound of FGD
To simplify the analysis, we consider the case where the gradient generator at round is given by a linear model:
(6) |
where is the parameter. The hypothesis class is thus . This family of FGD algorithm covers the BU algorithm, which corresponds to setting and otherwise. For this toy example, we use the classic exponentiated gradient descent method [kivinen97exponentiated] to update , which ensures that . The detailed algorithm is summarized in Algorithm 4 in Appendix B.
Theorem 2.
5 Related Work
Domain Generalization.
Our problem can be viewed as an extension of the classic domain generalization problem. In short, the classic domain generalization problem that is extensively studied in vision or NLP is one-shot in the sense that it aims to generalize a model to one unseen target domain by training over multiple source domains. In contrast, our problem is -shot, since we have a stream of pairs of target/source domains. The difference between one-shot and -shot can be significant. In the one-shot setting, we are unable to receive feedback on how the model generalizes on the unseen domain and thus the existing algorithms are hence focus on improving the worst-case generalization by learning domain-invariant representation based on methods such as domain feature alignment [li2018domain, guo-etal-2019-towards], causal learning [arjovsky2019invariant, wang2022provable], multi-task learning [carlucci2019domain], meta-learning [balaji2018metareg, li2018learning] and data augmentation [yan2020improve, ilse2021selecting]. In comparison, our algorithm mainly focuses on how to use the feedback in the -shot setting to learn to predict the gradient information of the future unseen domain. While adopting the techniques from the one-shot domain generalization is of interest, the design of those algorithms utilizes a lot of domain knowledge from CV or NLP, making it non-trivial to apply to recommendation systems. We thus leave it for future work.
Continual Learning.
Continual learning is a similar scenario where the goal is to learn an accurate model given a stream of different tasks/domains. Compared with multi-task learning [sener2018multi, crawshaw2020multi, ye2021pareto, wang2021bridging], the key challenge of continual learning is catastrophic forgetting [kirkpatrick2017overcoming]: the model forgets how to solve past tasks after it is exposed to new tasks. Various of types of solutions are proposed, including rehearsal-based methods [lopez2017gradient, aljundi2019gradient, chaudhry2020using], knowledge distillation [rebuffi2017icarl], regularization [kirkpatrick2017overcoming, buzzega2020dark] and architecture adjustment [rusu2016progressive, serra2018overcoming]. Although the learning scenario is similar, a direct application of continual learning methods to our setting might not give a desirable outcome. The reason is that the final goals of the two problems are quite different: continual learning aims to learn the current task without sacrificing the performance of the past learned tasks, while we only focus on performing well in the unobserved future task.
Gradual Domain Adaptation
Gradual domain adaptation (GDA) aims at adapting a model to an unlabeled target domain after being trained on a labeled source domain and a sequence of unlabeled intermediate domains. Despite being similar to the setting of temporal domain generalization, GDA is still different from the latter since there are no labels provided in the intermediate domains for GDA. A modern and common approach for GDA is gradual self-training [kumar2020understanding, wang2022understanding, zhou2022online, dong2022algorithms], which fits a model to the source domain and then adapts the model along the sequence of intermediate domains consecutively with self-training [nigam2000analyzing].
Meta-Learning.
Meta-learning, or learning-to-learn, aims to optimize the training process such that the outcome is improved. Examples of meta-learning includes learning a better initialization [finn2017model, lee2018gradient], optimizer [andrychowicz2016learning, flennerhag2019meta], hyper-parameter [franceschi2018bilevel, chen2019lambdaopt] and network architecture [liu2018darts, wang2022global]. The proposed FGD can be viewed as learning a better optimizer for the temporal domain generalization problems. Meta-learning is also widely deployed in recommendation systems. Examples include solving cold start issue [bharadhwaj2019meta, lee2019melu] through learning initialization and knowledge transferring through model fusion [zhang2020retrain, peng2021learning].
6 Experiment
Method | FM | DeepFM | ||||||
---|---|---|---|---|---|---|---|---|
Auc-8 | Logloss-8 | Auc-16 | Logloss-16 | Auc-8 | Logloss-8 | Auc-16 | Logloss-16 | |
IU | ||||||||
BU-2 | ||||||||
SPMF-2 | ||||||||
ASMG-2 | ||||||||
Meta-2 | * | * | * | * | * | * | * | |
BU-3 | ||||||||
SPMF-3 | ||||||||
ASMG-3 | ||||||||
Meta-3 | * | * | * | * | * | * | * | |
BU-5 | ||||||||
SPMF-5 | ||||||||
ASMG-5 | ||||||||
Meta-5 | * | * | * | * |
We demonstrate the effectiveness of the proposed FGD.
Dataset.
We consider two datasets CriteoTB and Avazu. CriteoTB has 13 integer feature fields and 26 categorical feature fields with around 800 million categorical tokens in total. It is the 24-day advertising data published by criteo. Training with the original CriteoTB dataset takes huge computational cost and to reduce computational overhead and increase reproducibility, we use a subsampled CriteoTB with 10% of examples are sampled for evaluation. Avazu contains 11 days of clicks/not clicks data from Avazu and all its 22 feature fields are categorical. We preprocess both datasets following guo2017deepfm, liu2020learnable.
Training Protocol.
In real world recommendation systems, passing the examples multiple times for training might cause severe over-fitting issue [zheng2020shadowsync, ye2020adaptive, du2021alternate]. Following zheng2020shadowsync, ye2020adaptive we perform a single pass on the training data in the sense that each training example is only visited once throughout the training. Thus, we set during the model training at time because examples from domain , has been visited for learning . In Algorithm 2, the default scheme trains the recommendation models until the norm of the gradient is smaller than a threshold while in the experiment, we use the alternative strategy in which we train the model with a fixed number of iterations such that all the examples are passed exactly once.
Evaluation Protocol.
As we consider an online learning environment, there is no need to split the dataset to training and testing subset. Instead, at the training time of , the data at the next day is used to evaluate the performance of and hence the domain generalization error is considered. Such evaluation protocol matches the real recommendation systems [ye2020adaptive]. We adopt AUC (Area Under the ROC Curve) and Logloss to measure the performance. For Criteo1TB we evaluate the performance using the last 8 or 16 days and the first 16 or 8 days are considered to be offline training for warm up start. For Avazu, the first 3 days are treated to be offline training and hence only the last 8 days are used for evaluation. The metrics are averaged over all the days that are used for evaluation. For all the experimental settings, we run all the compared approaches 3 times with different random seeds and report the averaged result.
Models and Optimizers.
We consider two representative architectures for recommendation models, FM [rendle2010factorization] and DeepFM [guo2017deepfm]. Following guo2017deepfm, liu2020learnable, we use Adam as our optimizer and tune the learning rate for each compared methods from using the performance of the offline training and the batch size is set to be 1024. For FGD, we add the model at the training trajectory into trajectory buffer every 150/50 iterations for CriteoTB/Avazu. The meta network is trained using SGD with learning rate 0.01 and batch size 20.
Baselines.
For comparison, we consider the following optimization algorithms: Incremental Update (IU) [wang2020practical] that updates the model incrementally only using the newly observed data ; Batch Update (BU-) [wang2020practical] that updates the model using the most recent domains ; Stream-centered Probabilistic Matrix Factorization (SPMF-) [wang2018streaming] in which a reservoir of historical examples are maintained to mix with the new data for current model updating. SPMF- denotes the setting that the example buffers has the same size as the number of examples in days; Adaptive Sequential Model Generation (ASMG-) [peng2021learning] that generates a better serving model from a sequence of most recent historical serving models via a meta generator; Future Gradient Descent (FGD-) is our approach with the recent domains used for training the recommendation models.
Result.
Table 1 and 2 summarized the results for CriteoTB and Avazu, respectively. The proposed FGD out-performs the baselines in most cases. We also observe that increasing improves the performance for most algorithms as more information can be utilized. The performance boost of FGD when increasing is more significant than other approaches. Compared with CriteoTB, FGD is less significantly better in Avazu dataset. We think the reason might be that the domains of different days in Avazu are less different compared with that in CriteoTB.
Method | FM | DeepFM | ||
---|---|---|---|---|
Auc | Logloss | Auc | Logloss | |
IU | ||||
BU-2 | ||||
SPMF-2 | ||||
ASMG-2 | ||||
Meta-2 | ||||
BU-3 | ||||
SPMF-3 | ||||
ASMG-3 | ||||
Meta-3 | * | * |
Temporal Domain Shift and Forecast Error of MFGG.
To visualize the effect of the temporal domain shift, we plot the gradient norm during the whole training process. We consider FGD-3 in CriteoTB with DeepFM as the recommendation models. In this examples, at each time , the recommendation model is trained with iterations. At time , denote as the parameter at the -th iteration of the training (note that after the training is used to predict examples in ). We visualize the evolution of the gradient norm of the future domain in a chronological order (i.e., ) in the left subfigure of Fig 3. Overall, is decreasing suggesting the improving performance but significant fluctuation of is also observed: when we shift from to , will suddenly increase demonstrating a considerable deviation between the adjacent domains. We also visualize the (normalized) forecast error of MFGG in the right subfigure of Fig 3
Here, we normalize the error by the gradient norm to rule out the effect of the decrease of gradient norm. We observe a decrease of the forecast error demonstrating that the gradient of future domain can be predicted using the past domains. Besides, the error remains stationary which provides evidence that the modeling the MFGG as a functional time-series model is reasonable.
Optimizing MFGG with Random Model. When optimizing MFGG, the loss is calculated based on a model sampled from its training trajectory so that we make MFGG focus on giving good prediction on the gradient of that has reasonable performance. To show the importance of such design, we also run FGD in which MFGG is optimized using with randomly initialized. We consider the setting of FGD-3 in CriteoTB and use both FM and DeepFM as recommendation model and summarize the result in Table 3. It can be shown that train the MFGG with random recommendation model degenrates the performance.
Buffer | Method | Auc-8 | Logloss-8 | Auc-16 | Logloss-16 |
---|---|---|---|---|---|
FM | Rand | ||||
Traj | |||||
DeepFM | Rand | ||||
Traj |
Computation Overhead.
We compare the wall clock training time of BU and FGD. We consider the DeepFM model in CriteoTB and report the averaged training time with different at each time in Table 4. It can be shown that the proposed FGD introduces only about 15% overhead.
Time/min | BU-2 | Meta-3 | BU-3 | Meta-3 | BU-3 | Meta-23 |
---|---|---|---|---|---|---|
20.4 | 24.3 | 29.7 | 33.8 | 47.6 | 52.2 |
7 Conclusion
In this paper, we propose future gradient descent (FGD) that forecasts the gradient information of the future domain for training to address the issue of temporal domain shift in online recommendation systems. We show that FGD gives smaller temporal domain generalization in theory compared with a widely adopted algorithm, Batch Update. Empirical evidence is provided to show that FGD outperforms various representatives algorithms.
Acknowledgements.
This work is supported by grant from Meta Inc.References
Appendix: Future Gradient Descent for Adapting the Temporal Shifting Data Distribution in Online Recommendation Systems
Extra Notation
We introduce several new notations for the appendix. We use to denote the inner product between two vectors and use to denote the entrywise product.
Appendix A Proof of Theorem 1
Proof.
We start with a simple decomposition using the triangle inequality:
By the termination condition of Algorithm 5, we have . Furthermore, it follows from (5) that
Hence, we obtain
(7) |
This further implies that
(8) |
and the main result follows from the fact that for all . Furthermore, under the boundedness assumption, we have for all
(9) |
Hence, (8) also implies , which leads to when . ∎
Appendix B Details of the Result in Section 4.4
Algorithm.
Given , define as a function of , where we view as a constant. Thus, if follows from that (8) that
(10) |
Thus, our goal is to minimize in an online manner, since we can only access after is chosen. To achieve this, we use the classic exponentiated gradient method to update . Specifically, for any , define the negative potential function and its Bregman divergence
Then is given by
where is the learning rate. See Section 6.6 in orabona2019modern for the derivation of the last equality. Intuitively, stabilizes the algorithm by ensuring that remains close to .
This simplified version of FGD is summarized in Algorithm 4. Note that when updating , we only use the last recommendation model .
Lemma 1.
Suppose that we have for all and . Then for all .
Proof of Theorem 2.
Now we proceed to the proof of Theorem 2. This is a standard result in the online learning literature (see, e.g., orabona2019modern). For completeness, we present the proof below.
Proof.
As is -strongly convex with , we have
(15) |
Throughout the proof, we slightly abuse the notation by writing and for simplicity. Notice that by our update rule is given by
From the first-order optimality condition, we get for any ,
where we used the three-point equality [chen1993convergence] in the last inequality. Furthermore,
Combining these two bounds, we have
Since is convex in , we have for any . By telescoping, we obtain
where we used Lemma 14, and in the last inequality. Choosing with some constant leads to
(16) |
Note that (16) holds for any . In particular, we can set defined by . Therefore,