Inverse Propensity Score based offline estimator for deterministic ranking lists using position bias

Nick Wood Just Eat Takeaway.com Sumit Sidana Just Eat Takeaway.com

September 1, 2022

1 Introduction

The mission of Just Eat Takeaway is to empower users’ every food moment. A big part of fulfilling that mission is ensuring that we show the right restaurants to the right users. To do this, we design large-scale machine learning recommendation systems that can learn which types of users tend to enjoy which types of restaurants.

A crucial part of this process is being able to compare the quality of different recommendation systems, in order to decide which is most effective. The gold standard approach to this evaluation problem is A/B testing - allow the different systems to recommend items to randomly selected groups of users, and compare their relative performance.

However, A/B testing is both slow, taking time to reach sufficient statistical power, and expensive [5, 1] if we serve poor recommendations to a subset of users.

Therefore, evaluating recommender systems in offline fashion without running A/B tests is of significant importance. These offline approaches avoid serving experimental recommenders to users, and instead evaluates them using old online data, where users were served recommendations from a different system. Metrics such as Root Mean Squared Error [2] Mean Average Precision, Normalized Discounted Cumulative Gain, Mean Reciprocal Rank and Hit Rate have been used repeatedly for offline evaluation.

All of these evaluation approaches are prone to selection, presentation and position biases [11, 6] and therefore the results from offline evaluation are often inaccurate, meaning the evaluated algorithms perform differently when served online than the offline evaluation predicted.

For this reason, “offline” approaches to evaluation have been subject to intense research [7] with the goal of coming up with offline performance indicators, which match the online performance. Counterfactual off-policy evaluation ( $O P E$ , henceforth) based approaches often make use of Inverse Propensity Score, also known as importance sampling [10]. However, IPS requires us to use a stochastic logging policy. In practice, this means that our recommender systems output a probability distribution over all the possible items that could be recommended, and we choose our recommendations by drawing them from this distribution. Our recommender will assign high probabilities to items it thinks are good recommendations, and low probabilities to poor items, but we will occasionally serve very different recommendations to those the recommender thinks is best.

If we want to evaluate a different recommender using the data acquired, this randomness is crucial, as it means we will sometimes have recommendations in our logged data that are close to what our new recommender would have selected.

However, introducing this randomness comes at a clear monetary cost, and it can be hard to justify paying this concrete cost for the less well defined benefit of faster recommender iteration. Ideally, we would like some way to evaluate non-random (deterministic) recommender systems offline. In the next section, we describe our approach to the counterfactual offline estimator we used for measuring the performance of deterministic ranking lists by making use of position-bias model [14].

2 Offline estimation using a Position-Bias Model

The propensity used in IPS is the probability that an item was selected to be displayed to a user. Position-bias modelling, in particular the examination model, gives us an alternative definition of item propensity. The examination model imagines that when a user sees a list of recommendations, they randomly choose a set of items to examine.

Each list position, k, has an associated examination probability, $P_{k} (E)$ , which usually is at its highest at the top of the list, and then decays to 0 for items in lower positions. From the list of examined items, the user chooses items which are relevant to them and clicks those which are. This process is modelled by assigning each item-user pair a probability of relevance $P_{i, u} (r)$ .

We can use expectation maximisation [4] on logged clicked data to measure $P_{k} (E)$ , the probability that an item is examined in each position [14]. Despite its simplicity, the examination model has been shown to predict user behaviour well [3], and we have used it with success on our ranking data in other applications.

$P_{k} (E)$ gives us the probability distribution of restaurants being examined from top to bottom. We replace the propensity of an item being shown to a user with the propensity that they examine it. We argue that even if the ranked list is deterministic in nature, the user actually examines/sees the restaurant as a probability and under this assumption, the traditional IPS estimator still applies. So, we are doing IPS, but we’re doing it with estimated propensities, which are coming from an examination model [14].

For the new policy, we get scores from new algorithm, rank restaurants according to those scores and find what position $p$ the restaurant holds in this ranking according to the new policy and then find the position-bias for each restaurant in this new ranking. For the logging policy, we find the position $k$ of each restaurant in which it was originally shown and then find its position-bias. Hence, our estimator becomes as follows:

\begin{matrix} V_{i p s} (π_{e}; D_{0}) = \frac{1}{n} n \sum i = 1 K \sum k = 1 \frac{π_{e} (a_{i} | x_{i})}{π_{0} (a_{i} | x_{i})} α_{k} r_{i, k} where, n = number of all restaurants spread over all sessions K = maximum position on which click happened in that session% π_{e} (a_{i} | x_{i}) = propensity of treatment policy for % restaurant a_{i} for given user x_{i} π_{0} (a_{i} | x_{i}) = propensity of control policy for % restaurant a_{i} for given user x_{i} α_{k} r_{i, k} = the reward of item i at position k (dcg, % number of clicks, precision) \end{matrix}

(1)

We also assume that the users come from the top and actions below the last order/click can be ignored. We only consider those restaurants above the clicked restaurant as true negative samples, as they were likely to have been seen. This assumption is similar to that of the RIPS estimator [9]. Hence our estimator further becomes:

\begin{matrix} V_{i p s} (π_{e}; D_{0}) = \frac{1}{n} n \sum i = 1 K \sum k = 1 \frac{θ_{p}}{θ_{k}} α_{k} r_{i, k} where, θ_{p} = position-bias-at-position-p θ_{k} = position-bias-at-position-k \end{matrix}

(2)

$K = m a x (k)$ is the maximum position at which the click/order happened in each session and $n$ is the total number of sessions, in which at least one restaurant was shown in the ranked list.

There is also a problem associated with equation 2 that most of the restaurants at lower ranks have a probability close to 0 of being examined. [12] and [13] address this sort of problem. This issue will increase the variance of our estimator, but as our position bias decays only slowly to 0, we have not found significant variance problems.

3 Experiments

In this section, we describe the experiments done using two different A/B tests. We first compute our OPE score given by equation 2 for the novel recommender in both the experiments and compare it with the actual CTR obtained for the variant recommender.

3.1 Multi-Objective Optimization Ranking Experiments

In this section, we evaluate a deterministic multi-objective ranking recommender for ranking restaurants. We ran an A/B test comparing a control 2-objective control recommender which combined an objective function for new restaurants as well as a rule based objective focussing on distance and restaurant quality, which resulted in a blended ranking between these two objectives. The variant recommender combines 3 objectives: new restaurants , rule-based, and an additional personalization objective.

To test the quality of our OPE, we take the ranking sessions and clicks from the control experimental data and compute the counterfactual ranking that would have been served by the variant policy in this session. We can then use our estimator and the logged feedback to evaluate the offline quality of the new policy. By comparing this measurement to the real variant policy performance in the A/B test, we can test the quality of our OPE framework.

Figure 1: ope vs ctr for 11 days of A/B test of multi-objective optimization based ranking

3.1.1 Independent Assumption

In this experiment, we use the independence assumption. This also helps in making the action space tractable. Every restaurant click is taken as independent from other restaurant clicks. This means that one single appearance of a restaurant in the ranked list is one data point in the experiment. The number of all such appearances is denoted by n in equation 2 Then, CTR definition is modified to:

\begin{matrix} C T R = \frac{1}{n} n \sum i = 1 c where, c = number of clicks in that session \end{matrix}

(3)

Day by day results shown in Figure 1 show a very strong correlation between the $O P E$ estimated from control logs and $C T R$ value computed from treatment logs. There is a constant difference between the $C T R$ and $O P E$ value, which we are still trying to explain.

3.2 Text Search Ranking Experiments

Figure 2: ope vs ctr for 11 days of A/B test of text search ranking

In text search, users come in and input a text query in the text box and the ranked list of restaurants are shown. The $K = m a x (k)$ in equation 2 is the maximum position at which the click/order happened in each session and $n$ is the total number of sessions in which search was done and there was at least one result returned. We first run the position bias examination model on 1 month of logs in order to obtain $P_{k} (E)$ .

Then we take logs from control and treatment and join these sessions on the text query and restaurant to compute the position of each restaurant which was clicked in control. For control, we already have the position, but, for treatment, we take the median of all the possible positions, which we end up as a result of the join. Then, we compare it with actual click-through-rate of each day of the treatment. For computing the OPE, we use equation 2

Actual Click-Through-Rate is given by:

\begin{matrix} C T R = n \sum i = 1 K \sum k = 1 \frac{c}{v} where, c = # clicks in that session v = # views in that session \end{matrix}

(4)

$K$ as before is the maximum position at which the click happened in that session. By using the maximum position, we make the clicks in ranked list a bit more dependent on each other. This kind of assumption is also similar to [8].

The results are shown in Figure 2, which shows day by day comparison of $C T R$ and $O P E$ . As evident from the Figure, $O P E$ is strongly correlated with the actual click-through-rate obtained in treatment. There is a constant difference between the $O P E$ and actual $C T R$ , which means $O P E$ is a bit underestimated compared to the actual $C T R$ . Variance of OPE is also less compared to typical IPS estimator. This could have been due to the position-bias values not converging properly. This could also have been due to the fact that control and treatment policies don’t differ a lot from each other. At this time, we are still investigating this further.

4 Conclusion

In this work, we presented a novel way of computing IPS using a position-bias model for deterministic logging policies. This technique significantly widens the policies on which OPE can be used. We validated this technique using two different experiments on industry-scale data. The OPE results are clearly strongly correlated with the online results, with some constant bias. The estimator requires the examination model to be a reasonably accurate approximation of real user behaviour.

5 Acknowledgments

We would like to thank Lauren Rodney, Erdem Biyik, Hakan Simsek, Liza Chukreeva and Rishi Kumar for helping us run these experiments. We would also like to thank Sabrican Özan, Max Knobbout and Boris Slesar for helping us review this article. Finally, we would like to thank Ozlem Ozer for sponsoring this project.

References

[1] A. Agarwal, S. Basu, T. Schnabel, and T. Joachims (2017-08) Effective evaluation using logged bandit feedback from multiple loggers. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, External Links: Document, Link Cited by: §1.
[2] J. Bennett and S. Lanning (2007-08) The netflix prize. In Proceedings of the KDD Cup Workshop 2007, New York, pp. 3–6. External Links: Link Cited by: §1.
[3] A. Chuklin, I. Markov, and M. de Rijke (2015) Click models for web search. Synthesis Lectures on Information Concepts, Retrieval, and Services, Morgan & Claypool Publishers. External Links: Link, Document Cited by: §2.
[4] A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39 (1), pp. 1–38. External Links: ISSN 00359246, Link Cited by: §2.
[5] A. Gilotte, C. Calauzènes, T. Nedelec, A. Abraham, and S. Dollé (2018-02) Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, External Links: Document, Link Cited by: §1.
[6] O. Jeunen (2019-09) Revisiting offline evaluation for implicit-feedback recommender systems. pp. . External Links: ISBN 978-1-4503-6243-6, Document Cited by: §1.
[7] O. Jeunen (2021-09) Offline approaches to recommendation with online success. Ph.D. Thesis. Cited by: §1.
[8] S. Li, Y. Abbasi-Yadkori, B. Kveton, S. Muthukrishnan, V. Vinay, and Z. Wen (2018) Offline evaluation of ranking policies with click models. arXiv. External Links: Document, Link Cited by: §3.2.
[9] J. McInerney, B. Brost, P. Chandar, R. Mehrotra, and B. Carterette (2020-08) Counterfactual evaluation of slate recommendations with sequential reward interactions. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, External Links: Document, Link Cited by: §2.
[10] A. B. Owen (2013) Monte carlo theory, methods and examples. Cited by: §1.
[11] M. Rossetti, F. Stella, and M. Zanker (2016) Contrasting offline and online results when evaluating recommendation algorithms. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, New York, NY, USA, pp. 31–34. External Links: ISBN 9781450340359, Link, Document Cited by: §1.
[12] N. Sachdeva, Y. Su, and T. Joachims (2020-08) Off-policy bandits with deficient support. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, External Links: Document, Link Cited by: §2.
[13] A. Swaminathan, A. Krishnamurthy, A. Agarwal, M. Dudík, J. Langford, D. Jose, and I. Zitouni (2016) Off-policy evaluation for slate recommendation. CoRR abs/1605.04812. External Links: Link, 1605.04812 Cited by: §2.
[14] X. Wang, N. Golbandi, M. Bendersky, D. Metzler, and M. Najork (2018) Position bias estimation for unbiased learning to rank in personal search. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, New York, NY, USA, pp. 610–618. External Links: ISBN 9781450355810, Link, Document Cited by: §1, §2, §2.