Comparing Apples to Oranges:
Learning Similarity Functions for Data Produced by Different Distributions

Leonidas Tsepenekas^1,², Ivan Brugere¹

Abstract

Similarity functions measure how comparable pairs of elements are, and play a key role in a wide variety of applications, e.g., Clustering problems and considerations of Individual Fairness. However, access to an accurate similarity function should not always be considered guaranteed. Specifically, when the elements to be compared are produced by different distributions, or in other words belong to different “demographic” groups, knowledge of their true similarity might be very difficult to obtain. In this work, we present a sampling framework that learns these across-groups similarity functions, using only a limited amount of experts’ feedback. We show analytical results with rigorous bounds, and empirically validate our algorithms via a large suite of experiments.

\affiliations

¹ JPMorgan Chase & Co, ² University of Maryland, College Park
ltsepene@umd.edu, ivan.brugere@jpmchase.com

1 Introduction

Given a feature space $I$ , a similarity function $σ : I^{2} \mapsto R_{\geq 0}$ measures how comparable any pair of elements $x, x^{'} \in I$ are. The function $σ$ can also be interpreted as a distance function, where the smaller $σ (x, x^{'})$ is, the more similar $x$ and $x^{'}$ are. Such functions are crucially used in a variety of AI/ML problems, and in each such case $σ$ is assumed to be known. Two of the most prominent applications where similarity functions have a central role are clustering and considerations of individual fairness. In clustering, the similarity function serves as the metric space in which we need to create the appropriate clusters. As for the well-known notion of individual fairness introduced in the seminal work of Dwork et al. (2012), its goal is treating similar individuals similarly. Hence, any algorithm that needs to abide by this concept of fairness, should be able to access the similarity value for every pair of individuals that are of interest.

Nonetheless, it is not realistic to assume that a reliable and accurate similarity function is always given. This issue was even raised in the work of Dwork et al. (2012), where it was acknowledged that the computation of $σ$ is not trivial, and thus should be deferred to third parties. The starting point of our work here is the observation that there exist scenarios where computing similarity can be assumed as easy (in other words given), while in other cases this task would be significantly more challenging. Specifically, we are interested in scenarios where there are multiple distributions that produce elements of $I$ . We loosely call each such distribution a “demographic” group, interpreting it as the stochastic way in which members of this group are produced. In this setting, computing the similarity value of two elements that are produced according to the same distribution, seems intuitively much easier compared to computing similarity values for elements belonging to different groups. We next present a few motivating examples that clarify this statement.

Individual Fairness: Consider a college admissions committee that needs access to an accurate similarity function for students, so that it provides a similar likelihood of acceptance to similar applicants. Let us focus on the following two demographic groups. The first being students from affluent families living in privileged communities and having access to the best quality schools and private tutoring. The other group would consist of students from low-income families, coming from a far less privileged background. Given this setting, directly comparing two students based on their features (e.g., SAT scores, strength of school curriculum, number of recommendation letters, etc), can be a fair way to elicit their similarity only if the students belong to the same group. It is clear that this trivial approach might hurt less privileged students when they are compared to students of the first group. This is because such a simplistic way of measuring similarity does not reflect potential that is undeveloped due to unequal access to resources. Hence, accurate across-groups comparisons that take into account such delicate issues, appear considerably more intricate.

Clustering: Suppose that a marketing company has a collection of user data that wants to cluster, with its end goal being a downstream market segmentation analysis. However, as it is usually the case, the data might come from different sources, e.g., data from private vendors and data from government bureaus. In this scenario, each data source might have its own way of representing user information, e.g., each source might use a unique subset of features. Therefore, eliciting the distance metric required for the clustering task should be straightforward for data coming from the same source, while across-sources distances would certainly require extra care, e.g., how can one extract the distance of two vectors containing different sets of features?

For more applications we refer the reader to Appendix A.

As suggested by earlier work on computing similarity functions (Ilvento, 2019), when obtaining similarity values is an overwhelming task, we can employ the advice of domain experts. Such experts can be given any pair of elements, and in return produce their true similarity value. However, utilizing this experts’ advice can be thought of as very costly, and hence it should be used sparingly. For example, in the case of comparing students from different economic backgrounds, the admissions committee can reach out to regulatory bodies or civil rights organizations. Nonetheless, resorting to these experts for every student comparison that might arise, is not a sustainable solution. Therefore, our goal in this paper is to learn the across-groups similarity functions, using as few queries to experts as possible.

1.1 Formal Problem Definition

Let $I$ denote the feature space of elements. We assume that elements come from $γ$ known “demographic” groups, where $γ \in N$ , and each group $ℓ \in [γ]$ is governed by an unknown distribution $D_{ℓ}$ over $I$ . We use $x \sim D_{ℓ}$ to denote a randomly drawn $x$ from $D_{ℓ}$ . Further, we use $x \in D_{ℓ}$ to denote that $x$ is an element in the support of $D_{ℓ}$ , and thus $x$ is a member of group $ℓ$ . Observe now that for a specific $x \in I$ , we might have $x \in D_{ℓ}$ and $x \in D_{ℓ^{'}}$ , for $ℓ \neq ℓ^{'}$ . Hence, in our model group membership is important, and every time we are considering an element $x \in I$ , we know which distribution produced $x$ , i.e., which group $x$ belongs to.

For every group $ℓ \in [γ]$ there is an intra-group similarity function $d_{ℓ} : I^{2} \mapsto R_{\geq 0}$ , such that for all $x, y \in D_{ℓ}$ we have $d_{ℓ} (x, y)$ representing the true similarity between $x, y$ . In addition, the smaller $d_{ℓ} (x, y)$ is, the more similar $x, y$ . Note here that the function $d_{ℓ}$ is only used to compare members of group $ℓ$ . Further, a common assumption for functions measuring similarity is that they are metric (Yona and Rothblum, 2018; Kim et al., 2018; Ilvento, 2019; Mukherjee et al., 2020; Wang et al., 2019); a function $d$ is metric if a) $d (x, y) = 0$ implies $x = y$ , b) $d (x, y) = d (y, x)$ and c) $d (x, y) \leq d (x, z) + d (z, y)$ for all $x, y, z$ (triangle inequality). We also adopt the metric assumption for the function $d_{ℓ}$ . Finally, based on the earlier discussion regarding computing similarity between elements of the same group, we assume that $d_{ℓ}$ is known and given as part of the instance.

Moreover, for any two groups $ℓ$ and $ℓ^{'}$ there exists an unknown across-groups similarity function $σ_{ℓ, ℓ^{'}} : I^{2} \mapsto R_{\geq 0}$ , such that for all $x \in D_{ℓ}$ and $y \in D_{ℓ^{'}}$ , $σ_{ℓ, ℓ^{'}} (x, y)$ represents the true similarity between $x, y$ . Again, the smaller $σ_{ℓ, ℓ^{'}} (x, y)$ is, the more similar the two elements, and for a meaningful use of $σ_{ℓ, ℓ^{'}}$ we must make sure that $x$ is a member of group $ℓ$ and $y$ a member of group $ℓ^{'}$ . Finally, in order to imitate the metric nature of a similarity function, we impose the following properties on $σ_{ℓ, ℓ^{'}}$ , which can be viewed as across-groups triangle inequalities:

Property $M_{1}$ : $σ_{ℓ, ℓ^{'}} (x, y) \leq d_{ℓ} (x, z) + σ_{ℓ, ℓ^{'}} (z, y)$ for every $x, z \in D_{ℓ}$ and $y \in D_{ℓ^{'}}$ .
Property $M_{2}$ : $σ_{ℓ, ℓ^{'}} (x, y) \leq σ_{ℓ, ℓ^{'}} (x, z) + d_{ℓ^{'}} (z, y)$ for every $x \in D_{ℓ}$ and $y, z \in D_{ℓ^{'}}$ .

Nonetheless, the collection of similarity values here does not axiomatically yield a valid metric space. This is due to the following. 1) If $σ_{ℓ, ℓ^{'}} (x, y) = 0$ for $x \in D_{ℓ}$ and $y \in D_{ℓ^{'}}$ , then we do not necessarily have $x = y$ . 2) It is not always the case that $d_{ℓ} (x, y) \leq σ_{ℓ, ℓ^{'}} (x, z) + σ_{ℓ, ℓ^{'}} (y, z)$ for $x, y \in D_{ℓ}$ , $z \in D_{ℓ^{'}}$ . 3) It is not always the case that $σ_{ℓ, ℓ^{'}} (x, y) \leq σ_{ℓ, ℓ^{''}} (x, z) + σ_{ℓ^{''}, ℓ^{'}} (z, y)$ for $x \in D_{ℓ}$ , $y \in D_{ℓ^{'}}$ , $z \in D_{ℓ^{''}}$ .

However, not having the collection of similarity values necessarily produce a metric space is not a weakness of our model. On the contrary, we view this as one of its strongest aspects. For one thing, imposing a complete metric constraint on the case of intricate across-groups comparisons sounds unrealistic and very restrictive. Further, even though existing literature treats similarity functions as metric ones, the seminal work of Dwork et al. (2012) mentions that this should not always be the case. Hence, our model is more general than the current literature.

Goal of Our Problem: We want for any two groups $ℓ, ℓ^{'}$ to compute a function $f_{ℓ, ℓ^{'}} : I^{2} \mapsto R_{\geq 0}$ , such that $f_{ℓ, ℓ^{'}} (x, y)$ is our estimate of similarity for any $x \in D_{ℓ}$ and $y \in D_{ℓ^{'}}$ . Specifically, we seek a PAC guarantee, where for any given accuracy and confidence parameters $ϵ, δ \in (0, 1)$ we have:

The subscript in the above probability corresponds to two independent random choices, one $x \sim D_{ℓ}$ and one $y \sim D_{ℓ^{'}}$ .

As for tools to learn $f_{ℓ, ℓ^{'}}$ , we only require two things. At first, for each group $ℓ$ we want a set $S_{ℓ}$ of i.i.d. samples from $D_{ℓ}$ . Obviously, the total number of used samples should be polynomial in the input parameters, i.e., polynomial in $γ, \frac{1}{ϵ}$ and $\frac{1}{δ}$ . Secondly, we require access to an expert oracle, which given any $x \in S_{ℓ}$ and $y \in S_{ℓ^{'}}$ for any $ℓ$ and $ℓ^{'}$ , returns the true similarity value $σ_{ℓ, ℓ^{'}} (x, y)$ . We refer to a single invocation of the oracle as a query. Since there is a cost to collecting expert feedback, an additional objective in our problem is minimizing the number of oracle queries.

1.2 Discussion of Our Results

In Section 2 we present our theoretical results. We begin with a simple learning algorithm achieving the following.

Theorem 1.

For any given parameters $ϵ, δ \in (0, 1)$ , we produce similarity functions $f_{ℓ, ℓ^{'}}$ for every $ℓ$ and $ℓ^{'}$ , such that:

	$Pr [{Error}_{(ℓ, ℓ^{'})}]$	$\coloneqqPr\mathclapx∼Dℓ,y∼Dℓ′, A[∣∣fℓ,ℓ′(x,y)−σℓ,ℓ′(x,y)∣∣=ω(ϵ)]$
		$= O (δ + p_{ℓ} (ϵ, δ) + p_{ℓ^{'}} (ϵ, δ))$

The randomness here is of three independent sources. The randomness $A$ of the algorithm, a choice $x \sim D_{ℓ}$ , and a choice $y \sim D_{ℓ}^{'}$ . The algorithm requires $\frac{1}{δ} log \frac{1}{δ^{2}}$ samples from each group, and $\frac{γ (γ - 1)}{δ^{2}} {log}^{2} \frac{1}{δ^{2}}$ oracle queries.

For each group $ℓ$ , we use $p_{ℓ} (ϵ, δ)$ to denote the probability of sampling an $(ϵ, δ)$ -rare element of $D_{ℓ}$ . We define as $(ϵ, δ)$ -rare for $D_{ℓ}$ , an element $x \in D_{ℓ}$ for which there is a less than $δ$ chance of sampling $x^{'} \sim D_{ℓ}$ with $d_{ℓ} (x, x^{'}) \leq ϵ$ . Formally, $x \in D_{ℓ}$ is $(ϵ, δ)$ -rare iff ${Pr}_{x^{'} \sim D_{ℓ}} [d_{ℓ} (x, x^{'}) \leq ϵ] < δ$ , and $p_{ℓ} (ϵ, δ) = {Pr}_{x \sim D_{ℓ}} [x is (ϵ, δ) -rare for D_{ℓ}]$ . Intuitively, a rare element should be interpreted as an “isolated” member of the group, in the sense that it is at most $δ$ -likely to encounter another element that is $ϵ$ -similar to it.

Clearly, the error probability for $ℓ$ and $ℓ^{'}$ is $O (δ)$ only when $p_{ℓ} (ϵ, δ), p_{ℓ^{'}} (ϵ, δ) = O (δ)$ . We hypothesize that in realistic distributions each $p_{ℓ} (ϵ, δ)$ should indeed be fairly small, and this hypothesis is actually validated by our experiments. The reason we believe this hypothesis to be true, is that very frequently real data demonstrate high concentration around certain archetypal elements. Hence, this sort of distributional density does not leave room for isolated elements in the rest of the space. Nonetheless, we also provide a strong no free lunch result involving the values $p_{ℓ} (ϵ, δ)$ , which shows that any practical algorithm necessarily depends on them. This result further implies that our algorithm’s error probabilities are almost optimal.

Theorem 2.

For every $ϵ, δ \in (0, 1)$ , any algorithm using finitely many samples, will yield $f_{ℓ, ℓ^{'}}$ with $Pr [| f_{ℓ, ℓ^{'}} (x, y) - σ_{ℓ, ℓ^{'}} (x, y) | = ω (ϵ)] = Ω (max {p_{ℓ} (ϵ, δ)$ , $p_{ℓ^{'}} (ϵ, δ)} - ϵ)$ ; randomness is over the independent choices $x \sim D_{ℓ}$ , $y \sim D_{ℓ}^{'}$ and the internal randomness of the algorithm.

Moving on, we focus on minimizing the oracle queries. By carefully modifying the earlier simple algorithm, we obtain a new algorithm with the following guarantees:

Theorem 3.

For any given parameters $ϵ, δ \in (0, 1)$ , we produce similarity functions $f_{ℓ, ℓ^{'}}$ for every $ℓ$ and $ℓ^{'}$ , such that:

Pr [{Error}_{(ℓ, ℓ^{'})}] = O (δ + p_{ℓ} (ϵ, δ) + p_{ℓ^{'}} (ϵ, δ))

$Pr [{Error}_{(ℓ, ℓ^{'})}]$ is as defined in Theorem 1. Let $N = \frac{1}{δ} log \frac{1}{δ^{2}}$ . The algorithm requires $N$ samples from each group, and the number of oracle queries is at most

\sum ℓ \in [γ] (Q_{ℓ} \sum ℓ^{'} \in [γ] : ℓ^{'} \neq ℓ Q_{ℓ^{'}})

where $Q_{ℓ} \leq N$ and $E [Q_{ℓ}] \leq \frac{1}{δ} + p_{ℓ} (ϵ, δ) N$ for each $ℓ$ .

Based on Theorem 3, the queries of the improved algorithm are at most $γ (γ - 1) N$ , which is exactly the number of queries in our earlier simple algorithm. However, the smaller the values $p_{ℓ} (ϵ, δ)$ are, the fewer queries in expectation. Our experimental results actually confirm that this algorithm always leads to a significant decrease in the used queries.

Our final theoretical result involves a lower bound on the number of queries required for learning.

Theorem 4.

For all $ϵ, δ \in (0, 1)$ , any learning algorithm producing functions $f_{ℓ, ℓ^{'}}$ with ${Pr}_{x \sim D_{ℓ}, y \sim D_{ℓ^{'}}} [| f_{ℓ, ℓ^{'}} (x, y) - σ_{ℓ, ℓ^{'}} (x, y) | = ω (ϵ)] = O (δ)$ , needs $Ω (\frac{γ^{2}}{δ^{2}})$ queries.

Combining Theorems 3, 4 implies that when all $p_{ℓ} (ϵ, δ)$ are negligible, i.e., $p_{ℓ} (ϵ, δ) \to 0$ , the expected queries of the Theorem 3 algorithm are asymptotically optimal.

Finally, Section 3 contains our experimental evaluation, where through a large suite of simulations on both real and synthetic data we validate our theoretical findings.

1.3 Related Work

Metric learning is a very well-studied area (Bellet et al., 2013; Kulis, 2013; Moutafis et al., 2017; Suárez-Díaz et al., 2018). There is also an extensive amount of work on using human feedback for learning metrics in specific tasks, e.g., image similarity, low-dimensional embeddings (Frome et al., 2007; Jamieson and Nowak, 2011; Tamuz et al., 2011; van der Maaten and Weinberger, 2012; Wilber et al., 2014). However, since these works are either tied to specific applications or specific metrics, they are only distantly related to ours.

Our model is more closely related to the literature on trying to learn the similarity function from the fairness definition of Dwork et al. (2012). This concept of fairness requires treating similar individuals similarly. Thus, it needs access to a function that returns a non-negative value for any pair of individuals, and this value corresponds to how similar the individuals are. Specifically, the smaller the value the more similar the elements that are compared.

Even though the fairness definition of Dwork et al. (2012) is very elegant and intuitive, the main obstacle for adopting it in practice is the inability to easily compute or access the crucial similarity function. To our knowledge, the only papers that attempt to learn this similarity function using expert oracles like us, are Ilvento (2019); Mukherjee et al. (2020) and Wang et al. (2019). Ilvento (2019) addresses the scenario of learning a general metric function, and gives theoretical PAC guarantees. Mukherjee et al. (2020) give theoretical guarantees for learning similarity functions that are only of a specific Mahalanobis form. Wang et al. (2019) simply provide empirical results. The first difference between our model and these papers is that unlike us, they do not consider elements coming from multiple distributions. However, the most important difference is that these works only learn metric functions. In our case the collection of similarity values (from all $d_{ℓ}$ and $σ_{ℓ, ℓ^{'}}$ ) does not necessarily yield a complete metric space; see the discussion in Section 1.1. Hence, our problem learns more general functions.

Regarding the difficulty in computing similarity between members of different groups, we are only aware of a brief result by Dwork et al. (2012). In particular, given a metric $d$ over the whole feature space, they mention that $d$ can only be trusted for comparisons between elements of the same group, and not for across-groups comparisons. In order to achieve the latter for groups $ℓ$ and $ℓ^{'}$ , they find a new similarity function $d^{'}$ that approximates $d$ , while minimizing the Earthmover distance between the distributions $D_{ℓ}, D_{ℓ^{'}}$ . This is completely different from our work, since here we assume the existence of across-groups similarity values, which we eventually want to learn. On the other hand, the approach of Dwork et al. (2012) can be seen as an optimization problem, where the across-groups similarity values need to be computed in a way that minimizes some objective. Also, unlike our model, this optimization approach has a serious limitation, and that is requiring $D_{ℓ}, D_{ℓ^{'}}$ to be explicitly known.

Finally, since similarity as distance can be quite difficult to compute in practice, there has been a line of research that defines similarity using simpler, yet less expressive structures. Examples include similarity lists Chakrabarti et al. (2022), similarity graphs Lahoti et al. (2019) and ordinal relationships Jung et al. (2019).

2 Theoretical Results

We begin the section by presenting a simple algorithm with PAC guarantees, whose error probability is shown to be almost optimal. Later on, we focus on optimizing the oracle queries, and show an improved algorithm for this objective.

2.1 A Simple Learning Algorithm

Given any confidence and accuracy parameters $δ, ϵ \in (0, 1)$ respectively, our approach is summarized in the following. At first, for every group $ℓ$ we need a set $S_{ℓ}$ of samples that are chosen i.i.d. according to $D_{ℓ}$ , such that $| S_{ℓ} | = \frac{1}{δ} log \frac{1}{δ^{2}}$ . Then, for every distinct $ℓ$ and $ℓ^{'}$ , and for all $x \in S_{ℓ}$ and $y \in S_{ℓ^{'}}$ , we ask the expert oracle for the true similarity value $σ_{ℓ, ℓ^{'}} (x, y)$ . The next observation follows trivially.

Observation 5.

The algorithm uses $\frac{γ}{δ} log \frac{1}{δ^{2}}$ samples, and $\frac{γ (γ - 1)}{δ^{2}} {log}^{2} \frac{1}{δ^{2}}$ queries to the oracle.

Suppose now that we need to compare any $x \in D_{ℓ}$ and $y \in D_{ℓ^{'}}$ . Our high level idea is that the properties $M_{1}$ and $M_{2}$ of $σ_{ℓ, ℓ^{'}}$ (see Section 1.1), will actually allow us to use the closest element to $x$ in $S_{ℓ}$ and the closest element to $y$ in $S_{ℓ^{'}}$ as proxies. Thus, let $π (x) = {a r g m i n}_{x^{'} \in S_{ℓ}} d_{ℓ} (x, x^{'})$ and $π (y) = {a r g m i n}_{y^{'} \in S_{ℓ^{'}}} d_{ℓ^{'}} (y, y^{'})$ . The algorithm then sets

fℓ,ℓ′(x,y)\coloneqqσℓ,ℓ′(π(x),π(y))

where $σ_{ℓ, ℓ^{'}} (π (x), π (y))$ is known from the earlier queries.

Before we proceed with the analysis of the algorithm, we need to recall some notation which was first introduced in Section 1.2. Consider any group $ℓ$ . An element $x \in D_{ℓ}$ with ${Pr}_{x^{'} \sim D_{ℓ}} [d_{ℓ} (x, x^{'}) \leq ϵ] < δ$ is called an $(ϵ, δ)$ -rare element of $D_{ℓ}$ , and also $pℓ(ϵ,δ)\coloneqqPrx∼Dℓ[x is (ϵ,δ)-rare for Dℓ]$ .

Theorem 6.

For any given parameters $ϵ, δ \in (0, 1)$ , the algorithm produces $f_{ℓ, ℓ^{'}}$ for every $ℓ$ and $ℓ^{'}$ , such that

Pr [{Error}_{(ℓ, ℓ^{'})}] = O (δ + p_{ℓ} (ϵ, δ) + p_{ℓ^{'}} (ϵ, δ))

where $Pr [{Error}_{(ℓ, ℓ^{'})}]$ is as in Theorem 1.

Proof.

For two distinct groups $ℓ$ and $ℓ^{'}$ , consider what will happen when we are asked to compare some $x \in D_{ℓ}$ and $y \in D_{ℓ^{'}}$ . Properties $M_{1}$ and $M_{2}$ of $σ_{ℓ, ℓ^{'}}$ imply

	$Q \leq σ_{ℓ, ℓ^{'}} (x, y) \leq P, where$
	$P\coloneqqdℓ(x,π(x))+σℓ,ℓ′(π(x),π(y))+dℓ′(y,π(y))$
	$Q\coloneqqσℓ,ℓ′(π(x),π(y))−dℓ(x,π(x))−dℓ′(y,π(y))$

Note that when $d_{ℓ} (x, π (x)) \leq 3 ϵ$ and $d_{ℓ^{'}} (y, π (y)) \leq 3 ϵ$ , the above inequalities and the definition of $f_{ℓ, ℓ^{'}} (x, y)$ yield $∣ ∣ f_{ℓ, ℓ^{'}} (x, y) - σ_{ℓ, ℓ^{'}} (x, y) ∣ ∣ \leq 6 ϵ$ . Thus, we just need upper bounds for $A\coloneqqPrSℓ,x∼Dℓ[∀x′∈Sℓ:d(x,x′)>3ϵ]$ and $B\coloneqqPrSℓ′,y∼Dℓ′[∀y′∈Sℓ:d(y,y′)>3ϵ]$ , since the previous analysis and a union bound give $Pr [{Error}_{(ℓ, ℓ^{'})}] \leq A + B$ . In what follows we present an upper bound for $A$ . The same analysis gives an identical bound for $B$ .

Before we proceed to the rest of the proof, we have to provide an existential construction. For the sake of simplicity we will be using the term dense for elements of $D_{ℓ}$ that are not $(ϵ, δ)$ -rare. For every $x \in D_{ℓ}$ that is dense, we define $Bx\coloneqq{x′∈Dℓ:dℓ(x,x′)≤ϵ}$ . Observe that the definition of dense elements implies ${Pr}_{x^{'} \sim D_{ℓ}} [x^{'} \in B_{x}] \geq δ$ for every dense $x$ . Next, consider the following process. We start with an empty set $R = {}$ , and we assume that all dense elements are unmarked. Then, we choose an arbitrary unmarked dense element $x$ , and we place it in the set $R$ . Further, for every dense $x^{'} \in D_{ℓ}$ that is unmarked and has $B_{x} \cap B_{x^{'}} \neq \emptyset$ , we mark $x^{'}$ and set $ψ (x^{'}) = x$ . Here the function $ψ$ maps dense elements to elements of $R$ . We continue this picking process until all dense elements have been marked. Since $B_{z} \cap B_{z^{'}} = \emptyset$ for any two $z, z^{'} \in R$ and ${Pr}_{x^{'} \sim D_{ℓ}} [x^{'} \in B_{z}] \geq δ$ for $z \in R$ , we have $| R | \leq 1 / δ$ . Also, for every dense $x$ we have $d_{ℓ} (x, ψ (x)) \leq 2 ϵ$ due to $B_{x} \cap B_{ψ (x)} \neq \emptyset$ .

Now we are ready to upper bound $A$ .

	$C$	$\coloneqqPrSℓ,x∼Dℓ[∀x′∈Sℓ:d(x,x′)>3ϵ∧x is (ϵ,δ)-rare]$
		$\leq Pr x \sim D_{ℓ} [x is (ϵ, δ) -rare] = p_{ℓ} (ϵ, δ)$
	$D$	$\coloneqqPrSℓ,x∼Dℓ[∀x′∈Sℓ:d(x,x′)>3ϵ∧x is dense]$
		$\leq Pr S_{ℓ} [\exists r \in R : B_{r} \cap S_{ℓ} = \emptyset]$
		$\leq \sum r \in R Pr S_{ℓ} [B_{r} \cap S_{ℓ} = \emptyset]$
		$\leq \| R \| (1 - δ)^{\| S_{ℓ} \|} \leq \| R \| e^{- δ \| S_{ℓ} \|} \leq δ$

The upper bound for $C$ is trivial. We next explain the computations for $D$ . For the transition between the first and the second line we use a proof by contradiction. Hence, suppose that $S_{ℓ} \cap B_{r} \neq \emptyset$ for every $r \in R$ , and let $i_{r}$ denote an arbitrary element of $S_{ℓ} \cap B_{r}$ . Then, for any dense element $x \in D_{ℓ}$ we have $d_{ℓ} (x, π (x)) \leq d_{ℓ} (x, i_{ψ (x)}) \leq d_{ℓ} (x, ψ (x)) + d_{ℓ} (ψ (x), i_{ψ (x)}) \leq 2 ϵ + ϵ = 3 ϵ$ . Back to the computations for $D$ , to get the third line we simply used a union bound. To get from the third to the fourth line, we used the definition of $r \in R$ as a dense element, which implies that the probability of sampling any element of $B_{r}$ in one try is at least $δ$ . The final bound is a result of numerical calculations using $| R | \leq \frac{1}{δ}$ and $| S_{ℓ} | = \frac{1}{δ} log \frac{1}{δ^{2}}$ .

To conclude the proof, observe that $A = C + D$ , and using a similar reasoning we also get $B \leq p_{ℓ^{'}} (ϵ, δ) + δ$ . ∎

Observation 5 and Theorem 6 directly yield Theorem 1.

A potential criticism of the algorithm presented here, is that its error probabilities depend on $p_{ℓ} (ϵ, δ)$ . However, Theorem 2 shows that such a dependence is unavoidable.

Proof of Theorem 2.

Given any $ϵ, δ$ , consider the following instance of the problem. We have two groups represented by the distributions $D_{1}$ and $D_{2}$ . For the first group we have only one element belonging to it, and let that element be $x$ . In other words, every time we draw an element from $D_{1}$ that element turns out to be $x$ , i.e., ${Pr}_{x^{'} \sim D_{1}} [x^{'} = x] = 1$ . For the second group we have that every $y \in D_{2}$ appears with probability $\frac{1}{| D_{2} |}$ , and $| D_{2} | \to \infty$ . In other words, $D_{2}$ is a uniform distribution over a countably infinite set.

Now we define all similarity values. At first, the similarity function for $D_{1}$ will trivially be $d_{1} (x, x) = 0$ . For the second group, for every distinct $y, y^{'} \in D_{2}$ we define $d_{2} (y, y^{'}) = 1$ . Obviously, for every $y \in D_{2}$ we set $d_{2} (y, y) = 0$ . Observe that $d_{1}$ and $d_{2}$ are metric functions for their respective groups. As for the across-groups similarities, each $σ (x, y)$ for $y \in D_{2}$ is chosen independently, and it is drawn uniformly at random from $[0, 1]$ . Note that this choice of $σ$ satisfies the necessary metric-like properties $M_{1}$ and $M_{2}$ that were introduced in Section 1.1.

Further, since $ϵ, δ \in (0, 1)$ , any $y \in D_{2}$ is $(ϵ, δ)$ -rare:

Pr y^{'} \sim D_{2} [d_{2} (y, y^{'}) \leq ϵ]

= Pr y^{'} \sim D_{2} [y^{'} = y] = \frac{1}{| D_{2} |} < δ

The first equality is because the only element within distance $ϵ$ from $y$ is $y$ itself. The last inequality is because we can always choose $| D_{2} |$ in the construction of the input such that $\frac{1}{| D_{2} |} < δ$ ; we control $| D_{2} |$ while $δ$ is given. Therefore, since all elements are $(ϵ, δ)$ -rare, we have $p_{2} (ϵ, δ) = 1$ .

Consider now any learning algorithm that produces an estimate function $f$ . For any $y \in D_{2}$ , let us try to analyze the probability of having $| f (x, y) - σ (x, y) | = ω (ϵ)$ . For one thing, the probability of $y$ being in the sample set of the algorithm is $1 - (1 - 1 / | D_{2} |)^{N}$ , where $N$ is the number of used samples. Because the sampled set is finite, we have that the previous probability is practically $0$ :

lim | D_{2} | \to \infty (1 - (1 - 1 / | D_{2} |)^{N}) = 0

Thus, since with absolute certainty $y$ will not be among the samples, the algorithm needs to learn $σ (x, y)$ via some other value $σ (x, y^{'})$ , for $y^{'}$ being a sampled element of the second group. However, due to the construction of $σ$ the values $σ (x, y)$ and $σ (x, y^{'})$ are independent. This means that knowledge of any $σ (x, y^{'})$ (with $y \neq y^{'}$ ) provides no information at all on $σ (x, y)$ . Hence, the best any algorithm can do is guess $f (x, y)$ uniformly at random from $[0, 1]$ . This yields $Pr [| f (x, y) - σ (x, y) | = ω (ϵ) | y] = 1 - Pr [| f (x, y) - σ (x, y) | = O (ϵ) | y] = 1 - O (ϵ) = p_{2} (ϵ, δ) - O (ϵ)$ . The latter trivially gives the desired bound through a mere integration over all $y \in D_{2}$ . ∎

2.2 Optimizing the Number of Expert Queries

Here we modify the earlier algorithm in a way that improves the number of queries used. The high-level idea behind this improvement is the following. Given the sets of samples $S_{ℓ}$ , instead of asking the oracle for all possible similarity values $σ_{ℓ, ℓ^{'}} (x, y)$ for every $ℓ, ℓ^{'}$ and every $x \in S_{ℓ}$ and $y \in S_{ℓ^{'}}$ , we would rather choose a set $R_{ℓ} \subseteq S_{ℓ}$ of representative elements for each group $ℓ$ . Then, we would ask the oracle for the values $σ_{ℓ, ℓ^{'}} (x, y)$ for every $ℓ, ℓ^{'}$ , but this time only for every $x \in R_{ℓ}$ and $y \in R_{ℓ^{'}}$ . The choice of the representatives is inspired by the $k$ -center algorithm of Hochbaum and Shmoys (1985). Intuitively, the representatives $R_{ℓ}$ of group $ℓ$ will serve as similarity proxies for the elements of $S_{ℓ}$ , such that each $x \in S_{ℓ}$ is assigned to a nearby $r_{ℓ} (x) \in R_{ℓ}$ via a mapping function $r_{ℓ} : S_{ℓ} \mapsto R_{ℓ}$ . Hence, if $d_{ℓ} (x, r_{ℓ} (x))$ is small enough, $x$ and $r_{ℓ} (x)$ are highly similar, and thus $r_{ℓ} (x)$ acts as a good approximation of $x$ . The full details for the construction of $R_{ℓ}$ and $r_{ℓ}$ are presented in Algorithm 1.

Suppose now that we need to compare some $x \in D_{ℓ}$ and $y \in D_{ℓ^{'}}$ . Our approach will be almost identical to that of Section 2.1. Once again, let $π (x) = {a r g m i n}_{x^{'} \in S_{ℓ}} d_{ℓ} (x, x^{'})$ and $π (y) = {a r g m i n}_{y^{'} \in S_{ℓ^{'}}} d_{ℓ^{'}} (y, y^{'})$ . However, unlike the simple algorithm of Section 2.1 that directly uses $π (x)$ and $π (y)$ , the more intricate algorithm here will rather use their proxies $r_{ℓ} (π (x))$ and $r_{ℓ^{'}} (π (y))$ . Our prediction will then be

fℓ,ℓ′(x,y)\coloneqqσℓ,ℓ′(rℓ(π(x)),rℓ′(π(y)))

where $σ_{ℓ, ℓ^{'}} (r_{ℓ} (π (x)), r_{ℓ^{'}} (π (y)))$ is known from the queries.

Input: Accuracy and confidence parameters $ϵ, δ$ . For every group $ℓ \in [γ]$ , a set $S_{ℓ}$ of i.i.d. samples chosen according to $D_{ℓ}$ , such that $| S_{ℓ} | = \frac{1}{δ} log \frac{1}{δ^{2}}$ .

1: for each

ℓ \in [γ]

H_{x}^{ℓ} \leftarrow {x^{'} \in S_{ℓ} : d_{ℓ} (x, x^{'}) \leq 8 ϵ}

for each

x \in S_{ℓ}

U \leftarrow S_{ℓ}

and

R_{ℓ} \leftarrow \emptyset

r_{ℓ} (x) \leftarrow x

for each

x \in S_{ℓ}

5: while

U \neq \emptyset

6: Choose an arbitrary

x \in U

R_{ℓ} \leftarrow R_{ℓ} \cup {x}

W_{x} \leftarrow {x^{'} \in U : H_{x}^{ℓ} \cap H_{x^{'}}^{ℓ} \neq \emptyset}

r_{ℓ} (x^{'}) \leftarrow x

for every

x^{'} \in W_{x}

10:

U \leftarrow U ∖ W_{x}

11: end while

12: end for

13: For every

ℓ

and

ℓ^{'}

, and for every

x \in R_{ℓ}

and

y \in R_{ℓ^{'}}

, ask the oracle for the value

σ_{ℓ, ℓ^{'}} (x, y)

and store it.

14: For every

ℓ

, return the set

R_{ℓ}

and the function

r_{ℓ}

Algorithm 1 Training Phase

Theorem 7.

For any given parameters $ϵ, δ \in (0, 1)$ , the new algorithm produces $f_{ℓ, ℓ^{'}}$ for every $ℓ$ and $ℓ^{'}$ , such that

Pr [{Error}_{(ℓ, ℓ^{'})}] = O (δ + p_{ℓ} (ϵ, δ) + p_{ℓ^{'}} (ϵ, δ))

where $Pr [{Error}_{(ℓ, ℓ^{'})}]$ is as in Theorem 1.

Proof.

For two distinct groups $ℓ$ and $ℓ^{'}$ , consider comparing some $x \in D_{ℓ}$ and $y \in D_{ℓ^{'}}$ . To begin with, let us assume that $d_{ℓ} (x, π (x)) \leq 3 ϵ$ and $d_{ℓ^{'}} (y, π (y)) \leq 3 ϵ$ . Furthermore, the execution of the algorithm implies $H_{π (x)}^{ℓ} \cap H_{r_{ℓ} (π (x))}^{ℓ} \neq \emptyset$ , and thus the triangle inequality and the definitions of the sets $H_{π (x)}^{ℓ}, H_{r_{ℓ} (π (x))}^{ℓ}$ give $d_{ℓ} (π (x), r_{ℓ} (π (x))) \leq 16 ϵ$ . Similarly we get $d_{ℓ^{'}} (π (y), r_{ℓ^{'}} (π (y))) \leq 16 ϵ$ . Eventually:

	$d_{ℓ} (x, r_{ℓ} (π (x)))$	$\leq d_{ℓ} (x, π (x)) + d_{ℓ} (π (x), r_{ℓ} (π (x)))$
		$\leq 19 ϵ$
	$d_{ℓ^{'}} (y, r_{ℓ^{'}} (π (y)))$	$\leq d_{ℓ^{'}} (y, π (y)) + d_{ℓ^{'}} (π (y), r_{ℓ^{'}} (π (y)))$
		$\leq 19 ϵ$

For notational convenience, let $A\coloneqqdℓ(x,rℓ(π(x)))$ and $B\coloneqqdℓ′(y,rℓ′(π(y)))$ . Then, the metric properties $M_{1}$ and $M_{2}$ of $σ_{ℓ, ℓ^{'}}$ and the definition of $f_{ℓ, ℓ^{'}} (x, y)$ yield

| σ_{ℓ, ℓ^{'}} (x, y) - f_{ℓ, ℓ^{'}} (x, y) | \leq A + B \leq 38 ϵ

Overall, we proved that when $d_{ℓ} (x, π (x)) \leq 3 ϵ$ and $d_{ℓ^{'}} (y, π (y)) \leq 3 ϵ$ , we have $∣ ∣ f_{ℓ, ℓ^{'}} (x, y) - σ_{ℓ, ℓ^{'}} (x, y) ∣ ∣ \leq 38 ϵ$ . Finally, as shown in the proof of Theorem 6, the probability of not having $d_{ℓ} (x, π (x)) \leq 3 ϵ$ and $d_{ℓ^{'}} (y, π (y)) \leq 3 ϵ$ , i.e., the error probability, is at most $2 δ + p_{ℓ} (ϵ, δ) + p_{ℓ^{'}} (ϵ, δ)$ . ∎

Since the number of samples used by the algorithm is easily seen to be $\frac{γ}{δ} log \frac{1}{δ^{2}}$ , the only thing left in order to prove Theorem 3 is analyzing the number of oracle queries. To that end, for every group $ℓ \in [γ]$ with its sampled set $S_{ℓ}$ , we define the following Set Cover problem.

Definition 8.

Let $Hℓx\coloneqq{x′∈Sℓ:dℓ(x,x′)≤4ϵ}$ for all $x \in S_{ℓ}$ . Find $C \subseteq S_{ℓ}$ minimizing $| C |$ , with $⋃_{c \in C} H_{c}^{ℓ} = S_{ℓ}$ . We use $O P T_{ℓ}$ to denote the optimal value of this problem. Using standard terminology, we say $x \in S_{ℓ}$ is covered by $C$ if $x \in ⋃_{c \in C} H_{c}^{ℓ}$ , and $C$ is feasible if it covers all $x \in S_{ℓ}$ .

Lemma 9.

For every $ℓ \in [γ]$ we have $| R_{ℓ} | \leq O P T_{ℓ}$ .

Proof.

Consider a group $ℓ$ , and let $C^{*}$ be its optimal solution for the problem of Definition 8. We first claim that each $H_{c}^{ℓ}$ with $c \in C^{*}$ contains at most one element of $R_{ℓ}$ . This is due to the following. For any $c \in C^{*}$ , we have $d_{ℓ} (z, z^{'}) \leq d_{ℓ} (z, c) + d_{ℓ} (c, z^{'}) \leq 8 ϵ$ for all $z, z^{'} \in H_{c}^{ℓ}$ . In addition, the construction of $R_{ℓ}$ trivially implies $d_{ℓ} (x, x^{'}) > 8 ϵ$ for all $x, x^{'} \in R_{ℓ}$ . Thus, no two elements of $R_{ℓ}$ can be in the same $H_{c}^{ℓ}$ with $c \in C^{*}$ . Finally, since Definition 8 requires all $x \in R_{ℓ}$ to be covered, we have $| R_{ℓ} | \leq | C^{*} | = O P T_{ℓ}$ . ∎

Lemma 10.

Let $N = \frac{1}{δ} log \frac{1}{δ^{2}}$ . For each group $ℓ \in [γ]$ we have $O P T_{ℓ} \leq N$ with probability $1$ , and $E [O P T_{ℓ}] \leq \frac{1}{δ} + p_{ℓ} (ϵ, δ) N$ . The randomness here is over the samples $S_{ℓ}$ .

Proof.

Consider a group $ℓ$ . Initially, through Definition 8 it is clear that $O P T_{ℓ} \leq | S_{ℓ} | = N$ . For the second statement of the lemma we need to analyze $O P T_{ℓ}$ in a more clever way.

Recall the classification of elements $x \in D_{ℓ}$ that was first introduced in the proof of Theorem 6. According to this, an element can either be $(ϵ, δ)$ -rare, or dense. Now we will construct a solution $C_{ℓ}$ to the problem of Definition 8 as follows.

At first, let $S_{ℓ, r}$ be the set of $(ϵ, δ)$ -rare elements of $S_{ℓ}$ . We will include all of $S_{ℓ, r}$ to $C_{ℓ}$ , so that all $(ϵ, δ)$ -rare elements of $S_{ℓ}$ are covered by $C_{ℓ}$ . Further:

E [| S_{ℓ, r} |] = p_{ℓ} (ϵ, δ) N

(1)

Moving on, recall the construction shown in the proof of Theorem 6. According to that, there exists a set $R$ of at most $\frac{1}{δ}$ dense elements from $D_{ℓ}$ , and a function $ψ$ that maps every dense element $x \in D_{ℓ}$ to an element $ψ (x) \in R$ , such that $d_{ℓ} (x, ψ (x)) \leq 2 ϵ$ . Let us now define for each $x \in R$ a set $Gx\coloneqq{x′∈Dℓ:x′ is dense and % ψ(x′)=x}$ , and note that $d_{ℓ} (z, z^{'}) \leq d_{ℓ} (z, x) + d_{ℓ} (z^{'}, x) \leq 4 ϵ$ for all $z, z^{'} \in G_{x}$ . Thus, for each $x \in R$ with $G_{x} \cap S_{ℓ} \neq \emptyset$ , we place in $C_{ℓ}$ an arbitrary $y \in G_{x} \cap S_{ℓ}$ , and that $y$ gets all of $G_{x} \cap S_{ℓ}$ covered. Finally, since the sets $G_{x}$ induce a partition of the dense elements of $D_{ℓ}$ , $C_{ℓ}$ covers all dense elements of $S_{ℓ}$ .

Equation (1) and $| R | \leq \frac{1}{δ}$ yield $E [| C_{ℓ} |] \leq \frac{1}{δ} + p_{ℓ} (ϵ, δ) N$ . Also, since $C_{ℓ}$ is shown to be a feasible solution for problem of Definition 8, we get $O P T_{ℓ} \leq | C_{ℓ} | ⟹ E [O P T_{ℓ}] \leq \frac{1}{δ} + p_{ℓ} (ϵ, δ) N$ . ∎

The proof of Theorem 3 is concluded as follows. All pairwise queries for the elements in the sets $R_{ℓ}$ are

\sum ℓ (| R_{ℓ} | \sum ℓ^{'} \neq ℓ | R_{ℓ^{'}} |) \leq \sum ℓ (O P T_{ℓ} \sum ℓ^{'} \neq ℓ O P T_{ℓ^{'}})

(2)

where the inequality follows from Lemma 9. Finally, combining equation (2) and Lemma 10 gives the desired bound.

Remark 11.

The factor $8$ in the definition of $H_{x}^{ℓ}$ at line 2 of Algorithm 1 is arbitrary. Actually, any factor $ρ = O (1)$ would yield the same guarantees, with any changes in accuracy and queries being only of an $O (1)$ order of magnitude.

Finally, we are interested in lower bounds on the queries required for learning. To that end, we present Theorem 4, which shows that any algorithm with accuracy $O (ϵ)$ and confidence $O (δ)$ needs $Ω (γ^{2} / δ^{2})$ queries.

Proof of Theorem 4.

We are given accuracy and confidence parameters $ϵ, δ \in (0, 1)$ respectively. For the sake of simplifying the exposition in the proof, let us assume that $\frac{1}{δ}$ is an integer; all later arguments can be generalized in order to handle the case of $1 / δ \notin N$ .

We construct the following problem instance. We have two groups represented by the distributions $D_{1}$ and $D_{2}$ . In addition, for both of these groups we assume that the support of the corresponding distribution contains $\frac{1}{δ}$ elements, and ${Pr}_{x^{'} \sim D_{1}} [x = x^{'}] = δ$ for every $x \in D_{1}$ as well as ${Pr}_{y^{'} \sim D_{2}} [y = y^{'}] = δ$ for every $y \in D_{2}$ .

For every $x, x^{'} \in D_{1}$ let $d_{1} (x, x^{'}) = 1$ , and $d_{1} (x, x) = 0$ for every $x \in D_{1}$ . Similarly, for every $y, y^{'} \in D_{2}$ we set $d_{2} (y, y^{'}) = 1$ , and for every $y \in D_{2}$ we set $d_{2} (y, y) = 0$ . The functions $d_{1}$ and $d_{2}$ are clearly metrics. As for the across-groups similarity values, each $σ (x, y)$ for $x \in D_{1}$ and $y \in D_{2}$ is chosen independently, and it is drawn uniformly at random from $[0, 1]$ . Note that this choice of $σ$ satisfies the necessary properties $M_{1}, M_{2}$ introduced in Section 1.1.

In this proof we are also focusing on a more special learning model. In particular, we assume that the distributions $D_{1}$ and $D_{2}$ are known. Hence, there is no need for sampling. The only randomness here is over the random arrivals $x \sim D_{1}$ and $y \sim D_{2}$ , where $x$ and $y$ are the elements that need to be compared. Obviously, the similarity function $σ$ would still remain unknown to any learner. Finally, the number of queries required for learning in this model cannot be more than the queries required in the original model, and this is because this model is easier (special case of the original).

Consider now an algorithm with the error guarantees mentioned in the Theorem statement, and focus on a fixed pair $(x, y)$ with $x \in D_{1}$ and $y \in D_{2}$ . If the algorithm has queried the oracle for $(x, y)$ , it knows $σ (x, y)$ with absolute certainty. Let us study what happens when the algorithm has not queried the oracle for $(x, y)$ . In this case, because the values $σ (x^{'}, y^{'})$ with $x^{'} \in D_{1}$ and $y^{'} \in D_{2}$ are independent, no query the algorithm has performed can provide any information for $σ (x, y)$ . Thus, the best the algorithm can do is uniformly at random guess a value in $[0, 1]$ , and return that as the estimate for $σ (x, y)$ . Letting ${¯ Q}_{x, y}$ denote the event where no query is performed for $(x, y)$ , we have

	$P$	$\coloneqqPr[\|fℓ,ℓ′(x,y)−σℓ,ℓ′(x,y)\|=ω(ϵ) \| ¯Qx,y]$
		$= 1 - Pr [\| f_{ℓ, ℓ^{'}} (x, y) - σ_{ℓ, ℓ^{'}} (x, y) \| = O (ϵ) \| {¯ Q}_{x, y}]$
		$= 1 - O (ϵ) = Ω (1)$

where the randomness comes only from the algorithm.

For the sake of contradiction, suppose the algorithm uses $q = o (1 / δ^{2})$ queries. Since each $(x, y)$ is equally likely to appear for a comparison, the overall error probability is

(\frac{1 / δ^{2} - q}{1 / δ^{2}}) \cdot P = Ω (1) \cdot Ω (1) = Ω (1)

Contradiction; the error probability must be $O (δ)$ . ∎

Corollary 12.

When for every $ℓ \in [γ]$ the value $p_{ℓ} (ϵ, δ)$ is arbitrarily close to $0$ , the algorithm presented in this section achieves an expected number of oracle queries that is asymptotically optimal.

Proof.

When every $p_{ℓ} (ϵ, δ)$ is very close to $0$ , Lemma 10 gives $E [O P T_{ℓ}] \leq \frac{1}{δ}$ . Thus, by inequality (2) the expected queries are $\frac{γ (γ - 1)}{δ^{2}}$ . Theorem 4 concludes the proof. ∎

3 Experimental Evaluation

We implemented all algorithms in Python 3.10.6 and ran our experiments on a personal laptop with Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz 2.90 GHz and 16.0 GB memory.

Algorithms: We implemented the simple algorithm from Section 2.1, and the more intricate algorithm of Section 2.2. We refer to the former as Simple-Alg, and to the latter as QueryOpt-Alg. For the training phase of QueryOpt-Alg, we set the dilation factor at line 2 of Algorithm 1 to $12$ instead of $8$ . The reason for this, is that a minimal experimental investigation revealed that this choice leads to a good balance between accuracy guarantees and oracle queries.

Number of demographic groups: All our experiments are for two groups, i.e., $γ = 2$ . The following reasons justify this decision. At first, this case captures the essence of our algorithmic results; the $γ > 2$ case can be viewed as running the algorithm for $γ = 2$ multiple times, one for each pair of groups. Secondly, the achieved confidence and accuracy of our algorithms are completely independent of $γ$ .

Similarity functions: In all our experiments the feature space is $R^{d}$ , where $d \in N$ is case-specific. To define similarity functions, we adopt a paradigm of feature importance (Niño-Adan et al., 2021). Specifically, we assume that all similarity functions $d_{1}, d_{2}$ and $σ$ are weighted Euclideans:

	$d_{ℓ} (x, x^{'})$	$= \sqrt{\sum i \in [d] α_{i, ℓ} (x_{i} - x_{i}^{^{'}})^{2}}$
	$σ (x, y)$	$= \sqrt{\sum i \in [d] β_{i} (x_{i} - y_{i})^{2}}$

The weight $α_{i, ℓ} \geq 0$ measures how significant feature $i$ is for comparisons within group $ℓ \in {1, 2}$ . Similarly, the weight $β_{i} \geq 0$ corresponds to how important feature $i$ is, when we are comparing elements across groups. The larger a weight is, the more significant that feature for the comparison at hand. Finally, we chose Euclidean metrics due to their widespread use as accurate similarity/distance functions.

For the across-groups weights we also assume that $β_{i} \leq min {α_{i, 1}, α_{i, 2}}$ for every $i \in [d]$ , i.e., the significance of a feature for across-groups comparisons cannot be more than its significance for within group comparisons. To justify this assumption think of the following scenario. Suppose that for some feature $i$ we have $α_{i, 1} = 0$ . In other words, this feature can be seen as totally irrelevant for the first group, i.e., it provides no information at all for the nature of the elements of this group. Hence, it is reasonable to also disregard $i$ when we compare an element of the first group with an element of the second group, i.e., $β_{i}$ should be $0$ as well.

The functions $d_{1}$ and $d_{2}$ are metric due to being weighted Euclideans. We also need to show that $σ$ satisfies the across-groups metric properties imposed in Section 1.1. Since $σ$ is a metric, for any $x, y, z \in R^{d}$ we have:

σ (x, y)

\leq σ (x, z) + σ (z, y)

(3)

Using the assumption that $β_{i} \leq min {α_{i, 1}, α_{i, 2}}$ we get:

	$σ (x, z)$	$= \sqrt{\sum i \in [d] β_{i} (x_{i} - z_{i})^{2}} \leq d_{1} (x, z)$		(4)
	$σ (z, y)$	$= \sqrt{\sum i \in [d] β_{i} (z_{i} - y_{i})^{2}} \leq d_{2} (z, y)$		(5)

Combining (3) and (4) proves property $M_{1}$ for $σ$ . Similarly, combining (3) and (5) proves property $M_{2}$ .

In each experiment we choose the importance weights as follows. At first, each $α_{i, ℓ}$ for $i \in [d]$ and $ℓ \in {1, 2}$ is chosen independently, and it is drawn uniformly at random from $[0, 1]$ . Afterwards, we choose all $β_{i}$ independently, with each $β_{i}$ drawn uniformly at random from $[0, min {α_{i, 1}, α_{i, 2}}]$ . Obviously, the algorithms do not have access to the vector $β$ , which is only used to simulate the oracle and compare our predictions with the corresponding true values.

3.1 Experiments on Synthetic Data

We believe that in our applications of interest, the data would exhibit high concentration around certain archetypal feature vectors. For instance, recall the scenario from the introduction, where we had to compare students for college admissions. Given some demographic group, it is natural to assume that there are certain prototypical student profiles, and each student falls close to one of them. For example, consider archetypal vectors each corresponding to one of the next profiles: exceptional student, very good student, good student, average student, below average student, dropout.

Data Generation for Each Group $ℓ \in {1, 2}$ : The best way to capture the aforementioned data behaviour, is having the group distribution $D_{ℓ}$ be a mixture of multivariate Gaussians (Carrasco, 2019). Specifically, we consider $16$ multivariate Gaussians $N (μ_{i}^{ℓ}, Σ_{i}^{ℓ})$ in $R^{20}$ ; $i \in [16]$ (we believe that the number of archetypal elements, each corresponding to a Gaussian, is usually small). As for the mixing weights, we set $π_{i}^{ℓ} = \frac{1}{2^{i}}$ for $i \in [15]$ and $π_{16}^{ℓ} = \frac{1}{2^{15}}$ . This choice tries to capture the fact that the prototypical feature vectors are not equally likely, and some are far more common than others, e.g., average students are more than exceptional ones. As for the mean $μ_{i}^{ℓ}$ of each Gaussian $i$ , it is chosen uniformly at random from $[0, 10]^{20}$ . Further, to define the covariance matrices $Σ_{i}^{ℓ} \in R^{20 \times 20}$ we do the following. At first, we assume that the features are independent, and thus each $Σ_{i}^{ℓ}$ can only have non-zero values in the diagonal. Then, given a value $U_{v a r}$ that serves as an upper bound for the variances, we choose each diagonal value of $Σ_{i}^{ℓ}$ uniformly at random from $[0, U_{v a r}]$ . Intuitively, the smaller $U_{v a r}$ is, the more concentrated the Gaussians. The previous process is performed independently for the construction of each $D_{ℓ}$ .

Range of $U_{v a r}$ : We run all experiments for data produced by every value of $U_{v a r}$ in ${0.5, 1, 2, 4}$ . Our algorithms are expected to perform better for smaller values of $U_{v a r}$ , since those yield more concentrated distributions, which naturally imply smaller $p_{ℓ} (ϵ, δ)$ . This behaviour was indeed observed in our simulations. Due to space constraints, here we only provide results for $U_{v a r} = 2$ . For the rest see Appendix B.

Choosing the accuracy parameter $ϵ$ : To use a meaningful value for $ϵ$ , we need to know the order of magnitude of $σ (x, y)$ . We calculated the value of $σ (x, y)$ over $1000$ trials, where the randomness was of multiple factors, i.e., the randomized constructions of $D_{1}$ and $D_{2}$ , the random choices for the feature weights, and of course the random sampling of $x$ and $y$ . In the end, we saw that $σ (x, y)$ is always greater than $3$ and highly concentrated in $[7, 8)$ ; see Appendix B. Thus, choosing $ϵ = 0.1$ is a reasonable option.

Confidence $δ$ and number of samples: We run all simulations for every value of $δ$ in ${0.1, 0.01, 0.001, 0.0001}$ , and in each case we use $\frac{1}{δ} log \frac{1}{δ^{2}}$ samples from each group.

Testing: We test our algorithms over $1000$ trials, where each trial consists of independently sampling $x \sim D_{1}$ and $y \sim D_{2}$ , and then inputting $x, y$ to our predictors. We are interested in two metrics. The first is the relative error percentage; if $p$ is our prediction for elements $x, y$ and $t$ is their true similarity value, the relative error percentage is $100 \cdot | p - t | / t$ . Figure (0(a)) shows the empirical distribution of this error for Simple-Alg, for every value of $δ$ . Figure (0(b)) shows the same statistics for QueryOpt-Alg. The second metric we consider is the absolute error divided by $ϵ$ ; if $p$ is our prediction for two elements and $t$ is their true similarity value, this metric is $| p - t | / ϵ$ . We are interested in this because our theoretical guarantees are of the form $| f (x, y) - σ (x, y) | = O (ϵ)$ . Figure (1(a)) shows the empirical distribution of the (absolute error)/ $ϵ$ for Simple-Alg, for every value of $δ$ . Figure (1(b)) shows the same statistics for QueryOpt-Alg.

Conclusions: First of all, we see that the smaller $δ$ is, the better the algorithms perform. Further, Simple-Alg performs marginally better than QueryOpt-Alg. Nonetheless, this tiny edge is not enough to justify a potential superiority of Simple-Alg, and the reason for this is the decrease in queries. Specifically, compared to the queries used by Simple-Alg, we observed the next for QueryOpt-Alg:

For $δ = 0.1$ we saw a $2.13 %$ decrease in used queries.
For $δ = 0.01$ we saw a $28.44 %$ decrease in queries.
For $δ = 0.001$ we saw a $71.41 %$ decrease in the queries.
For $δ = 0.0001$ we saw a $93.65 %$ decrease in the queries.

This improvement as $δ$ decreases is natural. The smaller $δ$ is the more samples we have, and thus the more flexibility for the clustering decisions in Algorithm 1.

Overall, for both types of errors the algorithms perform significantly well. In particular, when $δ = 10^{- 3}$ or $δ = 10^{- 4}$ the guarantees are remarkably strong.

3.2 Experiments on Real Data

We used 2 datasets from the UCI ML Repository Dua and Graff (2017), namely Adult-48,842 points Kohavi (1996) and Creditcard-30,000 points Yeh and Lien (2009). For both datasets, we kept numerical features as they were and transformed each categorical feature as follows. Each category is assigned to an integer in ${1, # categories}$ , and this integer is used in place of the category in the feature vector (Ding et al., 2021). Also, we standardized all features so that they have approximately the same order of magnitude.

Choosing the Two Groups: For both datasets, we defined groups based on marital status. Specifically, the first group corresponds to points that are married individuals, and the second corresponds to points that are not married (singles, divorced and widowed are merged together). Afterwards, for each group we chose a random permutation of its points, and every time we needed to sample from either group, we were simply taking the next element in the permutation.

Choosing the accuracy parameter $ϵ$ : We calculated the value of $σ (x, y)$ over $5000$ trials, where the randomness was of multiple factors, i.e., the random choices for the feature weights, and the sampling of $x$ and $y$ from their respective random permutations. In the end, we saw that for both datasets $σ (x, y)$ is always large enough to justify setting $ϵ = 0.1$ ; see Appendix B for figures on $σ (x, y)$ .

Confidence $δ$ and number of samples: For Adult we run all simulations for every value of $δ$ in ${0.1, 0.01, 0.001}$ , and we used $\frac{1}{δ} log \frac{1}{δ^{2}}$ samples from each group. For Creditcard we had to be a bit more careful, because it contains fewer points. Specifically, for Creditcard we run all simulations for every value of $δ$ in ${0.1, 0.01, Δ}$ , using $\frac{1}{δ} log \frac{1}{δ^{2}}$ samples from each group. Here $Δ \approx 0.001$ and $\frac{1}{Δ} log \frac{1}{Δ^{2}} - 1000$ is less than the size of the smallest of the two groups, guaranteeing that there are enough points for sampling and testing.

Testing: We test our algorithms over $1000$ trials, where each trial consists of independently sampling two elements $x, y$ , one for each group, and then inputting those to our predictors. We are again interested in the same two metrics as in Section 3.1. Figures (3) and (4) show the performance of the algorithms on Adult. Due to space constraints our results on Creditcard are moved to Appendix B.

Conclusions: We reach the same conclusions as earlier, mainly observing that our algorithms for $δ = 0.01$ perform very well. As for the queries of QueryOpt-Alg compared to the queries of Simple-Alg, we observed the following:

For $δ = 0.1$ we saw a $12.36 %$ decrease in used queries.
For $δ = 0.01$ we saw a $52.34 %$ decrease in queries.
For $δ = 0.001$ we saw a $87.64 %$ decrease in the queries.

Acknowledgements

Disclaimer: This paper was prepared for informational purposes in part by the Artificial Intelligence Research group of JPMorgan Chase & Co. and its affiliates (“JP Morgan”), and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.

References

A. Bellet, A. Habrard, and M. Sebban (2013) A Survey on Metric Learning for Feature Vectors and Structured Data. Research Report Laboratoire Hubert Curien UMR 5516. External Links: Link Cited by: §1.3.
O. C. Carrasco (2019) Gaussian mixture models explained. https://towardsdatascience.com/. External Links: Link Cited by: §3.1.
D. Chakrabarti, J. P. Dickerson, S. A. Esmaeili, A. Srinivasan, and L. Tsepenekas (2022) A new notion of individually fair clustering: $α$ -equitable $k$ -center. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, G. Camps-Valls, F. J. R. Ruiz, and I. Valera (Eds.), Proceedings of Machine Learning Research, Vol. 151, pp. 6387–6408. Cited by: §1.3.
F. Ding, M. Hardt, J. Miller, and L. Schmidt (2021) Retiring adult: new datasets for fair machine learning. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. W. Vaughan (Eds.), Vol. 34, pp. 6478–6490. External Links: Link Cited by: §3.2.
D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §3.2.
C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ’12. Cited by: §1.1, §1.3, §1.3, §1.3, §1, §1.
A. Frome, Y. Singer, F. Sha, and J. Malik (2007) Learning globally-consistent local distance functions for shape-based image retrieval and classification. In 2007 IEEE 11th International Conference on Computer Vision, Vol. , pp. 1–8. External Links: Document Cited by: §1.3.
D. S. Hochbaum and D. B. Shmoys (1985) A best possible heuristic for the k-center problem. Mathematics of Operations Research 10 (2), pp. 180–184. External Links: ISSN 0364765X, 15265471, Link Cited by: §2.2.
C. Ilvento (2019) Metric learning for individual fairness. arXiv. External Links: Document, Link, 1906.00250 Cited by: §1.1, §1.3, §1.
K. G. Jamieson and R. D. Nowak (2011) Low-dimensional embedding using adaptively selected ordinal data. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Vol. , pp. 1077–1084. External Links: Document Cited by: §1.3.
C. Jung, M. Kearns, S. Neel, A. Roth, L. Stapleton, and Z. S. Wu (2019) An algorithmic framework for fairness elicitation. arXiv. External Links: Document, Link, 1905.10660 Cited by: §1.3.
M. P. Kim, O. Reingold, and G. N. Rothblum (2018) Fairness through computationally-bounded awareness. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Red Hook, NY, USA, pp. 4847–4857. Cited by: §1.1.
R. Kohavi (1996) Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp. 202–207. Cited by: §3.2.
B. Kulis (2013) Metric learning: a survey. (), pp. . External Links: Document Cited by: §1.3.
P. Lahoti, K. P. Gummadi, and G. Weikum (2019) Operationalizing individual fairness with pairwise fair representations. Proc. VLDB Endow. 13 (4), pp. 506–518. External Links: ISSN 2150-8097, Link, Document Cited by: §1.3.
P. Moutafis, M. Leng, and I. A. Kakadiaris (2017) An overview and empirical comparison of distance metric learning methods. IEEE Transactions on Cybernetics 47 (3), pp. 612–625. External Links: Document Cited by: §1.3.
D. Mukherjee, M. Yurochkin, M. Banerjee, and Y. Sun (2020) Two simple ways to learn individual fairness metrics from data. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. Cited by: §1.1, §1.3.
I. Niño-Adan, D. Manjarres, I. Landa-Torres, and E. Portillo (2021) Feature weighting methods: a review. Expert Systems with Applications 184, pp. 115424. External Links: ISSN 0957-4174, Document, Link Cited by: §3.
J. L. Suárez-Díaz, S. García, and F. Herrera (2018) A tutorial on distance metric learning: mathematical foundations, algorithms, experimental analysis, prospects and challenges (with appendices on mathematical background and detailed algorithms explanation). arXiv. External Links: Document, Link, 1812.05944 Cited by: §1.3.
O. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. T. Kalai (2011) Adaptively learning the crowd kernel. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pp. 673–680. External Links: ISBN 9781450306195 Cited by: §1.3.
L. van der Maaten and K. Weinberger (2012) Stochastic triplet embedding. In 2012 IEEE International Workshop on Machine Learning for Signal Processing, Vol. , pp. 1–6. External Links: Document Cited by: §1.3.
H. Wang, N. Grgic-Hlaca, P. Lahoti, K. P. Gummadi, and A. Weller (2019) An empirical study on learning fairness metrics for compas data with human supervision. arXiv. External Links: Document, Link, 1910.10255 Cited by: §1.1, §1.3.
M. Wilber, I. Kwak, and S. Belongie (2014) Cost-effective hits for relative similarity comparisons. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing 2 (1), pp. 227–233. External Links: Link Cited by: §1.3.
I. Yeh and C. Lien (2009) The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications 36, pp. 2473–2480. External Links: Document Cited by: §3.2.
G. Yona and G. Rothblum (2018) Probably approximately metric-fair learning. In International Conference on Machine Learning, pp. 5680–5688. Cited by: §1.1.

Appendix A Additional Motivating Applications

Individual Fairness: Suppose that some college committee wants to provide scholarships to student athletes, in a way that similar candidates will have the same chance of receiving the scholarship. For this case, computing the similarity of two students that are athletes of the same sport appears relatively easy, since their respective feature vectors will most certainly contain the necessary information for the comparison at hand. However, computing the similarity of athletes across different sports must be far more challenging. For instance, how can one compare a basketball player to a swimmer? In such a situation, it might also be the case that the feature vectors for the two students do not even contain the same subset of features.

Clustering: Consider an investor who has a set of potential investments, and wants to partition them into clusters of high intra-group similarity. Investments that are similar are thought of as having the same profit potential, and clustering the given set of investments might be useful for post-processing downstream analysis. In this context, the investor needs access to a distance metric in order to perform any clustering procedure. However, computing this metric might be tricky. For one thing, comparing investments that involve the same market should be relatively straightforward, e.g., comparing two investments in the housing market. On the other hand, comparing investments to different markets appears to be a much more cumbersome task, e.g., trying to find how similar a housing investment and a cryptocurrency one are.

Appendix B Additional Experimental Results

Figures (5)-(10) demonstrate the empirical distribution of $σ (x, y)$ in every case of interest, which is needed in order to justify our choice of setting $ϵ = 0.1$ in all of our experiments. Figures (11)-(12) show the performance of our algorithms for synthetic data produced with $U_{v a r} = 1$ . Figures (13)-(14) show the performance of our algorithms for synthetic data produced with $U_{v a r} = 4$ . Finally, figures (15)-(16) show the performance of our algorithms on the Creditcard dataset. As for the queries of QueryOpt-Alg compared to the queries of Simple-Alg, we observed the following:

Synthetic data with $U_{v a r} = 0.5$ : For $δ = 0.1$ we saw a $92.94 %$ decrease in used queries. For $δ = 0.01$ we saw a $99.81 %$ decrease. For $δ = 0.001$ we saw a $99.9898 %$ decrease, and for $δ = 0.0001$ the decrease was $99.99 %$ .
Synthetic data with $U_{v a r} = 1$ : For $δ = 0.1$ we saw a $51.11 %$ decrease in used queries. For $δ = 0.01$ we saw a $94.07 %$ decrease. For $δ = 0.001$ we saw a $99.38 %$ decrease, and for $δ = 0.0001$ the decrease was $99.95 %$ .
Synthetic data with $U_{v a r} = 4$ : For $δ = 0.1$ we saw a $4.26 %$ decrease in used queries. For $δ = 0.01$ we saw a $16.18 %$ decrease. For $δ = 0.001$ we saw a $51.34 %$ decrease, and for $δ = 0.0001$ the decrease was $82.04 %$ .
Creditcard dataset: For $δ = 0.1$ we saw a $6.29 %$ decrease in used queries. For $δ = 0.01$ we saw a $48.45 %$ decrease, and for $δ = 0.001$ the decrease was $79.83 %$ .