Comparing Apples to Oranges:
Learning Similarity Functions for Data Produced by Different Distributions

Leonidas Tsepenekas1,2, Ivan Brugere1
Abstract

Similarity functions measure how comparable pairs of elements are, and play a key role in a wide variety of applications, e.g., Clustering problems and considerations of Individual Fairness. However, access to an accurate similarity function should not always be considered guaranteed. Specifically, when the elements to be compared are produced by different distributions, or in other words belong to different “demographic” groups, knowledge of their true similarity might be very difficult to obtain. In this work, we present a sampling framework that learns these across-groups similarity functions, using only a limited amount of experts’ feedback. We show analytical results with rigorous bounds, and empirically validate our algorithms via a large suite of experiments.

\affiliations

1 JPMorgan Chase & Co, 2 University of Maryland, College Park
ltsepene@umd.edu, ivan.brugere@jpmchase.com

1 Introduction

Given a feature space , a similarity function measures how comparable any pair of elements are. The function can also be interpreted as a distance function, where the smaller is, the more similar and are. Such functions are crucially used in a variety of AI/ML problems, and in each such case is assumed to be known. Two of the most prominent applications where similarity functions have a central role are clustering and considerations of individual fairness. In clustering, the similarity function serves as the metric space in which we need to create the appropriate clusters. As for the well-known notion of individual fairness introduced in the seminal work of Dwork et al. (2012), its goal is treating similar individuals similarly. Hence, any algorithm that needs to abide by this concept of fairness, should be able to access the similarity value for every pair of individuals that are of interest.

Nonetheless, it is not realistic to assume that a reliable and accurate similarity function is always given. This issue was even raised in the work of Dwork et al. (2012), where it was acknowledged that the computation of is not trivial, and thus should be deferred to third parties. The starting point of our work here is the observation that there exist scenarios where computing similarity can be assumed as easy (in other words given), while in other cases this task would be significantly more challenging. Specifically, we are interested in scenarios where there are multiple distributions that produce elements of . We loosely call each such distribution a “demographic” group, interpreting it as the stochastic way in which members of this group are produced. In this setting, computing the similarity value of two elements that are produced according to the same distribution, seems intuitively much easier compared to computing similarity values for elements belonging to different groups. We next present a few motivating examples that clarify this statement.

Individual Fairness: Consider a college admissions committee that needs access to an accurate similarity function for students, so that it provides a similar likelihood of acceptance to similar applicants. Let us focus on the following two demographic groups. The first being students from affluent families living in privileged communities and having access to the best quality schools and private tutoring. The other group would consist of students from low-income families, coming from a far less privileged background. Given this setting, directly comparing two students based on their features (e.g., SAT scores, strength of school curriculum, number of recommendation letters, etc), can be a fair way to elicit their similarity only if the students belong to the same group. It is clear that this trivial approach might hurt less privileged students when they are compared to students of the first group. This is because such a simplistic way of measuring similarity does not reflect potential that is undeveloped due to unequal access to resources. Hence, accurate across-groups comparisons that take into account such delicate issues, appear considerably more intricate.

Clustering: Suppose that a marketing company has a collection of user data that wants to cluster, with its end goal being a downstream market segmentation analysis. However, as it is usually the case, the data might come from different sources, e.g., data from private vendors and data from government bureaus. In this scenario, each data source might have its own way of representing user information, e.g., each source might use a unique subset of features. Therefore, eliciting the distance metric required for the clustering task should be straightforward for data coming from the same source, while across-sources distances would certainly require extra care, e.g., how can one extract the distance of two vectors containing different sets of features?

For more applications we refer the reader to Appendix A.

As suggested by earlier work on computing similarity functions (Ilvento, 2019), when obtaining similarity values is an overwhelming task, we can employ the advice of domain experts. Such experts can be given any pair of elements, and in return produce their true similarity value. However, utilizing this experts’ advice can be thought of as very costly, and hence it should be used sparingly. For example, in the case of comparing students from different economic backgrounds, the admissions committee can reach out to regulatory bodies or civil rights organizations. Nonetheless, resorting to these experts for every student comparison that might arise, is not a sustainable solution. Therefore, our goal in this paper is to learn the across-groups similarity functions, using as few queries to experts as possible.

1.1 Formal Problem Definition

Let denote the feature space of elements. We assume that elements come from known “demographic” groups, where , and each group is governed by an unknown distribution over . We use to denote a randomly drawn from . Further, we use to denote that is an element in the support of , and thus is a member of group . Observe now that for a specific , we might have and , for . Hence, in our model group membership is important, and every time we are considering an element , we know which distribution produced , i.e., which group belongs to.

For every group there is an intra-group similarity function , such that for all we have representing the true similarity between . In addition, the smaller is, the more similar . Note here that the function is only used to compare members of group . Further, a common assumption for functions measuring similarity is that they are metric (Yona and Rothblum, 2018; Kim et al., 2018; Ilvento, 2019; Mukherjee et al., 2020; Wang et al., 2019); a function is metric if a) implies , b) and c) for all (triangle inequality). We also adopt the metric assumption for the function . Finally, based on the earlier discussion regarding computing similarity between elements of the same group, we assume that is known and given as part of the instance.

Moreover, for any two groups and there exists an unknown across-groups similarity function , such that for all and , represents the true similarity between . Again, the smaller is, the more similar the two elements, and for a meaningful use of we must make sure that is a member of group and a member of group . Finally, in order to imitate the metric nature of a similarity function, we impose the following properties on , which can be viewed as across-groups triangle inequalities:

  1. Property : for every and .

  2. Property : for every and .

Nonetheless, the collection of similarity values here does not axiomatically yield a valid metric space. This is due to the following. 1) If for and , then we do not necessarily have . 2) It is not always the case that for , . 3) It is not always the case that for , , .

However, not having the collection of similarity values necessarily produce a metric space is not a weakness of our model. On the contrary, we view this as one of its strongest aspects. For one thing, imposing a complete metric constraint on the case of intricate across-groups comparisons sounds unrealistic and very restrictive. Further, even though existing literature treats similarity functions as metric ones, the seminal work of Dwork et al. (2012) mentions that this should not always be the case. Hence, our model is more general than the current literature.

Goal of Our Problem: We want for any two groups to compute a function , such that is our estimate of similarity for any and . Specifically, we seek a PAC guarantee, where for any given accuracy and confidence parameters we have:

The subscript in the above probability corresponds to two independent random choices, one and one .

As for tools to learn , we only require two things. At first, for each group we want a set of i.i.d. samples from . Obviously, the total number of used samples should be polynomial in the input parameters, i.e., polynomial in and . Secondly, we require access to an expert oracle, which given any and for any and , returns the true similarity value . We refer to a single invocation of the oracle as a query. Since there is a cost to collecting expert feedback, an additional objective in our problem is minimizing the number of oracle queries.

1.2 Discussion of Our Results

In Section 2 we present our theoretical results. We begin with a simple learning algorithm achieving the following.

Theorem 1.

For any given parameters , we produce similarity functions for every and , such that:

The randomness here is of three independent sources. The randomness of the algorithm, a choice , and a choice . The algorithm requires samples from each group, and oracle queries.

For each group , we use to denote the probability of sampling an -rare element of . We define as -rare for , an element for which there is a less than chance of sampling with . Formally, is -rare iff , and . Intuitively, a rare element should be interpreted as an “isolated” member of the group, in the sense that it is at most -likely to encounter another element that is -similar to it.

Clearly, the error probability for and is only when . We hypothesize that in realistic distributions each should indeed be fairly small, and this hypothesis is actually validated by our experiments. The reason we believe this hypothesis to be true, is that very frequently real data demonstrate high concentration around certain archetypal elements. Hence, this sort of distributional density does not leave room for isolated elements in the rest of the space. Nonetheless, we also provide a strong no free lunch result involving the values , which shows that any practical algorithm necessarily depends on them. This result further implies that our algorithm’s error probabilities are almost optimal.

Theorem 2.

For every , any algorithm using finitely many samples, will yield with , ; randomness is over the independent choices , and the internal randomness of the algorithm.

Moving on, we focus on minimizing the oracle queries. By carefully modifying the earlier simple algorithm, we obtain a new algorithm with the following guarantees:

Theorem 3.

For any given parameters , we produce similarity functions for every and , such that:

is as defined in Theorem 1. Let . The algorithm requires samples from each group, and the number of oracle queries is at most

where and for each .

Based on Theorem 3, the queries of the improved algorithm are at most , which is exactly the number of queries in our earlier simple algorithm. However, the smaller the values are, the fewer queries in expectation. Our experimental results actually confirm that this algorithm always leads to a significant decrease in the used queries.

Our final theoretical result involves a lower bound on the number of queries required for learning.

Theorem 4.

For all , any learning algorithm producing functions with , needs queries.

Combining Theorems 3, 4 implies that when all are negligible, i.e., , the expected queries of the Theorem 3 algorithm are asymptotically optimal.

Finally, Section 3 contains our experimental evaluation, where through a large suite of simulations on both real and synthetic data we validate our theoretical findings.

1.3 Related Work

Metric learning is a very well-studied area (Bellet et al., 2013; Kulis, 2013; Moutafis et al., 2017; Suárez-Díaz et al., 2018). There is also an extensive amount of work on using human feedback for learning metrics in specific tasks, e.g., image similarity, low-dimensional embeddings (Frome et al., 2007; Jamieson and Nowak, 2011; Tamuz et al., 2011; van der Maaten and Weinberger, 2012; Wilber et al., 2014). However, since these works are either tied to specific applications or specific metrics, they are only distantly related to ours.

Our model is more closely related to the literature on trying to learn the similarity function from the fairness definition of Dwork et al. (2012). This concept of fairness requires treating similar individuals similarly. Thus, it needs access to a function that returns a non-negative value for any pair of individuals, and this value corresponds to how similar the individuals are. Specifically, the smaller the value the more similar the elements that are compared.

Even though the fairness definition of Dwork et al. (2012) is very elegant and intuitive, the main obstacle for adopting it in practice is the inability to easily compute or access the crucial similarity function. To our knowledge, the only papers that attempt to learn this similarity function using expert oracles like us, are Ilvento (2019); Mukherjee et al. (2020) and Wang et al. (2019). Ilvento (2019) addresses the scenario of learning a general metric function, and gives theoretical PAC guarantees. Mukherjee et al. (2020) give theoretical guarantees for learning similarity functions that are only of a specific Mahalanobis form. Wang et al. (2019) simply provide empirical results. The first difference between our model and these papers is that unlike us, they do not consider elements coming from multiple distributions. However, the most important difference is that these works only learn metric functions. In our case the collection of similarity values (from all and ) does not necessarily yield a complete metric space; see the discussion in Section 1.1. Hence, our problem learns more general functions.

Regarding the difficulty in computing similarity between members of different groups, we are only aware of a brief result by Dwork et al. (2012). In particular, given a metric over the whole feature space, they mention that can only be trusted for comparisons between elements of the same group, and not for across-groups comparisons. In order to achieve the latter for groups and , they find a new similarity function that approximates , while minimizing the Earthmover distance between the distributions . This is completely different from our work, since here we assume the existence of across-groups similarity values, which we eventually want to learn. On the other hand, the approach of Dwork et al. (2012) can be seen as an optimization problem, where the across-groups similarity values need to be computed in a way that minimizes some objective. Also, unlike our model, this optimization approach has a serious limitation, and that is requiring to be explicitly known.

Finally, since similarity as distance can be quite difficult to compute in practice, there has been a line of research that defines similarity using simpler, yet less expressive structures. Examples include similarity lists Chakrabarti et al. (2022), similarity graphs Lahoti et al. (2019) and ordinal relationships Jung et al. (2019).

2 Theoretical Results

We begin the section by presenting a simple algorithm with PAC guarantees, whose error probability is shown to be almost optimal. Later on, we focus on optimizing the oracle queries, and show an improved algorithm for this objective.

2.1 A Simple Learning Algorithm

Given any confidence and accuracy parameters respectively, our approach is summarized in the following. At first, for every group we need a set of samples that are chosen i.i.d. according to , such that . Then, for every distinct and , and for all and , we ask the expert oracle for the true similarity value . The next observation follows trivially.

Observation 5.

The algorithm uses samples, and queries to the oracle.

Suppose now that we need to compare any and . Our high level idea is that the properties and of (see Section 1.1), will actually allow us to use the closest element to in and the closest element to in as proxies. Thus, let and . The algorithm then sets

where is known from the earlier queries.

Before we proceed with the analysis of the algorithm, we need to recall some notation which was first introduced in Section 1.2. Consider any group . An element with is called an -rare element of , and also .

Theorem 6.

For any given parameters , the algorithm produces for every and , such that

where is as in Theorem 1.

Proof.

For two distinct groups and , consider what will happen when we are asked to compare some and . Properties and of imply

Note that when and , the above inequalities and the definition of yield . Thus, we just need upper bounds for and , since the previous analysis and a union bound give . In what follows we present an upper bound for . The same analysis gives an identical bound for .

Before we proceed to the rest of the proof, we have to provide an existential construction. For the sake of simplicity we will be using the term dense for elements of that are not -rare. For every that is dense, we define . Observe that the definition of dense elements implies for every dense . Next, consider the following process. We start with an empty set , and we assume that all dense elements are unmarked. Then, we choose an arbitrary unmarked dense element , and we place it in the set . Further, for every dense that is unmarked and has , we mark and set . Here the function maps dense elements to elements of . We continue this picking process until all dense elements have been marked. Since for any two and for , we have . Also, for every dense we have due to .

Now we are ready to upper bound .

The upper bound for is trivial. We next explain the computations for . For the transition between the first and the second line we use a proof by contradiction. Hence, suppose that for every , and let denote an arbitrary element of . Then, for any dense element we have . Back to the computations for , to get the third line we simply used a union bound. To get from the third to the fourth line, we used the definition of as a dense element, which implies that the probability of sampling any element of in one try is at least . The final bound is a result of numerical calculations using and .

To conclude the proof, observe that , and using a similar reasoning we also get . ∎

Observation 5 and Theorem 6 directly yield Theorem 1.

A potential criticism of the algorithm presented here, is that its error probabilities depend on . However, Theorem 2 shows that such a dependence is unavoidable.

Proof of Theorem 2.

Given any , consider the following instance of the problem. We have two groups represented by the distributions and . For the first group we have only one element belonging to it, and let that element be . In other words, every time we draw an element from that element turns out to be , i.e., . For the second group we have that every appears with probability , and . In other words, is a uniform distribution over a countably infinite set.

Now we define all similarity values. At first, the similarity function for will trivially be . For the second group, for every distinct we define . Obviously, for every we set . Observe that and are metric functions for their respective groups. As for the across-groups similarities, each for is chosen independently, and it is drawn uniformly at random from . Note that this choice of satisfies the necessary metric-like properties and that were introduced in Section 1.1.

Further, since , any is -rare:

The first equality is because the only element within distance from is itself. The last inequality is because we can always choose in the construction of the input such that ; we control while is given. Therefore, since all elements are -rare, we have .

Consider now any learning algorithm that produces an estimate function . For any , let us try to analyze the probability of having . For one thing, the probability of being in the sample set of the algorithm is , where is the number of used samples. Because the sampled set is finite, we have that the previous probability is practically :

Thus, since with absolute certainty will not be among the samples, the algorithm needs to learn via some other value , for being a sampled element of the second group. However, due to the construction of the values and are independent. This means that knowledge of any (with ) provides no information at all on . Hence, the best any algorithm can do is guess uniformly at random from . This yields . The latter trivially gives the desired bound through a mere integration over all . ∎

2.2 Optimizing the Number of Expert Queries

Here we modify the earlier algorithm in a way that improves the number of queries used. The high-level idea behind this improvement is the following. Given the sets of samples , instead of asking the oracle for all possible similarity values for every and every and , we would rather choose a set of representative elements for each group . Then, we would ask the oracle for the values for every , but this time only for every and . The choice of the representatives is inspired by the -center algorithm of Hochbaum and Shmoys (1985). Intuitively, the representatives of group will serve as similarity proxies for the elements of , such that each is assigned to a nearby via a mapping function . Hence, if is small enough, and are highly similar, and thus acts as a good approximation of . The full details for the construction of and are presented in Algorithm 1.

Suppose now that we need to compare some and . Our approach will be almost identical to that of Section 2.1. Once again, let and . However, unlike the simple algorithm of Section 2.1 that directly uses and , the more intricate algorithm here will rather use their proxies and . Our prediction will then be

where is known from the queries.

Input: Accuracy and confidence parameters . For every group , a set of i.i.d. samples chosen according to , such that .

1:  for each  do
2:      for each .
3:      and .
4:      for each .
5:     while  do
6:        Choose an arbitrary .
7:        .
8:        .
9:         for every .
10:        .
11:     end while
12:  end for
13:  For every and , and for every and , ask the oracle for the value and store it.
14:  For every , return the set and the function .
Algorithm 1 Training Phase
Theorem 7.

For any given parameters , the new algorithm produces for every and , such that

where is as in Theorem 1.

Proof.

For two distinct groups and , consider comparing some and . To begin with, let us assume that and . Furthermore, the execution of the algorithm implies , and thus the triangle inequality and the definitions of the sets give . Similarly we get . Eventually:

For notational convenience, let and . Then, the metric properties and of and the definition of yield

Overall, we proved that when and , we have . Finally, as shown in the proof of Theorem 6, the probability of not having and , i.e., the error probability, is at most . ∎

Since the number of samples used by the algorithm is easily seen to be , the only thing left in order to prove Theorem 3 is analyzing the number of oracle queries. To that end, for every group with its sampled set , we define the following Set Cover problem.

Definition 8.

Let for all . Find minimizing , with . We use to denote the optimal value of this problem. Using standard terminology, we say is covered by if , and is feasible if it covers all .

Lemma 9.

For every we have .

Proof.

Consider a group , and let be its optimal solution for the problem of Definition 8. We first claim that each with contains at most one element of . This is due to the following. For any , we have for all . In addition, the construction of trivially implies for all . Thus, no two elements of can be in the same with . Finally, since Definition 8 requires all to be covered, we have . ∎

Lemma 10.

Let . For each group we have with probability , and . The randomness here is over the samples .

Proof.

Consider a group . Initially, through Definition 8 it is clear that . For the second statement of the lemma we need to analyze in a more clever way.

Recall the classification of elements that was first introduced in the proof of Theorem 6. According to this, an element can either be -rare, or dense. Now we will construct a solution to the problem of Definition 8 as follows.

At first, let be the set of -rare elements of . We will include all of to , so that all -rare elements of are covered by . Further:

(1)

Moving on, recall the construction shown in the proof of Theorem 6. According to that, there exists a set of at most dense elements from , and a function that maps every dense element to an element , such that . Let us now define for each a set , and note that for all . Thus, for each with , we place in an arbitrary , and that gets all of covered. Finally, since the sets induce a partition of the dense elements of , covers all dense elements of .

Equation (1) and yield . Also, since is shown to be a feasible solution for problem of Definition 8, we get . ∎

The proof of Theorem 3 is concluded as follows. All pairwise queries for the elements in the sets are

(2)

where the inequality follows from Lemma 9. Finally, combining equation (2) and Lemma 10 gives the desired bound.

Remark 11.

The factor in the definition of at line 2 of Algorithm 1 is arbitrary. Actually, any factor would yield the same guarantees, with any changes in accuracy and queries being only of an order of magnitude.

Finally, we are interested in lower bounds on the queries required for learning. To that end, we present Theorem 4, which shows that any algorithm with accuracy and confidence needs queries.

Proof of Theorem 4.

We are given accuracy and confidence parameters respectively. For the sake of simplifying the exposition in the proof, let us assume that is an integer; all later arguments can be generalized in order to handle the case of .

We construct the following problem instance. We have two groups represented by the distributions and . In addition, for both of these groups we assume that the support of the corresponding distribution contains elements, and for every as well as for every .

For every let , and for every . Similarly, for every we set , and for every we set . The functions and are clearly metrics. As for the across-groups similarity values, each for and is chosen independently, and it is drawn uniformly at random from . Note that this choice of satisfies the necessary properties introduced in Section 1.1.

In this proof we are also focusing on a more special learning model. In particular, we assume that the distributions and are known. Hence, there is no need for sampling. The only randomness here is over the random arrivals and , where and are the elements that need to be compared. Obviously, the similarity function would still remain unknown to any learner. Finally, the number of queries required for learning in this model cannot be more than the queries required in the original model, and this is because this model is easier (special case of the original).

Consider now an algorithm with the error guarantees mentioned in the Theorem statement, and focus on a fixed pair with and . If the algorithm has queried the oracle for , it knows with absolute certainty. Let us study what happens when the algorithm has not queried the oracle for . In this case, because the values with and are independent, no query the algorithm has performed can provide any information for . Thus, the best the algorithm can do is uniformly at random guess a value in , and return that as the estimate for . Letting denote the event where no query is performed for , we have

where the randomness comes only from the algorithm.

For the sake of contradiction, suppose the algorithm uses queries. Since each is equally likely to appear for a comparison, the overall error probability is

Contradiction; the error probability must be . ∎

Corollary 12.

When for every the value is arbitrarily close to , the algorithm presented in this section achieves an expected number of oracle queries that is asymptotically optimal.

Proof.

When every is very close to , Lemma 10 gives . Thus, by inequality (2) the expected queries are . Theorem 4 concludes the proof. ∎

3 Experimental Evaluation

We implemented all algorithms in Python 3.10.6 and ran our experiments on a personal laptop with Intel(R) Core(TM) i7-7500U CPU @ 2.70GHz 2.90 GHz and 16.0 GB memory.

Algorithms: We implemented the simple algorithm from Section 2.1, and the more intricate algorithm of Section 2.2. We refer to the former as Simple-Alg, and to the latter as QueryOpt-Alg. For the training phase of QueryOpt-Alg, we set the dilation factor at line 2 of Algorithm 1 to instead of . The reason for this, is that a minimal experimental investigation revealed that this choice leads to a good balance between accuracy guarantees and oracle queries.

Number of demographic groups: All our experiments are for two groups, i.e., . The following reasons justify this decision. At first, this case captures the essence of our algorithmic results; the case can be viewed as running the algorithm for multiple times, one for each pair of groups. Secondly, the achieved confidence and accuracy of our algorithms are completely independent of .

Similarity functions: In all our experiments the feature space is , where is case-specific. To define similarity functions, we adopt a paradigm of feature importance (Niño-Adan et al., 2021). Specifically, we assume that all similarity functions and are weighted Euclideans:

The weight measures how significant feature is for comparisons within group . Similarly, the weight corresponds to how important feature is, when we are comparing elements across groups. The larger a weight is, the more significant that feature for the comparison at hand. Finally, we chose Euclidean metrics due to their widespread use as accurate similarity/distance functions.

For the across-groups weights we also assume that for every , i.e., the significance of a feature for across-groups comparisons cannot be more than its significance for within group comparisons. To justify this assumption think of the following scenario. Suppose that for some feature we have . In other words, this feature can be seen as totally irrelevant for the first group, i.e., it provides no information at all for the nature of the elements of this group. Hence, it is reasonable to also disregard when we compare an element of the first group with an element of the second group, i.e., should be as well.

The functions and are metric due to being weighted Euclideans. We also need to show that satisfies the across-groups metric properties imposed in Section 1.1. Since is a metric, for any we have:

(3)

Using the assumption that we get:

(4)
(5)

Combining (3) and (4) proves property for . Similarly, combining (3) and (5) proves property .

In each experiment we choose the importance weights as follows. At first, each for and is chosen independently, and it is drawn uniformly at random from . Afterwards, we choose all independently, with each drawn uniformly at random from . Obviously, the algorithms do not have access to the vector , which is only used to simulate the oracle and compare our predictions with the corresponding true values.

3.1 Experiments on Synthetic Data

(a) Simple-Alg
(b) QueryOpt-Alg
Figure 1: Distribution of the Relative Error Percentage
(a) Simple-Alg
(b) QueryOpt-Alg
Figure 2: Distribution of (Absolute Error)/

We believe that in our applications of interest, the data would exhibit high concentration around certain archetypal feature vectors. For instance, recall the scenario from the introduction, where we had to compare students for college admissions. Given some demographic group, it is natural to assume that there are certain prototypical student profiles, and each student falls close to one of them. For example, consider archetypal vectors each corresponding to one of the next profiles: exceptional student, very good student, good student, average student, below average student, dropout.

Data Generation for Each Group : The best way to capture the aforementioned data behaviour, is having the group distribution be a mixture of multivariate Gaussians (Carrasco, 2019). Specifically, we consider multivariate Gaussians in ; (we believe that the number of archetypal elements, each corresponding to a Gaussian, is usually small). As for the mixing weights, we set for and . This choice tries to capture the fact that the prototypical feature vectors are not equally likely, and some are far more common than others, e.g., average students are more than exceptional ones. As for the mean of each Gaussian , it is chosen uniformly at random from . Further, to define the covariance matrices we do the following. At first, we assume that the features are independent, and thus each can only have non-zero values in the diagonal. Then, given a value that serves as an upper bound for the variances, we choose each diagonal value of uniformly at random from . Intuitively, the smaller is, the more concentrated the Gaussians. The previous process is performed independently for the construction of each .

Range of : We run all experiments for data produced by every value of in . Our algorithms are expected to perform better for smaller values of , since those yield more concentrated distributions, which naturally imply smaller . This behaviour was indeed observed in our simulations. Due to space constraints, here we only provide results for . For the rest see Appendix B.

Choosing the accuracy parameter : To use a meaningful value for , we need to know the order of magnitude of . We calculated the value of over trials, where the randomness was of multiple factors, i.e., the randomized constructions of and , the random choices for the feature weights, and of course the random sampling of and . In the end, we saw that is always greater than and highly concentrated in ; see Appendix B. Thus, choosing is a reasonable option.

Confidence and number of samples: We run all simulations for every value of in , and in each case we use samples from each group.

Testing: We test our algorithms over trials, where each trial consists of independently sampling and , and then inputting to our predictors. We are interested in two metrics. The first is the relative error percentage; if is our prediction for elements and is their true similarity value, the relative error percentage is . Figure (0(a)) shows the empirical distribution of this error for Simple-Alg, for every value of . Figure (0(b)) shows the same statistics for QueryOpt-Alg. The second metric we consider is the absolute error divided by ; if is our prediction for two elements and is their true similarity value, this metric is . We are interested in this because our theoretical guarantees are of the form . Figure (1(a)) shows the empirical distribution of the (absolute error)/ for Simple-Alg, for every value of . Figure (1(b)) shows the same statistics for QueryOpt-Alg.

(a) Simple-Alg
(b) QueryOpt-Alg
Figure 3: Distribution of the Relative Error Percentage on Adult
(a) Simple-Alg
(b) QueryOpt-Alg
Figure 4: Distribution of (Absolute Error)/ on Adult

Conclusions: First of all, we see that the smaller is, the better the algorithms perform. Further, Simple-Alg performs marginally better than QueryOpt-Alg. Nonetheless, this tiny edge is not enough to justify a potential superiority of Simple-Alg, and the reason for this is the decrease in queries. Specifically, compared to the queries used by Simple-Alg, we observed the next for QueryOpt-Alg:

  1. For we saw a decrease in used queries.

  2. For we saw a decrease in queries.

  3. For we saw a decrease in the queries.

  4. For we saw a decrease in the queries.

This improvement as decreases is natural. The smaller is the more samples we have, and thus the more flexibility for the clustering decisions in Algorithm 1.

Overall, for both types of errors the algorithms perform significantly well. In particular, when or the guarantees are remarkably strong.

3.2 Experiments on Real Data

We used 2 datasets from the UCI ML Repository Dua and Graff (2017), namely Adult-48,842 points Kohavi (1996) and Creditcard-30,000 points Yeh and Lien (2009). For both datasets, we kept numerical features as they were and transformed each categorical feature as follows. Each category is assigned to an integer in , and this integer is used in place of the category in the feature vector (Ding et al., 2021). Also, we standardized all features so that they have approximately the same order of magnitude.

Choosing the Two Groups: For both datasets, we defined groups based on marital status. Specifically, the first group corresponds to points that are married individuals, and the second corresponds to points that are not married (singles, divorced and widowed are merged together). Afterwards, for each group we chose a random permutation of its points, and every time we needed to sample from either group, we were simply taking the next element in the permutation.

Choosing the accuracy parameter : We calculated the value of over trials, where the randomness was of multiple factors, i.e., the random choices for the feature weights, and the sampling of and from their respective random permutations. In the end, we saw that for both datasets is always large enough to justify setting ; see Appendix B for figures on .

Confidence and number of samples: For Adult we run all simulations for every value of in , and we used samples from each group. For Creditcard we had to be a bit more careful, because it contains fewer points. Specifically, for Creditcard we run all simulations for every value of in , using samples from each group. Here and is less than the size of the smallest of the two groups, guaranteeing that there are enough points for sampling and testing.

Testing: We test our algorithms over trials, where each trial consists of independently sampling two elements , one for each group, and then inputting those to our predictors. We are again interested in the same two metrics as in Section 3.1. Figures (3) and (4) show the performance of the algorithms on Adult. Due to space constraints our results on Creditcard are moved to Appendix B.

Conclusions: We reach the same conclusions as earlier, mainly observing that our algorithms for perform very well. As for the queries of QueryOpt-Alg compared to the queries of Simple-Alg, we observed the following:

  1. For we saw a decrease in used queries.

  2. For we saw a decrease in queries.

  3. For we saw a decrease in the queries.

Acknowledgements

Disclaimer: This paper was prepared for informational purposes in part by the Artificial Intelligence Research group of JPMorgan Chase & Co. and its affiliates (“JP Morgan”), and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.

References

  • A. Bellet, A. Habrard, and M. Sebban (2013) A Survey on Metric Learning for Feature Vectors and Structured Data. Research Report Laboratoire Hubert Curien UMR 5516. External Links: Link Cited by: §1.3.
  • O. C. Carrasco (2019) Gaussian mixture models explained. https://towardsdatascience.com/. External Links: Link Cited by: §3.1.
  • D. Chakrabarti, J. P. Dickerson, S. A. Esmaeili, A. Srinivasan, and L. Tsepenekas (2022) A new notion of individually fair clustering: -equitable -center. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, G. Camps-Valls, F. J. R. Ruiz, and I. Valera (Eds.), Proceedings of Machine Learning Research, Vol. 151, pp. 6387–6408. Cited by: §1.3.
  • F. Ding, M. Hardt, J. Miller, and L. Schmidt (2021) Retiring adult: new datasets for fair machine learning. In Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. W. Vaughan (Eds.), Vol. 34, pp. 6478–6490. External Links: Link Cited by: §3.2.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §3.2.
  • C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel (2012) Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, ITCS ’12. Cited by: §1.1, §1.3, §1.3, §1.3, §1, §1.
  • A. Frome, Y. Singer, F. Sha, and J. Malik (2007) Learning globally-consistent local distance functions for shape-based image retrieval and classification. In 2007 IEEE 11th International Conference on Computer Vision, Vol. , pp. 1–8. External Links: Document Cited by: §1.3.
  • D. S. Hochbaum and D. B. Shmoys (1985) A best possible heuristic for the k-center problem. Mathematics of Operations Research 10 (2), pp. 180–184. External Links: ISSN 0364765X, 15265471, Link Cited by: §2.2.
  • C. Ilvento (2019) Metric learning for individual fairness. arXiv. External Links: Document, Link, 1906.00250 Cited by: §1.1, §1.3, §1.
  • K. G. Jamieson and R. D. Nowak (2011) Low-dimensional embedding using adaptively selected ordinal data. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Vol. , pp. 1077–1084. External Links: Document Cited by: §1.3.
  • C. Jung, M. Kearns, S. Neel, A. Roth, L. Stapleton, and Z. S. Wu (2019) An algorithmic framework for fairness elicitation. arXiv. External Links: Document, Link, 1905.10660 Cited by: §1.3.
  • M. P. Kim, O. Reingold, and G. N. Rothblum (2018) Fairness through computationally-bounded awareness. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Red Hook, NY, USA, pp. 4847–4857. Cited by: §1.1.
  • R. Kohavi (1996) Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp. 202–207. Cited by: §3.2.
  • B. Kulis (2013) Metric learning: a survey. (), pp. . External Links: Document Cited by: §1.3.
  • P. Lahoti, K. P. Gummadi, and G. Weikum (2019) Operationalizing individual fairness with pairwise fair representations. Proc. VLDB Endow. 13 (4), pp. 506–518. External Links: ISSN 2150-8097, Link, Document Cited by: §1.3.
  • P. Moutafis, M. Leng, and I. A. Kakadiaris (2017) An overview and empirical comparison of distance metric learning methods. IEEE Transactions on Cybernetics 47 (3), pp. 612–625. External Links: Document Cited by: §1.3.
  • D. Mukherjee, M. Yurochkin, M. Banerjee, and Y. Sun (2020) Two simple ways to learn individual fairness metrics from data. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. Cited by: §1.1, §1.3.
  • I. Niño-Adan, D. Manjarres, I. Landa-Torres, and E. Portillo (2021) Feature weighting methods: a review. Expert Systems with Applications 184, pp. 115424. External Links: ISSN 0957-4174, Document, Link Cited by: §3.
  • J. L. Suárez-Díaz, S. García, and F. Herrera (2018) A tutorial on distance metric learning: mathematical foundations, algorithms, experimental analysis, prospects and challenges (with appendices on mathematical background and detailed algorithms explanation). arXiv. External Links: Document, Link, 1812.05944 Cited by: §1.3.
  • O. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. T. Kalai (2011) Adaptively learning the crowd kernel. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, pp. 673–680. External Links: ISBN 9781450306195 Cited by: §1.3.
  • L. van der Maaten and K. Weinberger (2012) Stochastic triplet embedding. In 2012 IEEE International Workshop on Machine Learning for Signal Processing, Vol. , pp. 1–6. External Links: Document Cited by: §1.3.
  • H. Wang, N. Grgic-Hlaca, P. Lahoti, K. P. Gummadi, and A. Weller (2019) An empirical study on learning fairness metrics for compas data with human supervision. arXiv. External Links: Document, Link, 1910.10255 Cited by: §1.1, §1.3.
  • M. Wilber, I. Kwak, and S. Belongie (2014) Cost-effective hits for relative similarity comparisons. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing 2 (1), pp. 227–233. External Links: Link Cited by: §1.3.
  • I. Yeh and C. Lien (2009) The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients. Expert Systems with Applications 36, pp. 2473–2480. External Links: Document Cited by: §3.2.
  • G. Yona and G. Rothblum (2018) Probably approximately metric-fair learning. In International Conference on Machine Learning, pp. 5680–5688. Cited by: §1.1.

Appendix A Additional Motivating Applications

Individual Fairness: Suppose that some college committee wants to provide scholarships to student athletes, in a way that similar candidates will have the same chance of receiving the scholarship. For this case, computing the similarity of two students that are athletes of the same sport appears relatively easy, since their respective feature vectors will most certainly contain the necessary information for the comparison at hand. However, computing the similarity of athletes across different sports must be far more challenging. For instance, how can one compare a basketball player to a swimmer? In such a situation, it might also be the case that the feature vectors for the two students do not even contain the same subset of features.

Clustering: Consider an investor who has a set of potential investments, and wants to partition them into clusters of high intra-group similarity. Investments that are similar are thought of as having the same profit potential, and clustering the given set of investments might be useful for post-processing downstream analysis. In this context, the investor needs access to a distance metric in order to perform any clustering procedure. However, computing this metric might be tricky. For one thing, comparing investments that involve the same market should be relatively straightforward, e.g., comparing two investments in the housing market. On the other hand, comparing investments to different markets appears to be a much more cumbersome task, e.g., trying to find how similar a housing investment and a cryptocurrency one are.

Appendix B Additional Experimental Results

Figures (5)-(10) demonstrate the empirical distribution of in every case of interest, which is needed in order to justify our choice of setting in all of our experiments. Figures (11)-(12) show the performance of our algorithms for synthetic data produced with . Figures (13)-(14) show the performance of our algorithms for synthetic data produced with . Finally, figures (15)-(16) show the performance of our algorithms on the Creditcard dataset. As for the queries of QueryOpt-Alg compared to the queries of Simple-Alg, we observed the following:

  1. Synthetic data with : For we saw a decrease in used queries. For we saw a decrease. For we saw a decrease, and for the decrease was .

  2. Synthetic data with : For we saw a decrease in used queries. For we saw a decrease. For we saw a decrease, and for the decrease was .

  3. Synthetic data with : For we saw a decrease in used queries. For we saw a decrease. For we saw a decrease, and for the decrease was .

  4. Creditcard dataset: For we saw a decrease in used queries. For we saw a decrease, and for the decrease was .

Figure 5: Distribution of for synthetic data with
Figure 6: Distribution of for synthetic data with
Figure 7: Distribution of for synthetic data with
Figure 8: Distribution of for synthetic data with
Figure 9: Distribution of for Adult
Figure 10: Distribution of for Creditcard
(a) Simple-Alg
(b) QueryOpt-Alg
Figure 11: Distribution of the Relative Error Percentage for Synthetic Data with
(a) Simple-Alg
(b) QueryOpt-Alg
Figure 12: Distribution of (Absolute Error)/ for Synthetic Data with
(a) Simple-Alg
(b) QueryOpt-Alg
Figure 13: Distribution of the Relative Error Percentage for Synthetic Data with
(a) Simple-Alg
(b) QueryOpt-Alg
Figure 14: Distribution of (Absolute Error)/ for Synthetic Data with
(a) Simple-Alg
(b) QueryOpt-Alg
Figure 15: Distribution of the Relative Error Percentage on Creditcard
(a) Simple-Alg
(b) QueryOpt-Alg
Figure 16: Distribution of (Absolute Error)/ on Creditcard