Fair learning with Wasserstein barycenters for non-decomposable performance measures

Solenne Gaucher
Université Paris-Saclay, CNRS
Laboratoire de mathématiques d’Orsay
&Nicolas Schreuder
MaLGa, DIBRIS
Università di Genova &Evgenii Chzhen
Université Paris-Saclay, CNRS
Laboratoire de mathématiques d’Orsay
Abstract

This work provides several fundamental characterizations of the optimal classification function under the demographic parity constraint. In the awareness framework, akin to the classical unconstrained classification case, we show that maximizing accuracy under this fairness constraint is equivalent to solving a corresponding regression problem followed by thresholding at level . We extend this result to linear-fractional classification measures (e.g., -score, AM measure, balanced accuracy, etc.), highlighting the fundamental role played by the regression problem in this framework. Our results leverage recently developed connection between the demographic parity constraint and the multi-marginal optimal transport formulation. Informally, our result shows that the transition between the unconstrained problems and the fair one is achieved by replacing the conditional expectation of the label by the solution of the fair regression problem. Finally, leveraging our analysis, we demonstrate an equivalence between the awareness and the unawareness setups in the case of two sensitive groups.

1 Introduction

Our\blfootnote Equal contribution experience of life is increasingly and insidiously being influenced by algorithmic predictions. It is now well accepted that such predictions might replicate or even amplify societal biases and discrimination because of machine learning algorithms’ training process (barocas-hardt-narayanan). A key difficulty in overcoming the effect of those biases is the lack of a precise understanding of how statistical algorithms make predictions: these algorithms are often designed to minimize a user-specified data-dependent loss and yield a highly complex prediction rule, leaving practitioners—and theoreticians—unable to understand and explain the issued predictions. Our goal is to provide a sound and simple mathematical characterization of the prediction process in the presence of fairness constraints.

In this paper we study the demographic parity fairness constraint (calders2009building; barocas-hardt-narayanan) in the awareness framework—allowing the prediction rules to explicitly take the sensitive attribute as an input. Even though this constraint is relatively well understood from algorithmic perspective in both classification (Agarwal_Beygelzimer_Dubik_Langford_Wallach18; menon2018cost; zeng2022bayes; schreuder2021classification; yang2020fairness; jiang2019wasserstein; silvia2020general; feldman2015certifying; gordaliza2019obtaining) and regression (chzhen2020fair; chzhen2020fairTV; gouic2020price; jiang2019wasserstein; agarwal2019fair; chiappa2021fairness), the connection between the two setups remains opaque. The main goal of the current paper is to unveil it.

In contrast, in the traditional unconstrained learning setup, the relation between classification and its regression counterpart is well understood and can be found in all standard books on the subject (see, e.g., hastie2009elements; devroye2013probabilistic; james2013introduction; mohri2018foundations). For instance, the most standard result illustrating this connection states that if minimizes the squared risk, the classifier minimizes the misclassification error. Such results form the first building block of many theoretical and practical studies (see, e.g., Audibert_Tsybakov07; Yang99; massart2006risk; biau2008consistency). More recently, the connection between regression and classification was pushed even further. For instance, replacing the misclassification error by the -score (van1974foundation; Chinchor92), Zhao_Edakunni_Pocock_Brown13 showed that the solution of the associated regression problem still plays a crucial role as an -score maximizer can be obtained by properly thresholding the solution of the regression problem. Moreover, a recent thread of results establish this fundamental relation for a large variety of performance measures including AM measure, the Jaccard similarity coefficient, and G-mean to name a few (menon2013statistical; koyejo2014consistent; Koyejo_Natarajan_Ravikumar_Dhillon15; Yan_Koyejo_Zhong_Ravikumar18). Again, akin to the standard minimization of misclassification error problem, all these developments led to many theoretical and practical advances (see, e.g., jasinska2016extreme; chzhen2020optimal; narasimhan2015optimizing; kotlowski2016surrogate; bascol2019cost; boughorbel2017optimal). Interestingly, some works that consider group fairness constraints actually report -score as a performance measure in their empirical studies without actually tailoring an algorithm to optimize it directly (see, e.g., BiswasR20; BiswasR21; abs-2207-03277; wang2021analyzing; dablain2022towards; wick2019unlocking). A possible cause of this is the absence of characterization of fair (-score) optimal classifiers in the fairness literature. In this paper we fill this gap for the demographic parity constraint and a large class of performance measures.

Literature that treats group fairness notions is typically distinguished by two features: exact notion of fairness and access to the sensitive attribute at prediction time. While this work considers only demographic parity, we discuss both awareness and unawareness setups—allowing or not the access to the sensitive attribute at prediction time respectively. Unlike the case of awareness, in which a significant understanding has been achieved from theoretical perspective, the case of unawareness remains opaque with contributions mainly focusing on algorithmic constructions (see e.g., Agarwal_Beygelzimer_Dubik_Langford_Wallach18; agarwal2019fair; oneto2019general; Donini17; narasimhan2018learning). A notable work of lipton2018does puts forward several empirical evidences highlighting critical issues connected of the unawareness framework. Our work makes a step towards a more explicit and transparent description of the optimal classifier under the demographic parity constraint with unawareness by introducing a simple theoretical reduction scheme to the awareness setup for binary protected attribute. Consequently, our results support theoretically the empirical claims made by lipton2018does.

Contributions.

The goal of this work is to establish a link between the regression and classification problems under the demographic parity constraint. We make the following contributions to the study of algorithmic fairness:

  1. We show that, under mild assumptions, if minimizes the squared risk under the demographic parity constraint, then minimizes the probability of misclassification under the same constraint.

  2. We extend the above result to a large family of performance measures introduced in Koyejo_Natarajan_Ravikumar_Dhillon15 for unconstrained classification.

  3. In the case of a binary sensitive attribute, we provide a simple reduction scheme that transforms, in a optimal way, the unawareness setup into the awareness one.

The first two contributions show the fundamental role played by regression in the context of demographic parity constraint and are built using basic tools from univariate optimal transport theory. As an interesting consequence of our analysis, we show that the notion of strong demographic parity introduced by jiang2019wasserstein is equivalent to the usual demographic parity when a performance measure is minimized. The latter indicates that the post-hoc or the downstream threshold will never harm the demographic parity constraint. The last contribution constitutes a step towards the theoretical treatment of the unawareness setup—a problem that still remains open. Importantly, even though our results are stated in the fair learning setting, they imply new results in the general learning setting. In particular, our results allow to obtain the characterization of the optimal unconstrained classifier for a large class of classification performance measures.

2 Problem setup

Consider a triplet , following some joint distribution , consisting of the nominally non-sensitive and sensitive features, and the label, respectively. Classifiers are functions of the form and score functions take the form . The set of all classifiers is denoted by and the set of all score functions is denoted by . Before proceeding let us introduce additional notation that is related to the unknown distribution . We set and recall that minimizes the squared risk without any constraint. For each , we define . The central object of this work is the optimal fair score function, defined as: {highlighted}

(1)

An explicit expression for under standard assumptions was derived in (chzhen2020fair; gouic2020price) using the univariate optimal transport theory and the reduction of the problem in Eq. (1) to a multi-marginal optimal transport formulation. In particular, they showed that, under mild assumptions, there is a one-to-one correspondence between the problem in Eq. (1) and the problem

where is the Wasserstein-2 distance (villani2009optimal, Definition 6.1) and denotes the space of univariate probability measures with finite second moment. Denoting by the solution of the above problem, it was shown that

where is the optimal transport map from to . Up until now, unlike in the regression setting, it was not clear if a direct link between optimal transport and the fair binary classification problem existed–or even made sense. Our work shows that such a connection exists and that it is fundamental.

Notation.

Given a real-valued function , we denote by the univariate measure defined for all as . For any univariate measure , we denote by its cumulative distribution, and by its quantile function, given by . For any we set . For any probability measure on and a function , we denote by , the image measure of .

3 The misclassification risk: a warm-up

In this section, we begin by tackling the classical minimization of the misclassification risk problem and highlight the main novelties and advances with respect to previous works. To this end, we consider the following optimal (in terms of the misclassification risk) fair classifier {highlighted}

(2)

We work under the following assumption. {assumption} For every , assume that is continuous and supported on an interval. A slightly modified version of the above was used in the context of fairness in (chzhen2020fair; chzhen2020fairTV; gouic2020price; jiang2019wasserstein) and also also in the classical unconstrained classification with generalized performance measures (Yan_Koyejo_Zhong_Ravikumar18). In Section A, we relax the above assumption and provide a proof that unifies the awareness case considered just below with the unawareness case presented in Section 5, Theorem 5.

The first warm-up result is reminiscent of those recently obtained by (zeng2022bayes; schreuder2021classification). The proof based on the duality and is very similar to the classical Neyman-Pearson lemma. While it does not allow to immediately reach our goals, it gives several fundamental insights that were already invoked in previous works on the demographic parity constraint (lipton2018does; hardt2016equality). {theorem} Let Assumption 3 be satisfied. Then, an optimal fair classifier defined in Eq. (2) can be expressed for all as

where is a solution of

The main takeaway message from the above theorem is: under the stated assumption, the optimal fair classifier can be derived as a group-wise thresholding of the regression function , with thresholds eventually depending on the sensitive groups. For a similar statement without the continuity assumption, we refer the reader to zeng2022bayes who derived optimal randomized classifiers using the Neyman-Pearson lemma. Let us now provide a novel characterization of an optimal fair classifier. {mytheo}Wasserstein based fair optimal classifierequivalence Let Assumption 3 be satisfied. Then, an optimal fair classifier defined in Eq. (2) can be expressed for all as

Discussion.

The above result is instructive on its own—one can solve binary classification under the demographic parity constraint by solving the corresponding regression problem. We recall that (chzhen2020fair; gouic2020price) built a statistically consistent algorithm for the estimation of the latter. Furthermore, they showed that under the imposed assumptions,

feldman2015certifying proposed to transport the group-wise distribution of towards their common barycenter as a disparity removal strategy. Yet, a theoretical justification was missing and this approach remained a heuristic until the work of gordaliza2019obtaining who provided an upper bound on the excess risk in terms of the Wasserstein barycenter objective. Later, jiang2019wasserstein relied on the barycenter formulation involving the Earth Mover distance (rachev1998mass) and showed that a transport-based prediction results in a minimal perturbation post-processing. However, the use of the Earth Mover distance might result in non-uniqueness issues. Our Theorem LABEL:thm:equivalence gives a complete theoretical justification of the transport based fair classification algorithms. Theorem LABEL:thm:fair_optimal_LF in Section 4 further extends this connection to non-decomposable measures.

Besides, jiang2019wasserstein introduced a notion of strong demographic parity, which amounts to taking classifiers for which there exists a score function such that and . This notion was later used in (silvia2020general; chiappa2021fairness). Theorem LABEL:thm:equivalence implies that the optimal classifier under the demographic parity constraint satisfies, an a priori more restrictive fairness notion—the strong demographic parity. Indeed, any classifier that satisfies strong demographic parity is demographic parity fair. Hence, we have deduced the equivalence between the two definitions at the optimum. The notion of strong demographic parity introduced by jiang2019wasserstein can be seen in a downstream or post-hoc settings. That is, the learner first tries to fit a score function and only after a particular threshold is selected in a potentially non-stationary way. Strong demographic parity implies that any threshold selection made by the learner will yield a fair classifier. In that sense, our results show that building a score function via an optimal fair regression function is optimal for misclassification risk and, as we see in Section 4, for many other classification measures. Below we provide a simple proof of Theorem LABEL:thm:equivalence.

Illustration of Bayes and fair optimal classifiers. Illustration of Bayes and fair optimal classifiers. Illustration of Bayes and fair optimal classifiers.
Figure 1: Illustration of Bayes and fair optimal classifiers. Left: group-wise cumulative distributions of —the threshold is ; Middle: Illustration of Theorem LABEL:thm:equivalence—black solid line corresponds to ; Right: illustration of group-wise thresholds that correspond to the fair optimal classifier.
Proof of Theorem LABEL:thm:equivalence.

Theorem 3 implies that under Assumption 3 the optimal classifier is of the form for some . It follows from (van2000asymptotic, Lemma 21.1(iv)) and Assumption 3 that for almost all w.r.t. . Thus, it is sufficient to look at the classifiers of the form

or, equivalently, at  (van2000asymptotic, Lemma 21.1(i)). Now, the inverse transform theorem states that under Assumption 3, has the same distribution as conditionally on , for uniformly distributed on . Then,

where we have used that for all  (van2000asymptotic, Lemma 21.1(ii)). Thus, verifies the DP constraint if and only if does not depend on . Denoting by this constant, we find that the optimal fair classifier must be of the form . The risk of any such classifier is given by

(3)

Using again inverse transform theorem, Eq. (3) can be further simplified to the following expression:

(4)

Under Assumption 3, for all . Thus, Eq. (4) reduces to

This function is minimized at which satisfies

(5)

and the optimal classifier under the demographic parity constraints is given by . Taking into account the condition satisfied by , we conclude. ∎

The proof itself is rather instructive and gives rise to the following interpretation. {myremark}Ranking interpretation of the fair optimal classification strategySolennes_final The proof of Theorem LABEL:thm:equivalence reveals that the optimal fair classifier can be written as , where is given by (5). Recall that is the quantile function of the Wasserstein-2 barycenter of measures , weighted by  (see, e.g., agueh2011barycenters, Section 6.1). Thus, denoting this barycenter by , can be alternatively expressed as

The last display shows that while the thresholds of differ across groups (as per Theorem 3), this threshold sensitive-group independent if viewed from the perspective of group-wise ranking. Putting it simply, if , then the best individuals from each group get classified positively. This property reflects the notion of rational ordering (lipton2018does) that follows from order preservation property of  (see chzhen2020minimax, Section 4). Figure 1 provides a graphical illustration of the above observations.

We note that as in other works explaining a given fairness constraint, we do not argue for or against the policy itself.

Expression
Accuracy
-score
Jaccard
AM-measure
Recall
Table 1: Some examples of measure that can be represented by Eq. (6). For more example see choi2010survey. We set for this table and .

4 Non-decomposable performance measures

In this part we extend the analysis of the previous section to a broader class of performance measures, which includes the -score, the AM-mean, and the misclassification risk among others. We follow the framework put forward by koyejo2014consistent, who introduced the so-called linear fractional performance measures. Formally, given coefficients and , the performance of a classifier is measured by its utility

(6)

We denote by the set of all classifiers for which the denominator of is non-zero. It is important to emphasize that both and are allowed to depend on the unknown distribution of the data but not on the classifier . For instance, the -score (van1974foundation) corresponds to the choice and . We refer to (choi2010survey) for additional examples of different choices of corresponding to different classification performance measures. Recently, yang2020fairness studied linear performance measures in the context of fairness, which essentially corresponds to the special case of the above linear fractional formulation with —which, for instance, does not encompass the -score. In another direction, celis2019classification considered linear fractional formulation of fairness constraints while optimizing the misclassification risk. However, given the structure of the constraints, this problem can essentially be re-formulated as misclassification risk minimization under linear fairness constraints.
As it is common in the literature on generalized performance measures, we view as a utility to be maximized, contrary to the minimization of the risk viewpoint from the previous section. Thus, our goal is to study{highlighted}

(7)

A remarkable property of linear fractional measures is that the unconstrained maximizer can still be obtained by thresholding the regression function . Yet, the threshold in this case might depend on the unknown distribution and ought to be estimated. Let us provide couple of standard examples. {example} Consider the problem of maximizing the accuracy:

Setting and , we see that the above formulation falls within the considered framework. {example} Consider the problem of maximizing the -score:

Zhao_Edakunni_Pocock_Brown13 showed that the solution of the above optimization problem can be written as

where is the unique solution of . koyejo2014consistent pushed further these results demonstrating that the “thresholding principle” remains true for the whole family of linear fractional measures. In what follows, we will show that their result is still valid if one replaces by —the solution of the fair regression problem. This validity is established in a strong sense, meaning that even the equation (as in Example 4) determining the threshold is preserved.

{mytheo}

Wasserstein based fair optimal classifier for non-decomposable measuresfair_optimal_LF Let Assumption 3 be satisfied. Assume that is such that

Assume that the coefficients satisfy one of the following mutually exclusive conditions:

()
()

Then, defined in Eq. (7) can be expressed for all as

(8)

where is either the unique solution of

(9)

if or otherwise.

A few comments are in order. First of all, Theorem LABEL:thm:fair_optimal_LF states that the pre-cited “thresholding principle” still holds for optimizing linear-fractional performance measures under the demographic parity constraint: optimal fair classifiers can be obtained by thresholding the optimal fair regression function at the right threshold level . Moreover, in the case an explicit expression is provided, while if one needs to solve a fixed-point equation to find the optimal threshold. Given that the function defining the fixed-point equation is univariate, monotone and continuous, the bisection method (or any other univariate root-finding method) can be used to obtain an approximation of the optimal threshold up to arbitrary precision. Finally, since the conditions on the coefficients might seem opaque at first sight, let us argue why they are harmless and meaningful. Intuitively, these conditions specify only two requirements:

  • The maximization of makes sense—the more the classifier align with the better. In particular, these conditions exclude , whose maximization does not make sense.

  • The denominator of is non-negative.

One can verify that all the measures presented in Table 1 do indeed satisfy these conditions as well as many other linear fractional performance measures from choi2010survey. We would also like to point out that while the conditions of Theorem LABEL:thm:fair_optimal_LF are cumbersome, they are easy to check in practice, unlike those given in (koyejo2014consistent), who relied on . Indeed, to check the latter, one needs to know or estimate the optimal value of beforehand, which is not always feasible in practice. In contrast, conditions () and () only involve the known coefficients . Finally, let us remark that = and both conditions () and () are invariant under the transformation. Yet, to fix only one of them, we additionally require , which forces the user to fix the signs of properly. Let us emphasize that, if , then —the denominator does not zero-out—which is a consequence of Lemma B.

Proof.

Let us first show that exists and unique. Indeed, the mapping

is continuous and monotone increasing on under the specified conditions. On the one hand, for we have  (see chzhen2020minimax, Section 4, item 4 on average stability) the above mapping evaluates to . On the other hand, for , it evaluates to . The existence follows from the intermediate value theorem and the uniqueness from monotonicity. The rest of the proof follows from the two lemmas presented below. ∎

The first lemma is similar to the main result of (koyejo2014consistent), while the second one gives an explicit expression for the excess-score of any fair classifier. The actual proof technique shares some similarities with the analysis of -score in (chzhen2020optimal) who provided an alternative proof to the result of Zhao_Edakunni_Pocock_Brown13 recalled in Example 4. {mylemma}Fixed point propertyfixed_at_optim_LF Let Assumption 3 be satisfied. Let be defined in Theorem LABEL:thm:fair_optimal_LF and assume that defined in Eq. (9) exists. Then,

Proof.

For compactness we drop the subscripts in this proof. Using Lemma  B, we find that

Case 1: . Combining this result with (9), we obtain the following expression for :

Factorizing the numerator and denominator by and respectively, the above can be written as

concluding the proof for the first case.
Case 2: . In this case, notice that we have

and, following the same computations, . Plugging the above equalities in the definition of yields

The proof is concluded. ∎

The next result provides an explicit expression for the excess score of any fair classifier .

{mylemma}

Excess score for fair non-decomposable measuresexcess_score Let Assumption 3 be satisfied. Let be defined as in Theorem LABEL:thm:fair_optimal_LF and assume that defined in Eq. (9) exists. Let be the Wasserstein barycenter of measures weighted by , respectively. Define as . Let . Then, for any classifier such that , excess score equals to

Furthermore, under the conditions on specified in Theorem LABEL:thm:fair_optimal_LF; we have for all classifiers . {remark} Lemma LABEL:lem:excess_score, together with Lemma B, stated in appendix, implies that

Hence, the inequality for all is implied from

The first of the above conditions is ensured if (Lemma B) assumed in Theorem LABEL:thm:fair_optimal_LF and the second one is ensured by () or (), as proved in Lemma B.

Proof of Lemma LABEL:lem:excess_score.

Let be the Wasserstein barycenter of measures , weighted by respectively. Assumption 3 and the form of ensures that the fair optimal classifier in Eq. (8) can be expressed as

where . Fix an arbitrary classifier which satisfies the demographic parity constraint.
Our goal is to develop , which we express as a sum of two terms , with

and

One verifies that indeed . Thanks to the alternative definition of introduced in the beginning of this proof, for any we have

where the last equality is due to the fact that satisfies the demographic parity constraint. Thus, setting and recalling that we can express and as

Case 1: . Lemma LABEL:lem:fixed_at_optim_LF implies that

Combining the above two expressions for and we obtain

Simplifying the above and using Lemma B, we obtain

As in Lemma LABEL:lem:fixed_at_optim_LF (using the expression for the numerator), we deduce that

and using the definition of , we can write

(10)

Combining the last three displays, we arrive at the claimed equality

Case 2: . We have shown in the proof of Lemma LABEL:lem:fixed_at_optim_LF that in this particular case, . Hence and reduce to

Consequently, the difference of utilities is expressed as

Again invoking the result of Lemma B, we deduce

The above two displays combined with Eq. (10) and the condition yield

The proof is concluded. ∎

Let us remark that the content of this section can be seen as a strict improvement over koyejo2014consistent who only derived Lemma LABEL:lem:fixed_at_optim_LF in the absence of the fairness constraint. Indeed, assuming that , ensures that any classifier is demographic parity fair and that . In the absence of the demographic parity constraint, Assumption 3 is not necessary and exactly the same proof technique allows to obtain the characterization of the optimal unconstrained classifier.

Examples: accuracy and -score

In this part, we give specific examples of the parameters and and instantiate Theorem LABEL:thm:fair_optimal_LF and Lemma LABEL:lem:excess_score. The first examples concerns the accuracy as a performance metric. It highlights the generality of the derived results. {example}[Accuracy under fairness constraint] Recalling the coefficients specified in Example 4, we see that in this case . Furthermore, one checks that condition () is satisfied. Hence under Assumption 3, Theorem LABEL:thm:fair_optimal_LF states that

with maximizes under the demographic parity constraint. Thus, it coincides with the result of Theorem LABEL:thm:equivalence. Furthermore, Lemma LABEL:lem:excess_score states that for any classifier such that , it holds that

We invite the reader to compare the above expression with its unconstrained version (devroye2013probabilistic, Theorem 2.2).

The second example concerns the -score that has been used in several empirical works on fairness as a performance measure (wang2021analyzing; dablain2022towards; wick2019unlocking). {example}[-score under fairness constraint] Recall that the -score is defined as

Using the coefficients specified in Example 4, we see that for this case and condition () is satisfied. Hence, under Assumption 3, Theorem LABEL:thm:fair_optimal_LF states that

with being a unique solution of

maximizes the -score under the demographic parity constraint. Furthermore, Lemma LABEL:lem:excess_score states that for any classifier such that , it holds that

We invite the reader to compare the above expression with its unconstrained version (chzhen2020optimal, Lemma 2).

5 The unawareness case

All the previous parts were concerned with the awareness setup—we allowed ourselves to use the sensitive attribute explicitly. However, it can happen in practice that for legal or ethical reasons, the sensitive attribute cannot be used as an input at prediction time (barocas2016big). Throughout this section we look at classifiers of the form . By abuse of notation, and as long as confusion cannot occur, we use the same notation to denote the set of all classifiers in the unawareness setup. We also need to introduce the conditional distribution of the sensitive attribute , given the nominally non-sensitive features . For all , we set . With one more abuse of notation, we set . In this section we look for {highlighted}

(11)

Note that the only difference with the previous setup is the absence of the sensitive input in the input of . lipton2018does investigated this framework empirically and provided evidence against its use in practice. In particular, they empirically showed that while not permitting using the sensitive attribute , many algorithms still learn the link between and implicitly. Our first result gives a theoretical justification to this phenomenon.

As in the awareness case, we work under a continuity assumption, adapted to this scenario. Recall that Assumption 3 imposed continuity of the regression function distribution for each sensitive group . Here we need a different assumption to account for the fact that is not accessible anymore, namely the continuity of any linear combination of the regression functions distributions and .

{assumption}

For every and for every vector such that , the distribution is continuous.

Akin to Theorem 3, we derive the explicit form of an optimal fair classifier in the unawareness setting. {theorem} Let Assumption 5 be satisfied. Then a solution defined in Eq. (11) can be expressed for all as

where is a solution of

(12)

We make two observations. First of all, the optimal fair classifier is no longer given by the group-wise threshold. Yet, one can think of the term as the -dependent threshold. The optimal classifier tries to guess the value of the sensitive attribute from the features to properly set the threshold. Note that as in the awareness case, here we have . Thus, in average, the “threshold” remains being equal to as in the standard classification setup. Secondly, we see that if is measurable w.r.t. , we fall back to the awareness case. Otherwise each variable is weighted by the conditional distribution of given .

Importantly, it is remains an open problem to give a connection of the above problem with the corresponding regression setup. The main reason for it is the current lack of an explicit solution to the optimal fair regression problem in the unawareness case. Some attempts were made in (chzhen2020example), yet they are unsatisfactory and do not give a complete picture. Intuitively, the difficulty of extending the optimal transport based approach to the unawareness setup lies in our inability to establish the source of a given . In other words, given , we have no idea which of it was sampled from. Hence, we cannot build a transport map from to their common barycenter since it requires the knowledge of . Naively, one might think to use —the best prediction of given —instead of . While intuitive, it is easy to see that simply replacing by in Theorem LABEL:thm:equivalence does not even satisfy the demographic parity constraint in general. As we show in the next paragraph, the connection between the fair classification and fair regression can be made explicit in the unawareness case if we consider the case of . The existence of such a connection is explained by the Hahn decomposition theorem for signed measure, whose generalization (even its formulation) to many measures is unclear.

Binary sensitive attribute: the reduction.

In this section we describe a reduction of the fair unaware binary classification problem to the awareness case for . First of all, let us recall that the minimization of over under any constraints is equivalent to the minimization of under the same constraints. Furthermore, the same applies to the awareness case where we only need to replace by .

For our reduction, given a distribution on , we build another distribution on and a function with the following property: there is a one-to-one correspondence between

and

In other words, if is an optimal fair classifier for distribution under awareness, then can be transformed into an optimal fair classifier for under unawareness. In what follows, we present the reduction and, given the distribution , explain the procedure to build .

Let . Note that if , then and any unaware classifier satisfies the demographic parity constraint. Hence, we assume that . We define in three steps. {highlighted}

  1. The distribution of given under is defined as

    where and is the Hahn decomposition of the signed measure  (see, e.g., billingsley2008probability, Theorem 32.1);

  2. the distribution of under is defined as

  3. the new pseudo-regression function is defined as

We note that under , the sensitive attribute is measurable w.r.t. since the supports of and do not intersect. We refer as to the pseudo-regression function since it is not guaranteed that it takes values in and, hence, is not necessary a valid regression function of under for . {myproposition}Unawareness to awareness reduction Let be any distribution on . Let and be defined using the three steps procedure described above and

Then, defined point-wise as

is a solution of .

Proof.

For any , define as

Note that the above correspondence of and is invertible since the supports of and do not intersect by construction. Observe that for any it holds that

Thus, given any classifier satisfying the demographic parity constraint under , we can transform it to a classifier that satisfies the constraints under . Furthermore, since

taking any classifier we can write

where in the first equality, we added the input to sue to the fact that is measurable under . Note that the second term is minimized point-wise by the Bayes classifier, while the first term is minimized by thanks to the equivalence established for the demographic parity constraint. ∎

The above result provide a theoretical justification to the empirical observations made by lipton2018does. Indeed, they have empirically shown that in the unawareness setting, many classification algorithms tailored for the demographic parity constraint, are forced to “guess” the sensitive attribute . Theoretically, this is reflected by the construction of the distribution under . Furthermore, since the reduction is performed to the awareness setup, the results of previous sections on the connection between fair regression and fair classification still applies. Yet, we emphasize that the above argument is only valid for and its extension to remains an open problem. The main difficulty comes from the absence of a version of the Hahn decomposition for more than two measures.

6 Fair learning: from infinite to finite sample

All the previous sections were concerned with the “infinite sample” regime—the case of known distribution . While not being the main focus of the paper, given the established connection with the problem of fair regression, one can easily pass from the infinite to the finite-sample regime. Indeed, there are many algorithms that allow to consistently estimate the regression function . For instance, agarwal2019fair give an in-processing algorithm with provable finite sample generalization bounds; gouic2020price propose a consistent estimator of chzhen2020fair provide an algorithm with finite sample fairness and risk guarantees; chzhen2020minimax exhibit a modification of the two aforementioned estimators that enjoys stronger fairness and risk guarantees.

Once an estimator of is constructed, one only needs to estimate the threshold specified in Theorem LABEL:thm:fair_optimal_LF. Recall that there are two cases considered in Theorem LABEL:thm:fair_optimal_LF, the first one requires finding a root of a specific function and the second one gives an explicit expression for . For the first case one can use the unsupervised approach recycling and only estimating and the, potentially distribution dependent coefficients, . For the second case one only needs to estimate or substitute the values of . Such an approach was analyzed in chzhen2020optimal in the context of binary classification with the -score without fairness considerations. Alternatively, for the threshold estimation, one can deploy the grid-search technique proposed by koyejo2014consistent by again recycling the base estimator of . In either case one ends up with a flexible and rather direct approach for building data-driven algorithms. We note however that the second approach requires additional labeled data, while the first one is only based on the unlabeled data. The final classification algorithm eventually takes the form of .

7 Conclusion

We have derived an explicit connection between the regression and classification under the demographic parity constraint problems. Leveraging the optimal transport interpretation of the optimal fair regressor, we have shown that the regression-classification link is akin to the classical unconstrained setup. This connection is extended to non-decomposable performance measures and, remarkably, amounts to replacing the standard regression function by its fair counterpart. Finally, we have provided a reduction scheme to pass from the unawareness setup to the awareness setup in the case of the binary sensitive attribute, hence giving the first explicit solution of the fair optimal unaware classifier. Our results are instructive and, relying on the previous studies, lead to wide spectrum of algorithms that can be used with non-decomposable measures. Future works will be focused on further clarification of other notions of fairness constraint by providing clean and interpretable theoretical studies.

Appendix A A unified proof for deriving optimal fair classifiers

In this section we state and prove a general result which implies both Theorem 3 and Theorem 5. On top of the problem setup presented in Section 2, let be a random variable taking its values in some abstract space . Moreover, define the regression functions . The random variable should be thought as for the awareness setting and for the unawareness setting. Our goal is to find a solution {highlighted}

(13)

The general result will be stated under the following continuity assumption. It requires continuity of the distribution of any linear combination of the regression functions evaluated at . {assumption} For every and for every vector such that , the distribution is continuous. Akin to Assumptions 3 and 5, Assumption A is not necessary to prove our result but it greatly simplifies its presentation and interpretation. Let us now state the general result which encompasses the two special cases presented in the main body of the paper. {mytheo}Fair optimal classifier (unified version)optimal_DP_unified Let Assumption A be satisfied. Then a solution defined in Eq. (13) can be expressed for all as

where is a solution of

(14)
{remark}

[Relating the above result to the main body] It is straightforward to derive Theorem 3 and Theorem 5 from Theorem LABEL:thm:optimal_DP_unified. Indeed, to prove Theorem 3, set and notice that . In particular, Assumption A is weaker than Assumption 5 and one can check that the optimal fair classifiers coincide. Similarly, Theorem 5 can be derived from Theorem LABEL:thm:optimal_DP_unified by setting .

Proof of Theorem LABEL:thm:optimal_DP_unified.

One can verify that the minimization of over is equivalent to the minimization of . Furthermore, the demographic parity constraint can be equivalently expressed as

Thus, we are interested in the solution of the optimization problem

Recall that we defined the random variable . The Lagrangian for the above problem can be expressed as

where . Weak duality implies that

(15)

Our approach to derive the optimal fair classifier can be decomposed in two classical steps: find optimal solutions to the dual problem ; show that strong duality holds so that the optimal solutions to the dual problem are also optimal for the primal problem.

Solving the dual problem.

In what follows we focus our attention on the dual problem, which can be solved analytically. We first solve for any the inner minimization problem of the formulation

(16)

Since can be any function from to , the above problem can be solved point-wise. In particular, one can check that the solution is given by

Plugging the optimal solution back in the dual problem, we obtain as solution of the outer maximization problem

(17)

The objective of the above optimization problem is non-negative, continuous convex as a function of . Lemma A ensures that exists.

The objective function of problem in Eq. (17) is not smooth everywhere due to the presence of the positive part function. However, thanks to Assumption A, the set of points at which the objective function is not differentiable has zero Lebesgue measure and can thus be ignored (see, e.g., bertsekas1973stochastic, Proposition 3). The First-Order Optimality Condition (FOOC) on the optimal Lagrange multiplier then reads as

The LHS of the above inequality can be simplified into

showing that the FOOC on is equivalent to satisfying DP.

Strong duality.

The above reasoning showed that defined with the optimal Lagrange multiplier is feasible for the primal problem. Combining this property with Eq. (15) implies that is also a solution of the primal problem.

A more convenient expression.

Using the fact that and , we can express the optimal Lagrange multiplier as

Moreover, introducing , we observe that for any and it holds that , where . Hence, since we are interested in any solution of the above optimization problem, we can define as

{lemma}

Let Assumption A be satisfied, then the mapping

(18)

attains its minimum.

Proof.

In the end of the proof of Theorem LABEL:thm:optimal_DP_unified we have show that minimization of (18) is equivalent to the minimization of

on the hyperplane . Thus, it is sufficient to show that

is attained.

It is clear that the mapping in question is convex on . Hence, it is sufficient to show that it is coercive (see e.g. bauschke2017convex, Proposition 11.15). It holds that

(19)

where we introduced the vector , , and . Thus, in view of (19), by Markov’s inequality, for any it holds that

(20)

where denotes the Euclidean norm. Note that if we are able to show that for some , the right hand side of the above inequality is bounded away from zero, the proof of coercivity is concluded since . To this end, let us introduce

for all and being defined as

By Assumption A, for any , the mapping is continuous on with and . Furthermore, for any such that and for any , we have thanks to triangle’s inequality and monotonicity of

where the convergence follows from the assumed continuity of . Thus, is continuous. Since is compact, we have that

is continuous on . Hence, the intermediate value theorem guarantees that there exists such that

In view of Eq. (20), we conclude. ∎

Appendix B Auxiliary results

The first lemma ensures that under certain conditions, the denominator of the linear fractional performance measure is always positive. {lemma} Assume that , then for any classifier

Furthermore, if , then the above inequality is strict.

Proof.

Observe that

The second claim follows the same lines. ∎

The second result gives a sufficient condition for positivity of the leading coefficient in Remark 4. {lemma} Assume that and either Eq. () or Eq. () is satisfied, then for any classifier

Proof.

Observe that in both cases, by Lemma B, we have

(21)

Case 1: . In that case condition () implies that and . In view of (21) we conclude.
Case 2: . The proof is immediate from (21) and the first part of condition (). ∎

The next lemma establishes an extended average stability property from (chzhen2020minimax). {lemma} Let Assumption 3 be satisfied, then

for all .

Proof.

Fix some . Introducing , we recall that

Furthermore, since both and are distributed uniformly on under Assumption 3, we can write

Finally, the last result relates the excess risk obtained in Lemma LABEL:lem:excess_score with the expression presented in Remark 4. {lemma} Under the conditions of Lemma LABEL:lem:fixed_at_optim_LF, we have

Proof.

We drop the subscript for compactness.
Case 1: . Using the corresponding case of Lemma LABEL:lem:fixed_at_optim_LF and solving it for , we deduce that

Hence, from the above we deduce that

Case 1: . Again using the corresponding case of Lemma LABEL:lem:fixed_at_optim_LF and solving it for , we deduce that

Hence, from the above we deduce that

The proof is concluded. ∎

References