Fair learning with Wasserstein barycenters for non-decomposable performance measures

Solenne Gaucher

^{*}

Université Paris-Saclay, CNRS
Laboratoire de mathématiques d’Orsay
&Nicolas Schreuder

^{*}

MaLGa, DIBRIS
Università di Genova &Evgenii Chzhen
Université Paris-Saclay, CNRS
Laboratoire de mathématiques d’Orsay

Abstract

This work provides several fundamental characterizations of the optimal classification function under the demographic parity constraint. In the awareness framework, akin to the classical unconstrained classification case, we show that maximizing accuracy under this fairness constraint is equivalent to solving a corresponding regression problem followed by thresholding at level $1 / 2$ . We extend this result to linear-fractional classification measures (e.g., $F$ -score, AM measure, balanced accuracy, etc.), highlighting the fundamental role played by the regression problem in this framework. Our results leverage recently developed connection between the demographic parity constraint and the multi-marginal optimal transport formulation. Informally, our result shows that the transition between the unconstrained problems and the fair one is achieved by replacing the conditional expectation of the label by the solution of the fair regression problem. Finally, leveraging our analysis, we demonstrate an equivalence between the awareness and the unawareness setups in the case of two sensitive groups.

1 Introduction

Our\blfootnote $^{*}$ Equal contribution experience of life is increasingly and insidiously being influenced by algorithmic predictions. It is now well accepted that such predictions might replicate or even amplify societal biases and discrimination because of machine learning algorithms’ training process (barocas-hardt-narayanan). A key difficulty in overcoming the effect of those biases is the lack of a precise understanding of how statistical algorithms make predictions: these algorithms are often designed to minimize a user-specified data-dependent loss and yield a highly complex prediction rule, leaving practitioners—and theoreticians—unable to understand and explain the issued predictions. Our goal is to provide a sound and simple mathematical characterization of the prediction process in the presence of fairness constraints.

In this paper we study the demographic parity fairness constraint (calders2009building; barocas-hardt-narayanan) in the awareness framework—allowing the prediction rules to explicitly take the sensitive attribute as an input. Even though this constraint is relatively well understood from algorithmic perspective in both classification (Agarwal_Beygelzimer_Dubik_Langford_Wallach18; menon2018cost; zeng2022bayes; schreuder2021classification; yang2020fairness; jiang2019wasserstein; silvia2020general; feldman2015certifying; gordaliza2019obtaining) and regression (chzhen2020fair; chzhen2020fairTV; gouic2020price; jiang2019wasserstein; agarwal2019fair; chiappa2021fairness), the connection between the two setups remains opaque. The main goal of the current paper is to unveil it.

In contrast, in the traditional unconstrained learning setup, the relation between classification and its regression counterpart is well understood and can be found in all standard books on the subject (see, e.g., hastie2009elements; devroye2013probabilistic; james2013introduction; mohri2018foundations). For instance, the most standard result illustrating this connection states that if $η$ minimizes the squared risk, the classifier $g∗(⋅)=\indη(⋅)≥1/2$ minimizes the misclassification error. Such results form the first building block of many theoretical and practical studies (see, e.g., Audibert_Tsybakov07; Yang99; massart2006risk; biau2008consistency). More recently, the connection between regression and classification was pushed even further. For instance, replacing the misclassification error by the $F$ -score (van1974foundation; Chinchor92), Zhao_Edakunni_Pocock_Brown13 showed that the solution of the associated regression problem still plays a crucial role as an $F$ -score maximizer can be obtained by properly thresholding the solution of the regression problem. Moreover, a recent thread of results establish this fundamental relation for a large variety of performance measures including AM measure, the Jaccard similarity coefficient, and G-mean to name a few (menon2013statistical; koyejo2014consistent; Koyejo_Natarajan_Ravikumar_Dhillon15; Yan_Koyejo_Zhong_Ravikumar18). Again, akin to the standard minimization of misclassification error problem, all these developments led to many theoretical and practical advances (see, e.g., jasinska2016extreme; chzhen2020optimal; narasimhan2015optimizing; kotlowski2016surrogate; bascol2019cost; boughorbel2017optimal). Interestingly, some works that consider group fairness constraints actually report $F$ -score as a performance measure in their empirical studies without actually tailoring an algorithm to optimize it directly (see, e.g., BiswasR20; BiswasR21; abs-2207-03277; wang2021analyzing; dablain2022towards; wick2019unlocking). A possible cause of this is the absence of characterization of fair ( $F$ -score) optimal classifiers in the fairness literature. In this paper we fill this gap for the demographic parity constraint and a large class of performance measures.

Literature that treats group fairness notions is typically distinguished by two features: exact notion of fairness and access to the sensitive attribute at prediction time. While this work considers only demographic parity, we discuss both awareness and unawareness setups—allowing or not the access to the sensitive attribute at prediction time respectively. Unlike the case of awareness, in which a significant understanding has been achieved from theoretical perspective, the case of unawareness remains opaque with contributions mainly focusing on algorithmic constructions (see e.g., Agarwal_Beygelzimer_Dubik_Langford_Wallach18; agarwal2019fair; oneto2019general; Donini17; narasimhan2018learning). A notable work of lipton2018does puts forward several empirical evidences highlighting critical issues connected of the unawareness framework. Our work makes a step towards a more explicit and transparent description of the optimal classifier under the demographic parity constraint with unawareness by introducing a simple theoretical reduction scheme to the awareness setup for binary protected attribute. Consequently, our results support theoretically the empirical claims made by lipton2018does.

Contributions.

The goal of this work is to establish a link between the regression and classification problems under the demographic parity constraint. We make the following contributions to the study of algorithmic fairness:

We show that, under mild assumptions, if $f^{*}$ minimizes the squared risk under the demographic parity constraint, then $\indf∗≥1/2$ minimizes the probability of misclassification under the same constraint.
We extend the above result to a large family of performance measures introduced in Koyejo_Natarajan_Ravikumar_Dhillon15 for unconstrained classification.
In the case of a binary sensitive attribute, we provide a simple reduction scheme that transforms, in a optimal way, the unawareness setup into the awareness one.

The first two contributions show the fundamental role played by regression in the context of demographic parity constraint and are built using basic tools from univariate optimal transport theory. As an interesting consequence of our analysis, we show that the notion of strong demographic parity introduced by jiang2019wasserstein is equivalent to the usual demographic parity when a performance measure is minimized. The latter indicates that the post-hoc or the downstream threshold will never harm the demographic parity constraint. The last contribution constitutes a step towards the theoretical treatment of the unawareness setup—a problem that still remains open. Importantly, even though our results are stated in the fair learning setting, they imply new results in the general learning setting. In particular, our results allow to obtain the characterization of the optimal unconstrained classifier for a large class of classification performance measures.

2 Problem setup

Consider a triplet $(\bX,S,Y)∈×[K]×{0,1}$ , following some joint distribution $\Prob$ , consisting of the nominally non-sensitive and sensitive features, and the label, respectively. Classifiers are functions of the form $g : \times [K] \mapsto {0, 1}$ and score functions take the form $f : \times [K] \mapsto [0, 1]$ . The set of all classifiers is denoted by and the set of all score functions is denoted by . Before proceeding let us introduce additional notation that is related to the unknown distribution $\Prob$ . We set $η(\bX,S)≜\Exp[Y∣\bX,S]$ and recall that $η$ minimizes the squared risk without any constraint. For each $s \in [K]$ , we define $ps≜\Prob(S=s)$ . The central object of this work is the optimal fair score function, defined as: {highlighted}

(1)

An explicit expression for $f^{*}$ under standard assumptions was derived in (chzhen2020fair; gouic2020price) using the univariate optimal transport theory and the reduction of the problem in Eq. (1) to a multi-marginal optimal transport formulation. In particular, they showed that, under mild assumptions, there is a one-to-one correspondence between the problem in Eq. (1) and the problem

where $W_{2}$ is the Wasserstein-2 distance (villani2009optimal, Definition 6.1) and denotes the space of univariate probability measures with finite second moment. Denoting by $ν^{⋆}$ the solution of the above problem, it was shown that

where $T\Law(η(\bX,S)∣S=s)→ν⋆$ is the optimal transport map from $\Law(η(\bX,S)∣S=s)$ to $ν^{⋆}$ . Up until now, unlike in the regression setting, it was not clear if a direct link between optimal transport and the fair binary classification problem existed–or even made sense. Our work shows that such a connection exists and that it is fundamental.

Notation.

Given a real-valued function $f:×[K]→\bbR$ , we denote by $μ_{s} (f)$ the univariate measure defined for all $A⊂\bbR$ as $μs(f)(A)≜\Prob(f(\bX,S)∈A∣S=s)$ . For any univariate measure $μ$ , we denote by $F_{μ}$ its cumulative distribution, and by $F_{μ}^{- 1}$ its quantile function, given by $F_{μ}^{- 1} (p) ≜ min {x : μ ((- \infty, x]) \geq p}$ . For any $x∈\bbR$ we set $(x)_{+} ≜ max {x, 0}$ . For any probability measure $μ$ on and a function $f:→\bbR$ , we denote by $f ♯ μ$ , the image measure of $μ$ .

3 The misclassification risk: a warm-up

In this section, we begin by tackling the classical minimization of the misclassification risk problem and highlight the main novelties and advances with respect to previous works. To this end, we consider the following optimal (in terms of the misclassification risk) fair classifier {highlighted}

(2)

We work under the following assumption. {assumption} For every $s \in [K]$ , assume that $\Law(η(\bX,S)∣S=s)$ is continuous and supported on an interval. A slightly modified version of the above was used in the context of fairness in (chzhen2020fair; chzhen2020fairTV; gouic2020price; jiang2019wasserstein) and also also in the classical unconstrained classification with generalized performance measures (Yan_Koyejo_Zhong_Ravikumar18). In Section A, we relax the above assumption and provide a proof that unifies the awareness case considered just below with the unawareness case presented in Section 5, Theorem 5.

The first warm-up result is reminiscent of those recently obtained by (zeng2022bayes; schreuder2021classification). The proof based on the $min max$ duality and is very similar to the classical Neyman-Pearson lemma. While it does not allow to immediately reach our goals, it gives several fundamental insights that were already invoked in previous works on the demographic parity constraint (lipton2018does; hardt2016equality). {theorem} Let Assumption 3 be satisfied. Then, an optimal fair classifier $g^{*} : \times [K] \to {0, 1}$ defined in Eq. (2) can be expressed for all $(\bx,s)∈×[K]$ as

g∗(\bx,s)=\ind2η(\bx,s)−1≥λ∗sps,

where $\blambda∗=(λ∗1,…,λ∗K)⊤∈\bbRK$ is a solution of

min\blambda∈\bbR\enscond\Exp\parentsq\abs2η(\bX,S)−1−λSpS\Exp[λSpS]=0.

The main takeaway message from the above theorem is: under the stated assumption, the optimal fair classifier can be derived as a group-wise thresholding of the regression function $η$ , with thresholds eventually depending on the sensitive groups. For a similar statement without the continuity assumption, we refer the reader to zeng2022bayes who derived optimal randomized classifiers using the Neyman-Pearson lemma. Let us now provide a novel characterization of an optimal fair classifier. {mytheo}Wasserstein based fair optimal classifierequivalence Let Assumption 3 be satisfied. Then, an optimal fair classifier $g^{*} : \times [K] \to {0, 1}$ defined in Eq. (2) can be expressed for all $(\bx,s)∈×[K]$ as

g∗(\bx,s)=\indf∗(\bx,s)≥1/2 with f∗ being% defined in~{}(???),

Discussion.

The above result is instructive on its own—one can solve binary classification under the demographic parity constraint by solving the corresponding regression problem. We recall that (chzhen2020fair; gouic2020price) built a statistically consistent algorithm for the estimation of the latter. Furthermore, they showed that under the imposed assumptions,

f∗(\bx,s)=(K∑σ=1pσF−1μσ(η))∘Fμs(η)transport to % the barycenter∘η(\bx,s).

feldman2015certifying proposed to transport the group-wise distribution of $η(\bX,S)$ towards their common barycenter as a disparity removal strategy. Yet, a theoretical justification was missing and this approach remained a heuristic until the work of gordaliza2019obtaining who provided an upper bound on the excess risk in terms of the Wasserstein barycenter objective. Later, jiang2019wasserstein relied on the barycenter formulation involving the Earth Mover distance (rachev1998mass) and showed that a transport-based prediction results in a minimal perturbation post-processing. However, the use of the Earth Mover distance might result in non-uniqueness issues. Our Theorem LABEL:thm:equivalence gives a complete theoretical justification of the transport based fair classification algorithms. Theorem LABEL:thm:fair_optimal_LF in Section 4 further extends this connection to non-decomposable measures.

Besides, jiang2019wasserstein introduced a notion of strong demographic parity, which amounts to taking classifiers $g : \times [K] \to {0, 1}$ for which there exists a score function $f : \times [K] \to [0, 1]$ such that $f(\bX,S)\independentS$ and $g(\bx,s)=\indf(\bx,s)≥1/2$ . This notion was later used in (silvia2020general; chiappa2021fairness). Theorem LABEL:thm:equivalence implies that the optimal classifier under the demographic parity constraint satisfies, an a priori more restrictive fairness notion—the strong demographic parity. Indeed, any classifier that satisfies strong demographic parity is demographic parity fair. Hence, we have deduced the equivalence between the two definitions at the optimum. The notion of strong demographic parity introduced by jiang2019wasserstein can be seen in a downstream or post-hoc settings. That is, the learner first tries to fit a score function and only after a particular threshold is selected in a potentially non-stationary way. Strong demographic parity implies that any threshold selection made by the learner will yield a fair classifier. In that sense, our results show that building a score function via an optimal fair regression function is optimal for misclassification risk and, as we see in Section 4, for many other classification measures. Below we provide a simple proof of Theorem LABEL:thm:equivalence.

Figure 1: Illustration of Bayes and fair optimal classifiers. Left: group-wise cumulative distributions of $η(\bX,S)$ —the threshold is $.5$ ; Middle: Illustration of Theorem LABEL:thm:equivalence—black solid line corresponds to $F_{μ (f^{*})}$ ; Right: illustration of group-wise thresholds that correspond to the fair optimal classifier.

Proof of Theorem LABEL:thm:equivalence.

Theorem 3 implies that under Assumption 3 the optimal classifier is of the form $g∗(x,s)=\indη(\bx,s)≥β∗s$ for some $\bbeta∗=(β∗s)s∈[K]∈\bbRK$ . It follows from (van2000asymptotic, Lemma 21.1(iv)) and Assumption 3 that $η(\bx,s)=F−1μs(η)∘Fμs(η)(η(\bx,s))$ for almost all $\bx∈\bbRd$ w.r.t. $\Prob\bX∣S=s$ . Thus, it is sufficient to look at the classifiers of the form

g(\bx,s)=\indF−1μs(η)∘Fμs(η)(η(\bx,s))≥βs,

or, equivalently, at $g(\bx,s)=\indFμs(η)(η(\bx,s))≥Fμs(η)(βs)$ (van2000asymptotic, Lemma 21.1(i)). Now, the inverse transform theorem states that under Assumption 3, $F_{μ_{s} (η)}^{- 1} (U)$ has the same distribution as $η(\bX,S)$ conditionally on $S = s$ , for $U$ uniformly distributed on $(0, 1)$ . Then,

P(g(\bX,S)=1∣S=s)=P(Fμs(η)∘F−1μs(η)(U)≥Fμs(η)(βs))=1−Fμs(η)(βs),

where we have used that $F_{μ_{s} (η)} \circ F_{μ_{s} (η)}^{- 1} (u) = u$ for all $u \in (0, 1)$ (van2000asymptotic, Lemma 21.1(ii)). Thus, $g$ verifies the DP constraint if and only if $F_{μ_{s} (η)} (β_{s})$ does not depend on $s$ . Denoting by $γ$ this constant, we find that the optimal fair classifier must be of the form $g(\bx,s)=\indFμs(η)(η(\bx,s))≥γ$ . The risk of any such classifier is given by

\risk(g)=E[Y]+∑s∈[K]ps\Exp[\indFμs(η)(η(\bx,s))≥γ(1−2η(\bX,s))∣S=s].

(3)

Using again inverse transform theorem, Eq. (3) can be further simplified to the following expression:

\risk(g)=E[Y]+∑s∈[K]ps1∫0\indFμs(η)∘F−1μs(η)(u)≥γ(1−2F−1μs(η)(u))\du.

(4)

Under Assumption 3, $F_{μ_{s} (η)} \circ F_{μ_{s} (η)}^{- 1} (u) = u$ for all $u \in (0, 1)$ . Thus, Eq. (4) reduces to

\risk(g)=E[Y]+1∫γ∑s∈[K]ps(1−2F−1μs(η)(u))\du.

This function is minimized at $γ^{*}$ which satisfies

(\sum s \in [K] p_{s} F_{μ_{s} (η)}^{- 1}) (γ^{*}) = 1 / 2,

(5)

and the optimal classifier under the demographic parity constraints is given by $g∗(\bx,s)=\indFμs(η)(η(\bx,s))≥γ∗$ . Taking into account the condition satisfied by $γ^{*}$ , we conclude. ∎

The proof itself is rather instructive and gives rise to the following interpretation. {myremark}Ranking interpretation of the fair optimal classification strategySolennes_final The proof of Theorem LABEL:thm:equivalence reveals that the optimal fair classifier can be written as $g∗(\bx,s)=\indFμs(η)(η(\bx,s))≥γ∗$ , where $γ^{*}$ is given by (5). Recall that $q \mapsto \sum_{s = 1}^{K} p_{s} F_{μ_{s} (η)}^{- 1} (q)$ is the quantile function of the Wasserstein-2 barycenter of measures $(μ_{s} (η))_{s \in [K]}$ , weighted by $(p_{s})_{s \in [K]}$ (see, e.g., agueh2011barycenters, Section 6.1). Thus, denoting this barycenter by $¯ μ (η)$ , $g^{*}$ can be alternatively expressed as

g∗(\bx,s)=\indFμs(η)(η(\bx,s))≥F¯μ(η)(1/2).

The last display shows that while the thresholds of $η$ differ across groups (as per Theorem 3), this threshold sensitive-group independent if viewed from the perspective of group-wise ranking. Putting it simply, if $F_{¯ μ (η)} (1 / 2) = p \in (0, 1)$ , then the $(1 - p) \times 100 %$ best individuals from each group get classified positively. This property reflects the notion of rational ordering (lipton2018does) that follows from order preservation property of $f^{*}$ (see chzhen2020minimax, Section 4). Figure 1 provides a graphical illustration of the above observations.

We note that as in other works explaining a given fairness constraint, we do not argue for or against the policy itself.

	Expression	$(\sfn0,\sfn1,\sfn2)$	$(\sfd0,\sfd1,\sfd2)$
Accuracy	$\Prob(Y=g(\bX,S))$	$(1 - p^{y = 1}, 2, - 1)$	$(1, 0, 0)$
$F_{b}$ -score	$(1+b2)\Prob(Y=1,g(\bX,S)=1)b2\Prob(Y=1)+\Prob(g(\bX,S)=1)$	$(0, 1 + b^{2}, 0)$	$(b^{2} p^{y = 1}, 0, 1)$
Jaccard	$\Prob(Y=1,g(\bX,S)=1)\Prob(Y=1,g(\bX,S)=0)+\Prob(g(\bX,S)=1)$	$(0, 1, 0)$	$(p^{y = 1}, - 1, 1)$
AM-measure	$12{\Prob(g(\bX,S)=0∣Y=0)+\Prob(g(\bX,S)=1∣Y=1)}$	$(\frac{1}{2}, \frac{1}{2 p^{y = 1}} + \frac{1}{2 p^{y = 0}}, - \frac{1}{2 p^{y = 0}})$	$(1, 0, 0)$
Recall	$\Prob(g(\bX,S)=1∣Y=1)$	$(0, 1, 0)$	$(p^{y = 1}, 0, 0)$

Table 1: Some examples of measure that can be represented by Eq. (6). For more example see choi2010survey. We set for this table

py=1≜\Prob(Y=1)

and

py=0≜\Prob(Y=0)

4 Non-decomposable performance measures

In this part we extend the analysis of the previous section to a broader class of performance measures, which includes the $F$ -score, the AM-mean, and the misclassification risk among others. We follow the framework put forward by koyejo2014consistent, who introduced the so-called linear fractional performance measures. Formally, given coefficients $(\sfn0,\sfn1,\sfn2)∈\bbR3$ and $(\sfd0,\sfd1,\sfd2)∈\bbR3$ , the performance of a classifier $g : \times [K] \to {0, 1}$ is measured by its utility

U(\sfn,\sfd)(g)\coloneqq\sfn0+\sfn1\Prob(g(\bX,S)=1,Y=1)+\sfn2\Prob(g(\bX,S)=1)\sfd0+\sfd1\Prob(g(\bX,S)=1,Y=1)+\sfd2\Prob(g(\bX,S)=1).

(6)

We denote by the set of all classifiers $g : \times [K] \to {0, 1}$ for which the denominator of $U(\sfn,\sfd)$ is non-zero. It is important to emphasize that both $\sfn$ and $\sfd$ are allowed to depend on the unknown distribution of the data $\Prob$ but not on the classifier $g$ . For instance, the $F_{1}$ -score (van1974foundation) corresponds to the choice $(\sfn0,\sfn1,\sfn2)=(0,2,0)$ and $(\sfd0,\sfd1,\sfd2)=(\Prob(Y=1),0,1)$ . We refer to (choi2010survey) for additional examples of different choices of $(\sfn,\sfd)$ corresponding to different classification performance measures. Recently, yang2020fairness studied linear performance measures in the context of fairness, which essentially corresponds to the special case of the above linear fractional formulation with $(\sfd0,\sfd1,\sfd2)=(1,0,0)$ —which, for instance, does not encompass the $F_{1}$ -score. In another direction, celis2019classification considered linear fractional formulation of fairness constraints while optimizing the misclassification risk. However, given the structure of the constraints, this problem can essentially be re-formulated as misclassification risk minimization under linear fairness constraints.
As it is common in the literature on generalized performance measures, we view $U(\sfn,\sfd)$ as a utility to be maximized, contrary to the minimization of the risk viewpoint from the previous section. Thus, our goal is to study{highlighted}

g∗(\sfn,\sfd)∈\argmaxg∈\dom(U(\sfn,\sfd))\enscondU(\sfn,\sfd)(g)g(\bX,S)\independentS.

(7)

A remarkable property of linear fractional measures is that the unconstrained maximizer can still be obtained by thresholding the regression function $η$ . Yet, the threshold in this case might depend on the unknown distribution $\Prob$ and ought to be estimated. Let us provide couple of standard examples. {example} Consider the problem of maximizing the accuracy:

Setting $(\sfn0,\sfn1,\sfn2)=(1−\Prob(Y=1),2,−1)$ and $(\sfd0,\sfd1,\sfd2)=(1,0,0)$ , we see that the above formulation falls within the considered framework. {example} Consider the problem of maximizing the $F_{1}$ -score:

Zhao_Edakunni_Pocock_Brown13 showed that the solution $g^{*}$ of the above optimization problem can be written as

g∗(\bx,s)=\indη(\bx,s)≥θ∗,

where $θ^{*} \in [0, 1]$ is the unique solution of $θ\Prob(Y=1)=\Exp(η(\bX,S)−θ)+$ . koyejo2014consistent pushed further these results demonstrating that the “thresholding principle” remains true for the whole family of linear fractional measures. In what follows, we will show that their result is still valid if one replaces $η$ by $f^{*}$ —the solution of the fair regression problem. This validity is established in a strong sense, meaning that even the equation (as in Example 4) determining the threshold is preserved.

{mytheo}

Wasserstein based fair optimal classifier for non-decomposable measuresfair_optimal_LF Let Assumption 3 be satisfied. Assume that $\sfd∈\bbR3$ is such that

\sfd0+min{min{\sfd1,0}+\sfd2,0}≥0.

Assume that the coefficients $(\sfn,\sfd)∈\bbR3×\bbR3$ satisfy one of the following mutually exclusive conditions:

(

C 1

)

(

C 2

)

Then, $g∗(\sfn,\sfd)$ defined in Eq. (7) can be expressed for all $(\bx,s)∈×[K]$ as

g∗(\sfn,\sfd)(\bx,s)=\indf∗(\bx,s)≥θ∗(\sfn,\sfd),

(8)

where $θ∗(\sfn,\sfd)$ is either the unique solution of

\Exp[(f∗(\bX,S)−θ)+]=θ⋅{\sfn0\sfd1−\sfd0\sfn1\sfn2\sfd1−\sfd2\sfn1}+{\sfn0\sfd2−\sfd0\sfn2\sfn2\sfd1−\sfd2\sfn1},

(9)

if $\sfn2\sfd1≠\sfd2\sfn1$ or $θ∗(\sfn,\sfd)=\sfd0\sfn2−\sfn0\sfd2\sfn0\sfd1−\sfd0\sfn1$ otherwise.

A few comments are in order. First of all, Theorem LABEL:thm:fair_optimal_LF states that the pre-cited “thresholding principle” still holds for optimizing linear-fractional performance measures under the demographic parity constraint: optimal fair classifiers can be obtained by thresholding the optimal fair regression function $f^{*}$ at the right threshold level $θ∗(\sfn,\sfd)$ . Moreover, in the case $\sfn2\sfd1=\sfd2\sfn1$ an explicit expression is provided, while if $\sfn2\sfd1≠\sfd2\sfn1$ one needs to solve a fixed-point equation to find the optimal threshold. Given that the function defining the fixed-point equation is univariate, monotone and continuous, the bisection method (or any other univariate root-finding method) can be used to obtain an approximation of the optimal threshold up to arbitrary precision. Finally, since the conditions on the coefficients might seem opaque at first sight, let us argue why they are harmless and meaningful. Intuitively, these conditions specify only two requirements:

The maximization of $U(\sfn,\sfd)(g)$ makes sense—the more the classifier align with $Y$ the better. In particular, these conditions exclude $\Prob(Y≠g(\bX,S))$ , whose maximization does not make sense.
The denominator of $U(\sfn,\sfd)$ is non-negative.

One can verify that all the measures presented in Table 1 do indeed satisfy these conditions as well as many other linear fractional performance measures from choi2010survey. We would also like to point out that while the conditions of Theorem LABEL:thm:fair_optimal_LF are cumbersome, they are easy to check in practice, unlike those given in (koyejo2014consistent), who relied on $\sign(\sfn1−U(\sfn,\sfd)(g∗(\sfn,\sfd))\sfd1)$ . Indeed, to check the latter, one needs to know or estimate the optimal value of $U(\sfn,\sfd)$ beforehand, which is not always feasible in practice. In contrast, conditions ( $C 1$ ) and ( $C 2$ ) only involve the known coefficients $(\sfn,\sfd)$ . Finally, let us remark that $U(\sfn,\sfd)$ = $U(−\sfn,−\sfd)$ and both conditions ( $C 1$ ) and ( $C 2$ ) are invariant under the $(\sfn,\sfd)↦(−\sfn,−\sfd)$ transformation. Yet, to fix only one of them, we additionally require $\sfd0+min{min{\sfd1,0}+\sfd2,0}≥0$ , which forces the user to fix the signs of $\sfd$ properly. Let us emphasize that, if $\sfd0+min{min{\sfd1,0}+\sfd2,0}>0$ , then —the denominator does not zero-out—which is a consequence of Lemma B.

Proof.

Let us first show that $θ∗(\sfn,\sfd)$ exists and unique. Indeed, the mapping

θ↦θ⋅{\sfn0\sfd1−\sfd0\sfn1\sfn2\sfd1−\sfd2\sfn1}+{\sfn0\sfd2−\sfd0\sfn2\sfn2\sfd1−\sfd2\sfn1}−\Exp[(f∗(\bX,S)−θ)+],

is continuous and monotone increasing on $[0, 1]$ under the specified conditions. On the one hand, for $θ = 0$ we have $\Exp[f∗(\bX,S)]=\Prob(Y=1)$ (see chzhen2020minimax, Section 4, item 4 on average stability) the above mapping evaluates to ${\sfn0\sfd2−\sfd0\sfn2\sfn2\sfd1−\sfd2\sfn1}−\Prob(Y=1)≤0$ . On the other hand, for $θ = 1$ , it evaluates to ${\sfn0\sfd1−\sfd0\sfn1\sfn2\sfd1−\sfd2\sfn1}+{\sfn0\sfd2−\sfd0\sfn2\sfn2\sfd1−\sfd2\sfn1}≥0$ . The existence follows from the intermediate value theorem and the uniqueness from monotonicity. The rest of the proof follows from the two lemmas presented below. ∎

The first lemma is similar to the main result of (koyejo2014consistent), while the second one gives an explicit expression for the excess-score of any fair classifier. The actual proof technique shares some similarities with the analysis of $F_{1}$ -score in (chzhen2020optimal) who provided an alternative proof to the result of Zhao_Edakunni_Pocock_Brown13 recalled in Example 4. {mylemma}Fixed point propertyfixed_at_optim_LF Let Assumption 3 be satisfied. Let $g∗(\sfn,\sfd)$ be defined in Theorem LABEL:thm:fair_optimal_LF and assume that $θ∗(\sfn,\sfd)$ defined in Eq. (9) exists. Then,

Proof.

For compactness we drop the subscripts $(\sfn,\sfd)$ in this proof. Using Lemma B, we find that

	$P(g∗(\bX,S)=1,Y=1)$	$=E[f∗(\bX,S)g∗(\bX,S)]$
		$=E[(f∗(\bX,S)−θ∗)+]+θ∗E[g∗(\bX,S)].$

Case 1: $\sfn2\sfd1≠\sfd2\sfn1$ . Combining this result with (9), we obtain the following expression for $U (g^{*})$ :

\sfn0(\sfn2\sfd1−\sfd2\sfn1)+\sfn1\parentθ∗(\sfn0\sfd1−\sfd0\sfn1)+(\sfn0\sfd2−\sfd0\sfn2)+(\sfn2+θ∗\sfn1)(\sfn2\sfd1−\sfd2\sfn1)\Exp[g∗(\bX,S)]\sfd0(\sfn2\sfd1−\sfd2\sfn1)+\sfd1\parentθ∗(\sfn0\sfd1−\sfd0\sfn1)+(\sfn0\sfd2−\sfd0\sfn2)+(\sfd2+θ∗\sfd1)(\sfn2\sfd1−\sfd2\sfn1)\Exp[g∗(\bX,S)].

Factorizing the numerator and denominator by $(\sfn2+θ∗\sfn1)$ and $(\sfd2+θ∗\sfd1)$ respectively, the above can be written as

U(g∗)=\sfn2+θ∗\sfn1\sfd2+θ∗\sfd1⋅(\sfn0\sfd1−\sfd0\sfn1)+(\sfn2\sfd1−\sfd2\sfn1)\Exp[g∗(\bX,S)](\sfn0\sfd1−\sfd0\sfn1)+(\sfn2\sfd1−\sfd2\sfn1)\Exp[g∗(\bX,S)]=\sfn2+θ∗\sfn1\sfd2+θ∗\sfd1,

concluding the proof for the first case.
Case 2: $\sfn2\sfd1=\sfd2\sfn1$ . In this case, notice that we have

\sfn1θ∗=\sfn1\sfn2\sfd0−\sfn0\sfn1\sfd2\sfn0\sfd1−\sfn1\sfd0=\sfn2\sfn1\sfd0−\sfn0\sfd2\sfn0\sfd1−\sfn1\sfd0=−\sfn2,

and, following the same computations, $\sfd1θ∗=−\sfd2$ . Plugging the above equalities in the definition of $U (g^{*})$ yields

U(g∗)=\sfn0+\sfn1E(f∗(\bX,S)−θ∗)+\sfd0+\sfd1E(f∗(\bX,S)−θ∗)+.

The proof is concluded. ∎

The next result provides an explicit expression for the excess score of any fair classifier $g$ .

{mylemma}

Excess score for fair non-decomposable measuresexcess_score Let Assumption 3 be satisfied. Let $g∗(\sfn,\sfd)$ be defined as in Theorem LABEL:thm:fair_optimal_LF and assume that $θ∗≜θ∗(\sfn,\sfd)$ defined in Eq. (9) exists. Let $¯ μ (η)$ be the Wasserstein barycenter of measures $μ_{1} (η), \dots, μ_{K} (η)$ weighted by $p_{1}, \dots, p_{K}$ , respectively. Define $β^{*}$ as $β^{*} = F_{¯ μ (η)} (θ^{*})$ . Let . Then, for any classifier $g∈\dom(U(\sfn,\sfd))$ such that $g(\bX,S)\independentS$ , excess score equals to

Furthermore, under the conditions on $(\sfn,\sfd)$ specified in Theorem LABEL:thm:fair_optimal_LF; we have for all classifiers $g : \times [K] \to {0, 1}$ . {remark} Lemma LABEL:lem:excess_score, together with Lemma B, stated in appendix, implies that

Hence, the inequality for all $g$ is implied from

The first of the above conditions is ensured if $\sfd0+min{min{\sfd1,0}+\sfd2,0}≥0$ (Lemma B) assumed in Theorem LABEL:thm:fair_optimal_LF and the second one is ensured by ( $C 1$ ) or ( $C 2$ ), as proved in Lemma B.

Proof of Lemma LABEL:lem:excess_score.

Let $¯ μ (η)$ be the Wasserstein barycenter of measures $μ_{1} (η), \dots, μ_{K} (η)$ , weighted by $p_{1}, \dots, p_{K}$ respectively. Assumption 3 and the form of $f^{*}$ ensures that the fair optimal classifier in Eq. (8) can be expressed as

g∗(\bx,s)=\indη(\bx,s)≥F−1μs(η)∘F¯μ(η)(θ∗)=\indη(\bx,s)≥F−1μs(η)(β∗),

where $β^{*} = F_{¯ μ (η)} (θ^{*})$ . Fix an arbitrary classifier $g$ which satisfies the demographic parity constraint.
Our goal is to develop $U (g^{*}) - U (g)$ , which we express as a sum of two terms $I + I I$ , with

I≜\sfn1\parent\Exp[η(\bX,S)(g∗(\bX,S)−g(\bX,S))]+\sfn2\Exp[g∗(\bX,S)−g(\bX,S)]\sfd0+\sfd1\Exp[η(\bX,S)g∗(\bX,S)]+\sfd2\Exp[g∗(\bX,S)],

and

II≜−U(g)\sfd1\parent\Exp[η(\bX,S)(g∗(\bX,S)−g(\bX,S))]+\sfd2\Exp[g∗(\bX,S)−g(\bX,S)]\sfd0+\sfd1\Exp[η(\bX,S)g∗(\bX,S)]+\sfd2\Exp[g∗(\bX,S)].

One verifies that indeed $U (g^{*}) - U (g) = I + I I$ . Thanks to the alternative definition of $g^{*}$ introduced in the beginning of this proof, for any $a,b∈\bbR$ we have

	$a\Exp[η(\bX,S)(g∗(\bX,S)−g(\bX,S))]$	$+b\Exp[g∗(\bX,S)−g(\bX,S)]$
		$=a\Exp[\absη(\bX,S)−F−1μS(η)(β∗)\indg∗(\bX,S)≠g(\bX,S)]$
		$++\Exp[(b+aF−1μS(η)(β∗))(g∗(\bX,S))−g(\bX,S)]$
		$=a\Exp[\absη(\bX,S)−F−1μS(η)(β∗)\indg∗(\bX,S)≠g(\bX,S)]$
		$++(b+aF−1¯μ(η)(β∗))\Exp[g∗(\bX,S))−g(\bX,S)],$

where the last equality is due to the fact that $g$ satisfies the demographic parity constraint. Thus, setting $Δ(g∗,g)≜\Exp[|η(\bX,S)−F−1μS(η)(β∗)|\indg∗(\bX,S)≠g(\bX,S)]$ and recalling that $θ^{*} = F_{¯ μ (η)}^{- 1} (β^{*})$ we can express $I$ and $I I$ as

	$I=\sfn1Δ(g∗,g)+(\sfn2+\sfn1θ∗)\Exp[g∗(\bX,S)−g(\bX,S)]\sfd0+\sfd1\Exp[η(\bX,S)g∗(\bX,S)]+\sfd2\Exp[g∗(\bX,S)],$
	$II=−U(g)\sfd1Δ(g∗,g)+(\sfd2+\sfd1θ∗)\Exp[g∗(\bX,S)−g(\bX,S)]\sfd0+\sfd1\Exp[η(\bX,S)g∗(\bX,S)]+\sfd2\Exp[g∗(\bX,S)].$

Case 1: $\sfn2\sfd1≠\sfd2\sfn1$ . Lemma LABEL:lem:fixed_at_optim_LF implies that

I=\sfn1Δ(g∗,g)+U(g∗)(\sfd2+\sfd1θ∗)\Exp[g∗(\bX,S)−g(\bX,S)]\sfd0+\sfd1\Exp[η(\bX,S)g∗(\bX,S)]+\sfd2\Exp[g∗(\bX,S)].

Combining the above two expressions for $I$ and $I I$ we obtain

	$U (g^{*}) - U (g)$	$=\parentU(g∗)−U(g)(\sfd2+\sfd1θ∗)\Exp[g∗(\bX,S)−g(\bX,S)]\sfd0+\sfd1\Exp[η(\bX,S)g∗(\bX,S)]+\sfd2\Exp[g∗(\bX,S)]$
		$=+\parent\sfn1−U(g)\sfd1Δ(g∗,g)\sfd0+\sfd1\Exp[η(\bX,S)g∗(\bX,S)]+\sfd2\Exp[g∗(\bX,S)].$

Simplifying the above and using Lemma B, we obtain

U (g^{*}) - U (g)

=\parent\sfn1−U(g)\sfd1Δ(g∗,g)\sfd0+\sfd1\Exp[(f∗(\bX,S)−θ∗)+]+(\sfd2+θ∗\sfd1)\Exp[g(\bX,S)].

As in Lemma LABEL:lem:fixed_at_optim_LF (using the expression for the numerator), we deduce that

	$\sfd0+\sfd1\Exp[(f∗(\bX,S)−θ∗)+]$	$+(\sfd2+θ∗\sfd1)\Exp[g(\bX,S)]$
		$=(\sfd2+θ∗\sfd1)\parent(\sfn0\sfd1−\sfd0\sfn1)+(\sfn2\sfd1−\sfd2\sfn1)\Exp[g(\bX,S)]\sfn2\sfd1−\sfd2\sfn1,$

and using the definition of $U (g)$ , we can write

\sfn1−U(g)\sfd1=(\sfn1\sfd0−\sfd1\sfn0)+(\sfn1\sfd2−\sfd1\sfn2)\Exp[g(\bX,S)]\sfd0+\sfd1\Exp[η(\bX,S)g(\bX,S)]+\sfd2\Exp[g(\bX,S)].

(10)

Combining the last three displays, we arrive at the claimed equality

U(g∗)−U(g)=\sfd2\sfn1−\sfn2\sfd1\sfd2+θ∗\sfd1⋅\Exp|η(\bX,S)−F−1μS(η)(β∗)|\indg∗(\bX,S)≠g(\bX,S)\sfd0+\sfd1\Exp[η(\bX,S)g(\bX,S)]+\sfd2\Exp[g(\bX,S)].

Case 2: $\sfn2\sfd1=\sfd2\sfn1$ . We have shown in the proof of Lemma LABEL:lem:fixed_at_optim_LF that in this particular case, $\sfn1θ∗+\sfn2=\sfd1θ∗+\sfd2=0$ . Hence $I$ and $I I$ reduce to

	$I=\sfn1Δ(g∗,g)\sfd0+\sfd1\Exp[η(\bX,S)g∗(\bX,S)]+\sfd2\Exp[g∗(\bX,S)],$
	$II=−U(g)\sfd1Δ(g∗,g)\sfd0+\sfd1\Exp[η(\bX,S)g∗(\bX,S)]+\sfd2\Exp[g∗(\bX,S)].$

Consequently, the difference of utilities is expressed as

U (g^{*}) - U (g)

=\parent\sfn1−U(g)\sfd1Δ(g∗,g)\sfd0+\sfd1\Exp[η(\bX,S)g∗(\bX,S)]+\sfd2\Exp[g∗(\bX,S)].

Again invoking the result of Lemma B, we deduce

\sfd0+\sfd1\Exp[η(\bX,S)g∗(\bX,S)]+\sfd2\Exp[g∗(\bX,S)]

=\sfd0+\sfd1\Exp[(f∗(\bX,S)−θ∗)+].

The above two displays combined with Eq. (10) and the condition $\sfn2\sfd1=\sfd2\sfn1$ yield

U (g^{*}) - U (g)

=\sfn1\sfd0−\sfd1\sfn0\sfd0+\sfd1\Exp[(f∗(\bX,S)−θ∗)+]⋅Δ(g∗,g)\sfd0+\sfd1\Exp[η(\bX,S)g(\bX,S)]+\sfd2\Exp[g(\bX,S)].

The proof is concluded. ∎

Let us remark that the content of this section can be seen as a strict improvement over koyejo2014consistent who only derived Lemma LABEL:lem:fixed_at_optim_LF in the absence of the fairness constraint. Indeed, assuming that $S\independent\bX$ , ensures that any classifier $g$ is demographic parity fair and that $f^{*} \equiv η$ . In the absence of the demographic parity constraint, Assumption 3 is not necessary and exactly the same proof technique allows to obtain the characterization of the optimal unconstrained classifier.

Examples: accuracy and $F_{1}$ -score

In this part, we give specific examples of the parameters $(\sfn0,\sfn1,\sfn2)$ and $(\sfd0,\sfd1,\sfd2)$ and instantiate Theorem LABEL:thm:fair_optimal_LF and Lemma LABEL:lem:excess_score. The first examples concerns the accuracy as a performance metric. It highlights the generality of the derived results. {example}[Accuracy under fairness constraint] Recalling the coefficients specified in Example 4, we see that in this case $\sfn2\sfd1−\sfd2\sfn1=−1⋅0−0⋅2=0$ . Furthermore, one checks that condition ( $C 2$ ) is satisfied. Hence under Assumption 3, Theorem LABEL:thm:fair_optimal_LF states that

g∗(n,d)(\bx,s)=\indf∗(\bx,s)≥θ∗(n,d),

with $θ∗(n,d)=\sfd0\sfn2−\sfn0\sfd2\sfn2\sfd1−\sfd2\sfn1=1⋅(−1)−(1−\Prob(Y=1))⋅0(1−\Prob(Y=1))⋅0−1⋅2=12$ maximizes $\Prob(Y≠g(\bX,S))$ under the demographic parity constraint. Thus, it coincides with the result of Theorem LABEL:thm:equivalence. Furthermore, Lemma LABEL:lem:excess_score states that for any classifier $g : \times \to {0, 1}$ such that $g(\bX,S)\independentS$ , it holds that

\Prob(Y=g∗(n,d)(\bX,S))−\Prob(Y=g(\bX,S))=2\Exp|η(\bX,S)−F−1μS(η)∘F¯μ(η)(.5)|\indg∗(\bX,S)≠g(\bX,S).

We invite the reader to compare the above expression with its unconstrained version (devroye2013probabilistic, Theorem 2.2).

The second example concerns the $F_{1}$ -score that has been used in several empirical works on fairness as a performance measure (wang2021analyzing; dablain2022towards; wick2019unlocking). {example}[ $F_{1}$ -score under fairness constraint] Recall that the $F_{1}$ -score is defined as

F1(g)=2\Prob(g(\bX,S)=1,Y=1)\Prob(Y=1)+\Prob(g(\bX,S)=1).

Using the coefficients specified in Example 4, we see that for this case $\sfn2\sfd1≠\sfd2\sfn1$ and condition ( $C 1$ ) is satisfied. Hence, under Assumption 3, Theorem LABEL:thm:fair_optimal_LF states that

g∗(n,d)(\bx,s)=\indf∗(\bx,s)≥θ∗(n,d),

with $θ_{(n, d)}^{*}$ being a unique solution of

\Prob(Y=1)θ=\Exp(f∗(\bX,S)−θ)+,

maximizes the $F_{1}$ -score under the demographic parity constraint. Furthermore, Lemma LABEL:lem:excess_score states that for any classifier $g : \times \to {0, 1}$ such that $g(\bX,S)\independentS$ , it holds that

F1(g∗(n,d))−F1(g)=2\Exp∣∣η(\bX,S)−F−1μS(η)∘F¯μ(η)(θ∗(n,d))∣∣\indg∗(n,d)(\bX,S)≠g(\bX,S)\Prob(Y=1)+\Prob(g(\bX,S)=1).

We invite the reader to compare the above expression with its unconstrained version (chzhen2020optimal, Lemma 2).

5 The unawareness case

All the previous parts were concerned with the awareness setup—we allowed ourselves to use the sensitive attribute explicitly. However, it can happen in practice that for legal or ethical reasons, the sensitive attribute cannot be used as an input at prediction time (barocas2016big). Throughout this section we look at classifiers of the form $g : \to {0, 1}$ . By abuse of notation, and as long as confusion cannot occur, we use the same notation to denote the set of all classifiers in the unawareness setup. We also need to introduce the conditional distribution of the sensitive attribute $S$ , given the nominally non-sensitive features $\bX$ . For all $s \in [K]$ , we set $τs(\bX)=\Prob(S=s∣\bX)$ . With one more abuse of notation, we set $η(\bX)≜\Exp[Y∣\bX]$ . In this section we look for {highlighted}

(11)

Note that the only difference with the previous setup is the absence of the sensitive input $S$ in the input of $g$ . lipton2018does investigated this framework empirically and provided evidence against its use in practice. In particular, they empirically showed that while not permitting using the sensitive attribute $S$ , many algorithms still learn the link between $S$ and $\bX$ implicitly. Our first result gives a theoretical justification to this phenomenon.

As in the awareness case, we work under a continuity assumption, adapted to this scenario. Recall that Assumption 3 imposed continuity of the regression function distribution $\Law(η(\bX,s))$ for each sensitive group $s \in S$ . Here we need a different assumption to account for the fact that $S$ is not accessible anymore, namely the continuity of any linear combination of the regression functions distributions $η(\bX)$ and $(τs(\bX))s∈K$ .

{assumption}

For every $s \in [K]$ and for every vector $\bc=(c1,…,cK)⊤∈RK$ such that $c_{1} + \dots + c_{K} = 0$ , the distribution $\Law(η(\bX)+∑Kσ=1cσpστσ(\bX)∣S=s)$ is continuous.

Akin to Theorem 3, we derive the explicit form of an optimal fair classifier in the unawareness setting. {theorem} Let Assumption 5 be satisfied. Then a solution $g^{*}$ defined in Eq. (11) can be expressed for all as

g∗(\bx)=\ind2η(\bx)−1≥K∑σ=1λ∗στσ(\bx)pσ,

where $\blambda∗=(λ∗1,…,λ∗K)∈\bbRK$ is a solution of

min\blambda∈\bbRK\enscond\Exp\parentsq\abs2η(\bX)−1−K∑σ=1λστσ(\bX)pσ\Exp[λSpS]=0.

(12)

We make two observations. First of all, the optimal fair classifier is no longer given by the group-wise threshold. Yet, one can think of the term $θ(\bx)≜∑Kσ=1λ∗στσ(\bx)pσ$ as the $\bx$ -dependent threshold. The optimal classifier $g^{*}$ tries to guess the value of the sensitive attribute from the features to properly set the threshold. Note that as in the awareness case, here we have $\Exp[θ(\bX)]=0$ . Thus, in average, the “threshold” remains being equal to $1 / 2$ as in the standard classification setup. Secondly, we see that if $S$ is measurable w.r.t. $\bX$ , we fall back to the awareness case. Otherwise each variable $λ_{s}^{*}$ is weighted by the conditional distribution of $S$ given $\bX$ .

Importantly, it is remains an open problem to give a connection of the above problem with the corresponding regression setup. The main reason for it is the current lack of an explicit solution to the optimal fair regression problem in the unawareness case. Some attempts were made in (chzhen2020example), yet they are unsatisfactory and do not give a complete picture. Intuitively, the difficulty of extending the optimal transport based approach to the unawareness setup lies in our inability to establish the source of a given $\bx$ . In other words, given $\bx$ , we have no idea which of $\Prob\bX∣S=1,…,\Prob\bX∣S=K$ it was sampled from. Hence, we cannot build a transport map from $\Law(η(\bX,S)∣S=s)$ to their common barycenter since it requires the knowledge of $S$ . Naively, one might think to use $^S(\bX)$ —the best prediction of $S$ given $\bX$ —instead of $S$ . While intuitive, it is easy to see that simply replacing $S$ by $^S(\bX)$ in Theorem LABEL:thm:equivalence does not even satisfy the demographic parity constraint in general. As we show in the next paragraph, the connection between the fair classification and fair regression can be made explicit in the unawareness case if we consider the case of $K = 2$ . The existence of such a connection is explained by the Hahn decomposition theorem for signed measure, whose generalization (even its formulation) to many measures is unclear.

Binary sensitive attribute: the $(\Prob→\QProb)$ reduction.

In this section we describe a reduction of the fair unaware binary classification problem to the awareness case for $K = 2$ . First of all, let us recall that the minimization of $\Prob(Y≠g(\bX,S))$ over $g$ under any constraints is equivalent to the minimization of $\Exp[g(\bX,S)(1−2η(\bX,S))]$ under the same constraints. Furthermore, the same applies to the awareness case where we only need to replace $η(\bX,S)$ by $η(\bX)$ .

For our reduction, given a distribution $\Prob$ on $\times {1, 2} \times {0, 1}$ , we build another distribution $\QProb$ on $\times {1, 2}$ and a function $~ η : \times {1, 2} \to [0, + \infty)$ with the following property: there is a one-to-one correspondence between

g∗\Prob∈\argming:→{0,1}\enscond\Exp\Prob[g(\bX)(1−2η(\bX))]g(\bX)\independent\ProbS,

and

g∗\QProb∈\argming:×{1,2}→{0,1}\enscond\Exp\QProb[g(\bX,S)(1−2~η(\bX,S))]g(\bX,S)\independent\QProbS.

In other words, if $g∗\QProb$ is an optimal fair classifier for distribution $\QProb$ under awareness, then $g∗\QProb$ can be transformed into an optimal fair classifier $g∗\Prob$ for $\Prob$ under unawareness. In what follows, we present the reduction and, given the distribution $\Prob$ , explain the procedure to build $\QProb$ .

Let $\TV≜12∫\abs\d\Prob\bX∣S=1−\d\Prob\bX∣S=2$ . Note that if $\TV=0$ , then $\bX\independentS$ and any unaware classifier satisfies the demographic parity constraint. Hence, we assume that $\TV∈(0,1]$ . We define $\QProb$ in three steps. {highlighted}

The distribution of $\bX$ given $S$ under $\QProb$ is defined as

$\QProb\bX∣S=1=(\Prob\bX∣S=1−\Prob\bX∣S=2)+\TVand\QProb\bX∣S=2=(\Prob\bX∣S=2−\Prob\bX∣S=1)+\TV,$

where $(\Prob\bX∣S=1−\Prob\bX∣S=2)+$ and $(\Prob\bX∣S=2−\Prob\bX∣S=1)+$ is the Hahn decomposition of the signed measure $\Prob\bX∣S=2−\Prob\bX∣S=1$ (see, e.g., billingsley2008probability, Theorem 32.1);
the distribution of $S$ under $\QProb$ is defined as

$\QProb(S=1)=\QProb(S=2)=12;$
the new pseudo-regression function $~ η$ is defined as

$~η(\bx,s)=12+\TV2⋅2η(\bx)−1\absτ1(\bx)p1−τ2(\bx)p2% for\bx∈\supp(\QProb\bX∣S=1)∩\supp(\QProb\bX∣S=2);$

We note that under $\QProb$ , the sensitive attribute $S$ is measurable w.r.t. $\bX$ since the supports of $\QProb\bX∣S=1$ and $\QProb\bX∣S=2$ do not intersect. We refer $~ η$ as to the pseudo-regression function since it is not guaranteed that it takes values in $[0, 1]$ and, hence, is not necessary a valid regression function of $Y∣\bX$ under $\QProb$ for $Y \in {0, 1}$ . {myproposition}Unawareness to awareness reduction Let $\Prob$ be any distribution on $\times {1, 2} \times {0, 1}$ . Let $\QProb$ and $~ η$ be defined using the three steps procedure described above and

g∗\QProb∈\argming:×{1,2}→{0,1}\enscond\Exp\QProb[g(\bX,S)(1−2~η(\bX,S))]g(\bX,S)\independent\QProbS.

Then, $g∗\Prob:→{0,1}$ defined point-wise as

g∗\Prob(\bx)=⎧⎪ ⎪⎨⎪ ⎪⎩g∗\QProb(\bx,1)\bx∈\supp(\QProb\bX∣S=1)g∗\QProb(\bx,2)\bx∈\supp(\QProb\bX∣S=2)\indη(\bx)≥1/2\bx∈\supp(\Prob\bX)∖\parent\supp(\QProb\bX∣S=1)∪\supp(\QProb\bX∣S=2),

is a solution of $ming:→{0,1}\enscond\Exp\Prob[g(\bX)(1−2η(\bX))]g(\bX)\independent\ProbS$ .

Proof.

For any $g : \times {1, 2} \to {0, 1}$ , define $~ g : \to {0, 1}$ as

~g(\bx)=⎧⎪ ⎪⎨⎪ ⎪⎩g(\bx,1)\bx∈\supp(\QProb\bX∣S=1)g(\bx,2)\bx∈\supp(\QProb\bX∣S=2)\indη(\bx)≥1/2\bx∈\supp(\Prob\bX)∖\parent\supp(\QProb\bX∣S=1)∪\supp(\QProb\bX∣S=2).

Note that the above correspondence of $g$ and $~ g$ is invertible since the supports of $\QProb\bX∣S=1$ and $\QProb\bX∣S=2$ do not intersect by construction. Observe that for any $g : \times {1, 2} \to {0, 1}$ it holds that

g(\bX,S)\independent\QProbS⟺g(⋅,1)♯\QProb\bX∣S=1=g(⋅,2)♯\QProb\bX∣S=2⟺~g♯\Prob\bX∣S=1=~g♯\Prob\bX∣S=2.

Thus, given any classifier $g$ satisfying the demographic parity constraint under $\QProb$ , we can transform it to a classifier that satisfies the constraints under $\Prob$ . Furthermore, since

\Exp\QProb[g(\bX,S)(1−2~η(\bX,S))]=\Exp\Prob[~g(\bX)(1−2Y)\ind\bX∈\supp(\QProb\bX∣S=1)∩\supp(\QProb\bX∣S=2)],

taking any classifier $¯ g : \to {0, 1}$ we can write

	$\Exp\Prob[¯g(\bX)(1−2Y)]$	$=\Exp\QProb[¯g(\bX,S)(1−2~η(\bX,S))\ind\bX∈\supp(\QProb\bX∣S=1)∩\supp(\QProb\bX∣S=2)]$
		$=+\Exp\Prob[¯g(\bX)(1−2Y)\ind\bX∉\supp(\QProb\bX∣S=1)∩\supp(\QProb\bX∣S=2)],$

where in the first equality, we added the input $S$ to $¯ g$ sue to the fact that $S$ is $\bX$ measurable under $\QProb$ . Note that the second term is minimized point-wise by the Bayes classifier, while the first term is minimized by $g∗\QProb$ thanks to the equivalence established for the demographic parity constraint. ∎

The above result provide a theoretical justification to the empirical observations made by lipton2018does. Indeed, they have empirically shown that in the unawareness setting, many classification algorithms tailored for the demographic parity constraint, are forced to “guess” the sensitive attribute $S$ . Theoretically, this is reflected by the construction of the distribution $\bX∣S$ under $\QProb$ . Furthermore, since the reduction is performed to the awareness setup, the results of previous sections on the connection between fair regression and fair classification still applies. Yet, we emphasize that the above argument is only valid for $K = 2$ and its extension to $K > 2$ remains an open problem. The main difficulty comes from the absence of a version of the Hahn decomposition for more than two measures.

6 Fair learning: from infinite to finite sample

All the previous sections were concerned with the “infinite sample” regime—the case of known distribution $\Prob$ . While not being the main focus of the paper, given the established connection with the problem of fair regression, one can easily pass from the infinite to the finite-sample regime. Indeed, there are many algorithms that allow to consistently estimate the regression function $f^{*}$ . For instance, agarwal2019fair give an in-processing algorithm with provable finite sample generalization bounds; gouic2020price propose a consistent estimator of $f^{*}$ ; chzhen2020fair provide an algorithm with finite sample fairness and risk guarantees; chzhen2020minimax exhibit a modification of the two aforementioned estimators that enjoys stronger fairness and risk guarantees.

Once an estimator $^f$ of $f^{*}$ is constructed, one only needs to estimate the threshold $θ^{*}$ specified in Theorem LABEL:thm:fair_optimal_LF. Recall that there are two cases considered in Theorem LABEL:thm:fair_optimal_LF, the first one requires finding a root of a specific function and the second one gives an explicit expression for $θ^{*}$ . For the first case one can use the unsupervised approach recycling $^f$ and only estimating $\Exp\bX∣S[⋅]$ and the, potentially distribution dependent coefficients, $(\sfn0,\sfn1,\sfn2),(\sfd0,\sfd1,\sfd2)$ . For the second case one only needs to estimate or substitute the values of $(\sfn0,\sfn1,\sfn2),(\sfd0,\sfd1,\sfd2)$ . Such an approach was analyzed in chzhen2020optimal in the context of binary classification with the $F_{1}$ -score without fairness considerations. Alternatively, for the threshold estimation, one can deploy the grid-search technique proposed by koyejo2014consistent by again recycling the base estimator $^f$ of $f^{*}$ . In either case one ends up with a flexible and rather direct approach for building data-driven algorithms. We note however that the second approach requires additional labeled data, while the first one is only based on the unlabeled data. The final classification algorithm eventually takes the form of $1(^f(\bx,s)≥^θ)$ .

7 Conclusion

We have derived an explicit connection between the regression and classification under the demographic parity constraint problems. Leveraging the optimal transport interpretation of the optimal fair regressor, we have shown that the regression-classification link is akin to the classical unconstrained setup. This connection is extended to non-decomposable performance measures and, remarkably, amounts to replacing the standard regression function by its fair counterpart. Finally, we have provided a reduction scheme to pass from the unawareness setup to the awareness setup in the case of the binary sensitive attribute, hence giving the first explicit solution of the fair optimal unaware classifier. Our results are instructive and, relying on the previous studies, lead to wide spectrum of algorithms that can be used with non-decomposable measures. Future works will be focused on further clarification of other notions of fairness constraint by providing clean and interpretable theoretical studies.

Appendix A A unified proof for deriving optimal fair classifiers

In this section we state and prove a general result which implies both Theorem 3 and Theorem 5. On top of the problem setup presented in Section 2, let $\bW$ be a random variable taking its values in some abstract space $W$ . Moreover, define the regression functions $τs(\bw)\coloneqq\Prob(S=s∣\bW=\bw),s∈[K]$ . The random variable $\bW$ should be thought as $(\bX,S)$ for the awareness setting and $\bX$ for the unawareness setting. Our goal is to find a solution {highlighted}

(13)

The general result will be stated under the following continuity assumption. It requires continuity of the distribution of any linear combination of the regression functions evaluated at $\bW$ . {assumption} For every $s \in [K]$ and for every vector $\bc=(c1,…,cK)⊤∈RK$ such that $c_{1} + \dots + c_{K} = 0$ , the distribution $\Law(η(\bW)+∑Kσ=1cσpστσ(\bW)∣S=s)$ is continuous. Akin to Assumptions 3 and 5, Assumption A is not necessary to prove our result but it greatly simplifies its presentation and interpretation. Let us now state the general result which encompasses the two special cases presented in the main body of the paper. {mytheo}Fair optimal classifier (unified version)optimal_DP_unified Let Assumption A be satisfied. Then a solution $g^{*}$ defined in Eq. (13) can be expressed for all as

g∗(\bw)=\ind2η(\bw)−1≥K∑σ=1λ∗στσ(\bw)pσ,

where $\blambda∗=(λ∗1,…,λ∗K)∈\bbRK$ is a solution of

min\blambda∈\bbRK\enscond\Exp\parentsq\abs2η(\bW)−1−K∑σ=1λστσ(\bW)pσ\Exp[λSpS]=0.

(14)

{remark}

[Relating the above result to the main body] It is straightforward to derive Theorem 3 and Theorem 5 from Theorem LABEL:thm:optimal_DP_unified. Indeed, to prove Theorem 3, set $\bW=(\bX,S),\bw=(\bx,s)$ and notice that $τσ(\bw)=\Prob(S=σ∣\bX=\bx,S=s)=δs(σ)$ . In particular, Assumption A is weaker than Assumption 5 and one can check that the optimal fair classifiers coincide. Similarly, Theorem 5 can be derived from Theorem LABEL:thm:optimal_DP_unified by setting $\bW=\bX,\bw=\bx$ .

Proof of Theorem LABEL:thm:optimal_DP_unified.

One can verify that the minimization of $\Prob(g(\bW))≠Y)$ over $g$ is equivalent to the minimization of $\Exp[g(\bW)(1−2η(\bW))]$ . Furthermore, the demographic parity constraint can be equivalently expressed as

\Exp[g(\bW)∣S=s]=\Exp[g(\bW)],s∈[K].

Thus, we are interested in the solution of the optimization problem

Recall that we defined the random variable $τs(\bW)=\Prob(S=s∣\bW),s∈[K]$ . The Lagrangian for the above problem can be expressed as

\cL(g,\blambda)=\Exp\parentsqg(\bW)\parent(1−2η(\bW))−K∑σ=1λσ(1−p−1στσ(\bW)),

where $\blambda∈\bbRK$ . Weak duality implies that

mingmax\blambda\cL(g,\blambda)≥max\blambdaming\cL(g,\blambda).

(15)

Our approach to derive the optimal fair classifier can be decomposed in two classical steps: find optimal solutions to the dual problem $max\blambdaming\cL(g,\blambda)$ ; show that strong duality holds so that the optimal solutions to the dual problem are also optimal for the primal problem.

Solving the dual problem.

In what follows we focus our attention on the dual $max min$ problem, which can be solved analytically. We first solve for any $\blambda$ the inner minimization problem of the $max min$ formulation

ming(g,\blambda).

(16)

Since $g$ can be any function from to ${0, 1}$ , the above problem can be solved point-wise. In particular, one can check that the solution is given by

g∗(\bw)=\ind2η(\bw)−1≥K∑σ=1λσ(p−1στσ(\bw)−1).

Plugging the optimal solution $g^{*}$ back in the dual problem, we obtain as solution of the outer maximization problem

\blambda∗∈\argmin\blambda∈\bbRK\Exp\parentsq\parent2η(\bW)−1+K∑σ=1λσ(1−p−1στσ(\bW))+.

(17)

The objective of the above optimization problem is non-negative, continuous convex as a function of $\blambda$ . Lemma A ensures that $\blambda∗$ exists.

The objective function of problem in Eq. (17) is not smooth everywhere due to the presence of the positive part function. However, thanks to Assumption A, the set of points at which the objective function is not differentiable has zero Lebesgue measure and can thus be ignored (see, e.g., bertsekas1973stochastic, Proposition 3). The First-Order Optimality Condition (FOOC) on the optimal Lagrange multiplier $\blambda∗$ then reads as

\Exp[p−1sτs(\bW)\indg∗(\bW)=1]=\Prob(g∗(\bW)=1),∀s∈[K].

The LHS of the above inequality can be simplified into

\Exp[p−1sτs(\bW)\indg∗(\bW)=1]=K∑s=1\Exp[τs(\bW)\indg∗(\bW)=1∣S=s]=\Prob(g∗(\bW)=1∣S=s),

showing that the FOOC on $\blambda∗$ is equivalent to $g^{*}$ satisfying DP.

Strong duality.

The above reasoning showed that $g^{*}$ defined with the optimal Lagrange multiplier $\blambda∗$ is feasible for the primal problem. Combining this property with Eq. (15) implies that $g^{*}$ is also a solution of the primal problem.

A more convenient expression.

Using the fact that $2 (a)_{+} = a + | a |$ and $\Expτs(\bW)=ps$ , we can express the optimal Lagrange multiplier $\blambda∗$ as

\blambda∗∈\argmin\blambda∈\bbRK\Exp\parentsq\abs2η(\bW)−1+K∑σ=1λσ(1−p−1στσ(\bW)).

Moreover, introducing $G(\blambda)=\Exp\parentsq\abs2η(\bW)−1+∑Kσ=1λσ(1−p−1στσ(\bW))$ , we observe that for any $c∈\bbR$ and $\blambda∈\bbRK$ it holds that $G(\blambda)=G(\blambda+c\bp)$ , where $\bp=(p1,…,pK)⊤∈\bbRK$ . Hence, since we are interested in any solution of the above optimization problem, we can define $(g∗,\blambda∗)$ as

	$g∗(\bw)=\ind2η(\bw)−1≥K∑σ=1λσp−1στσ(\bw),$
	$\blambda∗∈\argmin\blambda∈\bbRK\enscond\Exp\parentsq\abs2η(\bW)−1−K∑σ=1λσp−1στσ(\bW)¯\blambda=0.\qed$

{lemma}

Let Assumption A be satisfied, then the mapping

\blambda↦\Exp\parentsq\parent2η(\bW)−1+K∑σ=1λσ(1−p−1στσ(\bW))+

(18)

attains its minimum.

Proof.

In the end of the proof of Theorem LABEL:thm:optimal_DP_unified we have show that minimization of (18) is equivalent to the minimization of

\blambda↦\Exp\parentsq\abs2η(\bW)−1−K∑σ=1λσp−1στσ(\bW)

on the hyperplane $\enscond\blambda∈\bbRK¯\blambda=0$ . Thus, it is sufficient to show that

min\blambda∈\bbRK\enscond\Exp\parentsq\abs2η(\bW)−1−K∑σ=1λσp−1στσ(\bW)¯\blambda=0

is attained.

It is clear that the mapping in question is convex on $\bbRK$ . Hence, it is sufficient to show that it is coercive (see e.g. bauschke2017convex, Proposition 11.15). It holds that

\Exp\abs2η(\bW)−1−K∑σ=1λσp−1στσ(\bW)=\Exp\abs\scalar(\blambda/\bp,1)(\bV,H),

(19)

where we introduced the vector $\bV≜(τ1(\bW),…,τK(\bW))$ , $H≜1−2η(\bW)$ , and $(\blambda/\bp,1)≜(λ1/p1,…,λK/pK,1)∈RK+1$ . Thus, in view of (19), by Markov’s inequality, for any $κ > 0$ it holds that

\Exp\abs2η(\bW)−1−K∑σ=1λσpστσ(\bW)≥κ∥(\blambda/\bp,1)∥\Prob(\abs\scalar(\blambda/\bp,1)(\bV,H)>κ∥(\blambda/\bp,1)∥),

(20)

where $∥ \cdot ∥$ denotes the Euclidean norm. Note that if we are able to show that for some $κ_{0} > 0$ , the right hand side of the above inequality is bounded away from zero, the proof of coercivity is concluded since $∥(\blambda/\bp,1)∥≥mins∈[K]{p−1s}∥\blambda∥$ . To this end, let us introduce

F(\bu,t)=\Prob(\abs\scalar\bu(\bV,H)≤t),

for all $t \geq 0$ and being defined as

By Assumption A, for any , the mapping $t↦F(\bu,t)$ is continuous on $(0, + \infty)$ with $F(\bu,0)=0$ and $F(\bu,+∞)=1$ . Furthermore, for any such that and for any $δ > 0, t > 0$ , we have thanks to triangle’s inequality and monotonicity of $F(\bu,⋅)$

F(\bu+\bh,t+δ)∈[F(\bu,t+δ−2∥\bh∥),F(\bu,t+δ+2∥\bh∥)]δ⟶0∥\bh∥⟶0−−−−−−→F(\bu,t),

where the convergence follows from the assumed continuity of $F(\bu,⋅)$ . Thus, $(\bu,t)↦F(\bu,t)$ is continuous. Since is compact, we have that

is continuous on $[0, + \infty)$ . Hence, the intermediate value theorem guarantees that there exists $κ_{0} > 0$ such that

G(κ0)=1−infλ1+…+λK=0\Prob(\abs\scalar(\blambda/\bp,1)(\bV,H)>κ0∥(\blambda/\bp,1)∥)=12.

In view of Eq. (20), we conclude. ∎

Appendix B Auxiliary results

The first lemma ensures that under certain conditions, the denominator of the linear fractional performance measure is always positive. {lemma} Assume that $\sfd0+min{min{\sfd1,0}+\sfd2,0}≥0$ , then for any classifier $g : \times [K] \to {0, 1}$

\sfd0+\sfd1\Prob(Y=1,g(\bX,S)=1)+\sfd2\Prob(g(\bX,S)=1)≥0.

Furthermore, if $\sfd0+min{min{\sfd1,0}+\sfd2,0}>0$ , then the above inequality is strict.

Proof.

Observe that

	$\sfd0+\sfd1\Prob(Y=1,g(\bX,S)=1)+\sfd2\Prob(g(\bX,S)=1)$	$=\sfd0+\Exp[(\sfd1Y+\sfd2)g(\bX,S)]$
		$≥\sfd0+\Exp[(min{\sfd1,0}+\sfd2)g(\bX,S)]$
		$≥\sfd0+min{min{\sfd1,0}+\sfd2,0}$
		$\geq 0 .$

The second claim follows the same lines. ∎

The second result gives a sufficient condition for positivity of the leading coefficient in Remark 4. {lemma} Assume that $\sfd0+min{min{\sfd1,0}+\sfd2,0}≥0$ and either Eq. ( $C 1$ ) or Eq. ( $C 2$ ) is satisfied, then for any classifier $g∈\dom(U(\sfn,\sfd))$

\sfn1−\sfd1U(\sfn,\sfd)(g)≥0.

Proof.

Observe that in both cases, by Lemma B, we have

\sign(\sfn1−\sfd1U(\sfn,\sfd)(g))=\sign((\sfn1\sfd0−\sfd1\sfn0)+(\sfn1\sfd2−\sfd1\sfn2)\Exp[g(\bX,S)]).

(21)

Case 1: $\sfn2\sfd1≠\sfd2\sfn1$ . In that case condition ( $C 1$ ) implies that $\sfn1\sfd2−\sfd1\sfn2>0$ and $\sfn1\sfd0−\sfd1\sfn0\sfn1\sfd2−\sfd1\sfn2≥0$ . In view of (21) we conclude.
Case 2: $\sfn2\sfd1=\sfd2\sfn1$ . The proof is immediate from (21) and the first part of condition ( $C 2$ ). ∎

The next lemma establishes an extended average stability property from (chzhen2020minimax). {lemma} Let Assumption 3 be satisfied, then

\Exp[(f∗(\bX,S)−η(\bX,S))\indf∗(\bX,S)≥θ]=0,

for all $θ \in [0, 1]$ .

Proof.

Fix some $θ \in [0, 1]$ . Introducing $T∗(⋅)\coloneqq\parent∑Kσ=1pσF−1μσ(η)(⋅)$ , we recall that

f∗(\bx,s)=T∗∘Fμs(η)(η(\bx,s)).

Furthermore, since both $FμS(η)(η(\bX,S))$ and $(FμS(η)(η(\bX,S))∣S=s)$ are distributed uniformly on $(0, 1)$ under Assumption 3, we can write

	$\Exp[($	$f∗(\bX,S)−η(\bX,S))g∗(\bX,S)]$
		$=\Exp[T∗(U)\indT∗(U)≥θ]−K∑s=1ps\Exp[F−1μs(η)(U)\indT∗(U)≥θ∣S=s]=0.\qed$

Finally, the last result relates the excess risk obtained in Lemma LABEL:lem:excess_score with the expression presented in Remark 4. {lemma} Under the conditions of Lemma LABEL:lem:fixed_at_optim_LF, we have

Proof.

We drop the subscript $(\sfn,\sfd)$ for compactness.
Case 1: $\sfd2\sfn1≠\sfn2\sfd1$ . Using the corresponding case of Lemma LABEL:lem:fixed_at_optim_LF and solving it for $θ^{*}$ , we deduce that

θ∗=\sfn2−\sfd2U(g∗)\sfd1U(g∗)−\sfn1.

Hence, from the above we deduce that

\sfd2+θ∗\sfd1=\sfd1\sfn2−\sfd2\sfn1\sfd1U(g∗)−\sfn1⟹\sfd2\sfn1−\sfn2\sfd1\sfd2+θ∗\sfd1=\sfn1−\sfd1U(g∗).

Case 1: $\sfd2\sfn1=\sfn2\sfd1$ . Again using the corresponding case of Lemma LABEL:lem:fixed_at_optim_LF and solving it for $\Exp(f∗(\bX,S)−θ∗)+$ , we deduce that

\Exp(f∗(\bX,S)−θ∗(\sfn,\sfd))+=\sfd0U(g∗)−\sfn0\sfn1−\sfd1U(g∗).

Hence, from the above we deduce that

\sfd0+\sfd1\Exp(f∗(\bX,S)−θ∗)+=\sfd0\sfn1−\sfd1\sfn0\sfn1−\sfd1U(g∗)⟹\sfn1\sfd0−\sfd1\sfn0\sfd0+\sfd1\Exp(f∗(\bX,S)−θ∗)+=\sfn1−\sfd1U(g∗).

The proof is concluded. ∎

Fair learning with Wasserstein barycenters for non-decomposable performance measures

Abstract

1 Introduction

Contributions.

2 Problem setup

Notation.

3 The misclassification risk: a warm-up

Discussion.

Proof of Theorem LABEL:thm:equivalence.

4 Non-decomposable performance measures

Proof.

Proof.

Proof of Lemma LABEL:lem:excess_score.

Examples: accuracy and F1-score

5 The unawareness case

Binary sensitive attribute: the (\Prob→\QProb) reduction.

Proof.

6 Fair learning: from infinite to finite sample

7 Conclusion

Appendix A A unified proof for deriving optimal fair classifiers

Proof of Theorem LABEL:thm:optimal_DP_unified.

Solving the dual problem.

Strong duality.

A more convenient expression.

Proof.

Appendix B Auxiliary results

Proof.

Proof.

Proof.

Proof.

References

Examples: accuracy and $F_{1}$ -score

Binary sensitive attribute: the $(\Prob→\QProb)$ reduction.