MIME: Minority Inclusion for Majority Group Enhancement of AI Performance

Pradyumna Chari 1Department of Electrical and Computer Engineering, UCLA
1
   Yunhao Ba 1Department of Electrical and Computer Engineering, UCLA
1
   Shreeram Athreya 1Department of Electrical and Computer Engineering, UCLA
1
   Achuta Kadambi 1Department of Electrical and Computer Engineering, UCLA
12Department of Computer Science, UCLA
, 1 2{pradyumnac,yhba,shreeram}@ucla.edu
2email: achuta@ee.ucla.edu
Abstract

Several papers have rightly included minority groups in artificial intelligence (AI) training data to improve test inference for minority groups and/or society-at-large. A society-at-large consists of both minority and majority stakeholders. A common misconception is that minority inclusion does not increase performance for majority groups alone. In this paper, we make the surprising finding that including minority samples can improve test error for the majority group. In other words, minority group inclusion leads to majority group enhancements (MIME) in performance. A theoretical existence proof of the MIME effect is presented and found to be consistent with experimental results on six different datasets. Project webpage: https://visual.ee.ucla.edu/mime.htm/.

Keywords:
Fairness, bias, data diversity.

1 Introduction

Inclusion of minorities in a dataset impacts the performance of artificial intelligence (AI). Recent research has presented the value of inclusive datasets to improve AI performance on minorities and also for society-at-large [gebru2018datasheets, buolamwini2018gender, larrazabal2020gender, ryu2017inclusivefacenet, li2019repair, mehrabi2021survey, jo2020lessons, gong2019diversity, kadambi2021achieving]. A society-at-large consists of both majority and minority stakeholders. However, an objection (often silently posed) to minority inclusion efforts, is that the inclusion of minorities can diminish performance for the majority. This is based on a “rule of thumb” that AI performance is maximized when one trains and tests on the same distribution. A devil’s advocate position against minority inclusion might be presented as: “In a fictitious society where we are absolutely certain that only blue-skinned humans will exist in the test set, why include out of distribution orange-skinned humans in the training set?”.

In this paper, we make the surprising finding that inclusion of minority samples improves AI performance not just for minorities, not just for society-at-large, but even for majorities. We refer to this effect as Minority Inclusion, Majority Enhancement (MIME), illustrated in Figure 2. Specifically, we note that including some minority samples in the train set improves majority group test performance. However, continued addition of minority samples leads to performance drop. The effect holds under statistical conditions that are represented in traditional computer vision datasets including FairFace [karkkainen2021fairface], UTKFace [zhang2017age], pets [golle2008machine], medical imaging datasets [rajpurkar2017chexnet] and even non-vision data [blake1998uci]. Although deep learning is used for these problems, the flattening layer of a network can be empirically approximated to elementary distributions like Gaussian Mixture Models (GMMs). A GMM facilitates closed-form analysis to prove the existence of the MIME effect. Additionally, we show existence of MIME on general distributions. Classification experiments on neural networks validate using Gaussian mixtures: complex neural networks exhibit feature embeddings in flat layers, distributed with approximately Gaussian density, across six datasets, in and beyond computer vision, and across many realizations and configurations.

 *When do the provable guarantees hold? The guarantees are certifiable for fixed backbone binary classification (e.g. one uses a head network with pretrained weights and fine-tunes a downstream layer for classification). The fixed backbone ML is far from a toy scenario (it is considered SoTA by some authors 
Figure 1: This paper proves* that including minorities improves majority performance. *When do the provable guarantees hold? The guarantees are certifiable for fixed backbone binary classification (e.g. one uses a head network with pretrained weights and fine-tunes a downstream layer for classification). The fixed backbone ML is far from a toy scenario (it is considered SoTA by some authors [kang2019decoupling]) and also enables provable certification - ordinarily it is hard to prove things for neural network settings.

Fairness in machine learning is an exceedingly popular area, and our results benefit from several key papers published in recent years. Sample reweighting approaches recognize the need to preferentially weight difficult examples [dong2017class, ren2018learning, cui2019class]. Active and online learning benefit from insights into sample “informativeness” (i.e. given a budget on the number of training samples, which would be the best sample to include [choi2021active, dasgupta2011two]). Domain randomization literature indicates that surprising perturbations to the training set can improve generalization performance [tremblay2018training, yue2019domain, huang2021fsdr]. We extend some of these theoretical insights to the sphere of analyzing benefits of minority inclusion on majority performance.

1.1 Contributions

 We theoretically describe an effect called Minority Inclusion, Majority Enhancement (MIME). The figure depicts test classification of blue mimes, and an initial training stack, also of blue mimes. If allowed to add one more training sample, it can be better to push an orange mime onto the training stack rather than a blue mime. Test accuracy can increase by pushing orange, even though the test set consists of blue mimes alone.
Figure 2: Inclusion of minorities can improve performance for majorities. We theoretically describe an effect called Minority Inclusion, Majority Enhancement (MIME). The figure depicts test classification of blue mimes, and an initial training stack, also of blue mimes. If allowed to add one more training sample, it can be better to push an orange mime onto the training stack rather than a blue mime. Test accuracy can increase by pushing orange, even though the test set consists of blue mimes alone.

While some works [gwilliam2021rethinking, larrazabal2020gender] have observed related phenomena for isolated tasks, to the best of our knowledge, characterizing benefits to majority groups by including minority data is largely unexplored theoretically. Our contributions are as follows:

  • We introduce the Minority Inclusion Majority Enhancement (MIME) effect in a theoretical and empirical setting.

  • Theoretically: we derive in closed form, the existence of the MIME effect both with and without domain gap (Key Results 1 and 2) and for general sample distributions (Key Result 3).

  • Empirically: we test the MIME effect on six datasets, as varied as animals to medical images, and observe the existence of MIME consistent with theory.

1.2 Outline of Theoretical Scope

Figure 1 describes the theoretical scope. Through three key results (Theorem 1, Theorem 2 and Theorem 3), this paper offers an existence proof of the MIME effect. An existence proof can leverage a tractable setting. As in Figure 2, training data is a stack of majority samples. Test data is all majority samples. We can push one additional training sample to increase the stack size to . We are allowed the choice of having the -th sample drawn from the minority or majority group. Theorem 1 proves that, under the assumptions in Section 3, pushing a minority sample is superior for majority group performance improvements. Theorem 2 generalizes this result to a more realistic scenario, with domain gap. Theorem 3 extends the existence proof to general sample distributions. Empirical results on real-world AI tasks offer validation for theoretical assumptions.

2 Related Work

Debiasing and fairness: It has been widely reported that biases in training data lead to biased algorithmic performance [bolukbasi2016man, hendricks2018women, buolamwini2018gender]. Work has been carried out in identifying and quantifying biases [balakrishnan2021towards, bellamy2019ai, wang2019balanced] and a range of methods exist to address them [gong2019diversity, mehrabi2021survey]. Early approaches suggest oversampling strategies [elkan2001foundations, bickel2009discriminative]. Other methods propose resampling based on individual performance [li2019repair]. Some works utilize information bottlenecks to disentangle biased attributes [tartaglione2021end]. Still other methods propose bias mitigation solutions based on adversarial learning [zhang2018mitigating] or include considerations like protected class-specific classifiers [wang2020towards]. Generative models have also found use in creating synthetic datasets with debiased attributes [ramaswamy2021fair]. Xu et al[xu2021robust] identify inherent bias amplification as a result of adversarial training and propose a framework to mitigate these biases. Our goals are different – while these aim to reduce test time performance bias across groups, we analyze influence of minority samples on majority group performance.

Learning from multiple domains: Domain adaptation literature explores learning from multiple sources [redko2020survey]. It could therefore be one potential way to analyze our problem of training on combinations of majority and minority data. In our setting, data arising from distinct domains is seen as being drawn from different distributions with a domain gap [ben2007analysis]. Between these domains, [ben2010theory] establishes error bounds for learning from combinations of domains. However, these error estimates and bounds do not take into account the notion of majority and minority groups; therefore, describing the MIME effect is outside their scope.

Dataset diversity: An important push towards fairness is through analysis of dataset composition. Several works indicate the importance of diverse datasets [gebru2018datasheets, jo2020lessons]. Ryu et al[ryu2017inclusivefacenet] note that class imbalance in the training set leads to performance reduction. Wang et al[wang2019balanced] highlight that perfectly balanced datasets may still not lead to balanced performance. For designing medical devices, [kadambi2021achieving] emphasizes the importance of diverse datasets. Through experiments on X-ray datasets, [larrazabal2020gender] observe that imbalanced training sets adversely affect performance on the disadvantaged group. They also observe that an unbiased training set shows the best overall accuracy. However, their inferences are related empirical observations on a few medical tasks and datasets. From an application perspective, the task of remote photoplethysmography enables analysis of the bias problem. Prior work notes that camera-based heart rate estimation exhibits skin tone bias [nowara2020meta], and [ba2021overcoming, Wang_2022_CVPR] propose synthetic augmentations to mitigate this. Additionally, [chari2020diverse, vilesov2022blending] establish that camera based heart rate estimation is fundamentally biased against dark skin tone subjects, establishing a notion of task complexity. While all these works recognize that data composition affects bias, none to our knowledge describe the effect of varying minority group proportions on majority group accuracy.

3 Statistical Origins of the MIME Effect

For more concise exposition, we make assumptions in the main paper derivation and defer extended generality to the supplement. Assumptions include:

  • Assumption 1: one-dimensional data samples and binary labels, , . This is relevant to modern classification problems since the final classification decision is based on a one dimensional projection of the feature representation of the sample with respect to the learnt hyperplane (discussed in Figure 1, Section 4). Additionally, existence proof of MIME holds for more general vectorized notation, as discussed in the supplement.

  • Assumption 2: the binary classifier used is a perceptron: this assumption relates to real neural networks since the last layer is perceptron-like [mohri2013perceptron].

We now introduce some key definitions that follow from these assumptions.
Definition 1: (Task complexity): For binary classification we define task complexity for a group of data as a continuous variable in , such that,

(1)

where is the classification error for hypothesis (the classifier), is the space of feasible hypotheses. It is noted later that this is empirically equivalent to distributional overlap. This definition is not new. Hard-sample mining [dong2017class] establishes the of use performance measures as an indicator of difficulty.
Definition 2: (Majority Group): Group class (i.e. group label ) on which the task performs better. Quantified by training a network only with majority group data and evaluating test performance: .
Definition 3: (Minority Group): Group class (i.e. group label ) on which the task performs worse. Quantified by training a network only with minority class data and evaluating test performance: .
Definition 4: (Minority Training Ratio ()): Ratio of minority to majority samples in the data under consideration (training set, in the context of this paper).
Definition 5: (MIME Domain Gap): Measure of how classification differs for minorities and majorities. Quantified as a difference between ideal hyperplanes. Note that this definition for domain gap could be different from other definitions. In this work, domain gap should be taken to mean MIME domain gap.
Empirical observations on cutting-edge machine learning tasks demonstrate the real-world applicability of the assumptions above. We now discuss three key results. For ease of understanding, we make two simplifying assumptions for Key Results 1 and 2: (i) simplified distributions that follow a symmetric Gaussian Mixture Model, and (ii) equally likely class labels, i.e. . These assumptions are relaxed in Key Result 3.

Key Result 1: A minority sample can be more valuable for majority classifiers than another majority sample

Our first key result shows that it can benefit performance on the majority group more if one adds minority data (instead of majority data). Consider a binary classification setting with data samples and labels . Samples from the two classes are drawn from distributions with distinct means:

(2)
 We plot GMMs with different task complexities. The domain gap
Figure 3: Visualizating of Gaussian Mixture Model parameters. We plot GMMs with different task complexities. The domain gap is visualized as the difference in the ideal threshold locations. The overlap/task complexity metric can be visually seen.

Maximum likelihood (ML) can be used to estimate the label as

(3)

An ideal hyperplane for ML is a set of data samples such that:

(4)

We consider the hyperplane’s geometry to be linear in this one dimensional setting. Therefore the hyperplane can be represented as a normal vector: . The normalized hyperplane is represented by a two dimensional vector, . Here, is the offset/bias. In general, a hyperplane may not be ideal. The accuracy of a hyperplane is based on a performance measure , where the operator takes as input the hyperplane and outputs the closeness to the ideal hyperplane . A goal of a learning based classifier is to obtain:

(5)

where is the best learnt estimate of the ideal hyperplane. The ideal hyperplane is the global minimizer of this objective. Now, assume we are provided a finite training set of labelled data . Let the estimated hyperplane be , denoting that samples have been used to learn the hyperplane. If one additional data sample is made available, then the learnt hyperplane would be . From Equation 2, the -th sample is drawn from one of two distributions:

(6)

We now introduce the notion of majority and minority sampling.

Introducing Majority/Minority Distributions: Suppose that the -th data sample could be drawn for the same classification task from a minority or majority group. Let denote the group label (for the group class). Equation 2 can now be conditioned on the group label, such that there are four possible distributions from which the -th sample can be drawn:

(7)

Overlap: Let the ideal decision hyperplane be located at . Then, given equal likelihood of the two labels for , the overlap for the majority group is defined as the probability of erroneous sample classification:

(8)

The same definition holds true for the minority class as well. Therefore, by definition, . The task complexities and are empirical estimates of the respective overlaps. Hereafter, we assume that all four marginal distributions are Gaussian and symmetric (this is relaxed later for Key Result 3). Figure 3 visually highlights relevant parameters. occurs through the interplay of component means and variances.

The expectation over the class label yields majority and minority sampling:

(9)

where we have defined or as having the -th sample come from the majority or minority distributions.

Armed with an expression for the -th sample, we can consider a scope similar to active/online learning [ertekin2007learning, huang2010active, settles2009active, kremer2014active, dasgupta2007general, beygelzimer2010agnostic, dasgupta2011two, balcan2007margin]. Suppose a dataset of samples has been collected on majority samples, such that there exists a dataset stack . A hyperplane is learnt on this dataset and can be improved by expanding the dataset size. Consider pushing sample index , denoted as onto the stack. Now we have a choice of pushing or , to create one of two datasets:

(10)

where represents the interesting case where we choose to push a minority sample onto a dataset with all majority samples (e.g. adding a dark skinned sample to a light skinned dataset). Denote and as hyperplanes learnt on and . We now arrive at the following result.

Theorem 1: Let be the performance of a hyperplane on the majority group. Let . Assume that the minority group distribution has an overlap while the majority group has an overlap . Both have the same ideal hyperplane . Under the definitions of and as above, assuming is sufficiently small and the group class distribution variances are not very large,

(11)

stating that, perhaps surprisingly, expected performance for majorities improves more by pushing a minority sample on the stack, rather than a majority sample.
Proof (Sketch): A sketch is provided, please see the supplement for the full proof. The general idea is to show that samples closer to are more beneficial, and minority distributions may sample these with higher likelihood. Without loss of generality, we assume that is located, non-ideally, closer to the task class (arbitrarily called the positive class) than . For our perceptron update rule, the improvement in the estimated hyperplane due to is proportional to the difference between the false negative rate (FNR) and the false positive rate (FPR) for , with respect to the distribution of . For sufficiently small , can be approximated in terms of the likelihood that is on the ideal hyperplane. The likelihood is directly proportional to . Under the assumptions of the theorem, a direct relation is established between the overlap and for each of the group classes. Then, it is shown that an additional minority sample, with overlap leads to greater expected gains as compared to an additional majority sample, concluding the proof.

Key Result 2: MIME holds under domain gap

In the previous key result we described the MIME effect in a restrictive setting where a minority and majority group have the same target hyperplane. However, it is rarely the case that minorities and majorities have the same decision boundary. We now consider the case with non-zero domain gap, to show that MIME holds on a more realistic setting. Domain gap can be quantified in terms of ideal decision hyperplanes. If and denote ideal hyperplanes for the majority and minority groups respectively, then domain gap .

A visual illustration of domain gap is provided in Figure 3. Next, we define relative hyperplane locations in terms of halfspaces (since all hyperplanes in the one dimensional setting are parallel). We say two hyperplanes and lie in the same halfspace of a reference hyperplane if their respective offsets/biases satisfy the condition . For occupancy in different halfspaces, the condition is . We now enter into the second key result.

Theorem 2: Let be the domain gap between the majority and minority groups. Assume that the minority group distribution has an ideal hyperplane ; while the majority group has an ideal hyperplane . Then, if , is small enough, and the group class distribution variances are not very large, it can be shown that if either of the following two cases:

  1. and lie in different halfspaces of ,
    or

  2. and lie in the same halfspace of , and if

    (12)

are true, then:

(13)

where is a non-negative constant that depends on the majority and minority means and standard deviations for all the individual GMM components.
Proof (Sketch): A sketch is provided, please see the supplement for the full proof. We prove independently for both cases.

  1. When and lie in different halfspaces of , it can be shown that the expected improvement in the hyperplane is higher for the minority group as compared to the majority group, using a similar argument as in Theorem 1. This proves the theorem for Case 1.

  2. When and lie in the same halfspace of , and assuming that is located closer to the positive class, we approximate the value as function of , and the likelihood as defined for Theorem 1. Then, through algebraic manipulation, constraints can be established in terms of the two likelihoods and . Under the assumptions of the theorem, a relation can be established between the ratios and . This proves the theorem for Case 2, and concludes the proof.

Key Result 3: MIME holds for general distributions

We now relax the symmetric Gaussian and equally likely labels requirements to arrive at a general condition for MIME existence. Let and be general distributions describing the majority group and classes. Additionally, . Minority group distributions are described similarly. We define the signed tail weight for the majority group as follows:

(14)

where for the majority group. is similarly defined. This leads us to our third key result.

Theorem 3: Consider majority and minority groups, with general sample distributions and unequal prior label distributions. If,

(15)

then .
Proof (Sketch): A sketch is provided, please see the supplement for the full proof. The perceptron algorithm update rule is proportional to (if is located closer to the positive class) or the (if is located closer to the negative class). The MIME effect exists in the scenario where the worst case update for the minority group is better than the best case update for the majority group (described in Equation 15). This proves the theorem.

Generalizations of Theorem 3 to include domain gap are discussed in the supplement, for brevity. Theorems 1 and 2 are special cases of the general Theorem 3, describing MIME existence for specific group distributions.

4 Verifying MIME Theory on Real Tasks

 (top row) The last layer of common neural architectures is a linear classifier on features. Histograms of the penultimate layer projections are generated for models with
Figure 4: The use of Gaussian mixtures to represent minority and majority distributions is consistent with behaviors in modern neural networks, on real-world datasets. (top row) The last layer of common neural architectures is a linear classifier on features. Histograms of the penultimate layer projections are generated for models with . (middle row) Minority histograms: note the greater difficulty due to less separation of data. (bottom row) Majority histograms: note smaller overlap and easier classification. Figure can be parsed on a per-dataset basis. Within each column, the reader can compare the domain gap and overlap in the two histograms.

In the previous section, we provide existence conditions for the MIME phenomenon for general sample distributions. However, experimental validation of the phenomenon requires quantification in terms of measurable quantities such as overlap. Theorem 2 provides us these resources. Here, we verify that the assumptions in Theorem 2 are validated by experiments on real tasks.

4.1 Verifying Assumptions

Dataset (Task) DS-1 [karkkainen2021fairface] (Gender) DS-2 [golle2008machine] (Species) DS-4 [rajpurkar2017chexnet] (Diagnosis) DS-5 [blake1998uci] (Income) DS-6 [zhang2017age, yao2020estimation] (Gender)
Major. overlap 0.186 0.163 0.294 0.132 0.09
Minor. overlap 0.224 0.198 0.369 0.208 0.19
Domain gap 0.276 0.518 0.494 0.170 1.62
Table 1: Experimental measures of overlap and domain gap are consistent with the theory in Section 3. Note that the majority group consistently has lower overlap. Domain gaps are found to be small. DS-1 is FairFace, DS-2 is Pet Images, DS-4 is Chest-Xray14 and DS-5 is Adult. DS-6 is the high domain gap gender classification experiment. DS-3 is excluded here since it deals with a 9 class classification problem.

Verifying Gaussianity: Theorem 2 assumes that data is drawn from a Gaussian Mixture Model. At first glance, this quantification may appear to be unrelated to complex neural networks. However, as illustrated at the top of Figure 4, a ConvNet is essentially a feature extractor that feeds a flattened layer into a simple perceptron or linear classifier. The flattened layer can be orthogonally projected onto the decision boundary to generate, in analogy, an used for linear classification (Figure 1, fixed-backbone configuration). We use this as a first approximation to the end-to-end configuration used in our experiments.

Plotting empirical histograms of these flattened layers (Figure 4) shows Gaussian-like distribution. This is consistent with the Law of Large Numbers – linear combination of several random variables follows an approximate Gaussian distribution. Hence, Theorem 2 is approximately related in this setting. Details about implementation and comparison to Gaussians are deferred to the supplement.

Verifying minority/majority definitions: The MIME proof linked minority and majority definitions to distributional overlap and domain gap. Given the histogram embeddings from above, it is seen that minority groups on all four vision tasks have greater overlap. There also exists a domain gap between majority and minority but this is small compared to distribution spread (except for the high domain gap experiment). This establishes applicability of small domain gap requirements. Quantification is provided in Table 1. Code is in the supplement.

4.2 MIME Effect Across Six, Real Datasets

 On four vision datasets, majority performance is maximized with some inclusion of minorities. All experiments are run for several trials and realizations (described in Section 
Figure 5: When domain gap is small, the MIME effect holds. On four vision datasets, majority performance is maximized with some inclusion of minorities. All experiments are run for several trials and realizations (described in Section 4.2).

Implementation: Six multi-attribute datasets are used to assess the MIME effect (five are in computer vision). For a particular experiment, we identify a task category to evaluate accuracy over (e.g. gender), and a group category (e.g. race). The best test accuracy on the majority group across all epochs is recorded as our accuracy measure. Each experiment is run for a fixed number of minority training ratios (). For each minority training ratio, the total number of training samples remains constant. That is, the minority samples replace the majority samples, instead of being appended to the training set. Each experiment is also run for a finite number of trials. Different trials have different random train and test sets (except for the FairFace dataset [karkkainen2021fairface] where we use the provided test split). Averaging is done across trials. Note that minority samples to be added are randomly chosen – the MIME effect is not specific to particular samples. For the vision datasets, we use a ResNet-34 architecture [he2016deep], with the output layer appropriately modified. For the non-visual dataset, a fully connected network is used. Average accuracy and trend error, across trials are used to evaluate performance. Specific implementation details are in the supplement.

MIME effect on gender classification: The FairFace dataset [karkkainen2021fairface] is used to perform gender classification ( is male, is female). The majority and minority groups are light and dark skin, respectively. Results are averaged over five trials. Figure 5 describes qualitative accuracy. The accuracy trends indicate that adding 10% of minority samples to the training set leads to approximately a 1.5% gain in majority group (light skin) test accuracy.

MIME effect on animal species identification: We manually annotate light and dark cats and dogs from the Pets dataset [golle2008machine]. We classify between cats () and dogs (). The majority and minority groups are light and dark fur color respectively. Figure 5 shows qualitative results. Over five trials, we see a majority group accuracy gain of about 2%, with a peak at 10%.

MIME effect on age classification: We use a second human faces dataset, the UTKFace dataset [zhang2017age], for the age classification task (9 classes of age-intervals). We pre-process the UTKFace age labels into class bins to match the FairFace dataset format. The majority and minority groups are male and female respectively. The proportion of task class labels is kept the same across group classes. Results are averaged over five trials. Figure 5 shows trends. We observe a smaller average improvement for the 10% minority training ratio. However, since these are average trends, this indicates consistent gain. Results on this dataset also empirically highlight the existence of the MIME effect beyond two class settings.

MIME effect on X-ray diagnosis Classification: We use the NIH Chest-Xray14 dataset [rajpurkar2017chexnet] to analyze trends on a medical imaging task. We perform binary classification of scans belonging to ‘Atelectasis’ () and ‘Pneumothorax’ () categories. The male and female genders are the majority and minority groups respectively. Results are averaged over seven trials (due to noisier trends). From Figure 5, we observe noisy trends - specifically we see a performance drop for , prior to an overall gain for . The error bounds also have considerably more noise. However, confidence in the peak and the MIME effect, as seen from the average trends and the error bounds, remains high.

 (a) The Adult Dataset 
Figure 6: MIME effect is observed in non-vision datasets, and is absent in the case of large domain gap. (a) The Adult Dataset [blake1998uci] uses Census data to predict an income label. (b) On dataset six, gender classification is rescoped to occur in a high domain gap setting. Majority group is chickens [yao2020estimation] and minority group is humans [zhang2017age].

MIME effect on income classification: For validation in a non-vision setting, we use the Adult (Census Income) dataset [blake1998uci]. The data consists of census information with annual income labels (income less than or equal to $50,000 is , income greater than $50,000 is ). The majority and minority groups are female and male genders respectively. Results are averaged over five trials. Figure 6(a) highlights a prominent accuracy gain for .

MIME effect and domain gap: Theorem 2 (Section 3) suggests that large domain gap settings will not show the MIME effect. We set up an experiment to verify this (Figure 6(b)). Gender classification among chickens (majority group) and humans (minority group) has a high domain gap due to minimal common context (validated by the domain gap estimates, Table 1). With increasing , the majority accuracy decreases. This (and Figure 4, Table 1 that show low domain gap for other datasets) validates Theorem 2. Note that while this result may not be unexpected, it further validates our proposed theory.

5 Discussion

Secondary validation and analysis: Table 2 supplies additional metrics to analyze MIME. Across datasets, almost all trials show existence, with every dataset showing average MIME performance gain. Some readers may view the error bars in Figures 5 and 6 as large, however they are comparable to other empirical ML works [liu2021iterative, d2021interplay]; they may appear larger due to scaling. Reasons for error bars include variations in train-test data and train set size (Table B and C, supplement). Further analysis, including interplay with debiasing methods (e.g. hard-sample mining [dong2017class]) and reconciliation with work on equal representation datasets [gebru2018datasheets, buolamwini2018gender, larrazabal2020gender, ryu2017inclusivefacenet, li2019repair, mehrabi2021survey, jo2020lessons, gong2019diversity, kadambi2021achieving] is deferred to the supplement.

Dataset DS-1 [karkkainen2021fairface] DS-2 [golle2008machine] DS-3 [zhang2017age] DS-4 [rajpurkar2017chexnet] DS-5 [blake1998uci]
#MIME trials/Total trials 4/5 4/5 5/5 6/7 4/5
Avg. MIME perf. gain 0.72% 1.84% 0.70% 1.89% 0.98%
Table 2: Additional evaluation metrics provide further evidence of MIME existence across all datasets. The table highlights: (i) number of trials with MIME performance gain (i.e. majority accuracy at some is greater than majority accuracy at ), and (ii) the mean MIME performance gain across trials (in % points).

Optimality of inclusion ratios: Our experiments show that there can exist an optimal amount of minority inclusion to benefit the majority group the most. This appears true across all experiments in Figures 56. However, beyond a certain amount, accuracy decreases consistently, with lowest accuracy on majority samples observed when no majorities are used in training. This optimal depends on individual task complexities, among other factors. Since identifying it is outside our scope (Section 1.11.2), our experiments use sampling resolution for . Peaks at for some datasets are due to this lower resolution; optimal peak need not lie there for all datasets (e.g. X-ray [rajpurkar2017chexnet] & Adult [blake1998uci]). Future work can identify optimal ratios through finer analysis over .

Limitations: The theoretical scope is certifiable within fixed-backbone binary classification, which is narrower than all of machine learning (Figure 1). Should this theory be accepted by the community, follow-up work can generalize theoretical claims. Another limitation is the definition-compatibility of majority and minority groups. Our theory is applicable to task-advantage definitions; some scholars in the community instead define majorities and minorities by proportion. Our theory is applicable to these authors as well, albeit with a slight redefinition of terminology. Additional considerations are included in the supplement.

Conclusion: In conclusion, majority performance benefits from a non-zero fraction of inclusion of minority data given a sufficiently small domain gap.

5.0.1 Acknowledgements

We thank members of the Visual Machines Group for their feedback and support. A.K. was partially supported by an NSF CAREER award IIS-2046737 and Army Young Investigator Award. P.C. was partially supported by a Cisco PhD Fellowship.

References