本文介绍了一个新颖的数据集,以帮助研究人员评估他们的计算机视觉和音频模型,以便在各种年龄,性别,表观肤色和环境照明条件下进行准确性。我们的数据集由3,011名受试者组成,并包含超过45,000个视频,平均每人15个视频。这些视频被录制在多个美国国家,各种成年人在各种年龄,性别和明显的肤色群体中。一个关键特征是每个主题同意参与他们使用的相似之处。此外,我们的年龄和性别诠释由受试者自己提供。一组训练有素的注释器使用FitzPatrick皮肤型刻度标记了受试者的表观肤色。此外,还提供了在低环境照明中记录的视频的注释。作为衡量某些属性的预测稳健性的申请,我们对DeepFake检测挑战(DFDC)的前五名获胜者提供了全面的研究。实验评估表明,获胜模型对某些特定人群的表现较小,例如肤色较深的肤色,因此可能对所有人都不概括。此外,我们还评估了最先进的明显年龄和性别分类方法。我们的实验在各种背景的人们的公平待遇方面对这些模型进行了彻底的分析。
translated by 谷歌翻译
Recent studies demonstrate that machine learning algorithms can discriminate based on classes like race and gender. In this work, we present an approach to evaluate bias present in automated facial analysis algorithms and datasets with respect to phenotypic subgroups. Using the dermatologist approved Fitzpatrick Skin Type classification system, we characterize the gender and skin type distribution of two facial analysis benchmarks, IJB-A and Adience. We find that these datasets are overwhelmingly composed of lighter-skinned subjects (79.6% for IJB-A and 86.2% for Adience) and introduce a new facial analysis dataset which is balanced by gender and skin type. We evaluate 3 commercial gender classification systems using our dataset and show that darker-skinned females are the most misclassified group (with error rates of up to 34.7%). The maximum error rate for lighter-skinned males is 0.8%. The substantial disparities in the accuracy of classifying darker females, lighter females, darker males, and lighter males in gender classification systems require urgent attention if commercial companies are to build genuinely fair, transparent and accountable facial analysis algorithms.
translated by 谷歌翻译
深层伪造的面部伪造引起了严重的社会问题。愿景社区已经提出了几种解决方案,以通过自动化的深层检测系统有效地对待互联网上的错误信息。最近的研究表明,基于面部分析的深度学习模型可以根据受保护的属性区分。对于对DeepFake检测技术的商业采用和大规模推出,对跨性别和种族等人口变化的深层探测器的评估和了解(不存在任何偏见或偏爱)至关重要。由于人口亚组之间的深泡探测器的性能差异会影响贫困子组的数百万人。本文旨在评估跨男性和女性的深泡探测器的公平性。但是,现有的DeepFake数据集未用人口标签注释以促进公平分析。为此,我们用性别标签手动注释了现有的流行DeepFake数据集,并评估了整个性别的当前DeepFake探测器的性能差异。我们对数据集的性别标记版本的分析表明,(a)当前的DeepFake数据集在性别上偏斜了分布,并且(b)通常采用的深层捕获探测器在性别中获得不平等的表现,而男性大多数均优于女性。最后,我们贡献了一个性别平衡和注释的DeepFake数据集GBDF,以减轻性能差异,并促进研究和发展,以朝着公平意识到的深层假探测器。 GBDF数据集可公开可用:https://github.com/aakash4305/gbdf
translated by 谷歌翻译
近年来,用深击的图像和视频操纵已成为安全和社会的严重关注。因此,已经提出了许多检测模型和数据库来可靠地检测DeepFake数据。但是,人们越来越担心这些模型和培训数据库可能会有偏见,从而导致深泡检测器失败。在这项工作中,我们通过(a)为五个流行的DeepFake数据集提供41个不同属性的大规模人口统计学和非人口统计学注释,以及(b)全面分析多个最先进的ART的AI偏见这些数据库上的DeepFake检测模型。调查分析了各种独特属性(从6500万标签)对检测性能的影响,包括人口统计学(年龄,性别,种族)和非人口统计学(头发,皮肤,配件等)信息。结果表明,研究的数据库缺乏多样性,更重要的是表明,使用的深层检测模型对许多研究的属性有很大偏见。此外,结果表明,模型的决策可能基于几个可疑(偏见)的假设,例如,如果一个人在微笑或戴上帽子。根据这种深泡检测方法的应用,这些偏见可能导致普遍性,公平性和安全性问题。我们希望这项研究的发现和注释数据库将有助于评估和减轻未来深层检测技术的偏见。我们的注释数据集可公开使用。
translated by 谷歌翻译
Facial analysis systems have been deployed by large companies and critiqued by scholars and activists for the past decade. Many existing algorithmic audits examine the performance of these systems on later stage elements of facial analysis systems like facial recognition and age, emotion, or perceived gender prediction; however, a core component to these systems has been vastly understudied from a fairness perspective: face detection, sometimes called face localization. Since face detection is a pre-requisite step in facial analysis systems, the bias we observe in face detection will flow downstream to the other components like facial recognition and emotion prediction. Additionally, no prior work has focused on the robustness of these systems under various perturbations and corruptions, which leaves open the question of how various people are impacted by these phenomena. We present the first of its kind detailed benchmark of face detection systems, specifically examining the robustness to noise of commercial and academic models. We use both standard and recently released academic facial datasets to quantitatively analyze trends in face detection robustness. Across all the datasets and systems, we generally find that photos of individuals who are $\textit{masculine presenting}$, $\textit{older}$, of $\textit{darker skin type}$, or have $\textit{dim lighting}$ are more susceptible to errors than their counterparts in other identities.
translated by 谷歌翻译
面部检测是计算机愿景领域的长期挑战,最终目标是准确地将人类面临着不受约束的环境。由于与姿势,图像分辨率,照明,闭塞和观点相关的混淆因素,使这些系统具有重要的技术障碍。据说,随着最近的机器学习的发展,面部检测系统实现了非凡的准确性,主要是基于数据驱动的深度学习模型[70]。虽然鼓励,限制了部署系统的面部检测性能和社会责任的关键方面是人类外观的固有多样性。每个人类的外表都反映了一个人的东西,包括他们的遗产,身份,经验和自我表达的可见表现。但是,有关面部检测系统如何在面对不同的面部尺寸和形状,肤色,身体修改和身体装饰方面进行良好的表现问题。为了实现这一目标,我们收集了独特的人类外观数据集,这是一种图像集,表示具有低频率的外观,并且往往是面部数据集的缺点。然后,我们评估了当前最先进的脸部检测模型,其能够检测这些图像中的面部。评估结果表明,面部检测算法对这些不同的外观没有概括。评估和表征当前的面部检测模型的状态将加速研究和开发,以创造更公平和更准确的面部检测系统。
translated by 谷歌翻译
Developing robust and fair AI systems require datasets with comprehensive set of labels that can help ensure the validity and legitimacy of relevant measurements. Recent efforts, therefore, focus on collecting person-related datasets that have carefully selected labels, including sensitive characteristics, and consent forms in place to use those attributes for model testing and development. Responsible data collection involves several stages, including but not limited to determining use-case scenarios, selecting categories (annotations) such that the data are fit for the purpose of measuring algorithmic bias for subgroups and most importantly ensure that the selected categories/subcategories are robust to regional diversities and inclusive of as many subgroups as possible. Meta, in a continuation of our efforts to measure AI algorithmic bias and robustness (https://ai.facebook.com/blog/shedding-light-on-fairness-in-ai-with-a-new-data-set), is working on collecting a large consent-driven dataset with a comprehensive list of categories. This paper describes our proposed design of such categories and subcategories for Casual Conversations v2.
translated by 谷歌翻译
计算机视觉(CV)取得了显着的结果,在几个任务中表现优于人类。尽管如此,如果不正确处理,可能会导致重大歧视,因为CV系统高度依赖于他们所用的数据,并且可以在此类数据中学习和扩大偏见。因此,理解和发现偏见的问题至关重要。但是,没有关于视觉数据集中偏见的全面调查。因此,这项工作的目的是:i)描述可能在视觉数据集中表现出来的偏差; ii)回顾有关视觉数据集中偏置发现和量化方法的文献; iii)讨论现有的尝试收集偏见视觉数据集的尝试。我们研究的一个关键结论是,视觉数据集中发现和量化的问题仍然是开放的,并且在方法和可以解决的偏见范围方面都有改进的余地。此外,没有无偏见的数据集之类的东西,因此科学家和从业者必须意识到其数据集中的偏见并使它们明确。为此,我们提出了一个清单,以在Visual DataSet收集过程中发现不同类型的偏差。
translated by 谷歌翻译
AI的最新进展,尤其是深度学习,导致创建新的现实合成媒体(视频,图像和音频)以及对现有媒体的操纵的创建显着增加,这导致了新术语的创建。 'deepfake'。基于英语和中文中的研究文献和资源,本文对Deepfake进行了全面的概述,涵盖了这一新兴概念的多个重要方面,包括1)不同的定义,2)常用的性能指标和标准以及3)与DeepFake相关的数据集,挑战,比赛和基准。此外,该论文还报告了2020年和2021年发表的12条与DeepFake相关的调查论文的元评估,不仅关注上述方面,而且集中在对关键挑战和建议的分析上。我们认为,就涵盖的各个方面而言,本文是对深层的最全面评论,也是第一个涵盖英语和中国文学和资源的文章。
translated by 谷歌翻译
由于隐私,透明度,问责制和缺少程序保障的担忧,印度的面部加工系统的增加越来越多。与此同时,我们也很少了解这些技术如何在印度13.4亿种群的不同特征,特征和肤色上表现出来。在本文中,我们在印度脸部的数据集中测试四个商用面部加工工具的面部检测和面部分析功能。该工具在面部检测和性别和年龄分类功能中显示不同的错误率。与男性相比,印度女性面的性别分类错误率始终如一,最高的女性错误率为14.68%。在某些情况下,这种错误率远高于其他国籍的女性之前的研究表明。年龄分类错误也很高。尽管从一个人的实际年龄从一个人的实际年龄到10年来考虑到可接受的误差率,但年龄预测失败的速度为14.3%至42.2%。这些发现指向面部加工工具的准确性有限,特别是某些人口组,在采用此类系统之前需要更关键的思维。
translated by 谷歌翻译
在过去的几十年里,机器和深度学习界在挑战性的任务中庆祝了巨大成就,如图像分类。人工神经网络的深度建筑与可用数据的宽度一起使得可以描述高度复杂的关系。然而,仍然不可能完全捕捉深度学习模型已经了解到的深度学习模型并验证它公平,而不会产生偏见,特别是在临界任务中,例如在医学领域产生的问题。这样的任务的一个示例是检测面部图像中的不同面部表情,称为动作单位。考虑到这项特定任务,我们的研究旨在为偏见提供透明度,具体与性别和肤色有关。我们训练一个神经网络进行动作单位分类,并根据其准确性和基于热量的定性分析其性能。对我们的结果的结构化审查表明我们能够检测到偏见。尽管我们不能从我们的结果得出结论,但较低的分类表现完全来自性别和肤色偏差,这些偏差必须得到解决,这就是为什么我们通过提出关于如何避免检测到的偏差的建议。
translated by 谷歌翻译
As facial recognition systems are deployed more widely, scholars and activists have studied their biases and harms. Audits are commonly used to accomplish this and compare the algorithmic facial recognition systems' performance against datasets with various metadata labels about the subjects of the images. Seminal works have found discrepancies in performance by gender expression, age, perceived race, skin type, etc. These studies and audits often examine algorithms which fall into two categories: academic models or commercial models. We present a detailed comparison between academic and commercial face detection systems, specifically examining robustness to noise. We find that state-of-the-art academic face detection models exhibit demographic disparities in their noise robustness, specifically by having statistically significant decreased performance on older individuals and those who present their gender in a masculine manner. When we compare the size of these disparities to that of commercial models, we conclude that commercial models - in contrast to their relatively larger development budget and industry-level fairness commitments - are always as biased or more biased than an academic model.
translated by 谷歌翻译
媒体报道指责人们对“偏见”',“”性别歧视“和”种族主义“的人士指责。研究文献中有共识,面部识别准确性为女性较低,妇女通常具有更高的假匹配率和更高的假非匹配率。然而,几乎没有出版的研究,旨在识别女性准确性较低的原因。例如,2019年的面部识别供应商测试将在广泛的算法和数据集中记录较低的女性准确性,并且数据集也列出了“分析原因和效果”在“我们没有做的东西”下''。我们介绍了第一个实验分析,以确定在去以前研究的数据集上对女性的较低人脸识别准确性的主要原因。在测试图像中控制相等的可见面部可见面积减轻了女性的表观更高的假非匹配率。其他分析表明,化妆平衡数据集进一步改善了女性以实现较低的虚假非匹配率。最后,聚类实验表明,两种不同女性的图像本质上比两种不同的男性更相似,潜在地占错误匹配速率的差异。
translated by 谷歌翻译
The presence of bias in deep models leads to unfair outcomes for certain demographic subgroups. Research in bias focuses primarily on facial recognition and attribute prediction with scarce emphasis on face detection. Existing studies consider face detection as binary classification into 'face' and 'non-face' classes. In this work, we investigate possible bias in the domain of face detection through facial region localization which is currently unexplored. Since facial region localization is an essential task for all face recognition pipelines, it is imperative to analyze the presence of such bias in popular deep models. Most existing face detection datasets lack suitable annotation for such analysis. Therefore, we web-curate the Fair Face Localization with Attributes (F2LA) dataset and manually annotate more than 10 attributes per face, including facial localization information. Utilizing the extensive annotations from F2LA, an experimental setup is designed to study the performance of four pre-trained face detectors. We observe (i) a high disparity in detection accuracies across gender and skin-tone, and (ii) interplay of confounding factors beyond demography. The F2LA data and associated annotations can be accessed at http://iab-rubric.org/index.php/F2LA.
translated by 谷歌翻译
已显示现有的面部分析系统对某些人口统计亚组产生偏见的结果。由于其对社会的影响,因此必须确保这些系统不会根据个人的性别,身份或肤色歧视。这导致了在AI系统中识别和减轻偏差的研究。在本文中,我们封装了面部分析的偏置检测/估计和缓解算法。我们的主要贡献包括对拟议理解偏见的算法的系统审查,以及分类和广泛概述现有的偏置缓解算法。我们还讨论了偏见面部分析领域的开放挑战。
translated by 谷歌翻译
DeepFake是指量身定制和合成生成的视频,这些视频现在普遍存在并大规模传播,威胁到在线可用信息的可信度。尽管现有的数据集包含不同类型的深击,但它们的生成技术各不相同,但它们并不考虑以“系统发育”方式进展。现有的深层面孔可能与另一个脸交换。可以多次执行面部交换过程,并且可以演变出最终的深层效果,以使DeepFake检测算法混淆。此外,许多数据库不提供应用的生成模型作为目标标签。模型归因通过提供有关所使用的生成模型的信息,有助于增强检测结果的解释性。为了使研究界能够解决这些问题,本文提出了Deephy,这是一种新型的DeepFake系统发育数据集,由使用三种不同的一代技术生成的5040个DeepFake视频组成。有840个曾经交换深击的视频,2520个换两次交换深击的视频和1680个换装深击的视频。使用超过30 GB的大小,使用1,352 GB累积内存的18 GPU在1100多个小时内准备了数据库。我们还使用六种DeepFake检测算法在Deephy数据集上展示了基准。结果突出了需要发展深击模型归因的研究,并将过程推广到各种深层生成技术上。该数据库可在以下网址获得:http://iab-rubric.org/deephy-database
translated by 谷歌翻译
2019年,英国的移民和庇护室的上部法庭驳回了基于其他差异的生物识别系统产出的决定。在生物识别数据库中发现了庇护所寻求者的指纹,这与上诉人的账户相矛盾。法庭发现这一证据明确透明,否认庇护索赔。如今,生物识别系统的扩散正在围绕其政治,社会和道德意义塑造公众辩论。然而,虽然对移动控制的种族式使用这种技术的担忧一直在上升,但对生物识别行业的投资和创新正在增加大幅增加。此外,生物识别技术最近也已经采用了公平,以减轻生物识别学的偏见和歧视。然而,算法公平不能在破损或预期目的的情况下分配正义,这是为了区分,例如在边境部署的生物识别。在本文中,我们提供了最近关于生物识别公平性辩论的批判性阅读,并展示了其在机器学习和关键边界研究的公平研究中的局限性。在以前的公平演示中,我们证明了生物识别公平标准是数学上的互斥。然后,纸张继续验证说明公平的生物识别系统,通过从先前的作品中再现实验。最后,我们通过在边境的辩论中讨论生物识别性的公平性的政治。我们声称偏见和错误率对公民和寻求庇护者产生了不同的影响。公平已经在生物识别学室内黯然失色,专注于算法的人口偏见和伦理话语,而不是检查这些系统如何重现历史和政治不公正。
translated by 谷歌翻译
The emergence of COVID-19 has had a global and profound impact, not only on society as a whole, but also on the lives of individuals. Various prevention measures were introduced around the world to limit the transmission of the disease, including face masks, mandates for social distancing and regular disinfection in public spaces, and the use of screening applications. These developments also triggered the need for novel and improved computer vision techniques capable of (i) providing support to the prevention measures through an automated analysis of visual data, on the one hand, and (ii) facilitating normal operation of existing vision-based services, such as biometric authentication schemes, on the other. Especially important here, are computer vision techniques that focus on the analysis of people and faces in visual data and have been affected the most by the partial occlusions introduced by the mandates for facial masks. Such computer vision based human analysis techniques include face and face-mask detection approaches, face recognition techniques, crowd counting solutions, age and expression estimation procedures, models for detecting face-hand interactions and many others, and have seen considerable attention over recent years. The goal of this survey is to provide an introduction to the problems induced by COVID-19 into such research and to present a comprehensive review of the work done in the computer vision based human analysis field. Particular attention is paid to the impact of facial masks on the performance of various methods and recent solutions to mitigate this problem. Additionally, a detailed review of existing datasets useful for the development and evaluation of methods for COVID-19 related applications is also provided. Finally, to help advance the field further, a discussion on the main open challenges and future research direction is given.
translated by 谷歌翻译
Trained machine learning models are increasingly used to perform high-impact tasks in areas such as law enforcement, medicine, education, and employment. In order to clarify the intended use cases of machine learning models and minimize their usage in contexts for which they are not well suited, we recommend that released models be accompanied by documentation detailing their performance characteristics. In this paper, we propose a framework that we call model cards, to encourage such transparent model reporting. Model cards are short documents accompanying trained machine learning models that provide benchmarked evaluation in a variety of conditions, such as across different cultural, demographic, or phenotypic groups (e.g., race, geographic location, sex, Fitzpatrick skin type [15]) and intersectional groups (e.g., age and race, or sex and Fitzpatrick skin type) that are relevant to the intended application domains. Model cards also disclose the context in which models are intended to be used, details of the performance evaluation procedures, and other relevant information. While we focus primarily on human-centered machine learning models in the application fields of computer vision and natural language processing, this framework can be used to document any trained machine learning model. To solidify the concept, we provide cards for two supervised models: One trained to detect smiling faces in images, and one trained to detect toxic comments in text. We propose model cards as a step towards the responsible democratization of machine learning and related artificial intelligence technology, increasing transparency into how well artificial intelligence technology works. We hope this work encourages those releasing trained machine learning models to accompany model releases with similar detailed evaluation numbers and other relevant documentation.
translated by 谷歌翻译
在各种领域中,机器学习模型在数据集基准上的精度和现实世界生产数据之间存在性能差距。尽管对静态数据集基准进行了仔细的设计以表示现实世界,但相对于模型已接受培训的数据,模型通常会出错。我们可以直接测量和调整分配转移的某些方面,但是我们无法在不知道数据生成过程的情况下解决样本选择偏见,对抗性扰动和非平稳性。在本文中,我们概述了在上下文中识别变化的两种方法,从而导致分布变化和模型预测错误:利用人类的直觉和专家知识来识别一阶环境,并基于Desiderata为数据生成过程开发动态基准。此外,我们提出了两个案例研究,以突出显示应用机器学习模型的隐式假设,这些假设在试图推广超出测试基准数据集时会导致错误。通过密切关注上下文在每个预测任务中的作用,研究人员可以减少上下文移动错误并提高泛化性能。
translated by 谷歌翻译