Comprehensive Dataset of Face Manipulations for Development and Evaluation of Forensic Tools

Brian DeCann (brian.decann@str.us), Kirill Trapeznikov (kirill.trapeznikov@str.us)
STR, Woburn, Massachusetts, United States

1 Introduction

Digital media (e.g., photographs, video) can be easily created, edited, and shared. Tools for editing digital media are capable of doing so while also maintaining a high degree of photo-realism. In other words, edits to digital media can be made to be unrecognizable to the human eye. Many edits to digital media are generally benign. For example, color-balancing or contrast-enhancement of photographs improves visual acuity and is aesthetically pleasing. However, edits can also be applied for malicious purposes. State-of-the-art face editing tools and software, for example, can artificially make a person appear to be smiling at an inopportune time, or depict authority figures as frail and tired in order to discredit individuals. At present, these editing models are generally based on StyleGAN [8][10][14][12], although image diffusion approaches [1][7][9][11][6][1] also perform very well. Additionally, NERF-based approaches [2] have also been developed. These approaches are all generally well performing while being quite different from one another, illustrating a variety of methods a user could utilize. Examples of edited (facial) photographs are illustrated in Figure 1. The example edits illustrated show a variety of semantic changes made to a face (e.g., neutral pose to smile and appearing older) in both controlled, portrait-style frontal face images as well as in-the-wild, full-scene images.

Given the increasing ease of editing digital media and the potential risks from misuse, a substantial amount of effort has gone into media forensics. To this end, we created a challenge dataset of edited facial images to assist the research community in developing novel approaches to address and classify the authenticity of digital media. Our dataset includes edits applied to controlled, portrait-style frontal face images and full-scene in-the-wild images that may include multiple (i.e., more than one) face per image. The goals of our dataset is to address the following challenge questions:

Can we determine the authenticity of a given image (edit detection)?
If an image has been edited, can we localize the edit region?
If an image has been edited, can we deduce (classify) what edit type was performed?

The majority of research in image forensics generally attempts to answer item (1), detection. To the best of our knowledge, there are no formal datasets specifically curated to evaluate items (2) and (3), localization and classification, respectively. Our hope is that our prepared evaluation protocol will assist researchers in improving the state-of-the-art in image forensics as they pertain to these challenges.

Figure 1: Examples of face manipulations in media. State-of-the-art models are capable of applying photorealistic manipulations in portrait-style (left), video (center), and in-the-wild media. Here, manipulations are generated using the approaches from Roich et al. [8], and Tzaban et al. [10].

2 Face Manipulations in Portrait Images

A portrait image is a type of face image where a significant majority of the image foreground denotes a human face. Example portrait-style images are illustrated in Figure 2.

Figure 2: Examples of portrait-style images from the CelebA-HQ dataset [4].

2.1 Portrait Image Dataset

2.1.1 Dataset

We compiled a dataset of edited portrait-style images. The image data was sourced from a subset of the CelebA-HQ dataset [4]. The CelebA-HQ dataset is a high-quality subset of the Large-Scale CelebFaces Attributes (CelebA) dataset [5]. The CelebA-HQ dataset consists of 30,000 high-quality versions of images in the CelebA dataset. The images denote square-cropped faces from photographs captured in-the-wild and are saved at a resolution of 1024x1024 pixels (versions at 128x128 and 256x256 pixels also exist). In our subset, we only consider identities that appear at least twice (i.e., there are at least two images of a given identity) in the image data.

We created two partitions of image data for training and testing purposes. The training partition contains a total of 6,846 total images. Each sampled CelebA-HQ image in the training partition is manipulated in five (5) separate instances, in combination with the original (unedited) image. Each sampled CelebA-HQ image is also paired with a separate (unedited) image of the same face identity as a reference. The five manipulations consist of “smile” (smile added or enhanced), “not smile” (smile removed or reduced), “young” (face is modified to appear younger), “old” (face is modified to appear older), and “surprised” (face is modified to include a surprised expression). We applied the Pivotal Tuning approach by Roich et al. to create each manipulated image [8]. The testing partition contains a total of 7,644 images and includes the same types of manipulated images as in the training partition and an additional seven manipulated images for a total of twelve images per identity (plus the original and a reference). In the testing partition there are additional examples for “smile”, “not smile”, “young”, and “old”, where the edit magnitude is reduced. In addition, there are three novel manipulations not present in the training partition. These include “purple_hair” (hair is modified to have a purple color), “angry” (face is modified to depict an angry expression), and “Taylor Swift” (face shape and features modified to appear similar to Taylor Swift). Each manipulation type is summarized in Table 1. An example of a subject with the set of applied manipulations is illustrated in Figure 3. In this example, each of the fourteen images would denote one subject in the testing partition. Labels that are italicized would not appear for a given subject in the training partition.

Edit Type	Remark
“smile”	Smile added or enhanced
“not smile”	Smile removed or reduced
“young”	Face is modified to appear younger
“old”	Face is modified to appear older
“surprised”	Face is modified to depict a surprised expression
“purple_hair”	Hair is modified to have a purple color
“angry”	Face is modified to depict an angry expression
“Taylor_Swift”	Face shape and features modified to appear similar to Taylor Swift

Table 1: Caption

Figure 3: Examples of a set of images for a given subject in our manipulation dataset. Here, labels that are italicized denote manipulations that exist strictly in the testing partition and are not present in the training partition.

We remark that some identities may appear more than once in a given partition (training or testing), however an identity appearing in the training set will not appear in the testing set (and vice-versa). Both partitions are available in .png and .jpg format.

	Training Partition	Test Partition
Image Count	6846	7644
Unique Edit Types	5	8

Table 2: Caption

2.2 Evaluation Protocol

For our portrait-style face manipulation dataset, we supply two challenges: detection and classification. A description of both challenges and associated outputs are described in the following sections.

2.2.1 Detection

The objective of the detection experiment is to identify whether a given image has been manipulated. For a given image in the testing partition return:

(string) “ $<$ filename $>$ ” : Image filename
(bool) [0,1] : Not edited or Edited

We measure balanced detection accuracy as the proportion of images that are correctly recognized as either edited or not edited.

2.2.2 Classification

The objective of the classification experiment is to classify the type of edit in a manipulated image. For a given image in the testing partition return:

(string) “ $<$ filename $>$ ” : Image filename
(string) “pristine” : if not edited; “ $<$ edit_type $>$ ” : if Edited

We measure classification accuracy as the proportion of edited images that are correctly recognized of being a given edit type.

3 Face Manipulation in the Wild

An in-the-wild image can be described as an image where the principal foreground components (e.g., objects, people, animals) do not occupy the majority of the spatial image area and a large background is present. In-the-wild images are sometimes unconstrained in terms of lighting or camera angles. Unprocessed, or raw in-the-wild images can also vary greatly in terms of spatial resolution.

3.1 In-the-Wild Image Dataset

3.1.1 Dataset

We compiled a dataset of edited in-the-wild-style images. The image data was sourced from a subset of the Flickr-Faces-HQ (FFHQ) [3]. The FFHQ dataset consists of 70,000 high-quality in-the-wild images. The authors of the FFHQ datset posit that the FFHQ data is much more variant in terms of age, ethnicity, background, and presence of facial covariates (e.g., eyeglasses, headwear) compared to CelebA-HQ. A version of the dataset consists futher of 70,000 detected, aligned, and cropped faces, which are saved at a resolution of 1024x1024 (a version at 128x128 also exists), but we only consider the raw, full-scene images. Example in-the-wild images from the FFHQ dataset are illustrated in Figure 4.

Figure 4: Examples of in-the-wild images from the FFHQ dataset [3].

Our edited in-the-wild dataset consists of a randomly sampled subset of the 70,000 raw in-the-wild FFHQ images. In our subset, we allow for the possibility that an image contains more than one person (face). This potentially adds an additional challenge in detecting and localizing edited faces. We created two partitions of image data for training and testing (validation) purposes. The training partition and test partition contain totals of 1,508 and 1,403 images, respectively. Within each partition, approximately 50% of the images are edited, while the remaining images are “pristine” (i.e., not edited). In the training partition, 759 images are edited and 750 are pristine. For the testing partition, 652 images are edited and 750 are pristine. All images are saved in .jpg format with a randomly chosen quality factor in the set $Q_{f} \in [75, 80, 85, 90]$ . Summary information for the in-the-wild dataset is listed in Table 3.

	Training Partition	Test Partition
Image Count	1,508	1,403
Pristine Images	750	750
Edited Images	759	652
Resolution	Variable	Variable
Image Format	.jpg	.jpg

Table 3: In-the-wild face manipulation dataset summary

Figure 5: Face counts in our image set, separated by training and test partition. Note that the number of face counts is relatively well balanced for each partition and that a significant majority of images have one or two faces.

Unlike the portrait-style images, each edited image is only subject to a single edit type. In other words, there are not multiple copies of the same underlying image but with different edits applied. Images that are edited are subject to one of six possible manipulations. These include “smile”, “not_smile”, “young”, “old”, “male”, “female”. We adopt the approach from Tzaban et al. to inject edits to in-the-wild images [10].

Edit Type	Remark
“smile”	Smile added or enhanced
“not_smile”	Smile removed or reduced
“young”	Face is modified to appear younger
“old”	Face is modified to appear older
“male”	Face is modified to appear more masculine
“female”	Face is modified to appear more feminine

Table 4: Types of image edits appearing in our in-the-wild face manipulation dataset

Figure 6: Examples of “pristine” (left) and “edited” (right) images in our in-the-wild face manipulation dataset.

For our in-the-wild face manipulation dataset, edits are localized to a region of the full-scene image. This is in contrast to the portrait-style face manipulation dataset, where images are fully synthesized from face-based GAN’s. For the images in the in-the-wild face manipulation dataset that are edited, we also provide a binary mask that captures the spatial image region where the edit was performed and transplanted back into the image. The edit region is identified using a modified BiSeNet for faces [13]. Figure 7 illustrates examples of binary edit masks associated with edited in-the-wild images with one and multiple persons. Additionally, Figure 8 reports the proportion of the edit region for edited images. Typically, the size of the edit region is between 8% and 20% of the spatial image area.

Figure 7: Examples of edit masks associated with edited in-the-wild images.

Figure 8: Proportion of edit region in our in-the-wild images. Note that the typical edit region size is approximately 10% of the image.

3.2 Evaluation Protocol

For our in-the-wild face manipulation dataset, we supply three challenges: detection, localization, and classification. A description of each challenge and outputs are described in the following sections.

3.2.1 Detection

The objective of the detection experiment is to identify whether a given image has been manipulated. For a given image in the testing partition return:

(string) “ $<$ filename $>$ ” : Image filename
(bool) [0,1] : Not edited or Edited

3.2.2 Localization

The objective of the localization experiment is to identify the specific image-region where an edit exists. For a given image in the testing partition, users must generate a binary mask, $^M$ , denoting the estimated edit region. The estimated mask is compared against the ground truth, $M$ using Matthews Correlation Coefficient (MCC). The MCC (phi coefficient, or mean-square contingency coefficient) is a measure of association for binary variables. MCC is computed from the confusion matrix of the pixel-based binary estimations. This is mathematically described in Equation (1), where TP denotes True Positive ( $M_{k} = {^M}_{k} = 1$ ), TN denotes True Negative ( $M_{k} = {^M}_{k} = 0$ ), FP denotes False Positive ( $M_{k} = 0$ , ${^M}_{k} = 1$ ), and FN denotes False Negative ( $M_{k} = 1$ , ${^M}_{k} = 0$ ).

M C C = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(1)

3.2.3 Classification

The objective of the classification experiment is to classify the type of edit in a manipulated image. For a given image in the testing partition return:

(string) “ $<$ filename $>$ ” : Image filename
(string) “pristine” : if not edited; “ $<$ edit_type $>$ ” : if Edited

In our in-the-wild face manipulation dataset the types of edits that are present in the training partition are also represented in the testing partition. Similarly, the types of edits that are in the testing partition are also represented in the training partition. Thus, the classification problem for this dataset is closed-set. This is in contrast to the portrait-style data, where novel edit types exist in the testing partition. We encourage users utilizing this data and challenge problem to consider open-set solutions as the set of potential edit types is near-unlimited.

Acknowledgement. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001120C0129. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

References

[1] T. Dockhorn, A. Vahdat, and K. Kreis (2021) Score-based generative modeling with critically-damped langevin diffusion. arXiv preprint arXiv:2112.07068. Cited by: §1.
[2] J. Gu, L. Liu, P. Wang, and C. Theobalt (2021) Stylenerf: a style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985. Cited by: §1.
[3] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410. Cited by: Figure 4, §3.1.1.
[4] C. Lee, Z. Liu, L. Wu, and P. Luo (2020) MaskGAN: towards diverse and interactive facial image manipulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 2, §2.1.1.
[5] Z. Liu, P. Luo, X. Wang, and X. Tang (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §2.1.1.
[6] K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn (2022) Diffusion autoencoders: toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10619–10629. Cited by: §1.
[7] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen (2022) Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125. Cited by: §1.
[8] D. Roich, R. Mokady, A. H. Bermano, and D. Cohen-Or (2021) Pivotal tuning for latent-based editing of real images. arXiv preprint arXiv:2106.05744. Cited by: Figure 1, §1, §2.1.1.
[9] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer (2022) High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695. Cited by: §1.
[10] R. Tzaban, R. Mokady, R. Gal, A. H. Bermano, and D. Cohen-Or (2022) Stitch it in time: gan-based facial editing of real videos. arXiv preprint arXiv:2201.08361. Cited by: Figure 1, §1, §3.1.1.
[11] Z. Xiao, K. Kreis, and A. Vahdat (2021) Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804. Cited by: §1.
[12] F. Yin, Y. Zhang, X. Cun, M. Cao, Y. Fan, X. Wang, Q. Bai, B. Wu, J. Wang, and Y. Yang (2022) Styleheat: one-shot high-resolution editable talking face generation via pretrained stylegan. arXiv preprint arXiv:2203.04036. Cited by: §1.
[13] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang (2018) Bisenet: bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 325–341. Cited by: §3.1.1.
[14] Y. Zhang, H. Ling, J. Gao, K. Yin, J. Lafleche, A. Barriuso, A. Torralba, and S. Fidler (2021) DatasetGAN: efficient labeled data factory with minimal human effort. In CVPR, Cited by: §1.

Appendix A Additional Example Images

In this section, we provide additional examples of manipulated face images from the portrait and in-the-wild image sets.

a.1 Portrait Images

Additional examples of edited portrait-style images are illustrated in Figures 9-11.

Figure 9: Manipulated portrait-style images with an added smile (middle left) and to appear younger (middle right). The right-most image denotes a reference (i.e., different image) for the same person.

a.2 In-the-Wild Images

Additional examples of edited in-the-wild images are illustrated in Figure 12 and 13.

Figure 12: Manipulated in-the-wild image edited to remove smile.

Figure 13: Manipulated in-the-wild image edited to add glasses