Distilling Facial Knowledge With Teacher-Tasks:
Semantic-Segmentation-Features for Pose-Invariant Face-Recognition

Abstract

This paper demonstrates a novel approach to improve face-recognition pose-invariance using semantic-segmentation features. The proposed Seg-Distilled-ID network jointly learns identification and semantic-segmentation tasks, where the segmentation task is then “distilled” (MobileNet encoder). Performance is benchmarked against three state-of-the-art encoders on a publicly available data-set emphasizing head-pose variations. Experimental evaluations show the Seg-Distilled-ID network shows notable robustness benefits, achieving 99.9% test-accuracy in comparison to 81.6% on ResNet-101, 96.1% on VGG-19 and 96.3% on InceptionV3. This is achieved using approximately one-tenth of the top encoder’s inference parameters. These results demonstrate distilling semantic-segmentation features can efficiently address face-recognition pose-invariance.

\name

Ali Hassani, Zaid El Shair, Rafi Ud Duala Refat, Hafiz Malik ^†^†thanks: Thanks to Ford Motor Company for Alliance Grant Biometric Forensics. © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. \addressDepartment of Electrical and Computer Engineering,
University of Michigan - Dearborn, Dearborn, USA {keywords} Face-Recognition, Head-Pose, Multi-Task-Learning, Knowledge-Distillation

1 Introduction

Face-recognition (FR) is becoming the go-to authentication technology for access control and verification applications. Its popularity starts with evolution of smart phones, where over 100 million devices offer it as seamless-unlock method [16]. This has led other industries to follow suit, where commercial real-estate [13], aviation [18] and banking [6] now use FR as a means to differentiate customer experience. This is made possible by advances in deep-learning [21]; state-of-the-art models can now discern 1 cooperative face from over 50,000 [4]. Having robust tolerance to pose-variations, however, is still a challenge [28].

Pose-variations are facial rotations over yaw and pitch. These change the relative-position of key-points (e.g., nose, eyes) and introduce variance within identity classes. As such, FR algorithms can struggle to discern the same person rotating from different people [28]. In particular when applying stringent industry false-acceptance-rate thresholds [4], variations in pose often result in false-rejections [28].

Current state-of-the-art methods rely on alignment techniques and/or sophisticated loss-functions to address pose-variability. Through alignment pre-processing, algorithms can project a cooperative-face for identification [10]. Alternatively, contrastive loss-functions (such as triplet) implicitly address pose-variations through relative class-distance [21]. While notable advances, even best-in-class algorithms struggle to achieve 100% accuracy on competition data-sets [8].

Figure 1: Seg-Distilled-ID Network for Pose-Invariance.

This paper presents the Seg-Distilled-ID network. This is a new approach to knowledge-distillation, using a teacher-task in lieu of a teacher-network. The Seg-Distilled-ID network is first jointly trained on both identification and (teacher) semantic-segmentation tasks, where the teacher-task is then removed. This “distills” the semantic-structures as context for precise identification (see Fig. 1). Recognition accuracy is benchmarked against three state-of-the-art encoders on the Mut1ny commercial face-segmentation data-set [15] (11,830 images selected from 67 subjects, varied over pose and lighting). The proposed Seg-Distilled-ID network achieves best-in-class accuracy using 2.4M inference parameters. These results demonstrate distilling semantic-segmentation features can efficiently address face-recognition pose-invariance.

In summary, this paper makes the following contributions:

Novel knowledge-distillation method via teacher-tasks.
Best-in-class ID accuracy with efficient parameter space (99.9%, Mut1ny faces [15]).

Figure 2: Improving Face-Identification Context with Semantic-Segmentation Teacher.

2 Related Works

Face-recognition (FR) research starts in the 1970s as a template-matching problem. This is pioneered by the discovery that statistical-distributions (e.g., Eigenfaces [25]) are generally robust. This is expanded upon by using hand-crafted features to describe distinguishable features [14]. These features, however, are insufficient at large-scales. This ultimately transitions to deep-learning (DL) solutions (starting with DeepFace [24]). Today, state-of-the-art DL networks emphasize a combination pose-alignment pre-processing [10] (which may include 3-D projection [7, 29]) and contrastive loss-functions [21].

Despite all these advances, pose-variations are still a challenge [28]. To address this, a new trend is to apply multi-task-learning (MTL) for sharing context. This is first exemplified by Ranjan et al, who combine landmarks and pose tasks on a face-detection network (HyperFace) to improve reliability [17]. Others have now recently applied this approach to identification. Yin and Xiaoming have a pose estimation task connected to the identification features [27]. Wang et al alternatively use semantic descriptors of sub-structures: size of eyes, nose and cheeks [26]. Both approaches show consistent (small) improvements on FR competition data-sets.

This research differentiates on these findings by using a knowledge-distillation approach with a precise-descriptor: semantic-segmentation. The aforementioned approaches share common features with the identification task. In this case, the network is jointly trained on facial-structure with identification, then “distills” the teacher-task. The “distilled” semantic-features enable the encoder to generate precise features, enabling efficient pose-invariance recognition.

3 Methodology

This research proposes a novel application of Multi-Task-Learning (MTL) to improve face-recognition pose-robustness. Facial semantic-segmentation is “distilled” by using a “teacher-task.” By encoding relative facial-structure, the loss function can better discern inter versus intra class variations.

3.1 Seg-Distilled-ID Network

Fig. 2 shows how the Seg-Distilled-ID network starts with identification and segmentation tasks. The segmentation-task functions as a teacher, helping the ID-task better converge towards optimal weights. Once training is complete the teacher-task is removed (note the dashed lines).

The network assumes a U-Net architecture [19]. U-Net is selected both for its applications to biomedical semantic-segmentation [19] and option for efficient MobileNetV2 encoder [20]. A MobileNetV2 backbone [20] encodes features for parallel identification and semantic-segmentation tasks. The identification-task is constructed by applying a global-average pooling layer, followed with a dense, 128-neuron, feature layer (ReLU activation [1]) and a dense, 67-neuron, classification layer (soft-max activation [12]). The segmentation-task is constructed using the Pix2Pix decoding layers [9] (e.g. final segmentation output of 128 by 128).

Both tasks use a categorical-cross-entropy loss, as shown in (1). This better separates out the (log) distance between classes by incorporating probability of the observation, $o$ , belonging to the label-class, $c$ . This probability can be defined as $p (o, c)$ [12]. A binary label, $^y$ , indicates whether the prediction matches the correct class. This is done per class $c$ of $M$ [12] in an expected-value fashion.

C E = - M \sum c = 1 {^y}_{o, c} log p (o, c)

(1)

Equation (2) shows the joint MTL-loss as a linear combination. Both identification and semantic-segmentation are multi-class-tasks, employing (categorical) cross-entropy loss. The losses are weighted in a 10 to 1 ratio; this is because the segmentation-task is both inherently harder and functions as the “teacher” for “distillation.” This is described in (2), where $C E$ is the cross-entropy loss function, $Y$ and $^Y$ are the respective task inference and label vectors, and $λ$ is the loss-weight (i.e., 1 and .1 respectively).

L o s s = λ_{S e g} \cdot C E (Y_{S e g}, {^Y}_{S e g}) + λ_{I D} \cdot C E (Y_{I D}, {^Y}_{I D})

(2)

Once training is complete, the teacher segmentation-task-layers are removed. This significantly reduces the network parameters, from 6.5M to 2.4M, for inference. Fig. 1 shows the final inference structure (see: first-page), where the encoder color change represents the segmentation knowledge-distillation. The purpose of this architecture is to both retain efficiency while demonstrating the dark-knowledge of the segmentation task is sufficient to improve identification.

4 Performance Evaluation

This experiment evaluates the identification accuracy when introducing significant pose-variations. The purpose is to demonstrate the utility of distilling face-segmentation as contextual features. The Seg-Distilled-ID network is validated against identification networks using MobileNetV2 [20] without segmentation-context and three state-of-the-art network encoders.

Figure 3: Mut1ny data-set challenging image samples.

4.1 Experiment: Pose-Invariant Identification

This experiment evaluates identification performance under high pose-variation. The Mut1ny Face/Head Segmentation (commercial edition) data-set [15] is used, employing 67 synthetic users with 150-250 unique perspectives (pose and background) each (11830 total). Each face is annotated with 14 structure classes: lips, left-eye, right-eye, nose, skin, hair, left-eyebrow, right-eyebrow, left-ear, right-ear, teeth, facial-hair, spectacles and background. These are cropped using the Dlib face detection tool [11]. Model verification-accuracy is measured following Labelled Faces in the Wild procedures [8]. Each person has 90% of their face-perspectives associated for training (8,320) and validation (2,080); test accuracy is evaluated on the remaining 10% (1,430).

Fig. 3 shows some sample images (with segmentation-masks) from the evaluation data-set. While there are only 67 people, it is a very challenging face-recognition data-set. There are differences in pose, accessories, facial-hair and illumination. These significantly increase intra-class variability.

4.2 Benchmark Algorithms

The Seg-Distilled-ID network is benchmarked against three state-of-the-art encoders and MobileNetV2 without teacher-task [20]. Note that this is a comparison of encoder knowledge where inputs and ID-loss-function are kept identical. Furthermore, a comparison of pose-estimation versus semantic-segmentation context is viewed as relevant due to the work of Yin [27]. However, given the Mut1ny data-set does not contain the same pose-annotations, it is not pragmatic to do so. Evaluating input transformation, loss-function and task-sharing approaches are viewed as key next steps.

Each benchmark network follows the same ID task-structure. That is to say an encoder generates the features, where are global-average-pooled, then classified using a 128-neuron dense feature-layer (ReLU activation) [1] and 67-neuron dense ID-classification-layer (soft-max activation) [12]. The following network feature-encoders are used:

MobileNetV2 [20]
ResNet-101 [5]
VGG-19 [22]
IncepvtionV3 [23]

Each network is referred to as the encoder “-ID”. E.g., validation-network 1 is designated “MobileNetV2-ID.” All feature-encoders come pre-trained on ImageNet [3]. Networks are compiled and trained in the same fashion, up to 125 epochs with a validation-loss patience of 20. Due to space constraints, training and validation curves are not shown.

5 Evaluation Results

Table 1 shows the performance evaluation results. As generally expected, having a stronger encoder correlates with better ID classification. All networks but MobileNetV2-ID train to a validation accuracy of at least 95% (training data not shown for space). This understandable from the encoder architectures. For example, InceptionV3-ID network has a relatively-high parameter-count with factorized-convolutions [23] and trains robustly. Conversely, the MobileNetV2-ID stops early and clearly over-fits due to its efficient design.

Network	Parameters	Test Accuracy
MobileNetV2-ID	2.4M	21.9%
ResNet-101-ID	43M	81.6%
VGG-19-ID	20M	96.1%
InceptionV3-ID	22M	96.3%
Seg-Distilled-ID	2.4M (6.5M +Seg)	99.9%

Table 1: Network evaluation on Mut1ny data-set.

This performance disparity exemplifies the benefits of distilling semantic-segmentation features. Despite MobileNetV2-ID over-fitting, the Seg-Distilled-ID has the highest accuracy score evaluated. This is achieved while retaining the MobileNet architecture’s efficiency (approximately one-tenth of the VGG and Inception network parameters). The parenthesis indicates that 2.4M parameters are used for inference and 6.5M are used for jointly training with the teacher-task.

The parameter efficiency is explainable by using the semantic-segmentation knowledge to select optimal features. Top-tier encoders use large parameter-spaces to implicitly infer context, enabling them to perceive information the base MobileNetV2-ID cannot. This methodology instead explicitly provides context through the facial-structure teacher-task. Fig. 3 shows the semantic-segmentation masks; one can infer how the features that encode facial-structure variation across pose enable precise identification across pose. This feature robustness enables the Seg-Distilled-ID to efficiently achieve best-in-class performance.

Note that generalized pose robustness is very much novel. Others demonstrate re-aligning the face in 3-D space can improve identification robustness (such as LDF-Net [7] and GridFace [29]). These methods are effective but degrade as yaw and pitch increase. It is hypothesized the 3-D alignment algorithms synthetically inferring the obscured facial features cascades bias from the projector. The Seg-Distilled-ID avoids this bias by learning facial-structures in a one-shot approach.

6 Conclusions

This paper presents the Seg-Distilled-ID network to address pose-invariance face-recognition. This is a novel application of knowledge-distillation, where an ID-task is jointly-trained with a “distilled” teacher semantic-segmentation-task. Benchmarking with state-of-the-art encoders ResNet-101 [5], VGG-19 [22] and InceptionV3 [23] shows the proposed Seg-Distilled-ID network achieves best-in-class performance using minimal parameters (MobileNetV2 [20] encoder).

Next steps include larger-scale evaluation with varied context-encoding methods. The Mut1ny data-set [15] has only 67 subjects in the synthetic-face repository at this time; hence, the planned next step is attempt transfer-learning these features onto Labelled Faces in the Wild [8]. To accommodate the increase in ID classes, a comparison of U-Net [19] and DeepLabV3 [2] designs will be done with various task-architectures. Benchmarking will also include face-alignment pre-processing and contrastive loss-functions.

7 Acknowledgments

The authors would like to give special thanks to Ford Motor Company for funding this research via University Alliance Grant. This in particular includes Ford engineers Justin Miller and Jon Diedrich for their continued support.

References

[1] A. F. Agarap (2018) Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375. Cited by: §3.1, §4.2.
[2] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §6.
[3] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.2.
[4] Google (2021) Biometrics — android open source project. Google. External Links: Link Cited by: §1, §1.
[5] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: item 2, §6.
[6] D. Heun (2021-10) Facial recognition tech is catching on with banks. American Banker. External Links: Link Cited by: §1.
[7] L. Hu, M. Kan, S. Shan, X. Song, and X. Chen (2017) LDF-net: learning a displacement field network for face recognition across pose. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pp. 9–16. Cited by: §2, §5.
[8] G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller (2008) Labeled faces in the wild: a database forstudying face recognition in unconstrained environments. In Workshop on faces in’Real-Life’Images: detection, alignment, and recognition, Cited by: §1, §4.1, §6.
[9] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §3.1.
[10] X. Jin and X. Tan (2017) Face alignment in-the-wild: a survey. Computer Vision and Image Understanding 162, pp. 1–22. Cited by: §1, §2.
[11] D. E. King (2009) Dlib-ml: a machine learning toolkit. The Journal of Machine Learning Research 10, pp. 1755–1758. Cited by: §4.1.
[12] W. Liu, Y. Wen, Z. Yu, and M. Yang (2016) Large-margin softmax loss for convolutional neural networks.. In ICML, Vol. 2, pp. 7. Cited by: §3.1, §3.1, §4.2.
[13] H. H. Lwin, A. S. Khaing, and H. M. Tun (2015) Automatic door access system using face recognition. international Journal of scientific & technology research 4 (6), pp. 294–299. Cited by: §1.
[14] B. Moghaddam, W. Wahid, and A. Pentland (1998) Beyond eigenfaces: probabilistic matching for face recognition. In Proceedings third IEEE international conference on automatic face and gesture recognition, pp. 30–35. Cited by: §2.
[15] Mut1ny (2021-07) Face/head segmentation dataset commercial purpose edition. External Links: Link Cited by: 2nd item, §1, §4.1, §6.
[16] L. Pascu (2020-01) Biometric facial recognition hardware present in 90% of smartphones by 2024: biometric update. BiometricUpdate.com. External Links: Link Cited by: §1.
[17] R. Ranjan, V. M. Patel, and R. Chellappa (2017) Hyperface: a deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE transactions on pattern analysis and machine intelligence 41 (1), pp. 121–135. Cited by: §2.
[18] Reservations.com (2020-01) Facial recognition statistics in airports: survey shows 43% approve, 33% disapprove. Reservations.com. External Links: Link Cited by: §1.
[19] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.1, §6.
[20] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §3.1, item 1, §4.2, §4, §6.
[21] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §1, §2.
[22] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: item 3, §6.
[23] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: item 4, §5, §6.
[24] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: §2.
[25] M. Turk and A. Pentland (1991) Eigenfaces for recognition. Journal of cognitive neuroscience 3 (1), pp. 71–86. Cited by: §2.
[26] Z. Wang, K. He, Y. Fu, R. Feng, Y. Jiang, and X. Xue (2017) Multi-task deep neural network for joint face recognition and facial attribute prediction. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 365–374. Cited by: §2.
[27] X. Yin and X. Liu (2017) Multi-task convolutional neural network for pose-invariant face recognition. IEEE Transactions on Image Processing 27 (2), pp. 964–975. Cited by: §2, §4.2.
[28] X. Zhang and Y. Gao (2009) Face recognition across pose: a review. Pattern recognition 42 (11), pp. 2876–2896. Cited by: §1, §1, §2.
[29] E. Zhou, Z. Cao, and J. Sun (2018) Gridface: face rectification via learning local homography transformations. In Proceedings of the European conference on computer vision (ECCV), pp. 3–19. Cited by: §2, §5.

Distilling Facial Knowledge With Teacher-Tasks: Semantic-Segmentation-Features for Pose-Invariant Face-Recognition