[

\fnmHongwei \surXu 2007xuhongwei@163.com \fnmSuncheng \surXiang xiangsuncheng17@sjtu.edu.cn * dahong.qian@sjtu.edu.cn [

Abstract

The softmax-based loss functions and its variants (e.g., cosface, sphereface, and arcface) significantly improve the face recognition performance in wild unconstrained scenes. A common practice of these algorithms is to perform optimizations on the multiplication between the embedding features and the linear transformation matrix. However in most cases, the dimension of embedding features is given based on traditional design experience, and there is less-studied on improving performance using the feature itself when giving a fixed size. To address this challenge, this paper presents a softmax approximation method called SubFace, which employs the subspace feature to promote the performance of face recognition. Specifically, we dynamically select the non-overlapping subspace features in each batch during training, and then use the subspace features to approximate full-feature among softmax-based loss, so the discriminability of the deep model can be significantly enhanced for face recognition. Comprehensive experiments conducted on benchmark datasets demonstrate that our method can significantly improve the performance of vanilla CNN baseline, which strongly proves the effectiveness of subspace strategy with the margin-based loss¹¹1We will release the code publicly on GitHub after publication..

Face recognition, Embedding feature, Softmax approximation, Discriminability

\jyear

2021

SubFace: Learning with Softmax Approximation for Face Recognition]SubFace: Learning with Softmax Approximation for Face Recognition

\equalcont

These authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

\fnm

Dahong \surQian

]\orgdivSchool of Biomedical Engineering, \orgnameShanghai Jiao Tong University, \orgaddress\cityShanghai, \postcode200240, \countryChina

1 Introduction

The introduction of Convolutional Neural Networks (CNNs) has greatly improved the performances on face vision tasks in the past 20 years, such as data cleaning (45), model parallel acceleration (46), face recognition (10; 11; 15; 48), etc. In essence, designing effective loss functions for the optimization of CNNs is pivotal for such improvements, which has attracted great interests and attention from both academia and industry.

In the field of face recognition, many methods are proposed to find a discriminative embedding space for more robust learning. For example, 45 introduce k sub-centers for each class into the ArcFace loss for noise data clean, which judges whether the current sample is noise face according to the distance between the sample of dominant sub-class and non dominant sub-classes. 46 propose a Partial FC strategy where the negative samples in loss function is replaced by the subset of negative samples for ace recognition training. In addition, 48 propose the soft-triple loss with multiple centers for each category, which can effectively capture the hidden distribution of data through this specially designed mechanism. More recently, the margin-based methods, e.g., ArcFace (11) and CosFace (10), can also achieve better performance by introducing margin penalty to the loss function, which marks a new milestone of metric learning in the face recognition community.

In essence, these methods mainly lay emphasis on employing new feature constraints or designing a better classification interface to achieve competitive performance. To be more specific, they are combined with the softmax loss function to apply to the face recognition scene, which can significantly promote the performance of face recognition in some degree. However, by analyzing the training process of these approaches, we notice that the existing training strategies regard face features as a whole, which fails to pursue the optimization of softmax-based loss on the whole face dataset. Considering the fact that the face features can be regarded as a high-dimensional vectors, which can help to make the features of sample identities more aggregated. Unfortunately, in some cases we observe that local features are less aggregated than global features, as shown in Fig. 1. Intuitively, these variations may be caused by the differences in terms of body posture, age, image quality, etc. While the more important reason we thought is that the existing training strategies always ignore the local distinguish ability of face features during training, which may lead to non-uniform distribution of feature discrimination.

Figure 1: The cos distance of embedding features of the positive image pairs, the dimension is 512, left of the solid line is the cos distance of the 512-d feature, right of the solid line is the min cos distance of the 128-d feature, which is randomly sampled from the 512-d feature.

To address this challenge, in this paper, we propose a softmax approximation method named SubFace for more robust feature learning. Firstly, we determine the dimension of subspace, and then dynamically select a subspace of such dimension in the embedding feature and the corresponding linear transformation matrix. Finally, the subspace is randomly selected at every batch to improve the generalization ability of sub-feature representation. Extensive experiments demonstrate that our training strategy combined with margin-based loss is very effective and can achieve comparable results with the state-of-the-arts.

To this end, the major contributions of our work can be summarized as follows:

We propose a softmax approximation training method named SubFace for more robust feature learning, which can be integrated with the margin-based loss seamlessly for the face recognition.
On the basis of feature approximation strategy, the subfeature normalization mechanism is introduced to further enhance the discriminability of learned feature for intra-class and inter-class relations.
Comprehensive experiments conducted on benchmarks demonstrate that our method can achieve competitive performance when comparing with the state-of-the-arts on face recognition tasks.

The remainder of this paper is structured as follows. In Section 2, we give the related works based on metric-distance-based loss and angular-margin-based loss, and then briefly introduce our method. In Section 3, The details of feature mining and approximation strategy are presented. Extensive evaluations compared with state-of-the-art methods and comprehensive analyses of the proposed approach are reported in Section 4. Finally, the conclusion of this paper and discussion of future works are presented in Section 5.

2 Related work

From the training process of face recognition methods, the loss functions can greatly promote the performance of face recognition. These methods can be categorized into metric-based loss methods (15; 18; 19; 20; 21; 40) and margin-based loss methods (10; 11; 12; 13; 42).

2.1 Metric-based methods

Early works usually use the metric-based methods. The common goal of these methods is to learn a distance metric that distinguishes the intra-class and inter-class. These common methods include the contrastive loss and triple losses. The contrastive losses (18; 19; 20) are those that use paired samples (positive or negative) to train a network to predict whether they belong to the same class. And the triplet loss (14; 39; 40; xiang2022learning) consider that the distance between the query and positive samples is greater than that between the query and negative samples by a given distance margin. Since contrastive and triplet loss often lead to slow convergence, N-pair loss (32) is designed, which improves the training convergence by considering the distance between the query samples and other multi negative samples jointly at each update. But these sample-to-samples comparisons often suffer from the explosion of sample space. Thus, some sample-to-class schemes, e.g., center loss (15) and variants (21), are proposed to solve the problem of explosive growth in computation, they learn a center for each category and require the same category features to be closed to its center. But these kind methods fail to address the problem of open set in face recognition.

2.2 Margin-based methods

Compared to the deep metric learning methods, the angular-margin-based loss (13; 11; 42) often remodel the last fully connected layer of classification network to different form. It explicitly adds discriminative constraints to the target logit and makes the same category feature space more compact, it is more efficient and stable, L-softmax (42) constructed a large-margin softmax loss and employed a piecewise function to guarantee the monotonicity of the cosine function. A-softmax (13) normalized the weight matrix $w$ , and learned face features on a hypersphere manifold. In fact, L-softmax and A-softmax are difficult to train due to the complex design of the angular margin. To relieve this dilemma, some researchers (10; 11) try to simplify this question. They set the bias term to zero and normalize the weight and embedding feature. The representation of the last fully connected layer is reduced to vector inner product form. And Cosface (10) introduced an additive cosine margin. Arcface (11) introduced an additive angular margin. More recently, Adaptiveface (37) and Fariloss (38) introduce adaptive margin strategy by adjusting the strength of the supervision process during training to address the problem of unbalanced data.

Compared with existing methods, our proposed SubFace strategy is different from them in the following aspect: (1) Our proposed SubFeature Strategy tries to improve the overall performance of features through mining the discriminative power of local features, supporting it to be more discriminative, while previous researches (such as Arcface and Cosface) mainly lay emphasis on global feature to learn a discriminative model. (2) The proposed framework has much fewer parameters and can be embedded into existing face recognition model seamlessly, which make it more flexible and adaptable in real-world scenarios. (3) To the best of our knowledge, we are among the first attempt to introduce feature approximation mechanism on face recognition task.

3 Proposed Approach

3.1 Preliminary

Usually a training system consists of three parts: training data, feature extraction network, and loss function. And a face training system can be expressed as ${{x_{i}, y_{i}}, f, F_{l o s s}}$ that contains $N$ number identities, each training sample $x_{i}$ corresponds to a label $y_{i}$ , $f (\cdot)$ is the feature extractor, $f (x_{i}) \in R^{d}$ represents the embedded feature of i-th face sample. $F_{l o s s}$ denotes the loss function during training, which can be represented as:

F_{l o s s} (x_{i}, x_{j}) = d i s (f (x_{i}), f (x_{j}))

(1)

The loss function provides a distance measurement, under which similar targets are as close and non similar targets are far away. Intuitively, human face has obvious nonlinear characteristics, adopting new loss functions in deep neural networks is effective approach to make features more discriminative. However, in addition to the non-linear discriminative embedding, there is clearly locally distinguishable embedding for face recognition. For example, we can distinguish and recognize pedestrians by discriminative parts with attention mechanism (xiang2020multi). Inspired by this, we take a big step forward to explore the identification ability of local features during the training process, which can significantly boost the performance in a large degree for face recognition task.

The framework of the face recognition training system. The main contribution of paper is highlighted by the red dashed box, and the step of approximation strategy is shown in red
solid box. In the part of loss function, — Figure 2: The framework of the face recognition training system. The main contribution of paper is highlighted by the red dashed box, and the step of approximation strategy is shown in red solid box. In the part of loss function, $f$ is the backbone embedding feature, $w$ is the linear transformation matrix in FC layer. The training path of most existing methods is carried out along the blue arrow. In this paper, we propose an approximation training strategy along the yellow arrow. $f^{^{'}}$ and $w^{^{'}}$ are used to replace $f$ and $w$ respectively.

3.2 Our Proposed Method

In this section, we present the approximation strategy to the face recognition training task. Firstly, we illustrate the feature approximation mechanism in Section 3.2.1, and then in Section 3.2.2, we give more detailed information about the subfeature normalization strategy.

3.2.1 Feature approximation

In order to obtain the sampling features to realize softmax approximation, we propose a feature approximation strategy which can realize accelerating or parallel processing by sampling positive and negative samples, the framework of this paper is shown in Fig. 2.

Firstly, we present the details of our approximation method using the softmax loss, which can be formulated as follows:

L = - \frac{1}{N} N \sum i = 1 l o g \frac{e^{w_{y i}^{T} f_{i} + b_{y i}}}{\sum_{j = 1}^{n} e^{w_{j}^{T} f_{i} + b_{j}}}

(2)

where the inner product $w_{j}^{T} f_{i}$ is considered as the distance between feature and weight. As $w_{j}$ has the same dimension as $f_{i}$ , which can be taken as a face representation, and meanwhile faces are locally distinguishable, we use the inner product of subspaces to replace that in Eq. 2.

On the basis of softmax loss, we introduce a dimension selection factor $r \in [0, 1]$ , the subfeature can be expressed as $R_{t} * f_{i}$ , where $R_{t} \in {0, 1}^{d}$ is a random matrix with $R_{(t, j)} \sim B e r n o u l l i (r)$ , $*$ denotes element-wise product. And $R_{t} * w_{y i}$ constitutes a subset of $w_{y i}$ . Consequently, we get the approximate loss:

ˆ L = - \frac{1}{N} N \sum i = 1 l o g_{i}

(3)

where $_{i} = \frac{e^{(R_{t} * w_{y i})^{T} (f_{i})}}{\sum_{j = 1}^{n} e^{(R_{t} * w_{j})^{T} (f_{i})}}$ .

And when performing evaluation on downstream tasks, we still use the full feature as face representation. Through analyzing our approximation loss in Eq. 3, obviously, $(R_{t} * w_{y i})^{T} f_{i} = w_{y i}^{T} (R_{t} * f_{i})$ , so it can be easily expended as:

ˆ L = L + T_{B e r n o u l l i} (r)

(4)

$T_{B e r n o u l l i} (r)$ is a term that depends on the sample ratio r, it is found that our approximation strategy has obvious regularization characteristics.

Compared with the original softmax loss function, our approximation strategy can dynamically replace the inner product of feature and weight by that of the subspace features. When training with this approximation mechanism, the optimization goal for positive samples is to maximize the multiplication between the subfeature. Considering the randomness of this feature sampling, it focuses on making the aggregation of local features distribute evenly in the whole feature, while the original training method has no such effect. For negative samples, there is no difference between global features and subspace features for enhancing inter-class distance.

3.2.2 Subfeature Normalization

According to the latest methods, the normalization of weight and feature has become the standard configuration. The normalization step makes CNN focus more on the optimization angle, and the obtained deep face features are more separated. In practice, the bias term b in Eq. 2 is often set to 0, when the individual weight w and the feature $f_{i}$ are normalized. The softmax loss function can be formulated as:

l_{s o f t m a x} = - \frac{1}{N} N \sum i = 1 l o g \frac{e^{s cos θ_{y i}}}{s cos θ_{y i} + \sum_{j = 1, j \neq y i}^{n} e^{s cos θ_{j}}}

(5)

$θ_{y i}$ is the angle between the embedding feature $f_{i}$ and the center $w_{y i}$ . The normalization makes the predictions of Eq. 5 only depends on the angle between feature and weight. Given the angle classification boundary $θ$ , the convergence constraint of training process should meet:

{\begin{matrix} a r c o s (w_{j}^{T} f_{i}) < θ & i f i = j a r c o s (w_{j}^{T} f_{i}) > θ & i f i \neq j \end{matrix}

(6)

For the same purpose, we also hope that the optimization goal of this feature approximation training method will focus on the angle difference. Considering with this, we normalize the subfeature $R_{t} * f_{i}$ and $R_{t} * w_{j}$ , and re-scaled $R_{t} * f_{i}$ to s, the angle representation in Eq. 5 is replaced with the angle of between subfeature and subweight. The angle constraint between $R_{t} * f i$ and $R_{t} * w_{j}$ should also meet:

{\begin{matrix} a r c o s (ω_{j}^{T} x_{i}) < θ & i f i = j a r c o s (ω_{j}^{T} x_{i}) > θ & i f i \neq j \end{matrix}

(7)

where $x_{i} = \frac{R_{t} * f i}{∥ R_{t} * f i ∥}$ , $ω_{j} = \frac{R_{t} * w_{j}}{∥ R_{t} * w_{j} ∥}$ .

1:scale s, margin parameters

m_{1}, m_{2}, m_{3}

, class ids label, sample ratio r, feature f, class center w

2:Class-wise affinity score logits

3:ind= randperm(r)

4:x = indexselect(f,ind)

ω

= indexselect(w,ind)

6:cosine = x*

ω

7:cosine = norm(x)*norm(

ω

)

8:onehot = scatter(label,depth,on = 1.0,off = 0)

9:phi =

c o s (a r c c o s (m_{1} c o s i n e) + m_{2}) - m_{3}

10:logits = s*(onehot*phi + (1-onehot)* cosine)

Algorithm 1 The proposed SubFace training method

In this paper, we propose a new softmax approximation strategy, which utilizes the local separability of facial features. And the training procedure of our SubFace method is illustrated in the Alg. 1. When adopting this approximation strategy to the margin-based loss, our method inherits the advantages of margin-based loss, and meanwhile, the discriminative ability of any subspace of embedded features is enhanced. Consequently, the co-adaptive relationship between feature is reduced. To the best of our knowledge, it is the first attempt to explore the softmax approximation to improve the performance of face recognition. Comprehensive experiments demonstrate that our method can achieve competitive performance compared with existing methods. It is worth mentioning that our method only needs several lines of codes, which make it more flexible and adaptable in real-world scenarios.

4 Experiments

4.1 Datasets

We employ three widely used datasets CASIA (24), A MS1M-RetinaFace (54), MS1MV2 (11) as our training datasets. The MS1MV2 contains 5.8M images and 85K identities. The MS1M-RetinaFace dataset contains 5.1M images of 93 K identities, which is a refined dataset of MS1M (53). And the CASIA dataset contains 10,577 identities and 0.5 M images. Note that all face images are resized to 112 $\times$ 112.

4.2 Implementation Details

We employ the widely used CNN architecture MobilefaceNet (13), ResNet-50, ResNet-100 (1; 2) as the backbone and set the embedding feature dimmension to 512. And during testing, two margin-based loss functions (ArcFace and CosFace) are used to make detailed comparison. We set the scale s to 64, the angular margin m to 0.5 when using ArcFace and the cosine margin m to 0.4 of CosFace. The mini-batch size is set to 512. The SGD momentum is set to 0.9 and the weight decay is empirically set to 5e-4. The learning rate starts from 0.1 and is divided by 10 at 36k, 52k iterations for CASIA, and the total training process is finished at 65k iterations. For MS1MV2, the learning ratio is divided at 100k, 160k iterations, the training process is finished at 180k iterations. For MS1M-RetinaFace, we divided the learning rate at 100k, 160k, 210k iterations and finished at 250k iterations. All experiments are conducted on a server equipped with two Nvidia A100 GPUs on Pytorch (paszke2019pytorch) framework.

4.3 Important Parameter

We employ the ResNet-50 as the embedding network and the CASIA as training dataset to train ArcFace using our strategy, and we compare the verification result under different sampling ratio in the range of [0-1], as shown in Fig. 3. The experiment shows that the performances of all sample ratio do no change a lot except when sampling ratio is low.

Verification results on val datasets using different sampling ratio in the range of [0-1]. (a) Verification result on LFW. (b) Verification result on CFP-FP. (c)
Verification result on AgeDB-30. — Figure 3: Verification results on val datasets using different sampling ratio in the range of [0-1]. (a) Verification result on LFW. (b) Verification result on CFP-FP. (c) Verification result on AgeDB-30.

For example, when the sampling rate is 0.1, it leads to a decrease in the recognition accuracy. More because the representation dimension of the feature is obviously insufficient. In this experiment, it means that only 51-dimensional features are used to represent the face during the training. And when the sampling ratio is greater than 0.4, our approach has little negative impacts on model performance, the performances on the three validation sets are more just fluctuating. And in order to make a fair comparison, we empirically set the sampling ratio at 0.7 in the follow experiments.

4.4 Ablation Study

To further validate the effectiveness of the our proposed method, we perform several ablation studies on the individual component of our proposed SubFace method.

Effects of feature approximation: In this experiment, we test our approximation strategy on the validation dataset LFW, CALFW, CPLFW respectively with backbone ResNet-50, and we choose the ArcFace as the loss function. As illustrated in Fig. 4, we give the feature distance distribution of face pairs. The distance distribution maps of face pairs at feature sampling rate 0.5 and 1.0 are given²²2The feature sampling rate of 1.0 represents the original training strategy.. The Euclidean distances between positive pairs

Figure 4: The distance distribution of face pair on LFW, CALFW, CPLFW dataset respectively.

are skewed to the smaller side compared with the situation at sampling ratio of 1.0. And there is no difference of the negative sample pair under the two cases in terms of the distance distribution. Our random strategy makes the embedding features more compact, as the discriminative power of features is more balanced across the entire dimension.

Effects of subfeature normalization: In this experiments we verify the convergence of our approximation strategy by observing the volatility of the vector angle. We denote $θ_{j}$ as the angle between the embedding feature $f_{i}$ and the center $w_{j}$ , $θ_{j}^{^{'}}$ as the angle between the subfeature $f_{(i, r)}$ and the subcenter $w_{(j, r)}$ . In this experiment, we adopt the average cosine distance defined in 46 and judge the convergence by the variation of cosine.

Figure 5: The average cosine distance when the sample ratio is 0.1, 0.5, 0.9 respectively.

As illustrated in Fig. 5, the change of sampling ratio does not affect the final convergence, and the arbitrary $r * d$ dimensional subspace can approximate the performance of full-dimensional features. During the same time, we also observe that the change of sampling rate can affect the volatility of the average cosine distance. To be more specific, when the sampling rate is smaller, the deviation is larger at the initial stage of training and the volatility of diffs is greater at later stage.

Effects of lightweight model: Lightweight models, which are benefited from smaller network structures, fewer model parameters, as well as superior performance, are now widely used in the application of embedded devices, computer vision, etc. In this section, we choose MobilefaceNet as the backbone network since MobilefaceNet is more balanced between computation, parameters and model performance. The parameter amount of Mobilefacenet is 0.99M, and the calculation amount is 439.8M (FLOPs), and it performs well on relevant data. We train ArcFace on CAISA and MS1M-RetinaFace respectively. According to the results in Table 1, our approximation strategy shows the performance superiority on lightweight models. For example, it can achieve the best accuracy of 99.58% on LFW and 96.22% on Age-DB30 respectively when trained on MS1M-RetinaFace dataset. And when using CASIA dataset, our trained model also shows a competitive performance, outperforming the previous work (7) and surpassing the original result by 0.08% on AgeDB-30 dataset.

Method	Params	LFW	AgeDB-30	Training
Method	(M)	LFW	AgeDB-30	Data
MobileNetV1 (49)	3.2	0.9863	0.8895	0.5M
ShuffleNet (49)	0.8	0.9870	0.8927	0.5M
MobilefaceNet (7)	0.99	0.9928	0.9305	0.5M
MobilefaceNet (CASIA, r=0.7)	0.99	0.9928	0.9313	0.5M
ShuffleFaceNet (0.5X) (47)	0.5	0.9907	0.9245	5.1M
MobilefaceNet (7)	0.99	0.9955	0.9607	3.8M
MobilefaceNetV1 (49)	3.4	0.9940	0.9640	5.1M
ProxylessFaceNAS (49)	3.2	0.9920	0.9440	5.1M
MobilefaceNet (MS1M-RetinaFace, r=0.7)	0.99	0.9958	0.9622	5.1M

Table 1: Verification results on LFW, AgeDB-30 of different lightweight models

Effects of small-scale & large-scale trainset: To validate the effects of small-scale & large-scale trainset, we adopt the two training set CASIA and MS1MV2 to test the performance of our sampling method. We treat the CASIA as small-scaled trainset and the MS1MV2 as large-scaled trainset. In order to make a fair comparison and reduce the volatility of the test results themselves, we choose ResNet-100 as the backbone, and adopt the arcface loss and cosface loss for optimization process.

Method	LFW	CFP-FP	AgeDB-30
CASIA, ResNet-50, ArcFace(0.5) (11)	0.9953	0.9556	0.9515
CASIA, ResNet-50, CosFace(0.35) (11)	0.9951	0.9544	0.9456
MS1MV2, ResNet-100, ArcFace(0.5) (46)	0.9983	0.9845	0.9820
MS1MV2, ResNet-100, CosFace(0.4) (46)	0.9983	0.9851	0.9803
CASIA, ResNet-100, ArcFace(0.5), SubFace (r=0.7)	0.9952	0.9584	0.9540
CASIA, ResNet-100, CosFace(0.4), SubFace (r=0.7)	0.9950	0.9534	0.9485
MS1MV2, ResNet-100, ArcFace(0.5), SubFace (r=0.7)	0.9983	0.9850	0.9823
MS1MV2, ResNet-100, CosFace(0.4), SubFace (r=0.7)	0.9982	0.9837	0.9820

Table 2: Verification performance (%) of ResNet-100 on LFW, CFP-FP, AgeDB-30 dataset respectively.

According to the experimental results in Table 2, there are obvious performance gaps that the results on the big model (ResNet-100) are generally better than the results of the small model (ResNet-50). The results on the big-scale dataset are better than those on the small-scale dataset using the same model. And ArcFace generally outperforms CosFace under the same conditions using our strategy. For example, when using ResNet-100, ArcFace with our approximation strategy can achieve the best verification performance on AgeDB-30 dataset (e.g. 98.23%), surpassing the results (46) by 0.03% with ArcFace(0.5) (46) when trained on MS1MV2. However, the performance difference between our SubFace strategy and ArcFace (46) is very slight on LFW and CFP-FP dataset respectively.

4.5 Comparison with the State-of-the-art Methods

In this section, we evaluate the model performance with different settings on the face verification datasets such as LFW (26), CFP-FP and agedb-30 (28). During the same time, we also give the performance of our methods on large-pose and large-age datasets CPLFW (29), CALFW (30), and the large-scale image datasets megaface (31), IJB-B (23), and IJB-C (35) respectively.

Evaluation on LFW, CALFW and CPLFW datasets: LFW dataset is one of the widely used benchmarks for unconstrained face verification on images and videos, it contains 6,000 comparison pairs, with 3,000 positive pairs and 3,000 negative pairs. CPLFW and CALFW datasets are recently introduced which show higher pose and age variations with same identities from LFW. According to the results from the Table 3, ArcFace with our strategy can achieve the best performance of 99.85% and 93.48% on LFW and CPLFW respectively. When compared with the original ArcFace (46), we have improved performance on all the three sets. Even with the results (the row 5 in Table 3) on a larger training sets Glint360k (46), the performance is surpassed by +0.02% and +0.09% on LFW and CALFW datasets respectively.

Method	LFW	CALFW	CPLFW
VGGFace2 (50)	0.9943	0.9057	0.8400
GroupFace (51)	0.9985	0.9620	0.9317
CurricularFace (52)	0.9980	0.9620	0.9313
MS1MV2, ResNet-100, ArcFace (46)	0.9982	0.9545	0.9208
Glint360k, ResNet-100, CosFace (46)	0.9983	0.9621	0.9478
MS1M-RetinaFace, ResNet-100, ArcFace, SubFace (r = 0.7)	0.9985	0.9630	0.9348

Table 3: Verification performance (%) of different face recognition models on LFW, CALFW and CPLFW respectively.

Evaluation on IJB-B and IJB-C datasets: The IJB-B dataset contains 1,845 subjects with 21.8 k still images and 55 k frames from 7,011 videos. The IJB-C contains 3,531 subjects with 31.3 k still images and 117.5 k video frames. And there are 10,270 genuine matches and 8m imposter matches in the IJB-B verification protocol. The IJB-C verification protocol provides a total 19,557 genuine matchs and 15,639K impostor matches.

Method	IJB-B	IJB-C
MS1MV2, ResNet-100, ArcFace (11)	0.942	0.956
MS1MV2, ArcFace (46)	0.948	0.962
GroupFace (51)	0.949	0.963
CurricularFace (52)	0.948	0.961
MS1MV2, ArcFace, ResNet-100 (SubFace, r=0.7)	0.9501	0.9638
MS1M-RetinaFace, ArcFace, ResNet-100 (SubFace, r=0.7)	0.9547	0.9685

Table 4: The 1:1 verification accuracy(TAR@FAR=1e-4) on the IJB-B and IJB-C dataset respectively.

On the IJB-B and IJB-C datasets, we employ the MS1M-retinaFace and MS1MV2 dataset as the training data and the ResNet-100 as the embedding network, the sampling ratio is choosen at 0.7 for the fair comparison with the most recent methods. As shown in Table 4, our method further improves the performace the TAR (@FAR=1e-4) to 96.85% and 95.47% on IJB-C and IJB-B respectively. And when using the same training dataset MS1MV2, the accuracy difference between our method and ArcFace (11) is 0.81% (94.2% vs 95.01%) and 0.78% (95.6% vs 96.38%) on IJB-B and IJB-C respectively, which surpasses the performance in CurricularFace (52) by a clear margin.

Evaluation on MegaFace dataset: Finally, we evaluate the performance on the MegaFace Challenge. The evaluation protocol of MegaFace includes gallery and probe sets. The gallery set contains 1M images of 690k different individuals and the probe set contains 100 k photos of 530 unique individuals from facescrub (36). As there are some noises in the original Megaface, we adopt the refined megaface dataset (11) to make a fair comparison. Table 5 shows the results obtained by ResNet-100 and the previous works for both identification and verification tasks on this dataset. From the result we can see that ResNet-100 with our strategy can achieve state-of-the-art results with respect to other recent models under verification scenarios.

Method	id	ver
ArcFace(0.5) (11)	0.9835	0.9848
CosFace(0.35) (11)	0.9791	0.9791
CASIA, ResNet-50, ArcFace (11)	0.9175	0.9369
MS1MV2, CosFace (46)	0.9836	0.9858
MS1MV2, ArcFace (46)	0.9831	0.9859
CurricularFace (52)	0.9871	0.9864
MS1M-RetinaFace, ResNet-100 ,ArcFace (SubFace, r=0.7)	0.9839	0.9871

Table 5: Face identification and verification evaluation of different methods. These experiments are tested on Megafce Challenge 1 using FaceScrub as the probe set. “id" refers to the rank-1 face identification accuracy with 1e6 distractors, and “ver" denotes to the face verification TAR at 1e

- 6

FAR.

4.6 Discussion

In this work, we propose a face training method based on local feature approximation, which has achieved competitive results on several benchmark datasets. On the one hand, compared with existing training methods, we mainly use the local feature for training, and we also verify the convergence and feature compactness of this method in Section 4.4. Unfortunately, the sampling dimension corresponding to the subfeature is consistent with all data in the same batch, which may not be an optimal choice in same cases. For example, the distinctive local area of the face may be different between profile and frontal, or the face under the condition of illumination and non-illumination. Therefore, there is still room for further optimization in the selection of our random sampling method.

On the other hand, we perform in-depth empirical analysis of our method, and it is surprising to observe that the training optimization is difficult to reach a comprehensive optimization state in all performance indicators, which leads to a slightly performance degradation on some benchmark datasets. For example, as shown in Table 2, the verification performances of our method on LFW and CFP-FP datasets are 99.83% and 98.50% respectively. While the best results of previous methods on LFW and CFP-FP datasets are 99.83% and 98.51% individually. The reasons for this phenomenon may come from the following aspects:

First, the sampling factor in our experiment is set to 0.7, maybe this optimal parameter should be different in various experiments.

Second, according to the Table 5, the performance gap between our method and existing methods in some datasets (MegaFace dataset) is relatively small, where the fluctuation of this difference can be completely compensated through code, training strategy and data enhancement, etc.

Third, the main purpose of our optimization method is to promote the compactness of positive sample features, so as to improve the performance of face classification. In the comparative study in Table 2, the performance difference here may be mainly resulted from the difference of the processing method on some hard samples. And these challenges warrant further research and consideration when deploying the face recognition model in real scenarios.

5 Conclusion

In this paper, we propose an approximate training strategy named SubFace to enhance the distinguishing ability of features. It uses the subspace of the class center combined with the subfeature to achieve intra-class compactness. Comprehensive experiments conducted on benchmarks demonstrates that SubFace can significantly improve the performance of vanilla CNN baseline with margined-based loss on face recognition, proving its superiority and competitiveness when compared with the state-of-the-arts. For further research, we will combine other optimization strategies, such as distributed computing, and neural architecture search, to further improve the performance of the proposed method.

\bmhead

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Project (Grant No. 81974276). The authors would like to thank the anonymous reviewers for their valuable suggestions and constructive criticisms.

Declarations

Funding
This work was supported by the National Natural Science Foundation of China under Project (Grant No. 81974276).
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
Not Applicable. The datasets and the work do not contain personal or sensitive information, no ethical issue is concerned.
Consent to participate
The authors are fine that the work is submitted and published by Machine Learning Journal. There is no human study in this work, so this aspect is not applicable.
Consent for publication
The authors are fine that the work (including all content, data and images) is published by Machine Learning Journal.
Availability of data and material
The data used for the experiments in this paper are available online, see Section 4.1 for more details.
Code availability
The code will be publicly available once the work is published upon agreement of different sides.
Authors’ contributions
Hongwei Xu and Suncheng Xiang contributed conception and design of the study, as well as the experimental process and interpreted model results. Dahong Qian obtained funding for the project and provided clinical guidance. Hongwei Xu and Suncheng Xiang drafted the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

[