MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised Point Cloud Action Recognition

Xiaodong Chen cxd1230@mail.ustc.edu.cn University of Science and Technology of China , Wu Liu liuwu1@jd.com JD Explore Academy , Xinchen Liu liuxinchen1@jd.com JD Explore Academy , Yongdong Zhang zyd73@ustc.edu.cn University of Science and Technology of China , Jungong Han jungong.han@aber.ac.uk Aberystwyth University and Tao Mei tmei@jd.com JD Explore Academy

2022

Abstract.

Recognizing human actions from point cloud videos has attracted tremendous attention from both academia and industry due to its wide applications like automatic driving, robotics, and so on. However, current methods for point cloud action recognition usually require a huge amount of data with manual annotations and a complex backbone network with high computation cost, which makes it impractical for real-world applications. Therefore, this paper considers the task of semi-supervised point cloud action recognition. We propose a Masked Pseudo-Labeling autoEncoder (MAPLE) framework to learn effective representations with much fewer annotations for point cloud action recognition. In particular, we design a novel and efficient Decoupled spatial-temporal TransFormer (DestFormer) as the backbone of MAPLE. In DestFormer, the spatial and temporal dimensions of the 4D point cloud videos are decoupled to achieve an efficient self-attention for learning both long-term and short-term features. Moreover, to learn discriminative features from fewer annotations, we design a masked pseudo-labeling autoencoder structure to guide the DestFormer to reconstruct features of masked frames from the available frames. More importantly, for unlabeled data, we exploit the pseudo-labels from the classification head as the supervision signal for the reconstruction of features from the masked frames. Finally, comprehensive experiments demonstrate that MAPLE achieves superior results on three public benchmarks and outperforms the state-of-the-art method by 8.08% accuracy on the MSR-Action3D dataset. ¹¹1See the project on www.xiaodongchen.cn/MAPLE/.

Point Cloud Action Recognition, Semi-supervised Learning, Auto-encoder, Vision Transformer

^†^†journalyear: 2022^†^†copyright: acmcopyright^†^†conference: Proceedings of the 30th ACM International Conference on Multimedia ; October 10–14, 2022; Lisboa, Portugal.^†^†booktitle: Proceedings of the 30th ACM International Conference on Multimedia (MM ’22), October 10–14, 2022, Lisboa, Portugal^†^†price: 15.00^†^†isbn: 978-1-4503-9203-7/22/10^†^†doi: 10.1145/3503161.3547892^†^†ccs: Computing methodologies Computer vision^†^†ccs: Computing methodologies Activity recognition and understanding

1. Introduction

Point cloud videos, compared with 2D RGB videos, contain richer visual and geometric information for action recognition. Researchers from academia and industry have recently focused on point cloud action recognition due to its wide potential applications in autonomous driving, industrial manufacturing, robotics, and so on (Li et al., 2010; Martin-Martin et al., 2021; Liu et al., 2018; Sun et al., 2022; Liu et al., 2022; Zheng et al., 2022). With the development of deep learning techniques such as deep neural networks and the transformer (Vaswani et al., 2017), significant progress has been made in this task (Liu et al., 2019; Qi et al., 2017a). However, the high computation cost and the requirement of large-scale annotated data hinder the practical application of point cloud action recognition.

Figure 1. Comparisons of different semi-supervised methods, i.e., VAT (Miyato et al., 2019), EntMin (Grandvalet and Bengio, 2004), and Pseudo Label (Lee and others, 2013) on the MSR-Action3D dataset in terms of classification accuracy.

To recognize human actions in point cloud videos, the mainstream methods are divided into three categories. The first one (Wang et al., 2020) is to convert the point cloud video into a series of ordered voxels and then apply traditional grid-based convolutions to these voxels. The second type of method (Liu et al., 2019; Qi et al., 2017b) is to model and track local points with pointnet-based (Qi et al., 2017a) models such as MeteorNet (Liu et al., 2019). However, these two types of methods suffer from low computational efficiency and point-tracking errors (Liu et al., 2019) respectively. To address these problems, He et al. (Fan et al., 2021) proposed the third method that extracts short-term local features by 4D convolutions and models long-term global information with the transformer. Nevertheless, the transformer-based methods usually require large-scale labeled data for training as the transformer has a larger model capacity (Zhao et al., 2021). Although a large amount of point cloud videos can be easily obtained, labeling point cloud often needs much more cost on manual annotations compared to 2D RGB videos, which hinders the application of these methods. Therefore, this paper focuses on the task of semi-supervised point cloud action recognition, which aims to reduce the reliance on manual labels in point cloud action recognition using a more efficient model.

Although the accuracy of current methods has been greatly improved, designing an annotation and computation-efficient framework for point cloud action recognition still faces several challenges. First of all, due to the noises and ambiguity of point clouds, how to learn discriminative features and model the spatial-temporal patterns from the point clouds is a great challenge. In image classification, researchers have studied combining techniques of CNNs with the transformer to improve the capability while reducing the computation complexity of the transformer models. For example, Swin-Transformer (Liu et al., 2021) greatly enhances the capacity of the transformer while improving its efficiency by the shifted windows, which demonstrates the potential of the transformer in the modality of RGB images. However, such models are limited on the point cloud action recognition task due to the irregularity of the point clouds.

The other challenge is how to reduce the dependence on manual annotations while preserving the capability of the learned feature representations through an appropriate learning paradigm. A common and effective learning framework is semi-supervised learning, which has rich applications in the field of image recognition and video understanding. Besides, Self-Supervised Learning (SSL) is also a powerful learning framework that exploits the generalizable representations from unlabeled data. In particular, some recent research on the field of SSL (He et al., 2021, 2020; Chen et al., 2020b) has shown excellent results, yet self-supervision alone is still insufficient due to its limited practical applicability. To solve this dilemma, Zhai et al. (Beyer et al., 2019) propose a new learning framework that combines self-supervised learning and semi-supervised learning and becomes a new paradigm in the semi-supervised field. However, limited by the invariance and unordered properties (Qi et al., 2017a) of the point clouds, such methods cannot be directly applied to point cloud action recognition.

To overcome these challenges, we propose a novel learning framework named Masked Pseudo-Labeling autoEncoder (MAPLE) for point cloud action recognition. It introduces an autoencoder into the semi-supervised point cloud action recognition task. We also design an efficient transformer-based model named Decoupled spatial-temporal TransFormer (DestFormer) for this new learning framework. Based on this DestFormer backbone, we design an encoder-decoder structure for MAPLE. It consists of a spatial extractor for learning short-term global features of actions, a temporal encoder for learning long-term action information, and a temporal decoder for feature reconstruction. To learn action information from the unlabeled action sequences, we reconstruct the masked action sequence with a highly masking ratio (e.g. 75 %) during the training process. However, directly reconstructing the video action sequence tends to result in the non-convergence of the model training and a decrease in classification performance. Inspired by the knowledge distillation (Hinton et al., 2015), we implicitly reconstruct the masked input sequence with the pseudo-label generated by the classification head to avoid this situation. Besides, to exploit the potential of MAPLE, we further combine MAPLE with the classical semi-supervised learning methods to improve the performance of semi-supervised point cloud action recognition.

We conduct extensive experiments with our MAPLE framework for semi-supervised point cloud action recognition on three widely-used datasets: MSR-Action3D (MSR3D) (Li et al., 2010), NTU RGB+D 60 (NTU60) (Shahroudy et al., 2016), and NTU RGB+D 120 (NTU120) (Liu et al., 2020a). As shown in Fig. 1, we make remarkable progress in the mainstream datasets compared to previous methods, e.g., VAT (Miyato et al., 2019), EntMin (Grandvalet and Bengio, 2004), and Pseudo Label (Lee and others, 2013)). The MAPLE framework achieves the new state-of-the-art performance of the semi-supervised point cloud action recognition.

In summary, the contributions of this paper are three-fold:

We present one of the first attempts toward semi-supervised point cloud action recognition which aims to learn efficient action representations from massive point cloud videos with fewer manual annotations.
We design a Decoupled spatial-temporal TransFormer, named DestFormer, as the backbone of our semi-supervised learning framework, which decouples the spatial and temporal dimensions of the 4D point cloud videos for achieving a more efficient and effective self-attention.
We propose a Masked Pseudo-Labeling autoEncoder (MAPLE) framework for learning a generalizable and discriminative classifier through reconstructing motion features of masked frames from the available action frames.

Figure 2. The detail of DestFormer. (a) Data Preparation: we construct some local areas (e.g. “a”) on adjacent frames (e.g. “t1”, “t2”) from the input $x_{i}$ as what P4Conv (Fan et al., 2021) do. (b) Spatial Extractor: we adopt P4Conv for modeling short-time local information and feed the output $s_{i}$ frame by frame into a spatial transformer for extracting the merged local feature $m_{i}$ . (c) Temporal Aggregator: we generate the short-term global feature $g_{i}$ through the pooling layer and aggregate the long-term global information with the temporal encoder. (d) Prediction Head: we project the global feature $v_{i}$ into label space via the classification head.

2. Related Work

Point cloud action recognition is a popular topic of video understanding in computer vision, which aims to help the machines understand the 3D world. Based on the characteristics of previous methods, three main categories have been distinguished. The first one is mainly based on voxels obtained from point clouds. 3D dynamic voxel (3DV) (Wang et al., 2020) brought the voxelization of point clouds into point cloud action recognition, via temporal rank pooling. It learned both action information through voxels and appearance through point cloud to encode the temporal information. The second type of method is directly performed on the original point cloud with pointnet-based (Qi et al., 2017a) models. For example, pointnet++ (Qi et al., 2017b) borrowed the idea of local receptive fields to extract the spatial information of point clouds. MeteorNet (Liu et al., 2019) further constructed the concept of spatial-temporal neighborhoods based on pointnet++ and determined the neighborhoods with direct grouping or chained-flow grouping. The last category adopts the data-hungry transformer-based model in point cloud action recognition. P4Transformer (Fan et al., 2021) directly modeled the action and appearance information of the whole video while effectively discarding the requirement of point-tracking used in MeteorNet and the complex calculations of voxelization. Similarly, $P S T^{2}$ (Wei et al., 2022) captured the spatial-temporal context information with the Spatial-temporal self-attention module. In this paper, our DestFormer belongs to the last category, but has less Floating-point Operations (FLOPs), powerful model capability, and less annotation dependence.

Semi-Supervised Learning is an important research topic in the field of pattern recognition and machine learning, which learns knowledge from the datasets including the much more set of unlabeled data and fewer labeled data. The theory and algorithms of semi-supervised learning were first summarized by Chapelle (Chapelle et al., 2006) in 2006 and Zhu (X., 2008) in 2008. The semi-supervised learning methods can be divided into two categories: the inductive methods and the transductive methods. The inductive method (Liu et al., 2010; Triguero et al., 2015; Sheikhpour et al., 2017) usually constructs a classifier for predicting the label of the whole dataset, including both the labeled and unlabeled data. By way of illustration, Grandvalet and Bengio (Grandvalet and Bengio, 2004) optimized the pseudo label generated by unlabeled data with conditional entropy minimization. Miyato et al. (Miyato et al., 2019) added the small perturbations to the original input and constrained the output of unlabeled data with regularization. Another transductive method (Jebara et al., 2009; Liu et al., 2012; Subramanya and Talukdar, 2014) was always performed on the graph-based model. Different from the inductive methods, the transductive methods never produce a classifier for prediction. It usually defines a graph for all input data and encodes the relationship between the pairwise data points.

Self-Supervised Learning completely abandons the reliance on manual labels by adopting the input itself as supervision, thus making great progress on representation learning in the last few years. Through Liu’s research (Liu et al., 2020b) on SSL, its main methods can be divided into three categories: generative SSL, contrastive SSL, and generative-contrastive SSL. The generative SSL trains a generator consisting of an encoder and decoder to reconstruct the input data. Its represent research in natural language processing is GPT (Radford et al., 2018) and BERT (Devlin et al., 2019), which predict the discarded content with the partially abandoned input sequence. In the field of computational vision, especially in the area of image classification and image generation, PixelCNN (van den Oord et al., 2016), VQ-VAE-2 (Razavi et al., 2019), and MAE (He et al., 2021) successfully used the whole input image as the self-supervised target. Compared to generative SSL, the motivation of contrastive SSL is to measure the similarity of different inputs (e.g., mutual information maximization and instance discrimination). Its influential work includes MoCo (He et al., 2020; Chen et al., 2020b), BYOL (Grill et al., 2020), and SimCLR (Chen et al., 2020a). As for the generative-contrastive SSL, most works focus on learning knowledge from unlabeled data with generative adversarial networks.

This paper links the inductive semi-supervised algorithm (e.g. Pseudo Label (Lee and others, 2013)) with the generative self-supervision (e.g. MAE (He et al., 2021)). Our MAPLE uses the short-term pseudo label instead of the short-term features as the reconstruction target. By this means, it can achieve the stable reconstruction of masked frames and learn generalizable features from unlabeled point cloud videos.

3. The Proposed MAPLE Framework

The detail of our MAPLE.
( — Figure 3. The detail of our MAPLE. (a) Adopting our DestFormer backbone and the cross-entropy loss for the supervised training. (b) The complete training process of MAPLE: (1) The spatial extractor encodes the input video as the short-term global feature $g_{i}$ . (2) After randomly discarding the short-term global feature $g_{i}$ , the temporal encoder projects the visible subset of $g_{i}$ as the latent representation $z_{i}$ . (3) The temporal decoder is responsible for reconstructing $r_{i}$ from the latent representation $z_{i}$ and the mask tokens $M$ . (4) The classification head generates the pseudo-label $P_{i}$ and ${^P}_{i}$ as our reconstruction target. Note that the modules here with the same colors share weights.

In this section, we declare the detailed framework of our MAPLE. Our MAPLE consists of a Decoupled spatial-temporal TransFormer (DestFormer) backbone, as shown in Fig. 7. The DestFormer takes the point cloud videos $x_{i}$ as input and adopts the spatial extractor, temporal aggregator and prediction head back-to-back for extracting the global feature $v_{i}$ and predicting the final class label $y_{i}$ . On this basis, our MAPLE builds a masked autoencoder learning framework for semi-supervised action recognition, as shown in Fig. 3. It consists of an encoder-decoder structure and implicitly reconstructs the masked input feature with the pseudo-label generated by the classification head. Before describing our MAPLE in detail, we first declare the necessary notations and definitions of the semi-supervised point cloud action recognition task.

3.1. Preliminary

The task of point cloud action recognition is, given a point cloud video of humans, to predict the human behavior and actions in the video. Semi-supervised point cloud action recognition is consistent with the point cloud action recognition task in the inference phase, but usually adopts a different paradigm in the training phase as follows.

1) We have a dataset $D$ , which contains a labeled subset $D_{l} = {(x_{i}, y_{i})}$ and an unlabeled subset $D_{u} = {x_{j}}$ . Both $D_{l}$ and $D_{u}$ are sampled i.i.d. from the same distribution $p (x)$ and in general the size of $D_{l}$ is smaller than the size of $D_{u}$ . Let $x_{i} = {X_{t} \in R^{(3 + C) \times N}}_{t = 1}^{T}$ denote the matrix sequence of a point cloud video, where N indicates the number of points in each frame, $T$ indicates the number of frames in this video, $R^{3}$ and $R^{C}$ indicates the spatial coordinates and features dimension of one point. It is worth noting that there are no point cloud features (ie., $C = 0$ ) in the given dataset (NTU RGB+D 60 (Shahroudy et al., 2016), NTU RGB+D 120 (Liu et al., 2020a) and MSR-Action3D (Li et al., 2010)).

2) The training of model $f_{θ} (\cdot)$ has an optimization function of the following form:

(1)

min θ L_{l} (D_{l}, θ) + α L_{u} (D_{u}, θ),

where $L_{l}$ is the loss function (e.g., cross-entropy loss, mean squared error loss, Hinge loss, etc.) for classification of the labeled dataset $D_{l}$ , and $L_{u}$ is the optimization objective designed for unlabeled dataset $D_{u}$ (the design of this function varies from paper to paper, and we discuss our designed $L_{u}$ in later subsection.), $α$ is the positive scalar weight for $L_{u}$ and $θ$ is the learnable parameters of $f_{θ} (\cdot)$ .

3.2. Decoupled Spatial-temporal TransFormer

This subsection presents our backbone for point cloud action recognition named Decoupled spatial-temporal TransFormer (DestFormer). The design of DestFormer is based on P4Transformer (Fan et al., 2021), and its main purpose is to serve as a basic backbone for the semi-supervised learning framework MAPLE and learn discriminative motion representations with fewer annotations. As shown in Fig. 7, the DestFormer consists of four parts: data preparation, spatial extractor, temporal aggregator, and prediction head. Note that embedding features in figures with different colors (e.g. “t1” and “t3”) correspond to the different keyframes of the input action sequence.

Data Preparation. For the input point cloud video $x_{i}$ , we construct some local areas (e.g. “a”,“b”,“c”) on adjacent frames (e.g. “t1”,“t2”) as what Point 4D Convolution (P4Conv) (Fan et al., 2021) do. The calculation of the local areas is based on the Farthest Point Sampling (FPS) algorithm (Qi et al., 2017b), and the exhaustive calculation process is declared in (Fan et al., 2021).

Spatial Extractor (SE). SE is designed to extract the short-term global feature $g_{i} = {G_{ζ \cdot (t - 1) + 1} \in R^{D}}_{t = 1}^{T / ζ}$ from the local areas. We first extracts the short-term local feature $s_{i} = {S_{ζ \cdot (t - 1) + 1} \in R^{D \times (N / κ)}}_{t = 1}^{T / ζ}$ through P4Conv (Fan et al., 2021), where $D$ is the dimension of short-term local feature, $κ \geq 1.0$ denotes the spatial scaling rate and $ζ \geq 1.0$ denotes the spatial scaling rate. P4Conv plays the role of aggregating the local information between adjacent $ζ$ frames. After that, we feed the short-term local feature $s_{i}$ frame by frame into the Spatial Transformer (ST) modules for extracting the merged short-term local feature $m_{i} = {M_{ζ \cdot (t - 1) + 1} \in R^{D \times (N / κ)}}_{t = 1}^{T / ζ}$ which aggregate the information of different spatial part.

Temporal Aggregator (TA). TA consists of a pooling layer and a transformer-based (Vaswani et al., 2017) Temporal Encoder (TE). we first prepare the short-term global feature $g_{i}$ from the merged short-term local feature $m_{i}$ through the pooling layer (e.g. maximum pooling). Then we aggregate the long-term global $z_{i}$ from the short-term global feature $g_{i}$ with our TE module.

Prediction Head. Following the TA module is the pooling layer (e.g. maximum pooling) and classification head, which consists of Layer Normalization layers (LayerNorm), linear layers, and Gaussian Error Linear Units (GELUs). Its role is to project the global feature $v_{i}$ into the label space and generate the corresponding pseudo labels for classification.

3.3. Masked Pseudo-labeling Autoencoder

This subsection elaborates our Masked Pseudo-Labeling autoEncoder (MAPLE) framework for semi-supervised point cloud action recognition. As shown in the left part of Fig. 3, we adopt our DestFormer backbone and the cross-entropy loss for the supervised training with labeled point cloud videos. The right part of Fig. 3 shows the complete training process of our MAPLE with unlabeled point cloud videos. Similar to the reconstruction process of autoencoders, the spatial extractor encodes the point cloud videos as the short-term local embedding feature $g_{i}$ . The temporal encoder of our MAPLE projects the visible subset of embedding feature $g_{i}$ into latent space Z, and the temporal decoder is responsible for reconstructing from the latent representation $z_{i}$ and the learnable mask tokens $M$ . However, different from classical autoencoders, our MAPLE implicitly reconstructs the original signal through the pseudo-label generated from the classification head, rather than reconstructing the original signal itself. We introduce the training process in detail as follows:

Masking the short-term global feature $g_{i}$ that is extracted from the Spatial Extractor (SE) modules is the first step of our framework. Like what ImageMAE does in (He et al., 2021), we directly discard a subset (e.g., 50%) of the original short-term global feature $g_{i}$ with random sampling. The motivation of masking is to help the model efficiently understand the order of actions via reconstructing the complete action sequence from the mutilated one.

Temporal Encoder (TE) is a lightweight transformer that only contains several self-attention blocks in the second step of our MAPLE. We directly feed the masked short-term global features $g_{i}$ into the TE module without adding positional embedding, since its temporal positional embedding is already added when fed into the SE module.

Temporal Decoder (TD) is also a lightweight transformer that is used to reconstruct the removed embedding feature $g_{i}$ in the third step of our MAPLE. Before feeding the latent representation $z_{i}$ into the TD modules, we first insert the shared and learnable mask token $M$ at the position of the original abandoned features and then add the new temporal positional embedding to the full set of sequences. Note that the shared mask without new temporal positional embedding cannot reconstruct the action information at different times.

Reconstruction Target. As shown in the final step of our MAPLE in Fig. 3 (b), the target of reconstruction is calculated with pseudo-label $P_{i}$ instead of the original feature $g_{i}$ . We feed both the original feature $g_{i}$ and the reconstructed feature $r_{i}$ into the classification head to obtain their corresponding pseudo-label $P_{i}$ and ${^P}_{i}$ . Note that the original pseudo-label $P_{i}$ is generated without backprop for maintaining the stability of the training process. Following this target, our unsupervised loss can be defined with the Kullback-Leibler divergence:

(2)

L_{u} = L_{m a p l e} = \frac{1}{| D_{u} |} \sum x_{i} \in D_{u} K L (f (P_{i} | x_{i}) | | f ({^P}_{i} | x_{i})),

where $K L$ is the function of Kullback-Leibler divergence, $| D_{u} |$ is the size of the unlabeled dataset, $f (\cdot)$ is the model, $P_{i}$ is the pseudo-label generated from the original feature $g_{i}$ , ${^P}_{i}$ is the reconstructed pseudo-label generated from the reconstructed feature $r_{i}$ . $A l g o r i t h m$ 1 and $A l g o r i t h m$ 2 present the training and inference process of our MAPLE, respectively.

Compare to reconstructing the original feature, reconstructing the pseudo-label not only improves the performance of classification but also makes the training stage more stable. We compare these two strategies in detail in section 4.5.

Stage 1: Pre-training with labeled dataset

D_{l}

(corresponding to the left part of Fig. 3).

Initialization: the network parameters of DestFormer

θ

; basic learning rate

η

; the labeled batch size

b_{l}

; the supervised cross-entropy loss

L_{l}

repeat

t = 1 … max iteration num:

fetch mini-batch

d_{l}

from

D_{l}

;

compute loss

L_{l}

d_{l}

;

update

θ^{t} = θ^{t - 1} - η ▽ L_{l}

until stable accuracy and loss in the validation set.

Stage 2: Training of MAPLE with unlabeled dataset

D_{u}

(corresponding to the left part of Fig. 3).

Initialization: positive scalar weight

α

for unsupervised loss

L_{u}

; unlabeled batch size

b_{u}

, where

b_{u} \geq b_{l}

repeat

t = 1 … max iteration num:

fetch mini-batch

d_{l}

from

D_{l}

and

d_{u}

from

D_{u}

;

compute loss

L = L_{l} + α L_{u}

d_{l}

and

d_{u}

;

update

θ^{t} = θ^{t - 1} - η ▽ L

until stable accuracy and loss in the validation set.

Algorithm 1 The training process of our MAPLE.

Initialization: the DestFormer model

f (\cdot)

without the temporal decoder; the best-trained network parameters

θ

repeat

t = 1 … final test batch:

fetch mini-batch

d_{t}

from test dataset

D_{t}

;

calculate the accuracy on

d_{t}

;

finished.

Calculate the accuracy on the whole test dataset

D_{t}

Algorithm 2 The inference process of our MAPLE.

4. Experiments

To show the effectiveness of our DestFormer and MAPLE, we first evaluate the supervised-only performance and computational efficiency of our DestFormer. Then we compare our MAPLE with the semi-supervised baseline algorithms and further combine these leading algorithms with our MAPLE to obtain superior classification performance. At last, we investigate the choices of masking rate, the depth of temporal decoder, and the irreplaceability of pseudo-label.

4.1. Dataset

Our experiments are performed on three main human action recognition datasets: MSR-Action3D (Li et al., 2010), NTU RGB+D 60 (Shahroudy et al., 2016), and NTU RGB+D 120 (Liu et al., 2020a).

MSR-Action3D (Li et al., 2010) dataset captured with Kinect v1 depth camera, which contains 567 videos and 23k frames in total (270 videos for training and 297 videos for testing). This dataset contains twenty actions: high arm wave, horizontal arm wave, and so on. For our semi-supervised point cloud action recognition, 7.5%, 15.0%, 22.5%, 30.0%, and 37.5% of training videos of each action are selected for the labeled dataset $D_{l}$ and the rest for the unlabeled dataset $D_{u}$ . More detailed information is available in the supplementary material.

NTU RGB+D 60 (Shahroudy et al., 2016) is a large dataset that was captured with Kinect v2 depth camera. It consists of 56K videos and 4M frames captured from 80 views and with 40 performers. Sixty action categories and two types of evaluation (i.e. cross-subject and cross-view) are defined in this dataset. In this paper, we evaluate our model with a cross-subject setting. For our semi-supervised point cloud action recognition task, 5%, 10%, 20%, 30%, and 40% of training videos of each action are selected for the labeled dataset $D_{l}$ .

NTU RGB+D 120 (Liu et al., 2020a) is an extension of NTU RGB+D 60 and the largest dataset for human action recognition. It consists of 114K videos and 8M frames captured from 155 views and with 106 performers. The dataset captured by Kinect v2 depth camera has the modalities of RGB, Depth, 3DJoints, and IR. One hundred and twenty action categories and two types of evaluation (i.e. cross-subject and cross-setup) are defined on this dataset. To harmonize with the above dataset, we still evaluate our model with a cross-subject setting and select the same percentage of the labeled dataset $D_{l}$ as NTU RGB+D 60.

4.2. Implementation Details and Approaches

This subsection presents the training hyperparameters and implementation details of our DestFormer and MAPLE.

Network Structure. The DestFormer $f (\cdot)$ are introduced in Section 3.2. By default, The spatial scaling rate $κ$ of P4Conv is set to 2 and the spatial scaling rate $ζ$ is set to 32. The spatial transformer is designed with 4 self-attention blocks and the temporal encoder is designed with only 3 self-attention blocks. Each black spatial transformer and temporal encoder contains 8 heads. As the Temporal Decoder used in MAPLE, it consists of 8 self-attention blocks to strengthen its ability for reconstruction.

MAPLE Training. In the whole process of training, the basic learning rate $η$ is set to 0.01. The warm-up strategy is used for the first 10 epochs with the initial $η = 10^{- 6}$ and the decreased learning rate $η$ of the final 5 epochs is set to 0.001. The mini-batch of $D_{l}$ and $D_{u}$ is set to 14 for the MSR-Action3D dataset, and 32 for NTU RGB+D 60 and NTU RGB+D 120. The masking ratio is set to 75% for all datasets. In Step 1 (Pre-training) of our training, the DestFormer is trained on the labeled dataset of MSR-Action3D, NTU RGB+D 60 and NTU RGB+D 120 with the epoch of 40, 20, and 20, respectively. In Step 2 (Training of MAPLE), we set the positive scalar weight $α$ as 0.5 for MSR-Action3D, and 0.2 for NTU RGB+D 60 and NTU RGB+D 120.

Dataset	Backbone	Ratio of Labeled Data
Dataset	Backbone	7.5%	15.0%	22.5%	30.0%	37.5%
MSR3D (Li et al., 2010)	P4Transformer	61.95	77.10	80.47	83.16	85.85
MSR3D (Li et al., 2010)	DestFormer	62.96	77.44	81.14	83.84	86.53
Dataset	Backbone	Ratio of Labeled Data
Dataset	Backbone	5%	10%	20%	30%	40%
NTU60 (Shahroudy et al., 2016)	P4Transformer	45.21	57.20	68.41	73.98	77.26
NTU60 (Shahroudy et al., 2016)	DestFormer	46.80	59.63	70.03	74.98	78.16
NTU120 (Liu et al., 2020a)	P4Transformer	30.38	40.34	48.66	53.28	56.94
NTU120 (Liu et al., 2020a)	DestFormer	36.09	47.75	58.05	62.56	65.31

Table 1. Comparison of the supervised-only action recognition accuracy (%) between P4Transformer (Fan et al., 2021) and our DestFormer on three benchmark dataset.

Backbone	Depth	GFLOPs	Inference Time (ms)
P4Transformer (Fan et al., 2021)	5	85.6	865
DestFormer	4+3	60.5	665

Table 2. Comparison of the time complexity (GFLOPs) and average inference time (ms) between P4Transformer (Fan et al., 2021) and Our DestFormer.

Compared Approaches. In the following section 4.4, we use leading semi-supervised learning algorithms that have proven to be generally effective as our compared approaches:

1) Supervised-only. We train the model only with labeled dataset $D_{l}$ . The performance of the best-trained model is used as the lower bounds of semi-supervised learning.

2) Pseudo Labels (Lee and others, 2013). The main idea is to further train the model with the pseudo hard labels of unlabeled data. This algorithm can be summarized in two steps as follows. First, we get the pre-training model with the supervised-only method, then we predict the pseudo hard labels of unlabeled data. Finally, the model can be retrained with these hard labels.

3) Virtual Adversarial Training (VAT) (Miyato et al., 2019). It is inspired by adversarial learning and its regularization only needs unlabeled data. In the training process, it first adds small adversarial perturbation $ϵ_{v a t}$ to the unlabeled data for changing the final prediction, and then forces the model $f_{θ} (\cdot)$ against this type of perturbation with the following consistency loss:

(3)

L_{v a t} = \frac{1}{| D_{u} |} \sum x_{i} \in D_{u} K L (f_{θ} (x_{i}) | | f_{θ} (x_{i} + △ x_{i})),

(4)

w h e r e △ x_{i} = arg max δ s . t . | δ |_{2} = ϵ_{v a t} K L (f_{θ} (x_{i}) | | f_{θ} (x_{i} + △ x_{i})) .

4) Conditional Entropy Minimization (EntMin) (Grandvalet and Bengio, 2004). This approach encourages the model to output the confident pseudo labels $y$ for unlabeled input data. In other words, the predictions $y$ closed to the one-hot vector are encouraged. The conditional entropy minimization loss can be defined as:

(5)

L_{e n t m i n} = \frac{1}{| D_{u} |} \sum x_{i} \in D_{u} \sum y \in Y - f_{θ} (y | x_{i}) log f_{θ} (y | x_{i}) .

Note that the EntMin is almost not used alone for semi-supervised learning because the model can easily increase the weights of the classification head to generate a confident prediction. It always adopt with the VAT loss, i.e. $L_{u} = α_{v a t} L_{v a t} + α_{e n t m i n} L_{e n t m i n}$ , where $α_{v a t}$ and $α_{e n t m i n}$ are the positive scalar weight for loss of VAT and EntMin, respectively.

4.3. Supervised-only Performance

To demonstrate the validity of our spatial-temporal backbone, this subsection compares the action recognition performance and computational efficiency of our DestFormer and the P4transformer model (Fan et al., 2021) with the supervised-only setting on MSR-Action3D, NTU RGB+D 60 and NTU RGB+D 120 datasets. The action recognition accuracy (%) of each backbone on three benchmark datasets is listed in Table 1. The time complexity (GFLOPs) and average inference time (ms) of each point cloud video are listed in Table 2.

In Table 1, we observe that our DestFormer has less annotation dependence and better classification performance in the supervised-only setting. Especially on the NTU RGB+D 120 dataset, our DestFormer model generally obtains a greater than 5.7% increase in action recognition accuracy.

In Table 2, we notice that our DestFormer is more efficient in computational complexity, which obtain about 30% and 23% decrease for time complexity (GLOPs) and inference time (ms), respectively.

4.4. Evaluation of Semi-supervised Methods

In this section, we first evaluate our MAPLE method by comparing it with leading semi-supervised methods (e.g. Pseudo Label, VAT, and EntMin) for semi-supervised point cloud action recognition on three mainstream datasets. Then we further combined our MAPLE method with those methods (VAT+EntMin+MAPLE) and obtain better performance for semi-supervised point cloud action recognition. Specifically, we use $L_{v a t}$ and $L_{e n t m i n}$ as unsupervised loss functions $L_{u}$ in the early training stage until the model has almost stabilized and then adopt the $L_{m a p l e}$ as the unsupervised loss function. Please refer to the supplementary materials for the detailed training process of “VAT+EntMin+MAPLE”.

Method	Ratio of Labeled Data
Method	7.5%	15.0%	22.5%	30.0%	37.5%
supervised-only	62.96	77.44	81.14	83.84	86.53
Pseudo Label (Lee and others, 2013)	68.01	80.64	81.65	85.19	88.05
VAT (Miyato et al., 2019)	66.92	80.47	81.14	85.19	86.53
VAT + EntMin (Grandvalet and Bengio, 2004)	67.24	81.48	83.84	85.94	87.29
MAPLE (Ours)	72.04	82.15	84.85	87.04	89.40
VAT+EntMin+MAPLE (Ours)	76.09	84.85	86.20	87.21	89.56

Table 3. Comparison of the results on MSR-Action3D.

Method	Ratio of Labeled Data
Method	5%	10%	20%	30%	40%
supervised-only	46.80	59.63	70.03	74.98	78.16
Pseudo Label (Lee and others, 2013)	47.24	61.96	72.14	76.74	79.15
VAT (Miyato et al., 2019)	46.80	59.95	70.92	75.77	78.47
VAT + EntMin (Grandvalet and Bengio, 2004)	47.07	62.20	72.59	77.25	79.33
MAPLE (Ours)	48.78	60.61	71.05	75.72	78.61
VAT+EntMin+MAPLE (Ours)	50.63	62.98	73.01	77.57	79.96

Table 4. Comparison of the results on NTU RGB+D 60.

Figure 4. The accuracy of classification on NTU RGB+D 60 5% labeled dataset with different masking ratios. The 75% masking ratio of reconstruction achieves peak accuracy.

Figure 5. The t-SNE visualization of different approaches on the MSR-Action3D dataset. The squares with the black border indicate the labeled data, and other dots indicate the unlabeled ones. Note that different colors denote different classes.

The results on MSR-Action3D and NTU RGB+D 60 datasets are shown in Tables 3 and 4 respectively. The results on NTU RGB+D 120 are shown in the supplementary material. From the semi-supervised results of each semi-supervised method, we can find that our MAPLE method is effective for semi-supervised learning and slightly outperforms the previous methods’ performance under most settings. We also observe that our MAPLE method can be combined with other leading semi-supervised methods, and obtain significant performance increases under each setting. Especially on the 7.5% labeled MSR-Action3D setting, it brings significant improvement in action recognition performance (+8.08% Acc).

4.5. Ablation Study and Visualization

We investigate the effectiveness of our proposed MAPLE method on benchmark datasets in this section. We first analyze the influence of the masking ratio and the depth of the temporal decoder, then illustrate the importance of reconstruction with pseudo-label during the training process. At last, we show the feature distributions of each method to prove the effectiveness of our MAPLE method.

Masking ratio. Fig. 4 shows the accuracy of classification on NTU RGB+D 60 5% labeled dataset with different masking ratios. The high masking ratio (75%) of reconstruction achieves the peak of classification performance, which is as high as the masking ratio of Image MAE (He et al., 2021). This phenomenon is the exact opposite of the natural language processing field, whose best masking ratio is 15% typically. However, this behavior verifies our hypothesis that most frames of the action sequence are redundant and we can reconstruct the complete action sequence from the small residual part of the sequence.

Depth of Temporal Decoder. We investigate the effect of decoder depth on the final classification performance and the result on NTU RGB+D 60 5% labeled dataset with different depth of temporal decoder. For more detail, please refer to the supplementary materials.

Reconstruction with Pseudo-Label. To demonstrate the importance of implicit reconstruction through pseudo-label, we show the results with and without pseudo-label on the MSR-Action3D dataset in the supplementary materials.

The t-SNE visualization. To further explore the mechanism of MAPLE, we visualize the feature distributions of the labeled and unlabeled video sequences of the MSR-Action3D training dataset with t-SNE in Fig. 5. From the t-SNE visualization, we can find that the model trained with only supervised action sequences is often hard to distinguish the decision boundaries of the unlabeled action sequence. Although the benchmark methods (Pseudo Label, VAT, and EntMin) have better distributions, there are still some outliers that make the decision boundaries ambiguous. Compared to previous approaches, our MAPLE and “VAT+EntMin+MAPLE” form tighter data clusters and clear decision boundaries which benefit from the semi-supervision learning with our masked autoencoder. To summarize, the visualization shows that combining semi-supervised learning with masked pseudo-labeling autoencoder is possible to learn numerous action concepts from unlabeled point cloud videos and improves the performance of action recognition.

5. Conclusion

In this paper, we present a Masked Pseudo-Labeling autoEncoder (MAPLE) framework with an effective transformer-based Decoupled spatial-temporal TransFormer (DestFormer) backbone to learn discriminative representations with much fewer annotations for the semi-supervised point cloud action recognition task. The MAPLE framework exploits the reconstruction of the masked features from the available frames to learn the numerous action concepts from unlabeled action sequences. Moreover, we combine our MAPLE with the classical semi-supervised methods to learn more generalizable features and establish the state-of-the-art performances of the semi-supervised point cloud action recognition task. We hope that our MAPLE framework can inspire the research of autoencoder on point cloud sequence in the future.

Acknowledgements.

This research was supported by the National Key R&D Program of China under Grant No. 2020AAA0103800.

References

L. Beyer, X. Zhai, A. Oliver, and A. Kolesnikov (2019) S4L: self-supervised semi-supervised learning. In ICCV, pp. 1476–1485. Cited by: §1.
O. Chapelle, B. Schölkopf, and A. Zien (Eds.) (2006) Semi-supervised learning. Cited by: §2.
T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton (2020a) A simple framework for contrastive learning of visual representations. In ICML, pp. 1597–1607. Cited by: §2.
X. Chen, H. Fan, R. B. Girshick, and K. He (2020b) Improved baselines with momentum contrastive learning. CoRR. Cited by: §1, §2.
J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pp. 4171–4186. Cited by: §2.
H. Fan, Y. Yang, and M. S. Kankanhalli (2021) Point 4d transformer networks for spatio-temporal modeling in point cloud videos. In CVPR, pp. 14204–14213. Cited by: Figure 2, §1, §2, §3.2, §3.2, §3.2, §4.3, Table 1, Table 2.
Y. Grandvalet and Y. Bengio (2004) Semi-supervised learning by entropy minimization. In NeurIPS, pp. 529–536. Cited by: Figure 1, §1, §2, §4.2, Table 3, Table 4.
J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. Á. Pires, Z. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko (2020) Bootstrap your own latent - A new approach to self-supervised learning. In NeurIPS, pp. 21271–21284. Cited by: §2.
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. B. Girshick (2021) Masked autoencoders are scalable vision learners. CoRR. Cited by: §1, §2, §2, §3.3, §4.5.
K. He, H. Fan, Y. Wu, S. Xie, and R. B. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, pp. 9726–9735. Cited by: §1, §2.
G. E. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. CoRR. Cited by: §1.
T. Jebara, J. Wang, and S. Chang (2009) Graph construction and b-matching for semi-supervised learning. In ICML, pp. 441–448. Cited by: §2.
D. Lee et al. (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, pp. 896. Cited by: Figure 1, §1, §2, §4.2, Table 3, Table 4.
W. Li, Z. Zhang, and Z. Liu (2010) Action recognition based on a bag of 3d points. In CVPR Workshops, pp. 9–14. Cited by: §1, §1, §3.1, §4.1, §4.1, Table 1.
J. Liu, A. Shahroudy, M. Perez, G. Wang, L. Duan, and A. C. Kot (2020a) NTU RGB+D 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell., pp. 2684–2701. Cited by: §1, §3.1, §4.1, §4.1, Table 1.
K. Liu, W. Liu, C. Gan, M. Tan, and H. Ma (2018) T-C3D: temporal convolutional 3d network for real-time action recognition. In AAAI, pp. 7138–7145. Cited by: §1.
W. Liu, J. He, and S. Chang (2010) Large graph construction for scalable semi-supervised learning. In ICML, pp. 679–686. Cited by: §2.
W. Liu, J. Wang, and S. Chang (2012) Robust and scalable graph-based semisupervised learning. Proc. IEEE, pp. 2624–2638. Cited by: §2.
W. Liu, Q. Bao, Y. Sun, and M. Tao (2022) Recent advances of monocular 2d and 3d human pose estimation: a deep learning perspective. ACM Computing Surveys (CSUR). Cited by: §1.
X. Liu, F. Zhang, Z. Hou, Z. Wang, L. Mian, J. Zhang, and J. Tang (2020b) Self-supervised learning: generative or contrastive. CoRR. Cited by: §2.
X. Liu, M. Yan, and J. Bohg (2019) MeteorNet: deep learning on dynamic 3d point cloud sequences. In ICCV, pp. 9245–9254. Cited by: §1, §1, §2.
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. In ICCV, pp. 9992–10002. Cited by: §1.
R. Martin-Martin, M. Patel, H. Rezatofighi, A. Shenoi, J. Gwak, E. Frankel, A. Sadeghian, and S. Savarese (2021) JRDB: a dataset and benchmark of egocentric robot visual perception of humans in built environments. TPAMI. Cited by: §1.
T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2019) Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell., pp. 1979–1993. Cited by: Figure 1, §1, §2, §4.2, Table 3, Table 4.
C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017a) PointNet: deep learning on point sets for 3d classification and segmentation. In CVPR, pp. 77–85. Cited by: §1, §1, §1, §2.
C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017b) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, pp. 5099–5108. Cited by: §1, §2, §3.2.
A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. CoRR. Cited by: §2.
A. Razavi, A. van den Oord, and O. Vinyals (2019) Generating diverse high-fidelity images with VQ-VAE-2. In NeurIPS, pp. 14837–14847. Cited by: §2.
A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016) NTU RGB+D: A large scale dataset for 3d human activity analysis. In CVPR, pp. 1010–1019. Cited by: §1, §3.1, §4.1, §4.1, Table 1.
R. Sheikhpour, M. A. Sarram, S. Gharaghani, and M. A. Z. Chahooki (2017) A survey on semi-supervised feature selection methods. Pattern Recognit., pp. 141–158. Cited by: §2.
A. Subramanya and P. P. Talukdar (2014) Graph-based semi-supervised learning. Morgan & Claypool Publishers. Cited by: §2.
Y. Sun, W. Liu, Q. Bao, Y. Fu, T. Mei, and M. J. Black (2022) Putting people in their place: monocular regression of 3d people in depth. pp. 13243–13252. Cited by: §1.
I. Triguero, S. García, and F. Herrera (2015) Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl. Inf. Syst., pp. 245–284. Cited by: §2.
A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu (2016) Pixel recurrent neural networks. In ICML, pp. 1747–1756. Cited by: §2.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §1, §3.2.
Y. Wang, Y. Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou, and J. Yuan (2020) 3DV: 3d dynamic voxel for action recognition in depth video. In CVPR, pp. 508–517. Cited by: §1, §2.
Y. Wei, H. Liu, T. Xie, Q. Ke, and Y. Guo (2022) Spatial-temporal transformer for 3d point cloud sequences. In WACV, pp. 657–666. Cited by: §2.
Z. X. (Ed.) (2008) Semi-supervised learning literature survey: department of computer sciences. Cited by: §2.
Y. Zhao, G. Wang, C. Tang, C. Luo, W. Zeng, and Z. Zha (2021) A battle of network structures: an empirical study of cnn, transformer, and MLP. CoRR. Cited by: §1.
J. Zheng, X. Liu, W. Liu, L. He, C. Yan, and T. Mei (2022) Gait recognition in the wild with dense 3d representations and a benchmark. In CVPR, pp. 20228–20237. Cited by: §1.

Appendix A Additional Experimental Results

In this supplementary material, we show more details about the semi-supervised datasets and our MAPLE algorithm.

Dataset	Division	Ratio of Labeled Data
Dataset	Division	7.5%	15.0%	22.5%	30.0%	37.5%
MSR3D	Labeled	20	40	60	80	100
MSR3D	Unlabeled	250	230	210	190	170
Dataset	Division	Ratio of Labeled Data
Dataset	Division	5.0%	10.0%	20.0%	30.0%	40.0%
NTU60	Labeled	1980	3960	7920	11880	15840
NTU60	Unlabeled	38340	36360	32400	28440	24480
NTU120	Labeled	3120	6358	12416	17846	22810
NTU120	Unlabeled	60240	57002	50944	45514	40550

Table 5. The division of the semi-supervised datasets.

Method	Ratio of Labeled Data
Method	5%	10%	20%	30%	40%
Supervised-only	36.09	47.75	58.05	62.56	65.31
Pseudo Label	36.18	48.25	58.33	62.85	65.55
VAT	35.90	47.74	58.29	62.56	65.75
VAT + EntMin	36.02	48.2	58.42	62.70	66.88
MAPLE (Ours)	37.15	48.56	58.59	63.18	65.84
VAT+EntMin+MAPLE (Ours)	36.91	48.80	59.25	64.02	67.08

Table 6. Comparison of the results on the NTU120 dataset.

Method	Ratio of Labeled Data
Method	7.5%	15.0%	22.5%	30.0%	37.5%
MAPLE w/o pseudo-label	69.19	81.32	84.51	86.87	86.53
MAPLE with pseudo-label	72.04	82.15	84.85	87.04	89.40

Table 7. The results of MAPLE with and w/o pseudo-label on the MSR-Action3D dataset

Stage 1: Pre-training with labeled dataset

D_{l}

Initialization: the network parameters of DestFormer

θ

; basic learning rate

η

; the labeled batch size

b_{l}

; the supervised cross-entropy loss

L_{l}

repeat

t = 1 … max iteration num:

fetch mini-batch

d_{l}

from

D_{l}

;

compute loss

L_{l}

d_{l}

;

update

θ^{t} = θ^{t - 1} - η ▽ L_{l}

until stable accuracy and loss in the validation set.

Stage 2: Training of VAT+EntMin+MAPLE with unlabeled dataset

D_{u}

Initialization: positive scalar weight

α_{v a t}

α_{e n t m i n}

and

α_{m a p l e}

for unsupervised loss

L_{v a t}

L_{e n t m i n}

and

L_{m a p l e}

, respectively; unlabeled batch size

b_{u}

, where

b_{u} \geq b_{l}

first repeat

t = 1 … max iteration num:

fetch mini-batch

d_{l}

from

D_{l}

and

d_{u}

from

D_{u}

;

compute loss

L = L_{l} + α_{v a t} L_{v a t} + α_{e n t m i n} L_{e n t m i n}

;

update

θ^{t} = θ^{t - 1} - η ▽ L

until stable accuracy and loss in the validation set.

second repeat

t = 1 … max iteration num:

fetch mini-batch

d_{l}

from

D_{l}

and

d_{u}

from

D_{u}

;

compute loss

L = L_{l} + α_{m a p l e} L_{m a p l e}

d_{l}

and

d_{u}

;

update

θ^{t} = θ^{t - 1} - η ▽ L

until stable accuracy and loss in the validation set.

Algorithm 3 The training process of our VAT+EntMin+MAPLE.

a.1. Division of Semi-supervised Datasets

Our experiments are conducted on three benchmark datasets: MSR-Action3D (MSR3D), NTU RGB+D 60 (NTU60), and NTU RGB+D 120 (NTU120). As shown in Table 5, we divide each training dataset into labeled training dataset $D_{l}$ and unlabeled training dataset $D_{u}$ . As an illustration, we select 33 videos for each class (1980 videos in total) from NTU RGB+D 60 as the 5% labeled training dataset.

a.2. Training Process of VAT+EntMin+MAPLE

In this subsection, we describe the detailed training progress of “VAT+EntMin+MAPLE”. As shown in Algorithm 3, we first pre-train our model with labeled dataset $D_{l}$ for getting better initialization parameters. Then we adopt $α_{v a t} L_{v a t} + α_{e n t m i n} L_{e n t m i n}$ as the unsupervised loss with the unlabeled dataset $D_{u}$ until the model achieve stable accuracy (approximately 10 to 15 epochs for this step). At last, we use the $L_{m a p l e}$ as our unsupervised optimization functions and continue training until the model almost converges.

a.3. Results on NTU RGB+D 120 Dataset

In this subsection, we additional evaluate our MAPLE by comparing it with leading semi-supervised methods on NTU RGB+D 120 dataset and show the results in Table 6. From the table, we can observe that previous semi-supervised methods (e.g. Pseudo Label, VAT and EntMin) are slightly effective for semi-supervised learning on this largest dataset of human action recognition. Our method outperforms the previous methods by about 1.0% classification accuracy under the most setting.

a.4. Reconstruction with Pseudo-Label

To demonstrate the importance of implicit reconstruction through pseudo-label, we show the $L 2$ Norm of the reconstructed feature $r_{i}$ with and without pseudo-label on the 5% labeled MSR-Action3D dataset in Fig. 7 and compare the results of MAPLE under each setting in Table 7. The loss function without pseudo-label can be defined with Mean-Squared Error (MSE) loss as follows:

(6)

L_{u} = L_{m a p l e} = \frac{1}{| D_{u} |} \sum x_{i} \in D_{u} M S E (f (g_{i} | x_{i}) | | f (r_{i} | x_{i})),

where $M S E$ is the function of MSE loss, $| D_{u} |$ is the size of the unlabeled dataset, $g_{i}$ is the original short-term global feature, and $r_{i}$ is the reconstructed short-term global feature. Note that the exploding and vanishing problem is not the same as gradient exploding and gradient vanishing. It indicates the difference in the feature size under each training strategy.

Action Accuracy (%)	high arm wave	horizontal arm wave	hammer	hand catch	forward punch
supervised-only	6.67	86.67	0.00	33.33	91.67
VAT+EntMin+MAPLE (Ours)	93.33	86.67	0.00	26.67	21.43
Action Accuracy (%)	high throw	draw x	draw tick	draw circle	hand clap
supervised-only	21.43	76.92	100.00	6.67	84.62
VAT+EntMin+MAPLE (Ours)	28.57	42.86	100.00	73.33	100.00
Action Accuracy (%)	two hand wave	side-boxing	bend	forward kick	side kick
supervised-only	100.00	35.71	20.00	100.00	100.00
VAT+EntMin+MAPLE (Ours)	100.00	100.00	80.00	100.00	100.00
Action Accuracy (%)	jogging	tennis swing	tennis serve	golf swing	pick up throw
supervised-only	100.00	86.67	93.33	46.67	66.67
VAT+EntMin+MAPLE (Ours)	100.00	93.33	100.00	73.33	93.33

Table 8. More details about the improvement of per class accuracy on the 7.5% labeled MSR-Action3D dataset.

Figure 6. The $L 2$ Norm of the reconstructed feature $r_{i}$ with and w/o pseudo-label in two common situations: (a) Exploding. (b) Vanishing. Note that the blue baseline denotes the average $L 2$ Norm of the original feature $g_{i}$ during the supervised-only training process.

Figure 6. The $L 2$ Norm of the reconstructed feature $r_{i}$ with and w/o pseudo-label in two common situations: (a) Exploding. (b) Vanishing. Note that the blue baseline denotes the average $L 2$ Norm of the original feature $g_{i}$ during the supervised-only training process.

From the figure and table, we can observe that training without pseudo-label lead to explosion or vanishment on the $L 2$ Norm of the reconstructed features, which not only results in a significant decrease in the final classification performance but also make it hard to stably converge during the training process. The main reason for this occurrence can be obtained from our loss function 6 and the encoder-decoder structure of our MAPLE. The explosion situation (Fig. 7 (a)) happens when both the detached $g_{i}$ (without backprop) and $r_{i}$ are learnable features generated by our MAPLE and they share the same spatial extractor. When $r_{i}$ tries to increase closer to the $g_{i}$ , it often leads to an increase in the weights of the spatial extractor, which in turn leads to the increase of $g_{i}$ . After hundreds or thousands of iterations, the $L 2$ Norm of $g_{i}$ and $r_{i}$ become larger and larger, even tending to infinity. The vanishment situation (Fig. 7 (b)) happens when both the original $g_{i}$ (with backprop) and $r_{i}$ are learnable features generated by our MAPLE. The model tends to simply reduce the $L 2$ Norm of $g_{i}$ and $r_{i}$ to achieve the reduction of loss function 6. After thousands of iterations, the $L 2$ Norm of $g_{i}$ and $r_{i}$ become smaller and smaller, even tending to zero.

a.5. Depth of Temporal Decoder

We investigate the effect of decoder depth on the final classification performance and the result on NTU RGB+D 60 5% labeled dataset is shown in Fig. 7. From the figure we can know that the depth of the temporal decoder does not have a large impact on the performance of the final classification, the main reason is that the role of the decoder is only used to reconstruct the complete action sequence from the latent space, and has less relevance to the classification. Therefore, a decoder of shallow depth is sufficient to reconstruct the complete sequence.

a.6. Details about per class accuracy

To find out what type of actions are easily classified and which ones are tough, we show more details about the improvement of per class accuracy after training with our MAPLE in Table 8 on the 7.5% labeled MSR-Action3D dataset. From an overall perspective, it seems that simple repetitive actions, such as high arm wave (+86.67% accuracy), can greatly benefit from our MAPLE, while some complex actions, such as draw x (-34.06% accuracy), are hardly benefit from our MAPLE.