CMD: Self-supervised 3D Action Representation Learning with Cross-modal Mutual Distillation

Yunyao Mao\orcidlink0000-0002-9427-9086 1 CAS Key Laboratory of Technology in GIPAS, EEIS Department,
University of Science and Technology of China 1 Wengang Zhou\orcidlink0000-0003-1690-9836 1 CAS Key Laboratory of Technology in GIPAS, EEIS Department,
University of Science and Technology of China 12Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
,  ¹,  ²,
³,  ⁴ 2myy2016@mail.ustc.edu.cn ** Zhenbo Lu\orcidlink0000-0002-0918-7524 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
,  ¹,  ²,
³,  ⁴ 2myy2016@mail.ustc.edu.cn
Jiajun Deng\orcidlink0000-0001-9624-7451 1 CAS Key Laboratory of Technology in GIPAS, EEIS Department,
University of Science and Technology of China 1 Houqiang Li\orcidlink0000-0003-2188-3028 1 CAS Key Laboratory of Technology in GIPAS, EEIS Department,
University of Science and Technology of China 12Institute of Artificial Intelligence, Hefei Comprehensive National Science Center
,  ¹,  ²,
³,  ⁴ 2myy2016@mail.ustc.edu.cn **

²email: zhwg@ustc.edu.cn

²email: luzhenbo@iai.ustc.edu.cn

²email: dengjj@ustc.edu.cn

²email: lihq@ustc.edu.cn

Abstract

In 3D action recognition, there exists rich complementary information between skeleton modalities. Nevertheless, how to model and utilize this information remains a challenging problem for self-supervised 3D action representation learning. In this work, we formulate the cross-modal interaction as a bidirectional knowledge distillation problem. Different from classic distillation solutions that transfer the knowledge of a fixed and pre-trained teacher to the student, in this work, the knowledge is continuously updated and bidirectionally distilled between modalities. To this end, we propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs. On the one hand, the neighboring similarity distribution is introduced to model the knowledge learned in each modality, where the relational information is naturally suitable for the contrastive frameworks. On the other hand, asymmetrical configurations are used for teacher and student to stabilize the distillation process and to transfer high-confidence information between modalities. By derivation, we find that the cross-modal positive mining in previous works can be regarded as a degenerated version of our CMD. We perform extensive experiments on NTU RGB+D 60, NTU RGB+D 120, and PKU-MMD II datasets. Our approach outperforms existing self-supervised methods and sets a series of new records. The code is available at: https://github.com/maoyunyao/CMD

Keywords:

Self-supervised 3D action recognition, contrastive learning

\@footnotetext

* Corresponding authors: Wengang Zhou and Houqiang Li

1 Introduction

Human action recognition, one of the fundamental problems in computer vision, has a wide range of applications in many downstream tasks, such as behavior analysis, human-machine interaction, virtual reality, etc. Recently, with the advancement of human pose estimation algorithms [openpose, fang2017rmpe, xu2020deep], skeleton-based 3D human action recognition has attracted increasing attention for its light-weight and background-robust characteristics. However, fully-supervised 3D action recognition [Chen_2021_ICCV, du2015hierarchical, ke2017new, li2019actional, Li_2021_ICCV, liu2020disentangling, shi2019skeleton, Shi_2021_ICCV, si2019attention, zhang2019view, zhang2020semantics, zhang2020context] requires large amounts of well-annotated skeleton data for training, which is rather labor-intensive to acquire. In this paper, we focus on the self-supervised settings, aiming to avoid the laborious workload of manual annotation for 3D action representation learning.

To learn robust and discriminative representation, many celebrated pretexts like motion prediction, jigsaw puzzle recognition, and masked reconstruction have been extensively studied in early works [lin_ms2l_mm, misra2016shuffle, nie2020unsupervised, noroozi2016unsupervised, su2020predict, zheng2018unsupervised]. Recently, the contrastive learning frameworks [chen2020simple, he2020momentum, van2018representation] have been introduced to the self-supervised 3D action recognition community [lin_ms2l_mm, rao2021augmented]. It achieves great success thanks to the capability of learning discriminative high-level semantic features. However, there still exist unsolved problems when applying contrastive learning on skeletons. On the one hand, the success of contrastive learning heavily relies on performing data augmentation [chen2020simple], but the skeletons from different videos are unanimously considered as negative samples. Given the limited action categories, it would be unreasonable to just ignore potential similar instances, since they may belong to the same category as the positive one. On the other hand, cross-modal interactive learning is largely overlooked in early contrastive learning-based attempts [lin_ms2l_mm, rao2021augmented], yet integrating multimodal information [cheng2020skeleton, SSVOD, transvgpp, liang2019three, 2sagcn2019cvpr, 9234715] is the key to improving the performance of 3D action recognition.

To tackle these problems, CrosSCLR [li20213d] turns to cross-modal positive mining (see Figure 1 (a)) and sample reweighting. Though effective, it suffers the following limitations. Firstly, the positive sample mining requires reliable preliminary knowledge, thus the representation in each modality needs to be optimized independently in advance, leading to a sophisticated two-stage training process. Secondly, the contrastive context, defined as the similarity between the positive query and negative embeddings, is treated as individual weights of samples in complementary modalities to participate in the optimization process. Such implicit knowledge exchange lacks a holistic grasp of the rich contextual information. Besides, the cross-modal consistency is also not explicitly guaranteed.

In this work, we go beyond heuristic positive sample mining and reformulate cross-modal interaction as a general bidirectional knowledge distillation [hinton2015distilling] problem. As shown in Figure 1 (b), in the proposed Cross-modal Mutual Distillation (CMD) framework, the neighboring similarity distribution is first extracted in each modality. It describes the relationship of the sample embedding with respect to its nearest neighbors in the customized feature space. Compared with individual features [hinton2015distilling] or logits [romero2014fitnets], such relational information is naturally suitable for modeling the knowledge learned with contrastive frameworks. Based on the relational information, bidirectional knowledge distillation between each two modalities is performed via explicit cross-modal consistency constraints. Since the representation in each skeleton modality is trained from scratch and there is no intuitive teacher-student relationship between modalities, embeddings from the momentum updated key encoder along with a smaller temperature are used for knowledge modeling on the teacher side, so as to stabilize the distillation process and highlight the high-confidence information in each modality.

Compared to previous works, the advantages of our approach are three-fold: i) Instead of heuristically reweighting training samples, the contextual information in contrastive learning is treated as a whole to model the modality-specific knowledge, explicitly ensuring the cross-modal consistency during distillation. ii) Unlike cross-modal positive sample mining, our approach does not heavily rely on the initial representation, thus is free of the sophisticated two-stage training. This largely benefits from the probabilistic knowledge modeling strategy. Moreover, the positive mining is also mathematically proved to be a special case of the proposed cross-modal distillation mechanism under extreme settings. iii) The proposed CMD is carefully designed to be well integrated into the existing contrastive framework with almost no extra computational overhead introduced.

We perform extensive experiments on three prevalent benchmark datasets: NTU RGB+D 60 [shahroudy2016ntu], NTU RGB+D 120 [liu2020ntu], and PKU-MMD II [liu2017pku]. Our approach achieves state-of-the-art results on all of them under all evaluation protocols. It’s worth noting that the proposed cross-modal mutual distillation is easily implemented in a few lines of code. We hope this simple yet effective approach will serve as a strong baseline for future research.

2 Related Work

Self-supervised Representation Learning: Self-supervised learning methods can be roughly divided into two categories: generative and contrastive [liu2021self]. Generative methods [ballard1987modular, he2022masked, van2017neural] try to reconstruct the original input to learn meaningful latent representation. Contrastive learning [chen2020simple, he2020momentum, van2018representation] aims to learn feature representation via instance discrimination. It pulls positive pairs closer and pushes negative pairs away. Since no labels are available during self-supervised contrastive learning, two different augmented versions of the same sample are treated as a positive pair, and samples from different instances are considered to be negative. In MoCo [he2020momentum] and MoCo v2 [chen2020improved], the negative samples are taken from previous batches and stored in a queue-based memory bank. In contrast, SimCLR [chen2020simple] and MoCo v3 [chen2021empirical] rely on a larger batch size to provide sufficient negative samples. Similar to the contrastive context in [li20213d], the neighboring similarity in this paper is defined as the normalized product between positive embedding and its neighboring anchors. Our goal is to transfer such modality-specific information between skeleton modalities to facilitate better contrastive 3D action representation learning.

Self-supervised 3D Action Recognition: Many previous works have been proposed to perform self-supervised 3D action representation learning. In LongT GAN [zheng2018unsupervised], an autoencoder-based model along with an additional adversarial training strategy are proposed. Following the generative paradigm, it learns latent representation via sequential reconstruction. Similarly, P&C [su2020predict] trains an encoder-decoder network to both predict and cluster skeleton sequences. To learn features that are more robust and separable, the authors also propose strategies to weaken the decoder, laying more burdens on the encoder. Different from previously mentioned methods that merely adopt a single reconstruction task, MS $^{2}$ L [lin_ms2l_mm] integrates multiple pretext tasks to learn better representation. In recent attempts [li20213d, rao2021augmented, thoker2021skeleton, guo2022aimclr], momentum encoder-based contrastive learning is introduced and better performance is achieved. Among them, CrosSCLR [li20213d] is the first to perform cross-modal knowledge mining. It finds potential positives and re-weights training samples with the contrastive contexts from different skeleton modalities. However, the positive mining performed in CrosSCLR requires reliable initial representation, two-stage training is indispensable. Differently, in this paper, a more general knowledge distillation mechanism is introduced to perform cross-modal information interaction. Besides, the positive mining performed in CrosSCLR can be regarded as a special case of our approach.

Similarity-based Knowledge Distillation: Pairwise similarity has been shown to be useful information in relational knowledge distillation [park2019relational, peng2019correlation, tung2019similarity]. In PKT [passalis2018learning], CompRess [abbasi2020compress], and SEED [fang2021seed], similarities of each sample with respect to a set of anchors are converted into a probability distribution, which models the structural information of the data. After that, knowledge distillation is performed by training the student to mimic the probability distribution of the teacher. Recently, contextual similarity information has also shown great potential in image retrieval [ouyang2021contextual, wu2022contextual] and representation learning [ALBEF, tejankar2021isd]. Our approach is partially inspired by these works. Differently, the cross-modal mutual distillation in our approach is designed to answer the question of how to transfer the biased knowledge between complementary modalities during 3D action pre-training.

3 Method

Figure 2: The overall pipeline of the proposed framework. It contains two modules, Single-modal Contrastive Learning (SCL) and Cross-modal Mutual Distillation (CMD). Given multiple skeleton modalities (*e.g.* joint and motion) as input, the SCL module performs self-supervised contrastive learning in each modality and the CMD module simultaneously transfers the learned knowledge between modalities. SCL and CMD work collaboratively so that each modality learns more comprehensive representation.

3.1 Framework Overview

By consolidating the idea of leveraging complementary information from cross-modal inputs to improve 3D action representation learning, we design the Cross-modal Mutual Distillation (CMD) framework. As shown in Figure 2, the proposed CMD consists of two key components: Single-modal Contrastive Learning (SCL) and Cross-modal Mutual Distillation (CMD). Given multiple skeleton modalities (e.g. joint, motion, and bone) as input, SCL is applied to each of them to learn customized 3D action representation. Meanwhile, in CMD, the knowledge learned by SCL is modeled by the neighboring similarity distributions, which describe the relationship between the sample embedding and its nearest neighbors. Cross-modal knowledge distillation is then performed by bidirectionally minimizing the KL divergence between the distributions corresponding to each modality. SCL and CMD run synchronously and cooperatively so that each modality learns more comprehensive representation.

3.2 Single-modal Contrastive Learning

In this section, we revisit the single-modal contrastive learning as the preliminary of our approach, which has been widely adopted in many tasks like image/video recognition [imagenet2009A, kuehne2011hmdb, soomro2012ucf101] and correspondence learning [wang2021contrastive]. In self-supervised 3D action recognition, previous works like AS-CAL [rao2021augmented], CrosSCLR [li20213d], ISC [thoker2021skeleton], and AimCLR [guo2022aimclr] also take the contrastive method MoCo v2 [chen2020improved] as their baseline.

Given a single-modal skeleton sequence $x$ , we first perform data augmentation to obtain two different views $x_{q}$ and $x_{k}$ (query and key). Then, two encoders are adopted to map the positive pair $x_{q}$ and $x_{k}$ into feature embeddings $z_{q} = E_{q} (x_{q}, θ_{q})$ and $z_{k} = E_{k} (x_{k}, θ_{k})$ , where $E_{q}$ and $E_{k}$ denote query encoder and key encoder, respectively. $θ_{q}$ and $θ_{k}$ are the learnable parameters of the two encoders. Note that in MoCo v2, the key encoder is not trained by gradient descent but the momentum updated version of the query encoder: $θ_{k} \leftarrow α θ_{k} + (1 - α) θ_{q}$ , where $α$ is a momentum coefficient that controls the updating speed. During self-supervised pre-training, the noise contrastive estimation loss InfoNCE [van2018representation] is used to perform instance discrimination, which is computed as follows:

L_{SCL} = - log \frac{exp (z_{q}^{⊤} z_{k} / τ_{c})}{exp (z_{q}^{⊤} z_{k} / τ_{c}) + \sum_{i = 1}^{N} exp (z_{q}^{⊤} m_{i} / τ_{c})},

(1)

where $τ_{c}$ is a temperature hyper-parameter [hinton2015distilling] that scales the distribution of instances and $m_{i}$ is the key embedding of negative sample. $N$ is the size of a queue-based memory bank $M$ where all the negative key embeddings are stored. After the training of the current mini-batch, $z_{k}$ is enqueued as a new negative key embedding and the oldest embeddings in the memory bank are dequeued.

Under the supervision of the InfoNCE loss, the encoder is forced to learn representation that is invariant to data augmentations, thereby focusing on semantic information shared between positive pairs. Nevertheless, the learned representation is often modally biased, making it difficult to account for all data characteristics. Though it can be alleviated by test-time ensembling, several times the running overhead will be introduced. Moreover, the inherent limitations of the learned representation in each modality still exist. Therefore, during self-supervised pre-training, cross-modal interaction is essential.

3.3 Cross-modal Mutual Distillation

While SCL is performed within each skeleton modality, the proposed CMD models the learned knowledge and transfers it between modalities. This enables each modality to receive knowledge from other perspectives, thereby alleviating the modal bias of the learned representation. Based on MoCo v2, CMD can be easily implemented in a few lines of code, as shown in Alg. 1.

Knowledge Modeling: To perform knowledge distillation between modalities, we first need to model the knowledge learned in each modality in a proper way. It needs to take advantage of the existing contrastive learning framework to avoid introducing excessive computational overhead. Moreover, since the distillation is performed cross-modally for self-supervised learned knowledge, conventional methods that rely on individual features/logits are no longer applicable.

Inspired by recent relational knowledge distillation works [park2019relational, peng2019correlation, tung2019similarity], we utilize the pairwise relationship between samples for modality-specific knowledge modeling. Given an embedding $z$ and a set of anchors ${n_{i}}_{i = 1, 2, \dots, K}$ , we compute the similarities between them as $sim (z, n_{i}) = z^{⊤} n_{i}, i = 1, 2, \dots, K .$

In the MoCo v2 [chen2020improved] framework, there are a handful of negative embeddings stored in the memory bank. We can easily obtain the required anchors without additional model inference. Note that if all the negative embeddings are used as anchors, the set ${z^{⊤} m_{i}}_{i = 1, 2, \dots, N}$ is exactly the contrastive context defined in [li20213d]. In our approach, we select the top $K$ nearest neighbors of $z$ as the anchors. The resulting pairwise similarities are further converted into probability distributions with a temperature hyper-parameter $τ$ :

p_{i} (z, τ) = \frac{exp (z^{⊤} n_{i} / τ)}{\sum_{j = 1}^{K} exp (z^{⊤} n_{j} / τ)}, i = 1, 2, \dots, K .

(2)

The obtained $p (z, τ) = {p_{i} (z, τ)}_{i = 1, 2, \dots, K}$ describes the distribution characteristic around the embedding $z$ in the customized feature space of each modality.

Knowledge Distillation: Based on the aforementioned probability distributions, an intuitive way to perform knowledge distillation would be to directly establish consistency constraints between skeleton modalities. Different from previous knowledge distillation approaches that transfer the knowledge of a fixed and well-trained teacher model to the student, in our approach, the knowledge is continuously updated during self-supervised pre-training and each modality acts as both student and teacher.

To this end, based on the contrastive framework, we make two customized designs in the proposed approach: i) Different embeddings are used for teacher and student. As shown in Figure 2, in MoCo v2 [chen2020improved], two augmented views of the same sample are encoded into query $z_{q}$ and key $z_{k}$ , respectively. In our approach, the key distribution obtained in one modality is used to guide the learning of query distribution in other modalities, so that knowledge is transferred accordingly. Specifically, for the key embedding $z_{k}^{a}$ from modality A and the query embedding $z_{q}^{b}$ from modality B, we select the top $K$ nearest neighbors of $z_{k}^{a}$ as anchors and compute the similarity distributions as $p (z_{q}^{b}, τ)$ and $p (z_{k}^{a}, τ)$ according to Eq. 2. Knowledge distillation from modality A to modality B is performed by minimizing the following KL divergence:

K L (p (z_{k}^{a}, τ) | | p (z_{q}^{b}, τ)) = K \sum i = 1 p_{i} (z_{k}^{a}, τ) \cdot log \frac{p_{i} (z_{k}^{a}, τ)}{p_{i} (z_{q}^{b}, τ)} .

(3)

Since the key encoder is not trained with gradient, the teacher is not affected during unidirectional knowledge distillation. Moreover, the momentum updated key encoder provides more stable knowledge for the student to learn. ii) Asymmetric temperatures $τ_{t}$ and $τ_{s}$ are employed for teacher and student, respectively. Considering that there is no intuitive teacher-student relationship between modalities, a smaller temperature is applied for the teacher in CMD to emphasize the high-confidence information, as discussed in [Tejankar_2021_ICCV].

Since the knowledge distillation works bidirectionally, given two modalities A and B, the loss function for CMD is formulated as follows:

L_{CMD} = K L (p (z_{k}^{a}, τ_{t}) | | p (z_{q}^{b}, τ_{s})) + K L (p (z_{k}^{b}, τ_{t}) | | p (z_{q}^{a}, τ_{s})) .

(4)

Note that Eq. 4 can be easily extended if more modalities are involved. The final loss function in our approach is the combination of $L_{SCL}$ and $L_{CMD}$ :

L = L_{SCL}^{a} + L_{SCL}^{b} + L_{CMD},

(5)

where the superscripts $a$ and $b$ denote modality A and B, respectively.

⬇

1# z_q_a, z_q_b, z_k_a, z_k_b: query/key embeddings in modality A/B (BxC)

2# queue_a, queue_b: queue of N keys in modality A/B (CxN)

3# tau_s, tau_t: temperatures for student/teacher (scalars)

5l_a, lk_a = torch.mm(z_q_a, queue_a), torch.mm(z_k_a, queue_a) # compute similarities

6l_b, lk_b = torch.mm(z_q_b, queue_b), torch.mm(z_k_b, queue_b)

8lk_a_topk, idx_a = torch.topk(lk_a, K, dim=-1) # select top K nearest neighbors

9lk_b_topk, idx_b = torch.topk(lk_b, K, dim=-1)

11loss_cmd = loss_kld(torch.gather(l_b, -1, idx_a) / tau_s, lk_a_topk / tau_t) # A to B

12 + loss_kld(torch.gather(l_a, -1, idx_b) / tau_s, lk_b_topk / tau_t) # B to A

14def loss_kld(inputs, targets):

15 inputs, targets = F.log_softmax(inputs, dim=1), F.softmax(targets, dim=1)

16 return F.kl_div(inputs, targets, reduction=’batchmean’)

Algorithm 1 Pseudocode of the CMD module in a PyTorch-like style.

3.4 Relationship with Positive Mining

Cross-modal Positive Mining: Cross-modal positive mining is the most important component in CrosSCLR [li20213d], where the most similar negative sample is selected to boost the positive sets for contrastive learning in complementary modalities. The contrastive loss for modality B is reformulated as:

	$L_{CPM}^{b}$	$= - log \frac{exp ({z_{q}^{b}}^{⊤} z_{k}^{b} / τ_{c}) + exp ({z_{q}^{b}}^{⊤} m_{u}^{b} / τ_{c})}{exp ({z_{q}^{b}}^{⊤} z_{k}^{b} / τ_{c}) + \sum_{i = 1}^{N} exp ({z_{q}^{b}}^{⊤} m_{i}^{b} / τ_{c})}$		(6)
		$= L_{SCL}^{b} - log \frac{exp ({z_{q}^{b}}^{⊤} m_{u}^{b} / τ_{c})}{exp ({z_{q}^{b}}^{⊤} z_{k}^{b} / τ_{c}) + \sum_{i = 1}^{N} exp ({z_{q}^{b}}^{⊤} m_{i}^{b} / τ_{c})},$		(6)

where $u$ is the index of most similar negative sample in modality A.

CMD with $τ_{t} = 0$ and $K = N$ : Setting temperature $τ_{t} = 0$ and $K = N$ , the key distribution $p (z_{k}^{a}, τ_{t})$ in Eq. 4 will be an one-hot vector with the only $1$ at index $u$ , and thus the loss works on modality B will be like:

$L^{b}$	$= L_{SCL}^{b} + L_{CMD}^{b}$	(7)
	$= L_{SCL}^{b} + N \sum i = 1 p_{i} (z_{k}^{a}, 0) \cdot log \frac{p_{i} (z_{k}^{a}, 0)}{p_{i} (z_{q}^{b}, τ_{s})}$
	$= L_{SCL}^{b} + 1 \cdot log \frac{1}{p_{u} (z_{q}^{b}, τ_{s})}$
	$= L_{SCL}^{b} - log \frac{exp ({z_{q}^{b}}^{⊤} m_{u}^{b} / τ_{s})}{\sum_{j = 1}^{N} exp ({z_{q}^{b}}^{⊤} m_{j}^{b} / τ_{s})} .$

We can find that the loss $L_{CMD}^{b}$ is essentially doing contrastive learning in modality B with the positive sample mined by modality A. Compared with Eq. 6, the only difference is that when the mined $m_{u}^{b}$ is taken as the positive sample, the key embedding $z_{k}^{b}$ is excluded from the denominator. The same result holds for modality A. Thus we draw a conclusion that the cross-modal positive mining performed in CrosSCLR [li20213d] can be regarded as a special case of our approach with the temperature of teacher $τ_{t} = 0$ and the number of neighbors $K = N$ .

4 Experiments

4.1 Implementation Details

Network Architecture: In our approach, we adopt a 3-layer Bidirectional GRU (BiGRU) as the base-encoder, which has a hidden dimension of 1024. Before the encoder, we additionally add a Batch Normalization [ioffe2015batch] layer to stabilize the training process. Each skeleton sequence is represented in a two-actor manner, where the second actor is set to zeros if only one actor exists. The sequences are further resized to a temporal length of 64 frames.

Self-supervised Pre-training: During pre-training, we adopt MoCo v2 [chen2020improved] to perform single-modal contrastive learning. The temperature hyper-parameter in the InfoNCE [van2018representation] loss is 0.07. In cross-modal mutual distillation, the temperatures for teacher and student are set to 0.05 and 0.1, respectively. The number of neighbors $K$ is set to 8192. The SGD optimizer is employed with a momentum of 0.9 and a weight decay of 0.0001. The batch size is set to 64 and the initial learning rate is 0.01. For NTU RGB+D 60 [shahroudy2016ntu] and NTU RGB+D 120 [liu2020ntu] datasets, the model is trained for 450 epochs, the learning rate is reduced to 0.001 after 350 epochs, and the size of the memory bank $N$ is 16384. For PKU-MMD II [liu2017pku] dataset, the total epochs are increased to 1000, and the learning rate drops at epoch 800. We adopt the same skeleton augmentations as ISC [thoker2021skeleton].

4.2 Datasets and Metrics

NTU RGB+D 60 [shahroudy2016ntu]: NTU-RGB+D 60 (NTU-60) is a large-scale multi-modality action recognition dataset which is captured by three Kinect v2 cameras. It contains 60 action categories and 56,880 sequences. The actions are performed by 40 different subjects (actors). In this paper, we adopt its 3D skeleton data for experiments. Specifically, each human skeleton contains 25 body joints, and each joint is represented as 3D coordinates. Two evaluation protocols are recommended by the authors: cross-subject (x-sub) and cross-view (x-view). For x-sub, action sequences performed by half of the 40 subjects are used as training samples and the rest as test samples. For x-view, the training samples are captured by camera 2 and 3 and the test samples are from camera 1.

NTU RGB+D 120 [liu2020ntu]: Compared with NTU-60, NTU-RGB+D 120 (NTU-120) extends the action categories from 60 to 120, with 114,480 skeleton sequences in total. The number of subjects is also increased from 40 to 106. Moreover, a new evaluation protocol named cross-setup (x-set) is proposed as a substitute for x-view. Specifically, the sequences are divided into 32 different setups according to the camera distances and background, with half of the 32 setups (even-numbered) used for training and the rest for testing.

PKU-MMD [liu2017pku]: PKU-MMD is a new benchmark for multi-modality 3D human action detection. It can also be used for action recognition tasks [lin_ms2l_mm]. PKU-MMD has two phases, where Phase II is extremely challenging since more noise is introduced by large view variation. In this work, we evaluate the proposed method on Phase II (PKU-II) under the widely used cross-subject evaluation protocol, with 5,332 skeleton sequences for training and 1,613 for testing.

Evaluation Metrics: We report the top-1 accuracy for all datasets.

4.3 Comparison with State-of-the-art Methods

In the section, the learned representation is utilized for 3D action classification under a variety of evaluation protocols. We compare the results with previous state-of-the-art methods. Note that during evaluation, we only take single skeleton modality (joint) as input by default, which is consistent with previous arts [lin_ms2l_mm, su2020predict, thoker2021skeleton]. Integrating multiple skeleton modalities for evaluation can significantly improve the performance, but it will also incur more time overhead.

Method	Modality	NTU-60		NTU-120		PKU-II
Method	Modality	x-sub	x-view	x-sub	x-set	x-sub
LongT GAN [zheng2018unsupervised]	Joint only	39.1	48.1	-	-	26.0
MS $^{2}$ L [lin_ms2l_mm]	Joint only	52.6	-	-	-	27.6
P&C [su2020predict]	Joint only	50.7	76.3	42.7	41.7	25.5
AS-CAL [rao2021augmented]	Joint only	58.5	64.8	48.6	49.2	-
SeBiReNet [nie2020unsupervised]	Joint only	-	79.7	-	-	-
AimCLR [guo2022aimclr]	Joint only	74.3	79.7	-	-	-
ISC [thoker2021skeleton]	Joint only	76.3	85.2	67.1	67.9	36.0
CrosSCLR-B	Joint only	77.3	85.1	67.1	68.6	41.9
CMD (Ours)	Joint only	79.8	86.9	70.3	71.5	43.0
3s-CrosSCLR [li20213d]	Joint+Motion+Bone	77.8	83.4	67.9	66.7	21.2
3s-AimCLR [guo2022aimclr]	Joint+Motion+Bone	78.9	83.8	68.2	68.8	39.5
3s-CrosSCLR-B	Joint+Motion+Bone	82.1	89.2	71.6	73.4	51.0
3s-CMD (Ours)	Joint+Motion+Bone	84.1	90.9	74.7	76.1	52.6

Table 1: Performance comparison on NTU-60, NTU-120, and PKU-II in terms of the linear evaluation protocol. Our approach achieves state-of-the-art performance on all of them, both when taking single skeleton modality as input and when ensembling multiple modalities during evaluation. The prefix “3s-” denotes multi-modal ensembling.

Linear Evaluation Protocol: For linear evaluation protocol, we freeze the pre-trained encoder and add a learnable linear classifier after it. The classifier is trained on the corresponding training set for 80 epochs with a learning rate of 0.1 (reduced to 0.01 and 0.001 at epoch 50 and 70, respectively). We evaluate the proposed method on the NTU-60, NTU-120, and PKU-II datasets. As shown in Table 1, we include the recently proposed CrosSCLR [li20213d], ISC [thoker2021skeleton], and AimCLR [guo2022aimclr] for comparison. Our approach outperforms previous state-of-the-art methods by a considerable margin on all the three benchmarks. Note that ISC and the proposed CMD share the same BiGRU encoder, which is different from the ST-GCN [yan2018spatial] encoder in CrosSCLR. For a fair comparison, we additionally train a variation of CrossSCLR with BiGRU as its base-encoder (denoted as CrosSCLR-B). We can find that our method still outperforms it on all the three datasets, which shows the superiority of the proposed cross-modal mutual distillation.

KNN Evaluation Protocol: An alternative way to use the pre-trained encoder for action classification is to directly apply a K-Nearest Neighbor (KNN) classifier to the learned features of the training samples. Following [su2020predict], we assign each test sample to the most similar class where its nearest neighbor is in (i.e. KNN with k=1). As shown in Table 3, we perform experiments on the NTU-60 and NTU-120 benchmarks and compare the results with previous works. For both datasets, our approach exhibits the best performance, surpassing CrosSCLR-B [li20213d] by 4.5%~6% in the more challenging cross-subject and cross-setup protocols.

Method	NTU-60		NTU-120
Method	x-sub	x-view	x-sub	x-set
LongT GAN [zheng2018unsupervised]	39.1	48.1	31.5	35.5
P&C [su2020predict]	50.7	76.3	39.5	41.8
ISC [thoker2021skeleton]	62.5	82.6	50.6	52.3
CrosSCLR-B	66.1	81.3	52.5	54.9
CMD (Ours)	70.6	85.4	58.3	60.9

Table 3: Performance comparison on PKU-II in terms of the transfer learning evaluation protocol. The source datasets are NTU-60 and NTU-120. The representation learned by our approach shows the best transferability.

Method	To PKU-II
Method	NTU-60	NTU-120
LongT GAN [zheng2018unsupervised]	44.8	-
MS $^{2}$ L [lin_ms2l_mm]	45.8	-
ISC [thoker2021skeleton]	51.1	52.3
CrosSCLR-B	54.0	52.8
CMD (Ours)	56.0	57.0

Table 2: Performance comparison on NTU-60 and NTU-120 in terms of the KNN evaluation protocol. The learned representation exhibits the best performance on both datasets. Surpassing previous state-of-the-art methods by a considerable margin.

Transfer Learning Evaluation Protocol: In transfer learning evaluation protocol, we examine the transferability of the learned representation. Specifically, we first utilize the proposed framework to pre-train the encoder on the source dataset. Then the pre-trained encoder along with a linear classifier are finetuned on the target dataset for 80 epochs with a learning rate of 0.01 (reduced to 0.001 at epoch 50). We select NTU-60 and NTU-120 as source datasets, and PKU-II as the target dataset. We compare the proposed approach with previous methods LongT GAN [zheng2018unsupervised], MS $^{2}$ L [lin_ms2l_mm], and ISC [thoker2021skeleton] under the cross-subject protocol. As shown in Table 3, our approach exhibits superior performance on the PKU-II dataset after large-scale pre-training, outperforming previous methods by a considerable margin. This indicates that the representation learned by our approach is more transferable.

Method	NTU-60
	x-view				x-sub
	(1%)	(5%)	(10%)	(20%)	(1%)	(5%)	(10%)	(20%)
LongT GAN [zheng2018unsupervised]	-	-	-	-	35.2	-	62.0	-
MS $^{2}$ L [lin_ms2l_mm]	-	-	-	-	33.1	-	65.1	-
ASSL [si2020adversarial]	-	63.6	69.8	74.7	-	57.3	64.3	68.0
ISC [thoker2021skeleton]	38.1	65.7	72.5	78.2	35.7	59.6	65.9	70.8
CrosSCLR-B [li20213d]	49.8	70.6	77.0	81.9	48.6	67.7	72.4	76.1
CMD (Ours)	53.0	75.3	80.2	84.3	50.6	71.0	75.4	78.7
3s-CrosSCLR [li20213d]	50.0	-	77.8	-	51.1	-	74.4	-
3s-Colorization [yang2021skeleton]	52.5	-	78.9	-	48.3	-	71.7	-
3s-AimCLR [guo2022aimclr]	54.3	-	81.6	-	54.8	-	78.2	-
3s-CMD (Ours)	55.5	77.2	82.4	86.6	55.6	74.3	79.0	81.8

Table 4: Performance comparison on NTU-60 in terms of the semi-supervised evaluation protocol. We randomly select a portion of the labeled data to fine-tune the pre-trained encoder, and the average of five runs is reported as the final performance. Our approach exhibits the state-of-the-art results compared with previous methods.

Semi-supervised Evaluation Protocol: In semi-supervised classification, both labeled and unlabeled data are included during training. Its goal is to train a classifier with better performance than the one trained with only labeled samples. For a fair comparison, we adopt the same strategy as ISC [thoker2021skeleton]. The pre-trained encoder is fine-tuned together with the post-attached linear classifier on a portion of the corresponding training set. We conduct experiments on the NTU-60 dataset. As shown in Table 4, we report the evaluation results when the proportion of supervised data is set to 1%, 5%, 10%, and 20%, respectively. Compared with previous methods LongT GAN [zheng2018unsupervised], MS $^{2}$ L [lin_ms2l_mm], ASSL [si2020adversarial], and ISC [thoker2021skeleton], our algorithm exhibits superior performance. For example, with the same baseline, the proposed approach outperforms ISC and CrosSCLR-B by a large margin. We also take 3s-CrosSCLR [li20213d], 3s-Colorization [yang2021skeleton], and recently proposed 3s-AimCLR [guo2022aimclr] into comparison. In these methods, test-time multimodal ensembling is performed and the results of using 1% and 10% labeled data are reported. We can find that our 3s-CMD still outperforms all of these methods after ensembling multiple skeleton modalities.

4.4 Ablation study

Figure 3: Ablative study of the number of neighbors $K$ in the cross-modal mutual distillation module. The performance is evaluated on the cross-subject protocol of the NTU-60 dataset.

To justify the effectiveness of the proposed cross-modal mutual distillation framework, we conduct several ablative experiments on the NTU-60 dataset according to the cross-subject protocol. More results can be found in the supplementary.

Number of neighbors: The number of nearest neighbors controls the abundance of contextual information used in the proposed cross-modal mutual distillation module. We test the performance of the learned representation with respect to different numbers of nearest neighbors $K$ under the linear evaluation protocol. As shown in Figure 3, on the downstream classification task, the performance of the pre-trained encoder improves as $K$ increases. When $K$ is large enough ( $K \geq 8192$ ), continuing to increase its value hardly contributes to the performance. This is because the newly added neighbors are far away and contain little reference value for describing the distribution around the query sample. In addition, we can also find that when the value of $K$ varies from 64 to 16384, the performance of our approach is consistently higher than that of CrosSCLR-B [li20213d] and our baseline. This demonstrates the superiority and robustness of the proposed approach.

Modality & Direction	Linear Evaluation				KNN Evaluation
Modality & Direction	Bone	Motion	Joint	$Δ$	Bone	Motion	Joint	$Δ$
Baseline	74.4	73.1	76.1		62.0	56.8	63.4
J $\leftarrow$ B	74.4	-	76.5		62.0	-	64.3
J $⇄$ B	76.6	-	77.7	$↑$ 1.2	65.9	-	66.5	$↑$ 2.2
J $\leftarrow$ M	-	73.1	78.9		-	56.8	64.8
J $⇄$ M	-	77.5	79.8	$↑$ 0.9	-	67.0	68.7	$↑$ 3.9
J $\leftarrow$ M, J $\leftarrow$ B	74.4	73.1	78.8		62.0	56.8	66.5
J $⇄$ M, J $⇄$ B, M $⇄$ B	77.8	77.1	79.4	$↑$ 0.6	69.5	68.7	70.6	$↑$ 4.1

Table 5: Ablative experiments of modality selection and bidirectional distillation. The performance is evaluated on the NTU-60 dataset according to the cross-subject protocol. J, M, and B denote joint, motion, and bone modality respectively. The horizontal arrows indicate the direction of distillation.

Modality Selection: In our approach, we consider three kinds of skeleton modalities for self-supervised pre-training as in [li20213d]. They are joint, motion, and bone, respectively. Our approach is capable of performing knowledge distillation between any two of the above modalities. As shown in Table 5, we report the performance of the representation obtained by pre-training with different combinations of skeleton modalities. Note that the joint modality is always preserved since it is used for evaluation. There are several observations as follows: i) Cross-modal knowledge distillation helps to improve the performance of the representation in student modalities. ii) Under the linear evaluation protocol, knowledge distillation between joint and motion achieves the optimal performance, exceeding the baseline by 3.7%. iii) Under the KNN evaluation protocol, the learned representation shows the best results when all the three modalities are involved in knowledge distillation, which outperforms the baseline with an absolute improvement of 7.2%.

Bidirectional Distillation: In addition to modality selection, we also verify the effectiveness of bidirectional distillation. It enables the modalities involved in the distillation to interact with each other and progress together, forming a virtuous circle. In Table 5, the last column of each evaluation protocol reports the performance gain of bidirectional mutual distillation over unidirectional distillation in the joint modality. Results show that regardless of which skeleton modalities are used during pre-training, bidirectional mutual distillation further boosts the performance, especially under the KNN evaluation protocol.

Qualitative Results: We visualize the learned representation of the proposed approach and compare it with that of the baseline. The t-SNE [van2008visualizing] algorithm is adopted to reduce the dimensionality of the representation. To obtain clearer results, we select only 1/4 of the categories in the NTU-60 dataset for visualization. The final results are illustrated in Figure 4. For both joint and motion modalities, the representation learned by our approach is more compactly clustered than those learned by the baseline in the feature space. This brings a stronger discrimination capability to the representation, explaining the stunning performance of our approach in Table 3.

5 Conclusion

In this work, we presented a novel approach for self-supervised 3D action representation learning. It reformulates cross-modal reinforcement as a bidirectional knowledge distillation problem, where the pairwise similarities between embeddings are utilized to model the modality-specific knowledge. The carefully designed cross-modal mutual distillation module can be well integrated into the existing contrastive learning framework, thus avoiding additional computational overhead. We evaluate the learned representation on three 3D action recognition benchmarks with four widely adopted evaluation protocols. The proposed approach sets a series of new state-of-the-art records on all of them, demonstrating the effectiveness of the cross-modal mutual distillation.

Acknowledgement: This work was supported by the National Natural Science Foundation of China under Contract U20A20183, 61836011, and 62021001. It was also supported by the GPU cluster built by MCC Lab of Information Science and Technology Institution, USTC.