GaitFi: Robust Device-Free Human Identification via WiFi and Vision Multimodal Learning

Lang Deng

^{*}

, Jianfei Yang

^{*}

, Shenghai Yuan, Han Zou, Chris Xiaoxuan Lu, and Lihua Xie, L. Deng, J. Yang, S. Yuan and L. Xie are with the School of Electrical and Electronics Engineering, Nanyang Technological University, Singapore (e-mail: ldeng002@e.ntu.edu.sg; yang0478@e.ntu.edu.sg; syuan003@e.ntu.edu.sg; elhxie@ntu.edu.sg). H. Zou is with the Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA (e-mail: enthalpyzou@gmail.com). C. X. Lu is with the School of Informatics at the University of Edinburgh, United Kingdom (e-mail: xiaoxuan.lu@ed.ac.uk).

^{*}

These authors contributed equally to this work. J. Yang is the corresponding author (e-mail: yang0478@e.ntu.edu.sg). This work is supported by NTU Presidential Postdoctoral Fellowship, “Adaptive Multimodal Learning for Robust Sensing and Recognition in Smart Cities” project fund, in Nanyang Technological University, Singapore.

Abstract

As an important biomarker for human identification, human gait can be collected at a distance by passive sensors without subject cooperation, which plays an essential role in crime prevention, security detection and other human identification applications. At present, most research works are based on cameras and computer vision techniques to perform gait recognition. However, vision-based methods are not reliable when confronting poor illuminations, leading to degrading performances. In this paper, we propose a novel multimodal gait recognition method, namely GaitFi, which leverages WiFi signals and videos for human identification. In GaitFi, Channel State Information (CSI) that reflects the multi-path propagation of WiFi is collected to capture human gaits, while videos are captured by cameras. To learn robust gait information, we propose a Lightweight Residual Convolution Network (LRCN) as the backbone network, and further propose the two-stream GaitFi by integrating WiFi and vision features for the gait retrieval task. The GaitFi is trained by the triplet loss and classification loss on different levels of features. Extensive experiments are conducted in the real world, which demonstrates that the GaitFi outperforms state-of-the-art gait recognition methods based on single WiFi or camera, achieving 94.2% for human identification tasks of 12 subjects.

Human identification, gait recognition, multimodal learning, WiFi, computer vision

I Introduction

Nowadays, numerous intelligent monitoring systems have been deployed in the public domain to extract biomarker information related to human behavior and identities. With the development of Internet of Things (IoT) sensors and pattern recognition, various human identification technologies come into existence, such as fingerprint recognition [ali2016overview], iris recognition [de2016iris] and face recognition [8614364]. Though these technologies achieve remarkable performances, they still have their own limitations, such as the sensing range of fingerprint and iris recognition and the degrading face recognition due to the mask during the COVID-19 period [wang2021mask]. Different from the prevailing human identification methods, gait is a unique biomarker that can be identified at a distance without human cooperation. The advantages of gait recognition in remote monitoring [bouchrika2018survey] make it essential in crime prevention, forensic identification and social security.

Human gait is defined as the coordinated and cyclic combination of various movements in the walking action that is a unique biomarker for a person [isaac2019trait]. Specifically, the gait information includes the static information of the individual’s appearance and the dynamic information of the person’s walking. Therefore, gait recognition can be achieved by extracting these interrelated salient features. The advantage of gait-based human identification is that gait can not only be captured at a longer distance but is also difficult to imitate, which has attracted many researchers to employ various sensors for gait recognition [yang2019review, zou2018identification].

Existing gait recognition methods mainly rely on cameras [kumar2021gait], wearable devices [yang2019review], and radar [vandersmissen2018indoor]. However, they have distinct limitations due to the characteristics of each modality of sensors. For cameras-based solutions, the video can be easily affected by environmental conditions such as illumination and occlusion. Some scenarios even forbid the camera to be used due to user privacy. Wearable devices can be leveraged for human activity [ding2018energy] and gait analytics [marsico2019survey], which allows for higher resolution measurements using multiple sensors, but wearable devices require the cooperation of subjects, which restrains their application scenarios, such as crime prevention. Many radar-based methods have also been utilized for gait recognition [chen2021attention], which can extract gait information by utilizing the Doppler feature of Frequency-Modulated Continuous Wave (FMCW). However, the disadvantage of radar is the sparsity of the data with low SNR. Conversely, Lidar can obtain higher-resolution data than radar, but it is very expensive [benedek2016lidar]. Recently, WiFi is enabled to sense human gaits by extracting channel state information (CSI) [wang2016gait, yang2022efficientfi], which is proved to be cost-effective and privacy-preserving. In WiFi sensing, human gaits are reflected by detailed amplitude and phase information of different subcarriers after the WiFi signals are modulated by Orthogonal Frequency Division Multiplexing (OFDM) [xie2015precise]. Since the body motions interfere with the propagation path of the Wifi signals and the body motions of the gaits of different subjects are different, these lead to specific patterns in CSI for different subjects. Since each sensor modality has its pros and cons, is it possible to fuse a few complementary modalities for robust gait recognition?

Here we consider the most common modality, video camera that contains a large amount of information, and the WiFi. The reason for choosing WiFi as another modality in addition to visual cameras is that WiFi-enabled IoT devices are more ubiquitous when compared to lidar and radar. The CSI data extracted from WiFi is robust to illumination, which is a good complementarity to vision modality. As WiFi sensing leverages electromagnetic waves rather than visible light, when slight occlusion happens for the camera (e.g., plastic and paper materials), the WiFi system can still work, which denotes another merit.

In this paper, we propose a multimodal device-free human identification system utilizing the gait recognition method, namely GaitFi, which can recognize human identities based on commercial off-the-shelf (COTS) WiFi-enabled IoT devices and cameras. GaitFi consists of a two-stream network for WiFi and video, and a multimodal fusion module to recognize human gaits. For the WiFi sensing module, we propose a Lightweight Residual Convolution Network (LRCN) that consists of convolution layers and residual blocks to extract spatial and temporal features. For the vision sensing module, we first use LRCN to obtain frame-level features and a Long Short-Term Memory (LSTM) network [yu2019review] to extract temporal dynamics. In the modality fusion module, we concatenate the feature vectors from the two modalities to generate a robust gait representation. This concatenated feature vector is mapped to the prediction probability. We apply the cross-entropy loss and the triplet loss to GaitFi on the final prediction and the concatenated feature layers, respectively, so that the two losses contribute to robust classification and metric learning feature space without interference. By using lightweight backbone and uncomplicated multmodal learning framework, GaitFi achieves good human identification performance with relatively small computational complexity and relatively small inference time. To demonstrate the effectiveness, we conduct real-world experiments by implementing the system using a pair of WiFi routers, a camera and a mini-PC. The proposed GaitFi can achieve a recognition accuracy of 94.2% using two modalities, which significantly outperforms the state-of-the-art methods based on either WiFi or a camera. In the field of gait recognition, GaitFi innovatively proposes a feature-level fusion of visual modality and WiFi modality to compensate for the shortcomings of the two sensing modalities and achieve better performance.

The contributions of this paper are summarized as follows:

We study how vision and WiFi signals contribute to the human gait recognition task, and propose a multimodal human identification system, namely GaitFi. To the best of our knowledge, it is the first work for the WiFi-vision gait recognition method based on multimodal learning.
In GaitFi, we propose a LRCN for WiFi and a boosted LRCN with LSTM for cameras, and then fuse them for deep metric learning. The fusion mechanism enables our system to leverage the complementarity of two modalities for better robustness.
Real-world experiments demonstrate that our GaitFi outperforms the state-of-the-art gait recognition methods based on single WiFi or cameras.

The rest of the paper is organized as follows: Sec. II reviews WiFi and vision-based gait analytics. Sec. III provides the detailed illustration of GaitFi. Sec. IV shows experiment procedure, results, comparison with existing works and ablation study. Sec. V concludes the paper and provides recommendations for future research topics.

Fig. 1: Structure and fusion mechanism of GaitFi system.

Ii Related work

Ii-a WiFi-Based Sensing and Gait Recognition

The WiFi-based gait recognition method uses RF signals from WiFi-enabled devices to determine human identity. The transmitter emits WiFi signals, which are reflected by different body parts of the walking subject and then recorded by CSI data at the receiver [yang2022benchmark], which has empowered many applications including occupancy detection [zou2017freedetector], crowd counting [zou2018device, zou2017freecount], human activity recognition [zou2018deepsense, zou2017multiple, zou2019wifi, yang2018carefi, wang2021multimodal], person identification [zou2018identification, wang2022caution], vital sign detection [hu2022resfi], pose estimation [yang2022metafi] and gesture recognition [zou2018robust, yang2019learning, zou2018gesture]. To use WiFi sensing in the real world, current research aims at efficient communication [yang2022efficientfi], model security [yang2022robustsense] and data-efficient training [yang2022autofi].

Recently, research on human identification using WiFi-enabled devices has begun to emerge because of ubiquitous WiFi-enabled IoT devices. This paper focuses on the research of WiFi in the field of gait recognition. A WiFi-based gait feature extraction system named WiFiU [wang2016gait] is proposed by Wang et al. to classify humans with different identities. Zhang et al. [zhang2016wifi] propose WiFi-ID, a WiFi-based gait recognition method that can be used in small offices or smart homes. Zeng et al. [zeng2016wiwho] utilize gait to recognize human identity by measuring the time domain information of WiFi signals. Lv et al. [lv2017wii] propose Wii, which improved gait recognition accuracy by performing autocorrelation on the torso reflection to remove imperfection in spectrograms. Cao et al. [cao2021lightweight] propose a lightweight deep learning algorithm named LW-WiID, which can achieve a relatively high recognition accuracy by extracting the spatial information of subcarriers. CAUTION [wang2022caution] proposes to employ few-shot learning for data-efficient human identification. From the above work, research on WiFi-based single-modal gait recognition is getting more appealing for IoT-enabled human identification.

Ii-B Vision-Based Gait Recognition

Vision-based solution plays an essential role in gait recognition methods. Johansson et al. [johansson1973visual] use moving light displays and reflectors on different joints of the human body and observe that gait patterns are unique, so that gait can be a biomarker feature that is recognized by vision. With the development of computer vision, vision-based gait recognition methods are gradually gaining widespread attention. Nowadays, vision-based gait recognition methods can be divided into two main categories, template-based and video sequence-based methods.

For the template-based gait recognition method, the gait silhouette contour sequence needs to be obtained using background subtraction [wang2003silhouette]. Then, the resulting gait profile is aligned by cropping and then pixel-level operations are performed to generate a gait template, such as Gait Energy Image (GEI) [han2005individual]. The obtained gait templates can be used to obtain feature representations by machine learning methods [xing2016complete]. After obtaining the gait representation, the similarity between the representation pairs can be measured by metric learning methods [takemura2017input]. Recently, an increasing number of deep learning methods have been applied to template-based gait recognition tasks [wu2016comprehensive, he2018multi].

The video sequence-based gait recognition directly uses the silhouette sequence generated by background subtraction as the input to the deep learning neural network. This method can collect more temporal information, so specialized neural network structures need to be designed to extract such temporal information. Liao et al. [liao2017pose] use a LSTM-based approach to extract temporal information from gait sequences. Chao et al. [chao2019gaitset] propose GaitSet, a network that can blur time information of gait sequence. Lin et al. [lin2021gait] propose to aggregate local temporal and local spatial information for gait recognition.

Ii-C Multimodal Machine Learning

Multimodal machine learning aims to build models that can process and correlate information from multiple modalities [ngiam2011multimodal]. The motivation for multimodal machine learning comes from the fact that every single modality has its own drawbacks that make them perform sub-optimally. In addition, humans perceive the world in a multimodal way, such as vision, sounds and text, encouraging the existence of multimodal learning. Therefore, when a research question or dataset contains multiple modalities, it is characterized as a multimodal task. Multimodal machine learning involves many research directions including representation, translation, alignment, fusion and co-learning. Fusion is responsible for combining the information of multiple modalities to perform target prediction (i.e., classification or regression). It is one of the earliest research directions of multimodal machine learning and is currently the most widely used one. According to the level of fusion, multimodal fusion can be divided into input-level [li2017pixel], feature-level [ross2005feature, haghighat2016discriminant] and decision-level fusion [chatzis1999multimodal]. For our system GaitFi, the multimodal machine learning method is feature-level fusion. Co-learning is another popular research topic in the multimodal machine learning domain, which can model resource-poor modalities by leveraging knowledge from other resource-rich modalities [rahate2022multimodal]. It achieves this capability by using transfer learning and domain adaptation methods [zou2019consensus]. There is also a type of work in co-learning called co-training [ning2021review], which is responsible for studying how to expand a small number of annotations in multimodal data to obtain more annotation information.

Iii Method

Iii-a WiFi-Vision Multimodal Gait Recognition Method

Different from the existing WiFi-based gait recognition methods that simply formulate the problem as a standard classification problem, we formulate it as a gait retrieval task that is more practical in reality. The gait retrieval task is similar to the visual pedestrian ReID task [lin2019improving] and visual gait recognition [chao2019gaitset]. Given gallery samples and probe samples (i.e. test samples), the objective is to find those samples in the probe that have the same identity as the gallery samples. Therefore, the process of gait recognition is to match the test gait sample with existing gallery gait data, which allows users to enlarge the categories easily in practice. Our system uses two modalities, WiFi and vision, to get richer gait information from different levels. As shown in Fig. 2, the two modalities can reflect the gaits of different people and indicate the occupancy condition. Then we introduce the two modalities of gait data.

Iii-A1 WiFi CSI Modality

Fig. 3: Comparison of raw and denoising data on the 50th subcarrier in the CSI stream. (The left picture is the waveform of the subcarrier before denoising, and the right picture is the waveform of the subcarrier after denoising)

Fig. 4: The CSI data of different subjects and vacant situation.

WiFi signals transmit through multiple paths between the transmitter (TX) and the receiver (RX) of WiFi-enabled IoT devices, and these signals can be scattered and reflected by human motion between TX and RX [yang2018fine]. In wireless communication, the reflection, diffraction and scattering phenomena of WiFi signals affected by the physical environment can be described by channel state information (CSI) [yang2013rssi]. Modern WiFi devices use Orthogonal Frequency Division Multiplexing (OFDM) at the physical layer following the IEEE 802.11n/ac standard, which allows multiple transmit and receive antennas for Multiple-Input Multiple-Output (MIMO) communications. The CSI reveals fine-grained characterization of delay, amplitude decay, and multi-path phase-shift effects on each communication subcarrier [yang2018device]. We model the frequency domain of the WiFi signal as the channel impulse response $h (τ)$

h (τ) = M \sum m = 1 α_{m} e^{j ϕ_{m}} δ (τ - τ_{m}),

(1)

where $M$ denotes the total number of multipath, $α_{m}$ and $ϕ_{m}$ represent the amplitude and phase of the $m$ -th multipath component, respectively, $δ (τ)$ denotes the Dirac delta function, and $τ_{m}$ denotes time delay. However, due to limited WiFi bandwidth, only clusters of multipath components are distinguishable. In the frequency domain, a sampled version of the signal spectrum on each subcarrier can be obtained from RX, and the CSI measurements can be summarized as a complex number $H_{i}$

H_{i} = | H_{i} | e^{j ∠ H_{i}},

(2)

where $| H_{i} |$ denotes the amplitude attenuation, and $∠ H_{i}$ denotes the phase shift at the $i$ -th subcarrier. Due to the hardware and the environmental variations, the carrier frequency drifts, and the robustness of the phase information is relatively poor [gjengset2014phaser]. Thus, we only use the amplitude information in our system. We employ two TP-Link N750 routers with a modified OpenWrt firmware to collect CSI data [yang2018device]. The modified firmware is equipped with the Atheros CSI tool [xie2015precise] that enables routers to record the packets transmitted over the wireless channel and extract CSI measurements from those packets. The routers are set to run in a 40MHz channel when operating at 5GHz, which allows us to extract 114 subcarriers of CSI for each TX-RX pair. At each measurement, the number of CSI streams we can obtain is $N_{t o t a l}$

N_{t o t a l} = N_{T X} N_{R X} N_{s u b c a r r i e r s},

(3)

where $N_{T X}$ and $N_{R X}$ denote the number of antennas of the router that transmits the signal and the router that receives the signal, respectively, and $N_{s u b c a r r i e r s}$ denotes the number of subcarriers that is 114 in a 40MHz channel. For the gait recognition task, the CSI data frames are generated when subjects walk through the Line-of-Sight (LoS) path of WiFi signal propagation (between TX and RX). The gait is unique for a subject as illustrated in Fig. 2 with 3 examples, where we use the heatmap to visualize the CSI data in different situations. Fig. 1(a) shows the heatmap of the CSI data when no subject passes the experimental site. Fig. 1(b) and Fig. 1(c) show the CSI data frames when subjects A and B pass by, respectively. It can be seen intuitively from the Fig. 2 that the CSI pattern is unique for different subjects. Therefore, the CSI data extracted from off-the-shelf WiFi routers can be used for gait recognition.

In order to analyze the effect of gait on WiFi CSI, we visualize one of the CSI subcarriers (the 50th out of 114) for analysis. For better resolution, we use the moving average method [isufi2016autoregressive] for denoising, as shown in Fig. 3, where the y-axis is the amplitude attenuation represented by $| H_{i} |$ in Eq. 2 and the x-axis is the packet number which can also be expressed in terms of the length of the received packets. In order to illustrate the correlation between gait and WiFi CSI data, we select two WiFi CSI samples of two subjects and visualize the 50th subcarrier in Fig. 4. It is observed that the presence of subjects leads to obvious CSI variations, and the CSI patterns of the same subject are similar, which illustrates that the CSI can reflect the unique human gait biomarker. This phenomenon provides a factual basis for using WiFi modality to recognize human gait.

Iii-A2 Vision Modality

Computer vision has been applied to many tasks, such as object detection [redmon2016you] and human activity recognition [ahmad2021human]. Recently, vision-based gait recognition has been extensively studied and has achieved remarkable accuracy due to the wide utilization of cameras and the development of computer vision [alzubaidi2021review]. In our system, we use a camera to obtain data on vision modality. The camera is set close to the WiFi receiver to capture video of subjects’ gaits simultaneously. Each video sample consists of a series of frames that contain the continuous temporal data of human gaits. We synchronize the CSI and video data in our system for better multimodal fusion.

Iii-B Multimodal Learning for WiFi and Vision

Having data from the two modalities, the multimodal learning module accounts for representing and recognizing human gaits. Given a set of $N$ subjects with $M$ samples per subject, we denote the dataset as $D^{T}$

D^{T} = {(x_{w}^{i j}, x_{v}^{i j}), y^{i j}}, i \in [1, N], j \in [1, M],

(4)

where $x_{w}^{i j}$ denotes the $j$ -th sample of the $i$ -th subject in the WiFi modality, $x_{v}^{i j}$ denotes the corresponding sample of vision modality and $y^{i j}$ denotes the ground truth of its human ID. The objective of our GaitFi is to map a test sample to its subject ID, denoted as

y^{i j} = Φ (x_{w}^{i j}, x_{v}^{i j}),

(5)

where $Φ (\cdot)$ is our gait recognition model.

Iii-B1 WiFi Gait Recognition Module

Block name	LRCN	WiFi-LRCN (only WiFi)
Conv_1	3x3 conv, stride 2, $C h$ 8	7x21 conv, stride 5, $C h$ 64
Res-layer_1	$R e s (\begin{matrix} 3 \times 3, C h = 8 3 \times 3, C h = 8 \end{matrix}) \times 2$	$R e s (\begin{matrix} 3 \times 3, C h = 64 3 \times 3, C h = 64 \end{matrix}) \times 2$
Conv_2	3x3 conv, stride 2, $C h$ 16	3x7 conv, stride 1, $C h$ 64
Res-layer_2	$R e s (\begin{matrix} 3 \times 3, C h = 16 3 \times 3, C h = 16 \end{matrix}) \times 2$	$R e s (\begin{matrix} 3 \times 3, C h = 128 3 \times 3, C h = 128 \end{matrix}) \times 2$
Pool_1	\	MaxPool2d, kernel=1x2, stride=1x2
Conv_3	3x3 conv, stride 2, $C h$ 32	3x7 conv, stride 1, $C h$ 256
Pool_2	\	MaxPool2d, kernel=1x2, stride=1x2
Res-layer_3	$R e s (\begin{matrix} 3 \times 3, C h = 32 3 \times 3, C h = 32 \end{matrix}) \times 2$	$R e s (\begin{matrix} 3 \times 3, C h = 512 3 \times 3, C h = 512 \end{matrix}) \times 2$
FC_1	Linear, 64	Linear, 512
FC_2	\	Linear, 12

TABLE I: Structure of LRCN and WiFi-LRCN. LRCN: A network for extracting features when two modalities are fused. WiFi-LRCN: Optimized LRCN specifically for WiFi modality. (

R e s

denotes the residual block proposed by He et al. [He_2016_CVPR], and

C h

represents the channel)

In order to perform gait recognition based on WiFi data $x_{w}^{i j}$ , we need to extract spatial features across subcarriers and temporal features across time [sheng2020deep]. To this end, we do not use the prevailing models in the computer vision field, and propose a Lightweight Residual Convolution Network (LRCN), where the main blocks and related parameters are shown in Tab. I. Considering the efficiency, we decrease the model complexity while preserving the capacity. The LRCN includes convolution blocks, residual blocks, batch normalization layers, ReLU layers, a flatten layer and a fully connected layer. The CSI frame input to the network first passes through a convolution block with a kernel of $3 \times 3$ and a stride of $2 \times 2$ with the number of channels to 8. Then a batch normalization layer and a ReLU layer are used for better convergence. To extract features by deeper layers, we design 3 convolution blocks with 8, 16 and 32 channels, each followed by a residual block. Compared to classic ResNet-18, our design has smaller parameters and floating-point operations (FLOPs).

In the LRCN, the convolution layers and residual blocks are to extract part of the features of the gait, where the residual block in the LRCN consists of two residual blocks proposed by He et al. [He_2016_CVPR]. The mathematical formula of the residual module can be expressed as

F (x) = C (x) + x,

(6)

where $x$ denotes the input feature, $F (\cdot)$ is the residual block and $C (\cdot)$ is the convolution layers. The residual design mitigates the degradation problem of deep neural networks when we increase the depth of deep neural networks. Batch normalization layers further address the problem of gradient vanishing and help attain better performance [ioffe2015batch]. The ReLU layer as an activation layer is to add non-linearity for better model capacity. As the CSI patterns are complicated and non-linear, we use all these layers in the LRCN. After feature extraction, we flatten the features into a 64-dimensional feature space. We denote this 64-dimensional feature as $z_{w}^{i j}$

z_{w}^{i j} = F_{w}^{θ_{w}} (x_{w}^{i j}),

(7)

where $F_{w} (\cdot)$ denotes the forward functions of the LRCN model, parameterized by $θ_{w}$ . In Tab. I, we also propose a WiFi-LRCN network that leverages more parameters, which is only for comparison in the experiments. The WiFi-LRCN can achieve better performance in the single modality situation, but cannot lead to further improvement for multimodal performance with an increase in complexity of the algorithm.

Step 1: Training Phase Module: the LRCN for wifi modality

F_{w}^{θ_{w}}

, the LRCN for vision modality

F_{w}^{θ_{v}}

, the LSTM

G_{l}^{θ_{l}}

, the fusing function

Ψ (\cdot, \cdot)

, the mapping function

L

Input: labeled samples

{(x_{w}^{i j}, x_{v}^{i j}), y^{i j}}_{i = 1, j = 1}^{N, M}

BEGIN: while epoch $<$ total epoch do

Obtain the fusing feature via

z_{u}^{i j} = Ψ (F_{w}^{θ_{w}} (x_{w}^{i j}), G_{l}^{θ_{l}} (F_{w}^{θ_{v}} (x_{v}^{i j})))

Map to another feature space via

z_{r}^{i j} = L (z_{u}^{i j})

Calculate

L_{t r i p l e t}

z_{u}^{i j}

Calculate

L_{c e}

z_{r}^{i j}

Update

θ_{w}, θ_{v}, θ_{l}

by minimizing

L_{c e} + α L_{t r i p l e t}

end while

Output: the model parameters

θ_{w}, θ_{v}, θ_{l}

END.

Step 2: Testing Phase Input: an unlabeled sample in the probe

(x_{w}, x_{v})

, labeled samples in the gallery

{(x_{w}^{i j}, x_{v}^{i j}), y^{i j}}_{i = 1, j = 1}^{N, M}

BEGIN: Obtain the fusing feature vectors:

z_{u} = Ψ (F_{w}^{θ_{w}} (x_{w}), G_{l}^{θ_{l}} (F_{v}^{θ_{v}} (x_{v})))

while $i \in [1, N]$ , $j \in [1, M]$ do

z_{u}^{i j} = Ψ (F_{w}^{θ_{w}} (x_{w}^{i j}), G_{l}^{θ_{l}} (F_{v}^{θ_{v}} (x_{v}^{i j})))

end while

while $i \in [1, N]$ do

d^{i} = \sum_{j = 1}^{M} | | z_{u} - z_{u}^{i j} | |^{2}

end while

y \leftarrow arg {min}_{d^{i}} y^{i}

Output: the label

y

of the testing sample

END.

Algorithm 1 The algorithm of GaitFi system.

Iii-B2 Visual Gait Recognition Module

The video data is composed of a sequence of consecutive gait image frames, denoted as $x_{v}^{i j}$ in Eq. 4. As the video sequence consists of much temporal information, we further utilize the Long Short-Term Memory network (LSTM) after the LRCN. Specifically, we firstly use the LRCN to extract the frame-level features, and get 64-dimensional features for each frame. Since the LSTM better captures the dependencies of consecutive frames, we input the frame-level features in chronological order into a LSTM with 64 hidden states, generating the video features that are denoted as $z_{v}^{i j} \in R^{64}$

z_{v}^{i j} = G_{l}^{θ_{l}} (F_{v}^{θ_{v}} (x_{v}^{i j})),

(8)

where $F_{v} (\cdot)$ denotes the LRCN model acting on vision modality, parameterized by $θ_{v}$ , and $G_{l} (\cdot)$ denotes the LSTM model, parameterized by $θ_{l}$ .

Iii-B3 Modality Fusion and Learning Objectives

After extracting the WiFi CSI feature vector $z_{w}^{i j}$ and image sequence feature vector $z_{v}^{i j}$ through the WiFi module and vision module, we propose a modality fusion mechanism to fuse two modalities. Fig. 1 shows the whole process from feature extraction to modal fusion. We concatenate the two feature vectors into a multimodal feature vector. In this way, we associate the feature information of the two modalities and obtain a higher-dimensional and more discriminative feature space. After this, we use a fully connected layer and softmax function to map the multimodal feature to a $K$ -dimensional feature, where $K$ denotes the number of subjects. The multimodal feature vector and the $K$ -dimensional feature vector are denoted as $z_{u}^{i j}$ and $z_{r}^{i j}$ , respectively. The whole process of modality fusion can be formulated as

{\begin{matrix} z_{u}^{i j} & = Ψ (z_{w}^{i j}, z_{v}^{i j}) z_{r}^{i j} & = L (z_{u}^{i j}), \end{matrix}

(9)

where $Ψ (\cdot, \cdot)$ denotes the operation of feature concatenation, and $L (\cdot)$ denotes the linear mapping operation for a fully connected layer. After obtaining the multimodal feature $z_{u}^{i j}$ and the $z_{r}^{i j}$ through modality fusion, we design the objectives to train our model in an end-to-end manner. For $z_{u}^{i j}$ , we aim to obtain a metric feature space where the samples from the same subject can cluster, so we use the triplet loss to implement similarity calculation between samples [hermans2017defense]. The triplet loss can pull the samples of the same category close while pushing those of different categories away, which is formulated as $L_{t r i p l e t}$

L_{t r i p l e t} = max (| | z_{u}^{i j} - z_{u}^{i m} | |^{2} - | | z_{u}^{i j} - z_{u}^{p n} | |^{2} + η, 0),

(10)

where $i \neq p$ , $j \neq m \neq n$ , $max (\cdot, \cdot)$ denotes the function of taking the maximum value, $z_{u}^{i j}, z_{u}^{i m}, z_{u}^{p n}$ are the multimodal features of three samples, and $η$ denotes the manually set margin, which is set to 0.2 empirically. The second objective is the normal cross-entropy loss for a $K$ -way classification. To this end, we obtain the softmax outputs of $z_{r}^{i j}$ in each dimension $k$ , denoted as $s_{(k)}^{i j}$

s_{(k)}^{i j} = \frac{exp (z_{r (k)}^{i j})}{Σ_{c = 1}^{K} exp (z_{r (c)}^{i j})}, k \in [1, K],

(11)

where $exp (\cdot)$ denotes the exponential function, $z_{r (k)}^{i j}$ is the value of the $k$ -th dimension of the $K$ -dimensional feature for the $j$ -th sample of the $i$ -th class. Then we calculate cross-entropy loss [shore1981properties]:

L_{c e} = - Σ_{k = 1}^{K} y_{o (k)}^{i j} log s_{(k)}^{i j},

(12)

where $y_{o (k)}^{i j}$ is the value of the $k$ -th dimension of the one-hot label $y_{o}^{i j}$ . The final objective $L_{t o t a l}$ is written by

L_{t o t a l} = L_{c e} + α L_{t r i p l e t},

(13)

$α$ controls the ratio of the metric learning loss in the total loss. The loss can be optimized via backpropagation by updating $θ_{w}$ , $θ_{v}$ and $θ_{l}$ . The testing process is based on metric measurement by finding the most similar subject cluster via Euclidean distance. The training and testing of our system are summarized in Algorithm 1.

(a) The routers (TP-Link N750) and the camera (Intel RealSense).

Fig. 6: Vision data frame preprocessing: first use GMM to get silhouettes, then align and crop.

Iv Experiment

Iv-a Setup and Data Collection

System setup. To evaluate the performance of the GaitFi system using the gait recognition method, we use two commercial TP-LINK N750 routers as WiFi transmitters and receivers respectively to acquire CSI data, and an Intel RealSense camera to acquire vision data for human gaits. The routers and the camera are shown in Fig. 4(a). The testbed is set up in an indoor environment, as shown in Fig. 4(b), where the camera and the router as the WiFi receiver are on one side of the photo, while the router as the WiFi transmitter is set on the other side. The receiver and the RealSense camera are connected to the same mini-PC for synchronization and data annotations. The WiFi routers are set to run at 5GHz with 40MHz bandwidth, whose firmware is upgraded as described in Sec. III-A1 to collect 114 subcarriers of CSI data for each TX-RX pair. The receiver is equipped with 3 antennas while the transmitter is equipped with 1 antenna. The distance of the TX-RX pair is $2.1 m$ . Fig. 4(c) is a top view of the testbed layout, where the sensing devices and the facilities are illustrated.

Data collection. To test the performance of the GaitFi system, we collect a dataset for performance evaluation on the above platform. We invite 12 volunteers with heights between $1.55 m$ and $1.85 m$ as subjects, where their genders and heights are shown in Tab. II. In comparative experiments, this dataset can effectively demonstrate the advantage and correctness of using GaitFi, compared to other gait recognition methods. The area of the dashed box in Fig. 4(c) shows the area where the subjects walk, while the walking direction of each subject is perpendicular to the line of sight (LoS) of the two routers. Walking from one side to the other side is recorded as a sample. 30 samples (i.e., 15 back and forth) are collected for each subject. The WiFi sensor and camera sensor simultaneously record $2 s$ of gait information for each walk. In this manner, we can obtain gait samples of 12 different groups (i.e., subjects), and each group contains 30 WiFi CSI frames and corresponding gait videos. During training, 20 samples of each subject serve as the training set, i.e., the gallery set, while the remaining 10 samples are utilized as the probe set. The gait videos obtained by the camera are drawn at an interval of $0.035 s$ to form the original visual gait frame sequence, and the pixel size of each frame is $640 \times 480$ . For the CSI data obtained by the WiFi sensor, the sampling rate of the receiver is 800 $p a c k e t s / s$ , and the sensing time is $2 s$ , so a WiFi CSI data frame has 1600 $p a c k e t s$ . Because the transmitter router has 1 antenna, and the receiver router has 3 antennas, with Eq. 3, the size of each WiFi CSI data frame is $3 \times 114 \times 1600$ .

Identity label	Gender	Height ( $m$ )
#1	Male	1.80
#2	Male	1.74
#3	Male	1.78
#4	Male	1.77
#5	Female	1.58
#6	Female	1.68
#7	Female	1.70
#8	Female	1.63
#9	Male	1.70
#10	Male	1.78
#11	Male	1.75
#12	Female	1.69

TABLE II: Statistics of subjects in our dataset.

Data pre-processing and implementation details. Before inputting the data into the end-to-end training model shown in Fig. 1, the data collected in Sec. IV-A needs to be preprocessed first. For each raw WiFi CSI data frame, we first remove all NaN (Not a Number) values that are caused by the loss of packet in the CSI data. Then the CSI data is normalized and sampled into a size of $3 \times 114 \times 500$ which is the input of the WiFi sensor module. As far as the vision data frame is concerned, the frame images extracted by the camera contain too much redundant information, which is not conducive to the extraction of gait features. As shown in Fig. 6, we use a Gaussian mixture model (GMM) for background subtraction to get a binarized gait silhouette [wang2003silhouette] first. Then we cut and align the silhouettes, which is the standard pipeline for vision-based gait recognition methods [takemura2018multi]. In this process, we can discard the silhouettes that do not contain any person. If the video sequence length is less than 32 frames, we repeat the last frame of the sequence to make up for 32 frames. The model structure has been illustrated in Tab. I. The learning scheme of the GaitFi is implemented by PyTorch, and the model is trained on one NVIDIA GTX 1660Ti. The Adam optimizer is leveraged for better convergence. The batch size is set to 32 with a learning rate of $10^{- 3}$ and a total of 30 epochs.

Method	Modality	Accuracy (%)
BeAware [jia2020beaware]	WiFi	73.3
CSAR [wang2018channel]	WiFi	81.7
DeepSense [zou2018deepsense]	WiFi	85.0
CNN-LB [wu2016comprehensive]	Vision	68.3
PTSN [liao2017pose]	Vision	88.3
GaitSet [chao2019gaitset]	Vision	92.5
WiFi-LRCN	WiFi	90.8
LRCN	Vision	69.2
LRCN+LSTM	Vision	90.8
Ours (GaitFi)	WiFi+Vision	94.2

TABLE III: Comparisons on real-world experiments.

Iv-B Overall Evaluation

To evaluate the performance of the GaitFi system for the human identification task utilizing the gait recognition method, we process our dataset by utilizing methods from other research on gait recognition based on WiFi modality or vision modality. In the case of WiFi modality, we compare our method with novel WiFi-based human sensing methods including BeAware [jia2020beaware], CSAR [wang2018channel], DeepSense [zou2018deepsense], and the vision-based gait recognition method including CNN-LB [wu2016comprehensive], PTSN [liao2017pose] and GaitSet [chao2019gaitset]. In Tab. III, our two-modality method achieves the state-of-the-art performance of 94.2% accuracy. In comparison, the recognition accuracy of the BeAware is 73.3%, since it only uses WiFi and a simple CNN module. When learning the WiFi modality using CSAR [wang2018channel] that consists of 4 LSTM modules, its recognition accuracy is 81.7%. DeepSense [zou2018deepsense] innovatively combines CNN and LSTM to process WiFi CSI data, and its recognition accuracy can reach 85.0%. For the vision modality, we first evaluate CNN-LB [wu2016comprehensive] which contains CNN feature extractors with a MLP (Multi-layer perceptron) classifier, and the recognition accuracy is 68.3%. The PTSN [liao2017pose] proposed by Liao et al. is a very representative sequence-based gait recognition method utilizing LSTM for video gait recognition, where the recognition accuracy is 88.3%. Then we compare our method with the state-of-the-art vision-based solution, GaitSet [chao2019gaitset], which has outstanding recognition accuracy on the public gait dataset CASIA-B. The GaitSet achieves an accuracy of 92.5% when it is applied to the single modality of vision in our dataset. Since the illumination condition is not ideal in the lab, the vision-based method may be affected and its performance is therefore degrading.

We also investigate our backbone network using different combinations of modalities and network structures. When we use the optimized lightweight residual convolution network WiFi-LRCN shown in Tab. I for WiFi modality, the accuracy can reach 90.8%. For vision modality, if we only use LRCN to extract the features of each frame and perform element-wise addition to get the gait features, the recognition accuracy is only 69.2%. The reason for this is that the element-wise addition at the frame level ignores the correlation between consecutive frames in a sequence, which is important for gait. To extract sequence-level features, we use LSTM to act on the output features from LRCN, achieving 90.8% accuracy. Although the performance of the lightweight backbone network is not as good as a complete vision-based solution GaitSet, it can save computing resources and be more efficient in identity inference. By fusing WiFi and vision two modalities, our GaitFi system achieves an accuracy rate of 94.2%, which demonstrates the advantages of multimodal sensing. GaitFi can learn gait features of vision and WiFi modalities at the same time, improve the recognition accuracy, and enable the system to achieve better effectiveness than a single modality.

Method	Modality	Accuracy (%)
BeAware [jia2020beaware]	WiFi	71.1
CSAR [wang2018channel]	WiFi	74.4
DeepSense [zou2018deepsense]	WiFi	80.0
CNN-LB [wu2016comprehensive]	Vision	57.8
PTSN [liao2017pose]	Vision	62.2
GaitSet [chao2019gaitset]	Vision	76.7
WiFi-LRCN	WiFi	83.3
LRCN	Vision	58.9
LRCN+LSTM	Vision	68.9
Ours (GaitFi)	WiFi+Vision	85.6

TABLE IV: Experimental results under poor illumination.

Iv-C Illumination Robustness

To study the robustness of the GaitFi system, we select 6 subjects to conduct experiments in the scene with poor illumination conditions, where 40 samples of gaits are collected from each subject, 25 of which are used as the training set and the gallery set, and the other 15 are used as the probe set. The results are shown in Tab. IV. Methods based on vision modalities perform poorly. The CNN-LB achieves 57.8%, and the PTSN only attains 62.2%. Even the state-of-the-art vision solution, the GaitSet, only achieves 76.7%. In contrast, the WiFi modality has better robustness against poor illumination. The BeAware, the CSAR and the DeepSense achieve 71.1%, 74.4%, and 80.0%, respectively. By utilizing WiFi and vision modalities, our GaitFi system achieves the best accuracy of 85.6%. The results illustrate that the CSI data extracted from WiFi is a good complementarity to vision modality, which can enhance the robustness of our system against poor illumination.

Iv-D Ablation Study

Iv-D1 Modality Comparison

We study the importance of the WiFi and vision modality when the GaitFi system conducts gait recognition. As shown in Tab. V, when we only use the WiFi sensor module branch in Fig. 1, the accuracy is only 75.0%. Whereas, the recognition accuracy is 90.8% when only the vision sensing module branch is used to make inferences. Both single-modality performances are lower than the 94.2% achieved by the whole GaitFi system. These results validate that the feature fusion of WiFi and vision modality can integrate two modalities to achieve higher recognition accuracy. The confusion matrices in Fig. 7 further demonstrate the superiority of our system, and the single modality method suffers from the confusion caused by similar gender and height. It is clearly found that the wrong predictions are more likely to occur among same-gender subjects for vision modality as shown in Fig. 6(c). For instance, samples of subject #1 (male), subject #3 (male), subject #6 (female), and subject #10 (male) are wrongly classified to subject #9 (male), subject #4 (male), subject #5 (female), and subject #11 (male), respectively. Moreover, similar heights or statures of subjects may also affect the accuracy of WiFi modality to infer human ID. In Fig. 6(b), when using only the WiFi modality, subjects #1, #2 and #3 with similar statures are prone to confuse the model. Although the vision modality produces fewer misclassified samples than the WiFi modality, some of the misclassifications that occur in the vision modality do not occur with the WiFi modality such as subjects #5 and #6. Therefore, the two modalities can be the complementarity for more robust gait recognition. The multimodal result obtained in Fig. 6(a) further demonstrates the better robustness of our system as well as the correctness of using two modalities, WiFi and vision, to sense human gait.

WiFi module	Vision module	Accuracy (%)	Inference time ( $m s$ )
$\sqrt{}$	$\sqrt{}$	94.2	86.3
$\sqrt{}$		75.0	43.1
	$\sqrt{}$	90.8	67.6

TABLE V: Ablation study of different modality.

Iv-D2 Inference Time Analysis

To investigate the impact of multimodal learning on inference time for the GaitFi system, we calculate the inference time for one sample using the whole GaitFi system, the WiFi sensing module, and the vision sensing module, respectively. The GPU used in this experiment is only one NVIDIA GTX 1660Ti, and the results of inference time are shown in Tab. V, where the inference times are $86.3$ ms when using the GaitFi system, $43.1$ ms when only using the WiFi sensing module and $67.6$ ms when only using the vision sensing module. These results show that the multimodal learning of GaitFi only leads to marginal time consumption, which is acceptable in real-world applications.

Iv-D3 Fusion Mechanism Analysis

Fusion mechanism	Highest accuracy (%)
Concatenation	94.2
Element-wise addition	90.0

TABLE VI: The effect of fusion mechanism.

We conduct a comparative experiment on the impact of the fusion mechanism on recognition accuracy. In addition to feature concatenation in Fig. 1, another method is to directly add the two feature vectors in each dimension numerically. In this experiment, the feature vectors extracted by the two modules are 64-dimensional, and the feature vector after the element-wise addition is still 64-dimensional, which is used to calculate the triplet loss and map to the $K$ -dimensional feature space to calculate the cross-entropy loss. The vector of features obtained by element-wise addition can be represented by $z_{a}^{i j}$

z_{a}^{i j} = z_{w}^{i j} + z_{v}^{i j} .

(14)

Tab. VI shows the impact of two fusion mechanisms on the recognition accuracy, where the mean recognition accuracy of element-wise addition across three runs is only 90.0%, not better than concatenation. These results illustrate that the feature-level concatenation has better recognition performance than the element-wise addition. The reason might be that the higher dimensional feature space has better discriminability for metric learning using the triplet loss.

Iv-D4 Loss Function Analysis

Cross-entropy loss ( $L_{c e}$ )	Triplet loss ( $L_{t r i p l e t}$ )	Accuracy (%)
$\sqrt{}$	$\sqrt{}$	94.2
$\sqrt{}$		89.2
	$\sqrt{}$	8.3

TABLE VII: Ablation experiment of two losses.

Fig. 8: The impact of the hyper-parameter $α$ on recognition accuracy.

The GaitFi system uses two loss functions for classification and metric learning. To test the effectiveness of two losses, we conduct ablation experiments for the loss functions, and the results are shown in Tab. VII. When only the cross-entropy loss function is used, the recognition accuracy is 89.2%, while the accuracy of the triplet loss is only 8.3%. It is observed that the training of the triplet loss model cannot converge. In contrast, GaitFi can achieve an accuracy of 94.2% by using these two loss functions. The cross-entropy loss helps the model construct a discriminative feature space, and the triplet loss further refines the features to be clustered for the same subject. This shows the two loss functions can enforce the model to learn a robust feature space, leading to better performance.

Iv-D5 Hyper-parameter Sensitivity

In Eq. 13, we introduce the hyper-parameter $α$ , which is used to adjust the ratio of the cross-entropy loss and the triplet loss. To investigate the effect of $α$ on the accuracy, we take different values of $α$ and plot the results in Fig. 8. It is found that a larger $α$ can result in decreasing performance as the triplet loss may hinder the convergence of the cross-entropy loss, but a too low $α$ makes the triplet loss not effective for feature learning. The best accuracy is achieved at $α = 0.001$ , which is the $α$ value taken in our experiments.

V Conclusion

In this paper, we propose a robust human identification system utilizing the gait recognition method that performs a multimodal fusion of WiFi signals of commercial IoT devices and videos captured by a camera through a novel deep learning method. We firstly develop a multimodal sensing platform that can acquire WiFi CSI data from WiFi-enabled commercial off-the-shelf IoT devices and videos from cameras simultaneously. Based on residual connection, we propose LRCN, a lightweight residual convolution network to extract representative features in WiFi CSI data frames. For vision modality, a combination of LRCN and LSTM networks is used to extract representative features from visual image sequences. The extracted features of the two modalities are concatenated, and then the system performs metric learning by optimizing triplet loss and cross-entropy loss. The system makes predictions by finding the nearest neighbor of the test sample in the feature space. The experiments are conducted in the real world. According to the experimental results, the GaitFi system can achieve 94.2% recognition accuracy, significantly outperforming other single-modal gait recognition methods based on WiFi or a camera.

GaitFi: Robust Device-Free Human Identification via WiFi and Vision Multimodal Learning

Abstract

I Introduction

Ii Related work

Ii-a WiFi-Based Sensing and Gait Recognition

Ii-B Vision-Based Gait Recognition

Ii-C Multimodal Machine Learning

Iii Method

Iii-a WiFi-Vision Multimodal Gait Recognition Method

Iii-A1 WiFi CSI Modality

Iii-A2 Vision Modality

Iii-B Multimodal Learning for WiFi and Vision

Iii-B1 WiFi Gait Recognition Module

Iii-B2 Visual Gait Recognition Module

Iii-B3 Modality Fusion and Learning Objectives

Iv Experiment

Iv-a Setup and Data Collection

Iv-B Overall Evaluation

Iv-C Illumination Robustness

Iv-D Ablation Study

Iv-D1 Modality Comparison

Iv-D2 Inference Time Analysis

Iv-D3 Fusion Mechanism Analysis

Iv-D4 Loss Function Analysis

Iv-D5 Hyper-parameter Sensitivity

V Conclusion

References