FusePose: IMU-Vision Sensor Fusion in Kinematic Space for Parametric Human Pose Estimation

Yiming Bao, Xu Zhao*, , Dahong Qian*, X. Zhao and D. Qian are co-corresponding authors.Y. Bao and D. Qian are with the School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China (e-mail: yiming.bao@sjtu.edu.cn; dahong.qian@sjtu.edu.cn)X. Zhao is with the Department of Automation, Shanghai Jiao Tong University, Shanghai, China (e-mail: zhaoxu@sjtu.edu.cn)This work has been supported by the NSFC grants 62176156 and Deepwise Healthcare Joint Research Lab, Shanghai Jiao Tong University.

Abstract

There exist challenging problems in 3D human pose estimation mission, such as poor performance caused by occlusion and self-occlusion. Recently, IMU-vision sensor fusion is regarded as valuable for solving these problems. However, previous researches on the fusion of IMU and vision data, which is heterogeneous, fail to adequately utilize either IMU raw data or reliable high-level vision features. To facilitate a more efficient sensor fusion, in this work we propose a framework called FusePose under a parametric human kinematic model. Specifically, we aggregate different information of IMU or vision data and introduce three distinctive sensor fusion approaches: NaiveFuse, KineFuse and AdaDeepFuse. NaiveFuse servers as a basic approach that only fuses simplified IMU data and estimated 3D pose in euclidean space. While in kinematic space, KineFuse is able to integrate the calibrated and aligned IMU raw data with converted 3D pose parameters. AdaDeepFuse further develops this kinematical fusion process to an adaptive and end-to-end trainable manner. Comprehensive experiments with ablation studies demonstrate the rationality and superiority of the proposed framework. The performance of 3D human pose estimation is improved compared to the baseline result. On Total Capture dataset, KineFuse surpasses previous state-of-the-art which uses IMU only for testing by 8.6%. AdaDeepFuse surpasses state-of-the-art which uses IMU for both training and testing by 8.5%. Moreover, we validate the generalization capability of our framework through experiments on Human3.6M dataset.

3D pose estimation, Human kinematic model, Sensor fusion, IMUs

I Introduction

Fig. 1: Self-occlusions introduce challenge in estimating 3D human pose from only vision data. IMUs can provide extra occlusion-free information. Here ground truth poses are shown for clarity.

3D human pose estimation is one of the most important problems in computer vision, which is closely related with human motion analysis, action recognition, human-computer interaction and so on [34, 51, 9, 1, 27]. The common solution of this problem is to predict the 3D keypoint coordinates of a predefined human skeleton from single-view [32] or multi-view RGB images [18, 39]. In this solution, it is widely known that occlusions or self-occlusion in images introduce stochastic estimation error in extracting image features. As shown in Figure 1, the occlusion is inevitable in vision data. It may significantly reduce the performance and result in unreasonable human pose [54, 22].

To cope with this problem, recently an increasing number of approaches attempt to integrate extra data from other sensors, such as IMUs [53, 44]. IMUs that are attached to human body can provide occlusion-free local information, which is valuable to improve the performance of 3D human pose estimation. Despite this, how to efficiently combine the data from vision and IMUs is intricate and challenging to explore. The main reason is that the information collected from cameras and IMU sensors is heterogeneous and hard to collaborate with each other in a fusion framework.

Some previous methods partially alleviate the difficulty of IMU-vision fusion from two different aspects, but also have respective shortcomings. The first category of approaches [55, 16] turn the different sensors data from heterogeneous into homogeneous. The IMU raw data is transformed to bone vectors [55] or volume features [16] to more conveniently fuse with high-level deep image features under learning-based frameworks. Since the outputs of these frameworks only contain global 3D coordinates of human joints, the transformed IMU data fails to fuse kinematic information or help generate anatomically more reasonable skeletons. The second category of approaches [10, 49] utilize the local rotation and acceleration data from IMUs to build energy terms and optimize generative models for pose estimation. Although these fusion methods employ IMU raw data, the generative models usually are not efficient enough compared to learning-based frameworks. Also, only low-level vision features such as silhouettes or estimated 2D poses are embedded into the optimization process.

To address the above-mentioned challenges, in this work we propose an IMU-vision sensor fusion framework for parametric 3D human pose estimation, named FusePose. We first apply an algorithm called algebraic triangulation (AlgTri) [18] to predict a sub-optimal 3D human pose from multi-view images (MVIs) as a baseline result, also as reliable high-level features for latter sensor fusion. Then, as performed in the previous work [55], naively we calculate the bone vectors using IMU information and utilize them to help improve the baseline result in hard cases with self-occlusions. A threshold screening algorithm is proposed in this NaiveFuse method for filtering hard cases. Obviously, NaiveFuse neglects the value of IMU raw data. Thus, we propose the second method, i.e. Kinematic Fuse (KineFuse), to fuse IMU raw data with transformed baseline result in kinematic space. To be specific, we first build a parametric human skeleton model in kinematic space. Then we propose a canonical inverse kinematic (IK) layer in deep neural network to process the baseline 3D pose to pose parameters of the parametric human skeleton model. IMU raw data is calibrated and carefully aligned as the similar pose parameters which are finally fused with the pose parameters converted from baseline result. NaiveFuse and KineFuse are both test-time post-processing algorithms based on threshold screening. To further embed IMU information into network training process and increase the robustness of the whole framework, we propose the last approach, i.e. Adaptive Deep Fuse (AdaDeepFuse), which is end-to-end trained using IMU and vision data. This approach adaptively performs sensor fusion on not only hard cases but also other cases. Same as KineFuse, AdaDeepFuse outputs refined pose parameters, which are then fed into a forward kinematic (FK) layer to produce the final fused 3D pose.

To sum up, IMU-vision fusion in kinematic space lies in the central place of KineFuse or AdaDeepFuse, where the data from heterogeneous modalities is aligned as homogeneous pose parameters, making the most advantage of complementary raw information for accurate and plausible parametric human pose results. In addition, there are two more merits of the proposed framework. The first one is that the consistency of the bone length can be ensured when fusing in kinematic space. This figures out an important issue in heatmap-based 3D pose estimation [19, 26], i.e. bone stretch due to varying illumination in images or quantization error in keypoint position generation. The second merit is that sensor fusion in kinematic space can aggregate twist rotation of bones in 3D human skeleton, which is incapable for sensor fusion in euclidean space.

Experiments demonstrate that our proposed methods achieve superior performance on Total Capture dataset which contains IMUs and multi-view images. Thanks to the well-designed algebraic triangulation algorithm, the baseline result already surpasses state-of-the-arts which only use vision information by a large margin. Compared to the baseline result, NaiveFuse, KineFuse and AdaDeepFuse further improve the pose estimation performance by $5.4 %$ , $10.0 %$ and $13.1 %$ , respectively. Also, we prove the generalization ability of the proposed framework on the new benchmark via experiments on Human3.6M dataset. In a nutshell, our contributions are summarized as follow:

$∙$ We build a parametric human skeleton model which can perform interconversion of 3D pose and pose parameters via canonical inverse kinematic layer and forward kinematic layer in deep neural network, enabling fusion in kinematic space and end-to-end network training.

$∙$ We propose three approaches, i.e. NaiveFuse, KineFuse and AdaDeepFuse, which effectively aggregate IMU raw data and reliable intermediate vision results, via leveraging the advantages of image clues in euclidean space and IMU rotations in kinematic space.

$∙$ Through experiments, we demonstrate that: (1) the fusion in kinematic space simultaneously ensures global skeleton consistency and local rotation reasonability; (2) the proposed framework significantly improves 3D pose estimation performance under self-occlusion.

Ii Related Work

Ii-a IMU-Vision Sensor Fusion

Generally, IMU measures accelerations and angular velocities, then orientations can be solved leveraging filter algorithms [2, 8, 6, 40, 46]. The commercial solution of [50] aggregates 17 IMUs to estimate human pose in a Kalman Filter. The seminal work of [47] uses a custom made suit to target human motion un everyday surroundings. There are also some works combine depth vision data with IMUs [13, 56]. The approach of [31] tries to integrate IMUs with 2D posees detected in one or two views. Reconstructing human poses from only IMUs is an under-constrained problem, as the information IMUs provide is insufficient for accurate human pose estimation. Thus, combining IMUs with vision data such as videos and images is worth employing. IMU-vision sensor fusion can take advantage of the supplementary strengths of the two data sources, i.e. the global drift-free position from vision data and local occlusion-free limb orientation even under fast motion from IMU data. The early work on the combination of video and inertial sensors is implemented by Pons-Moll [38] via optimizing a local hybrid tracker. Marcard et al. [49] further develop this strategy under a setting of 8 viewpoint videos and 13 IMUs. Malleson et al. [30] integrate constraints from IMUs, multi-view images and a human pose prior model to optimize generated parametric human pose. These methods explore the raw data of IMUs via minimizing the rotation or the acceleration energy terms. However, they only utilize rough vision cues such as image silhouettes or estimated 2D human pose.

As the deep networks ensure the high quality of extracted image features, deep feature-based methods for 3D human pose estimation develop well in recent years. The 2D heatmaps can serve as appropriate intermediate cues for either multi-view features aggregation or IMU-vision fusion. Gilbert et al. [10] back-project multiple viewpoint features to probabilistic visual hull (PVH). The IMU orientations are processed to 3D joint locations by a forward kinematic solver and then embedded into 3D pose estimation via tensor concatenation. Huang et al. [16] propose to transform IMU-bone orientations to volume features, which are then concatenated with vision volume heatmaps. Zhang et al. [55] process the IMU orientations to bone vectors for either promoting the 2D joint heatmap refinement in terms of multi-view geometry or optimizing the 3D occupancy of joints via picture structure model (PSM). Although high accuracy the above-mentioned deep learning-based methods achieve, they do not leverage the raw orientation and acceleration data from IMUs. Instead, they utilize the transformed IMU information such as bone vectors for the convenience of fusing with deep vision features.

In this work, we fuse the IMU raw data and reliable intermediate features of multi-view vision data under a parametric human pose representation, ensuring both the efficiency of sensor fusion and the kinematic consistency of estimated 3D human skeletons.

Ii-B Parametric Human Pose Estimation

3D human pose estimation is often formulated as learning from sensor data such as monocular or multi-view images to estimate the 3D keypoint coordinates of a human skeleton [24, 18, 32]. This location-based strategy recognizes each human joint as a point from heatmap and each bone as a line from a predefined human kinematic tree. However, there exist problems unsolved under this point-line human pose representation. First, directly regressing human joints from heatmaps which are generated by convolutional neural network may introduce bone stretching issue[15, 14] or left-right asymmetry[5]. Second, the twist rotation of the bone is ignored and can not be integrated into the whole framework, resulting in that the estimated 3D human pose may be inadequate to represent the true human motion[25]. Thus, a parametric human skeleton is paramount for more plausible human pose.

Parametric human pose[35, 11, 28, 48] represents the human body with skeletons with a certain number of keypoints and the parameters attached to them. The pose parameters are often based on a human model with local coordinates for joints or bones. This model-based methods have great merit for more than human skeleton estimation such as robotic control[4], motion retargeting[45], and human mesh recovery[20, 3, 7, 33]. The works on model-based parametric human pose estimation can be divided into two categories: optimization-based methods and direct regression methods.

Optimization-based methods mainly focus on searching for the optimal solution of human pose parameters[52]. The approach of [23] fit SMPL [29] to 2D detections via solving optimization problem. Energy functions are built to minimize the difference between generated human model and image features. The solution procedure often needs good initialization and a relatively long time for iteratively optimization, while mismatching the requirement of real scene application.

With the development of deep networks, increasing studies focus on directly regressing either location-based pose[18, 32, 42] or parametric pose[45, 41, 36, 20] in an end-to-end manner. Kanazawa et al. [20] propose Human Mesh Recovery (HMR) to reconstruct a mesh representation that is parameterized shape and 3D joint angles. Shi et al. [41] proposed an approach that learns to predict joint rotations directly from training data. Despite the great contribution made by [20] and [41], these methods rely heavily on an extra adversarial loss to learn real-or-fake prior information and ensure that the regressed results lie on the manifold of natural pose parameters.

Iii Method

Iii-a Formulation

Parametric pose representation. A 3D human skeleton is constituted of 3D locations of $J$ keypoints $P = {p_{j}}_{j = 1}^{J}$ and the certain connections between them. The parametric pose representation formulates a rigid human body as a rest T-pose skeleton $P^{T}$ together with the root transition parameter $T_{1}$ and the rotation parameters $R = {R_{j}}_{j = 2}^{J}$ of the local systems $S = {s_{j}}_{j = 2}^{J}$ attached to the bone vectors.

From the rest skeleton $P^{T}$ , we can initialize all the local systems $S$ by defining cartesian coordinates. A vector $b^{g}$ in global system can be transferred to $s_{j}$ by $b^{j} = R_{j}^{g} b^{g}$ , where $R_{j}^{g}$ denotes the rotation from global system $s_{g}$ to local system $s_{j}$ . Also, the relative rotation from the parent $p a (j)$ of $j$ to $j$ can be calculated by:

R_{j}^{p a (j)} = R_{j}^{g} (R_{p a (j)}^{g})^{- 1} .

(1)

Fig. 2: (a) In the process of FK, $b_{j}$ is first rotated to $b_{j}^{^{'}}$ under the rotation $R_{j}$ of joint $j$ , then is further rotated to $b_{j}^{^{''}}$ under the rotation $R_{p a (j)}$ of joint $p a (j)$ . (b) Illustration of canonical solution of IK. The rotation axis is set as the cross product result of reference and rotated bone vector. The rotation angle is the angle between the two bone vectors.

FK and IK. Given the parameters $T_{1}$ and $R$ , 3D human pose can be rebuilt by forward kinematics (FK). This process ensures the bone length consistency and simultaneously is able to integrate the twist rotation of the bone vector.

The purpose of FK is to reconstruct 3D joint locations from pose parameters along the kinematic chain. For example, all the joints in the path from joint $j$ to root joint are formulated as ${p a^{i} (j)}_{i = 1}^{n_{j}}$ , then the bone vector in global system $b_{j}^{g}$ can be calculated by:

b_{j}^{g} = R_{p a^{n} (j)}^{g} R_{p a^{n} (j)} R_{p a^{n - 1} (j)}^{p a^{n} (j)} . . . R_{p a (j)} R_{j}^{p a (j)} R_{j} b_{j}^{j},

(2)

where the bone vector in local system $b_{j}^{j}$ and all the relative rotation are calculated from the rest T-pose, as described above. $n_{j}$ is simplified as $n$ here and later. The 3D location of $j$ therefore can be calculated by:

p_{j} = p_{p a (j)} + b_{j}^{g} .

(3)

As illustrated in Figure 2 (a), the bone vector $b_{j}$ from $p a (j)$ to $j$ is not only affected by the rotation $R_{j}$ of its related joint $j$ , but also by the rotations of all the upstream joints ${p a^{i} (j)}_{i = 1}^{n}$ , matching the kinematics of rigid bodies and the real anatomic motion of human.

While the forward kinematics is well-defined and easy to solve, the inverse kinematics (IK) is an ill-posed problem because the 3D locations of joints fail to model the rotation information. In this work, we propose a canonical solution of IK based on axis-angle representation.

Fig. 3: The proposed IMU-vision fusion framework for parametric 3D human pose estimation. The 3D pose $P^{v i s}$ estimated from multi-view images is parameterized to local rotations $Q^{v i s}$ via IK layer. The IMU raw data is also processed to local rotations ${¯ Q}^{i m u}$ or bone vectors $B^{i m u}$ and is integrated with vision data ( $P^{v i s}$ or $Q^{v i s}$ ) by three fusion approaches: *NaiveFuse*, *KineFuse* and *AdaDeepFuse*. The final fused 3D poses of *KineFuse* and *AdaDeepFuse* are generated via FK layer. Only *AdaDeepFuse* needs to be trained end-to-end.

Same as the process of FK, Inverse Kinematics (IK) is performed along the kinematic chain. When calculating the rotation parameter $R_{j}$ of joint $j$ , the rotation parameters of all his parents ${p a^{i} (j)}_{i = 1}^{n}$ are supposed be already solved. As the bone vector $b_{j}^{g}$ can be derived from the current 3D pose, the only unknown variable in Eq. (2) is $R_{j}$ . Figure 2 (b) shows how we solve a canonical axis-angle representation of the local rotation from two 3D bone vectors. The axis $K$ and angle $θ$ are then transferred to the rotation matrix by the Rodriguez formula:

R = I + s i n (θ) K + (1 - c o s (θ)) K^{2} .

(4)

Despite this solution is unable to represent the true rotation of the local system related to the bone vector, it provides a unique mapping from 3D joints location to pose parameters. A more detailed introduction of IK will be expanded in section III-C and Alg 1.

Since the IK and the FK are both differentiable, we implement them as two parameter-free layers in the deep neural network: IK Layer and FK Layer. Thus, they are both lightweight enough to be inserted into any 3D pose estimation framework as a useful module.

The whole framework of our approach is illustrated in Figure 3. We first process the information from whether multi-view images (MVIs) or IMUs to pose parameters based on the kinematic human model. Then, a total of three methods are proposed to fuse those pose parameters for bolstering the performance of 3D human pose estimation. In the rest of this section, we will elaborate on pose parameterization and the three parts of the whole frameworks, relatively.

Iii-B Pose Parameterization

Multi-view images. We adopt an effective architecture to predict a sub-optimal 3D human pose $P^{v i s}$ from MVIs. Following [18], we use ResNet as our 2D encoder to generate 2D joint heatmaps, and a differential algebraic triangulation (AlgTri in Figure 3) with confidence weights of different views to obtain a linear approximate solution of 3D joint locations via Singular Value Decomposition (SVD).

After the $P^{v i s}$ is obtained, pose parameters are generated by an IK layer. The root global transition $T_{1}$ is determined as the root location in $P_{v i s}$ . For better properties of computation consistency and stability, following [37], we convert the rotation matrix of local system $R_{j}^{v i s}$ to quaterion $q_{j}^{v i s}$ as pose representation parameters.

The pose parameters $Q^{v i s} = {q_{j}^{v i s}}_{j = 2}^{J}$ have certain distributions based on the articulation range of human joints with known degree of freedoms (DoFs). The main reason we explore the pose parameters in this work instead of 3D joint locations is to more conveniently and efficiently aggregate the pose information from MVIs and IMUs. We will elaborate on the detail of the aggregation algorithms in section III-C.

IMUs. IMUs are rigid sensors attached to human bones, measuring the local rotation by integrating the recorded gyroscope data from the reference frame to the current frame. The information from IMUs is often represented as quaternion. In order to employ the global rotation $q_{k}^{i m u}$ of human bone attached with IMU $k$ , the raw data $q_{k}$ from the IMU need to be calibrated using the IMU-bone offset $q_{k b}$ and the IMU reference frame-global offset $q_{k g}$ by:

q_{k}^{i m u} = (q_{k b})^{- 1} \otimes q_{k g} \otimes q_{k},

(5)

where $\otimes$ denotes the quaterion multiplication.

The calculated bone global rotation $Q^{i m u} = {q_{k}^{i m u}}_{k = 1}^{K}$ is still difficult to collaborate with the pose parameters generated from MVIs. Hence, it is also indispensable to align the IMU and vision information as the same representation for robust and efficient consolidation. There exist two available operations for IMU-vision alignment. The first one is directly applying the global rotations from IMUs to the bone vectors in T-pose, and generating the current frame bone vectors $B^{i m u} = {b_{k}^{i m u}}_{k = 1}^{K}$ which are the same representation with the bone vectors derived from 3D joint locations. $K$ is the number of IMUs attached to human bones. In this work, $K$ is set as 8 following [55], employing IMUs in thighs, calves, upper arms and forearms. The bone vector rotation is performed by $b_{k}^{i m u} = {¯ q}_{k}^{i m u} ⊳ b_{j}^{j}$ , where $⊳$ denotes applying the rotation of a quaternion on a 3D vector and $j$ is the endpoint of the bone $k$ attached with IMU. The second operation for IMU-vision alignment is converting the global rotation $Q^{i m u}$ to the local rotation ${¯ Q}^{i m u} = {{¯ q}_{k}^{i m u}}_{k = 1}^{K}$ by:

{¯ q}_{k}^{i m u} = q_{k}^{i m u} \otimes (q_{j}^{g})^{- 1},

(6)

where $q_{j}^{g}$ denotes the transformation from global $s_{g}$ to local $s_{j}$ . Note that the local rotations are also part of the unknown variables in the IK process. Thus, ${¯ Q}^{i m u}$ can be efficiently embedded into the IK layer and can collaborate well with pose parameters $Q^{v i s}$ from vision data.

Iii-C IMU-Vision Fusion Framework

As illustrated in the middle of Figure 3, we propose three approaches to fusing information from vision and IMUs. Next, we will dive into the elaboration of these three approaches.

NaiveFuse. The first approach NaiveFuse aggregates the estimated coarse 3D pose $P^{v i s}$ from multi-view images and the rotated bone vectors $B^{i m u}$ using IMU raw data. There is a large consent that the IMU data suffers from drift error due to the integral operation in data collection. Directly replacing the bone vectors in $P^{v i s}$ using $B^{i m u}$ may result in pose error increasing for easy-to-estimate frames. Thus, we propose a threshold screening method based on vectorial angle to decide if the bone vector should be replaced or not as follows:

b_{j}^{v i s} = {\begin{matrix} b_{k}^{i m u}, & if θ_{k} > θ_{t} b_{j}^{v i s}, & otherwise, \end{matrix}

(7)

where $θ_{k}$ is the angle between $b_{k}^{i m u}$ and $b_{j}^{v i s}$ and $θ_{t}$ is the threshold.

Obviously, this approach can only facilitate the results of joints related to IMUs by a decent margin. Assuming there exist large estimation errors on the upstream joints, NaiveFuse can hardly improve the results. Another drawback of NaiveFuse is that it is inadequate to just naively utilize the transformed 3D bone vector to reconstruct the real motion parameter of human bone which is a rigid body. For better leveraging the rigid bone rotation information from IMUs on a deeper level, we propose another fusion method in a kinematic manner.

Input:

P^{v i s}, {¯ Q}^{i m u}, P^{T}

Output:

Q^{K F}, P^{K F}

T_{1} \leftarrow p_{1}^{v i s}

P^{v i s}

p_{1}^{K F} \leftarrow T_{1}

for

j

along the kinematic tree do

b_{j}^{v i s} \leftarrow (p_{j}^{v i s} - p_{p a (j)}^{v i s})

q_{j}^{t o t a l} \leftarrow

CanonicalSolve(

b_{j}^{v i s}, b_{j}^{j}

)

b_{j}

is attached with IMU

k

then

b_{k}^{i m u} \leftarrow {¯ q}_{k}^{i m u} ⊳ b_{j}^{j}

θ_{k} \leftarrow

< b_{k}^{i m u}, b_{j}^{j} >

θ_{k} > θ_{t}

then

q_{j}^{t o t a l} \leftarrow {¯ q}_{k}^{i m u}

end if

q_{j} \leftarrow (q_{p a^{n} (j)}^{g})^{- 1} \otimes q_{j}^{t o t a l}

for

p a^{i} (j) : i \in [n, 1]

q_{j} \leftarrow (q_{p a^{i} (j)})^{- 1} \otimes q_{j}

q_{j} \leftarrow (q_{p a^{i - 1} (j)}^{p a^{i} (j)})^{- 1} \otimes q_{j}

end for

Q^{K F} \leftarrow {q_{j}}_{j = 2}^{J}

P^{K F} \leftarrow F K (Q^{K F}, P^{T})

Algorithm 1 Kinematic Fuse

KineFuse. Kinematic Fuse (KineFuse) aggregates the pose parameters $Q^{v i s}$ calculated from $P^{v i s}$ via IK layer and the aligned IMU rotation information ${¯ Q}^{i m u}$ . The main advantage of utilizing local rotations in KineFuse instead of bone vectors in NaiveFuse is that the twist rotations of bones can be compensated in the IK process. As described in section III-A, the proposed canonical solution of IK is ill-posed and unable to disentangle the twist rotation from two 3D bone vectors. Despite that the twist rotation of a bone has no impact on the 3D location of joints attached to the bone, the situation is different for the lower downstream joints. For instance, the twist rotation of thighs is unable to affect the location of knees while able to affect that of ankles.

The whole process of KineFuse is summarized in Alg 1. Given the predicted 3D pose $P^{v i s}$ from vision, the calibrated and aligned IMU rotation information ${¯ Q}^{i m u}$ and the rest T-pose $P^{T}$ , the KineFuse is performed based on IK layer and output the fused pose parameters $T_{1}$ and $Q^{K F}$ . In order to get to the final fused 3D pose $P^{K F}$ , the fused pose parameters are fed into FK layer as described in section III-A.

The most important improvement in KineFuse is that it can simultaneously ensure the consistency of bone length and the accuracy of limb joints estimation. To be specific, the results of joints that are not related to IMUs can also be improved on the ground that the bone stretching errors are eliminated via IK and FK process. Furthermore, the estimation errors of joints that are related to IMUs can be explicitly reduced on the basis of more accurate ancestor joints, together with IMU information which entangles with twist rotation.

However, in KineFuse there exists an important issue that, it just serves as a post-processing approach in test-time, same as NaiveFuse. IMU information should also be utilized in the network training stage to help increase the robustness of the whole framework. Thus, in the last approach AdaDeepFuse, we explore efficient IMU-vision fusion by a deep learning-based supervised way.

AdaDeepFuse. Adaptive Deep Fuse (AdaDeepFuse) receives pose parameters from MVIs $Q^{v i s}$ and IMUs ${¯ Q}^{i m u}$ , which are then concatenated and fed into an AdaFuse module. The AdaFuse module is implemented as a multi-layer perceptron (MLP) with the output of fused pose parameters $Q^{A D F}$ . Then, same as in KineFuse $Q^{A D F}$ will be utilized to infer an IK layer for the final fused 3D pose $P^{A D F}$ .

Different from NaiveFuse and KinFuse, AdaDeepFuse are trained in an end-to-end manner with supervision in either the 3D human pose or pose parameters. We claim that the error from IMUs and 3D pose estimated from MVIs can be adaptively compensated via training a neural network. The AdaFuse module can determine what extent $Q^{v i s}$ or ${¯ Q}^{i m u}$ contributes to more accurate results. We fuse pose parameters instead of bone vectors since the quaternion space is continuous for interpolation and robust for network training. The total loss $L$ function for training AdaDeepFuse consists of two terms of loss function for 3D human pose $L_{p o s e}$ and loss function for pose parameters $L_{p a r a m}$ as :

L = L_{p o s e} + α L_{p a r a m s},

(8)

where $α$ is the weight of pose parameter loss and set as $1 \times 10^{- 2}$ in this work. Both the two loss terms are implemented as computing the Smoothed L1 loss of results compared to ground truths. The ground truth 3D pose $P^{G T}$ is collected from MoCap equipment. The ground truth pose parameters $Q^{G T}$ are derived from $P^{G T}$ via IK layer.

Iv Experiments

In this section, we conduct the experiments to evaluate the proposed IMU-vision fusion framework. First, the two datasets used in the experiments are introduced. Then, the comparisons between our methods with other methods are reported. Finally, the ablation studies are conducted.

Iv-a Datasets and Experiment Settings

Total Capture [44]. This dataset is a large-scale benchmark containing information of 13 IMUs, multi-view videos and 3D human pose ground truth. Following [55, 44], we use four (1, 3, 5, 7 ) of whole eight views in this work for efficiency. For the number of chosen IMUs, we design two optional settings: 8 IMUs on the limbs and 4 of them on either the upper limbs or lower limbs. We partition the training and testing dataset with respect to subjects and performance sequence. The training set consists of performances ROM1, 2, 3; Walking1, 3; Freestyle1, 2; Acting1, 2 and Running1 on subjects 1, 2 and 3. The testing set contains the performances Freestyle3, Acting3 and Walking2 on all subjects. We use all frames in the training set and every eighth frame in the testing set. We evaluate all the three proposed approaches on Total Capture dataset. To be specific, we supervise NaiveFuse and KineFuse only on the predicted 3D pose $P^{v i s}$ from MVIs, and supervise AdaDeepFuse on the fused 3D pose $P^{A D F}$ .

Human3.6M [17]. We use this dataset for generalization evaluation. It consists of 3.6M frames from 4 synchronized cameras with the 3D pose ground truth. There are 11 subjects in Human3.6M. Following previous work [42], we split it into the training set (S1, S5, S6, S7, S8) and testing set (S9, S11). Since it has no IMU sensors, we calculate bone vectors from 3D pose annotations as the simulative IMU information. Unlike real IMU sensor data, this kind of data is absolutely accurate, free from drift or other errors and unsuitable for training a model. Thus, we just verify NaiveFuse and KineFuse in Human3.6M dataset.

Implementation details. We use ResNet-152 [12] as 2D encoder backbone, initialized with ImageNet pre-trained weights. In Total Capture we train the 2D encoder and the AdaFuse Module end-to-end from scratch for 15 epochs using Adam[21] optimizer, while in Human3.6M we utilize the pre-trained AlgTri model weights in [18]. The input multi-view images are resized to $320 \times 320$ in Total Capture and $384 \times 384$ in Human3.6M. We calculate the Mean Per Joint Position Error (MPJPE) as the metric to evaluate the performance of 3D pose estimation. It measures the mean distance between estimated 3D joint locations and ground truths over all subjects and frames of the testing set.

Method	train w/ IMUs	test w/ IMUs	SeenSubject (S1,2,3)			UnseenSubject (S4,5)			Average
			W2	A3	FS3	W2	A3	FS3
Tri-CPM [51]			79.0	112.0	106.0	79.0	149.0	73.0	99.0
PVH [44]			48.3	94.3	122.3	84.3	154.5	168.5	107.3
LSTM-AE [43]			13.0	23.0	47.0	21.8	40.9	68.5	34.1
IMUPVH [10]	✓	✓	19.2	42.3	48.8	24.7	58.8	61.8	42.6
Fusion-RPSM [39]			19.0	21.0	28.0	32.0	33.0	54.0	29.0
DeepFuse-Vision[16]			-	-	-	-	-	-	32.7
DeepFuse-IMU [16]	✓		-	-	-	-	-	-	28.9
GeoFuse-SN-ORSPM [55]		✓	-	-	-	-	-	-	25.5
GeoFuse-ORN-ORSPM [55]	✓	✓	14.3	17.5	25.9	23.9	27.8	49.3	24.6
AlgTri (baseline)			9.6	15.3	30.9	24.1	30.6	61.3	25.9
NaiveFuse (ours)		✓	9.6	14.8	27.9	23.9	30.4	55.4	24.5
KineFuse (ours)		✓	9.5	14.3	26.8	22.4	28.7	51.8	23.3
AdaDeepFuse (ours)	✓	✓	10.2	13.7	26.3	21.7	26.8	49.2	22.5

TABLE I: Comparison of the 3D pose estimation errors MPJPE (mm) of different methods on the Total Capture dataset. Our method outperforms previous state-of-the-arts.

Iv-B Comparison to Other Methods

We first compare the 3D human pose estimation performance of our framework to the state-of-the-arts on the Total Capture dataset. The results are listed in Table I. The last four rows show the methods implemented in this work, including the baseline AlgTri algorithm and three proposed IMU-vision sensor fusion approaches. Among all previous methods, first we can see that the LSTM-AE[43] and IMUPVH[10] achieve decent performance in the case of using temporal information. The error the state-of-the-art among the methods that only utilize vision data and single frame as input, i.e. Fusion-RPSM[39], is $29 m m$ and larger than $25.9 m m$ achieved by our trained baseline AlgTri. This is because the multiple viewpoints information is aggregated well using view confidences weighted triangulation algorithm. The state-of-the-art performance using only vision data is boosted by 14.3%.

Among the methods that perform IMU-vision sensor fusion, our approach NaiveFuse ( $24.5 m m$ ), KineFuse ( $23.3 m m$ ) and AdaDeepFuse ( $22.5 m m$ ) achieve state-of-the-art performances. There are three main points need to be noted. First, the NaiveFuse and KineFuse both utilize IMU information for only testing and serve as post-processing approaches. They can be conviniently applied to any existing 3D human pose estimation framework. They both outperforme the method of GeoFuse-SN-ORPSM[55] ( $25.5 m m$ ) which also only uses IMU for testing only. Second, the last proposed approach AdaDeepFuse uses IMU for both training and testing. It outperforms the method of GeoFuse-ORN-OPRSM[55] ( $24.6 m m$ ). We increase the performance by 8.5%. Third, the Recursive Pictorial Structure Model (RPSM) in [55] is proposed to perform 3D IMU-vision information integration based on enumeration algorithm, which is time-consuming and relies heavily on computing and memory resources. However, our approaches are computationally friendly since the proposed IK and FK layers are both lightweight and parameter-free. The AdaFuse module is also a shallow network. Thus, our framework has a wider spectrum of applications in the real scene.

Iv-C Ablation Study

Threshold screening. One of the paramount operations in NaiveFuse and KineFuse is the threshold screening for hard-to-estimate frames. To verify the necessity and robustness of this operation, we calculate vectorial angles between the bone vectors which are derived from vision, IMUs and ground truth. Note that the information from vision here is the estimated 3D pose $P^{v i s}$ by AlgTri. Figure 4 illustrates the angles curves of the right forearm bone vector in one of the motion sequences. By observing the curves, we can draw a conclusion that the difference between IMU and vision information is very close to the difference between IMU and ground truth. Hence, we can use a threshold to screen the proper hard cases in terms of the vectorial angle between bone vectors derived from IMU and vision. In addition, it can also be observed that the difference between IMU and ground truth is tiny, indicating the reliability of using IMU in test-time.

Fig. 4: The vectorial angles between bone vectors which are derived from vision information, IMUs and ground truth. The curves show the angles of right forearm along a motion sequence.

As shown in Figure 4, the angle curve between vision and IMUs is undulate due to pose estimation errors of $P^{v i s}$ . An appropriate value of threshold $θ^{t}$ is crucial for figuring out suitable hard cases and then integrating IMU information to improve the performance. We randomly select $30 %$ frames from Total Capture training set to perform an ablation study to determine the value of $θ^{t}$ . We list some possible values of $θ^{t}$ and their respective estimation errors in Table II. We find that $θ^{t} = 0.25$ achieves the best performance in our experiment setting. Note that NaiveFuse and KineFuse are employed in this experiment.

$θ^{t}$	0.15	0.20	0.25	0.30	0.35
NaiveFuse	24.83	24.61	24.51	24.64	24.79
KineFuse	23.58	23.38	23.31	23.37	23.51

TABLE II: Different value of

θ^{t}

for methods that apply threshold screening and their respective 3D pose estimation errors (mm) in random selected validation set.

Fusion Methods. In this work we propose a total of three IMU-vision fusion approaches. We perform ablation studies of all three approaches in Total Capture dataset to evaluate their efficiency. We calculate the joint errors on the estimated 3D pose $P^{v i s}$ by AlgTri as the baseline. Then, we conduct experiments on NaiveFuse (NF), KineFuse (KF) and AdaDeepFuse (ADF) and calculate the joints error on the fused 3D pose $P^{N F}$ , $P^{K F}$ and $P^{A D F}$ , respectively. Table III lists the improvements the three methods bring in the considered joints. Moreover, we use the results of baseline to minus the results of the fused poses and obtain the respective joint error improvement of three approaches, which are shown in Figure 5 for more obvious illustration.

Joint Error (mm)	Baseline	Naive Fuse	Kinematic Fuse	Adaptive Deep Fuse
Belly	9.9	9.9 (-)	9.6 ( $↓$ 0.3)	9.6 ( $↓$ 0.3)
Neck	26.2	26.2 (-)	26.4 ( $↑$ 0.2)	26.4 ( $↑$ 0.2)
Nose	28.5	28.5 (-)	28.2 ( $↓$ 0.3)	28.2 ( $↓$ 0.3)
Right Hip	11.2	11.2 (-)	10.2 ( $↓$ 1.0)	10.2 ( $↓$ 1.0)
Left Hip	10.2	10.2 (-)	8.5 ( $↓$ 1.7)	8.5 ( $↓$ 1.7)
Right Shoulder	36.9	36.9 (-)	31.2 ( $↓$ 5.7)	31.2 ( $↓$ 5.7)
Left Shoulder	38.4	38.4 (-)	34.0 ( $↓$ 4.4)	34.0 ( $↓$ 4.4)
Right Knee	25.3	23.1 ( $↓$ 2.2)	19.1 ( $↓$ 6.2)	16.7 ( $↓$ 8.6)
Right Ankle	27.7	25.2 ( $↓$ 2.5)	23.2 ( $↓$ 4.5)	20.5 ( $↓$ 7.2)
Left Knee	26.3	22.1 ( $↓$ 4.2)	20.0 ( $↓$ 6.3)	16.8 ( $↓$ 9.5)
Left Ankle	26.9	23.2 ( $↓$ 3.7)	23.2 ( $↓$ 3.7)	19.8 ( $↓$ 7.1)
Right Elbow	39.9	38.1 ( $↓$ 1.7)	35.9 ( $↓$ 4.0)	35.0 ( $↓$ 4.9)
Right Wrist	45.8	42.1 ( $↓$ 3.7)	40.6 ( $↓$ 5.2)	38.9 ( $↓$ 6.9)
Left Elbow	43.8	39.5 ( $↓$ 4.3)	39.5 ( $↓$ 4.3)	38.3 ( $↓$ 5.5)
Left Wrist	48.4	41.7 ( $↓$ 6.7)	41.4 ( $↓$ 7.0)	40.7 ( $↓$ 7.7)
Average	25.9	24.5 ( $↓$ 1.4)	23.3 ( $↓$ 2.6)	22.5 ( $↓$ 3.4)

TABLE III: 3D pose estimation errors (mm) of the ablation study in Total Capture dataset.

Fig. 5: The respective joint error improvement of three proposed approaches. ’Others’ represents joints that are not related to used IMUs.

Fig. 6: Illustration of different 3D pose results on Total Capture dataset. Self-occlusions exist in most of used image views on the selected cases, while being solved via kinematic constrains and efficient IMU-vision fusion.

We can draw several conclusions by observing the results. First, we can find that NF only brings error improvement for the joints that are related to the used IMUs, while having no impact on other joints ( $0.0 %$ of blue bar in ’Others’). This is the limitation of directly replacing local bone vectors while ignoring the holistic human skeleton. As a contrast, KF and ADF both prompt better performance on these IMU-unrelated joints ( $8.3 %$ of orange and green bars in ’Others’). This is mainly because IK and FK ensure the bone lengths consistency and avoid the bone stretch issue in the fused 3D pose. This conclusion can be also achieved by observing Figure 6, where we draw the output 3D poses of different methods and show them. The bone lengths in the 3D pose of AlgTri and NF disagree with that of ground truth, while this phenomenon does not occur in the results of KF and ADF. Based on the more accurate ancestor joints, the errors of IMU-related joints on limbs are further improved in KF, compared to NF. The whole average joint error improvement is also increased from $5.4 %$ to $10.0 %$ . The value is further boosted to $13.1 %$ in ADF, indicating the efficiency of the adaptive deep learning-based fusion strategy.

In order to figure out which IMU plays more important roles in the fusion process, we conduct experiments on the usage of IMUs based on KineFuse. The results are listed in Table IV. It can be observed that the IMUs in upper limbs perform better than the IMUs in lower limbs. This is because the error improvement of the upstream joints will further correct the downstream children. In this work, using all IMUs in limbs serves as the final setting for the most accurate results.

IMUs	upper limbs (4)	lower limbs (4)	limbs (8)
MPJPE	25.2	25.7	23.3

TABLE IV: Ablation study on the usage of IMUs. Upper represents the upper arms and thighs while lower represents the forearms and calves. 4 and 8 denotes the number of IMUs attached to body bones.

We also conduct experiments on Human3.6M dataset for generalization evaluation. Due to the absolute accuracy of simulative IMU information, we just validate NaiveFuse and KineFuse on the testing set. The qualitative and quantitative results are shown in Table V and Figure 7. We can obtain similar conclusions with that from experiments on Total Capture. The IMU-related joint errors are improved from $30.1 m m$ to $23.5 m m$ by NaiveFuse, while to $19.2 m m$ by KineFuse. In addition, results of other joints are also decreased by KineFuse from $20.4 m m$ to $18.4 m m$ while fixed in NaiveFuse. Also, in Figure 7, the bone length consistency is ensured in KineFuse while not in AlgTri and NaiveFuse.

Methods	IMU-related	Others	Average
Baseline	30.1	20.3	24.9
NaiveFuse	23.5 ( $↓$ 6.6)	20.3 (-)	21.8 ( $↓$ 3.1)
KineFuse	19.2 ( $↓$ 9.9)	18.4 ( $↓$ 1.9)	18.8 ( $↓$ 6.1)

TABLE V: 3D pose estimation errors (mm) of the ablation study in Human3.6M dataset. IMU-related joints contain the knees, ankles, elbows and wrists.

Fig. 7: Illustration of different 3D pose results on Human3.6M dataset. Since there are no real IMU information, AdaDeepFuse is not explored here.

Fig. 8: Illustration of some failure cases on Total Capture dataset (left and middle) and Human3.6M dataset (right). They are mainly extremely unusual poses. It is hard to estimate the upper stream joints such as belly, hips and shoulders in these cases.

V Analysis and Discussion

As shown in Figure 6 and 7, the baseline method based on AlgTri performs not well on the unusual actions or on the poses where occlusions exist. The IMU information is occlusion-free so the aligned bone vectors can help improve the results of limbs by NaiveFuse. However, due to the limitation of resolution of the estimated heatmaps, the reconstructed joint locations are variable. The human skeleton is with the encumbrance of bone stretch issue. In this work, we figure out this issue by applying IK and FK layers and thus ensuring the bone length consistency. Based on this improvement, the local rotation information is further utilized for better IMU-vision fusion. In KineFuse and AdaDeepFuse, we explore two effective approaches and achieve superior performance.

We show some failure cases on two utilized datasets in Figure 8. The first two columns are samples from Total Capture using AdaDeepFuse, while the last column is from Human3.6M using KineFuse. The main reason for failure is that the predictions on the upstream joints are unsatisfactory, resulting in poorer predictions on the downstream joints under the usage of IMU information.

Although our approaches achieved the state-of-the-art performance, there still exists a lot of work to exploit in the future. First, in addition to being able to represent local rotation, another important advantage of IMU information locates in its temporal continuity and stability in fast motion. In this work, we transfer estimated joint locations to pose parameters The pose parameters, which can represent local rotations, also have an advantage on interpolation for pulse error compensation caused by occlusions on motions. The IMU-vision fusion based on temporal information is very valuable to explore in our future work. Another limitation of the proposed approaches is that they rely on the known lengths to build the rest T-pose, which is hard to receive in the real application. Thus, we will try to integrate the prediction of bone lengths into the whole framework.

Vi Conclusion

We propose a framework for IMU-vision fusion based on a parametric human representation. The information from IMU and vision data is aligned as bone vectors or local rotations and is aggregated via three effective approaches. The integrated IMU information not only corrects offset error of the estimated joints caused by occlusions and quantization error on heatmaps, but also aids to yield a plausible and authentic human pose. Extensive experiments with ablation analysis show that our proposed framework achieves superior performance on Total Capture dataset. Also, our approaches are proved able to generalize to other benchmarks via validation experiments on Human3.6M dataset. The value of temporal continuity in IMU-vision fusion is going to be explored in the future.

References

[1] A. Arnab, C. Doersch, and A. Zisserman (2019) Exploiting temporal context for 3d human pose estimation in the wild. In CVPR, pp. 3395–3404. Cited by: §I.
[2] E. R. Bachmann, R. B. McGhee, X. Yun, and M. J. Zyda (2001) Inertial and magnetic posture tracking for inserting humans into networked virtual environments. In Proceedings of the ACM symposium on Virtual reality software and technology, pp. 9–16. Cited by: §II-A.
[3] F. Bogo, A. Kanazawa, C. Lassner, P. Gehler, J. Romero, and M. J. Black (2016) Keep it smpl: automatic estimation of 3d human pose and shape from a single image. In ECCV, pp. 561–578. Cited by: §II-B.
[4] A. Csiszar, J. Eilers, and A. Verl (2017) On solving the inverse kinematics problem using neural networks. In 2017 24th International Conference on Mechatronics and Machine Vision in Practice (M2VIP), pp. 1–6. Cited by: §II-B.
[5] R. Dabral, A. Mundhada, U. Kusupati, S. Afaque, A. Sharma, and A. Jain (2018) Learning 3d human pose from structure and motion. In ECCV, pp. 668–683. Cited by: §II-B.
[6] M. B. Del Rosario, H. Khamis, P. Ngo, N. H. Lovell, and S. J. Redmond (2018) Computationally efficient adaptive error-state kalman filter for attitude estimation. IEEE Sensors Journal 18 (22), pp. 9332–9342. Cited by: §II-A.
[7] T. Fan, K. V. Alwala, D. Xiang, W. Xu, T. Murphey, and M. Mukadam (2021) Revitalizing optimization for 3d human pose and shape estimation: a sparse constrained formulation. In ICCV, pp. 11457–11466. Cited by: §II-B.
[8] E. Foxlin (1996) Inertial head-tracker sensor fusion by a complementary separate-bias kalman filter. In Proceedings of the IEEE 1996 Virtual Reality Annual International Symposium, pp. 185–194. Cited by: §II-A.
[9] Q. Gao, J. Liu, Z. Ju, and X. Zhang (2019) Dual-hand detection for human–robot interaction by a parallel network based on hand detection and body pose estimation. IEEE Transactions on Industrial Electronics 66 (12), pp. 9663–9672. Cited by: §I.
[10] A. Gilbert, M. Trumble, C. Malleson, A. Hilton, and J. Collomosse (2019) Fusing visual and inertial sensors with semantics for 3d human pose estimation. International Journal of Computer Vision 127 (4), pp. 381–397. Cited by: §I, §II-A, §IV-B, TABLE I.
[11] K. Grochow, S. L. Martin, A. Hertzmann, and Z. Popović (2004) Style-based inverse kinematics. In ACM SIGGRAPH, pp. 522–531. Cited by: §II-B.
[12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §IV-A.
[13] T. Helten, A. Baak, G. Bharaj, M. Müller, H. Seidel, and C. Theobalt (2013) Personalization and evaluation of a real-time depth-based full body tracker. In 3DV, pp. 279–286. Cited by: §II-A.
[14] D. Holden, T. Komura, and J. Saito (2017) Phase-functioned neural networks for character control. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–13. Cited by: §II-B.
[15] D. Holden, J. Saito, and T. Komura (2016) A deep learning framework for character motion synthesis and editing. ACM Transactions on Graphics (TOG) 35 (4), pp. 1–11. Cited by: §II-B.
[16] F. Huang, A. Zeng, M. Liu, Q. Lai, and Q. Xu (2020) Deepfuse: an imu-aware network for real-time 3d human pose estimation from multi-view image. In WACV, pp. 429–438. Cited by: §I, §II-A, TABLE I.
[17] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2013) Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36 (7), pp. 1325–1339. Cited by: §IV-A.
[18] K. Iskakov, E. Burkov, V. Lempitsky, and Y. Malkov (2019) Learnable triangulation of human pose. In ICCV, pp. 7718–7727. Cited by: §I, §I, §II-B, §II-B, §III-B, §IV-A.
[19] A. Kamel, B. Sheng, P. Li, J. Kim, and D. D. Feng (2020) Hybrid refinement-correction heatmaps for human pose estimation. IEEE Transactions on Multimedia 23, pp. 1330–1342. Cited by: §I.
[20] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2018) End-to-end recovery of human shape and pose. In CVPR, pp. 7122–7131. Cited by: §II-B, §II-B.
[21] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-A.
[22] M. Kocabas, C. P. Huang, O. Hilliges, and M. J. Black (2021) Pare: part attention regressor for 3d human body estimation. In ICCV, pp. 11127–11137. Cited by: §I.
[23] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler (2017) Unite the people: closing the loop between 3d and 2d human representations. In CVPR, pp. 6050–6059. Cited by: §II-B.
[24] J. Li, S. Bian, A. Zeng, C. Wang, B. Pang, W. Liu, and C. Lu (2021) Human pose regression with residual log-likelihood estimation. In ICCV, pp. 11025–11034. Cited by: §II-B.
[25] J. Li, C. Xu, Z. Chen, S. Bian, L. Yang, and C. Lu (2021) HybrIK: a hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In CVPR, pp. 3383–3393. Cited by: §II-B.
[26] Y. Li, S. Zhang, Z. Wang, S. Yang, W. Yang, S. Xia, and E. Zhou (2021) Tokenpose: learning keypoint tokens for human pose estimation. In ICCV, pp. 11313–11322. Cited by: §I.
[27] J. Liu, H. Ding, A. Shahroudy, L. Duan, X. Jiang, G. Wang, and A. C. Kot (2020) Feature boosting network for 3d pose estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2), pp. 494–501. Cited by: §I.
[28] Y. Liu, J. Gall, C. Stoll, Q. Dai, H. Seidel, and C. Theobalt (2013) Markerless motion capture of multiple characters using multiview image segmentation. IEEE transactions on pattern analysis and machine intelligence 35 (11), pp. 2720–2735. Cited by: §II-B.
[29] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM transactions on graphics (TOG) 34 (6), pp. 1–16. Cited by: §II-B.
[30] C. Malleson, J. Collomosse, and A. Hilton (2019) Real-time multi-person motion capture from multi-view video and imus. International Journal of Computer Vision, pp. 1–18. Cited by: §II-A.
[31] C. Malleson, A. Gilbert, M. Trumble, J. Collomosse, A. Hilton, and M. Volino (2017) Real-time full-body motion capture from video and imus. In 3DV, pp. 449–457. Cited by: §II-A.
[32] J. Martinez, R. Hossain, J. Romero, and J. J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In ICCV, pp. 2640–2649. Cited by: §I, §II-B, §II-B.
[33] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H. Seidel, W. Xu, D. Casas, and C. Theobalt (2017) Vnect: real-time 3d human pose estimation with a single rgb camera. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–14. Cited by: §II-B.
[34] R. Navaratnam, A. Thayananthan, P. H. Torr, and R. Cipolla (2005) Hierarchical part-based human body pose estimation.. In BMVC, Cited by: §I.
[35] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black (2019) Expressive body capture: 3d hands, face, and body from a single image. In CVPR, pp. 10975–10985. Cited by: §II-B.
[36] D. Pavllo, C. Feichtenhofer, M. Auli, and D. Grangier (2019) Modeling human motion with quaternion-based neural networks. International Journal of Computer Vision, pp. 1–18. Cited by: §II-B.
[37] D. Pavllo, C. Feichtenhofer, M. Auli, and D. Grangier (2020) Modeling human motion with quaternion-based neural networks. International Journal of Computer Vision 128 (4), pp. 855–872. Cited by: §III-B.
[38] G. Pons-Moll, A. Baak, T. Helten, M. Müller, H. Seidel, and B. Rosenhahn (2010) Multisensor-fusion for 3d full-body human motion capture. In CVPR, pp. 663–670. Cited by: §II-A.
[39] H. Qiu, C. Wang, J. Wang, N. Wang, and W. Zeng (2019) Cross view fusion for 3d human pose estimation. In ICCV, pp. 4342–4351. Cited by: §I, §IV-B, TABLE I.
[40] D. Roetenberg, H. J. Luinge, C. T. Baten, and P. H. Veltink (2005) Compensation of magnetic disturbances improves inertial and magnetic sensing of human body segment orientation. IEEE Transactions on neural systems and rehabilitation engineering 13 (3), pp. 395–405. Cited by: §II-A.
[41] M. Shi, K. Aberman, A. Aristidou, T. Komura, D. Lischinski, D. Cohen-Or, and B. Chen (2020) Motionet: 3d human motion reconstruction from monocular video with skeleton consistency. ACM Transactions on Graphics (TOG) 40 (1), pp. 1–15. Cited by: §II-B.
[42] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei (2018) Integral human pose regression. In ECCV, pp. 529–545. Cited by: §II-B, §IV-A.
[43] M. Trumble, A. Gilbert, A. Hilton, and J. Collomosse (2018) Deep autoencoder for combined human pose estimation and body model upscaling. In ECCV, pp. 784–800. Cited by: §IV-B, TABLE I.
[44] M. Trumble, A. Gilbert, C. Malleson, A. Hilton, and J. P. Collomosse (2017) Total capture: 3d human pose estimation fusing video and inertial sensors.. In BMVC, Vol. 2, pp. 1–13. Cited by: §I, §IV-A, TABLE I.
[45] R. Villegas, J. Yang, D. Ceylan, and H. Lee (2018) Neural kinematic networks for unsupervised motion retargetting. In ICCV, pp. 8639–8648. Cited by: §II-B, §II-B.
[46] R. V. Vitali, R. S. McGinnis, and N. C. Perkins (2020) Robust error-state kalman filter for estimating imu orientation. IEEE Sensors Journal 21 (3), pp. 3561–3569. Cited by: §II-A.
[47] D. Vlasic, R. Adelsberger, G. Vannucci, J. Barnwell, M. Gross, W. Matusik, and J. Popović (2007) Practical motion capture in everyday surroundings. ACM transactions on graphics (TOG) 26 (3), pp. 35–es. Cited by: §II-A.
[48] D. Vlasic, I. Baran, W. Matusik, and J. Popović (2008) Articulated mesh animation from multi-view silhouettes. In ACM SIGGRAPH, pp. 1–9. Cited by: §II-B.
[49] T. Von Marcard, G. Pons-Moll, and B. Rosenhahn (2016) Human pose estimation from video and imus. IEEE transactions on pattern analysis and machine intelligence 38 (8), pp. 1533–1547. Cited by: §I, §II-A.
[50] M. Xsens (2009) Full 6dof human motion tracking using miniature inertial sensors. Daniel RoetenbergLuingeHenk. Cited by: §II-A.
[51] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang (2017) Learning feature pyramids for human pose estimation. In ICCV, pp. 1281–1290. Cited by: §I, TABLE I.
[52] M. Ye and R. Yang (2014) Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In CVPR, pp. 2345–2352. Cited by: §II-B.
[53] X. Yi, Y. Zhou, and F. Xu (2021) TransPose: real-time 3d human translation and pose estimation with six inertial sensors. ACM Transactions on Graphics (TOG) 40 (4), pp. 1–13. Cited by: §I.
[54] T. Zhang, B. Huang, and Y. Wang (2020) Object-occluded human shape and pose estimation from a single color image. In CVPR, pp. 7376–7385. Cited by: §I.
[55] Z. Zhang, C. Wang, W. Qin, and W. Zeng (2020) Fusing wearable imus with multi-view images for human pose estimation: a geometric approach. In CVPR, pp. 2200–2209. Cited by: §I, §I, §II-A, §III-B, §IV-A, §IV-B, TABLE I.
[56] Z. Zheng, T. Yu, H. Li, K. Guo, Q. Dai, L. Fang, and Y. Liu (2018) Hybridfusion: real-time performance capture using a single depth sensor and sparse imus. In ECCV, pp. 384–400. Cited by: §II-A.