REMOT: A Region-to-Whole Framework for Realistic Human Motion Transfer

Quanwei Yang yangquanwei@mail.ustc.edu.cn University of Science and Technology of China Xinchen Liu liuxinchen1@jd.com Wu Liu liuwu1@jd.com JD Explore Academy Hongtao Xie htxie@ustc.edu.cn University of Science and Technology of China Xiaoyan Gu guxiaoyan@iie.ac.cn Institute of Information Engineering, Chinese Academy of Sciences Lingyun Yu yuly@mail.ustc.edu.cn University of Science and Technology of ChinaInstitute of Artificial Intelligence, Hefei Comprehensive National Science Center  and  Yongdong Zhang zhyd73@ustc.edu.cn University of Science and Technology of China

REMOT: A Region-to-Whole Framework for Realistic Human Motion Transfer

Quanwei Yang yangquanwei@mail.ustc.edu.cn University of Science and Technology of China Xinchen Liu liuxinchen1@jd.com Wu Liu liuwu1@jd.com JD Explore Academy Hongtao Xie htxie@ustc.edu.cn University of Science and Technology of China Xiaoyan Gu guxiaoyan@iie.ac.cn Institute of Information Engineering, Chinese Academy of Sciences Lingyun Yu yuly@mail.ustc.edu.cn University of Science and Technology of ChinaInstitute of Artificial Intelligence, Hefei Comprehensive National Science Center  and  Yongdong Zhang zhyd73@ustc.edu.cn University of Science and Technology of China
Abstract.

Human Video Motion Transfer (HVMT) aims to, given an image of a source person, generate his/her video that imitates the motion of the driving person. Existing methods for HVMT mainly exploit Generative Adversarial Networks (GANs) to perform the warping operation based on the flow estimated from the source person image and each driving video frame. However, these methods always generate obvious artifacts due to the dramatic differences in poses, scales, and shifts between the source person and the driving person. To overcome these challenges, this paper presents a novel REgion-to-whole human MOtion Transfer (REMOT) framework based on GANs. To generate realistic motions, the REMOT adopts a progressive generation paradigm: it first generates each body part in the driving pose without flow-based warping, then composites all parts into a complete person of the driving motion. Moreover, to preserve the natural global appearance, we design a Global Alignment Module to align the scale and position of the source person with those of the driving person based on their layouts. Furthermore, we propose a Texture Alignment Module to keep each part of the person aligned according to the similarity of the texture. Finally, through extensive quantitative and qualitative experiments, our REMOT achieves state-of-the-art results on two public benchmarks.

Human Motion Transfer; Video Generation; Generative Adversarial Network
journalyear: 2022copyright: acmcopyrightconference: Proceedings of the 30th ACM International Conference on Multimedia; October 10–14, 2022; Lisboa, Portugalbooktitle: Proceedings of the 30th ACM International Conference on Multimedia (MM ’22), October 10–14, 2022, Lisboa, Portugalprice: 15.00doi: 10.1145/3503161.3547896isbn: 978-1-4503-9203-7/22/10ccs: Computing methodologies Computer visionccs: Applied computing Media artsThis work is done when Quanwei Yang is an intern at JD Explore Academy.*Xinchen Liu is the corresponding author.
Comparison of our method with direct transfer methods and warping-based transfer methods. The images with the
Figure 1. Comparison of our method with direct transfer methods and warping-based transfer methods. The images with the orange box are the source image and human parsing. The images with the blue box represent the driving video frame, pose map and human parsing. The images with the red box are the generated frames. Obvious artifacts in the generated images are marked with red dotted lines.

1. Introduction

The task of Human video Motion Transfer (HVMT) is to synthesize a realistic video where a source person performs desired driving motion (Ma et al., 2017; Wei et al., 2021). Specifically, given the source image and a driving video, it needs to make the person in the source image imitate the action of the person in the driving video. In recent years, this task has received intense research because of its wide range of applications, such as digital man (Shen et al., 2021) and person re-identification (Qian et al., 2018; Zheng et al., 2022). However, the quality of generated motion videos is still far from satisfactory and needs further improvement.

With the development of deep learning, remarkable achievements have been made in many fields (Goodfellow et al., 2014; Vaswani et al., 2017; Zhang et al., 2021; Liu et al., 2019a; Sun et al., 2022; He et al., 2017; Redmon et al., 2016; Long et al., 2015; Liu et al., 2017; Cao et al., 2015; Liu et al., 2021b; He et al., 2016; Liu et al., 2018). Among them, the rise of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) makes synthesized images or videos become more and more realistic (Wang et al., 2018a, b; Yu et al., 2022, 2021; Karras et al., 2019). Existing HVMT methods also adopt the GAN-based framework to achieve realistic results (Chan et al., 2019). According to the generalization of the model, current HVMT methods can be divided into personalized HVMT and general-purpose HVMT. Personalized methods (Chan et al., 2019; Wang et al., 2021; Yang et al., 2020) implicitly embed the global appearance of a specific person into the generator, then synthesize the images of this person conditioned on the driving poses. Although such methods can synthesize realistic video frames, these methods require a large amount of video data of a specific person for training. Moreover, this type of method has no generalization ability for different persons, which has to train an exclusive model for each subject.

To overcome the huge data requirement of individual subjects and improve the generalization of personalized methods, recent works mainly focus on general-purpose methods (Liu et al., 2019b; Wei et al., 2021; Siarohin et al., 2019; Jeon et al., 2020; Wang et al., 2019; Huang et al., 2021; Gafni et al., 2021). This type of method aims to learn a model that can be adapted for the generation of unseen persons. Even if the generated images may be not satisfactory, the results can be improved by simply fine-tuning the existing models instead of re-training them. Therefore, this paper is also concentrated on the general-purpose human motion transfer which has a wider range of applications.

Although general-purpose methods have many advantages, they still face several challenges. First of all, because different parts of one person, such as the face, the body, the pants, etc., usually have different textures, it is difficult to model these complex textures with a uniform GAN. For example, some direct transfer methods (Wang et al., 2019) use a GAN to generate the whole person by fusing the appearance and pose features, which hardly preserve local details well, as shown in Figure 1 (a). Second, the drastic variations of poses and viewpoints between the source person and the driving person bring great challenges to human motion transfer. Some recent methods adopt warping flows, such as the optical flow (Wei et al., 2021), the transform flow (Liu et al., 2019b), etc., to achieve pose transfer, as shown in Figure 1 (b). Nevertheless, the generated frames still have significant artifacts caused by the errors of flow estimation. Last but not the least, the misalignment of scales and positions between the source person and the driving person can also degrade the quality of generated videos.

To overcome the above challenges, this paper presents a novel REgion-to-whole framework for realistic human MOtion Transfer, named REMOT. Figure 1 (c) demonstrates the conceptual architecture of the proposed progressive framework. The REMOT first predicts the layout (semantic parsing masks) of the human body conditioned on the driving pose. Then a region generator is designed to generate each body part individually. Moreover, we propose a Global Alignment Module (GAM) to keep the scales and positions of the source person and the driving person aligned for accurate feature fusion. Finally, a GAN-based whole compositor is utilized to integrate the parts and refine the coarse synthesized person image. To preserve the local details of the source person, we design a Texture Alignment Module (TAM) to align the features of the source person image and the generated frames. In summary, our key contributions are as follows:

  • We propose a region-to-whole framework, named REMOT, for the general-purpose HVMT task. REMOT discards the flow-based warping operation and takes a progressive generation paradigm, which can better deal with dramatic variations of poses by gradually generating person frames from parts to the whole.

  • To overcome the differences in scales and positions between the source person and the driving ones, we propose a Global Alignment Module to adapt the scale and position of the source person to those of the driving person.

  • A Texture Alignment Module is proposed to align the features of the source image and the initially generated image to preserve more details like textures of clothes and edges of the bodies.

We conduct extensive experiments on the iPER Dataset (Liu et al., 2019b) and SoloDance Dataset (Wei et al., 2021). Experimental results show that our model achieves the state-of-the-art both quantitatively and qualitatively.

2. Related Work

2.1. Image-to-Image Translation

Our method is related to the image-to-image translation task that converts one source image to a target image conditional on a specific style, resolution, contents, etc (Isola et al., 2017). Existing image-to-image translation methods usually adopt the Conditional Generative Adversarial Networks (Mirza and Osindero, 2014; Li et al., 2019) to take the specific image (e.g., semantic segmentation map, sketch, edge map, etc.) as input for image generation. Such models can generate the desired image by editing the properties of the input image. Therefore, Compared with unconstrained GANs, image-to-image translation has a wider range of applications. For example, Pix2pix (Isola et al., 2017) used a conditional discriminator to make the generated image and the input image match as much as possible, so it can generate realistic corresponding images. Based on pix2pix, pix2pixHD (Wang et al., 2018a) generated high-resolution images by using multi-scale generators and discriminators. Vid2vid (Wang et al., 2018b) was aimed at the temporal inconsistency of the generated videos. By adding optical flow constrained to the generator and discriminator, vid2vid can generate temporally coherent videos.

The personalized motion transfer is to translate the pose map to the corresponding RGB image, where the person’s appearance is implicitly embedded into the generator. Therefore, many image-to-image translation works can be used for personalized motion transfer. However, such methods cannot be directly applied to HVMT for the generation of varied poses and precise details of persons.

The pipeline of our proposed REMOT. It consists of two networks: (a) Region Generation Network (RGN) and (b) Whole Composition Network (WCN). The RGN takes the source image and driving frame as input to obtain each region of the source person in the driving pose. (The blue arrows represent the data preprocessing: pose detection, layout detection.) The WCN takes the output of the RGN
Figure 2. The pipeline of our proposed REMOT. It consists of two networks: (a) Region Generation Network (RGN) and (b) Whole Composition Network (WCN). The RGN takes the source image and driving frame as input to obtain each region of the source person in the driving pose. (The blue arrows represent the data preprocessing: pose detection, layout detection.) The WCN takes the output of the RGN , the source foreground image after global alignment , and concatenated human layout with pose map as input, then generates the final target frame.

2.2. Human Video Motion Transfer

Personalized HVMT. On the basis of the image-to-image translation methods, some works have specific improvements based on the characteristics of this task. In order to conform to the size of the source person, EDN (Chan et al., 2019) matched the poses between the driving person and the source person by designing global pose normalization. In addition, EDN designed a face GAN to generate a more precise face region. DIW (Wang et al., 2021) encoded non-uniformly sampled pose maps of the past ten frames to capture appearance change details and refine pose features of the single frame. As mentioned earlier, personalized methods suffer from the inability to generalize to unseen person generations and need massive computing resources.

General-purpose HVMT. To overcome the limitations of personalized HVMT, recent studies focus on general-purpose HVMT. In order to better preserve appearance details, some works adopt the warp operation. These methods focusing on the animation of general objects, such as FOMM (Siarohin et al., 2019) and MRAA (Siarohin et al., 2021), utilized predicted keypoints or regions to compute the flow field between the source image and the driving image, and then warped the feature of the source image to obtain the target image. However, these methods do not take temporal consistency into consideration, which results in the temporal inconsistency of the synthesized video. LWGAN (Liu et al., 2019b) warped the source information in both image and feature spaces based on the SMPL model. However, the SMPL model (Loper et al., 2015) was only suitable for smooth human bodies, it cannot represent the human body with complex clothes. Different from warping the feature of the source image, C2F (Wei et al., 2021) estimated the optical flow of the clothing regions and directly warped the clothing region according to optical flow. Unfortunately, when the driving pose is greatly different from the source pose, such methods usually failed due to inaccurate flow estimations. Different from the above methods, our proposed method does not use warp operation. This design can avoid complex transformation calculations and can better cope with arbitrary poses.

In addition, some methods consider the diversity of different regions of the human body and treat each region of the human body separately. SSF (Gafni et al., 2021) encoded each region of the body separately, then integrated features of different regions to directly generate the final target image. SHUP (Balakrishnan et al., 2018) first transformed each part of the source person to the spatial position of the driving pose, then directly generated the whole person image. Moreover, PGTM (Huang et al., 2021) predicted the UV map (Güler et al., 2018) of the source person conditioned on driving pose, then filled it with each region textures of the source person image. However, predicting dense UVs in different poses accurately is challenging. StylePeople (Grigorev et al., 2021) used parameterized human body model SMPL-X (Pavlakos et al., 2019) to model the pose and shape of subjects, then used neural rendering to render region appearance details. Our method also encodes each region of the source image separately. However, unlike these methods, we directly generate corresponding target region images, then use the whole composition network to integrate the generated regions for generating the final target image. Compared with directly generating the whole person image, generating the region image first can better preserve the appearance details.

3. The Proposed Method

Given the source image and a driving video , where represents the driving frame at time of the driving video, our goal is to generate a realistic video in which the person of the source image imitates the same motion as the driving video. We can formulate this task as

(1)

where refers to generative model, represents the number of video frames, represents the generated image.

Figure 2 is an overview of our pipeline. Our proposed method consists of two networks: the Region Generation Network and the Whole Composition Network. For the Region Generation Network, given the pose of the driving image and the human body layout of the source image , we first generate the layout of the source person conditioned on the driving pose through Layout GAN. Then the Region GAN takes the generated mask of each region and the corresponding region of the source person after global alignment as input, and obtains each source person region image in the driving pose. Finally, in the Whole Composition Network, we refine the coarse person image by aligning the features of the source person with the Texture Alignment Module. In addition, we add the background image to the generated person image to obtain the final target image . In the following, we will introduce the details of these two networks.

3.1. Region Generation Network

As mentioned before, the reason why we first generate each source person region in the driving pose is that different regions of the human body have significantly different texture patterns. It’s difficult to directly generate the whole human body. To achieve this, we need to first generate the layout of the source people in the driving pose through Layout GAN, as shown in Figure 2 (a).

Layout GAN. In order to represent the human pose, we first use the human keypoint detection method (Cao et al., 2021; Liu et al., 2021a, 2022b, 2022a) to predict the 2-D keypoint of the human body. Then according to the predefined connection strategy, we can get the pose connection map of the driving person, where is the resolution of the image. In order to represent the human body layout, we use the human parsing methods (Li et al., 2020; Liu et al., 2019c; Zeng et al., 2021) to obtain the 18-channel semantic segmentation maps of the person image. Further, according to the similarity of textures in various regions, we merge the semantic segmentation maps into six classes, i.e., head, top, bottom, shoes, limbs, and background. So we can get the body layout of the source person image and of the driving person image.

For the network structure, we also use vid2vid (Wang et al., 2018b) as in (Wei et al., 2021). Specifically, we use three encoders , and to encode the source body layout , the concatenated driving poses connection map and the concatenated results previously generated , to obtain the feature maps , and respectively. Then decoder is proposed to decode the added feature maps , and to get raw result . Similarly, decodes added features and to obtain optical flow and its weight , and finally obtains the final result . The entire Layout GAN can be formulated as

(2a)
(2b)
(2c)

where represent element-wise addition and element-wise multiplication, respectively. indicates that the inputs are concatenated along the channel dimension. refers to the affine transformation of according to the optical flow . It should be emphasized that the warp operation in the formula is only to integrate the generated results from previous moment and make generated video frames coherent temporally, not for the driving image and the source image.

Region GAN. After obtaining the human body layout of the source person conditioned on driving pose from Layout GAN, Region GAN is proposed to generate five person regions conditioned on the driving pose.

For the network structure, we still adopt the vid2vid structure. Since Region GAN is only for generating raw region images of the source person, here we only use one generator to generate five regions of the human body. This can not only save computational resources but also prevent the model from over-fitting. Specifically, we use the encoder to encode the -th region mask of the human body to obtain the feature map . In addition, we use another encoder to encode the -th region of the source person image and the previous generations to obtain the feature map . Here we directly concatenate , and along the channel dimension to utilize the previous generations for enhancing the temporal consistency of appearance details. The original Region GAN can be expressed as

(3)

Global Alignment Module. The scale and spatial position of the person in the video may vary greatly, so the scales and positions of and may not match, then the extracted features and are also not aligned. Therefore, simply adding and does not fuse the features well. To scale it, we propose the Global Alignment Module (GAM), a simple but effective Affine transformation to transform the before feeding it to , so that the scale and location of are consistent with the .

See the green rectangle of Figure 2 (a). First, we obtain the human masks and according to and , respectively. According to the human mask, we can obtain the foreground image of the source image. We express the process of aligning the foreground of the source foreground image with the following formulation:

(4)

where represents the global aligned foreground image of the source image. represents the scaling factor, which is obtained according to , and represent the height of and , respectively. represents the position offset, which can be obtained by , where and represent the center position coordinates of and , respectively. So the Region GAN can be expressed as

(5)

The loss of Region GAN is the sum of the loss of five synthesized region images, and each loss consists of three items, e.g., the reconstruction loss, the perceptual loss (Johnson et al., 2016), and the adversarial loss.

Reconstruction loss. It directly constrains the generated image in the pixel space to be closer to ground truth . Compared with the L2 loss, L1 loss can focus on the subtle differences of the image. Its formulation is as follows:

(6)

Perceptual loss. It regularizes the generated image and ground truth to be closer in multi-dimensional feature space. The perceptual loss includes the feature content loss and feature style loss, which can be expressed as

(7)

where represents the -th layer of the pre-trained VGG-19 (Simonyan and Zisserman, 2015) model, and represents the Gram matrix of features.

Adversarial loss. It aims to force the synthesized images to conform to a similar distribution of real images. In order to make the network focus on multi-scale image details, we also use the multi-scale conditional discriminator proposed in pix2pixHD (Isola et al., 2017). It takes a synthesized image and the corresponding human region mask as input. Its expression is

(8)

Therefore, the entire loss function of the generator is shown below.

(9)

where and are the weights for the reconstruction loss and perceptual loss, respectively.

For the discriminator, its loss function is

(10)

3.2. Whole Composition Network

After generating the each region image, we can obtain the foreground image of the target image using the following formula:

(11)

However, this is only the sum of the individual regions of the human body generated by Region GAN, each region is still insufficient in connections and some details. In addition, we still need to add the background to the generated person image. Therefore, the Whole Generation Network is proposed to further enhance the details of the person, improve the unreasonable areas and add the background image to the person image.

As shown in Figure 2 (b), for the Whole Composition Network, we use three encoders to extract information. a) We use encoder to encode the concatenated to obtain feature map . Three adjacent frames as input can implicitly improving the temporal consistency of generated frames. b) We use the encoder to encode concatenated to get the feature map . c) In order to exploit the true foreground information of the source person image again, we use the encoder to encode to get the feature map . Because , and are all in the driving pose, the obtained feature maps and can be directly added for feature fusion.

Texture Alignment Module. However, and are the same person with the same textures but having different poses. So the Texture Alignment Module (TAM) is proposed to better fuse feature maps and for refining local details. TAM calculates the texture similarity between each position of and each position of , then adds the aligned to .

As shown in the blue rectangle of Figure 2 (b), we first reshape into , and we can get by in the same way. Then we calculate the similarity of each position in to each position in to get the affinity matrix . Here we use the cosine distance to represent the texture similarity. The calculation formula is as follows:

(12)

where means matrix multiplication, represents the feature of at the -th position, and represents the feature of at the -th position. Next, we multiply by the affinity matrix to get the alignment vector, then add it point-to-point with , finally reshape it to the original dimension to get the fused feature map , which can be formulated as

(13)

After getting the fused texture feature , we add it to the pose feature and send it to decoder to get the final foreground image . The entire foreground image generation process can be expressed as

(14)

Although a realistic person image is synthesized, we still need to add a reasonable background to the foreground person image. Therefore, we send the additive feature map and into decoder to get the soft-mask , then multiply it with to get the corresponding background image. Finally, we add the background to the foreground image to get the final image. Its calculation formula can be expressed as

(15)

It should be noted that we also use the same strategy as Layout GAN, using the predicted optical flow of adjacent frames to further refine the generated image. In addition to using the same loss function as Region GAN, we also design another discriminator specifically for the facial region to generate realistic faces.

3.3. Training and Inference

Training. In the training phase, in order to get the paired training data as supervision information, we select a forward frame in a video as the source image and this video as the driving video. Before training our model, we first use pre-trained models to extract the poses and the human parsing maps of frames. We train our REMOT step by step. First, we train Layout GAN and Region GAN separately for 10 epochs. Then we train the Whole Composition Network for 10 epochs using the output of Region GAN.

Inference. In the inference stage, the choice of driving video is not limited, as long as it is a clear video of any solo person. For convenience, our model can perform end-to-end inference.

4. Experiments

This section consists of four subsections. We first introduce the Implementation Details and Datasets (Sec. 4.1). Then compare our method with some representative existing methods quantitatively and qualitatively (Sec. 4.2 and Sec. 4.3). Finally, we show the effectiveness of each module through the ablation study (Sec. 4.4). Each subsection is described in detail below.

Methods SSIM PSNR LPIPS FID TCM
EDN (Chan et al., 2019) 0.840 23.39 0.076 56.29 0.361
FSV2V (Wang et al., 2019) 0.780 20.44 0.110 110.99 0.184
LWGAN (Liu et al., 2019b) 0.825 21.43 0.091 77.99 0.197
C2F (Wei et al., 2021) 0.849 24.27 0.072 55.07 0.687
Ours 0.856 25.33 0.065 53.04 0.793
Table 1. Quantitative comparison to state-of-the-art methods on the iPER dataset. indicates the higher is better, correspondingly, indicates the lower is better.

4.1. Implementation Details and Datasets

Implementation Details. Like (Wei et al., 2021), all frames are resized and cropped to 256x192 before training the models. We use Openpose (Cao et al., 2021) to get the 2-D keypoints of the human body, use SCHP (Li et al., 2020) to get the human parsing, and use the optical flow obtained by FlowNet2 (Ilg et al., 2017) as ground truth. We set , in Eq10 to 10 for balancing several losses. We use Adam optimizers (Kingma and Ba, 2015) with learning rate of 0.0002, where on Nvidia Tesla P40.

Dataset. In order to comprehensively evaluate the performance of our proposed model, we choose two public datasets for training and testing.

iPER Dataset. The iPER dataset was proposed by (Liu et al., 2019b), which was collected in the laboratory environment. The dataset includes 30 actors with a total of 206 videos. Its action types of an actor include A-pose and random actions. The scale of iPER is relatively large and the actions are relatively simple. Same as the original protocol of iPER, 164 videos are used for training and the remaining 42 videos for testing.

SoloDance Dataset. Different from the iPER dataset, SoloDance dataset contains 179 solo dance videos in real scenes collected online (Wei et al., 2021), and each video has 300 frames. The dataset has 143 subjects, and its action types are mainly dances such as hip-hop and modern dance. Compared with the iPER dataset, the actions in this dataset are more complex and the types of clothes are more diverse. However, this dataset is not as large as the iPER dataset. Following the setting of (Wei et al., 2021), we also use 153 videos for training and 26 videos for testing.

Compared Approaches. We select several representative state-of-the-art HVMT methods for comparison with our proposed approach. They are personalized methods END (Chan et al., 2019), directly generation method FSV2V (Wang et al., 2019), feature warping method LWGAN (Liu et al., 2019b), and image warping method C2F (Wei et al., 2021). We also use 3,000 video frames for each subject to train the exclusive EDN model, and the training strategies of other methods are the same as our method.

Methods SSIM PSNR LPIPS FID TCM
EDN (Chan et al., 2019) 0.811 23.22 0.051 53.17 0.347
FSV2V (Wang et al., 2019) 0.721 20.84 0.132 112.99 0.106
LWGAN (Liu et al., 2019b) 0.786 (0.935) 20.87 (22.48) 0.106 86.53 0.176
C2F(Wei et al., 2021) 0.879 (0.935) 26.65 (24.94) 0.049 46.49 0.641
Ours 0.850 (0.953) 24.83 (27.89) 0.045 53.29 0.788
Table 2. Quantitative comparison to state-of-the-art methods on the SoloDance dataset. Here are Mask-SSIM and Mask-PSNR in ().

4.2. Quantitative Results

Metrics. In order to comprehensively compare the quality of generated images, we use four image-level evaluation metrics: SSIM (Wang et al., 2004), PSNR, Learned Perceptual Image Patch Similarity (LPIPS) (Zhang et al., 2018), Fréchet Inception Distance (FID) (Heusel et al., 2017) and one metric for evaluating the temporal consistency of generated videos, i.e., Temporal Consistency Metric (TCM) (Yao et al., 2017). SSIM measures the structural similarity between the generated image and the real image in pixel space. PSNR evaluates the generation quality based on pixel-wise errors. LPIPS assesses the perceptual distance based on feature vectors. FID represents the Inception (Szegedy et al., 2016) feature distance between two image sets. As for TCM, it evaluates the temporal consistency of generated videos by calculating the distance between the current video frame and the warped neighbored frame with optical flow.

Qualitative results on the iPER dataset.
Visible artifacts are marked with colored dotted boxes.
Figure 3. Qualitative results on the iPER dataset. Visible artifacts are marked with colored dotted boxes.
Qualitative results on the SoloDance dataset.
Visible artifacts are marked with colored dotted boxes.
Figure 4. Qualitative results on the SoloDance dataset. Visible artifacts are marked with colored dotted boxes.

A quantitative comparison of our method with other methods on the iPER dataset is shown in Table 1. It shows that our proposed method outperforms other methods in terms of image quality and temporal consistency. EDN trains the exclusive model for each specific person, so it achieves satisfactory results on all metrics. Because FSV2V directly generates the whole person image, resulting in unreasonable person images, it performs the worst among all methods. For LWGAN, there is no constraint on temporal consistency, resulting in jitters in the generated human video. Therefore, LWAGN does not perform well on the TCM metric. Since half of the videos of rotating actions in the iPER dataset, it is difficult for C2F to handle such a situation with drastic pose differences. Thanks to proposed region-to-whole strategy, our method can be applied to more situations, thus achieving the best results on all metrics.

Table 2 provides the comparison of our method with other methods on the SoloDance dataset. It can be seen that we have achieved the best results on the Mask-SSIM, Mask-PSNR, LPIPS and TCM. (Mask means masking the background of the image according to the human parsing map) This shows that our method has advantages in the quality of generated person images and the temporal consistency of generated videos. The reason why we don’t achieve the best results on other metrics may be that the scale of the SoloDance dataset is not large enough for training the model composed of entirely GANs compared to the iPER dataset. In addition, C2F directly retains the clothes of the source image based on predicted optical flow, which makes it achieve the best results on the FID metric. However, this also leads to generating unreasonable person images in the case of dramatic pose differences between the source person and the driving person. (It can be seen in the areas marked by the yellow dotted boxes of Figures 3 and 4.) Overall, our method can achieve the state-of-the-art on both two datasets.

4.3. Qualitative Results

As shown in Figures 3 and 4, we randomly select three video frames of different poses from synthesized videos on iPER and SoloDance datasets. Although EDN can generate realistic images in some cases, unseen poses often lead to unreasonable person images (Marked with green boxes in Figures 3 and  4). Therefore, EDN is limited by the pose diversity of the particular person in the training set. It can be seen that due to the inaccuracy of the SMPL predicted by HMR (Kanazawa et al., 2018), LWGAN will generate uneven texture and jitter (Marked with red boxes in Figures 3 and  4). For C2F, when the poses of the driving image and the source image are different dramatically, optical flow prediction becomes inaccurate. So the synthesized clothes cannot be warped precisely, which results in unrealistic person images. (Marked with yellow boxes in Figures 3 and  4) This may seriously affect human visual perception. Compared with other methods, our method can better handle the situation of drastic pose changes while preserving the details. In addition, the person images generated by our model generally have clearer faces. This is due to our progressive model in which the initial face image provides an important template for the final clear face.

4.4. Ablation Study

In order to verify the roles of the Global Alignment Module (GAM) and Texture Alignment Module (TAM) on the generated results and the effectiveness of the Whole Composition Network (WCN), we perform the ablation experiments on the iPER and SoloDance datasets. The variants of the framework are as follows:

  • RGN w/o GAM. It means that the foreground image is generated without using the GAM.

  • REMOT w/o WCN. It means that the foreground image is generated through the RGN using the GAM.

  • REMOT w/o TAM. It means that the REMOT model directly adds the feature to the feature point-to-point, instead of using the TAM.

  • REMOT. It refers to our proposed complete REMOT model.

Results of ablation experiments.
Figure 5. Results of ablation experiments.
SSIM* PSNR* LPIPS FID TCM
RGN w/o GAM 0.892 20.94 - - -
REMOT w/o WCN 0.895 21.32 - - -
REMOT w/o TAM 0.928 26.62 0.074 55.23 0.715
REMOT 0.952 28.87 0.065 53.04 0.793
Table 3. Results of ablation experiments of our method on the iPER dataset. * indicates Mask-SSIM and Mask-PSNR.
SSIM* PSNR* LPIPS FID TCM
RGN w/o GAM 0.921 24.56 - - -
REMOT w/o WCN 0.937 25.54 - - -
REMOT w/o TAM 0.941 26.25 0.053 64.37 0.723
REMOT 0.953 27.89 0.045 53.29 0.788
Table 4. Results of ablation experiments of our method on the SoloDance dataset.

The experimental results are shown in Tables 3 and 4. Since the RGN only generates person foreground images, we only calculate the Mask-SSIM and Mask-PSNR. Experimental results in Table 3 show that the GAM does not significantly improve the quality of the generated images on the iPER dataset. This is because in iPER videos, the motion of actors is relatively simple and the scale and position of actors remain unchanged, which means that the source image and the driving image are originally in a weak alignment. However, the GAM plays a greater role because of the large range of actions of the subjects in the SoloDance dataset. Obviously, on both datasets, WCN can significantly improve the quality of generated images. Compared with direct feature addition for fusion, TAM can better fuse the features of the generated raw foreground image with the features of the source image.

As can be seen from Figure 5, WCN plays an important role in enhancing facial details and clothing textures. Compared with the direct addition of features, TAM is more beneficial to generate realistic person images.

5. Conclusion

In this paper, we propose a progressive human motion transfer framework, namely REMOT, which gradually generates person images from region to whole. REMOT abandons the warping operation, which avoids inaccurate flow estimation due to drastic variations of poses. Compared with directly generating the whole person image, the region-to-whole strategy makes the HVMT task easier and generated person images more realistic. Moreover, we propose a Global Alignment Module to match the size and position of the source person with those of the driving person. Furthermore, we propose the Texture Alignment Module to align the features of the source image with the generated image to preserve more details. Experiments on the iPER and SoloDance datasets show that our proposed approach achieves state-of-the-art results. In future work, we will try 3D information of the human body for HVMT, which can provide powerful prior knowledge such as 3D shapes and poses of the human body.

Acknowledgements.
This work is supported by the National Key R&D Program of China under Grand No. 2020AAA0103800, the National Nature Science Foundation of China (62121002, 62022076, U1936210, 62102127), the Fundamental Research Funds for the Central Universities under Grant WK3480000011, the Youth Innovation Promotion Association Chinese Academy of Sciences (Y2021122), and the Hefei Postdoctoral Research Activities Foundation (BSH202101).

References

  • (1)
  • Balakrishnan et al. (2018) Guha Balakrishnan, Amy Zhao, Adrian V. Dalca, Frédo Durand, and John V. Guttag. 2018. Synthesizing Images of Humans in Unseen Poses. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 8340–8348.
  • Cao et al. (2015) Xiaochun Cao, Hua Zhang, Xiaojie Guo, Si Liu, and Dan Meng. 2015. SLED: Semantic Label Embedding Dictionary Representation for Multilabel Image Annotation. IEEE Trans. Image Process. 24, 9 (2015), 2746–2759.
  • Cao et al. (2021) Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2021. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE Trans. Pattern Anal. Mach. Intell. 43, 1 (2021), 172–186.
  • Chan et al. (2019) Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A. Efros. 2019. Everybody Dance Now. In IEEE/CVF International Conference on Computer Vision (ICCV). 5932–5941.
  • Gafni et al. (2021) Oran Gafni, Oron Ashual, and Lior Wolf. 2021. Single-Shot Freestyle Dance Reenactment. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 882–891.
  • Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems (NIPS). 2672–2680.
  • Grigorev et al. (2021) Artur Grigorev, Karim Iskakov, Anastasia Ianina, Renat Bashirov, Ilya Zakharkin, Alexander Vakhitov, and Victor Lempitsky. 2021. StylePeople: A Generative Model of Fullbody Human Avatars. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5151–5160.
  • Güler et al. (2018) Riza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. 2018. DensePose: Dense Human Pose Estimation in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7297–7306.
  • He et al. (2017) Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. Mask R-CNN. In IEEE/CVF International Conference on Computer Vision (ICCV). 2980–2988.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778.
  • Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Advances in Neural Information Processing Systems (NIPS). 6626–6637.
  • Huang et al. (2021) Zhichao Huang, Xintong Han, Jia Xu, and Tong Zhang. 2021. Few-Shot Human Motion Transfer by Personalized Geometry and Texture Modeling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2297–2306.
  • Ilg et al. (2017) Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. 2017. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1647–1655.
  • Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5967–5976.
  • Jeon et al. (2020) Subin Jeon, Seonghyeon Nam, Seoung Wug Oh, and Seon Joo Kim. 2020. Cross-Identity Motion Transfer for Arbitrary Objects Through Pose-Attentive Video Reassembling. In European Conference on Computer Vision (ECCV). 292–308.
  • Johnson et al. (2016) Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In European Conference on Computer Vision (ECCV). 694–711.
  • Kanazawa et al. (2018) Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. 2018. End-to-End Recovery of Human Shape and Pose. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7122–7131.
  • Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4401–4410.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR).
  • Li et al. (2020) Peike Li, Yunqiu Xu, Yunchao Wei, and Yi Yang. 2020. Self-Correction for Human Parsing. IEEE Trans. Pattern Anal. Mach. Intell. (2020), 3260 – 3271.
  • Li et al. (2019) Yuhang Li, Xuejin Chen, Feng Wu, and Zheng-Jun Zha. 2019. LinesToFacePhoto: Face Photo Generation From Lines With Conditional Self-Attention Generative Adversarial Networks. In ACM International Conference on Multimedia (MM). 2323–2331.
  • Liu et al. (2017) Anan Liu, Yuting Su, Weizhi Nie, and Mohan S. Kankanhalli. 2017. Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1 (2017), 102–114.
  • Liu et al. (2018) Kun Liu, Wu Liu, Chuang Gan, Mingkui Tan, and Huadong Ma. 2018. T-C3D: Temporal Convolutional 3D Network for Real-Time Action Recognition. In AAAI Conference on Artificial Intelligence (AAAI). 7138–7145.
  • Liu et al. (2022a) Wu Liu, Qian Bao, Yu Sun, and Tao Mei. 2022a. Recent advances in monocular 2d and 3d human pose estimation: A deep learning perspective. ACM Computing Surveys (2022).
  • Liu et al. (2019b) Wen Liu, Zhixin Piao, Jie Min, Wenhan Luo, Lin Ma, and Shenghua Gao. 2019b. Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis. In IEEE/CVF International Conference on Computer Vision (ICCV). 5903–5912.
  • Liu et al. (2019a) Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, Chenggang Yan, and Tao Mei. 2019a. Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3566–3574.
  • Liu et al. (2019c) Xinchen Liu, Meng Zhang, Wu Liu, Jingkuan Song, and Tao Mei. 2019c. BraidNet: Braiding Semantics and Details for Accurate Human Parsing. In ACM International Conference on Multimedia (MM). 338–346.
  • Liu et al. (2021a) Zhenguang Liu, Haoming Chen, Runyang Feng, Shuang Wu, Shouling Ji, Bailin Yang, and Xun Wang. 2021a. Deep Dual Consecutive Network for Human Pose Estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 525–534.
  • Liu et al. (2022b) Zhenguang Liu, Runyang Feng, Haoming Chen, Shuang Wu, Yixing Gao, Yunjun Gao, and Xiang Wang. 2022b. Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 11006–11016.
  • Liu et al. (2021b) Zhenguang Liu, Shuang Wu, Shuyuan Jin, Shouling Ji, Qi Liu, Shijian Lu, and Li Cheng. 2021b. Investigating Pose Representations and Motion Contexts Modeling for 3D Motion Prediction. IEEE Trans. Pattern Anal. Mach. Intell. (2021), 1–16.
  • Long et al. (2015) Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3431–3440.
  • Loper et al. (2015) Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34, 6 (2015), 248:1–248:16.
  • Ma et al. (2017) Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc Van Gool. 2017. Pose Guided Person Image Generation. In Advances in Neural Information Processing Systems (NIPS). 406–416.
  • Mirza and Osindero (2014) Mehdi Mirza and Simon Osindero. 2014. Conditional Generative Adversarial Nets. CoRR abs/1411.1784 (2014).
  • Pavlakos et al. (2019) Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. 2019. Expressive Body Capture: 3D Hands, Face, and Body From a Single Image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 10975–10985.
  • Qian et al. (2018) Xuelin Qian, Yanwei Fu, Tao Xiang, Wenxuan Wang, Jie Qiu, Yang Wu, Yu-Gang Jiang, and Xiangyang Xue. 2018. Pose-Normalized Image Generation for Person Re-identification. In European Conference on Computer Vision (ECCV). 661–678.
  • Redmon et al. (2016) Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 779–788.
  • Shen et al. (2021) Tong Shen, Jiawei Zuo, Fan Shi, Jin Zhang, Liqin Jiang, Meng Chen, Zhengchen Zhang, Wei Zhang, Xiaodong He, and Tao Mei. 2021. ViDA-MAN: Visual Dialog with Digital Humans. In ACM International Conference on Multimedia (MM). 2789–2791.
  • Siarohin et al. (2019) Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, Elisa Ricci, and Nicu Sebe. 2019. First Order Motion Model for Image Animation. In Advances in Neural Information Processing Systems (NIPS). 7135–7145.
  • Siarohin et al. (2021) Aliaksandr Siarohin, Oliver J. Woodford, Jian Ren, Menglei Chai, and Sergey Tulyakov. 2021. Motion Representations for Articulated Animation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 13653–13662.
  • Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations (ICLR).
  • Sun et al. (2022) Yu Sun, Wu Liu, Qian Bao, Yili Fu, Tao Mei, and Michael J. Black. 2022. Putting People in their Place: Monocular Regression of 3D People in Depth. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 13243–13252.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for Computer Vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2818–2826.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems (NIPS). 5998–6008.
  • Wang et al. (2019) Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Bryan Catanzaro, and Jan Kautz. 2019. Few-shot Video-to-Video Synthesis. In Advances in Neural Information Processing Systems (NIPS). 5014–5025.
  • Wang et al. (2018a) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018a. High-Resolution Image Synthesis and Semantic Manipulation With Conditional GANs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 8798–8807.
  • Wang et al. (2018b) Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Nikolai Yakovenko, Andrew Tao, Jan Kautz, and Bryan Catanzaro. 2018b. Video-to-Video Synthesis. In Advances in Neural Information Processing Systems (NIPS). 1152–1164.
  • Wang et al. (2021) Tuanfeng Y. Wang, Duygu Ceylan, Krishna Kumar Singh, and Niloy J. Mitra. 2021. Dance In the Wild: Monocular Human Animation with Neural Dynamic Appearance Synthesis. In International Conference on 3D Vision (3DV). IEEE, 268–277.
  • Wang et al. (2004) Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 4 (2004), 600–612.
  • Wei et al. (2021) Dongxu Wei, Xiaowei Xu, Haibin Shen, and Kejie Huang. 2021. C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal Consistent Motion Transfer. In AAAI Conference on Artificial Intelligence (AAAI). 2852–2860.
  • Yang et al. (2020) Zhuoqian Yang, Wentao Zhu, Wayne Wu, Chen Qian, Qiang Zhou, Bolei Zhou, and Chen Change Loy. 2020. TransMoMo: Invariance-Driven Unsupervised Video Motion Retargeting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5305–5314.
  • Yao et al. (2017) Chun-Han Yao, Chia-Yang Chang, and Shao-Yi Chien. 2017. Occlusion-aware Video Temporal Consistency. In ACM International Conference on Multimedia (MM). 777–785.
  • Yu et al. (2022) Lingyun Yu, Hongtao Xie, and Yongdong Zhang. 2022. Multimodal Learning for Temporally Coherent Talking Face Generation With Articulator Synergy. IEEE Trans. Multim. 24 (2022), 2950–2962.
  • Yu et al. (2021) Lingyun Yu, Jun Yu, Mengyan Li, and Qiang Ling. 2021. Multimodal Inputs Driven Talking Face Generation With Spatial-Temporal Dependency. IEEE Trans. Circuits Syst. Video Technol. 31, 1 (2021), 203–216.
  • Zeng et al. (2021) Dan Zeng, Yuhang Huang, Qian Bao, Junjie Zhang, Chi Su, and Wu Liu. 2021. Neural Architecture Search for Joint Human Parsing and Pose Estimation. In IEEE/CVF International Conference on Computer Vision (ICCV). 11365–11374.
  • Zhang et al. (2021) Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, Limin Wang, and Zhenan Sun. 2021. PyMAF: 3D Human Pose and Shape Regression with Pyramidal Mesh Alignment Feedback Loop. In IEEE/CVF International Conference on Computer Vision (ICCV). 11426–11436.
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. 2018. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 586–595.
  • Zheng et al. (2022) Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, and Tao Mei. 2022. Gait Recognition in the Wild With Dense 3D Representations and a Benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 20228–20237.