Visual signals in a video can be divided into content and motion. While content specifies which objects are in the video, motion describes their dynamics. Based on this prior, we propose the Motion and Content decomposed Genera-tive Adversarial Network (MoCoGAN) framework for video generation. The proposed framework generates a video by mapping a sequence of random vectors to a sequence of video frames. Each random vector consists of a content part and a motion part. While the content part is kept fixed, the motion part is realized as a stochastic process. To learn motion and content decomposition in an unsupervised manner, we introduce a novel adversarial learning scheme utilizing both image and video discriminators. Extensive experimental results on several challenging datasets with qualitative and quantitative comparison to the state-of-the-art approaches, verify effectiveness of the proposed framework. In addition, we show that MoCoGAN allows one to generate videos with same content but different motion as well as videos with different content and same motion. Our code is available at https://github.com/sergeytulyakov/mocogan.
translated by 谷歌翻译
我们考虑图像到视频转换的问题,其中输入图像被转换成包含单个对象的运动的输出视频。这种问题的近似方法通常训练转换网络以生成以结构序列为条件的未来帧。并行工作表明,通过利用训练数据的时间知识的时空生成网络可以产生短的高质量运动。我们将这两种方法的优点结合起来,并提出了一个两阶段生成框架,其中视频是从结构生成的,然后是精细的时频信号。为了更有效地模拟运动,我们训练网络以学习当前帧和未来帧之间的残余运动,这避免了学习运动无关的细节。我们对两个图像到视频翻译任务进行了大量实验:面部表情重定向和人体姿势预测。在这两项任务中,最先进的方法取得了卓越的成果,证明了我们的方法的有效性。
translated by 谷歌翻译
由于Generative Adversarial Networks的出现,视频合成已经取得了非凡的突破。然而,现有方法缺乏适当的表示来明确控制视频中的动态。另一方面,人体姿势可以内在地和可解释地表示运动模式,并且无论外观如何都可以实现几何约束。在本文中,我们提出了一种姿态引导方法,以解开的方式合成人类视频:合理的运动预测和连贯的外观生成。在第一阶段,姿势序列生成对抗网络(PSGAN)以对抗方式学习以产生以类标签为条件的姿势序列。在第二阶段,语义一致生成对抗网络(SCGAN)从姿势生成视频帧,同时保留输入图像中的相干外观。通过在高特征级别实施生成的和地面真实姿势之间的语义一致性,我们的SCGAN对于噪声超常的姿势具有鲁棒性。对人类行为和人类面部病症的广泛实验表明,所提出的方法优于其他人类行为。
translated by 谷歌翻译
Videos express highly structured spatio-temporal patterns of visual data. A video can be thought of as being governed by two factors: (i) temporally invariant (e.g., person identity), or slowly varying (e.g., activity), attribute-induced appearance, encoding the persistent content of each frame, and (ii) an inter-frame motion or scene dynamics (e.g., encoding evolution of the person executing the action). Based on this intuition, we propose a generative framework for video generation and future prediction. The proposed framework generates a video (short clip) by decoding samples sequentially drawn from a latent space distribution into full video frames. Variational Autoencoders (VAEs) are used as a means of encoding/decoding frames into/from the latent space and RNN as a way to model the dynamics in the latent space. We improve the video generation consistency through temporally-conditional sampling and quality by structuring the latent space with attribute controls; ensuring that attributes can be both inferred and conditioned on during learning/generation. As a result, given attributes and/or the first frame, our model is able to generate diverse but highly consistent sets of video sequences, accounting for the inherent uncertainty in the prediction task. Experimental results on Chair CAD [1], Weizmann Human Action [2], and MIT Flickr [3] datasets, along with detailed comparison to the state-of-the-art, verify effectiveness of the framework.
translated by 谷歌翻译
We propose a deep neural network for the prediction of future frames innatural video sequences. To effectively handle complex evolution of pixels invideos, we propose to decompose the motion and content, two key componentsgenerating dynamics in videos. Our model is built upon the Encoder-DecoderConvolutional Neural Network and Convolutional LSTM for pixel-level prediction,which independently capture the spatial layout of an image and thecorresponding temporal dynamics. By independently modeling motion and content,predicting the next frame reduces to converting the extracted content featuresinto the next frame content by the identified motion features, which simplifiesthe task of prediction. Our model is end-to-end trainable over multiple timesteps, and naturally learns to decompose motion and content without separatetraining. We evaluate the proposed network architecture on human activityvideos using KTH, Weizmann action, and UCF-101 datasets. We showstate-of-the-art performance in comparison to recent approaches. To the best ofour knowledge, this is the first end-to-end trainable network architecture withmotion and content separation to model the spatiotemporal dynamics forpixel-level future prediction in natural videos.
translated by 谷歌翻译
Taking a photo outside, can we predict the immediate future, e.g., how would the cloud move in the sky? We address this problem by presenting a generative adversar-ial network (GAN) based two-stage approach to generating realistic time-lapse videos of high resolution. Given the first frame, our model learns to generate long-term future frames. The first stage generates videos of realistic contents for each frame. The second stage refines the generated video from the first stage by enforcing it to be closer to real videos with regard to motion dynamics. To further encourage vivid motion in the final generated video, Gram matrix is employed to model the motion more precisely. We build a large scale time-lapse dataset, and test our approach on this new dataset. Using our model, we are able to generate realistic videos of up to 128 × 128 resolution for 32 frames. Quantitative and qualitative experiment results demonstrate the superiority of our model over the state-of-the-art models .
translated by 谷歌翻译
将图像生成扩展到视频生成是一项非常困难的任务,因为视频的时间维度在生成过程中引入了额外的挑战。此外,由于记忆和训练稳定性的限制,随着视频分辨率/持续时间的增加,这一代变得越来越具有挑战性。在这项工作中,我们开发了逐步增长的生成对抗网络(GAN)以实现更高分辨率视频生成的想法。特别是,我们开始产生低分辨率和短持续时间的视频样本,然后通过将新的时空卷积层添加到当前网络来逐步(或联合)逐步增加分辨率和持续时间。从非常原始级别的空间学习开始视频分布的外观和时间运动,提出的渐进方法逐步学习时空信息以生成更高分辨率的视频。此外,我们引入了Wasserstein GAN(SWGAN)损失的切片版本,以改进高维和混合时空的视频数据的分布学习。分配。 SWGAN损失用一维边际分布替换联合分布之间的距离,使损失更容易计算。我们在收集的10,900个视频的视频数据集上评估所提出的模型,以生成256x256x32分辨率的逼真的脸部视频。此外,我们的模型在无监督动作识别数据集UCF-101中也达到了14.57的记录感知核心。
translated by 谷歌翻译
Learning to represent and generate videos from unlabeled data is a very challenging problem. To generate realistic videos, it is important not only to ensure that the appearance of each frame is real, but also to ensure the plausibil-ity of a video motion and consistency of a video appearance in the time direction. The process of video generation should be divided according to these intrinsic difficulties. In this study, we focus on the motion and appearance information as two important orthogonal components of a video, and propose Flow-and-Texture-Generative Adversarial Networks (FTGAN) consisting of FlowGAN and TextureGAN. In order to avoid a huge annotation cost, we have to explore a way to learn from unlabeled data. Thus, we employ optical flow as motion information to generate videos. FlowGAN generates optical flow, which contains only the edge and motion of the videos to be begerated. On the other hand, Texture-GAN specializes in giving a texture to optical flow generated by FlowGAN. This hierarchical approach brings more realistic videos with plausible motion and appearance consistency. Our experiments show that our model generates more plausible motion videos and also achieves significantly improved performance for unsupervised action classification in comparison to previous GAN works. In addition, because our model generates videos from two independent information, our model can generate new combinations of motion and attribute that are not seen in training data, such as a video in which a person is doing sit-up in a baseball ground.
translated by 谷歌翻译
本文介绍了一种新颖的图像动画深度学习框架。给出了一个带有目标对象的输入图像和一个描绘运动物体的驱动视频序列,我们的框架生成了一个视频,其中目标对象根据驱动顺序进行动画处理。这是通过深层结构实现的,这种结构将外观和运动信息分离。我们的框架包括三个主要模块:(i)密切检测器,无需经过精心训练以提取对象关键点;(ii)密集运动预测网络,用于从稀疏关键点生成密集热图,以便更好地编码运动信息;以及(iii)运动传输网络,使用从输入图像提取的运动图和外观信息来合成输出帧。我们展示了我们的方法在几个基准数据集上的有效性,涵盖了各种各样的对象外观,并且表明我们的方法优于最先进的图像动画和视频生成方法。
translated by 谷歌翻译
我们提出了一种简单但高效的方法来解决条件生成对抗网络(cGAN)中的模糊崩溃问题。虽然条件分布是多模态的(即,有很多模式)实践,但大多数cGAN方法倾向于学习过于简化的分布。无论代码的变化如何,输入始终映射到单个输出。为了解决这个问题,我们建议明确规范发电机根据潜码产生不同的输出。所提出的规范化简单,通用,并且可以容易地集成到大多数有条件的GAN目标中。此外,关于生成器的明确正规化我们控制视觉质量和多样性之间平衡的方法。我们展示了我们的方法在三个条件生成任务上的有效性:图像到图像的翻译,图像修复和未来的视频预测。我们表明,对现有模型简单地加上我们的正则化导致了令人惊讶的多样化世代,大大优于以前在每个单独任务中专门设计的多模态条件生成方法。
translated by 谷歌翻译
我们研究视频到视频合成的问题,其目的是学习从输入源视频(例如,一系列语义分割掩模)到精确描绘源视频内容的输出照片级真实视频的拍摄功能。虽然其图像对应图像到图像合成问题是一个热门话题,但文献中对视频到视频合成问题的研究较少。在不了解时间动态的情况下,将现有的图像合成方法直接应用于输入视频通常会导致视觉质量低的时间不连贯的视频。在本文中,我们提出了一种新的视频到视频合成方法,在生成对抗性学习框架下。通过精心设计的发生器和鉴别器架构,再加上一个时空对称物镜,我们可以在多种输入格式(包括分割面板,草图和姿势)上实现高分辨率,逼真,时间相干的视频效果。多个基准测试的实验表明,与强基线相比,我们的方法具有优势。特别是,我们的模型能够合成长达30秒的街道场景的2K分辨率视频,这显着提升了视频合成的最新技术水平。最后,我们将我们的方法应用于未来的视频预测,超越了几个最先进的竞争系统。
translated by 谷歌翻译
我们提出了第一个深度学习解决方案,用于视频帧修复,在视频编辑,操作和取证中应用的一般视频修复问题的实例。我们的任务不像帧插值和视频预测那样模糊,因为我们可以同时访问时空背景和未来的部分一瞥,这使我们能够客观地评估模型预测的质量。我们设计了一个由两个模块组成的流水线:双向视频预测模块和atemporally-aware帧插值模块。预测模块使用基于卷积的基于LSTM的编码器 - 解码器对丢失帧进行两个中间预测,一个以前置帧为条件,另一个以后续帧为条件。插值模块将中间预测混合以形成最终结果。具体而言,它利用视频预测模块中的时间信息和隐藏激活来解决预测之间的不一致。我们的实验证明,我们的方法比最先进的视频预测方法和修复基线的修复框架产生更准确和定性更令人满意的结果。
translated by 谷歌翻译
在本文中,我们提出了一种从单个图像和用户提供的运动笔划自动生成视频序列的新方法。基于单个输入图像生成视频序列在视觉内容创建中具有许多应用,但是即使是经验丰富的艺术家也很繁琐且耗时。已经提出了自动方法来解决这个问题,但是大多数现有的视频预测方法需要多个输入帧。另外,生成的序列具有有限的变化,因为输出主要由输入帧确定,而不允许用户对结果提供附加约束。在我们的技术中,用户可以使用单个输入图像上的草图笔划来控制生成的动画。我们训练我们的系统使得动画对象的轨迹跟随笔划,这使得它更灵活和更可控。用户可以从单个图像生成与不同的输入相对应的各种视频序列。我们的方法是第一个系统,给定单帧和动作笔画,可以通过逐帧反复生成视频来生成动画。我们的体系结构的重复性质的一个重要好处是它有助于合成任意数量的生成帧。我们的体系结构使用自动编码器和生成对抗网络(GAN)来生成清晰的纹理图像,我们使用另一个GAN来保证转换帧之间是现实和平滑的。我们证明了我们的方法对MNIST,KTH和Human 3.6M数据集的有效性。
translated by 谷歌翻译
我们提出了一种新的Autoencoder GAN模型FutureGAN,它可以在给定一系列过去帧的情况下预测视频序列的未来帧。我们认可了Karras等人最近推出的GAN(PGGAN)架构的逐步发展。 [18]。在训练期间,通过在鉴别器和发生器网络中逐步添加层来逐渐增加输入和输出帧的分辨率。为了学习有效捕获帧序列的空间和时间成分的表示,我们使用时空三维卷积。我们已经在从合成帧序列到自然帧序列的各种数据集中获得了128 x 128 px的帧分辨率的有希望的结果,而理论上不限于特定的帧分辨率。 FutureGAN学会生成可信的未来,学习表示似乎有效地捕获输入帧的空间和时间变换。与大多数其他视频预测模型相比,我们的架构的一大优势在于其简单性。该模型仅接收原始像素值作为输入,有效地生成输出帧,而不依赖于附加约束,条件或复杂的基于像素的错误丢失度量。
translated by 谷歌翻译
由于生成对抗网络(GAN)的出现,自动视频生成领域得到了提升。但是,大多数现有方法无法使用文本标题控制生成的视频的内容,从而在很大程度上失去了它们的实用性。这尤其会影响人类视频,因为它们具有各种各样的动作和外观。本文提出了条件流和纹理GAN(CFT-GAN),一种基于动作 - 外观字幕的基于GAN的视频生成方法。我们提出了一种通过在两阶段生成流程中编码标题(例如,'\ textnormal {蓝色牛仔裤中的男人正在打高尔夫球')来生成视频的新方法。我们的CFT-GAN使用这样的标题来为每个帧产生光流(动作)和纹理(外观)。因此,输出视频以可能的方式反映了标题中指定的内容。此外,为了训练我们的方法,我们构建了一个带有字幕的人类视频生成的新数据集。我们通过消融研究和用户研究来定性和定量地评估所提出的方法。结果表明,CFT-GAN能够成功生成包含字幕中指示的动作和外观的视频。
translated by 谷歌翻译
长期的人体运动可以表示为一系列运动模式 - 捕捉短期时间动态的运动序列 - 它们之间的过渡。我们利用这种结构,提出了一种新颖的MotionTransformation变分自动编码器(MT-VAE),用于学习运动序列生成。我们的模型联合学习运动模式的特征嵌入(可以从中重建运动序列)和表示一个运动模式到下一个运动模式的转换的特征变换。我们的模型将来可以从相同的输入生成多个多样且可信的运动序列。我们将我们的方法应用于面部和全身运动,并演示基于类比的运动转换和视频合成等应用。
translated by 谷歌翻译
Current deep learning results on video generation are limited while there are only a few first results on video prediction and no relevant significant results on video completion. This is due to the severe ill-posedness inherent in these three problems. In this paper, we focus on human action videos, and propose a general, two-stage deep framework to generate human action videos with no constraints or arbitrary number of constraints, which uniformly address the three problems: video generation given no input frames, video prediction given the first few frames, and video completion given the first and last frames 1. To make the problem tractable, in the first stage we train a deep generative model that generates a human pose sequence from random noise. In the second stage, a skeleton-to-image network is trained, which is used to generate a human action video given the complete human pose sequence generated in the first stage. By introducing the two-stage strategy, we sidestep the original ill-posed problems while producing for the first time high-quality video generation/prediction/completion results of much longer duration. We present quantitative and qualitative evaluation to show that our two-stage approach out-performs state-of-the-art methods in video generation, prediction and video completion. Our video result demonstration can be viewed at https://iamacewhite. github.io/supp/index.html.
translated by 谷歌翻译
Each smile is unique: one person surely smiles in different ways (e.g. closing/opening the eyes or mouth). Given one input image of a neutral face, can we generate multiple smile videos with distinctive characteristics? To tackle this one-to-many video generation problem, we propose a novel deep learning architecture named Conditional Multi-Mode Network (CMM-Net). To better encode the dynamics of facial expressions, CMM-Net explicitly exploits facial landmarks for generating smile sequences. Specifically, a variational auto-encoder is used to learn a facial landmark embedding. This single embedding is then exploited by a conditional recurrent network which generates a landmark embedding sequence conditioned on a specific expression (e.g. spontaneous smile). Next, the generated landmark em-beddings are fed into a multi-mode recurrent landmark generator , producing a set of landmark sequences still associated to the given smile class but clearly distinct from each other. Finally, these landmark sequences are translated into face videos. Our experimental results demonstrate the effectiveness of our CMM-Net in generating realistic videos of multiple smile expressions.
translated by 谷歌翻译
Future frame prediction in videos is a promising avenue for unsupervised video representation learning. Video frames are naturally generated by the inherent pixel flows from preceding frames based on the appearance and motion dynamics in the video. However, existing methods focus on directly hallucinating pixel values, resulting in blurry predictions. In this paper, we develop a dual motion Gen-erative Adversarial Net (GAN) architecture, which learns to explicitly enforce future-frame predictions to be consistent with the pixel-wise flows in the video through a dual-learning mechanism. The primal future-frame prediction and dual future-flow prediction form a closed loop, generating informative feedback signals to each other for better video prediction. To make both synthesized future frames and flows indistinguishable from reality, a dual adversar-ial training method is proposed to ensure that the future-flow prediction is able to help infer realistic future-frames, while the future-frame prediction in turn leads to realistic optical flows. Our dual motion GAN also handles natural motion uncertainty in different pixel locations with a new probabilistic motion encoder, which is based on variational autoencoders. Extensive experiments demonstrate that the proposed dual motion GAN significantly outperforms state-of-the-art approaches on synthesizing new video frames and predicting future flows. Our model generalizes well across diverse visual scenes and shows superiority in unsupervised video representation learning.
translated by 谷歌翻译
Current approaches in video forecasting attempt to generate videos directlyin pixel space using Generative Adversarial Networks (GANs) or VariationalAutoencoders (VAEs). However, since these approaches try to model all thestructure and scene dynamics at once, in unconstrained settings they oftengenerate uninterpretable results. Our insight is to model the forecastingproblem at a higher level of abstraction. Specifically, we exploit human posedetectors as a free source of supervision and break the video forecastingproblem into two discrete steps. First we explicitly model the high levelstructure of active objects in the scene---humans---and use a VAE to model thepossible future movements of humans in the pose space. We then use the futureposes generated as conditional information to a GAN to predict the futureframes of the video in pixel space. By using the structured space of pose as anintermediate representation, we sidestep the problems that GANs have ingenerating video pixels directly. We show through quantitative and qualitativeevaluation that our method outperforms state-of-the-art methods for videoprediction.
translated by 谷歌翻译