Given two consecutive frames, video interpolation aims at generating intermediate frame(s) to form both spatially and temporally coherent video sequences. While most existing methods focus on single-frame interpolation, we propose an end-to-end convolutional neural network for variable-length multi-frame video interpolation, where the motion interpretation and occlusion reasoning are jointly modeled. We start by computing bi-directional optical flow between the input images using a U-Net architecture. These flows are then linearly combined at each time step to approximate the intermediate bi-directional optical flows. These approximate flows, however, only work well in locally smooth regions and produce artifacts around motion boundaries. To address this shortcoming, we employ another U-Net to refine the approximated flow and also predict soft visibility maps. Finally, the two input images are warped and linearly fused to form each intermediate frame. By applying the visibility maps to the warped images before fusion, we exclude the contribution of occluded pixels to the interpolated intermediate frame to avoid artifacts. Since none of our learned network parameters are time-dependent, our approach is able to produce as many intermediate frames as needed. To train our network, we use 1,132 240-fps video clips, containing 300K individual video frames. Experimental results on several datasets, predicting different numbers of interpolated frames, demonstrate that our approach performs consistently better than existing methods.
translated by 谷歌翻译
We address the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation). This problem is challenging because video appearance and motion can be highly complex. Traditional optical-flow-based solutions often fail where flow estimation is challenging, while newer neural-network-based methods that hallucinate pixel values directly often produce blurry results. We combine the advantages of these two methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which we call deep voxel flow. Our method requires no human supervision, and any video can be used as training data by dropping, and then learning to predict, existing frames. The technique is efficient, and can be applied at any video resolution. We demonstrate that our method produces results that both quantitatively and qualitatively improve upon the state-ofthe-art.
translated by 谷歌翻译
我们提出了一种用于视频帧插值(VFI)的实时中流估计算法。许多最近的基于流的VFI方法首先估计双向光学流,然后缩放并将它们倒转到近似中间流动,导致运动边界上的伪像。RIFE使用名为IFNET的神经网络,可以直接估计中间流量从粗细流,速度更好。我们设计了一种用于训练中间流动模型的特权蒸馏方案,这导致了大的性能改善。Rife不依赖于预先训练的光流模型,可以支持任意时间的帧插值。实验表明,普里埃雷在若干公共基准上实现了最先进的表现。\ url {https://github.com/hzwer/arxiv2020-rife}。
translated by 谷歌翻译
Standard video frame interpolation methods first estimate optical flow between input frames and then synthesize an intermediate frame guided by motion. Recent ap-proaches merge these two steps into a single convolution process by convolving input frames with spatially adaptive kernels that account for motion and re-sampling simultaneously. These methods require large kernels to handle large motion, which limits the number of pixels whose kernels can be estimated at once due to the large memory demand. To address this problem, this paper formulates frame interpolation as local separable convolution over input frames using pairs of 1D kernels. Compared to regular 2D kernels, the 1D kernels require significantly fewer parameters to be estimated. Our method develops a deep fully convolutional neural network that takes two input frames and estimates pairs of 1D kernels for all pixels simultaneously. Since our method is able to estimate kernels and synthesizes the whole video frame at once, it allows for the incorporation of perceptual loss to train the neural network to produce visually pleasing frames. This deep neural network is trained end-to-end using widely available video data without any human annotation. Both qualitative and quantitative experiments show that our method provides a practical solution to high-quality video frame interpolation.
translated by 谷歌翻译
我们提出了一种称为基于DNN的基于DNN的框架,称为基于增强的相关匹配的视频帧插值网络,以支持4K的高分辨率,其具有大规模的运动和遮挡。考虑到根据分辨率的网络模型的可扩展性,所提出的方案采用经常性金字塔架构,该架构分享每个金字塔层之间的参数进行光学流量估计。在所提出的流程估计中,通过追踪具有最大相关性的位置来递归地改进光学流。基于前扭曲的相关匹配可以通过排除遮挡区域周围的错误扭曲特征来提高流量更新的准确性。基于最终双向流动,使用翘曲和混合网络合成任意时间位置的中间帧,通过细化网络进一步改善。实验结果表明,所提出的方案在4K视频数据和低分辨率基准数据集中占据了之前的工作,以及具有最小型号参数的客观和主观质量。
translated by 谷歌翻译
translated by 谷歌翻译
视频框架插值是一项艰巨的任务,这是由于不断变化的现实场景。先前的方法通常计算双向光流,然后在线性运动假设下预测中间光流,从而导致各向同性中间流量产生。随访研究通过估计的高阶运动信息和额外的帧获得各向异性调整。基于运动假设,它们的方法很难在真实场景中对复杂的运动进行建模。在本文中,我们提出了一种端到端训练方法A^2OF,用于视频框架插值,并通过事件驱动的各向异性调整光学流量调节。具体而言,我们使用事件为中间光流生成光流分布掩码,这可以对两个帧之间的复杂运动进行建模。我们提出的方法在视频框架插值中优于先前的方法,将基于事件的视频插值带到了更高的阶段。
translated by 谷歌翻译
现有的视频框架插值方法只能在给定的中间时间步骤中插值框架,例如1/2。在本文中,我们旨在探索一种更广泛的视频框架插值,该视频框架在任意时步。为此,我们考虑在元学习的帮助下以统一的方式处理不同的时间阶段。具体而言,我们开发了一个双元学习的帧插值框架,以通过上下文信息和光流的指导以及将时间步长为附带信息,将中间框架合成中间框架。首先,构建了一个内容感知的元学习流程模块,以提高基于输入帧的下采样版本的光流估计的准确性。其次,以精致的光流和时间步长为输入,运动吸引的元学习框架插值模块为在粗翘曲版本的特征图上使用的每个像素生成卷积内核,以生成输入的特征图上的每个像素生成预测帧的帧。广泛的定性和定量评估以及消融研究表明,通过以如此精心设计的方式在我们的框架中引入元学习,我们的方法不仅可以实现优于先进的框架插值方法,还可以实现优越的性能还拥有在任意时间步长以支持插值的扩展能力。
translated by 谷歌翻译
Motion blur from camera shake is a major problem in videos captured by hand-held devices. Unlike single-image deblurring, video-based approaches can take advantage of the abundant information that exists across neighboring frames. As a result the best performing methods rely on the alignment of nearby frames. However, aligning images is a computationally expensive and fragile procedure, and methods that aggregate information must therefore be able to identify which regions have been accurately aligned and which have not, a task that requires high level scene understanding. In this work, we introduce a deep learning solution to video deblurring, where a CNN is trained end-toend to learn how to accumulate information across frames. To train this network, we collected a dataset of real videos recorded with a high frame rate camera, which we use to generate synthetic motion blur for supervision. We show that the features learned from this dataset extend to deblurring motion blur that arises due to camera shake in a wide range of videos, and compare the quality of results to a number of other baselines 1 .
translated by 谷歌翻译
人类活动的上采样视频是一个有趣但具有挑战性的任务,具有许多潜在的应用,从游戏到娱乐和运动广播。在该设置中合成视频帧的主要困难源于人类运动的高度复杂和非线性性质和身体的复杂外观和质地。我们建议在运动引导框架上采样框架中解决这些问题,该框架上采样框架能够产生现实的人类运动和外观。通过利用大规模运动捕获数据集(Amass)培训新颖运动模型,推动帧之间的非线性骨架运动。然后,神经渲染管线使用高帧速率姿态预测以产生全帧输出,考虑姿势和背景一致性。我们的管道只需要低帧速率视频和未配对的人类运动数据,但不需要高帧率视频进行培训。此外,我们贡献了第一个评估数据集,该数据集包括用于此任务的人类活动的高质量和高帧速率视频。与最先进的视频插值技术相比,我们的方法在具有更高质量和精度的帧之间产生的帧,这是通过最先进的导致像素级,分布度量和比较用户评估的结果。我们的代码和收集的数据集可以在https://git.io/render-in-botween中找到。
translated by 谷歌翻译
Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a selfsupervised, task-specific manner. We design a neural network with a trainable motion estimation component and a video processing component, and train them jointly to learn the task-oriented flow. For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional optical flow on standard benchmarks as well as our Vimeo-90K dataset in three video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution. IntroductionMotion estimation is a key component in video processing tasks such as temporal frame interpolation, video denoising,
translated by 谷歌翻译
A difficult example for video frame interpolation. Our approach produces a high-quality result in spite of the delicate flamingo leg that is subject to large motion. This is a video figure that is best viewed using Adobe Reader.
translated by 谷歌翻译
视频框架插值〜(VFI)算法近年来由于数据驱动算法及其实现的前所未有的进展,近年来有了显着改善。最近的研究引入了高级运动估计或新颖的扭曲方法,以解决具有挑战性的VFI方案。但是,没有发表的VFI作品认为插值误差(IE)的空间不均匀特征。这项工作引入了这样的解决方案。通过密切检查光流与IE之间的相关性,本文提出了新的错误预测指标,该指标将中间框架分为与不同IE水平相对应的不同区域。它基于IE驱动的分割,并通过使用新颖的错误控制损耗函数,引入了一组空间自适应插值单元的合奏,该单元逐步处理并集成了分段区域。这种空间合奏会产生有效且具有诱人的VFI解决方案。对流行视频插值基准测试的广泛实验表明,所提出的解决方案在当前兴趣的应用中优于当前最新(SOTA)。
translated by 谷歌翻译
We present a compact but effective CNN model for optical flow, called PWC-Net. PWC-Net has been designed according to simple and well-established principles: pyramidal processing, warping, and the use of a cost volume. Cast in a learnable feature pyramid, PWC-Net uses the current optical flow estimate to warp the CNN features of the second image. It then uses the warped features and features of the first image to construct a cost volume, which is processed by a CNN to estimate the optical flow. PWC-Net is 17 times smaller in size and easier to train than the recent FlowNet2 model. Moreover, it outperforms all published optical flow methods on the MPI Sintel final pass and KITTI 2015 benchmarks, running at about 35 fps on Sintel resolution (1024×436) images. Our models are available on https://github.com/NVlabs/PWC-Net.
translated by 谷歌翻译
高速,高分辨率的立体视频(H2-STEREO)视频使我们能够在细粒度上感知动态3D内容。然而,对商品摄像机的收购H2-STEREO视频仍然具有挑战性。现有的空间超分辨率或时间框架插值方法分别提供了缺乏时间或空间细节的折衷解决方案。为了减轻这个问题,我们提出了一个双摄像头系统,其中一台相机捕获具有丰富空间细节的高空间分辨率低框架速率(HSR-LFR)视频,而另一个摄像头则捕获了低空间分辨率的高架框架-Rate(LSR-HFR)视频带有光滑的时间细节。然后,我们设计了一个学习的信息融合网络(LIFNET),该网络利用跨摄像机冗余,以增强两种相机视图,从而有效地重建H2-STEREO视频。即使在大型差异场景中,我们也利用一个差异网络将时空信息传输到视图上,基于该视图,我们建议使用差异引导的LSR-HFR视图基于差异引导的流量扭曲,并针对HSR-LFR视图进行互补的扭曲。提出了特征域中的多尺度融合方法,以最大程度地减少HSR-LFR视图中闭塞引起的翘曲幽灵和孔。 LIFNET使用YouTube收集的高质量立体视频数据集以端到端的方式进行训练。广泛的实验表明,对于合成数据和摄像头捕获的真实数据,我们的模型均优于现有的最新方法。消融研究探讨了各个方面,包括时空分辨率,摄像头基线,摄像头解理,长/短曝光和应用程序,以充分了解其对潜在应用的能力。
translated by 谷歌翻译
视频预测是一个推断任务,可以预测给定过去帧的未来帧,而视频框架插值是一个插值任务,可以估算两个帧之间的中间帧。我们目睹了视频框架插值的巨大进步,但野外的一般视频预测仍然是一个悬而未决的问题。受视频框架插值的照片真实结果的启发,我们为视频框架插值提供了一个新的优化框架,用于视频预测,其中我们根据插值模型解决了推断问题。我们的视频预测框架是基于优化的,而无需训练数据集,而无需培训数据集,因此训练数据和测试数据之间没有域间隙问题。另外,我们的方法不需要任何其他信息,例如语义或实例地图,这使我们的框架适用于任何视频。关于CityScapes,Kitti,Davis,Middlebury和Vimeo90K数据集的广泛实验表明,在一般情况下,我们的视频预测结果非常强大,我们的方法优于其他需要大量培训数据或额外语义信息的视频预测方法。
translated by 谷歌翻译
滚动快门(RS)失真可以解释为在RS摄像机曝光期间,随着时间的推移从瞬时全局快门(GS)框架中挑选一排像素。这意味着每个即时GS帧的信息部分,依次是嵌入到行依赖性失真中。受到这一事实的启发,我们解决了扭转这一过程的挑战性任务,即从rs失真中的图像中提取未变形的GS框架。但是,由于RS失真与其他因素相结合,例如读数设置以及场景元素与相机的相对速度,因此仅利用临时相邻图像之间的几何相关性的型号,在处理数据中,具有不同的读数设置和动态场景的数据中遭受了不良的通用性。带有相机运动和物体运动。在本文中,我们建议使用双重RS摄像机捕获的一对图像,而不是连续的框架,而RS摄像机则具有相反的RS方向,以完成这项极具挑战性的任务。基于双重反转失真的对称和互补性,我们开发了一种新型的端到端模型,即IFED,以通过卢比时间对速度场的迭代学习来生成双重光流序列。广泛的实验结果表明,IFED优于天真的级联方案,以及利用相邻RS图像的最新艺术品。最重要的是,尽管它在合成数据集上进行了训练,但显示出在从现实世界中的RS扭曲的动态场景图像中检索GS框架序列有效。代码可在https://github.com/zzh-tech/dual-versed-rs上找到。
translated by 谷歌翻译
视频通常将流和连续的视觉数据记录为离散的连续帧。由于存储成本对于高保真度的视频来说是昂贵的,因此大多数存储以相对较低的分辨率和帧速率存储。最新的时空视频超分辨率(STVSR)的工作是开发出来的,以将时间插值和空间超分辨率纳入统一框架。但是,其中大多数仅支持固定的上采样量表,这限制了其灵活性和应用。在这项工作中,我们没有遵循离散表示,我们提出了视频隐式神经表示(videoinr),并显示了其对STVSR的应用。学到的隐式神经表示可以解码为任意空间分辨率和帧速率的视频。我们表明,Videoinr在常见的上采样量表上使用最先进的STVSR方法实现了竞争性能,并且在连续和训练的分布量表上显着优于先前的作品。我们的项目页面位于http://zeyuan-chen.com/videoinr/。
translated by 谷歌翻译
我们提出了Tain(视频插值的变压器和注意力),这是一个用于视频插值的残留神经网络,旨在插入中间框架,并在其周围连续两个图像框架下进行插值。我们首先提出一个新型的视觉变压器模块,称为交叉相似性(CS),以与预测插值框架相似的外观相似的外观。然后,这些CS特征用于完善插值预测。为了说明CS功能中的遮挡,我们提出了一个图像注意(IA)模块,以使网络可以从另一个框架上关注CS功能。此外,我们还使用封闭式贴片来增强培训数据集,该补丁可以跨帧移动,以改善网络对遮挡和大型运动的稳健性。由于现有方法产生平滑的预测,尤其是在MB附近,因此我们根据图像梯度使用额外的训练损失来产生更清晰的预测。胜过不需要流量估计并与基于流程的方法相当执行的现有方法,同时在VIMEO90K,UCF101和SNU-FILM基准的推理时间上具有计算有效的效率。
translated by 谷歌翻译
Flow-guide synthesis provides a common framework for frame interpolation, where optical flow is typically estimated by a pyramid network, and then leveraged to guide a synthesis network to generate intermediate frames between input frames. In this paper, we present UPR-Net, a novel Unified Pyramid Recurrent Network for frame interpolation. Cast in a flexible pyramid framework, UPR-Net exploits lightweight recurrent modules for both bi-directional flow estimation and intermediate frame synthesis. At each pyramid level, it leverages estimated bi-directional flow to generate forward-warped representations for frame synthesis; across pyramid levels, it enables iterative refinement for both optical flow and intermediate frame. In particular, we show that our iterative synthesis can significantly improve the robustness of frame interpolation on large motion cases. Despite being extremely lightweight (1.7M parameters), UPR-Net achieves excellent performance on a large range of benchmarks. Code will be available soon.
translated by 谷歌翻译