Videos shot by laymen using hand-held cameras contain undesirable shaky motion. Estimating the global motion between successive frames, in a manner not influenced by moving objects, is central to many video stabilization techniques, but poses significant challenges. A large body of work uses 2D affine transformations or homography for the global motion. However, in this work, we introduce a more general representation scheme, which adapts any existing optical flow network to ignore the moving objects and obtain a spatially smooth approximation of the global motion between video frames. We achieve this by a knowledge distillation approach, where we first introduce a low pass filter module into the optical flow network to constrain the predicted optical flow to be spatially smooth. This becomes our student network, named as \textsc{GlobalFlowNet}. Then, using the original optical flow network as the teacher network, we train the student network using a robust loss function. Given a trained \textsc{GlobalFlowNet}, we stabilize videos using a two stage process. In the first stage, we correct the instability in affine parameters using a quadratic programming approach constrained by a user-specified cropping limit to control loss of field of view. In the second stage, we stabilize the video further by smoothing global motion parameters, expressed using a small number of discrete cosine transform coefficients. In extensive experiments on a variety of different videos, our technique outperforms state of the art techniques in terms of subjective quality and different quantitative measures of video stability. The source code is publicly available at \href{https://github.com/GlobalFlowNet/GlobalFlowNet}{https://github.com/GlobalFlowNet/GlobalFlowNet}
translated by 谷歌翻译
视频是一种流行的媒体形式,其中在线视频流最近聚集了很多人气。在这项工作中,我们提出了一种新颖的实时视频稳定方法 - 将摇晃视频转换为稳定的视频,仿佛它实时通过万向节稳定。我们的框架是以自我监督的方式进行培训,不需要使用特殊硬件设置(即,在立体声钻机或附加运动传感器上的两个摄像机)捕获的数据。我们的框架包括在给定帧之间的转换估计器,用于全局稳定性调整,然后通过空间平滑的光学流动的场景视差减少模块,以进一步稳定。然后,保证金修整模块填充稳定期间创建的缺失的边缘区域,以减少裁剪后的数量。这些顺序步骤将失真和边距减少到最小,同时增强稳定性。因此,我们的方法优于最先进的实时视频稳定方法以及需要相机轨迹优化的离线方法。无论分辨率(例如,480p或1080p),我们的方法程序大约需要41 fps的24.3 ms。
translated by 谷歌翻译
以前的深度学习视频稳定器需要大量的配对不稳定和稳定的视频进行培训,这很难收集。另一方面,基于传统的基于轨迹的稳定器将任务分为几个子任务并随后对其进行处理,这些任务在使用手工制作的功能方面无纹理和遮挡的区域脆弱。在本文中,我们试图以一种深刻的无监督学习方式解决视频稳定问题,这借鉴了传统稳定器的分裂和纠纷思想,同时利用DNNS的代表权来应对现实情况下的挑战。从技术上讲,DUT由轨迹估计阶段和轨迹平滑阶段组成。在轨迹估计阶段,我们首先估计按键点的运动,初始化和完善网格的运动,分别通过新型的多摄影估计策略和运动改进网络,并通过临时关联获得基于网格的轨迹。在轨迹平滑阶段,我们设计了一个新颖的网络来预测轨迹平滑的动态平滑核,这可以很好地适应具有不同动态模式的轨迹。我们利用关键点和网格顶点的空间和时间连贯性来制定训练目标,从而导致无监督的培训计划。公共基准测试的实验结果表明,DUT在定性和定量上都优于最先进的方法。源代码可在https://github.com/annbless/dutcode上找到。
translated by 谷歌翻译
We present a novel camera path optimization framework for the task of online video stabilization. Typically, a stabilization pipeline consists of three steps: motion estimating, path smoothing, and novel view rendering. Most previous methods concentrate on motion estimation, proposing various global or local motion models. In contrast, path optimization receives relatively less attention, especially in the important online setting, where no future frames are available. In this work, we adopt recent off-the-shelf high-quality deep motion models for the motion estimation to recover the camera trajectory and focus on the latter two steps. Our network takes a short 2D camera path in a sliding window as input and outputs the stabilizing warp field of the last frame in the window, which warps the coming frame to its stabilized position. A hybrid loss is well-defined to constrain the spatial and temporal consistency. In addition, we build a motion dataset that contains stable and unstable motion pairs for the training. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art online methods both qualitatively and quantitatively and achieves comparable performance to offline methods.
translated by 谷歌翻译
可以通过定期预测未来的框架以增强虚拟现实应用程序中的用户体验,从而解决了低计算设备上图形渲染高帧速率视频的挑战。这是通过时间视图合成(TVS)的问题来研究的,该问题的目标是预测给定上一个帧的视频的下一个帧以及上一个和下一个帧的头部姿势。在这项工作中,我们考虑了用户和对象正在移动的动态场景的电视。我们设计了一个将运动解散到用户和对象运动中的框架,以在预测下一帧的同时有效地使用可用的用户运动。我们通过隔离和估计过去框架的3D对象运动,然后推断它来预测对象的运动。我们使用多平面图像(MPI)作为场景的3D表示,并将对象运动作为MPI表示中相应点之间的3D位移建模。为了在估计运动时处理MPI中的稀疏性,我们将部分卷积和掩盖的相关层纳入了相应的点。然后将预测的对象运动与给定的用户或相机运动集成在一起,以生成下一帧。使用不合格的填充模块,我们合成由于相机和对象运动而发现的区域。我们为动态场景的电视开发了一个新的合成数据集,该数据集由800个以全高清分辨率组成的视频组成。我们通过数据集和MPI Sintel数据集上的实验表明我们的模型优于文献中的所有竞争方法。
translated by 谷歌翻译
视频稳定在提高视频质量方面起着核心作用。但是,尽管这些方法取得了很大的进展,但它们主要是在标准天气和照明条件下进行的,并且在不利条件下的性能可能会差。在本文中,我们提出了一种用于视频稳定的综合感知不良天气鲁棒算法,该算法不需要真实数据,并且只能在合成数据上接受培训。我们还提出了Silver,这是一种新颖的渲染引擎,可通过自动地面提取程序生成所需的训练数据。我们的方法使用我们的特殊生成的合成数据来训练仿射转换矩阵估计器,避免了当前方法面临的特征提取问题。此外,由于在不利条件下没有视频稳定数据集,因此我们提出了新颖的VSAC105REAL数据集以进行评估。我们将我们的方法与使用两个基准测试的五种最先进的视频稳定算法进行了比较。我们的结果表明,当前的方法在至少一个天气条件下的表现差,即使在一个具有合成数据的小数据集中培训,我们就稳定性得分,失真得分,成功率和平均种植方面取得了最佳性能考虑所有天气条件时的比率。因此,我们的视频稳定模型在现实世界的视频上很好地概括了,并且不需要大规模的合成训练数据来收敛。
translated by 谷歌翻译
Motion blur from camera shake is a major problem in videos captured by hand-held devices. Unlike single-image deblurring, video-based approaches can take advantage of the abundant information that exists across neighboring frames. As a result the best performing methods rely on the alignment of nearby frames. However, aligning images is a computationally expensive and fragile procedure, and methods that aggregate information must therefore be able to identify which regions have been accurately aligned and which have not, a task that requires high level scene understanding. In this work, we introduce a deep learning solution to video deblurring, where a CNN is trained end-toend to learn how to accumulate information across frames. To train this network, we collected a dataset of real videos recorded with a high frame rate camera, which we use to generate synthetic motion blur for supervision. We show that the features learned from this dataset extend to deblurring motion blur that arises due to camera shake in a wide range of videos, and compare the quality of results to a number of other baselines 1 .
translated by 谷歌翻译
With the advent of Neural Style Transfer (NST), stylizing an image has become quite popular. A convenient way for extending stylization techniques to videos is by applying them on a per-frame basis. However, such per-frame application usually lacks temporal-consistency expressed by undesirable flickering artifacts. Most of the existing approaches for enforcing temporal-consistency suffers from one or more of the following drawbacks. They (1) are only suitable for a limited range of stylization techniques, (2) can only be applied in an offline fashion requiring the complete video as input, (3) cannot provide consistency for the task of stylization, or (4) do not provide interactive consistency-control. Note that existing consistent video-filtering approaches aim to completely remove flickering artifacts and thus do not respect any specific consistency-control aspect. For stylization tasks, however, consistency-control is an essential requirement where a certain amount of flickering can add to the artistic look and feel. Moreover, making this control interactive is paramount from a usability perspective. To achieve the above requirements, we propose an approach that can stylize video streams while providing interactive consistency-control. Apart from stylization, our approach also supports various other image processing filters. For achieving interactive performance, we develop a lite optical-flow network that operates at 80 Frames per second (FPS) on desktop systems with sufficient accuracy. We show that the final consistent video-output using our flow network is comparable to that being obtained using state-of-the-art optical-flow network. Further, we employ an adaptive combination of local and global consistent features and enable interactive selection between the two. By objective and subjective evaluation, we show that our method is superior to state-of-the-art approaches.
translated by 谷歌翻译
隐式神经表示(INR)被出现为代表信号的强大范例,例如图像,视频,3D形状等。尽管它已经示出了能够表示精细细节的能力,但其效率尚未得到广泛研究数据表示。在INR中,数据以神经网络的参数的形式存储,并且通用优化算法通常不会利用信号中的空间和时间冗余。在本文中,我们建议通过明确地删除数据冗余来表示和压缩视频的新型INR方法。我们提出了跨视频帧和残差的主体剩余流场(NRFF)而不是存储原始RGB颜色,而不是存储原始RGB颜色。维护通常更光滑和更复杂的运动信息,比原始信号更少,需要更少的参数。此外,重用冗余像素值进一步提高了网络参数效率。实验结果表明,所提出的方法优于基线方法的显着边际。代码可用于https://github.com/daniel03c1/eff_video_repruseentation。
translated by 谷歌翻译
Block based motion estimation is integral to inter prediction processes performed in hybrid video codecs. Prevalent block matching based methods that are used to compute block motion vectors (MVs) rely on computationally intensive search procedures. They also suffer from the aperture problem, which can worsen as the block size is reduced. Moreover, the block matching criteria used in typical codecs do not account for the resulting levels of perceptual quality of the motion compensated pictures that are created upon decoding. Towards achieving the elusive goal of perceptually optimized motion estimation, we propose a search-free block motion estimation framework using a multi-stage convolutional neural network, which is able to conduct motion estimation on multiple block sizes simultaneously, using a triplet of frames as input. This composite block translation network (CBT-Net) is trained in a self-supervised manner on a large database that we created from publicly available uncompressed video content. We deploy the multi-scale structural similarity (MS-SSIM) loss function to optimize the perceptual quality of the motion compensated predicted frames. Our experimental results highlight the computational efficiency of our proposed model relative to conventional block matching based motion estimation algorithms, for comparable prediction errors. Further, when used to perform inter prediction in AV1, the MV predictions of the perceptually optimized model result in average Bjontegaard-delta rate (BD-rate) improvements of -1.70% and -1.52% with respect to the MS-SSIM and Video Multi-Method Assessment Fusion (VMAF) quality metrics, respectively as compared to the block matching based motion estimation system employed in the SVT-AV1 encoder.
translated by 谷歌翻译
Given two consecutive frames, video interpolation aims at generating intermediate frame(s) to form both spatially and temporally coherent video sequences. While most existing methods focus on single-frame interpolation, we propose an end-to-end convolutional neural network for variable-length multi-frame video interpolation, where the motion interpretation and occlusion reasoning are jointly modeled. We start by computing bi-directional optical flow between the input images using a U-Net architecture. These flows are then linearly combined at each time step to approximate the intermediate bi-directional optical flows. These approximate flows, however, only work well in locally smooth regions and produce artifacts around motion boundaries. To address this shortcoming, we employ another U-Net to refine the approximated flow and also predict soft visibility maps. Finally, the two input images are warped and linearly fused to form each intermediate frame. By applying the visibility maps to the warped images before fusion, we exclude the contribution of occluded pixels to the interpolated intermediate frame to avoid artifacts. Since none of our learned network parameters are time-dependent, our approach is able to produce as many intermediate frames as needed. To train our network, we use 1,132 240-fps video clips, containing 300K individual video frames. Experimental results on several datasets, predicting different numbers of interpolated frames, demonstrate that our approach performs consistently better than existing methods.
translated by 谷歌翻译
滚动快门(RS)失真可以解释为在RS摄像机曝光期间,随着时间的推移从瞬时全局快门(GS)框架中挑选一排像素。这意味着每个即时GS帧的信息部分,依次是嵌入到行依赖性失真中。受到这一事实的启发,我们解决了扭转这一过程的挑战性任务,即从rs失真中的图像中提取未变形的GS框架。但是,由于RS失真与其他因素相结合,例如读数设置以及场景元素与相机的相对速度,因此仅利用临时相邻图像之间的几何相关性的型号,在处理数据中,具有不同的读数设置和动态场景的数据中遭受了不良的通用性。带有相机运动和物体运动。在本文中,我们建议使用双重RS摄像机捕获的一对图像,而不是连续的框架,而RS摄像机则具有相反的RS方向,以完成这项极具挑战性的任务。基于双重反转失真的对称和互补性,我们开发了一种新型的端到端模型,即IFED,以通过卢比时间对速度场的迭代学习来生成双重光流序列。广泛的实验结果表明,IFED优于天真的级联方案,以及利用相邻RS图像的最新艺术品。最重要的是,尽管它在合成数据集上进行了训练,但显示出在从现实世界中的RS扭曲的动态场景图像中检索GS框架序列有效。代码可在https://github.com/zzh-tech/dual-versed-rs上找到。
translated by 谷歌翻译
本文介绍了一种基于CNN的完全无监督的方法,用于光流量的运动分段。我们假设输入光流可以表示为分段参数运动模型集,通常,仿射或二次运动模型。这项工作的核心思想是利用期望 - 最大化(EM)框架。它使我们能够以良好的方式设计丢失功能和我们运动分割神经网络的培训程序。然而,与经典迭代的EM相比,一旦培训网络,我们就可以为单个推理步骤中的任何看不见的光学流场提供分割,没有对运动模型参数的初始化,因为它们没有估计推断阶段。已经调查了不同的损失功能,包括强大的功能。我们还提出了一种关于光学流场的新型数据增强技术,对性能显着影响。我们在Davis2016数据集上测试了我们的运动分段网络。我们的方法优于相当的无监督方法,非常有效。实际上,它可以在125fps运行,使其可用于实时应用程序。
translated by 谷歌翻译
如何正确对视频序列中的框架间关系进行建模是视频恢复(VR)的重要挑战。在这项工作中,我们提出了一个无监督的流动对准​​序列模型(S2SVR)来解决此问题。一方面,在VR中首次探讨了在自然语言处理领域的序列到序列模型。优化的序列化建模显示了捕获帧之间远程依赖性的潜力。另一方面,我们为序列到序列模型配备了无监督的光流估计器,以最大程度地发挥其潜力。通过我们提出的无监督蒸馏损失对流量估计器进行了训练,这可以减轻数据差异和以前基于流动的方法的降解光流问题的不准确降解。通过可靠的光流,我们可以在多个帧之间建立准确的对应关系,从而缩小了1D语言和2D未对准框架之间的域差异,并提高了序列到序列模型的潜力。 S2SVR在多个VR任务中显示出卓越的性能,包括视频脱张,视频超分辨率和压缩视频质量增强。代码和模型可在https://github.com/linjing7/vr-baseline上公开获得
translated by 谷歌翻译
We pose video object segmentation as spectral graph clustering in space and time, with one graph node for each pixel and edges forming local space-time neighborhoods. We claim that the strongest cluster in this video graph represents the salient object. We start by introducing a novel and efficient method based on 3D filtering for approximating the spectral solution, as the principal eigenvector of the graph's adjacency matrix, without explicitly building the matrix. This key property allows us to have a fast parallel implementation on GPU, orders of magnitude faster than classical approaches for computing the eigenvector. Our motivation for a spectral space-time clustering approach, unique in video semantic segmentation literature, is that such clustering is dedicated to preserving object consistency over time, which we evaluate using our novel segmentation consistency measure. Further on, we show how to efficiently learn the solution over multiple input feature channels. Finally, we extend the formulation of our approach beyond the segmentation task, into the realm of object tracking. In extensive experiments we show significant improvements over top methods, as well as over powerful ensembles that combine them, achieving state-of-the-art on multiple benchmarks, both for tracking and segmentation.
translated by 谷歌翻译
现有的基于深度学习的无监督视频对象分割方法仍依靠地面真实的细分面具来训练。在这种情况下令人未知的意味着在推理期间没有使用注释帧。由于获得真实图像场景的地面真实的细分掩码是一种艰苦的任务,我们想到了一个简单的框架,即占主导地位的移动对象分割,既不需要注释数据训练,也不依赖于显着的电视或预先训练的光流程图。灵感来自分层图像表示,我们根据仿射参数运动引入对像素区域进行分组的技术。这使我们的网络能够仅使用RGB图像对为培训和推理的输入来学习主要前景对象的分割。我们使用新的MOVERCARS DataSet为这项新颖任务建立了基线,并对最近的方法表现出竞争性能,这些方法需要培训带有注释面具的最新方法。
translated by 谷歌翻译
高动态范围(HDR)视频提供比标准低动态范围(LDR)视频更具视觉上的体验。尽管HDR成像具有重要进展,但仍有一个具有挑战性的任务,可以使用传统的现成摄像头捕获高质量的HDR视频。现有方法完全依赖于在相邻的LDR序列之间使用致密光流来重建HDR帧。然而,当用嘈杂的框架应用于交替的曝光时,它们会导致颜色和暴露的曝光不一致。在本文中,我们提出了一种从LDR序列与交替曝光的LDR序列的HDR视频重建的端到端GAN框架。我们首先从Noisy LDR视频中提取清洁LDR帧,并具有在自我监督设置中培训的去噪网络的交替曝光。然后,我们将相邻的交流帧与参考帧对齐,然后在完全的对手设置中重建高质量的HDR帧。为了进一步提高所产生帧的鲁棒性和质量,我们在培训过程中将时间稳定性的正则化术语与成本函数的内容和风格的损耗一起融合。实验结果表明,我们的框架实现了最先进的性能,并通过现有方法生成视频的优质HDR帧。
translated by 谷歌翻译
我们提出了一种称为基于DNN的基于DNN的框架,称为基于增强的相关匹配的视频帧插值网络,以支持4K的高分辨率,其具有大规模的运动和遮挡。考虑到根据分辨率的网络模型的可扩展性,所提出的方案采用经常性金字塔架构,该架构分享每个金字塔层之间的参数进行光学流量估计。在所提出的流程估计中,通过追踪具有最大相关性的位置来递归地改进光学流。基于前扭曲的相关匹配可以通过排除遮挡区域周围的错误扭曲特征来提高流量更新的准确性。基于最终双向流动,使用翘曲和混合网络合成任意时间位置的中间帧,通过细化网络进一步改善。实验结果表明,所提出的方案在4K视频数据和低分辨率基准数据集中占据了之前的工作,以及具有最小型号参数的客观和主观质量。
translated by 谷歌翻译
The quantitative evaluation of optical flow algorithms by Barron et al. (1994) led to significant advances in performance. The challenges for optical flow algorithms today go beyond the datasets and evaluation methods proposed in that paper. Instead, they center on problems associated with complex natural scenes, including nonrigid motion, real sensor noise, and motion discontinuities. We propose a new set of benchmarks and evaluation methods for the next generation of optical flow algorithms. To that end, we contribute four types of data to test different aspects of optical flow algorithms: (1) sequences with nonrigid motion where the ground-truth flow is determined by A preliminary version of this paper appeared in the IEEE International Conference on Computer Vision (Baker et al. 2007).
translated by 谷歌翻译
我们提出了一种用于视频帧插值(VFI)的实时中流估计算法。许多最近的基于流的VFI方法首先估计双向光学流,然后缩放并将它们倒转到近似中间流动,导致运动边界上的伪像。RIFE使用名为IFNET的神经网络,可以直接估计中间流量从粗细流,速度更好。我们设计了一种用于训练中间流动模型的特权蒸馏方案,这导致了大的性能改善。Rife不依赖于预先训练的光流模型,可以支持任意时间的帧插值。实验表明,普里埃雷在若干公共基准上实现了最先进的表现。\ url {https://github.com/hzwer/arxiv2020-rife}。
translated by 谷歌翻译