光学流量估计是视频分析领域的一个重要而有挑战性问题。卷积神经网络的不同语义级别/层的特征可以提供不同粒度的信息。为了利用如此灵活和全面的信息,我们提出了一个半监督的特征金字塔形相关和残余重建网络(FPCR-Net),用于框架对的光学流量估计。它由两个主要模块组成:金字塔相关映射和剩余重建。金字塔相关映射模块利用全局/本地补丁的多尺度相关性来通过聚合不同尺度的特征来形成多级成本卷。剩余重建模块旨在重建每个阶段中更精细的光学流的子带高频残差。基于金字塔相关映射,我们进一步提出了相关 - 扭曲 - 归一化(CWN)模块,以有效地利用相关性依赖性。实验结果表明,该方案在针对竞争基线方法的平均终点误差(AEE)方面,实现了最先进的性能,改善了0.80,1.15和0.10 - Flownet2,LiteFlowNet和PWC-Net Sintel DataSet的最终通过。
translated by 谷歌翻译
We present a compact but effective CNN model for optical flow, called PWC-Net. PWC-Net has been designed according to simple and well-established principles: pyramidal processing, warping, and the use of a cost volume. Cast in a learnable feature pyramid, PWC-Net uses the current optical flow estimate to warp the CNN features of the second image. It then uses the warped features and features of the first image to construct a cost volume, which is processed by a CNN to estimate the optical flow. PWC-Net is 17 times smaller in size and easier to train than the recent FlowNet2 model. Moreover, it outperforms all published optical flow methods on the MPI Sintel final pass and KITTI 2015 benchmarks, running at about 35 fps on Sintel resolution (1024×436) images. Our models are available on https://github.com/NVlabs/PWC-Net.
translated by 谷歌翻译
Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks where CNNs were successful. In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations.Since existing ground truth datasets are not sufficiently large to train a CNN, we generate a synthetic Flying Chairs dataset. We show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI, achieving competitive accuracy at frame rates of 5 to 10 fps.
translated by 谷歌翻译
We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts perpixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes. RAFT achieves stateof-the-art performance. On KITTI, RAFT achieves an F1-all error of 5.10%, a 16% error reduction from the best published result (6.10%). On Sintel (final pass), RAFT obtains an end-point-error of 2.855 pixels, a 30% error reduction from the best published result (4.098 pixels). In addition, RAFT has strong cross-dataset generalization as well as high efficiency in inference time, training speed, and parameter count. Code is available at https://github.com/princeton-vl/RAFT.
translated by 谷歌翻译
我们提出了一种称为基于DNN的基于DNN的框架,称为基于增强的相关匹配的视频帧插值网络,以支持4K的高分辨率,其具有大规模的运动和遮挡。考虑到根据分辨率的网络模型的可扩展性,所提出的方案采用经常性金字塔架构,该架构分享每个金字塔层之间的参数进行光学流量估计。在所提出的流程估计中,通过追踪具有最大相关性的位置来递归地改进光学流。基于前扭曲的相关匹配可以通过排除遮挡区域周围的错误扭曲特征来提高流量更新的准确性。基于最终双向流动,使用翘曲和混合网络合成任意时间位置的中间帧,通过细化网络进一步改善。实验结果表明,所提出的方案在4K视频数据和低分辨率基准数据集中占据了之前的工作,以及具有最小型号参数的客观和主观质量。
translated by 谷歌翻译
We learn to compute optical flow by combining a classical spatial-pyramid formulation with deep learning. This estimates large motions in a coarse-to-fine approach by warping one image of a pair at each pyramid level by the current flow estimate and computing an update to the flow. Instead of the standard minimization of an objective function at each pyramid level, we train one deep network per level to compute the flow update. Unlike the recent FlowNet approach, the networks do not need to deal with large motions; these are dealt with by the pyramid. This has several advantages. First, our Spatial Pyramid Network (SPyNet) is much simpler and 96% smaller than FlowNet in terms of model parameters. This makes it more efficient and appropriate for embedded applications. Second, since the flow at each pyramid level is small (< 1 pixel), a convolutional approach applied to pairs of warped images is appropriate. Third, unlike FlowNet, the learned convolution filters appear similar to classical spatio-temporal filters, giving insight into the method and how to improve it. Our results are more accurate than FlowNet on most standard benchmarks, suggesting a new direction of combining classical flow methods with deep learning.1 This, of course, has well-known limitations, which we discuss later.
translated by 谷歌翻译
Recent works have shown that optical flow can be learned by deep networks from unlabelled image pairs based on brightness constancy assumption and smoothness prior. Current approaches additionally impose an augmentation regularization term for continual self-supervision, which has been proved to be effective on difficult matching regions. However, this method also amplify the inevitable mismatch in unsupervised setting, blocking the learning process towards optimal solution. To break the dilemma, we propose a novel mutual distillation framework to transfer reliable knowledge back and forth between the teacher and student networks for alternate improvement. Concretely, taking estimation of off-the-shelf unsupervised approach as pseudo labels, our insight locates at defining a confidence selection mechanism to extract relative good matches, and then add diverse data augmentation for distilling adequate and reliable knowledge from teacher to student. Thanks to the decouple nature of our method, we can choose a stronger student architecture for sufficient learning. Finally, better student prediction is adopted to transfer knowledge back to the efficient teacher without additional costs in real deployment. Rather than formulating it as a supervised task, we find that introducing an extra unsupervised term for multi-target learning achieves best final results. Extensive experiments show that our approach, termed MDFlow, achieves state-of-the-art real-time accuracy and generalization ability on challenging benchmarks. Code is available at https://github.com/ltkong218/MDFlow.
translated by 谷歌翻译
Recently, flow-based frame interpolation methods have achieved great success by first modeling optical flow between target and input frames, and then building synthesis network for target frame generation. However, above cascaded architecture can lead to large model size and inference delay, hindering them from mobile and real-time applications. To solve this problem, we propose a novel Progressive Motion Context Refine Network (PMCRNet) to predict motion fields and image context jointly for higher efficiency. Different from others that directly synthesize target frame from deep feature, we explore to simplify frame interpolation task by borrowing existing texture from adjacent input frames, which means that decoder in each pyramid level of our PMCRNet only needs to update easier intermediate optical flow, occlusion merge mask and image residual. Moreover, we introduce a new annealed multi-scale reconstruction loss to better guide the learning process of this efficient PMCRNet. Experiments on multiple benchmarks show that proposed approaches not only achieve favorable quantitative and qualitative results but also reduces current model size and running time significantly.
translated by 谷歌翻译
从一组多曝光图像中重建无精神的高动态范围(HDR)图像是一项具有挑战性的任务,尤其是在大型对象运动和闭塞的情况下,使用现有方法导致可见的伪影。为了解决这个问题,我们提出了一个深层网络,该网络试图学习以正规损失为指导的多尺度特征流。它首先提取多尺度功能,然后对非参考图像的特征对齐。对齐后,我们使用残留的通道注意块将不同图像的特征合并。广泛的定性和定量比较表明,我们的方法可实现最新的性能,并在颜色伪像和几何变形大大减少的情况下产生出色的结果。
translated by 谷歌翻译
The FlowNet demonstrated that optical flow estimation can be cast as a learning problem. However, the state of the art with regard to the quality of the flow has still been defined by traditional methods. Particularly on small displacements and real-world data, FlowNet cannot compete with variational methods. In this paper, we advance the concept of end-to-end learning of optical flow and make it work really well. The large improvements in quality and speed are caused by three major contributions: first, we focus on the training data and show that the schedule of presenting data during training is very important. Second, we develop a stacked architecture that includes warping of the second image with intermediate optical flow. Third, we elaborate on small displacements by introducing a subnetwork specializing on small motions. FlowNet 2.0 is only marginally slower than the original FlowNet but decreases the estimation error by more than 50%. It performs on par with state-of-the-art methods, while running at interactive frame rates. Moreover, we present faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet.
translated by 谷歌翻译
Flow-guide synthesis provides a common framework for frame interpolation, where optical flow is typically estimated by a pyramid network, and then leveraged to guide a synthesis network to generate intermediate frames between input frames. In this paper, we present UPR-Net, a novel Unified Pyramid Recurrent Network for frame interpolation. Cast in a flexible pyramid framework, UPR-Net exploits lightweight recurrent modules for both bi-directional flow estimation and intermediate frame synthesis. At each pyramid level, it leverages estimated bi-directional flow to generate forward-warped representations for frame synthesis; across pyramid levels, it enables iterative refinement for both optical flow and intermediate frame. In particular, we show that our iterative synthesis can significantly improve the robustness of frame interpolation on large motion cases. Despite being extremely lightweight (1.7M parameters), UPR-Net achieves excellent performance on a large range of benchmarks. Code will be available soon.
translated by 谷歌翻译
无监督的对光流计算的深度学习取得了令人鼓舞的结果。大多数现有的基于深网的方法都依赖图像亮度一致性和局部平滑度约束来训练网络。他们的性能在发生重复纹理或遮挡的区域降低。在本文中,我们提出了深层的外两极流,这是一种无监督的光流方法,将全局几何约束结合到网络学习中。特别是,我们研究了多种方式在流量估计中强制执行外两极约束。为了减轻在可能存在多个动作的动态场景中遇到的“鸡肉和蛋”类型的问题,我们提出了一个低级别的约束以及对培训的订婚结合的约束。各种基准测试数据集的实验结果表明,与监督方法相比,我们的方法实现了竞争性能,并且优于最先进的无监督深度学习方法。
translated by 谷歌翻译
Video restoration tasks, including super-resolution, deblurring, etc, are drawing increasing attention in the computer vision community. A challenging benchmark named REDS is released in the NTIRE19 Challenge. This new benchmark challenges existing methods from two aspects:(1) how to align multiple frames given large motions, and (2) how to effectively fuse different frames with diverse motion and blur. In this work, we propose a novel Video Restoration framework with Enhanced Deformable convolutions, termed EDVR, to address these challenges. First, to handle large motions, we devise a Pyramid, Cascading and Deformable (PCD) alignment module, in which frame alignment is done at the feature level using deformable convolutions in a coarse-to-fine manner. Second, we propose a Temporal and Spatial Attention (TSA) fusion module, in which attention is applied both temporally and spatially, so as to emphasize important features for subsequent restoration. Thanks to these modules, our EDVR wins the champions and outperforms the second place by a large margin in all four tracks in the NTIRE19 video restoration and enhancement challenges. EDVR also demonstrates superior performance to state-of-the-art published methods on video super-resolution and deblurring. The code is available at https://github.com/xinntao/EDVR.
translated by 谷歌翻译
我们提出了一个卷积神经网络,该卷积神经网络共同检测了运动边界(MBS)和遮挡区域(OCC)的视频,两者在前向后和向后。检测很困难,因为光流沿MBS是不连续的,并且在OCC中未定义,而许多流量估计器假设光滑度和到处定义的流程。在同时在两个时间方向上推理,我们将估计的映射直接扭曲在两个框架之间。由于帧之间的外观经常在MBS或OV中信号附近,因此构造一个成本块,其为一帧中的每个特征记录在搜索范围内具有匹配的特征的最低差异。该成本块是二维的,并且比流动分析中使用的四维成本量便宜得多。成本块特征由编码器计算,MB和OCC估计由解码器计算。我们发现将解码器层布置精细到粗,而不是粗细,提高性能。 Monet以烧结和飞行电影基准测试的所有任务优于最先进的技术,而不会对它们进行任何微调。
translated by 谷歌翻译
Given two consecutive frames, video interpolation aims at generating intermediate frame(s) to form both spatially and temporally coherent video sequences. While most existing methods focus on single-frame interpolation, we propose an end-to-end convolutional neural network for variable-length multi-frame video interpolation, where the motion interpretation and occlusion reasoning are jointly modeled. We start by computing bi-directional optical flow between the input images using a U-Net architecture. These flows are then linearly combined at each time step to approximate the intermediate bi-directional optical flows. These approximate flows, however, only work well in locally smooth regions and produce artifacts around motion boundaries. To address this shortcoming, we employ another U-Net to refine the approximated flow and also predict soft visibility maps. Finally, the two input images are warped and linearly fused to form each intermediate frame. By applying the visibility maps to the warped images before fusion, we exclude the contribution of occluded pixels to the interpolated intermediate frame to avoid artifacts. Since none of our learned network parameters are time-dependent, our approach is able to produce as many intermediate frames as needed. To train our network, we use 1,132 240-fps video clips, containing 300K individual video frames. Experimental results on several datasets, predicting different numbers of interpolated frames, demonstrate that our approach performs consistently better than existing methods.
translated by 谷歌翻译
Optical flow estimation is a classical yet challenging task in computer vision. One of the essential factors in accurately predicting optical flow is to alleviate occlusions between frames. However, it is still a thorny problem for current top-performing optical flow estimation methods due to insufficient local evidence to model occluded areas. In this paper, we propose the Super Kernel Flow Network (SKFlow), a CNN architecture to ameliorate the impacts of occlusions on optical flow estimation. SKFlow benefits from the super kernels which bring enlarged receptive fields to complement the absent matching information and recover the occluded motions. We present efficient super kernel designs by utilizing conical connections and hybrid depth-wise convolutions. Extensive experiments demonstrate the effectiveness of SKFlow on multiple benchmarks, especially in the occluded areas. Without pre-trained backbones on ImageNet and with a modest increase in computation, SKFlow achieves compelling performance and ranks $\textbf{1st}$ among currently published methods on the Sintel benchmark. On the challenging Sintel clean and final passes (test), SKFlow surpasses the best-published result in the unmatched areas ($7.96$ and $12.50$) by $9.09\%$ and $7.92\%$. The code is available at \href{https://github.com/littlespray/SKFlow}{https://github.com/littlespray/SKFlow}.
translated by 谷歌翻译
在本文中,我们提出了USEGSCENE,该框架用于使用卷积神经网络对立体声相机图像的深度,光流和自我感动的无监督学习。我们的框架利用语义信息来改善深度和光流图的正则化,多模式融合和遮挡填充考虑动态刚性对象运动作为独立的SE(3)转换。此外,我们与纯照相匹配匹配互补,我们提出了连续图像之间语义特征,像素类别和对象实例边界的匹配。与以前的方法相反,我们提出了一个网络体系结构,该网络体系结构可以使用共享编码器共同预测所有输出,并允许在任务域上传递信息,例如,光流的预测可以从深度的预测中受益。此外,我们明确地了解网络内部的深度和光流遮挡图,这些图被利用,以改善这些区域的预测。我们在流行的Kitti数据集上介绍了结果,并表明我们的方法以大幅度的优于其他方法。
translated by 谷歌翻译
可以通过定期预测未来的框架以增强虚拟现实应用程序中的用户体验,从而解决了低计算设备上图形渲染高帧速率视频的挑战。这是通过时间视图合成(TVS)的问题来研究的,该问题的目标是预测给定上一个帧的视频的下一个帧以及上一个和下一个帧的头部姿势。在这项工作中,我们考虑了用户和对象正在移动的动态场景的电视。我们设计了一个将运动解散到用户和对象运动中的框架,以在预测下一帧的同时有效地使用可用的用户运动。我们通过隔离和估计过去框架的3D对象运动,然后推断它来预测对象的运动。我们使用多平面图像(MPI)作为场景的3D表示,并将对象运动作为MPI表示中相应点之间的3D位移建模。为了在估计运动时处理MPI中的稀疏性,我们将部分卷积和掩盖的相关层纳入了相应的点。然后将预测的对象运动与给定的用户或相机运动集成在一起,以生成下一帧。使用不合格的填充模块,我们合成由于相机和对象运动而发现的区域。我们为动态场景的电视开发了一个新的合成数据集,该数据集由800个以全高清分辨率组成的视频组成。我们通过数据集和MPI Sintel数据集上的实验表明我们的模型优于文献中的所有竞争方法。
translated by 谷歌翻译
Optical flow, which computes the apparent motion from a pair of video frames, is a critical tool for scene motion estimation. Correlation volume is the central component of optical flow computational neural models. It estimates the pairwise matching costs between cross-frame features, and is then used to decode optical flow. However, traditional correlation volume is frequently noisy, outlier-prone, and sensitive to motion blur. We observe that, although the recent RAFT algorithm also adopts the traditional correlation volume, its additional context encoder provides semantically representative features to the flow decoder, implicitly compensating for the deficiency of the correlation volume. However, the benefits of this context encoder has been barely discussed or exploited. In this paper, we first investigate the functionality of RAFT's context encoder, then propose a new Context Guided Correlation Volume (CGCV) via gating and lifting schemes. CGCV can be universally integrated with RAFT-based flow computation methods for enhanced performance, especially effective in the presence of motion blur, de-focus blur and atmospheric effects. By incorporating the proposed CGCV with previous Global Motion Aggregation (GMA) method, at a minor cost of 0.5% extra parameters, the rank of GMA is lifted by 23 places on KITTI 2015 Leader Board, and 3 places on Sintel Leader Board. Moreover, at a similar model size, our correlation volume achieves competitive or superior performance to state of the art peer supervised models that employ Transformers or Graph Reasoning, as verified by extensive experiments.
translated by 谷歌翻译
尽管运动补偿大大提高了视频质量,但单独执行运动补偿和视频脱张需要大量的计算开销。本文提出了一个实时视频Deblurring框架,该框架由轻巧的多任务单元组成,该单元以有效的方式支持视频脱张和运动补偿。多任务单元是专门设计的,用于使用单个共享网络处理两个任务的大部分,并由多任务详细网络和简单的网络组成,用于消除和运动补偿。多任务单元最大程度地减少了将运动补偿纳入视频Deblurring的成本,并实现了实时脱毛。此外,通过堆叠多个多任务单元,我们的框架在成本和过度质量之间提供了灵活的控制。我们通过实验性地验证了方法的最先进的质量,与以前的方法相比,该方法的运行速度要快得多,并显示了实时的实时性能(在DVD数据集中测量了30.99db@30fps)。
translated by 谷歌翻译