We learn to compute optical flow by combining a classical spatial-pyramid formulation with deep learning. This estimates large motions in a coarse-to-fine approach by warping one image of a pair at each pyramid level by the current flow estimate and computing an update to the flow. Instead of the standard minimization of an objective function at each pyramid level, we train one deep network per level to compute the flow update. Unlike the recent FlowNet approach, the networks do not need to deal with large motions; these are dealt with by the pyramid. This has several advantages. First, our Spatial Pyramid Network (SPyNet) is much simpler and 96% smaller than FlowNet in terms of model parameters. This makes it more efficient and appropriate for embedded applications. Second, since the flow at each pyramid level is small (< 1 pixel), a convolutional approach applied to pairs of warped images is appropriate. Third, unlike FlowNet, the learned convolution filters appear similar to classical spatio-temporal filters, giving insight into the method and how to improve it. Our results are more accurate than FlowNet on most standard benchmarks, suggesting a new direction of combining classical flow methods with deep learning.1 This, of course, has well-known limitations, which we discuss later.
translated by 谷歌翻译
We present a compact but effective CNN model for optical flow, called PWC-Net. PWC-Net has been designed according to simple and well-established principles: pyramidal processing, warping, and the use of a cost volume. Cast in a learnable feature pyramid, PWC-Net uses the current optical flow estimate to warp the CNN features of the second image. It then uses the warped features and features of the first image to construct a cost volume, which is processed by a CNN to estimate the optical flow. PWC-Net is 17 times smaller in size and easier to train than the recent FlowNet2 model. Moreover, it outperforms all published optical flow methods on the MPI Sintel final pass and KITTI 2015 benchmarks, running at about 35 fps on Sintel resolution (1024×436) images. Our models are available on https://github.com/NVlabs/PWC-Net.
translated by 谷歌翻译
Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks where CNNs were successful. In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations.Since existing ground truth datasets are not sufficiently large to train a CNN, we generate a synthetic Flying Chairs dataset. We show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI, achieving competitive accuracy at frame rates of 5 to 10 fps.
translated by 谷歌翻译
The FlowNet demonstrated that optical flow estimation can be cast as a learning problem. However, the state of the art with regard to the quality of the flow has still been defined by traditional methods. Particularly on small displacements and real-world data, FlowNet cannot compete with variational methods. In this paper, we advance the concept of end-to-end learning of optical flow and make it work really well. The large improvements in quality and speed are caused by three major contributions: first, we focus on the training data and show that the schedule of presenting data during training is very important. Second, we develop a stacked architecture that includes warping of the second image with intermediate optical flow. Third, we elaborate on small displacements by introducing a subnetwork specializing on small motions. FlowNet 2.0 is only marginally slower than the original FlowNet but decreases the estimation error by more than 50%. It performs on par with state-of-the-art methods, while running at interactive frame rates. Moreover, we present faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet.
translated by 谷歌翻译
光学流量估计是视频分析领域的一个重要而有挑战性问题。卷积神经网络的不同语义级别/层的特征可以提供不同粒度的信息。为了利用如此灵活和全面的信息,我们提出了一个半监督的特征金字塔形相关和残余重建网络(FPCR-Net),用于框架对的光学流量估计。它由两个主要模块组成:金字塔相关映射和剩余重建。金字塔相关映射模块利用全局/本地补丁的多尺度相关性来通过聚合不同尺度的特征来形成多级成本卷。剩余重建模块旨在重建每个阶段中更精细的光学流的子带高频残差。基于金字塔相关映射,我们进一步提出了相关 - 扭曲 - 归一化(CWN)模块,以有效地利用相关性依赖性。实验结果表明,该方案在针对竞争基线方法的平均终点误差(AEE)方面,实现了最先进的性能,改善了0.80,1.15和0.10 - Flownet2,LiteFlowNet和PWC-Net Sintel DataSet的最终通过。
translated by 谷歌翻译
无监督的对光流计算的深度学习取得了令人鼓舞的结果。大多数现有的基于深网的方法都依赖图像亮度一致性和局部平滑度约束来训练网络。他们的性能在发生重复纹理或遮挡的区域降低。在本文中,我们提出了深层的外两极流,这是一种无监督的光流方法,将全局几何约束结合到网络学习中。特别是,我们研究了多种方式在流量估计中强制执行外两极约束。为了减轻在可能存在多个动作的动态场景中遇到的“鸡肉和蛋”类型的问题,我们提出了一个低级别的约束以及对培训的订婚结合的约束。各种基准测试数据集的实验结果表明,与监督方法相比,我们的方法实现了竞争性能,并且优于最先进的无监督深度学习方法。
translated by 谷歌翻译
We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for optical flow. RAFT extracts perpixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes. RAFT achieves stateof-the-art performance. On KITTI, RAFT achieves an F1-all error of 5.10%, a 16% error reduction from the best published result (6.10%). On Sintel (final pass), RAFT obtains an end-point-error of 2.855 pixels, a 30% error reduction from the best published result (4.098 pixels). In addition, RAFT has strong cross-dataset generalization as well as high efficiency in inference time, training speed, and parameter count. Code is available at https://github.com/princeton-vl/RAFT.
translated by 谷歌翻译
Recent work has shown that optical flow estimation can be formulated as a supervised learning task and can be successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by a large synthetically generated dataset. The present paper extends the concept of optical flow estimation via convolutional networks to disparity and scene flow estimation. To this end, we propose three synthetic stereo video datasets with sufficient realism, variation, and size to successfully train large networks. Our datasets are the first large-scale datasets to enable training and evaluating scene flow methods. Besides the datasets, we present a convolutional network for real-time disparity estimation that provides state-of-the-art results. By combining a flow and disparity estimation network and training it jointly, we demonstrate the first scene flow estimation with a convolutional network.
translated by 谷歌翻译
The quantitative evaluation of optical flow algorithms by Barron et al. (1994) led to significant advances in performance. The challenges for optical flow algorithms today go beyond the datasets and evaluation methods proposed in that paper. Instead, they center on problems associated with complex natural scenes, including nonrigid motion, real sensor noise, and motion discontinuities. We propose a new set of benchmarks and evaluation methods for the next generation of optical flow algorithms. To that end, we contribute four types of data to test different aspects of optical flow algorithms: (1) sequences with nonrigid motion where the ground-truth flow is determined by A preliminary version of this paper appeared in the IEEE International Conference on Computer Vision (Baker et al. 2007).
translated by 谷歌翻译
Given two consecutive frames, video interpolation aims at generating intermediate frame(s) to form both spatially and temporally coherent video sequences. While most existing methods focus on single-frame interpolation, we propose an end-to-end convolutional neural network for variable-length multi-frame video interpolation, where the motion interpretation and occlusion reasoning are jointly modeled. We start by computing bi-directional optical flow between the input images using a U-Net architecture. These flows are then linearly combined at each time step to approximate the intermediate bi-directional optical flows. These approximate flows, however, only work well in locally smooth regions and produce artifacts around motion boundaries. To address this shortcoming, we employ another U-Net to refine the approximated flow and also predict soft visibility maps. Finally, the two input images are warped and linearly fused to form each intermediate frame. By applying the visibility maps to the warped images before fusion, we exclude the contribution of occluded pixels to the interpolated intermediate frame to avoid artifacts. Since none of our learned network parameters are time-dependent, our approach is able to produce as many intermediate frames as needed. To train our network, we use 1,132 240-fps video clips, containing 300K individual video frames. Experimental results on several datasets, predicting different numbers of interpolated frames, demonstrate that our approach performs consistently better than existing methods.
translated by 谷歌翻译
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.
translated by 谷歌翻译
We address the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation). This problem is challenging because video appearance and motion can be highly complex. Traditional optical-flow-based solutions often fail where flow estimation is challenging, while newer neural-network-based methods that hallucinate pixel values directly often produce blurry results. We combine the advantages of these two methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which we call deep voxel flow. Our method requires no human supervision, and any video can be used as training data by dropping, and then learning to predict, existing frames. The technique is efficient, and can be applied at any video resolution. We demonstrate that our method produces results that both quantitatively and qualitatively improve upon the state-ofthe-art.
translated by 谷歌翻译
We present a unified formulation and model for three motion and 3D perception tasks: optical flow, rectified stereo matching and unrectified stereo depth estimation from posed images. Unlike previous specialized architectures for each specific task, we formulate all three tasks as a unified dense correspondence matching problem, which can be solved with a single model by directly comparing feature similarities. Such a formulation calls for discriminative feature representations, which we achieve using a Transformer, in particular the cross-attention mechanism. We demonstrate that cross-attention enables integration of knowledge from another image via cross-view interactions, which greatly improves the quality of the extracted features. Our unified model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks. We outperform RAFT with our unified model on the challenging Sintel dataset, and our final model that uses a few additional task-specific refinement steps outperforms or compares favorably to recent state-of-the-art methods on 10 popular flow, stereo and depth datasets, while being simpler and more efficient in terms of model design and inference speed.
translated by 谷歌翻译
Many video enhancement algorithms rely on optical flow to register frames in a video sequence. Precise flow estimation is however intractable; and optical flow itself is often a sub-optimal representation for particular video processing tasks. In this paper, we propose task-oriented flow (TOFlow), a motion representation learned in a selfsupervised, task-specific manner. We design a neural network with a trainable motion estimation component and a video processing component, and train them jointly to learn the task-oriented flow. For evaluation, we build Vimeo-90K, a large-scale, high-quality video dataset for low-level video processing. TOFlow outperforms traditional optical flow on standard benchmarks as well as our Vimeo-90K dataset in three video processing tasks: frame interpolation, video denoising/deblocking, and video super-resolution. IntroductionMotion estimation is a key component in video processing tasks such as temporal frame interpolation, video denoising,
translated by 谷歌翻译
光流估计是自动驾驶和机器人系统系统中的一项基本任务,它可以在时间上解释流量场景。自动驾驶汽车显然受益于360 {\ deg}全景传感器提供的超宽视野(FOV)。但是,由于全景相机的独特成像过程,专为针孔图像设计的模型不会令人满意地概括为360 {\ deg}全景图像。在本文中,我们提出了一个新颖的网络框架 - panoflow,以学习全景图像的光流。为了克服全景转化中等应角投影引起的扭曲,我们设计了一种流动失真增强(FDA)方法,其中包含径向流量失真(FDA-R)或等骨流量失真(FDA-E)。我们进一步研究了全景视频的环状光流的定义和特性,并通过利用球形图像的环状来推断360 {\ deg}光流并将大型位移转换为相对小的位移,从而提出了环状流量估计(CFE)方法移位。 Panoflow适用于任何现有的流量估计方法,并从狭窄的FOL流量估计的进度中受益。此外,我们创建并释放基于CARLA的合成全景数据集Flow360,以促进训练和定量分析。 Panoflow在公共Omniflownet和已建立的Flow360基准中实现了最先进的表现。我们提出的方法将Flow360上的端点误差(EPE)降低了27.3%。在Omniflownet上,Panoflow获得了3.17像素的EPE,从最佳发布的结果中降低了55.5%的误差。我们还通过收集工具和公共现实世界中的全球数据集对我们的方法进行定性验证我们的方法,这表明对现实世界导航应用程序的强大潜力和稳健性。代码和数据集可在https://github.com/masterhow/panoflow上公开获取。
translated by 谷歌翻译
我们提出了一种称为基于DNN的基于DNN的框架,称为基于增强的相关匹配的视频帧插值网络,以支持4K的高分辨率,其具有大规模的运动和遮挡。考虑到根据分辨率的网络模型的可扩展性,所提出的方案采用经常性金字塔架构,该架构分享每个金字塔层之间的参数进行光学流量估计。在所提出的流程估计中,通过追踪具有最大相关性的位置来递归地改进光学流。基于前扭曲的相关匹配可以通过排除遮挡区域周围的错误扭曲特征来提高流量更新的准确性。基于最终双向流动,使用翘曲和混合网络合成任意时间位置的中间帧,通过细化网络进一步改善。实验结果表明,所提出的方案在4K视频数据和低分辨率基准数据集中占据了之前的工作,以及具有最小型号参数的客观和主观质量。
translated by 谷歌翻译
The paucity of videos in current action classification datasets (UCF-101 and HMDB-51) has made it difficult to identify good video architectures, as most methods obtain similar performance on existing small-scale benchmarks. This paper re-evaluates state-of-the-art architectures in light of the new Kinetics Human Action Video dataset. Kinetics has two orders of magnitude more data, with 400 human action classes and over 400 clips per class, and is collected from realistic, challenging YouTube videos. We provide an analysis on how current architectures fare on the task of action classification on this dataset and how much performance improves on the smaller benchmark datasets after pre-training on Kinetics.We also introduce a new Two-Stream Inflated 3D Con-vNet (I3D) that is based on 2D ConvNet inflation: filters and pooling kernels of very deep image classification ConvNets are expanded into 3D, making it possible to learn seamless spatio-temporal feature extractors from video while leveraging successful ImageNet architecture designs and even their parameters. We show that, after pre-training on Kinetics, I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.9% on HMDB-51 and 98.0% on UCF-101.
translated by 谷歌翻译
培训细节和数据集对于筏等最新的光流模型有多重要?它们会概括吗?为了探索这些问题,而不是开发新的模型,我们将重新访问三个突出的模型,即PWC-NET,IRR-PWC和RAFT,并采用一组常见的现代培训技术和数据集,并观察到显着的性能增长,证明了重要性和普遍性这些培训细节。我们新训练的PWC-NET和IRR-PWC模型显示出惊人的改进,与Sintel和Kitti 2015 Benchmarks相比,最高30%的结果与原始发布的结果相比。他们的表现胜过2015年Kitti的最新流程1D,而推断过程中的速度快3倍。我们新训练的筏子在2015年的Kitti上获得了4.31%的成绩,比写作时所有已发表的光流方法更准确。我们的结果表明,分析光流方法的性能提高时,分离模型,训练技术和数据集的贡献的好处。我们的源代码将公开可用。
translated by 谷歌翻译
Ground truth optical flow is difficult to measure in real scenes with natural motion. As a result, optical flow data sets are restricted in terms of size, complexity, and diversity, making optical flow algorithms difficult to train and test on realistic data. We introduce a new optical flow data set derived from the open source 3D animated short film Sintel. This data set has important features not present in the popular Middlebury flow evaluation: long sequences, large motions, specular reflections, motion blur, defocus blur, and atmospheric effects. Because the graphics data that generated the movie is open source, we are able to render scenes under conditions of varying complexity to evaluate where existing flow algorithms fail. We evaluate several recent optical flow algorithms and find that current highly-ranked methods on the Middlebury evaluation have difficulty with this more complex data set suggesting further research on optical flow estimation is needed. To validate the use of synthetic data, we compare the image-and flow-statistics of Sintel to those of real films and videos and show that they are similar. The data set, metrics, and evaluation website are publicly available.
translated by 谷歌翻译
通常将视频中的跟踪像素作为光流估计问题进行研究,其中每个像素都用位移向量描述,该位移向量将其定位在下一帧中。即使可以免费获得更广泛的时间上下文,但要考虑到这一点的事先努力仅在2框方法上产生了少量收益。在本文中,我们重新访问Sand and Teller的“粒子视频”方法,并将像素跟踪作为远程运动估计问题,其中每个像素都用轨迹描述,该轨迹将其定位在以后的多个帧中。我们使用该组件重新构建了这种经典方法,这些组件可以驱动流量和对象跟踪中最新的最新方法,例如密集的成本图,迭代优化和学习的外观更新。我们使用从现有的光流数据中挖掘出的远程Amodal点轨迹来训练我们的模型,并通过多帧的遮挡合成增强,这些轨迹会增强。我们在轨迹估计基准和关键点标签传播任务中测试我们的方法,并与最新的光流和功能跟踪方法进行比较。
translated by 谷歌翻译