基于深度学习的无监督光流估计器由于对地面真理的成本和难度而引起了越来越多的关注。尽管多年来通过平均终点误差(EPE)衡量的性能有所提高,但沿运动边界(MBS)的流量估计仍然较差,而流动不平稳,通常假定的流动不平滑,而神经网络计算的功能为多个动作污染。为了改善无监督的设置中的流量,我们设计了一个框架,该框架通过分析沿边界候选者的视觉变化来检测MB,并用更远的动作取代接近检测的动作。我们提出的算法比具有相同输入的基线方法更准确地检测边界,并且可以改善任何流动预测变量的估计值,而无需额外的训练。
translated by 谷歌翻译
我们提出了一个卷积神经网络,该卷积神经网络共同检测了运动边界(MBS)和遮挡区域(OCC)的视频,两者在前向后和向后。检测很困难,因为光流沿MBS是不连续的,并且在OCC中未定义,而许多流量估计器假设光滑度和到处定义的流程。在同时在两个时间方向上推理,我们将估计的映射直接扭曲在两个框架之间。由于帧之间的外观经常在MBS或OV中信号附近,因此构造一个成本块,其为一帧中的每个特征记录在搜索范围内具有匹配的特征的最低差异。该成本块是二维的,并且比流动分析中使用的四维成本量便宜得多。成本块特征由编码器计算,MB和OCC估计由解码器计算。我们发现将解码器层布置精细到粗,而不是粗细,提高性能。 Monet以烧结和飞行电影基准测试的所有任务优于最先进的技术,而不会对它们进行任何微调。
translated by 谷歌翻译
Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks where CNNs were successful. In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations.Since existing ground truth datasets are not sufficiently large to train a CNN, we generate a synthetic Flying Chairs dataset. We show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI, achieving competitive accuracy at frame rates of 5 to 10 fps.
translated by 谷歌翻译
We present a compact but effective CNN model for optical flow, called PWC-Net. PWC-Net has been designed according to simple and well-established principles: pyramidal processing, warping, and the use of a cost volume. Cast in a learnable feature pyramid, PWC-Net uses the current optical flow estimate to warp the CNN features of the second image. It then uses the warped features and features of the first image to construct a cost volume, which is processed by a CNN to estimate the optical flow. PWC-Net is 17 times smaller in size and easier to train than the recent FlowNet2 model. Moreover, it outperforms all published optical flow methods on the MPI Sintel final pass and KITTI 2015 benchmarks, running at about 35 fps on Sintel resolution (1024×436) images. Our models are available on https://github.com/NVlabs/PWC-Net.
translated by 谷歌翻译
与无监督培训相比,对光流预测因子的监督培训通常会产生更好的准确性。但是,改进的性能通常以较高的注释成本。半监督的培训与注释成本相比,准确性的准确性。我们使用一种简单而有效的半监督训练方法来表明,即使一小部分标签也可以通过无监督的训练来提高流量准确性。此外,我们提出了基于简单启发式方法的主动学习方法,以进一步减少实现相同目标准确性所需的标签数量。我们对合成和真实光流数据集的实验表明,我们的半监督网络通常需要大约50%的标签才能达到接近全标签的精度,而在Sintel上有效学习只有20%左右。我们还分析并展示了有关可能影响主动学习绩效的因素的见解。代码可在https://github.com/duke-vision/optical-flow-active-learning-release上找到。
translated by 谷歌翻译
我们提出了Tain(视频插值的变压器和注意力),这是一个用于视频插值的残留神经网络,旨在插入中间框架,并在其周围连续两个图像框架下进行插值。我们首先提出一个新型的视觉变压器模块,称为交叉相似性(CS),以与预测插值框架相似的外观相似的外观。然后,这些CS特征用于完善插值预测。为了说明CS功能中的遮挡,我们提出了一个图像注意(IA)模块,以使网络可以从另一个框架上关注CS功能。此外,我们还使用封闭式贴片来增强培训数据集,该补丁可以跨帧移动,以改善网络对遮挡和大型运动的稳健性。由于现有方法产生平滑的预测,尤其是在MB附近,因此我们根据图像梯度使用额外的训练损失来产生更清晰的预测。胜过不需要流量估计并与基于流程的方法相当执行的现有方法,同时在VIMEO90K,UCF101和SNU-FILM基准的推理时间上具有计算有效的效率。
translated by 谷歌翻译
无监督的对光流计算的深度学习取得了令人鼓舞的结果。大多数现有的基于深网的方法都依赖图像亮度一致性和局部平滑度约束来训练网络。他们的性能在发生重复纹理或遮挡的区域降低。在本文中,我们提出了深层的外两极流,这是一种无监督的光流方法,将全局几何约束结合到网络学习中。特别是,我们研究了多种方式在流量估计中强制执行外两极约束。为了减轻在可能存在多个动作的动态场景中遇到的“鸡肉和蛋”类型的问题,我们提出了一个低级别的约束以及对培训的订婚结合的约束。各种基准测试数据集的实验结果表明,与监督方法相比,我们的方法实现了竞争性能,并且优于最先进的无监督深度学习方法。
translated by 谷歌翻译
The FlowNet demonstrated that optical flow estimation can be cast as a learning problem. However, the state of the art with regard to the quality of the flow has still been defined by traditional methods. Particularly on small displacements and real-world data, FlowNet cannot compete with variational methods. In this paper, we advance the concept of end-to-end learning of optical flow and make it work really well. The large improvements in quality and speed are caused by three major contributions: first, we focus on the training data and show that the schedule of presenting data during training is very important. Second, we develop a stacked architecture that includes warping of the second image with intermediate optical flow. Third, we elaborate on small displacements by introducing a subnetwork specializing on small motions. FlowNet 2.0 is only marginally slower than the original FlowNet but decreases the estimation error by more than 50%. It performs on par with state-of-the-art methods, while running at interactive frame rates. Moreover, we present faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet.
translated by 谷歌翻译
Ground truth optical flow is difficult to measure in real scenes with natural motion. As a result, optical flow data sets are restricted in terms of size, complexity, and diversity, making optical flow algorithms difficult to train and test on realistic data. We introduce a new optical flow data set derived from the open source 3D animated short film Sintel. This data set has important features not present in the popular Middlebury flow evaluation: long sequences, large motions, specular reflections, motion blur, defocus blur, and atmospheric effects. Because the graphics data that generated the movie is open source, we are able to render scenes under conditions of varying complexity to evaluate where existing flow algorithms fail. We evaluate several recent optical flow algorithms and find that current highly-ranked methods on the Middlebury evaluation have difficulty with this more complex data set suggesting further research on optical flow estimation is needed. To validate the use of synthetic data, we compare the image-and flow-statistics of Sintel to those of real films and videos and show that they are similar. The data set, metrics, and evaluation website are publicly available.
translated by 谷歌翻译
在本文中,我们提出了USEGSCENE,该框架用于使用卷积神经网络对立体声相机图像的深度,光流和自我感动的无监督学习。我们的框架利用语义信息来改善深度和光流图的正则化,多模式融合和遮挡填充考虑动态刚性对象运动作为独立的SE(3)转换。此外,我们与纯照相匹配匹配互补,我们提出了连续图像之间语义特征,像素类别和对象实例边界的匹配。与以前的方法相反,我们提出了一个网络体系结构,该网络体系结构可以使用共享编码器共同预测所有输出,并允许在任务域上传递信息,例如,光流的预测可以从深度的预测中受益。此外,我们明确地了解网络内部的深度和光流遮挡图,这些图被利用,以改善这些区域的预测。我们在流行的Kitti数据集上介绍了结果,并表明我们的方法以大幅度的优于其他方法。
translated by 谷歌翻译
Given two consecutive frames, video interpolation aims at generating intermediate frame(s) to form both spatially and temporally coherent video sequences. While most existing methods focus on single-frame interpolation, we propose an end-to-end convolutional neural network for variable-length multi-frame video interpolation, where the motion interpretation and occlusion reasoning are jointly modeled. We start by computing bi-directional optical flow between the input images using a U-Net architecture. These flows are then linearly combined at each time step to approximate the intermediate bi-directional optical flows. These approximate flows, however, only work well in locally smooth regions and produce artifacts around motion boundaries. To address this shortcoming, we employ another U-Net to refine the approximated flow and also predict soft visibility maps. Finally, the two input images are warped and linearly fused to form each intermediate frame. By applying the visibility maps to the warped images before fusion, we exclude the contribution of occluded pixels to the interpolated intermediate frame to avoid artifacts. Since none of our learned network parameters are time-dependent, our approach is able to produce as many intermediate frames as needed. To train our network, we use 1,132 240-fps video clips, containing 300K individual video frames. Experimental results on several datasets, predicting different numbers of interpolated frames, demonstrate that our approach performs consistently better than existing methods.
translated by 谷歌翻译
The quantitative evaluation of optical flow algorithms by Barron et al. (1994) led to significant advances in performance. The challenges for optical flow algorithms today go beyond the datasets and evaluation methods proposed in that paper. Instead, they center on problems associated with complex natural scenes, including nonrigid motion, real sensor noise, and motion discontinuities. We propose a new set of benchmarks and evaluation methods for the next generation of optical flow algorithms. To that end, we contribute four types of data to test different aspects of optical flow algorithms: (1) sequences with nonrigid motion where the ground-truth flow is determined by A preliminary version of this paper appeared in the IEEE International Conference on Computer Vision (Baker et al. 2007).
translated by 谷歌翻译
Recent work has shown that optical flow estimation can be formulated as a supervised learning task and can be successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by a large synthetically generated dataset. The present paper extends the concept of optical flow estimation via convolutional networks to disparity and scene flow estimation. To this end, we propose three synthetic stereo video datasets with sufficient realism, variation, and size to successfully train large networks. Our datasets are the first large-scale datasets to enable training and evaluating scene flow methods. Besides the datasets, we present a convolutional network for real-time disparity estimation that provides state-of-the-art results. By combining a flow and disparity estimation network and training it jointly, we demonstrate the first scene flow estimation with a convolutional network.
translated by 谷歌翻译
We address the problem of synthesizing new video frames in an existing video, either in-between existing frames (interpolation), or subsequent to them (extrapolation). This problem is challenging because video appearance and motion can be highly complex. Traditional optical-flow-based solutions often fail where flow estimation is challenging, while newer neural-network-based methods that hallucinate pixel values directly often produce blurry results. We combine the advantages of these two methods by training a deep network that learns to synthesize video frames by flowing pixel values from existing ones, which we call deep voxel flow. Our method requires no human supervision, and any video can be used as training data by dropping, and then learning to predict, existing frames. The technique is efficient, and can be applied at any video resolution. We demonstrate that our method produces results that both quantitatively and qualitatively improve upon the state-ofthe-art.
translated by 谷歌翻译
We learn to compute optical flow by combining a classical spatial-pyramid formulation with deep learning. This estimates large motions in a coarse-to-fine approach by warping one image of a pair at each pyramid level by the current flow estimate and computing an update to the flow. Instead of the standard minimization of an objective function at each pyramid level, we train one deep network per level to compute the flow update. Unlike the recent FlowNet approach, the networks do not need to deal with large motions; these are dealt with by the pyramid. This has several advantages. First, our Spatial Pyramid Network (SPyNet) is much simpler and 96% smaller than FlowNet in terms of model parameters. This makes it more efficient and appropriate for embedded applications. Second, since the flow at each pyramid level is small (< 1 pixel), a convolutional approach applied to pairs of warped images is appropriate. Third, unlike FlowNet, the learned convolution filters appear similar to classical spatio-temporal filters, giving insight into the method and how to improve it. Our results are more accurate than FlowNet on most standard benchmarks, suggesting a new direction of combining classical flow methods with deep learning.1 This, of course, has well-known limitations, which we discuss later.
translated by 谷歌翻译
This paper proposes a novel model and dataset for 3D scene flow estimation with an application to autonomous driving. Taking advantage of the fact that outdoor scenes often decompose into a small number of independently moving objects, we represent each element in the scene by its rigid motion parameters and each superpixel by a 3D plane as well as an index to the corresponding object. This minimal representation increases robustness and leads to a discrete-continuous CRF where the data term decomposes into pairwise potentials between superpixels and objects. Moreover, our model intrinsically segments the scene into its constituting dynamic components. We demonstrate the performance of our model on existing benchmarks as well as a novel realistic dataset with scene flow ground truth. We obtain this dataset by annotating 400 dynamic scenes from the KITTI raw data collection using detailed 3D CAD models for all vehicles in motion. Our experiments also reveal novel challenges which cannot be handled by existing methods.
translated by 谷歌翻译
从视频中获得地面真相标签很具有挑战性,因为在像素流标签的手动注释非常昂贵且费力。此外,现有的方法试图将合成数据集的训练模型调整到真实的视频中,该视频不可避免地遭受了域差异并阻碍了现实世界应用程序的性能。为了解决这些问题,我们提出了RealFlow,这是一个基于期望最大化的框架,可以直接从任何未标记的现实视频中创建大规模的光流数据集。具体而言,我们首先估计一对视频帧之间的光流,然后根据预测流从该对中合成新图像。因此,新图像对及其相应的流可以被视为新的训练集。此外,我们设计了一种逼真的图像对渲染(RIPR)模块,该模块采用软磁性裂口和双向孔填充技术来减轻图像合成的伪像。在E-Step中,RIPR呈现新图像以创建大量培训数据。在M-Step中,我们利用生成的训练数据来训练光流网络,该数据可用于估计下一个E步骤中的光流。在迭代学习步骤中,流网络的能力逐渐提高,流量的准确性以及合成数据集的质量也是如此。实验结果表明,REALFLOW的表现优于先前的数据集生成方法。此外,基于生成的数据集,我们的方法与受监督和无监督的光流方法相比,在两个标准基准测试方面达到了最先进的性能。我们的代码和数据集可从https://github.com/megvii-research/realflow获得
translated by 谷歌翻译
本文介绍了一种新颖的体系结构,用于同时估算高度准确的光流和刚性场景转换,以实现困难的场景,在这种情况下,亮度假设因强烈的阴影变化而违反了亮度假设。如果是旋转物体或移动的光源(例如在黑暗中驾驶汽车遇到的光源),场景的外观通常从一个视图到下一个视图都发生了很大变化。不幸的是,用于计算光学流或姿势的标准方法是基于这样的期望,即场景中特征在视图之间保持恒定。在调查的情况下,这些方法可能经常失败。提出的方法通过组合图像,顶点和正常数据来融合纹理和几何信息,以计算照明不变的光流。通过使用粗到最新的策略,可以学习全球锚定的光流,从而减少了基于伪造的伪相应的影响。基于学习的光学流,提出了第二个体系结构,该体系结构可预测扭曲的顶点和正常地图的稳健刚性变换。特别注意具有强烈旋转的情况,这通常会导致这种阴影变化。因此,提出了一个三步程序,该程序可以利用正态和顶点之间的相关性。该方法已在新创建的数据集上进行了评估,该数据集包含具有强烈旋转和阴影效果的合成数据和真实数据。该数据代表了3D重建中的典型用例,其中该对象通常在部分重建之间以很大的步骤旋转。此外,我们将该方法应用于众所周知的Kitti Odometry数据集。即使由于实现了Brighness的假设,这不是该方法的典型用例,因此,还建立了对标准情况和与其他方法的关系的适用性。
translated by 谷歌翻译
结合同时定位和映射(SLAM)估计和动态场景建模可以高效地在动态环境中获得机器人自主权。机器人路径规划和障碍避免任务依赖于场景中动态对象运动的准确估计。本文介绍了VDO-SLAM,这是一种强大的视觉动态对象感知SLAM系统,用于利用语义信息,使得能够在场景中进行准确的运动估计和跟踪动态刚性物体,而无需任何先前的物体形状或几何模型的知识。所提出的方法识别和跟踪环境中的动态对象和静态结构,并将这些信息集成到统一的SLAM框架中。这导致机器人轨迹的高度准确估计和对象的全部SE(3)运动以及环境的时空地图。该系统能够从对象的SE(3)运动中提取线性速度估计,为复杂的动态环境中的导航提供重要功能。我们展示了所提出的系统对许多真实室内和室外数据集的性能,结果表明了对最先进的算法的一致和实质性的改进。可以使用源代码的开源版本。
translated by 谷歌翻译
在本文中,通过以自我监督的方式将基于几何的方法纳入深度学习架构来实现强大的视觉测量(VO)的基本问题。通常,基于纯几何的算法与特征点提取和匹配中的深度学习不那么稳健,但由于其成熟的几何理论,在自我运动估计中表现良好。在这项工作中,首先提出了一种新颖的光学流量网络(PANET)内置于位置感知机构。然后,提出了一种在没有典型网络的情况下共同估计深度,光学流动和自我运动来学习自我运动的新系统。所提出的系统的关键组件是一种改进的束调节模块,其包含多个采样,初始化的自我运动,动态阻尼因子调整和Jacobi矩阵加权。另外,新颖的相对光度损耗函数先进以提高深度估计精度。该实验表明,所提出的系统在基于基于基于基于基于基于基于基于学习的基于学习的方法之间的深度,流量和VO估计方面不仅优于其他最先进的方法,而且与几何形状相比,也显着提高了鲁棒性 - 基于,基于学习和混合VO系统。进一步的实验表明,我们的模型在挑战室内(TMU-RGBD)和室外(KAIST)场景中实现了出色的泛化能力和性能。
translated by 谷歌翻译