Recent work has shown that optical flow estimation can be formulated as a supervised learning task and can be successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by a large synthetically generated dataset. The present paper extends the concept of optical flow estimation via convolutional networks to disparity and scene flow estimation. To this end, we propose three synthetic stereo video datasets with sufficient realism, variation, and size to successfully train large networks. Our datasets are the first large-scale datasets to enable training and evaluating scene flow methods. Besides the datasets, we present a convolutional network for real-time disparity estimation that provides state-of-the-art results. By combining a flow and disparity estimation network and training it jointly, we demonstrate the first scene flow estimation with a convolutional network.
translated by 谷歌翻译
Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks where CNNs were successful. In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations.Since existing ground truth datasets are not sufficiently large to train a CNN, we generate a synthetic Flying Chairs dataset. We show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI, achieving competitive accuracy at frame rates of 5 to 10 fps.
translated by 谷歌翻译
This paper proposes a novel model and dataset for 3D scene flow estimation with an application to autonomous driving. Taking advantage of the fact that outdoor scenes often decompose into a small number of independently moving objects, we represent each element in the scene by its rigid motion parameters and each superpixel by a 3D plane as well as an index to the corresponding object. This minimal representation increases robustness and leads to a discrete-continuous CRF where the data term decomposes into pairwise potentials between superpixels and objects. Moreover, our model intrinsically segments the scene into its constituting dynamic components. We demonstrate the performance of our model on existing benchmarks as well as a novel realistic dataset with scene flow ground truth. We obtain this dataset by annotating 400 dynamic scenes from the KITTI raw data collection using detailed 3D CAD models for all vehicles in motion. Our experiments also reveal novel challenges which cannot be handled by existing methods.
translated by 谷歌翻译
本文介绍了一种新颖的体系结构,用于同时估算高度准确的光流和刚性场景转换,以实现困难的场景,在这种情况下,亮度假设因强烈的阴影变化而违反了亮度假设。如果是旋转物体或移动的光源(例如在黑暗中驾驶汽车遇到的光源),场景的外观通常从一个视图到下一个视图都发生了很大变化。不幸的是,用于计算光学流或姿势的标准方法是基于这样的期望,即场景中特征在视图之间保持恒定。在调查的情况下,这些方法可能经常失败。提出的方法通过组合图像,顶点和正常数据来融合纹理和几何信息,以计算照明不变的光流。通过使用粗到最新的策略,可以学习全球锚定的光流,从而减少了基于伪造的伪相应的影响。基于学习的光学流,提出了第二个体系结构,该体系结构可预测扭曲的顶点和正常地图的稳健刚性变换。特别注意具有强烈旋转的情况,这通常会导致这种阴影变化。因此,提出了一个三步程序,该程序可以利用正态和顶点之间的相关性。该方法已在新创建的数据集上进行了评估,该数据集包含具有强烈旋转和阴影效果的合成数据和真实数据。该数据代表了3D重建中的典型用例,其中该对象通常在部分重建之间以很大的步骤旋转。此外,我们将该方法应用于众所周知的Kitti Odometry数据集。即使由于实现了Brighness的假设,这不是该方法的典型用例,因此,还建立了对标准情况和与其他方法的关系的适用性。
translated by 谷歌翻译
我们介绍从单个视频帧预测的问题,从单个视频帧,包括实际瞬时光流的光流量的低维子空间。我们展示了几种自然场景假设如何通过差异和对象实例的表示,通过一组基流字段来识别适当的流子空间。流量子空间与新颖的丢失函数一起可用于预测单眼深度或预测深度加上对象实例嵌入的任务。这提供了一种新方法,可以使用单眼输入视频以无监督的方式学习这些任务,而无需相机内在或姿势。
translated by 谷歌翻译
We present a compact but effective CNN model for optical flow, called PWC-Net. PWC-Net has been designed according to simple and well-established principles: pyramidal processing, warping, and the use of a cost volume. Cast in a learnable feature pyramid, PWC-Net uses the current optical flow estimate to warp the CNN features of the second image. It then uses the warped features and features of the first image to construct a cost volume, which is processed by a CNN to estimate the optical flow. PWC-Net is 17 times smaller in size and easier to train than the recent FlowNet2 model. Moreover, it outperforms all published optical flow methods on the MPI Sintel final pass and KITTI 2015 benchmarks, running at about 35 fps on Sintel resolution (1024×436) images. Our models are available on https://github.com/NVlabs/PWC-Net.
translated by 谷歌翻译
The FlowNet demonstrated that optical flow estimation can be cast as a learning problem. However, the state of the art with regard to the quality of the flow has still been defined by traditional methods. Particularly on small displacements and real-world data, FlowNet cannot compete with variational methods. In this paper, we advance the concept of end-to-end learning of optical flow and make it work really well. The large improvements in quality and speed are caused by three major contributions: first, we focus on the training data and show that the schedule of presenting data during training is very important. Second, we develop a stacked architecture that includes warping of the second image with intermediate optical flow. Third, we elaborate on small displacements by introducing a subnetwork specializing on small motions. FlowNet 2.0 is only marginally slower than the original FlowNet but decreases the estimation error by more than 50%. It performs on par with state-of-the-art methods, while running at interactive frame rates. Moreover, we present faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet.
translated by 谷歌翻译
在本文中,我们提出了USEGSCENE,该框架用于使用卷积神经网络对立体声相机图像的深度,光流和自我感动的无监督学习。我们的框架利用语义信息来改善深度和光流图的正则化,多模式融合和遮挡填充考虑动态刚性对象运动作为独立的SE(3)转换。此外,我们与纯照相匹配匹配互补,我们提出了连续图像之间语义特征,像素类别和对象实例边界的匹配。与以前的方法相反,我们提出了一个网络体系结构,该网络体系结构可以使用共享编码器共同预测所有输出,并允许在任务域上传递信息,例如,光流的预测可以从深度的预测中受益。此外,我们明确地了解网络内部的深度和光流遮挡图,这些图被利用,以改善这些区域的预测。我们在流行的Kitti数据集上介绍了结果,并表明我们的方法以大幅度的优于其他方法。
translated by 谷歌翻译
Ground truth optical flow is difficult to measure in real scenes with natural motion. As a result, optical flow data sets are restricted in terms of size, complexity, and diversity, making optical flow algorithms difficult to train and test on realistic data. We introduce a new optical flow data set derived from the open source 3D animated short film Sintel. This data set has important features not present in the popular Middlebury flow evaluation: long sequences, large motions, specular reflections, motion blur, defocus blur, and atmospheric effects. Because the graphics data that generated the movie is open source, we are able to render scenes under conditions of varying complexity to evaluate where existing flow algorithms fail. We evaluate several recent optical flow algorithms and find that current highly-ranked methods on the Middlebury evaluation have difficulty with this more complex data set suggesting further research on optical flow estimation is needed. To validate the use of synthetic data, we compare the image-and flow-statistics of Sintel to those of real films and videos and show that they are similar. The data set, metrics, and evaluation website are publicly available.
translated by 谷歌翻译
在立体声视觉中,自相似或平淡的区域可能使得很难匹配两个图像之间的补丁。基于主动立体声的方法通过在场景上投射伪随机模式来减轻此问题,以便可以在没有歧义的情况下识别图像对的每个贴片。但是,投影模式显着改变了图像的外观。如果这种模式充当对抗性噪声的一种形式,则可能对基于深度学习的方法的性能产生负面影响,这现在是密集立体声视觉的事实上的标准。在本文中,我们提出了Active-Passive Simstereo数据集和相应的基准测试,以评估立体声匹配算法的被动立体声和活动立体声图像之间的性能差距。使用提出的基准测试和额外的消融研究,我们表明特征提取和匹配的模块选择了20个选择的基于深度学习的立体声匹配方法,可以推广到主动立体声,没有问题。但是,由于二十个体系结构(ACVNet,Cascadestereo和Stereonet)中三个的差异细化模块由于对输入图像的外观的依赖而受到主动立体声模式的负面影响。
translated by 谷歌翻译
Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage.We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Exploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.
translated by 谷歌翻译
在接受高质量的地面真相(如LiDAR数据)培训时,监督的学习深度估计方法可以实现良好的性能。但是,LIDAR只能生成稀疏的3D地图,从而导致信息丢失。每个像素获得高质量的地面深度数据很难获取。为了克服这一限制,我们提出了一种新颖的方法,将有前途的平面和视差几何管道与深度信息与U-NET监督学习网络相结合的结构信息结合在一起,与现有的基于流行的学习方法相比,这会导致定量和定性的改进。特别是,该模型在两个大规模且具有挑战性的数据集上进行了评估:Kitti Vision Benchmark和CityScapes数据集,并在相对错误方面取得了最佳性能。与纯深度监督模型相比,我们的模型在薄物体和边缘的深度预测上具有令人印象深刻的性能,并且与结构预测基线相比,我们的模型的性能更加强大。
translated by 谷歌翻译
可以通过定期预测未来的框架以增强虚拟现实应用程序中的用户体验,从而解决了低计算设备上图形渲染高帧速率视频的挑战。这是通过时间视图合成(TVS)的问题来研究的,该问题的目标是预测给定上一个帧的视频的下一个帧以及上一个和下一个帧的头部姿势。在这项工作中,我们考虑了用户和对象正在移动的动态场景的电视。我们设计了一个将运动解散到用户和对象运动中的框架,以在预测下一帧的同时有效地使用可用的用户运动。我们通过隔离和估计过去框架的3D对象运动,然后推断它来预测对象的运动。我们使用多平面图像(MPI)作为场景的3D表示,并将对象运动作为MPI表示中相应点之间的3D位移建模。为了在估计运动时处理MPI中的稀疏性,我们将部分卷积和掩盖的相关层纳入了相应的点。然后将预测的对象运动与给定的用户或相机运动集成在一起,以生成下一帧。使用不合格的填充模块,我们合成由于相机和对象运动而发现的区域。我们为动态场景的电视开发了一个新的合成数据集,该数据集由800个以全高清分辨率组成的视频组成。我们通过数据集和MPI Sintel数据集上的实验表明我们的模型优于文献中的所有竞争方法。
translated by 谷歌翻译
The quantitative evaluation of optical flow algorithms by Barron et al. (1994) led to significant advances in performance. The challenges for optical flow algorithms today go beyond the datasets and evaluation methods proposed in that paper. Instead, they center on problems associated with complex natural scenes, including nonrigid motion, real sensor noise, and motion discontinuities. We propose a new set of benchmarks and evaluation methods for the next generation of optical flow algorithms. To that end, we contribute four types of data to test different aspects of optical flow algorithms: (1) sequences with nonrigid motion where the ground-truth flow is determined by A preliminary version of this paper appeared in the IEEE International Conference on Computer Vision (Baker et al. 2007).
translated by 谷歌翻译
A significant weakness of most current deep Convolutional Neural Networks is the need to train them using vast amounts of manually labelled data. In this work we propose a unsupervised framework to learn a deep convolutional neural network for single view depth prediction, without requiring a pre-training stage or annotated ground-truth depths. We achieve this by training the network in a manner analogous to an autoencoder. At training time we consider a pair of images, source and target, with small, known camera motion between the two such as a stereo pair. We train the convolutional encoder for the task of predicting the depth map for the source image. To do so, we explicitly generate an inverse warp of the target image using the predicted depth and known inter-view displacement, to reconstruct the source image; the photometric error in the reconstruction is the reconstruction loss for the encoder. The acquisition of this training data is considerably simpler than for equivalent systems, requiring no manual annotation, nor calibration of depth sensor to camera. We show that our network trained on less than half of the KITTI dataset gives comparable performance to that of the state-of-the-art supervised methods for single view depth estimation. 1 1 Find the model and other imformation on the project github page: https://github. com/Ravi-Garg/Unsupervised_Depth_Estimation
translated by 谷歌翻译
当不可能使用深度传感器时,估计与物体的距离对于自动驾驶至关重要。在这种情况下,必须从车载安装的RGB摄像机估算距离,这是一项复杂的任务,尤其是在天然室外景观等环境中。在本文中,我们提出了一种名为M4Depth的新方法,以进行深度估计。首先,我们建立了两个连续帧的深度与视觉差异之间的徒关系,并展示了如何利用它以执行运动不变的像素深度估计。然后,我们详细介绍了基于金字塔卷积神经网络体系结构的M4DEPTH,每个级别通过使用两个定制的成本量来完善输入差异图估计。我们使用这些成本量来利用运动施加的视觉时空约束,并为各种场景增强网络的稳健性。我们在公共数据集上基准了我们的测试和概括模式的方法,其中包含在各种室外场景中记录的合成相机轨迹。结果表明,我们的网络在这些数据集上的表现优于最新技术,同时在标准深度估计基准上表现良好。我们方法的代码可在https://github.com/michael-fonder/m4depth上公开获得。
translated by 谷歌翻译
时间一致的深度估计对于诸如增强现实之类的实时应用至关重要。虽然立体声深度估计已经接受了显着的注意,导致逐帧的改进,虽然相对较少的工作集中在跨越帧的时间一致性。实际上,基于我们的分析,当前立体声深度估计技术仍然遭受不良时间一致性。由于并发对象和摄像机运动,在动态场景中稳定深度是挑战。在在线设置中,此过程进一步加剧,因为只有过去的帧可用。在本文中,我们介绍了一种技术,在线设置中的动态场景中产生时间一致的深度估计。我们的网络增强了具有新颖运动和融合网络的当前每帧立体声网络。通过预测每个像素SE3变换,运动网络占对象和相机运动。融合网络通过用回归权重聚合当前和先前预测来提高预测的一致性。我们在各种数据集中进行广泛的实验(合成,户外,室内和医疗)。在零射泛化和域微调中,我们证明我们所提出的方法在数量和定性的时间稳定和每个帧精度方面优于竞争方法。我们的代码将在线提供。
translated by 谷歌翻译
Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.
translated by 谷歌翻译
我们提出了一个新颖的高分辨率和具有挑战性的立体声数据集框架室内场景,并以致密而准确的地面真相差异注释。我们数据集的特殊是存在几个镜面和透明表面的存在,即最先进的立体声网络失败的主要原因。我们的采集管道利用了一个新颖的深度时空立体声框架,该框架可以轻松准确地使用子像素精度进行标记。我们总共发布了419个样本,这些样本在64个不同的场景中收集,并以致密的地面差异注释。每个样本包括高分辨率对(12 MPX)以及一个不平衡对(左:12 MPX,右:1.1 MPX)。此外,我们提供手动注释的材料分割面具和15K未标记的样品。我们根据我们的数据集评估了最新的深层网络,强调了它们在解决立体声方面的开放挑战方面的局限性,并绘制了未来研究的提示。
translated by 谷歌翻译
Recent work has shown that depth estimation from a stereo pair of images can be formulated as a supervised learning task to be resolved with convolutional neural networks (CNNs). However, current architectures rely on patch-based Siamese networks, lacking the means to exploit context information for finding correspondence in illposed regions. To tackle this problem, we propose PSM-Net, a pyramid stereo matching network consisting of two main modules: spatial pyramid pooling and 3D CNN. The spatial pyramid pooling module takes advantage of the capacity of global context information by aggregating context in different scales and locations to form a cost volume. The 3D CNN learns to regularize cost volume using stacked multiple hourglass networks in conjunction with intermediate supervision. The proposed approach was evaluated on several benchmark datasets. Our method ranked first in the KITTI 2012 and 2015 leaderboards before March 18, 2018. The codes of PSMNet are available at: https: //github.com/JiaRenChang/PSMNet.
translated by 谷歌翻译