我们提出了一个新颖的高分辨率和具有挑战性的立体声数据集框架室内场景,并以致密而准确的地面真相差异注释。我们数据集的特殊是存在几个镜面和透明表面的存在,即最先进的立体声网络失败的主要原因。我们的采集管道利用了一个新颖的深度时空立体声框架,该框架可以轻松准确地使用子像素精度进行标记。我们总共发布了419个样本,这些样本在64个不同的场景中收集,并以致密的地面差异注释。每个样本包括高分辨率对(12 MPX)以及一个不平衡对(左:12 MPX,右:1.1 MPX)。此外,我们提供手动注释的材料分割面具和15K未标记的样品。我们根据我们的数据集评估了最新的深层网络,强调了它们在解决立体声方面的开放挑战方面的局限性,并绘制了未来研究的提示。
translated by 谷歌翻译
我们通过求解立体声匹配对应关系来解决注册同步颜色(RGB)和多光谱(MS)图像的问题。目的是,我们引入了一个新颖的RGB-MS数据集,在室内环境中框架13个不同的场景,并提供了34个图像对,并以差距图的形式带有半密度的高分辨率高分辨率地面标签。为了解决这项任务,我们提出了一个深度学习架构,通过利用进一步的RGB摄像机来以自我监督的方式进行培训,这仅在培训数据获取过程中需要。在此设置中,我们可以通过将知识从更轻松的RGB-RGB匹配任务中提炼出基于大约11K未标记的图像三重列表的集合来使知识从更轻松的RGB-RGB匹配任务中提取知识,从而方便地学习跨模式匹配。实验表明,提议的管道为这项小说,具有挑战性的任务进行了未来的研究,为未来的研究设定了良好的性能栏(1.16像素的平均注册错误)。
translated by 谷歌翻译
现代计算机视觉已超越了互联网照片集的领域,并进入了物理世界,通过非结构化的环境引导配备摄像头的机器人和自动驾驶汽车。为了使这些体现的代理与现实世界对象相互作用,相机越来越多地用作深度传感器,重建了各种下游推理任务的环境。机器学习辅助的深度感知或深度估计会预测图像中每个像素的距离。尽管已经在深入估算中取得了令人印象深刻的进步,但仍然存在重大挑战:(1)地面真相深度标签很难大规模收集,(2)通常认为相机信息是已知的,但通常是不可靠的,并且(3)限制性摄像机假设很常见,即使在实践中使用了各种各样的相机类型和镜头。在本论文中,我们专注于放松这些假设,并描述将相机变成真正通用深度传感器的最终目标的贡献。
translated by 谷歌翻译
在立体声视觉中,自相似或平淡的区域可能使得很难匹配两个图像之间的补丁。基于主动立体声的方法通过在场景上投射伪随机模式来减轻此问题,以便可以在没有歧义的情况下识别图像对的每个贴片。但是,投影模式显着改变了图像的外观。如果这种模式充当对抗性噪声的一种形式,则可能对基于深度学习的方法的性能产生负面影响,这现在是密集立体声视觉的事实上的标准。在本文中,我们提出了Active-Passive Simstereo数据集和相应的基准测试,以评估立体声匹配算法的被动立体声和活动立体声图像之间的性能差距。使用提出的基准测试和额外的消融研究,我们表明特征提取和匹配的模块选择了20个选择的基于深度学习的立体声匹配方法,可以推广到主动立体声,没有问题。但是,由于二十个体系结构(ACVNet,Cascadestereo和Stereonet)中三个的差异细化模块由于对输入图像的外观的依赖而受到主动立体声模式的负面影响。
translated by 谷歌翻译
许多移动制造商最近在其旗舰模型中采用了双像素(DP)传感器,以便更快的自动对焦和美学图像捕获。尽管他们的优势,由于DT在DP图像中的视差缺失的数据集和算法设计,但对3D面部理解的使用研究受到限制。这是因为子孔图像的基线非常窄,并且散焦模糊区域存在视差。在本文中,我们介绍了一种以DP为导向的深度/普通网络,该网络重建3D面部几何。为此目的,我们使用我们的多摄像头结构光系统捕获的101人拥有超过135k张图片的DP面部数据。它包含相应的地面真值3D模型,包括度量刻度的深度图和正常。我们的数据集允许建议的匹配网络广泛化,以便以3D面部深度/正常估计。所提出的网络由两种新颖的模块组成:自适应采样模块和自适应正常模块,专门用于处理DP图像中的散焦模糊。最后,该方法实现了最近基于DP的深度/正常估计方法的最先进的性能。我们还展示了估计深度/正常的适用性面对欺骗和致密。
translated by 谷歌翻译
立体声匹配是计算机愿景中的一个重要任务,这些任务是几十年来引起了巨大的研究。虽然在差距准确度,密度和数据大小方面,公共立体声数据集难以满足模型的要求。在本文中,我们的目标是解决数据集和模型之间的问题,并提出了一个具有高精度差异地面真理的大规模立体声数据集,名为Plantstereo。我们使用了半自动方式来构造数据集:在相机校准和图像配准后,可以从深度图像获得高精度视差图像。总共有812个图像对覆盖着多种植物套装:菠菜,番茄,胡椒和南瓜。我们首先在四种不同立体声匹配方法中评估了我们的Plandstereo数据集。不同模型和植物的广泛实验表明,与整数精度的基础事实相比,Plantstereo提供的高精度差异图像可以显着提高深度学习模型的培训效果。本文提供了一种可行和可靠的方法来实现植物表面密集的重建。 PlantSereo数据集和相对代码可用于:https://www.github.com/wangqingyu985/plantstereo
translated by 谷歌翻译
A significant weakness of most current deep Convolutional Neural Networks is the need to train them using vast amounts of manually labelled data. In this work we propose a unsupervised framework to learn a deep convolutional neural network for single view depth prediction, without requiring a pre-training stage or annotated ground-truth depths. We achieve this by training the network in a manner analogous to an autoencoder. At training time we consider a pair of images, source and target, with small, known camera motion between the two such as a stereo pair. We train the convolutional encoder for the task of predicting the depth map for the source image. To do so, we explicitly generate an inverse warp of the target image using the predicted depth and known inter-view displacement, to reconstruct the source image; the photometric error in the reconstruction is the reconstruction loss for the encoder. The acquisition of this training data is considerably simpler than for equivalent systems, requiring no manual annotation, nor calibration of depth sensor to camera. We show that our network trained on less than half of the KITTI dataset gives comparable performance to that of the state-of-the-art supervised methods for single view depth estimation. 1 1 Find the model and other imformation on the project github page: https://github. com/Ravi-Garg/Unsupervised_Depth_Estimation
translated by 谷歌翻译
Recent work has shown that optical flow estimation can be formulated as a supervised learning task and can be successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by a large synthetically generated dataset. The present paper extends the concept of optical flow estimation via convolutional networks to disparity and scene flow estimation. To this end, we propose three synthetic stereo video datasets with sufficient realism, variation, and size to successfully train large networks. Our datasets are the first large-scale datasets to enable training and evaluating scene flow methods. Besides the datasets, we present a convolutional network for real-time disparity estimation that provides state-of-the-art results. By combining a flow and disparity estimation network and training it jointly, we demonstrate the first scene flow estimation with a convolutional network.
translated by 谷歌翻译
The success of monocular depth estimation relies on large and diverse training sets. Due to the challenges associated with acquiring dense ground-truth depth across different environments at scale, a number of datasets with distinct characteristics and biases have emerged. We develop tools that enable mixing multiple datasets during training, even if their annotations are incompatible.In particular, we propose a robust training objective that is invariant to changes in depth range and scale, advocate the use of principled multi-objective learning to combine data from different sources, and highlight the importance of pretraining encoders on auxiliary tasks. Armed with these tools, we experiment with five diverse training datasets, including a new, massive data source: 3D films. To demonstrate the generalization power of our approach we use zero-shot cross-dataset transfer, i.e. we evaluate on datasets that were not seen during training. The experiments confirm that mixing data from complementary sources greatly improves monocular depth estimation. Our approach clearly outperforms competing methods across diverse datasets, setting a new state of the art for monocular depth estimation.
translated by 谷歌翻译
Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage.We propose a novel training objective that enables our convolutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Exploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.
translated by 谷歌翻译
Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.
translated by 谷歌翻译
3D object detection is an essential task in autonomous driving. Recent techniques excel with highly accurate detection rates, provided the 3D input data is obtained from precise but expensive LiDAR technology. Approaches based on cheaper monocular or stereo imagery data have, until now, resulted in drastically lower accuracies -a gap that is commonly attributed to poor image-based depth estimation. However, in this paper we argue that it is not the quality of the data but its representation that accounts for the majority of the difference. Taking the inner workings of convolutional neural networks into consideration, we propose to convert image-based depth maps to pseudo-LiDAR representations -essentially mimicking the LiDAR signal. With this representation we can apply different existing LiDAR-based detection algorithms. On the popular KITTI benchmark, our approach achieves impressive improvements over the existing state-of-the-art in image-based performance -raising the detection accuracy of objects within the 30m range from the previous state-of-the-art of 22% to an unprecedented 74%. At the time of submission our algorithm holds the highest entry on the KITTI 3D object detection leaderboard for stereo-image-based approaches. Our code is publicly available at https: //github.com/mileyan/pseudo_lidar.
translated by 谷歌翻译
We present an end-to-end deep learning architecture for depth map inference from multi-view images. In the network, we first extract deep visual image features, and then build the 3D cost volume upon the reference camera frustum via the differentiable homography warping. Next, we apply 3D convolutions to regularize and regress the initial depth map, which is then refined with the reference image to generate the final output. Our framework flexibly adapts arbitrary N-view inputs using a variance-based cost metric that maps multiple features into one cost feature. The proposed MVSNet is demonstrated on the large-scale indoor DTU dataset. With simple post-processing, our method not only significantly outperforms previous state-of-the-arts, but also is several times faster in runtime. We also evaluate MVSNet on the complex outdoor Tanks and Temples dataset, where our method ranks first before April 18, 2018 without any fine-tuning, showing the strong generalization ability of MVSNet.
translated by 谷歌翻译
立体声Vision是一种有效的深度估算技术,具有广泛适用性在自主城市和公路驾驶中。虽然已经为立体声开发了各种基于深度学习的方法,但是具有固定基线的双目设置的输入数据受到限制。解决这样的问题,我们介绍了一个端到端网络,用于处理来自三曲面的数据,这是窄和宽立体对的组合。在这种设计中,用网络的共享权重和中间融合处理两对具有公共参考图像的双目数据。我们还提出了一种用于合并两个基线的4D数据的引导添加方法。此外,介绍了实际和合成数据集的迭代顺序自我监督和监督学习,使三曲系统的训练实用,无需实际数据集的地面真实数据。实验结果表明,三曲差距网络超越了个别对被馈送到类似架构中的场景。代码和数据集:https://github.com/cogsys-tuebingen/tristeReonet。
translated by 谷歌翻译
这些年来,展示技术已经发展。开发实用的HDR捕获,处理和显示解决方案以将3D技术提升到一个新的水平至关重要。多曝光立体声图像序列的深度估计是开发成本效益3D HDR视频内容的重要任务。在本文中,我们开发了一种新颖的深度体系结构,以进行多曝光立体声深度估计。拟议的建筑有两个新颖的组成部分。首先,对传统立体声深度估计中使用的立体声匹配技术进行了修改。对于我们体系结构的立体深度估计部分,部署了单一到stereo转移学习方法。拟议的配方规避了成本量构造的要求,该要求由基于重新编码的单码编码器CNN取代,具有不同的重量以进行功能融合。基于有效网络的块用于学习差异。其次,我们使用强大的视差特征融合方法组合了从不同暴露水平上从立体声图像获得的差异图。使用针对不同质量度量计算的重量图合并在不同暴露下获得的差异图。获得的最终预测差异图更强大,并保留保留深度不连续性的最佳功能。提出的CNN具有使用标准动态范围立体声数据或具有多曝光低动态范围立体序列的训练的灵活性。在性能方面,所提出的模型超过了最新的单眼和立体声深度估计方法,无论是定量还是质量地,在具有挑战性的场景流以及暴露的Middlebury立体声数据集上。该体系结构在复杂的自然场景中表现出色,证明了其对不同3D HDR应用的有用性。
translated by 谷歌翻译
在本文中,根据PatchMatch Multi-View Stereo(MVS),提出了针对城市场景的基于图像的3D重建的完整管道。首先,输入图像被馈入现成的视觉大满贯系统中,以提取相机姿势和稀疏关键点,这些镜头用于初始化PatchMatch优化。然后,在具有新颖的深度正常一致性损耗项和全局修复算法的多尺度框架中,对Pixelwise的深度和正态进行了迭代计算,以平衡PatchMatch固有的局部性质。最后,通过在3D中以反向项目的多视图一致估计来生成大规模点云。针对Kitti数据集上的经典MVS算法和单眼深度网络仔细评估了所提出的方法,显示了最先进的性能。
translated by 谷歌翻译
建立新型观点综合的最近进展后,我们提出了改善单眼深度估计的应用。特别是,我们提出了一种在三个主要步骤中分开的新颖训练方法。首先,单眼深度网络的预测结果被扭转到额外的视点。其次,我们应用一个额外的图像综合网络,其纠正并提高了翘曲的RGB图像的质量。通过最小化像素-WISE RGB重建误差,该网络的输出需要尽可能类似地查看地面真实性视图。第三,我们将相同的单眼深度估计重新应用于合成的第二视图点,并确保深度预测与相关的地面真理深度一致。实验结果证明,我们的方法在Kitti和Nyu-Deaft-V2数据集上实现了最先进的或可比性,具有轻量级和简单的香草U-Net架构。
translated by 谷歌翻译
深度估计是某些领域的关键技术之一,例如自动驾驶和机器人导航。但是,使用单个传感器的传统方法不可避免地受到传感器的性能的限制。因此,提出了一种融合激光镜头和立体声摄像机的精度和健壮方法。该方法完全结合了LiDAR和立体声摄像机的优势,这些摄像头可以保留LIDAR高精度和图像的高分辨率的优势。与传统的立体声匹配方法相比,对象和照明条件的质地对算法的影响较小。首先,将LIDAR数据的深度转换为立体声摄像机的差异。由于LiDAR数据的密度在Y轴上相对稀疏,因此使用插值方法对转换的差异图进行了更采样。其次,为了充分利用精确的差异图,融合了差异图和立体声匹配以传播准确的差异。最后,将视差图转换为深度图。此外,转换后的差异图还可以提高算法的速度。我们在Kitti基准测试中评估了拟议的管道。该实验表明,我们的算法比几种经典方法具有更高的精度。
translated by 谷歌翻译
Monocular depth estimation is a challenging problem on which deep neural networks have demonstrated great potential. However, depth maps predicted by existing deep models usually lack fine-grained details due to the convolution operations and the down-samplings in networks. We find that increasing input resolution is helpful to preserve more local details while the estimation at low resolution is more accurate globally. Therefore, we propose a novel depth map fusion module to combine the advantages of estimations with multi-resolution inputs. Instead of merging the low- and high-resolution estimations equally, we adopt the core idea of Poisson fusion, trying to implant the gradient domain of high-resolution depth into the low-resolution depth. While classic Poisson fusion requires a fusion mask as supervision, we propose a self-supervised framework based on guided image filtering. We demonstrate that this gradient-based composition performs much better at noisy immunity, compared with the state-of-the-art depth map fusion method. Our lightweight depth fusion is one-shot and runs in real-time, making our method 80X faster than a state-of-the-art depth fusion method. Quantitative evaluations demonstrate that the proposed method can be integrated into many fully convolutional monocular depth estimation backbones with a significant performance boost, leading to state-of-the-art results of detail enhancement on depth maps.
translated by 谷歌翻译
作为许多自主驾驶和机器人活动的基本组成部分,如自我运动估计,障碍避免和场景理解,单眼深度估计(MDE)引起了计算机视觉和机器人社区的极大关注。在过去的几十年中,已经开发了大量方法。然而,据我们所知,对MDE没有全面调查。本文旨在通过审查1970年至2021年之间发布的197个相关条款来弥补这一差距。特别是,我们为涵盖各种方法的MDE提供了全面的调查,介绍了流行的绩效评估指标并汇总公开的数据集。我们还总结了一些代表方法的可用开源实现,并比较了他们的表演。此外,我们在一些重要的机器人任务中审查了MDE的应用。最后,我们通过展示一些有希望的未来研究方向来结束本文。预计本调查有助于读者浏览该研究领域。
translated by 谷歌翻译