从单眼视频中估算移动摄像头的姿势是一个具有挑战性的问题,尤其是由于动态环境中移动对象的存在,在动态环境中,现有摄像头姿势估计方法的性能易于几何一致的像素。为了应对这一挑战,我们为视频提供了一种强大的密度间接结构,该结构是基于由成对光流初始化的致密对应的。我们的关键想法是将远程视频对应性优化为密集的点轨迹,并使用它来学习对运动分割的强大估计。提出了一种新型的神经网络结构来处理不规则的点轨迹数据。然后,在远程点轨迹的一部分中,通过全局捆绑式调整估算和优化摄像头姿势,这些轨迹被归类为静态。 MPI Sintel数据集的实验表明,与现有最新方法相比,我们的系统产生的相机轨迹明显更准确。此外,我们的方法能够在完全静态的场景上保留相机姿势的合理准确性,该场景始终优于端到端深度学习的强大最新密度对应方法,这证明了密集间接方法的潜力基于光流和点轨迹。由于点轨迹表示是通用的,因此我们进一步介绍了具有动态对象的复杂运动的野外单眼视频的比较。代码可在https://github.com/bytedance/particle-sfm上找到。
translated by 谷歌翻译
结合同时定位和映射(SLAM)估计和动态场景建模可以高效地在动态环境中获得机器人自主权。机器人路径规划和障碍避免任务依赖于场景中动态对象运动的准确估计。本文介绍了VDO-SLAM,这是一种强大的视觉动态对象感知SLAM系统,用于利用语义信息,使得能够在场景中进行准确的运动估计和跟踪动态刚性物体,而无需任何先前的物体形状或几何模型的知识。所提出的方法识别和跟踪环境中的动态对象和静态结构,并将这些信息集成到统一的SLAM框架中。这导致机器人轨迹的高度准确估计和对象的全部SE(3)运动以及环境的时空地图。该系统能够从对象的SE(3)运动中提取线性速度估计,为复杂的动态环境中的导航提供重要功能。我们展示了所提出的系统对许多真实室内和室外数据集的性能,结果表明了对最先进的算法的一致和实质性的改进。可以使用源代码的开源版本。
translated by 谷歌翻译
我们提出了场景运动的新颖双流表示,将光流分​​解为由摄像机运动引起的静态流场和另一个由场景中对象的运动引起的动态流场。基于此表示形式,我们提出了一个动态的大满贯,称为Deflowslam,它利用图像中的静态和动态像素来求解相机的姿势,而不是像其他动态SLAM系统一样简单地使用静态背景像素。我们提出了一个动态更新模块,以一种自我监督的方式训练我们的Deflowslam,其中密集的束调节层采用估计的静态流场和由动态掩码控制的权重,并输出优化的静态流动场的残差,相机姿势的残差,和反度。静态和动态流场是通过将当前图像翘曲到相邻图像来估计的,并且可以通过将两个字段求和来获得光流。广泛的实验表明,在静态场景和动态场景中,Deflowslam可以很好地推广到静态和动态场景,因为它表现出与静态和动态较小的场景中最先进的Droid-Slam相当的性能,同时在高度动态的环境中表现出明显优于Droid-Slam。代码和数据可在项目网页上找到:\ urlstyle {tt} \ textColor {url_color} {\ url {https://zju3dv.github.io/deflowslam/}}}。
translated by 谷歌翻译
密集的深度和姿势估计是各种视频应用的重要先决条件。传统的解决方案遭受了稀疏特征跟踪的鲁棒性和视频中相机基线不足。因此,最近的方法利用基于学习的光流和深度在估计密集深度之前。但是,以前的作品需要大量的计算时间或产量亚最佳深度结果。我们提出了GCVD,这是本文中从运动(SFM)中基于学习的视频结构的全球一致方法。 GCVD将紧凑型姿势图集成到基于CNN的优化中,以从有效的密钥帧选择机制中实现全球一致的估计。它可以通过流动引导的密钥帧和完善的深度提高基于学习的方法的鲁棒性。实验结果表明,GCVD在深度和姿势估计上都优于最先进的方法。此外,运行时实验表明,它在提供全球一致性的短期和长期视频中都提供了强大的效率。
translated by 谷歌翻译
在本文中,通过以自我监督的方式将基于几何的方法纳入深度学习架构来实现强大的视觉测量(VO)的基本问题。通常,基于纯几何的算法与特征点提取和匹配中的深度学习不那么稳健,但由于其成熟的几何理论,在自我运动估计中表现良好。在这项工作中,首先提出了一种新颖的光学流量网络(PANET)内置于位置感知机构。然后,提出了一种在没有典型网络的情况下共同估计深度,光学流动和自我运动来学习自我运动的新系统。所提出的系统的关键组件是一种改进的束调节模块,其包含多个采样,初始化的自我运动,动态阻尼因子调整和Jacobi矩阵加权。另外,新颖的相对光度损耗函数先进以提高深度估计精度。该实验表明,所提出的系统在基于基于基于基于基于基于基于基于学习的基于学习的方法之间的深度,流量和VO估计方面不仅优于其他最先进的方法,而且与几何形状相比,也显着提高了鲁棒性 - 基于,基于学习和混合VO系统。进一步的实验表明,我们的模型在挑战室内(TMU-RGBD)和室外(KAIST)场景中实现了出色的泛化能力和性能。
translated by 谷歌翻译
We propose GeoNet, a jointly unsupervised learning framework for monocular depth, optical flow and egomotion estimation from videos. The three components are coupled by the nature of 3D scene geometry, jointly learned by our framework in an end-to-end manner. Specifically, geometric relationships are extracted over the predictions of individual modules and then combined as an image reconstruction loss, reasoning about static and dynamic scene parts separately. Furthermore, we propose an adaptive geometric consistency loss to increase robustness towards outliers and non-Lambertian regions, which resolves occlusions and texture ambiguities effectively. Experimentation on the KITTI driving dataset reveals that our scheme achieves state-of-the-art results in all of the three tasks, performing better than previously unsupervised methods and comparably with supervised ones.
translated by 谷歌翻译
动态对象感知的SLAM(DOS)利用对象级信息以在动态环境中启用强大的运动估计。现有方法主要集中于识别和排除优化的动态对象。在本文中,我们表明,基于功能的视觉量大系统也可以通过利用两个观察结果来受益于动态铰接式对象的存在:(1)随着时间的推移,铰接对象的每个刚性部分的3D结构保持一致; (2)同一刚性零件上的点遵循相同的运动。特别是,我们提出了Airdos,这是一种动态的对象感知系统,该系统将刚度和运动限制引入模型铰接对象。通过共同优化相机姿势,对象运动和对象3D结构,我们可以纠正摄像头姿势估计,防止跟踪损失,并为动态对象和静态场景生成4D时空图。实验表明,我们的算法改善了在挑战拥挤的城市环境中的视觉大满贯算法的鲁棒性。据我们所知,Airdos是第一个动态对象感知的大满贯系统,该系统表明可以通过合并动态铰接式对象来改善相机姿势估计。
translated by 谷歌翻译
Simultaneous Localization & Mapping (SLAM) is the process of building a mutual relationship between localization and mapping of the subject in its surrounding environment. With the help of different sensors, various types of SLAM systems have developed to deal with the problem of building the relationship between localization and mapping. A limitation in the SLAM process is the lack of consideration of dynamic objects in the mapping of the environment. We propose the Dynamic Object Tracking SLAM (DyOb-SLAM), which is a Visual SLAM system that can localize and map the surrounding dynamic objects in the environment as well as track the dynamic objects in each frame. With the help of a neural network and a dense optical flow algorithm, dynamic objects and static objects in an environment can be differentiated. DyOb-SLAM creates two separate maps for both static and dynamic contents. For the static features, a sparse map is obtained. For the dynamic contents, a trajectory global map is created as output. As a result, a frame to frame real-time based dynamic object tracking system is obtained. With the pose calculation of the dynamic objects and camera, DyOb-SLAM can estimate the speed of the dynamic objects with time. The performance of DyOb-SLAM is observed by comparing it with a similar Visual SLAM system, VDO-SLAM and the performance is measured by calculating the camera and object pose errors as well as the object speed error.
translated by 谷歌翻译
基于学习的视觉探针计(VO)算法在常见的静态场景上实现了显着的性能,受益于高容量模型和大量注释的数据,但在动态,填充的环境中往往会失败。语义细分在估计摄像机动作之前主要用于丢弃动态关联,但以丢弃静态功能为代价,并且很难扩展到看不见的类别。在本文中,我们利用相机自我运动和运动分割之间的相互依赖性,并表明两者都可以在单个基于学习的框架中共同完善。特别是,我们提出了Dytanvo,这是第一个涉及动态环境的基于学习的VO方法。它需要实时两个连续的单眼帧,并以迭代方式预测相机的自我运动。我们的方法在现实世界动态环境中的最先进的VOUTESS的平均提高27.7%,甚至在动态视觉SLAM系统中进行竞争性,从而优化了后端的轨迹。在很多看不见的环境上进行的实验也证明了我们的方法的普遍性。
translated by 谷歌翻译
We present ObjectMatch, a semantic and object-centric camera pose estimation for RGB-D SLAM pipelines. Modern camera pose estimators rely on direct correspondences of overlapping regions between frames; however, they cannot align camera frames with little or no overlap. In this work, we propose to leverage indirect correspondences obtained via semantic object identification. For instance, when an object is seen from the front in one frame and from the back in another frame, we can provide additional pose constraints through canonical object correspondences. We first propose a neural network to predict such correspondences on a per-pixel level, which we then combine in our energy formulation with state-of-the-art keypoint matching solved with a joint Gauss-Newton optimization. In a pairwise setting, our method improves registration recall of state-of-the-art feature matching from 77% to 87% overall and from 21% to 52% in pairs with 10% or less inter-frame overlap. In registering RGB-D sequences, our method outperforms cutting-edge SLAM baselines in challenging, low frame-rate scenarios, achieving more than 35% reduction in trajectory error in multiple scenes.
translated by 谷歌翻译
我们提出了一个新颖的圆锥视觉探针仪框架,称为PVO,以对场景的运动,几何形状和泛型分割信息进行更全面的建模。 PVO在统一的视图中模拟视觉探光仪(VO)和视频全景分割(VPS),从而使这两个任务能够相互促进。具体来说,我们将一个泛型更新模块引入VO模块,该模块在图像泛型分段上运行。该泛型增强的VO模块可以通过调整优化的相机姿势的权重来修剪相机姿势估计中动态对象的干扰。另一方面,使用摄像头姿势,深度和光流,通过将当前帧的圆形分割结果融合到相邻框架中,从而提高了VO-增强VPS模块,从而提高了分割精度。模块。这两个模块通过反复的迭代优化互相贡献。广泛的实验表明,PVO在视觉景观和视频综合分割任务中的最先进方法均优于最先进的方法。代码和数据可在项目网页上找到:\ urlstyle {tt} \ textColor {url_color} {\ url {https://zju3dv.github.io/pvo/pvo/}}}。
translated by 谷歌翻译
This paper presents ORB-SLAM, a feature-based monocular SLAM system that operates in real time, in small and large, indoor and outdoor environments. The system is robust to severe motion clutter, allows wide baseline loop closing and relocalization, and includes full automatic initialization. Building on excellent algorithms of recent years, we designed from scratch a novel system that uses the same features for all SLAM tasks: tracking, mapping, relocalization, and loop closing. A survival of the fittest strategy that selects the points and keyframes of the reconstruction leads to excellent robustness and generates a compact and trackable map that only grows if the scene content changes, allowing lifelong operation. We present an exhaustive evaluation in 27 sequences from the most popular datasets. ORB-SLAM achieves unprecedented performance with respect to other state-of-the-art monocular SLAM approaches. For the benefit of the community, we make the source code public.
translated by 谷歌翻译
由于其对环境变化的鲁棒性,视觉猛感的间接方法是受欢迎的。 ORB-SLAM2 \ CITE {ORBSLM2}是该域中的基准方法,但是,除非选择帧作为关键帧,否则它会消耗从未被重用的描述符。轻量级和高效,因为它跟踪相邻帧之间的关键点而不计算描述符。为此,基于稀疏光流提出了一种两个级粗到微小描述符独立的Keypoint匹配方法。在第一阶段,我们通过简单但有效的运动模型预测初始关键点对应,然后通过基于金字塔的稀疏光流跟踪鲁棒地建立了对应关系。在第二阶段,我们利用运动平滑度和末端几何形状的约束来改进对应关系。特别是,我们的方法仅计算关键帧的描述符。我们在\ texit {tum}和\ texit {icl-nuim} RGB-D数据集上测试Fastorb-Slam,并将其准确性和效率与九种现有的RGB-D SLAM方法进行比较。定性和定量结果表明,我们的方法实现了最先进的准确性,并且大约是ORB-SLAM2的两倍。
translated by 谷歌翻译
Photometric differences are widely used as supervision signals to train neural networks for estimating depth and camera pose from unlabeled monocular videos. However, this approach is detrimental for model optimization because occlusions and moving objects in a scene violate the underlying static scenario assumption. In addition, pixels in textureless regions or less discriminative pixels hinder model training. To solve these problems, in this paper, we deal with moving objects and occlusions utilizing the difference of the flow fields and depth structure generated by affine transformation and view synthesis, respectively. Secondly, we mitigate the effect of textureless regions on model optimization by measuring differences between features with more semantic and contextual information without adding networks. In addition, although the bidirectionality component is used in each sub-objective function, a pair of images are reasoned about only once, which helps reduce overhead. Extensive experiments and visual analysis demonstrate the effectiveness of the proposed method, which outperform existing state-of-the-art self-supervised methods under the same conditions and without introducing additional auxiliary information.
translated by 谷歌翻译
作为许多自主驾驶和机器人活动的基本组成部分,如自我运动估计,障碍避免和场景理解,单眼深度估计(MDE)引起了计算机视觉和机器人社区的极大关注。在过去的几十年中,已经开发了大量方法。然而,据我们所知,对MDE没有全面调查。本文旨在通过审查1970年至2021年之间发布的197个相关条款来弥补这一差距。特别是,我们为涵盖各种方法的MDE提供了全面的调查,介绍了流行的绩效评估指标并汇总公开的数据集。我们还总结了一些代表方法的可用开源实现,并比较了他们的表演。此外,我们在一些重要的机器人任务中审查了MDE的应用。最后,我们通过展示一些有希望的未来研究方向来结束本文。预计本调查有助于读者浏览该研究领域。
translated by 谷歌翻译
We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. In common with recent work [10,14,16], we use an end-to-end learning approach with view synthesis as the supervisory signal. In contrast to the previous work, our method is completely unsupervised, requiring only monocular video sequences for training. Our method uses single-view depth and multiview pose networks, with a loss based on warping nearby views to the target using the computed depth and pose. The networks are thus coupled by the loss during training, but can be applied independently at test time. Empirical evaluation on the KITTI dataset demonstrates the effectiveness of our approach: 1) monocular depth performs comparably with supervised methods that use either ground-truth pose or depth for training, and 2) pose estimation performs favorably compared to established SLAM systems under comparable input settings.
translated by 谷歌翻译
在本文中,我们串联串联一个实时单手抄语和密集的测绘框架。对于姿势估计,串联基于关键帧的滑动窗口执行光度束调整。为了增加稳健性,我们提出了一种新颖的跟踪前端,使用从全局模型中呈现的深度图来执行密集的直接图像对齐,该模型从密集的深度预测逐渐构建。为了预测密集的深度映射,我们提出了通过分层构造具有自适应视图聚合的3D成本卷来平衡关键帧之间的不同立体声基线的3D成本卷来使用整个活动密钥帧窗口的级联视图 - 聚合MVSNet(CVA-MVSNET)。最后,将预测的深度映射融合到表示为截短的符号距离函数(TSDF)体素网格的一致的全局映射中。我们的实验结果表明,在相机跟踪方面,串联优于其他最先进的传统和学习的单眼视觉径管(VO)方法。此外,串联示出了最先进的实时3D重建性能。
translated by 谷歌翻译
我们提出了深斑视觉探光仪(DPVO),这是一种新的单眼视觉探光度(VO)的深度学习系统。DPVO在单个RTX-3090 GPU上仅使用4GB存储器以2x-5X实时速度运行时,是准确且健壮的。我们对标准基准测试进行评估,并以准确性和速度均优于所有先前的工作(经典或学习)。代码可在https://github.com/princeton-vl/dpvo上找到。
translated by 谷歌翻译
We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene. State-of-the-art methods based on temporally varying Neural Radiance Fields (aka dynamic NeRFs) have shown impressive results on this task. However, for long videos with complex object motions and uncontrolled camera trajectories, these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications. Instead of encoding the entire dynamic scene within the weights of an MLP, we present a new approach that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner. Our system retains the advantages of prior methods in its ability to model complex scenes and view-dependent effects, but also enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories. We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings. Our project webpage is at dynibar.github.io.
translated by 谷歌翻译
近年来,尤其是在户外环境中,自我监督的单眼深度估计已取得了重大进展。但是,在大多数现有数据被手持设备捕获的室内场景中,深度预测结果无法满足。与室外环境相比,使用自我监督的方法估算室内环境的单眼视频深度,导致了两个额外的挑战:(i)室内视频序列的深度范围在不同的框架上有很大变化,使深度很难进行。网络以促进培训的一致深度线索; (ii)用手持设备记录的室内序列通常包含更多的旋转运动,这使姿势网络难以预测准确的相对摄像头姿势。在这项工作中,我们通过对这些挑战进行特殊考虑并巩固了一系列良好实践,以提高自我监督的单眼深度估计室内环境的表现,从而提出了一种新颖的框架单声道++。首先,提出了具有基于变压器的比例回归网络的深度分解模块,以明确估算全局深度尺度因子,预测的比例因子可以指示最大深度值。其次,我们不像以前的方法那样使用单阶段的姿势估计策略,而是建议利用残留姿势估计模块来估计相对摄像机在连续迭代的跨帧中构成。第三,为了为我们的残留姿势估计模块纳入广泛的坐标指南,我们建议直接在输入上执行坐标卷积编码,以实现姿势网络。提出的方法在各种基准室内数据集(即Euroc Mav,Nyuv2,扫描仪和7片)上进行了验证,证明了最先进的性能。
translated by 谷歌翻译