新颖的视图合成(NVS)和视频预测(VP)通常被视为计算机视觉中的不相交任务。但是,它们都可以看作是观察空间时代世界的方法:NVS的目的是从新的角度综合一个场景,而副总裁则旨在从新的时间点观看场景。这两个任务提供了互补的信号以获得场景表示形式,因为观点从空间观察中变化为深度的变化,并且时间观察为相机和单个对象的运动提供了信息。受这些观察的启发,我们建议研究时空(背心)中视频外推的问题。我们提出了一个模型,该模型利用了两项任务的自学和互补线索,而现有方法只能解决其中之一。实验表明,我们的方法比室内和室外现实世界数据集上的几种最先进的NVS和VP方法更好地实现了性能。
translated by 谷歌翻译
Figure 1: Our method can synthesize novel views in both space and time from a single monocular video of a dynamic scene. Here we show video results with various configurations of fixing and interpolating view and time (left), as well as a visualization of the recovered scene geometry (right). Please view with Adobe Acrobat or KDE Okular to see animations.
translated by 谷歌翻译
A recent strand of work in view synthesis uses deep learning to generate multiplane images-a camera-centric, layered 3D representation-given two or more input images at known viewpoints. We apply this representation to singleview view synthesis, a problem which is more challenging but has potentially much wider application. Our method learns to predict a multiplane image directly from a single image input, and we introduce scale-invariant view synthesis for supervision, enabling us to train on online video. We show this approach is applicable to several different datasets, that it additionally generates reasonable depth maps, and that it learns to fill in content behind the edges of foreground objects in background layers.Project page at https://single-view-mpi.github.io/.
translated by 谷歌翻译
可以通过定期预测未来的框架以增强虚拟现实应用程序中的用户体验,从而解决了低计算设备上图形渲染高帧速率视频的挑战。这是通过时间视图合成(TVS)的问题来研究的,该问题的目标是预测给定上一个帧的视频的下一个帧以及上一个和下一个帧的头部姿势。在这项工作中,我们考虑了用户和对象正在移动的动态场景的电视。我们设计了一个将运动解散到用户和对象运动中的框架,以在预测下一帧的同时有效地使用可用的用户运动。我们通过隔离和估计过去框架的3D对象运动,然后推断它来预测对象的运动。我们使用多平面图像(MPI)作为场景的3D表示,并将对象运动作为MPI表示中相应点之间的3D位移建模。为了在估计运动时处理MPI中的稀疏性,我们将部分卷积和掩盖的相关层纳入了相应的点。然后将预测的对象运动与给定的用户或相机运动集成在一起,以生成下一帧。使用不合格的填充模块,我们合成由于相机和对象运动而发现的区域。我们为动态场景的电视开发了一个新的合成数据集,该数据集由800个以全高清分辨率组成的视频组成。我们通过数据集和MPI Sintel数据集上的实验表明我们的模型优于文献中的所有竞争方法。
translated by 谷歌翻译
本文旨在减少透明辐射场的渲染时间。一些最近的作品用图像编码器配备了神经辐射字段,能够跨越场景概括,这避免了每场景优化。但是,它们的渲染过程通常很慢。主要因素是,在推断辐射场时,它们在空间中的大量点。在本文中,我们介绍了一个混合场景表示,它结合了最佳的隐式辐射场和显式深度映射,以便有效渲染。具体地,我们首先构建级联成本量,以有效地预测场景的粗糙几何形状。粗糙几何允许我们在场景表面附近的几个点来样,并显着提高渲染速度。该过程是完全可疑的,使我们能够仅从RGB图像共同学习深度预测和辐射现场网络。实验表明,该方法在DTU,真正的前瞻性和NERF合成数据集上展示了最先进的性能,而不是比以前的最可推广的辐射现场方法快至少50倍。我们还展示了我们的方法实时综合动态人类执行者的自由观点视频。代码将在https://zju3dv.github.io/enerf/处提供。
translated by 谷歌翻译
Image view synthesis has seen great success in reconstructing photorealistic visuals, thanks to deep learning and various novel representations. The next key step in immersive virtual experiences is view synthesis of dynamic scenes. However, several challenges exist due to the lack of high-quality training datasets, and the additional time dimension for videos of dynamic scenes. To address this issue, we introduce a multi-view video dataset, captured with a custom 10-camera rig in 120FPS. The dataset contains 96 high-quality scenes showing various visual effects and human interactions in outdoor scenes. We develop a new algorithm, Deep 3D Mask Volume, which enables temporally-stable view extrapolation from binocular videos of dynamic scenes, captured by static cameras. Our algorithm addresses the temporal inconsistency of disocclusions by identifying the error-prone areas with a 3D mask volume, and replaces them with static background observed throughout the video. Our method enables manipulation in 3D space as opposed to simple 2D masks, We demonstrate better temporal stability than frame-by-frame static view synthesis methods, or those that use 2D masks. The resulting view synthesis videos show minimal flickering artifacts and allow for larger translational movements.
translated by 谷歌翻译
We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene. State-of-the-art methods based on temporally varying Neural Radiance Fields (aka dynamic NeRFs) have shown impressive results on this task. However, for long videos with complex object motions and uncontrolled camera trajectories, these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications. Instead of encoding the entire dynamic scene within the weights of an MLP, we present a new approach that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner. Our system retains the advantages of prior methods in its ability to model complex scenes and view-dependent effects, but also enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories. We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings. Our project webpage is at dynibar.github.io.
translated by 谷歌翻译
We present a method for novel view synthesis from input images that are freely distributed around a scene. Our method does not rely on a regular arrangement of input views, can synthesize images for free camera movement through the scene, and works for general scenes with unconstrained geometric layouts. We calibrate the input images via SfM and erect a coarse geometric scaffold via MVS. This scaffold is used to create a proxy depth map for a novel view of the scene. Based on this depth map, a recurrent encoder-decoder network processes reprojected features from nearby views and synthesizes the new view. Our network does not need to be optimized for a given scene. After training on a dataset, it works in previously unseen environments with no finetuning or per-scene optimization. We evaluate the presented approach on challenging real-world datasets, including Tanks and Temples, where we demonstrate successful view synthesis for the first time and substantially outperform prior and concurrent work.
translated by 谷歌翻译
我们介绍了Fadiv-Syn,一种快速深入的新型观点合成方法。相关方法通常受到它们的深度估计阶段的限制,其中不正确的深度预测可能导致大的投影误差。为避免此问题,我们将输入图像有效地将输入图像呈现为目标帧,以为一系列假定的深度平面。得到的平面扫描量(PSV)直接进入我们的网络,首先以自我监督的方式估计软PSV掩模,然后直接产生新颖的输出视图。因此,我们侧行显式深度估计。这提高了透明,反光,薄,特色场景部件上的效率和性能。 Fadiv-syn可以在大规模Realestate10K数据集上执行插值和外推任务,优于最先进的外推方法。与可比方法相比,它由于其轻量级架构而实现了实时性能。我们彻底评估消融,例如去除软掩蔽网络,从更少的示例中培训以及更高的分辨率和更强深度离散化的概括。
translated by 谷歌翻译
我们提出了一种便携式多型摄像头系统,该系统具有专用模型,用于动态场景中的新型视图和时间综合。我们的目标是使用我们的便携式多座相机从任何角度从任何角度出发为动态场景提供高质量的图像。为了实现这种新颖的观点和时间综合,我们开发了一个配备了五个相机的物理多型摄像头,以在时间和空间域中训练神经辐射场(NERF),以进行动态场景。我们的模型将6D坐标(3D空间位置,1D时间坐标和2D观看方向)映射到观看依赖性且随时间变化的发射辐射和体积密度。量渲染用于在指定的相机姿势和时间上渲染光真实的图像。为了提高物理相机的鲁棒性,我们提出了一个摄像机参数优化模块和一个时间框架插值模块,以促进跨时间的信息传播。我们对现实世界和合成数据集进行了实验以评估我们的系统,结果表明,我们的方法在定性和定量上优于替代解决方案。我们的代码和数据集可从https://yuenfuilau.github.io获得。
translated by 谷歌翻译
Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.
translated by 谷歌翻译
近年来,尤其是在户外环境中,自我监督的单眼深度估计已取得了重大进展。但是,在大多数现有数据被手持设备捕获的室内场景中,深度预测结果无法满足。与室外环境相比,使用自我监督的方法估算室内环境的单眼视频深度,导致了两个额外的挑战:(i)室内视频序列的深度范围在不同的框架上有很大变化,使深度很难进行。网络以促进培训的一致深度线索; (ii)用手持设备记录的室内序列通常包含更多的旋转运动,这使姿势网络难以预测准确的相对摄像头姿势。在这项工作中,我们通过对这些挑战进行特殊考虑并巩固了一系列良好实践,以提高自我监督的单眼深度估计室内环境的表现,从而提出了一种新颖的框架单声道++。首先,提出了具有基于变压器的比例回归网络的深度分解模块,以明确估算全局深度尺度因子,预测的比例因子可以指示最大深度值。其次,我们不像以前的方法那样使用单阶段的姿势估计策略,而是建议利用残留姿势估计模块来估计相对摄像机在连续迭代的跨帧中构成。第三,为了为我们的残留姿势估计模块纳入广泛的坐标指南,我们建议直接在输入上执行坐标卷积编码,以实现姿势网络。提出的方法在各种基准室内数据集(即Euroc Mav,Nyuv2,扫描仪和7片)上进行了验证,证明了最先进的性能。
translated by 谷歌翻译
建立新型观点综合的最近进展后,我们提出了改善单眼深度估计的应用。特别是,我们提出了一种在三个主要步骤中分开的新颖训练方法。首先,单眼深度网络的预测结果被扭转到额外的视点。其次,我们应用一个额外的图像综合网络,其纠正并提高了翘曲的RGB图像的质量。通过最小化像素-WISE RGB重建误差,该网络的输出需要尽可能类似地查看地面真实性视图。第三,我们将相同的单眼深度估计重新应用于合成的第二视图点,并确保深度预测与相关的地面真理深度一致。实验结果证明,我们的方法在Kitti和Nyu-Deaft-V2数据集上实现了最先进的或可比性,具有轻量级和简单的香草U-Net架构。
translated by 谷歌翻译
We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. In common with recent work [10,14,16], we use an end-to-end learning approach with view synthesis as the supervisory signal. In contrast to the previous work, our method is completely unsupervised, requiring only monocular video sequences for training. Our method uses single-view depth and multiview pose networks, with a loss based on warping nearby views to the target using the computed depth and pose. The networks are thus coupled by the loss during training, but can be applied independently at test time. Empirical evaluation on the KITTI dataset demonstrates the effectiveness of our approach: 1) monocular depth performs comparably with supervised methods that use either ground-truth pose or depth for training, and 2) pose estimation performs favorably compared to established SLAM systems under comparable input settings.
translated by 谷歌翻译
现代计算机视觉已超越了互联网照片集的领域,并进入了物理世界,通过非结构化的环境引导配备摄像头的机器人和自动驾驶汽车。为了使这些体现的代理与现实世界对象相互作用,相机越来越多地用作深度传感器,重建了各种下游推理任务的环境。机器学习辅助的深度感知或深度估计会预测图像中每个像素的距离。尽管已经在深入估算中取得了令人印象深刻的进步,但仍然存在重大挑战:(1)地面真相深度标签很难大规模收集,(2)通常认为相机信息是已知的,但通常是不可靠的,并且(3)限制性摄像机假设很常见,即使在实践中使用了各种各样的相机类型和镜头。在本论文中,我们专注于放松这些假设,并描述将相机变成真正通用深度传感器的最终目标的贡献。
translated by 谷歌翻译
In this paper, we target at the problem of learning a generalizable dynamic radiance field from monocular videos. Different from most existing NeRF methods that are based on multiple views, monocular videos only contain one view at each timestamp, thereby suffering from ambiguity along the view direction in estimating point features and scene flows. Previous studies such as DynNeRF disambiguate point features by positional encoding, which is not transferable and severely limits the generalization ability. As a result, these methods have to train one independent model for each scene and suffer from heavy computational costs when applying to increasing monocular videos in real-world applications. To address this, We propose MonoNeRF to simultaneously learn point features and scene flows with point trajectory and feature correspondence constraints across frames. More specifically, we learn an implicit velocity field to estimate point trajectory from temporal features with Neural ODE, which is followed by a flow-based feature aggregation module to obtain spatial features along the point trajectory. We jointly optimize temporal and spatial features by training the network in an end-to-end manner. Experiments show that our MonoNeRF is able to learn from multiple scenes and support new applications such as scene editing, unseen frame synthesis, and fast novel scene adaptation.
translated by 谷歌翻译
https://video-nerf.github.io Figure 1. Our method takes a single casually captured video as input and learns a space-time neural irradiance field. (Top) Sample frames from the input video. (Middle) Novel view images rendered from textured meshes constructed from depth maps. (Bottom) Our results rendered from the proposed space-time neural irradiance field.
translated by 谷歌翻译
不确定性在未来预测中起关键作用。未来是不确定的。这意味着可能有很多可能的未来。未来的预测方法应涵盖坚固的全部可能性。在自动驾驶中,涵盖预测部分中的多种模式对于做出安全至关重要的决策至关重要。尽管近年来计算机视觉系统已大大提高,但如今的未来预测仍然很困难。几个示例是未来的不确定性,全面理解的要求以及嘈杂的输出空间。在本论文中,我们通过以随机方式明确地对运动进行建模并学习潜在空间中的时间动态,从而提出了解决这些挑战的解决方案。
translated by 谷歌翻译
Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods.Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.
translated by 谷歌翻译