3D reconstruction and novel view synthesis of dynamic scenes from collections of single views recently gained increased attention. Existing work shows impressive results for synthetic setups and forward-facing real-world data, but is severely limited in the training speed and angular range for generating novel views. This paper addresses these limitations and proposes a new method for full 360{\deg} novel view synthesis of non-rigidly deforming scenes. At the core of our method are: 1) An efficient deformation module that decouples the processing of spatial and temporal information for acceleration at training and inference time; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field. We evaluate the proposed approach on the established synthetic D-NeRF benchmark, that enables efficient reconstruction from a single monocular view per time-frame randomly sampled from a full hemisphere. We refer to this form of inputs as monocularized data. To prove its practicality for real-world scenarios, we recorded twelve challenging sequences with human actors by sampling single frames from a synchronized multi-view rig. In both cases, our method is trained significantly faster than previous methods (minutes instead of days) while achieving higher visual accuracy for generated novel views. Our source code and data is available at our project page https://graphics.tu-bs.de/publications/kappel2022fast.
translated by 谷歌翻译
Figure 1. Given a monocular image sequence, NR-NeRF reconstructs a single canonical neural radiance field to represent geometry and appearance, and a per-time-step deformation field. We can render the scene into a novel spatio-temporal camera trajectory that significantly differs from the input trajectory. NR-NeRF also learns rigidity scores and correspondences without direct supervision on either. We can use the rigidity scores to remove the foreground, we can supersample along the time dimension, and we can exaggerate or dampen motion.
translated by 谷歌翻译
Point of View & TimeFigure 1: We propose D-NeRF, a method for synthesizing novel views, at an arbitrary point in time, of dynamic scenes with complex non-rigid geometries. We optimize an underlying deformable volumetric function from a sparse set of input monocular views without the need of ground-truth geometry nor multi-view images. The figure shows two scenes under variable points of view and time instances synthesised by the proposed model.
translated by 谷歌翻译
最近,神经辐射场(NERF)正在彻底改变新型视图合成(NVS)的卓越性能。但是,NERF及其变体通常需要进行冗长的每场训练程序,其中将多层感知器(MLP)拟合到捕获的图像中。为了解决挑战,已经提出了体素网格表示,以显着加快训练的速度。但是,这些现有方法只能处理静态场景。如何开发有效,准确的动态视图合成方法仍然是一个开放的问题。将静态场景的方法扩展到动态场景并不简单,因为场景几何形状和外观随时间变化。在本文中,基于素素网格优化的最新进展,我们提出了一种快速变形的辐射场方法来处理动态场景。我们的方法由两个模块组成。第一个模块采用变形网格来存储3D动态功能,以及使用插值功能将观测空间中的3D点映射到规范空间的变形的轻巧MLP。第二个模块包含密度和颜色网格,以建模场景的几何形状和密度。明确对阻塞进行了建模,以进一步提高渲染质量。实验结果表明,我们的方法仅使用20分钟的训练就可以实现与D-NERF相当的性能,该训练比D-NERF快70倍以上,这清楚地证明了我们提出的方法的效率。
translated by 谷歌翻译
Recent methods for neural surface representation and rendering, for example NeuS, have demonstrated remarkably high-quality reconstruction of static scenes. However, the training of NeuS takes an extremely long time (8 hours), which makes it almost impossible to apply them to dynamic scenes with thousands of frames. We propose a fast neural surface reconstruction approach, called NeuS2, which achieves two orders of magnitude improvement in terms of acceleration without compromising reconstruction quality. To accelerate the training process, we integrate multi-resolution hash encodings into a neural surface representation and implement our whole algorithm in CUDA. We also present a lightweight calculation of second-order derivatives tailored to our networks (i.e., ReLU-based MLPs), which achieves a factor two speed up. To further stabilize training, a progressive learning strategy is proposed to optimize multi-resolution hash encodings from coarse to fine. In addition, we extend our method for reconstructing dynamic scenes with an incremental training strategy. Our experiments on various datasets demonstrate that NeuS2 significantly outperforms the state-of-the-arts in both surface reconstruction accuracy and training speed. The video is available at https://vcai.mpi-inf.mpg.de/projects/NeuS2/ .
translated by 谷歌翻译
In this paper, we take a significant step towards real-world applicability of monocular neural avatar reconstruction by contributing InstantAvatar, a system that can reconstruct human avatars from a monocular video within seconds, and these avatars can be animated and rendered at an interactive rate. To achieve this efficiency we propose a carefully designed and engineered system, that leverages emerging acceleration structures for neural fields, in combination with an efficient empty space-skipping strategy for dynamic scenes. We also contribute an efficient implementation that we will make available for research purposes. Compared to existing methods, InstantAvatar converges 130x faster and can be trained in minutes instead of hours. It achieves comparable or even better reconstruction quality and novel pose synthesis results. When given the same time budget, our method significantly outperforms SoTA methods. InstantAvatar can yield acceptable visual quality in as little as 10 seconds training time.
translated by 谷歌翻译
神经辐射场(NERF)在建模3D场景和合成新型视图图像方面取得了巨大成功。但是,大多数以前的NERF方法需要大量时间来优化一个场景。显式数据结构,例如体素特征,显示出加速训练过程的巨大潜力。但是,体素特征面临两个大挑战,要应用于动态场景,即建模时间信息并捕获不同的点运动尺度。我们通过用时间感知的体素特征(称为Tineuvox)表示场景来提出一个辐射现场框架。引入了一个微小的坐标变形网络,以模拟粗糙运动轨迹,并在辐射网络中进一步增强了时间信息。提出了一种多距离插值方法,并应用于体素特征,以模拟小运动和大型运动。我们的框架大大加快了动态光芒度场的优化,同时保持高渲染质量。经验评估均在合成场景和真实场景上进行。我们的Tineuvox仅需8分钟和8 MB的存储成本即可完成培训,同时表现出比以前的动态NERF方法相似甚至更好的渲染性能。
translated by 谷歌翻译
最近,我们看到了照片真实的人类建模和渲染的神经进展取得的巨大进展。但是,将它们集成到现有的下游应用程序中的现有网络管道中仍然具有挑战性。在本文中,我们提出了一种全面的神经方法,用于从密集的多视频视频中对人类表演进行高质量重建,压缩和渲染。我们的核心直觉是用一系列高效的神经技术桥接传统的动画网格工作流程。我们首先引入一个神经表面重建器,以在几分钟内进行高质量的表面产生。它与多分辨率哈希编码的截短签名距离场(TSDF)的隐式体积渲染相结合。我们进一步提出了一个混合神经跟踪器来生成动画网格,该网格将明确的非刚性跟踪与自我监督框架中的隐式动态变形结合在一起。前者将粗糙的翘曲返回到规范空间中,而后者隐含的一个隐含物进一步预测了使用4D哈希编码的位移,如我们的重建器中。然后,我们使用获得的动画网格讨论渲染方案,从动态纹理到各种带宽设置下的Lumigraph渲染。为了在质量和带宽之间取得复杂的平衡,我们通过首先渲染6个虚拟视图来涵盖表演者,然后进行闭塞感知的神经纹理融合,提出一个分层解决方案。我们证明了我们方法在各种平台上的各种基于网格的应用程序和照片真实的自由观看体验中的功效,即,通过移动AR插入虚拟人类的表演,或通过移动AR插入真实环境,或带有VR头戴式的人才表演。
translated by 谷歌翻译
Figure 1: Our method can synthesize novel views in both space and time from a single monocular video of a dynamic scene. Here we show video results with various configurations of fixing and interpolating view and time (left), as well as a visualization of the recovered scene geometry (right). Please view with Adobe Acrobat or KDE Okular to see animations.
translated by 谷歌翻译
最近的神经人类表示可以产生高质量的多视图渲染,但需要使用密集的多视图输入和昂贵的培训。因此,它们在很大程度上仅限于静态模型,因为每个帧都是不可行的。我们展示了人类学 - 一种普遍的神经表示 - 用于高保真自由观察动态人类的合成。类似于IBRNET如何通过避免每场景训练来帮助NERF,Humannerf跨多视图输入采用聚合像素对准特征,以及用于解决动态运动的姿势嵌入的非刚性变形场。原始人物员已经可以在稀疏视频输入的稀疏视频输入上产生合理的渲染。为了进一步提高渲染质量,我们使用外观混合模块增强了我们的解决方案,用于组合神经体积渲染和神经纹理混合的益处。各种多视图动态人类数据集的广泛实验证明了我们在挑战运动中合成照片 - 现实自由观点的方法和非常稀疏的相机视图输入中的普遍性和有效性。
translated by 谷歌翻译
本文旨在减少透明辐射场的渲染时间。一些最近的作品用图像编码器配备了神经辐射字段,能够跨越场景概括,这避免了每场景优化。但是,它们的渲染过程通常很慢。主要因素是,在推断辐射场时,它们在空间中的大量点。在本文中,我们介绍了一个混合场景表示,它结合了最佳的隐式辐射场和显式深度映射,以便有效渲染。具体地,我们首先构建级联成本量,以有效地预测场景的粗糙几何形状。粗糙几何允许我们在场景表面附近的几个点来样,并显着提高渲染速度。该过程是完全可疑的,使我们能够仅从RGB图像共同学习深度预测和辐射现场网络。实验表明,该方法在DTU,真正的前瞻性和NERF合成数据集上展示了最先进的性能,而不是比以前的最可推广的辐射现场方法快至少50倍。我们还展示了我们的方法实时综合动态人类执行者的自由观点视频。代码将在https://zju3dv.github.io/enerf/处提供。
translated by 谷歌翻译
Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.
translated by 谷歌翻译
我们介绍了一个自由视的渲染方法 - Humannerf - 这对人类进行了复杂的身体运动的给定单曲视频工作,例如,来自YouTube的视频。我们的方法可以在任何帧中暂停视频,并从任意新相机视点呈现对象,甚至是该特定帧和身体姿势的完整360度摄像机路径。这项任务特别具有挑战性,因为它需要合成身体的光电型细节,如从输入视频中可能不存在的各种相机角度所见,以及合成布折叠和面部外观的细细节。我们的方法优化了在规范T型姿势中的人的体积表示,同时通过运动场,该运动场通过向后的警报将估计的规范表示映射到视频的每个帧。运动场分解成骨骼刚性和非刚性运动,由深网络产生。我们对现有工作显示出显着的性能改进,以及从移动人类的单眼视频的令人尖锐的观点渲染的阐释示例,以挑战不受控制的捕获场景。
translated by 谷歌翻译
综合照片 - 现实图像和视频是计算机图形的核心,并且是几十年的研究焦点。传统上,使用渲染算法(如光栅化或射线跟踪)生成场景的合成图像,其将几何形状和材料属性的表示为输入。统称,这些输入定义了实际场景和呈现的内容,并且被称为场景表示(其中场景由一个或多个对象组成)。示例场景表示是具有附带纹理的三角形网格(例如,由艺术家创建),点云(例如,来自深度传感器),体积网格(例如,来自CT扫描)或隐式曲面函数(例如,截短的符号距离)字段)。使用可分辨率渲染损耗的观察结果的这种场景表示的重建被称为逆图形或反向渲染。神经渲染密切相关,并将思想与经典计算机图形和机器学习中的思想相结合,以创建用于合成来自真实观察图像的图像的算法。神经渲染是朝向合成照片现实图像和视频内容的目标的跨越。近年来,我们通过数百个出版物显示了这一领域的巨大进展,这些出版物显示了将被动组件注入渲染管道的不同方式。这种最先进的神经渲染进步的报告侧重于将经典渲染原则与学习的3D场景表示结合的方法,通常现在被称为神经场景表示。这些方法的一个关键优势在于它们是通过设计的3D-一致,使诸如新颖的视点合成捕获场景的应用。除了处理静态场景的方法外,我们还涵盖了用于建模非刚性变形对象的神经场景表示...
translated by 谷歌翻译
到目前为止,已经研究了基于学习坐标的体积3D场景表示,例如神经辐射场(NERF),假设RGB或RGB-D图像是输入。同时,从神经科学文献中知道,人类视觉系统(HVS)的定制是为了处理异步亮度而不是同步的RGB图像,以构建和不断更新周围环境的心理3D表示,以进行导航和生存。受HVS原理启发的视觉传感器是事件摄像机。因此,事件是稀疏和异步的每个像素亮度(或颜色通道)更改信号。与神经3D场景表示学习的现有作品相反,本文从新的角度解决了问题。我们证明,可以从异步事件流中学习适用于RGB空间中新型视图合成的NERF。我们的模型在RGB空间中具有挑战性场景的新颖的视野具有很高的视觉准确性,即使它们的数据训练得多(即,来自单个事件摄像机的事件流围绕对象移动)并更有效(由于其效率更高(由于其培训)(由于事件流的固有稀疏性)比现有的NERF模型接受了RGB图像。我们将发布我们的数据集和源代码,请参见https://4dqv.mpi-inf.mpg.de/eventnerf/。
translated by 谷歌翻译
我们向渲染和时间(4D)重建人类的渲染和时间(4D)重建的神经辐射场,通过稀疏的摄像机捕获或甚至来自单眼视频。我们的方法将思想与神经场景表示,新颖的综合合成和隐式统计几何人称的人类表示相结合,耦合使用新颖的损失功能。在先前使用符号距离功能表示的结构化隐式人体模型,而不是使用统一的占用率来学习具有统一占用的光域字段。这使我们能够从稀疏视图中稳健地融合信息,并概括超出在训练中观察到的姿势或视图。此外,我们应用几何限制以共同学习观察到的主题的结构 - 包括身体和衣服 - 并将辐射场正规化为几何合理的解决方案。在多个数据集上的广泛实验证明了我们方法的稳健性和准确性,其概括能力显着超出了一系列的姿势和视图,以及超出所观察到的形状的统计外推。
translated by 谷歌翻译
https://video-nerf.github.io Figure 1. Our method takes a single casually captured video as input and learns a space-time neural irradiance field. (Top) Sample frames from the input video. (Middle) Novel view images rendered from textured meshes constructed from depth maps. (Bottom) Our results rendered from the proposed space-time neural irradiance field.
translated by 谷歌翻译
由于其显着的合成质量,最近,神经辐射场(NERF)最近对3D场景重建和新颖的视图合成进行了相当大的关注。然而,由散焦或运动引起的图像模糊,这通常发生在野外的场景中,显着降低了其重建质量。为了解决这个问题,我们提出了DeBlur-nerf,这是一种可以从模糊输入恢复尖锐的nerf的第一种方法。我们采用逐合成方法来通过模拟模糊过程来重建模糊的视图,从而使NERF对模糊输入的鲁棒。该仿真的核心是一种新型可变形稀疏内核(DSK)模块,其通过在每个空间位置变形规范稀疏内核来模拟空间变形模糊内核。每个内核点的射线起源是共同优化的,受到物理模糊过程的启发。该模块作为MLP参数化,具有能够概括为各种模糊类型。联合优化NERF和DSK模块允许我们恢复尖锐的NERF。我们证明我们的方法可用于相机运动模糊和散焦模糊:真实场景中的两个最常见的模糊。合成和现实世界数据的评估结果表明,我们的方法优于几个基线。合成和真实数据集以及源代码将公开可用于促进未来的研究。
translated by 谷歌翻译
We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene. State-of-the-art methods based on temporally varying Neural Radiance Fields (aka dynamic NeRFs) have shown impressive results on this task. However, for long videos with complex object motions and uncontrolled camera trajectories, these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications. Instead of encoding the entire dynamic scene within the weights of an MLP, we present a new approach that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner. Our system retains the advantages of prior methods in its ability to model complex scenes and view-dependent effects, but also enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories. We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings. Our project webpage is at dynibar.github.io.
translated by 谷歌翻译
我们提出了一种可区分的渲染算法,以进行有效的新型视图合成。通过偏离基于音量的表示,支持学习点表示,我们在训练和推理方面的内存和运行时范围内改进了现有方法的数量级。该方法从均匀采样的随机点云开始,并使用基于可区分的SPLAT渲染器来发展模型以匹配一组输入图像,从而学习了每点位置和观看依赖性外观。在训练和推理中,我们的方法比NERF快300倍,质量只有边缘牺牲,而在静态场景中使用少于10 〜MB的记忆。对于动态场景,我们的方法比Stnerf训练两个数量级,并以接近互动速率渲染,同时即使在不施加任何时间固定的正则化合物的情况下保持较高的图像质量和时间连贯性。
translated by 谷歌翻译