我们提出了神经像素组成(NPC),这是一种连续3D-4D视图合成的新方法,只有一组离散的多视图观测值作为输入。现有的最新方法需要密集的多视图监督和广泛的计算预算。所提出的配方可靠地在稀疏和宽基线的多视图图像上运行,并且可以在几秒钟至10分钟内进行高分辨率(12MP)含量(即比现有方法快200-400x的收敛速度)进行有效训练。对我们的方法至关重要的是两个核心新颖性:1)像素的表示,其中包含从视线沿特定位置和时间的多视图中积累的颜色和深度信息,以及2)多层感知器(MLP)这使得为​​像素位置提供了这种丰富信息的组成,以获得最终的颜色输出。我们尝试了各种各样的多视图序列,与现有方法相比,并在各种和挑战性的环境中获得更好的结果。最后,我们的方法从稀疏的多视图中实现了密集的3D重建,其中Colmap是最先进的3D重建方法,挣扎。
translated by 谷歌翻译
3D reconstruction and novel view synthesis of dynamic scenes from collections of single views recently gained increased attention. Existing work shows impressive results for synthetic setups and forward-facing real-world data, but is severely limited in the training speed and angular range for generating novel views. This paper addresses these limitations and proposes a new method for full 360{\deg} novel view synthesis of non-rigidly deforming scenes. At the core of our method are: 1) An efficient deformation module that decouples the processing of spatial and temporal information for acceleration at training and inference time; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field. We evaluate the proposed approach on the established synthetic D-NeRF benchmark, that enables efficient reconstruction from a single monocular view per time-frame randomly sampled from a full hemisphere. We refer to this form of inputs as monocularized data. To prove its practicality for real-world scenarios, we recorded twelve challenging sequences with human actors by sampling single frames from a synchronized multi-view rig. In both cases, our method is trained significantly faster than previous methods (minutes instead of days) while achieving higher visual accuracy for generated novel views. Our source code and data is available at our project page https://graphics.tu-bs.de/publications/kappel2022fast.
translated by 谷歌翻译
基于图像的体积人类使用像素对齐的特征有望泛化,从而看不见姿势和身份。先前的工作利用全局空间编码和多视图几何一致性来减少空间歧义。但是,全球编码通常会过度适应培训数据的分布,并且很难从稀疏视图中学习多视图一致的重建。在这项工作中,我们研究了现有空间编码的常见问题,并提出了一种简单而高效的方法,可以从稀疏视图中对高保真体积的人类进行建模。关键思想之一是通过稀疏3D关键点编码相对空间3D信息。这种方法对观点和跨数据库域间隙的稀疏性很强。我们的方法的表现优于头部重建的最先进方法。关于人体的重建是看不见的受试者,我们还实现了与使用参数人体模型和时间特征聚集的先前工作相当的性能。 Our experiments show that a majority of errors in prior work stem from an inappropriate choice of spatial encoding and thus we suggest a new direction for high-fidelity image-based human modeling. https://markomih.github.io/keypointnerf
translated by 谷歌翻译
Figure 1: Our method can synthesize novel views in both space and time from a single monocular video of a dynamic scene. Here we show video results with various configurations of fixing and interpolating view and time (left), as well as a visualization of the recovered scene geometry (right). Please view with Adobe Acrobat or KDE Okular to see animations.
translated by 谷歌翻译
We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views. The core of our method is a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information on the fly from multiple source views. By drawing on source views at render time, our method hearkens back to classic work on image-based rendering (IBR), and allows us to render high-resolution imagery. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes. We render images using classic volume rendering, which is fully differentiable and allows us to train using only multiview posed images as supervision. Experiments show that our method outperforms recent novel view synthesis methods that also seek to generalize to novel scenes. Further, if fine-tuned on each scene, our method is competitive with state-of-the-art single-scene neural rendering methods. 1
translated by 谷歌翻译
新型视图综合的古典光场渲染可以准确地再现视图依赖性效果,例如反射,折射和半透明,但需要一个致密的视图采样的场景。基于几何重建的方法只需要稀疏的视图,但不能准确地模拟非兰伯语的效果。我们介绍了一个模型,它结合了强度并减轻了这两个方向的局限性。通过在光场的四维表示上操作,我们的模型学会准确表示依赖视图效果。通过在训练和推理期间强制执行几何约束,从稀疏的视图集中毫无屏蔽地学习场景几何。具体地,我们介绍了一种基于两级变压器的模型,首先沿着ePipoll线汇总特征,然后沿参考视图聚合特征以产生目标射线的颜色。我们的模型在多个前进和360 {\ DEG}数据集中优于最先进的,具有较大的差别依赖变化的场景更大的边缘。
translated by 谷歌翻译
在本文中,我们为复杂场景进行了高效且强大的深度学习解决方案。在我们的方法中,3D场景表示为光场,即,一组光线,每组在到达图像平面时具有相应的颜色。对于高效的新颖视图渲染,我们采用了光场的双面参数化,其中每个光线的特征在于4D参数。然后,我们将光场配向作为4D函数,即将4D坐标映射到相应的颜色值。我们训练一个深度完全连接的网络以优化这种隐式功能并记住3D场景。然后,特定于场景的模型用于综合新颖视图。与以前需要密集的视野的方法不同,需要密集的视野采样来可靠地呈现新颖的视图,我们的方法可以通过采样光线来呈现新颖的视图并直接从网络查询每种光线的颜色,从而使高质量的灯场呈现稀疏集合训练图像。网络可以可选地预测每光深度,从而使诸如自动重新焦点的应用。我们的小说视图合成结果与最先进的综合结果相当,甚至在一些具有折射和反射的具有挑战性的场景中优越。我们在保持交互式帧速率和小的内存占地面积的同时实现这一点。
translated by 谷歌翻译
我们提出了HRF-NET,这是一种基于整体辐射场的新型视图合成方法,该方法使用一组稀疏输入来呈现新视图。最近的概括视图合成方法还利用了光辉场,但渲染速度不是实时的。现有的方法可以有效地训练和呈现新颖的观点,但它们无法概括地看不到场景。我们的方法解决了用于概括视图合成的实时渲染问题,并由两个主要阶段组成:整体辐射场预测指标和基于卷积的神经渲染器。该架构不仅基于隐式神经场的一致场景几何形状,而且还可以使用单个GPU有效地呈现新视图。我们首先在DTU数据集的多个3D场景上训练HRF-NET,并且网络只能仅使用光度损耗就看不见的真实和合成数据产生合理的新视图。此外,我们的方法可以利用单个场景的密集参考图像集来产生准确的新颖视图,而无需依赖其他明确表示,并且仍然保持了预训练模型的高速渲染。实验结果表明,HRF-NET优于各种合成和真实数据集的最先进的神经渲染方法。
translated by 谷歌翻译
本文旨在减少透明辐射场的渲染时间。一些最近的作品用图像编码器配备了神经辐射字段,能够跨越场景概括,这避免了每场景优化。但是,它们的渲染过程通常很慢。主要因素是,在推断辐射场时,它们在空间中的大量点。在本文中,我们介绍了一个混合场景表示,它结合了最佳的隐式辐射场和显式深度映射,以便有效渲染。具体地,我们首先构建级联成本量,以有效地预测场景的粗糙几何形状。粗糙几何允许我们在场景表面附近的几个点来样,并显着提高渲染速度。该过程是完全可疑的,使我们能够仅从RGB图像共同学习深度预测和辐射现场网络。实验表明,该方法在DTU,真正的前瞻性和NERF合成数据集上展示了最先进的性能,而不是比以前的最可推广的辐射现场方法快至少50倍。我们还展示了我们的方法实时综合动态人类执行者的自由观点视频。代码将在https://zju3dv.github.io/enerf/处提供。
translated by 谷歌翻译
Figure 1. Given a monocular image sequence, NR-NeRF reconstructs a single canonical neural radiance field to represent geometry and appearance, and a per-time-step deformation field. We can render the scene into a novel spatio-temporal camera trajectory that significantly differs from the input trajectory. NR-NeRF also learns rigidity scores and correspondences without direct supervision on either. We can use the rigidity scores to remove the foreground, we can supersample along the time dimension, and we can exaggerate or dampen motion.
translated by 谷歌翻译
我们介绍了神经点光场,它用稀疏点云上的轻场隐含地表示场景。结合可分辨率的体积渲染与学习的隐式密度表示使得可以合成用于小型场景的新颖视图的照片现实图像。作为神经体积渲染方法需要潜在的功能场景表示的浓密采样,在沿着射线穿过体积的数百个样本,它们从根本上限制在具有投影到数百个训练视图的相同对象的小场景。向神经隐式光线推广稀疏点云允许我们有效地表示每个光线的单个隐式采样操作。这些点光场作为光线方向和局部点特征邻域的函数,允许我们在没有密集的物体覆盖和视差的情况下插入光场条件训练图像。我们评估大型驾驶场景的新型视图综合的提出方法,在那里我们综合了现实的看法,即现有的隐式方法未能代表。我们验证了神经点光场可以通过显式建模场景来实现沿着先前轨迹的视频来预测沿着看不见的轨迹的视频。
translated by 谷歌翻译
我们提出了一种新的方法来获取来自在线图像集合的对象表示,从具有不同摄像机,照明和背景的照片捕获任意物体的高质量几何形状和材料属性。这使得各种以各种对象渲染应用诸如新颖的综合,致密和协调的背景组合物,从疯狂的内部输入。使用多级方法延伸神经辐射场,首先推断表面几何形状并优化粗估计的初始相机参数,同时利用粗糙的前景对象掩模来提高训练效率和几何质量。我们还介绍了一种强大的正常估计技术,其消除了几何噪声的效果,同时保持了重要细节。最后,我们提取表面材料特性和环境照明,以球形谐波表示,具有处理瞬态元素的延伸部,例如,锋利的阴影。这些组件的结合导致高度模块化和有效的对象采集框架。广泛的评估和比较证明了我们在捕获高质量的几何形状和外观特性方面的方法,可用于渲染应用。
translated by 谷歌翻译
由于其简单性和最先进的性能,神经辐射场(NERF)被出现为新型视图综合任务的强大表示。虽然NERF可以在许多输入视图可用时产生看不见的观点的光静观渲染,但是当该数量减少时,其性能显着下降。我们观察到,稀疏输入方案中的大多数伪像是由估计场景几何中的错误引起的,并且在训练开始时通过不同的行为引起。我们通过规范从未观察的视点呈现的修补程序的几何和外观来解决这一点,并在训练期间退火光线采样空间。我们还使用规范化的流模型来规范未观察的视点的颜色。我们的车型不仅优于优化单个场景的其他方法,而是在许多情况下,还有条件模型,这些模型在大型多视图数据集上广泛预先培训。
translated by 谷歌翻译
https://video-nerf.github.io Figure 1. Our method takes a single casually captured video as input and learns a space-time neural irradiance field. (Top) Sample frames from the input video. (Middle) Novel view images rendered from textured meshes constructed from depth maps. (Bottom) Our results rendered from the proposed space-time neural irradiance field.
translated by 谷歌翻译
我们向渲染和时间(4D)重建人类的渲染和时间(4D)重建的神经辐射场,通过稀疏的摄像机捕获或甚至来自单眼视频。我们的方法将思想与神经场景表示,新颖的综合合成和隐式统计几何人称的人类表示相结合,耦合使用新颖的损失功能。在先前使用符号距离功能表示的结构化隐式人体模型,而不是使用统一的占用率来学习具有统一占用的光域字段。这使我们能够从稀疏视图中稳健地融合信息,并概括超出在训练中观察到的姿势或视图。此外,我们应用几何限制以共同学习观察到的主题的结构 - 包括身体和衣服 - 并将辐射场正规化为几何合理的解决方案。在多个数据集上的广泛实验证明了我们方法的稳健性和准确性,其概括能力显着超出了一系列的姿势和视图,以及超出所观察到的形状的统计外推。
translated by 谷歌翻译
最近的神经人类表示可以产生高质量的多视图渲染,但需要使用密集的多视图输入和昂贵的培训。因此,它们在很大程度上仅限于静态模型,因为每个帧都是不可行的。我们展示了人类学 - 一种普遍的神经表示 - 用于高保真自由观察动态人类的合成。类似于IBRNET如何通过避免每场景训练来帮助NERF,Humannerf跨多视图输入采用聚合像素对准特征,以及用于解决动态运动的姿势嵌入的非刚性变形场。原始人物员已经可以在稀疏视频输入的稀疏视频输入上产生合理的渲染。为了进一步提高渲染质量,我们使用外观混合模块增强了我们的解决方案,用于组合神经体积渲染和神经纹理混合的益处。各种多视图动态人类数据集的广泛实验证明了我们在挑战运动中合成照片 - 现实自由观点的方法和非常稀疏的相机视图输入中的普遍性和有效性。
translated by 谷歌翻译
Photo-realistic free-viewpoint rendering of real-world scenes using classical computer graphics techniques is challenging, because it requires the difficult step of capturing detailed appearance and geometry models. Recent studies have demonstrated promising results by learning scene representations that implicitly encode both geometry and appearance without 3D supervision. However, existing approaches in practice often show blurry renderings caused by the limited network capacity or the difficulty in finding accurate intersections of camera rays with the scene geometry. Synthesizing high-resolution imagery from these representations often requires time-consuming optical ray marching. In this work, we introduce Neural Sparse Voxel Fields (NSVF), a new neural scene representation for fast and high-quality free-viewpoint rendering. NSVF defines a set of voxel-bounded implicit fields organized in a sparse voxel octree to model local properties in each cell. We progressively learn the underlying voxel structures with a diffentiable ray-marching operation from only a set of posed RGB images. With the sparse voxel octree structure, rendering novel views can be accelerated by skipping the voxels containing no relevant scene content. Our method is typically over 10 times faster than the state-of-the-art (namely, NeRF (Mildenhall et al., 2020)) at inference time while achieving higher quality results. Furthermore, by utilizing an explicit sparse voxel representation, our method can easily be applied to scene editing and scene composition. We also demonstrate several challenging tasks, including multi-scene learning, free-viewpoint rendering of a moving human, and large-scale scene rendering. Code and data are available at our website: https://github.com/facebookresearch/NSVF.
translated by 谷歌翻译
Point of View & TimeFigure 1: We propose D-NeRF, a method for synthesizing novel views, at an arbitrary point in time, of dynamic scenes with complex non-rigid geometries. We optimize an underlying deformable volumetric function from a sparse set of input monocular views without the need of ground-truth geometry nor multi-view images. The figure shows two scenes under variable points of view and time instances synthesised by the proposed model.
translated by 谷歌翻译
神经辐射场(NERFS)产生最先进的视图合成结果。然而,它们慢渲染,需要每像素数百个网络评估,以近似卷渲染积分。将nerfs烘烤到明确的数据结构中实现了有效的渲染,但导致内存占地面积的大幅增加,并且在许多情况下,质量降低。在本文中,我们提出了一种新的神经光场表示,相反,相反,紧凑,直接预测沿线的集成光线。我们的方法支持使用每个像素的单个网络评估,用于小基线光场数据集,也可以应用于每个像素的几个评估的较大基线。在我们的方法的核心,是一个光线空间嵌入网络,将4D射线空间歧管映射到中间可间可动子的潜在空间中。我们的方法在诸如斯坦福光场数据集等密集的前置数据集中实现了最先进的质量。此外,对于带有稀疏输入的面对面的场景,我们可以在质量方面实现对基于NERF的方法具有竞争力的结果,同时提供更好的速度/质量/内存权衡,网络评估较少。
translated by 谷歌翻译
We present Depth-aware Image-based NEural Radiance fields (DINER). Given a sparse set of RGB input views, we predict depth and feature maps to guide the reconstruction of a volumetric scene representation that allows us to render 3D objects under novel views. Specifically, we propose novel techniques to incorporate depth information into feature fusion and efficient scene sampling. In comparison to the previous state of the art, DINER achieves higher synthesis quality and can process input views with greater disparity. This allows us to capture scenes more completely without changing capturing hardware requirements and ultimately enables larger viewpoint changes during novel view synthesis. We evaluate our method by synthesizing novel views, both for human heads and for general objects, and observe significantly improved qualitative results and increased perceptual metrics compared to the previous state of the art. The code will be made publicly available for research purposes.
translated by 谷歌翻译