Volumetric neural rendering methods like NeRF generate high-quality view synthesis results but are optimized per-scene leading to prohibitive reconstruction time. On the other hand, deep multi-view stereo methods can quickly reconstruct scene geometry via direct network inference. Point-NeRF combines the advantages of these two approaches by using neural 3D point clouds, with associated neural features, to model a radiance field. Point-NeRF can be rendered efficiently by aggregating neural point features near scene surfaces, in a ray marching-based rendering pipeline. Moreover, Point-NeRF can be initialized via direct inference of a pre-trained deep network to produce a neural point cloud; this point cloud can be finetuned to surpass the visual quality of NeRF with 30X faster training time. Point-NeRF can be combined with other 3D reconstruction methods and handles the errors and outliers in such methods via a novel pruning and growing mechanism. The experiments on the DTU, the NeRF Synthetics , the ScanNet and the Tanks and Temples datasets demonstrate Point-NeRF can surpass the existing methods and achieve the state-of-the-art results.
translated by 谷歌翻译
b) MVS-NeRF no fine-tuning c) MVS-NeRF 6 min fine-tuning d) NeRF 5.1h optimization a) Source views SSIM:0.766 SSIM: 0.923 SSIM:0.924 * Equal contribution Research done when Anpei Chen was in a remote internship with UCSD.generalizable radiance field reconstruction. Moreover, if dense images are captured, our estimated radiance field representation can be easily fine-tuned; this leads to fast per-scene reconstruction with higher rendering quality and substantially less optimization time than NeRF.
translated by 谷歌翻译
We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views. The core of our method is a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information on the fly from multiple source views. By drawing on source views at render time, our method hearkens back to classic work on image-based rendering (IBR), and allows us to render high-resolution imagery. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes. We render images using classic volume rendering, which is fully differentiable and allows us to train using only multiview posed images as supervision. Experiments show that our method outperforms recent novel view synthesis methods that also seek to generalize to novel scenes. Further, if fine-tuned on each scene, our method is competitive with state-of-the-art single-scene neural rendering methods. 1
translated by 谷歌翻译
我们介绍了神经点光场,它用稀疏点云上的轻场隐含地表示场景。结合可分辨率的体积渲染与学习的隐式密度表示使得可以合成用于小型场景的新颖视图的照片现实图像。作为神经体积渲染方法需要潜在的功能场景表示的浓密采样,在沿着射线穿过体积的数百个样本,它们从根本上限制在具有投影到数百个训练视图的相同对象的小场景。向神经隐式光线推广稀疏点云允许我们有效地表示每个光线的单个隐式采样操作。这些点光场作为光线方向和局部点特征邻域的函数,允许我们在没有密集的物体覆盖和视差的情况下插入光场条件训练图像。我们评估大型驾驶场景的新型视图综合的提出方法,在那里我们综合了现实的看法,即现有的隐式方法未能代表。我们验证了神经点光场可以通过显式建模场景来实现沿着先前轨迹的视频来预测沿着看不见的轨迹的视频。
translated by 谷歌翻译
Photo-realistic free-viewpoint rendering of real-world scenes using classical computer graphics techniques is challenging, because it requires the difficult step of capturing detailed appearance and geometry models. Recent studies have demonstrated promising results by learning scene representations that implicitly encode both geometry and appearance without 3D supervision. However, existing approaches in practice often show blurry renderings caused by the limited network capacity or the difficulty in finding accurate intersections of camera rays with the scene geometry. Synthesizing high-resolution imagery from these representations often requires time-consuming optical ray marching. In this work, we introduce Neural Sparse Voxel Fields (NSVF), a new neural scene representation for fast and high-quality free-viewpoint rendering. NSVF defines a set of voxel-bounded implicit fields organized in a sparse voxel octree to model local properties in each cell. We progressively learn the underlying voxel structures with a diffentiable ray-marching operation from only a set of posed RGB images. With the sparse voxel octree structure, rendering novel views can be accelerated by skipping the voxels containing no relevant scene content. Our method is typically over 10 times faster than the state-of-the-art (namely, NeRF (Mildenhall et al., 2020)) at inference time while achieving higher quality results. Furthermore, by utilizing an explicit sparse voxel representation, our method can easily be applied to scene editing and scene composition. We also demonstrate several challenging tasks, including multi-scene learning, free-viewpoint rendering of a moving human, and large-scale scene rendering. Code and data are available at our website: https://github.com/facebookresearch/NSVF.
translated by 谷歌翻译
我们提出了一种可区分的渲染算法,以进行有效的新型视图合成。通过偏离基于音量的表示,支持学习点表示,我们在训练和推理方面的内存和运行时范围内改进了现有方法的数量级。该方法从均匀采样的随机点云开始,并使用基于可区分的SPLAT渲染器来发展模型以匹配一组输入图像,从而学习了每点位置和观看依赖性外观。在训练和推理中,我们的方法比NERF快300倍,质量只有边缘牺牲,而在静态场景中使用少于10 〜MB的记忆。对于动态场景,我们的方法比Stnerf训练两个数量级,并以接近互动速率渲染,同时即使在不施加任何时间固定的正则化合物的情况下保持较高的图像质量和时间连贯性。
translated by 谷歌翻译
本文旨在减少透明辐射场的渲染时间。一些最近的作品用图像编码器配备了神经辐射字段,能够跨越场景概括,这避免了每场景优化。但是,它们的渲染过程通常很慢。主要因素是,在推断辐射场时,它们在空间中的大量点。在本文中,我们介绍了一个混合场景表示,它结合了最佳的隐式辐射场和显式深度映射,以便有效渲染。具体地,我们首先构建级联成本量,以有效地预测场景的粗糙几何形状。粗糙几何允许我们在场景表面附近的几个点来样,并显着提高渲染速度。该过程是完全可疑的,使我们能够仅从RGB图像共同学习深度预测和辐射现场网络。实验表明,该方法在DTU,真正的前瞻性和NERF合成数据集上展示了最先进的性能,而不是比以前的最可推广的辐射现场方法快至少50倍。我们还展示了我们的方法实时综合动态人类执行者的自由观点视频。代码将在https://zju3dv.github.io/enerf/处提供。
translated by 谷歌翻译
我们提出了HRF-NET,这是一种基于整体辐射场的新型视图合成方法,该方法使用一组稀疏输入来呈现新视图。最近的概括视图合成方法还利用了光辉场,但渲染速度不是实时的。现有的方法可以有效地训练和呈现新颖的观点,但它们无法概括地看不到场景。我们的方法解决了用于概括视图合成的实时渲染问题,并由两个主要阶段组成:整体辐射场预测指标和基于卷积的神经渲染器。该架构不仅基于隐式神经场的一致场景几何形状,而且还可以使用单个GPU有效地呈现新视图。我们首先在DTU数据集的多个3D场景上训练HRF-NET,并且网络只能仅使用光度损耗就看不见的真实和合成数据产生合理的新视图。此外,我们的方法可以利用单个场景的密集参考图像集来产生准确的新颖视图,而无需依赖其他明确表示,并且仍然保持了预训练模型的高速渲染。实验结果表明,HRF-NET优于各种合成和真实数据集的最先进的神经渲染方法。
translated by 谷歌翻译
Recent years we have witnessed rapid development in NeRF-based image rendering due to its high quality. However, point clouds rendering is somehow less explored. Compared to NeRF-based rendering which suffers from dense spatial sampling, point clouds rendering is naturally less computation intensive, which enables its deployment in mobile computing device. In this work, we focus on boosting the image quality of point clouds rendering with a compact model design. We first analyze the adaption of the volume rendering formulation on point clouds. Based on the analysis, we simplify the NeRF representation to a spatial mapping function which only requires single evaluation per pixel. Further, motivated by ray marching, we rectify the the noisy raw point clouds to the estimated intersection between rays and surfaces as queried coordinates, which could avoid \textit{spatial frequency collapse} and neighbor point disturbance. Composed of rasterization, spatial mapping and the refinement stages, our method achieves the state-of-the-art performance on point clouds rendering, outperforming prior works by notable margins, with a smaller model size. We obtain a PSNR of 31.74 on NeRF-Synthetic, 25.88 on ScanNet and 30.81 on DTU. Code and data are publicly available at https://github.com/seanywang0408/RadianceMapping.
translated by 谷歌翻译
We present TensoRF, a novel approach to model and reconstruct radiance fields. Unlike NeRF that purely uses MLPs, we model the radiance field of a scene as a 4D tensor, which represents a 3D voxel grid with per-voxel multi-channel features. Our central idea is to factorize the 4D scene tensor into multiple compact low-rank tensor components. We demonstrate that applying traditional CP decomposition -- that factorizes tensors into rank-one components with compact vectors -- in our framework leads to improvements over vanilla NeRF. To further boost performance, we introduce a novel vector-matrix (VM) decomposition that relaxes the low-rank constraints for two modes of a tensor and factorizes tensors into compact vector and matrix factors. Beyond superior rendering quality, our models with CP and VM decompositions lead to a significantly lower memory footprint in comparison to previous and concurrent works that directly optimize per-voxel features. Experimentally, we demonstrate that TensoRF with CP decomposition achieves fast reconstruction (<30 min) with better rendering quality and even a smaller model size (<4 MB) compared to NeRF. Moreover, TensoRF with VM decomposition further boosts rendering quality and outperforms previous state-of-the-art methods, while reducing the reconstruction time (<10 min) and retaining a compact model size (<75 MB).
translated by 谷歌翻译
We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (nonconvolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.
translated by 谷歌翻译
最近的神经人类表示可以产生高质量的多视图渲染,但需要使用密集的多视图输入和昂贵的培训。因此,它们在很大程度上仅限于静态模型,因为每个帧都是不可行的。我们展示了人类学 - 一种普遍的神经表示 - 用于高保真自由观察动态人类的合成。类似于IBRNET如何通过避免每场景训练来帮助NERF,Humannerf跨多视图输入采用聚合像素对准特征,以及用于解决动态运动的姿势嵌入的非刚性变形场。原始人物员已经可以在稀疏视频输入的稀疏视频输入上产生合理的渲染。为了进一步提高渲染质量,我们使用外观混合模块增强了我们的解决方案,用于组合神经体积渲染和神经纹理混合的益处。各种多视图动态人类数据集的广泛实验证明了我们在挑战运动中合成照片 - 现实自由观点的方法和非常稀疏的相机视图输入中的普遍性和有效性。
translated by 谷歌翻译
获取房间规模场景的高质量3D重建对于即将到来的AR或VR应用是至关重要的。这些范围从混合现实应用程序进行电话会议,虚拟测量,虚拟房间刨,到机器人应用。虽然使用神经辐射场(NERF)的基于卷的视图合成方法显示有希望再现对象或场景的外观,但它们不会重建实际表面。基于密度的表面的体积表示在使用行进立方体提取表面时导致伪影,因为在优化期间,密度沿着射线累积,并且不在单个样本点处于隔离点。我们建议使用隐式函数(截短的签名距离函数)来代表表面来代表表面。我们展示了如何在NERF框架中纳入此表示,并将其扩展为使用来自商品RGB-D传感器的深度测量,例如Kinect。此外,我们提出了一种姿势和相机细化技术,可提高整体重建质量。相反,与集成NERF的深度前瞻性的并发工作,其专注于新型视图合成,我们的方法能够重建高质量的韵律3D重建。
translated by 谷歌翻译
With the success of neural volume rendering in novel view synthesis, neural implicit reconstruction with volume rendering has become popular. However, most methods optimize per-scene functions and are unable to generalize to novel scenes. We introduce VolRecon, a generalizable implicit reconstruction method with Signed Ray Distance Function (SRDF). To reconstruct with fine details and little noise, we combine projection features, aggregated from multi-view features with a view transformer, and volume features interpolated from a coarse global feature volume. A ray transformer computes SRDF values of all the samples along a ray to estimate the surface location, which are used for volume rendering of color and depth. Extensive experiments on DTU and ETH3D demonstrate the effectiveness and generalization ability of our method. On DTU, our method outperforms SparseNeuS by about 30% in sparse view reconstruction and achieves comparable quality as MVSNet in full view reconstruction. Besides, our method shows good generalization ability on the large-scale ETH3D benchmark. Project page: https://fangjinhuawang.github.io/VolRecon.
translated by 谷歌翻译
Neural Radiance Field (NeRF) has revolutionized free viewpoint rendering tasks and achieved impressive results. However, the efficiency and accuracy problems hinder its wide applications. To address these issues, we propose Geometry-Aware Generalized Neural Radiance Field (GARF) with a geometry-aware dynamic sampling (GADS) strategy to perform real-time novel view rendering and unsupervised depth estimation on unseen scenes without per-scene optimization. Distinct from most existing generalized NeRFs, our framework infers the unseen scenes on both pixel-scale and geometry-scale with only a few input images. More specifically, our method learns common attributes of novel-view synthesis by an encoder-decoder structure and a point-level learnable multi-view feature fusion module which helps avoid occlusion. To preserve scene characteristics in the generalized model, we introduce an unsupervised depth estimation module to derive the coarse geometry, narrow down the ray sampling interval to proximity space of the estimated surface and sample in expectation maximum position, constituting Geometry-Aware Dynamic Sampling strategy (GADS). Moreover, we introduce a Multi-level Semantic Consistency loss (MSC) to assist more informative representation learning. Extensive experiments on indoor and outdoor datasets show that comparing with state-of-the-art generalized NeRF methods, GARF reduces samples by more than 25\%, while improving rendering quality and 3D geometry estimation.
translated by 谷歌翻译
神经辐射字段(NERF)将场景编码为神经表示,使得能够实现新颖视图的照片逼真。然而,RGB图像的成功重建需要在静态条件下拍摄的大量输入视图 - 通常可以为房间尺寸场景的几百个图像。我们的方法旨在将整个房间的小说视图从数量级的图像中合成。为此,我们利用密集的深度前导者来限制NERF优化。首先,我们利用从用于估计相机姿势的运动(SFM)预处理步骤的结构自由提供的稀疏深度数据。其次,我们使用深度完成将这些稀疏点转换为密集的深度图和不确定性估计,用于指导NERF优化。我们的方法使数据有效的新颖观看综合在挑战室内场景中,使用少量为整个场景的18张图像。
translated by 谷歌翻译
我们呈现Geonerf,一种基于神经辐射场的完全光电素质性新颖性研究综合方法。我们的方法由两个主要阶段组成:几何推理和渲染器。为了渲染新颖的视图,几何件推理首先为每个附近的源视图构造级联成本卷。然后,使用基于变压器的注意力机制和级联成本卷,渲染器Infers的几何和外观,并通过经典音量渲染技术呈现细节的图像。特别是该架构允许复杂的遮挡推理,从一致的源视图中收集信息。此外,我们的方法可以在单个场景中轻松进行微调,通过每场比较优化的神经渲染方法呈现竞争结果,其数量是计算成本。实验表明,Geonerf优于各种合成和实时数据集的最先进的最新神经渲染模型。最后,随着对几何推理的略微修改,我们还提出了一种适应RGBD图像的替代模型。由于深度传感器,该模型通常直接利用经常使用的深度信息。实施代码将公开可用。
translated by 谷歌翻译
我们引入了一个可扩展的框架,用于从RGB-D图像中具有很大不完整的场景覆盖率的新型视图合成。尽管生成的神经方法在2D图像上表现出了惊人的结果,但它们尚未达到相似的影像学结果,并结合了场景完成,在这种情况下,空间3D场景的理解是必不可少的。为此,我们提出了一条在基于网格的神经场景表示上执行的生成管道,通过以2.5D-3D-2.5D方式进行场景的分布来完成未观察到的场景部分。我们在3D空间中处理编码的图像特征,并具有几何完整网络和随后的纹理镶嵌网络,以推断缺失区域。最终可以通过与一致性的可区分渲染获得感性图像序列。全面的实验表明,我们方法的图形输出优于最新技术,尤其是在未观察到的场景部分中。
translated by 谷歌翻译
新型视图合成(NVS)是一项具有挑战性的任务,需要系统从新观点中生成场景的影像图像,在新观点中,质量和速度对应用都很重要。以前的基于图像的渲染(IBR)方法很快,但是当输入视图稀疏时质量较差。最近的神经辐射场(NERF)和可推广的变体可带来令人印象深刻的结果,但不是实时的。在我们的论文中,我们提出了一种具有稀疏输入的可推广的NVS方法,称为FWD,该方法可实时提供高质量的合成。凭借明确的深度和可区分的渲染,它以130-1000 X的加速和更好的感知质量取得了SOTA方法的竞争结果。如果有的话,我们可以在训练或推理过程中无缝整合传感器深度,以提高图像质量,同时保持实时速度。随着深度传感器的越来越多的流行率,我们希望使用深度的方法将变得越来越有用。
translated by 谷歌翻译
我们重新审视NPBG,这是一种流行的新型视图合成方法,引入了无处不在的点神经渲染范式。我们对具有快速视图合成的数据效率学习特别感兴趣。除前景/背景场景渲染分裂以及改善的损失外,我们还通过基于视图的网状点描述符栅格化来实现这一目标。通过仅在一个场景上训练,我们的表现就超过了在扫描仪上接受过培训的NPBG,然后进行了填充场景。我们还针对最先进的方法SVS进行了竞争性,该方法已在完整的数据集(DTU,坦克和寺庙)上进行了培训,然后进行了对现场的培训,尽管它们具有更深的神经渲染器。
translated by 谷歌翻译