我们提出了HRF-NET,这是一种基于整体辐射场的新型视图合成方法,该方法使用一组稀疏输入来呈现新视图。最近的概括视图合成方法还利用了光辉场,但渲染速度不是实时的。现有的方法可以有效地训练和呈现新颖的观点,但它们无法概括地看不到场景。我们的方法解决了用于概括视图合成的实时渲染问题,并由两个主要阶段组成:整体辐射场预测指标和基于卷积的神经渲染器。该架构不仅基于隐式神经场的一致场景几何形状,而且还可以使用单个GPU有效地呈现新视图。我们首先在DTU数据集的多个3D场景上训练HRF-NET,并且网络只能仅使用光度损耗就看不见的真实和合成数据产生合理的新视图。此外,我们的方法可以利用单个场景的密集参考图像集来产生准确的新颖视图,而无需依赖其他明确表示,并且仍然保持了预训练模型的高速渲染。实验结果表明,HRF-NET优于各种合成和真实数据集的最先进的神经渲染方法。
translated by 谷歌翻译
We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views. The core of our method is a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information on the fly from multiple source views. By drawing on source views at render time, our method hearkens back to classic work on image-based rendering (IBR), and allows us to render high-resolution imagery. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes. We render images using classic volume rendering, which is fully differentiable and allows us to train using only multiview posed images as supervision. Experiments show that our method outperforms recent novel view synthesis methods that also seek to generalize to novel scenes. Further, if fine-tuned on each scene, our method is competitive with state-of-the-art single-scene neural rendering methods. 1
translated by 谷歌翻译
我们呈现Geonerf,一种基于神经辐射场的完全光电素质性新颖性研究综合方法。我们的方法由两个主要阶段组成:几何推理和渲染器。为了渲染新颖的视图,几何件推理首先为每个附近的源视图构造级联成本卷。然后,使用基于变压器的注意力机制和级联成本卷,渲染器Infers的几何和外观,并通过经典音量渲染技术呈现细节的图像。特别是该架构允许复杂的遮挡推理,从一致的源视图中收集信息。此外,我们的方法可以在单个场景中轻松进行微调,通过每场比较优化的神经渲染方法呈现竞争结果,其数量是计算成本。实验表明,Geonerf优于各种合成和实时数据集的最先进的最新神经渲染模型。最后,随着对几何推理的略微修改,我们还提出了一种适应RGBD图像的替代模型。由于深度传感器,该模型通常直接利用经常使用的深度信息。实施代码将公开可用。
translated by 谷歌翻译
b) MVS-NeRF no fine-tuning c) MVS-NeRF 6 min fine-tuning d) NeRF 5.1h optimization a) Source views SSIM:0.766 SSIM: 0.923 SSIM:0.924 * Equal contribution Research done when Anpei Chen was in a remote internship with UCSD.generalizable radiance field reconstruction. Moreover, if dense images are captured, our estimated radiance field representation can be easily fine-tuned; this leads to fast per-scene reconstruction with higher rendering quality and substantially less optimization time than NeRF.
translated by 谷歌翻译
本文旨在减少透明辐射场的渲染时间。一些最近的作品用图像编码器配备了神经辐射字段,能够跨越场景概括,这避免了每场景优化。但是,它们的渲染过程通常很慢。主要因素是,在推断辐射场时,它们在空间中的大量点。在本文中,我们介绍了一个混合场景表示,它结合了最佳的隐式辐射场和显式深度映射,以便有效渲染。具体地,我们首先构建级联成本量,以有效地预测场景的粗糙几何形状。粗糙几何允许我们在场景表面附近的几个点来样,并显着提高渲染速度。该过程是完全可疑的,使我们能够仅从RGB图像共同学习深度预测和辐射现场网络。实验表明,该方法在DTU,真正的前瞻性和NERF合成数据集上展示了最先进的性能,而不是比以前的最可推广的辐射现场方法快至少50倍。我们还展示了我们的方法实时综合动态人类执行者的自由观点视频。代码将在https://zju3dv.github.io/enerf/处提供。
translated by 谷歌翻译
我们探索了基于神经光场表示的几种新颖观点合成的新策略。给定目标摄像头姿势,隐式神经网络将每个射线映射到其目标像素的颜色。该网络的条件是根据来自显式3D特征量的粗量渲染产生的本地射线特征。该卷是由使用3D Convnet的输入图像构建的。我们的方法在基于最先进的神经辐射场竞争方面,在合成和真实MVS数据上实现了竞争性能,同时提供了100倍的渲染速度。
translated by 谷歌翻译
我们研究了从3D对象组成的场景的稀疏源观察的新型视图综合的问题。我们提出了一种简单但有效的方法,既不是持续的也不是隐含的,挑战近期观测综合的趋势。我们的方法将观察显式编码为启用摊销渲染的体积表示。我们证明,虽然由于其表现力,但由于其表现力,但由于其富有力的力量,我们的简单方法获得了与最新的基线的比较比较了与最先进的基线的相当甚至更好的新颖性重建质量,同时增加了渲染速度超过400倍。我们的模型采用类别无关方式培训,不需要特定于场景的优化。因此,它能够将新颖的视图合成概括为在训练期间未见的对象类别。此外,我们表明,通过简单的制定,我们可以使用视图综合作为自我监控信号,以便在没有明确的3D监督的情况下高效学习3D几何。
translated by 谷歌翻译
新型视图合成(NVS)是一项具有挑战性的任务,需要系统从新观点中生成场景的影像图像,在新观点中,质量和速度对应用都很重要。以前的基于图像的渲染(IBR)方法很快,但是当输入视图稀疏时质量较差。最近的神经辐射场(NERF)和可推广的变体可带来令人印象深刻的结果,但不是实时的。在我们的论文中,我们提出了一种具有稀疏输入的可推广的NVS方法,称为FWD,该方法可实时提供高质量的合成。凭借明确的深度和可区分的渲染,它以130-1000 X的加速和更好的感知质量取得了SOTA方法的竞争结果。如果有的话,我们可以在训练或推理过程中无缝整合传感器深度,以提高图像质量,同时保持实时速度。随着深度传感器的越来越多的流行率,我们希望使用深度的方法将变得越来越有用。
translated by 谷歌翻译
用于运动中的人类的新型视图综合是一个具有挑战性的计算机视觉问题,使得诸如自由视视频之类的应用。现有方法通常使用具有多个输入视图,3D监控或预训练模型的复杂设置,这些模型不会概括为新标识。旨在解决这些限制,我们提出了一种新颖的视图综合框架,以从单视图传感器捕获的任何人的看法生成现实渲染,其具有稀疏的RGB-D,类似于低成本深度摄像头,而没有参与者特定的楷模。我们提出了一种架构来学习由基于球体的神经渲染获得的小说视图中的密集功能,并使用全局上下文修复模型创建完整的渲染。此外,增强剂网络利用了整体保真度,即使在原始视图中的遮挡区域中也能够产生细节的清晰渲染。我们展示了我们的方法为单个稀疏RGB-D输入产生高质量的合成和真实人体演员的新颖视图。它概括了看不见的身份,新的姿势,忠实地重建面部表情。我们的方法优于现有人体观测合成方法,并且对不同水平的输入稀疏性具有稳健性。
translated by 谷歌翻译
Volumetric neural rendering methods like NeRF generate high-quality view synthesis results but are optimized per-scene leading to prohibitive reconstruction time. On the other hand, deep multi-view stereo methods can quickly reconstruct scene geometry via direct network inference. Point-NeRF combines the advantages of these two approaches by using neural 3D point clouds, with associated neural features, to model a radiance field. Point-NeRF can be rendered efficiently by aggregating neural point features near scene surfaces, in a ray marching-based rendering pipeline. Moreover, Point-NeRF can be initialized via direct inference of a pre-trained deep network to produce a neural point cloud; this point cloud can be finetuned to surpass the visual quality of NeRF with 30X faster training time. Point-NeRF can be combined with other 3D reconstruction methods and handles the errors and outliers in such methods via a novel pruning and growing mechanism. The experiments on the DTU, the NeRF Synthetics , the ScanNet and the Tanks and Temples datasets demonstrate Point-NeRF can surpass the existing methods and achieve the state-of-the-art results.
translated by 谷歌翻译
我们呈现NERF-SR,一种用于高分辨率(HR)新型视图合成的解决方案,主要是低分辨率(LR)输入。我们的方法是基于神经辐射场(NERF)的内置,其预测每点密度和颜色,具有多层的射击。在在任意尺度上产生图像时,NERF与超越观察图像的分辨率努力。我们的关键识别是NERF具有本地之前的,这意味着可以在附近区域传播3D点的预测,并且保持准确。我们首先通过超级采样策略来利用它,该策略在每个图像像素处射击多个光线,这在子像素级别强制了多视图约束。然后,我们表明,NERF-SR可以通过改进网络进一步提高超级采样的性能,该细化网络利用估计的深度来实现HR参考图像上的相关补丁的幻觉。实验结果表明,NERF-SR在合成和现实世界数据集的HR上为新型视图合成产生高质量结果。
translated by 谷歌翻译
We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (nonconvolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.
translated by 谷歌翻译
Neural Radiance Field (NeRF) has revolutionized free viewpoint rendering tasks and achieved impressive results. However, the efficiency and accuracy problems hinder its wide applications. To address these issues, we propose Geometry-Aware Generalized Neural Radiance Field (GARF) with a geometry-aware dynamic sampling (GADS) strategy to perform real-time novel view rendering and unsupervised depth estimation on unseen scenes without per-scene optimization. Distinct from most existing generalized NeRFs, our framework infers the unseen scenes on both pixel-scale and geometry-scale with only a few input images. More specifically, our method learns common attributes of novel-view synthesis by an encoder-decoder structure and a point-level learnable multi-view feature fusion module which helps avoid occlusion. To preserve scene characteristics in the generalized model, we introduce an unsupervised depth estimation module to derive the coarse geometry, narrow down the ray sampling interval to proximity space of the estimated surface and sample in expectation maximum position, constituting Geometry-Aware Dynamic Sampling strategy (GADS). Moreover, we introduce a Multi-level Semantic Consistency loss (MSC) to assist more informative representation learning. Extensive experiments on indoor and outdoor datasets show that comparing with state-of-the-art generalized NeRF methods, GARF reduces samples by more than 25\%, while improving rendering quality and 3D geometry estimation.
translated by 谷歌翻译
自从神经辐射场(NERF)出现以来,神经渲染引起了极大的关注,并且已经大大推动了新型视图合成的最新作品。最近的重点是在模型上过度适合单个场景,以及学习模型的一些尝试,这些模型可以综合看不见的场景的新型视图,主要包括将深度卷积特征与类似NERF的模型组合在一起。我们提出了一个不同的范式,不需要深层特征,也不需要类似NERF的体积渲染。我们的方法能够直接从现场采样的贴片集中直接预测目标射线的颜色。我们首先利用表现几何形状沿着每个参考视图的异性线提取斑块。每个贴片线性地投影到1D特征向量和一系列变压器处理集合中。对于位置编码,我们像在光场表示中一样对射线进行参数化,并且至关重要的差异是坐标是相对于目标射线的规范化的,这使我们的方法与参考帧无关并改善了概括。我们表明,即使接受比先前的工作要少得多的数据训练,我们的方法在新颖的综合综合方面都超出了最新的视图综合。
translated by 谷歌翻译
We present a method for novel view synthesis from input images that are freely distributed around a scene. Our method does not rely on a regular arrangement of input views, can synthesize images for free camera movement through the scene, and works for general scenes with unconstrained geometric layouts. We calibrate the input images via SfM and erect a coarse geometric scaffold via MVS. This scaffold is used to create a proxy depth map for a novel view of the scene. Based on this depth map, a recurrent encoder-decoder network processes reprojected features from nearby views and synthesizes the new view. Our network does not need to be optimized for a given scene. After training on a dataset, it works in previously unseen environments with no finetuning or per-scene optimization. We evaluate the presented approach on challenging real-world datasets, including Tanks and Temples, where we demonstrate successful view synthesis for the first time and substantially outperform prior and concurrent work.
translated by 谷歌翻译
Photo-realistic free-viewpoint rendering of real-world scenes using classical computer graphics techniques is challenging, because it requires the difficult step of capturing detailed appearance and geometry models. Recent studies have demonstrated promising results by learning scene representations that implicitly encode both geometry and appearance without 3D supervision. However, existing approaches in practice often show blurry renderings caused by the limited network capacity or the difficulty in finding accurate intersections of camera rays with the scene geometry. Synthesizing high-resolution imagery from these representations often requires time-consuming optical ray marching. In this work, we introduce Neural Sparse Voxel Fields (NSVF), a new neural scene representation for fast and high-quality free-viewpoint rendering. NSVF defines a set of voxel-bounded implicit fields organized in a sparse voxel octree to model local properties in each cell. We progressively learn the underlying voxel structures with a diffentiable ray-marching operation from only a set of posed RGB images. With the sparse voxel octree structure, rendering novel views can be accelerated by skipping the voxels containing no relevant scene content. Our method is typically over 10 times faster than the state-of-the-art (namely, NeRF (Mildenhall et al., 2020)) at inference time while achieving higher quality results. Furthermore, by utilizing an explicit sparse voxel representation, our method can easily be applied to scene editing and scene composition. We also demonstrate several challenging tasks, including multi-scene learning, free-viewpoint rendering of a moving human, and large-scale scene rendering. Code and data are available at our website: https://github.com/facebookresearch/NSVF.
translated by 谷歌翻译
我们引入了一个可扩展的框架,用于从RGB-D图像中具有很大不完整的场景覆盖率的新型视图合成。尽管生成的神经方法在2D图像上表现出了惊人的结果,但它们尚未达到相似的影像学结果,并结合了场景完成,在这种情况下,空间3D场景的理解是必不可少的。为此,我们提出了一条在基于网格的神经场景表示上执行的生成管道,通过以2.5D-3D-2.5D方式进行场景的分布来完成未观察到的场景部分。我们在3D空间中处理编码的图像特征,并具有几何完整网络和随后的纹理镶嵌网络,以推断缺失区域。最终可以通过与一致性的可区分渲染获得感性图像序列。全面的实验表明,我们方法的图形输出优于最新技术,尤其是在未观察到的场景部分中。
translated by 谷歌翻译
我们提出了可推广的NERF变压器(GNT),这是一种纯粹的,统一的基于变压器的体系结构,可以从源视图中有效地重建神经辐射场(NERF)。与NERF上的先前作品不同,通过颠倒手工渲染方程来优化人均隐式表示,GNT通过封装两个基于变压器的阶段来实现可概括的神经场景表示和渲染。 GNT的第一阶段,称为View Transformer,利用多视图几何形状作为基于注意力的场景表示的电感偏差,并通过在相邻视图上从异性线中汇总信息来预测与坐标对齐的特征。 GNT的第二阶段,名为Ray Transformer,通过Ray Marching呈现新视图,并使用注意机制直接解码采样点特征的序列。我们的实验表明,当在单个场景上进行优化时,GNT可以在不明确渲染公式的情况下成功重建NERF,甚至由于可学习的射线渲染器,在复杂的场景上甚至将PSNR提高了〜1.3db。当在各种场景中接受培训时,GNT转移到前面的LLFF数据集(LPIPS〜20%,SSIM〜25%$)和合成搅拌器数据集(LPIPS〜20%,SSIM 〜25%$)时,GNN会始终达到最先进的性能4%)。此外,我们表明可以从学习的注意图中推断出深度和遮挡,这意味着纯粹的注意机制能够学习一个物理地面渲染过程。所有这些结果使我们更接近将变形金刚作为“通用建模工具”甚至用于图形的诱人希望。请参阅我们的项目页面以获取视频结果:https://vita-group.github.io/gnt/。
translated by 谷歌翻译
多视图立体声(MVS)是3D计算机视觉中的核心任务。随着新颖的深度学习方法的激增,学习的MVS超过了经典方法的准确性,但仍然依赖于建立记忆密集型密集的成本量。新型视图合成(NVS)是一项平行的研究线,最近发现神经辐射场(NERF)模型的普及程度增加,该模型优化了每个场景辐射场。但是,NERF方法不会推广到新颖的场景,并且训练和测试速度很慢。我们建议用一个可以恢复3D场景几何形状作为距离函数的新型网络以及高分辨率的颜色图像来弥合这两种方法之间的差距。我们的方法仅使用一组稀疏的图像作为输入,可以很好地推广到新颖的场景。此外,我们提出了一种粗糙的球形追踪方法,以显着提高速度。我们在各种数据集上表明,我们的方法达到了与人均优化方法的可比精度,同时能够概括和运行速度更快。我们在https://github.com/ais-bonn/neural_mvs上提供源代码
translated by 谷歌翻译
尽管神经辐射场(NERF)在新型视图合成方面表现出了令人印象深刻的进步,但大多数方法通常需要具有准确的相机姿势的同一场景的多个输入图像。在这项工作中,我们试图将输入实质上减少到单个未予以的图像。现有的方法在本地图像功能上有条件重建一个3D对象,但通常会在远离源视图的视点处进行模糊的预测。为了解决这个问题,我们建议利用全球和本地功能形成表现力的3D表示。全局功能是从视觉变压器中学到的,而本地功能则从2D卷积网络中提取。为了综合一种新型视图,我们训练以学习的3D表示条件进行量渲染的多层感知器(MLP)网络。这种新颖的3D表示允许网络重建看不见的区域,而无需执行对称或规范坐标系等约束。我们的方法只能从单个输入图像中渲染新视图,并使用单个模型在多个对象类别中概括。定量和定性评估表明,所提出的方法可实现最先进的绩效,并使细节比现有方法更丰富。
translated by 谷歌翻译