为了估计基于多视图的渲染中3D点的体积密度和颜色,一种常见的方法是检查给定的源图像特征之间的共识存在,这是估计过程的信息提示之一。为此,大多数以前的方法都利用了同样加权的聚合特征。但是,这可能会使在源图像功能集中包含一些经常通过遮挡发生的异常值时,很难检查共识存在。在本文中,我们提出了一种新颖的源视图特征聚合方法,该方法通过利用特征集中的局部结构来促进我们以强大的方式以强大的方式找到共识。我们首先计算提出的聚合的每个源特征的源视图距离分布。之后,将距离分布转换为几个相似性分布,并具有所提出的可学习相似性映射函数。最后,对于特征集中的每个元素,通过计算加权均值和方差来提取聚合特征,其中权重是从相似性分布得出的。在实验中,我们在各种基准数据集(包括合成和真实图像场景)上验证了所提出的方法。实验结果表明,合并提出的功能可以通过大幅度提高性能,从而提高最先进的性能。
translated by 谷歌翻译
We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views. The core of our method is a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information on the fly from multiple source views. By drawing on source views at render time, our method hearkens back to classic work on image-based rendering (IBR), and allows us to render high-resolution imagery. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes. We render images using classic volume rendering, which is fully differentiable and allows us to train using only multiview posed images as supervision. Experiments show that our method outperforms recent novel view synthesis methods that also seek to generalize to novel scenes. Further, if fine-tuned on each scene, our method is competitive with state-of-the-art single-scene neural rendering methods. 1
translated by 谷歌翻译
我们提出了可推广的NERF变压器(GNT),这是一种纯粹的,统一的基于变压器的体系结构,可以从源视图中有效地重建神经辐射场(NERF)。与NERF上的先前作品不同,通过颠倒手工渲染方程来优化人均隐式表示,GNT通过封装两个基于变压器的阶段来实现可概括的神经场景表示和渲染。 GNT的第一阶段,称为View Transformer,利用多视图几何形状作为基于注意力的场景表示的电感偏差,并通过在相邻视图上从异性线中汇总信息来预测与坐标对齐的特征。 GNT的第二阶段,名为Ray Transformer,通过Ray Marching呈现新视图,并使用注意机制直接解码采样点特征的序列。我们的实验表明,当在单个场景上进行优化时,GNT可以在不明确渲染公式的情况下成功重建NERF,甚至由于可学习的射线渲染器,在复杂的场景上甚至将PSNR提高了〜1.3db。当在各种场景中接受培训时,GNT转移到前面的LLFF数据集(LPIPS〜20%,SSIM〜25%$)和合成搅拌器数据集(LPIPS〜20%,SSIM 〜25%$)时,GNN会始终达到最先进的性能4%)。此外,我们表明可以从学习的注意图中推断出深度和遮挡,这意味着纯粹的注意机制能够学习一个物理地面渲染过程。所有这些结果使我们更接近将变形金刚作为“通用建模工具”甚至用于图形的诱人希望。请参阅我们的项目页面以获取视频结果:https://vita-group.github.io/gnt/。
translated by 谷歌翻译
我们呈现Geonerf,一种基于神经辐射场的完全光电素质性新颖性研究综合方法。我们的方法由两个主要阶段组成:几何推理和渲染器。为了渲染新颖的视图,几何件推理首先为每个附近的源视图构造级联成本卷。然后,使用基于变压器的注意力机制和级联成本卷,渲染器Infers的几何和外观,并通过经典音量渲染技术呈现细节的图像。特别是该架构允许复杂的遮挡推理,从一致的源视图中收集信息。此外,我们的方法可以在单个场景中轻松进行微调,通过每场比较优化的神经渲染方法呈现竞争结果,其数量是计算成本。实验表明,Geonerf优于各种合成和实时数据集的最先进的最新神经渲染模型。最后,随着对几何推理的略微修改,我们还提出了一种适应RGBD图像的替代模型。由于深度传感器,该模型通常直接利用经常使用的深度信息。实施代码将公开可用。
translated by 谷歌翻译
本文旨在减少透明辐射场的渲染时间。一些最近的作品用图像编码器配备了神经辐射字段,能够跨越场景概括,这避免了每场景优化。但是,它们的渲染过程通常很慢。主要因素是,在推断辐射场时,它们在空间中的大量点。在本文中,我们介绍了一个混合场景表示,它结合了最佳的隐式辐射场和显式深度映射,以便有效渲染。具体地,我们首先构建级联成本量,以有效地预测场景的粗糙几何形状。粗糙几何允许我们在场景表面附近的几个点来样,并显着提高渲染速度。该过程是完全可疑的,使我们能够仅从RGB图像共同学习深度预测和辐射现场网络。实验表明,该方法在DTU,真正的前瞻性和NERF合成数据集上展示了最先进的性能,而不是比以前的最可推广的辐射现场方法快至少50倍。我们还展示了我们的方法实时综合动态人类执行者的自由观点视频。代码将在https://zju3dv.github.io/enerf/处提供。
translated by 谷歌翻译
神经辐射场(NERF)在代表3D场景和合成新颖视图中示出了很大的潜力,但是在推理阶段的NERF的计算开销仍然很重。为了减轻负担,我们进入了NERF的粗细分,分层采样过程,并指出粗阶段可以被我们命名神经样本场的轻量级模块代替。所提出的示例场地图光线进入样本分布,可以将其转换为点坐标并进料到radiance字段以进行体积渲染。整体框架被命名为Neusample。我们在现实合成360 $ ^ {\ circ} $和真正的前瞻性,两个流行的3D场景集上进行实验,并表明Neusample在享受更快推理速度时比NERF实现更好的渲染质量。Neusample进一步压缩,以提出的样品场提取方法朝向质量和速度之间的更好的权衡。
translated by 谷歌翻译
我们探索了基于神经光场表示的几种新颖观点合成的新策略。给定目标摄像头姿势,隐式神经网络将每个射线映射到其目标像素的颜色。该网络的条件是根据来自显式3D特征量的粗量渲染产生的本地射线特征。该卷是由使用3D Convnet的输入图像构建的。我们的方法在基于最先进的神经辐射场竞争方面,在合成和真实MVS数据上实现了竞争性能,同时提供了100倍的渲染速度。
translated by 谷歌翻译
Neural Radiance Field (NeRF) has revolutionized free viewpoint rendering tasks and achieved impressive results. However, the efficiency and accuracy problems hinder its wide applications. To address these issues, we propose Geometry-Aware Generalized Neural Radiance Field (GARF) with a geometry-aware dynamic sampling (GADS) strategy to perform real-time novel view rendering and unsupervised depth estimation on unseen scenes without per-scene optimization. Distinct from most existing generalized NeRFs, our framework infers the unseen scenes on both pixel-scale and geometry-scale with only a few input images. More specifically, our method learns common attributes of novel-view synthesis by an encoder-decoder structure and a point-level learnable multi-view feature fusion module which helps avoid occlusion. To preserve scene characteristics in the generalized model, we introduce an unsupervised depth estimation module to derive the coarse geometry, narrow down the ray sampling interval to proximity space of the estimated surface and sample in expectation maximum position, constituting Geometry-Aware Dynamic Sampling strategy (GADS). Moreover, we introduce a Multi-level Semantic Consistency loss (MSC) to assist more informative representation learning. Extensive experiments on indoor and outdoor datasets show that comparing with state-of-the-art generalized NeRF methods, GARF reduces samples by more than 25\%, while improving rendering quality and 3D geometry estimation.
translated by 谷歌翻译
我们提出了一个基于变压器的NERF(Transnerf),以学习在新视图合成任务的观察视图图像上进行的通用神经辐射场。相比之下,现有的基于MLP的NERF无法直接接收具有任意号码的观察视图,并且需要基于辅助池的操作来融合源视图信息,从而导致源视图与目标渲染视图之间缺少复杂的关系。此外,当前方法分别处理每个3D点,忽略辐射场场景表示的局部一致性。这些局限性可能会在挑战现实世界应用中降低其性能,在这些应用程序中可能存在巨大的差异和新颖的渲染视图之间的巨大差异。为了应对这些挑战,我们的Transnerf利用注意机制自然地将任意数量的源视图的深层关联解码为基于坐标的场景表示。在统一变压器网络中,在射线铸造空间和周围视图空间中考虑了形状和外观的局部一致性。实验表明,与基于图像的最先进的基于图像的神经渲染方法相比,我们在各种场景上接受过培训的Transnf可以在场景 - 敏捷和每个场景的燃烧场景中获得更好的性能。源视图与渲染视图之间的差距很大。
translated by 谷歌翻译
新型视图综合的古典光场渲染可以准确地再现视图依赖性效果,例如反射,折射和半透明,但需要一个致密的视图采样的场景。基于几何重建的方法只需要稀疏的视图,但不能准确地模拟非兰伯语的效果。我们介绍了一个模型,它结合了强度并减轻了这两个方向的局限性。通过在光场的四维表示上操作,我们的模型学会准确表示依赖视图效果。通过在训练和推理期间强制执行几何约束,从稀疏的视图集中毫无屏蔽地学习场景几何。具体地,我们介绍了一种基于两级变压器的模型,首先沿着ePipoll线汇总特征,然后沿参考视图聚合特征以产生目标射线的颜色。我们的模型在多个前进和360 {\ DEG}数据集中优于最先进的,具有较大的差别依赖变化的场景更大的边缘。
translated by 谷歌翻译
多视图立体声(MVS)是3D计算机视觉中的核心任务。随着新颖的深度学习方法的激增,学习的MVS超过了经典方法的准确性,但仍然依赖于建立记忆密集型密集的成本量。新型视图合成(NVS)是一项平行的研究线,最近发现神经辐射场(NERF)模型的普及程度增加,该模型优化了每个场景辐射场。但是,NERF方法不会推广到新颖的场景,并且训练和测试速度很慢。我们建议用一个可以恢复3D场景几何形状作为距离函数的新型网络以及高分辨率的颜色图像来弥合这两种方法之间的差距。我们的方法仅使用一组稀疏的图像作为输入,可以很好地推广到新颖的场景。此外,我们提出了一种粗糙的球形追踪方法,以显着提高速度。我们在各种数据集上表明,我们的方法达到了与人均优化方法的可比精度,同时能够概括和运行速度更快。我们在https://github.com/ais-bonn/neural_mvs上提供源代码
translated by 谷歌翻译
我们提出了HRF-NET,这是一种基于整体辐射场的新型视图合成方法,该方法使用一组稀疏输入来呈现新视图。最近的概括视图合成方法还利用了光辉场,但渲染速度不是实时的。现有的方法可以有效地训练和呈现新颖的观点,但它们无法概括地看不到场景。我们的方法解决了用于概括视图合成的实时渲染问题,并由两个主要阶段组成:整体辐射场预测指标和基于卷积的神经渲染器。该架构不仅基于隐式神经场的一致场景几何形状,而且还可以使用单个GPU有效地呈现新视图。我们首先在DTU数据集的多个3D场景上训练HRF-NET,并且网络只能仅使用光度损耗就看不见的真实和合成数据产生合理的新视图。此外,我们的方法可以利用单个场景的密集参考图像集来产生准确的新颖视图,而无需依赖其他明确表示,并且仍然保持了预训练模型的高速渲染。实验结果表明,HRF-NET优于各种合成和真实数据集的最先进的神经渲染方法。
translated by 谷歌翻译
Volumetric neural rendering methods like NeRF generate high-quality view synthesis results but are optimized per-scene leading to prohibitive reconstruction time. On the other hand, deep multi-view stereo methods can quickly reconstruct scene geometry via direct network inference. Point-NeRF combines the advantages of these two approaches by using neural 3D point clouds, with associated neural features, to model a radiance field. Point-NeRF can be rendered efficiently by aggregating neural point features near scene surfaces, in a ray marching-based rendering pipeline. Moreover, Point-NeRF can be initialized via direct inference of a pre-trained deep network to produce a neural point cloud; this point cloud can be finetuned to surpass the visual quality of NeRF with 30X faster training time. Point-NeRF can be combined with other 3D reconstruction methods and handles the errors and outliers in such methods via a novel pruning and growing mechanism. The experiments on the DTU, the NeRF Synthetics , the ScanNet and the Tanks and Temples datasets demonstrate Point-NeRF can surpass the existing methods and achieve the state-of-the-art results.
translated by 谷歌翻译
Photo-realistic free-viewpoint rendering of real-world scenes using classical computer graphics techniques is challenging, because it requires the difficult step of capturing detailed appearance and geometry models. Recent studies have demonstrated promising results by learning scene representations that implicitly encode both geometry and appearance without 3D supervision. However, existing approaches in practice often show blurry renderings caused by the limited network capacity or the difficulty in finding accurate intersections of camera rays with the scene geometry. Synthesizing high-resolution imagery from these representations often requires time-consuming optical ray marching. In this work, we introduce Neural Sparse Voxel Fields (NSVF), a new neural scene representation for fast and high-quality free-viewpoint rendering. NSVF defines a set of voxel-bounded implicit fields organized in a sparse voxel octree to model local properties in each cell. We progressively learn the underlying voxel structures with a diffentiable ray-marching operation from only a set of posed RGB images. With the sparse voxel octree structure, rendering novel views can be accelerated by skipping the voxels containing no relevant scene content. Our method is typically over 10 times faster than the state-of-the-art (namely, NeRF (Mildenhall et al., 2020)) at inference time while achieving higher quality results. Furthermore, by utilizing an explicit sparse voxel representation, our method can easily be applied to scene editing and scene composition. We also demonstrate several challenging tasks, including multi-scene learning, free-viewpoint rendering of a moving human, and large-scale scene rendering. Code and data are available at our website: https://github.com/facebookresearch/NSVF.
translated by 谷歌翻译
自从神经辐射场(NERF)出现以来,神经渲染引起了极大的关注,并且已经大大推动了新型视图合成的最新作品。最近的重点是在模型上过度适合单个场景,以及学习模型的一些尝试,这些模型可以综合看不见的场景的新型视图,主要包括将深度卷积特征与类似NERF的模型组合在一起。我们提出了一个不同的范式,不需要深层特征,也不需要类似NERF的体积渲染。我们的方法能够直接从现场采样的贴片集中直接预测目标射线的颜色。我们首先利用表现几何形状沿着每个参考视图的异性线提取斑块。每个贴片线性地投影到1D特征向量和一系列变压器处理集合中。对于位置编码,我们像在光场表示中一样对射线进行参数化,并且至关重要的差异是坐标是相对于目标射线的规范化的,这使我们的方法与参考帧无关并改善了概括。我们表明,即使接受比先前的工作要少得多的数据训练,我们的方法在新颖的综合综合方面都超出了最新的视图综合。
translated by 谷歌翻译
尽管神经辐射场(NERF)在新型视图合成方面表现出了令人印象深刻的进步,但大多数方法通常需要具有准确的相机姿势的同一场景的多个输入图像。在这项工作中,我们试图将输入实质上减少到单个未予以的图像。现有的方法在本地图像功能上有条件重建一个3D对象,但通常会在远离源视图的视点处进行模糊的预测。为了解决这个问题,我们建议利用全球和本地功能形成表现力的3D表示。全局功能是从视觉变压器中学到的,而本地功能则从2D卷积网络中提取。为了综合一种新型视图,我们训练以学习的3D表示条件进行量渲染的多层感知器(MLP)网络。这种新颖的3D表示允许网络重建看不见的区域,而无需执行对称或规范坐标系等约束。我们的方法只能从单个输入图像中渲染新视图,并使用单个模型在多个对象类别中概括。定量和定性评估表明,所提出的方法可实现最先进的绩效,并使细节比现有方法更丰富。
translated by 谷歌翻译
我们提出了一种基于神经隐式表示的少量新型视图综合信息 - 理论正规化技术。所提出的方法最小化由于在每个光线中强制密度的熵约束而发生的潜在的重建不一致。另外,当从几乎冗余的观点获取所有训练图像时,为了减轻潜在的退化问题,我们还通过限制来自一对略微不同观点的光线的信息增益来将空间平滑度约束纳入估计的图像。我们的算法的主要思想是使重建的场景沿各个光线紧凑,并在附近的光线上一致。所提出的常规方基于Nerf以直接的方式插入大部分现有的神经体积渲染技术。尽管其简单性,但是,与现有的神经观察合成方法通过大量标准基准测试的现有神经观察方法相比,我们实现了一致的性能。我们的项目网站可用于\ url {http://cvlab.snu.ac.kr/research/infonerf}。
translated by 谷歌翻译
With the success of neural volume rendering in novel view synthesis, neural implicit reconstruction with volume rendering has become popular. However, most methods optimize per-scene functions and are unable to generalize to novel scenes. We introduce VolRecon, a generalizable implicit reconstruction method with Signed Ray Distance Function (SRDF). To reconstruct with fine details and little noise, we combine projection features, aggregated from multi-view features with a view transformer, and volume features interpolated from a coarse global feature volume. A ray transformer computes SRDF values of all the samples along a ray to estimate the surface location, which are used for volume rendering of color and depth. Extensive experiments on DTU and ETH3D demonstrate the effectiveness and generalization ability of our method. On DTU, our method outperforms SparseNeuS by about 30% in sparse view reconstruction and achieves comparable quality as MVSNet in full view reconstruction. Besides, our method shows good generalization ability on the large-scale ETH3D benchmark. Project page: https://fangjinhuawang.github.io/VolRecon.
translated by 谷歌翻译
We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (nonconvolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.
translated by 谷歌翻译
Input: 3 views of held-out scene NeRF pixelNeRF Output: Rendered new views Input Novel views Input Novel views Input Novel views Figure 1: NeRF from one or few images. We present pixelNeRF, a learning framework that predicts a Neural Radiance Field (NeRF) representation from a single (top) or few posed images (bottom). PixelNeRF can be trained on a set of multi-view images, allowing it to generate plausible novel view synthesis from very few input images without test-time optimization (bottom left). In contrast, NeRF has no generalization capabilities and performs poorly when only three input views are available (bottom right).
translated by 谷歌翻译