Neural Radiance Fields (NeRFs) encode the radiance in a scene parameterized by the scene's plenoptic function. This is achieved by using an MLP together with a mapping to a higher-dimensional space, and has been proven to capture scenes with a great level of detail. Naturally, the same parameterization can be used to encode additional properties of the scene, beyond just its radiance. A particularly interesting property in this regard is the semantic decomposition of the scene. We introduce a novel technique for semantic soft decomposition of neural radiance fields (named SSDNeRF) which jointly encodes semantic signals in combination with radiance signals of a scene. Our approach provides a soft decomposition of the scene into semantic parts, enabling us to correctly encode multiple semantic classes blending along the same direction -- an impossible feat for existing methods. Not only does this lead to a detailed, 3D semantic representation of the scene, but we also show that the regularizing effects of the MLP used for encoding help to improve the semantic representation. We show state-of-the-art segmentation and reconstruction results on a dataset of common objects and demonstrate how the proposed approach can be applied for high quality temporally consistent video editing and re-compositing on a dataset of casually captured selfie videos.
translated by 谷歌翻译
Figure 1. Given a monocular image sequence, NR-NeRF reconstructs a single canonical neural radiance field to represent geometry and appearance, and a per-time-step deformation field. We can render the scene into a novel spatio-temporal camera trajectory that significantly differs from the input trajectory. NR-NeRF also learns rigidity scores and correspondences without direct supervision on either. We can use the rigidity scores to remove the foreground, we can supersample along the time dimension, and we can exaggerate or dampen motion.
translated by 谷歌翻译
神经隐式表示在新的视图合成和来自多视图图像的高质量3D重建方面显示了其有效性。但是,大多数方法都集中在整体场景表示上,但忽略了其中的各个对象,从而限制了潜在的下游应用程序。为了学习对象组合表示形式,一些作品将2D语义图作为训练中的提示,以掌握对象之间的差异。但是他们忽略了对象几何和实例语义信息之间的牢固联系,这导致了单个实例的不准确建模。本文提出了一个新颖的框架ObjectsDF,以在3D重建和对象表示中构建具有高保真度的对象复合神经隐式表示。观察常规音量渲染管道的歧义,我们通过组合单个对象的签名距离函数(SDF)来对场景进行建模,以发挥明确的表面约束。区分不同实例的关键是重新审视单个对象的SDF和语义标签之间的牢固关联。特别是,我们将语义信息转换为对象SDF的函数,并为场景和对象开发统一而紧凑的表示形式。实验结果表明,ObjectSDF框架在表示整体对象组合场景和各个实例方面的优越性。可以在https://qianyiwu.github.io/objectsdf/上找到代码
translated by 谷歌翻译
获取3D对象表示对于创建照片现实的模拟器和为AR/VR应用程序收集资产很重要。神经领域已经显示出其在学习2D图像的场景的连续体积表示方面的有效性,但是从这些模型中获取对象表示,并以较弱的监督仍然是一个开放的挑战。在本文中,我们介绍了Laterf,一种从给定的2D图像和已知相机姿势的2D图像中提取感兴趣对象的方法,对象的自然语言描述以及少数对象和非对象标签 - 输入图像中的对象点。为了忠实地从场景中提取对象,后来在每个3D点上都以其他“对象”概率扩展NERF公式。此外,我们利用预先训练的剪辑模型与我们可区分的对象渲染器相结合的丰富潜在空间来注入对象的封闭部分。我们在合成数据集和真实数据集上展示了高保真对象提取,并通过广泛的消融研究证明我们的设计选择是合理的。
translated by 谷歌翻译
综合照片 - 现实图像和视频是计算机图形的核心,并且是几十年的研究焦点。传统上,使用渲染算法(如光栅化或射线跟踪)生成场景的合成图像,其将几何形状和材料属性的表示为输入。统称,这些输入定义了实际场景和呈现的内容,并且被称为场景表示(其中场景由一个或多个对象组成)。示例场景表示是具有附带纹理的三角形网格(例如,由艺术家创建),点云(例如,来自深度传感器),体积网格(例如,来自CT扫描)或隐式曲面函数(例如,截短的符号距离)字段)。使用可分辨率渲染损耗的观察结果的这种场景表示的重建被称为逆图形或反向渲染。神经渲染密切相关,并将思想与经典计算机图形和机器学习中的思想相结合,以创建用于合成来自真实观察图像的图像的算法。神经渲染是朝向合成照片现实图像和视频内容的目标的跨越。近年来,我们通过数百个出版物显示了这一领域的巨大进展,这些出版物显示了将被动组件注入渲染管道的不同方式。这种最先进的神经渲染进步的报告侧重于将经典渲染原则与学习的3D场景表示结合的方法,通常现在被称为神经场景表示。这些方法的一个关键优势在于它们是通过设计的3D-一致,使诸如新颖的视点合成捕获场景的应用。除了处理静态场景的方法外,我们还涵盖了用于建模非刚性变形对象的神经场景表示...
translated by 谷歌翻译
https://video-nerf.github.io Figure 1. Our method takes a single casually captured video as input and learns a space-time neural irradiance field. (Top) Sample frames from the input video. (Middle) Novel view images rendered from textured meshes constructed from depth maps. (Bottom) Our results rendered from the proposed space-time neural irradiance field.
translated by 谷歌翻译
我们向渲染和时间(4D)重建人类的渲染和时间(4D)重建的神经辐射场,通过稀疏的摄像机捕获或甚至来自单眼视频。我们的方法将思想与神经场景表示,新颖的综合合成和隐式统计几何人称的人类表示相结合,耦合使用新颖的损失功能。在先前使用符号距离功能表示的结构化隐式人体模型,而不是使用统一的占用率来学习具有统一占用的光域字段。这使我们能够从稀疏视图中稳健地融合信息,并概括超出在训练中观察到的姿势或视图。此外,我们应用几何限制以共同学习观察到的主题的结构 - 包括身体和衣服 - 并将辐射场正规化为几何合理的解决方案。在多个数据集上的广泛实验证明了我们方法的稳健性和准确性,其概括能力显着超出了一系列的姿势和视图,以及超出所观察到的形状的统计外推。
translated by 谷歌翻译
Recent advances in neural radiance fields have enabled the high-fidelity 3D reconstruction of complex scenes for novel view synthesis. However, it remains underexplored how the appearance of such representations can be efficiently edited while maintaining photorealism. In this work, we present PaletteNeRF, a novel method for photorealistic appearance editing of neural radiance fields (NeRF) based on 3D color decomposition. Our method decomposes the appearance of each 3D point into a linear combination of palette-based bases (i.e., 3D segmentations defined by a group of NeRF-type functions) that are shared across the scene. While our palette-based bases are view-independent, we also predict a view-dependent function to capture the color residual (e.g., specular shading). During training, we jointly optimize the basis functions and the color palettes, and we also introduce novel regularizers to encourage the spatial coherence of the decomposition. Our method allows users to efficiently edit the appearance of the 3D scene by modifying the color palettes. We also extend our framework with compressed semantic features for semantic-aware appearance editing. We demonstrate that our technique is superior to baseline methods both quantitatively and qualitatively for appearance editing of complex real-world scenes.
translated by 谷歌翻译
给定一个单眼视频,在恢复静态环境时进行分割和解耦动态对象是机器智能中广泛研究的问题。现有的解决方案通常在图像域中解决此问题,从而限制其对环境的性能和理解。我们介绍了脱钩的动态神经辐射场(D $^2 $ nerf),这是一种自制的方法,采用单眼视频,并学习了一个3D场景表示,该表示将移动对象(包括它们的阴影)从静态背景中解脱出来。我们的方法通过两个单独的神经辐射场表示移动对象和静态背景,只有一个允许时间变化。这种方法的幼稚实现导致动态组件接管静态的成分,因为前者的表示本质上更一般并且容易过度拟合。为此,我们提出了一种新颖的损失,以促进现象的正确分离。我们进一步提出了一个阴影场网络,以检测和解除动态移动的阴影。我们介绍了一个新的数据集,其中包含各种动态对象和阴影,并证明我们的方法可以在解耦动态和静态3D对象,遮挡和阴影删除以及移动对象的图像分段中获得比最新方法更好的性能。
translated by 谷歌翻译
Figure 1: Our method can synthesize novel views in both space and time from a single monocular video of a dynamic scene. Here we show video results with various configurations of fixing and interpolating view and time (left), as well as a visualization of the recovered scene geometry (right). Please view with Adobe Acrobat or KDE Okular to see animations.
translated by 谷歌翻译
在本文中,我们研究了2D视图中3D场景几何分解和操纵的问题。通过利用最新的隐式神经表示技术,尤其是吸引人的神经辐射领域,我们引入了一个对象字段组件,以了解仅从2D监督的3D空间中所有单个对象的独特代码。该组件的关键是一系列精心设计的损失函数,以使每个3D点,尤其是在非占用空间中,即使没有3D标签,也可以有效地优化。此外,我们引入了一种反查询算法,以自由操纵学习的场景表示中的任何指定的3D对象形状。值得注意的是,我们的操纵算法可以明确解决关键问题,例如对象碰撞和视觉遮挡。我们的方法称为DM-NERF,是最早在单个管道中同时重建,分解,操纵和渲染复杂3D场景的方法之一。在三个数据集上进行的大量实验清楚地表明,我们的方法可以从2D视图中准确分解所有3D对象,从而允许在3D空间中自由操纵任何感兴趣的对象,例如翻译,旋转,尺寸调整和变形。
translated by 谷歌翻译
对人类的逼真渲染和安息对于实现增强现实体验至关重要。我们提出了一个新颖的框架,以重建人类和场景,可以用新颖的人类姿势和景色从一个单一的野外视频中呈现。给定一个由移动摄像机捕获的视频,我们训练了两个NERF模型:人类NERF模型和一个场景NERF模型。为了训练这些模型,我们依靠现有方法来估计人类和场景的粗糙几何形状。这些粗糙的几何估计值使我们能够创建一个从观察空间到独立姿势独立的空间的翘曲场10秒的视频剪辑,并以新颖的观点以及背景提供新颖的姿势,提供人类的高质量效果。
translated by 谷歌翻译
我们呈现NESF,一种用于单独从构成的RGB图像中生成3D语义场的方法。代替经典的3D表示,我们的方法在最近的基础上建立了隐式神经场景表示的工作,其中3D结构被点亮功能捕获。我们利用这种方法来恢复3D密度领域,我们然后在其中培训由构成的2D语义地图监督的3D语义分段模型。尽管仅在2D信号上培训,我们的方法能够从新颖的相机姿势生成3D一致的语义地图,并且可以在任意3D点查询。值得注意的是,NESF与产生密度场的任何方法兼容,并且随着密度场的质量改善,其精度可提高。我们的实证分析在复杂的实际呈现的合成场景中向竞争性2D和3D语义分割基线表现出可比的质量。我们的方法是第一个提供真正密集的3D场景分段,需要仅需要2D监督培训,并且不需要任何关于新颖场景的推论的语义输入。我们鼓励读者访问项目网站。
translated by 谷歌翻译
We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene. State-of-the-art methods based on temporally varying Neural Radiance Fields (aka dynamic NeRFs) have shown impressive results on this task. However, for long videos with complex object motions and uncontrolled camera trajectories, these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications. Instead of encoding the entire dynamic scene within the weights of an MLP, we present a new approach that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner. Our system retains the advantages of prior methods in its ability to model complex scenes and view-dependent effects, but also enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories. We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings. Our project webpage is at dynibar.github.io.
translated by 谷歌翻译
体积神经渲染方法,例如神经辐射场(NERFS),已实现了光真实的新型视图合成。但是,以其标准形式,NERF不支持场景中的物体(例如人头)的编辑。在这项工作中,我们提出了Rignerf,该系统不仅仅是仅仅是新颖的视图综合,并且可以完全控制头姿势和从单个肖像视频中学到的面部表情。我们使用由3D可变形面模型(3DMM)引导的变形场对头姿势和面部表情的变化进行建模。 3DMM有效地充当了Rignerf的先验,该rignerf学会仅预测3DMM变形的残留物,并使我们能够在输入序列中呈现不存在的新颖(刚性)姿势和(非刚性)表达式。我们仅使用智能手机捕获的简短视频进行培训,我们证明了我们方法在自由视图合成肖像场景的有效性,并具有明确的头部姿势和表达控制。项目页面可以在此处找到:http://shahrukhathar.github.io/2022/06/06/rignerf.html
translated by 谷歌翻译
We propose Panoptic Lifting, a novel approach for learning panoptic 3D volumetric representations from images of in-the-wild scenes. Once trained, our model can render color images together with 3D-consistent panoptic segmentation from novel viewpoints. Unlike existing approaches which use 3D input directly or indirectly, our method requires only machine-generated 2D panoptic segmentation masks inferred from a pre-trained network. Our core contribution is a panoptic lifting scheme based on a neural field representation that generates a unified and multi-view consistent, 3D panoptic representation of the scene. To account for inconsistencies of 2D instance identifiers across views, we solve a linear assignment with a cost based on the model's current predictions and the machine-generated segmentation masks, thus enabling us to lift 2D instances to 3D in a consistent way. We further propose and ablate contributions that make our method more robust to noisy, machine-generated labels, including test-time augmentations for confidence estimates, segment consistency loss, bounded segmentation fields, and gradient stopping. Experimental results validate our approach on the challenging Hypersim, Replica, and ScanNet datasets, improving by 8.4, 13.8, and 10.6% in scene-level PQ over state of the art.
translated by 谷歌翻译
这项工作的目标是通过扫描平台捕获的数据进行3D重建和新颖的观看综合,该平台在城市室外环境中常设世界映射(例如,街景)。给定一系列由摄像机和扫描仪通过室外场景的摄像机和扫描仪进行的序列,我们产生可以从中提取3D表面的模型,并且可以合成新颖的RGB图像。我们的方法扩展了神经辐射字段,已经证明了用于在受控设置中的小型场景中的逼真新颖的图像,用于利用异步捕获的LIDAR数据,用于寻址捕获图像之间的曝光变化,以及利用预测的图像分段来监督密度。在光线指向天空。这三个扩展中的每一个都在街道视图数据上的实验中提供了显着的性能改进。我们的系统产生最先进的3D表面重建,并与传统方法(例如〜Colmap)和最近的神经表示(例如〜MIP-NERF)相比,合成更高质量的新颖视图。
translated by 谷歌翻译
Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.
translated by 谷歌翻译
对于场景重建和新型视图综合的数量表示形式的普及最近,人们的普及使重点放在以高视觉质量和实时为实时的体积内容动画上。尽管基于学习功能的隐性变形方法可以产生令人印象深刻的结果,但它们是艺术家和内容创建者的“黑匣子”,但它们需要大量的培训数据才能有意义地概括,并且在培训数据之外不会产生现实的外推。在这项工作中,我们通过引入实时的音量变形方法来解决这些问题,该方法是实时的,易于使用现成的软件编辑,并且可以令人信服地推断出来。为了证明我们方法的多功能性,我们将其应用于两种情况:基于物理的对象变形和触发性,其中使用Blendshapes控制着头像。我们还进行了彻底的实验,表明我们的方法与两种体积方法相比,结合了基于网格变形的隐式变形和方法。
translated by 谷歌翻译
Neural Radiance Fields (NeRFs) are emerging as a ubiquitous scene representation that allows for novel view synthesis. Increasingly, NeRFs will be shareable with other people. Before sharing a NeRF, though, it might be desirable to remove personal information or unsightly objects. Such removal is not easily achieved with the current NeRF editing frameworks. We propose a framework to remove objects from a NeRF representation created from an RGB-D sequence. Our NeRF inpainting method leverages recent work in 2D image inpainting and is guided by a user-provided mask. Our algorithm is underpinned by a confidence based view selection procedure. It chooses which of the individual 2D inpainted images to use in the creation of the NeRF, so that the resulting inpainted NeRF is 3D consistent. We show that our method for NeRF editing is effective for synthesizing plausible inpaintings in a multi-view coherent manner. We validate our approach using a new and still-challenging dataset for the task of NeRF inpainting.
translated by 谷歌翻译