我们提出了一种便携式多型摄像头系统,该系统具有专用模型,用于动态场景中的新型视图和时间综合。我们的目标是使用我们的便携式多座相机从任何角度从任何角度出发为动态场景提供高质量的图像。为了实现这种新颖的观点和时间综合,我们开发了一个配备了五个相机的物理多型摄像头,以在时间和空间域中训练神经辐射场(NERF),以进行动态场景。我们的模型将6D坐标(3D空间位置,1D时间坐标和2D观看方向)映射到观看依赖性且随时间变化的发射辐射和体积密度。量渲染用于在指定的相机姿势和时间上渲染光真实的图像。为了提高物理相机的鲁棒性,我们提出了一个摄像机参数优化模块和一个时间框架插值模块,以促进跨时间的信息传播。我们对现实世界和合成数据集进行了实验以评估我们的系统,结果表明,我们的方法在定性和定量上优于替代解决方案。我们的代码和数据集可从https://yuenfuilau.github.io获得。
translated by 谷歌翻译
https://video-nerf.github.io Figure 1. Our method takes a single casually captured video as input and learns a space-time neural irradiance field. (Top) Sample frames from the input video. (Middle) Novel view images rendered from textured meshes constructed from depth maps. (Bottom) Our results rendered from the proposed space-time neural irradiance field.
translated by 谷歌翻译
Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.
translated by 谷歌翻译
Figure 1: Our method can synthesize novel views in both space and time from a single monocular video of a dynamic scene. Here we show video results with various configurations of fixing and interpolating view and time (left), as well as a visualization of the recovered scene geometry (right). Please view with Adobe Acrobat or KDE Okular to see animations.
translated by 谷歌翻译
Point of View & TimeFigure 1: We propose D-NeRF, a method for synthesizing novel views, at an arbitrary point in time, of dynamic scenes with complex non-rigid geometries. We optimize an underlying deformable volumetric function from a sparse set of input monocular views without the need of ground-truth geometry nor multi-view images. The figure shows two scenes under variable points of view and time instances synthesised by the proposed model.
translated by 谷歌翻译
We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene. State-of-the-art methods based on temporally varying Neural Radiance Fields (aka dynamic NeRFs) have shown impressive results on this task. However, for long videos with complex object motions and uncontrolled camera trajectories, these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications. Instead of encoding the entire dynamic scene within the weights of an MLP, we present a new approach that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene-motion-aware manner. Our system retains the advantages of prior methods in its ability to model complex scenes and view-dependent effects, but also enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories. We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings. Our project webpage is at dynibar.github.io.
translated by 谷歌翻译
最近的神经人类表示可以产生高质量的多视图渲染,但需要使用密集的多视图输入和昂贵的培训。因此,它们在很大程度上仅限于静态模型,因为每个帧都是不可行的。我们展示了人类学 - 一种普遍的神经表示 - 用于高保真自由观察动态人类的合成。类似于IBRNET如何通过避免每场景训练来帮助NERF,Humannerf跨多视图输入采用聚合像素对准特征,以及用于解决动态运动的姿势嵌入的非刚性变形场。原始人物员已经可以在稀疏视频输入的稀疏视频输入上产生合理的渲染。为了进一步提高渲染质量,我们使用外观混合模块增强了我们的解决方案,用于组合神经体积渲染和神经纹理混合的益处。各种多视图动态人类数据集的广泛实验证明了我们在挑战运动中合成照片 - 现实自由观点的方法和非常稀疏的相机视图输入中的普遍性和有效性。
translated by 谷歌翻译
Figure 1. Given a monocular image sequence, NR-NeRF reconstructs a single canonical neural radiance field to represent geometry and appearance, and a per-time-step deformation field. We can render the scene into a novel spatio-temporal camera trajectory that significantly differs from the input trajectory. NR-NeRF also learns rigidity scores and correspondences without direct supervision on either. We can use the rigidity scores to remove the foreground, we can supersample along the time dimension, and we can exaggerate or dampen motion.
translated by 谷歌翻译
We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views. The core of our method is a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information on the fly from multiple source views. By drawing on source views at render time, our method hearkens back to classic work on image-based rendering (IBR), and allows us to render high-resolution imagery. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes. We render images using classic volume rendering, which is fully differentiable and allows us to train using only multiview posed images as supervision. Experiments show that our method outperforms recent novel view synthesis methods that also seek to generalize to novel scenes. Further, if fine-tuned on each scene, our method is competitive with state-of-the-art single-scene neural rendering methods. 1
translated by 谷歌翻译
静态场景的新颖观看综合在生产照片逼真的结果方面取得了显着的进步。但是,对于动态内容的沉浸式渲染,仍然存在关键挑战。例如,一种基于精英图像的渲染框架之一,多平面图像(MPI)为静态场景产生高新的观看合成质量,但面临着建模动态部分的难度。此外,通过MPI建模动态变化可能需要庞大的存储空间和长期推理时间,其阻碍了其在实时方案中的应用。在本文中,我们提出了一种新的颞型MPI表示,其能够在整个视频中以紧凑的时间编码整个视频中的丰富的3D和动态变化信息。由于高度紧凑且表现力的潜在基础和共同学习的系数,任意时间实例的新颖 - 实例将能够实时具有高视觉质量。我们显示给定的可比内存消耗,我们提出的时间 - MPI框架能够生成时间实例MPI,只有0.002秒,速度快3000倍,与其他状态相比,3DB更高的平均视图合成PSNR - 艺术动态场景建模框架。
translated by 谷歌翻译
在本文中,我们为复杂场景进行了高效且强大的深度学习解决方案。在我们的方法中,3D场景表示为光场,即,一组光线,每组在到达图像平面时具有相应的颜色。对于高效的新颖视图渲染,我们采用了光场的双面参数化,其中每个光线的特征在于4D参数。然后,我们将光场配向作为4D函数,即将4D坐标映射到相应的颜色值。我们训练一个深度完全连接的网络以优化这种隐式功能并记住3D场景。然后,特定于场景的模型用于综合新颖视图。与以前需要密集的视野的方法不同,需要密集的视野采样来可靠地呈现新颖的视图,我们的方法可以通过采样光线来呈现新颖的视图并直接从网络查询每种光线的颜色,从而使高质量的灯场呈现稀疏集合训练图像。网络可以可选地预测每光深度,从而使诸如自动重新焦点的应用。我们的小说视图合成结果与最先进的综合结果相当,甚至在一些具有折射和反射的具有挑战性的场景中优越。我们在保持交互式帧速率和小的内存占地面积的同时实现这一点。
translated by 谷歌翻译
神经辐射场(NERF)最近在新型视图合成中取得了令人印象深刻的结果。但是,以前的NERF作品主要关注以对象为中心的方案。在这项工作中,我们提出了360ROAM,这是一种新颖的场景级NERF系统,可以实时合成大型室内场景的图像并支持VR漫游。我们的系统首先从多个输入$ 360^\ circ $图像构建全向神经辐射场360NERF。然后,我们逐步估算一个3D概率的占用图,该概率占用图代表了空间密度形式的场景几何形状。跳过空的空间和上采样占据的体素本质上可以使我们通过以几何学意识的方式使用360NERF加速量渲染。此外,我们使用自适应划分和扭曲策略来减少和调整辐射场,以进一步改进。从占用地图中提取的场景的平面图可以为射线采样提供指导,并促进现实的漫游体验。为了显示我们系统的功效,我们在各种场景中收集了$ 360^\ Circ $图像数据集并进行广泛的实验。基线之间的定量和定性比较说明了我们在复杂室内场景的新型视图合成中的主要表现。
translated by 谷歌翻译
Image view synthesis has seen great success in reconstructing photorealistic visuals, thanks to deep learning and various novel representations. The next key step in immersive virtual experiences is view synthesis of dynamic scenes. However, several challenges exist due to the lack of high-quality training datasets, and the additional time dimension for videos of dynamic scenes. To address this issue, we introduce a multi-view video dataset, captured with a custom 10-camera rig in 120FPS. The dataset contains 96 high-quality scenes showing various visual effects and human interactions in outdoor scenes. We develop a new algorithm, Deep 3D Mask Volume, which enables temporally-stable view extrapolation from binocular videos of dynamic scenes, captured by static cameras. Our algorithm addresses the temporal inconsistency of disocclusions by identifying the error-prone areas with a 3D mask volume, and replaces them with static background observed throughout the video. Our method enables manipulation in 3D space as opposed to simple 2D masks, We demonstrate better temporal stability than frame-by-frame static view synthesis methods, or those that use 2D masks. The resulting view synthesis videos show minimal flickering artifacts and allow for larger translational movements.
translated by 谷歌翻译
本文旨在减少透明辐射场的渲染时间。一些最近的作品用图像编码器配备了神经辐射字段,能够跨越场景概括,这避免了每场景优化。但是,它们的渲染过程通常很慢。主要因素是,在推断辐射场时,它们在空间中的大量点。在本文中,我们介绍了一个混合场景表示,它结合了最佳的隐式辐射场和显式深度映射,以便有效渲染。具体地,我们首先构建级联成本量,以有效地预测场景的粗糙几何形状。粗糙几何允许我们在场景表面附近的几个点来样,并显着提高渲染速度。该过程是完全可疑的,使我们能够仅从RGB图像共同学习深度预测和辐射现场网络。实验表明,该方法在DTU,真正的前瞻性和NERF合成数据集上展示了最先进的性能,而不是比以前的最可推广的辐射现场方法快至少50倍。我们还展示了我们的方法实时综合动态人类执行者的自由观点视频。代码将在https://zju3dv.github.io/enerf/处提供。
translated by 谷歌翻译
我们提出EV-NERF,这是一个从事件数据得出的神经辐射场。虽然事件摄像机可以测量高框架速率的细微亮度变化,但低照明或极端运动的测量却遭受了显着的域差异,并具有复杂的噪声。结果,基于事件的视觉任务的性能不会转移到具有挑战性的环境中,在这种环境中,事件摄像机预计会在普通摄像机上蓬勃发展。我们发现,NERF的多视图一致性提供了强大的自我实施信号,以消除虚假测量结果并提取一致的基础结构,尽管输入高度嘈杂。 EV-NERF的输入不是原始NERF的图像,而是事件测量值,并伴随着传感器的运动。使用反映传感器测量模型的损耗函数,EV-NERF创建了一个集成的神经体积,该量总结了捕获约2-4秒的非结构化和稀疏数据点。生成的神经体积还可以从具有合理深度估计的新型视图中产生强度图像,这可以作为各种基于视觉任务的高质量输入。我们的结果表明,EV-NERF在极端噪声条件和高动力范围成像下实现了强度图像重建的竞争性能。
translated by 谷歌翻译
我们提出了一种基于神经辐射场(NERF)的单个$ 360^\ PANORAMA图像合成新视图的方法。在类似环境中的先前研究依赖于多层感知的邻居插值能力来完成由遮挡引起的丢失区域,这导致其预测中的伪像。我们提出了360Fusionnerf,这是一个半监督的学习框架,我们介绍几何监督和语义一致性,以指导渐进式培训过程。首先,将输入图像重新投影至$ 360^\ Circ $图像,并在其他相机位置提取辅助深度图。除NERF颜色指导外,深度监督还改善了合成视图的几何形状。此外,我们引入了语义一致性损失,鼓励新观点的现实渲染。我们使用预先训练的视觉编码器(例如剪辑)提取这些语义功能,这是一个视觉变压器,经过数以千计的不同2D照片,并通过自然语言监督从网络中挖掘出来。实验表明,我们提出的方法可以在保留场景的特征的同时产生未观察到的区域的合理完成。 360fusionnerf在各种场景中接受培训时,转移到合成结构3D数据集(PSNR〜5%,SSIM〜3%lpips〜13%)时,始终达到最先进的性能,SSIM〜3%LPIPS〜9%)和replica360数据集(PSNR〜8%,SSIM〜2%LPIPS〜18%)。
translated by 谷歌翻译
我们人类正在进入虚拟时代,确实想将动物带到虚拟世界中。然而,计算机生成的(CGI)毛茸茸的动物受到乏味的离线渲染的限制,更不用说交互式运动控制了。在本文中,我们提出了Artemis,这是一种新型的神经建模和渲染管道,用于生成具有外观和运动合成的清晰神经宠物。我们的Artemis可以实现互动运动控制,实时动画和毛茸茸的动物的照片真实渲染。我们的Artemis的核心是神经生成的(NGI)动物引擎,该动物发动机采用了有效的基于OCTREE的动物动画和毛皮渲染的代表。然后,该动画等同于基于显式骨骼翘曲的体素级变形。我们进一步使用快速的OCTREE索引和有效的体积渲染方案来生成外观和密度特征地图。最后,我们提出了一个新颖的阴影网络,以在外观和密度特征图中生成外观和不透明度的高保真细节。对于Artemis中的运动控制模块,我们将最新动物运动捕获方法与最近的神经特征控制方案相结合。我们引入了一种有效的优化方案,以重建由多视图RGB和Vicon相机阵列捕获的真实动物的骨骼运动。我们将所有捕获的运动馈送到神经角色控制方案中,以生成具有运动样式的抽象控制信号。我们将Artemis进一步整合到支持VR耳机的现有引擎中,提供了前所未有的沉浸式体验,用户可以与各种具有生动动作和光真实外观的虚拟动物进行紧密互动。我们可以通过https://haiminluo.github.io/publication/artemis/提供我们的Artemis模型和动态毛茸茸的动物数据集。
translated by 谷歌翻译
3D reconstruction and novel view synthesis of dynamic scenes from collections of single views recently gained increased attention. Existing work shows impressive results for synthetic setups and forward-facing real-world data, but is severely limited in the training speed and angular range for generating novel views. This paper addresses these limitations and proposes a new method for full 360{\deg} novel view synthesis of non-rigidly deforming scenes. At the core of our method are: 1) An efficient deformation module that decouples the processing of spatial and temporal information for acceleration at training and inference time; and 2) A static module representing the canonical scene as a fast hash-encoded neural radiance field. We evaluate the proposed approach on the established synthetic D-NeRF benchmark, that enables efficient reconstruction from a single monocular view per time-frame randomly sampled from a full hemisphere. We refer to this form of inputs as monocularized data. To prove its practicality for real-world scenarios, we recorded twelve challenging sequences with human actors by sampling single frames from a synchronized multi-view rig. In both cases, our method is trained significantly faster than previous methods (minutes instead of days) while achieving higher visual accuracy for generated novel views. Our source code and data is available at our project page https://graphics.tu-bs.de/publications/kappel2022fast.
translated by 谷歌翻译
由于其显着的合成质量,最近,神经辐射场(NERF)最近对3D场景重建和新颖的视图合成进行了相当大的关注。然而,由散焦或运动引起的图像模糊,这通常发生在野外的场景中,显着降低了其重建质量。为了解决这个问题,我们提出了DeBlur-nerf,这是一种可以从模糊输入恢复尖锐的nerf的第一种方法。我们采用逐合成方法来通过模拟模糊过程来重建模糊的视图,从而使NERF对模糊输入的鲁棒。该仿真的核心是一种新型可变形稀疏内核(DSK)模块,其通过在每个空间位置变形规范稀疏内核来模拟空间变形模糊内核。每个内核点的射线起源是共同优化的,受到物理模糊过程的启发。该模块作为MLP参数化,具有能够概括为各种模糊类型。联合优化NERF和DSK模块允许我们恢复尖锐的NERF。我们证明我们的方法可用于相机运动模糊和散焦模糊:真实场景中的两个最常见的模糊。合成和现实世界数据的评估结果表明,我们的方法优于几个基线。合成和真实数据集以及源代码将公开可用于促进未来的研究。
translated by 谷歌翻译
综合照片 - 现实图像和视频是计算机图形的核心,并且是几十年的研究焦点。传统上,使用渲染算法(如光栅化或射线跟踪)生成场景的合成图像,其将几何形状和材料属性的表示为输入。统称,这些输入定义了实际场景和呈现的内容,并且被称为场景表示(其中场景由一个或多个对象组成)。示例场景表示是具有附带纹理的三角形网格(例如,由艺术家创建),点云(例如,来自深度传感器),体积网格(例如,来自CT扫描)或隐式曲面函数(例如,截短的符号距离)字段)。使用可分辨率渲染损耗的观察结果的这种场景表示的重建被称为逆图形或反向渲染。神经渲染密切相关,并将思想与经典计算机图形和机器学习中的思想相结合,以创建用于合成来自真实观察图像的图像的算法。神经渲染是朝向合成照片现实图像和视频内容的目标的跨越。近年来,我们通过数百个出版物显示了这一领域的巨大进展,这些出版物显示了将被动组件注入渲染管道的不同方式。这种最先进的神经渲染进步的报告侧重于将经典渲染原则与学习的3D场景表示结合的方法,通常现在被称为神经场景表示。这些方法的一个关键优势在于它们是通过设计的3D-一致,使诸如新颖的视点合成捕获场景的应用。除了处理静态场景的方法外,我们还涵盖了用于建模非刚性变形对象的神经场景表示...
translated by 谷歌翻译