We explore the problem of view synthesis from a narrow baseline pair of images, and focus on generating highquality view extrapolations with plausible disocclusions. Our method builds upon prior work in predicting a multiplane image (MPI), which represents scene content as a set of RGBα planes within a reference view frustum and renders novel views by projecting this content into the target viewpoints. We present a theoretical analysis showing how the range of views that can be rendered from an MPI increases linearly with the MPI disparity sampling frequency, as well as a novel MPI prediction procedure that theoretically enables view extrapolations of up to 4× the lateral viewpoint movement allowed by prior work. Our method ameliorates two specific issues that limit the range of views renderable by prior methods: 1) We expand the range of novel views that can be rendered without depth discretization artifacts by using a 3D convolutional network architecture along with a randomized-resolution training procedure to allow our model to predict MPIs with increased disparity sampling frequency. 2) We reduce the repeated texture artifacts seen in disocclusions by enforcing a constraint that the appearance of hidden content at any depth must be drawn from visible content at or behind that depth.
translated by 谷歌翻译
Fast and easy handheld capture with guideline: closest object moves at most D pixels between views Promote sampled views to local light field via layered scene representation Blend neighboring local light fields to render novel views
translated by 谷歌翻译
A recent strand of work in view synthesis uses deep learning to generate multiplane images-a camera-centric, layered 3D representation-given two or more input images at known viewpoints. We apply this representation to singleview view synthesis, a problem which is more challenging but has potentially much wider application. Our method learns to predict a multiplane image directly from a single image input, and we introduce scale-invariant view synthesis for supervision, enabling us to train on online video. We show this approach is applicable to several different datasets, that it additionally generates reasonable depth maps, and that it learns to fill in content behind the edges of foreground objects in background layers.Project page at https://single-view-mpi.github.io/.
translated by 谷歌翻译
We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (nonconvolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.
translated by 谷歌翻译
我们介绍了Fadiv-Syn,一种快速深入的新型观点合成方法。相关方法通常受到它们的深度估计阶段的限制,其中不正确的深度预测可能导致大的投影误差。为避免此问题,我们将输入图像有效地将输入图像呈现为目标帧,以为一系列假定的深度平面。得到的平面扫描量(PSV)直接进入我们的网络,首先以自我监督的方式估计软PSV掩模,然后直接产生新颖的输出视图。因此,我们侧行显式深度估计。这提高了透明,反光,薄,特色场景部件上的效率和性能。 Fadiv-syn可以在大规模Realestate10K数据集上执行插值和外推任务,优于最先进的外推方法。与可比方法相比,它由于其轻量级架构而实现了实时性能。我们彻底评估消融,例如去除软掩蔽网络,从更少的示例中培训以及更高的分辨率和更强深度离散化的概括。
translated by 谷歌翻译
Deep networks have recently enjoyed enormous success when applied to recognition and classification problems in computer vision [20,29], but their use in graphics problems has been limited ([21, 7] are notable recent exceptions). In this work, we present a novel deep architecture that performs new view synthesis directly from pixels, trained from a large number of posed image sets. In contrast to traditional approaches which consist of multiple complex stages of processing, each of which require careful tuning and can fail in unexpected ways, our system is trained end-to-end. The pixels from neighboring views of a scene are presented to the network which then directly produces the pixels of the unseen view. The benefits of our approach include generality (we only require posed image sets and can easily apply our method to different domains), and high quality results on traditionally difficult scenes. We believe this is due to the end-to-end nature of our system which is able to plausibly generate pixels according to color, depth, and texture priors learnt automatically from the training data. To verify our method we show that it can convincingly reproduce known test views from nearby imagery. Additionally we show images rendered from novel viewpoints. To our knowledge, our work is the first to apply deep learning to the problem of new view synthesis from sets of real-world, natural imagery.
translated by 谷歌翻译
The DeepView architecture. (a) The network takes a sparse set of input images shot from different viewpoints. (b, c) The scene is reconstructed using learned gradient descent, producing a multi-plane image (a series of fronto-parallel, RGBA textured planes). (d)The multi-plane image is suitable for real-time, high-quality rendering of novel viewpoints. The result above uses four input views in a 30cm × 20cm rectangular layout. The novel view was rendered with a virtual camera positioned at the centroid of the four input views. More results, including video and an interactive viewer, at: https://augmentedperception.github.io/deepview/
translated by 谷歌翻译
代表具有多个半透明彩色图层的场景是实时新型视图合成的流行和成功的选择。现有方法在平面或球形的规则间隔层上推断颜色和透明度值。在这项工作中,我们介绍了一种基于多个半透明层的新视图综合方法,具有场景适应的几何形状。我们的方法在两个阶段中介绍了立体对的这些表示。第一阶段从给定的一对视图中缩小了少数数据自适应层的几何形状。第二阶段为这些层的颜色和透明度值产生了新颖的视图合成的最终表示。重要的是,两个阶段都通过可差异化的渲染器连接,并以端到端的方式训练。在实验中,我们展示了所提出的方法在使用定期间隔的层上的优势,没有适应场景几何形状。尽管在渲染过程中较快的数量次数,但我们的方法也优于基于隐式几何表示的最近提出的IBRNET系统。查看https://samsunglabs.github.io/stereolayers的结果。
translated by 谷歌翻译
虚拟现实(VR)耳机提供了一种身临其境的立体视觉体验,但以阻止用户直接观察其物理环境的代价。传递技术旨在通过利用向外的摄像头来重建否则没有耳机的用户可以看到的图像来解决此限制。这本质上是一个实时视图综合挑战,因为传递摄像机不能与眼睛进行物理共同。现有的通行技术会遭受分散重建工件的注意力,这主要是由于缺乏准确的深度信息(尤其是对于近场和分离的物体),并且表现出有限的图像质量(例如,低分辨率和单色)。在本文中,我们提出了第一种学习的传递方法,并使用包含立体声对RGB摄像机的自定义VR耳机评估其性能。通过模拟和实验,我们证明了我们所学的传递方法与最先进的方法相比提供了卓越的图像质量,同时满足了实时的,透视透视的立体视图综合的严格VR要求,从而在广泛的视野上综合用于桌面连接的耳机。
translated by 谷歌翻译
我们提出了HRF-NET,这是一种基于整体辐射场的新型视图合成方法,该方法使用一组稀疏输入来呈现新视图。最近的概括视图合成方法还利用了光辉场,但渲染速度不是实时的。现有的方法可以有效地训练和呈现新颖的观点,但它们无法概括地看不到场景。我们的方法解决了用于概括视图合成的实时渲染问题,并由两个主要阶段组成:整体辐射场预测指标和基于卷积的神经渲染器。该架构不仅基于隐式神经场的一致场景几何形状,而且还可以使用单个GPU有效地呈现新视图。我们首先在DTU数据集的多个3D场景上训练HRF-NET,并且网络只能仅使用光度损耗就看不见的真实和合成数据产生合理的新视图。此外,我们的方法可以利用单个场景的密集参考图像集来产生准确的新颖视图,而无需依赖其他明确表示,并且仍然保持了预训练模型的高速渲染。实验结果表明,HRF-NET优于各种合成和真实数据集的最先进的神经渲染方法。
translated by 谷歌翻译
我们介绍了与给定单个图像的任意长相机轨迹相对应的长期视图的新面积视图的问题。这是一个具有挑战性的问题,远远超出了当前视图合成方法的能力,这在提出大型摄像机运动时快速退化。用于视频生成的方法也具有有限的生产长序列的能力,并且通常不适用于场景几何形状。我们采用混合方法,它以迭代`\ emph {render},\ emph {refine},\ emph {重复}'框架集成了几何和图像合成,允许在数百帧之后覆盖大距离的远程生成。我们的方法可以从一组单目的视频序列训练。我们提出了一个沿海场景的空中镜头数据集,并比较了我们最近的观看综合和有条件的视频生成基线的方法,表明它可以在与现有方法相比,在大型相机轨迹上产生更长的时间范围。项目页面https://infinite-nature.github.io/。
translated by 谷歌翻译
https://video-nerf.github.io Figure 1. Our method takes a single casually captured video as input and learns a space-time neural irradiance field. (Top) Sample frames from the input video. (Middle) Novel view images rendered from textured meshes constructed from depth maps. (Bottom) Our results rendered from the proposed space-time neural irradiance field.
translated by 谷歌翻译
We present a method that synthesizes novel views of complex scenes by interpolating a sparse set of nearby views. The core of our method is a network architecture that includes a multilayer perceptron and a ray transformer that estimates radiance and volume density at continuous 5D locations (3D spatial locations and 2D viewing directions), drawing appearance information on the fly from multiple source views. By drawing on source views at render time, our method hearkens back to classic work on image-based rendering (IBR), and allows us to render high-resolution imagery. Unlike neural scene representation work that optimizes per-scene functions for rendering, we learn a generic view interpolation function that generalizes to novel scenes. We render images using classic volume rendering, which is fully differentiable and allows us to train using only multiview posed images as supervision. Experiments show that our method outperforms recent novel view synthesis methods that also seek to generalize to novel scenes. Further, if fine-tuned on each scene, our method is competitive with state-of-the-art single-scene neural rendering methods. 1
translated by 谷歌翻译
用于运动中的人类的新型视图综合是一个具有挑战性的计算机视觉问题,使得诸如自由视视频之类的应用。现有方法通常使用具有多个输入视图,3D监控或预训练模型的复杂设置,这些模型不会概括为新标识。旨在解决这些限制,我们提出了一种新颖的视图综合框架,以从单视图传感器捕获的任何人的看法生成现实渲染,其具有稀疏的RGB-D,类似于低成本深度摄像头,而没有参与者特定的楷模。我们提出了一种架构来学习由基于球体的神经渲染获得的小说视图中的密集功能,并使用全局上下文修复模型创建完整的渲染。此外,增强剂网络利用了整体保真度,即使在原始视图中的遮挡区域中也能够产生细节的清晰渲染。我们展示了我们的方法为单个稀疏RGB-D输入产生高质量的合成和真实人体演员的新颖视图。它概括了看不见的身份,新的姿势,忠实地重建面部表情。我们的方法优于现有人体观测合成方法,并且对不同水平的输入稀疏性具有稳健性。
translated by 谷歌翻译
where the highest resolution is required, using facial performance capture as a case in point.
translated by 谷歌翻译
部分闭塞作用是一种现象,即相机附近的模糊物体是半透明的,导致部分外观被遮挡的背景。但是,由于现有的散景渲染方法,由于在全焦点图像中的遮挡区域缺少信息而模拟现实的部分遮挡效果是一项挑战。受到可学习的3D场景表示的启发,我们试图通过引入一种基于MPI的新型高分辨率Bokeh渲染框架来解决部分遮挡,称为MPIB。为此,我们首先介绍了如何将MPI表示形式应用于散布渲染的分析。基于此分析,我们提出了一个MPI表示模块与背景介入模块相结合,以实现高分辨率场景表示。然后,可以将此表示形式重复使用以根据控制参数呈现各种散景效应。为了训练和测试我们的模型,我们还为数据生成设计了基于射线追踪的散景生成器。对合成和现实世界图像的广泛实验验证了该框架的有效性和灵活性。
translated by 谷歌翻译
We present a method for novel view synthesis from input images that are freely distributed around a scene. Our method does not rely on a regular arrangement of input views, can synthesize images for free camera movement through the scene, and works for general scenes with unconstrained geometric layouts. We calibrate the input images via SfM and erect a coarse geometric scaffold via MVS. This scaffold is used to create a proxy depth map for a novel view of the scene. Based on this depth map, a recurrent encoder-decoder network processes reprojected features from nearby views and synthesizes the new view. Our network does not need to be optimized for a given scene. After training on a dataset, it works in previously unseen environments with no finetuning or per-scene optimization. We evaluate the presented approach on challenging real-world datasets, including Tanks and Temples, where we demonstrate successful view synthesis for the first time and substantially outperform prior and concurrent work.
translated by 谷歌翻译
Image view synthesis has seen great success in reconstructing photorealistic visuals, thanks to deep learning and various novel representations. The next key step in immersive virtual experiences is view synthesis of dynamic scenes. However, several challenges exist due to the lack of high-quality training datasets, and the additional time dimension for videos of dynamic scenes. To address this issue, we introduce a multi-view video dataset, captured with a custom 10-camera rig in 120FPS. The dataset contains 96 high-quality scenes showing various visual effects and human interactions in outdoor scenes. We develop a new algorithm, Deep 3D Mask Volume, which enables temporally-stable view extrapolation from binocular videos of dynamic scenes, captured by static cameras. Our algorithm addresses the temporal inconsistency of disocclusions by identifying the error-prone areas with a 3D mask volume, and replaces them with static background observed throughout the video. Our method enables manipulation in 3D space as opposed to simple 2D masks, We demonstrate better temporal stability than frame-by-frame static view synthesis methods, or those that use 2D masks. The resulting view synthesis videos show minimal flickering artifacts and allow for larger translational movements.
translated by 谷歌翻译
Figure 1: Our method can synthesize novel views in both space and time from a single monocular video of a dynamic scene. Here we show video results with various configurations of fixing and interpolating view and time (left), as well as a visualization of the recovered scene geometry (right). Please view with Adobe Acrobat or KDE Okular to see animations.
translated by 谷歌翻译