A recent strand of work in view synthesis uses deep learning to generate multiplane images-a camera-centric, layered 3D representation-given two or more input images at known viewpoints. We apply this representation to singleview view synthesis, a problem which is more challenging but has potentially much wider application. Our method learns to predict a multiplane image directly from a single image input, and we introduce scale-invariant view synthesis for supervision, enabling us to train on online video. We show this approach is applicable to several different datasets, that it additionally generates reasonable depth maps, and that it learns to fill in content behind the edges of foreground objects in background layers.Project page at https://single-view-mpi.github.io/.
translated by 谷歌翻译
Fast and easy handheld capture with guideline: closest object moves at most D pixels between views Promote sampled views to local light field via layered scene representation Blend neighboring local light fields to render novel views
translated by 谷歌翻译
我们介绍了Fadiv-Syn,一种快速深入的新型观点合成方法。相关方法通常受到它们的深度估计阶段的限制,其中不正确的深度预测可能导致大的投影误差。为避免此问题,我们将输入图像有效地将输入图像呈现为目标帧,以为一系列假定的深度平面。得到的平面扫描量(PSV)直接进入我们的网络,首先以自我监督的方式估计软PSV掩模,然后直接产生新颖的输出视图。因此,我们侧行显式深度估计。这提高了透明,反光,薄,特色场景部件上的效率和性能。 Fadiv-syn可以在大规模Realestate10K数据集上执行插值和外推任务,优于最先进的外推方法。与可比方法相比,它由于其轻量级架构而实现了实时性能。我们彻底评估消融,例如去除软掩蔽网络,从更少的示例中培训以及更高的分辨率和更强深度离散化的概括。
translated by 谷歌翻译
Deep networks have recently enjoyed enormous success when applied to recognition and classification problems in computer vision [20,29], but their use in graphics problems has been limited ([21, 7] are notable recent exceptions). In this work, we present a novel deep architecture that performs new view synthesis directly from pixels, trained from a large number of posed image sets. In contrast to traditional approaches which consist of multiple complex stages of processing, each of which require careful tuning and can fail in unexpected ways, our system is trained end-to-end. The pixels from neighboring views of a scene are presented to the network which then directly produces the pixels of the unseen view. The benefits of our approach include generality (we only require posed image sets and can easily apply our method to different domains), and high quality results on traditionally difficult scenes. We believe this is due to the end-to-end nature of our system which is able to plausibly generate pixels according to color, depth, and texture priors learnt automatically from the training data. To verify our method we show that it can convincingly reproduce known test views from nearby imagery. Additionally we show images rendered from novel viewpoints. To our knowledge, our work is the first to apply deep learning to the problem of new view synthesis from sets of real-world, natural imagery.
translated by 谷歌翻译
虚拟现实(VR)耳机提供了一种身临其境的立体视觉体验,但以阻止用户直接观察其物理环境的代价。传递技术旨在通过利用向外的摄像头来重建否则没有耳机的用户可以看到的图像来解决此限制。这本质上是一个实时视图综合挑战,因为传递摄像机不能与眼睛进行物理共同。现有的通行技术会遭受分散重建工件的注意力,这主要是由于缺乏准确的深度信息(尤其是对于近场和分离的物体),并且表现出有限的图像质量(例如,低分辨率和单色)。在本文中,我们提出了第一种学习的传递方法,并使用包含立体声对RGB摄像机的自定义VR耳机评估其性能。通过模拟和实验,我们证明了我们所学的传递方法与最先进的方法相比提供了卓越的图像质量,同时满足了实时的,透视透视的立体视图综合的严格VR要求,从而在广泛的视野上综合用于桌面连接的耳机。
translated by 谷歌翻译
我们介绍了与给定单个图像的任意长相机轨迹相对应的长期视图的新面积视图的问题。这是一个具有挑战性的问题,远远超出了当前视图合成方法的能力,这在提出大型摄像机运动时快速退化。用于视频生成的方法也具有有限的生产长序列的能力,并且通常不适用于场景几何形状。我们采用混合方法,它以迭代`\ emph {render},\ emph {refine},\ emph {重复}'框架集成了几何和图像合成,允许在数百帧之后覆盖大距离的远程生成。我们的方法可以从一组单目的视频序列训练。我们提出了一个沿海场景的空中镜头数据集,并比较了我们最近的观看综合和有条件的视频生成基线的方法,表明它可以在与现有方法相比,在大型相机轨迹上产生更长的时间范围。项目页面https://infinite-nature.github.io/。
translated by 谷歌翻译
Image view synthesis has seen great success in reconstructing photorealistic visuals, thanks to deep learning and various novel representations. The next key step in immersive virtual experiences is view synthesis of dynamic scenes. However, several challenges exist due to the lack of high-quality training datasets, and the additional time dimension for videos of dynamic scenes. To address this issue, we introduce a multi-view video dataset, captured with a custom 10-camera rig in 120FPS. The dataset contains 96 high-quality scenes showing various visual effects and human interactions in outdoor scenes. We develop a new algorithm, Deep 3D Mask Volume, which enables temporally-stable view extrapolation from binocular videos of dynamic scenes, captured by static cameras. Our algorithm addresses the temporal inconsistency of disocclusions by identifying the error-prone areas with a 3D mask volume, and replaces them with static background observed throughout the video. Our method enables manipulation in 3D space as opposed to simple 2D masks, We demonstrate better temporal stability than frame-by-frame static view synthesis methods, or those that use 2D masks. The resulting view synthesis videos show minimal flickering artifacts and allow for larger translational movements.
translated by 谷歌翻译
We explore the problem of view synthesis from a narrow baseline pair of images, and focus on generating highquality view extrapolations with plausible disocclusions. Our method builds upon prior work in predicting a multiplane image (MPI), which represents scene content as a set of RGBα planes within a reference view frustum and renders novel views by projecting this content into the target viewpoints. We present a theoretical analysis showing how the range of views that can be rendered from an MPI increases linearly with the MPI disparity sampling frequency, as well as a novel MPI prediction procedure that theoretically enables view extrapolations of up to 4× the lateral viewpoint movement allowed by prior work. Our method ameliorates two specific issues that limit the range of views renderable by prior methods: 1) We expand the range of novel views that can be rendered without depth discretization artifacts by using a 3D convolutional network architecture along with a randomized-resolution training procedure to allow our model to predict MPIs with increased disparity sampling frequency. 2) We reduce the repeated texture artifacts seen in disocclusions by enforcing a constraint that the appearance of hidden content at any depth must be drawn from visible content at or behind that depth.
translated by 谷歌翻译
新颖的视图合成(NVS)和视频预测(VP)通常被视为计算机视觉中的不相交任务。但是,它们都可以看作是观察空间时代世界的方法:NVS的目的是从新的角度综合一个场景,而副总裁则旨在从新的时间点观看场景。这两个任务提供了互补的信号以获得场景表示形式,因为观点从空间观察中变化为深度的变化,并且时间观察为相机和单个对象的运动提供了信息。受这些观察的启发,我们建议研究时空(背心)中视频外推的问题。我们提出了一个模型,该模型利用了两项任务的自学和互补线索,而现有方法只能解决其中之一。实验表明,我们的方法比室内和室外现实世界数据集上的几种最先进的NVS和VP方法更好地实现了性能。
translated by 谷歌翻译
View-dependent effects such as reflections pose a substantial challenge for image-based and neural rendering algorithms. Above all, curved reflectors are particularly hard, as they lead to highly non-linear reflection flows as the camera moves. We introduce a new point-based representation to compute Neural Point Catacaustics allowing novel-view synthesis of scenes with curved reflectors, from a set of casually-captured input photos. At the core of our method is a neural warp field that models catacaustic trajectories of reflections, so complex specular effects can be rendered using efficient point splatting in conjunction with a neural renderer. One of our key contributions is the explicit representation of reflections with a reflection point cloud which is displaced by the neural warp field, and a primary point cloud which is optimized to represent the rest of the scene. After a short manual annotation step, our approach allows interactive high-quality renderings of novel views with accurate reflection flow. Additionally, the explicit representation of reflection flow supports several forms of scene manipulation in captured scenes, such as reflection editing, cloning of specular objects, reflection tracking across views, and comfortable stereo viewing. We provide the source code and other supplemental material on https://repo-sam.inria.fr/ fungraph/neural_catacaustics/
translated by 谷歌翻译
The DeepView architecture. (a) The network takes a sparse set of input images shot from different viewpoints. (b, c) The scene is reconstructed using learned gradient descent, producing a multi-plane image (a series of fronto-parallel, RGBA textured planes). (d)The multi-plane image is suitable for real-time, high-quality rendering of novel viewpoints. The result above uses four input views in a 30cm × 20cm rectangular layout. The novel view was rendered with a virtual camera positioned at the centroid of the four input views. More results, including video and an interactive viewer, at: https://augmentedperception.github.io/deepview/
translated by 谷歌翻译
用于运动中的人类的新型视图综合是一个具有挑战性的计算机视觉问题,使得诸如自由视视频之类的应用。现有方法通常使用具有多个输入视图,3D监控或预训练模型的复杂设置,这些模型不会概括为新标识。旨在解决这些限制,我们提出了一种新颖的视图综合框架,以从单视图传感器捕获的任何人的看法生成现实渲染,其具有稀疏的RGB-D,类似于低成本深度摄像头,而没有参与者特定的楷模。我们提出了一种架构来学习由基于球体的神经渲染获得的小说视图中的密集功能,并使用全局上下文修复模型创建完整的渲染。此外,增强剂网络利用了整体保真度,即使在原始视图中的遮挡区域中也能够产生细节的清晰渲染。我们展示了我们的方法为单个稀疏RGB-D输入产生高质量的合成和真实人体演员的新颖视图。它概括了看不见的身份,新的姿势,忠实地重建面部表情。我们的方法优于现有人体观测合成方法,并且对不同水平的输入稀疏性具有稳健性。
translated by 谷歌翻译
综合照片 - 现实图像和视频是计算机图形的核心,并且是几十年的研究焦点。传统上,使用渲染算法(如光栅化或射线跟踪)生成场景的合成图像,其将几何形状和材料属性的表示为输入。统称,这些输入定义了实际场景和呈现的内容,并且被称为场景表示(其中场景由一个或多个对象组成)。示例场景表示是具有附带纹理的三角形网格(例如,由艺术家创建),点云(例如,来自深度传感器),体积网格(例如,来自CT扫描)或隐式曲面函数(例如,截短的符号距离)字段)。使用可分辨率渲染损耗的观察结果的这种场景表示的重建被称为逆图形或反向渲染。神经渲染密切相关,并将思想与经典计算机图形和机器学习中的思想相结合,以创建用于合成来自真实观察图像的图像的算法。神经渲染是朝向合成照片现实图像和视频内容的目标的跨越。近年来,我们通过数百个出版物显示了这一领域的巨大进展,这些出版物显示了将被动组件注入渲染管道的不同方式。这种最先进的神经渲染进步的报告侧重于将经典渲染原则与学习的3D场景表示结合的方法,通常现在被称为神经场景表示。这些方法的一个关键优势在于它们是通过设计的3D-一致,使诸如新颖的视点合成捕获场景的应用。除了处理静态场景的方法外,我们还涵盖了用于建模非刚性变形对象的神经场景表示...
translated by 谷歌翻译
代表具有多个半透明彩色图层的场景是实时新型视图合成的流行和成功的选择。现有方法在平面或球形的规则间隔层上推断颜色和透明度值。在这项工作中,我们介绍了一种基于多个半透明层的新视图综合方法,具有场景适应的几何形状。我们的方法在两个阶段中介绍了立体对的这些表示。第一阶段从给定的一对视图中缩小了少数数据自适应层的几何形状。第二阶段为这些层的颜色和透明度值产生了新颖的视图合成的最终表示。重要的是,两个阶段都通过可差异化的渲染器连接,并以端到端的方式训练。在实验中,我们展示了所提出的方法在使用定期间隔的层上的优势,没有适应场景几何形状。尽管在渲染过程中较快的数量次数,但我们的方法也优于基于隐式几何表示的最近提出的IBRNET系统。查看https://samsunglabs.github.io/stereolayers的结果。
translated by 谷歌翻译
We present a method for novel view synthesis from input images that are freely distributed around a scene. Our method does not rely on a regular arrangement of input views, can synthesize images for free camera movement through the scene, and works for general scenes with unconstrained geometric layouts. We calibrate the input images via SfM and erect a coarse geometric scaffold via MVS. This scaffold is used to create a proxy depth map for a novel view of the scene. Based on this depth map, a recurrent encoder-decoder network processes reprojected features from nearby views and synthesizes the new view. Our network does not need to be optimized for a given scene. After training on a dataset, it works in previously unseen environments with no finetuning or per-scene optimization. We evaluate the presented approach on challenging real-world datasets, including Tanks and Temples, where we demonstrate successful view synthesis for the first time and substantially outperform prior and concurrent work.
translated by 谷歌翻译
现代计算机视觉已超越了互联网照片集的领域,并进入了物理世界,通过非结构化的环境引导配备摄像头的机器人和自动驾驶汽车。为了使这些体现的代理与现实世界对象相互作用,相机越来越多地用作深度传感器,重建了各种下游推理任务的环境。机器学习辅助的深度感知或深度估计会预测图像中每个像素的距离。尽管已经在深入估算中取得了令人印象深刻的进步,但仍然存在重大挑战:(1)地面真相深度标签很难大规模收集,(2)通常认为相机信息是已知的,但通常是不可靠的,并且(3)限制性摄像机假设很常见,即使在实践中使用了各种各样的相机类型和镜头。在本论文中,我们专注于放松这些假设,并描述将相机变成真正通用深度传感器的最终目标的贡献。
translated by 谷歌翻译
where the highest resolution is required, using facial performance capture as a case in point.
translated by 谷歌翻译
我们提出了一种便携式多型摄像头系统,该系统具有专用模型,用于动态场景中的新型视图和时间综合。我们的目标是使用我们的便携式多座相机从任何角度从任何角度出发为动态场景提供高质量的图像。为了实现这种新颖的观点和时间综合,我们开发了一个配备了五个相机的物理多型摄像头,以在时间和空间域中训练神经辐射场(NERF),以进行动态场景。我们的模型将6D坐标(3D空间位置,1D时间坐标和2D观看方向)映射到观看依赖性且随时间变化的发射辐射和体积密度。量渲染用于在指定的相机姿势和时间上渲染光真实的图像。为了提高物理相机的鲁棒性,我们提出了一个摄像机参数优化模块和一个时间框架插值模块,以促进跨时间的信息传播。我们对现实世界和合成数据集进行了实验以评估我们的系统,结果表明,我们的方法在定性和定量上优于替代解决方案。我们的代码和数据集可从https://yuenfuilau.github.io获得。
translated by 谷歌翻译
https://video-nerf.github.io Figure 1. Our method takes a single casually captured video as input and learns a space-time neural irradiance field. (Top) Sample frames from the input video. (Middle) Novel view images rendered from textured meshes constructed from depth maps. (Bottom) Our results rendered from the proposed space-time neural irradiance field.
translated by 谷歌翻译