Deep networks have recently enjoyed enormous success when applied torecognition and classification problems in computer vision, but their use ingraphics problems has been limited. In this work, we present a novel deeparchitecture that performs new view synthesis directly from pixels, trainedfrom a large number of posed image sets. In contrast to traditional approacheswhich consist of multiple complex stages of processing, each of which requirecareful tuning and can fail in unexpected ways, our system is trainedend-to-end. The pixels from neighboring views of a scene are presented to thenetwork which then directly produces the pixels of the unseen view. Thebenefits of our approach include generality (we only require posed image setsand can easily apply our method to different domains), and high quality resultson traditionally difficult scenes. We believe this is due to the end-to-endnature of our system which is able to plausibly generate pixels according tocolor, depth, and texture priors learnt automatically from the training data.To verify our method we show that it can convincingly reproduce known testviews from nearby imagery. Additionally we show images rendered from novelviewpoints. To our knowledge, our work is the first to apply deep learning tothe problem of new view synthesis from sets of real-world, natural imagery.
translated by 谷歌翻译
我们提供了一种实用且强大的深度学习解决方案,用于捕捉和渲染复杂现实世界场景的新视图以进行虚拟探索。以前的方法要么需要难以置信的密集视图采样,要么提供用户应该如何对场景进行采样以提供高质量可靠性的指导新意见。相反,我们从采样视图的不规则网格提出了一种用于视图合成的算法,该算法首先通过多平面图像(MPI)场景表示将每个采样视图扩展为局部光场,然后通过混合相邻的局部光场来渲染新颖视图。我们扩展了传统的全光采样理论,以推导出一个界限,用于精确指定用户在使用我们的算法时应该如何密集地对给定场景的视图进行采样。在实践中,我们应用此界限捕获和渲染真实世界场景,实现奈奎斯特速率视图采样的感知质量,同时使用多达4000倍的视图。我们通过增强现实智能手机应用程序展示了ourapproach的实用性,该应用程序引导用户捕获场景的输入图像,以及在桌面和移动平台上实现实时虚拟探索的查看器。
translated by 谷歌翻译
YouTube videos TRAINING Stereo photo to lightfield Camera motion clips Multiplane Images (MPIs) STEREO MAGNIFICATION ~6.3cm 1.4cm Fig. 1. We extract camera motion clips from YouTube videos and use them to train a neural network to generate the Multiplane Image (MPI) scene representation from narrow-baseline stereo image pairs. The inferred MPI representation can then be used to synthesize novel views of the scene, including ones that extrapolate significantly beyond the input baseline. (Video stills in this and other figures are used under Creative-Commons license from YouTube user SonaVisual.) The view synthesis problem-generating novel views of a scene from known imagery-has garnered recent attention due in part to compelling applications in virtual and augmented reality. In this paper, we explore an intriguing scenario for view synthesis: extrapolating views from imagery captured by narrow-baseline stereo cameras, including VR cameras and now-widespread dual-lens camera phones. We call this problem stereo magnification, and propose a learning framework that leverages a new layered representation that we call multiplane images (MPIs). Our method also uses a massive new data source for learning view extrapolation: online videos on YouTube. Using data mined from such videos, we train a deep network that predicts an MPI from an input stereo image pair. This inferred MPI can then be used to synthesize a range of novel views of the scene, including views that extrapolate significantly beyond the input baseline. We show that our method compares favorably with several recent view synthesis methods, and demonstrate applications in magnifying narrow-baseline stereo images.
translated by 谷歌翻译
(a) System setup Li gh t fie ld se qu en ce (3 fp s) Interactively refocused (b) System input (c) System output Light field video (30fps) DSLR Lytro ILLUM Fig. 1. The setup and I/O of our system. (a) We attach an additional standard camera to a light field camera using a tripod screw, so they can be easily carried together. (b) The inputs consist of a standard 30 fps video and a 3 fps light field sequence. (c) Our system then generates a 30 fps light field video, which can be used for a number of applications such as refocusing and changing viewpoints as the video plays. Light field cameras have many advantages over traditional cameras, as they allow the user to change various camera settings after capture. However, capturing light fields requires a huge bandwidth to record the data: a modern light field camera can only take three images per second. This prevents current consumer light field cameras from capturing light field videos. Temporal interpolation at such extreme scale (10x, from 3 fps to 30 fps) is infeasible as too much information will be entirely missing between adjacent frames. Instead, we develop a hybrid imaging system, adding another standard video camera to capture the temporal information. Given a 3 fps light field sequence and a standard 30 fps 2D video, our system can then generate a full light field video at 30 fps. We adopt a learning-based approach, which can be decomposed into two steps: spatio-temporal flow estimation and appearance estimation. The flow estimation propagates the angular information from the light field sequence to the 2D video, so we can warp input images to the target view. The appearance estimation then combines these warped images to output the final pixels. The whole process is trained end-to-end using con-volutional neural networks. Experimental results demonstrate that our algorithm outperforms current video interpolation methods, enabling consumer light field videography, and making applications such as refocusing and par-allax view generation achievable on videos for the first time. Code and data are available at https://cseweb.ucsd.edu/~viscomp/projects/LF/papers/SIG17/lfv/.
translated by 谷歌翻译
我们从一组狭窄的基线图像中探索视图合成的问题,并专注于生成具有合理分离的高质量视图外推。我们的方法建立在预测多平面图像(MPI)的先前工作的基础上,该平面图像将场景内容表示为参考视锥体中的一组RGB $ \ alpha $ plane,并通过将该内容投影到目标视点来呈现新颖视图。我们提出了一个理论分析,显示可以从MPI渲染的视图范围如何随MPI视差采样频率线性增加,以及一个新颖的MPI预测程序,理论上可以使视图外推最多4美元以上的先前允许的旁边视点移动工作。我们的方法改善了限制现有方法可渲染视图范围的两个特定问题:1)我们通过使用3D卷积网络架构以及随机分辨率训练过程来扩展可以在没有深度离散伪像的情况下渲染的新视图的范围,以允许我们的模型预测具有增加的视差采样频率的MPI。 2)我们通过强制执行约束来减少在去除内容中看到的重复纹理伪像,该约束必须在该深度后面的可见内容中绘制任何深度处的隐藏内容的外观。请在以下网址查看我们的搜索结果视频:https://www.youtube.com/watch?v = aJqAaMNL2m4。
translated by 谷歌翻译
我们提出了一种基于学习的新型视图合成方法,用于基于一组多视图图像训练的扫描对象。我们不是使用纹理映射或手工设计的基于图像的渲染,而是直接训练adeep神经网络来合成对象的视图相关图像。首先,我们使用基于覆盖的最近邻居查找来检索一组参考帧,这些参考帧使用交叉投影明确地扭曲到给定的目标视图。然后我们的网络学会最好地合成扭曲的图像。这使我们能够生成照片般逼真的结果,而不必在“记住”对象外观上分配容量。相反,可以重复使用多视图图像。虽然这适用于漫反射对象,但交叉投影不会推广到依赖于视图的效果。因此,我们提出了一个分解网络,它提取视图相关的效果,并以自我监督的方式进行训练。在分解之后,漫射共享被交叉投影,而目标视图的视图相关层被回归。我们在实际数据和合成数据上定性和定量地展示了我们的方法的有效性。
translated by 谷歌翻译
Learning based methods have shown very promising results for the task of depth estimation in single images. However, most existing approaches treat depth prediction as a supervised regression problem and as a result, require vast quantities of corresponding ground truth depth data for training. Just recording quality depth data in a range of environments is a challenging problem. In this paper, we innovate beyond existing approaches, replacing the use of explicit depth data during training with easier-to-obtain binocular stereo footage. We propose a novel training objective that enables our convo-lutional neural network to learn to perform single image depth estimation, despite the absence of ground truth depth data. Exploiting epipolar geometry constraints, we generate disparity images by training our network with an image reconstruction loss. We show that solving for image reconstruction alone results in poor quality depth images. To overcome this problem, we propose a novel training loss that enforces consistency between the disparities produced relative to both the left and right images, leading to improved performance and robustness compared to existing approaches. Our method produces state of the art results for monocular depth estimation on the KITTI driving dataset, even outperforming supervised methods that have been trained with ground truth depth.
translated by 谷歌翻译
光场通过捕获视觉信号的空间角度尺寸来呈现表示3D世界的丰富方式。然而,通过全光相机捕获光场(LF)的流行方式呈现出空间角分辨率权衡。诸如压缩光场和可编程编码孔径的计算成像技术从通过多路复用进入的空间角光场获得的编码投影重建全传感器分辨率LF。在这里,我们提出了一个统一的学习框架,可以从各种多路复用方案重建LF,只需输入最少数量的编码图像。我们考虑三种光场捕获方案:外差捕获方案,代码放置在传感器附近,编码孔径方案,代码在摄像机孔径,最后是双曝光方案,捕获聚焦 - 散焦对,没有明确的编码。我们的算法包括三个阶段1)我们从编码图像中恢复全焦点图像2)从编码图像和全焦点图像中估计所有LF视图的视差图,3)然后我们渲染LF通过使用视差图翘曲全焦点图像并对其进行细化。对于这三个阶段,我们提出三个深度神经网络--ViewNet,DispairtyNet和RefineNet。我们的重建表明我们的学习算法为所有三种复用方案实现了最先进的结果。特别是,我们的聚焦 - 散焦对的LF重建与来自多个图像的其他基于学习的视图合成方法相当。因此,我们的工作为使用传统相机(如DSLR)捕获高分辨率LF(〜百万像素)铺平了道路。请查看我们的补充材料$ \ href {https://docs.google.com/presentation/d/1Vr-F8ZskrSd63tvnLfJ2xmEXY6OBc1Rll3XeOAtc11I /} {online} $更好地欣赏重建的光场。
translated by 谷歌翻译
and GEORGE DRETTAKIS REVES/INRIA Sophia Antipolis Modern camera calibration and multiview stereo techniques enable users to smoothly navigate between different views of a scene captured using standard cameras. The underlying automatic 3D reconstruction methods work well for buildings and regular structures but often fail on vegetation, vehicles and other complex geometry present in everyday urban scenes. Consequently , missing depth information makes image-based rendering (IBR) for such scenes very challenging. Our goal is to provide plausible free-viewpoint navigation for such datasets. To do this, we introduce a new IBR algorithm that is robust to missing or unreliable geometry, providing plausible novel views even in regions quite far from the input camera positions. We first oversegment the input images, creating superpixels of homogeneous color content which often tends to preserve depth discontinuities. We then introduce a depth-synthesis approach for poorly reconstructed regions based on a graph structure on the oversegmentation and appropriate traver-sal of the graph. The superpixels augmented with synthesized depth allow us to define a local shape-preserving warp which compensates for inaccurate depth. Our rendering algorithm blends the warped images, and generates plausible image-based novel views for our challenging target scenes. Our results demonstrate novel view synthesis in real time for multiple challenging scenes with significant depth complexity, providing a convincing immersive navigation experience. This work was partially funded by the EU IP project VERVE (www. verveconsortium.org); additional funding was provided by Autodesk, Adobe (research and software donations) and NVIDIA (professor partnership program).
translated by 谷歌翻译
(a) ⇒ (b) ⇐ (c) (d) Figure 1: A video view interpolation example: (a,c) synchronized frames from two different input cameras and (b) a virtual interpolated view. (d) A depth-matted object from earlier in the sequence is inserted into the video. Abstract The ability to interactively control viewpoint while watching a video is an exciting application of image-based rendering. The goal of our work is to render dynamic scenes with interactive viewpoint control using a relatively small number of video cameras. In this paper, we show how high-quality video-based rendering of dynamic scenes can be accomplished using multiple synchronized video streams combined with novel image-based modeling and rendering algorithms. Once these video streams have been processed, we can synthesize any intermediate view between cameras at any time, with the potential for space-time manipulation. In our approach, we first use a novel color segmentation-based stereo algorithm to generate high-quality photoconsistent correspondences across all camera views. Mattes for areas near depth discontinuities are then automatically extracted to reduce artifacts during view synthesis. Finally, a novel temporal two-layer compressed representation that handles matting is developed for rendering at interactive rates.
translated by 谷歌翻译
This paper formulates and solves a new variant of the stereo correspondence problem: simultaneously recovering the disparities, true colors, and opacities of visible surface elements. This problem arises in newer applications of stereo reconstruction, such as view interpolation and the layering of real imagery with synthetic graphics for special effects and virtual studio applications. While this problem is intrinsically more difficult than traditional stereo correspondence, where only the disparities are being recovered, it provides a principled way of dealing with commonly occurring problems such as occlusions and the handling of mixed (foreground/background) pixels near depth discontinuities. It also provides a novel means for separating foreground and background objects (matting), without the use of a special blue screen. We formulate the problem as the recovery of colors and opacities in a generalized 3-D (x, y, d) disparity space, and solve the problem using a combination of initial evidence aggregation followed by iterative energy minimization.
translated by 谷歌翻译
我们提出了一种通过卷积神经网络对图像内容进行细粒度3D处理的新方法,我们将其称为可转换瓶颈网络(TBN)。它将给定的空间变换直接应用于ourencoder-bottleneck-decoder体系结构中的体积瓶颈。多视图监督鼓励网络学习在空间上解开瓶颈内的特征空间。可以使用任意空间变换来操纵所得到的空间结构。我们证明了TBNs对于新型合成的功效,在具有挑战性的基准上实现了最先进的结果。我们证明,为此任务培训的网络所产生的瓶颈包含有意义的空间结构,使我们能够直观地在3D中执行各种图像处理,远远超出训练期间的严格转换。这些操作包括非均匀缩放,非刚性扭曲以及组合来自不同图像的内容。最后,我们从瓶颈中提取显式3D结构,从单个输入图像执行令人印象深刻的3D重构。
translated by 谷歌翻译
We present a learnt system for multi-view stereopsis. In contrast to recent learning based methods for 3D reconstruction, we leverage the underlying 3D geometry of the problem through feature projection and unprojection along viewing rays. By formulating these operations in a differentiable manner, we are able to learn the system end-to-end for the task of metric 3D reconstruction. End-to-end learning allows us to jointly reason about shape priors while conforming to geometric constraints, enabling reconstruction from much fewer images (even a single image) than required by classical approaches as well as completion of unseen surfaces. We thoroughly evaluate our approach on the ShapeNet dataset and demonstrate the benefits over classical approaches and recent learning based methods.
translated by 谷歌翻译
We present a transformation-grounded image generation network for novel 3D view synthesis from a single image. Our approach first explicitly infers the parts of the geometry visible both in the input and novel views and then casts the remaining synthesis problem as image completion. Specifically , we both predict a flow to move the pixels from the input to the novel view along with a novel visibility map that helps deal with occulsion/disocculsion. Next, conditioned on those intermediate results, we hallucinate (infer) parts of the object invisible in the input image. In addition to the new network structure, training with a combination of adversarial and perceptual loss results in a reduction in common artifacts of novel view synthesis such as distortions and holes, while successfully generating high frequency details and preserving visual aspects of the input image. We evaluate our approach on a wide range of synthetic and real examples. Both qualitative and quantitative results show our method achieves significantly better results compared to existing methods.
translated by 谷歌翻译
我们提出了一种从单个输入图像推断出上层的层结构3D表示的方法。这使我们不仅可以推断可见像素的深度,还可以在不直接可见的场景中捕获内容的纹理和深度。我们通过利用更自然可用的多视图监控信号来克服直接监控的挑战。我们的见解是使用视图合成作为代理任务:我们强制执行我们的表示(从单个图像推断),从新颖的角度来看,与真实的观察图像匹配。我们提出了一个学习框架,使用一个新的,可区分的新颖视图渲染器来实现这种洞察力。我们在两种不同的环境中提供我们的方法的定性和定量验证,并证明我们可以学习捕捉场景的隐藏方面。
translated by 谷歌翻译
我们提出了一种端到端的深度学习架构,用于从多视图图像中进行深度图推理。在网络中,我们首先提取深度视觉图像特征,然后通过可微分的单应性变形在参考相机上构建3D成本量。接下来,我们应用3D卷积来初始化和回归初始深度图,然后使用参考图像对其进行细化以生成最终输出。我们的框架使用基于方差的成本度量灵活地适应任意N视图输入,该成本度量将多个特征映射到一个成本特征。拟议的MVSNet在大规模室内DTU数据集上进行了演示。通过简单的后处理,我们的方法不仅明显优于以前的技术水平,而且在运行时也快几倍。我们还在复杂的室外Tanksand Temples数据集上评估MVSNet,我们的方法在2018年4月18日之前排名第一,没有进行任何微调,显示出MVSNet强大的泛化能力。
translated by 谷歌翻译
现代计算机图形管道可以以卓越的视觉质量合成图像;但是,它需要定义明确,高质量的3D内容作为输入。在这项工作中,我们探索了不完美3D内容的使用,例如,从具有噪声和不完整表面几何的照片度量重建获得,同时仍然旨在产生照片般逼真(重新)渲染。为了解决这个具有挑战性的问题,我们引入了Deferred Neural Rendering,这是一种新的图像合成范例,它将传统的图形管道与可学习的组件相结合。具体来说,我们提出了神经纹理,这些神经纹理是作为场景捕捉过程的一部分进行训练的被学习的特征图。与传统纹理类似,神经纹理作为地图存储在3D网格代理之上;但是,高维特征图包含的信息明显多,可以通过我们新的deferneural渲染管道来解释。神经纹理和延迟神经渲染都是端对端训练,使我们能够合成逼真的图像,即使原始3D内容不完美。与传统的黑盒2D生成神经网络相比,我们的3D表示可以对生成的输出进行显式控制,并允许广泛的应用领域。例如,我们可以合成记录的3D场景的时间一致的视频重新渲染,因为我们的表示固有地嵌入在3D空间中。这样,可以利用神经纹理以实时速率在静态和动态环境中相干地渲染或操纵现有视频内容。我们展示了我们的方法在小说视图合成,场景编辑和面部处理方面的不同实验的有效性,并与利用标准图形管道以及传统生成神经网络的最先进方法进行比较。
translated by 谷歌翻译
我们解决了新颖视图合成的问题:给定输入图像,合成从任意视点观察到的相同对象或场景的新图像。我们将此视为一项学习任务,但是,批判性地,我们学习从输入图像中学习复制它们,而不是学习从头开始合成像素。我们的方法利用观察到相同实例的不同视图的视觉外观高度相关,并且可以通过训练卷积神经网络(CNN)来预测外观流量来明确地学习这种相关性 - 在输入视图中指定哪个像素的二维坐标向量可以用来重建目标视图。此外,通过学习如何最佳地组合单视图预测,所提出的框架可以容易地推广到多个输入视图。我们展示了对于物体和场景,我们的方法能够合成比以前基于CNN的技术更高的感知质量的新视图。
translated by 谷歌翻译
We develop a continuous framework for the analysis of 4D light fields, and describe novel variational methods for disparity reconstruction as well as spatial and angular super-resolution. Disparity maps are estimated locally using epipolar plane image analysis without the need for expensive matching cost minimization. The method works fast and with inherent subpixel accuracy since no discretization of the disparity space is necessary. In a variational framework, we employ the disparity maps to generate super-resolved novel views of a scene, which corresponds to increasing the sampling rate of the 4D light field in spatial as well as angular direction. In contrast to previous work, we formulate the problem of view synthesis as a continuous inverse problem, which allows us to correctly take into account foreshortening effects caused by scene geometry transformations. All optimization problems are solved with state-of-the-art convex relaxation techniques. We test our algorithms on a number of real-world examples as well as our new benchmark data set for light fields, and compare results to a multiview stereo method. The proposed method is both faster as well as more accurate. Data sets and source code are provided online for additional evaluation.
translated by 谷歌翻译
离焦深度(DfD)和立体匹配是两种研究最多的被动深度感应方案。这些技术本质上是互补的:DfD可以自然地处理重复纹理,这对于立体匹配是有问题的,而立体匹配对散焦模糊不敏感并且可以处理大范围。在本文中,我们提出了一种统一的基于学习的技术,用于混合DfD和立体匹配。我们的输入是图像三联体:立体视觉和其中一个立体视图的散焦图像。我们首先应用深度引导光场渲染来构建用于这种混合传感设置的综合训练数据集。接下来,我们采用沙漏网络架构分别从DfD和立体声进行深度推理。最后,我们利用两个独立网络之间的不同连接方法将它们集成到一个统一的解决方案中,以产生高保真3D差异图。对真实和合成数据的综合实验表明,我们新的基于学习的混合三维传感技术可以显着提高三维重建的准确性和鲁棒性。
translated by 谷歌翻译