Voxurf: Voxel-based Efficient and Accurate Neural Surface Reconstruction

Tong Wu, Jiaqi Wang, Xingang Pan, Xudong Xu, Christian Theobalt, Ziwei Liu, Dahua Lin
SenseTime-CUHK Joint Lab, The Chinese University of Hong Kong,
Shanghai AI Laboratory, Max Planck Institute for Informatics,
S-Lab, Nanyang Technological University, Centre of Perceptual and Interactive Intelligence
{wt020, xx018, dhlin}@ie.cuhk.edu.hk, wangjiaqi@pjlab.org.cn,
{xpan,theobalt}@mpi-inf.mpg.de, ziwei.liu@ntu.edu.sg
Abstract

Neural surface reconstruction aims to reconstruct accurate 3D surface based on multi-view images. Previous methods based on neural volume rendering mostly train a fully implicit model, and they require hours of training for a single scene. Recent efforts explore the explicit volumetric representation, which substantially accelerates the optimization process by memorizing significant information in learnable voxel grids. However, these voxel-based methods often struggle in reconstructing fine-grained geometry. Through empirical studies, we found that high-quality surface reconstruction hinges on two key factors: the capability of constructing a coherent shape and the precise modeling of color-geometry dependency. In particular, the latter is the key to accurate reconstruction of fine details. Inspired by these findings, we develop Voxurf, a voxel-based approach for efficient and accurate neural surface reconstruction, which consists of two stages: 1) leverage a learnable feature grid to construct the color field and obtain a coherent coarse shape, and 2) refine detailed geometry with a dual color network that captures precise color-geometry dependency. We further introduce a hierarchical geometry feature to enable information sharing across voxels. Our experiments show that Voxurf achieves high efficiency and high quality at the same time. On the DTU benchmark, Voxurf achieves higher reconstruction quality compared to state-of-the-art methods, with 20x speedup in training.

1 Introduction

Neural surface reconstruction based on multi-view images has recently seen dramatic progress. Inspired by the success of Neural Radiance Fields (NeRF) (Mildenhall et al., 2020) on Novel View Synthesis (NVS), recent works follow the neural volume rendering scheme to represent the 3D geometry with SDF or occupancy field via a fully implicit model (Oechsle et al., 2021; Yariv et al., 2021; Wang et al., 2021). These approaches train a deep multilayer perceptron (MLP), which takes in hundreds of sampled points on each camera ray and outputs the corresponding colors and geometry information. Pixel-wise supervision is then applied by measuring the difference between the accumulated color on each ray and the ground truth. Struggling with fitting geometric and color details for surfaces reconstruction with a pure MLP based framework, these methods require hours of training time for a single scene, substantially limiting their applications in the real world.

Recent advances in NeRF accelerate the training process with the aid of an explicit volumetric representation (Sun et al., 2021; Yu et al., 2021). They directly store and optimize the geometry and color information in explicit voxel grids: the density of a queried point can be readily interpolated from the eight neighboring voxel grids, and the view-dependent color is either represented with spherical harmonic coefficients (Yu et al., 2021), or predicted by shallow MLPs that take learnable features stored in grids as auxiliary inputs (Sun et al., 2021). These approaches achieve competitive rendering performance at a much lower training cost ( 20 minutes). However, the 3D surface reconstructed by them cannot faithfully represent the exact geometry, suffering from conspicuous noise in free space and holes on the surface (Fig. 1 (a)). This is a common drawback shared by NeRF and its variants caused by the density-based volume rendering trained for the NVS task.

Figure 1: Comparisons. We show examples on surface reconstruction and novel view synthesis with different methods. (a) DVGO (Sun et al., 2021) benefits from the fastest convergence while suffers from a poor surface reconstruction; (b) NeuS (Wang et al., 2021) produces clean and decent surfaces after a long training time, while high-frequency details are lost in both the geometry and image; (c) our method achieves around 20x speedup than NeuS and recover high-quality surfaces and images with fine details. All the training times are tested on a single Nvidia A100 GPU.

In this work, we aim to take advantage of explicit volumetric representation for efficient training and propose customized designs to harvest high-quality surface reconstruction. A naive baseline model for this purpose is to embed the improved volume rendering scheme for surface reconstructions (Oechsle et al., 2021; Wang et al., 2021; Yariv et al., 2021) into an efficient framework with explicit volumetric representation. We evaluate this idea via an empirical study in Sec. 4, which unveils the challenges of our goal.

Specifically, we note that a high-quality surface reconstruction has two criteria: 1) maintaining a coherence coarse shape and 2) recovering fine geometry details. Considering the tight correlation between color and geometry in the volume rendering framework, the two criteria require a) an accurate color field that is able to model complex scenes with different materials and high-frequency textures for constructing a coherence coarse shape and b) an exact color-geometry dependency to boost geometry learning with fine details, respectively. Our empirical study in Sec. 4 is based on a hybrid architecture (Sun et al., 2021) with explicit volumetric representation and shallow MLPs to predict colors. Experimental results reveal that: a single shallow MLP for the color prediction is incapable of representing complex scenes mentioned in a), which sometimes leads to conspicuous geometry flaws; incorporating learnable feature grids boosts the capacity, while it disturbs the color-geometry dependency mentioned in b), because some high-frequent color details will be directly stored in feature grids and thus fail to affect geometric details. Furthermore, extracting geometry information from a single point hinders the information sharing across different voxel grids, which hurts the surface smoothness and introduces local minimas.

To tackle the challenges above, we introduce Voxurf, a two-stage pipeline for efficient and accurate Voxel-based surface reconstruction: 1) we first make use of a learnable feature grid to stably obtain a coarse yet coherent structure; 2) then for fine geometry optimization, we design a dual color network that respectively leverages the learnable feature grid and extracted geometric information. It is capable of representing complex color field and preserving the color-geometry dependency at the same time; we also introduce a hierarchical geometry feature based on the SDF voxel grid to encourage information sharing in a relatively larger region. 3) finally, several regularization terms are introduced to encourage compact and smooth geometry.

We conduct experiments on the DTU (Jensen et al., 2014) and BlendedMVS (Yao et al., 2020) datasets for quantitative and qualitative evaluations. Experimental results demonstrate that Voxurf achieves lower Chamfer Distance on the DTU (Jensen et al., 2014) benchmark than the state-of-the-art methods with around 20x speedup. It also achieves remarkable results on the auxiliary task of NVS. As illustrated in Fig. 1, our method is shown to be superior at preserving high-frequency details in both geometry reconstruction and image rendering compared to the previous approaches. In summary, our contributions are highlighted below:

  1. Our method achieves around 20x speedup for training compared to the state-of-the-art approaches, reducing the training time from over 5 hours to 16 minutes on a single Nvidia A100 GPU.

  2. Our method achieves higher surface reconstruction accuracy and novel view synthesis quality, and it is superior in representing fine details for both surface recovery and image rendering compared to previous approaches.

  3. Our study provides insightful observations and analysis on the architecture design of such a hybrid model for surface reconstruction.

2 Related Works

2.1 Neural surface reconstruction

Recently, implicit representations that encode the geometry or appearance of a 3D scene by neural networks have gained a lot of attention (Park et al., 2019; Mescheder et al., 2019; Atzmon et al., 2019; Jiang et al., 2020; Zhang et al., 2021). Among them, a plethora of papers have explored neural surface reconstruction from multi-view images: A branch of approaches (Niemeyer et al., 2020; Yariv et al., 2020; Liu et al., 2020; Kellnhofer et al., 2021) leverages surface rendering to obtain the color of each ray, \ie, the color of an intersection point of the ray with the surface is regarded as the final rendered color. However, they usually require accurate object masks and careful weight initialization. Another line of attempts (Wang et al., 2021; Yariv et al., 2021; Oechsle et al., 2021; Darmon et al., 2021) resorts to volume rendering (Max, 1995) to get rid of the mask requirement. Given that volume density can’t lead to satisfactory surface reconstruction, these methods formulate the radiance fields and implicit surface representations in a unified model, thereby achieving the merits of both representations. Encoding the whole scene in a fully implicit manner requires a long training time. In a departure from these works with pure MLP networks only, our method leverages learnable voxel grids and shallow color networks for quick convergence, which also benefits from better surfaces and novel-view images with fine details.

2.2 Explicit and hybrid volumetric representations

Despite the great success of neural implicit representations in 3D modeling (Mescheder et al., 2019; Chen & Zhang, 2019; Park et al., 2019; Lombardi et al., 2019; Sitzmann et al., 2019; Saito et al., 2019), the hybrid volumetric representations which explicitly integrate separate conventional 3D representations, e.g. point clouds, voxels, and MPIs, have received growing attention in recent years. NeX (Wizadwongsa et al., 2021) proposes a hybrid NeRF-MPI modeling strategy for real-time novel view synthesis, which renders finer-scale details of much higher quality as well as view-dependent effects and thus leads to state-of-the-art results. Point-NeRF (Xu et al., 2022) combines neural features with explicit point clouds to model a radiance field, and thus enable efficient rendering by aggregating neural point features near scene surfaces. Neural Volumes (Lombardi et al., 2019) uses a voxel grid to model the scene and handles the variation along the time axis with a learned warping function. Instant-ngp (Müller et al., 2022) uses multi-resolution hashing for efficient encoding, and the whole system is implemented with fully-fused CUDA kernels for fast convergence. Plenoxels (Yu et al., 2021) represent a scene as a sparse 3D grid with spherical harmonics and are optimized two orders of magnitude faster than Neural Radiance Fields with no loss in visual quality. TensoRF (Chen et al., 2022) considers the full volume field as a 4D tensor and proposes to factorize the tensor into multiple compact low-rank tensor components for efficient scene modeling. The method most related to ours is DVGO (Sun et al., 2021), which represents a scene with hybrid volumetric representations that optimizes voxel grids and a shallow MLP. Although these works have shown remarkable results on the novel view synthesis task, none of them is designed to learn surface geometry of objects as a separate entity. In contrast, we target at not only rendering photorealistic images with fine details from novel viewpoints but also reconstructing high-quality surfaces from multiple input images.

3 Preliminaries

In this work, we follow the volume rendering scheme proposed by NeuS (Wang et al., 2021) to replace radiance fields with neural SDF representations (Park et al., 2019) for surface reconstruction. For the explicit volumetric representation, we adopt a hybrid model similar to DVGO (Sun et al., 2021) that consists of a shallow MLP and learnable feature voxel grids to represent the color field.

Volume rendering with SDF representation. NeuS represents a scene as an implicit SDF representation parameterized by an MLP. For a specific pixel, the ray emitting from the camera center through it in the viewing direction can be expressed as . The rendered color for the pixel is integrated along the ray with volume rendering (Max, 1995), which can be approximated with discrete sampled points on the ray:

(1)

where is the opacity values and is the accumulated transmittance. The key difference between NeuS and NeRF is the formula of . Given the implicit SDF function , they introduce a probability density function , where adopts the logistic density distribution function and is defined as . Note that, is also the derivative of the Sigmoid function . After a series of derivations, in NeuS is formulated as:

(2)

Here, the value above is learned or manually updated during training.

Hybrid volumetric representation. DVGO (Sun et al., 2021) proposes to represent the geometry with explicit density voxel grids . It applies a hybrid representation for color prediction that comprises a shallow MLP parameterized by and a feature voxel grid . Given a 3D position and the viewing direction , the volume density and color are estimated with:

(3)
(4)

where ‘interp’ denotes the trilinear interpolation. Following NeRF (Mildenhall et al., 2020; Tancik et al., 2020), the positional encoding for both and is applied in Eqn. 4.

The reason why we adopt the hybrid model is that we can naturally incorporate the surface normal vectors obtained from the SDF field into the color network branch. Motivated by the BRDF function, This design encodes dependency of the color field on the surface normal and is proved effective for geometry learning in previous works such as NeuS (Wang et al., 2021) and IDR (Yariv et al., 2020).

4 Study on architecture design for geometry learning

Figure 2: Reconstruction results with different architecture designs. The surface normal and local feature are both optional inputs to the MLP. We conduct experiments on two cases on the left, and zoom in to visualize the produced surface, normal field, and feature field of different settings on the right.

In this section, we carry out some prior experiments with variants of the baseline hybrid model above for surface reconstruction, aiming to figure out the key factors for architecture design that lead to a decent geometry learning.

Based on the hybrid model introduced in Sec. 3, we modify it by replacing the density voxel grid with an SDF voxel grid and employ Eqn 2 for calculation. We start with a 3-layer shallow MLP as the color network, where the 1) local feature for color that is interpolated from and 2) normal direction that is calculated by are both optimal inputs to it. Recall that a decent surface reconstruction is expected to bear both a coherent coarse structure and accurate fine details, we will next focus on the two points of views and analyze the performance of these baseline variants.

Intuitively, the capacity of a shallow MLP is limited, and it can hardly represent a complex scene with different materials, high-frequency textures, and view-dependent lighting information. When the ground truth image encounters a rapid color-shifting, the volume rendering integration over an under-fitted color field results in a corrupted geometry. Case (1) in Fig. 2 gives an example for this, as shown in (a) and (b). Incorporating the local feature enables fast color learning and increases the representation ability of the network, and the problem is noticeably alleviated, as shown in Fig. 2, (c) and (d) in case (1).

The key to reconstructing accurate geometry details. We then introduce another ”easier” case in Fig. 2, namely case (2). Its texture changes moderately, and the color largely depends on the surface normal caused by diffuse reflection. Although the geometry still collapses given neither normal or local feature as input (Fig. 2, (a) in case (2)), we observe a reasonable reconstruction with details when the MLP is conditioned only on (Fig. 2, (b) in case (2)). Incorporating the condition does not further reduce the Chamfer Distance (CD), instead, geometry details are missing since using together with in color network disturbs the geometry-color dependency, \ie, the relationship built between the color and the surface normal, as shown in Fig. 2, (c) and (d) in case (2).

In brief, the usage of feature voxel grid helps maintain a coherent coarse shape for complex scenes, yet it disturbs the color-geometry dependency and discourage the reconstruction of fine geometry details. For easy cases where the color shift is closely related to global lighting and geometry, the surface normal input itself is able to stabilize the shape and retain fine details at the same time.

5 Methodology

Based on the observation above, we propose a two-stage training scheme: we first take advantage of the feature voxel grid to stabilize geometry optimization and obtain a coarse yet coherent shape (Sec. 5.1); we then introduce a dual color network for fine geometry optimization, which is able to 1) maintain a coherent geometry structure, and 2) recover accurate surface and image details at the same time; meanwhile, we design a hierarchical geometry feature to enable information sharing in a relatively larger area and encourage a stable optimization (Sec. 5.2); we also introduce several regularization and smoothness priors for compact and smooth geometry learning (Sec. 5.3).

5.1 Coarse shape initialization

We first search the coarse 3D areas of interest by optimizing the voxel-based density and color as in (Sun et al., 2021). In this approach, we obtain the boundary of a smaller subspace for reconstruction and a coarse mask of the non-empty region for the following steps.

We then initialize our SDF voxel grid with an ellipsoid-like zero level set based on the boundary above, and we perform coarse shape optimization with the aid of as introduced in Sec. 4. Specifically, we train a 3-layer MLP with both normal vector and local feature as inputs along with the embedded position and viewing direction . To encourage a stable training process and smooth surface, we propose to conduct the interpolation on a smoothed voxel grid rather than the raw data of . In particular, we denote to be applying 3D convolution on the voxel grid with a Gaussian kernel shaped as , whose the weight matrix follows a Gaussian distribution with a standard deviation of , as defined below:

(5)

where denotes a normalization term. Querying a smoothed SDF value for an arbitrary point thus becomes:

(6)

We use for the ray marching integration following Eqn. 1 and Eqn. 2 and calculate the reconstruction loss. We also apply total variation (TV) regularization on and its gradient, as to be introduced in Sec. 5.3

5.2 Fine geometry optimization

Figure 3: Overview of key components in our model. We leverage a hybrid representation with a SDF voxel grid and a feature voxel grid . In the middle, we show the design for our dual color network, where is the interpolated feature from at point , and denotes the hierarchical feature constructed on the right. Here we show the multi-level sampling scheme and the region of grids that is effected by one point during optimization with different settings of levels.

At this stage, we aim to recover accurate geometry details based on the coarse initialization above. We note that the challenges are two-fold: 1) The study in Sec. 4 reveals a conflict introduced by the feature voxel grid, \ie, the representation capacity of the color field is improved at the sacrifice of color-geometry dependency. 2) The optimization of the SDF voxel grid is based on trilinear interpolation to query a 3D point. The operation brings in fast convergence, while it also limits information sharing across different locations, which may lead to local minima with degenerate solutions and a sub-optimal smoothness. We propose a dual color network and a hierarchical geometry feature design to address these two issues, respectively.

Dual color network. We first take inspiration from the observation in Sec. 4 and design a dual color network with color residual, which takes advantage of both the geometry information (\eg, normal vector ) and the local feature interpolated from the learnable feature voxel grid . Specifically, we train two shallow MLPs with different extra inputs besides the embedded position and view direction. The first MLP takes only to build the color-geometry dependency; the second one takes both and as inputs to enable a faster and more accurate color learning, which will in turn benefit the geometry optimization. The two networks are combined in a residual manner with detaching operations as shown in Fig. 3: the output of , denoted by , is detached before input to , and the output is added back to a detached copy of to get the final color prediction . Outputs of both and are supervised by a reconstruction loss between the ground truth image and the integrated color alone the ray. Specifically, the rendered colors from them are denoted as and , and the corresponding reconstruction losses are formulated by:

(7)

where denotes the ground truth color and denotes a loss weight.

We intuitively explain why we adopt the detaching operations here: thanks to the assistance of , fits the color field rapidly. However, this would affect the learning process of the pure MLP if the outputs of them are directly combined since fits the scene at a slower pace. We thus detach the effect from to enable a stable optimization of by the reconstruction loss of itself, which further benefits the learning of color-geometry dependency. We carry out an ablation study on this design in Sec. 6.2.

Hierarchical geometry feature. Recall that we’ve been using the normal vector as the geometry information input to the color networks, which takes only the information from adjacent grids of . In order to enlarge the perception area and encourage the information sharing among grid points, we propose to look at a larger region of the SDF field and take the corresponding sdf values and gradients as an auxiliary condition to the color network. Specifically, for a given 3D position , we first define a selection rule for its neighbours. We take half of the voxel size as one step and define its neighbours along the axis on both sides. Take the axis as an example, the neighbouring coordinates are defined as and , where , , denotes the ‘level’ of neighbour area, and denotes the maximum of the voxel grid on axis . We then extend the definition to a hierarchical manner by concatenating the neighbours from different levels together as formulated below:

(8)

where denotes the SDF values queried from at locations and . When , , which is exactly the SDF value at the location itself. Then, we also incorporate the gradient information as previous works (Yariv et al., 2020; Wang et al., 2021) into the geometry feature. Specifically, we can gain the gradient vector , and the approximate normal vector is to normalize the to a l2-norm of 1. The hierarchical version of the normal is formulated as:

(9)

Finally, the hierarchical geometry feature at point for level is to combine the information above by:

(10)

As shown in Fig. 3, with a predefined level is input to MLP to assist geometry learning.

5.3 Regularization and smoothness prior

We incorporate several regularization terms to encourage the continuity and smoothness of voxel grids during training. Specifically, we adopt a total variation (TV) regularization (Rudin & Osher, 1994):

(11)

where denotes the squared difference between the dth value in voxel and the dth value in voxel , which can be analogously extended to and . We apply the TV term above to the SDF voxel grid, denoted by , which encourages a continuous and compact geometry. We also assume the surface to be relatively smooth in a local area, and we follow the definition in Eqn. 5 and introduce a smoothness regularization formulated as:

(12)

We apply the smoothness term above to the gradient of sdf voxel grid, denoted by , which encourages a smooth surface and alleviates the issue of noisy points in the space. Notice that we can also naturally conduct post-processing on the SDF field after training thanks to its explicit representation. For example, applying the Gaussian kernel above before extracting the geometry can further boost surface smoothness for better visualization.

Finally, the overall training loss is formulated as:

(13)

where and denote the weights for the corresponding loss terms.

6 Experiments

Datasets. The DTU dataset contains different static scenes with 49 or 64 posed multi-view images for each scene. It covers a variety of objects with different materials, geometry, and texture. We evaluate our approach on DTU with the same 15 scenes following IDR (Yariv et al., 2020) and quantitatively compare with previous work by Chamfer Distance given the ground truth point clouds. The BlendedMVS (Yao et al., 2020) dataset contains 113 scenes that cover a variety of real-world environments, providing 31 to 143 posed multi-view images for each. We select 7 challenging scenes following NeuS (Wang et al., 2021) and present qualitative comparisons with previous works.

Scan 24 37 40 55 63 65 69 83 97 105 106 110 114 118 122 mean
NeRF(Mildenhall et al., 2020) 1.83 2.39 1.79 0.66 1.79 1.44 1.50 1.20 1.96 1.27 1.44 2.61 1.04 1.13 0.99 1.54
DVGO(Sun et al., 2021) 1.83 1.74 1.70 1.53 1.91 1.91 1.77 2.60 2.08 1.79 1.76 2.12 1.60 1.80 1.58 1.85
DVGOSun et al. (2021) 1.34 1.51 1.13 0.89 1.61 1.61 1.50 2.46 2.02 1.53 1.32 1.89 1.23 0.88 2.07 1.53
IDR(Yariv et al., 2020) 1.63 1.87 0.63 0.48 1.04 0.79 0.77 1.33 1.16 0.76 0.67 0.90 0.42 0.51 0.53 0.90
NeuS(Wang et al., 2021) 0.83 0.98 0.56 0.37 1.13 0.59 0.60 1.45 0.95 0.78 0.52 1.43 0.36 0.45 0.45 0.77
Ours 0.65 0.74 0.39 0.35 0.96 0.64 0.85 1.58 1.01 0.68 0.60 1.11 0.37 0.45 0.47 0.72
Table 1: Quantitative evaluation on DTU dataset.
Figure 4: Qualitative comparisons on the DTU dataset.

Baselines. We include several baselines for comparison in this work: 1) IDR (Yariv et al., 2020), 2) NeuS (Wang et al., 2021), 3) NeRF (Mildenhall et al., 2020), 4) DVGO (Sun et al., 2021). More detailed descriptions of these methods are in supplementary materials. The results of 1), 2), and 3) are taken from the original papers (Yariv et al., 2020; Wang et al., 2021), and all are reported in the setting with foreground object masks for a fair comparison. We include the mask-free comparisons in the supplementary materials.

Implementation Details. We set the expected number of voxels to be at the coarse stage and at the fine stage including an up-scaling step. We use a batch size of 8,192 rays with the point sampling step size on a ray to be half of the voxel size. Besides the coarse searching stage in (Sun et al., 2021), we train our coarse initialization stage for 10k iterations and the fine geometry optimization stage for 20k iterations with an Adam optimizer. The initial learning rate is set as for all the MLPs and for voxels in the coarse stage, while the SDF voxel starts by in fine stage. Please see the supplementary materials for additional implementation details.

Figure 5: Qualitative comparisons on the BlendedMVS dataset.
PSNR SSIM LPIPS CD Train Time
DVGO Sun et al. (2021) 31.64 0.916 0.159 1.85 11 mins
NeuS Wang et al. (2021) 29.63 0.892 0.199 0.77 5.5 hours
Ours 32.16 0.929 0.144 0.72 16 mins
Table 2: An overall comparison on surface reconstruction, the auxiliary task of novel view synthesis, and training time on DTU. The image metrics are reported on test views; the training time for each method is tested on a single Nvidia A100 GPU.

6.1 Comparisons

The quantitative results for surface reconstruction on DTU are reported in Table 1. We further add TV terms for DVGO (Sun et al., 2021) on and to provide a better baseline, denoted by DVGO. Quantitative experimental results show that we achieve lower Chamfer Distances than the SOTA methods under the same setting. We then conduct qualitative comparisons on both DTU and BlendedMVS, as shown in Fig. 4 and Fig. 5, respectively. The DVGO baseline shows poor reconstruction quality with noise and holes since it is designed for novel view synthesis rather than surface reconstruction. NeuS and ours show accurate and continuous surface recovery in a variety of cases. In comparison, NeuS, as a fully implicit model, naturally benefits from the intrinsic continuity and encourages smoothness in local areas, while it sometimes fails to recover very thin structures high-frequency geometry details due to over-smoothing. In contrast, our method can recover fine geometry details thanks to the explicit representation for geometry.

We also evaluate our method on the auxiliary task of novel view synthesis. The quantitative results can be found in Table 2, our method outperforms NeuS by a clear margin and also surpasses DVGO on all the metrics. The qualitative results are contained in the supplementary materials. The overall comparisons in Table 2 demonstrate that Voxurf is highly efficient in training, and is able to produce high-quality surface reconstruction as well as image rendering.

6.2 Analysis

Figure 6: The dual color network helps learn the color field for complex scenes which in turn encourages geometry learning; the hierarchical geometry feature directly promotes accurate surface reconstruction (see the normal field at the area with windows.)
Table 3: Ablation over the effect of our dual color network and hierarchical geometry feature. Both techniques are proved effective when used alone, and they corporate well and finally reduce the CD by 0.19 from the baseline.
CD 0.91 0.79 0.77 0.72
Dual
Hierarchical
Table 4: Ablation over the design of the dual color network. Residual and Detach denote whether to apply the two operations introduced in Sec. 5.2. The design we adopt in our method is proved to be the best option.
CD 0.77 0.75 0.75 0.72
Residual
Detach

In this section, we carry out a series of ablation studies to figure out the contribution of each technical component.

The effect of dual color network and a hierarchical geometry feature. As shown in Table 6.2, both techniques individually work well on the baseline model, and a combination of them produces the best result. 1) The effect of dual color network can be directly sensed in the improvement of image rendering quality, as can be seen from the comparison of roof textures in Fig. 6. An accurate color will in turn promote geometry learning, and thus applying the module leads to better CD results. Experimental results in Table 6.2 also validate the effectiveness of the design introduced in Sec. 5.2, including the residual color and detachment. 2) Hierarchical geometry feature directly promotes an accurate surface reconstruction, as demonstrated by results in Table 6.2 and the difference between normal images of Fig. 6. We also explore different designing details including the level selection and the effect of gradient and sdf value, as shown in Table 6.2 and Table 6.2, respectively.

level 0 0.5 1 1.5 2 2.5 3
CD 0.98 0.75 0.74 0.74 0.72 0.73 0.72
Table 6: Ablation over the effect of Gradient and SDF as components of the geometry feature design. The results indicate that a combination of them produces the best results.
CD (mean) 0.79 0.76 0.74 0.72
Gradient
SDF
Table 5: Ablation over the level selection for hierarchical geometry feature. The performance increases with a higher level at first and converges after level 2.0, and we apply that setting for all the experiments.
Figure 7: Studies on technical components that encourage surface smoothness at the (a) coarse shape initialization stage, (b) fine geometry optimization stage, and (c) post-processing when extracting the mesh after training.

Ablation over continuity and smoothness priors. We make efforts to encourage the continuity and smoothness of the reconstructed surface at different training stages. As shown in Fig. 7 (a), during the coarse shape initialization stage, the naive solution produces holes and noises. Applying the Gaussian convolution substantially alleviates the problem and leads to a more compact geometry. Regularization terms would further encourage a clean and smooth surface to provide a good initialization for the next stage. As shown in Fig. 7 (b), during the fine geometry optimization stage, the regularization terms also help maintain surface smoothness. Finally, as shown in Fig. 7 (c), post processing on a trained model can promote surface smoothness for a better visualization quality and maintain an accurate structure at the same time.

7 Conclusion

This paper proposes Voxurf, a voxel-based approach for efficient and accurate neural surface reconstruction with a two-stage pipeline. The first stage leverages a learnable feature grid to construct the color field and obtain a coherent coarse shape. The second stage refines detailed geometry with a dual color network to capture precise color-geometry dependence and leverage a hierarchical geometry feature to facilitate the optimization. Our experiments show that Voxurf achieves high efficiency and high quality at the same time.

Appendix A Experimental details

a.1 Additional implementation details

We use the same hyper-parameters for all the scenes. We use a 3-layer MLP for the first training stage, and two 4-layer MLPs for the dual color network in the second training stage. We choose level 2.0 for the hierarchical geometry feature and set the dimension of the feature voxel grid as 6.

For the first stage, we set and for the regularization terms and introduce an additional TV term for with a weight of . The Gassian kernel is in size with . For the second stage, we set for the reconstruction loss; we set and for the regularization terms. The fine SDF grid starts with a resolution of , which is up-scaled by trilinear interpolation to after 15000 iterations.

For the value in in Eqn. (2), we design a function based on the iteration, , where controls the beginning value of , denotes the iteration number, and basically controls the decaying speed of along with the increasing iterations. We set , for the first stage and , for the second stage.

For all the experimental results on the DTU (Jensen et al., 2014) dataset, our method is trained on the training set with around 90% images for each scene, following (Wang et al., 2021) for the splitting scheme, and the 10% images are used for evaluation of the novel view synthesis task. We notice that the CD performance is only slightly influenced compared to training on the full dataset. For experiments on the BlendedMVS (Yao et al., 2020) dataset, we use all the images for training.

a.2 Details for baseline methods

We include the following baseline methods for comparison: IDR (Yariv et al., 2020) reconstructs high-quality surfaces with implicit representation based on foreground object masks and the corresponding mask loss. NeuS (Wang et al., 2021) is a state-of-the-art approach that develops a volume rendering method for surface reconstruction, where the mask supervision is optional. We mainly compare to its mask-aware version. Reconstruction results for NeuS are implemented with its official code 111https://github.com/Totoro97/NeuS and the pre-trained models, and the novel view rendering results are provided by the authors. NeRF (Mildenhall et al., 2020) first proposes to use neural radiance field for novel view synthesis. Though not specifically designed for surface reconstruction, we can extract a noise geometry from a trained NeRF model with a selected threshold. In this paper, the reconstruction evaluation results for NeRF are directly taken from (Wang et al., 2021) for a fair comparison, while we also implement NeRF with nerf-pytorch 222https://github.com/yenchenlin/nerf-pytorch for novel view synthesis. DVGO (Sun et al., 2021) accelerates NeRF with a hybrid representation. We use the official code 333https://github.com/sunset1995/DirectVoxGO for implementation. Similarly, we select a threshold to extract the geometry from the density voxel grid, as shown in Sec. B.5.

Metric PSNR SSIM LPIPS
Scan NeRF DVGO NeuS Ours NeRF DVGO NeuS Ours NeRF DVGO NeuS Ours
24 26.97 27.77 26.13 27.89 0.772 0.830 0.764 0.857 0.331 0.277 0.348 0.239
37 25.99 25.96 24.08 26.90 0.811 0.833 0.798 0.870 0.206 0.184 0.222 0.160
40 27.68 27.75 26.73 28.81 0.786 0.791 0.747 0.841 0.304 0.303 0.352 0.274
55 29.39 30.42 28.06 31.02 0.917 0.939 0.887 0.950 0.143 0.116 0.177 0.108
63 33.07 34.35 28.69 34.38 0.936 0.953 0.937 0.957 0.128 0.095 0.129 0.083
65 30.87 31.18 31.41 31.48 0.954 0.956 0.958 0.960 0.114 0.103 0.112 0.094
69 27.90 29.52 28.96 30.13 0.844 0.921 0.909 0.928 0.308 0.190 0.223 0.181
83 33.49 36.94 31.56 37.43 0.948 0.969 0.950 0.968 0.125 0.084 0.120 0.084
97 27.43 27.67 25.51 28.35 0.900 0.914 0.901 0.923 0.200 0.168 0.192 0.155
105 31.68 32.85 29.18 32.94 0.910 0.928 0.896 0.932 0.186 0.154 0.218 0.148
106 30.73 33.75 32.60 34.17 0.879 0.933 0.914 0.947 0.244 0.167 0.201 0.138
110 29.61 33.10 30.83 32.70 0.872 0.941 0.917 0.937 0.241 0.153 0.200 0.153
114 29.37 30.18 29.32 30.97 0.901 0.914 0.897 0.926 0.193 0.174 0.216 0.159
118 33.44 36.11 35.91 37.24 0.915 0.957 0.948 0.964 0.199 0.123 0.156 0.110
122 33.41 36.99 35.49 37.97 0.935 0.967 0.957 0.972 0.142 0.088 0.114 0.076
mean 30.07 31.64 29.63 32.16 0.885 0.916 0.892 0.929 0.204 0.159 0.199 0.144
Table R7: Quantitative evaluation on the DTU dataset for novel view synthesis. Our method outperforms the baselines Mildenhall et al. (2020); Sun et al. (2021); Wang et al. (2021) on all the three metrics.

Appendix B Additional experimental results

b.1 Novel view synthesis

We report the results for novel view synthesis on the DTU dataset in Table R7. Our method outperforms the baselines in all three metrics, including PSNR, SSIM (Wang et al., 2004), and LPIPS (Zhang et al., 2018) (VGG). Examples of rendered images at testing views are shown in Fig. S10 and Fig. S11 in Sec. C.

b.2 Comparisons for the w/o mask setting.

Our method can also work for cases where no foreground mask is provided. We report our results and comparisons to previous approaches in Table R8.

Scan 24 37 40 55 63 65 69 83 97 105 106 110 114 118 122 mean
Colmap(Schönberger et al., 2016) 0.81 2.05 0.73 1.22 1.79 1.58 1.02 3.05 1.40 2.05 1.00 1.32 0.49 0.78 1.17 1.36
NeRF(Mildenhall et al., 2020) 1.90 1.60 1.85 0.58 2.28 1.27 1.47 1.67 2.05 1.07 0.88 2.53 1.06 1.15 0.96 1.49
UNISURF(Oechsle et al., 2021) 1.32 1.36 1.72 0.44 1.35 0.79 0.80 1.49 1.37 0.89 0.59 1.47 0.46 0.59 0.62 1.02
VolSDFYariv et al. (2021) 1.14 1.26 0.81 0.49 1.25 0.70 0.72 1.29 1.18 0.70 0.66 1.08 0.42 0.61 0.55 0.86
NeuS(Wang et al., 2021) 1.00 1.37 0.93 0.43 1.10 0.65 0.57 1.48 1.09 0.83 0.52 1.20 0.35 0.49 0.54 0.84
Ours 0.71 0.78 0.43 0.35 1.03 0.76 0.74 1.49 1.04 0.74 0.51 1.12 0.41 0.55 0.45 0.74
Table R8: Quantitative evaluation on DTU dataset (without mask).

b.3 Ablation over the voxel grid resolution

The voxel grid resolution denotes the number of voxels contained in . We study the effect of voxel grid resolution by keeping all the other settings to be the same, as shown in Table R9. Increasing the voxel grid resolution from to and from to consistently results in lower Chamfer Distance (CD) with longer training time. However, the case with achieves a similar CD with , and requires heavier training cost. Thus, we take the number of voxels to be as the default setting in the other experiments.

Resolution CD (mean) Train time
0.79 13 mins
0.75 14 mins
0.72 16 mins
0.73 19 mins
Table R9: The effect of voxel grid resolution on reconstruction performance and training time. All the cases below are trained with the same settings except for the voxel grid resolution.
Figure S8: The two-stage training process of Voxurf. The number in vertical axis is calculated by for a better visualization.

b.4 Two-stage training process

Our method adopts a two-stage training pipeline. We show the curve of Chamfer Distance and the visualization result by the end of each stage in Fig. S8. We show that 1) we can obtain a coherent shape by the end of the Stage-1, while the performance is limited by the low resolution that the details are hard to be reconstructed; 2) the fine details are recovered by the end of Stage-2 and the overall structure is consistent with the coarse shape of Stage-1.

b.5 Threshold selection

To extract the surface from a trained DVGO (Sun et al., 2021) model, we can first obtain the alpha value for any point in the 3D space by density interpolation and activation. And then we show how we select a proper alpha threshold when we extract the surface in Fig. S9. A small threshold like 0.001 and 0.01 usually results in noise areas floating above the surface, while a large one like 0.5 and 0.8 would lead to an incomplete surface with large holes. We thus select 0.1 as the alpha threshold that is adopted in this paper.

Figure S9: Comparisons of Alpha threshold selection for surface extraction from a trained DVGO (Sun et al., 2021) model. A large threshold leads to holes and the incomplete surface, while a small one leads to floating noises above the surface.

Appendix C Additional qualitative comparisons

Finally, we show the qualitative comparisons for novel view synthesis in Fig. S10 and Fig. S11, and we show additional surface reconstruction results in Fig S12, Fig S13, and Fig S14.

Figure S10: Qualitative comparisons on DTU for novel view synthesis. (Part 1/2)
Figure S11: Qualitative comparisons on DTU for novel view synthesis. (Part 2/2)
Figure S12: Additional surface reconstruction comparisons on DTU. (Part 1/2)
Figure S13: Additional surface reconstruction comparisons on DTU. (Part 2/2)
Figure S14: Additional surface reconstruction comparisons on BlendedMVS.

References

  • Atzmon et al. (2019) Matan Atzmon, Niv Haim, Lior Yariv, Ofer Israelov, Haggai Maron, and Yaron Lipman. Controlling neural level sets. Advances in Neural Information Processing Systems, 32, 2019.
  • Chen et al. (2022) Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. In European Conference on Computer Vision (ECCV), 2022.
  • Chen & Zhang (2019) Zhiqin Chen and Hao Zhang. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5939–5948, 2019.
  • Darmon et al. (2021) François Darmon, Bénédicte Bascle, Jean-Clément Devaux, Pascal Monasse, and Mathieu Aubry. Improving neural implicit surfaces geometry with patch warping. 2021. URL https://arxiv.org/2112.09648.
  • Jensen et al. (2014) Rasmus Jensen, Anders Dahl, George Vogiatzis, Engin Tola, and Henrik Aanæs. Large scale multi-view stereopsis evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  406–413, 2014.
  • Jiang et al. (2020) Yue Jiang, Dantong Ji, Zhizhong Han, and Matthias Zwicker. Sdfdiff: Differentiable rendering of signed distance fields for 3d shape optimization. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • Kellnhofer et al. (2021) Petr Kellnhofer, Lars C Jebe, Andrew Jones, Ryan Spicer, Kari Pulli, and Gordon Wetzstein. Neural lumigraph rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4287–4297, 2021.
  • Liu et al. (2020) Shaohui Liu, Yinda Zhang, Songyou Peng, Boxin Shi, Marc Pollefeys, and Zhaopeng Cui. Dist: Rendering deep implicit signed distance function with differentiable sphere tracing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2019–2028, 2020.
  • Lombardi et al. (2019) Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural volumes: Learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751, 2019.
  • Max (1995) Nelson Max. Optical models for direct volume rendering. IEEE Transactions on Visualization and Computer Graphics, 1(2):99–108, 1995.
  • Mescheder et al. (2019) Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4460–4470, 2019.
  • Mildenhall et al. (2020) Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In European conference on computer vision, pp.  405–421. Springer, 2020.
  • Müller et al. (2022) Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 2022.
  • Niemeyer et al. (2020) Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3504–3515, 2020.
  • Oechsle et al. (2021) Michael Oechsle, Songyou Peng, and Andreas Geiger. Unisurf: Unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5589–5599, 2021.
  • Park et al. (2019) Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  165–174, 2019.
  • Rudin & Osher (1994) Leonid I Rudin and Stanley Osher. Total variation based image restoration with free local constraints. In Proceedings of 1st international conference on image processing, volume 1, pp.  31–35. IEEE, 1994.
  • Saito et al. (2019) Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2304–2314, 2019.
  • Schönberger et al. (2016) Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. In European Conference on Computer Vision, pp.  501–518. Springer, 2016.
  • Sitzmann et al. (2019) Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems, 2019.
  • Sun et al. (2021) Cheng Sun, Min Sun, and Hwann-Tzong Chen. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. arXiv preprint arXiv:2111.11215, 2021.
  • Tancik et al. (2020) Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems, 33:7537–7547, 2020.
  • Wang et al. (2021) Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. NeurIPS, 2021.
  • Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • Wizadwongsa et al. (2021) Suttisak Wizadwongsa, Pakkapon Phongthawee, Jiraphon Yenphraphai, and Supasorn Suwajanakorn. Nex: Real-time view synthesis with neural basis expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8534–8543, 2021.
  • Xu et al. (2022) Qiangeng Xu, Zexiang Xu, Julien Philip, Sai Bi, Zhixin Shu, Kalyan Sunkavalli, and Ulrich Neumann. Point-nerf: Point-based neural radiance fields. arXiv preprint arXiv:2201.08845, 2022.
  • Yao et al. (2020) Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1790–1799, 2020.
  • Yariv et al. (2020) Lior Yariv, Yoni Kasten, Dror Moran, Meirav Galun, Matan Atzmon, Basri Ronen, and Yaron Lipman. Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems, 33, 2020.
  • Yariv et al. (2021) Lior Yariv, Jiatao Gu, Yoni Kasten, and Yaron Lipman. Volume rendering of neural implicit surfaces. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  • Yu et al. (2021) Alex Yu, Sara Fridovich-Keil, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. arXiv preprint arXiv:2112.05131, 2021.
  • Zhang et al. (2021) Jingyang Zhang, Yao Yao, and Long Quan. Learning signed distance field for multi-view surface reconstruction. International Conference on Computer Vision (ICCV), 2021.
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  586–595, 2018.