Polarimetric Inverse Rendering for Transparent Shapes Reconstruction

Mingqi Shao, Chongkun Xia, Dongxu Duan, Xueqian Wang

Abstract

In this work, we propose a novel method for the detailed reconstruction of transparent objects by exploiting polarimetric cues. Most of the existing methods usually lack sufficient constraints and suffer from the over-smooth problem. Hence, we introduce polarization information as a complementary cue. We implicitly represent the object’s geometry as a neural network, while the polarization render is capable of rendering the object’s polarization images from the given shape and illumination configuration. Direct comparison of the rendered polarization images to the real-world captured images will have additional errors due to the transmission in the transparent object. To address this issue, the concept of reflection percentage which represents the proportion of the reflection component is introduced. The reflection percentage is calculated by a ray tracer and then used for weighting the polarization loss. We build a polarization dataset for multi-view transparent shapes reconstruction to verify our method. The experimental results show that our method is capable of recovering detailed shapes and improving the reconstruction quality of transparent objects. Our dataset and code will be publicly available at https://github.com/shaomq2187/TransPIR.

\affiliations

Tsinghua Shenzhen Internation Graguate School
{smq21,ddx21}@mails.tsinghua.edu.cn
{xiachongkun,wang.xq}@sz.tsinghua.edu.cn \nocopyright

Introduction

The acquisition of transparent 3D shapes has always been a challenge in computer vision since their visual appearance is determined by the light paths with both reflected and refracted light. The complex optical characteristic of transparent shapes makes the observation of the RGB camera or the commonly used depth camera contain significant error and leads to the poor performance of the traditional 3D reconstruction methods such as active camera scanning and multi-view stereo.

Recently, the neural inverse rendering based on the neural implicit representationMichalkiewicz et al. (2019); Oechsle et al. (2021); Yariv et al. (2020, 2021) has achieved charming performance in the task of learning 3D shapes from 2D images of opaque objects. However, the high complexity of transparent objects’ light paths couples the color of the transparent surface with its geometry, environment light, and viewing direction, leading to the difficulty of applying the neural inverse rendering methods that exploit the photometric consistency constraint, e.g., IDRYariv et al. (2020), to transparent shapes. In addition, the reconstructed shapes that only use the silhouette constraint suffer from the over-smooth problem. Li et al.Li et al. (2020) proposed a physical-based neural network to handle the complex optical characteristic of transparent shapes under natural lighting conditions, however, it still suffers from the same over-smooth problem. Therefore, additional cues are necessary to be introduced to recover the detailed shapes.

Figure 1: Our reconstruction results and their transparent renderings

Researchers introduce the ray-ray correspondence between camera rays and the rays from the background pattern as the light path’s constraint as a cue for detailed shape reconstructionKutulakos and Steger (2008); Tsai et al. (2015); Qian et al. (2016); Wu et al. (2018); Lyu et al. (2020). However, the ray-ray correspondence generally requires strict calibration and precise control.

As a passive imaging principle with weak assumptions of lighting conditions, polarimetric cues perform well on many tasksBa et al. (2020); Zhao et al. (2020); Deschaintre et al. (2021) since the polarimetric cues provide light’s information from a new dimension in addition to intensity. The polarization state of the reflected light from the object’s surface encodes the azimuth and zenith angle of the surface normal. Therefore, this paper introduces polarization information as an additional cue for transparent shapes reconstruction.

Due to the complex light interactions of transparent objects, the useful polarization state in the directly reflected light, encoding the normal vector of the surface, is frequently disturbed by the transmitted light. To reduce the negative contribution of severely disturbed areas’ polarization state to the reconstructed shape, we employ ray tracing to trace the proportion of reflected light at each point and apply it as the weight of the polarimetric cue.

In this paper, the neural implicit representation is used to represent the object’s geometry. The polarization maps at each view are rendered through a differentiable polarimetric renderer, and the reflection percentage is calculated through a ray tracer, which is used as the weight of the polarization loss. Fig.1 shows the reconstruction results of our method. Experimental results illustrate that our method can reconstruct detailed transparent shapes. We summarize our contributions as follows:

A polarimetric inverse rendering framework for transparent shape reconstruction from multi-view polarization images.
A weighted polarization loss which utilizes the reflection percentage enables the polarimetric cues can be effectively used.
Producing the detailed reconstruction results of different transparent objects with complex and irregular shapes.
The first polarization dataset for multi-view transparent shapes reconstruction.

Related Work

Transparent Shapes Reconstruction

The main challenge of transparent shape reconstruction is that their surface observation will be interfered by the transmission from the background. This feature is exploited by researchers to place known patterns on the back of transparent objects to obtain the correspondence between the emitted ras from transparent objects and the ras from the background patternsKutulakos and Steger (2008); Wu et al. (2018); Lyu et al. (2020). In contrast to the methods that require strict control of the settings to get accurate rays correspondence, our method only needs a dark background and uniform light intensity in directions facing the transparent object.

Recently, data-driven approaches show advantages for shape reconstruction of transparent objects. Li et al.Li et al. (2020) propose a neural 3d reconstruction framework for transparent objects, which simulates light transport within transparent objects through a physically-based rendering layer and uses a pre-trained network for point cloud reconstruction. However, the gap between synthetic dataset and real-world data and insufficient constraint of RGB information lead to the over-smooth phenomenon of the reconstructed shapesXu et al. (2022). Additional information is needed to enhance the constraint.

The polarization information is an important cue for transparent shapes estimatingMiyazaki et al. (2004); Mingqi et al. (2022) because the polarization state of the light reflected from an object’s surface encodes the information of the surface normalAtkinson and Hancock (2006). However, the methods purely rely on the polarization information usually suffer from large errors since the reflected light, which encodes the surface normal vector, is generally covered by other light components. Hence, different from these methods purely relying on polarization information, our method serves polarization information as cues for multi-view reconstruction to provide auxiliary constraints.

Figure 2: Overview of our method. The silhouette loss is used as supervision of initial shape reconstructing, then the polarimetric render and ray tracer calculate the weighted polarization loss for detailed shape optimization

Polarimetric Inverse Rendering

The polarization state of an object surface encodes the information of the surface normal and is often used for surface estimationMiyazaki and Ikeuchi (2005); Zou et al. (2020); Ba et al. (2020),reflection removalLei et al. (2020), radiance decompositionDeschaintre et al. (2021); Dave et al. (2022), multi-view stereo enhancementCui et al. (2017); Yang et al. (2018); Ding et al. (2021). Miyazaki et al.Miyazaki and Ikeuchi (2005) propose inverse polarization ray-tracing for transparent shape reconstruction, but the assumption that the background shape is known limits its application. Zhao et al.Zhao et al. (2020) propose a polarimetric multi-view inverse rendering framework for 3D reconstruction, this framework uses polarization information to optimize each vertex of the initial model generated from structure-from-motion to reconstruct detailed shape. Our key idea is similar to Zhao et al. (2020), but the polarization information availability of each point on the transparent surface needs to be judged.

Method

Overview

Our goal is to recover the transparent shape from multi-view 2D images by exploiting polarimetric cues and the pipeline of our method is shown in Fig.2. Instead of starting with the space carving methodKutulakos and Seitz (2000), we adopt the neural implicit representation in IDRYariv et al. (2020) to produce a smooth initial shape. The polarimetric render will render the AoLP maps of different views and then the rendered AoLP maps will be compared with the captured AoLP maps to get polarization loss. Since only the polarization information of points with a high proportion of reflection component is reliable, the ray tracer will trace the reflection percentage of each point to weight the polarization loss. Afterward, the weighted polarization loss guides the optimization to produce the final detailed shape.

Implicit Surface Representation

Signed distance function(SDF) is a continuous function of spatial position, for a given position $x \in R^{3}$ , the SDF will output the closest distance $d \in R$ to the surface.

f : R^{3} \to R x \to d = f (x)

(1)

The distance $d$ is positive when $x$ is inside the boundary, negative outside, and zero when $x$ is on the boundary.

Similar to IDRYariv et al. (2020), we represent the transparent objects’ geometry as a neural network(MLP) $f_{θ}$ with learnable parameters $θ$ and optimize $f_{θ} (x)$ to the object’s ground-truth SDF $f (x)$ :

o p t : f_{θ} (x) \to f (x)

(2)

The normal ${^n}_{θ} (x)$ of the surface represented by MLP-based SDF $f_{θ} (x)$ can be expressed as follows:

{^n}_{θ} (x) = \nabla_{x} f_{θ} (x) / ∥ \nabla_{x} f_{θ} (x) ∥_{2}

(3)

The derivative of $f_{θ} (x)$ can be easily acquired from the automatic differentiation mechanism.

The SDF representation has the benefits of being able to represent smooth surfaces and arbitrary topologies. With the SDF representation and the supervision of silhouettes, the transparent object’s initial shape can be obtained.

Polarimetric Rendering

In this paper, we use the Mueller calculusClarke (2009) to calculate the polarization state of the specular reflection of the transparent surface. In the Muller calculus, the full polarization state of light is represented by the Stokes vector $s = [s_{0}, s_{1}, s_{2}, s_{3}]^{T}$ , where $s_{0}$ represents the light intensity, $s_{1}$ and $s_{2}$ denote the linear polarization components of the $x$ -axis and $45^{\circ}$ directions, and $s_{3}$ represents the right circular polarization component. The three polarimetric cues intensity $I$ , degree of linear polarization(DoLP) $ρ$ , angle of linear polarization(AoLP) $ψ$ can be parameterized by the stokes vector:

I = s_{0}

(4)

ρ = \frac{\sqrt{s_{1}^{2} + s_{2}^{2}}}{s_{0}}

(5)

ψ = \frac{1}{2} {tan}^{- 1} (\frac{s_{2}}{s_{1}})

(6)

Figure 3: Reflection and transmission at the interface of different refractive index media

From the Fresnel’s equations, the amplitude reflection coefficients $r_{s}, r_{p}$ that are perpendicular to and parallel to the incident plane, respectively, can be written as followsClarke (2009):

r_{s} = \frac{η_{i} cos χ_{i} - η_{t} cos χ_{t}}{η_{i} cos χ_{i} + η_{t} cos χ_{t}} = - \frac{sin (χ_{i} - χ_{t})}{sin (χ_{i} + χ_{t})}

(7)

r_{p} = \frac{η_{i} cos χ_{t} - η_{t} cos χ_{i}}{η_{t} cos χ_{i} + η_{i} cos χ_{t}} = - \frac{tan (χ_{i} - χ_{t})}{tan (χ_{i} + χ_{t})}

(8)

As shown in Fig.3, when the stokes vectors $s_{i}, s_{r}$ of the incident light and the reflected light are in the same coordinate system, the transformation between $s_{i}, s_{r}$ can be represented by a mueller matrix $M_{r}$ , that is,

s_{r} = M_{r} s_{i}

(9)

M_{r} = \frac{1}{2} ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} (r_{s}^{2} + r_{p}^{2}) & (r_{s}^{2} - r_{p}^{2}) & 0 & 0 (r_{s}^{2} - r_{p}^{2}) & (r_{s}^{2} + r_{p}^{2}) & 0 & 0 0 & 0 & 2 r_{s} r_{p} & 0 0 & 0 & 0 & 2 r_{s} r_{p} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎦

(10)

Figure 4: Schematic of polarimetric rendering in our method. The initial Stokes vector from the light source is observed by the camera after four transformations: three coordinate transformations $R_{i}, R_{o}, R_{c}$ , and one polarization state transformation $M_{r}$ of the reflection

Fig.4 shows the schematic of polarimetric rendering used in our method. The coordinate frames transformations are required since the stokes vectors $s_{0}$ and $s_{c}$ are defined in the different reference frames. We adopts the transformations of frames similar to Mitsuba2Nimier-David et al. (2019), and detailed expressions of $R_{i} (v_{i}; {^n}_{θ})$ , $R_{o} (v_{o}; {^n}_{θ})$ , $R_{c} (v_{o})$ are presented in supplementary material. Finally, the stokes vector $s_{c}$ captured by the polarization camera can be written as the linear operations:

s_{c} = R_{c} R_{o} M_{r} R_{i} s_{0} = [s_{c 0}, s_{c 1}, s_{c 2}, s_{c 3}]^{T}

(11)

From the Eq.6, the rendered AoLP $^ψ$ is,

^ψ = \frac{1}{2} {tan}^{- 1} (\frac{s_{c 2}}{s_{c 1}})

(12)

The polar render in Fig.2 renders the AoLP maps through Eq.11, Eq.12, then they will be compared with the real-world captured AoLP maps to calculate polarization loss, which will be described in detail in later subsection Optimization.

Figure 5: Ray tracing procedure. For the given shape, the illumination configuration, and the viewing direction $- v_{0}$ , the radiance of reflected and transmitted rays $I_{r 0}, I_{t 0}$ along the $v_{0}$ direction can be calculated from Fresnel’s laws

Reflection Percentage Estimation

The polar render employs the Fresnel specular reflection model to render the AoLP maps for the given shape. However, in some areas, the rendered AoLP map has a large error compared with the AoLP map captured in the real world as shown in Fig.6. This error is mainly caused by the high transmission rather than shape difference. The observed light from the transparent surface consists of two components: the directly reflected light on the surface(the specular reflection component) and the transmitted light from the inside(the transmission component). The AoLP of the specular component is only related to the normal of the intersection point on the surface, while the AoLP of the transmission component is related to the normals of all intersection points in its transmission process. Therefore, the higher proportion of the specular reflection, the more reliable the captured AoLP. To quantify and reduce the resulting error, we use ray tracing to calculate the reflection percentage of each point on the transparent surface.

Figure 6: (a) The AoLP map rendered by the polar render. (b)Real-world captured AoLP map. (c)The reflection percentage map rendered by the ray tracer. The area with a low reflection percentage in rendered AoLP map suffers a large error compared to the real-world captured AoLP map

We implement a 2-bounce ray tracer to estimate the radiance of reflection and transmission components on the transparent surface. Fig.5 illustrates the ray tracing procedure. The rays in this Fig.5 flow from the light source to the camera, which is inverse in our actual implementation. Given the shape, illumination configuration, and the camera viewing direction $- v_{0}$ , the directions of other rays can be derived from Fresnel’s law of refraction. In each interaction, the energy of the incident light is distributed according to the Fresnel term $F$ Born and Wolf (2013),

F_{η_{i}, η_{t}}^{v_{i}, v_{t}, n} = \frac{1}{2} (\frac{η_{i} v_{i} \cdot n - η_{t} v_{t} \cdot n}{η_{i} v_{i} \cdot n + η_{t} v_{t} \cdot n})^{2} + \frac{1}{2} (\frac{η_{t} v_{i} \cdot n - η_{i} v_{t} \cdot n}{η_{t} v_{i} \cdot n + η_{i} v_{t} \cdot n})^{2}

(13)

I_{r} = F_{η_{i}, η_{t}}^{v_{i}, v_{t}, n} I_{i}

(14)

I_{t} = (1 - F_{η_{i}, η_{t}}^{v_{i}, v_{t}, n}) I_{i}

(15)

where $I_{i}, I_{r}, I_{t}$ are the intensity of incident, reflected and refracted rays, $v_{i}, v_{r}, v_{t}$ represent their directions, respectively. $n$ is the normal vector.

Using Eq.13, 14, 15, we can calculate the intensities of reflected ray and transmitted ray $I_{t 0}, I_{r 0}$ observed by the camera in Fig.5:

I_{t 1} = (1 - F_{η_{a i r}, η_{g l a s s}}^{v_{t 2}, v_{t 1}, {^n}_{2}}) I_{t 2}

(16)

I_{t 0} = (1 - F_{η_{g l a s s}, η_{a i r}}^{v_{t 1}, v_{0}, {^n}_{1}}) I_{t 1}

(17)

I_{r 0} = F_{η_{a i r}, η_{g l a s s}}^{v_{r 1}, v_{0}, {^n}_{1}} I_{r 1}

(18)

where the values of $I_{t 2}, I_{r 1}$ are sampled from the illumination configuration, which is related to our dataset acquisition setup and is presented in the supplementary material in detail. Due to the total internal reflection, the values of $I_{t 1}$ of some rays are unable to calculate from the 2-bounce ray tracer, in this case, we set $I_{t 1}$ to a constant to avoid 100% reflection percentage since it is impossible in the real world.

Finally, the reflection percentage $w$ in this paper is defined as the ratio of the reflection intensity to total intensity:

w = \frac{I_{r 0}}{I_{r 0} + I_{t 0}}

(19)

As shown in Fig.6, the real-world captured AoLP map (Fig.6(b)) has obvious demarcation in some areas compared to the rendered AoLP map (Fig.6(a)) since the proportion of reflection component changes. This change is consistent with the reflection percentage map(Fig.6(c)) rendered by the ray tracer, which further illustrates the importance of the reflection percentage. Otherwise, the error caused by the smaller reflection proportion will guide the optimization to a wrong shape.

Optimization

We minimize the following loss function to optimize the MLP-based SDF $f_{θ} (x)$ to the object’s ground-truth SDF $f (x)$ :

L_{n e t} = L_{s i l} + λ_{s d f} L_{s d f} + λ_{p o l} L_{p o l}

(20)

where $L_{s i l}, L_{s d f}, L_{p o l}$ represent the silhouette loss, sdf regularization loss term, and weighted polarization loss term. The default values of $λ_{s d f}$ and $λ_{p o l}$ are $0.1$ and $0.4$ , respectively.

Silhouette Loss.

We adopt the same silhouette loss as in IDRYariv et al. (2020) for shape optimization supervision, which plays an important role in initial shape reconstruction.

SDF Regularization Loss.

To encourage the $f_{θ} (x)$ approximates a signed distance function, we add the regularization loss, i.e., the Eikonal regularizationGropp et al. (2020):

L_{s d f} = \frac{1}{∥ P ∥_{1}} \sum p \in P (∥ \nabla f_{θ} (p) ∥_{2} - 1)^{2}

(21)

where $P$ represents the set of intersection points of sampled rays in the mini-batch with the surface.

Weighted Polarization Loss.

We use polarimetric cues to calculate polarization loss to guide the optimization. The weighted polarization loss for each sampled ray is defined as follows:

L_{p o l}^{p} = w_{p} ∥ {^ψ}_{p} - ψ_{p} ∥_{1}, p \in P

(22)

where $w_{p}$ is the reflection percentage of the point $p$ . ${^ψ}_{p}$ represents the rendered AoLP, and $ψ_{p}$ denotes the real-world captured AoLP. As mentioned before, the reflection percentage is related to the reliability of the $ψ_{p}$ , hence we employ it as the weight for the error $∥ {^ψ}_{p} - ψ_{p} ∥_{1}$ .

We assume that the difference of the normal between the initial shape and the ground-truth shape is smaller than $ε$ since the supervision of $L_{s i l}$ can produce a good initial shape. With this assumption, we can clip the excessive polarization loss to avoid guiding the optimization to the wrong shape:

Lppol′={Lppol, ∥^ψp−ψp∥1≤ε 0,otherwise

(23)

L_{p o l} = \frac{1}{∥ P ∥_{1}} \sum p \in P {L_{p o l}^{p}}^{'}

(24)

where the default value of $ε$ in this paper is $π / 6$ .

Experiments

Dataset and Metrices

We build a dataset containing $4$ transparent objects(CAT, FROG, ELEPHANT, SQUIRREL) to verify our method since there is no public polarization dataset for multi-view reconstruction of transparent objects. Our dataset acquisition setup is shown in Fig.7. A DLASA G3-GM14-M2450 polarization camera is adopted as our capture device. For each object, we take $34$ polarization images uniformly distributed views. The pose of the camera is accurately obtained by a commercial robot UR5. To increase the proportions of the reflection components, a diffuse sphere is employed as the light source. The light source is fixed with the camera to ensure uniform and strong reflections in any pose, thereby increasing the abundance of polarization information.

Figure 7: Setup for acquiring our dataset

We adopt the chamfer distance(CD) and chamfer normal angle(CDN) as the metrics and the metrics are calculated by uniformly sampling $10000$ points from the ground-truth and reconstructed shape and the metrics that appear in this section are the summation of all the sampling points. For convenience, we normalize the camera’s poses to ensure that the reconstructed models are all within a unit sphere and all evaluations are performed on this basis.

Method	CAT		FROG		ELEPHANT		SQUIRREL
Method	CD	CDN	CD	CDN	CD	CDN	CD	CDN
VH	18.96	3143.56	34.06	2752.86	24.26	3810.97	14.76	3081.47
IDR	9.82	978.79	18.99	1152.16	12.52	1579.04	13.94	1691.07
Ours	8.97	744.05	17.22	1088.37	11.31	1510.27	13.35	1476.51

Table 1: Quantitative results of the comparisons with baselines

Comparisons with Baselines

We compare our method with two 3D reconstruction methods, IDRYariv et al. (2020) and visual hull(space carving) to verify the effectiveness of our method.

IDR. IDR is one of the state-of-the-art methods for inverse rendering reconstruction using MLP-based implicit representation. The renderer in IDR is only suitable for opaque objects, its differential renderer will diverge when applied to transparent objects. Hence, we remove the RGB loss term in IDR, only silhouette loss and regularization loss are used. All the other super parameters are the same as in the original IDR.

Visual Hull(VH). Visual hull or space carving is a traditional algorithm for 3D reconstruction and is usually used as the initial shape reconstruction. We utilize the visual hull code from Li et al.Li et al. (2020) to compute the visual hull. Since our camera poses are already normalized, we limit the visual hull to $[- 1, 1]^{3}$ and set the spatial resolution to $256$ .

Fig.8 is the visualization of the comparisons and it shows that due to the limitation of the resolution, the visual hull can only reconstruct the rough outline of the object, and its reconstructed surfaces are rough and lack details. In contrast, IDR exhibits the advantage of using MLP-based implicit representation, which can produce water-tight surfaces. However, due to only silhouette supervision, the reconstruction results of IDR also lack details about objects. Our method introduces polarimetric cues as compliments, hence our method has more detailed results, such as the eyes of the objects, compared to IDR. Table.1 lists the quantization results of our comparisons, and our method achieves the best performance on each object.

Figure 8: Visualization of the reconstruction results compared with baselines

Ablation Studies

We conduct ablation experiments on the important parts of our method, including loss terms $L_{p o l}$ , $L_{s d f}$ , and the reflection percentage $w$ . In the ablation studies, only the weight of the studied module is set to zero, other parameters are kept the same, and the CAT object is selected for ablation studies. The ablation results of $L_{p o l}$ have been shown in Fig.8 and Table.1 in the previous subsection, i.e., the comparison of ours and IDR, the loss term $L_{p o l}$ improves the reconstruction quality by supplementing information of shape’s details.

Metric	W/o $L_{p o l}$	W/o $L_{s d f}$	W/o $w$	Full
CD	9.82	26.81	9.51	8.97
CDN	978.79	1788.01	1017.79	744.05

Table 2: Quantitative results of the ablation studies

Fig.9 shows the results of ablation study on the SDF loss term $L_{s d f}$ . After removing $L_{s d f}$ , obvious contour lines and hollows appear on the reconstructed surface. $L_{s d f}$ constraints surface normal of the shape that is implicitly represented by a MLP to approach the unit vector, which ensures the reconstructed shape is smooth and realistic. Therefore, when the SDF loss term omits, the MLP will have large gradients at some spatial areas, resulting in holes in the reconstructed shape.

Fig.10 presents the results with and without the reflection percentage $w$ and it shows that the polarimetric cues will misguide the shape reconstruction, especially the folded areas. The high transmission component proportion in these areas leads to the coupling of the observed polarization state with all the interaction points in the transmitted light path. Calculating the loss directly with the rendered polarization images will lead to an erroneous shape, which is the reason for introducing the reflection percentage to weight the polarization loss.

Table.2 lists all the results of our ablation studies. The results illustrate that all the parts in our method are essential to the quality of the reconstructed shape.

Figure 9: Ablation study of the SDF loss term

Figure 10: Ablation study of the reflection percentage

Different Number of Views.

We uniformly sample $10$ and $20$ views from $34$ views and compare our method with the IDR without polarimetric cues. The visualization and quantitative results of different number of views are summarized in Fig.11 and Table.3, respectively. Our method is able to reconstruct the detailed shape under $20$ views but with some noise in the head region, and the quantitative results also show that the difference of chamfer distance between $20$ views and $34$ views is tiny. The details of the reconstructed shape from $10 v i e w s$ are significantly reduced due to the reduction of polarimetric cues. But the reconstruction quality of our method outperforms the IDR without polarimetric cues in all the different number of views.

Figure 11: Reconstruction results under the different number of views. The shapes listed in the two rows are the results without/with polarimetric cues, respectively

Method	10 Views		20 Views		34 Views
Method	CD	CDN	CD	CDN	CD	CDN
IDR	11.75	1259.79	11.61	1371.06	9.82	978.79
Ours	9.89	1009.11	9.29	934.67	8.97	744.05

Table 3: Quantitative results of the reconstructed shapes under the different number of views

Discussion

Conclusion. In this paper, we propose a polarimetric inverse rendering framework for transparent shapes reconstruction from multi-view polarization images. We employ the implicit neural representation for the object’s geometry, then it is rendered by the polarimetric render and compared to the real-world captured polarization images. To address the polarization information reliability reduction caused by the transmission, a ray tracer will trace the reflection percentage to calculate the weighed polarization loss. In addition, we construct the first polarization dataset for multi-view transparent shapes reconstruction and verify our method on this dataset. The experimental results show that our method can recover high-quality transparent shapes, and prove that polarimetric cues can effectively recover the details of transparent objects.

Limitations and future work. The polarimetric rendering in this paper only considers the reflection component on the transparent surface, i.e., only the polarimetric cues of the areas with high reflection percentage are effectively utilized. Hence, using the polarization ray tracing technique to render more realistic polarization images of transparent objects will be our future work. In addition, the quality of our method heavily depends on the quality of the initial shape, how to reconstruct outperforming shapes based on poor initial shapes is also one of our future directions.

References

G. A. Atkinson and E. R. Hancock (2006) Recovery of surface orientation from diffuse polarization. IEEE transactions on image processing 15 (6), pp. 1653–1664. Cited by: Transparent Shapes Reconstruction.
Y. Ba, A. Gilbert, F. Wang, J. Yang, R. Chen, Y. Wang, L. Yan, B. Shi, and A. Kadambi (2020) Deep shape from polarization. In European Conference on Computer Vision, pp. 554–571. Cited by: Introduction, Polarimetric Inverse Rendering.
M. Born and E. Wolf (2013) Principles of optics: electromagnetic theory of propagation, interference and diffraction of light. Elsevier. Cited by: Reflection Percentage Estimation.
D. Clarke (2009) Stellar polarimetry. John Wiley & Sons. Cited by: Polarimetric Rendering, Polarimetric Rendering.
Z. Cui, J. Gu, B. Shi, P. Tan, and J. Kautz (2017) Polarimetric multi-view stereo. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1558–1567. Cited by: Polarimetric Inverse Rendering.
A. Dave, Y. Zhao, and A. Veeraraghavan (2022) PANDORA: polarization-aided neural decomposition of radiance. arXiv preprint arXiv:2203.13458. Cited by: Polarimetric Inverse Rendering.
V. Deschaintre, Y. Lin, and A. Ghosh (2021) Deep polarization imaging for 3d shape and svbrdf acquisition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15567–15576. Cited by: Introduction, Polarimetric Inverse Rendering.
Y. Ding, Y. Ji, M. Zhou, S. B. Kang, and J. Ye (2021) Polarimetric helmholtz stereopsis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5037–5046. Cited by: Polarimetric Inverse Rendering.
A. Gropp, L. Yariv, N. Haim, M. Atzmon, and Y. Lipman (2020) Implicit geometric regularization for learning shapes. In International Conference on Machine Learning, pp. 3789–3799. Cited by: SDF Regularization Loss..
D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Implementation Details.
K. N. Kutulakos and S. M. Seitz (2000) A theory of shape by space carving. International journal of computer vision 38 (3), pp. 199–218. Cited by: Overview.
K. N. Kutulakos and E. Steger (2008) A theory of refractive and specular 3d shape by light-path triangulation. International Journal of Computer Vision 76 (1), pp. 13–29. Cited by: Introduction, Transparent Shapes Reconstruction.
C. Lei, X. Huang, M. Zhang, Q. Yan, W. Sun, and Q. Chen (2020) Polarized reflection removal with perfect alignment in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1750–1758. Cited by: Polarimetric Inverse Rendering.
Z. Li, Y. Yeh, and M. Chandraker (2020) Through the looking glass: neural 3d reconstruction of transparent shapes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1262–1271. Cited by: Introduction, Transparent Shapes Reconstruction, Comparisons with Baselines.
J. Lyu, B. Wu, D. Lischinski, D. Cohen-Or, and H. Huang (2020) Differentiable refraction-tracing for mesh reconstruction of transparent objects. ACM Transactions on Graphics (TOG) 39 (6), pp. 1–13. Cited by: Introduction, Transparent Shapes Reconstruction.
M. Michalkiewicz, J. K. Pontes, D. Jack, M. Baktashmotlagh, and A. Eriksson (2019) Implicit surface representations as layers in neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4743–4752. Cited by: Introduction.
S. Mingqi, X. Chongkun, Y. Zhendong, H. Junnan, and W. Xueqian (2022) Transparent shape from single polarization images. arXiv preprint arXiv:2204.06331. Cited by: Transparent Shapes Reconstruction.
D. Miyazaki and K. Ikeuchi (2005) Inverse polarization raytracing: estimating surface shapes of transparent objects. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2, pp. 910–917. Cited by: Polarimetric Inverse Rendering.
D. Miyazaki, M. Kagesawa, and K. Ikeuchi (2004) Transparent surface modeling from a pair of polarization images. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (1), pp. 73–82. Cited by: Transparent Shapes Reconstruction.
M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger (2020) Differentiable volumetric rendering: learning implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3504–3515. Cited by: Intersection of Ray and Geometry.
M. Nimier-David, D. Vicini, T. Zeltner, and W. Jakob (2019) Mitsuba 2: a retargetable forward and inverse renderer. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–17. Cited by: Polarimetric Rendering.
M. Oechsle, S. Peng, and A. Geiger (2021) Unisurf: unifying neural implicit surfaces and radiance fields for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5589–5599. Cited by: Introduction.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32. Cited by: Implementation Details.
Y. Qian, M. Gong, and Y. H. Yang (2016) 3d reconstruction of transparent objects with position-normal consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4369–4377. Cited by: Introduction.
N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W. Lo, J. Johnson, and G. Gkioxari (2020) Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501. Cited by: Transformations of Stokes Frames.
C. Tsai, A. Veeraraghavan, and A. C. Sankaranarayanan (2015) What does a single light-ray reveal about a transparent object?. In 2015 IEEE International Conference on Image Processing (ICIP), pp. 606–610. Cited by: Introduction.
B. Wu, Y. Zhou, Y. Qian, M. Gong, and H. Huang (2018) Full 3d reconstruction of transparent objects. arXiv preprint arXiv:1805.03482. Cited by: Introduction, Transparent Shapes Reconstruction.
J. Xu, Z. Zhu, H. Bao, and W. Xu (2022) A hybrid mesh-neural representation for 3d transparent object reconstruction. arXiv preprint arXiv:2203.12613. Cited by: Transparent Shapes Reconstruction.
L. Yang, F. Tan, A. Li, Z. Cui, Y. Furukawa, and P. Tan (2018) Polarimetric dense monocular slam. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3857–3866. Cited by: Polarimetric Inverse Rendering.
L. Yariv, J. Gu, Y. Kasten, and Y. Lipman (2021) Volume rendering of neural implicit surfaces. Advances in Neural Information Processing Systems 34, pp. 4805–4815. Cited by: Introduction.
L. Yariv, Y. Kasten, D. Moran, M. Galun, M. Atzmon, B. Ronen, and Y. Lipman (2020) Multiview neural surface reconstruction by disentangling geometry and appearance. Advances in Neural Information Processing Systems 33, pp. 2492–2502. Cited by: Introduction, Overview, Implicit Surface Representation, Silhouette Loss., Comparisons with Baselines, Implementation Details.
J. Zhao, Y. Monno, and M. Okutomi (2020) Polarimetric multi-view inverse rendering. In European Conference on Computer Vision, pp. 85–102. Cited by: Introduction, Polarimetric Inverse Rendering.
S. Zou, X. Zuo, Y. Qian, S. Wang, C. Xu, M. Gong, and L. Cheng (2020) 3d human shape reconstruction from a polarization image. In European Conference on Computer Vision, pp. 351–368. Cited by: Polarimetric Inverse Rendering.

111

Implementation Details

We implement the $f_{θ} (x)$ as a MLP with $8$ layers and $512$ units per layer in PyTorchPaszke et al. (2019), and a skip connection from input layer to the middle layer is added as in IDRYariv et al. (2020). The model is trained on a RTX 3090 GPU(24GB). We use the Adam optimizerKingma and Ba (2014) with a learning rate of $1 e - 4$ to optimize the network. We sample $20480$ rays per iteration and employ the secant algorithm to calculate the intersection points of the sampled rays to the shape. Each object is trained for $1000$ epochs, the first $100$ epochs are the initial shape reconstruction stage, and $λ_{p o l}$ is set to zero. After $100$ epochs, $λ_{p o l}$ is set to $0.4$ ( $0.2$ for object ELEPHANT) to introduce polarimetric cues.

Intersection of Ray and Geometry

In this paper, we employ a neural network to implicitly represent an object’s signed distance function(SDF), denoted as $f_{θ} (x)$ . Both the polarimetric render and ray tracer in our method require the intersections of rays and the geometry, especially in the ray tracer not only requires the intersections of the rays from the camera, but the interactions of the internally refracted rays with geometry also need to be obtained. The ray marching method, which is often used with the SDF representation, can not meet the requirements since it can only be used on one side of the SDF function. Therefore, we adopt the secant algorithm to approach the intersection of rays and geometry.

Figure 12: Schematic diagram of the intersection calculation of the ray and SDF

As shown in Fig.12, we denote the ray as $x = x_{0} + t v, t \geq 0$ , the intersection point of geometry as $x_{*} = x_{0} + t_{*} v$ and the intersection point of unit sphere as $x_{s} = x_{0} + t_{s} v$ , where $x_{0}$ is the start point of the ray and $v$ is the unit direction vector. Similar to Niemeyer et al. (2020), we sample $100$ equal steps between $0$ and $t_{s}$ , that is, $0 < t_{1} < . . . < t_{99} < t_{s}$ . Then we find the first $t_{i}$ and $t_{i + 1}$ where $s g n (f_{θ} (x_{0} + t_{i} v)) \neq s g n (f_{θ} (x_{0} + t_{i + 1} v))$ , the transition of signs of SDF values represents the ray crossing of the surface. The secant algorithm is used for approximation in the interval $(t_{i}, t_{i + 1})$ and the nearest values to $t_{*}$ from both sides are recorded as $t_{*}^{+}$ and $t_{*}^{-}$ , where $s g n (f_{θ} (x_{0} + t_{*}^{+} v)) = s g n (f_{θ} (x_{0} + t_{*} v))$ and $s g n (f_{θ} (x_{0} + t_{*}^{-} v)) \neq s g n (f_{θ} (x_{0} + t_{*} v))$ . Both the $t_{*}^{+}$ and $t_{*}^{-}$ can be used as an approximation of $t_{*}$ , the specific selection will depend on whether the next ray is refracted or reflected. The reason for calculating the two values of $t_{*}$ is that the sign of the start point will affect the correctness of the intersection result. For example, when $x_{*}^{+} = x_{0} + t_{*}^{+} v$ is the start point of a refracted ray and its SDF value $f_{θ} (x_{*}^{+}) > 0$ , the obtained initial interval must be $(t_{0}, t_{1})$ since the refracted direction is toward the inside of the shape, resulting in a wrong intersection. The secant method with two interactions provides the basis for future multi-bounce( $> 2$ ) ray tracing in SDF, although only 2-bounce ray tracing is used in this paper.

Dataset Details

Acquisition Details

Our dataset consists of four objects’ images from $34$ views, each of which contains four raw polarization images and three polarization parameters, intensity, degree of linear polarization(DoLP), and angle of linear polarization(AoLP), which are calculated from raw polarization images and will be described later. In addition, the mask and normal map of each view are also provided in our dataset. We used a 3D scanner to scan the powdered transparent objects to get the ground-truth shapes, then manually aligned them into each view to get the ground-truth normal maps.

Accurate camera poses for $34$ views are also provided, which are obtained by a precisely controlled robotic arm. When the origin of spherical coordinate setting to the center of the object, we uniformly sampled the views at the azimuths of $(0^{\circ}, 340^{\circ})$ with an interval of $20^{\circ}$ each time, and zenith angles of $50^{\circ}$ and $70^{\circ}$ .

Preprocessing

We use a polarization camera to capture polarization images with four built in polarizer arrays of $0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}$ . Therefore, four intensity images of $I_{0^{\circ}}, I_{45^{\circ}}, I_{90^{\circ}}, I_{135^{\circ}}$ can be obtained in a single shot, and the stokes vector can be written from the four intensities:

s = ⎡ ⎢ ⎢ ⎢ ⎣ \begin{matrix} s_{0} s_{1} s_{2} s_{3} \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎦ = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} I_{0^{\circ}} + I_{90^{\circ}} I_{0^{\circ}} - I_{90^{\circ}} I_{45^{\circ}} - I_{135^{\circ}} 0 \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎦

(25)

where $s_{3}$ represents the right circular polarzation component, which is zero here because the polarization camera can only capture the linear polarization state.

The degree of linear polarization(DoLP) $ρ$ and angle fo linear polarization(AoLP) $ψ$ are calculated from the following equations:

ρ = \frac{\sqrt{s_{1}^{2} + s_{2}^{2}}}{s_{0}}

(26)

ψ = \frac{1}{2} arctan (\frac{s_{2}}{s_{1}})

(27)

The DoLP and AoLP maps are obtained by calculating the DoLP and AoLP values of each pixel and are used as the polarization part of our dataset.

Illumination Sampling

A diffuse sphere is used as the light source in our dataset acquisition, which enhances the reflection component of the transparent surface. The illumination sampling is needed in our polar render and ray tracer. Hence, we model the illumination sampling as shown in Fig.13, where $v_{c}$ is the view direction and $v_{s}$ is the direction of the sampling ray(the ray that needs illumination sampling).

The relative position between the light source and the polarization camera is fixed, so only the angle $α$ between the direction of the sampled ray $v_{s}$ and the view direction $v_{c}$ can determine whether the sampled ray intersects the light source. Our illumination sampling function can be written as follows:

α = arccos (\frac{- v_{c} \cdot v_{s}}{∥ v_{c} ∥_{2} ∥ v_{s} ∥_{2}})

(28)

I_{s a m p l e} = {\begin{matrix} 1.0, & 0 \leq α < \frac{π}{2} - δ 0.1, & o t h e r w i s e \end{matrix}

(29)

where $I_{s a m p l e}$ represents the intensity of the sampling ray. We limit the angular range that can directly sample the light source to $[0, \frac{π}{2} - δ)$ , because the bottom of the light source can not fit the object surface and some rays within a range $δ$ can not sample the light source directly. The default value of $δ$ is $\frac{π}{18}$ . We set the intensity of the light source to $1.0$ , and the intensity of the other areas to $0.1$ since the environment has weak illumination from diffuse reflection and others.

Transformations of Stokes Frames

The value of the Stokes vector is related to the reference frame. To facilitate Mueller’s calculation, we adopt the coordinate system and transformation similar to Mistuba2Ravi et al. (2020), which will be introduced in detail below. We denote the reference frame as $(x, y)$ omitting its $z$ representation, the $z$ -axis is always along the ray direction.

Figure 14: Transfomations of incident and outgoing frames

First, we define the $R o t a t o r (Δ θ)$ similar to Mitsuba2, which represents the transformation matrix from frame $(x, y)$ to $(x^{'}, y^{'})$ where $(x^{'}, y^{'})$ is obtained by rotating $(x, y)$ around the $z - a x i s$ with angle of $Δ θ$ :

R o t a t o r (Δ θ) = ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ \begin{matrix} 1 & 0 & 0 & 0 0 & cos (2 Δ θ) & sin (2 Δ θ) & 0 0 & - sin (2 Δ θ) & cos (2 Δ θ) & 0 0 & 0 & 0 & 1 \end{matrix} ⎤ ⎥ ⎥ ⎥ ⎥ ⎦

(30)

The transformation matrix is written in the form of $4 \times 4$ to match the size of the Mueller matrix.

The reference frames used in our polarimetric render are shown in Fig.14. The frame $(x_{i n}, y_{i n})$ is defined by the incident plane, where $x_{i n}$ is perpendicular to the incident plane formed by $v_{i n}$ and $^n$ . Frames $({^x}_{i n}, {^y}_{i})$ and $({^x}_{o u t}, {^y}_{o u t})$ are implicit orthogonal frames with $v_{i}$ and $v_{o}$ as $z$ -axis, respectively. Using implicit orthographic frames enables us to extend our polarization renderer to multi-bounce polarimetric ray tracer easily in the future. $(x_{i n}, y_{i n})$ and $({^x}_{i n}, {^y}_{i n})$ have the same $z$ -axis and their transformation can be written by exploiting the definition of the $R o t a t o r$ :

Δ θ_{i n} =< x_{i n}, {^x}_{i n} >=<^n \times v_{i}, I m p l i c i t (v_{i}) >

(31)

R_{i} (v_{i};^n) = R o t a t o r (Δ θ_{i n})

(32)

where $< \cdot >$ means the signed angle between the unit vectors. $I m p l i c i t (\cdot)$ represents the implicit orthographic frame calculation function, which takes the $z$ -axis direction as input and outputs the $x$ -axis direction vector.

Similarly, the transformation between $(x_{o u t}, y_{o u t})$ and $({^x}_{o u t}, {^y}_{o u t})$ can be obtained:

Δ θ_{o u t} =< x_{o u t}, {^x}_{o u t} >=< I m p l i c i t (v_{i}), I m p l i c i t (v_{o}) >

(33)

R_{o} (v_{o};^n) = R o t a t o r (Δ θ_{o u t})

(34)

Finally, the Stokes vector in frame $(x_{o u t}, y_{o u t})$ is needed to convert into the pixel coordinate frame $(x_{c}, y_{c})$ . The up direction $(- y_{c})$ of the camera can be written as follows:

u p = P_{c} ⎡ ⎢ ⎣ \begin{matrix} 0 - 1 0 \end{matrix} ⎤ ⎥ ⎦

(35)

where $P_{c}$ is the extrinsic rotation matrix of the camera. With the representation in world coordinate frame of the camera’s up direction, we can describe the transformation from $(x_{o u t}, y_{o} u t)$ to $(x_{c}, y_{c})$ using $R o t a t o r$ :

Δ θ_{c} =< x_{o u t}, x_{c} >=< I m p l i c i t (v_{i}), u p \times v_{o} >

(36)

R_{c} (v_{o}) = R o t a t o r (Δ θ_{c})

(37)

Through the above frames transformations, the Stokes vector $s_{0}$ emitted from the light source is observed in the camera as $s_{c}$ :

s_{c} = R_{c} (v_{o}) R_{o} (v_{o};^n) M_{r} R_{i} (v_{i};^n) s_{0}

(38)

where $M_{r}$ is the Mueller matrix of reflection and its detailed expression is presented in the main text.