SIGNet: Intrinsic Image Decomposition by a Semantic and Invariant Gradient Driven Network for Indoor Scenes

Partha Das\orcidlink0000-0003-0112-3638 1CV Lab, University of Amsterdam, The Netherlands133DUniversum, Amsterdam, The Netherlands
& 1 & 2 3{p.das, th.gevers}@uva.nl
   Sezer Karaoğlu 1CV Lab, University of Amsterdam, The Netherlands133DUniversum, Amsterdam, The Netherlands
& 1 & 2 3{p.das, th.gevers}@uva.nl
   Arjan Gijsenij 2AkzoNobel, The Netherlands 2    Theo Gevers 1CV Lab, University of Amsterdam, The Netherlands133DUniversum, Amsterdam, The Netherlands
& 1 & 2 3{p.das, th.gevers}@uva.nl
3email: s.karaoglu@3duniversum.com
3email: arjan.gijsenij@akzonobel.com
Abstract

Intrinsic image decomposition (IID) is an under-constrained problem. Therefore, traditional approaches use hand crafted priors to constrain the problem. However, these constraints are limited when coping with complex scenes. Deep learning-based approaches learn these constraints implicitly through the data, but they often suffer from dataset biases (due to not being able to include all possible imaging conditions).

In this paper, a combination of the two is proposed. Component specific priors like semantics and invariant features are exploited to obtain semantically and physically plausible reflectance transitions. These transitions are used to steer a progressive CNN with implicit homogeneity constraints to decompose reflectance and shading maps.

An ablation study is conducted showing that the use of the proposed priors and progressive CNN increase the IID performance. State of the art performance on both our proposed dataset and the standard real-world IIW dataset shows the effectiveness of the proposed method. Code is made available here.

Keywords:
Priors, Semantic Segmentation, Intrinsic Image Decomposition, CNN, Indoor dataset.

1 Introduction

An image can be defined as the combination of an object’s colour and the incident light on it projected on a plane. Inverting the process of image formation is useful for many downstream computer vision tasks such as geometry estimation [henderson2019], relighting [shu2017], colour edits [Beigpour2011ObjectRB] and Augmented Reality (AR) insertion and interactions for applications like the Metaverse. The process of recovering the object colour (reflectance or albedo) and the incident light (shading) is known as Intrinsic Image Decomposition (IID). As the problem is ill-defined (with only one known), constraint-based approaches are explored to limit the solution space. For example, as an explicit gradient assumption, softer (or smoother) gradient transitions are attributed to shading transitions, while stronger (or abrupt) ones are related to reflectance transitions [Land1971]. Colour palette constraints in the form of sparsity priors and piece-wise consistency are also employed for reflectance estimation [Gehler2011], [Barron2015]. However, these approaches are based on strong assumptions of the imaging process and hence are limited in their applicability.

Implicit constraints, by means of deep learning-based methods, are proposed to expand previous approaches [Narihia2015]. For these methods, the losses implicitly formulate the constraints and are dependent on the training data. These methods learn a flexible representation based on training data which may lead to dataset biases. [Li2018ECCV] integrates multiple datasets to manage the dataset bias problem. However, introducing more datasets only acts as an expansion of the imaging distribution. Additionally, multiple purpose-built losses are needed to train the network. An alternative approach of combining constraints and deep learning is explored in [Fan2018] where edges are used as an additional constraint to guide the network. However, edges at image locations with strong illumination effects, like pronounced cast shadows, may lead to edge misclassification resulting in undesirable effects like shading-reflectance leakages.

On the other hand, [Baslamisli2018ECCV] forgoes priors and specialised losses to leverage joint learning of related modalities. They explore semantic segmentation as a closely related task to IID, arguing that jointly learning the semantic maps provides the network information to jointly correct for reflectance-shading transitions. However, no explicit guidance or constraint between the semantics and reflectance are imposed. The network thus relies on learning the constraints from the ground truth semantic, reflectance and shading jointly. Moreover, only outdoor gardens are considered, where most natural classes (e.g., bushes, trees, and roses) contain similar colours (i.e., constrained colour distributions).

The proposed network overviews. The network consists of i) the global encoder module, ii) the reflectance edge module, iii) the initial estimation module, and iv) the final correction module. The final reflectance and shading outputs are used for all the evaluations. Please refer to the supplementary for more details. Images shown here are ground truth images, for illustrative purposes.
Figure 1: The proposed network overviews. The network consists of i) the global encoder module, ii) the reflectance edge module, iii) the initial estimation module, and iv) the final correction module. The final reflectance and shading outputs are used for all the evaluations. Please refer to the supplementary for more details. Images shown here are ground truth images, for illustrative purposes.

This paper exploits physical and statistical image properties for IID of indoor scenes. Illumination and geometry invariant descriptors [Gevers1999] yield physics-based cues to detect reflectance transitions, while statistical grouping of pixels in an image provides initial starting estimates for IID components. To this end, a combination of semantic and invariant transition constraints is proposed. Semantic transitions provide valuable information about reflectance transitions i.e., a change in semantics most likely matches a reflectance transition but not always the other way around (objects may consist of different colours). Illumination invariant gradients provide useful information about reflectance transitions but can be unstable (noisy) due to low intensity. Exploiting reflectance transition information on these two levels compensates each other and ensures a stronger guidance for IID. In addition, indoor structures, like walls and ceilings, are often homogeneously coloured. To this end, the semantic map can be used as an explicit homogeneous prior. This allows for integrating an explicit sparsity/piece-wise consistency (homogeneity) prior in the form of constant reflectance colour.

In this paper, a progressive CNN is employed, consisting of two stages. The first stage of the network exploits the prior information to arrive at an initial estimation. This estimation is based on the semantics, the invariant guided boundaries, and sparsity constraints. The second stage of the network takes the initial estimation and fine-tunes it using the original image cues to disentangle the reflectance and shading maps while being semantically correct. This allows the network to separate the problem into two distinct solution spaces that build progressively on each other. In addition, it allows the network to learn a continuous representation that can extrapolate even when the priors contain errors. An overview of the proposed network is shown in the Fig. 1.

While deep learning networks have shown very good performance, they require high quality datasets. Traditional physical-based rendering methods are often time and resource intensive. Recently, these methods are more efficient i.e., real time on consumer hardware. Hence, a dataset of physical-based and photo-realistic rendered indoor images is provided. The synthetic dataset is used to train the proposed method.

In summary, our contributions are as follows:

  • Algorithm: An end-to-end semantic and physically invariant edge transition driven hybrid network is proposed for intrinsic image decomposition of indoor scenes.

  • Insight: The use of component specific priors outperforms learning from a single image.

  • Performance: The proposed algorithm is able to achieve state-of-the-art performance on both synthetic and real-world datasets.

  • Dataset: A new ray-traced and photo-realistic indoor dataset is provided.

2 Related Works

A considerable amount of effort has been put in exploring hand-crafted prior constraints for the problem of IID.  [Land1971] pioneered the field by assuming reflectance changes to be related to sharper gradient changes, while smoother gradients correspond to shading changes. Other priors have been explored like piece-wise constancy for the reflectance, and smoothness priors for shading [Barron2015], textures [Gehler2011]. Constraints in the form of additional inputs have also been explored. [Lee2012] explores the use of depth as an additional input, while [Jeon2014] explores surface normals. Near infrared priors are used by [Cheng2019] to decompose non-local intrinsics. Humans in the loop is also studied by [Bonneel2014] and [Narihira2015-2]. However, these works mostly focus on single objects and do not generalise well to complete scenes.

In contrast to the use of explicit (hand-crafted) constraints, deep learning methods that implicitly learn (data-driven) specific constraints are also explored [Narihia2015].  [Baslamisli2019] explores disentangling the shading component into direct and indirect shading. [Zhou2019] differentiates shading into illumination and surface normals in addition to reflectance. [Bell2014] uses a piece-wise constancy property of reflectances and employs Conditional Random Fields to perform IID. [Fan2018] shows that image edges contain information about reflectance edges and uses them as a guidance for the IID problem. [Li2018ECCV] reduces the solution space by using multiple task specific losses. [Sengupta2019] directly learns the inverse of the rendering function. Finally, [Baslamisli2018ECCV] forgoes losses and jointly learns semantic segmentation to implicitly learn a posterior on the IID, while [Saini2019] uses estimated semantic features as a support for an iterative competing formulation for IID. However, the above approaches do not explicitly integrate the physics-based image formation information and rely on the datasets containing a large set of imaging conditions. Hence, they may fall short for images containing extreme imaging conditions such as strong shadows or reflectance transitions. Large datasets [Li2018ECCV, roberts2021, Li2021CVPR] are proposed to train networks. Unfortunately, they are limited in their photo-realistic appearance.

Unlike IID, physics-based image formation priors have been explored in other tasks. [Finlayson1992] introduces Colour Ratios which are illumination invariant descriptors for objects. [Gevers1999] then introduces Cross Colour Ratios which are both geometric and illumination invariant reflectance descriptors. [Baslamisli2020] shows the applicability of the descriptors to the problem of IID. In contrast to previous methods, in this paper, a combination of explicit image formation-based priors and implicit intrinsic component property losses are explored.

3 Methodology

3.1 Priors

Semantic Segmentation:

[Baslamisli2018ECCV] shows that semantic segmentation provides useful information for the IID problem. However, components are jointly learned and hence their method lacks any explicit influence of the component’s property. Since object boundaries correspond to reflectance changes such boundary information can serve as a useful global reflectance transition guidance for the network. Furthermore, homogeneous colour (i.e., reflectance) constraints (e.g., a wall has a uniform colour) can be imposed on the segmentation explicitly. To this end, in this paper, an off-the-self segmentation algorithm Mask2Former [cheng2021] is used to obtain segmentation maps.

Invariant Gradient Domain:

Solely using semantic regions as priors may cause the network to be biased to the regions generated by the segmentation method. To prevent such a bias, an invariant (edge) map is included as an additional prior to the network. In this work, Cross Colour Ratios (CCR) [Gevers1999] are employed. These are illumination invariants i.e., reflectance descriptors. Given an image with channels Red (), Green () and Blue () and neighbouring pixels and , CCR is defined by , and where, , and are the red, green, and blue channel for pixel . Descriptors , and are illumination free and therefore solely depending on reflectance transitions. Using the reflectance gradient as an additional prior allows the network to be steered by reflectance transitions.

Reflectance and Shading Estimates:

Consider the simplified Lambertian [Shafer1985] image formation model: , where shading () is the scaling term on the reflectance component (). Hence, for a given constant reflectance region, all the pixels are different shades of the same colour. In this way, the reflectance colour becomes a scale optimisation for which the pixel mean of a segment can be used: where, is the channel-specific mean of the pixels. , and values are then spread within the region to obtain an initial starting point for reflectance colour based on the homogeneity constraint. Conversely, these values can be inverted using the image formation to obtain the corresponding scaled shading estimates. A CNN is then employed to implicitly learn the scaling for both priors. Additionally, since the mean of the segment does not consider textures, a deep learning method is proposed to compensate it by means of a dedicated correction module, see section 3.2. The supplementary material provides more visuals for these priors.

3.2 Network Architecture

The network consists of components: i) Global encoder blocks, ii) Reflectance edge Decoder, iii) Initial estimation decoder and iv) Final correction module. The network is trained end-to-end. The input to the network is an image and its corresponding segmentation obtained by Mask2Former [cheng2021]. The CCR, Reflectance and Shading estimates are computed from the input image for the respective encoder blocks. Additional details and visuals for the modules can be found in the supplementary materials.

Overview of the global encoder module. Each of the inputs are provided with their independent encoders to enable modality specific feature learning. The respective features are used in the downstream decoders to provide component specific information for the network.
Figure 2: Overview of the global encoder module. Each of the inputs are provided with their independent encoders to enable modality specific feature learning. The respective features are used in the downstream decoders to provide component specific information for the network.

Global Encoder Module:

The input image, the segmentation image, the average reflectance estimate, inverse shading estimate and the CCR images are encoded through their respective encoders. The encoders share the same configuration, but the intermediate features are independent of each other. The semantic features () provide guidance for the general outlines of object boundaries, while the CCR features () focus on local reflectance transitions, possibly including textures. Correspondingly, the average reflectance estimate features () and the inverse shading estimate features () provide a starting point for the reflectance and the shading estimation, respectively. Finally, the image features () provide the network a common conditioning to learn the scaling and boundary transitions for the intrinsic components. Fig. 2 shows the overview of the module.

Overview of the reflectance edge and the attention guided initial estimation module. The edge module takes the image encoder, semantic encoder, and the invariant encoder feature to learn a semantically and physically guided reflectance transition. The edge features are then transferred through an attention block to the initial estimation decoder module. The reflectance decoder in this module takes the semantic encoder, image encoder and the average reflectance estimation features and input. The shading decoder correspondingly takes the image encoder along with the average shading estimation feature. Interconnections in the decoder allows the network to use reflectance cues for shading and vice versa.
Figure 3: Overview of the reflectance edge and the attention guided initial estimation module. The edge module takes the image encoder, semantic encoder, and the invariant encoder feature to learn a semantically and physically guided reflectance transition. The edge features are then transferred through an attention block to the initial estimation decoder module. The reflectance decoder in this module takes the semantic encoder, image encoder and the average reflectance estimation features and input. The shading decoder correspondingly takes the image encoder along with the average shading estimation feature. Interconnections in the decoder allows the network to use reflectance cues for shading and vice versa.

Reflectance Edge Module:

This sub-network decodes the reflectance edges of the given input. The decoded reflectance and edges are used as an attention mechanism to the initial estimation module to provide (global) region consistency. The features, and are concatenated with the image features and passed on to the edge decoder. The semantic and CCR features provide object and reflectance transitions, respectively. The image features allow the network to disentangle reflectance from illumination edges. Corresponding skip connections from , and encoders are used to generate high frequency details. Scale space supervision, following [Xie2015], is provided by a common deconvolution layer for the last layers, for scales of and , yielding a scale consistent reflectance edge prediction. The ground truth edges are calculated by using a Canny Edge operation on the ground truth reflectance. Fig. 3 shows an overview of the module.

Initial Estimation Module:

The initial estimation decoder block focuses on learning the IID from the respective initial estimates of the intrinsic (Fig. 3). It consists of two parallel decoders. The Reflectance decoder learns to predict the first estimation from and . The features are further augmented with the learned boundaries from the reflectance edge decoder passed through an attention layer [tang2020]. is also passed to the decoder to guide global object transitions and acts as an additional attention. Similarly, the Shading decoder only receives and , focusing on properties like smoother (shading) gradient changes. The reflectance and shading decoders are interconnected to provide an additional cue to learn an inverse of each other. Skip connections from the respective encoders to the decoders are also given. This allows the network to learn an implicit scaling on top of the average reflectance and the inverse shading estimation. The output at this stage is guided by transition and reflectance boundaries and may suffer from local inconsistencies like shading-reflectance leakages.

The final decoder module. The initial reflectance and shading estimates from the previous step are further corrected to obtain the final reflectance and shading. The encoder consists of an independent parallel reflectance and shading encoder. The reflectance encoder takes receives the initial reflectance and the reflectance edge as an input, while the shading encoder receives the initial shading. Two parallel decoders are used for reflectance and shading to obtain the final IID outputs.
Figure 4: The final decoder module. The initial reflectance and shading estimates from the previous step are further corrected to obtain the final reflectance and shading. The encoder consists of an independent parallel reflectance and shading encoder. The reflectance encoder takes receives the initial reflectance and the reflectance edge as an input, while the shading encoder receives the initial shading. Two parallel decoders are used for reflectance and shading to obtain the final IID outputs.

Final Correction Module:

To deal with local inconsistencies, a final correction module is proposed. First, the reflectance edge from the edge decoder and the reflectance from the previous decoder is concatenated and passed through a feature calibration layer. This allows the network to focus on local inconsistencies guided by global boundaries. The output is then passed through a final reflectance encoder. The shading from the previous module is similarly passed through another encoder block. The output of these two encoders is then passed through another set of parallel decoders for the final reflectance and shading output. Since the reflectance and shading from the previous block is already globally consistent, this decoder acts as a localised correction. To constrain the corrections to local homogeneous regions, skip connections (through attention layers) of encoded reflectance edge features are provided to the decoders. In this way, the network limits the corrections to the local homogeneous regions and recover local structures like textures. Skip connections from the respective reflectance and shading encoders are provided to include high frequency information transfer. The reflectance and shading features in the decoder are shared within each other to enforce an implicit image formation model. Fig. 4 shows the overview of the module.

3.3 Dataset

Unreal Engine [unrealengine] is used to generate a dataset suited for the task. The rendering engine supports physically based rendering, with real-time raytracing (RTX) support. The engine first calculates the intrinsic components from the various material and geometry property of the objects making up the scene. Then, the illumination is physically simulated through ray tracing and lighting is calculated. Finally, all these results are combined to render the final image. Since the engine calculates the intrinsic components, ground truth intrinsic is recovered using the respective buffer. The dataset consists of dense reflectance and shading ground-truths. The network learns the inversion of this process.

Samples from the proposed dataset. The dataset comes with the corresponding dense reflectance and shading maps. The dataset consists of various everyday objects and lighting, containing both near local light sources, like lamps, and more global light sources like sunlight and windows.
Figure 5: Samples from the proposed dataset. The dataset comes with the corresponding dense reflectance and shading maps. The dataset consists of various everyday objects and lighting, containing both near local light sources, like lamps, and more global light sources like sunlight and windows.

Assets from the unreal marketplace are used to generate the dataset. These assets are professionally created to be photo realistic. images are generated of which images are used for training, and are used for validation and testing. To evaluate the generalisability of the network, Intrinsic Images in the Wild (IIW) [Bell2014] is used as a real-world test. Fig. 5 shows a number of samples from the dataset. The dataset generated is comparatively small. However, the purpose of the dataset is that the network learns an efficient physics guided representation, rather than a dataset dependent one. The pretrained model and the dataset will be made available.

3.4 Loss Functions and Training details

MSE loss is applied for each output of the network: (i) Initial estimation loss ( & ) and (ii) Final correction loss (). is the loss applied on the scale space reflectance edge. is the loss on the reflectance and shading output from the initial estimation module. Additional losses are also applied on the reflectance and shading output from the final correction module. This reflectance and shading are also combined and compared with the input image for a reconstruction loss. These losses are collected in the term . An invariance loss is added between the normalised and the prediction of the network for each segment. A Total Variation (TV) loss () is included to deal with the assumption that large indoor classes like walls and ceilings are homogeneously coloured. This loss is only applied to ceilings and wall pixels and minimises the TV between the prediction and the ground truth reflectance. Finally, to encourage perceptually consistent and sharper textures, a perceptual and dssim loss are included and grouped as . The final loss term to minimise for the network thus becomes:

(1)

where and are weighting terms for the edge and initial estimation losses. They are empirically set to and , respectively. The network is trained for 60 epochs, with a learning rate of and the Adam [kingma2014] optimiser. Please refer to the supplementary materials for more details.

4 Experiments

4.1 Ablation Study

To study the influence of different architecture components and losses, an ablation study is conducted. For a fair evaluation, the ablation study is performed on the test-set of the rendered dataset. For all the ablations, all hyper-parameters are kept constant. The results of the ablation study are presented in table 1.

Reflectance Shading
MSE LMSE DSSIM MSE LMSE DSSIM
w/o Final Correction 0.0029 0.0020 0.0225 0.0044 0.0035 0.0276
w/o Priors 0.0105 0.0047 0.0444 0.0054 0.0034 0.0399
w Canny Edges 0.0032 0.0037 0.0229 0.0031 0.0049 0.0293
w/o Average Estimates 0.0030 0.0023 0.0232 0.0041 0.0043 0.0267
w/o Reflectance Edge Module 0.0097 0.0156 0.3254 0.0033 0.0061 0.0270
No DSSIM Loss 0.0131 0.0240 0.3704 0.0041 0.0055 0.1488
No Perceptual Loss 0.0032 0.0022 0.0289 0.0032 0.0038 0.0285
No Invariant & Homogeneity Loss 0.0032 0.0027 0.0288 0.0024 0.0024 0.0318
Proposed 0.0026 0.0018 0.0219 0.0030 0.0033 0.0252
Table 1: Ablation study for the proposed network. For each experiment, the respective parts of the network are modified. All the experiments are conducted on the same test and train split of the proposed dataset. All the applicable hyper-parameters are kept constant.

Influence of final correction module:

In this experiment, the influence of the final correction module is studied. The output from the initial estimation decoder is taken as the final output.

From the results, it is shown that the final correction module helps in improving the outputs. The improvement in the DSSIM metric for both components shows that the final correction module is able to deal with structural artefacts.

Influence of priors:

The influence of all the priors is studied in this ablation. The additional priors are removed, and the network only receives the image as an input. All network structures are kept the same. This setup studies if the network can disentangle the additional information from the input image without any specific priors.

Removing all priors makes the network to perform worse for all metrics. In this setting, the network only uses the image to derive both the reflectance and shading changes. This is challenging for strong illumination effects. This shows that the priors are an important source of information enabling a better disentanglement between intrinsic components.

Influence of specialised edges:

This experiment studies the need of specialised edges obtained from the semantic transition boundaries and invariant features. The edges obtained from the input image are provided to the network. The study focuses on whether the network can distinguish between reflectance, geometry, and shadow edges directly from the image.

From the results, it is shown that using image edges is not sufficient. Image edges can be ambiguous due to the presence of shadow edges. However, the performance is still better than using the image as the only input, showing that edges yield, to a certain extent, useful transition information.

Influence of reflectance and shading estimate priors:

In this experiment, the efficacy of the statistic-based homogeneous reflectance and the inverted shading estimate is studied.

Removing the average reflectance and shading estimates degrades the performance. With the priors of the estimates, the network can use its learning capacity to deal with the scaling of the initial estimation to obtain the correct IID. The network needs to learn the colour as well as the scaling within the same learning capacity.

Influence of reflectance edge guidance module:

For this experiment, the edge guidance module is removed. As such, the network is then forced to learn the attention and the reflectance transition boundaries implicitly as part of the solution space.

Removing the reflectance edge module results in the second worse result. This shows that, apart from the priors, the ability to use those features to learn a reflectance transition, is useful. It is shown that without such a transition guidance, the network is susceptible of misclassifying shadow edges as reflectance transitions. Furthermore, it is shown that without this module, the reflectance performance suffers more than the shading performance. Hence, using a learned edge guidance allows the network to be more flexible and better able to distinguish between true reflectance transitions.

Influence of different losses:

The influence of the different losses is studied in this experiment. For each sub-experiment, the same proposed structure is used, and the respective losses are selectively turned off.

From the results, it is shown that the DSSIM loss contributes to a large extend, to the performance, because this loss penalises perceptual variations like contrast, luminance, and structure. As such, by removing the supervision, the network learns an absolute difference which is not expressive to smaller spatial changes. Similar trend of performance decrease is shown when removing the perceptual and homogeneity losses. This is expected since both losses contribute to region consistency. With the addition of the losses on the reflectance, the shading values suffer slightly. However, structurally they perform better when including the losses, as shown by the DSSIM metric. This indicates that applying such a loss helps not only to achieve a better reflectance value, but it also jointly improves shading, resulting in sharper outputs.

4.2 Comparison to State of the Art

On the proposed Dataset:

To study the influence of the dataset, the proposed network is compared to baseline algorithm’s performance. For these experiments, the standard, MSE, LMSE and the DSSIM metric are used. The baselines are chosen based on their performance of the Weighted Human Disagreement Rate (WHDR), widely used in the literature. Hence, [Li2018ECCV] is chosen as a baseline. [Zhou2019] does not provide any publicly available code, hence is not included. Although [Fan2018] is the state of the art, their provided code generates errors when trying to run on custom datasets and hence is not used for comparison. For completeness, [Xu2020] and  [Shi2017] is also compared. [Xu2020] uses an optimization-based method based on the pioneering Retinex model. Since it is a purely physical constraint-based model, it is included for comparison. For a fair comparison, methods focusing on indoors are used. [Baslamisli2018ECCV] assumes outdoor settings and requires semantic ground truths to train and hence is not included. For all the networks, they are retrained on the dataset that is proposed in this paper, using the optimum hyperparamters as mentioned in the respective publication. The results are shown in table 2 and figure 6

Comparison of the proposed to baseline methods. It is shown that the proposed method is able to better disentangle the illumination effect. In comparison, CGIntrinsics, which has comparable results on the WHDR SoTA, suffers from discolouration. STAR misses the illumination while ShapeNet suffers from artefacts.
Figure 6: Comparison of the proposed to baseline methods. It is shown that the proposed method is able to better disentangle the illumination effect. In comparison, CGIntrinsics, which has comparable results on the WHDR SoTA, suffers from discolouration. STAR misses the illumination while ShapeNet suffers from artefacts.
Reflectance Shading
MSE LMSE DSSIM MSE LMSE DSSIM
ShapeNet [Shi2017] 0.0084 0.0133 0.1052 0.0065 0.0129 0.1862
STAR [Xu2020] 0.0304 0.0166 0.1180 0.0290 0.0128 0.1572
CGIntrinsics [Li2018ECCV] 0.0211 0.0156 0.0976 0.0848 0.0577 0.2180
Proposed 0.0026 0.0018 0.0219 0.0030 0.0033 0.0252
Table 2: Comparison to the baseline methods on the proposed dataset. It is shown that the proposed method outperforms all other methods.

From the table it is shown that our proposed model is able to provide the highest scores. From the figure, the baselines suffer from strong illumination effects. CGIntrinsics discolours the regions while STAR mostly fails. ShapeNet, suffers from artefacts and colour variations around the illumination regions. In comparison, the proposed network is able to recover from such effects.

On IIW [Bell2014]:

The proposed network is finetuned on the IIW dataset and compared to the baselines. The training and testing splits are used as specified in the original publication. For the baselines, the numbers are obtained from the respective original publications. The results are shown in table 3 and visuals in figure 7.

Visual results on the IIW test set. Compared to CGIntrinsics 
Figure 7: Visual results on the IIW test set. Compared to CGIntrinsics [Li2018ECCV] and Fan  et al[Fan2018], the proposed method disentangles better the shading and highlights (highlighted in red boxes), showing a smoother reflectance. Both CGIntrinsics and [Fan2018] are unable to remove the highlights from the reflectance, resulting in discolouration. They are also susceptible to reflectance colour change as be seen on the cat and furniture (highlighted green boxes). The proposed method is able to better retain the original colour in the reflectance.
Methods WHDR (mean)
Direct Intrinsics [Narihia2015] 37.3
Color Retinex [Grosse2009] 26.9
Garces et al[Garces2012] 25.5
Zhao et al[Zhao2012] 23.2
IIW [Bell2014] 20.6
Nestmeyer et al[Nestmeyer2016] 19.5
Bi et al[Bi2015] 17.7
Sengupta et al[Sengupta2019] 16.7
Li et al[Li2020] 15.9
CGIntrinsics [Li2018ECCV] 15.5
GLoSH [Zhou2019] 15.2
Fan et al[Fan2018] 15.4
Proposed 15.2
CGIntrinsics [Li2018ECCV]* 14.8
GLoSH [Zhou2019]* 14.6
Fan et al[Fan2018]* 14.5
Proposed* 13.9
Table 3: Baseline comparison for the IIW dataset. Results marked with * are postprocessed with a guided filter [Nestmeyer2016]

The IIW dataset does not contain dense ground truth and hence is only finetuned with the ordinal loss. A guided filter [Nestmeyer2016] is used to further improve the results. Overall, our proposed method is on par with GLoSH [Zhou2019] which is the best performing method without any post filtering. However, they need both lighting and normal information as supervision, while the proposed method is trained with just reflectance and shading, along with a smaller dataset ( images of [Zhou2019] vs. of the proposed method). For the filtered results, the proposed method is able to achieve a comfortable lead compared to the current best of 14.5 obtained by [Fan2018], showing the efficiency of the current model.

5 Conclusions

In this paper, an end-to-end prior driven approach for indoor scenes has been proposed for the task of intrinsic image decomposition. Reflectance transitions and invariant illuminant descriptors has been used to guide the reflectance decomposition. Image statistics-based priors have been used to provide the network a starting point for learning. To integrate explicit homogeneous constraints, a progressive CNN was used. To train the network, a custom physically rendered dataset was proposed.

An extensive ablation was performed to validate the proposed network showing that: i) using explicit reflectance transition priors helps the network to achieve an improved intrinsic image decomposition, ii) image statistics-based priors are helpful for simplifying the problem and, iii) the proposed method attains sota performance for the standardised real-world dataset IIW.

6 Supplementary: Network Architecture Details

The network consists of basic modules: i) encoder, ii) decoder and iii) attention. Figs 8, 9 and 10 visualise these modules.

Encoder structure. Each encoder structure consists of
Figure 8: Encoder structure. Each encoder structure consists of groups of convolutions, followed by a batch normalization and a ReLU non-linearity layer.
Decoder structure. Each decoder structure consists of a transposed convolution followed by a batch normalization and a ReLU non-linearity.
Figure 9: Decoder structure. Each decoder structure consists of a transposed convolution followed by a batch normalization and a ReLU non-linearity.
Attention layer. The layer receives two inputs, the guidance map, and the input over which the attention is applied. All the operations are elementwise, and the output is the same spatial and channel dimensions as the input.
Figure 10: Attention layer. The layer receives two inputs, the guidance map, and the input over which the attention is applied. All the operations are elementwise, and the output is the same spatial and channel dimensions as the input.

These structures are used iteratively to construct the larger modules.

Global Encoder Module:

All encoders share the same structure. Table 4 lists the configuration.

Name Layer Kernel Size, Stride, Padding Output Size
Input conv1 3x3x64, 1, 1 256x256x64
conv1 3x3x64, 1, 1 256x256x64
conv2 3x3x64, 2, 1 128x128x64
conv2 3x3x128, 1, 1 128x128x128
conv3 3x3x128, 2, 1 64x64x128
conv3 3x3x256, 1, 1 64x64x256
conv4 3x3x256, 2, 1 32x32x256
conv4 3x3x512, 1, 1 32x32x512
conv5 3x3x512, 2, 1 16x16x512
Bottleneck conv5 3x3x512, 1, 1 16x16x512
Table 4: Overview of our encoder configuration used for each encoder module.

Reflectance Edge Module:

The configuration for the reflectance edge decoder module is listed in table 5. The features being multiplied and added denote skip connections from the respective encoders. All features are first depth-wise concatenated before being passed through the decoder structures. The module has scaled outputs: , and .

Name Layer Kernel Size, Stride, Padding Output Size
BottleNeck deconv1 4x4x(512 * 3), 2, 1 32x32x512
deconv2 4x4x(512 * 4), 2, 1 64x64x512
deconv3 4x4x(512 + (256 * 3)), 2, 1 128x128x256
deconv4 4x4x(256 + (128 * 3)), 2, 1 256x256x128
conv 3x3x(128 + (64 * 3)), 1, 1 256x256x64
Full Edge 256 conv 3x3x3, 1, 1 256x256x3
Edge Output 64 conv 3x3x3, 1, 1 256x256x3
Edge Output 128 conv 3x3x3, 1, 1 256x256x3
Table 5: Overview of the configuration for the edge decoders. The summations and product represent skip connections.

Initial Estimation Module:

Tables 6 and 7 show the module configuration. The decoder is a joint decoder in the style of [Shi2017], i.e., the reflectance decoder blocks use the previous shading decoder outputs together with the previous reflectance decoder block outputs (denoted by ). The reflectance decoder receives the encoder features from the image encoder, the reflectance estimate encoder, and the semantic encoder as skip connections (denoted by ). Additionally, the edge decoder features, and the current reflectance decoder outputs are passed through the attention layer before being passed to the next block. Similarly, the shading decoder blocks receives shading estimate encode and the image encoder features as skip connections.

Name Layer Kernel Size, Stride, Padding Output Size
BottleNeck deconv1 4x4x(512 * 3), 2, 1 32x32x512
Attention reflect edge (re) & deconv1 32x32x512
deconv2 4x4x(512 * 2 + (512 + 512 + 512)), 2, 1 64x64x512
Attention re deconv2 & deconv2 64x64x512
deconv3 4x4x(512 * 2 + (256 + 256 + 256)), 2, 1 128x128x256
Attention re deconv3 & deconv3 128x128x256
deconv4 4x4x(256 * 2 + (128 + 128 + 128)), 2, 1 256x256x128
Attention re deconv4 & decovn4 256x256x128
conv6 3x3x(128 * 2 + (64 + 64 + 64)), 1, 1 256x256x64
conv6 3x3x3, 1, 1 256x256x3
Initial Reflectance Output Attention re output & conv6 256x256x3
Table 6: Overview of the configuration for the initial reflectance estimation module decoder
Name Layer Kernel Size, Stride, Padding Output Size
BottleNeck deconv1 4x4x(512 * 2), 2, 1 32x32x512
Attention reflect edge (re) & deconv1 32x32x512
deconv2 4x4x(512 * 2 + (512 + 512)), 2, 1 64x64x512
Attention re deconv2 & deconv2 64x64x512
deconv3 4x4x(512 * 2 + (256 + 256)), 2, 1 128x128x256
Attention re deconv3 & deconv3 128x128x256
deconv4 4x4x(256 * 2 + (128 + 128)), 2, 1 256x256x128
Attention re deconv4 & decovn4 256x256x128
conv6 3x3x(128 * 2 + (64 + 64)), 1, 1 256x256x64
conv6 3x3x1, 1, 1 256x256x1
Initial Shading Output Attention re output & conv6 256x256x1
Table 7: Overview of the configuration for the initial shading estimation module decoder. The attention layer outputs are reused from the reflectance decoder.

Final Correction Module:

This module consists of two encoder and decoder structures: i) The reflectance edge encoder (), ii) the initial reflectance estimation encoder (), iii) the initial shading estimation encoder (), iv) the final reflectance decoder and v) the final shading decoder. The encoders use the previously introduced configurations detailed in table 4. Tables 9 and 10 gives an overview of the final decoder configurations. Just like the previous module, a joint decoder structure is used. The reflectance receives skip connections from . In addition, the features from and are passed through an attention layer and forwarded as an additional skip connection. Similarly, the shading decoder receives skip connections from .

Name Layer Kernel Size, Stride, Padding Output Size
Reflectance
Input
reflec conv1 1x1x(3 + 3), 1, 0 256x256x8
Reflectance
Bottleneck
reflec conv1 1x1x16, 1, 0 256x256x16
Table 8: Overview of the feature calibrator. It has a separate 1x1 convolutions for the reflectance

The input to the reflectance encoder is the concatenation of the reflectance edge and the initial reflectance estimate from the previous decoder. This concatenated output is first fed through a feature calibration module (detailed in table 8) before being passed to the final reflectance decoder.

Name Layer Kernel Size, Stride, Padding Output Size
BottleNeck deconv1 4x4x512, 2, 1 32x32x512
Attention
conv4
& conv4
32x32x512
deconv2 4x4x(512 * 2 + (512 + 512)), 2, 1 64x64x512
Attention
conv3
& conv3
64x64x512
deconv3 4x4x(512 * 2 + (256 + 256)), 2, 1 128x128x256
Attention
conv2
& conv2
128x128x256
deconv4 4x4x(256 * 2 + (128 + 128)), 2, 1 256x256x128
Attention
conv1
& conv1
256x256x128
conv6 3x3x(128 * 2 + (64 + 64)), 1, 1 256x256x64
Final Reflectance Output conv6 3x3x3, 1, 1 256x256x3
Table 9: Overview of the configuration for the final reflectance correction decoder
Name Layer Kernel Size, Stride, Padding Output Size
BottleNeck deconv1 4x4x512, 2, 1 32x32x512
Attention
conv4
& conv4
32x32x512
deconv2 4x4x(512 * 2 + (512 + 512)), 2, 1 64x64x512
Attention
conv3
& conv3
64x64x512
deconv3 4x4x(512 * 2 + (256 + 256)), 2, 1 128x128x256
Attention
conv2
& conv2
128x128x256
deconv4 4x4x(256 * 2 + (128 + 128)), 2, 1 256x256x128
Attention
conv1
& conv1
256x256x128
conv6 3x3x(128 * 2 + (64 + 64)), 1, 1 256x256x64
Final Shading Output conv6 3x3x1, 1, 1 256x256x1
Table 10: Overview of the configuration for the final shading correction decoder

7 Supplementary: Loss Function Details

The losses are broadly grouped into i) initial estimation loss ( & ), ii) final correction loss (), and iii) invariant and homogeneity constraint loss ( & ). All losses use the standard MSE loss.

1) Initial Estimation Loss:

The initial estimation block outputs consist of the scales of reflectance edges, along with the initial reflectance and shading estimations. The reflectance edge outputs consist of outputs, , and the full resolution of . The total edge loss is defined by:

(2)

where are the losses on the reflectance edges with resolution of , and . The ground truth for these losses is generated by using a Canny Edge operation on the reflectance ground truth. The initial estimations are matched to the ground truth reflectance and shading, hence:

(3)

where is the initial reflectance estimation and is the initial shading estimation losses.

2) Final Correction Loss:

For the final outputs of the network a reconstruction loss is added. This ensures that the network can learn the image formation model. On top of this loss, the reflectance and shading are also compared to the ground truth. The final loss is:

(4)

where and are the losses between the ground truth reflectance and shading and the network prediction. The loss is the reconstruction loss between the product of the network prediction and the input image.

3) Invariant and Homogeneity Constraint Loss:

In addition to the standard losses, additional constrain specific losses are also applied to explicitly encourage invariance and homogeneity in the reflectance. Given a segment, obtained from the segmentation map, the normalised values are compared between the prediction and the image. It is reasoned that since normalised are illumination invariant features, the predicted reflectance should in turn have similar values for the corresponding pixels:

(5)

where, is the loss between the normalised of the predicted reflectance and the image, for the pixels in the segment belonging to class .

In addition, large classes, especially indoors, like walls and ceilings, consists of largely homogeneous regions. Explicit homogeneous supervision is provided in the form of a Total Variation loss:

(6)

where, is the MSE between the total variation of the predicted reflectance () and the ground truth reflectance (), for the pixels belonging to the ceiling and wall class.

Finally, to make the network produce perceptually consistent outputs, a perceptual loss with a pretrained VGG16 [Simonyan2015] and Structural Dissimilarity loss is also added:

(7)

where is the perceptual loss between the predicted reflectance and the ground truth reflectance and is the Structural dissimilarity metric between the predicted and ground truth reflectance and shading. and are empirically set as and , respectively.

The final loss to minimise for the network then becomes:

(8)

where and are empirically found to be optimum as and , respectively.

8 Supplementary: Additional Visuals

8.1 Iiw

Additional visual comparisons with the baselines, for the IIW test-set, are provided in figs 11, 13, 14, 15 and 12.

The shadow behind the TV case is completely removed by the proposed method, while the competing method suffers from artefacts.
Figure 11: The shadow behind the TV case is completely removed by the proposed method, while the competing method suffers from artefacts.
Additional visuals from the IIW test-set. The second row for each image group shows the zoom of the highlighted box. The proposed method can recover reflectance from various illumination effects including coloured illumination.
Figure 12: Additional visuals from the IIW test-set. The second row for each image group shows the zoom of the highlighted box. The proposed method can recover reflectance from various illumination effects including coloured illumination.
The proposed method can preserve reflectance colour while also being able to handle various illumination effects.
Figure 13: The proposed method can preserve reflectance colour while also being able to handle various illumination effects.
The proposed method’s reflectance is free from artefacts, while the competing method has that problem. Additionally, background details are preserved in sharper clarity, compared to the competing methods.
Figure 14: The proposed method’s reflectance is free from artefacts, while the competing method has that problem. Additionally, background details are preserved in sharper clarity, compared to the competing methods.
The proposed method can disentangle the shading and illumination effects, giving a comparatively flatter and piece-wise constant reflectance.
Figure 15: The proposed method can disentangle the shading and illumination effects, giving a comparatively flatter and piece-wise constant reflectance.

8.2 Dataset

Figs 16, 17 and 18 provides some additional visuals for the proposed dataset, along with the corresponding reflectance and shading ground truth that is made available with this dataset.

Visuals from the proposed dataset. For each of the images, the dense ground truth reflectance and the shading is also provided.
Figure 16: Visuals from the proposed dataset. For each of the images, the dense ground truth reflectance and the shading is also provided.
Additional visuals from the dataset. The images are of realistic indoor scenes, consisting of various settings like living rooms, kitchen, and hallways.
Figure 17: Additional visuals from the dataset. The images are of realistic indoor scenes, consisting of various settings like living rooms, kitchen, and hallways.
More visuals from the dataset showing different bedroom and kitchen styles.
Figure 18: More visuals from the dataset showing different bedroom and kitchen styles.

References