SIGNet: Intrinsic Image Decomposition by a Semantic and Invariant Gradient Driven Network for Indoor Scenes
Abstract
Intrinsic image decomposition (IID) is an under-constrained problem. Therefore, traditional approaches use hand crafted priors to constrain the problem. However, these constraints are limited when coping with complex scenes. Deep learning-based approaches learn these constraints implicitly through the data, but they often suffer from dataset biases (due to not being able to include all possible imaging conditions).
In this paper, a combination of the two is proposed. Component specific priors like semantics and invariant features are exploited to obtain semantically and physically plausible reflectance transitions. These transitions are used to steer a progressive CNN with implicit homogeneity constraints to decompose reflectance and shading maps.
An ablation study is conducted showing that the use of the proposed priors and progressive CNN increase the IID performance. State of the art performance on both our proposed dataset and the standard real-world IIW dataset shows the effectiveness of the proposed method. Code is made available here.
Keywords:
Priors, Semantic Segmentation, Intrinsic Image Decomposition, CNN, Indoor dataset.1 Introduction
An image can be defined as the combination of an object’s colour and the incident light on it projected on a plane. Inverting the process of image formation is useful for many downstream computer vision tasks such as geometry estimation [henderson2019], relighting [shu2017], colour edits [Beigpour2011ObjectRB] and Augmented Reality (AR) insertion and interactions for applications like the Metaverse. The process of recovering the object colour (reflectance or albedo) and the incident light (shading) is known as Intrinsic Image Decomposition (IID). As the problem is ill-defined (with only one known), constraint-based approaches are explored to limit the solution space. For example, as an explicit gradient assumption, softer (or smoother) gradient transitions are attributed to shading transitions, while stronger (or abrupt) ones are related to reflectance transitions [Land1971]. Colour palette constraints in the form of sparsity priors and piece-wise consistency are also employed for reflectance estimation [Gehler2011], [Barron2015]. However, these approaches are based on strong assumptions of the imaging process and hence are limited in their applicability.
Implicit constraints, by means of deep learning-based methods, are proposed to expand previous approaches [Narihia2015]. For these methods, the losses implicitly formulate the constraints and are dependent on the training data. These methods learn a flexible representation based on training data which may lead to dataset biases. [Li2018ECCV] integrates multiple datasets to manage the dataset bias problem. However, introducing more datasets only acts as an expansion of the imaging distribution. Additionally, multiple purpose-built losses are needed to train the network. An alternative approach of combining constraints and deep learning is explored in [Fan2018] where edges are used as an additional constraint to guide the network. However, edges at image locations with strong illumination effects, like pronounced cast shadows, may lead to edge misclassification resulting in undesirable effects like shading-reflectance leakages.
On the other hand, [Baslamisli2018ECCV] forgoes priors and specialised losses to leverage joint learning of related modalities. They explore semantic segmentation as a closely related task to IID, arguing that jointly learning the semantic maps provides the network information to jointly correct for reflectance-shading transitions. However, no explicit guidance or constraint between the semantics and reflectance are imposed. The network thus relies on learning the constraints from the ground truth semantic, reflectance and shading jointly. Moreover, only outdoor gardens are considered, where most natural classes (e.g., bushes, trees, and roses) contain similar colours (i.e., constrained colour distributions).
This paper exploits physical and statistical image properties for IID of indoor scenes. Illumination and geometry invariant descriptors [Gevers1999] yield physics-based cues to detect reflectance transitions, while statistical grouping of pixels in an image provides initial starting estimates for IID components. To this end, a combination of semantic and invariant transition constraints is proposed. Semantic transitions provide valuable information about reflectance transitions i.e., a change in semantics most likely matches a reflectance transition but not always the other way around (objects may consist of different colours). Illumination invariant gradients provide useful information about reflectance transitions but can be unstable (noisy) due to low intensity. Exploiting reflectance transition information on these two levels compensates each other and ensures a stronger guidance for IID. In addition, indoor structures, like walls and ceilings, are often homogeneously coloured. To this end, the semantic map can be used as an explicit homogeneous prior. This allows for integrating an explicit sparsity/piece-wise consistency (homogeneity) prior in the form of constant reflectance colour.
In this paper, a progressive CNN is employed, consisting of two stages. The first stage of the network exploits the prior information to arrive at an initial estimation. This estimation is based on the semantics, the invariant guided boundaries, and sparsity constraints. The second stage of the network takes the initial estimation and fine-tunes it using the original image cues to disentangle the reflectance and shading maps while being semantically correct. This allows the network to separate the problem into two distinct solution spaces that build progressively on each other. In addition, it allows the network to learn a continuous representation that can extrapolate even when the priors contain errors. An overview of the proposed network is shown in the Fig. 1.
While deep learning networks have shown very good performance, they require high quality datasets. Traditional physical-based rendering methods are often time and resource intensive. Recently, these methods are more efficient i.e., real time on consumer hardware. Hence, a dataset of physical-based and photo-realistic rendered indoor images is provided. The synthetic dataset is used to train the proposed method.
In summary, our contributions are as follows:
-
Algorithm: An end-to-end semantic and physically invariant edge transition driven hybrid network is proposed for intrinsic image decomposition of indoor scenes.
-
Insight: The use of component specific priors outperforms learning from a single image.
-
Performance: The proposed algorithm is able to achieve state-of-the-art performance on both synthetic and real-world datasets.
-
Dataset: A new ray-traced and photo-realistic indoor dataset is provided.
2 Related Works
A considerable amount of effort has been put in exploring hand-crafted prior constraints for the problem of IID. [Land1971] pioneered the field by assuming reflectance changes to be related to sharper gradient changes, while smoother gradients correspond to shading changes. Other priors have been explored like piece-wise constancy for the reflectance, and smoothness priors for shading [Barron2015], textures [Gehler2011]. Constraints in the form of additional inputs have also been explored. [Lee2012] explores the use of depth as an additional input, while [Jeon2014] explores surface normals. Near infrared priors are used by [Cheng2019] to decompose non-local intrinsics. Humans in the loop is also studied by [Bonneel2014] and [Narihira2015-2]. However, these works mostly focus on single objects and do not generalise well to complete scenes.
In contrast to the use of explicit (hand-crafted) constraints, deep learning methods that implicitly learn (data-driven) specific constraints are also explored [Narihia2015]. [Baslamisli2019] explores disentangling the shading component into direct and indirect shading. [Zhou2019] differentiates shading into illumination and surface normals in addition to reflectance. [Bell2014] uses a piece-wise constancy property of reflectances and employs Conditional Random Fields to perform IID. [Fan2018] shows that image edges contain information about reflectance edges and uses them as a guidance for the IID problem. [Li2018ECCV] reduces the solution space by using multiple task specific losses. [Sengupta2019] directly learns the inverse of the rendering function. Finally, [Baslamisli2018ECCV] forgoes losses and jointly learns semantic segmentation to implicitly learn a posterior on the IID, while [Saini2019] uses estimated semantic features as a support for an iterative competing formulation for IID. However, the above approaches do not explicitly integrate the physics-based image formation information and rely on the datasets containing a large set of imaging conditions. Hence, they may fall short for images containing extreme imaging conditions such as strong shadows or reflectance transitions. Large datasets [Li2018ECCV, roberts2021, Li2021CVPR] are proposed to train networks. Unfortunately, they are limited in their photo-realistic appearance.
Unlike IID, physics-based image formation priors have been explored in other tasks. [Finlayson1992] introduces Colour Ratios which are illumination invariant descriptors for objects. [Gevers1999] then introduces Cross Colour Ratios which are both geometric and illumination invariant reflectance descriptors. [Baslamisli2020] shows the applicability of the descriptors to the problem of IID. In contrast to previous methods, in this paper, a combination of explicit image formation-based priors and implicit intrinsic component property losses are explored.
3 Methodology
3.1 Priors
Semantic Segmentation:
[Baslamisli2018ECCV] shows that semantic segmentation provides useful information for the IID problem. However, components are jointly learned and hence their method lacks any explicit influence of the component’s property. Since object boundaries correspond to reflectance changes such boundary information can serve as a useful global reflectance transition guidance for the network. Furthermore, homogeneous colour (i.e., reflectance) constraints (e.g., a wall has a uniform colour) can be imposed on the segmentation explicitly. To this end, in this paper, an off-the-self segmentation algorithm Mask2Former [cheng2021] is used to obtain segmentation maps.
Invariant Gradient Domain:
Solely using semantic regions as priors may cause the network to be biased to the regions generated by the segmentation method. To prevent such a bias, an invariant (edge) map is included as an additional prior to the network. In this work, Cross Colour Ratios (CCR) [Gevers1999] are employed. These are illumination invariants i.e., reflectance descriptors. Given an image with channels Red (), Green () and Blue () and neighbouring pixels and , CCR is defined by , and where, , and are the red, green, and blue channel for pixel . Descriptors , and are illumination free and therefore solely depending on reflectance transitions. Using the reflectance gradient as an additional prior allows the network to be steered by reflectance transitions.
Reflectance and Shading Estimates:
Consider the simplified Lambertian [Shafer1985] image formation model: , where shading () is the scaling term on the reflectance component (). Hence, for a given constant reflectance region, all the pixels are different shades of the same colour. In this way, the reflectance colour becomes a scale optimisation for which the pixel mean of a segment can be used: where, is the channel-specific mean of the pixels. , and values are then spread within the region to obtain an initial starting point for reflectance colour based on the homogeneity constraint. Conversely, these values can be inverted using the image formation to obtain the corresponding scaled shading estimates. A CNN is then employed to implicitly learn the scaling for both priors. Additionally, since the mean of the segment does not consider textures, a deep learning method is proposed to compensate it by means of a dedicated correction module, see section 3.2. The supplementary material provides more visuals for these priors.
3.2 Network Architecture
The network consists of components: i) Global encoder blocks, ii) Reflectance edge Decoder, iii) Initial estimation decoder and iv) Final correction module. The network is trained end-to-end. The input to the network is an image and its corresponding segmentation obtained by Mask2Former [cheng2021]. The CCR, Reflectance and Shading estimates are computed from the input image for the respective encoder blocks. Additional details and visuals for the modules can be found in the supplementary materials.
Global Encoder Module:
The input image, the segmentation image, the average reflectance estimate, inverse shading estimate and the CCR images are encoded through their respective encoders. The encoders share the same configuration, but the intermediate features are independent of each other. The semantic features () provide guidance for the general outlines of object boundaries, while the CCR features () focus on local reflectance transitions, possibly including textures. Correspondingly, the average reflectance estimate features () and the inverse shading estimate features () provide a starting point for the reflectance and the shading estimation, respectively. Finally, the image features () provide the network a common conditioning to learn the scaling and boundary transitions for the intrinsic components. Fig. 2 shows the overview of the module.
Reflectance Edge Module:
This sub-network decodes the reflectance edges of the given input. The decoded reflectance and edges are used as an attention mechanism to the initial estimation module to provide (global) region consistency. The features, and are concatenated with the image features and passed on to the edge decoder. The semantic and CCR features provide object and reflectance transitions, respectively. The image features allow the network to disentangle reflectance from illumination edges. Corresponding skip connections from , and encoders are used to generate high frequency details. Scale space supervision, following [Xie2015], is provided by a common deconvolution layer for the last layers, for scales of and , yielding a scale consistent reflectance edge prediction. The ground truth edges are calculated by using a Canny Edge operation on the ground truth reflectance. Fig. 3 shows an overview of the module.
Initial Estimation Module:
The initial estimation decoder block focuses on learning the IID from the respective initial estimates of the intrinsic (Fig. 3). It consists of two parallel decoders. The Reflectance decoder learns to predict the first estimation from and . The features are further augmented with the learned boundaries from the reflectance edge decoder passed through an attention layer [tang2020]. is also passed to the decoder to guide global object transitions and acts as an additional attention. Similarly, the Shading decoder only receives and , focusing on properties like smoother (shading) gradient changes. The reflectance and shading decoders are interconnected to provide an additional cue to learn an inverse of each other. Skip connections from the respective encoders to the decoders are also given. This allows the network to learn an implicit scaling on top of the average reflectance and the inverse shading estimation. The output at this stage is guided by transition and reflectance boundaries and may suffer from local inconsistencies like shading-reflectance leakages.
Final Correction Module:
To deal with local inconsistencies, a final correction module is proposed. First, the reflectance edge from the edge decoder and the reflectance from the previous decoder is concatenated and passed through a feature calibration layer. This allows the network to focus on local inconsistencies guided by global boundaries. The output is then passed through a final reflectance encoder. The shading from the previous module is similarly passed through another encoder block. The output of these two encoders is then passed through another set of parallel decoders for the final reflectance and shading output. Since the reflectance and shading from the previous block is already globally consistent, this decoder acts as a localised correction. To constrain the corrections to local homogeneous regions, skip connections (through attention layers) of encoded reflectance edge features are provided to the decoders. In this way, the network limits the corrections to the local homogeneous regions and recover local structures like textures. Skip connections from the respective reflectance and shading encoders are provided to include high frequency information transfer. The reflectance and shading features in the decoder are shared within each other to enforce an implicit image formation model. Fig. 4 shows the overview of the module.
3.3 Dataset
Unreal Engine [unrealengine] is used to generate a dataset suited for the task. The rendering engine supports physically based rendering, with real-time raytracing (RTX) support. The engine first calculates the intrinsic components from the various material and geometry property of the objects making up the scene. Then, the illumination is physically simulated through ray tracing and lighting is calculated. Finally, all these results are combined to render the final image. Since the engine calculates the intrinsic components, ground truth intrinsic is recovered using the respective buffer. The dataset consists of dense reflectance and shading ground-truths. The network learns the inversion of this process.
Assets from the unreal marketplace are used to generate the dataset. These assets are professionally created to be photo realistic. images are generated of which images are used for training, and are used for validation and testing. To evaluate the generalisability of the network, Intrinsic Images in the Wild (IIW) [Bell2014] is used as a real-world test. Fig. 5 shows a number of samples from the dataset. The dataset generated is comparatively small. However, the purpose of the dataset is that the network learns an efficient physics guided representation, rather than a dataset dependent one. The pretrained model and the dataset will be made available.
3.4 Loss Functions and Training details
MSE loss is applied for each output of the network: (i) Initial estimation loss ( & ) and (ii) Final correction loss (). is the loss applied on the scale space reflectance edge. is the loss on the reflectance and shading output from the initial estimation module. Additional losses are also applied on the reflectance and shading output from the final correction module. This reflectance and shading are also combined and compared with the input image for a reconstruction loss. These losses are collected in the term . An invariance loss is added between the normalised and the prediction of the network for each segment. A Total Variation (TV) loss () is included to deal with the assumption that large indoor classes like walls and ceilings are homogeneously coloured. This loss is only applied to ceilings and wall pixels and minimises the TV between the prediction and the ground truth reflectance. Finally, to encourage perceptually consistent and sharper textures, a perceptual and dssim loss are included and grouped as . The final loss term to minimise for the network thus becomes:
(1) | |||
where and are weighting terms for the edge and initial estimation losses. They are empirically set to and , respectively. The network is trained for 60 epochs, with a learning rate of and the Adam [kingma2014] optimiser. Please refer to the supplementary materials for more details.
4 Experiments
4.1 Ablation Study
To study the influence of different architecture components and losses, an ablation study is conducted. For a fair evaluation, the ablation study is performed on the test-set of the rendered dataset. For all the ablations, all hyper-parameters are kept constant. The results of the ablation study are presented in table 1.
Reflectance | Shading | |||||
MSE | LMSE | DSSIM | MSE | LMSE | DSSIM | |
w/o Final Correction | 0.0029 | 0.0020 | 0.0225 | 0.0044 | 0.0035 | 0.0276 |
w/o Priors | 0.0105 | 0.0047 | 0.0444 | 0.0054 | 0.0034 | 0.0399 |
w Canny Edges | 0.0032 | 0.0037 | 0.0229 | 0.0031 | 0.0049 | 0.0293 |
w/o Average Estimates | 0.0030 | 0.0023 | 0.0232 | 0.0041 | 0.0043 | 0.0267 |
w/o Reflectance Edge Module | 0.0097 | 0.0156 | 0.3254 | 0.0033 | 0.0061 | 0.0270 |
No DSSIM Loss | 0.0131 | 0.0240 | 0.3704 | 0.0041 | 0.0055 | 0.1488 |
No Perceptual Loss | 0.0032 | 0.0022 | 0.0289 | 0.0032 | 0.0038 | 0.0285 |
No Invariant & Homogeneity Loss | 0.0032 | 0.0027 | 0.0288 | 0.0024 | 0.0024 | 0.0318 |
Proposed | 0.0026 | 0.0018 | 0.0219 | 0.0030 | 0.0033 | 0.0252 |
Influence of final correction module:
In this experiment, the influence of the final correction module is studied. The output from the initial estimation decoder is taken as the final output.
From the results, it is shown that the final correction module helps in improving the outputs. The improvement in the DSSIM metric for both components shows that the final correction module is able to deal with structural artefacts.
Influence of priors:
The influence of all the priors is studied in this ablation. The additional priors are removed, and the network only receives the image as an input. All network structures are kept the same. This setup studies if the network can disentangle the additional information from the input image without any specific priors.
Removing all priors makes the network to perform worse for all metrics. In this setting, the network only uses the image to derive both the reflectance and shading changes. This is challenging for strong illumination effects. This shows that the priors are an important source of information enabling a better disentanglement between intrinsic components.
Influence of specialised edges:
This experiment studies the need of specialised edges obtained from the semantic transition boundaries and invariant features. The edges obtained from the input image are provided to the network. The study focuses on whether the network can distinguish between reflectance, geometry, and shadow edges directly from the image.
From the results, it is shown that using image edges is not sufficient. Image edges can be ambiguous due to the presence of shadow edges. However, the performance is still better than using the image as the only input, showing that edges yield, to a certain extent, useful transition information.
Influence of reflectance and shading estimate priors:
In this experiment, the efficacy of the statistic-based homogeneous reflectance and the inverted shading estimate is studied.
Removing the average reflectance and shading estimates degrades the performance. With the priors of the estimates, the network can use its learning capacity to deal with the scaling of the initial estimation to obtain the correct IID. The network needs to learn the colour as well as the scaling within the same learning capacity.
Influence of reflectance edge guidance module:
For this experiment, the edge guidance module is removed. As such, the network is then forced to learn the attention and the reflectance transition boundaries implicitly as part of the solution space.
Removing the reflectance edge module results in the second worse result. This shows that, apart from the priors, the ability to use those features to learn a reflectance transition, is useful. It is shown that without such a transition guidance, the network is susceptible of misclassifying shadow edges as reflectance transitions. Furthermore, it is shown that without this module, the reflectance performance suffers more than the shading performance. Hence, using a learned edge guidance allows the network to be more flexible and better able to distinguish between true reflectance transitions.
Influence of different losses:
The influence of the different losses is studied in this experiment. For each sub-experiment, the same proposed structure is used, and the respective losses are selectively turned off.
From the results, it is shown that the DSSIM loss contributes to a large extend, to the performance, because this loss penalises perceptual variations like contrast, luminance, and structure. As such, by removing the supervision, the network learns an absolute difference which is not expressive to smaller spatial changes. Similar trend of performance decrease is shown when removing the perceptual and homogeneity losses. This is expected since both losses contribute to region consistency. With the addition of the losses on the reflectance, the shading values suffer slightly. However, structurally they perform better when including the losses, as shown by the DSSIM metric. This indicates that applying such a loss helps not only to achieve a better reflectance value, but it also jointly improves shading, resulting in sharper outputs.
4.2 Comparison to State of the Art
On the proposed Dataset:
To study the influence of the dataset, the proposed network is compared to baseline algorithm’s performance. For these experiments, the standard, MSE, LMSE and the DSSIM metric are used. The baselines are chosen based on their performance of the Weighted Human Disagreement Rate (WHDR), widely used in the literature. Hence, [Li2018ECCV] is chosen as a baseline. [Zhou2019] does not provide any publicly available code, hence is not included. Although [Fan2018] is the state of the art, their provided code generates errors when trying to run on custom datasets and hence is not used for comparison. For completeness, [Xu2020] and [Shi2017] is also compared. [Xu2020] uses an optimization-based method based on the pioneering Retinex model. Since it is a purely physical constraint-based model, it is included for comparison. For a fair comparison, methods focusing on indoors are used. [Baslamisli2018ECCV] assumes outdoor settings and requires semantic ground truths to train and hence is not included. For all the networks, they are retrained on the dataset that is proposed in this paper, using the optimum hyperparamters as mentioned in the respective publication. The results are shown in table 2 and figure 6
Reflectance | Shading | |||||
MSE | LMSE | DSSIM | MSE | LMSE | DSSIM | |
ShapeNet [Shi2017] | 0.0084 | 0.0133 | 0.1052 | 0.0065 | 0.0129 | 0.1862 |
STAR [Xu2020] | 0.0304 | 0.0166 | 0.1180 | 0.0290 | 0.0128 | 0.1572 |
CGIntrinsics [Li2018ECCV] | 0.0211 | 0.0156 | 0.0976 | 0.0848 | 0.0577 | 0.2180 |
Proposed | 0.0026 | 0.0018 | 0.0219 | 0.0030 | 0.0033 | 0.0252 |
From the table it is shown that our proposed model is able to provide the highest scores. From the figure, the baselines suffer from strong illumination effects. CGIntrinsics discolours the regions while STAR mostly fails. ShapeNet, suffers from artefacts and colour variations around the illumination regions. In comparison, the proposed network is able to recover from such effects.
On IIW [Bell2014]:
The proposed network is finetuned on the IIW dataset and compared to the baselines. The training and testing splits are used as specified in the original publication. For the baselines, the numbers are obtained from the respective original publications. The results are shown in table 3 and visuals in figure 7.
Methods | WHDR (mean) |
---|---|
Direct Intrinsics [Narihia2015] | 37.3 |
Color Retinex [Grosse2009] | 26.9 |
Garces et al. [Garces2012] | 25.5 |
Zhao et al. [Zhao2012] | 23.2 |
IIW [Bell2014] | 20.6 |
Nestmeyer et al. [Nestmeyer2016] | 19.5 |
Bi et al. [Bi2015] | 17.7 |
Sengupta et al. [Sengupta2019] | 16.7 |
Li et al. [Li2020] | 15.9 |
CGIntrinsics [Li2018ECCV] | 15.5 |
GLoSH [Zhou2019] | 15.2 |
Fan et al. [Fan2018] | 15.4 |
Proposed | 15.2 |
CGIntrinsics [Li2018ECCV]* | 14.8 |
GLoSH [Zhou2019]* | 14.6 |
Fan et al. [Fan2018]* | 14.5 |
Proposed* | 13.9 |
The IIW dataset does not contain dense ground truth and hence is only finetuned with the ordinal loss. A guided filter [Nestmeyer2016] is used to further improve the results. Overall, our proposed method is on par with GLoSH [Zhou2019] which is the best performing method without any post filtering. However, they need both lighting and normal information as supervision, while the proposed method is trained with just reflectance and shading, along with a smaller dataset ( images of [Zhou2019] vs. of the proposed method). For the filtered results, the proposed method is able to achieve a comfortable lead compared to the current best of 14.5 obtained by [Fan2018], showing the efficiency of the current model.
5 Conclusions
In this paper, an end-to-end prior driven approach for indoor scenes has been proposed for the task of intrinsic image decomposition. Reflectance transitions and invariant illuminant descriptors has been used to guide the reflectance decomposition. Image statistics-based priors have been used to provide the network a starting point for learning. To integrate explicit homogeneous constraints, a progressive CNN was used. To train the network, a custom physically rendered dataset was proposed.
An extensive ablation was performed to validate the proposed network showing that: i) using explicit reflectance transition priors helps the network to achieve an improved intrinsic image decomposition, ii) image statistics-based priors are helpful for simplifying the problem and, iii) the proposed method attains sota performance for the standardised real-world dataset IIW.
6 Supplementary: Network Architecture Details
The network consists of basic modules: i) encoder, ii) decoder and iii) attention. Figs 8, 9 and 10 visualise these modules.
These structures are used iteratively to construct the larger modules.
Global Encoder Module:
All encoders share the same structure. Table 4 lists the configuration.
Name | Layer | Kernel Size, Stride, Padding | Output Size |
---|---|---|---|
Input | conv1 | 3x3x64, 1, 1 | 256x256x64 |
conv1 | 3x3x64, 1, 1 | 256x256x64 | |
conv2 | 3x3x64, 2, 1 | 128x128x64 | |
conv2 | 3x3x128, 1, 1 | 128x128x128 | |
conv3 | 3x3x128, 2, 1 | 64x64x128 | |
conv3 | 3x3x256, 1, 1 | 64x64x256 | |
conv4 | 3x3x256, 2, 1 | 32x32x256 | |
conv4 | 3x3x512, 1, 1 | 32x32x512 | |
conv5 | 3x3x512, 2, 1 | 16x16x512 | |
Bottleneck | conv5 | 3x3x512, 1, 1 | 16x16x512 |
Reflectance Edge Module:
The configuration for the reflectance edge decoder module is listed in table 5. The features being multiplied and added denote skip connections from the respective encoders. All features are first depth-wise concatenated before being passed through the decoder structures. The module has scaled outputs: , and .
Name | Layer | Kernel Size, Stride, Padding | Output Size |
---|---|---|---|
BottleNeck | deconv1 | 4x4x(512 * 3), 2, 1 | 32x32x512 |
deconv2 | 4x4x(512 * 4), 2, 1 | 64x64x512 | |
deconv3 | 4x4x(512 + (256 * 3)), 2, 1 | 128x128x256 | |
deconv4 | 4x4x(256 + (128 * 3)), 2, 1 | 256x256x128 | |
conv | 3x3x(128 + (64 * 3)), 1, 1 | 256x256x64 | |
Full Edge 256 | conv | 3x3x3, 1, 1 | 256x256x3 |
Edge Output 64 | conv | 3x3x3, 1, 1 | 256x256x3 |
Edge Output 128 | conv | 3x3x3, 1, 1 | 256x256x3 |
Initial Estimation Module:
Tables 6 and 7 show the module configuration. The decoder is a joint decoder in the style of [Shi2017], i.e., the reflectance decoder blocks use the previous shading decoder outputs together with the previous reflectance decoder block outputs (denoted by ). The reflectance decoder receives the encoder features from the image encoder, the reflectance estimate encoder, and the semantic encoder as skip connections (denoted by ). Additionally, the edge decoder features, and the current reflectance decoder outputs are passed through the attention layer before being passed to the next block. Similarly, the shading decoder blocks receives shading estimate encode and the image encoder features as skip connections.
Name | Layer | Kernel Size, Stride, Padding | Output Size |
---|---|---|---|
BottleNeck | deconv1 | 4x4x(512 * 3), 2, 1 | 32x32x512 |
Attention | reflect edge (re) & deconv1 | 32x32x512 | |
deconv2 | 4x4x(512 * 2 + (512 + 512 + 512)), 2, 1 | 64x64x512 | |
Attention | re deconv2 & deconv2 | 64x64x512 | |
deconv3 | 4x4x(512 * 2 + (256 + 256 + 256)), 2, 1 | 128x128x256 | |
Attention | re deconv3 & deconv3 | 128x128x256 | |
deconv4 | 4x4x(256 * 2 + (128 + 128 + 128)), 2, 1 | 256x256x128 | |
Attention | re deconv4 & decovn4 | 256x256x128 | |
conv6 | 3x3x(128 * 2 + (64 + 64 + 64)), 1, 1 | 256x256x64 | |
conv6 | 3x3x3, 1, 1 | 256x256x3 | |
Initial Reflectance Output | Attention | re output & conv6 | 256x256x3 |
Name | Layer | Kernel Size, Stride, Padding | Output Size |
---|---|---|---|
BottleNeck | deconv1 | 4x4x(512 * 2), 2, 1 | 32x32x512 |
Attention | reflect edge (re) & deconv1 | 32x32x512 | |
deconv2 | 4x4x(512 * 2 + (512 + 512)), 2, 1 | 64x64x512 | |
Attention | re deconv2 & deconv2 | 64x64x512 | |
deconv3 | 4x4x(512 * 2 + (256 + 256)), 2, 1 | 128x128x256 | |
Attention | re deconv3 & deconv3 | 128x128x256 | |
deconv4 | 4x4x(256 * 2 + (128 + 128)), 2, 1 | 256x256x128 | |
Attention | re deconv4 & decovn4 | 256x256x128 | |
conv6 | 3x3x(128 * 2 + (64 + 64)), 1, 1 | 256x256x64 | |
conv6 | 3x3x1, 1, 1 | 256x256x1 | |
Initial Shading Output | Attention | re output & conv6 | 256x256x1 |
Final Correction Module:
This module consists of two encoder and decoder structures: i) The reflectance edge encoder (), ii) the initial reflectance estimation encoder (), iii) the initial shading estimation encoder (), iv) the final reflectance decoder and v) the final shading decoder. The encoders use the previously introduced configurations detailed in table 4. Tables 9 and 10 gives an overview of the final decoder configurations. Just like the previous module, a joint decoder structure is used. The reflectance receives skip connections from . In addition, the features from and are passed through an attention layer and forwarded as an additional skip connection. Similarly, the shading decoder receives skip connections from .
Name | Layer | Kernel Size, Stride, Padding | Output Size | ||
---|---|---|---|---|---|
|
reflec conv1 | 1x1x(3 + 3), 1, 0 | 256x256x8 | ||
|
reflec conv1 | 1x1x16, 1, 0 | 256x256x16 |
The input to the reflectance encoder is the concatenation of the reflectance edge and the initial reflectance estimate from the previous decoder. This concatenated output is first fed through a feature calibration module (detailed in table 8) before being passed to the final reflectance decoder.
Name | Layer | Kernel Size, Stride, Padding | Output Size | ||
---|---|---|---|---|---|
BottleNeck | deconv1 | 4x4x512, 2, 1 | 32x32x512 | ||
Attention |
|
32x32x512 | |||
deconv2 | 4x4x(512 * 2 + (512 + 512)), 2, 1 | 64x64x512 | |||
Attention |
|
64x64x512 | |||
deconv3 | 4x4x(512 * 2 + (256 + 256)), 2, 1 | 128x128x256 | |||
Attention |
|
128x128x256 | |||
deconv4 | 4x4x(256 * 2 + (128 + 128)), 2, 1 | 256x256x128 | |||
Attention |
|
256x256x128 | |||
conv6 | 3x3x(128 * 2 + (64 + 64)), 1, 1 | 256x256x64 | |||
Final Reflectance Output | conv6 | 3x3x3, 1, 1 | 256x256x3 |
Name | Layer | Kernel Size, Stride, Padding | Output Size | ||
---|---|---|---|---|---|
BottleNeck | deconv1 | 4x4x512, 2, 1 | 32x32x512 | ||
Attention |
|
32x32x512 | |||
deconv2 | 4x4x(512 * 2 + (512 + 512)), 2, 1 | 64x64x512 | |||
Attention |
|
64x64x512 | |||
deconv3 | 4x4x(512 * 2 + (256 + 256)), 2, 1 | 128x128x256 | |||
Attention |
|
128x128x256 | |||
deconv4 | 4x4x(256 * 2 + (128 + 128)), 2, 1 | 256x256x128 | |||
Attention |
|
256x256x128 | |||
conv6 | 3x3x(128 * 2 + (64 + 64)), 1, 1 | 256x256x64 | |||
Final Shading Output | conv6 | 3x3x1, 1, 1 | 256x256x1 |
7 Supplementary: Loss Function Details
The losses are broadly grouped into i) initial estimation loss ( & ), ii) final correction loss (), and iii) invariant and homogeneity constraint loss ( & ). All losses use the standard MSE loss.
1) Initial Estimation Loss:
The initial estimation block outputs consist of the scales of reflectance edges, along with the initial reflectance and shading estimations. The reflectance edge outputs consist of outputs, , and the full resolution of . The total edge loss is defined by:
(2) |
where are the losses on the reflectance edges with resolution of , and . The ground truth for these losses is generated by using a Canny Edge operation on the reflectance ground truth. The initial estimations are matched to the ground truth reflectance and shading, hence:
(3) |
where is the initial reflectance estimation and is the initial shading estimation losses.
2) Final Correction Loss:
For the final outputs of the network a reconstruction loss is added. This ensures that the network can learn the image formation model. On top of this loss, the reflectance and shading are also compared to the ground truth. The final loss is:
(4) |
where and are the losses between the ground truth reflectance and shading and the network prediction. The loss is the reconstruction loss between the product of the network prediction and the input image.
3) Invariant and Homogeneity Constraint Loss:
In addition to the standard losses, additional constrain specific losses are also applied to explicitly encourage invariance and homogeneity in the reflectance. Given a segment, obtained from the segmentation map, the normalised values are compared between the prediction and the image. It is reasoned that since normalised are illumination invariant features, the predicted reflectance should in turn have similar values for the corresponding pixels:
(5) |
where, is the loss between the normalised of the predicted reflectance and the image, for the pixels in the segment belonging to class .
In addition, large classes, especially indoors, like walls and ceilings, consists of largely homogeneous regions. Explicit homogeneous supervision is provided in the form of a Total Variation loss:
(6) |
where, is the MSE between the total variation of the predicted reflectance () and the ground truth reflectance (), for the pixels belonging to the ceiling and wall class.
Finally, to make the network produce perceptually consistent outputs, a perceptual loss with a pretrained VGG16 [Simonyan2015] and Structural Dissimilarity loss is also added:
(7) |
where is the perceptual loss between the predicted reflectance and the ground truth reflectance and is the Structural dissimilarity metric between the predicted and ground truth reflectance and shading. and are empirically set as and , respectively.
The final loss to minimise for the network then becomes:
(8) | |||
where and are empirically found to be optimum as and , respectively.