FUSION: Fully Unsupervised Test-Time Stain Adaptation via Fused Normalization Statistics

Nilanjan Chattopadhyay Advanced Technology Group, AIRA MATRIX, India
1
   Shiv Gehlot Advanced Technology Group, AIRA MATRIX, India
1
   Nitin Singhal Advanced Technology Group, AIRA MATRIX, India
1
1email: {nilanjan.chattopadhyay,shiv.gehlot,nitin.singhal}@airamatrix.com
Abstract

Staining reveals the micro-structure of the aspirate while creating the histopathology slides. Stain variation, defined as a chromatic difference between the source and the target, is caused by varying characteristics during staining, resulting in a distribution shift and poor performance on the target. The goal of stain normalization is to match the target’s chromatic distribution to that of the source. However, stain normalisation causes the underlying morphology to distort, resulting in an incorrect diagnosis. We propose FUSION, a new method for promoting stain-adaption by adjusting the model to the target in an unsupervised test-time scenario, eliminating the necessity for significant labelling at the target end. FUSION works by altering the target’s batch-normalization statistics and fusing them with source statistics using a weighting factor. The algorithm reduces to one of two extremes based on the weighting factor. Despite the lack of training or supervision, FUSION surpasses existing equivalent algorithms for classification and dense predictions (segmentation), as demonstrated by comprehensive experiments on two public datasets.

Keywords:
Stain Variation Unsupervised Stain Adaptation.
**footnotetext: These authors contributed equally

1 Introduction

Staining is used in histopathology slide preparation to highlight the essential structure of the aspirate. Differences in stain chemicals, lighting conditions, or staining time may produce stain colour variances in images collected at the same or separate facilities. The inter-center chromatic difference results in sub-optimal performance on the target test set because of stain variation, limiting model deployment across the centres.

Transfer learning is a simple but effective strategy for dealing with distribution shifts, but it requires annotation efforts at the target end, which can be challenging, especially in medical image analysis. By training a network to learn stain variance via mapping color-augmented input images to the original images, self-supervised learning eliminates the need for annotation [Tellez2019, shivssl]. However, the settings of the train-time augmentations influence the performance of these algorithms on the target domain. Stain normalisation is a stain variation handling method that does not require any training at the source or target end. It aligns the source and target chromatic distributions using reference image(s) from the source domain [kothari2011automatic, mccann2014algorithm, ruifrok2001quantification, abe2005color, GUPTA2020, macenko2009method, reinhard2001color, ruderman1998statistics, magee2009colour]. However, depending on the reference image(s), its performance can vary greatly. Changes in the underlying structure of the images further degrade the performance. The use of generative modeling-based solutions eliminates the need for a reference image. However, it requires the data from target domain in addition to a larger training sample, which limits its application [IsbiZanjani2018, MidlZanjani2018, stainGan].

These limitations are overcome via domain adaptation [Ganin2015UnsupervisedDA, TzengHDS15] and test time training [TTT], but with a modified training process. The goal of adaptation is to generalise a model trained on source data to target data . Test time adaptation (TTA) modifies the testing procedure while preserving the initial training. Two TTA techniques explored in the literature are Entropy Minimization (EM) [wang2021tent, memo2021] and Normalization Statistics Adaptation (SNA) [vanillabna, snb]. In EM, the backpropagation to update the adaptation parameters is driven by the entropy function formulated using the predictions of on . Batch normalisation statistics and parameters are controlled in [wang2021tent] by entropy minimization of the target’s prediction. For gradient update, [memo2021] minimises the entropy of the marginal distribution computed from several augmented versions of a single test sample. SNA eliminates the need for backpropagation by focusing on batch normalization-specific statistics (mean and variance) and only requires forward steps. During the inference in [vanillabna], the batch normalisation statistics are updated with . On the other hand, [snb] updates statistics with considering as a prior.

FUSION, the proposed methodology, performs batch statistics adaptation as well. FUSION is a generic technique that combines source and target batch normalisation layers via a weighting factor that favors the source or the target end. The stain fluctuation between and is likewise connected to the weighting factor. FUSION outperforms the conventional SNA techniques because of its generic nature. The domain difference between and is represented in this study by stain variation; thus, the terms “model adaptation” and “stain adaptation” are used interchangeable.

\subfloat[ and ] \subfloat[Vanilla Inference] \subfloat[FUSION ()] \subfloat[FUSION ()]

Figure 1: The vanilla inference performs poorly on the target due to considerable stain variation between (source) and (target). FUSION may bypass this constraint by combining the and normalisation statistics, resulting in significant performance gains.

2 Fusion

Batch normalisation (BN) is built into current Convolutional Neural Networks (CNNs) for stable and faster training. The second-order batch statistics (, ) and two learnable parameters for scaling and shifting are used by BN layers to normalise the features of each batch.

(1)

where, is a training batch from , and are activations. The moving averages of and are also kept during the training phase to use during inference.

These statistics can differ greatly between two centres (source and target) in the case of stain variation (Fig. 1). As a result, in an inter-center training-testing system, the training statistics may lead to poor performance [vanillabna] during inference on test set from a different center. However, we believe that the batch’s () requirement to represent the target data distribution is important for classification because a batch may not contain samples from all classes. Accordingly, for inference on , Eq. 1 is modified as:

(2)

Even though each test batch represents the test data distribution, it is not sufficient to describe the statistics of the complete test set. Although all of the test samples can be pooled into a single batch, this has computational limits. We describe a computationally efficient strategy for leveraging the statistics of the complete test set during inference. The Eq. 2 is utilized for inference in the first step, and running estimates of and produced from the test-batches are kept at the same time (similar to training). In the second step, the collected running averages are utilized in the final inference, altering Eq. 2 to:

(3)

where, and represents the running statistics of mean and variance, respectively in the first step.

Figure 2: Conventional inference (a) uses the running statistics of the source for inference on target . The modified statistics adaptation approach (b) replaces it with the target statistics, . Unlike prior approaches to statistics adaptation, which only considered per-batch statistics, this method considers the complete target set. FUSION (c) is a generalized statistics adaptation approach that fuses the source and target statistics with a weighting factor , reducing to (a) and (b) for and , respectively.

Fusing the Batch Normalizations

Only the statistics of the target are considered in Eq. 3, completely discarding the statistics of the source. Excessive perturbations of the source statistics may cause the performance to deteriorate. As a result, optimal performance can be achieved by combining both statistics. To this purpose, Eq. 3 is updated as follows:

(4)

is a hyperparameter, and represents moving averages for source. Eq. 4 is a generalized test time statistic normalization adaptation. For , it reduces to Eq. 3, considering only the target statistics. While for , it focuses only on the source statistics. With as the weighting parameter, Eq. 4 exploits both the domains.

3 Experiments

Applications: FUSION is evaluated on classification and segmentation tasks using Camelyon -17 and TUPAC datasets. Classification and segmentation are performed on the five-center Camelyon-17 [CAMELYON17] dataset, but only classification is performed on the three-center TUPAC dataset. Sample images and a detailed dataset description may be found in Table 1 and Fig 3, respectively. Datasets vary widely among locations, making it probable that a model developed for one location may perform badly when applied to other locations. The effectiveness of FUSION in preventing such performance decline is put to the test.
Implementation: ResNet-18 [resnet] and EfficientNet-B0 [efnet] are trained for classification with and without Train-time augmentations (TrTAug). Rotation-based augmentation is the default, whereas TrTAug also includes color-based augmentations such as hue, saturation, and value (HSV) variations to induce stain invariance. The Feature Pyramid Network (FPN) [fpn] with ResNet-34 [efnet] as an encoder is utilized for the segmentation task. Each network is trained with SGDM optimizer for 55 epochs and a step LR scheduler of 0.1 decay factor. The learning rate was set to .001, the batch size to 64, and the weight decay to .01. For FUSION, optimal is selected through grid search from .
Baselines: We exhibit the influence of test-time augmentations (TTAug), stain normalisation ([macenko2009method, vahadane]), and normalisation statistics adaption ([vanillabna, snb]) in addition to vanilla inference. The [vanillabna] and [snb] are represented as vanilla batch normalization adaptation (Vanilla-BNA) and sample based BNA (SB-BNA), respectively. The influence of Eq. 2 in conjunction with Vanilla-BNA (Vanilla-BNA + Eq. 2) is also investigated. FUSION-full or FUSION with (Eq. 3) is also analyzed. N is set to 20 for SB-BNA, as suggested in [snb].

Application Dataset (Centers) Size C0 C1 C2 C3 C4 Camelyon-17 [CAMELYON17] 13527 9690 14867 30517 26967 256 256 C0 C1 C2 Classification TUPAC [tupac] 4260 1600 1344 128 128 C0 C1 C2 C3 C4 Segmentation Camelyon-17 [CAMELYON17] 237 231 300 437 685 1024 1024

Table 1: Multi centers (sources) datasets are used for classification and segmentation to analyze inter-center stain adaptation with FUSION. For each dataset, highlighted center is considered as source and remaining as the targets.

\subfloat[C0] \subfloat[C1] \subfloat[C2] \subfloat[C3] \subfloat[C4]   \subfloat[C0] \subfloat[C1] \subfloat[C2]

Figure 3: Sample images from different centers of Camelyon17 (a-e) and TUPAC (f-h), highlighting the inter-center stain variation.

4 Results

FUSION gives the best classification and segmentation results in varied test setups.

Table 2 indicates that conventional inference results in poor performance. This is attributed to drastic stain variation between C2 (source) and other centers (targets), implying the necessity of stain adaptation. FUSION outperforms the other approaches when it comes to classification on Camelyon-17 under stain-variation condition. The performance improvement for C0, C1, C3, and C4 in terms of balanced accuracy for EfficientNet-B0 (without TrTAug) is 43.26 percent, 22.89 percent, 38.85 percent, and 27.60 percent, respectively. Introducing TrTAug, as expected, introduces stain invariance in the network, resulting in increased performance across target centers. FUSION outperforms other methods and provides highest increment over the baseline with TrTAug. Similarly, FUSION is the optimal technique for TUPAC dataset, as shown in Fig. 4. It performs on par to FUSION-full in a few cases while superior in others.

FUSION is the best performing approach for segmentation (Table 3), with the highest gain of 8.55 percent in terms of dice score for C0. Performance on C4 shows decrease as seen in classification as well. However, FUSION results in least decrement in comparison to other methods. As can be seen, FUSION-full underperforms FUSION in most cases, implying that full adaptation is inefficient and that an optimal combination of and statistics is required.

Source Target
C2 C0 C1 C3 C4
EfficientNet-B0 [efnet] 50.34 0.57 49.43 0.64 49.90 0.13 63.99 2.67
   + TTAug 51.07 1.27 () 43.37 1.08 () 44.73 0.94 () 54.06 0.71 ()
   + Macenko [macenko2009method] 50.21 0.19 () 50.32 2.84 () 49.83 3.03 () 49.95 0.05 ()
   + Vahadane [vahadane] 84.65 2.20 () 76.29 3.26 () 76.67 3.13 ( 26.77) 46.08 4.35 ()
   + Vanilla BNA [vanillabna] 53.47 0.19 ( 3.13) 53.27 0.24 ( 3.84) 53.47 0.24 ( 3.57) 51.55 0.17 ( 12.44)
   + Vanilla BNA [vanillabna] +(2) 92.43 0.24 () 72.11 0.40 () 87.29 0.52 () 91.44 0.73 ()
   + SB-BNA [snb] 67.94 5.33 () 62.32 2.35 () 57.52 2.57 () 85.67 1.46 ()
   + FUSION-full (3) 92.61 0.12 () 71.27 0.37 () 87.59 0.31 () 92.14 0.76 ()
   + FUSION (4) 93.59 0.13 ( 43.25) 72.32 0.54 () 88.75 0.32 () 91.59 0.90 ()
+ TrTAug 93.32 1.11 75.28 3.66 85.28 1.51 82.01 3.12
   + TTAug 92.25 0.99 () 75.03 2.66 () 82.57 0.88 () 79.02 2.42 ()
   + Macenko [macenko2009method] 92.72 1.12 () 82.98 3.09 () 83.80 3.24 () 83.75 1.83 ()
   + Vahadane [vahadane] 93.41 0.78 ( 0.09) 80.53 1.73 ( 5.25) 88.79 0.59 ( 3.21) 71.37 1.18 ( 10.64)
   + Vanilla BNA [vanillabna] 53.78 0.18 () 53.03 0.23 () 53.75 0.13 () 51.86 0.20 ()
   + Vanilla BNA [vanillabna] +(2) 94.29 0.16 ( 0.97) 83.41 1.08 ( 8.13) 91.67 0.22 ( 6.39) 94.81 0.22 ( 12.8)
   + SB-BNA [snb] 93.96 0.79 () 83.87 1.01 () 90.45 0.74 () 93.22 0.72 ()
   + FUSION-full (3) 94.69 0.22 () 83.83 1.22 () 91.93 0.29 () 95.38 0.18 ()
   + FUSION (4). 95.28 0.33 () 88.68 0.87 ( 13.4) 93.02 0.34 ( 7.74) 95.58 0.12 ( 13.57)
Table 2: Balanced accuracy for classification on Camelyon-17 [CAMELYON17] with C2 as the source. Best results are highlighted in bold. Maximum increment and minimum decrement are represented in blue and red, respectively. Vanilla BNA has poor performance without Eq. 2. Hence, for classification, each batch at inference time must represents the test set ().

\subfloat[] \subfloat[]

Figure 4: Balanced accuracy for classification on TUPAC [tupac] with source C0 (a) and C1 (b). Vanilla BNA, like Camelyon-17 in Table 2, performs worse than vanilla inference in the absence of Eq. 2. In certain circumstances, FUSION-full and FUSION are comparable to Vanilla BNA, while in some cases, they are superior. Also, with C1 as the source and C2 as the target, FUSION produces the least decrement, proving the validity of statistics merging.
Source Target
C1 C0 C2 C3 C4
ResNet-34 [resnet] 62.75 7.95 53.49 14.00 58.23 8.87 50.52 12.61
   + TTA 56.23 6.65 () 47.30 8.87 () 50.46 7.01 () 39.60 8.91 ()
   + Vanilla BNA [vanillabna] 70.56 4.36 ( 7.81) 55.70 3.11 ( 2.21) 59.80 3.44 ( 1.57) 38.05 4.80 ( 10.92)
   + Vanilla BNA [vanillabna]+(2) 70.47 4.23 () 56.08 3.57 () 59.85 3.57 () 38.26 4.59 ()
   + FUSION-full (3) 70.36 5.00 () 57.08 6.17 () 60.62 3.79 () 40.95 6.73 ()
   + FUSION (4) 71.30 5.95 ( 8.55) 60.68 9.66 () 63.97 4.81 () 48.06 7.40 ()
Table 3: Dice Score for segmentation on Camelyon-17 [CAMELYON17] with C1 as the source. FUSION has maximum increment for C0, C2, C3, and least decrement for C4. Due to dense predictions, Eq. 2 is inherently satisfied, making its combination with Vanilla BNA redundant.

In Vanilla BNA, each batch must be representative of the test data distribution.

In the absence of Eq. 2, vanilla BNA performs worse than vanilla inference. Batch normalisation statistics update without Eq. 2 does not examine the impact of all classes in the dataset, resulting in poor performance. As pixels in a single image belong to many classes, this is satisfied by default for segmentation. Also, because Eq. 3 considers updates due to all test samples (statistical running averages) throughout inference, Eq. 2 is not required separately.

FUSION restores activation distribution shifts caused by stain differences.

BatchNorm normalizes the distribution of activations over a minibatch during training. By setting the first two moments (mean and variance) of the distribution of each activation to be zero and one, it tries to correct for the changes in distribution of activations of a layer in the network caused by update of parameters of the previous layers. This distributional changes in layer inputs after every minibatch is known as internal covariate shift.

By default, a running mean and variance of activations for all minibatches is tracked in BatchNorm layer to be used during test time such that the activation mean is 0 and variance is 1. But when the model is used for prediction on an unseen and different stain, the distribution of activations changes and behaves differently from training time. The existing BatchNorm statistics, since based on different distribution, can’t correct for the new distribution shift. This causes the new mean and variance of activations after BatchNorm layer differ from the intended values of mean 0 and variance 1.

Let be the running mean and be the variance recorded during training at layer for channel . With stain variation between and :

(5)
(6)

After applying BatchNorm the above equations can be written as:

(7)
(8)

FUSION is designed to restore these activations in a way that:

(9)
(10)

where FUSION can be calculated using Eq. 4.

The shift in activation distribution (Fig. 5, 5, 5) that explains the performance decrease with stain variation is correlated to the stain difference. With similar stains, the distribution shift is minor, and source batch normalisation statistics are sufficient. With substantial stain variation, however, the shift is significant with major reduction in test performance, and FUSION corrects it back to the source. As a result, FUSION’s achieves higher performance improvement even on greater stain difference (higher distribution shift).

\subfloat[Channel 325 ] \subfloat[Channel 1028 ] \subfloat[Channel 892 ]

Figure 5: (a-c) Due to a stain difference between the source and target, the feature map distribution shifts with EfficientNet-B0 after the BatchNorm layer. After the last BatchNorm layer, the activations are recovered from the EfficientNet-B0 model. C2 was used to train the model at the beginning. The blue line displays the density plot of a channel using source validation data, while the red line represents the shift owing to the target’s different stain. FUSION aims to match the training distribution by correcting the covariate shift (green line).

\subfloat[Impact of step count ] \subfloat[Impact of batch size ]

Figure 6: Impact of (a) steps count and (b) batch size on FUSION’s performance in terms of balanced accuracy. The model is trained on C2 with varying batch size and steps count, and mean balanced accuracy is reported on remaining test centers (centers C0, C1, C3, C4). (a) FUSION’s performance is related to BatchNorm statistics which correlates to batch size and number of iterations/steps. Increased steps counts improves the performance irrespective of batch sizes. As moving average of mean and variance is used for updating BatchNorm statistics, large number of steps are required to overcome the affect of momentum. (b) When applying FUSION, batch size has negligible effect on performance when step count is fixed (set to 1). The difference in performance is not dependent on batch size. The similar observations can be made in (a) where points of different batch sizes overlap each other for same step count.

The performance of FUSION correlates to the number of steps.

Batch size and step counts have an impact on batch normalisation statistic calculations. The performance of updating statistics from the target does not improve as batch size is increased. The amount of steps taken to generate the new statistics, on the other hand, is critical. As shown in Fig. 6, the performance of different batch sizes converges roughly in the same way as step count increases. This could be related to the reduced influence of momentum while using step counts for batch statistics upgradation. Further evidence of importance of step count vs batch size can be seen in Fig. 6 where for the same step count, batch size has no correlation with performance.

FUSION is generic and provides flexibility during inference.

Full adaptation is neither generic nor efficient, as it may result in substantial statistical deviation. Similarly, poor performance without adaptation is due to incompatibility of the source statistics with the target. FUSION’s generalizability, which allows it to focus on any source or target via the weighting factor , is a key advantage. FUSION provides more flexibility during inference because of its entirely unsupervised and test-time adaptive nature.

5 Conclusion

FUSION was proposed for entirely unsupervised test-time stain adaptation, and its performance was evaluated in various scenarios. Due to its unique qualities that combine source and target batch normalisation statistics, FUSION outperforms similar approaches in the literature. The rigorous testing with various datasets, applications, and architectures reveals that FUSION provides optimal performance. The characteristics of FUSION, such as its impact on countering the covariate shift, also justify the obtained higher performance. Analysis of batch size and the step counts impact on FUSION during inference highlights its dependency on step counts instead of batch size. However, as suspected, batch size of one is not sufficient for FUSION to work as batch normalization statistics calculation is required. Apart from its performance, FUSION provides inference time flexibility to steer it towards the source or target through a single hyperparameter. Hence, it allows no adaptation, full adaptation, or partial adaptation.

References