Coarse Retinal Lesion Annotations Refinement via Prototypical Learning

Qinji Yu 1Shanghai Jiao Tong University, Shanghai, China 1 Kang Dang 2VoxelCloud, Inc., Los Angeles, USA2 Ziyu Zhou 1Shanghai Jiao Tong University, Shanghai, China 1 Yongwei Chen 2VoxelCloud, Inc., Los Angeles, USA2 Xiaowei Ding^(\Letter) 1Shanghai Jiao Tong University, Shanghai, China 12VoxelCloud, Inc., Los Angeles, USA2

Abstract

Deep-learning-based approaches for retinal lesion segmentation often require an abundant amount of precise pixel-wise annotated data. However, coarse annotations such as circles or ellipses for outlining the lesion area can be six times more efficient than pixel-level annotation. Therefore, this paper proposes an annotation refinement network to convert a coarse annotation into a pixel-level segmentation mask. Our main novelty is the application of the prototype learning paradigm to enhance the generalization ability across different datasets or types of lesions. We also introduce a prototype weighing module to handle challenging cases where the lesion is overly small. The proposed method was trained on the publicly available IDRiD dataset and then generalized to the public DDR and our real-world private datasets. Experiments show that our approach substantially improved the initial coarse mask and outperformed the non-prototypical baseline by a large margin. Moreover, we demonstrate the usefulness of the prototype weighing module in both cross-dataset and cross-class settings.

Keywords:

Prototypical Learning Retina Lesion Segmentation Coarse Annotation Refinement.

^†^†Q. Yu and K. Dang contribute equally to this work.

1 Introduction

Given the growing demand for retinal screening, automatic segmentation for retinal lesions enjoys increasing clinical relevance. By answering the issue of what lesions exist in the image and where they are located, retinal lesion segmentation algorithms assist ophthalmologists in making clinical diagnoses and assessing disease severity [18]. While recent deep-learning approaches have tremendously boosted the retinal lesion segmentation accuracy [18, 9, 4, 19], they often require abundant expert-level-accurate, pixel-wise annotated data, which requires significant time and expense to acquire. Previous studies show that coarse annotations such as circles or ellipses for outlining the lesion area can be six times more efficient than pixel-level annotation [5]. Therefore, it is essential to study novel methodologies tailored for lower-quality coarse annotations.

Existing works on exploiting coarse annotations can be categorized into weakly-supervised segmentation [11, 20, 15, 10, 16, 1, 24], and mask refinement [21, 5]. Weakly-supervised segmentation methods rely on prior assumptions such as box tightness constraint [16] and image contrast constraint [15] to utilize box-level and image-level coarse annotations. A few high-quality pixel-level retinal lesion datasets such as IDRiD [12] and DDR [8] provide precise lesion boundaries. While successful, weakly-supervised segmentation does not utilize these pre-existing lesion segmentation datasets that provide rich knowledge on lesions’ exact appearance and shape. Instead of further developing weakly-supervised segmentation methods, we propose to use such datasets by training an annotation refinement model in a data-driven manner to convert a coarse annotation into a pixel-level segmentation mask. It should be emphasized that our work is significantly different from the weakly-supervised approach, as it is trained in a fully supervised way (instead of weakly-supervised) with coarse annotation and pixel-level ground truth in pairs. Additionally, we note several existing mask refinement methods [21, 5] which refine initial coarse masks into more accurate segmentation results; however, they are usually optimized for a particular dataset. In comparison, our method applies the prototype learning paradigm [17, 7, 14, 24] to enhance generalization across different data sets and lesion types. Good generalization is the key to putting the coarse annotation refinement algorithm into practice. For example, we can train the coarse annotation refinement network on a large-scale dataset for once and reuse the trained model on other datasets with less fine-tuning or tweakings.

Particularly, our prototype learning averages the features from the coarse mask region to form a lesion prototype and averages the background features to create a background prototype. A pixel is classified to the lesion class if its corresponding feature vector is more similar to the lesion prototype. Since our prototypical approach generates image-specific prototype to adaptively describe the image itself, it is less sensitive to the intra-class variance and high distribution shifts from different datasets or unseen classes. However, averaging features uniformly may be problematic when the lesion is considerably smaller than the coarse mask, as the resultant lesion prototype becomes dominated by background features. We alleviate this issue by a superpixel-guided prototype weighing module. The module first divides the coarse mask into several superpixels[7] and the prototype for each superpixel is obtained. Each prototype’s dis-similarity with the background prototype is then calculated as a weighting factor. The final lesion prototype is the weighted combination of these superpixel-guided sub-prototypes.

Contributions. (1) To the best of our knowledge, our method is the first prototypical approach for the coarse retinal lesion annotation refinement problem. (2) We present a prototype weighing module to solve the problem of the actual lesion being overly small. (3) Experiments demonstrate that the proposed method substantially improved the initial coarse annotation and outperformed non-prototypical mask refinement baselines. It also confirms the superiority of the prototype weighing module in both cross-dataset and cross-class settings.

2 Methods

This section details the proposed coarse annotation refinement method with the overall structure shown in Fig. 1. We assume that there exists a set of image patches and the associated coarse lesion annotations, and our algorithm will convert them into the corresponding high-quality pixel-level annotations.

Figure 1: Framework of our prototype-based coarse annotation refinement network.

2.1 Annotation Refinment via Prototype Learning

2.1.1 Feature extraction

The input to the network is the concatenation of the image patch $I \in R^{H \times W \times 3}$ and its corresponding coarse lesion annotation $M \in R^{H \times W \times 1}$ . We use a modified U-Net backbone to extract its feature map $F \in R^{H^{'} \times W^{'} \times C}$ . Following [14], we remove the last two upsampling blocks in the U-Net to speed up the calculation. As a result, the resolution of the feature map is 1/4 of the original input. We concatenate the feature map with the down-sampled coarse mask $M^{'} \in R^{H^{'} \times W^{'} \times 1}$ in the feature channel dimension to further incorporate the coarse annotation prior. To get the final fusion feature map $F^{'} \in R^{H^{'} \times W^{'} \times C^{'}}$ , we adopt a simple 1-layer network with architecture: 1 $\times$ 1 Conv2d $+$ BatchNorm2d $+$ ReLU.

2.1.2 Coarse Prototype Extraction

Given the fused feature map, we want to learn representative and well-separated prototype vectors for the lesion region and the background based on the prototypical network. In previous research [17, 14, 22], the prototypical network condenses the masked object features in an image into a single or few prototypes. A relative simple coarse foreground lesion prototype can be calculated by mask average pooling, as follows:

p_{f g} = \frac{\sum_{(x, y)} F^{'} (x, y) 1 [M^{'} (x, y) = 1]}{\sum_{(x, y)} 1 [M^{'} (x, y) = 1]},

(1)

where $(x, y)$ indexes the spatial locations and $1 (∙)$ is an indicator function. In addition, the background prototype is computed by

p_{b g} = \frac{\sum_{(x, y)} F^{'} (x, y) 1 [M^{'} (x, y) = 0]}{\sum_{(x, y)} 1 [M^{'} (x, y) = 0]},

(2)

where $p_{f g}, p_{b g} \in R^{C^{'}}$ .

2.1.3 Coarse Annotation Refinement

Refinement is done using a non-parametric metric learning method [17]. For each pixel at location $(x, y)$ of the final fusion feature map $F^{'}$ , we calculate the distance between its feature vector and the derived prototypes $P = {p_{b g} {, p}_{f g}}$ . Then, we apply the softmax operation over the distances to get the probability map $P_{c} \in R^{H^{'} \times W^{'} \times 1}$ and $c \in {b g, f g}$ . Formally, we have:

P_{c} (x, y) = \frac{exp (- α \cdot d (F^{'} (x, y), p_{c}))}{\sum_{p_{j} \in P} exp (- α \cdot d (F^{'} (x, y), p_{j}))},

(3)

where $α$ is the scaling factor fixed at 20.

We train our model end-to-end using the sum of dice loss $L_{d i c e}$ and binary cross-entropy loss $L_{b c e}$ between the final probability map $P_{f g}$ and the well-annotated ground truth mask $M_{g t}$ . That is: $L_{l o s s} = L_{d i c e} + L_{b c e}$ .

During testing, for each image patch $I$ and its corresponding coarse lesion annotation $M$ , we obtain a corresponding foreground probability map $P_{f g}$ . When mapping $P_{f g}$ back to the original image space (uncropped full image), some of them will overlap. For each pixel in the overlapping area, we choose the maximum probability of these probability maps as its value. In the end, thresholding is used to get the final refined mask.

Figure 2: Illustration of superpixel-guided prototype weighting. Boundaries in red, green, and yellow are actual lesion, coarse lesion, and superpixel regions respectively

2.2 Superpixel-guided prototype weighting

As shown in Fig. 2 when the actual lesion is relatively small compared to the coarse mask, the coarse foreground prototype defined by Eq. (1) cannot represent the actual lesion features. To reduce the impact of false-positive pixels within the coarse annotation, we divide the initial coarse region into several sub-regions according to their feature similarity. Concretely, we refer to maskSLIC [6] to aggregate the feature map within the masked region into multiple superpixel clusters. For each superpixel region $S_{i}$ , we can obtain its corresponding superpixel-based sub-prototype $g_{i}$ according to Eq. (1). We collect the extracted sub-prototypes and denote them as set $G = {g_{i}}$ , where $i \in 1, 2, \dots, N_{s p}$ ( $N_{s p}$ is the number of superpixels). We compute the cosine distance to measure the similarity between each $g_{i}$ and $p_{b g}$ :

d (g_{i}, p_{b g}) = 1 - \frac{g_{i} \cdot p_{b g}}{∥ g_{i} ∥ \cdot ∥ ∥ p_{b g} ∥ ∥} .

(4)

Intuitively, the prototypes dissimilar to the background prototype are more important parts for the final foreground prototype. Therefore, we can derive a weight coefficient for each prototype in set $G$ :

w_{i} = \frac{exp (β \cdot d (g_{i}, p_{b g}))}{\sum_{g_{j} \in G} exp (β \cdot d (g_{j}, p_{b g}))},

(5)

where $β$ is the scaling factor fixed at 10. The final foreground prototype is then given by

p_{weighted} = \sum g_{i} \in G w_{i} \cdot g_{i} .

(6)

As shown later, our proposed superpixel weighted prototype $p_{weighted}$ is a more representative foreground prototype that performed better in various experiments.

3 Experiments and Results

3.1 Experimental Setup

3.1.1 Coarse Annotation Generation

There is no public available retinal lesion dataset with paired coarse annotation and pixel-level segmentation mask. To construct such paired dataset, we develop a simple coarse annotation generation method. Firstly, the coarse annotations are simulated from the well-annotated fine masks by applying the following chain of operations: smoothing, dilating, expanding, clustering the connected components using DBSCAN [2] and fitting ellipses to each cluster. Secondly, the fundus image is cropped around each ellipse in the corresponding coarse annotation. Finally, these cropped image patch and coarse annotation pairs are resized to fixed dimensions $H \times W$ for subsequent model training and testing.

3.1.2 Datasets and Evaluation Metrics.

We evaluate the proposed methods on publicly IDRiD and DDR datasets, and our real-world private dataset. IDRiD contains 81 fundus images (54 training images, 27 testing images) with pixel-level annotations for hard exudates (EX), hemorrhages (HE), microaneurysms (MA), and soft exudates (SE). Similarly, the testing part of DDR contains 225 fundus images with pixel-level annotations for EX, HE, MA, and SE. Our real-world private testing dataset collects 211 fundus images with pixel-level annotations for drusen (Drus) and pre-retinal hemorrhages (Prh) labeled by two experienced ophthalmologists. To train our annotation refinement network, we collect 32985 training patch pairs (EX:9957, HE:7752, MA:14387, SE:889) from the IDRiD training images using our coarse annotation generation algorithm. We also apply the mask generation algorithm to generate coarse mask for each testing image. To compare different refinement methods, we calculate the Intersection over Union (IoU) between the refined annotation and ground-truth mask.

3.1.3 Baselines.

We implement several non-prototypical mask refinement baseline models, taking in an image patch and a coarse mask as the input. In detail, we choose three widely used feature extraction backbones, Res18 [3], HRNet18 [23], and U-Net [13] attached with the coarse mask fusion module to perform feature extraction. The feature extraction process is identical to the one described in Sec.2.1.1 except for the feature backbone. After that, we attach a 1x1 Conv2d layer as a binary classifier to obtain a refined segmentation score map.

3.1.4 Implementation Details.

For the prototypical methods, we set the superpixel number $N_{s p} = 20$ . All training patch pairs are resized to 256 $\times$ 256 and augmented by RandomShiftScaleRotate, RandomBlur, and RandomBrightnessContrast. All models are implemented by PyTorch and trained from scratch using Adam optimizer with a batch size of 64 for 120 epochs. The initial learning rate is $10^{- 4}$ and reduces according to ReduceLROnPlateau strategy.

3.2 Results

3.2.1 Same-Dataset Experiments

We train the proposed coarse annotation refinement network using all four lesion types on IDRiD and evaluate the performance on the IDRiD testing set. As shown in Table 1, “Initial Coarse” denotes the IoU of actual lesion region versus coarse annotation region. Our prototypical method improves the initial coarse mask considerably. It also consistently obtains better refinement performance than the non-prototypical baselines on all four lesion classes in terms of IoU score, with average IoU score improving by more than 5.2%. This experiment demonstrates our advantages when training and testing images are from the same dataset.

Methods	MA	SE	EX	HE	Average
Initial Coarse	9.6 (2.6)	49.3 (10.4)	15.0 (5.2)	33.2 (8.6)	26.8 (6.7)
Res18 [3]	73.9 (7.3)	68.4 (14.2)	54.1 (9.4)	62.6 (10.9)	64.7 (10.5)
HRNet18 [23]	79.1 (7.5)	78.1 (5.3)	56.8 (9.1)	64.8 (11.8)	69.7 (8.4)
U-Net [13]	77.6 (6.8)	75.6 (12.6)	58.9 (8.9)	67.9 (11.4)	70.0 (9.9)
Our methods	84.2 (6.2)	80.7 (9.2)	65.3 (8.1)	69.9 (13.9)	75.0 (9.4)
w/ superpixel	84.1 (6.2)	79.6 (9.9)	65.9 (8.7)	71.1 (11.2)	75.2 (9.0)

Table 1: The image-level average IoU (%) and its standard deviations (%) of on IDRiD. "w/ superpixel" means with superpixel-guided prototype weighing.

3.2.2 Cross-Dataset and Cross-Class Experiments

We directly evaluate the performance on the DDR testing set and our real-world private dataset using models trained on the IDRiD dataset without further fine-tuning. As shown in Table 2, our method exceeds the U-Net baseline by 4.3% on both DDR and private datasets. For DDR, our superpixel weighted prototype performs better for all lesion types compared to the non-weighted prototype. Similarly, the weighted prototype is notably better than the non-weighted one on the private dataset, especially for the class Prh (56.3% $\to$ 62.6%). Overall, we see a general trend that our model can generalize well to new datasets or unseen classes.

			DDR				Private
Methods	MA	SE	EX	HE	Average	Drus	Prh	Average
Initial Coarse	6.9 (3.4)	32.3 (10.8)	14.8 (8.8)	23.1 (11.4)	19.3 (8.6)	33.6 (19.1)	49.7 (13.0)	41.6 (16.0)
Res18 [3]	56.3 (12.4)	67.6 (13.6)	52.3 (13.2)	54.4 (13.3)	57.6 (13.1)	43.9 (16.6)	52.9 (15.2)	48.4 (15.9)
HRNet18 [23]	65.1 (13.5)	68.2 (12.6)	54.2 (13.9)	58.5 (12.9)	61.5 (13.3)	44.2 (21.1)	57.5 (16.4)	50.9 (18.8)
U-Net [13]	60.9 (13.2)	70.2 (15.0)	55.4 (12.8)	59.8 (12.7)	61.6 (13.4)	45.2 (20.9)	57.9 (16.4)	51.6 (18.6)
Our methods	68.8 (12.0)	71.2 (16.9)	58.4 (12.7)	61.3 (15.7)	64.9 (14.3)	47.9 (23.4)	56.3 (20.7)	52.1 (22.1)
w/ superpixel	69.8 (11.9)	72.0 (17.5)	58.9 (12.4)	62.9 (14.5)	65.9 (14.1)	49.1 (23.0)	62.6 (15.3)	55.9 (19.1)

Table 2: The image-level average IoU (%) and its standard deviations (%) on DDR and our real-world private dataset having 2 unseen classes.

3.2.3 Coarse Mask Reduction Factors

Since ophthalmologists tend to draw a single rough ellipse to cover several unconnected lesion regions, we simulate the process by setting different reduction factors to the DBSCAN clustering algorithm. Actually, the number of the generated ellipses is the number of connected lesion regions divided by the reduction factor. In other words, with a higher reduction factor, the generated coarse mask will be more coarse. As shown in Fig. 3, although the refinement performance of all methods degrades as the reduction factor ranges from 1.0 to 2.0, our prototypical method has less degradation compared to the U-Net baseline. It implies our method is more robust against coarser annotations.

Figure 3: Refinement performance under different reduction factors.

3.2.4 Visual results

Fig. 4 presents some visualization of refinement results. Despite the vast variation in lesion scales, colors, and low contrast to surrounding regions, the first three rows show our proposed superpixel weighted prototype approach generates the most accurate lesion boundary. The last row shows a failure case when the coarse mask contains two distinct lesion classes, EX and Drus, at the same time.

Figure 4: Visualization of refinement results on four categories of lesions.

4 Conclusions

This paper proposes a novel prototype-based network to convert a coarse annotation into a pixel-level segmentation mask. The proposed network first extracts the lesion and background prototypes and labels the image pixel as the lesion class if its feature is more similar to the lesion prototype. A superpixel-guided prototype weighing module is then proposed to tackle the issue of the actual lesion being overly small compared to the coarse mask. On the IDRiD dataset, our model outperformed non-prototypical baselines by a large margin. Extensive experiments on DDR and our real-world private dataset also demonstrate the proposed model enjoys better generalizability to new datasets and some unseen lesion classes.

References

[1] T. Chu, X. Li, H. V. Vo, R. M. Summers, and E. Sizikova (2021) Improving weakly supervised lesion segmentation using multi-task learning. In Medical Imaging with Deep Learning, pp. 60–73. Cited by: §1.
[2] M. Ester, H. Kriegel, J. Sander, X. Xu, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In kdd, Vol. 96, pp. 226–231. Cited by: §3.1.1.
[3] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.3, Table 1, Table 2.
[4] S. Huang, J. Li, Y. Xiao, N. Shen, and T. Xu (2022) RTNet: relation transformer network for diabetic retinopathy multi-lesion segmentation. IEEE Transactions on Medical Imaging. Cited by: §1.
[5] Y. Huang, L. Lin, M. Li, J. Wu, P. Cheng, K. Wang, J. Yuan, and X. Tang (2020) Automated hemorrhage detection from coarsely annotated fundus images in diabetic retinopathy. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), pp. 1369–1372. Cited by: §1, §1.
[6] B. Irving (2016) MaskSLIC: regional superpixel generation with application to local pathology characterisation in medical images. arXiv preprint arXiv:1606.09518. Cited by: §2.2.
[7] G. Li, V. Jampani, L. Sevilla-Lara, D. Sun, J. Kim, and J. Kim (2021) Adaptive prototype learning and allocation for few-shot segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8334–8343. Cited by: §1, §1.
[8] T. Li, Y. Gao, K. Wang, S. Guo, H. Liu, and H. Kang (2019) Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening. Information Sciences 501, pp. 511–522. Cited by: §1.
[9] Q. Liu, H. Liu, and Y. Liang (2021) M2MRF: many-to-many reassembly of features for tiny lesion segmentation in fundus images. arXiv preprint arXiv:2111.00193. Cited by: §1.
[10] X. Liu, Q. Yuan, Y. Gao, K. He, S. Wang, X. Tang, J. Tang, and D. Shen (2022) Weakly supervised segmentation of covid19 infection with scribble annotation on ct images. Pattern recognition 122, pp. 108341. Cited by: §1.
[11] C. Playout, R. Duval, and F. Cheriet (2019) A novel weakly supervised multitask architecture for retinal lesions segmentation on fundus images. IEEE transactions on medical imaging 38 (10), pp. 2434–2444. Cited by: §1.
[12] P. Porwal, S. Pachade, M. Kokare, G. Deshmukh, J. Son, W. Bae, L. Liu, J. Wang, X. Liu, L. Gao, et al. (2020) Idrid: diabetic retinopathy–segmentation and grading challenge. Medical image analysis 59, pp. 101561. Cited by: §1.
[13] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.1.3, Table 1, Table 2.
[14] H. Tang, X. Liu, S. Sun, X. Yan, and X. Xie (2021) Recurrent mask refinement for few-shot medical image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3918–3928. Cited by: §1, §2.1.1, §2.1.2.
[15] Y. Tang, J. Cai, K. Yan, L. Huang, G. Xie, J. Xiao, J. Lu, G. Lin, and L. Lu (2021) Weakly-supervised universal lesion segmentation with regional level set loss. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 515–525. Cited by: §1.
[16] J. Wang and B. Xia (2021) Bounding box tightness prior for weakly supervised image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 526–536. Cited by: §1.
[17] K. Wang, J. H. Liew, Y. Zou, D. Zhou, and J. Feng (2019) Panet: few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9197–9206. Cited by: §1, §2.1.2, §2.1.3.
[18] Q. Wei, X. Li, W. Yu, X. Zhang, Y. Zhang, B. Hu, B. Mo, D. Gong, N. Chen, D. Ding, et al. (2021) Learn to segment retinal lesions and beyond. In 2020 25th International Conference on Pattern Recognition (ICPR), pp. 7403–7410. Cited by: §1.
[19] Z. Yan, X. Han, C. Wang, Y. Qiu, Z. Xiong, and S. Cui (2019) Learning mutually local-global u-nets for high-resolution retinal lesion segmentation in fundus images. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 597–600. Cited by: §1.
[20] L. Yang, Y. Zhang, Z. Zhao, H. Zheng, P. Liang, M. T. Ying, A. T. Ahuja, and D. Z. Chen (2018) Boxnet: deep learning based biomedical image segmentation using boxes only annotation. arXiv preprint arXiv:1806.00593. Cited by: §1.
[21] Y. Yang, Z. Wang, J. Liu, K. Cheng, and X. Yang (2019) Label refinement with an iterative generative adversarial network for boosting retinal vessel segmentation. arXiv preprint arXiv:1912.02589. Cited by: §1.
[22] Q. Yu, K. Dang, N. Tajbakhsh, D. Terzopoulos, and X. Ding (2021) A location-sensitive local prototype network for few-shot medical image segmentation. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp. 262–266. Cited by: §2.1.2.
[23] Y. Yuan, X. Chen, and J. Wang (2020) Object-contextual representations for semantic segmentation. Cited by: §3.1.3, Table 1, Table 2.
[24] G. Zhang, X. Lu, J. Tan, J. Li, Z. Zhang, Q. Li, and X. Hu (2021) Refinemask: towards high-quality instance segmentation with fine-grained features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6861–6869. Cited by: §1.