Seg4Reg+: Consistency Learning between Spine Segmentation and Cobb Angle Regression

Yi Lin^🖂 1Tencent Jarvis Lab, Shenzhen, China
1linyi.pk@gmail.com Luyan Liu 1Tencent Jarvis Lab, Shenzhen, China
1linyi.pk@gmail.com Kai Ma 1Tencent Jarvis Lab, Shenzhen, China
1linyi.pk@gmail.com Yefeng Zheng 1Tencent Jarvis Lab, Shenzhen, China
1linyi.pk@gmail.com

Abstract

Automated methods for Cobb angle estimation are of high demand for scoliosis assessment. Existing methods typically calculate the Cobb angle from landmark estimation, or simply combine the low-level task (e.g., landmark detection and spine segmentation) with the Cobb angle regression task, without fully exploring the benefits from each other. In this study, we propose a novel multi-task framework, named Seg4Reg+, which jointly optimizes the segmentation and regression networks. We thoroughly investigate both local and global consistency and knowledge transfer between each other. Specifically, we propose an attention regularization module leveraging class activation maps (CAMs) from image-segmentation pairs to discover additional supervision in the regression network, and the CAMs can serve as a region-of-interest enhancement gate to facilitate the segmentation task in turn. Meanwhile, we design a novel triangle consistency learning to train the two networks jointly for global optimization. The evaluations performed on the public AASCE Challenge dataset demonstrate the effectiveness of each module and superior performance of our model to the state-of-the-art methods.

1 Introduction

Adolescent idiopathic scoliosis (AIS) causes a structural, lateral, rotated curvature of the spine that arises in children at or around puberty [12]. The Cobb angle, derived from a anterior-posterior X-ray and measured by selecting the most tilted vertebra, is the primary means for clinical diagnosing of AIS. However, manual measurement of the Cobb angle requires professional radiologists to carefully identify each vertebra and measure angle, which is time-consuming and could suffer from a large inter-/intra-observer variety. Hence, it is needed to provide an accurate and robust method for quantitative measurement of Cobb angle automatically.

Numerous computer-aided methods have been developed for automated Cobb angle estimation. Conventional methods utilized active contour model [2], customized filter [1] and charged-particle models [8] for spine segmentation to calculate the Cobb angle, which are computationally expensive and the unclear spine boundary will result in inaccurate estimations. Recently, deep learning based methods [3, 6, 9, 13, 15, 17] have been proposed to consolidate the tasks of vertebral landmark detection with Cobb angle estimation to improve the robustness of spinal curvature assessment. And Seg4Reg [7], which won the 1st place in the AASCE¹¹1https://aasce19.grand-challenge.org challenge, regarded the segmentation results of spine as the input of the regression network for Cobb angle estimation. Its superior performance owns to the segmentation mask of spine, which retains the shape information and filters the distractions (e.g., artifacts and local contrast variation). However, the performance of this method depends heavily on the segmentation results.

Although the above methods have achieved great success, their applications to Cobb angle estimation suffer from three limitations: 1) The methods [3, 6, 9, 13, 15, 17] relying on the landmark coordinates are susceptible since a small error in landmark coordinates may cause a huge mistake in angle estimation; 2) the two-stage frameworks [3, 6, 7, 17] often suffers from the error accumulation; and 3) the cascaded networks[3, 6, 7] cannot guarantee a global optimum.

Figure 1: An overview of the proposed framework.

In this paper, we propose a novel consistency learning framework, named Seg4Reg+, which incorporates segmentation into the regression task, as shown in Fig. 1. The segmentation task extracts representative features for the regression task by an attention regularization (AR) module with auxiliary constraint on the class activation map (CAM). And the regression task is able to provide specific hints for the segmentation task by a region-of-interest enhancement (ROIE) gate to force the segmentation network to pay more attention to the important area. To reach the global optimum, we further design a novel triangle consistency learning scheme for end-to-end training. In summary, our main contributions are as follows:

We propose a novel consistency learning framework, named Seg4Reg+, incorporating segmentation and regression tasks with an AR module and ROIE gate to boost the performance of both tasks.
We design a triangle consistency learning scheme for end-to-end training.
Extensive evaluations on the public AASCE Challenge dataset demonstrate the effectiveness of each module and superior performance of our model to the state-of-the-art methods.

2 Method

As illustrated in Figure 1, the proposed Seg4Reg+ model consists of a segmentation network $N_{S}$ and a regression network $N_{R}$ . We first pre-train the two networks separately for approximately optimized results to speed up the training process. Then the two networks are boosted by each other by the ROIE gate and AR module. In addition, a novel training strategy named triangle consistency learning scheme is designed for end-to-end training. In the following, we first introduce the proposed ROIE gate and AR module, then illustrate the triangle consistency learning scheme.

ROIE Gate. We first train $N_{S}$ to roughly segment the spine region. We adopt the same network as Seg4Reg [7] for a fair comparison. Specifically, we modify the PSPNet [18] by replacing the pooling layer with the dilated convolution in the pyramid pooling module and take ResNet-50 [4] as the backbone. The objective function is the weighted Dice loss and cross-entropy loss:

L_{s e g} (s (x), y) = \sum i (1 - \frac{2 \times s (x_{i}) y_{i}}{s (x_{i}) + y_{i}}) + λ \sum i (- y_{i} log (s (x_{i}))),

(1)

where $x$ denotes the input image, and $s (x_{i})$ and $y_{i}$ denote the label of pixel $x_{i}$ of prediction and ground truth, respectively, and $λ$ denotes the hyperparameter weighting the two losses.

To boost the performance of $N_{S}$ with $N_{R}$ , the ROIE gate is designed as an attention mechanism to transfer the specific hints from $N_{R}$ to $N_{S}$ . The proposed ROIE gate is inspired by CAM, which is the most common technique in weakly supervised segmentation methods [19]. We expect that the CAM of $N_{R}$ can incorporate the refined prior information about the spine area into the segmentation process. In addition, the value of each pixel on the CAM represents its significance to the regression output, which in turn guides $N_{S}$ to pay more attention to the important areas (i.e., the most tilted vertebra endplates). Specifically, we treat CAM of $N_{R}$ as attention map, and perform a matrix multiplication between CAM and the feature map $f_{m} (x)$ from the middle layer of $N_{S}$ . Then, we multiply the result by a scalar parameter $α$ and perform an element-wise sum operation with the feature $f_{m} (x)$ to obtain the final output $f_{m}^{'} (x)$ as follows:

f_{m}^{'} (x) = α (C (r (x)) \circ f_{m} (x)) + f_{m} (x),

(2)

where $C (\cdot)$ is a CAM that indicates the discriminative part with respect to the regression results, $α$ is a learnable parameter which is initialized as 0, $\circ$ denotes multiplication function. It can be inferred from Equation (2) that the resulting feature $f_{m}^{'} (x)$ combines global contextual view and selectively aggregates contexts according to the CAM, thus improving intra-class compact and semantic consistency.

AR Module. For $N_{r}$ , we modify the state-of-the-art classification network by replacing the last convolutional layer with the output channel corresponding to three clinically relevant Cobb angles: proximal thoracic (PT), main thoracic (MT) and thoracolumbar/lumbar (TL). And the activation function in the last layer is set to the sigmoid function. Here, we design a novel objective function based on symmetric mean absolute percentage error, named SMAPE loss:

L_{S M A P E} (r (x),^y) = \frac{\sum_{i = 1}^{n} | {^y}_{i} - r (x_{i}) |}{\sum_{i = 1}^{n} |_{i} + r (x_{i}) + ϵ |},

(3)

where $_{i}$ and $r (x_{i})$ denote the ground truth and prediction of $i$ th angle of the total $n = 3$ Cobb angles (i.e., PT, MT and TL), and $ϵ$ is a smooth factor.

To boost the performance of $N_{R}$ with $N_{S}$ , the AR module is designed to explore the hidden state representation of the $N_{R}$ via classification activation mapping (CAM) to force it to focus on the spine area. Specifically, to integrate regularization on $N_{R}$ , we expand the $N_{R}$ into a shared-weight Siamese structure. One branch takes the concatenation of the raw image and its corresponding segmentation mask as input, and the other directly takes the segmentation as input. The output activation maps from two branches are regularized by mean absolute error to guarantee the consistency of CAMs, and the regression network in consequence is forced to focus on the spine area. The objective function is:

L_{A R} = ∥ C (x, s (x)) - C (s (x)) ∥_{1},

(4)

0: Input image,

x \in X

;Ground truth of segmentation mask of spine,

y \in Y

;Ground truth of three Cobb angles (PT, MT and TL),

^y \in^Y

;

s (\cdot)

: Segmentation task of network

N_{S}

with parameter

θ_{1}

;

r (\cdot)

: Regression task of network

N_{R}

with parameter

θ_{2}

;

C (\cdot)

: Class activation map (CAM) generated by

N_{R}

;\\Training basic

N_{S}

1: while stopping criterion not met do

2: Compute the segmentation loss

L_{s e g} (s (x), y)

with Equation (1);

3: Update parameters

θ_{1}

N_{S}

by backpropagation;

4: end while\\Training

N_{R}

with AR.

5: while stopping criterion not met do

6: Compute the SMAPE loss

L_{S M A P E} (r (x, s (x)),^y)

with Equation (3);\\

(x, s (x))

means concatenation of raw image and its segmentation mask.

7: Compute the SMAPE loss

L_{S M A P E} (r (s (x)),^y)

with Equation (3);

8: Compute the attention regularization loss

L_{A R} = ∥ C (x, s (x)) - C (s (x)) ∥_{1}

;

9: Update the regression network parameters

θ_{2}

;

10: end while\\Fine-tuning

N_{S}

with ROIE.

11: while stopping criterion not met do

12: Add local consistency constraints to

N_{S}

with Equation (2);

13: Update the parameters

θ_{1}

N_{S}

;

14: end while\\Fine-tuning

N_{S}

by SMAPE loss.

15: while stopping criterion not met do

16: Compute the regression loss

L_{S M A P E} (r (s), r (s (x, C (x))))

;

17: Freeze the

θ_{2}

and update the

θ_{1}

;

18: end while\\Fine-tuning

N_{R}

with refined segmentation.

19: Repeat steps 5-10

20: return

θ_{1}

and

θ_{2}

Algorithm 1 Triangle Consistency Learning.

Figure 2: The training strategy of triangle consistency learning: (a) training $N_{S}$ , (b) training $N_{R}$ with AR, (c) fine-tuning $N_{S}$ with ROIE, (d) fine-tuning $N_{S}$ by the SMAPE loss, and (e) fine-tuning $N_{R}$ with refined segmentation.

Triangle Consistency Learning. Inspired by the inference-path invariance theory [16] which declares that inference paths with the same endpoints, but different intermediate domains, yield similar results. The segmentation process is essentially an auxiliary task for the regression task, thus the regression network has the potential to optimize the segmentation network. Based on this assumption, we design a novel training strategy named as triangle consistency learning for end-to-end training, are shown in Figure 2. The details of the proposed training strategy are shown in Algorithm 1, which can be divided into five processes: 1) training basic $N_{S}$ (steps 1-4); 2) training $N_{R}$ with AR (steps 5-10); 3) fine-tuning $N_{S}$ with ROIE (steps 11-14); 4) fine-tuning $N_{S}$ with SMAPE loss on the regression output of $s (x)$ and $y$ (steps 15-18); and 5) fine-tuning $N_{R}$ with refined segmentation (step 19). In the fourth process, we compute the SMAPE loss for the regression outputs of segmentation results and the corresponding ground truth, denoted $r (s (x))$ and $r (y)$ , respectively. Then we freeze the parameters of $N_{R}$ and optimize $N_{S}$ by the backpropagation. In this way, we can generate more suitable segmentation mask for $N_{R}$ , superior to traditional segmentation.

3 Experiments

Data and Implementation Details. We use the public dataset of MICCAI 2019 AASCE Challenge [13], which consists of 609 spinal anterior-posterior X-ray images to evaluate our method. The dataset is divided by the provider into 481 images for training and 128 images for testing. We evaluate the proposed method using two metrics, symmetric mean absolute percent error (SMAPE) and mean absolute error (MAE). And we evaluate the segmentation results using five performance metrics including the Jaccard index (JA), Dice coefficient (Dice), pixel-wise accuracy (pixel-AC), pixel-wise sensitivity (pixel-SE), and pixel-wise specificity (pixel-SP).

We pre-process the data before inputting it into our network. First, we resize the image to [512, 256]. Then, we linearly transform the Cobb angles into [0, 1], and augment the dataset by randomly flipping, rotating ( $−45\lx@arcdegree$ , $45\lx@arcdegree$ ), and rescaling with the factor between (0.85, 1.25). We train the $N_{S}$ for 90 epochs using ADAM optimization with learning rate $1 \times 10^{- 4}$ and weight decay $1 \times 10^{- 5}$ . And for regression, we train the network for 200 epochs in total with learning rate $1 \times 10^{- 3}$ and weight decay $1 \times 10^{- 5}$ .

Baseline	AR	ROIE	TCL	Img	Seg	MAE	SMAPE (%)
✓				✓		6.34, 7.77, 8.01	12.32
✓	✓			✓		4.55, 5.75, 5.92	9.39
✓	✓	✓		✓		4.21, 5.32, 5.22	9.32
✓	✓	✓	✓	✓		4.01, 5.16, 5.51	9.17
✓					✓	6.51, 6.22, 7.17	10.95
✓	✓				✓	4.03, 6.38, 5.80	9.52
✓	✓	✓			✓	3.61, 4.90, 5.53	9.01
✓	✓	✓	✓		✓	5.13, 4.73, 5.24	8.92
✓	✓	✓	✓	✓	✓	3.88, 4.62, 4.99	8.47

Table 1: The ablation study for each part of Seg4Reg+. AR: attention regularization module, ROIE: region-of-interest enhancement, TCL: triangle consistency learning, Img: raw image as input, and Seg: segmentation mask.

Ablation Studies. To verify the effectiveness of each module in the proposd Seg4Reg+ approach, we conduct a set of experiments for ablation study, which are shown in Table 1. We first test with the raw image as input, the AR module has a 2.93% improvement in SMAPE compared to baseline, which validates that the AR module can gain from segmentation and benefit for the regression task. By combining the ROIE gate, the performance can be further improved by 0.07% (the improvement with the segmentation mask as input is more significant with 0.51% boost in SMAPE), which demonstrates $N_{R}$ can boost $N_{S}$ in turn. And applying triangle consistency learning could improve the SMAPE by 0.15%. Similar trend can also be observed when we test with segmentation results as input. Finally, when we concatenate the raw image and segmentation mask together as input, the performance can by further improved up to 8.47%.

Figure 3: The visualization of CAMs with and without the AR module. The top and middle rows show the result of baseline method (without AR) using raw image and segmentation mask as input, respectively. And the bottom row shows the CAMs generated from the baseline method with AR.

Figure 4: Examples of segmentation results. Each group shows the original image, CAM, and segmentations without and with ROIE, respectively. The yellow mask is true positive, red mask is false negative, and green mask is false positive.

Effectiveness of AR Module. Figure 3 shows the results of the baseline method and the AR module. From the first row in Figure 3, we can see several drawbacks of the CAM generated by the baseline method: 1) focusing only on the local region (e.g., columns 1-11), 2) the relative inferior ability of feature extraction (e.g., columns 12-14), and 3) the vulnerability to the blurred images (e.g., columns 15-16). And the bottom row in Figure 3 shows that with the AR module, $N_{R}$ can make more precise prediction with more reasonable perspective view. Particularly, in column 14, our AR module predicts more precise attention area while the baseline method generates very weak attention maps.

Methods	JA	Dice	pixel-AC	pixel-SE	pixel-SP
w/o CAMs	75.47	86.02	94.83	86.27	98.21
MDC [11]	75.91	86.31	95.73	89.08	98.12
D-Net [5]	76.99	87.00	95.02	86.61	98.46
MB-DCNN [14]	76.17	86.47	95.12	87.18	98.19
Ours	77.86	87.55	95.49	88.06	98.42

Table 2: Comparison of segmentation performance (%) of ROIE with other methods.

Effectiveness of ROIE Gate. We evaluate both qualitative and quantitative segmentation results of our ROIE module. Table 2 shows that applying the ROIE gate to the origin segmentation model promotes JA from 75.47% to 77.86%, and the visualization results in Figure 4 show that the ROIE gate is helpful for false positive reduction. Furthermore, we compare different ways of transferring CAM to the segmentation network, i.e., MDC [11], D-Net [5] and MB-DCNN [14]. However, simply delivering CAMs to $N_{S}$ , these methods would be affected by inaccurate CAMs. And our ROIE gate alleviates this problem by fusing the CAM adaptively with a learnable parameter. For a fair comparison, all of these methods use the same segmentation architecture. It shows that our ROIE gate improves JA by 0.9% and 1.7% over D-Net and MB-DCNN, respectively.

Methods	A-Net [3]	L-Net [3]	B-Net [13]	PFA [10]	VF [6]		Seg4Reg [7]		Ours
Backbone	-	-	-	res50	UNet	MNet	res18	eff_b1	res18	eff_b1
MAE	8.58	10.46	9.31	6.69	3.90	3.51	6.63	3.96	4.50	3.73
SMAPE (%)	20.35	26.94	23.44	12.97	8.79	7.84	10.95	7.64	8.47	7.32

Table 3: Comparison of regression performance with the state-of-the-art methods.

Comparison with State-of-the-art. We compare the proposed method with state-of-the-art methods including A-Net [3], L-Net [3], BoostNet [13], PFA [10], VF [6], and Seg4Reg [7], which won the 1st place in AASCE challenge. We follow the same experiment setting with VF [6], and the performance of all competing methods is adopted from the original publications for a fair comparison.

As depicted in Table 3, promising results are observed in predicting Cobb angle using the proposed framework. The SMAPE is 7.32% with our Seg4Reg+ framework, whereas PFA, VF-MNet and Seg4Reg achieve 12.97%, 7.84% and 7.64%, respectively. The MAE of our method is 3.73, slightly inferior to VF-MNet. We believe the superior performance of our model in SMAPE is more convincing as SMAPE is the only evaluation metric in the AASCE challenge.

4 Conclusion

In this paper, we proposed a novel Seg4Reg+ model for automated Cobb angle estimation. Our Seg4Reg+ model incorporates segmentation into regression task via an attention regularization module and a region-of-interest enhancement gate to boost the performance of both tasks. Further, the two networks are integrated with global consistency learning for global optimization. Experimental results demonstrated the effectiveness of each module and the superior performance of our model to the state-of-the-art methods.

Acknowledgments. This work was funded by Key-Area Research and Development Program of Guangdong Province, China (No. 2018B010111001), and the Scientific and Technical Innovation 2030-”New Generation Artificial Intelligence” Project (No. 2020AAA0104100).

References

[1] H. Anitha, A. Karunakar, and K. Dinesh (2014) Automatic extraction of vertebral endplates from scoliotic radiographs using customized filter. Biomedical Engineering Letters 4 (2), pp. 158–165. Cited by: §1.
[2] H. Anitha and G. Prabhu (2012) Automatic quantification of spinal curvature in scoliotic radiograph using image processing. Journal of Medical Systems 36 (3), pp. 1943–1951. Cited by: §1.
[3] B. Chen, Q. Xu, L. Wang, S. Leung, J. Chung, and S. Li (2019) An automated and accurate spine curve analysis system. IEEE Access 7, pp. 124596–124605. Cited by: §1, §1, Table 3, §3.
[4] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §2.
[5] S. Hong, H. Noh, and B. Han (2015) Decoupled deep neural network for semi-supervised semantic segmentation. arXiv preprint arXiv:1506.04924. Cited by: Table 2, §3.
[6] K. C. Kim, H. S. Yun, S. Kim, and J. K. Seo (2020) Automation of spine curve assessment in frontal radiographs using deep learning of vertebral-tilt vector. IEEE Access 8, pp. 84618–84630. Cited by: §1, §1, Table 3, §3.
[7] Y. Lin, H. Zhou, K. Ma, X. Yang, and Y. Zheng (2019) Seg4Reg networks for automated spinal curvature estimation. In International Workshop and Challenge on Computational Methods and Clinical Applications for Spine Imaging, pp. 69–74. Cited by: §1, §1, §2, Table 3, §3.
[8] T. A. Sardjono, M. H. Wilkinson, A. G. Veldhuizen, P. M. van Ooijen, K. E. Purnama, and G. J. Verkerke (2013) Automatic Cobb angle determination from radiographic images. Spine 38 (20), pp. E1256–E1262. Cited by: §1.
[9] H. Sun, X. Zhen, C. Bailey, P. Rasoulinejad, Y. Yin, and S. Li (2017) Direct estimation of spinal Cobb angles by structured multi-output regression. In International Conference on Information Processing in Medical Imaging, pp. 529–540. Cited by: §1, §1.
[10] J. Wang, L. Wang, and C. Liu (2019) A multi-task learning method for direct estimation of spinal curvature. In International Workshop and Challenge on Computational Methods and Clinical Applications for Spine Imaging, pp. 113–118. Cited by: Table 3, §3.
[11] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang (2018) Revisiting dilated convolution: a simple approach for weakly- and semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7268–7277. Cited by: Table 2, §3.
[12] S. L. Weinstein, L. A. Dolan, J. C. Cheng, A. Danielsson, and J. A. Morcuende (2008) Adolescent idiopathic scoliosis. The Lancet 371 (9623), pp. 1527–1537. Cited by: §1.
[13] H. Wu, C. Bailey, P. Rasoulinejad, and S. Li (2017) Automatic landmark estimation for adolescent idiopathic scoliosis assessment using BoostNet. In International Conference on Medical Image Computing and Computer Assisted Intervention, pp. 127–135. Cited by: §1, §1, Table 3, §3, §3.
[14] Y. Xie, J. Zhang, Y. Xia, and C. Shen (2020) A mutual bootstrapping model for automated skin lesion segmentation and classification. IEEE Transactions on Medical Imaging 39 (7), pp. 2482–2493. Cited by: Table 2, §3.
[15] J. Yi, P. Wu, Q. Huang, H. Qu, and D. N. Metaxas (2020) Vertebra-focused landmark detection for scoliosis assessment. In IEEE 17th International Symposium on Biomedical Imaging, pp. 736–740. Cited by: §1, §1.
[16] A. R. Zamir, A. Sax, N. Cheerla, R. Suri, Z. Cao, J. Malik, and L. J. Guibas (2020) Robust learning through cross-task consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11197–11206. Cited by: §2.
[17] C. Zhang, J. Wang, J. He, P. Gao, and G. Xie (2021) Automated vertebral landmarks and spinal curvature estimation using non-directional part affinity fields. Neurocomputing. Cited by: §1, §1.
[18] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2881–2890. Cited by: §2.
[19] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016) Learning deep features for discriminative localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. Cited by: §2.