Weakly Supervised Airway Orifice Segmentation in Video Bronchoscopy

Ron Keuth Institute of Medical Informatics, University of Lübeck, Germany Mattias Heinrich Institute of Medical Informatics, University of Lübeck, Germany Martin Eichenlaub University Heart Center Freiburg-Bad Krozingen, Germany Marian Himstedt Institute of Medical Informatics, University of Lübeck, Germany

\authorinfo

Further author information: E-mail: ron.keuth@student.uni-luebeck.de, marian.himstedt@uni-luebeck.de

CNN: Convolutional Neural Network
Lite R-ASPP: Lite Reduced Atrous Spatial Pyramid Pooling
CVC-DS: Dataset from the Computer Vision Center’s interactive and augmented modelling group
RB-DS: Real Bronchoscopy Dataset
DSC: Dice Similarity Score
GAN: Generative Adversarial Networks
AMCD: Average Minimum Centriod Distance
VB: Video Bronchoscopy
COPD: Chronic Obstructive Pulmonary Disease
ICU: Intensive Care Units
EMT: Electromagnetic Tracking

1 Description of purpose

Video Bronchoscopy (VB) is commonly applied in conjunction with lung diseases. It is a fundamental procedure for diagnosis of lung cancer enabling biopsy of deep airway tissue. In addition to that, VB is routinely conducted for monitoring Chronic Obstructive Pulmonary Disease (COPD) patients and clarification of acute respiratory problems at Intensive Care Units (ICU). The navigation within the bronchial tree is challenging and physically demanding for physicians due to homogenous textures and perceptually similar appearance of bronchial orifices. This is particularly the case in the absence of prior CT scans and Electromagnetic Tracking (EMT) systems at ICUs. Airway orifice segmentation which is the main objective of this paper enables image-based guidance, e.g. by providing graphical overlays on top of the VB images. In conjunction with EMT or image-based tracking[8] these overlays can be accompanied by airway labels w.r.t. generic or interpatient lung models which however is not addressed by the approach presented in this paper. The variety of tissue appearance, illumination, image artifacts, secrete and patient anatomy poses a challenge for airway segmentation which we aim to address by deep learning-based approaches. However, this is currently hampered due to a lack of readily available ground truth labels motivating the incorporation of traditional (non-learning-based) methods as weak supervision. In particular, we incorporate an airway phantom dataset collection accompanied by ground truth depth images to generate airway orifice labels for training a deep learning-based segmentation model. Our proposed methods are also developed with a focus on their complexity and runtime, keeping them real-time capable even on low-end devices for intervention guidance.

2 Methods

2.1 Datasets

We utilize three datasets for training, validation and testing our method: The Phantom dataset [7] consisting of about 30k RGB and depth images captured within a simplified model. For in-vivo evaluation, we use 125 samples of 20 different bronchoscopies with their annotated segmentations from the public CVC-DS dataset [5] and about 100 frames of the private dataset RB-DS with expert annotations.

(a) Detecting airways using $k$ -means ( $k = 2$ ) on the depth data.

2.2 Data-Driven Methods for Instance Orifice Segmentation

Fig.1 shows the different steps of our proposed pipeline to generate an instance orifice segmentation map from a given depth image. A $k$ -means determines the two classes ( $k = 2$ ) of airway and other tissue, considering only the depth distribution (see Fig.0(a)). The obtained global airway labels are then used to generate a binary segmentation map (see Fig.0(e)) and to define the region of interest for the next stage. For determining different airway instances, we low-pass filter the depth image using an efficient box filter (3 $\times$ average pooling with kernel size=3) followed by a non-maximum suppression, where the peaks have to be $5 %$ of the image resolution apart from each other (see Fig.0(d)). These peaks define the markers for the compactness marker-based watershed algorithm[4], which runs on the inverted depth image as input. The watershed perfectly models the nature of the instance airway segmentation problem, allowing different depth values of adjacent airways and let their segmentation flow smoothly into each other (see Fig.0(b)). The result of the watershed is finally composed together with the global labels to an instance orifice segmentation map (see Fig.0(f)).

The pipeline is very real-time capable, running with roughly $130 Hz$ on a laptop CPU¹¹1INTEL i5-7200U 2C/4T@3.1GHz.One current limitation of our pipeline is the segmentation of at least one orifice, even if it is a false positive. However, because such false-positive leans to cover an unusually huge area, the problem can be solved by defining a heuristic such as a relative threshold over the area covered by one orifice instance.

2.3 Cnn Architecture

For this paper, we solely focus on the binary orifice segmentation to enable the use of a Lite Reduced Atrous Spatial Pyramid Pooling (Lite R-ASPP)[2] as an efficient Convolutional Neural Network (CNN) architecture for segmentation. We use an encoder pretrained on ImageNet for training. The high texture and illumination variety from the synthetic to the in-vivo VBs introduces a domain gap. To narrow this gap, we apply multiple data augmentation methods. Therefore we follow a guideline[1] for an intensity value augmentation of the RGB images, randomly choosing transformations like color jitter, quantization and histogram equalization. In in-vivo VB the operator rotate the endoscope a lot during navigation, making rotation invariance a real-world requirement for our methods. We achieve that by rotating the images with radiants randomly sampled from $[0, 2 π]$ [9]. To prevent the model’s parameter from becoming too complex and overfitting the synthetic data, we use weight decay and its AdamW implementation[3] during training.

2.4 Metric for Evaluation

A well established metric to evaluate the overlapping of two segmentations is the Dice Similarity Score (DSC). However, the DSC alone has only limited significance in our context. This is because an airway’s orifice has, unlike e.g. a liver, no clear organ boundaries, resulting in a high inter and also intra observer variability in the in-vivo segmentation ground truths. The first column of Fig.1(b) shows a good example for such a situation, where four different sized segmentations are provided for the same airway orifice with all being correct, but resulting in an underestimated DSC.

As a solution, we also consider the distance of the first moments (centers of gravity) of the individual airway orifice instance segmentations. Moments are scale invariant and therefore well-suited for this use case. We convert the distances of the first moments into our Average Minimum Centriod Distance (AMCD) metric as followed: Having $N$ airways with their ground truth segmentation and $M$ predicted segmentation, we calculate their first moments in $C \in R^{N \times 2}$ and $^C \in R^{M \times 2}$ respectively. The minimal distance of the center $d_{c_{i}}$ is then the minimum of the euclidean distance:

d_{c_{i}} = arg min {^c}_{j} | | c_{i} - {^c}_{j} | |_{2}

(1)

with $i \in N$ and $j \in M$ . We finally obtain the AMCD for the overall image by the mean ${¯ d}_{c} = \frac{1}{N} \sum_{i \in N} d_{c_{i}}$ . However, we decided to include the DSC due to its scientific importance even though its signficance is rather limited for the evaluation of our approach.

2.5 Trainings and Evaluation

We train a Lite R-ASPP instance on the training split of the phantom dataset and evaluate its performance on the test split and also on the two in-vivo VB datasets. For each of the two in-vivo datasets a model is also trained to examine if the semantic airway knowledge gained by the synthetic data is comparable to the one by the in-vivo data. It has to be mentioned that we decided against a cross validation on the in-vivo datasets due to their limited sizes. Thus, the performances on their own dataset do not demonstrate their ability to generalize to unseen data.

3 Results

train dataset	metric	test datasets
		Phantom[test]	CVC-DS	RB-DS
Phantom[train]	DSC	$73.48 \pm 17.72$	$60.91 \pm 17.83$	$50.58 \pm 16$
Phantom[train]	${¯ d}_{c} [px]$	$20.41 \pm 11.55$	$8.41 \pm 7.25$	$14.58 \pm 10.75$
CVC-DS	DSC	$36.67 \pm 21.35$	$85.79 \pm 7.27$	$61.51 \pm 22.03$
CVC-DS	${¯ d}_{c} [px]$	$20.19 \pm 17.52$	$2.76 \pm 2.41$	$9.5 \pm 11.27$
RB-DS	DSC	$32.72 \pm 12.66$	$80.95 \pm 9.66$	$80.57 \pm 0.1$
RB-DS	${¯ d}_{c} [px]$	$16.74 \pm 9.44$	$3.55 \pm 3.43$	$4.37 \pm 3.9$

Table 1: Quantitative results. We use the Dice Similarity Score (DSC) and the Average Minimum Centriod Distance

{¯ d}_{c}

(see Eq.1) within the image resolution of

128^{2}

. Please remind the limited significance of the DSC in our context (see Sec.2.4).

The qualitative results in Fig.2 demonstrate that the model trained on the phantom dataset was able to extract a semantic knowledge about airways from the noisy ground truth generated based on depth images using our proposed data-driven pipeline. In some cases, it even outperforms this noisy ground truth, particularly when airways were not detected properly due to their low depth profiles (see Fig.1(a)). This happens if an orifice belongs to an airway, that has a high angle to the camera perspective, and therefore only its wall remains visible. The data augmentation method [1] was able to close the domain gap from the phantom to the in-vivo datasets, when comparing the model’s performance to the models being directly trained on the in-vivo data. Even cases without any airways present were correctly predicted, even though such situations were not directly covered by the ground truth during training (see Fig.1(b)). Having those results in mind, our training can be considered as successful for the real world use case.

Our quantitative results shown in Tab.1 have only limited significance for the Phantom dataset due to lack of real ground truth, which is compensated by our noisy generated one. However, the ground truth by the in-vivo dataset was created by human experts and is, beside the very high observer variability, reliable. The model trained on the Phantom dataset shows a great domain robustness, considering the resulting DSC. The domain gap seems larger coming from the in-vivo domain on the first sight. But this is a perfect example for our AMCD ${¯ d}_{c}$ and its motivation; even if the DSC on the phantom data is twice as low as with the model trained on the phantom dataset, the AMCD is in the very same range, going hand in hand with the correct prediction visible in the qualitative results. This is due to the observer variability of the ground truth segmentation, resulting in different diameters of the segmentations depending on the observer.

Considering the quantitative but especially the qualitative results, the training on the Phantom data with the noisy ground truth as weak supervision enables the learning of a semantic knowledge of airways. We had also shown that this knowledge can be transferred to in-vivo data while reaching comparable performance as the models directly trained on these datasets.

4 New or breakthrough work to be presented

This paper presents a novel approach for weakly-supervised CNN-based airway orifice segmentation in video bronchoscopy which is trained on phantom and evaluated on in-vivo patient data. To our best knowledge, this is the first paper presenting a deep learning-based approach omitting the use of depth images for inference.

5 Conclusion

In this work, we presented a very real-time capable pipeline that extracts airway segmentations from a given bronchoscopy depth image using efficient data-driven classical methods. However, this method has some disadvantages: On the one hand it requires a depth image, which is not native given by the endoscope due to hardware limitation and therefore has to be generated via a complex non-linear domain translation like a Generative Adversarial Networks (GAN). On the other hand, a data-driven approach is not equal to a semantic understanding. Considering this and due to the lack of robustness to some edge cases, we consider this pipeline alone as not suitable for real world applications. However, paired with the RGB image of the bronchoscopy, these generated segmentation maps can be used as weak supervision during training of a shallow CNN for airway orifice segmentation. We showed that this model being trained on phantom data gains tremendous semantic knowledge of airway structures overcoming noisy ground truth on edge cases and is even applicable directly to in-vivo VB due thanks to a substantial data augmentation. With all of this, our proposed method allows the generation of segmentation masks directly on RGB images without the need of hand annotated datasets. We argue that this direct prediction from RGB images is superior to the segmentation approaches on the depth images, because it comes without the risks of a domain translation from RGB to depth via GANs, which is mainly based on unsupervised manner of the GAN training, likely to cause the generation of wrong anatomies[6] like additional or absent airway branches in the synthesized depth images.

References

[1] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019-06) Autoaugment: Learning augmentation strategies from data. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2019 (Section 3), pp. 113–123. Note: arXiv: 1805.09501v3 ISBN: 9781728132938 External Links: ISSN 10636919, Document Cited by: §2.3, §3.
[2] A. Howard, W. Wang, G. Chu, L. Chen, B. Chen, and M. Tan Searching for MobileNetV3. Note: arXiv: 1905.02244v5 Cited by: §2.3.
[3] I. Loshchilov and F. Hutter (2019-01) Decoupled Weight Decay Regularization. arXiv (en). Note: arXiv:1711.05101 [cs, math] External Links: Link Cited by: §2.3.
[4] P. Protzel (2014) Compact Watershed and Preemptive SLIC: On improving trade-offs of superpixel segmentation algorithms. External Links: Link Cited by: §2.2.
[5] C. Sánchez, J. Bernal, D. Gil, and F. J. Sánchez (2014) On-Line Lumen Centre Detection in Gastrointestinal and Respiratory Endoscopy. In Clinical Image-Based Procedures. Translational Research in Medical Imaging, M. Erdt, M. G. Linguraru, C. Oyarzun Laura, R. Shekhar, S. Wesarg, M. A. González Ballester, and K. Drechsler (Eds.), Cham, pp. 31–38. External Links: ISBN 978-3-319-05666-1 Cited by: §2.1.
[6] Y. Shin, H. A. Qadir, and I. Balasingham (2018) Abnormal Colon Polyp Image Synthesis Using Conditional Adversarial Networks for Improved Detection Performance. IEEE Access 6, pp. 56007–56017 (en). External Links: ISSN 2169-3536, Link, Document Cited by: §5.
[7] M. Visentini-Scarzanella, T. Sugiura, T. Kaneko, and S. Koto (2017-07) Deep monocular 3D reconstruction for assisted navigation in bronchoscopy. International Journal of Computer Assisted Radiology and Surgery 12 (7), pp. 1089–1099. Note: Publisher: Springer Verlag External Links: ISSN 18616429, Document Cited by: §2.1.
[8] C. Wang, M. Oda, Y. Hayashi, B. Villard, T. Kitasaka, H. Takabatake, M. Mori, H. Honma, H. Natori, and K. Mori (2020-10) A visual SLAM-based bronchoscope tracking scheme for bronchoscopic navigation. International Journal of Computer Assisted Radiology and Surgery 15 (10), pp. 1619–1630 (en). External Links: ISSN 1861-6410, 1861-6429, Link, Document Cited by: §1.
[9] J. Y. Yoo, Y. Kang, J. S. Park, Y. Cho, S. Y. Park, H. I. Yoon, S. J. Park, H. Jeong, and T. Kim (123) Deep learning for anatomical interpretation of video bronchoscopy images. Scientific Reports | 11, pp. 23765. External Links: Link, Document Cited by: §2.3.