Q-Net: Query-Informed Few-Shot Medical Image Segmentation

Qianqian Shen, Yanan Li

^{*}

, Jiyong Jin, Bin Liu Corresponding Author.
Code is available at https://github.com/ZJLAB-AMMI/Q-Net

Abstract

Deep learning has achieved tremendous success in computer vision, while medical image segmentation (MIS) remains a challenge, due to the scarcity of data annotations. Meta-learning techniques for few-shot segmentation (Meta-FSS) have been widely used to tackle this challenge, while they neglect possible distribution shifts between the query image and the support set. In contrast, an experienced clinician can perceive and address such shifts by borrowing information from the query image, then fine-tune or calibrate his (her) prior cognitive model accordingly. Inspired by this, we propose Q-Net, a Query-informed Meta-FSS approach, which mimics in spirit the learning mechanism of an expert clinician. We build Q-Net based on ADNet, a recently proposed anomaly detection-inspired method. Specifically, we add two query-informed computation modules into ADNet, namely a query-informed threshold adaptation module and a query-informed prototype refinement module. Combining them with a dual-path extension of the feature extraction module, Q-Net achieves state-of-the-art performance on two widely used datasets, which are composed of abdominal MR images and cardiac MR images, respectively. Our work sheds light on a novel way to improve Meta-FSS techniques by leveraging query information.

\affiliations

Research Center for Applied Mathematics and Machine Intelligence, Zhejiang Lab
shenqq377@163.com, {liyn, jinjy,liubin}@zhejianglab.com

Introduction

Figure 1: Comparison between (a) ADNet Hansen et al. (2022) and (b) our proposed Q-Net during inference. Q-Net differs from ADNet in 3 ways. First, it performs a query-informed threshold adaptation, yielding an adaptive, rather than fixed, threshold for use in anomaly-detection. Second, it has an additional query-informed prototype refinement module. Lastly, it runs in dual-path, capturing dual-scale features. Model parameters are determined via meta-learning. In meta-training, the flow of the model stay unchanged, while for Q-Net, the prototype refinement module turns off.

Segmentation of organs, tissues, and lesions from biomedical images plays a key role in clinical research, including disease diagnosis, treatment planning and rehearsal, and surgical guidance Wang et al. (2018); Hesamian et al. (2019). Convolutional neural networks (CNNs) have achieved state-of-the-art (SOTA) performance in semantic segmentation for both natural Noh, Hong, and Han (2015); Long, Shelhamer, and Darrell (2015); Chen et al. (2018) and medical images Ronneberger, Fischer, and Brox (2015); Milletari, Navab, and Ahmadi (2016); Wang et al. (2021). The widely used segmentation frameworks need a large number of labeled images for model training. Such a supervised learning paradigm is not practical for medical scenarios, as annotating medical images requires clinical expertise, which is much more expensive than labeling natural images. There are some publicly available biomedical datasets, which have annotations for major human organs among the liver, kidney and spleen, while annotations for other organs such as adrenal glands and duodenum are scarce and hard to collect Ørting et al. (2019); Kavur et al. (2021); Sun et al. (2022).

A natural strategy for machine learning using scarce data annotations is model fine-tuning. However, it does not work well for MIS, since adapting a pre-trained model to segment objects of a novel class based on extremely few labeled images can easily render over-fitting. Recently, meta-learning techniques for FSS have been widely used to tackle the above challenge Shaban et al. (2017); Rakelly et al. (2018); Wang et al. (2019); Zhang et al. (2020). In particular, they learn to segment objects of data-scarce classes over many simulated segmentation tasks in an episodic fashion. Such simulated tasks are designed with use of images of data-rich classes, and each task is assigned with a support set and a query set, mimicking a test-time scenario corresponding to a specific type of organ or lesion. The support set is used to learn discriminative representations of each class, which are then used to do segmentation predictions on the query set Dong and Xing (2018). After meta-training over these tasks, the resulting algorithm agent is hoped to be capable of segmenting a previously “unseen” organ or lesion based solely on a few annotated images.

In contrast with natural images, medical images have their own characteristics that bring new challenges. First, the imbalance between the foreground object and the background is more severe than for natural images. For medical images, the texture of the foreground object is generally homogeneous, while the background is usually spatially in-homogeneous due to the existence of abundant tissues and objects of “unseen” classes Sun et al. (2022). Further, human organs have large variations in size, shape, and location across different patients. A specific organ or tissue may have different appearance patterns due to different image acquisition protocols and different imaging modalities given by different equipments. Even in the same 3D volume, the 2D slices related to the same organ can be dissimilar.

The aforementioned issues can lead to distribution shifts between the query image and the support set, while, to our knowledge, current techniques like prototypical networks, do not take into account such shifts explicitely. In clinical practice, an experienced clinician can borrow information from the query image, such as the CT or MRI scans of a patient, to fine-tune or calibrate his (her) prior knowledge for determining the size, shape and location of a target tissue. In this way, the distribution shift is removed or at least mitigated. Inspired by the above observation, we propose Q-Net, a query-informed FSS approach that mimics in spirit the learning mechanism of an expert clinician. Figure 1 shows Q-Net compared to the prototypical network method based on which it is developed.

Our Contributions:

Inspired by the learning mechanism of an expert clinician mentioned above, we put forward a strategy, namely query-informed learning (QIL), for improving meta-FSS techniques.
Following the QIL strategy, we propose Q-Net illustrated in Figures 1 and 2. It adds three novel algorithmic modules, namely query-informed threshold adaptation, query-informed prototype refinement, and a dual path extension, to a recent SOTA ADNet Hansen et al. (2022). ADNet fixes its threshold and prototype during inference. In contrast, Q-Net can adjust its threshold and prototype automatically to mitigate possible distribution shifts between the query image and the support set. In addition, unlike ADNet that uses single-scale features, Q-Net has a dual-path architecture that captures multi-scale features for use.
We validate the performance of Q-Net experimentally on two widely-used medical image datasets. Results demonstrate that Q-Net significantly outperforms other SOTA methods.

Related Work

Medical Image Segmentation

Deep learning with CNNs has become the dominated approach to MIS on various tissues, anatomical structures and lesions. Fully Convolutional Networks (FCNs) stand out to be a powerful deep net architecture for semantic segmentation Long, Shelhamer, and Darrell (2015). FCNs replace the fully connected layers of standard CNNs with fully convolutional layers. Given an arbitrary-sized input image, FCNs can produce pixel-wise output of consistent size. Afterwards, the encoder-decoder networks become the major architecture for semantic segmentation, among which U-Net is well-recognized as a SOTA network for MIS Ronneberger, Fischer, and Brox (2015). U-Net adopts a symmetrical encoder-decoder architecture infused with skipped connections. In the downsampling path, it applies CNN, while in the upsampling path, it enhances the feature map, making its size be consistent with that of the input.

Following U-Net, several FCNs have been developed, such as U-Net 3D Çiçek et al. (2016), V-Net Milletari, Navab, and Ahmadi (2016), Y-Net Mehta et al. (2018), attention U-Net Oktay et al. (2018) and nnU-Net Isensee et al. (2018), which can be regarded as different variants of U-Net. All these CNN type models require abundant expert-annotated data. When only a few labeled images of an “unseen” class are available for segmenting an object of this class, these methods fail to provide good performance. When facing data-scarce cases, one needs to integrate data augmentation or transfer learning techniques to such methods to make them work.

Few-shot Semantic Segmentation

Different from supervised learning methods that require a large number of labeled samples to work, few-shot learning (FSL) methods aim to learn with few labeled samples. Thus, it is natural to adapt FSL methods to deal with MIS, leading to FSS methods. Ouyang et al. (2020) propose an adaptive local prototype pooling module to add extra focus on the background. Hansen et al. (2022) use a single foreground prototype, and view image segmentation from an anomaly detection perspective. Rakelly et al. (2018) use weak annotations by replacing the pre-defined binary mask with a few selected landmarks of foreground and background. Most of these FSS methods aim to fully exploit information covered in the support set and assume that the labeled training data is abundant.

The prototypical network, which uses masked average pooling to extract class-wise features from the support set, is another major category of FSS methods featured by its strong interpretability and robustness against noises. Dong and Xing (2018) adopt the meta-learning mechanism in prototypical networks for semantic segmentation. The resulting model contains a prototype network for learning class-specific prototypes and a segmentation network for making predictions on query images. Wang et al. (2019) formulate image segmentation as an non-parametric prototype matching process, then propose prototype alignment network (PANet), which inversely predicts labels of the support images by using query images as the support set. Liu et al. (2020) improve PANet by introducing additional prototypes to capture more diverse features of the semantic classes. Base on PANet, Ouyang et al. (2020) and Hansen et al. (2022) employ superpixel/supervoxel segmentation for a self-supervised prototypical FSS on medical images. Tang et al. (2021) propose a context relation encoder and mask refinement module to iteratively refine the segmentation.

Another branch of medical FSS methods follows a two-arm architecture of Shaban et al. (2017), which consists of a conditioner arm and a segmenter arm, corresponding to the support set and the query data, respectively. Roy et al. (2020) employ dense connections with squeeze and excitation blocks to strengthen the interaction between the conditioner arm and the segmentor arm. Sun et al. (2022) present a global correlation network with discriminative embedding, obtaining an improved performance.

The aforementioned works do not explicitly take into account possible distribution shifts between the query image and the support set, while such shifts indeed exist in some practical MIS scenarios. Our proposed Q-Net learns to borrow information from the query image based on a meta-learning setting to remove or at least mitigate such distribution shifts.

Problem Statement

We consider an image segmentation task. First, we learn a segmentation model based on a training set $D_{t r a i n}$ with a label set $C_{t r a i n}$ . Then we evaluate the model using a test set $D_{t e s t}$ , whose label set is denoted by $C_{t e s t}$ . For FSS tasks of our concern, $C_{t r a i n}$ and $C_{t e s t}$ are disjoint, namely $C_{t r a i n} \cap C_{t e s t} = \emptyset$ .

We consider a meta-learning setting, in which both the training set $D_{t r a i n} = {(S_{i}, Q_{i})}_{i = 1}^{N_{t r a i n}}$ and the test set $D_{t e s t} = {(S_{i}, Q_{i})}_{i = 1}^{N_{t e s t}}$ consist of several randomly sampled episodes, where $N_{t r a i n}$ and $N_{t e s t}$ are the number of episodes for training and testing, respectively. Each episode includes $K$ support images with annotations and a set of $Q$ query images of $N$ classes. That says we consider a N-way K-shot task. The support set $S_{i} = {(x_{k}^{s}, m_{k}^{s} (c_{j}))}_{k = 1}^{K}$ contains $K$ image-mask pairs of a gray-scale image $x \in R^{H \times W \times 1}$ and its corresponding binary mask $m \in {0, 1}^{H \times W}$ for class $c_{j} \in C_{t r a i n}, j = 1, 2, \dots, N$ . The query set $Q_{i}$ contains $N_{q r y}$ image-mask pairs from the same class as the support set. We learn the model on the support set and make a prediction for the query masks. Following the common practice in Ouyang et al. (2020); Hansen et al. (2022), we adopt 1-way 1-shot meta-learning strategy for few-shot MIS.

Supervoxel-guided Clustering

Following Hansen et al. (2022), we use supervoxel clustering to generate pseudo labels for training, where each supervoxel produces a pseudo label. In this way, we can better utilize the volumetric nature of the medical images, compared with 2D superpixel clustering. We then construct episodic tasks for meta-training of our deep net model based on segmentations given by the supervoxels. Specifically, for each pseudo class/supervoxel, we randomly select one 2D image slice as a support image, and then use another adjacent image slice in the same volume as the query image. In what follows, we elaborate the proposed Q-Net, which is trained over such episodic tasks.

Methodology

The proposed Q-Net is performed in three steps in test time: 1) extracting dual-path image features for both the support and the query images; 2) computing a query-informed anomaly threshold and then using it to separate the foreground object from the background; 3) refining the initial class prototype by incorporating query information and then using the refined prototype to make final segmentation. Figure 1 gives a brief conceptual illustration of our method, and Figure 2 shows a more detailed working flow of our method in test-time. Note that the model parameters are determined during meta-training, where the flow of the model stay unchanged but the prototype refinement module turns off.

Figure 2: Illustration of Q-Net. We use a shared feature encoder to learn deep feature maps in dual-path, corresponding to two feature scales: $32 \times 32$ and $64 \times 64$ . The flows of operations are the same for these two paths, while we modularize the operations of the 2nd path and drawn them in gray to save space. In each path, we first learn one foreground prototype p from the support features. Next we compute the similarity map between each query feature vector and the prototype. Then we predict the initial segmentation mask ${~ m}_{32}^{q}$ via anomaly detection performed on the similarity map with threshold $T$ . See the Subsection “Query-informed Threshold Adaptation” for our approach to learn an adaptive $T$ value. Then we refine the prototype by repeating the following two operations for a fixed number of times: (1) replacing the foreground feature vectors with the prototype; (2) minimizing a reconstruction loss. See more details in the Methodology section.

Dual-Path Feature Extraction

We employ a shared feature extractor $f_{θ}$ to extract multi-scale features from both the support and the query images in the embedding space. We denote the support features as $F^{s} = f_{θ} (x^{s}), F^{s} \in R^{H^{'} \times W^{'} \times Z}$ , where $H^{'}, W^{'}$ denote the height and the width of the features and $Z$ the channel depth. The query features $F^{q} = f_{θ} (x^{q})$ is defined in the same way.

Following the common practice, we use ResNet-101 with pre-trained MS-COCO weights as the backbone net for feature extraction. It is composed of a $7 \times 7$ convolutional layer with stride of 2, a max pooling layer, and four residual blocks with multiple 3-layer bottleneck blocks. The output features of the first residual block in the network are 4 times smaller than the input resolution. The network can provide another 3 times down-sampling in the last 3 residual blocks with dilation. In contrast to previous works Wang et al. (2019); Ouyang et al. (2020); Hansen et al. (2022), where the network outputs features that are 8 times smaller than input resolution by using dilation in the second residual block, we use dilation in the last residual block and extract features with two different scales from the last two residual blocks. This results in the resolution of multi-scale features being 1/4 and 1/8 of the image resolution $(H^{'} = H / 4 o r H / 8, W^{'} = W / 4 o r W / 8)$ . The multi-scale feature extraction can better capture class prototypes of tissues of different sizes and shapes.

Query-Informed Threshold Adaptation

Following Hansen et al. (2022), we consider a single foreground prototype to avoid introducing additional background prototypes in each training episode. The prototypical prediction is undertaken by non-parametric metric learning. First, we compute the foreground prototype $p \in R^{1 \times Z}$ from the support feature map $F^{s}$ via masked average pooling:

p = \frac{\sum F^{s} ⊙ m^{s}}{\sum m^{s}}

(1)

where the support feature map $F^{s}$ is resized to the mask size $(H, W)$ , and $⊙$ denotes the Hadamard product for masked average pooling.

In order to perform soft thresholding on the predicted mask, we employ a shifted Sigmoid on the negative cosine similarity between the query feature vectors and the foreground prototype, as follows

S (h, w) = - a \frac{F^{q} (h, w) \cdot p}{∥ F^{q} ∥ ∥ p ∥}

(2)

{^m}_{f}^{q} = 1 - σ (S (h, w) - T)

(3)

{^m}_{b}^{q} = 1 - {^m}_{f}^{q}

(4)

where $(h, w)$ denotes the spatial locations of the foreground mask, $a = 20$ a scaling factor Oreshkin, Rodríguez López, and Lacoste (2018), and $T$ the learned threshold. The subscripts $_{f}$ and $_{b}$ denote “foreground” and “background”, respectively.

In common practice of FSS, when $T$ is learned in the meta-training stage, it is then directly used in the meta-testing stage to segment unseen foreground objects. In the standard setting, the background may contain novel test objects, then the supervoxels overlapped with the test labels might be used as the training classes. Compared to the spatially heterogeneous background, the foreground objects, such as the right and the left kidneys in the abdominal MRI image, are homogeneous, thus can be easily clustered into the same supervoxel. The learned threshold used for anomaly detection has been optimized on supervoxels that share similar shape and size to the “unseen” objects. For such cases, using a fixed $T$ makes sense. However, if image slices containing the test classes are removed, then the test classes becomes completely unseen classes for the algorithm agent, then the threshold learned in the training process can not be adapted straightforward to such “unseen” classes.

Following the QIL strategy, we add two fully-connected layers $g_{ϕ}$ at the top of the feature extractor $f_{θ}$ to learn a soft threshold from the query image instead of the predicted foreground masks on the seen classes:

T = g_{ϕ} (F^{q})

(5)

where the fully-connected layer maps the number of channels of the flattened feature maps from 2048 to one. See more details in the Appendix Section.

Query-Informed Prototype Refinement

Given class prototypes extracted from the support images, we use cosine similarity to make segmentation predictions on the foreground and background. This prediction process assumes that a query image holds the similar appearance with support images. During the training phase, the image slices adjacent to the support image are selected as query samples, while during the inference phase, the query image and the support set can be selected from different image volumes. Therefore, a query image can be quite different from the support images on intensity and appearance. Following the QIL strategy, We propose a prototype refinement module to update the class prototypes by borrowing information from the query image features in test time.

In prototype refinement, firstly we produce the predicted mask by calculating the cosine similarity between the query image features and the class prototypes. At each iteration, we apply element-wise multiplication with the query feature maps and the background mask ${^m}_{b}^{q}$ to obtain new feature maps ${^F}^{q}$ . Then, we allocate the single foreground prototype p as a feature vector to the new feature maps.

^Fq(i,j)={Fq(i,j),^mqb(i,j)=1pn,^mqb(i,j)=0

(6)

We design this module to mimic the steps of an optimization algorithm in inference time. This module is trained to modify the foreground prototypes p gradually so that the final predicted mask ${^m}^{q}$ converges to an optimum solution.

p_{n + 1} = p_{n} - v \partial N \sum n = 1 L ({^F}^{q}, F^{q}) / \partial p_{n}

(7)

where $L ({^F}^{q}, F^{q})$ is the cross-entropy loss computed for the new query features ${^F}^{q}$ to original query features $F^{q}$ , and $N$ denotes the number of iterations of gradient back-propagation. $p_{N}$ denotes the optimal prototype after $N$ iterations, thus, an optimal predicted mask ${~ m}_{f}^{q}$ is generated.

{~ m}^{q} = α \cdot {~ m}_{64}^{q} + (1 - α) \cdot {~ m}_{32}^{q}

(8)

Therefor, the final segmentation mask ${~ m}^{q}$ is generated by element-wise summation of predicted masks based on multi-scales in inference time, and $α \in (0, 1.0)$ is a coefficient for prediction based on $64 \times 64$ features.

Loss function

We train our network by minimizing the cross-entropy between the predicted masks and the ground-truth segmentation $m^{q}$ . The dual-path predicted masks after anomaly threshold adaptation are up-sampled to $(H, W)$ and then combined through element-wise summation, as follows

{^m}^{q} = α \cdot {^m}_{64}^{q} + (1 - α) \cdot {^m}_{32}^{q}

(9)

where $α$ is a balance factor, whose value is set empirically, see Table 3 for an ablation analysis.

We compute the binary cross-entropy loss as the segmentation loss for each training episode:

Lseg=−1HWH∑hW∑w∑j={f,b}mqj(h,w)log(^mqj(h,w))

(10)

Following Wang et al. (2019); Ouyang et al. (2020), we inversely predict labels of the support images by using query images as the support set, then build a prototypical alignment regularization item as follows

Lreg=−1HWH∑hW∑w∑j={f,b}msj(h,w)log(^msj(h,w))

(11)

Overall, the loss function for each training episode is defined to be

L = L_{s e g} + L_{r e g}

(12)

For more implementation details on prototype refinement, readers are referred to the Appendix Section.

ABD
Settings	Method	Liver	R.kidney	L.kidney	Spleen	mean
Setting 1	SE-Net	29.02	47.96	45.78	47.30	42.51
	PANet	$73.81 \pm 3.39$	$80.67 \pm 3.35$	$69.48 \pm 6.05$	$69.10 \pm 10.18$	$73.27 \pm 7.90$
	ALPNet	$78.55 \pm 2.48$	$83.11 \pm 5.35$	78.16 $\pm$ 6.54	$70.58 \pm 6.16$	$77.60 \pm 7.01$
	ADNet	$82.11 \pm 2.19$	85.80 $\pm$ 4.51	$73.86 \pm 8.23$	72.29 $\pm$ 8.87	78.51 $\pm$ 8.63
	Q-Net	81.74 $\pm$ 3.83	$87.98 \pm 4.55$	$78.36 \pm 8.36$	$75.99 \pm 8.64$	$81.02 \pm 8.08$
Setting 2	SE-Net	27.43	61.32	62.11	51.80	50.66
	PANet	$69.37 \pm 5.44$	66.94 $\pm$ 5.67	63.17 $\pm$ 7.84	$61.25 \pm 7.84$	$65.68 \pm 7.54$
	ALPNet	$70.73 \pm 3.81$	$73.30 \pm 9.72$	$61.20 \pm 8.22$	62.98 $\pm$ 9.28	67.05 $\pm$ 9.57
	ADNet	77.03 $\pm$ 3.36	$56.68 \pm 10.84$	$59.64 \pm 9.04$	$59.44 \pm 7.46$	$63.20 \pm 11.48$
	Q-Net	$78.25 \pm 4.81$	$65.94 \pm 10.08$	$64.81 \pm 7.42$	$65.37 \pm 8.61$	$68.59 \pm 9.73$
CMR
Settings	Method	LV-BP	LV-MYO	RV	mean
Setting 1	SE-Net	58.04	25.18	12.86	32.02
	PANet	$72.77 \pm 7.66$	$44.76 \pm 4.08$	$57.13 \pm 4.54$	$58.20 \pm 12.79$
	ALPNet	$85.42 \pm 3.95$	$63.38 \pm 3.95$	$74.07 \pm 4.47$	$74.29 \pm 9.90$
	ADNet	86.26 $\pm$ 2.18	65.08 $\pm$ 4.85	76.50 $\pm$ 1.97	75.95 $\pm$ 9.34
	Q-Net	$90.25 \pm 1.44$	$65.92 \pm 2.96$	$78.19 \pm 2.95$	$78.15 \pm 10.22$

Table 1: DSC comparison with other methods on ABD and CMR datasets. Bold and italic numbers denote the best and second best results, respectively.

Experiment	Method	Liver	R.kidney	L.kidney	Spleen	mean
Feature Extractor	DP (DS on Layer 2)	$72.56$	$55.91$	$52.61$	$60.65$	$60.43 \pm 15.90$
	DP (DS on Layer 3)	$71.00$	$56.74$	$55.05$	$66.01$	$62.20 \pm 12.02$
	DP (DS on Layer 4)	$77.59$	$62.08$	$62.86$	$62.42$	$66.24 \pm 9.62$
Added components	$32 \times 32$	$77.03$	$56.68$	$59.54$	$59.44$	$63.20 \pm 11.48$
	$64 \times 64$	$78.65$	$59.58$	$60.82$	$62.80$	$65.46 \pm 11.40$
	DP	$78.53$	$61.35$	$64.30$	$62.55$	$66.68 \pm 9.99$
	DP + PR	$78.35$	$62.91$	$64.94$	$63.13$	$67.33 \pm 9.65$
	DP + TA	$78.79$	$63.87$	$64.69$	$64.38$	$67.93 \pm 10.09$
	DP + TA + PR	$78.25$	$65.94$	$64.81$	$65.37$	$68.59 \pm 9.73$

Table 2: Ablation study for Q-Net on dataset ABD under Setting 2. ’DS’ means downsampling. ’DP’, ’TA’ and ’PR’ denote dual-path feature extraction, query-informed threshold adaptation, and query-informed prototype refinement, respectively.

Experiments

Datasets

We evaluate the proposed Q-Net on two popular MRI datasets for FSS, i.e., ABD Kavur et al. (2021) and CMR Zhuang (2016).

ABD is an abdominal MRI dataset published from the ISBI 2019 Combined Healthy Abdominal Organ Segmentation Challenge (CHAOS). It includes 20 3D T2-SPIR MRI scans with on average 36 slices from liver, left kidney, right kidney, and spleen.

CMR is a MRI dataset published from the MICCAI 2019 Multi-sequence Cardiac MRI Segmentation Challenge (bSSFP fold). It contains 35 3D cardiac MRI scans with on average 13 slices.

Performance Metric

Following the common practice in medical few-shot image segmentation, we use the mean Sørensen-Dice coefficient (DSC) as evaluation metric. It measures the overlap ratio of the predicted mask $A$ and the ground-truth segmentation $B$ as follows:

D S C (A, B) = \frac{2 ∥ A \cap B ∥}{∥ A ∥ + ∥ B ∥} * 100 %

(13)

The greater this DSC score, the better the segmentation performance, and vice versa.

Data Pre-processing

In our experiments, we adopt the same image pre-processing scheme as in Ouyang et al. (2020), to make a fair performance comparison. Specifically, We first cut off the bright end (the top 0.5%) of the histogram to alleviate the off-resonance issue. Then we re-sample the image slices to get the same resolution as before. We use axial slices for ABD and short-axis slices for CMR, respectively. At last, we crop these slices to an unified size of $256 \times 256$ pixels.

We implement the proposed method using the PyTorch (v1.10.2) on a NVIDIA RTX 3090Ti GPU. The minimum supervoxel size is set at $ρ = 5000$ for ABD and $ρ = 1000$ for CMR, as recommended in Hansen et al. (2022). In the experiment, each 2D slice is repeated three times in the channel dimension to fit the network. During inference, we perform slice-by-slice segmentation on 3D image volumes under the evaluation protocols introduced by Roy et al. (2020). In each fold, one support volume is randomly selected and the remaining image volumes are treated as query volumes. We repeat this process 5 times independently and take the mean of the resulting scores as the final performance score.

Experimental Settings

We consider two experimental settings to evaluate the performance of our model.

Setting 1 is the standard setting, where objects of test classes can appear in the background in the training dataset. Since we perform a self-supervised supervoxel-based segmentation on the whole image, the resulting supervoxels have similar shapes and sizes as objects of the test classes. Therefore, objects of test classes may be implicitly involved in the training process, then the test classes are not truly “unseen” classes for the algorithm agent, in this setting.

Setting 2 generalizes the Setting 1, for which we directly remove image slices that contains the test classes from the training data. It guarantees that the test classes are truly “unseen” classes for the model. We follow the same protocol in Ouyang et al. (2020) to separate the four testing organs into two groups: upper abdomen of liver and spleen, and lower abdomen of left/right kidneys.

Results

Comparison with SOTA methods

We compare Q-Net with modern SOTA models, including SE-Net Roy et al. (2020), PANet Wang et al. (2019), ALPNet Ouyang et al. (2020), and ADNet Hansen et al. (2022), for both settings mentioned above. See the result in Table 1. It shows that our Q-Net significantly outperforms the SOTA methods on both datasets.

Specifically, for setting 1 where the test organs may appear in the background, Q-Net gives the largest mean DSC score on both ABD and CMR. In particular, its dice score for right-kidney on ABD achieves about 88%. For setting 2 where the test organs are completely removed from the background, Q-Net performs best again, while the SOTA ADNet performs poorly for small-sized organs, such as the right kidney. This result indicates that Q-Net is better at segmenting small-sized organs, compared with ADNet.

For dataset CMR, we only consider Setting 1, as Setting 2 is impractical for cardiac MRI scans. The left-ventricle myocardium (LV-MYO) is wrapped in the left ventricle blood pool (LV-BP) and the right-ventricle (RV) is quite close to them. We see that Q-Net performs much better than the others in segmenting such adjacent organs, especially for LV-BP and RV. Although foreground objects in the CMR images have smaller variations in size, Q-Net still performs best as shown in bottom lines of Table 1.

We also provide some qualitative comparisons in Figure 3. It is shown that Q-Net is more robust against variations of the objects’ appearance patterns. It also reconfirms that Q-Net performs better in segmenting small-sized organs.

Figure 3: Qualitative comparisons for the ABD dataset. Left to right: Segmentation results and ground-truth segmentation of a query slice containing the target object. Top to bottom: spleen, left kidney, right kidney and liver. (Best viewed with zoom)

Ablation Study

First, we do an ablation study for Q-Net, to test how the way in which we construct the $64 \times 64$ feature maps affects the final segmentation performance. In the ResNet backbone, there are three residual blocks containing the dilation operation. We adopt the third downsampling operation by activating its dilation operation in one of these residual blocks (i.e., layers 2, 3 and 4). The outputs of the last residual block before the third downsampling component are $64 \times 64$ feature maps. The segmentation results on ABD dataset under Setting 2 are shown in Table 2, where DS denotes downsampling. We observe that using the downsampling component in the last residual block can provide a better combination of multi-scale predictions.

Second, we test separate contributions of our three proposed computation modules, namely query-informed threshold adaptation (TA), query-informed prototype refinement (PR), and the dual-path extension (DP), by conducting an ablation study of them on the ABD dataset, under Setting 2. The result is presented in Table 2. We see from the last column that each module indeed makes a separate contribution to improve the final segmentation performance. In particular, DP brings a performance gain of 1.22%, and the two query-informed modules contribute an additional gain of 1.91%.

In the third study, we evaluate the effect of the balance factor $α$ in Equation (8) on dual-path predictions. $α$ denotes the weight of the larger feature maps’ contribution in producing the final mask prediction. As shown in Table 3, For Setting 1, the larger the $α$ , the better, while for Setting 2, the optimal $α$ value is 0.8.

Finally, we test the effect of the number of iterations for prototype refinement on the final performance of Q-Net. We do this experiment on dataset ABD, and compute DSC scores corresponding to different checkpoints at the prototype refinement process. The result is shown in Figure 4. We let the curves corresponding to different organs share the same starting point for ease of comparison. Such an analysis suggests an empirical choice of the iteration number at 30.

Limitations and Future Work

Currently Q-Net uses a single prototype for each organ, while this may be inappropriate for dealing with large organs with in-homogeneous structures, such as liver, which can occupy almost half of a medical image. Larger feature maps may exacerbate this problem. How to assign an adaptive number of prototypes for a specific organ deserves future research.

In addition, Q-Net treats each query image independently. That means the knowledge informed from the query images are not accumulated for future segmentation tasks. An future direction following this line is to investigate approaches to do incremental query-informed learning.

Finally, the query-informed adaptation (QIA) mechanism has a natural Bayesian flavor. A formal analysis on the tie between QIA and Bayesian is missing here and can be conducted in the future.

$α$	ABD
$α$	Setting1	Setting2
0.9	$81.02 \pm 8.08$	$67.28 \pm 11.68$
0.8	$80.31 \pm 8.45$	$68.75 \pm 11.71$
0.6	$80.28 \pm 8.33$	$68.59 \pm 9.73$
0.5	$79.72 \pm 9.00$	$67.98 \pm 10.47$
0.4	$79.30 \pm 9.11$	$66.89 \pm 10.09$
0.2	$78.64 \pm 8.74$	$65.89 \pm 10.49$

Table 3: Ablation study for the effect of

α

on the performance of Q-Net in terms of the DSC score.

Figure 4: DSC scores collected at different checkpoints of the iterative prototype refinement process. $μ$ and $σ$ denote the mean and the standard error, respectively.

Conclusion

We proposed Q-Net, a novel meta-learning approach to few-shot MIS. We experimentally validate its performance. Results show that Q-Net outperforms SOTA methods, especially for segmentation of small-sized organs. Our work sheds light on a way to improve meta-learning techniques for FSS by borrowing information from the query image.

References

Chen et al. (2018) Chen, L.-C.; Zhu, Y.; Papandreou, G.; Schroff, F.; and Adam, H. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), 801–818.
Çiçek et al. (2016) Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S. S.; Brox, T.; and Ronneberger, O. 2016. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, 424–432. Springer.
Dong and Xing (2018) Dong, N.; and Xing, E. P. 2018. Few-shot semantic segmentation with prototype learning. In BMVC, 3(4).
Hansen et al. (2022) Hansen, S.; Gautam, S.; Jenssen, R.; and Kampffmeyer, M. 2022. Anomaly detection-inspired few-shot medical image segmentation through self-supervision with supervoxels. Medical Image Analysis, 78: 102385.
Hesamian et al. (2019) Hesamian, M. H.; Jia, W.; He, X.; and Kennedy, P. 2019. Deep learning techniques for medical image segmentation: achievements and challenges. Journal of digital imaging, 32(4): 582–596.
Isensee et al. (2018) Isensee, F.; Petersen, J.; Klein, A.; Zimmerer, D.; Jaeger, P. F.; Kohl, S.; Wasserthal, J.; Koehler, G.; Norajitra, T.; Wirkert, S.; et al. 2018. nnu-net: Self-adapting framework for u-net-based medical image segmentation. arXiv preprint arXiv:1809.10486.
Kavur et al. (2021) Kavur, A. E.; Gezer, N. S.; Barış, M.; Aslan, S.; Conze, P.-H.; Groza, V.; Pham, D. D.; Chatterjee, S.; Ernst, P.; Özkan, S.; et al. 2021. CHAOS challenge-combined (CT-MR) healthy abdominal organ segmentation. Medical Image Analysis, 69: 101950.
Liu et al. (2020) Liu, Y.; Zhang, X.; Zhang, S.; and He, X. 2020. Part-aware prototype network for few-shot semantic segmentation. In European Conference on Computer Vision, 142–158. Springer.
Long, Shelhamer, and Darrell (2015) Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3431–3440.
Mehta et al. (2018) Mehta, S.; Mercan, E.; Bartlett, J.; Weaver, D.; Elmore, J. G.; and Shapiro, L. 2018. Y-Net: joint segmentation and classification for diagnosis of breast biopsy images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 893–901. Springer.
Milletari, Navab, and Ahmadi (2016) Milletari, F.; Navab, N.; and Ahmadi, S.-A. 2016. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), 565–571. IEEE.
Noh, Hong, and Han (2015) Noh, H.; Hong, S.; and Han, B. 2015. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, 1520–1528.
Oktay et al. (2018) Oktay, O.; Schlemper, J.; Folgoc, L. L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N. Y.; Kainz, B.; et al. 2018. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999.
Oreshkin, Rodríguez López, and Lacoste (2018) Oreshkin, B.; Rodríguez López, P.; and Lacoste, A. 2018. Tadam: Task dependent adaptive metric for improved few-shot learning. Advances in neural information processing systems, 31.
Ørting et al. (2019) Ørting, S.; Doyle, A.; van Hilten, A.; Hirth, M.; Inel, O.; Madan, C. R.; Mavridis, P.; Spiers, H.; and Cheplygina, V. 2019. A survey of crowdsourcing in medical image analysis. arXiv preprint arXiv:1902.09159.
Ouyang et al. (2020) Ouyang, C.; Biffi, C.; Chen, C.; Kart, T.; Qiu, H.; and Rueckert, D. 2020. Self-supervision with superpixels: Training few-shot medical image segmentation without annotation. In European Conference on Computer Vision, 762–780. Springer.
Rakelly et al. (2018) Rakelly, K.; Shelhamer, E.; Darrell, T.; Efros, A. A.; and Levine, S. 2018. Few-shot segmentation propagation with guided networks. arXiv preprint arXiv:1806.07373.
Ronneberger, Fischer, and Brox (2015) Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, 234–241. Springer.
Roy et al. (2020) Roy, A. G.; Siddiqui, S.; Pölsterl, S.; Navab, N.; and Wachinger, C. 2020. ‘Squeeze & excite’guided few-shot segmentation of volumetric images. Medical image analysis, 59: 101587.
Shaban et al. (2017) Shaban, A.; Bansal, S.; Liu, Z.; Essa, I.; and Boots, B. 2017. One-shot learning for semantic segmentation. arXiv preprint arXiv:1709.03410.
Sun et al. (2022) Sun, L.; Li, C.; Ding, X.; Huang, Y.; Chen, Z.; Wang, G.; Yu, Y.; and Paisley, J. 2022. Few-shot medical image segmentation using a global correlation network with discriminative embedding. Computers in biology and medicine, 140: 105067.
Tang et al. (2021) Tang, H.; Liu, X.; Sun, S.; Yan, X.; and Xie, X. 2021. Recurrent mask refinement for few-shot medical image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3918–3928.
Wang et al. (2018) Wang, G.; Zuluaga, M. A.; Li, W.; Pratt, R.; Patel, P. A.; Aertsen, M.; Doel, T.; David, A. L.; Deprest, J.; Ourselin, S.; et al. 2018. DeepIGeoS: a deep interactive geodesic framework for medical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 41(7): 1559–1572.
Wang et al. (2019) Wang, K.; Liew, J. H.; Zou, Y.; Zhou, D.; and Feng, J. 2019. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9197–9206.
Wang et al. (2021) Wang, W.; Wang, G.; Wu, X.; Ding, X.; Cao, X.; Wang, L.; Zhang, J.; and Wang, P. 2021. Automatic segmentation of prostate magnetic resonance imaging using generative adversarial networks. Clinical Imaging, 70: 1–9.
Zhang et al. (2020) Zhang, X.; Wei, Y.; Yang, Y.; and Huang, T. S. 2020. Sg-one: Similarity guidance network for one-shot semantic segmentation. IEEE transactions on cybernetics, 50(9): 3855–3865.
Zhuang (2016) Zhuang, X. 2016. Multivariate mixture model for cardiac segmentation from multi-sequence MRI. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 581–588. Springer.

Appendix A Appendix

Implementation Details of Query-Informed Threshold Adaptation

The computation module of Query-Informed Threshold Adaptation consists of two operations, namely prototype prediction and a threshold adaptation. We show the workflow of this module for the path corresponding to the feature scale of $64 \times 64$ in Figure 5.

Figure 5: Illustration of the “Query-informed Threshold Adaptation” module.

Given the support feature $F_{64}^{s}$ and query feature $F_{64}^{q}$ with the size of $64 \times 64 \times 512$ , we compute the foreground prototype $p \in R^{1 \times 512}$ from the support feature $F_{64}^{s}$ via masked average pooling. Then we compute the negative cosine similarity map $S (h, w)$ between each query feature vector and the foreground prototype, which has the same size as the support mask size $(H, W)$ . This is the workflow of the prototypical prediction with only one foreground prototype.

In query-informed adaptive thresholding module, the adaptive threshold $T$ is learned from the query feature $F_{32}^{q}$ with the size of $32 \times 32 \times 512$ , which is followed by an adaptive threshold generator $g_{ϕ}$ . The adaptive threshold generator includes two fully-connected layers that convert the channel size of the flattened feature maps from 2048 to 1000, then to one. This adaptive threshold $T$ is not a fixed value. It depends on the query image, which may contain an object of an “unseen” semantic class. In the stage of testing, parameters of the feature encoder and the adaptive threshold generator are fixed when implementing iterative prototype refinement.

Implementation Details of Query-Informed Prototype Refinement

The Query-Informed Prototype Refinement module is only activated in the stage of testing. Figure 6 illustrates the iterative workflow of updating the foreground prototype guided by the gradient of loss, incurred when replacing the query feature vectors corresponding to the foreground part of the predicted mask by the single foreground prototype p. In Figure 6 the adaptive threshold $T$ is determined through the query-informed threshold adaptation module.

Figure 6: Illustration of the “Query-informed Prototype Refinement” module

A pseudocode that implements the prototype refinement module corresponding to the feature scale of $64 \times 64$ is presented in Algorithm 1.

Require: Support feature $F_{64}^{s}$ , support mask $m^{s}$ , and query feature $F_{64}^{q}$

1: Compute an initial foreground prototype

p_{0}

2: Compute negative cosine similarity

S (h, w)

between each query feature vector and the foreground prototype. Set

n = 0

3: Get a predicted mask

{~ m}_{n}^{q}

4: Reallocate the foreground prototype to give a new query features

{^F}_{n}^{q} (i, j)

^Fqn(i,j)={Fq(i,j),^mqn(i,j)=0pn,^mqn(i,j)=1

5: Compute cross-entropy loss between the reconstructed query feature

{~ F}_{n}^{q}

and the original query feature

F^{q}

and update

p_{n}

p_{n + 1} = p_{n} - v \frac{\partial \sum_{n = 1}^{N} L ({^F}^{q}, F^{q})}{\partial p_{n}}

6: If the stopping criterion does not met, repeat from step 2, otherwise, output

p_{n + 1}

Algorithm 1 Query-informed Prototype Refinement