Compound Figure Separation of Biomedical Images: Mining Large Datasets for Self-supervised Learning

\nameTianyuan Yao \emailtianyuan.yao@vanderbilt.edu
\addrVanderbilt University, Department of Computer Science, Nashville, TN, USA 37215 \AND\nameChang Qu \emailchang.qu@vanderbilt.edu
\addrVanderbilt University, Department of Computer Science, Nashville, TN, USA 37215\AND\nameJun Long \emailjunlong@csu.edu.cn
\addrCentral South University, Big Data Institute, Changsha, Hunan, China 410083\AND\nameQuan Liu \emailquan.liu@vanderbilt.edu
\addrVanderbilt University, Department of Computer Science, Nashville, TN, USA 37215\AND\nameRuining Deng \emailr.deng@vanderbilt.edu
\addrVanderbilt University, Department of Computer Science, Nashville, TN, USA 37215\AND\nameYuanhan Tian \emailyuanhan.tian@vanderbilt.edu
\addrVanderbilt University, Department of Computer Science, Nashville, TN, USA 37215\AND\nameJiachen Xu \emailjiachen.xu@vanderbilt.edu
\addrVanderbilt University, Department of Computer Science, Nashville, TN, USA 37215\AND\nameAadarsh Jha \emailaadarsh.jha@vanderbilt.edu
\addrVanderbilt University, Department of Computer Science, Nashville, TN, USA 37215\AND\nameZuhayr Asad \emailzuhayr.asad@vanderbilt.edu
\addrVanderbilt University, Department of Computer Science, Nashville, TN, USA 37215\AND\nameShunxing Bao \emailshunxing.bao@vanderbilt.edu
\addrVanderbilt University, Department of Electrical and Computer Engineering, Nashville, TN, USA 37215\AND\nameMengyang Zhao \emailmengyang.zhao@dartmouth.edu
\addrDartmouth College, Hanover, NH, USA 03755\AND\nameAgnes B. Fogo \emailagnes.fogo@vumc.org
\addrVanderbilt University Medical Center, Department of Pathology, Nashville, TN, USA 37215\AND\nameBennett A. Landman \emailbennett.landman@vanderbilt.edu
\addrVanderbilt University, Department of Electrical and Computer Engineering, Nashville, TN, USA 37215\AND\nameHaichun Yang \emailhaichun.yang@vumc.org
\addrVanderbilt University Medical Center, Department of Pathology, Nashville, TN, USA 37215\AND\nameCatie Chang \emailcatie.chang@vanderbilt.edu
\addrVanderbilt University, Department of Electrical and Computer Engineering, Nashville, TN, USA 37215\AND\nameYuankai Huo \emailyuankai.huo@vanderbilt.edu
\addrVanderbilt University, Department of Electrical and Computer Engineering, Nashville, TN, USA 37215

Abstract

With the rapid development of self-supervised learning (e.g., contrastive learning), the importance of having large-scale images (even without annotations) for training a more generalizable AI model has been widely recognized in medical image analysis. However, collecting large-scale task-specific unannotated data at scale can be challenging for individual labs. Existing online resources, such as digital books, publications, and search engines, provide a new resource for obtaining large-scale images. However, published images in healthcare (e.g., radiology and pathology) consist of a considerable amount of compound figures with subplots. In order to extract and separate compound figures into usable individual images for downstream learning, we propose a simple compound figure separation (SimCFS) framework without using the traditionally required detection bounding box annotations, with a new loss function and a hard case simulation. Our technical contribution is four-fold: (1) we introduce a simulation-based training framework that minimizes the need for resource extensive bounding box annotations; (2) we propose a new side loss that is optimized for compound figure separation; (3) we propose an intra-class image augmentation method to simulate hard cases; and (4) to the best of our knowledge, this is the first study that evaluates the efficacy of leveraging self-supervised learning with compound image separation. From the results, the proposed SimCFS achieved state-of-the-art performance on the ImageCLEF 2016 Compound Figure Separation Database. The pretrained self-supervised learning model using large-scale mined figures improved the accuracy of downstream image classification tasks with a contrastive learning algorithm. The source code of SimCFS is made publicly available at https://github.com/hrlblab/ImageSeperation.

\DeclareAcronym

ROI short=ROI, long=region of interest, \DeclareAcronymIOU short=IOU, long=intersection over union, \DeclareAcronymcIOU short=cIOU, long=circle intersection over union, \DeclareAcronymDoF short=DoF, long=degrees of freedom, \DeclareAcronymCPL short=CPL, long=Center Point Localization, \melbaheading2022:025https://www.melba-journal.org/papers/2022:025.html20221-1802/202208/2022Yao, Qu, Long, Liu, Deng, Tian, Xu, Jha, Asad, Bao, Zhao, Fogo, Landman, Yang, Chang and Huo \ShortHeadingsCompound Figure Separation of Biomedical ImagesYao et al. \firstpageno1

{keywords}

Compound figures, Biomedical data, Deep learning, Contrastive learning, Self-supervised learning

1 Introduction

Self-supervised learning algorithms (e.g., contrastive learning) allow deep learning models to learn effective image representations from large-scale unlabeled data (celebi2016unsupervised; sathya2013comparison; chen2020simple). Thus, the important role of having large-scale images (even without annotations) for training a more generalizable AI model has been widely recognized in medical image analysis. Even unannotated medical images can be difficult to obtain at scale for individual labs (zhang2017deep). Fortunately, online resources (e.g., NIH Open-i $^{®}$ (demner2012design) search engine, academic images released by journals) have provided a cost-effective and scalable way of obtaining large-scale images. However, the images from such resources consist of a considerably large amount of compound figures with subplots that cannot be directly used by modern self-supervised learning approaches (Fig 1). To make the data useful, we need to extract individual subplots from the compound figure, with compound figure separation algorithms (lee2015dismantling).

Recent contrastive learning methods have demonstrated advantages in pretraining a more generalizable deep learning model using large-scale unannotated individual images. However, the web-mined images from medical literature and search engines are not necessarily single images that can be directly used for contrastive learning. Therefore, the proposed SimCFS framework can be used to separate such compound images into individual images as unannotated training data for self-supervised learning.

Figure 1: Value of compound figure separation. This figure shows the hurdle (red arrow) of training self-supervised machine learning algorithms directly using large-scale biomedical image data from biomedical image databases (e.g., NIH OpenI) and academic journals (e.g., AJKD). When searching desired tissues (e.g., search “glomeruli”), a large amount of data are compound figures. Such data would advance medical image research via recent self-supervised learning algorithms, such as self-supervised learning, contrasting learning, and auto encoder networks huo2021ai

Figure 2: The overall workflow of the proposed simple compound figure separation (SimCFS) workflow. In the training stage, SimCFS only requires individual images from different categories. The pseudo compound figures are generated from the proposed augmentation simulator (SimCFS-AUG). Then, a detection network (SimCFS-DET) is trained to perform compound figure separation. In the testing stage (the gray panel), only the trained SimCFS-DET is required for separating the images.

Various compound figure separation approaches have been developed (davila2020chart; lee2015detecting; apostolova2013image; tsutsui2017data; shi2019layout; jiang2021two; huang2005associating), especially with recent advances in deep learning. However, previous approaches typically required resource extensive bounding box annotation to form the problem as a training detection task. In this paper, we propose a simple compound figure separation (SimCFS) framework that minimizes the need for bounding box annotations in compound figure separation. Briefly, the contribution of this study is four-fold:

$∙$ We introduce a simulation-based training framework that minimizes the need of resource extensive bounding box annotations.

$∙$ We propose a new Side loss, which is an optimized detection loss for figure separation.

$∙$ We propose an intra-class image augmentation method to mimic the hard cases of compound images without clear boundaries.

$∙$ To the best of our knowledge, this is the first study that evaluates the efficacy of leveraging self-supervised learning with compound image separation.

We apply our technique to conduct compound figure separation for renal pathology (in-house data) as well as on the ImageCLEF 2016 Compound Figure Separation Database (publicly available). Glomerular phenotyping (koziell2002genotype) is a fundamental task for efficient diagnosis and quantitative evaluations in renal pathology. Recently, deep learning techniques have played increasingly important roles in renal pathology to reduce clinical working load of pathologists and enable large-scale population based research (gadermayr2017cnn; bueno2020glomerulosclerosis; govind2018glomerular; kannan2019segmentation; ginley2019computational). Due to the lack of a publicly available dataset for renal pathology, it is appealing to extract large-scale glomerular images from public databases (e.g., NIH Open-i $^{®}$ search engine) for downstream self-supervised or semi-supervised learning (huo2021ai). Meanwhile, the Image-CLEF 2016 dataset consists of various types of organs, and resources of large-scale medical images, which is arguably the most widely used testbed for compound image separation tasks. Both cohorts are used to evaluate the performance of different methods.

This work is extended from our conference paper (yao2021compound) with the new efforts listed below: (1) we included more technical and evaluation details for the proposed method; (2) More comprehensive literature review and related work have been introduced; (3) We performed more rigorous evaluation (five-fold cross-validation) during the evaluation stages; (4) We conducted more comprehensive evaluation with more baseline compound image generation and separation methods (e.g., tsutsui2017data); (5) We evaluated the efficacy of leveraging self-supervised learning with compound image separation by evaluating with both supervised and semi-supervised methods; (6) Our web mined glomerular dataset (20,000 images), as well as the source code of SimCFS, are released to public in the paper.

2 Related Work

2.1 Compound Figure Separation

In biomedical articles, about 40-60 $%$ of figures are multi-panel (kalpathy2015evaluating). Several methods have been proposed in the document analysis community that envolve, extracting figures and their semantic information. For example, huang2005associating presented their recognition results of textual and graphical information in literary figures. davila2020chart presented a survey of approaches of several data mining pipelines for future research.

2.1.1 Traditional vision approaches

In order to collect scientific data massively and automatically, various approaches have been proposed in the prior arts(10.1093/bioinformatics/btx611; 10.1007/978-3-319-65813-1_20; lee2015dismantling). For example, lee2015detecting proposed an SVM-based binary classifier to distinguish completed charts from visual markers, such as labels, legend, and ticks. apostolova2013image proposed a figure separation method via a capital index. These traditional computer vision approaches were commonly performed on the figure’s grid-based layout. Thus, the separation was usually accomplished by simple horizontal and vertical cuts based on the image boundary information.

2.1.2 Deep learning Methods

In the past few years, deep learning based algorithms, especially convolutional neural networks (CNNs), have provided considerably superior performance in extracting and separating subplots from from compound images. tsutsui2017data proposed a CNN based approach that treated compound figure segmentation as an object localization problem by estimating the bounding boxes of subplots. This was one of the earliest deep learning-based approaches to achieve compound figure separation via a deep convolutional neural network. Tsutsui et al. applied the You Only Look Once (YOLO) Version 2, a CNN based detection network, which utilized a single convolutional network to predict bounding boxes and class probabilities simultaneously. They also implemented training on artificially constructed datasets and reported superior performances on ImageCLEF dataset (GSB2016). shi2019layout developed a multi-branch output CNN to predict the irregular panel layouts and provided augmented data to drive learning. Their network separated compound figures of different sizes of structures with better accuracy.

More recently, anchor-based approaches have attracted great attentions in the object detection field due to their concise network architectures and high computational efficiency. The introducing of anchor has prior knowledge to object distribution which is also closer to the compound figure situation. YOLOv4 was used by jiang2021two to achieve a superior detection performance. They combined a traditional vision method with high performance of deep learning networks by detecting the sub-figure label and then optimizing the feature selection process in the sub-figure detection. Now, YOLO has been updated to V5, which inherited the advantages of YOLOv4 (bochkovskiy2020yolov4). YOLOv5 integrated spatial pyramid pooling with new data enhancement methods like Mosaic training, balanced model size and detection speed which achieved faster detection speed and higher accuracy.

2.2 Self-supervised learning method

Supervised learning refers the usage of a set of input variables to predict the value of a labeled output variable. It requires labeled data (like an answer key that the model can use to evaluate its performance). Conversely, self-supervised learning (celebi2016unsupervised) refers to inferring underlying patterns from an unlabeled dataset without any reference to labeled outcomes or predictions.

Recently, a new family of self-supervised representation learning, called contrastive learning, shows its superior performance in various vision tasks (wu2018unsupervised; noroozi2016unsupervised; zhuang2019local; hjelm2018learning). Learning from large-scale unlabeled data, contrastive learning can learn discriminative features for downstream tasks. SimCLR (chen2020simple) maximizes the similarity between images in the same category and repels the representations of different category images. wu2018unsupervised uses an offline dictionary to store all data representation and randomly selects training data to maximize negative pairs. MoCo (he2020momentum) introduces a momentum design to maintain a negative sample pool instead of an offline dictionary. Such works demand a large batch size in order to include sufficient negative samples. To eliminate the needs of negative samples, BYOL (grill2020bootstrap) was proposed to train a model with an asynchronous momentum encoder. Recently, SimSiam (chen2020exploring) was proposed to further eliminate the momentum encoder in BYOL, allowing for less GPU memory consumption.

3 Methods

The overall framework of SimCFS is presented in Fig. 2. The training stage of SimCFS contains two major steps: (1) compound figure simulation, and (2) sub-figure detection. In the training stage, the SimCFS network can be trained with either a binary (background and sub-figure) or multi-class setting. The purpose of the compound figure simulation is to achieve collecting large-scale training compound images in an annotation free manner. In the testing stage, only the detection network is needed, where the output will be the bounding boxes of the sub-figures which shall enable us to crop those images in a fully automatic manner. The binary setting detector can serve as a compound figure separator, while the multi-class detector can be used for web image mining for images of concerned categories.

3.1 Anchor-based detection

YOLOv5, the latest version in the YOLO family (bochkovskiy2020yolov4), is employed as the backbone network for sub-figure detection. The rationale for choosing YOLOv5 is that the sub-figures in compound figures are typically located in horizontal or vertical orders. Herein, the grid-based design with anchor boxes is well adaptable to our application. A new Side loss is introduced to the detection network that further optimizes the performance of compound figure separation.

3.2 Compound figure simulation

Our goal is to only utilize individual images, which are non-compound images with weak classification labels in training a compound image separation method. In previous studies, the same task typically requires stronger bounding box annotations of subplots using real compound figures. In compound figure separation tasks, a unique advantage is that the sub-figures are not overlapped. Moreover, their spatial distributions are more ordered as compared with natural images in object detection. Therefore, we propose to directly simulate compound figures from individual images as the training data for the downstream sub-figure detection.

tsutsui2017data proposed a compound figure synthesis approach (Fig. 3). The method first randomly samples a number of rows and generates random heights for each row. Then a random number of single figures fills the empty template. However, the single figures are naively resized to fit the template, with large distortion (Fig. 3).

Figure 3: Compound figure simulation. (a) The upper panel shows the previously proposed compound figure synthesis strategy. It first generates the figure grids and then fills with images that have undergone image distortion, which is unusual in real compound figures. (b) The lower panel presents the proposed SimCFS-AUG compound figure simulator. It keeps the original ratio of individual images in an adaptive manner. Beyond this step of keeping original ratios, an intra-class augmentation is introduced to simulate the hard cases in which the boundaries are not explicitly visible between similar subplots. (Bounding boxes are displayed for visualization and are not actually visible in the training data)

Input:
      Single images $X_{i}$ in $k$ classes
     Set of training input indices with known labels $L_{1}, L_{2}, . . ., L_{k}$
Output:
     Compound figure ${¯ ¯¯ ¯ C}_{j}$
     Annotation file $A_{j}$

1:for each pseudo compound figure

{¯ ¯¯ ¯ C}_{j}

2: Stage 1: Space initialize

▹

Multi real world case simulation

Layout \leftarrow row-restricted or column-restricted

Classes \leftarrow multi or intra

▹

Add intra-class augmentation

Number of rows/columns \leftarrow n \in [2, 5]

6: if layout is row-restricted then

▹

Keep close to real world aspect ratio

Width W_{{¯ ¯¯ ¯ C}_{j}} \leftarrow 640, Height H_{{¯ ¯¯ ¯ C}_{j}} \leftarrow \sum_{p = 1}^{n} H_{p} while \frac{3}{4} \leq a s p e c t r a t i o \leq \frac{4}{3}

▹

Each row’s height

H_{1}, . . . H_{p}

should be in certain range

9: else if layout is column-restricted then

10:

Height H_{{¯ ¯¯ ¯ C}_{j}} \leftarrow 640, Width W_{{¯ ¯¯ ¯ C}_{j}} \leftarrow \sum_{q = 1}^{n} W_{q} while \frac{3}{4} \leq a s p e c t r a t i o \leq \frac{4}{3}

11:

▹

Each column’s width

W_{1}, . . . W_{q}

should be in certain range

12: Stage 2: Fit in preset space

13: for row/column in n do

14: if Classes is multi then

15: Create ImagePool I, for images

X_{i}

in I, i

\in L_{1}, L_{2}, . . ., L_{k}

16: else if Classes is intra then

17: Create ImagePool I, for images

X_{i}

in I, i

\in L_{m}, m \in [1, k]

18: Random fill in resized images from ImagePool (keeping original ratio)

19: Save resized

w_{i}^{'}, h_{i}^{'}

, center position

x_{i}, y_{i}

A_{j}

20: Stage 3: Output: compound figure

{¯ ¯¯ ¯ C}_{j}

, annotation

A_{j}

Algorithm 1 Compound figure simulation

Inspired by prior arts (tsutsui2017data), we propose a simple augmentation strategy that is specific to compound figure separation data, called SimCFS-AUG, to perform compound figure simulation. The inputs of the simulator are single images with specified classes. Two groups are generated when simulating compound figures; these groups are row-restricted and column-restricted. The length of each row or column is randomly generated within a certain range. Then, images from our database are randomly selected and concatenated together to fit in the preset space. As opposed to previous studies, the original ratio of individual images is kept within our SimCFS-AUG simulator so as to mimic more realistic common compound images without distortion in individual images.

Figure 4: Proposed Side loss for figure separation. The upper panel shows the principle of side loss, in which penalties only apply when vertices of detected bounding boxes are outside of true box regions. The lower left panel shows the bias of current IoU loss towards over detection. When an under detection case (yellow box) and an over detection case (red box) have the same margins ( $d$ ), from predicted to true boxes, the over detection has the smaller loss (larger IoU). The lower right panel shows the under detection and over detection examples of the compound figure separation, with the same IoU loss. Side loss is proposed to break IoU loss, given the results in the yellow boxes are less contaminated by nearby figures than the results in the red boxes (green arrows).

3.3 Side loss for compound figure separation

For object detection on natural images, there is no specific preference between over detection and under detection as objects can be randomly located and even overlapped. In medical compound images, however,objects are typically closely attached to each other without overlapping. In this case, over detection would introduce undesired pixels from the nearby plots (Fig. 4), which are not ideal for downstream deep learning tasks. Unfortunately, over detection is often encouraged by the current Intersection Over Union (IoU) loss in object detection (Fig. 4), as compared with under detection.

In the SimCFS-DET network, we introduce a simple side loss, which will penalize over detection. We define a predicted bounding box as $B^{p}$ and a ground truth box as $B^{g}$ , with coordinates: $B^{p} = (x_{1}^{p}, y_{1}^{p}, x_{2}^{p}, y_{2}^{p})$ , $B^{g} = (x_{1}^{g}, y_{1}^{g}, x_{2}^{g}, y_{2}^{g})$ . The over detection penalty of vertices for each box is computed as:

	$x_{1}^{I} = max (0, x_{1}^{g} - x_{1}^{p}), y_{1}^{I} = max (0, y_{1}^{g} - y_{1}^{p})$		(1)
	$x_{2}^{I} = max (0, x_{2}^{p} - x_{2}^{g}), y_{2}^{I} = max (0, y_{2}^{p} - y_{2}^{g})$		(1)

Then, the Side loss is defined as:

L_{s i d e} = x_{1}^{I} + y_{1}^{I} + x_{2}^{I} + y_{2}^{I}

(2)

The side loss is combined with canonical loss functions in YOLOv5, including bounding box loss ( $L_{b o x}$ ), object probability loss ( $L_{o b j}$ ), and classification loss ( $L_{c l s}$ ).
$L_{t o t a l} = λ_{1} L_{b o x} + λ_{2} L_{o b j} + λ_{3} L_{c l s} + λ_{4} L_{s i d e}$ ,where $λ_{1}$ , $λ_{2}$ , $λ_{3}$ , $λ_{4}$ are constant weights to balance the four loss functions. Following YOLOv5’s implementation ¹¹1https://github.com/ultralytics/yolov5, the parameters were set as $λ_{1}$ = $b o x \times (3 / n l)$ , $λ_{2}$ = $o b j \times (i m g s i z e / 640)^{2} \times (3 / n l)$ , $λ_{3}$ = $(c l s \times n u m_c l s / 80) \times (3 / n l)$ , where $n u m_c l s$ was the number of classes, $n l$ was the number of layers, and $i m g s i z e$ was the image size.The $λ_{4}$ of the Side loss was empirically set to $λ_{1} / 30$ across all experiments as the Side loss and Box loss are all based on the coordinates.

Figure 5: Qualitative Results. This figure shows the qualitative results of comparing proposed SimCFS approach with the YOLOv5 benchmark.

4 Experimental Design

4.1 Data

We collected two in-house datasets for evaluating the performance of different compound figure separation strategies. One compound figure dataset (called Glomeruli-2000) consisted of 917 training and 917 testing real figure plots from the American Journal of Kidney Diseases (AJKD), with keywords “glomerular OR glomeruli OR glomerulus”. Each figure was annotated manually with four classes, including glomeruli from (1) light microscopy, (2) fluorescence microscopy, (3) electron microscopy, and (4) charts/plots.

To obtain individual images to simulate compound figures, we downloaded 5,663 single individual images from online resources. Briefly, we obtained 1,037 images from Twitter, and obtained 4,626 images from Google search, with five classes, including individual images from (1) glomeruli with light microscopy, (2) glomeruli with fluorescence microscopy, (3) glomeruli with electron microscopy, (4) charts/plots, and (5) others. The individual images were combined using the SimCFS-AUG simulator in order to generate 7,000 pseudo training images. 2,000 of the pseudo images (with multiple sub-figures) were simulated using intra-class augmentation. In addition, 2,947 individual images were further employed as training data. The implementation of SimCFS-DET was based on YOLOv5 with PyTorch implementations. Google Colab was used to perform all experiments in this study.

4.2 Implement Details

In the experiment setting, the parameters are empirically chosen. We set the learning rate to 0.01, weight decay to 0.0005 and momentum to 0.937. The input image size was set to 640, $b o x$ to 0.5, $o b j$ to 1, $c l s$ to 0.5, and the number of layers to 3. For our in-house datasets, we trained 50 epochs using a batch size of 64. For the imageCLEF2016 dataset (GSB2016), we trained 50 epochs using a smaller batch size of 8.

4.3 Evaluation Metrics

Mean average precision was the primary metric used to evaluate detection performance. For a given threshold IOU, average precision was obtained by calculating the area under the 101-point interpolated precision-recall curve. Then, mean average precision ( $A P$ ) is the mean of the average precision for IOU thresholds from 0.5 to 0.95 with a step size of 0.05. $A P_{50}$ is the average precision with an IOU threshold at 0.5. $A P_{75}$ is the average precision with an IOU threshold at 0.75. $A P_{S}$ is the mean average precision for small objects (area less than $32^{2}$ ). $A P_{M}$ is the mean average precision for medium objects (area between $32^{2}$ and $96^{2}$ ). Since no objects contained an area greater than $96^{2}$ , the large mean average precision ( $A P_{L}$ ) was not utilized.

5 Results

5.1 Ablation Study

In this ablation study, we evaluate the image separation performance via 917 real compound images with manual box annotations as testing data in 1 and Fig. 5. For training, we assessed the performance of using 917 real compound training images (“Real Training Images”), as well as the performance when only using simulated training images (“Simulated Training Images”).

From the result, the proposed Side loss consistently improves the detection performance by a decent margin. The proposed compound image simulation method (with intra-class self-augmentation) achieves superior performance as compared to the benchmarks.

Method	Training Data	SL	AUG	All	Light	Fluo.	Elec.	Chart
YOLOv5	$R$			69.8	77.1	71.3	73.4	57.4
SimCFS-DET (ours)	$R$	✓		79.2	86.1	80.9	84.2	65.8
YOLOv5	$¯ S$			63.8	76.4	60.1	72.5	46.8
YOLOv5	$S$			66.4	79.3	62.1	76.1	48.0
YOLOv5	$S$		✓	71.4	82.8	72.1	75.3	47.1
SimCFS (ours)	$¯ S$	✓		68.9	77.1	66.8	82.5	49.1
SimCFS (ours)	$S$	✓		69.4	77.6	67.1	84.1	48.8
SimCFS (ours)	$S$	✓	✓	80.3	89.9	78.7	87.4	58.8

*The best and second best performances are denoted by bold and underline.
*For training data, $R$ is using real compound figure while $S$ is using simulated images, $¯ S$ is using tsutsui2017data grid-based synthetic method.
*SL is the side loss, AUG is the intra-class self-augmentation.
*ALL is the Overall mAP $_{0.5 : .95}$ , which is reported for all concerned classes, (light, fluorescence,
and electron microscopy).

Table 1: The ablation study with different types of training data.

Method	Backbone	mAP $_{0.5}$	mAP $_{0.5 : .95}$
tsutsui2017data	YOLOv2	69.8	-
tsutsui2017data	Transfer	77.3	-
zou2020unified	ResNet152	78.4	-
zou2020unified	VGG19	81.1	-
YOLOv5 (bochkovskiy2020yolov4)	YOLOv5	85.3	69.5
SimCFS-DET (ours)	YOLOv5	88.9	71.2
SimCFS-DET esemble (ours)	YOLOv5	90.3	71.5

Table 2: The results on ImageCLEF2016 dataset.

5.2 Comparison with State-of-the-art

We also compare CFS-DET with the state-of-the-art approaches including tsutsui2017data and zou2020unified using the ImageCLEF2016 dataset (GSB2016). ImageCLEF2016 is the commonly accepted benchmark for compound figure separation, including total 8,397 annotated multi-panel figures (6,783 figures for training and 1,614 figures for testing). Table 2 shows the results of the ImageCLEF2016 dataset. The proposed CFS-DET approach consistently outperforms other methods by considering evaluation metrics. Additionally, we applied five-fold cross validation to our model training using weighted boxes fusion as proposed by (solovyev2021weighted). To merge the bounding boxes results from the five predictions, the proposed method used the confidence scores of all of the proposed bounding boxes in order to construct the average boxes. Eventually, when combining SimCFS with the weighted boxes fusion (SimCFS-DET ensemble), the performance was further improved.

5.3 Application on Contrastive Learning

We demonstrate the application of our SimCFS framework and how it helps to provide massive biomedical image data and benefits further data analysis with self-supervised representation learning.

In this study, self-supervised contrastive learning was employed as an example downstream task for our SimCFS compound image separation approach. We demonstrate how our approach helps to provide massive biomedical image data and benefits further data analysis with self-supervised representation learning. To evaluate the performance of introducing separated images, a semi-supervised method was evaluated beyond the supervised benchmark to present the performance of using the same set of unannotated images as the contrastive learning approach.(Table 3) Specifically, the stain and imaging modality classification task is employed to evaluate the performance of different approaches.

5.3.1 Data

We first collected 10,000 compound figures with the keywords ‘glomerular OR glomeruli OR glomerulus’. Then we used our SimCFS network to process all compound images to get more than 20,000 glomeruli pathologies obtained by different microscopy or in different stains with a confidence threshold of 0.7.

Other in-house data are 3,000 manually annotated glomeruli pathologies with seven classes, including glomeruli from (1) electron microscopy, (2) fluorescence microscopy, and light microscopy with different stains of (3) PAS, (4) silver, (5) H&E, (6) Masson and (7) other.

5.3.2 Approach

We used the SimSiam network (chen2020simple) as the baseline method of contrastive learning. 20,000 glomeruli pathologies were used to train the SimSiam network. Two random augmentations from the same image were used as training data. In all of our self-supervised pre-training, images for model training were resized to $224 \times 224$ pixels. We used the momentum SGD as the optimizer. The weight decay was set to 0.0005. The base learning rate was $l r = 0.05$ and the batch size equals 64. The learning rate was $l r \times$ BatchSize $/ 256$ , which followed a cosine decay schedule (loshchilov2017sgdr).

To apply the self-supervised pre-training networks, we froze the pretrained ResNet-50 model by adding one extra linear layer which followed the global average pooling layer. When finetuning with the 3,000 manually annotated glomeruli data, only the extra linear layer was trained.To prevent model over-fitting, we applied 5-fold cross validation by dividing our data into 5 folders, using four of the five folders as training data and the other folder as validation. We used the SGD optimizer to train linear classifier with a based (initial) learning rate $l r$ =30, weight decay=0, momentum=0.9, and batch size=64 (follows chen2020exploring). We trained linear classifiers for 100 epochs and selected the best model based on the validation set.

5.3.3 Results

Fine-tuning our pretrained SimSiam (Backbone:ResNet-50) on 2.3K labeled images is significantly better then training from scratch. Interestingly, our model also outperformed ResNet-50 models pretrained on ImageNet. Table 3 shows the results.

Methods	Unlabeled Images	labeled Images	F1 Score	Balanced Acc
Supervised method:
Random Int	-	2.3k	0.845	0.843
ImageNet Int	-	2.3k	0.888	0.883
Semi-supervised method:
Temporal Ensembling	20k	2.3k	0.892	0.885
Self-supervised method:
Simsiam	-	2.3k	0.891	0.893
Simsiam w.SimCFS	20k	2.3k	0.900	0.904

*For the supervised method, we trained the entire ResNet-50 (random initialized and ImageNet pretrained) from scratch with fully supervised learning.

Table 3: Classification performance.

6 Discussion

In this study, we develop a new compound image separation framework with the ultimate goal to advance downstream machine learning tasks. The recent contrastive learning methods demonstrated their advantages of pretraining a more generalizable deep learning model using large-scale unannotated individual images. However, the web-mined images from medical literatures and search engines are not necessarily single images that can be directly used for contrastive learning. Therefore, the proposed SimCFS can be used to separate such compound images into individual images as unannotated training data for self-supervised learning.

The YOLO method was employed since it was a broadly used anchor-based backbone in previous compound image separation algorithms. However, our framework is an open framework, where the YOLO method can be replaced by other object detection backbones (e.g., anchor-free methods) and even with an even better performance.

The new application, through the optimization of both Side loss function and hard case simulation, proposes to improve the accuracy of image separation. Our proposed Side loss is designed based on the knowledge that there is no overlapping case in compound figures. By adding a penalty for the overestimated bounding box, the predictions are less overlapped as compared to the true box regions.

Secondly, with our compound figure simulation method, SimCFS can be trained with only synthetic compound figures which are generated by only a small quantity of annotated individual images. At the beginning of our experiment, when we synthesized row-restricted and column-restricted compound figures using images from all classes, the results were not as good as the real compound image data. To overcome such issues, we proposed the intra-class image augmentation method. By simulating those hard cases and adding the new intra-class compound figures to our previous synthesized data, the performance of the simulated training data has outperformed the real data by its large quantity and various simulated cases.

Recent advances in computer vision are due, to a large extent, to the growing size of annotated training data. However, one key limitation to the SimCFS network is that the ImageCLEF Medical dataset , the largest available dataset for compound figure separation, has only 7,000 images for training, which is much smaller than most modern object detection datasets. An important goal for this community could be to build up a much larger size dataset with multi-classes annotations like MRI, pathology, and charts etc. In this study, we assessed the promising application of SimCFS, which is to create large-scale unlabeled images for downstream contrastive learning. Using NIH OpenI, tens of thousands of free biomedical data can be achieved by searching the desired tissue types. The self-supervised learning strategy achieved better accuracy than the fully supervised approach with ImageNet initialization.

Several potential improvements for our SimCFS framework are as follows. First, we could further introduce image synthesis approaches to the proposed pipeline to obtain more unique imagesḞurthermore, we can perform textual contents extractions for captions, notes and labels while separating figures. These data in multi-forms could benefit further data mining research.

7 Conclusion

In this paper, we introduced the SimCFS framework to extract images of interests from large-scale compounded figures with weak classification labels. The pseudo training data were built using the proposed SimCFS-AUG simulator. The anchor-based SimCFS-DET detection achieved state-of-the-art performance by introducing a simple side loss. Additionally, our SimCFS framework provided cost-efficient and large-scale unannotated images to train un-/self-supervised representation learning methods (e.g., SimSiam). It achieved better performance than ImageNet’s supervised pre-trained counterparts in classification tasks.

\acks

This work was supported in part by NIH NIDDK DK56942(ABF) and NSF CAREER 1452485 (Landman).

\ethics

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.

\coi

We declare we don’t have conflicts of interest.