A Fair Experimental Comparison of Neural Network Architectures for Latent Representations of Multi-Omics for Drug Response Prediction

Tony Hauptmann Institute of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany Stefan Kramer Institute of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany

September 1, 2022

Abstract

Background: Recent years have seen a surge of novel neural network architectures for the integration of multi-omics data for prediction. Most of the architectures include either encoders alone or encoders and decoders, i.e., autoencoders of various sorts, to transform multi-omics data into latent representations. One important parameter is the depth of integration: the point at which the latent representations are computed or merged, which can be either early, intermediate, or late. The literature on integration methods is growing steadily, however, close to nothing is known about the relative performance of these methods under fair experimental conditions and under consideration of different use cases.

Results: We developed a comparison framework that trains and optimizes multi-omics integration methods under equal conditions. We incorporated early integration and four recently published deep learning methods: MOLI, Super.FELT, OmiEmbed, and MOMA. Further, we devised a novel method, Omics Stacking, that combines the advantages of intermediate and late integration. Experiments were conducted on a public drug response data set with multiple omics data (somatic point mutations, somatic copy number profiles and gene expression profiles) that was obtained from cell lines, patient-derived xenografts, and patient samples. Our experiments confirmed that early integration has the lowest predictive performance. Overall, architectures that integrate triplet loss achieved the best results. Statistical differences can, overall, rarely be observed, however, in terms of the average ranks of methods, Super.FELT is consistently performing best in a cross-validation setting and Omics Stacking best in an external test set setting.

Conclusions: We recommend researchers to follow fair comparison protocols, as suggested in the paper. When faced with a new data set, Super.FELT is a good option in the cross-validation setting as well as Omics Stacking in the external test set setting. Statistical significances are hardly observable, despite trends in the algorithms’ rankings. Future work on refined methods for transfer learning tailored for this domain may improve the situation for external test sets. The source code of all experiments is available under https://github.com/kramerlab/Multi-Omics_analysis

Keywords: machine learning, deep learning, multi-omics integration, neural network, latent representation, drug response prediction, autoencoder

\addbibresource

references.bib

1 Background

Data analysis in the life sciences often involves the integration of data from multiple modalities or views. Integration is necessary to obtain models with improved predictive performance or explanatory power. One currently popular approach to integrating multiple views is to take advantage of latent representations as computed by neural network architectures. Views are frequently defined as (potentially very large) groups of variables that originate from one measurement technology. In bioinformatics and computational biology, views often originate from different omics platforms, e.g., from genomics, transcriptomics, proteomics, and so forth.

The neural network architectures include either just encoders or encoders and decoders, as in autoencoder-type architectures. A myriad of architectures is possible for the integration of multi-omics data: encoders only, encoders-decoders (autoencoders of various sorts), integration either early (already for the computation of a joint latent representation), intermediate (concatenating latent representations following their computation), or late (combining the results of individual latent representations) \autociteintegrationSchemes, and using different loss functions. As the literature on the topic is growing, one would expect that a more recent publication would improve upon a previous publication in terms of performance, or, at least, that a specific use case with superior performance has been identified then. However, as it turns out, various approaches have been optimized and tested with different sets of hyperparameters (some of them fixed, some of them optimized), with different values for hyperparameters, and with different test protocols. Further, performance can be very different for cross-validation (assuming a similar distribution of data for each test set) and so-called external test sets. Thus, it is at this point far from clear which method performs best overall, and, specifically, which method is to be preferred in which setting, e.g., in a cross-validation-like setting or with an external test set from an unknown distribution.

In this paper, we level the playing field, establish a uniform, basic protocol for various existing methods, and test the methods with a state-of-art statistical method for the comparison of machine learning algorithms \autocitecritical_difference. As a side product, we derive a method that integrates intermediate and late integration and fares well in settings where the predictive performance on external test sets is favored.

We study the prediction performance of the various algorithms on data sets for drug response prediction, which have been used widely in the literature in past few years \autociteHossein:2019. The ultimate goal of such studies is to detect drug response biomarkers, which would help to develop personalized treatments of patients and improve clinical outcomes \autociteGeeleher2014. Drug screening studies on large patient cohorts are rarely performed, because it is ethically not feasible to change the chemotherapeutic regime and to cause a suboptimal therapy \autociteGeeleher2017, Geeleher2014. On the other hand, large-scale drug-screening efforts using human cancer cell line models have begun to establish a collection of gene–drug associations and have uncovered potential molecular markers predictive of therapeutic responses \autociteprecision_drug_response. A critical challenge that remains is the clinical utility of the results, i.e., the translatability from in vitro to in vivo \autociteHossein:2019.

In summary, the contributions of this paper are as follows:

a fair comparison of recent deep multi-omics integration algorithms for drug response prediction,
a new combined intermediate and late architecture for multi-omics integration, and
a thorough validation study of the methods’ predictions on external test sets.

The remainder of the paper is organized as follows: First, we interpret the methods’ results on the test and external sets and test significance in the observed differences. An in-depth discussion and the conclusions follow. Next, we give an overview of the used data sets and the design of the fair comparison framework. Towards the end of the article, we still give details of the included integration architectures.

2 Methods

2.1 Drug Response Data

For our experiments, we used a publicly available drug response data set containing responses for six drugs: Docetaxel, Cisplatin, Gemcitabine, Paclitaxel, Erlotinib, and Cetuximab \autociteHossein:2020:dataset. The data set was chosen, because it had patient-derived xenograft (PDX) or patient data necessary to test the method’s translatability \autociteHossein:2019. Two external test sets are available for Gemcitabine.

The data set contains the drug response as target and data for three omics: somatic point mutations, somatic copy number profiles and gene expression profiles. Gene expressions are standardized and pairwise homogenized. Gene-level copy number estimates are binarized, representing copy-neutral genes as zeroes and genes with overlapping deletions or amplifications as ones. Somatic point mutations are also in a binary format: one for mutated genes and zero for genes without a mutation \autociteHossein:2019. For a fair comparison, the same data processing and training procedures were used for all methods. We used a variance threshold to filter genes with a small variance, as they hardly provide additional information. We set the variance thresholds in the same way as Park et al. \autocitesuperfelt.

Sharifi-Noghabi et al. acquired the data from PDX mice models \autocitepdx, The Cancer Genome Atlas (TCGA) \autociteWeinstein:2013 and Genomics of Drug Sensitivity in Cancer (GDSC) \autocitegdsc. GDSC consists of cell line data and was used for training, validation and testing because of its high number of samples. The trained neural networks were additionally tested on either PDX or TCGA to validate the algorithms’ translatability.

The characteristics of the data set per drug are summarized in Table 1.

Drug	Resource	Number of Samples	Usage
Cetuximab	GDSC	856 (NR:735, R:121)	Train & Test
Cisplatin	GDSC	829 (NR:752, R:77)	Train & Test
Docetaxel	GDSC	829 (NR:764, R:65)	Train & test
Erlotinib	GDSC	362 (NR:298, R:64)	Train & test
Gemcitabine	GDSC	844 (NR:790, R:54)	Train & test
Paclitaxel	GDSC	389 (NR:363, R:26)	Train & test
Cetuximab	PDX	60 (NR:55, R:5)	External test
Cisplatin	TCGA	66 (NR:6, R:60)	External test
Docetaxel	TCGA	16 (NR:8, R:8)	External test
Erlotinib	PDX	21 (NR:18, R:3)	External test
Gemcitabine	PDX	25 (NR:18, R:7)	External test
Gemcitabine	TCGA	57 (NR:36, R:21)	External test
Paclitaxel	PDX	43 (NR:38, R:5)	External test
NR=Non-Responder
R=Responder

Table 1: Characteristics of the drug response multi-omics data set.

2.2 Comparison Framework

To compare the algorithms fairly, different precautions were taken: First, the same preprocessing was performed in all experiments to provide the same input data, and second, the hyperparameters of the algorithms were optimized with an equal number of iterations by random search from a fixed grid. All algorithms draw parameters from the same grids (Table 2).

Parameter	Values
Batch size	${8, 16, 32}$
Dropout rate	${0.1, 0.3, 0.5, 0.7}$
Epochs	${2..20}$
Gamma	${0.0, 0.1, 0.3, 0.5}$
Layer Dimension	${32, 64, 128, 256, 512, 1024}$
Learning rate	${0.001, 0.01}$
Margin	${0.2, 0.5, 1}$
Weight Decay	${0.0001, 0.001, 0.01, 0.05, 0.1}$

Table 2: The hyperparameter grid used in the hyperparameter optimization.

A $5 \times 5$ stratified cross-validation was performed to reduce the dependence on the data splitting. 200 hyperparameter sets were created for each iteration of the outer cross-validation and each set was used for training in the inner cross-validation. The mean area under the receiver operating characteristic (AUROC) was used as performance measure. The algorithm was retrained with the best hyperparameter set on the combined train and validation sets. The trained network was used to compute the final results on the test sets from cross-validation and the external test set.

The data sets are imbalanced as they contain only few responders. In the drug discovery process, researchers are mainly interested in the positive samples to find an effective drug. To account for this requirement, the area under precision recall curve (AUPRC) \autociteauprc was additionally computed.

One requirement to use these methods is for them to work on patient data sets which are normally too small for transfer learning and still detect positive samples. This was covered by using the final model of an outer cross-validation iteration on an external patient or PDX data set.

The comparison of multiple algorithms on multiple data sets can lead to ambiguous results, if none of them performs significantly better. We compute the mean ranking as single value for a better comparison. Additionally, we compute the critical difference (CD) \autocitecritical_difference with the Nemenyi significance test with $α = 0.05$ .

2.3 Multi-Omics Integration Architectures

Most deep learning algorithms for multi-omics integration are based upon the concept of encoding the feature space into a lower-dimensional latent space. The encoded representation of features in the latent space is commonly called latent features. The subnetwork that computes the latent features is called encoder and computes a non-linear dimensionality reduction. The smaller dimension of the latent representation assures that the encoder does not simply learn the identity function. An autoencoder transforms an input first into the latent representation and then decodes it back into the input dimension. The reconstructed sample should resemble the input as far as possible. The difference between input and reconstruction is measured and used as loss function \autociteautoencoder.

Next, we explain briefly the six multi-omics integration architectures that were used. The first one is Early integration (EI), which serves as baseline in our experiments. EI concatenates the omics before they serve as single input of a neural network, which consists of an encoding subnetwork and a classifier layer. The network is trained by minimizing the binary cross-entropy loss function, given that our target is to classify subjects into responders and non-responders.

The schematic architecture for three omics in visualized in Figure 1.

Figure 1: Schematic architecture of Early Integration with three input omics.

The next method is Multi-Omics Late Integration (MOLI) developed by Sharifi-Noghabi et al. \autociteHossein:2019, which is, notwithstanding its name, an intermediate integration method. MOLI uses an individual encoder for each omics to compute the latent representations, which are concatenated and used as input for the classifier (Figure 2).

Figure 2: Schematic architecture of MOLI with three input omics.

Due to the small sample size, the neural network is prone to overfitting, so MOLI uses the triplet loss \autocitetriplet_loss for regularization. The rationale behind triplet loss is that instances of the same class should have shorter distance between them than instances of different classes. To calculate the loss value, the triplet loss uses the embedding $f (x) \in R^{d}$ of three samples. The first one is the anchor sample $x_{i}^{a}$ , the second $x_{i}^{p}$ belongs to the same class, and at last a sample $x_{i}^{n}$ is of the opposite class.

With them, the following loss function is minimized:

L_{t r i p l e t} = N \sum i {[| | f (x_{i}^{a}) - f (x_{i}^{p}) | |_{2}^{2} - | | f (x_{i}^{a}) - f (x_{i}^{n}) | |_{2}^{2} + α]}_{+},

(1)

where $α$ is a margin that defines the minimum distance between pairs of different classes. In MOLI the concatenated latent representations are used as embeddings. The final loss function is the sum of classification loss and triplet loss:

L_{M O L I} = L_{C l a s s i f i c a t i o n} + γ L_{T r i p l e t},

(2)

where the influence of the triplet loss is weighted by the hyperparameter $γ$ and $L_{C l a s s i f i c a t i o n}$ is the binary cross-entropy. Supervised Feature Extraction Learning using Triplet Loss (Super.FELT) is a variation of MOLI, in which the encoding and classification is not performed jointly, but in two different phases \autocitesuperfelt. The first phase trains a supervised encoder with triplet loss for each omics to extract latent features of the high-dimensional omics data to avoid overfitting. The latent features are concatenated and used to train a classifier. Figure 3 (a) shows the encoding phase and (b) the classification phase of Super.FELT.

Figure 3: Schematic architecture of Super.FELT with three input omics.

Super.FELT added a variance threshold in the feature selection to remove features with a low variance \autocitesuperfelt. It is based on the assumption that genes with a low variance might contain less important information and can be safely removed to reduce the dimensionality, hence increasing the focus on the more variable genes.

Additionally, we developed a novel extension of MOLI, which we call Omics Stacking. It is inspired by stacking and a combination of intermediate and late integration. The method stacks the results of different neural network classifier layers that use different latent features as input, with a meta-learner. The meta-learner can be any classifier, but we opted for a fully connected layer to train the neural network end-to-end.

The omics are transformed to the latent space with individual encoders, but instead of only classifying the concatenated features, Omics Stacking trains a separate classifier for each omics and the concatenated omics (Figure 4). The triplet loss is still used on the concatenated embeddings to regularize. The outputs of the classifiers are combined by a meta-learner that enables the weighting of different results to emphasize the most accurate one.

Figure 4: Schematic architecture of Omics-Stacking with three input omics.

Omics Stacking combines the advantages of intermediate and late integration. It models the interaction between omics and retains the weak signals of individual omics. This method relies not only on the combined omics, but also on the individual omics and so fosters generalization. We performed an ablation study to validate different versions of Omics Stacking in Appendix B.

The former methods were developed and tested on drug response prediction, but in fact can be used for other tasks as well. Next we present two methods that were validated for the prediction of a tumor’s type or stage.

The first of these methods, Multi-task Attention Learning Algorithm for Multi-omics Data (MOMA) \autocitemoma, uses a geometrical representation and an attention mechanism \autociteattention. It is composed of three components: First, it builds a module for each omics using a module encoder. Each omics has its own module encoder consisting of two fully connected layers that convert omics features to modules. A module is represented by a normalized two-dimensional vector.

Second, it focuses on important modules between omics using a module attention mechanism. This mechanism is designed to act as a mediator to identify modules with high similarity among multiple omics. The relevance between modules is measured by the cosine similarity and is converted to a probability distribution with the softmax function. The distributions are then used to create an attention matrix that stores the relationship information between modules of different omics. To highlight important modules, the module vectors are multiplied by the attention matrix (Figure 5).

Subsequently, fully connected layers and the logistic function are applied to flatten the multidimensional vectors and to compute the final probabilities for each omics \autocitemoma.

Figure 5: Schematic architecture of the network used in MOMA for two input omics.

MOMA is trained with cross-entropy loss between the true label and the omics specific outputs. After the training of the neural network, a logistic regression is fit on the omics specific outputs to generate the combined prediction.

The last architecture we tested OmiEmbed \autociteomiEmbed, which is based on a supervised variational autoencoder (VAE) \autocitevae. OmiEmbed was developed as a unified end-to-end multi-view multitask deep learning framework for high-dimensional multi-omics data. The method learns the latent features with the auxiliary unsupervised task of reconstructing the input omics. The trained latent features can be used for one or more supervised tasks. The overall architecture of OmiEmbed comprises a deep embedding module and one or multiple downstream task modules (Figure 6).

Figure 6: Schematic architecture of OmiEmbed for three input omics.

The loss function of OmiEmbed is the sum of two parts: a reconstruction loss and the loss of the downstream task. $L_{e m b e d}$ is the unsupervised loss function of the VAE:

L_{e m b e d} = \frac{1}{M} M \sum i = 1 B C E (x_{i}, x_{i}^{'}) + D_{K L} (N (μ, σ) | | N (0, 1)) .

(3)

$B C E$ is the binary cross-entropy to measure the differences between the input $x$ and the reconstruction $x^{'}$ and is computed individually for each of the $M$ omics. $D_{K L}$ is the Kullback–Leibler divergence between the learned distribution and a standard normal distribution.

The embedding loss function is used together with the loss of the downstream task, which is in our case classification:

L_{t o t a l} = λ L_{e m b e d} + L_{C E},

(4)

where $L_{C E}$ is the cross-entropy loss and $λ$ a balancing weight.

Instead of training all layers at the same time, the network learns in three phases: First, only the VAE is trained. In the second phase, the VAE weights are fixed and only the downstream network is trained and in the last phase the complete network is fine-tuned.

Table 3 summarizes the components and architectures of the described methods.

Architecture	Training	Triplet Loss	Integration Type	Encoding
Early Integration	End-to-End	-	Early	Supervised Encoder
MOLI	End-to-End	+	Intermediate	Supervised Encoder
Super.FELT	Encoding & Classifying	+	Intermediate	Supervised Encoder
Omics Stacking	End-to-End	+	Intermediate + Late	Supervised Encoder
MOMA	End-to-End	-	Intermediate + Late	Vector Encoding
OmiEmbed	Three phases	-	Intermediate	Variational Supervised Autoencoder

Table 3: Characteristics of the multi-omics integration methods.

3 Results

At first, we analyzed the results on the five cross-validation test sets (Tables 5 and 4). Super.FELT and Omics Stacking achieved for two drugs the highest AUROCs, but for three drugs Super.FELT was the second best and Omics Stacking for one drug. MOLI, OmiEmbed and MOMA each achieved for one drug the highest AUROC and EI not once. EI performs worse (mean rank = 5.29) than the intermediate and late integration methods. None method clearly outperforms the others, but Omics Stacking and Super.FELT performed slightly better, according to their mean ranks of 2.86 and 2.29, respectively.

The AUPRC results are similar, but the three algorithms that use triplet loss achieved the best result for six out seven drugs. It shows the regularization benefit of triplet loss on discriminating responders and non-responders. Again, EI was the worst performing algorithm.

All architectures have a high standard deviation for AUROC and AUPRC, which underlines the importance of stratified cross-validation to alleviate the influence of data splitting.

Drug	Omics Stacking	MOLI	Super.FELT	Early Integration	OmiEmbed	MOMA
Gemcitabine TCGA	$0.646 \pm 0.045$	$0.628 \pm 0.118$	$0.588 \pm 0.070$	$0.611 \pm 0.029$	$0.628 \pm 0.057$	$0.650 \pm 0.029$
Gemcitabine PDX	$0.651 \pm 0.071$	$0.622 \pm 0.098$	$0.646 \pm 0.063$	$0.586 \pm 0.092$	$0.539 \pm 0.078$	$0.625 \pm 0.034$
Cisplatin	$0.722 \pm 0.066$	$0.764 \pm 0.039$	$0.753 \pm 0.047$	$0.660 \pm 0.105$	$0.640 \pm 0.071$	$0.714 \pm 0.075$
Docetaxel	$0.772 \pm 0.077$	$0.792 \pm 0.097$	$0.813 \pm 0.051$	$0.731 \pm 0.080$	$0.803 \pm 0.035$	$0.783 \pm 0.060$
Erlotinib	$0.754 \pm 0.114$	$0.705 \pm 0.062$	$0.744 \pm 0.125$	$0.671 \pm 0.052$	$0.664 \pm 0.130$	$0.739 \pm 0.105$
Cetuximab	$0.731 \pm 0.090$	$0.731 \pm 0.033$	$0.768 \pm 0.045$	$0.677 \pm 0.075$	$0.754 \pm 0.044$	$0.751 \pm 0.043$
Paclitaxel	$0.667 \pm 0.138$	$0.596 \pm 0.117$	$0.726 \pm 0.121$	$0.607 \pm 0.060$	$0.740 \pm 0.098$	$0.692 \pm 0.081$
Mean Rank	2.86	3.86	2.29	5.29	3.71	3.00

Table 4: Mean AUROC on the test sets from cross-validation. Best results are shown in bold and second best are underlined. The values represent the means and standard deviations over five iterations.

Drug	Omics Stacking	MOLI	Super.FELT	Early Integration	OmiEmbed	MOMA
Gemcitabine TCGA	$0.161 \pm 0.053$	$0.155 \pm 0.065$	$0.108 \pm 0.024$	$0.117 \pm 0.025$	$0.132 \pm 0.072$	$0.138 \pm 0.041$
Gemcitabine PDX	$0.151 \pm 0.085$	$0.154 \pm 0.066$	$0.130 \pm 0.050$	$0.138 \pm 0.046$	$0.082 \pm 0.024$	$0.111 \pm 0.027$
Cisplatin	$0.293 \pm 0.089$	$0.316 \pm 0.084$	$0.282 \pm 0.052$	$0.222 \pm 0.065$	$0.204 \pm 0.086$	$0.262 \pm 0.075$
Docetaxel	$0.316 \pm 0.083$	$0.345 \pm 0.093$	$0.373 \pm 0.027$	$0.312 \pm 0.117$	$0.251 \pm 0.071$	$0.281 \pm 0.090$
Erlotinib	$0.479 \pm 0.153$	$0.446 \pm 0.108$	$0.499 \pm 0.162$	$0.346 \pm 0.120$	$0.468 \pm 0.162$	$0.476 \pm 0.151$
Cetuximab	$0.376 \pm 0.104$	$0.357 \pm 0.075$	$0.400 \pm 0.095$	$0.290 \pm 0.057$	$0.329 \pm 0.091$	$0.347 \pm 0.070$
Paclitaxel	$0.220 \pm 0.108$	$0.163 \pm 0.087$	$0.245 \pm 0.116$	$0.160 \pm 0.045$	$0.270 \pm 0.079$	$0.213 \pm 0.074$
Mean Rank	2.14	2.71	2.57	5.0	4.57	4.00

Table 5: Mean AUPRC on the test sets from cross-validation. Best results are shown in bold and second best are underlined. The values represent the means and standard deviations over five iterations.

The visualization of the mean ranks and the critical differences (Figure 7) supports our analysis. EI is, for AUROC as well as AUPRC, significantly worse than the best performing method. The mean ranks for AUPRC contain a visible gap between methods that use triplet loss (Omics Stacking, Super.FELT and MOLI) and methods without (MOMA, OmiEmbed and EI), but without significance.

Figure 7: Mean rank and critical difference of the AUROC and AUPRC on the test sets from cross-validation. The mean values of the outer cross-validation results are compared. The Nemenyi test with $α = 0.05$ was used to compute significant differences.

Next, we analyzed the results on the external test sets (Tables 7 and 6). Here, the method learns the source distribution, but at the same time, must generalize enough to predict drug response on an unknown distribution.

Again, no single method performed best with all drugs, but Omics Stacking had the lowest mean rank in both metrics. It achieved the best results in half of the data sets for AUROC and AUPRC. Additionally, it achieved the lowest mean rank (see Figure 8 for the critical differences) for AUROC and AUPRC. The results validate the benefit of classifying both individual and integrated features for the translatability. EI achieved the best AUROC and second best AUPRC for Erlotinib, but classified worse in general. Surprisingly, OmiEmbed performed worst on the external data. One possible explanation may be that the regularization of the VAE narrows its capabilities to perform similarly on a shifted distribution without retraining.

Differences occurred between AUROC and AUPRC: Super.FELT had the second best mean rank for AUROC, but was only fourth for AUPRC. No method was significantly better regarding the AUPRC, however, OmiEmbed and EI had similarly low mean ranks. Omics Stacking and MOLI performed similarly on the AUPRC, but Omics Stacking had higher AUROC values.

Drug	Omics Stacking	MOLI	Super.FELT	Early Integration	OmiEmbed	MOMA
Gemcitabine TCGA	$0.655 \pm 0.029$	$0.640 \pm 0.037$	$0.618 \pm 0.042$	$0.604 \pm 0.093$	$0.565 \pm 0.059$	$0.473 \pm 0.039$
Gemcitabine PDX	$0.714 \pm 0.089$	$0.614 \pm 0.044$	$0.692 \pm 0.054$	$0.525 \pm 0.099$	$0.657 \pm 0.119$	$0.627 \pm 0.092$
Cisplatin	$0.644 \pm 0.087$	$0.674 \pm 0.032$	$0.728 \pm 0.045$	$0.604 \pm 0.052$	$0.513 \pm 0.056$	$0.687 \pm 0.019$
Docetaxel	$0.584 \pm 0.101$	$0.647 \pm 0.038$	$0.588 \pm 0.056$	$0.456 \pm 0.065$	$0.478 \pm 0.056$	$0.581 \pm 0.064$
Erlotinib	$0.744 \pm 0.065$	$0.722 \pm 0.127$	$0.563 \pm 0.080$	$0.789 \pm 0.079$	$0.633 \pm 0.105$	$0.715 \pm 0.132$
Cetuximab	$0.575 \pm 0.049$	$0.476 \pm 0.111$	$0.556 \pm 0.099$	$0.470 \pm 0.130$	$0.468 \pm 0.075$	$0.505 \pm 0.019$
Paclitaxel	$0.619 \pm 0.152$	$0.547 \pm 0.121$	$0.527 \pm 0.114$	$0.418 \pm 0.062$	$0.516 \pm 0.026$	$0.573 \pm 0.124$
Mean Rank	1.86	3.00	2.86	4.71	5.00	3.57

Table 6: Mean AUROC on external test set. Best results are shown in bold and second best are underlined. The values represent the means and standard deviations over five iterations.

Drug	Omics Stacking	MOLI	Super.FELT	Early Integration	OmiEmbed	MOMA
Gemcitabine TCGA	$0.581 \pm 0.074$	$0.535 \pm 0.085$	$0.502 \pm 0.082$	$0.513 \pm 0.086$	$0.462 \pm 0.095$	$0.389 \pm 0.040$
Gemcitabine PDX	$0.510 \pm 0.132$	$0.424 \pm 0.038$	$0.457 \pm 0.055$	$0.362 \pm 0.108$	$0.466 \pm 0.100$	$0.414 \pm 0.068$
Cisplatin	$0.942 \pm 0.027$	$0.950 \pm 0.007$	$0.963 \pm 0.009$	$0.932 \pm 0.004$	$0.908 \pm 0.009$	$0.952 \pm 0.006$
Docetaxel	$0.560 \pm 0.051$	$0.590 \pm 0.021$	$0.565 \pm 0.024$	$0.491 \pm 0.034$	$0.544 \pm 0.049$	$0.578 \pm 0.060$
Erlotinib	$0.440 \pm 0.075$	$0.410 \pm 0.158$	$0.223 \pm 0.024$	$0.428 \pm 0.106$	$0.294 \pm 0.079$	$0.369 \pm 0.146$
Cetuximab	$0.125 \pm 0.018$	$0.141 \pm 0.086 - ------------- -$	$0.126 \pm 0.039$	$0.108 \pm 0.028$	$0.101 \pm 0.018$	$0.148 \pm 0.065$
Paclitaxel	$0.256 \pm 0.147$	$0.191 \pm 0.077$	$0.147 \pm 0.028$	$0.120 \pm 0.016$	$0.135 \pm 0.007$	$0.172 \pm 0.047$
Mean Rank	2.29	2.43	3.43	4.71	4.86	3.29

Table 7: Mean AUPRC on external test set. Best results are shown in bold and second best are underlined. The values represent the means and standard deviations over five iterations.

Figure 8: Mean Rank and critical difference of the AUROC and AUPRC on the external test set. The mean values of the outer cross-validation results are compared. The Nemenyi test with $α = 0.05$ was used to compute significant differences.

4 Discussion

The emerging interest in multi-omics integration with neural networks produces an increasing number of different architectures. In this work, we compared a subset of recently published methods as fairly as possible to validate their predictive capabilities. Our experiments focused on drug response prediction, a high-dimensional problem, with the added complexity of having few samples.

Our work includes only data sets from cancer research, and there are two reasons for that decision. The first one is the availability of multi-omics data, as cancer samples are more common compared to other diseases. Data sets for other diseases with enough samples and a high enough quality are still rare. The second reason is that the extensive hyperparameter optimization with an inner and outer cross-validation increases the hardware requirements. The sheer amount of different, recently published algorithms made it necessary to focus on just one area.

Current drug response methods are trained on cell lines and predict directly on patient samples without adaptation. Transfer learning is a method where a model is trained on source data sets and later transferred to a target data set \autocitetransfer_learning. One promising current direction is to train on in vitro samples and fine-tune the neural network for in vivo samples.

Previous studies \autociteHossein:2019 showed that EI is not suitable for multi-omics integration, because concatenating the features leads to a higher-dimensional sparse space. Our experiments corroborate this hypothesis, because EI achieved the worst results on both data sets. This make early integration not suitable as the sole baseline: At least one other intermediate integration method should be included. However, our experiments showed that no method performs the best for all drugs.

5 Conclusions

In this paper we showed that none of the current multi-omics integration methods excels at drug response prediction, however, Early Integration performed significantly worse. Researchers should not rely on a single method, but rather consider more than one method for their task at hand. If the number of experiments is limited or translatability is wanted, we recommend to use the newly introduced method, Omics Stacking, as it achieved good results on test and external data. When faced with a new data set in a cross-validation like setting, Super.FELT is another good option. Our experiments also suggest that a fair experimental is necessary to see the strengths and weaknesses of various algorithms, which were not visible from the publications alone. We hope that this comparison has shed some light on the relative performance of multi-omics integration methods, has produced valuable insights for their application, and that it encourages further research.

Declarations

Acknowledgment

Funding

This work was funded by the German Federal Ministry for Education and Research as part of the DIASyM project under grant number [031L0217A].

Contribution

TH implemented the methods and experiments and wrote the manuscript.
SK supervised the study, developed the concept and wrote and previewed the manuscript.

\printbibliography

Appendix A Implementation Details

The algorithms were implemented in PyTorch 1.11.0 and NumPy 1.22.4 was used to compute the AUROC and AUPRC. The critical differences were computed and visualized with Orange3 3.32.0. The triplets were generated online with an all-triplets scheme, which creates all possible triplets of a batch, as in the experiments of the corresponding paper of MOLI and Super.FELT. An open source implementation of triplet loss and triplet creation was used¹¹1https://github.com/adambielski/siamese-triplet. Adagrad [adagrad] was used to update the neural network weights.

Early Integration and Omics Stacking were implemented by the authors. For MOMA we used the provided example source code²²2https://github.com/DMCB-GIST/MOMA and adapted it to three omics, made the number of modules a hyperparameter and combined the individual probability outputs with a logistic regression. For omiEmbed³³3https://github.com/zhangxiaoyu11/OmiEmbed, MOLI⁴⁴4https://github.com/hosseinshn/MOLI and Super.FELT⁵⁵5https://github.com/DMCB-GIST/Super.FELT, the published source code was used.

The inner cross-validation was stopped early, if achieving a higher AUROC than the best result was not possible. This allowed us to speed up the hyperparameter optimization.

Appendix B Ablation Study

In the ablation study the impact of the components, especially the influence of different numbers of concurrent classification layers, was validated. The first altered architecture – Omics Stacking without integration – omits the subnetwork that classifies the concatenated omics features. That alteration result in a late integration neural network. The second one – Omics Stacking Complete – has classifier layers for every possible non-empty subset of the omics set, which adds three additional classification subnetworks. At last, we tested Omics Stacking without Triplet Loss, which leaves out the triplet loss function.

The results indicate that the omics integration is an essential part and removing it worsens the prediction. Also, the addition of the integration of two omics makes the results worse. We suppose that too many classifiers restrain the network from identifying relevant patterns. The AUROCs, AUPRCs and mean ranks are given in Tables 11, 10, 9 and 8 and the visualization of them in Figures 10 and 9.

Figure 9: Mean rank and critical difference of the AUROC and AUPRC on the test sets from cross-validation for the ablation study. The mean values of the outer cross-validation results are compared. The Nemenyi test with $α = 0.05$ was used to compute significant differences.

Figure 10: Mean rank and critical difference of the AUROC and AUPRC on the external test data for the ablation study. The mean values of the outer cross-validation results are compared. The Nemenyi test with $α = 0.05$ was used to compute significant differences.

	Omics Stacking
Drug	Complete Integration	Without Integration	Integration	Without Triplet Loss
Gemcitabine TCGA	$0.630 \pm 0.055$	$0.605 \pm 0.076$	$0.646 \pm 0.045$	$0.601 \pm 0.084$
Gemcitabine PDX	$0.640 \pm 0.089$	$0.656 \pm 0.062$	$0.651 \pm 0.071$	$0.593 \pm 0.044$
Cisplatin	$0.742 \pm 0.050$	$0.757 \pm 0.061$	$0.722 \pm 0.066$	$0.734 \pm 0.091$
Docetaxel	$0.775 \pm 0.089$	$0.759 \pm 0.031$	$0.772 \pm 0.077$	$0.813 \pm 0.024$
Erlotinib	$0.696 \pm 0.101$	$0.744 \pm 0.098$	$0.754 \pm 0.114$	$0.662 \pm 0.112$
Cetuximab	$0.748 \pm 0.048$	$0.679 \pm 0.052$	$0.731 \pm 0.090$	$0.721 \pm 0.055$
Paclitaxel	$0.695 \pm 0.104$	$0.634 \pm 0.114$	$0.667 \pm 0.138$	$0.522 \pm 0.085$
Rank	2.00	2.57	2.14	3.29

Table 8: Mean AUROC on test sets from cross-validation for the ablation study. Best results are shown in bold and second best are underlined. The values represent the means and standard deviations over five iterations.

	Omics Stacking
Drug	Complete Integration	Without Integration	Integration	Without Triplet Loss
Gemcitabine TCGA	$0.143 \pm 0.053$	$0.136 \pm 0.054$	$0.161 \pm 0.053$	$0.149 \pm 0.074$
Gemcitabine PDX	$0.138 \pm 0.046$	$0.204 \pm 0.092$	$0.151 \pm 0.085$	$0.119 \pm 0.046$
Cisplatin	$0.288 \pm 0.072$	$0.293 \pm 0.074$	$0.293 \pm 0.089$	$0.277 \pm 0.084$
Docetaxel	$0.317 \pm 0.064$	$0.262 \pm 0.072$	$0.316 \pm 0.083$	$0.358 \pm 0.046$
Erlotinib	$0.422 \pm 0.704$	$0.440 \pm 0.149$	$0.479 \pm 0.153$	$0.352 \pm 0.111$
Cetuximab	$0.379 \pm 0.102$	$0.264 \pm 0.042$	$0.376 \pm 0.104$	$0.350 \pm 0.049$
Paclitaxel	$0.186 \pm 0.076$	$0.188 \pm 0.103$	$0.220 \pm 0.108$	$0.090 \pm 0.024$
Rank	2.57	2.57	1.71	3.14

Table 9: Mean AUPRC on test sets from cross-validation for the ablation study. Best results are shown in bold and second best are underlined. The values represent the means and standard deviations over five iterations.

	Omics Stacking
Drug	Complete Integration	Without Integration	Integration	Without Triplet Loss
Gemcitabine TCGA	$0.646 \pm 0.048$	$0.641 \pm 0.056$	$0.655 \pm 0.029$	$0.624 \pm 0.068$
Gemcitabine PDX	$0.656 \pm 0.049$	$0.630 \pm 0.066$	$0.714 \pm 0.089$	$0.665 \pm 0.078$
Cisplatin	$0.668 \pm 0.046$	$0.685 \pm 0.078$	$0.644 \pm 0.087$	$0.680 \pm 0.063$
Docetaxel	$0.613 \pm 0.058$	$0.597 \pm 0.046$	$0.584 \pm 0.101$	$0.600 \pm 0.090$
Erlotinib	$0.704 \pm 0.141$	$0.715 \pm 0.193$	$0.744 \pm 0.065$	$0.741 \pm 0.077$
Cetuximab	$0.529 \pm 0.052$	$0.599 \pm 0.173$	$0.575 \pm 0.049$	$0.466 \pm 0.122$
Paclitaxel	$0.511 \pm 0.106$	$0.432 \pm 0.087$	$0.619 \pm 0.152$	$0.419 \pm 0.082$
Rank	2.57	2.57	2.00	2.86

Table 10: Mean AUROC on external test set for the ablation study. Best results are shown in bold and second best are underlined. The values represent the means and standard deviations over five iterations.

	Omics Stacking
Drug	Complete Integration	Without Integration	Integration	Without Triplet Loss
Gemcitabine TCGA	$0.537 \pm 0.040$	$0.545 \pm 0.088$	$0.581 \pm 0.074$	$0.489 \pm 0.070$
Gemcitabine PDX	$0.522 \pm 0.087$	$0.455 \pm 0.072$	$0.510 \pm 0.132$	$0.483 \pm 0.109$
Cisplatin	$0.954 \pm 0.009$	$0.953 \pm 0.022$	$0.942 \pm 0.027$	$0.955 \pm 0.012$
Docetaxel	$0.571 \pm 0.028$	$0.644 \pm 0.048$	$0.560 \pm 0.051$	$0.568 \pm 0.056$
Erlotinib	$0.400 \pm 0.171$	$0.482 \pm 0.251$	$0.440 \pm 0.075$	$0.421 \pm 0.126$
Cetuximab	$0.107 \pm 0.012$	$0.188 \pm 0.164$	$0.125 \pm 0.018$	$0.102 \pm 0.024$
Paclitaxel	$0.176 \pm 0.094$	$0.127 \pm 0.027$	$0.256 \pm 0.147$	$0.121 \pm 0.017$
Rank	2.43	2.14	2.29	3.14

Table 11: Mean AUPRC on the test sets from cross-validation for the ablation study. Best results are shown in bold and second best are underlined. The values represent the means and standards deviation over five iterations.