TFusion: Transformer based N-to-One Multimodal Fusion Block ^†^†thanks: $^{*}$ corresponding author

Zecheng Liu School of Computer Science
and Engineering
South China University of Technology
Guangzhou, China
scutzcliu@gmail.com Jia Wei

^{*}

School of Computer Science
and Engineering
South China University of Technology
Guangzhou, China
csjwei@scut.edu.cn Rui Li Golisano College of Computing
and Information Sciences
Rochester Institute of Technology
Rochester, NY, USA
rxlics@rit.edu

Abstract

People perceive the world with different senses, such as sight, hearing, smell, and touch. Processing and fusing information from multiple modalities enables Artificial Intelligence to understand the world around us more easily. However, when there are missing modalities, the number of available modalities is different in diverse situations, which leads to an N-to-One fusion problem. To solve this problem, we propose a transformer based fusion block called TFusion. Different from preset formulations or convolution based methods, the proposed block automatically learns to fuse available modalities without synthesizing or zero-padding missing ones. Specifically, the feature representations extracted from upstream processing model are projected as tokens and fed into transformer layers to generate latent multimodal correlations. Then, to reduce the dependence on particular modalities, a modal attention mechanism is introduced to build a shared representation, which can be applied by the downstream decision model. The proposed TFusion block can be easily integrated into existing multimodal analysis networks. In this work, we apply TFusion to different backbone networks for multimodal human activity recognition and brain tumor segmentation tasks. Extensive experimental results show that the TFusion block achieves better performance than the competing fusion strategies.

multimodal fusion, missing modalities, brain tumor segmentation, human activity recognition

\UseRawInputEncoding

I Introduction

People perceive the world with signals from different modalities, which often carry complementary information about varying aspects of an object or event of interest. Therefore, collecting and utilizing multimodal information is crucial for Artificial Intelligence to understand the world around us. Data collected from various sensors (e.g., microphones, cameras, motion controllers) are used to identify human activity [1]. Moreover, multimodal medical images obtained from different scanning protocols (e.g., Computed Tomography, Magnetic Resonance Imaging) are employed for disease diagnosis [2]. Satisfactory performances have been achieved with these multimodal data.

Fig. 1: Different fusion strategies for modality missing data. (a) The arithmetic strategy with a mean function. (b) The selection strategy adopts a max function as the selection rule. (c) The convolution strategy simulates missing data by zero-padding. (d) Our proposed TFuion. $*$ denotes the value obtained from an automatically learned mapping function.

In practical application, however, modality missing is a common scenario. Wirelessly connected sensors may occasionally disconnect and temporarily be unable to send any data [3]. Medical images may be missing due to artifacts and diverse patient conditions [4]. In these unexpected situations, any combinatorial subset of available modalities can be given as input. To handle this, one intuitive solution is to train a dedicated model on all possible subsets of available modalities [5, 6, 7]. However, these methods are ineffective and time-consuming. Another way is to predict missing modalities and perform with the completed modalities [8]. But, these approaches also require additional prediction networks for each missing situation, and the quality of the recovered data directly affects the performance, especially when there are only a few available modalities. Therefore, instead of training dedicated models for different situations or attempting to predict the missing modalities, fusing the available modalities into a shared representation is a proper way. However, it is particularly challenging due to the varying number of input modalities, which results in the N-to-One fusion problem.

Currently, existing fusion strategies to tackle this challenge can be broadly grouped into three categories: the arithmetic strategy, the selection strategy and the convolution strategy. As shown in Fig. 1(a), in the arithmetic strategy, feature representations of available modalities are merged by an arithmetic function, such as averaging, computing the first and second moments or other designed formulas [9, 10, 11]. For the selection strategy, as shown in Fig. 1(b), each value of fused representation is selected from the values at the corresponding position of the inputs. The rule for selection can be defined as max, min or probability-based [12, 13, 14]. Although the above two fusion strategies are easily scalable to various data missing situations, their fusion operation is hard-coded. All available modalities contribute equally and their latent correlations are neglected.

Fig. 2: The illustration of TFusion. Feature representations extracted from available modalities are projected as tokens and fed into the transformer layers (TransL) to learn multimodal correlations. Through a modal softmax function, the automatically learned weight maps are obtained to fuse the varying number of input feature representations. $R_{f}$ : L or H $\times$ W or D $\times$ H $\times$ W (shape of feature representation); T = L $\cdot | K |$ or H $\cdot$ W $\cdot | K |$ or D $\cdot$ H $\cdot$ W $\cdot | K |$ (number of tokens).

Unlike hard-coding the fusion operation, in the convolution strategy, the convolutional fusion network automatically learns how to fuse these feature representations, which is beneficial to exploiting the correlation between multiple modalities. However, as shown in Fig. 1(c), this fusion strategy needs a constant number of data to meet the requirements of the input channels in the convolutional network. Therefore, it has to simulate missing data by crudely zero-padding or replacing it with similar modalities, which inevitably introduces a bias in computation and causes performance degradation [15, 16, 17].

To tackle the above problems, we propose a transformer based multimodal fusion block, TFusion. As shown in Fig. 1(d), it can handle any number of input data instead of fixing its number. In addition, TFusion is a learning-based fusion strategy that consists of two components: the correlation extraction (CE) module and the modal attention (MA) module. In the CE module, multiple transformer layers are employed [18, 19]. Feature representations extracted from available modalities are projected as tokens and fed into the transformer layers to learn multimodal correlations in the CE module. Based on these correlations, a modal softmax function is proposed to generate weight maps in the MA module. Finally, it builds a shared feature representation by fusing the varying inputs with the weight maps.

The contributions of this work are:

We proposed a TFusion block, which is a data-dependent fusion strategy. It can learn the latent correlations between different modalities and builds a shared representation adaptively. The entire fusion process is based on available modalities without completing missing ones.
The TFusion is not limited to specific deep learning architectures. It takes inputs from any kind of upstream processing model and serves as the input of the downstream decision model, which enables applying the TFusion to various backbone networks for different tasks.
We provide qualitative and quantitative performance evaluations on activity recognition with the SHL [20] dataset and brain tumor segmentation with the BraTS2020 [21] dataset. The results show the superiority of TFusion over competing fusion strategies.

Ii Methodology

Ii-a Method Overview

For multiple modalities, let $k \in K \subseteq {1, 2, \dots, S}$ index a specific modality, within the available modality set of $K$ , where $S$ is the number of all possible modalities. Given an input $f_{k} \in R^{B \times C \times R_{f}}$ , B and C denote the batch size and the number of channels, respectively. $R_{f}$ represents the shape of feature representation extracted from the $k$ -th modality of a sample data, which can be 1D (L), 2D (H $\times$ W), 3D (D $\times$ H $\times$ W) or higher-dimensional. In addition, $I = {f_{k} | k \in K}$ denotes the input set of feature representations from all the available modalities. Our goal is to learn a fusion function $F$ that can project $I$ into a shared feature representation $f_{s}$ , denoted as $F (I) \to f_{s}$ . To achieve the goal, we design an N-to-One fusion block, TFusion. The architecture is shown in Fig. 2, which consists of two modules: correlation extraction (CE) module and modal attention (MA) module.

Ii-B Correlation Extraction

Given the feature representation $f_{k} \in R^{B \times C \times R_{f}}$ , we first flatten the $R_{f}$ dimensions of $f_{k}$ into one dimension and get a $B \times C \times R$ feature representation, where $R = L$ (1D), $R = H \times W$ (2D), $R = D \times H \times W$ (3D), etc. It can be viewed as $B \times R$ $C$ -dimensional tokens $t_{k}$ . Then, we obtain the concatenation of all the tokens $z_{0} \in R^{B \times T \times C}$ , where $T = R \times | K |$ , and $| K |$ denotes the number of available modalities.

Given $z_{0}$ , the stack of eight transformer layers are introduced to learn the latent multimodal correlations. Each layer has a standard architecture that includes a multi-head attention (MHA) [18] block and a fully connected feed-forward network (FFN) [18]. Layer normalization (LN) [18] is applied before every block. The outputs of the $x$ -th ( $x \in [1, 2, \dots, 8]$ ) layer can be describe as:

z_{x}^{'} = M H A (L N (z_{x - 1})) + z_{x - 1}

(1)

z_{x} = F F N (L N (z_{x}^{'})) + z_{x}^{'}

(2)

Therefore, we get $z_{l} \in R^{B \times T \times C}$ , which is the last transformer layer output. By reverting $z_{l}$ to the size of $| K | \times B \times C \times R_{f}$ , we obtain the output $I^{'} = {f_{k}^{'} | k \in K}$ of CE as:

I^{'} = s p l i t (r (z_{l}))

(3)

where $r (\cdot)$ and $s p l i t (\cdot)$ are the reshape and split operations, and $I^{'}$ is the set of transformed feature representations $f_{k}^{'} \in R^{B \times C \times R_{f}}$ which contains multimodal correlations and has the same size as the original input $f_{k}$ .

Ii-C Modal Attention

Given the transformed feature representations set $I^{'}$ , the weight map $m_{k}$ is generated with the modal attention mechanism. Feature representations extracted from different modalities are expected to have different weights for fusion at the voxel level. Therefore, we introduce a modal-wise and voxel-level softmax function to generate the weight maps from $I^{'}$ , as shown in Fig. 3.

We denote the $i$ -th voxel of $f_{k}^{'}$ and $m_{k}$ as $v_{k}^{i}$ and $m_{k}^{i}$ , respectively. $e$ is the natural logarithm. The value of weight map $m_{k}$ can be defined as:

m_{k}^{i} = \frac{e^{v_{k}^{i}}}{\sum_{j \in K} e^{v_{j}^{i}}}

(4)

By element-wise multiplying input feature map $f_{k}$ with the corresponding weight map $m_{k}$ and summing all the modalities, we can obtain a fused feature map $f_{s}$ as:

f_{s} = \sum k \in K f_{k} \cdot m_{k}

(5)

Since the sum of $m_{1}^{i}, \dots m_{| K |}^{i}$ is 1, the value range of fused feature representation $f_{s}$ remains stable to improve the robustness for variable input modalities. Moreover, the relative sizes of $v_{1}^{i}, \dots v_{| K |}^{i}$ (contain the latent multi-modal correlations learned from the CE module) are retained in the corresponding weights. In particular, when only one modality is available, all the values of the weight map are 1, which means $f_{s} = f_{k}$ ( $k \in K$ , $| K | = 1$ ). In this case, the input feature representation remains unchanged. It enables the backbone network (the upstream processing model and the downstream decision model) to enhance its capability to encode and decode information from different modalities rather than relying on a particular one. It is crucial for variable multimodal data analysis.

Fig. 3: The illustration of modal attention mechanism.

It is worth noting that TFusion is a flexible data-dependent fusion strategy by which complementary information from variable input feature representations can be fused automatically. In addition, TFusion does not require the introduction of any redundant information (e.g., padding zero), which is different from existing convolutional fusion strategies. Experiments described in Sect.III are carried out on human activity recognition and brain tumor segmentation to evaluate the TFusion.

Iii Experiments and Results

Iii-a Datasets

SHL2019. The SHL (Sussex-Huawei Locomotion) Challenge 2019 [20] dataset provides data from seven sensors of a smartphone to recognize eight modes of locomotion and transportation (activities), including still, walking, run, bike, car, bus, train, and subway. The sensor data are collected from smartphones of a person with four locations, including the bag, trousers front pocket, breast pocket and hand. Each location is called “Bag”, “Hips”, “Torso”, and “Hand”, respectively. The seven sensors are an accelerometer, gravity, gyroscope, linear accelerometer, magnetometer, orientation and pressure sensor. The orientation consists of four channels, expressed as a spatial rotation in a quaternion (i.e., w, x, y, and z-axes). The pressure has a single channel. Each of the remaining sensor data consists of three channels corresponding to the three-dimensional axes (i.e., x, y, and z-axes).

Fig. 4: The illustration of the network architecture for activity recognition with EmbraceNet. (B $\times$ C $\times$ $R_{f}$ ) is given, where B, C and $R_{f}$ denotes the batch size, channels and data shape, respectively.

The SHL2019 dataset is split into three subsets: train, validation, and test, which contain 588,215, 48,708, and 55,811 available samples, respectively. Each sample consists of 500 sensor values that are acquired for five seconds with a sampling rate of 100 Hz. Data acquired from the locations except the “Hand” are given in the train subset, while the validation subset provides the data of all four locations. In the test subset, only unlabeled “Hand” location data are available.

BraTS2020. The BraTS2020 [21] dataset provide four modality scans: T1ce, T1, T2, FLAIR for brain tumor segmentation. It contains 369 subjects. There are four labels in the segmentation mask namely NCR (label 1: the necrotic tumor core), ED (label 2: the peritumoral edematous/invaded tissue), NET (label 3: the non-enhancing tumor core), and ET (label 4: the enhancing tumor). To better represent the clinical application tasks, different structures have been grouped into three mutually inclusive tumor regions: ET: the enhancing tumor, TC (Union of labels 1, 3 and 4): the tumor core region, and WT (Union of all labels): the whole tumor region [21]. All the modalities from the same subject are co-registered, resampled to isotropic $1 m m^{3}$ and skull-stripped, with the size of $155 \times 240 \times 240$ . We select 70% data as training data, while 10% and 20% as validation and test data respectively. To prevent overfitting, two data augmentation techniques (randomly flip the axes and rotate with a random angle in $[- 10^{\circ}, 10^{\circ}]$ ) are applied during training. We apply z-score normalization [22] to the volumes individually and randomly crop $128 \times 128 \times 128$ patches as inputs to the networks.

Iii-B Baseline Methods

EmbraceNet. In the experiments on activity recognition, we compare TFusion with EmbraceNet [23], which employs a selection strategy (shown in Fig. 1(b)) by generating feature masks ( $r_{1}, r_{2}, \dots, r_{7}$ ) with the rule of giving equal chances to all available modalities during each value selection. For a fair comparison, as shown in Fig. 4, we adopt the same processing (P) and decision (D) model as used in [23]. We obtain the performance of our fusion strategy by replacing EmbraceNet with TFusion.

Following [23] setting, five decisions are obtained per sample, and each one is the result for a one-second time duration. The batch size is set to 8. A cross-entropy loss and the Adam optimization method [24] with $β_{1} = 0.9$ , $β_{2} = 0.999$ are employed. The learning rate is initially set to $1 \times 10^{- 4}$ and reduced by a factor of 2 at every $1 \times 10^{5}$ steps. A total of $5 \times 10^{5}$ training steps are executed.

Fig. 5: The illustration of the network architecture for bran tumor segmentation with GFF. (B $\times$ C $\times$ $R_{f}$ ) is given, where B, C and $R_{f}$ denotes the batch size, channels and data shape, respectively.

Modalities				Dice(%)
Modalities				WT		TC		ET
T1ce	T1	T2	Flair	GFF	TFusion	GFF	TFusion	GFF	TFusion
$∙$	$\circ$	$\circ$	$\circ$	68.24	69.75*	73.27	75.63*	69.30	71.94*
$\circ$	$∙$	$\circ$	$\circ$	64.45	69.11*	46.93	53.86*	23.74	29.71*
$\circ$	$\circ$	$∙$	$\circ$	79.78	79.61	58.27	61.99*	36.13	35.87
$\circ$	$\circ$	$\circ$	$∙$	81.82	83.97	50.53	52.84	29.50	34.40*
$∙$	$∙$	$\circ$	$\circ$	74.99	75.30	75.89	80.35*	72.09	74.90*
$∙$	$\circ$	$∙$	$\circ$	83.93	84.27*	79.55	81.48*	72.87	74.74*
$∙$	$\circ$	$\circ$	$∙$	87.34	87.32	79.01	79.06	74.89	75.82
$\circ$	$∙$	$∙$	$\circ$	81.76	81.78	59.75	66.67*	36.50	40.38*
$\circ$	$∙$	$\circ$	$∙$	85.86	86.39	61.92	62.31	37.52	38.22
$\circ$	$\circ$	$∙$	$∙$	86.99	87.50*	61.92	66.38*	38.94	41.46*
$∙$	$∙$	$∙$	$\circ$	84.48	84.59	79.83	82.32*	73.74	74.78
$∙$	$∙$	$\circ$	$∙$	88.03	88.04	80.50	82.04*	74.53	75.44
$∙$	$\circ$	$∙$	$∙$	88.75	89.11*	81.60	82.06	74.43	74.91
$\circ$	$∙$	$∙$	$∙$	86.84	87.63*	65.38	68.76*	40.90	43.53*
$∙$	$∙$	$∙$	$∙$	88.65	88.93	81.29	82.18	74.55	73.76
Average				82.13	82.89*	69.04	71.86*	55.31	57.32*

TABLE I: Performance evaluation of the brain tumor segmentation on BraTS2020. The table shows the Dice score for different MRI modalities being either absent (

\circ

) or present (

∙

). A better method has higher Dice (Best highlighted in bold). * denotes significant improvement provided by a Wilcoxon test (

p

-values

< 0.05

)

Accuracy(%)	Bag	Hips	Torso	Hand	All
Early fusion $†$	–	–	–	–	46.73
Intermediate fusion $†$	–	–	–	–	63.87
Late fusion $†$	–	–	–	–	63.85
Confidence fusion $†$ [25]	–	–	–	–	63.60
EmbraceNet $†$ [23]	63.68	67.98	81.58	47.63	65.22
TFusion	67.41	68.91	85.22	48.35	67.47
TFusion $w / o$ CE	56.82	63.14	74.69	46.70	60.33
TFusion $w / o$ MA	65.01	67.95	83.49	47.52	65.99

TABLE II: Performance evaluation on the validation data of SHL2019. The table shows the Accuracy of different phone locations.

w / o

means without.

†

denotes results from [23].

GFF. In the experiments on brain tumor segmentation, we compare TFusion with a gated feature fusion block (GFF) [17], which belongs to the convolution strategy (shown in Fig. 1(c)). As shown in Fig. 5, a feature disentanglement architecture is employed. Multimodal medical images are decomposed into the modality-invariant content and the modality-specific appearance code by encoders $E^{c}$ and $E^{a}$ , respectively. The content codes (e.g., $c_{2}$ and $c_{3}$ , shown in Fig. 5) of missing modalities are simulated with zero values. Then, all content codes are fused into a shared representation $c_{s}$ by GFF. Given $c_{s}$ and an appearance code $a_{i}$ , the corresponding image is reconstructed by the decoder $D_{i}^{r}$ . Given $c_{s}$ , the tumor segmentation results are generated by the decoder $D^{s}$ . For a fair comparison, we adopt the same encoders ( $E_{i}^{c}$ and $E_{i}^{a}$ ) and decoders ( $D^{s}$ and $D_{i}^{r}$ ) as used in [17]. We obtain the performance of our fusion strategy by replacing GFF with TFfusion and removing the zero-padding operation.

The training max_epoch is set to 200. Following [17] setting, the batch size is set to 1. Adam [24] is utilized with a learning rate of $1 \times 10^{- 4}$ and progressively multiplies it by (1 - epoch / max_epoch $)^{0.9}$ . Losses of $L_{K L}$ , $L_{r e c}$ and $L_{s e g}$ are employed as [17]. During training, to simulate real missing modalities scenarios, each training patient’s data is fixed to one of 15 possible missing cases. For a comprehensive evaluation, we test the performance of all 15 cases for each test patient.

Our implementations are on an NVIDIA RTX 3090 (24G) with PyTorch 1.8.1.

Iii-C Results.

Activity recognition. We compare TFusion with the EmbraceNet [23] on SHL2019. In addition, as shown in TABLE II, we also compare the results of other fusion methods, which use the same processing (P) model and decision (D) model as shown in Fig 4. (1) In the early fusion method, the data of seven sensors are concatenated along their $C$ dimension. The prediction results are obtained by inputting the concatenation into a network of P and D in series. (2) For the intermediate fusion approach, the EmbraceNet is replaced with the concatenation of feature representations along their $R_{f}$ dimension. (3) In the late fusion method, an independent network of P and D in series is trained for each sensor, and then the decision is made from the averaged softmax outputs. (4) In the confidence fusion model, the EmbraceNet is replaced with the confidence calculation and fusion layers in [25]. The results of different fusion methods on the validation data are presented in TABLE II. Our proposed TFusion outperforms the EmbraceNet in all four smartphone locations and improves the overall accuracy from 65.22% to 67.47%.

CE	MA	Average Dice(%)
CE	MA	WT	TC	ET
	✓	82.42	70.39	55.65
✓		82.76	70.93	55.56
✓	✓	82.89	71.86	57.32

TABLE III: Ablation experiments of different key components in TFusion on BratTS2020(Best highlighted in bold).

Fig. 6: Segmentation results from different methods for the necrotic/non-enhancing (yellow), active core (green) and edema(red). Input modalities for inference are indicated. (Color figure online).

Brain tumor segmentation. The quantitative segmentation results are shown in Table I. Compared with GFF, the network integrated with TFusion achieves better average performance over the 15 possible combinations in all three tumor segmentation tasks. In particular, TFusion outperforms GFF for all the possible combinations in the TC segmentation task. Overall, TFusion achieves better Dice scores in most situations (13,15,13 situations for WT, TC and ET segmentation, respectively). In addition, we conduct the statistical significance analysis. The number of situations with significant improvement are 6, 10 and 8 for WT, TC and ET, respectively. It is provided by a Wilcoxon test ( $p$ -values $< 0.05$ ). Besides, we find no significant drop in performance caused by TFusion. The visualization of segmentation results is shown in Fig. 6.

Ablation experiments.The correlation extraction (CE) module and the modal attention (MA) module are two key components in TFusion. To verify the contribution of these modules, we evaluate the TFusion without CE and MA, respectively. TFusion without CE denotes that feature representations are directly fed into the MA module (Fig. 2). TFusion without MA means that we directly add the transformed feature representations ( $I^{'}$ ) up to get the fusion result. For activity recognition, as shown in TABLE II, we can find that TFuison without CE performs worse than other methods. Compared with EmbraceNet, the improvement of TFusion without MA is inconspicuous. For brain tumor segmentation, as shown in TABLE. III, we present the averaged performance over the 15 possible combinations of the input modalities on BraTS2020. It shows that both the CE and MA module lead to performance improvement across all the tumor regions. Therefore, ablation experiments on two different tasks show that both CE and MA play an important role in TFusion.

Iv Conclusion

In this paper, we propose a transformer based N-to-One fusion block TFusion to tackle the problem of multimodal missing modalities fusion. As a data-dependent fusion strategy, TFusion can automatically learn the latent correlations between different modalities and builds a shared feature representation. The entire fusion process is based on available data without simulating missing modalities. In addition, TFusion has compatibility with any kind of upstream processing model and downstream decision model, making it universally applicable to different tasks. We show that it can be integrated into existing backbone networks by replacing their fusion operation or block to improve activity recognition and brain tumor segmentation performance. In the future, we will explore other tasks related to variable multimodal fusion with TFusion.

Acknowledgment

This work is supported in part by the Guangzhou Science and Technology Planning Project (202201010092), the Guangdong Provincial Natural Science Foundation (2020A1515010717), NSF-1850492 and NSF-2045804.

References

[1] C. Chen, R. Jafari, and N. Kehtarnavaz, “Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor,” in 2015 IEEE International conference on image processing (ICIP). IEEE, 2015, pp. 168–172.
[2] Z. Guo, X. Li, H. Huang, N. Guo, and Q. Li, “Deep learning-based image segmentation on multimodal medical imaging,” IEEE Transactions on Radiation and Plasma Medical Sciences, vol. 3, no. 2, pp. 162–169, 2019.
[3] R. Chavarriaga, H. Sagha, A. Calatroni, S. T. Digumarti, G. Tröster, J. d. R. Millán, and D. Roggen, “The opportunity challenge: A benchmark database for on-body sensor-based activity recognition,” Pattern Recognition Letters, vol. 34, no. 15, pp. 2033–2042, 2013.
[4] M. J. Graves and D. G. Mitchell, “Body mri artifacts in clinical practice: a physicist’s and radiologist’s perspective,” Journal of Magnetic Resonance Imaging, vol. 38, no. 2, pp. 269–287, 2013.
[5] C. Chen, Q. Dou, Y. Jin, Q. Liu, and P. A. Heng, “Learning with privileged multimodal knowledge for unimodal segmentation,” IEEE Transactions on Medical Imaging, pp. 1–1, 2021.
[6] M. Hu, M. Maillard, Y. Zhang, T. Ciceri, G. La Barbera, I. Bloch, and P. Gori, “Knowledge distillation from multi-modal to mono-modal segmentation networks,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, A. L. Martel, P. Abolmaesumi, D. Stoyanov, D. Mateus, M. A. Zuluaga, S. K. Zhou, D. Racoceanu, and L. Joskowicz, Eds. Cham: Springer International Publishing, 2020, pp. 772–781.
[7] Y. Wang, Y. Zhang, Y. Liu, Z. Lin, J. Tian, C. Zhong, Z. Shi, J. Fan, and Z. He, “Acn: Adversarial co-training network for brain tumor segmentation with missing modalities,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, M. de Bruijne, P. C. Cattin, S. Cotin, N. Padoy, S. Speidel, Y. Zheng, and C. Essert, Eds. Cham: Springer International Publishing, 2021, pp. 410–420.
[8] L. Shen, W. Zhu, X. Wang, L. Xing, J. M. Pauly, B. Turkbey, S. A. Harmon, T. H. Sanford, S. Mehralivand, P. L. Choyke, B. J. Wood, and D. Xu, “Multi-domain image completion for random missing input data,” IEEE Transactions on Medical Imaging, vol. 40, no. 4, pp. 1113–1122, 2021.
[9] K. Lau, J. Adler, and J. Sjölund, “A unified representation network for segmentation with missing modalities,” arXiv preprint arXiv:1908.06683, 2019.
[10] R. Dorent, S. Joutard, M. Modat, S. Ourselin, and T. Vercauteren, “Hetero-modal variational encoder-decoder for joint modality completion and segmentation,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, D. Shen, T. Liu, T. M. Peters, L. H. Staib, C. Essert, S. Zhou, P.-T. Yap, and A. Khan, Eds. Cham: Springer International Publishing, 2019, pp. 74–82.
[11] M. Havaei, N. Guizard, N. Chapados, and Y. Bengio, “Hemis: Hetero-modal image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016, S. Ourselin, L. Joskowicz, M. R. Sabuncu, G. Unal, and W. Wells, Eds. Cham: Springer International Publishing, 2016, pp. 469–477.
[12] A. Chartsias, T. Joyce, M. V. Giuffrida, and S. A. Tsaftaris, “Multimodal mr synthesis via modality-invariant latent representation,” IEEE Transactions on Medical Imaging, vol. 37, no. 3, pp. 803–814, 2018.
[13] J. Ouyang, E. Adeli, K. M. Pohl, Q. Zhao, and G. Zaharchuk, “Representation disentanglement for multi-modal brain mri analysis,” in International Conference on Information Processing in Medical Imaging. Springer, 2021, pp. 321–333.
[14] J.-H. Choi and J.-S. Lee, “Embracenet: A robust deep learning architecture for multimodal classification,” Information Fusion, vol. 51, pp. 259–270, 2019.
[15] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in ICML, 2011.
[16] T. Zhou, S. Canu, P. Vera, and S. Ruan, “Brain tumor segmentation with missing modalities via latent multi-source correlation representation,” in Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, A. L. Martel, P. Abolmaesumi, D. Stoyanov, D. Mateus, M. A. Zuluaga, S. K. Zhou, D. Racoceanu, and L. Joskowicz, Eds. Cham: Springer International Publishing, 2020, pp. 533–541.
[17] C. Chen, Q. Dou, Y. Jin, H. Chen, J. Qin, and P.-A. Heng, “Robust multimodal brain tumor segmentation via feature disentanglement and gated fusion,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 447–456.
[18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017.
[19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021.
[20] L. Wang, H. Gjoreski, M. Ciliberto, S. Mekki, S. Valentin, and D. Roggen, “Enabling reproducible research in sensor-based transportation mode recognition with the sussex-huawei dataset,” IEEE Access, vol. 7, pp. 10 870–10 891, 2019.
[21] S. Bakas, B. Menze, C. Davatzikos, J. Kalpathy-Cramer, K. Farahani et al., “MICCAI Brain Tumor Segmentation (BraTS) 2020 Benchmark: ”Prediction of Survival and Pseudoprogression”,” Mar. 2020.
[22] F. Isensee, J. Petersen, A. Klein, D. Zimmerer, P. F. Jaeger, S. Kohl, J. Wasserthal, G. Koehler, T. Norajitra, S. Wirkert et al., “nnu-net: Self-adapting framework for u-net-based medical image segmentation,” arXiv preprint arXiv:1809.10486, 2018.
[23] J.-H. Choi and J.-S. Lee, “Embracenet for activity: A deep multimodal fusion architecture for activity recognition,” in Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers, 2019, pp. 693–698.
[24] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[25] J.-H. Choi and J.-S. Lee, “Confidence-based deep multimodal fusion for activity recognition,” in Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, 2018, pp. 1548–1556.

TFusion: Transformer based N-to-One Multimodal Fusion Block ††thanks: ∗ corresponding author