UniCon: Unidirectional Split Learning with Contrastive Loss for Visual Question Answering

Yuwei Sun
University of Tokyo
RIKEN AIP
ywsun@g.ecc.u-tokyo.ac.jp
&Hideya Ochiai
University of Tokyo
ochiai@elab.ic.i.u-tokyo.ac.jp
Abstract

Visual question answering (VQA) that leverages multi-modality data has attracted intensive interest in real-life applications, such as home robots and clinic diagnoses. Nevertheless, one of the challenges is to design robust learning for different client tasks. This work aims to bridge the gap between the prerequisite of large-scale training data and the constraint of client data sharing mainly due to confidentiality. We propose the Unidirectional Split Learning with Contrastive Loss (UniCon) to tackle VQA tasks training on distributed data silos. In particular, UniCon trains a global model over the entire data distribution of different clients learning refined cross-modal representations via contrastive learning. The learned representations of the global model aggregate knowledge from different local tasks. Moreover, we devise a unidirectional split learning framework to enable more efficient knowledge sharing. The comprehensive experiments with five state-of-the-art VQA models on the VQA-v2 dataset demonstrated the efficacy of UniCon, achieving an accuracy of 49.89% in the validation set of VQA-v2. This work is the first study of VQA under the constraint of data confidentiality using self-supervised Split Learning.

Introduction

The real-world deployment of multi-modal machine learning (MMML) in safety-critical applications such as healthcare needs to address a variety of model vulnerabilities for robust architecture design. The adversarial attacks on MMML usually include two different types according to the attacking target, i.e., data poisoning (Bagdasaryan et al., 2020; Bagdasaryan and Shmatikov, 2020; Lin et al., 2020; Wang et al., 2020b; Walmer et al., 2022) and model poisoning (Wang et al., 2020b; Fang et al., 2020; Zhou et al., 2021). A large body of works showed the venerability of an MMML model like Visual Question Answering (VQA) to the attacks (Walmer et al., 2022) but a few of them addressed the robust architecture design for an MMML model. In this work, we devise a novel Split Learning-based VQA framework to alleviate the threats from the two types of poisoning attacks.

Figure 1: The overall architecture of UniCon. UniCon comprises the cross-modal representation learning, the answer projection network for semantic notion understanding of answers, and two adapter networks for the contrastive learning of different model component outputs. UniCon learns refined representations by encouraging the similarity of the relevant component outputs while discouraging the irrelevant component outputs.

Moreover, there has been an increasing need for personal models working at the edge, mainly due to data confidentiality. Nevertheless, a standalone model is prone to have small training data that decreases the model performance. In this regard, decentralized machine learning methods (Sun et al., 2021) like federated learning (McMahan et al., 2017) are used to facilitate the distributed model training across different data silos without disclosing the raw data. Furthermore, in MMML tasks like VQA whose goal is to answer natural language questions according to images, decentralized learning has been attracting increasing attention for personal applications. For instance, connected home robots could learn to better generalize to new dialogues based on knowledge transferred from other robots via model parameters or representation sharing. So that sensitive personal data will not be disclosed and used by a third party. Training a VQA model usually necessitates a large number of multi-modality data, which involves a broad range of personal interests in both natural language texts and image contents. The privacy-preserving VQA has not been studied until recently(Liu et al., 2020). However, the efforts in the previous work using federated learning could only guarantee data privacy but not model privacy where training was performed via model parameter sharing.

In this work, we propose a novel contrastive loss-based split learning framework to train a VQA model without disclosing either the training data or the model. In particular, we tackle VQA as a self-supervised learning task instead of a multi-class classification task. The proposed method learns refined cross-modal representations from different question-image-answer triplets without the supervision of labels. So that the representations from relevant triplets stay close while the representations from irrelevant triplets are far apart (Fig. 1). We termed the proposed framework the Unidirectional Split Learning with Contrastive Loss (UniCon). UniCon facilitates the privacy-preserving model training on distributed data silos, safeguarding vision and natural language data for VQA. We hope that this work will motivate future research in robust learning for personal multimodal models.

The main contributions of this work are the following.

1) This work is the first study of Visual Question Answering (VQA) under the constraint of data confidentiality using Split Learning.

2) We devise a new contrastive loss-based split learning framework, called Unidirectional Split Learning with Contrastive Loss (UniCon), to overcome inefficiency in the bidirectional representations and gradients sharing in classical Split Learning and train a VQA model end-to-end.

3) This paper demonstrates the contrastive learning of different model components for training a global VQA model on the fly. UniCon aligns different modality representations and optimizes the entire model by encouraging the similarity of the relevant model component outputs while discouraging the irrelevant component outputs.

4) We present an in-depth evaluation on the VQA-v2 dataset using a wide range of state-of-the-art models including MFB, BUTD, BAN, MMNas, and MCAN. The empirical results demonstrated the efficacy of UniCon in the different learning settings achieving an accuracy of 49.89% in the VQA-v2 validation set.

Related Work

Multimodal Machine Learning and Visual Question Answering

Information in the real world usually comes in different modalities. Multimodal machine learning (MMML) (Alayrac et al., 2020; Radford et al., 2021; Rouditchenko et al., 2021; Ramesh et al., 2021, 2022) has been intensively studied with significant progress in cross-modal understanding and reasoning. In particular, Visual Question Answering (VQA) is the task to answer a natural language question according to the contents of a presented image. VQA is actively studied (Yang et al., 2016b; Anderson et al., 2018; Kim et al., 2018; Yu et al., 2020) with recent years’ progress in the attention mechanism (Vaswani et al., 2017a). Nevertheless, the vast majority of VQA studies so far are based on the modality network fusion methods (Yang et al., 2016b; Kim et al., 2017) where the VQA task is considered a multi-class classification. Such an assumption hinders the understanding of the semantic notions embedded in the natural language answers. Moreover, in practice, these studies do not touch on the privacy of VQA when using multi-modality data to train a model. In this work, we aim to bridge the gap between the large-scale training of VQA and the constraint of data sharing mainly due to confidentiality (Table 1).

Methods Shared Training Data Shared Training Model Learning Framework Loss Function
Yu et al. (2020) Single fusion model Cross entropy
Zhu et al. (2020) Single fusion model Cross entropy + Contrastive loss
Liu et al. (2020) Federated learning Cross entropy
UniCon (Ours) Unidirectional split leaning Contrastive loss
Table 1: Summary of existing work that was designed for tackling VQA tasks. Compared to the previous work, UniCon does not require sharing training data or models. This is enabled by a unidirectional split learning framework and contrastive loss.

Contrastive Learning

The prerequisite of large-scale training tasks like VQA that requires a decent amount of labeled modality data hinders the efficacy of cross-modal learning since data collection and labeling processes are usually effortful. Most VQA research has focused on supervised learning where the one-hot vectors of answer classes are employed to perform a multi-class classification (Yang et al., 2016a; Anderson et al., 2018; Kim et al., 2018). Nevertheless, the learned correlation between the multi-modality input and the answer does not consider the semantic meaning of the answer.

In contrast, recent research in MMML has emerged in self-supervised learning (SSL)(Chopra et al., 2005; Chen et al., 2020b). In particular, contrastive learning is commonly used to learn a shared embedding space from unlabelled data, in which similar sample pairs stay close to each other while dissimilar ones from different pairs are far apart. For instance, Barlow Twins(Zbontar et al., 2021) learns a cross-correlation matrix by keeping the representations of different distorted versions of an input sample similar, while minimizing the redundancy between these representations. Moreover, Momentum Contrast (MoCo) (He et al., 2020) trains a visual representation encoder by matching an encoded query to a dictionary of encoded keys based on the Information Noise Contrastive Estimation (InfoNCE) loss (Oord et al., 2018a). Furthermore, contrastive learning was also reported usage in training multimodal models. For example, CLIP(Radford et al., 2021) computes a cosine similarity matrix between all possible candidates of images and texts within a batch. Then, the similarity between relevant pairs is maximized and the similarity between irrelevant ones is minimized. Furthermore, there are also studies such as MultiModal Versatile Networks (Alayrac et al., 2020) and AVLnet(Rouditchenko et al., 2021) demonstrating the model training on large collections of unlabelled video data with the contrastive loss.

The most recent work has also explored the implementation of SSL to VQA. Question-Image Correlation Estimation (QICE) (Zhu et al., 2020) trains on relevant and irrelevant image and question pairs in VQA datasets to alleviate the language prior problem (Kurakin et al., 2017; Agrawal et al., 2018; Goyal et al., 2019) improving the understanding of image contents. Kim et al. (Kim et al., 2021) studied Video Question Answering using SSL by measuring the similarity scores between the representations of questions and ground truth answers. These methods above adopt a two-step training strategy, i.e., the cross-modality pretraining with SSL and the fusion network fine-tuning. In contrast, we train a VQA model end-to-end based on the contrastive learning of different model components and determine a prediction result using the similarity measurement between modality representations. This work leverages several projection networks to align the cross-modal representations and embed the semantic meaning of answers.

Decentralized Machine Learning

Our method is inspired by Decentralized Machine Learning built upon distributed learning (Li et al., 2014; Smola and Narayanamurthy, 2010), such as Federated Learning (McMahan et al., 2017; Sun et al., 2020; Kairouz et al., 2021), Split Learning (Hancox and et al., 2020; Thapa et al., 2022), and Swarm Learning (Warnat et al., 2021). Though methods like Federated Learning have been intensively studied to tackle tasks of a single modality (Sun et al., 2021), the studies in multimodal models such as VQA are still insufficient. Notably, for the visual language grounding tasks, Liu et al. (Liu et al., 2020) proposed a Federated Learning-based VQA framework called the aimNet, which consists of an aligning module, an integrating module, and a mapping module. AimNet could acquire fine-grained representations from different clients for improved downstream tasks. However, aimNet is a supervised learning framework that trains on the hard labels for different answers. Moreover, though aimNet does not disclose the training data via Federated Learning, the model parameters are shared to obtain better-refined representations. Such shared model parameters can be used to mount adversarial attacks of model poisoning (Wang et al., 2020b; Fang et al., 2020; Zhou et al., 2021).

To overcome the above challenges, we devise a new split Learning-based VQA framework leveraging a simpler and faster unidirectional learning flow to improve classical split learning. Furthermore, neither the training data nor the complete model’s parameters are disclosed during the training. To the best of our knowledge, this is the first time any work has attempted to employ split learning to tackle the VQA task and consolidate split learning and the contrastive loss.

Methods

In this section, we first introduce the motivation for introducing the Unidirectional Split Learning with Contrastive Loss (UniCon) for visual question answering (VQA) tasks. We then discuss the technical underpinnings of UniCon comprising the unidirectional split learning framework, semantic notion understanding with the answer projection network, and contrastive learning of different model components using two adapter networks.

Motivation

In VQA, the training data of images, questions, and answers are usually prepared beforehand for the model training. Nevertheless, previous efforts in the study of adversarial attacks on neural networks have shown that the use of large-scale training data can be correlated with the model’s venerability to poisoning attacks. The poisoning attacks can usually be divided into two categories depending on the prior knowledge of the adversary (Sun et al., 2021), i.e., data poisoning and model poisoning. Moreover, in the most recent study of attacks on state-of-the-art VQA models, an adversary exploited the complex fusion mechanisms to successfully embed effective and stealthy backdoors (Walmer et al., 2022). Such threats are enabled due to the adversary’s accessibility to either the training data or VQA models.

To this end, the line of work in decentralized machine learning (DML) approaches such as Federated Learning (FL) have been recently used to facilitate privacy-preserving learning of VQA models (Liu et al., 2020). FL trains a global model over the entire data distribution of different tasks thus attaining better-refined representations for downstream tasks. This is enabled by the sharing of local model parameters trained on different client tasks. Intuitively, FL can retain data confidentiality via model sharing, however, sharing the entire model renders the FL-based framework venerable to model poisoning attacks. For example, the adversary could reconstruct the input data from the shared model parameters (Hitaj et al., 2017). Our goal is to alleviate the threats of model poisoning while retaining the guarantee of data confidentiality based on an adapted method of Split Learning. The proposed method leverages contrastive learning to align knowledge between different components of a VQA model such that the model can be trained without disclosing the entire model architecture.

Split Learning for Visual Question Answering

Visual Question Answering (VQA) is the task to answer questions according to given image contents. The VQA problem is usually considered a supervised learning task with a fixed list of possible answer options. In particular, let be the VQA model that takes as the input the pair of an image and a question and outputs an answer where . The goal of the VQA model is to predict the correct answer given the input pair where is the dataset. where is the conditional probability.

Figure 2: Split Learning vs. UniCon: Split Learning without label sharing (left) employs the one-hot vectors of answer classes as labels to train the model. The propagation direction is bidirectional increasing the client’s waiting time for the computation of the global component. The Unidirectional Split Learning with Contrastive Loss (right) considered in this paper, however, learns the semantic notion of the answer text with contrastive learning. Moreover, the direction of the propagation is unidirectional, therefore, the client components can be computed simultaneously without waiting for the computation of the global components.

Moreover, Split Learning (SL) splits a complete model into different parts and trains a global model via interactive representation and gradient sharing. There are mainly two types of SL for different tasks (Thapa et al., 2022). In particular, we consider Split Learning without label sharing that wraps the model on the server around the end layer and sends the layer output back to a client (Fig. 2.a). This architecture could guarantee the data confidentiality of a client since both the input data and labels are not shared for the training. Furthermore, a complete model is split into three different components for the training, i.e., a global component and two client components . We assume there are clients. The th client has its own dataset where is the sample size of dataset . Here, , , and where is the sample size of . We suppose that the model components of different clients share the same architecture in SL and each client cannot share data mainly due to data confidentiality.

Then, SL proceeds by iterating the following steps: (1) each client computes the output of the component with and sends the output to the server, (2) the server further forward-propagates the client input with the global component and sends back the output, (3) the probability distribution is computed with the client component and the loss is computed with the ground-truths , (4) the gradients of each client with respect to each component’s parameters are computed via an inverse path , (5) the gradients are averaging aggregated to update the different components, , , . The process above is repeated until a given training goal is achieved. We refer to the appendix for full details on the SL algorithm.

Unidirectional Split Learning with Contrastive Loss

Previous work showed the plausible usages and efficacy of Contrastive Learning (Chopra et al., 2005) in a single modality (Chen et al., 2020b; He et al., 2020; Zbontar et al., 2021). In multimodal learning, Contrastive Learning has recently been used to learn refined cross-modal representations (Radford et al., 2021; Rouditchenko et al., 2021) by encouraging multimodal data from a relevant input to have more similar representations compared to data from irrelevant inputs. For the VQA tasks, one challenge is that most models are supervised where the one-hot vectors of answer classes are employed to perform a multi-class classification (Anderson et al., 2018; Kim et al., 2018). Therefore, the semantic notions of answers are usually not well correlated with the inputs reducing the generality of the trained model to unknown samples. To this end, we propose the use of the contrastive loss in Split Learning to correlate vision contents and language semantic notions such that each model component learns better-refined representations for the VQA tasks. Moreover, since the learning flow in classical SL is bidirectional between a client and the server, the waiting time for computing the next component is greatly increased (Fig. 2.a). By using the contrastive loss, we adapt SL to the unidirectional sharing architecture where the representations are shared by clients to the server and the gradients are shared by the server to clients. The unidirectional split learning with contrastive loss (UniCon) allows a client to compute the representations and gradients without waiting for the computation of the global component, which is more efficient and simpler compared to SL (Fig. 2.b).

Semantic Notion Understanding with Answer Projection Network.

The vast majority of VQA models are supervised where the one-hot vectors of answer classes are employed to perform a multi-class classification (Yang et al., 2016a; Anderson et al., 2018; Kim et al., 2018). To learn semantic notions as well from the answers, we devise an Answer Projection Network (APN) to embed the answer language contexts into a feature vector . Notably, APN comprises three different building blocks including a text preprocessing module, the Word2Vec using GloVe (Pennington et al., 2014), and a linear projection layer. The detailed architecture is described in Section Experiments.

Adapter Networks and the Shared Projection Space.

We propose the use of two adapter networks to project the outputs from different model components into a shared projection space. In this regard, a nonlinear projection head on more complex representations can improve the performance while for simpler modality representations it is not beneficial to use the nonlinear projection (Chen et al., 2020a, b; Alayrac et al., 2020). In particular, we replace a VQA model ’s output layer with the Nonlinear Head Adapter (NHA) network that projects the high-level cross-modal representations from the layer before the output layer into the shared projection space . Furthermore, we devise the Linear Tail Adapter (LTA) to project the low-level representations of APN into the shared projection space . Note that and have the same dimension of . The detailed architecture of NHA and LTA are described in Section Experiments.

Learning with the Information Noise Contrastive Estimation loss.

The Information Noise Contrastive Estimation (InfoNCE) loss is commonly used for contrastive learning (Oord et al., 2018b) to identify the positive sample from a set of unrelated negative samples. Notably, UniCon employs the relevant NHA and LTA outputs in the shared projection space of the same input triplets within one training batch as positive pairs. where is the sample size of the training batch. In contrast, given a NHA output , any irrelevant LTA outputs are employed as the negative keys of the NHA output. Then, we train the model by aligning the knowledge between the component outputs in positive pairs while discouraging the similarity between the outputs in negative pairs (Fig. 1). We formulate the loss as follows

(1)

where is the temperature parameter and is an indicator function: 1 if , 0 otherwise.

Model Parameter Aggregation.

Parameter aggregation of models trained on different VQA tasks aims to improve the generality of the models to unseen samples. This is enabled by aggregating the update gradients of different components after each epoch’s training. Since the sharing of all components’ gradients to the server for aggregation might disclose the complete model’s architecture, we employ a dual-server aggregation strategy where an auxiliary parameter server is adopted to aggregate the gradients of client components ( and ) and the main server is used to aggregate the gradients of global components ( and ), respectively. The dual-server aggregation can limit an adversary’s prior knowledge of the training model thus alleviating model poisoning attacks such as membership inference (Nasr et al., 2019) and information stealing (Wang et al., 2020a). However, since the main focus of this work is not adversarial attacks and defense of VQA models, we leave the discussion on the model’s robustness against poisoning attacks to future study. We formulate the parameter aggregation of APN, VQA, NHA, and LTA by the following

(2)

where is the parameters of a model component from .

Furthermore, each client updates the local components and the main server updates the global components based on the aggregated gradients, respectively. We refer to the appendix for full details on the UniCon algorithm.

Accuracy with Representation Similarity Measurement

The difficulty to measure the accuracy in UniCon is that we do not have a discriminative model to infer the class of the input. Inspired by (Radford et al., 2021), we evaluate the product similarity scores between the representations of an image and question input pair from the hold-out validation dataset of VQA and the representations of answer options , where denotes the representation of answer option . Then, we select the answer with the highest similarity with the input as the predicted answer. There also exist studies reporting the average accuracy of different batches, i.e., measuring the similarity scores within the same batch. However, we found that such a metric can easily produce a much higher accuracy compared to the evaluation metric based on all answer options. Therefore, we use as the metric the similarity scores between the representations of input and all answer options. We formulate the accuracy by the following

(3)

Experiments

Datasets and Models

Datasets.

Our method is evaluated on the benchmark dataset VQA-v2 (Agrawal et al., 2017) where images are from the COCO dataset (Lin et al., 2014) and we report the results on its validation split using Eq. 3.

Models.

We studied our approach with the following state-of-the-art Visual Question Answering models: (1) Multi-modal Factorized Bilinear (MFB) (Yu et al., 2017) combines multi-modal features using an end-to-end network architecture to jointly learn the image and question attention; (2) Bottom-Up and Top-Down attention mechanism (BUTD) (Anderson et al., 2018) enables attention to be calculated at the level of objects and other salient image regions. The bottom-up mechanism based on Faster R-CNN proposes image regions, while the top-down mechanism determines feature weightings; (3) Bilinear Attention Networks (BAN) (Kim et al., 2018) considers bilinear interactions among two groups of input channels and extracts the joint representations for each pair of channels; (4) Multimodal neural architecture search (MMNas) (Yu et al., 2020) uses a gradient-based algorithm to learn the optimal architecture; (5) Modular Co-Attention Network (MCAN) (Yu et al., 2019b) consists of Modular Co-Attention (MCA) layers cascaded in depth where each MCA layer models both the self-attention and the guided-attention of the input channels.

Implementation Details

The proposed method is model agnostic and can be applied to different VQA models. We used PyTorch and the OpenVQA platform (Yu et al., 2019a) to implement the VQA models. We set the hyperparameters of different VQA models to their default author-recommended values. Due to the training time cost, we evaluated the model performance with three different seeds and reported the mean. We conducted experiments on four NVIDIA A100 Tensor Core GPU with 40GB memory each. The code will be made publicly available upon acceptance.

Component Architecture and Hyperparameters

We employed the following architecture for the three model components in UniCon, respectively. For APN, we used the GloVe (Pennington et al., 2014) trained on Common Crawl to convert the pre-processed texts with a maximum word of eight into the dimension of . Then, a fully-connected layer followed by the ReLU activation function was applied to project the representations into . Finally, a Max Pooling layer was applied to the eight words with respect to each dimension producing a 512-dimension vector. Moreover, for LTA, we used a fully connected layer to project the representations received from each client to the shared space with a dimension of 256. For NHA, we used a two-layer architecture: (1) a fully connected layer to project the representations to a dimension of 512 followed by the ReLU, (2) and another fully connected layer to project the middle layer output to a dimension of 256 followed by the batch normalization. We refer to the appendix for full details on the model architecture.

Furthermore, we employed a batch size of 128, a total epoch of 20 (693400 steps), the Adam optimizer with parameters , , and , and an initial learning rate of 0.0001 with a linear warmup of 10K steps and a decay rate of 0.2 at the epoch 10 and 15. The hyperparameters were chosen using the grid search. We found that a smaller learning rate or a larger batch size would generally hinder the model learning and the model ceased learning in some cases with larger parameter sizes. In addition, for the InfoNCE loss, we adopted a temperature of 0.07 as in (Wu et al., 2018; Patrick et al., 2020; Alayrac et al., 2020).

Empirical Results

Figure 3: Qualitative analysis of model predictions. We show different success and failure cases for different question types. VQA models trained with the contrastive loss tend to successfully answer the questions about the general contents of an image while poorly answering the counting questions, e.g., counting the pictures on the wall, and the questions about the detail of an image, e.g., what is the man wearing on his head.
VQA Models Contrastive learning-based VQA (%)
Overall Yes/No Number Other
BAN 36.23 66.90 12.71 19.11
BUTD 45.08 75.82 29.27 25.86
MFB 46.98 73.95 32.81 30.20
MCAN-s 53.18 81.06 41.95 34.93
MCAN-l 53.32 81.21 42.66 34.90
MMNas-s 51.54 78.06 39.76 34.46
MMNas-l 53.82 80.06 42.86 36.75
Table 2: Effectiveness of the contrastive learning-based VQA for different models. The highest reported accuracy under each task is in bold.

We performed extensive experiments based on the five state-of-the-art VQA models above. In particular, for MMNas and MCAN, we further considered the effectiveness of different model complexities using MMNas-small (MMNas-s) and MMNas-large (MMNas-l), and MCAN-small (MCAN-s) and MCAN-large (MCAN-l). The detailed architecture designs of these models followed the settings in (Yu et al., 2020, 2019b). Moreover, we evaluated the proposed method’s performance based on Eq. 3. Notably, for each triplet in the validation set, we input the image and question pair to UniCon. Then, the output representation of the nonlinear head adapter is used to measure the similarity scores with the representations of the 3048 answers from the linear tail adapter. Finally, the prediction result is the answer text with the highest similarity score and the accuracy is computed based on the predictions and the ground truths of the validation set. Note that the label space of VQA-v2 is 3048-dimension which is much larger than the datasets considered in CLIP (Radford et al., 2021), hence, the VQA tasks are more challenging to perform the contrastive learning. The largest label space in CLIP is ImageNet (Deng et al., 2009) with only 1000 classes.

Furthermore, we studied the effectiveness of the contrastive learning-based VQA. The results are shown in Table 2. The empirical results demonstrate that the contrastive learning-based approach can be effectively applied to most VQA models. BAN showed the worst performance, particularly in counting the number. MMNas-l showed the best overall performance of 53.82% outperforming the other models for counting the number (Number) and answering the contents of an image (Other). Nevertheless, MCAN-l performed the best in the Yes/No questions.

VQA Models UniCon (%)
Overall Yes/No Number Other
BAN 35.11 63.84 11.06 19.61
BUTD 40.96 66.98 13.34 28.74
MFB 42.43 68.65 23.33 27.52
MCAN-s 48.42 74.93 30.88 32.89
MCAN-l 48.44 77.44 30.72 32.01
MMNas-s 45.14 70.55 28.04 30.33
MMNas-l 49.89 74.85 36.88 34.33
Table 3: Effectiveness of UniCon for different models where two clients train a global model without disclosing their data or models. The highest reported accuracy under each task is in bold.
Figure 4: Similarity scores between component outputs before and after training in UniCon. The similarity scores are optimized such that the positive pairs have a higher similarity while the negative pairs have a lower similarity.

To evaluate the efficacy of UniCon, we randomly divided the training set into two subsets. Then, we employed the two subsets as the local datasets of two clients to perform the split learning. The two clients share the same model component architecture and cannot share data mainly due to data confidentiality. The performance was evaluated on the global model after each round’s parameter aggregation. We show the numerical results in Table 3. Similarly, MMNas-l outperformed the other models, however, MCAN-l performed the best in the Yes/No questions. Moreover, there exists a trade-off between the model performance and using split learning for confidentiality. Compared to the overall accuracy of 53.82% of MMNas-l in the standalone training over the entire data set, the split learning-based approach obtained an accuracy of 49.89%. Nevertheless, when data sharing becomes an obstacle, the proposed approach can benefit the model training by leveraging other clients’ knowledge of different tasks. In addition, in case that data sharing is not allowed, a standalone model usually only has access to part of the data training over the partial distribution. UniCon allows clients to train over the entire data distribution thus improving the global model performance. We aim to study the efficacy of UniCon to improve the performance of standalone learning over the partial distribution in our future endeavors. We also presented the similarity measurement scores between the pairs of and within one training batch before and after training, respectively, in Fig. 4.

Conclusion

We propose a Unidirectional Split Learning with Contrastive loss (UniCon) for VQA, which learns fine-grained cross-modal representations by aligning model component outputs of different client tasks. We evaluate the efficacy of the UniCon framework on five different VQA models based on the VQA-v2 dataset. Extensive experiments show that our approach can be applied to different models demonstrating the effectiveness and universality of our approach. This work can be extended by considering a broader list of answer options using prompt engineering (Gao et al., 2021; Radford et al., 2021) further improving the prediction performance. Moreover, the robustness of UniCon against adversarial attacks will be studied with additional efforts using methods like Differential Privacy (Abadi et al., 2016). We hope that this work will motivate future research in robust learning for personal multimodal models.

References

Appendix A Appendices

VQA-v2 Dataset

Our method was evaluated on the benchmark VQA-v2 dataset (Agrawal et al., 2017) (Fig. 5) that includes different question types and images from the COCO dataset (Lin et al., 2014).

Figure 5: VQA-v2 dataset including the question types of ”yes/no”, ”number”, and ”other”.

Statistical T-test

We performed a paired t-test with two-tails (Kim, 2015) to compare the results between the results in Table 2 and Table 3. The VQA model variants that are based on contrastive learning without using split learning compared to the seven VQA model variants that are based on UniCon demonstrated a -value of 1.357. If we set a -value of 0.05 and a degree of freedom , then the -value is 2.477. Since is in the range [-2.477, 2.477], there is no significant effect on the prediction accuracy for using split learning. Therefore, UniCon achieves a competitive prediction result while safeguarding a client’s training data and model.

Attention Map Visualization

The attention mechanism (Bahdanau et al., 2015; Vaswani et al., 2017b) consists of three main components, namely the query , the key , and the value . Notably, in Visual Question Answering (VQA), attention weights are learned to represent the relative importance of visual representations at different spatial locations with respect to a given question. The attention layers are updated such that weights are put on the visual regions that are more relevant to the question. Following (Yang et al., 2016a), we visualized the attention map by extracting the weight matrix from the learned attention mechanism (Fig. 6).

Figure 6: The highlighted image regions are of importance to answering the given question. The attention maps are generated with the weight matrix of the attention mechanism in the cross-modal representation learning. The questions are ”What color is the sail” for the samples above and ”How many dogs is the man walking” for the samples below.

Model Components Architecture

We show the detailed architecture for the nonlinear head adapter (NHA), the linear tail adapter (LTA), and the answer projection network (APN) in Fig. 7.

Figure 7: The detailed architecture of NHA, LTA, and APN.

More Results of the Qualitative Analysis

The qualitative analysis evaluates the different success and failure cases for the different question types. Here we show more examples of the prediction results of UniCon in Fig. 8. The prediction results are based on the similarity measurement approach with a total of 3048 different answers.

Figure 8: More prediction results of UniCon.

Algorithms

Here we demonstrate the algorithms for Split Learning (without label sharing) and Unidirectional Split Learning with Contrastive Loss (UniCon) in Algorithm 1 and 2, respectively.

1:  : parameters of
2:  : parameters of
3:  : parameters of
4:  : partial derivatives of at local epoch
5:  : partial derivatives of at local epoch
6:  : total training rounds
7:  : total local epochs
8:  : negative log likelihood loss
9:  : learning rate
10:  
11:  for each round  do
12:     for each client in parallel do
13:        
14:        
15:        
16:        for each local epoch  do
17:           
18:            = ServerForward
19:           
20:           
21:           = ServerBackprop
22:           
23:        end for
24:     end for
25:     
26:     
27:     
28:  end for
29:  
30:  function ServerForward
31:  
32:  return   to client
33:  
34:  function ServerBackprop
35:  
36:  
37:  return   to client
Algorithm 1 Split Learning (without label sharing)
1:  : parameters of
2:  : parameters of
3:  : parameters of
4:  : parameters of
5:  
6:  : total training rounds
7:  : total local epochs
8:  : learning rate
9:  : Information Noise Contrastive Estimation (InfoNCE) loss
10:  for each round  do
11:     for each client in parallel do
12:        
13:        
14:        
15:        
16:        for each local epoch  do
17:            =
18:            =
19:            = Server(,)
20:           
21:            
22:        end for
23:     end for
24:     
25:     
26:     
27:     
28:  end for
29:  
30:  function Server(,)
31:  
32:  
33:  
34:  
35:  
36:  
37:  
38:  return   to client
Algorithm 2 Unidirectional Split Learning with Contrastive Loss (UniCon)