UniCon: Unidirectional Split Learning with Contrastive Loss for Visual Question Answering
Abstract
Visual question answering (VQA) that leverages multi-modality data has attracted intensive interest in real-life applications, such as home robots and clinic diagnoses. Nevertheless, one of the challenges is to design robust learning for different client tasks. This work aims to bridge the gap between the prerequisite of large-scale training data and the constraint of client data sharing mainly due to confidentiality. We propose the Unidirectional Split Learning with Contrastive Loss (UniCon) to tackle VQA tasks training on distributed data silos. In particular, UniCon trains a global model over the entire data distribution of different clients learning refined cross-modal representations via contrastive learning. The learned representations of the global model aggregate knowledge from different local tasks. Moreover, we devise a unidirectional split learning framework to enable more efficient knowledge sharing. The comprehensive experiments with five state-of-the-art VQA models on the VQA-v2 dataset demonstrated the efficacy of UniCon, achieving an accuracy of 49.89% in the validation set of VQA-v2. This work is the first study of VQA under the constraint of data confidentiality using self-supervised Split Learning.
Introduction
The real-world deployment of multi-modal machine learning (MMML) in safety-critical applications such as healthcare needs to address a variety of model vulnerabilities for robust architecture design. The adversarial attacks on MMML usually include two different types according to the attacking target, i.e., data poisoning (Bagdasaryan et al., 2020; Bagdasaryan and Shmatikov, 2020; Lin et al., 2020; Wang et al., 2020b; Walmer et al., 2022) and model poisoning (Wang et al., 2020b; Fang et al., 2020; Zhou et al., 2021). A large body of works showed the venerability of an MMML model like Visual Question Answering (VQA) to the attacks (Walmer et al., 2022) but a few of them addressed the robust architecture design for an MMML model. In this work, we devise a novel Split Learning-based VQA framework to alleviate the threats from the two types of poisoning attacks.
Moreover, there has been an increasing need for personal models working at the edge, mainly due to data confidentiality. Nevertheless, a standalone model is prone to have small training data that decreases the model performance. In this regard, decentralized machine learning methods (Sun et al., 2021) like federated learning (McMahan et al., 2017) are used to facilitate the distributed model training across different data silos without disclosing the raw data. Furthermore, in MMML tasks like VQA whose goal is to answer natural language questions according to images, decentralized learning has been attracting increasing attention for personal applications. For instance, connected home robots could learn to better generalize to new dialogues based on knowledge transferred from other robots via model parameters or representation sharing. So that sensitive personal data will not be disclosed and used by a third party. Training a VQA model usually necessitates a large number of multi-modality data, which involves a broad range of personal interests in both natural language texts and image contents. The privacy-preserving VQA has not been studied until recently(Liu et al., 2020). However, the efforts in the previous work using federated learning could only guarantee data privacy but not model privacy where training was performed via model parameter sharing.
In this work, we propose a novel contrastive loss-based split learning framework to train a VQA model without disclosing either the training data or the model. In particular, we tackle VQA as a self-supervised learning task instead of a multi-class classification task. The proposed method learns refined cross-modal representations from different question-image-answer triplets without the supervision of labels. So that the representations from relevant triplets stay close while the representations from irrelevant triplets are far apart (Fig. 1). We termed the proposed framework the Unidirectional Split Learning with Contrastive Loss (UniCon). UniCon facilitates the privacy-preserving model training on distributed data silos, safeguarding vision and natural language data for VQA. We hope that this work will motivate future research in robust learning for personal multimodal models.
The main contributions of this work are the following.
1) This work is the first study of Visual Question Answering (VQA) under the constraint of data confidentiality using Split Learning.
2) We devise a new contrastive loss-based split learning framework, called Unidirectional Split Learning with Contrastive Loss (UniCon), to overcome inefficiency in the bidirectional representations and gradients sharing in classical Split Learning and train a VQA model end-to-end.
3) This paper demonstrates the contrastive learning of different model components for training a global VQA model on the fly. UniCon aligns different modality representations and optimizes the entire model by encouraging the similarity of the relevant model component outputs while discouraging the irrelevant component outputs.
4) We present an in-depth evaluation on the VQA-v2 dataset using a wide range of state-of-the-art models including MFB, BUTD, BAN, MMNas, and MCAN. The empirical results demonstrated the efficacy of UniCon in the different learning settings achieving an accuracy of 49.89% in the VQA-v2 validation set.
Related Work
Multimodal Machine Learning and Visual Question Answering
Information in the real world usually comes in different modalities. Multimodal machine learning (MMML) (Alayrac et al., 2020; Radford et al., 2021; Rouditchenko et al., 2021; Ramesh et al., 2021, 2022) has been intensively studied with significant progress in cross-modal understanding and reasoning. In particular, Visual Question Answering (VQA) is the task to answer a natural language question according to the contents of a presented image. VQA is actively studied (Yang et al., 2016b; Anderson et al., 2018; Kim et al., 2018; Yu et al., 2020) with recent years’ progress in the attention mechanism (Vaswani et al., 2017a). Nevertheless, the vast majority of VQA studies so far are based on the modality network fusion methods (Yang et al., 2016b; Kim et al., 2017) where the VQA task is considered a multi-class classification. Such an assumption hinders the understanding of the semantic notions embedded in the natural language answers. Moreover, in practice, these studies do not touch on the privacy of VQA when using multi-modality data to train a model. In this work, we aim to bridge the gap between the large-scale training of VQA and the constraint of data sharing mainly due to confidentiality (Table 1).
Methods | Shared Training Data | Shared Training Model | Learning Framework | Loss Function |
---|---|---|---|---|
Yu et al. (2020) | ✓ | Single fusion model | Cross entropy | |
Zhu et al. (2020) | ✓ | Single fusion model | Cross entropy + Contrastive loss | |
Liu et al. (2020) | Federated learning | Cross entropy | ||
UniCon (Ours) | Unidirectional split leaning | Contrastive loss |
Contrastive Learning
The prerequisite of large-scale training tasks like VQA that requires a decent amount of labeled modality data hinders the efficacy of cross-modal learning since data collection and labeling processes are usually effortful. Most VQA research has focused on supervised learning where the one-hot vectors of answer classes are employed to perform a multi-class classification (Yang et al., 2016a; Anderson et al., 2018; Kim et al., 2018). Nevertheless, the learned correlation between the multi-modality input and the answer does not consider the semantic meaning of the answer.
In contrast, recent research in MMML has emerged in self-supervised learning (SSL)(Chopra et al., 2005; Chen et al., 2020b). In particular, contrastive learning is commonly used to learn a shared embedding space from unlabelled data, in which similar sample pairs stay close to each other while dissimilar ones from different pairs are far apart. For instance, Barlow Twins(Zbontar et al., 2021) learns a cross-correlation matrix by keeping the representations of different distorted versions of an input sample similar, while minimizing the redundancy between these representations. Moreover, Momentum Contrast (MoCo) (He et al., 2020) trains a visual representation encoder by matching an encoded query to a dictionary of encoded keys based on the Information Noise Contrastive Estimation (InfoNCE) loss (Oord et al., 2018a). Furthermore, contrastive learning was also reported usage in training multimodal models. For example, CLIP(Radford et al., 2021) computes a cosine similarity matrix between all possible candidates of images and texts within a batch. Then, the similarity between relevant pairs is maximized and the similarity between irrelevant ones is minimized. Furthermore, there are also studies such as MultiModal Versatile Networks (Alayrac et al., 2020) and AVLnet(Rouditchenko et al., 2021) demonstrating the model training on large collections of unlabelled video data with the contrastive loss.
The most recent work has also explored the implementation of SSL to VQA. Question-Image Correlation Estimation (QICE) (Zhu et al., 2020) trains on relevant and irrelevant image and question pairs in VQA datasets to alleviate the language prior problem (Kurakin et al., 2017; Agrawal et al., 2018; Goyal et al., 2019) improving the understanding of image contents. Kim et al. (Kim et al., 2021) studied Video Question Answering using SSL by measuring the similarity scores between the representations of questions and ground truth answers. These methods above adopt a two-step training strategy, i.e., the cross-modality pretraining with SSL and the fusion network fine-tuning. In contrast, we train a VQA model end-to-end based on the contrastive learning of different model components and determine a prediction result using the similarity measurement between modality representations. This work leverages several projection networks to align the cross-modal representations and embed the semantic meaning of answers.
Decentralized Machine Learning
Our method is inspired by Decentralized Machine Learning built upon distributed learning (Li et al., 2014; Smola and Narayanamurthy, 2010), such as Federated Learning (McMahan et al., 2017; Sun et al., 2020; Kairouz et al., 2021), Split Learning (Hancox and et al., 2020; Thapa et al., 2022), and Swarm Learning (Warnat et al., 2021). Though methods like Federated Learning have been intensively studied to tackle tasks of a single modality (Sun et al., 2021), the studies in multimodal models such as VQA are still insufficient. Notably, for the visual language grounding tasks, Liu et al. (Liu et al., 2020) proposed a Federated Learning-based VQA framework called the aimNet, which consists of an aligning module, an integrating module, and a mapping module. AimNet could acquire fine-grained representations from different clients for improved downstream tasks. However, aimNet is a supervised learning framework that trains on the hard labels for different answers. Moreover, though aimNet does not disclose the training data via Federated Learning, the model parameters are shared to obtain better-refined representations. Such shared model parameters can be used to mount adversarial attacks of model poisoning (Wang et al., 2020b; Fang et al., 2020; Zhou et al., 2021).
To overcome the above challenges, we devise a new split Learning-based VQA framework leveraging a simpler and faster unidirectional learning flow to improve classical split learning. Furthermore, neither the training data nor the complete model’s parameters are disclosed during the training. To the best of our knowledge, this is the first time any work has attempted to employ split learning to tackle the VQA task and consolidate split learning and the contrastive loss.
Methods
In this section, we first introduce the motivation for introducing the Unidirectional Split Learning with Contrastive Loss (UniCon) for visual question answering (VQA) tasks. We then discuss the technical underpinnings of UniCon comprising the unidirectional split learning framework, semantic notion understanding with the answer projection network, and contrastive learning of different model components using two adapter networks.
Motivation
In VQA, the training data of images, questions, and answers are usually prepared beforehand for the model training. Nevertheless, previous efforts in the study of adversarial attacks on neural networks have shown that the use of large-scale training data can be correlated with the model’s venerability to poisoning attacks. The poisoning attacks can usually be divided into two categories depending on the prior knowledge of the adversary (Sun et al., 2021), i.e., data poisoning and model poisoning. Moreover, in the most recent study of attacks on state-of-the-art VQA models, an adversary exploited the complex fusion mechanisms to successfully embed effective and stealthy backdoors (Walmer et al., 2022). Such threats are enabled due to the adversary’s accessibility to either the training data or VQA models.
To this end, the line of work in decentralized machine learning (DML) approaches such as Federated Learning (FL) have been recently used to facilitate privacy-preserving learning of VQA models (Liu et al., 2020). FL trains a global model over the entire data distribution of different tasks thus attaining better-refined representations for downstream tasks. This is enabled by the sharing of local model parameters trained on different client tasks. Intuitively, FL can retain data confidentiality via model sharing, however, sharing the entire model renders the FL-based framework venerable to model poisoning attacks. For example, the adversary could reconstruct the input data from the shared model parameters (Hitaj et al., 2017). Our goal is to alleviate the threats of model poisoning while retaining the guarantee of data confidentiality based on an adapted method of Split Learning. The proposed method leverages contrastive learning to align knowledge between different components of a VQA model such that the model can be trained without disclosing the entire model architecture.
Split Learning for Visual Question Answering
Visual Question Answering (VQA) is the task to answer questions according to given image contents. The VQA problem is usually considered a supervised learning task with a fixed list of possible answer options. In particular, let be the VQA model that takes as the input the pair of an image and a question and outputs an answer where . The goal of the VQA model is to predict the correct answer given the input pair where is the dataset. where is the conditional probability.
Moreover, Split Learning (SL) splits a complete model into different parts and trains a global model via interactive representation and gradient sharing. There are mainly two types of SL for different tasks (Thapa et al., 2022). In particular, we consider Split Learning without label sharing that wraps the model on the server around the end layer and sends the layer output back to a client (Fig. 2.a). This architecture could guarantee the data confidentiality of a client since both the input data and labels are not shared for the training. Furthermore, a complete model is split into three different components for the training, i.e., a global component and two client components . We assume there are clients. The th client has its own dataset where is the sample size of dataset . Here, , , and where is the sample size of . We suppose that the model components of different clients share the same architecture in SL and each client cannot share data mainly due to data confidentiality.
Then, SL proceeds by iterating the following steps: (1) each client computes the output of the component with and sends the output to the server, (2) the server further forward-propagates the client input with the global component and sends back the output, (3) the probability distribution is computed with the client component and the loss is computed with the ground-truths , (4) the gradients of each client with respect to each component’s parameters are computed via an inverse path , (5) the gradients are averaging aggregated to update the different components, , , . The process above is repeated until a given training goal is achieved. We refer to the appendix for full details on the SL algorithm.
Unidirectional Split Learning with Contrastive Loss
Previous work showed the plausible usages and efficacy of Contrastive Learning (Chopra et al., 2005) in a single modality (Chen et al., 2020b; He et al., 2020; Zbontar et al., 2021). In multimodal learning, Contrastive Learning has recently been used to learn refined cross-modal representations (Radford et al., 2021; Rouditchenko et al., 2021) by encouraging multimodal data from a relevant input to have more similar representations compared to data from irrelevant inputs. For the VQA tasks, one challenge is that most models are supervised where the one-hot vectors of answer classes are employed to perform a multi-class classification (Anderson et al., 2018; Kim et al., 2018). Therefore, the semantic notions of answers are usually not well correlated with the inputs reducing the generality of the trained model to unknown samples. To this end, we propose the use of the contrastive loss in Split Learning to correlate vision contents and language semantic notions such that each model component learns better-refined representations for the VQA tasks. Moreover, since the learning flow in classical SL is bidirectional between a client and the server, the waiting time for computing the next component is greatly increased (Fig. 2.a). By using the contrastive loss, we adapt SL to the unidirectional sharing architecture where the representations are shared by clients to the server and the gradients are shared by the server to clients. The unidirectional split learning with contrastive loss (UniCon) allows a client to compute the representations and gradients without waiting for the computation of the global component, which is more efficient and simpler compared to SL (Fig. 2.b).
Semantic Notion Understanding with Answer Projection Network.
The vast majority of VQA models are supervised where the one-hot vectors of answer classes are employed to perform a multi-class classification (Yang et al., 2016a; Anderson et al., 2018; Kim et al., 2018). To learn semantic notions as well from the answers, we devise an Answer Projection Network (APN) to embed the answer language contexts into a feature vector . Notably, APN comprises three different building blocks including a text preprocessing module, the Word2Vec using GloVe (Pennington et al., 2014), and a linear projection layer. The detailed architecture is described in Section Experiments.
Adapter Networks and the Shared Projection Space.
We propose the use of two adapter networks to project the outputs from different model components into a shared projection space. In this regard, a nonlinear projection head on more complex representations can improve the performance while for simpler modality representations it is not beneficial to use the nonlinear projection (Chen et al., 2020a, b; Alayrac et al., 2020). In particular, we replace a VQA model ’s output layer with the Nonlinear Head Adapter (NHA) network that projects the high-level cross-modal representations from the layer before the output layer into the shared projection space . Furthermore, we devise the Linear Tail Adapter (LTA) to project the low-level representations of APN into the shared projection space . Note that and have the same dimension of . The detailed architecture of NHA and LTA are described in Section Experiments.
Learning with the Information Noise Contrastive Estimation loss.
The Information Noise Contrastive Estimation (InfoNCE) loss is commonly used for contrastive learning (Oord et al., 2018b) to identify the positive sample from a set of unrelated negative samples. Notably, UniCon employs the relevant NHA and LTA outputs in the shared projection space of the same input triplets within one training batch as positive pairs. where is the sample size of the training batch. In contrast, given a NHA output , any irrelevant LTA outputs are employed as the negative keys of the NHA output. Then, we train the model by aligning the knowledge between the component outputs in positive pairs while discouraging the similarity between the outputs in negative pairs (Fig. 1). We formulate the loss as follows
(1) |
where is the temperature parameter and is an indicator function: 1 if , 0 otherwise.
Model Parameter Aggregation.
Parameter aggregation of models trained on different VQA tasks aims to improve the generality of the models to unseen samples. This is enabled by aggregating the update gradients of different components after each epoch’s training. Since the sharing of all components’ gradients to the server for aggregation might disclose the complete model’s architecture, we employ a dual-server aggregation strategy where an auxiliary parameter server is adopted to aggregate the gradients of client components ( and ) and the main server is used to aggregate the gradients of global components ( and ), respectively. The dual-server aggregation can limit an adversary’s prior knowledge of the training model thus alleviating model poisoning attacks such as membership inference (Nasr et al., 2019) and information stealing (Wang et al., 2020a). However, since the main focus of this work is not adversarial attacks and defense of VQA models, we leave the discussion on the model’s robustness against poisoning attacks to future study. We formulate the parameter aggregation of APN, VQA, NHA, and LTA by the following
(2) |
where is the parameters of a model component from .
Furthermore, each client updates the local components and the main server updates the global components based on the aggregated gradients, respectively. We refer to the appendix for full details on the UniCon algorithm.
Accuracy with Representation Similarity Measurement
The difficulty to measure the accuracy in UniCon is that we do not have a discriminative model to infer the class of the input. Inspired by (Radford et al., 2021), we evaluate the product similarity scores between the representations of an image and question input pair from the hold-out validation dataset of VQA and the representations of answer options , where denotes the representation of answer option . Then, we select the answer with the highest similarity with the input as the predicted answer. There also exist studies reporting the average accuracy of different batches, i.e., measuring the similarity scores within the same batch. However, we found that such a metric can easily produce a much higher accuracy compared to the evaluation metric based on all answer options. Therefore, we use as the metric the similarity scores between the representations of input and all answer options. We formulate the accuracy by the following
(3) |
Experiments
Datasets and Models
Datasets.
Models.
We studied our approach with the following state-of-the-art Visual Question Answering models: (1) Multi-modal Factorized Bilinear (MFB) (Yu et al., 2017) combines multi-modal features using an end-to-end network architecture to jointly learn the image and question attention; (2) Bottom-Up and Top-Down attention mechanism (BUTD) (Anderson et al., 2018) enables attention to be calculated at the level of objects and other salient image regions. The bottom-up mechanism based on Faster R-CNN proposes image regions, while the top-down mechanism determines feature weightings; (3) Bilinear Attention Networks (BAN) (Kim et al., 2018) considers bilinear interactions among two groups of input channels and extracts the joint representations for each pair of channels; (4) Multimodal neural architecture search (MMNas) (Yu et al., 2020) uses a gradient-based algorithm to learn the optimal architecture; (5) Modular Co-Attention Network (MCAN) (Yu et al., 2019b) consists of Modular Co-Attention (MCA) layers cascaded in depth where each MCA layer models both the self-attention and the guided-attention of the input channels.
Implementation Details
The proposed method is model agnostic and can be applied to different VQA models. We used PyTorch and the OpenVQA platform (Yu et al., 2019a) to implement the VQA models. We set the hyperparameters of different VQA models to their default author-recommended values. Due to the training time cost, we evaluated the model performance with three different seeds and reported the mean. We conducted experiments on four NVIDIA A100 Tensor Core GPU with 40GB memory each. The code will be made publicly available upon acceptance.
Component Architecture and Hyperparameters
We employed the following architecture for the three model components in UniCon, respectively. For APN, we used the GloVe (Pennington et al., 2014) trained on Common Crawl to convert the pre-processed texts with a maximum word of eight into the dimension of . Then, a fully-connected layer followed by the ReLU activation function was applied to project the representations into . Finally, a Max Pooling layer was applied to the eight words with respect to each dimension producing a 512-dimension vector. Moreover, for LTA, we used a fully connected layer to project the representations received from each client to the shared space with a dimension of 256. For NHA, we used a two-layer architecture: (1) a fully connected layer to project the representations to a dimension of 512 followed by the ReLU, (2) and another fully connected layer to project the middle layer output to a dimension of 256 followed by the batch normalization. We refer to the appendix for full details on the model architecture.
Furthermore, we employed a batch size of 128, a total epoch of 20 (693400 steps), the Adam optimizer with parameters , , and , and an initial learning rate of 0.0001 with a linear warmup of 10K steps and a decay rate of 0.2 at the epoch 10 and 15. The hyperparameters were chosen using the grid search. We found that a smaller learning rate or a larger batch size would generally hinder the model learning and the model ceased learning in some cases with larger parameter sizes. In addition, for the InfoNCE loss, we adopted a temperature of 0.07 as in (Wu et al., 2018; Patrick et al., 2020; Alayrac et al., 2020).
Empirical Results
VQA Models | Contrastive learning-based VQA (%) | |||
---|---|---|---|---|
Overall | Yes/No | Number | Other | |
BAN | 36.23 | 66.90 | 12.71 | 19.11 |
BUTD | 45.08 | 75.82 | 29.27 | 25.86 |
MFB | 46.98 | 73.95 | 32.81 | 30.20 |
MCAN-s | 53.18 | 81.06 | 41.95 | 34.93 |
MCAN-l | 53.32 | 81.21 | 42.66 | 34.90 |
MMNas-s | 51.54 | 78.06 | 39.76 | 34.46 |
MMNas-l | 53.82 | 80.06 | 42.86 | 36.75 |
We performed extensive experiments based on the five state-of-the-art VQA models above. In particular, for MMNas and MCAN, we further considered the effectiveness of different model complexities using MMNas-small (MMNas-s) and MMNas-large (MMNas-l), and MCAN-small (MCAN-s) and MCAN-large (MCAN-l). The detailed architecture designs of these models followed the settings in (Yu et al., 2020, 2019b). Moreover, we evaluated the proposed method’s performance based on Eq. 3. Notably, for each triplet in the validation set, we input the image and question pair to UniCon. Then, the output representation of the nonlinear head adapter is used to measure the similarity scores with the representations of the 3048 answers from the linear tail adapter. Finally, the prediction result is the answer text with the highest similarity score and the accuracy is computed based on the predictions and the ground truths of the validation set. Note that the label space of VQA-v2 is 3048-dimension which is much larger than the datasets considered in CLIP (Radford et al., 2021), hence, the VQA tasks are more challenging to perform the contrastive learning. The largest label space in CLIP is ImageNet (Deng et al., 2009) with only 1000 classes.
Furthermore, we studied the effectiveness of the contrastive learning-based VQA. The results are shown in Table 2. The empirical results demonstrate that the contrastive learning-based approach can be effectively applied to most VQA models. BAN showed the worst performance, particularly in counting the number. MMNas-l showed the best overall performance of 53.82% outperforming the other models for counting the number (Number) and answering the contents of an image (Other). Nevertheless, MCAN-l performed the best in the Yes/No questions.
VQA Models | UniCon (%) | |||
---|---|---|---|---|
Overall | Yes/No | Number | Other | |
BAN | 35.11 | 63.84 | 11.06 | 19.61 |
BUTD | 40.96 | 66.98 | 13.34 | 28.74 |
MFB | 42.43 | 68.65 | 23.33 | 27.52 |
MCAN-s | 48.42 | 74.93 | 30.88 | 32.89 |
MCAN-l | 48.44 | 77.44 | 30.72 | 32.01 |
MMNas-s | 45.14 | 70.55 | 28.04 | 30.33 |
MMNas-l | 49.89 | 74.85 | 36.88 | 34.33 |
To evaluate the efficacy of UniCon, we randomly divided the training set into two subsets. Then, we employed the two subsets as the local datasets of two clients to perform the split learning. The two clients share the same model component architecture and cannot share data mainly due to data confidentiality. The performance was evaluated on the global model after each round’s parameter aggregation. We show the numerical results in Table 3. Similarly, MMNas-l outperformed the other models, however, MCAN-l performed the best in the Yes/No questions. Moreover, there exists a trade-off between the model performance and using split learning for confidentiality. Compared to the overall accuracy of 53.82% of MMNas-l in the standalone training over the entire data set, the split learning-based approach obtained an accuracy of 49.89%. Nevertheless, when data sharing becomes an obstacle, the proposed approach can benefit the model training by leveraging other clients’ knowledge of different tasks. In addition, in case that data sharing is not allowed, a standalone model usually only has access to part of the data training over the partial distribution. UniCon allows clients to train over the entire data distribution thus improving the global model performance. We aim to study the efficacy of UniCon to improve the performance of standalone learning over the partial distribution in our future endeavors. We also presented the similarity measurement scores between the pairs of and within one training batch before and after training, respectively, in Fig. 4.
Conclusion
We propose a Unidirectional Split Learning with Contrastive loss (UniCon) for VQA, which learns fine-grained cross-modal representations by aligning model component outputs of different client tasks. We evaluate the efficacy of the UniCon framework on five different VQA models based on the VQA-v2 dataset. Extensive experiments show that our approach can be applied to different models demonstrating the effectiveness and universality of our approach. This work can be extended by considering a broader list of answer options using prompt engineering (Gao et al., 2021; Radford et al., 2021) further improving the prediction performance. Moreover, the robustness of UniCon against adversarial attacks will be studied with additional efforts using methods like Differential Privacy (Abadi et al., 2016). We hope that this work will motivate future research in robust learning for personal multimodal models.
References
- Deep learning with differential privacy. In ACM Conference on Computer and Communications Security, Cited by: Conclusion.
- Don’t just assume; look and answer: overcoming priors for visual question answering. In CVPR, Cited by: Contrastive Learning.
- VQA: visual question answering - www.visualqa.org. Int. J. Comput. Vis. 123 (1), pp. 4–31. Cited by: Appendix A, Datasets..
- Self-supervised multimodal versatile networks. In NeurIPS, Cited by: Multimodal Machine Learning and Visual Question Answering, Contrastive Learning, Adapter Networks and the Shared Projection Space., Component Architecture and Hyperparameters.
- Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, Cited by: Multimodal Machine Learning and Visual Question Answering, Contrastive Learning, Semantic Notion Understanding with Answer Projection Network., Unidirectional Split Learning with Contrastive Loss, Models..
- Blind backdoors in deep learning models. arXiv preprint. Cited by: Introduction.
- How to backdoor federated learning. In AISTATS, Cited by: Introduction.
- Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: Appendix A.
- A simple framework for contrastive learning of visual representations. In ICML, Cited by: Adapter Networks and the Shared Projection Space..
- A simple framework for contrastive learning of visual representations. In ICML, Cited by: Contrastive Learning, Adapter Networks and the Shared Projection Space., Unidirectional Split Learning with Contrastive Loss.
- Learning a similarity metric discriminatively, with application to face verification. In CVPR, Cited by: Contrastive Learning, Unidirectional Split Learning with Contrastive Loss.
- Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: Empirical Results.
- Local model poisoning attacks to byzantine-robust federated learning. In USENIX Security Symposium, Cited by: Introduction, Decentralized Machine Learning.
- Making pre-trained language models better few-shot learners. In ACL/IJCNLP, Cited by: Conclusion.
- Making the V in VQA matter: elevating the role of image understanding in visual question answering. Int. J. Comput. Vis. 127 (4), pp. 398–414. Cited by: Contrastive Learning.
- The future of digital health with federated learning. Cited by: Decentralized Machine Learning.
- Momentum contrast for unsupervised visual representation learning. In CVPR, Cited by: Contrastive Learning, Unidirectional Split Learning with Contrastive Loss.
- Deep models under the GAN: information leakage from collaborative deep learning. In ACM Conference on Computer and Communications Security, Cited by: Motivation.
- Advances and open problems in federated learning. Found. Trends Mach. Learn. 14 (1-2), pp. 1–210. Cited by: Decentralized Machine Learning.
- Bilinear attention networks. In NeurIPS, Cited by: Multimodal Machine Learning and Visual Question Answering, Contrastive Learning, Semantic Notion Understanding with Answer Projection Network., Unidirectional Split Learning with Contrastive Loss, Models..
- DeepStory: video story QA by deep embedded memory networks. In IJCAI, Cited by: Multimodal Machine Learning and Visual Question Answering.
- Self-supervised pre-training and contrastive representation learning for multiple-choice video QA. In AAAI, Cited by: Contrastive Learning.
- T test as a parametric statistic. Korean journal of anesthesiology 68 (6). Cited by: Appendix A.
- Adversarial machine learning at scale. In ICLR, Cited by: Contrastive Learning.
- Scaling distributed machine learning with the parameter server. In USENIX, Cited by: Decentralized Machine Learning.
- Composite backdoor attack for deep neural network by mixing existing benign features. In ACM SIGSAC, pp. 113–131. Cited by: Introduction.
- Microsoft COCO: common objects in context. In ECCV, Cited by: Appendix A, Datasets..
- Federated learning for vision-and-language grounding problems. In AAAI, Cited by: Introduction, Decentralized Machine Learning, Table 1, Motivation.
- Communication-Efficient Learning of Deep Networks from Decentralized Data. In AISTATS, Cited by: Introduction, Decentralized Machine Learning.
- Comprehensive privacy analysis of deep learning: passive and active white-box inference attacks against centralized and federated learning. In IEEE Symposium on Security and Privacy (SP), Cited by: Model Parameter Aggregation..
- Representation learning with contrastive predictive coding. arXiv Preprint. Cited by: Contrastive Learning.
- Representation learning with contrastive predictive coding. arXiv preprint. Cited by: Learning with the Information Noise Contrastive Estimation loss..
- Multi-modal self-supervision from generalized data transformations. arXiv preprint. Cited by: Component Architecture and Hyperparameters.
- GloVe: global vectors for word representation. In EMNLP, Cited by: Semantic Notion Understanding with Answer Projection Network., Component Architecture and Hyperparameters.
- Learning transferable visual models from natural language supervision. In ICML, Cited by: Multimodal Machine Learning and Visual Question Answering, Contrastive Learning, Unidirectional Split Learning with Contrastive Loss, Accuracy with Representation Similarity Measurement, Empirical Results, Conclusion.
- Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv.2204.06125. Cited by: Multimodal Machine Learning and Visual Question Answering.
- Zero-shot text-to-image generation. In ICML, Cited by: Multimodal Machine Learning and Visual Question Answering.
- AVLnet: learning audio-visual language representations from instructional videos. In Annual Conference of the International Speech Communication Association, Cited by: Multimodal Machine Learning and Visual Question Answering, Contrastive Learning, Unidirectional Split Learning with Contrastive Loss.
- An architecture for parallel topic models. Proc. VLDB Endow. 3 (1), pp. 703–710. Cited by: Decentralized Machine Learning.
- Intrusion detection with segmented federated learning for large-scale multiple lans. Cited by: Decentralized Machine Learning.
- Decentralized deep learning for multi-access edge computing: a survey on communication efficiency and trustworthiness. In IEEE Transactions on Artificial Intelligence, Cited by: Introduction, Decentralized Machine Learning, Motivation.
- SplitFed: when federated learning meets split learning. Cited by: Decentralized Machine Learning, Split Learning for Visual Question Answering.
- Attention is all you need. In NeurIPS, Cited by: Multimodal Machine Learning and Visual Question Answering.
- Attention is all you need. In NeurIPS, Cited by: Appendix A.
- Dual-key multimodal backdoors for visual question answering. CVPR. Cited by: Introduction, Motivation.
- Attack of the tails: yes, you really can backdoor federated learning. In NeurIPS, Cited by: Model Parameter Aggregation..
- Attack of the tails: yes, you really can backdoor federated learning. In NeurIPS, Cited by: Introduction, Decentralized Machine Learning.
- Swarm learning for decentralized and confidential clinical machine learning. Cited by: Decentralized Machine Learning.
- Unsupervised feature learning via non-parametric instance discrimination. In CVPR, Cited by: Component Architecture and Hyperparameters.
- Stacked attention networks for image question answering. In CVPR, Cited by: Appendix A, Contrastive Learning, Semantic Notion Understanding with Answer Projection Network..
- Stacked attention networks for image question answering. In CVPR, Cited by: Multimodal Machine Learning and Visual Question Answering.
- OpenVQA. Note: https://github.com/MILVLG/openvqa Cited by: Implementation Details.
- Deep multimodal neural architecture search. In ACM Multimedia, Cited by: Multimodal Machine Learning and Visual Question Answering, Table 1, Models., Empirical Results.
- Deep modular co-attention networks for visual question answering. In CVPR, Cited by: Models., Empirical Results.
- Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In ICCV, Cited by: Models..
- Barlow twins: self-supervised learning via redundancy reduction. Cited by: Contrastive Learning, Unidirectional Split Learning with Contrastive Loss.
- Deep model poisoning attack on federated learning. Future Internet 13 (3), pp. 73. Cited by: Introduction, Decentralized Machine Learning.
- Overcoming language priors with self-supervised learning for visual question answering. In IJCAI, Cited by: Contrastive Learning, Table 1.
Appendix A Appendices
VQA-v2 Dataset
Statistical T-test
We performed a paired t-test with two-tails (Kim, 2015) to compare the results between the results in Table 2 and Table 3. The VQA model variants that are based on contrastive learning without using split learning compared to the seven VQA model variants that are based on UniCon demonstrated a -value of 1.357. If we set a -value of 0.05 and a degree of freedom , then the -value is 2.477. Since is in the range [-2.477, 2.477], there is no significant effect on the prediction accuracy for using split learning. Therefore, UniCon achieves a competitive prediction result while safeguarding a client’s training data and model.
Attention Map Visualization
The attention mechanism (Bahdanau et al., 2015; Vaswani et al., 2017b) consists of three main components, namely the query , the key , and the value . Notably, in Visual Question Answering (VQA), attention weights are learned to represent the relative importance of visual representations at different spatial locations with respect to a given question. The attention layers are updated such that weights are put on the visual regions that are more relevant to the question. Following (Yang et al., 2016a), we visualized the attention map by extracting the weight matrix from the learned attention mechanism (Fig. 6).
Model Components Architecture
We show the detailed architecture for the nonlinear head adapter (NHA), the linear tail adapter (LTA), and the answer projection network (APN) in Fig. 7.
More Results of the Qualitative Analysis
The qualitative analysis evaluates the different success and failure cases for the different question types. Here we show more examples of the prediction results of UniCon in Fig. 8. The prediction results are based on the similarity measurement approach with a total of 3048 different answers.