Hybrid Fusion Based Interpretable Multimodal Emotion Recognition with Insufficient Labelled Data

Puneet Kumar pkumar99@cs.iitr.ac.in 0000-0002-4318-1353Indian Institute of Technology RoorkeeIndia247667 Sarthak Malik sarthak˙m@mt.iitr.ac.in 0000-0002-0980-8765Indian Institute of Technology RoorkeeIndia247667  and  Balasubramanian Raman bala@cs.iitr.ac.in 0000-0001-6277-6267Indian Institute of Technology RoorkeeIndia247667
Abstract.

This paper proposes a multimodal emotion recognition system, VIsual Spoken Textual Additive Net (VISTA Net), to classify the emotions reflected by a multimodal input containing image, speech, and text into discrete classes. A new interpretability technique, K-Average Additive exPlanation (KAAP), has also been developed to identify the important visual, spoken, and textual features leading to predicting a particular emotion class. The VISTA Net fuses the information from image, speech & text modalities using a hybrid of early and late fusion. It automatically adjusts the weights of their intermediate outputs while computing the weighted average without human intervention. The KAAP technique computes the contribution of each modality and corresponding features toward predicting a particular emotion class. To mitigate the insufficiency of multimodal emotion datasets labeled with discrete emotion classes, we have constructed a large-scale IIT-R MMEmoRec dataset consisting of real-life images, corresponding speech & text, and emotion labels (‘angry,’ ‘happy,’ ‘hate,’ and ‘sad.’). The VISTA Net has resulted in 95.99% emotion recognition accuracy on considering image, speech, and text modalities, which is better than the performance on considering the inputs of any one or two modalities.

Affective Computing, Emotional and Social Signals, Multimodal Fusion, Media Interpretation, Emotion and Sentiment Analysis.
copyright: acmcopyrightjournalyear: 2022doi: XXXXXXX.XXXXXXXjournal: TOMMjournalvolume: 1journalnumber: 1article: 1publicationmonth: 7ccs: Information systems Sentiment analysisccs: General and reference Cross-computing tools and techniquesccs: Information systems Multimedia and multimodal retrievalccs: Computing methodologies Supervised learning by classification

1. Introduction

The multimedia data has overgrown in the last few years, leading multimodal emotion analysis to emerging as an important research trend (baltruvsaitis2018multimodal). Research in this direction aims to help machines become empathetic as emotion analysis is used in various applications such as cognitive psychology, automated identification, intelligent devices, and human-machine interface (poria2017review). Humans portray different emotions through various modalities such as images, speech, and text (cimtay2020cross). Utilizing the multimodal information from them could increase the performance of emotion recognition (zeng2009survey).

Researchers have performed emotion recognition by analyzing visual, spoken, and textual information separately  (rao2019learning; xu2020improve; majumder2019dialoguernn). The multimodal emotion recognition using two of these modalities has also been explored; however, emotion recognition using all three modalities, i.e., visual, spoken, and textual, has not been fully explored (zeng2009survey). Moreover, most of the existing multimodal approaches do not focus on interpreting the internal working of their emotion recognition systems. It inspired us to develop a multimodal emotion recognition system capable of recognizing emotions portrayed by visual, spoken, and text modalities and explaining each modality’s importance and particular features contributing to emotion recognition.

Multimodal emotion recognition also faces the unavailability of sufficient labeled data for training. Moreover, the real-life multimodal data contains generic images with facial, human, and non-human objects, but most of the existing multimodal datasets contain only facial and human images (busso2008iemocap). A few multimodal datasets contain generic images; however, they consist of positive, negative, and neutral sentiment labels instead of multi-class emotion labels (gaspar2019multimodal; Vadicam2017ICCVW). In this chapter, we construct a new dataset, the IIT-R MMEmoRec dataset,’ that contains generic images, corresponding speech utterances, text transcripts, and discrete class labels, i.e., ‘happy,’ ‘sad,’ ‘hate,’ and ‘anger.’

We start with bimodal emotion recognition using the pre-trained SER, TER, and IER models build in the previous chapters. Intermediate fusion is first implemented on two modalities’ information, then late fusion is applied to the intermediate fusion’s output and respective emotion recognition modules’ output. For instance, to perform (speech + text) emotion recognition, the intermediate fusion of speech and text information is performed, whose output is later fused with the outputs of SER and TER models. The modality weights for the fusion are computed using the grid search. Further, we propose an interpretable multimodal emotion recognition system, VISTA Net, that fuses image, speech, and text features using hybrid fusion. A novel interpretability technique, KAAP, has also been developed to identify the important visual, spoken, and textual features that predict particular emotion classes. The VISTA Net uses KAAP and automatically adjusts the weights of their intermediate outputs while computing the weighted average without human intervention.

Using the modality weights computed through grid-search, 82.63% accuracy has been observed for (speech + text) emotion recognition whereas (image + text) emotion recognition and (speech + image) emotion recognition resulted in 90.20% and 84.17% accuracies on IIT-R MMEmoRec dataset. The VISTA Net has resulted in 95.99% emotion recognition accuracy when considering the image, speech, and text modalities, which is better than the performance when considering the input of any one or two modalities. The enhanced performance by the VISTA Net advocates the importance of utilizing complementary information from multiple modalities for emotion recognition. The importance of all the modalities has been reported, and their features contributing to emotion recognition are highlighted during the analysis. This paper makes the following contributions.

  1. The bimodal emotion recognition has been performed for each combination of speech, text, and image modalities using a hybrid of intermediate and late fusion. The SER, TER, and IER models proposed in the previous chapters have been utilized, and modality weights for fusion have been computed using grid-search.

  2. A hybrid-fusion-based novel interpretable multimodal emotion recognition system, VISTA Net, has been proposed to classify an input containing an image, corresponding speech, and text into discrete emotion classes.

  3. A novel interpretability technique, KAAP, has been developed to identify each modality’s importance and important image, speech, and text features contributing the most to recognizing emotions.

  4. A large-scale dataset, ‘IIT-R MMEmoRec dataset’ containing images, speech utterances, text transcripts, and emotion labels has been constructed.

Further in this paper, the related works have been reviewed in Section 2. The proposed dataset, system, and interpretability technique have been described in Section 3 along with the dataset construction procedure. Section 4 and 5 discuss the experiments and results and the paper is concluded in Section 6.

2. Related works

2.1. Unimodal emotion recognition

2.1.1. Speech emotion recognition

The traditional feature-based SER systems extract audio features such as cepstrum coefficient, voice tone, prosody, and pitch and use them for SER (el2011survey). For instance, Rong et al. (rong2009acoustic) worked on extracting the most important audio features from the speech samples, whereas Lee et al. (lee2005toward) used the extracted audio features to identify negative and positive emotions in speech samples. The feature-based SER systems depend on the polarity of the emotional features. The features of high-key classes (happiness and anger) are similar in properties among themselves, and they are very different from the low-key classes (sad and despair) (trigeorgis2016adieu). In the context of machine learning-based SER, Support Vector Machine (SVM) based classifiers and Hidden Markov model (HMM) based statistical techniques have also been explored (jain2018cubic; lorenzo2015emotion). However, they require manual crafting of acoustic features, and HMM-based models cannot always reliably estimate the parameters of global speech features (el2011survey). Hence, it is challenging to develop an end-to-end SER system using them. The deep learning-based approaches using spectrogram features and attention mechanisms have shown state-of-the-art SER results (dai2019learning). In this context, Xu et al. (xu2020improve) generated multiple attention maps, fused them, and used them for SER. They observed an increased performance as compared to non-fusion-based approaches. In another work, Mao et al. (mao2014learning) used a Convolutional Neural Network (CNN) for spectrogram processing. On the other hand, Majumder et al. (majumder2019dialoguernn) performed Recurrent Neural Network (RNN) based SER. They determined speech embeddings and used them for speaker identification. In another work using attention maps, Seyedmahdad et al. (mirsamadi2017automatic) implemented local attention to learn the emotion features automatically.

2.1.2. Text emotion recognition

Deep learning-based approaches have shown state-of-the-art TER results. With the evolution of deep learning, it has become possible to convert text into vectors and use Deep Neural Networks (DNNs) to process them. Emotion recognition in conversation could be useful to mine opinions from conversational data on platforms such as YouTube, Facebook, Reddit, Twitter, and others (poria2019emotion). More examples of Deep Learning-based emotion analysis include personality detection from text using document modeling (majumder2017deep) and text-based emotion recognition using YouTube comments (hajar2016using). In another work, Huang et al. (huang2019emotionx) used a pre-trained Bidirectional Encoder Representations from Transformers (BERT) model for emotion recognition in text dialogs. Shrivastava et al. (shrivastava2019effective) applied sequence-based CNNs, whereas Bambaataa et al. (batbaatar2019semantic) utilized semantic and emotional information to train a deep TER model. Deep models focus on learning parameters such as features, sequence-related information, and contextual information from the text input.

2.1.3. Image emotion recognition

One of the most informative ways for machine perception of emotions is through facial expressions in images and videos. Identifying human emotions from facial expressions is a relatively more saturated research area than emotion recognition in other modalities. Various techniques such as face localization, face registration, micro-expression analysis, tracking the landmark points, shape feature analysis, eye gaze prediction, face segmentation, and detection have been developed for facial emotion recognition (li2018deep; corneanu2016survey). Image Emotion Recognition (IER) research is also an active domain. For instance, Kim et al. (kim2018building1) built a deep feed-forward neural network to combine different levels of emotion features obtained by using the semantic information of the image. In another work, Rao et al. (rao2019learning) prepared hierarchical notations for emotion recognition in the visual domain. Traditional, feature-based IER analysis with low-level (shape, color, and edge) and mid-level (composition and optical balance) image features  (hanjalic2006extracting; joshi2011aesthetics; machajdik2010affective) used the semantic content of the images for emotion analysis. However, all the low and mid-level features are difficult to accommodate by handcrafted feature extraction techniques used for machine traditional and machine learning-based methods. On the other hand, deep learning-based IER approaches are capable of extracting high-level visual features, but they struggle to extract low and mid-level features. Moreover, they require well-labeled large-scale datasets for training (rao2019learning).

2.2. Multimodal emotion recognition

Different emotion representation methods are used in various modalities (gunes2011emotion; xu2015word). For example, emotion representation in face and gestures utilizes features tracking, sensitivity analysis, and heat maps. In real-life scenarios, emotions are portrayed through various modalities such as vision, speech, and text. Analysis in a single modality may not be able to recognize the emotional context completely (zeng2009survey). That fact has paved the way to draw researchers’ attention towards multimodal emotion analysis  (hossain2019emotion). Moreover, various modalities have different statistical properties associated with them. To correctly recognize complex human emotions portrayed through them, it is very important to consider their inter-relationships (poria2017review). As discussed earlier in this chapter, various attempts have been made for emotion analysis using visual, spoken, and textual modalities individually. However, the emotion analysis in a multimodal manner considering the inter-relationships of these modalities is still an unexplored space (page2019multimodal). The existing works in this direction are discussed as follows.

2.2.1. (Speech + text) emotion recognition

In the context of recognizing emotions from speech utterances and corresponding text transcripts, Chuang et al. (chuang2004multi) worked on analyzing the overlapping emotional information in speech and text. A simple fusion of spoken and textual information has also been used for emotion recognition in some more works. For instance, Makuuchi et al. (makiuchi2021multimodal) performed separate acoustic and textual analyses and determined the emotional context based on their collective result. In another work, Yoon et al. (yoon2018multimodal) extracted the audio and text information using dual RNNs and then combined it to perform emotion recognition. On the other hand, some research attempts used textual information to improve the SER performance. For example, Tripathi et al. (tripathi2018multi) performed emotion recognition on the Interactive EMOtional dyadic motion CAPture (IEMOCAP) dataset (busso2008iemocap) using data from speech and text modalities and complementing it with hand movements and facial impressions data. In another work, Siriwardhana et al. (siriwardhana2020jointly) fine-tuned transformers-based models to improve the performance of multimodal speech emotion recognition.

2.2.2. (Text + image) emotion recognition

Several attempts have been made to recognize the emotional content portrayed in visual and textual modalities. In this direction, Kahou et al. (kahou2016emonets) developed a framework, EmoNets for emotion recognition in video and text. Fortin et al. (page2019multimodal) implemented a multi-task architecture-based emotion recognition approach to perform predictions with one or two missing modalities by using a classifier for each combination of image, text, and tags. In another work, Xu et al. (xu2018co) modeled the interplay of visual and textual content for sentiment recognition using a co-memory based-network.

2.2.3. (Speech + image) emotion recognition

Multimodal emotion analysis from audio-visual data has also started getting researchers’ attention lately (hossain2019emotion). For instance, Aytar et al. (aytar2016soundnet) proposed SoundNet that extracts the emotional information by self-supervised learning of sound representations. In another work, Guanghui et al. (guanghui2021multi) implemented the feature correlation analysis algorithm for multimodal emotion recognition. They extracted speech and visual features using two-dimensional and three-dimensional CNNs, fused the features, and used SVM for emotion classification on the fused features.

2.2.4. (Speech + text + image) emotion recognition

There have been several attempts regarding emotion recognition in more than two modalities simultaneously. For example, Poria (poria2016fusing) used information fusion techniques to combine the context from audio, vision, and textual modalities for the sentiment analysis. They found the fusion of the textual and spoken description of the emotional information as an aid to the emotion analysis for visual modality. In another work, Tzirakis et al. (tzirakis2017end) extracted the features from speech, image, and text modalities, analyzed the correlation among them, and trained an end-to-end emotion recognition system using them combinedly.

2.3. Explainable and interpretable emotion analysis

Explainability refers to the ability to describe an algorithm’s mechanism that led to a particular output. In contrast, Interpretability understands the context of a model’s output, analyzes its functional design, and relates the design to the output (broniatowski2021psychological; kumartowards). The deep learning-based techniques act like a black box. The challenges involved with explaining and interpreting their internal working have given rise to a new research area known as explainable AI (lundberg2017unified). Among the recent research carried out in this direction, Riberio et al. (ribeiro2016should) pointed out the value of interpreting the internal working of deep learning-based classifiers. They also designed a framework that computes the importance of each input towards a particular output and interprets a classifier’s predictions. In another research, a method to determine the input’s part leading to a particular output has been developed by Fazi et al. (fazi2020beyond). Research has also been done for tracing every neuron’s contribution and understanding the output part-by-part (shrikumar2017learning). The existing Interpretability techniques can be divided into the following categories.

2.3.1. Attribution Interpretability techniques

In these methods, the attribution values denoting the relevance of inputs concerning outputs are determined. A popular attribution value is ‘Shapley Values’ (SHAPley1953value) . The attribution techniques are frequently used for local interpretability, which explains the impact of one instance instead of the overall model. The Shapley values have been used by Lundberg et al. (lundberg2017unified) who implemented an interpretability framework, SHAP (Shapley Additive exPlanations), which determines each feature’s contribution by analyzing its Shapley values (malik2021towards). The computation of Shaply values is very expensive because models are required to be trained for a model with features (lundberg2017unified; castro2009polynomial). Different approximations have been used to speed us Shapley values’ computation. For instance, Shapley values sampling (castro2009polynomial) and KernelSHAP (lundberg2017unified). The attribution techniques are further classified into perturbation and back-propagation-based approaches, which are explained further.

2.3.2. Perturbation Interpretability techniques

These techniques make a small change in the input and observe its impact. The insights thus obtained are used to interpret the model’s working (fong2019understanding). The most frequently used perturbation technique is Local Interpretable Model-agnostic Explanations (LIME) (ribeiro2016should) that perturbs the given instance and synthesizes new data. The new data is weighted as per the closeness of the new instance to the original instance. The output is computed by training the original model on the perturbed data. The trained model’s weights denote the approximate values of each feature’s contribution. The LIME can be used with any machine learning model, though it is computationally expensive as it involves the generation of new data.

2.3.3. Backpropagation Interpretability techniques

The backpropagation-based interpretability techniques calculate the attributions by backpropagating through the network multiple times. A popular backpropagation-based technique is ‘Saliency Map’ (simonyan2013deep) that has the label output’s absolute gradient of each input feature as an attribution. Another popular technique is Gradient-weighted Class Activation Map (Grad-CAM), which assigns a score to each feature and computes the activation map using this score (selvaraju2017grad). The Grad-CAM goes until the last convolutional layer instead of backpropagating back to the image. It generates a map that highlights the important features of the input image. If the input image is slightly changed, it generates an entirely different map (kindermans2019reliability).

As suggested by the above survey, LIME, SHAP, and Grad-CAM are the most frequently used interpretability techniques for machine learning models. The LIME is involved with a very high computational cost, whereas Grad-CAM is incapable of withstanding small changes in the input image. The SHAP technique does not suffer from the aforementioned limitations. Furthermore, the DNN interpretability has been applied to the visual modality but it has not been fully explored for the speech and text modalities and multimodal analysis. It inspired us to develop an interpretability technique for multimodal emotion recognition to explain the importance of each modality and identify the important features of each modality that lead to the prediction of a particular class.

3. Proposed Work

3.1. Data compilation

Table 1 shows some samples from the IIT-R MMEmoRec dataset while the process to construct it has been elaborated as follows. It contains generic (facial, human, non-human objects) images (as opposed to only facial images/videos in other known trimodular emotion datasets, IEMOCAP (busso2008iemocap), and MOSEI (zadeh2018multi)), speech utterances, text transcripts, emotion label (‘angry,’ ‘happy,’ ‘hate,’ and ‘sad’), the probability of each emotion class given by each modality and probability of final emotion class. The IIT-R MMEmoRec dataset has been constructed on top of the ‘Balanced Twitter for Sentiment Analysis’ (B-T4SA) dataset (Vadicam2017ICCVW). The B-T4SA dataset contains images, text, and sentiment (‘positive,’ ‘negative,’ neutral) labels, whereas the IIT-R MMEmoRec dataset, has been compiled to have discrete emotion labels for image, text, and speech modalities. The following steps have been followed to construct the IIT-R MMEmoRec dataset.

Table 1. A few samples from IIT-R MMEmoRec dataset. Here, ‘Img_Prob,’ ‘Sp_Prob,’ ‘Txt_Prob,’ and ‘Final_Prob’ are image, speech, text and final prediction probabilities whereas angry, happy, hate and sad emotion labels are denoted as 0, 1, 2 & 3 respectively.
  • The text from the BT4SA dataset is pre-processed by removing links, special characters, and tags, and then the cleaned text is converted to speech using the pre-trained state-of-the-art text-to-speech (TTS) model, DeepSpeech3 (ping2018deep). The rationale for using the TTS model is governed by the recent studies that prove TTS models generate high-quality speech signals that can be used as a valid approximation of natural speech signals (ping2018deep; deepmind2016wavenet; oord2016wavenet).

  • The image, speech, and text components are passed through pre-trained IER, SER, and TER models trained on Flickr & Instagram (FI) (you2016building) dataset, IEMOCAP (busso2008iemocap) dataset, and ISEAR dataset (scherer1994evidence), respectively, and the prediction probabilities of each emotion class are obtained for each modality.

  • The prediction probabilities are then averaged to obtain the ground-truth emotion of each data sample. The averaging is done to ensure that the chosen ground truth is the one that is supported by the majority of modalities. Fig. 1 shows an example of emotion label determination. The probabilities for each emotion class given by each modality are shown. The ‘happy’ class has an average prediction probability of compared to for ‘angry,’ for ‘hate,’ and for ‘sad.’ The final emotion label for the sample is determined as ‘happy.’

Table 2. Class-wise data distribution. Emotion Samples Angry 53,317 Happy 44,980 Hate 3,831 Sad 10,327 Figure 1. Example of emotion label determination.
  • The data is segregated according to classes, and the samples having an average prediction probability of less than the threshold confidence value of times of maximum probability for the corresponding class are discarded. The threshold confidence is determined in Section 4.3.2.

  • The four emotion classes, ‘angry,’ ‘happy,’ ‘hate,’ and ‘sad,’ are common in various datasets of different modalities considered in this work. The samples labeled as ‘excitement’ & ‘disgust’ have been re-labeled as ‘happy’ & ‘hate’ as per Plutchik’s wheel of emotions (plutchik2001nature). The final dataset contains a total of samples with labeled as ‘angry,’ as ‘happy’ and & as‘sad’ and ‘hate’ respectively as described in Table 2.

3.1.1. Determining threshold confidence value for dataset construction

The original B-T4SA dataset contained 4.7M data samples labeled as ‘positive,’ ‘negative,’ and ‘neutral.’ While constructing the IIT-R MMEmoRec dataset with discrete emotion labels, i.e., ‘angry,’ ‘happy,’ ‘hate,’ and ‘sad,’ it was essential to retain only the samples having high confidence in the associated emotion label. After passing the image, speech, and text components of the inputs to respective emotion recognition models as discussed in Section 3.2, we computed a value for each data sample in each class representing at what percentage compared to the class maximum that sample is in its particular class. It gave us the confidence of each data sample in its particular class. To determine the appropriate threshold, we plotted possible threshold values vs. the ratio of the class present (the number of each class sample and the total number of samples) as shown in Fig. 2.

The higher the threshold, the higher the confidence and the better the quality of data. However, a higher threshold value also leads to two issues – i) reduction in the size of the dataset and ii) disruption in the distribution of emotion classes compared to its original distribution. As seen in Fig. 2 the distribution of each class at a threshold approaching is very different as compared to when all samples are taken at the thresholds.

Determining threshold confidence value for dataset construction.
Figure 2. Determining threshold confidence value for dataset construction.

An appropriate threshold value needs to be chosen, leading to a good trade-off between high confidence and appropriate size & distribution of the dataset. Till the threshold value of , the distribution is almost the same as the original, but this confidence is too low to be acceptable. The next possible value is above but below . Between these two values, the distribution of various classes is almost the same, and the confidence is also above , which is acceptable. Hence an average value of is chosen as the threshold confidence value.

3.1.2. Human evaluation

The MMEmoRec dataset has been evaluated by having 8 people evaluate the data samples. We had two human readers (one male and one female) who spoke out and recorded the text components of the data samples. The evaluators listened to the machine synthesized speech against the human speech recorded by the human readers and scored the contextual similarity between them on a scale of 0 to 100. The human evaluators also evaluated whether the data samples’ speech, image, and text components agree with the annotated emotion sample individually and combinedly. The samples have been picked randomly, and the average of the evaluators’ scores has been reported in Table 3 where denotes the percentage of evaluators reporting the synthetic speech (ss) to be similar to human speech (hs). & denotes the percentage of speech components of synthetic and human speech portraying the annotated emotion. Likewise, and denote the agreement of annotated emotion class by image and text components. and show the samples showing agreement of the annotated emotion class by all three modalities on considering synthetic and human speech, respectively.

Class
Angry 67.18% 82.81% 85.94% 70.31% 84.38% 75.00% 82.03%
Happy 52.78% 66.67% 69.44% 63.89% 72.22% 66.67% 69.44%
Hate 62.50% 71.43% 72.32% 67.86% 71.43% 73.21% 72.32%
Sad 60.42% 77.08% 78.13% 75.00% 87.50% 77.08% 83.33%
\hdashlineOverall 60.72% 74.49% 76.46% 69.26% 78.81% 72.99% 76.78%
Table 3. Human evaluation of MMEmoRec dataset.

We had two readers read the text of the data samples and called their output human synthesized speech. 60.72% evaluators found the synthetic speech to be contextually similar to the human synthesized speech. 74.49% synthetic speech samples and 78.91% human synthesized speech samples were found to be portraying the annotated emotion labels. As per the further observations, 69.26% images and 78.81% text components of the data samples correspond to the annotated emotion labels. Moreover, the evaluators also reported that 72.99% of the samples considering machine synthesized speech along with the corresponding text & image were in line with the determined emotion label, whereas this is comparable to the value of 76.74% on considering human synthesized speech along with the corresponding text & image.

3.2. VISTA Net

The proposed system, VISTA Net’s architecture, is shown in Fig. 3 which has been decided based on the ablation studies discussed in Section 4.3.1. It fuses image, speech & text features using a hybrid of two-stage intermediate and late fusion, which considers all possible pairs of all three modalities and automatically weights them without human intervention. The intermediate fusion combines the information from various modalities before classifying, i.e., after feature extraction, whereas late fusion combines the information after classification.

Schematic architecture of the proposed multimodal emotion recognition system. Here,
Figure 3. Schematic architecture of the proposed multimodal emotion recognition system. Here, & denote the pre-trained & simpler networks for modality whereas ‘i,’ ‘s,’ and ‘t’ denote visual, speech and text modalities, respectively.

The three modalities are fed into two types of networks: a pre-trained and a simpler network. The intuition behind this approach is to build a fully automated multimodal emotion classifier by including various modalities’ in all possible combinations and learning their weights while training without any human intervention. The proposed system contains and for image, and for speech, and and for text, denoting pre-trained and simpler networks respectively. The input speech has been converted to a log-mel spectrogram before feeding into the network.

3.2.1. Intermediate fusion phase

The images of dimension are fed into and respectively, with consisting of VGG16 (simonyan2014very) and a -dimensional dense layer and containing 3 convolution layers of , and filters of size and a dense layer of dimensions. The spectrogram of size from speech input is passed from a filter convolution layer of size , to make it compatible with VGG16. Further, it is passed from and , consisting of the same architecture as and respectively.

The text input is similarly passed from containing a BERT (devlin2018bert) and consisting of an embedding and LSTM layer with units. Both & are followed by -dimensional dense layers. In the intermediate fusion, all pairs of the pre-trained and simpler networks from different modalities are created by passing them from the layer that we have defined. It gives us six such combinations passed from dense layers with neurons, giving the classification based on each pair. The Eq. 1 shows all the possible pairs formed from the combination of pre-trained and simpler networks such that both the networks do not belong to the same modality.

(1)

where , , , , and are the classification outputs for various pairs of pre-trained and simpler networks. The layer ensures that during training, the weight of any weighted addition is learned using back-propagation without any human intervention. Each weight in the layer is randomly initialized and then passed from the softmax layer, giving us positive values used as final weights and learned during training.

3.2.2. Late Fusion Phase

In this phase, the information from various modalities’ all possible pairs is combined in a hybrid manner. The intermediate classification outputs obtained from above Eq. 1 are passed from another layer, which combines these outputs dynamically, giving us the final output as depicted in Eq. 2. The output is passed from a dense layer with dimensions equal to the number of emotion classes, i.e., four.

(2)

where denotes the final output and , , , , and are the intermediate classification outputs.

3.3. Kaap

This Section proposes a novel multimodal interpretability technique, K-Average Additive exPlanation (KAAP), depicted in Fig. 4. It computes the importance of each modality and its features while predicting a particular emotion class. The existing interpretability techniques do not apply to speech and multimodal emotion recognition. Moreover, the most frequently used and accepted interpretability technique for images and text is SHAP (lundberg2017unified), which is an approximation of Shapley values (SHAPley1953value). It requires computational time-complexity whereas KAAP requires a time of where is a given hyper-parameter. Moreover, KAAP applies to multimodal emotion analysis and a single modality or a combination of any two modalities as well.

Schematic representation of the proposed interpretability technique. Where
Figure 4. Schematic representation of the proposed interpretability technique. Where , , and denote the no. of partitions for image, speech and text modalities; , and are the widths for image & speech feature matrices and is text feature vector’s length.
Define : DNN model,
Define : Image data,
Define : text data,
Define : Speech spectogram data,
Define : The type of data whose perturbed probability is required
procedure Predict ()
if  then
       
       
end if
if  then
       
       
end if
if  then
       
       
end if
return
Algorithm 1 Predict (Probability prediction)
Define : DNN model,
Define : Image data,
Define : text data,
Define : Speech spectogram data,
Define : Size of data
Define : Parts in which data is divided
Define : The type of data KAAP values are required
procedure KAAP_helper ()
for  do
       
        for  do
               if  then
                     
                     
                     
                     
                     
               end if
              if  then
                     
                     
                     
                     
                     
               end if
              if  then
                     
                     
                     
                     
                     
               end if
              
              
               if  or  then
                     
               end if
              if  then
                     
               end if
              
        end for
       
end for
return
Algorithm 2 KAAP_helper (Procedure to calculate KAAP value for each data instance)
Define : DNN model,
Define : Image data,
Define : text data,
Define : Speech spectrogram data,
Define : Width of the input
Define : Parts in which data is divided
Define : The type of data KAAP values are required
procedure KAAP ()
return
Algorithm 3 KAAP

3.3.1. Calculating K exPlanable (KP) values

For a model with features , K exPlanable (KP) value of feature , denotes its importance. Fig. 5 depicts an example calculation.

Sample model for KP values computation.
Figure 5. Sample model for KP values computation.

Consider four nodes, Node with no feature i.e NULL, Node with a single feature , Node with all the remaining features left in Node , i.e. , , and Node with all the features . The ‘Marginal Contribution’ of an edge connecting Node and Node is defined as the difference between the prediction probabilities on using their features. For a given predicted label , the marginal contribution of the feature for the edge from Node to Node is calculated using Eq. 3. Here, is the probability of label calculated by having only feature in the input and perturbing all other features to zero.

(3)

To calculate the overall importance of , we need to calculate the weighted average of all ‘marginal contribution’ of given by Eq. 4.

(4)

Where and are the weights for the weighted addition. Now, there are two conditions on the weights: i) the sum of weights equals one; this is done to normalize the weights; ii) the weight must be times the weight . The second condition is based on the fact that is the effect of addition of in an empty set of features, while is the effect of addition on a set containing features. This results in Eq. 5.

(5)

Where the values of and , shown in Eq. 6 are computed using Eq. 5.

(6)

The KP values shown in Eq. 7 are computed using Eq. 4 and Eq. 6.

(7)

3.3.2. Calculating KAAP values

This Section computes the KAAP values and uses them to determine the importance of each modality and its features. The information of image, text, and speech modalities are in the same data format, i.e., continuous format. A single pixel can not define an object that can lead to a particular emotion for an image, but a group of pixels will. For speech, the spectrogram at a single instance of time & frequency alone can not define anything, but a time interval will. Likewise, for text, a single letter may not define an emotion, but a word is capable of doing so. KAAP values have been defined based on the motivation from the aforementioned fact. They are computed using the KP values for a group of features.

First, the input of size is divided into parts, where is a hyperparameter decided through ablation study in Section 4.3. These parts correspond to the features of the input. Then, for a feature group , values are computed for the given value of using Eq 6. It represents how a group of features will perform compared to all remaining groups. However, these groups can vary in size, i.e., can have various values that lead to different groups and thus to different KP values from groups of different sizes, thus affecting the original features’ importance. To deal with this issue, the weighted average of all the KP values is taken for where weights are equal to the number of features in that group of features, given by the Eq. 8. It should be noted that = 1 is ignored here, as the whole input as one feature will not make any sense.

(8)

For input image and speech spectrogram, both of width and height , their KP values for a given are calculated by dividing the input into parts along both the axes. As a matrix defines both image and speech spectrogram, this gives us with feature group, the equation for calculating the KAAP values for the above two inputs is given by Eq. 9. It gives us a matrix showing the importance of each pixel for a given image and speech input. This matrix directly represents the importance of the image. At the same time, for speech input, the values are averaged along the frequency axis to reduce the KAAP value matrix to the time axis, hence giving importance to speech at a given time.

(9)

For input text, the division is done such that each text word is considered a feature, as the emotion can only be defined by a word, not a single letter, as discussed above. Then the text is divided into parts, and as a linear array can represent text, the KAAP values are calculated using Eq. 8. Also, the value of used for image, speech, and text modalities have been determined as , , and respectively in Section 4.3.2. Furthermore, the modalities’ importance defined by symbols , , and for visual, spoken, and textual features, respectively, are computed assuming that image, speech, and text are three distinct features and calculating each modality’s KAAP value for = . While finding the importance of the features of a particular modality, all the other modalities are perturbed to zero. The KAAP technique is depicted in Algorithms 3 which uses Algorithm 2 that calculates the KAAP values for each data instance and 1 for probability prediction.

4. Implementation

4.1. Experimental setup

The network training for the proposed system has been carried out on Nvidia Quadro P5000 GPU, whereas the testing & evaluation have been done on an Intel(R) Core(TM) i7-8700 Ubuntu machine with 64-bit OS and 3.70 GHz, 16GB RAM.

4.2. Training strategy and hyperparameter setting

The model training has been performed using a batch-size of , train-test split of -, Adam optimizer, ReLU activation function with a learning rate of and ReduceLROnPlateau learning rate scheduler with patience value of . The baselines and proposed models converged in terms of validation loss in to epochs. As a safe upper bound, the models have been trained for epochs with EarlyStopping (prechelt1998early) with patience values of . The loss function used is the average of categorical focal loss (lin2017focal) and categorical cross-entropy loss. Accuracy, macro f1 (opitz2019macro), and CohenKappa (vieira2010cohen) have been analyzed for the model evaluation.

4.3. Ablation studies and models

The ablation studies have been performed to determine the threshold confidence value for data construction, appropriate network configuration for VISTA Net, and suitable values for KAAP.

4.3.1. Ablation study 1: Determining baselines and proposed system’s architecture

To begin with, the emotion recognition has been performed for a single modality at a time, i.e., separate IER, SER, and TER using pre-trained VGG models (simonyan2014very) for Image & speech and BERT (devlin2018bert) for text. The performance has been evaluated in terms of Accuracy, CohenKappa metric (CK), F1 score, Precision, and Recall and summarized in Table 4. The CK metric measures if the distribution of the predicted class is in line with the ground truth or not.

Model Acc F1 CK P R
Image only 60.44 0.60 0.324 0.60 0.60
Speech only 78.69 0.75 0.624 0.74 0.79
Text only 81.51 0.81 0.69 0.81 0.82
\hdashlineImage + Text 86.40 0.86 0.77 0.86 0.86
Image + Speech 84.66 0.85 0.746 0.85 0.85
Text + Speech 81.95 0.81 0.70 0.82 0.81
\hdashlineImage + Speech + Text 86.60 0.86 0.78 0.86 0.87
Table 4. Ablation Study 1. ‘Acc,’ ‘F1,’ ‘CK,’ ‘P,’ and ‘R’ denote accuracy, F1-score, CohenKappa score, precision and recall.

Next, we moved on to the combination of two modalities. The chosen two modalities are fed into respective pre-trained models and then passed from a dense layer of neurons. Then the information from these modalities is added using the layer defined in 3.2.1, this output is next passed from three dense layers of size , and neurons, which then classifies the emotion. Image + text comes out to be the best combination, beating the remaining two combinations in both Accuracy and CK values.

At last, the information from all three modalities is combined and fed into their respective pre-trained models and is then passed from a dense layer of size , which is then passed from a layer; the output of this layer is passed from dense layers as in the combination of two modalities. Combining all three modalities has performed better than the remaining models in all the evaluation metrics. As observed during the experiments above, combining the information from the complementary modalities has led to better emotion recognition performance. Hence, the baselines and proposed model have been formulated, including all three modalities and various information fusion mechanisms in Section 4.4.

4.3.2. Ablation study 2: Determining values for KAAP

An in-depth ablation study has been conducted here to decide the value of used in Section 3.3.2. The dice coefficient (deng2018learning) is used to determine the best values. It measures the similarity of two data samples; the value of denotes that the two compared data samples are completely similar, whereas a value of denotes their complete dis-similarity. For each modality, KAAP values are calculated at . The dice coefficient is calculated for two adjacent values. For example, at , the KAAP values at and are used to calculate the dice coefficient. The procedure mentioned above has been performed for all three modalities, and the results are visualized in Fig. 6.

Ablation Study 2: Determining appropriate
Figure 6. Ablation Study 2: Determining appropriate values.

The effect of increasing values can be observed in the figure. For image & speech, the value converges to at = 7, while for text, the optimal value of k is .

4.4. Baselines and proposed models

The ‘Image + Speech + Text’ configuration described in Section 4.3.1 is considered as baseline 1, whereas further baselines models’ architectures have been formulated by incorporating further improvements in the information fusion mechanisms.

The baseline models are made on a common idea as described below. Firstly all the three modalities are fed into , , , , and as described in Section 3.2, and are then passed from a dense layer of neurons, resulting in a -dimensional outputs which are then combined using to give three outputs. The following strategy is being followed for combining them: any pre-trained network must be combined with another simpler network. At least one combination must contain the network from different modalities because if all the modalities combine with themselves, then such a combination will not lead to any information exchange. Thus, six such configurations are possible, as described in Eq. 10.

(10)

The configuration is discarded as it does not hold the condition that at least one combination must combine with a different modality. The configurations , , are partially-complete combinations as one of the three outputs of these combinations combine the pre-trained and simpler network from the same modalities. On the other hand, the configurations and are complete.

Using the above strategy puts us in two disadvantages: i) only two out of five such baselines are complete while others are partially-complete; ii) different datasets have different requirements. For example, a particular multimodal dataset may have better images and speech components, while other datasets may have a better quality of text components. To generalize for any dataset and scenario, an automated multimodal emotion recognition system, VISTA Net, has been proposed, which combines all output of baselines 2-6 leaving any self combination and taking the weighted average of remaining all. Hence, it automatically decides the weights of each combination according to the requirements of problem statements and the dataset. The baselines’ and proposed system’s results are summarized in the following Section in Table 5.

5. Results and Discussion

The emotion classification results have been discussed in this Section, along with their interpretation and a comparison of sentiment classification results with existing methods.

5.1. Quantitative results

The VISTA Net has achieved emotion recognition accuracy of 95.99%. Its class-wise accuracies are shown in Fig. 7 while its results, along with the results of baselines, are shown in Table 5.

Confusion matrix showing class-wise accuracies.
Figure 7. Confusion matrix showing class-wise accuracies.
Model Acc F1 CK P R
Baseline 1 86.60 0.86 0.78 0.86 0.87
Baseline 2(#2) 94.89 0.95 0.91 0.95 0.95
Baseline 3(#3) 95.44 0.95 0.92 0.93 0.95
Baseline 4(#4) 95.39 0.95 0.92 0.95 0.95
Baseline 5(#5) 95.58 0.96 0.92 0.96 0.96
Baseline 6(#6) 95.37 0.95 0.92 0.95 0.95
\hdashlineVISTA Net 95.99 0.96 0.93 0.96 0.96
Table 5. Results comparison for emotion recognition on IIT-R MMEmoRec dataset. ‘Acc,’ ‘F1,’ ‘CK,’ ‘P,’ & ‘R’ denote accuracy, F1-score, CohenKappa score, precision & recall. Baseline 1 has (Image + Speech + Text) configuration from Section 4.3.1.

5.2. Qualitative results

Fig. 8 shows sample emotion classification & interpretation results. The important speech and image features contributing to emotion classification are obtained, and corresponding words are highlighted. In the waveform, yellow and blue correspond to the most and least important features, respectively. As observed from Fig. 8, speech and text were the most contributing modalities for the prediction of ‘angry’ and ‘hate’ classes, whereas image and text modalities contributed equally to the determination of ‘happy’ and ‘sad’ classes.

Sample results; here, ‘P’, ‘GT’ are the predicted and ground-truth labels whereas and ‘Score’ denotes the importance of visual (
Figure 8. Sample results; here, ‘P’, ‘GT’ are the predicted and ground-truth labels whereas and ‘Score’ denotes the importance of visual (), spoken () & textual () modalities.

5.3. Results comparison

The emotion recognition results have been reported in Section 5.1. The IIT-R MMEmoRec dataset has been constructed from the B-T4SA dataset in this paper; hence, there are no existing emotion recognition results for it. However, sentiment classification (into ‘neutral,’ ‘negative,’ and ‘positive’ classes) results on the B-T4SA dataset are available in the literature, which have been compared with VISTA Net’s sentiment classification results in Table 6.

Approach Modality Accuracy
Cross-Modal Learning (Vadicam2017ICCVW) V + T 51.30%
Multimodal Sentiment Analysis (gaspar2019multimodal) V + T 60.42%
Hybrid Fusion (kumar2021hybrid) V + T 86.70%
Automated ML (lopes2021automl) V + T 95.19%
\hdashlineVISTA Net (for Sentiment Classification)   V + S + T 96.59%
Table 6. Results comparison for sentiment classification on BT4SA dataset with existing approaches. Here, ‘V,’ ‘S,’ and ‘T’ denote visual, spoken and textual modalities.

5.4. Performance for missing modalities

In real-life scenarios, some of the data samples in the multimodal data may be missing information about one of the modalities. The VISTA Net has been evaluated for such scenarios. We formulated four use-cases with image, speech, text, or no modality missing respectively and divided the test dataset into randomly selected equal parts accordingly. Then the information of the missing modality has been overridden to null, and VISTA Net has been evaluated for emotion recognition. Table 7 summarizes the results thus observed.

Model Acc F1 CK P R
Missing Image 82.59 0.82 0.77 0.86 0.83
Missing Speech 57.62 0.45 0.75 0.75 0.58
Missing Text 62.82 0.68 0.70 0.87 0.63
\hdashlineMissing None 95.90 0.96 0.92 0.96 0.96
Table 7. Results for missing modalities. Here, ‘Acc,’ ‘F1,’ ‘CK,’ ‘P,’ and ‘R’ denote accuracy, F1-score, CohenKappa score, precision and recall.

As observed from Table 7, the emotion recognition performance for Missing no modality (i.e., having the information from all three modalities) is in line with the results observed in Section5.2. Further, missing image modality information has caused the least dip in the performance. Moreover, the information from speech and text modalities combinedly has resulted in emotion classification accuracy of 82.59%, whereas including all the modalities resulted in 95.90% accuracy. The aforementioned observations are in-line with the observations in Section 4.3.1 where IER performance was lesser than TER and SER performance.

5.5. Discussion

Various research tasks may require a particular modality’s information more than the others; for example, text and visual information may be secondary for multimodal speech recognition. Likewise, a multimodal emotion dataset might contain better quality information for a particular modality than other modalities. In such cases, it would require human intervention to decide which modality is more important for the analysis. However, the VISTA Net is capable of deciding that automatically. It considers all possible combinations of various modalities’ information and weighs them accordingly.

As the proposed MMEmoRec dataset contains the information of complementary modalities, it enables the deep learning models to learn the contextually related representation of the underlying emotions. The ground truth is obtained by applying unimodal models. The final label is obtained by averaging the probability of each emotion obtained for each model and is considered the ground truth of the dataset. If the same unimodal emotion recognition models are used for emotion recognition during the dataset construction, then there will be a slight bias in the final performance. However, there will be no bias in developing and using a newer multimodal emotion recognition model. Furthermore, the human evaluators evaluated the IIT-R MMEmoRec dataset for the consistency of the determined emotion labels and appropriateness of the speech component, which has been synthesized via test-to-speech.

The qualitative & qualitative results (Fig. 7 & 8 and Table 5 & 6) have affirmed the importance of utilizing the information from complementary modalities. As observed from Fig. 8, different modalities have played a key role in determining the overall emotion portrayed by the input data sample. In some cases, the information for a particular modality may be missing from some of the data samples. The proposed system, VISTA Net, has been evaluated for such cases with missing modality information, and the observations are in accordance with the results previously observed and the insights gained during ablation studies.

The proposed interpretability technique, KAAP, computes the importance of each modality and the importance of their respective features towards the prediction of a particular emotion class. The existing interpretability techniques such as SHAP and LIME are not applicable to speech modalities, whereas KAAP is applicable to all image, text, and speech modalities. The proposed technique is expected to pave the way for growth in multimedia emotion analysis. We also hope that the IIT-R MMEmoRec dataset will inspire further advancements in this context.

6. Conclusions and future work

The proposed system, VISTA Net, performs emotion recognition by considering the information from image, speech & text modalities. It combines the information from these modalities in a hybrid manner of intermediate and late fusion and determines their weights automatically. It has resulted in better performance on including image, speech & text modalities than including only one or two of these modalities. The proposed interpretability technique, KAAP, identifies each modality’s contribution and important features towards predicting a particular emotion class. The future research plan includes working on transforming emotional content from one modality to another. We will also work on controllable emotion generation, where the output contains the desired emotional tone.

References