Attentive pooling for Group Activity Recognition

\auDing Li    \auYuan Xie    \auWensheng Zhang    \auYongqiang Tang    \auZhizhong Zhang zhangwenshengia@hotmail.com \add1Institute of Automation, Chinese Academy of Science, Beijing, 100190, People’s Republic of China
\add2Research Center of Precision Sensing and Control, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, People’s Republic of China
Abstract

In group activity recognition, hierarchical framework is widely adopted to represent the relationships between individuals and their corresponding group, and has achieved promising performance. However, the existing methods simply employed max/average pooling in this framework, which ignored the distinct contributions of different individuals to the group activity recognition. In this paper, we propose a new contextual pooling scheme, named attentive pooling, which enables the weighted information transition from individual actions to group activity. By utilizing the attention mechanism, the attentive pooling is intrinsically interpretable and able to embed member context into the existing hierarchical model. In order to verify the effectiveness of the proposed scheme, two specific attentive pooling methods, i.e., global attentive pooling (GAP) and hierarchical attentive pooling (HAP) are designed. GAP rewards the individuals that are significant to group activity, while HAP further considers the hierarchical division by introducing subgroup structure. The experimental results on the benchmark dataset demonstrate that our proposal is significantly superior beyond the baseline and is comparable to the state-of-the-art methods.

\supertitle

Research Article

1 Introduction

Understanding human actions in a given video sequence has stimulated much research interest in computer vision. During the past several years, massive efforts [simonyan2014two; wang2016temporal; ji20133d; tran2015learning; yu2017fully; varol2018long] have been made for the individual action recognition, which merely focuses on the actions of a single person. When multiple individuals are in the scene, rather than classifying person-level actions in isolation, recognizing the activities performed by the group facilitates a lot of applications, e.g. video surveillance, sport analytics and video retrieval. Recently, the group activity recognition has received considerable academic attention [choi2012unified; ibrahim2016hierarchical; shu2017cern; li2017sbgar; bagautdinov2017social; wang2017recurrent; biswas2018structural]. Compared with individual action recognition, the main challenge of group activity recognition is modelling hierarchical relationships between individuals and group.

To overcome this difficulty, numerous researches ibrahim2016hierarchical; DBLP:journals/corr/IbrahimMDVM16; biswas2018structural have focused on the hierarchical structure, which are on the basis of the unity of individuals. Considering that individual actions collectively define the group activities, these approaches first establish a separate model for recognition and analysis of individual actions and its dynamics. Then, the representations of group activities are extracted by summarizing the information of collected individual actions. Typically in several deep learning approaches, Recurrent Neural Networks, e.g. LSTM networks LSTM have been used to capture the dynamics of actions and activities, and max pooling or average pooling are applied to the aggregation of person-level representations.

Despite the promising performance, the max/average pooling adopted in ibrahim2016hierarchical; DBLP:journals/corr/IbrahimMDVM16; shu2017cern; biswas2018structural ignored the distinguished contributions of different individuals to the group activity recognition. For instance, there is a clip labelled as "Left Winpoint" shown in Figure 1(a). According to the unity of players in this clip, we can conclude that the team in the left side of the court are gathering and celebrating their scoring, while their rivals lose in this round. The clip shown in Figure 1(b) is labelled as "Right Spike", where a person in the right side of the court performs the individual action "spiking"(marked by the yellow rectangle), and two players across the net are blocking (marked by red rectangle). Meanwhile, there are several players performing "Moving" action around their teammates, and persons labelled as "Standing" or "Waiting" are preparing for competing in the next round. In fact, we are able to recognize the group activity of the frame by those people who perform "spiking" and "blocking", because others make less contributions to the final group activity prediction. It is the individual differences that facilitate the group activity recognition, and the more diverse are the individuals, the more complicated contextual relationship of the group get. Previous pooling scheme is able to aggregate features effectively when the contextual relationship is simple (Figure 1(a)), while achieved limited success when the contextual relationship gets complicated (Figure 1(b)).

(a) A video clip labeled as group activity "Left Winpoint"
(b) A video clip labeled as group activity "Right Spike"
Figure 1: The illustration of our motivation in distinguishing individuals with different attention. Players with different actions in the scene are marked by bounding boxes with different colour.

Thus, there is an imperative need to consider individual differences as well as the unity of the group simultaneously, and this motivates us to propose a novel pooling scheme, named attentive pooling. To verify the effectiveness of the proposed scheme, we first design an global attentive pooling (GAP) model, which rewards the individuals that are significant to group activity. In addition, it has been proven that the structure of the focal group is beneficial to analyse the group activity DBLP:journals/corr/IbrahimMDVM16. Subgroups are introduced after the partition of the whole group. Hence, We further present a hierarchical attentive pooling (HAP) model, which attaches variable importance to both person-level and subgroup-level features.

Our main contributions of this paper can be summarized as follows:

  1. We present a novel pooling scheme for group activity recognition, which can better explore the relationships between individual actions and group activities.

  2. We extend the attentive pooling scheme to the hierarchical mode, and thoroughly compare the different modes of proposed scheme.

  3. The evaluation of attentive pooling scheme is conducted on the widely-used Volleyball dataset, and the results confirm the effectiveness of our proposal.

The rest of this paper is organized as follows. In Section 2, we briefly review some related works, followed by the introduction of several variants of attentive pooling based methods and implementation details in Section 3. Experimental results and discussions are showed in Section 4. Finally, we conclude this paper in Section 5.

2 Related work

Human activity recognition is a hot topic for research, surveys such as [herath2017going; chaquet2013survey] have reviewed the vast literature in this area. Here, we will mainly focus on the group activity recognition and recent related advances in attention mechanism.

Group activity recognition. In the early years, hand-crafted features are used as representation of individual and group-level activity in most previous approaches. Nabi et al. [nabi2013temporal] proposed a pose-let activation pattern over time (TPOS) descriptor to capture human motions and interactions in groups. A unified framework [choi2012unified] proposed by Choi et al. simultaneously achieve tracking multiple people, recognizing individual actions, interactions and collective activities in a joint framework. These methods which use hand-crafted features are highly dependent on hypothesis, and cannot be trained in an end-to-end fashion.

In recent years, deep learning has been widely used in group activity recognition, and shows great potential to accelerate the research process. Considering the hierarchical structure of group activity analysis in video, a framework [ibrahim2016hierarchical] consists of multi-level cascade of recurrent neural networks was proposed, and it greatly inspired researchers in this area. The approach can be divided into two steps. First, tracklets of multi-person are constructed based on the detection and trajectories, and sptio-temporal features are extracted from these tracklets by utilizing deep convolutional neural network and lower recurrent neural network. Second, the extracted features of multi-person are fed into the higher recurrent neural network after passing through pooling module along the axis of persons. The final prediction of individual actions and group activities are obtained via softmax in a feed-forward way. [DBLP:journals/corr/IbrahimMDVM16] extended the hierarchical framework by splitting the whole group into several subgroups, which improved the performance. Thereafter, several methods are proposed from different perspectives. Shu et al. [shu2017cern] extended this hierarchical framework by specifying a novel energy layer and computing the corresponding p-values to estimate the most confident energy minimum. Bagautdinov et al. [bagautdinov2017social] presented a unified framework for detecting and recognizing human social behaviors in raw image sequences. Li et al. [li2017sbgar] proposed a semantic-based approach which generates captions from video frames and predict final activity categories based on generated captions. Biswas et al. [biswas2018structural] proposed a structural recurrent neural network (SRNN) that uses a series of interconnected RNNs to jointly capture the actions of individuals, their interactions, as well as the group activity. These methods which employ max pooling operation have achieved clear improvement compared to those using handcraft features, but they ignore the distinct contributions of different individuals to the group activity recognition. To address this issue, Wang et al. wang2017recurrent proposed a recurrent interactional context modelling scheme based on LSTM network and produces more discriminative/descriptive interactional features. However, this framework has three layers of LSTM network, which increases the complexity of the model.

Attention mechanism. Attention mechanism is considered as content-based addressing when processing focal sequence, and has been applied to several tasks in computer vision and natural language processing vaswani2017attention. According to the process of selecting attentive parts, attention mechanism can be roughly divided into two paths, named hard attention and soft attention. As a representative of the first path, Mnih et al. [mnih2014recurrent] presented a recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. The hard attention block is trained as the training process of reinforcement learning, which may cause difficulties to converge. Instead of hard selections of parts of interest, soft attention mechanisms were proposed by using weighted averages. In the soft attention path, Sharma et al. [sharma2015action] proposed a Soft-Attention LSTM model built on top of multi-layered RNNs to selectively focus on parts of the video frames and classify videos after taking a few glimpses. yang2016hierarchical proposed a hierarchical attention network for document classification. In this approach, the network has two levels of attention mechanisms applied at the word and sentence-level, enabling it to attend differentially to more and less important content. Li et al. videolstm presented an end-to-end sequence learning model for action recognition in video, which captures the spatial layout by hardwiring convolutions in the soft-attention LSTM. Long et al. attentionclusters2019 explored the potential of attention mechanism when aggregating local features, and proposed a attention clusters based framework for video classification.

Attention allows the model to avoid the negative impact of noisy parts, thus can improve performance of recognition. Specifically, attention allows the model to assign a relevance score to the elements in the group, and highlights elements with the most task-relevant information. Further more, it reveals the structural information within the group, which provides a way for us to make results more interpretable.

3 Method

As mentioned before, the goal of this paper is to recognize activities of a group of people. We utilize a bottom-up approach to represent and recognize the individual actions and group activities in a hierarchical manner, and employ the attentive pooling scheme in the transition from person-level features to group-level features. Comprehensively understanding a single person action and its temporal dynamics is the foundation of group activity recognition, and modelling the relationships between individuals and their corresponding group in a hierarchical manner has been validated among researchers. By using attentive pooling scheme which is easy to be integrated into the hierarchical framework, we can take into account both individual differences and unity of the group.

Figure 2: Attentive pooling based framework for group activity recognition. Taking a sequence of frames where N players compete as input, we extract the static and dynamic representations of each individual by CNN and low-level LSTM layers, denoted as black rectangles. After that, we feed the individual features into attentive pooling layer. The output of attentive pooling layer is passed through high-level LSTM layers and softmax layers trained to recognize group activities. The timeline is represented by the dotted arrow.

3.1 Hierarchical Temporal Model(HTM) of Individual Action and Group Activity

To clearly represent action information of a person, it is very necessary to capture the spatial features and its temporal dynamics simultaneously. As is common sense to us all, human action can be seen from both static and dynamic perspectives. There are several actions that can be discriminated from a single frame, while dynamic information extracted from multiple frames is needed for recognizing other actions.

Inspired by the excellent performance of deep Convolutional Neural Network (CNN), we use CNN to extract features from the bounding box around the person in each time step on a person trajectory. Similar to [donahue2015long], we use long short-term memory (LSTM) models [LSTM] to represent temporal dynamics of individual action. Such temporal information and spatial features can represent an individual action, and are critical for group performance. LSTM has been widely employed for many sequential problems in computer vision. The content of the memory cell in a LSTM unit is regulated by several gating units that control the flow of information in and out of the cells. The control they offer also helps in avoiding spurious gradient updates that can typically happen in training RNNs when the length of a temporal input is large. This property enables us to stack such layers in order to learn complex dynamics present in the input in different ranges. The output of the CNN, represented by , can be considered as a region-based feature describing the appearance around a person.

For person in the scene, we denote the sequence of spatial features extracted by CNN by , and the sequence of temporal features . is the length of the sampled video clip. Then, will be fed into LSTM block where the forget gate, input gate and output gate are trained automatically, the update process at time-step t can be simply expressed as:

Besides, we choose to concatenate the spatial(static) and temporal (dynamic) feature (represented by ), and get the comprehensive representation of single person’s action, denoted as . At time-step t, it can be expressed as follows:

As for a group of people, group-level activity is collectively defined by individual person-level actions. Once got spatio-temporal features of person-level action, the next step is to model the group-level activity in hierarchical architecture. The whole framework is shown as Fig. 2.

In order to get group-level activity information, we need to apply pooling operation to person-level spatio-temporal features. Here, we propose attentive pooling methods to aggregate person-level features. The aim of attentive pooling is to get a comprehensive collection of person-level information, so as to change our focal object into the whole group. Attentive pooling is the core of our method, and will be explicitly shown in next section. After that, we get the representation of the whole group, denoted as . Finally, we can take group representation as the input of group-level LSTM, thus, we can get final recognition by analysing the hidden state of the second-stage LSTM layer. Explicitly, we make pass through a fully connected layer, then input the second-stage LSTM layer. The hidden state of the second-stage LSTM layer is the group-level features we want. carries temporal information for the whole group dynamics. In the end, we feed into a softmax classification layer to get final predictions for group activity. In order to train the network, we minimize the joint loss:

where denotes the cross-entropy loss, is the prediction function of the model for the group activity, and is the prediction function for the actions of the individuals, and denote the labels of group activity and individual action, is number of person in the group, is the weight parameter.

3.2 Attention Module and Attentive Pooling

As we mentioned above, persons in the scene usually have different actions, and these actions collectively define the activity of the group. In this occasion, there is a strong relativity between group activity and some person actions, while other person actions contribute less to the final group activity recognition. The aim of attentive pooling is to get a comprehensive collection of person-level information, so as to get more robust representation for the whole group and naturally change our focal object from single person to the whole group.

Inspired by success of attention mechanism applied in several tasks of computer vision and natural language processing, we derive several diversified forms of attentive pooling.

3.2.1 Global attentive pooling (GAP)

To reward individual actions that are clues to correctly classify a group activity, we introduce a person level weight vector and use the vector to measure the importance of the personal actions. At time step t, given representing the action of person, firstly, we feed the personal action feature through a one-layer MLP and get , and then compute the similarity of with the context vector . Here, is the hidden representation of , and corresponds to a person-level context vector. The aim is to get the importance of the person action by measuring the similarity of and , and the weighted averages of is the feature vector of the whole group. This process can be formulated as follows:

Where is the feature vector representing group activity that summarizes all the information of person-level actions in the scene, is the corresponding weight of , is number of person in the scene. The architecture of GAP is shown in Fig. 3.

Figure 3: Global attentive pooling at time-step t

3.2.2 Hierarchical attentive pooling (HAP)

Observing that the contributions of subgroups to the activity classification varies ,we also need to consider the subgroups discriminatingly. For instance, the global label of the whole group is "Right Winpoint", which means the volleyball team in right side of the court wins in this round. At this moment, the importance of subgroup in the right side is usually higher than that of subgroup in the left side. Inspired by the success of hierarchical attention network yang2016hierarchical, we extend the overall person attentive pooling into hierarchical person pooling for subgroups and individuals. First of all, we divide the group of person in the scene into several subgroups. Secondly, attentive person pooling which is similar to overall person pooling is adopted to aggregate person-level representation of each person in the subgroup, and the representations of subgroups are aggregated by attentive subgroup pooling. Finally, we get the representation of the whole group , and feed it into the softmax classification layer. Here, is the hidden representation of , and corresponds to a subgroup-level context vector. The architecture of HAP is shown in Fig. 4. Mathematically, for person in subgroup, person-level attentive pooling can be expressed as:

And subgroup-level attentive pooling is formulated as:

Where is number of subgroups in the group, denotes number of person in subgroup.

Figure 4: Hierarchical attentive pooling at time-step t

3.2.3 Integration and Separation

As mentioned above, actions of subgroups and individuals collectively define the group activity, and the features of subgroups are aggregated in two different ways. In the previous works, some are developed in a single group, which is based on the integration of individuals or subgroups, the others handle this by considering them separately. The former are normalizing the local features (denoted as 1 group), while the latter are concatenating them (denoted as 2 groups). In order to be consistent with other approaches [ibrahim2016hierarchical; shu2017cern; biswas2018structural], we have also performed experiment where we treat the group of people as an integrated whole, and divide them into separate subgroups when utilizing the concatenation block.

To harness the advantage of attentive pooling and group split, we employ GAP within the subgroups and get group-level feature by concatenating the representations of subgroups , named subgroup GAP. The architecture of subgroup GAP is shown in Fig. 5. Thus,

In general, this method considered the whole group separately, and GAP and HAP are essentially the integration of local features.

Figure 5: Attentive pooling with concatenation of separate subgroups (subgroup GAP) at time-step t

3.3 Implementation Details

In accordance to [ibrahim2016hierarchical], we adopt the AlexNet [krizhevsky2012imagenet] pre-trained on the ImageNet [deng2009imagenet] and extract the feature vector for a person from the last convolutional layer. During training, we fine-tune the fc6 and fc7 layer. For temporal modelling, we employ LSTM implemented by Tensorflow library. The person-level LSTMs have 3000 hidden units, and group-level LSTMs have 2000 hidden units, initialized with a Gaussian distribution. And the MLP in attention module has 512 hidden units, initialized by Xavier [glorot2010understanding]. For convenience, we split the whole group into 2 subgroups in the order in which the players are arranged in the annotations. We train our hierarchical networks in two steps: training the person-level networks and training the hierarchical attentive networks jointly. In the experiments, we set , and use stochastic gradient descent with ADAM [kingma2014adam], with the initial learning rate set to 0.00001.

4 Experiment

4.1 Dataset

We evaluate our framework on the widely-used volleyball dataset [ibrahim2016hierarchical]. This dataset is released in 2016, and has 55 volleyball game video documents with 4830 labelled frames, where each player is labeled and subsequently annotated with the bounding box. Each player performs one of the 9 individual actions resulting in one of the 8 group activity labels. Furthermore, the whole dataset is divided into non-overlapping sets of 24 sequences for training, 15 sequences for validation and the remaining sequences are used for testing. Similar to [ibrahim2016hierarchical; shu2017cern], we have used both training and validation sequences for training. Since not all frames are annotated by bounding boxes, the Dlib tracker [king2009dlib] is used to propagate the ground-truth bounding boxes to the unannotated frames.

4.2 Baselines

In our experiment, we compare the proposed methods with previous baseline, and the following baselines are considered in all our experiments:

  1. B1 (HTM-max pooling): This baseline model is the same as [DBLP:journals/corr/IbrahimMDVM16]. We compare the performance of this method both in the experiment of integration and separation.

  2. B2 (HTM-avg pooling): This is similar to B1, but the max pooling operation has been replaced with average pooling operation when aggregating individual features. Both results in the experiment of integration and separation are shown as follows.

  3. B3 (HTM-GAP): To verify the effectiveness of overall attentive pooling, we employ GAP in the HTM. We assign the learned corresponding attention weight to each member in the group, and all the individual features would be attentively pooled.

  4. B4 (HTM-HAP): We employ HAP in the HTM, both subgroups and individuals would be attentively pooled. Comparing to B3 model, the hierarchical attention framework is added.

  5. B5 (HTM-subgroup GAP and concatenation): In the track of separation, we employ GAP within the subgroup and concatenate the representations of subgroups in the HTM.

4.3 Experimental results on the volleyball dataset

We have compared the accuracy of group activity recognition of baselines (Table 4.3) and previous methods (Table 4.3), and the experimental results of integration as well as separation are reported.

\processtable

Comparison of our methods with baseline methods on the volleyball dataset. \topruleApproach Methods Accuracy of group activity \midruleIntegration B1-HTM-max pooling 70.3% B2-HTM-avg pooling 68.5% B3-HTM-GAP 74.2% B4-HTM-HAP 77.7% Separation B1-HTM-max pooling 81.9% B2-HTM-avg pooling 80.7% B5-HTM-subgroup GAP 84.5% \botrule

As shown in the Table 4.3, our method using GAP or HAP module outperforms the baseline model. The results of integration show that combining the HTM and GAP module improves the accuracy of group activity, the effectiveness of proposed attentive pooling is verified. And performance is further improved by employing HAP module, which proves that attentive pooling scheme can be extended with the structural information as additional assistance. Apart from experiment of integration, the accuracy of separation is also increased by 1.5% when comparing with B1 method. Due to the volleyball scenario where two teams distribute on the both sides of the nets, models of separation outperforms those of integration by a large margin.

\processtable

Comparison to the state-of-the-art on the volleyball dataset. \topruleApproach Methods Accuracy Year \midruleIntegration B1-Hierarchical LSTM [DBLP:journals/corr/IbrahimMDVM16] 70.3% 2016 CERN (1group) [shu2017cern] 73.5% 2017 SBGAR [li2017sbgar] 66.9% 2017 SRNN (1group) [biswas2018structural] 74.4% 2018 Ours-B3 77.7% Separation B1-Hierarchical LSTM ) [DBLP:journals/corr/IbrahimMDVM16] 81.9% 2016 CERN (2 groups) [shu2017cern] 83.3% 2017 SRNN (2 groups) [biswas2018structural] 83.4% 2018 Ours-B4 84.5% \botrule

Table 4.3 compares the proposed methods with previous methods on this dataset. In experiment of integration, our HAP based model achieves higher accuracy than previous methods, and the performance of subgroup GAP based model is better than CERN and baseline method and comparable to the SRNN model in experiment of separation. All the experimental results are achieved based on the Alexnet features.

4.4 Discussion

In this part, we firstly discuss the impact of hidden units in attention modules and the confusion matrix of prediction. Then we list some of the experiment results, and analyse how our proposed methods works.

\processtable

Comparison of different numbers of MLP hidden units in attention module. \topruleNumbers of hidden units Methods Accuracy of group activity \midrule1024 HTM-GAP 69.0% HTM-HAP 77.7% HTM-subgroup GAP 83.1% 512 HTM-GAP 74.2% HTM-HAP 77.4% HTM-subgroup GAP 83.4% 256 HTM-GAP 73.3% HTM-HAP 77.4% HTM-subgroup GAP 82.4% \botrule

To evaluate the impact of MLP hidden units in attention module, we conduct a set of experiments outlined Table 4.4. Both in experiment of GAP and subgroup GAP, we get better performance when employing 512 hidden units. And the accuracy are higher when 1024 hidden units are used in experiment of HAP.

(a) Original frame in the dataset
(b) Frame with individual attention weights assigned
(c) Frame with individual spotlight according to the attention weights
Figure 6: Examples of visual results (better viewed in colour). In figure (b), the length of bold red lines annotated alongside with bounding boxes corresponds to the value of attention weights. The longer the red line is, the more important it shows. In figure (c), we highlight the important individuals according to the value of attention weights. Examples shows that the attentive pooling method can not only improve the accuracy of group activity, but makes the model more interpretable.
Figure 7: Confusion matrix on the Volleyball Dataset obtained by using our attentive pooling based model.

Figures 6 and 7 show the attention map of group activities using our model and the visualizations of the confusion matrix. The case study shown in Figure 6 illustrates how attentive pooling works, we choose to give three types of group activities as typical examples. In the 1st row, the frame is labelled as "Left Set". The longer red line around the player who is setting the ball for his teammates indicates the higher attention weights he has. And it is easier to figure out the different contribution of each person when we highlight the important individuals. Similarly, the frame labelled as "Right Spike" shown in the 2nd row highlights the players who perform spiking and blocking in the court. Intuitively, we can recognize the group activity just by considering the highlighted individual actions. The bottom frame which is labelled as "Left Winpoint" is slight different from frames above. In this frame, the volleyball players in the left side of the court are gathering, while their rivals are scattered on the other side of the court. The results show that the average attention weights of players who gathered round are higher than that of scattered players, which makes it easier to recognize the group label "Left Winpoint". The learned attention weights not only strengthen the positive effect of difference on recognition, but cooperate well with the unity of individuals.

From the confusion matrix shown in Figure 7, we can observe that "Left Winpoint" gains highest accuracy, while the accuracy of "Right Set" and "Right pass" are lower than 80%. The aggregation of players could be easily recognized based our model, which makes the high performance of winpoint activities. Due to the complexity of volleyball scenario and the similarity of set and pass activities, there is some confusion when distinguishing them. Besides, the interval between setting and spiking is always very short, which makes it difficult to recognize.

5 Conclusions

In this work, we address the motivation of taking the different contribution each part of group makes into consideration with a novel pooling scheme based on attention mechanism, named attentive pooling. The GAP attentively selects the focal parts inside the group, and HAP extends the GAP by modelling structure of the group in a hierarchical manner. By utilizing those variants of attentive pooling, the proposed model can generate more discriminative and interpretable representations of the group activity. We evaluated the models on the widely-used Volleyball Dataset, and experimental results showed the superiority of the proposed model.

6 Acknowledgements

The authors are thankful for the financial support from the National Natural Science Foundation of China (U1636220, 61432008, 61472423).

References