基于视频的动作识别的深度学习模型通常会生成短片段的特征(由几帧组成);通过计算这些特征的统计数据,这样的剪辑级特征被聚合到视频级表示。通常使用零(最大)或一阶(平均)统计。在本文中,我们探索了使用二阶统计量的好处。具体而言,我们提出了一种新颖的端到端可学习特征聚合方案,称为时间相关池,通过捕获临时解决方案之间的相似性来生成视频序列的动作描述符。在视频中计算的剪辑级CNN功能。这样的描述符虽然在计算上便宜,但也自然地编码多个CNN特征的共激活,从而提供比它们的一阶对应物更丰富的动作表征。我们通过计算再现内核希尔伯特空间中的CNN特征之后的相关性来推荐该方案的更高阶扩展。我们提供基准数据集的实验,如HMDB-51和UCF-101,细粒度数据集,如MPII烹饪活动和JHMDB,以及recentKinetics-600。我们的结果证明了高阶泳池方案的优势,当与手工制作的特征(标准实践)相结合时,可以实现最先进的精确度。
translated by 谷歌翻译
This paper performs the first investigation into depth for large-scale human action recognition in video where the depth cues are estimated from the videos themselves. We develop a new framework called depth2action and experiment thoroughly into how best to incorporate the depth information. We introduce spatio-temporal depth normaliza-tion (STDN) to enforce temporal consistency in our estimated depth sequences. We also propose modified depth motion maps (MDMM) to capture the subtle temporal changes in depth. These two components significantly improve the action recognition performance. We evaluate our depth2action framework on three large-scale action recognition video benchmarks. Our model achieves state-of-the-art performance when combined with appearance and motion information thus demonstrating that depth2action is indeed complementary to existing approaches.
translated by 谷歌翻译
Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).
translated by 谷歌翻译
当前最先进的视频理解方法采用时间抖动来模拟以不同帧速率分析视频。但是,这对于多速率视频效果不佳,其中动作或子动作的速度不同。帧采样率应根据不同的运动速度而变化。在这项工作中,我们提出了一个简单但有效的策略,称为随机时间跳过,以解决这种情况。该策略通过随机抽样评估培训有效地处理多速率视频。这是一种详尽的方法,可以涵盖所有的运动速度变化。此外,由于大量的时间跳过,我们的网络可以看到最初覆盖超过100帧的视频剪辑。这样的时间范围足以分析大多数动作/事件。我们还介绍了一种能够识别人类动作识别的改进运动图的一种感知识别光流学习方法。我们的框架是端到端的可训练,实时运行,并在六个广泛采用的视频基准测试中实现了最先进的性能。
translated by 谷歌翻译
源于计算机视觉和机器学习的快速发展,视频分析任务已经从推断现状到预测未来状态。基于视觉的动作识别和来自视频的预测是这样的任务,其中动作识别是基于完整动作执行来推断人类动作(呈现状态),以及基于不完整动作执行来预测动作(未来状态)的动作预测。这些twotasks最近已经成为特别流行的主题,因为它们具有爆炸性的新兴现实应用,例如视觉监控,自动驾驶车辆,娱乐和视频检索等。在过去的几十年中,为了建立一个强大的应用程序,已经投入了大量的时间。行动识别和预测的有效框架。在本文中,我们调查了动作识别和预测中完整的最先进技术。现有的模型,流行的算法,技术难点,流行的行动数据库,评估协议和有希望的未来方向也提供了系统的讨论。
translated by 谷歌翻译
We focus on the problem of wearer's action recognition in first person a.k.a. egocentric videos. This problem is more challenging than third person activity recognition due to unavailability of wearer's pose and sharp movements in the videos caused by the natural head motion of the wearer. Carefully crafted features based on hands and objects cues for the problem have been shown to be successful for limited targeted datasets. We propose convolutional neural networks (CNNs) for end to end learning and classification of wearer's actions. The proposed network makes use of egocentric cues by capturing hand pose, head motion and saliency map. It is compact. It can also be trained from relatively small number of labeled egocentric videos that are available. We show that the proposed network can generalize and give state of the art performance on various dis-parate egocentric action datasets.
translated by 谷歌翻译
In this work, multimodal fusion of RGB-D data are analyzed for action recognition by using scene flow as early fusion and integrating the results of all modalities in a late fusion fashion. Recently, there is a migration from traditional handcrafting to deep learning. However, handcrafted features are still widely used owing to their high performance and low computational complexity. In this research, Multimodal dense trajectories (MMDT) is proposed to describe RGB-D videos. Dense trajectories are pruned based on scene flow data. Besides, 2DCNN is extended to mul-timodal (MM2DCNN) by adding one more stream (scene flow) as input and then fusing the output of all models. We evaluate and compare the results from each modality and their fusion on two action datasets. The experimental result shows that the new representation improves the accuracy. Furthermore, the fusion of handcrafted and learning-based features shows a boost in the final performance, achieving state of the art results.
translated by 谷歌翻译
Human actions in video sequences are three-dimensional (3D) spatio-temporal signals characterizing both the visual appearance and motion dynamics of the involved humans and objects. Inspired by the success of convolutional neural networks (CNN) for image classification , recent attempts have been made to learn 3D CNNs for recognizing human actions in videos. However, partly due to the high complexity of training 3D convolution kernels and the need for large quantities of training videos, only limited success has been reported. This has triggered us to investigate in this paper a new deep architecture which can handle 3D signals more effectively. Specifically, we propose factorized spatio-temporal convolutional networks (F ST CN) that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers (called spatial convolutional layers), followed by learning 1D temporal kernels in the upper layers (called temporal convolutional layers). We introduce a novel transformation and permutation operator to make factorization in F ST CN possible. Moreover, to address the issue of sequence alignment, we propose an effective training and inference strategy based on sampling multiple video clips from a given action video sequence. We have tested F ST CN on two commonly used benchmark datasets (UCF-101 and HMDB-51). Without using auxiliary training videos to boost the performance, F ST CN outperforms existing CNN based methods and achieves comparable performance with a recent method that benefits from using auxiliary training videos.
translated by 谷歌翻译
We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis, particularly in combination with convolutional neural networks (CNNs). A dynamic image encodes temporal data such as RGB or optical flow videos by using the concept of 'rank pooling'. The idea is to learn a ranking machine that captures the temporal evolution of the data and to use the parameters of the latter as a representation. When a linear ranking machine is used, the resulting representation is in the form of an image, which we call dynamic because it summarizes the video dynamics in addition of appearance. This is a powerful idea because it allows to convert any video to an image so that existing CNN models pre-trained for the analysis of still images can be immediately extended to videos. We also present an efficient and effective approximate rank pooling operator, accelerating standard rank pooling algorithms by orders of magnitude, and formulate that as a CNN layer. This new layer allows generalizing dynamic images to dynamic feature maps. We demonstrate the power of the new representations on standard benchmarks in action recognition achieving state-of-the-art performance.
translated by 谷歌翻译
Human actions captured in video sequences are three-dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short-Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L 2 STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of similar model complexities.
translated by 谷歌翻译
Most state-of-the-art methods for action recognition rely on a two-stream architecture that processes appearance and motion independently. In this paper, we claim that considering them jointly offers rich information for action recognition. We introduce a novel representation that gracefully encodes the movement of some semantic keypoints. We use the human joints as these keypoints and term our Pose moTion representation PoTion. Specifically, we first run a state-of-the-art human pose estimator [4] and extract heatmaps for the human joints in each frame. We obtain our PoTion representation by temporally aggregating these probability maps. This is achieved by 'colorizing' each of them depending on the relative time of the frames in the video clip and summing them. This fixed-size representation for an entire video clip is suitable to classify actions using a shallow convolutional neural network. Our experimental evaluation shows that PoTion outper-forms other state-of-the-art pose representations [6, 48]. Furthermore, it is complementary to standard appearance and motion streams. When combining PoTion with the recent two-stream I3D approach [5], we obtain state-of-the-art performance on the JHMDB, HMDB and UCF101 datasets.
translated by 谷歌翻译
Automatic detection and classification of dynamic hand gestures in real-world systems intended for human computer interaction is challenging as: 1) there is a large diversity in how people perform gestures, making detection and classification difficult; 2) the system must work online in order to avoid noticeable lag between performing a gesture and its classification; in fact, a negative lag (classification before the gesture is finished) is desirable, as feedback to the user can then be truly instantaneous. In this paper, we address these challenges with a recurrent three-dimensional convolutional neural network that performs simultaneous detection and classification of dynamic hand gestures from multi-modal data. We employ connectionist temporal classification to train the network to predict class labels from in-progress gestures in unsegmented input streams. In order to validate our method, we introduce a new challenging multi-modal dynamic hand gesture dataset captured with depth, color and stereo-IR sensors. On this challenging dataset, our gesture recognition system achieves an accuracy of 83.8%, outperforms competing state-of-the-art algorithms, and approaches human accuracy of 88.4%. Moreover, our method achieves state-of-the-art performance on SKIG and ChaLearn2014 benchmarks.
translated by 谷歌翻译
帧序列中的时空表示在动作识别的任务中起重要作用。以前,将光流作为时空信息与包含空间信息的一组RGB图像组合使用的方法在动作识别任务中表现出很大的性能增强。然而,它具有昂贵的计算成本并且需要双流(RGB和光流)框架。在本文中,我们提出了包含运动块的MFNet(运动特征网络),该运动块使得可以在可以端到端训练的统一网络中的相邻帧之间编码时空信息。运动块可以附加到任何现有的基于CNN的动作识别框架,只需要很少的额外成本。我们在两个动作识别数据集(Jester和Something-Something)上评估了我们的网络,并通过从头开始训练网络,实现了两个数据集的竞争性能。
translated by 谷歌翻译
分析人类行为的视频涉及理解视频帧之间的时间关系。最先进的动作识别方法依赖于传统的光流估计方法来预先计算CNN的信息。这种两阶段方法计算成本高,存储要求高,而不是端到端可训练的。在本文中,我们提出了一种新颖的CNN架构,它隐含地捕获相邻帧之间的运动信息。我们将我们的方法命名为隐藏的双流CNN,因为它只将原始视频帧作为输入并直接预测动作类而不显式计算光流。我们的端到端方法比其两阶段基线快10倍。四个具有挑战性的识别数据集的实验结果:UCF101,HMDB51,THUMOS14和ActivityNet v1.2表明我们的方法明显优于以前的最佳实时方法。
translated by 谷歌翻译
深度卷积网络在静止图像中的视觉识别方面取得了巨大成功。然而,对于视频中的动作识别,传统方法的优势并不明显。本文旨在探索设计有效的ConvNet架构以便在视频中进行动作识别的原则,并在给定有限的训练样本的情况下学习这些模型。我们的第一个贡献是时间片段网络(TSN),一种基于视频的动作识别的新颖框架。这是基于长期时间结构建模的想法。它结合了稀疏时间采样策略和视频级监控,可以使用整个动作视频进行高效,有效的学习。另一个贡献是我们在时间分段网络的帮助下,在学习视频数据的ConvNets方面的一系列良好实践的研究。我们的方法在HMDB51($ 69.4 \%$)和UCF101($ 94.2 \%$)的数据集上获得了最先进的性能。我们还可视化learnConvNet模型,定性地证明了时间段网络的有效性和提出的良好实践。
translated by 谷歌翻译
Convolutional neural networks with spatio-temporal 3D kernels (3D CNNs) have an ability to directly extract spatio-temporal features from videos for action recognition. Although the 3D kernels tend to overfit because of a large number of their parameters, the 3D CNNs are greatly improved by using recent huge video databases. However, the architecture of 3D CNNs is relatively shallow against to the success of very deep neural networks in 2D-based CNNs, such as residual networks (ResNets). In this paper, we propose a 3D CNNs based on ResNets toward a better action representation. We describe the training procedure of our 3D ResNets in details. We experimentally evaluate the 3D ResNets on the ActivityNet and Kinetics datasets. The 3D ResNets trained on the Kinetics did not suffer from overfit-ting despite the large number of parameters of the model, and achieved better performance than relatively shallow networks, such as C3D. Our code and pretrained models (e.g. Kinetics and ActivityNet) are publicly available at https://github.com/kenshohara/3D-ResNets.
translated by 谷歌翻译
(c) (d) (a) (b) Figure 1: Seeing these ordered frames from videos, can you tell whether each video is playing forward or backward? (answer below 1). Depending on the video, solving the task may require (a) low-level understanding (e.g. physics), (b) high-level reasoning (e.g. semantics), or (c) familiarity with very subtle effects or with (d) camera conventions. In this work, we learn and exploit several types of knowledge to predict the arrow of time automatically with neural network models trained on large-scale video datasets. Abstract We seek to understand the arrow of time in videos-what makes videos look like they are playing forwards or back-wards? Can we visualize the cues? Can the arrow of time be a supervisory signal useful for activity analysis? To this end, we build three large-scale video datasets and apply a learning-based approach to these tasks. To learn the arrow of time efficiently and reliably, we design a ConvNet suitable for extended temporal footprints and for class activation visualization, and study the effect of artificial cues, such as cinematographic conventions , on learning. Our trained model achieves state-of-the-art performance on large-scale real-world video datasets. Through cluster analysis and localization of important regions for the prediction, we examine learned visual cues that are consistent among many samples and show when and where they occur. Lastly, we use the trained ConvNet for two applications: self-supervision for action recognition , and video forensics-determining whether Hollywood film clips have been deliberately reversed in time, often used as special effects.
translated by 谷歌翻译
We present a data-efficient representation learning approach to learn video representation with small amount of labeled data. We propose a multitask learning model Ac-tionFlowNet to train a single stream network directly from raw pixels to jointly estimate optical flow while recognizing actions with convolutional neural networks, capturing both appearance and motion in a single model. Our model effectively learns video representation from motion information on unlabeled videos. Our model significantly improves action recognition accuracy by a large margin (23.6%) compared to state-of-the-art CNN-based unsupervised representation learning methods trained without external large scale data and additional optical flow input. Without pre-training on large external labeled datasets, our model, by well exploiting the motion information, achieves competitive recognition accuracy to the models trained with large labeled datasets such as ImageNet and Sport-1M.
translated by 谷歌翻译
人体骨骼关节在动作分析中很受欢迎,因为它们可以从视频中轻易提取以丢弃背景噪声。然而,电流骨架表示并没有完全受益于CNN的机器学习。我们提出“Skepxels”是骨架序列的时空表示,以充分利用CNN的2D卷积核心的关节之间的“局部”相关性。我们使用Skepxels将骨架视频转换为灵活维度的图像,并使用生成的图像开发基于CNN的框架,从而有效地识别人类动作。 Skepxels通过最大化唯一距离度量来编码关于帧中骨架关节的时空信息,该唯一距离度量在骨架图像中使用的不同关节布置上协作地定义。此外,它们是灵活的编码复合语义概念,如关节的位置和速度。所提出的动作识别通过首先捕获骨架连接与Skepxels之间的微时间关系然后利用它们的宏 - 时间关系来利用层次方式中的表示。通过骨架图像的CNN特征计算傅里叶时空金字塔。我们使用所提出的方法扩展了Inception-ResNet CNN架构,并将大规模NTUhuman活动数据集的最新精度提高了4.4%。在中型N-UCLA和UTH-MHAD数据集上,我们的方法分别优于现有结果5.7%和9.3%。
translated by 谷歌翻译
成功减轻动物疼痛的先决条件是认识到它,这对非言语物种来说是一个巨大的挑战。此外,像马一样的猎物动物往往会隐藏自己的痛苦。在这项研究中,我们提出了一种深度复现的双流结构,用于区分马的视频中的疼痛和无痛。在具有中度疼痛诱导的对照试验中显示马的单一标准物上评估不同的模型,这已在早期工作中呈现。序贯模型在实验上与单帧模型相比较,显示了数据临时性的重要性,并以数据的兽医专家分类为基准。我们另外使用最新版本的最先进的人类疼痛识别方法进行基线比较。虽然机器学习中的睾丸疼痛检测是一个新领域,但我们的研究结果超过了兽医专家的表现,并且优于其他较大的非人类物种的疼痛检测结果。
translated by 谷歌翻译