基于视频的动作识别的深度学习模型通常会生成短片段的特征(由几帧组成);通过计算这些特征的统计数据,这样的剪辑级特征被聚合到视频级表示。通常使用零(最大)或一阶(平均)统计。在本文中,我们探索了使用二阶统计量的好处。具体而言,我们提出了一种新颖的端到端可学习特征聚合方案,称为时间相关池,通过捕获临时解决方案之间的相似性来生成视频序列的动作描述符。在视频中计算的剪辑级CNN功能。这样的描述符虽然在计算上便宜,但也自然地编码多个CNN特征的共激活,从而提供比它们的一阶对应物更丰富的动作表征。我们通过计算再现内核希尔伯特空间中的CNN特征之后的相关性来推荐该方案的更高阶扩展。我们提供基准数据集的实验,如HMDB-51和UCF-101,细粒度数据集,如MPII烹饪活动和JHMDB,以及recentKinetics-600。我们的结果证明了高阶泳池方案的优势,当与手工制作的特征(标准实践)相结合时,可以实现最先进的精确度。
translated by 谷歌翻译
Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).
translated by 谷歌翻译
This paper performs the first investigation into depth for large-scale human action recognition in video where the depth cues are estimated from the videos themselves. We develop a new framework called depth2action and experiment thoroughly into how best to incorporate the depth information. We introduce spatio-temporal depth normaliza-tion (STDN) to enforce temporal consistency in our estimated depth sequences. We also propose modified depth motion maps (MDMM) to capture the subtle temporal changes in depth. These two components significantly improve the action recognition performance. We evaluate our depth2action framework on three large-scale action recognition video benchmarks. Our model achieves state-of-the-art performance when combined with appearance and motion information thus demonstrating that depth2action is indeed complementary to existing approaches.
translated by 谷歌翻译
We focus on the problem of wearer's action recognition in first person a.k.a. egocentric videos. This problem is more challenging than third person activity recognition due to unavailability of wearer's pose and sharp movements in the videos caused by the natural head motion of the wearer. Carefully crafted features based on hands and objects cues for the problem have been shown to be successful for limited targeted datasets. We propose convolutional neural networks (CNNs) for end to end learning and classification of wearer's actions. The proposed network makes use of egocentric cues by capturing hand pose, head motion and saliency map. It is compact. It can also be trained from relatively small number of labeled egocentric videos that are available. We show that the proposed network can generalize and give state of the art performance on various dis-parate egocentric action datasets.
translated by 谷歌翻译
当前最先进的视频理解方法采用时间抖动来模拟以不同帧速率分析视频。但是,这对于多速率视频效果不佳,其中动作或子动作的速度不同。帧采样率应根据不同的运动速度而变化。在这项工作中,我们提出了一个简单但有效的策略,称为随机时间跳过,以解决这种情况。该策略通过随机抽样评估培训有效地处理多速率视频。这是一种详尽的方法,可以涵盖所有的运动速度变化。此外,由于大量的时间跳过,我们的网络可以看到最初覆盖超过100帧的视频剪辑。这样的时间范围足以分析大多数动作/事件。我们还介绍了一种能够识别人类动作识别的改进运动图的一种感知识别光流学习方法。我们的框架是端到端的可训练,实时运行,并在六个广泛采用的视频基准测试中实现了最先进的性能。
translated by 谷歌翻译
Human actions in video sequences are three-dimensional (3D) spatio-temporal signals characterizing both the visual appearance and motion dynamics of the involved humans and objects. Inspired by the success of convolutional neural networks (CNN) for image classification , recent attempts have been made to learn 3D CNNs for recognizing human actions in videos. However, partly due to the high complexity of training 3D convolution kernels and the need for large quantities of training videos, only limited success has been reported. This has triggered us to investigate in this paper a new deep architecture which can handle 3D signals more effectively. Specifically, we propose factorized spatio-temporal convolutional networks (F ST CN) that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers (called spatial convolutional layers), followed by learning 1D temporal kernels in the upper layers (called temporal convolutional layers). We introduce a novel transformation and permutation operator to make factorization in F ST CN possible. Moreover, to address the issue of sequence alignment, we propose an effective training and inference strategy based on sampling multiple video clips from a given action video sequence. We have tested F ST CN on two commonly used benchmark datasets (UCF-101 and HMDB-51). Without using auxiliary training videos to boost the performance, F ST CN outperforms existing CNN based methods and achieves comparable performance with a recent method that benefits from using auxiliary training videos.
translated by 谷歌翻译
源于计算机视觉和机器学习的快速发展,视频分析任务已经从推断现状到预测未来状态。基于视觉的动作识别和来自视频的预测是这样的任务,其中动作识别是基于完整动作执行来推断人类动作(呈现状态),以及基于不完整动作执行来预测动作(未来状态)的动作预测。这些twotasks最近已经成为特别流行的主题,因为它们具有爆炸性的新兴现实应用,例如视觉监控,自动驾驶车辆,娱乐和视频检索等。在过去的几十年中,为了建立一个强大的应用程序,已经投入了大量的时间。行动识别和预测的有效框架。在本文中,我们调查了动作识别和预测中完整的最先进技术。现有的模型,流行的算法,技术难点,流行的行动数据库,评估协议和有希望的未来方向也提供了系统的讨论。
translated by 谷歌翻译
In this work, multimodal fusion of RGB-D data are analyzed for action recognition by using scene flow as early fusion and integrating the results of all modalities in a late fusion fashion. Recently, there is a migration from traditional handcrafting to deep learning. However, handcrafted features are still widely used owing to their high performance and low computational complexity. In this research, Multimodal dense trajectories (MMDT) is proposed to describe RGB-D videos. Dense trajectories are pruned based on scene flow data. Besides, 2DCNN is extended to mul-timodal (MM2DCNN) by adding one more stream (scene flow) as input and then fusing the output of all models. We evaluate and compare the results from each modality and their fusion on two action datasets. The experimental result shows that the new representation improves the accuracy. Furthermore, the fusion of handcrafted and learning-based features shows a boost in the final performance, achieving state of the art results.
translated by 谷歌翻译
帧序列中的时空表示在动作识别的任务中起重要作用。以前,将光流作为时空信息与包含空间信息的一组RGB图像组合使用的方法在动作识别任务中表现出很大的性能增强。然而,它具有昂贵的计算成本并且需要双流(RGB和光流)框架。在本文中,我们提出了包含运动块的MFNet(运动特征网络),该运动块使得可以在可以端到端训练的统一网络中的相邻帧之间编码时空信息。运动块可以附加到任何现有的基于CNN的动作识别框架,只需要很少的额外成本。我们在两个动作识别数据集(Jester和Something-Something)上评估了我们的网络,并通过从头开始训练网络,实现了两个数据集的竞争性能。
translated by 谷歌翻译
我们提出了一种简单而有效的方法,用于使用在大规模监督视频数据集上训练的深度三维卷积网络(3D ConvNets)进行时空特征学习。我们的研究结果有三个方面:1)与2DConvNets相比,3DConvNets更适合于时空特征学习; 2)在所有层中具有小的3x3x3卷积内核的同构架构是3D ConvNets的最佳性能架构之一; 3)我们学到的特征,即C3D(卷积3D),简单的线性分类器在4个不同的基准测试中胜过最先进的方法,并且与其他2个基准测试中的当前最佳方法相当。此外,这些功能非常紧凑:UCF101数据集的精度达到了52.8%,只有10个维度,并且由于ConvNets的快速推理,计算效率也非常高。最后,它们在概念上非常简单,易于携带和使用。
translated by 谷歌翻译
We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis, particularly in combination with convolutional neural networks (CNNs). A dynamic image encodes temporal data such as RGB or optical flow videos by using the concept of 'rank pooling'. The idea is to learn a ranking machine that captures the temporal evolution of the data and to use the parameters of the latter as a representation. When a linear ranking machine is used, the resulting representation is in the form of an image, which we call dynamic because it summarizes the video dynamics in addition of appearance. This is a powerful idea because it allows to convert any video to an image so that existing CNN models pre-trained for the analysis of still images can be immediately extended to videos. We also present an efficient and effective approximate rank pooling operator, accelerating standard rank pooling algorithms by orders of magnitude, and formulate that as a CNN layer. This new layer allows generalizing dynamic images to dynamic feature maps. We demonstrate the power of the new representations on standard benchmarks in action recognition achieving state-of-the-art performance.
translated by 谷歌翻译
成功减轻动物疼痛的先决条件是认识到它,这对非言语物种来说是一个巨大的挑战。此外,像马一样的猎物动物往往会隐藏自己的痛苦。在这项研究中,我们提出了一种深度复现的双流结构,用于区分马的视频中的疼痛和无痛。在具有中度疼痛诱导的对照试验中显示马的单一标准物上评估不同的模型,这已在早期工作中呈现。序贯模型在实验上与单帧模型相比较,显示了数据临时性的重要性,并以数据的兽医专家分类为基准。我们另外使用最新版本的最先进的人类疼痛识别方法进行基线比较。虽然机器学习中的睾丸疼痛检测是一个新领域,但我们的研究结果超过了兽医专家的表现,并且优于其他较大的非人类物种的疼痛检测结果。
translated by 谷歌翻译
人体骨骼关节在动作分析中很受欢迎,因为它们可以从视频中轻易提取以丢弃背景噪声。然而,电流骨架表示并没有完全受益于CNN的机器学习。我们提出“Skepxels”是骨架序列的时空表示,以充分利用CNN的2D卷积核心的关节之间的“局部”相关性。我们使用Skepxels将骨架视频转换为灵活维度的图像,并使用生成的图像开发基于CNN的框架,从而有效地识别人类动作。 Skepxels通过最大化唯一距离度量来编码关于帧中骨架关节的时空信息,该唯一距离度量在骨架图像中使用的不同关节布置上协作地定义。此外,它们是灵活的编码复合语义概念,如关节的位置和速度。所提出的动作识别通过首先捕获骨架连接与Skepxels之间的微时间关系然后利用它们的宏 - 时间关系来利用层次方式中的表示。通过骨架图像的CNN特征计算傅里叶时空金字塔。我们使用所提出的方法扩展了Inception-ResNet CNN架构,并将大规模NTUhuman活动数据集的最新精度提高了4.4%。在中型N-UCLA和UTH-MHAD数据集上,我们的方法分别优于现有结果5.7%和9.3%。
translated by 谷歌翻译
视觉重复在我们的世界中无处不在。它出现在人类活动(运动,烹饪),动物行为(蜜蜂的摇摆舞),自然现象(风中的叶子)和城市环境(闪烁的灯光)中。从现实视频估计视觉重复是具有挑战性的,因为周期性运动完全是静态和静止的。为了更好地处理逼真的视频,我们应该通过现有工作经常做出的静态和静态假设。基于周期运动理论的Ourspatiotemporal过滤方法有效地处理各种各样的外观并且需要无学习。从3D运动开始,我们通过将运动场分解为其基本分量来导出三种周期运动类型。此外,三个时间运动连续性出现在场的时间动力学中。对于3D运动的2D感知,我们考虑相对于运动的视点;以下是18例复发性运动知觉。为了估计在所有情况下的重复,我们的理论意味着构建差分运动图的混合:梯度,发散和卷曲。我们用小波滤波器在时间上对运动图进行卷积以估计重复动力学。我们的方法能够直接从在运动图上密集计算的时间滤波器响应中对重复运动进行空间分割。对我们的声明进行实验验证,我们使用我们的新颖数据集进行重复估计,用非静态和非静态重复运动更好地反映现实。在重复计算的任务上,与深度学习替代相比,我们获得了有利的结果。
translated by 谷歌翻译
深度卷积网络在静止图像中的视觉识别方面取得了巨大成功。然而,对于视频中的动作识别,传统方法的优势并不明显。本文旨在探索设计有效的ConvNet架构以便在视频中进行动作识别的原则,并在给定有限的训练样本的情况下学习这些模型。我们的第一个贡献是时间片段网络(TSN),一种基于视频的动作识别的新颖框架。这是基于长期时间结构建模的想法。它结合了稀疏时间采样策略和视频级监控,可以使用整个动作视频进行高效,有效的学习。另一个贡献是我们在时间分段网络的帮助下,在学习视频数据的ConvNets方面的一系列良好实践的研究。我们的方法在HMDB51($ 69.4 \%$)和UCF101($ 94.2 \%$)的数据集上获得了最先进的性能。我们还可视化learnConvNet模型,定性地证明了时间段网络的有效性和提出的良好实践。
translated by 谷歌翻译
Most state-of-the-art methods for action recognition rely on a two-stream architecture that processes appearance and motion independently. In this paper, we claim that considering them jointly offers rich information for action recognition. We introduce a novel representation that gracefully encodes the movement of some semantic keypoints. We use the human joints as these keypoints and term our Pose moTion representation PoTion. Specifically, we first run a state-of-the-art human pose estimator [4] and extract heatmaps for the human joints in each frame. We obtain our PoTion representation by temporally aggregating these probability maps. This is achieved by 'colorizing' each of them depending on the relative time of the frames in the video clip and summing them. This fixed-size representation for an entire video clip is suitable to classify actions using a shallow convolutional neural network. Our experimental evaluation shows that PoTion outper-forms other state-of-the-art pose representations [6, 48]. Furthermore, it is complementary to standard appearance and motion streams. When combining PoTion with the recent two-stream I3D approach [5], we obtain state-of-the-art performance on the JHMDB, HMDB and UCF101 datasets.
translated by 谷歌翻译
(c) (d) (a) (b) Figure 1: Seeing these ordered frames from videos, can you tell whether each video is playing forward or backward? (answer below 1). Depending on the video, solving the task may require (a) low-level understanding (e.g. physics), (b) high-level reasoning (e.g. semantics), or (c) familiarity with very subtle effects or with (d) camera conventions. In this work, we learn and exploit several types of knowledge to predict the arrow of time automatically with neural network models trained on large-scale video datasets. Abstract We seek to understand the arrow of time in videos-what makes videos look like they are playing forwards or back-wards? Can we visualize the cues? Can the arrow of time be a supervisory signal useful for activity analysis? To this end, we build three large-scale video datasets and apply a learning-based approach to these tasks. To learn the arrow of time efficiently and reliably, we design a ConvNet suitable for extended temporal footprints and for class activation visualization, and study the effect of artificial cues, such as cinematographic conventions , on learning. Our trained model achieves state-of-the-art performance on large-scale real-world video datasets. Through cluster analysis and localization of important regions for the prediction, we examine learned visual cues that are consistent among many samples and show when and where they occur. Lastly, we use the trained ConvNet for two applications: self-supervision for action recognition , and video forensics-determining whether Hollywood film clips have been deliberately reversed in time, often used as special effects.
translated by 谷歌翻译
通过可穿戴的第一人称视点相机,准确识别图像中的指针是人类活动理解的关键子任务。传统的手工分割方法依赖于大量手动标记的数据来生成稳健的手检测器。然而,这些方法仍面临挑战,因为手的外观在用户,任务,环境或照明条件之间变化很大。在许多可穿戴应用和接口的情况下的关键观察是,仅需要在特定情境环境中准确地检测用户的手。基于这一观察,我们引入了一种交互式方法来学习不需要任何手动标记的训练数据的特定于人的手分割模型。我们的方法分两步进行,一个用于识别移动手区域的交互式自举步骤,然后学习个性化的用户特定用户手外观模型。具体地说,我们的方法使用两个卷积神经网络:(1)使用预定义运动信息来检测手区域的手势网络; (2)基于手势网络的输出学习手区域的特定人员模型的外观网络。在训练过程中,为了使外观网络对手势网络中的错误具有鲁棒性,前一网络的丢失功能包含了信心。学习时的手势网络。实验证明了在全范围照明和手部外观变化的所有具有挑战性的数据集上,F1得分超过0.8的方法的稳健性,超过基线方法的改进超过10%。
translated by 谷歌翻译
We present a data-efficient representation learning approach to learn video representation with small amount of labeled data. We propose a multitask learning model Ac-tionFlowNet to train a single stream network directly from raw pixels to jointly estimate optical flow while recognizing actions with convolutional neural networks, capturing both appearance and motion in a single model. Our model effectively learns video representation from motion information on unlabeled videos. Our model significantly improves action recognition accuracy by a large margin (23.6%) compared to state-of-the-art CNN-based unsupervised representation learning methods trained without external large scale data and additional optical flow input. Without pre-training on large external labeled datasets, our model, by well exploiting the motion information, achieves competitive recognition accuracy to the models trained with large labeled datasets such as ImageNet and Sport-1M.
translated by 谷歌翻译
We bring together ideas from recent work on feature design for egocentric action recognition under one framework by exploring the use of deep convolutional neural networks (CNN). Recent work has shown that features such as hand appearance, object attributes, local hand motion and camera ego-motion are important for characterizing first-person actions. To integrate these ideas under one framework, we propose a twin stream network architecture , where one stream analyzes appearance information and the other stream analyzes motion information. Our appearance stream encodes prior knowledge of the egocen-tric paradigm by explicitly training the network to segment hands and localize objects. By visualizing certain neuron activation of our network, we show that our proposed architecture naturally learns features that capture object attributes and hand-object configurations. Our extensive experiments on benchmark egocentric action datasets show that our deep architecture enables recognition rates that significantly outperform state-of-the-art techniques-an average 6.6% increase in accuracy over all datasets. Furthermore , by learning to recognize objects, actions and activities jointly, the performance of individual recognition tasks also increase by 30% (actions) and 14% (objects). We also include the results of extensive ablative analysis to highlight the importance of network design decisions.
translated by 谷歌翻译