Dynamics of human body skeletons convey significant information for human action recognition. Conventional approaches for modeling skeletons usually rely on hand-crafted parts or traversal rules, thus resulting in limited expressive power and difficulties of generalization. In this work, we propose a novel model of dynamic skeletons called Spatial-Temporal Graph Convolutional Networks (ST-GCN), which moves beyond the limitations of previous methods by automatically learning both the spatial and temporal patterns from data. This formulation not only leads to greater expressive power but also stronger generalization capability. On two large datasets, Kinetics and NTU-RGBD, it achieves substantial improvements over mainstream methods.
translated by 谷歌翻译
用于基于骨架的动作识别的传统深度方法通常将骨架构造为坐标序列或伪图像以馈送到RNN或CNN,其不能明确地利用关节之间的自然连接。最近,图形卷积网络(GCN)将CNN推广到更通用的非欧几里德结构,为基于骨架的动作识别获得了显着的性能。然而,图的拓扑结构是手动设置并固定在所有层上,这对于动作识别任务和分层CNN结构可能不是最佳的。此外,一阶信息(关节的坐标)主要用于以前的GCN,而二阶信息(骨骼的长度和方向)则较少被利用。本文提出了一种新的双流非局部图卷积网络。解决这些问题。模型各层图的拓扑结构可以通过BP算法统一或单独学习,具有更大的灵活性和通用性。同时,建议采用双流框架同时对关节和骨骼信息进行建模,进一步提高识别性能。对两个大规模数据集,NTU-RGB + D和Kinetics的广泛实验证明,我们的模型的性能超过了现有技术的显着水平。
translated by 谷歌翻译
动作识别或动作检测的任务涉及分析视频并确定正在执行的动作或动作。这些视频的主要主题主要是人类执行某些动作。然而,这种要求可以放宽以概括其他主题,例如动物或机器人。应用程序的范围可以从人机交互操作到自动视频编辑提议。当我们考虑时空行为识别时,我们处理行动本地化。该任务不仅涉及确定正在执行什么动作,还涉及在所述视频中何时何地执行该动作。本文旨在调查试图解决这一任务的方法和算法,对它们进行综合比较,探索可用的各种数据集,并确定最有希望的方法。
translated by 谷歌翻译
随着可触及深度传感器的普及,动态人体骨架作为一种强大的动作识别方式引起了人们的广泛关注。以前的方法模拟了基于RNN或CNN的骨架,它对不规则关节具有有限的抑制力。在本文中,我们在图上自然地表示骨架,并提出了基于骨架的动作识别的广义图卷积神经网络(GGCN),旨在通过谱图理论捕获空间 - 时间变化。特别地,我们在连续帧上构建年龄化图,其中每个关节不仅强烈地或弱地连接到同一帧中的相邻关节,而且还与先前帧和后续帧中的相关关节相关联。然后将广义图与特征学习的骨架序列的坐标矩阵一起馈入GGCN,我们在网络中部署频谱图卷积的高阶和快速Chebyshev近似。实验表明我们实现了最先进的性能广泛使用的NTU RGB + D,UT-Kinect和SYSU 3D数据集。
translated by 谷歌翻译
我们提出了用于动作分割的新颖的堆叠时空图形卷积网络(Stacked-STGCN),即,预测和定位长视频上的动作的序列。我们扩展了最初为基于骨架的动作识别提出的时空图形卷积网络(STGCN),以使具有不同特征的节点(例如,场景,演员,对象,动作等),具有不同长度的特征描述符和任意时间边缘连接成为可能。解释了与复杂活动相关的大图变形。我们进一步向STGCN介绍堆叠沙漏架构,以利用编码器 - 解码器设计的优势,提高泛化性能和定位精度。我们探索各种描述符,例如帧级VGG,段级I3D,基于RCNN的对象等作为节点描述符,以基于对综合上下文信息的联合参考来实现动作分段。我们在CAD120上显示结果(它提供预先计算的节点特征和边缘权重,用于跨算法的公平性能比较)以及更复杂的真实世界活动数据集Charades。我们的Stacked-STGCN通常使用VGG功能,在使用VGG功能的Charades上获得的最高报告结果中获得4.1%的CAD分数和1.3%的最佳报告结果。
translated by 谷歌翻译
人类行为包括关节运动的身体部位或“躯体”。人类骨骼直观地表示为稀疏图形,其中节点和节点之间的自然连接作为边缘。图形卷积网络已被用于识别来自骨架视频的动作。我们为此任务引入了基于部件的图卷积网络(PB-GCN),其灵感来自可变形部件模型(DPM)。我们将骨架图划分为四个子图,其中共享关节,并使用基于部件的图卷积网络学习识别模型。我们展示了这种模型与模型usingentire骨架图相比提高了识别性能。我们不是使用3D关节坐标作为节点特征,而是使用相对坐标和时间位移来提高性能。我们的模型在两个具有挑战性的基准标记数据集NTURGB + D和HDM05上实现了最先进的性能,用于骨骼动作识别。
translated by 谷歌翻译
Recently, skeleton based action recognition gains more popularity due to cost-effective depth sensors coupled with real-time skeleton estimation algorithms. Traditional approaches based on handcrafted features are limited to represent the complexity of motion patterns. Recent methods that use Recurrent Neural Networks (RNN) to handle raw skeletons only focus on the contextual dependency in the temporal domain and neglect the spatial configurations of articulated skeletons. In this paper, we propose a novel two-stream RNN architecture to model both temporal dynamics and spatial configurations for skeleton based action recognition. We explore two different structures for the temporal stream: stacked RNN and hierarchical RNN. Hierarchical RNN is designed according to human body kinemat-ics. We also propose two effective methods to model the spatial structure by converting the spatial graph into a sequence of joints. To improve generalization of our model, we further exploit 3D transformation based data augmentation techniques including rotation and scaling transformation to transform the 3D coordinates of skeletons during training. Experiments on 3D action recognition benchmark datasets show that our method brings a considerable improvement for a variety of actions, i.e., generic actions, interaction activities and gestures.
translated by 谷歌翻译
人体骨骼关节在动作分析中很受欢迎,因为它们可以从视频中轻易提取以丢弃背景噪声。然而,电流骨架表示并没有完全受益于CNN的机器学习。我们提出“Skepxels”是骨架序列的时空表示,以充分利用CNN的2D卷积核心的关节之间的“局部”相关性。我们使用Skepxels将骨架视频转换为灵活维度的图像,并使用生成的图像开发基于CNN的框架,从而有效地识别人类动作。 Skepxels通过最大化唯一距离度量来编码关于帧中骨架关节的时空信息,该唯一距离度量在骨架图像中使用的不同关节布置上协作地定义。此外,它们是灵活的编码复合语义概念,如关节的位置和速度。所提出的动作识别通过首先捕获骨架连接与Skepxels之间的微时间关系然后利用它们的宏 - 时间关系来利用层次方式中的表示。通过骨架图像的CNN特征计算傅里叶时空金字塔。我们使用所提出的方法扩展了Inception-ResNet CNN架构,并将大规模NTUhuman活动数据集的最新精度提高了4.4%。在中型N-UCLA和UTH-MHAD数据集上,我们的方法分别优于现有结果5.7%和9.3%。
translated by 谷歌翻译
In this paper, we propose a deep progressive reinforcement learning (DPRL) method for action recognition in skeleton-based videos, which aims to distil the most informative frames and discard ambiguous frames in sequences for recognizing actions. Since the choices of selecting representative frames are multitudinous for each video, we model the frame selection as a progressive process through deep reinforcement learning, during which we progressively adjust the chosen frames by taking two important factors into account: (1) the quality of the selected frames and (2) the relationship between the selected frames to the whole video. Moreover, considering the topology of human body inherently lies in a graph-based structure, where the ver-tices and edges represent the hinged joints and rigid bones respectively, we employ the graph-based convolutional neu-ral network to capture the dependency between the joints for action recognition. Our approach achieves very competitive performance on three widely used benchmarks.
translated by 谷歌翻译
在未修剪的视频中识别人类行为是一项重要的挑战任务。有效的3D运动表示和强大的学习模型是影响识别性能的两个关键因素。在本文中,我们介绍了一种新的基于骨架的视频3D动作识别表示。所提出的表示的关键思想是通过colorencoding过程将骨架序列中携带的人体的3D关节坐标变换为RGB图像。通过标准化3D关节坐标并将每个骨架框架分成五个部分,其中关节根据其物理连接的顺序连接,颜色编码表示能够表示复杂3D运动的时空演变,与每个序列的长度无关。然后,我们基于残差网络架构(ResNet)在获得的基于图像的表示上设计和训练不同的DeepConvolutional神经网络(D-CNN),以学习3Dmotion特征并将它们分类为类。我们的方法是在二次使用的动作识别基准上进行评估:MSR Action3D和NTU-RGB + D,这是一个用于3D人类动作识别的超大规模数据集。实验结果证明,所提出的方法优于先前的现有技术方法,同时需要较少的训练和预测计算。
translated by 谷歌翻译
Action recognition and human pose estimation are closely related but both problems are generally handled as distinct tasks in the literature. In this work, we propose a multitask framework for jointly 2D and 3D pose estimation from still images and human action recognition from video sequences. We show that a single architecture can be used to solve the two problems in an efficient way and still achieves state-of-the-art results. Additionally , we demonstrate that optimization from end-to-end leads to significantly higher accuracy than separated learning. The proposed architecture can be trained with data from different categories simultaneously in a seamlessly way. The reported results on four datasets (MPII, Human3.6M, Penn Action and NTU) demonstrate the effectiveness of our method on the targeted tasks.
translated by 谷歌翻译
Most state-of-the-art methods for action recognition rely on a two-stream architecture that processes appearance and motion independently. In this paper, we claim that considering them jointly offers rich information for action recognition. We introduce a novel representation that gracefully encodes the movement of some semantic keypoints. We use the human joints as these keypoints and term our Pose moTion representation PoTion. Specifically, we first run a state-of-the-art human pose estimator [4] and extract heatmaps for the human joints in each frame. We obtain our PoTion representation by temporally aggregating these probability maps. This is achieved by 'colorizing' each of them depending on the relative time of the frames in the video clip and summing them. This fixed-size representation for an entire video clip is suitable to classify actions using a shallow convolutional neural network. Our experimental evaluation shows that PoTion outper-forms other state-of-the-art pose representations [6, 48]. Furthermore, it is complementary to standard appearance and motion streams. When combining PoTion with the recent two-stream I3D approach [5], we obtain state-of-the-art performance on the JHMDB, HMDB and UCF101 datasets.
translated by 谷歌翻译
源于计算机视觉和机器学习的快速发展,视频分析任务已经从推断现状到预测未来状态。基于视觉的动作识别和来自视频的预测是这样的任务,其中动作识别是基于完整动作执行来推断人类动作(呈现状态),以及基于不完整动作执行来预测动作(未来状态)的动作预测。这些twotasks最近已经成为特别流行的主题,因为它们具有爆炸性的新兴现实应用,例如视觉监控,自动驾驶车辆,娱乐和视频检索等。在过去的几十年中,为了建立一个强大的应用程序,已经投入了大量的时间。行动识别和预测的有效框架。在本文中,我们调查了动作识别和预测中完整的最先进技术。现有的模型,流行的算法,技术难点,流行的行动数据库,评估协议和有希望的未来方向也提供了系统的讨论。
translated by 谷歌翻译
最近使用骨架数据进行动作识别在计算机视觉中引起了很多关注以前的研究主要基于固定的骨架图,仅捕获关节之间的局部物理依赖性,这可能会错误地显示关节相关性。为了捕获更丰富的依赖关系,我们引入了称为A链接推理模块的编码器 - 解码器结构,以捕获特定于行动的特定潜在依赖关系,即动作链接,直接构造。我们还扩展现有的骨架图以表示高阶依赖性,即结构链接。将这两种类型的链接组合成时间化的骨架图,我们进一步提出了动作 - 结构图卷积网络(AS-GCN),它将动作 - 结构图卷积和时间卷积作为基本构建块进行叠加,以学习动作识别的空间和时间特征。 。将未来的姿势预测头与识别头并行添加,以通过自我监督来帮助捕获更详细的行为模式。我们使用两个骨架数据集NTU-RGB + D和Kinetics在动作识别中验证AS-GCN。与最先进的方法相比,所提出的AS-GCN实现了持续的大的改进。作为副产品,AS-GCN也为未来的姿势预测展示了有希望的结果。
translated by 谷歌翻译
Skeleton-based action recognition has made great progress recently, but many problems still remain unsolved. For example, the representations of skeleton sequences captured by most of the previous methods lack spatial structure information and detailed temporal dynamics features. In this paper, we propose a novel model with spatial reasoning and temporal stack learning (SR-TSL) for skeleton-based action recognition, which consists of a spatial reasoning network (SRN) and a temporal stack learning network (TSLN). The SRN can capture the high-level spatial structural information within each frame by a residual graph neural network, while the TSLN can model the detailed temporal dynamics of skeleton sequences by a composition of multiple skip-clip LSTMs. During training, we propose a clip-based incremental loss to optimize the model. We perform extensive experiments on the SYSU 3D Human-Object Interaction dataset and NTU RGB+D dataset and verify the effectiveness of each network of our model. The comparison results illustrate that our approach achieves much better results than the state-of-the-art methods.
translated by 谷歌翻译
许多视频描绘了人们,正是他们的互动告诉了我们他们的活动,彼此之间的关系以及文化和社会环境。随着人类行动识别的进步,研究人员开始通过视频自动识别这些人与人之间的相互作用。主要挑战源于处理录音设置的相当大的变化,所描绘的人物的外观以及他们的交互的表现。本调查提供了这些挑战和数据集的摘要,然后深入讨论了基于视觉的相关识别和检测方法。我们关注基于卷积神经网络(CNN)的近期有前途的工作。最后,我们概述了克服当前最新技术限制的方向。
translated by 谷歌翻译
This paper presents an overview of state-of-the-art methods in activity recognition using semantic features. Unlike low-level features, semantic features describe inherent characteristics of activities. Therefore, semantics make the recognition task more reliable especially when the same actions look visually different due to the variety of action executions. We define a semantic space including the most popular semantic features of an action namely the human body (pose and poselet), attributes, related objects, and scene context. We present methods exploiting these semantic features to recognize activities from still images and video data as well as four groups of activities: atomic actions, people interactions, human-object interactions, and group activities. Furthermore, we provide potential applications of semantic approaches along with directions for future research.
translated by 谷歌翻译
计算机视觉中基于学习的方法的主导范式是在大型数据集上训练通用模型,例如用于图像识别的ResNet,或用于视频理解的I3D,并允许它们发现手头问题的最佳表示。虽然这是一个明显有吸引力的方法,但并不适用于所有情况。我们声称动作检测是一个具有挑战性的问题 - 需要训练的模型很大,标记数据的获取成本很高。为了解决这个限制,我们建议将领域知识纳入模型的结构,简化优化。特别是,我们使用跟踪模块扩展标准I3D网络以聚合长期运动模式,并使用图形卷积网络来推理演员和对象之间的交互。根据具有挑战性的AVA数据集进行评估,所提出的方法比I3Dbaseline提高了5.5%mAP,并且超过了最先进的4.8%mAP。
translated by 谷歌翻译
This paper presents a new method for 3D action recognition with skeletonsequences (i.e., 3D trajectories of human skeleton joints). The proposed methodfirst transforms each skeleton sequence into three clips each consisting ofseveral frames for spatial temporal feature learning using deep neuralnetworks. Each clip is generated from one channel of the cylindricalcoordinates of the skeleton sequence. Each frame of the generated clipsrepresents the temporal information of the entire skeleton sequence, andincorporates one particular spatial relationship between the joints. The entireclips include multiple frames with different spatial relationships, whichprovide useful spatial structural information of the human skeleton. We proposeto use deep convolutional neural networks to learn long-term temporalinformation of the skeleton sequence from the frames of the generated clips,and then use a Multi-Task Learning Network (MTLN) to jointly process all framesof the generated clips in parallel to incorporate spatial structuralinformation for action recognition. Experimental results clearly show theeffectiveness of the proposed new representation and feature learning methodfor 3D action recognition.
translated by 谷歌翻译
本文从3Dskeleton序列提出了一个新的人体动作识别框架。以前的研究没有充分利用人类行为中视频片段之间的时间关系。一些研究成功地使用了非常深的卷积神经网络(CNN)模型,但却从数据不足问题中得到了提升。在这项研究中,我们首先将askeleton序列分成不同的时间段,以利用它们之间的相关性。然后通过利用针对人类骨骼序列优化的细 - 粗(F2C)CNN架构同时提取骨架​​序列的时间和空间特征。我们在NTU RGB + D和SBU Kinect Interaction数据集上评估我们提出的方法。它通过交叉对象和交叉视图协议分别在NTU RGB + D上实现了79.6%和84.6%的精度,这几乎与最先进的性能相同。此外,我们的方法显着提高了双人互动中的动作准确性。
translated by 谷歌翻译