由于其低成本和快速移动性,无人驾驶汽车(UAV)现在已广泛应用于数据获取。随着航空视频量的增加,对这些视频自动解析的需求正在激增。为了实现这一目标,当前的研究主要集中于在空间和时间维度沿着卷积的整体特征提取整体特征。但是,这些方法受到小时接收场的限制,无法充分捕获长期的时间依赖性,这对于描述复杂动力学很重要。在本文中,我们提出了一个新颖的深神经网络,称为futh-net,不仅为整体特征建模,而且还模拟了空中视频分类的时间关系。此外,在新型融合模块中,多尺度的时间关系可以完善整体特征,以产生更具歧视性的视频表示。更特别地,FUTH-NET采用了两条道路架构:(1)学习框架外观和短期时间变化的一般特征的整体代表途径,以及(2)捕获跨任意跨越任意时间关系的时间关系途径框架,提供长期的时间依赖性。之后,提出了一个新型的融合模块,以时空整合从这两种途径中学到的两个特征。我们的模型对两个航空视频分类数据集进行了评估,即ERA和无人机操作,并实现了最新结果。这表明了其在不同识别任务(事件分类和人类行动识别)之间的有效性和良好的概括能力。为了促进进一步的研究,我们在https://gitlab.lrz.de/ai4eo/reasoning/futh-net上发布该代码。
translated by 谷歌翻译
Human activity recognition (HAR) using drone-mounted cameras has attracted considerable interest from the computer vision research community in recent years. A robust and efficient HAR system has a pivotal role in fields like video surveillance, crowd behavior analysis, sports analysis, and human-computer interaction. What makes it challenging are the complex poses, understanding different viewpoints, and the environmental scenarios where the action is taking place. To address such complexities, in this paper, we propose a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely sampled video frames for obtaining global weighted temporal attention. The proposed SWTA is comprised of two parts. First, temporal segment network that sparsely samples a given set of frames. Second, weighted temporal attention, which incorporates a fusion of attention maps derived from optical flow, with raw RGB images. This is followed by a basenet network, which comprises a convolutional neural network (CNN) module along with fully connected layers that provide us with activity recognition. The SWTA network can be used as a plug-in module to the existing deep CNN architectures, for optimizing them to learn temporal information by eliminating the need for a separate temporal stream. It has been evaluated on three publicly available benchmark datasets, namely Okutama, MOD20, and Drone-Action. The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets thereby surpassing the previous state-of-the-art performances by a margin of 25.26%, 18.56%, and 2.94%, respectively.
translated by 谷歌翻译
有效地对视频中的空间信息进行建模对于动作识别至关重要。为了实现这一目标,最先进的方法通常采用卷积操作员和密集的相互作用模块,例如非本地块。但是,这些方法无法准确地符合视频中的各种事件。一方面,采用的卷积是有固定尺度的,因此在各种尺度的事件中挣扎。另一方面,密集的相互作用建模范式仅在动作 - 欧元零件时实现次优性能,给最终预测带来了其他噪音。在本文中,我们提出了一个统一的动作识别框架,以通过引入以下设计来研究视频内容的动态性质。首先,在提取本地提示时,我们会生成动态尺度的时空内核,以适应各种事件。其次,为了将这些线索准确地汇总为全局视频表示形式,我们建议仅通过变压器在一些选定的前景对象之间进行交互,从而产生稀疏的范式。我们将提出的框架称为事件自适应网络(EAN),因为这两个关键设计都适应输入视频内容。为了利用本地细分市场内的短期运动,我们提出了一种新颖有效的潜在运动代码(LMC)模块,进一步改善了框架的性能。在几个大规模视频数据集上进行了广泛的实验,例如,某种东西,动力学和潜水48,验证了我们的模型是否在低拖鞋上实现了最先进或竞争性的表演。代码可在:https://github.com/tianyuan168326/ean-pytorch中找到。
translated by 谷歌翻译
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together. * The work was done during an internship at SenseTime.
translated by 谷歌翻译
Temporal modeling is key for action recognition in videos. It normally considers both short-range motions and long-range aggregations. In this paper, we propose a Temporal Excitation and Aggregation (TEA) block, including a motion excitation (ME) module and a multiple temporal aggregation (MTA) module, specifically designed to capture both short-and long-range temporal evolution. In particular, for short-range motion modeling, the ME module calculates the feature-level temporal differences from spatiotemporal features. It then utilizes the differences to excite the motion-sensitive channels of the features. The long-range temporal aggregations in previous works are typically achieved by stacking a large number of local temporal convolutions. Each convolution processes a local temporal window at a time. In contrast, the MTA module proposes to deform the local convolution to a group of subconvolutions, forming a hierarchical residual architecture. Without introducing additional parameters, the features will be processed with a series of sub-convolutions, and each frame could complete multiple temporal aggregations with neighborhoods. The final equivalent receptive field of temporal dimension is accordingly enlarged, which is capable of modeling the long-range temporal relationship over distant frames. The two components of the TEA block are complementary in temporal modeling. Finally, our approach achieves impressive results at low FLOPs on several action recognition benchmarks, such as Kinetics, Something-Something, HMDB51, and UCF101, which confirms its effectiveness and efficiency.
translated by 谷歌翻译
Drone-camera based human activity recognition (HAR) has received significant attention from the computer vision research community in the past few years. A robust and efficient HAR system has a pivotal role in fields like video surveillance, crowd behavior analysis, sports analysis, and human-computer interaction. What makes it challenging are the complex poses, understanding different viewpoints, and the environmental scenarios where the action is taking place. To address such complexities, in this paper, we propose a novel Sparse Weighted Temporal Fusion (SWTF) module to utilize sparsely sampled video frames for obtaining global weighted temporal fusion outcome. The proposed SWTF is divided into two components. First, a temporal segment network that sparsely samples a given set of frames. Second, weighted temporal fusion, that incorporates a fusion of feature maps derived from optical flow, with raw RGB images. This is followed by base-network, which comprises a convolutional neural network module along with fully connected layers that provide us with activity recognition. The SWTF network can be used as a plug-in module to the existing deep CNN architectures, for optimizing them to learn temporal information by eliminating the need for a separate temporal stream. It has been evaluated on three publicly available benchmark datasets, namely Okutama, MOD20, and Drone-Action. The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets thereby surpassing the previous state-of-the-art performances by a significant margin.
translated by 谷歌翻译
视频自我监督的学习是一项挑战的任务,这需要模型的显着表达力量来利用丰富的空间时间知识,并从大量未标记的视频产生有效的监督信号。但是,现有方法未能提高未标记视频的时间多样性,并以明确的方式忽略精心建模的多尺度时间依赖性。为了克服这些限制,我们利用视频中的多尺度时间依赖性,并提出了一个名为时间对比图学习(TCGL)的新型视频自我监督学习框架,该框架共同模拟了片段间和片段间的时间依赖性用混合图对比学习策略学习的时间表示学习。具体地,首先引入空间 - 时间知识发现(STKD)模块以基于离散余弦变换的频域分析从视频中提取运动增强的空间时间表。为了显式模拟未标记视频的多尺度时间依赖性,我们的TCGL将关于帧和片段命令的先前知识集成到图形结构中,即片段/间隙间时间对比图(TCG)。然后,特定的对比学习模块旨在最大化不同图形视图中节点之间的协议。为了为未标记的视频生成监控信号,我们介绍了一种自适应片段订购预测(ASOP)模块,它利用视频片段之间的关系知识来学习全局上下文表示并自适应地重新校准通道明智的功能。实验结果表明我们的TCGL在大规模行动识别和视频检索基准上的最先进方法中的优势。
translated by 谷歌翻译
在本文中,一种称为VigAt的纯粹发行的自下而上的方法,该方法将对象检测器与视觉变压器(VIT)骨干网络一起得出对象和框架功能,以及一个头网络来处理这些功能,以处理事件的任务提出了视频中的识别和解释。VIGAT头由沿空间和时间维度分解的图形注意网络(GAT)组成,以便有效捕获对象或帧之间的局部和长期依赖性。此外,使用从各个GAT块的邻接矩阵得出的加权内(wids),我们表明所提出的体系结构可以识别解释网络决策的最显着对象和框架。进行了全面的评估研究,表明所提出的方法在三个大型公开视频数据集(FCVID,Mini-Kinetics,ActivityNet)上提供了最先进的结果。
translated by 谷歌翻译
自动外科阶段识别在机器人辅助手术中起着重要作用。现有方法忽略了一个关键问题,即外科阶段应该通过学习段级语义来分类,而不是仅仅依赖于框架明智的信息。在本文中,我们提出了一种段 - 细分分层一致性网络(SAHC),用于来自视频的外科阶段识别。关键的想法是提取分层高级语义 - 一致的段,并使用它们来优化由暧昧帧引起的错误预测。为实现它,我们设计一个时间分层网络以生成分层高级段。然后,我们引入分层段帧注意力(SFA)模块,以捕获低级帧和高级段之间的关系。通过通过一致性损耗来规则地规范帧及其对应段的预测,网络可以生成语义 - 一致的段,然后纠正由模糊的低级帧引起的错误分类预测。我们在两个公共外科视频数据集上验证SAHC,即M2CAI16挑战数据集和CholeC80数据集。实验结果表明,我们的方法优于以前的最先进的余量,显着达到M2Cai16的4.1%。代码将在验收时在Github发布。
translated by 谷歌翻译
无人驾驶飞机(UAV)通过低成本,大型覆盖,实时和高分辨率数据采集能力而广泛应用于检查,搜索和救援行动的目的。在这些过程中产生了大量航空视频,在这些过程中,正常事件通常占压倒性的比例。本地化和提取异常事件非常困难,这些事件包含手动从长视频流中的潜在有价值的信息。因此,我们致力于开发用于解决此问题的异常检测方法。在本文中,我们创建了一个新的数据集,名为Droneanomaly,用于空中视频中的异常检测。该数据集提供了37个培训视频序列和22个测试视频序列,这些视频序列来自7个不同的现实场景,其中包括各种异常事件。有87,488个彩色视频框架(训练51,635,测试35,853),每秒30帧的尺寸为640美元\ times 640美元。基于此数据集,我们评估现有方法并为此任务提供基准。此外,我们提出了一种新的基线模型,即变压器(ANDT)的异常检测,该模型将连续的视频帧视为一系列小管,它利用变压器编码器从序列中学习特征表示,并利用解码器来预测下一帧。我们的网络模型在训练阶段模型正常,并确定了具有不可预测的时间动力学的事件,作为测试阶段的异常。此外,为了全面评估我们提出的方法的性能,我们不仅使用无人机 - 异常数据集,而且使用另一个数据集。我们将使我们的数据集和代码公开可用。可以在https://youtu.be/ancczyryoby上获得演示视频。我们使数据集和代码公开可用。
translated by 谷歌翻译
Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters;(ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves stateof-the-art results. Our code and models are available at http://www.robots.ox.ac.uk/ vgg/software/two stream action
translated by 谷歌翻译
In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-theart on Sports-1M, Kinetics, UCF101, and HMDB51.
translated by 谷歌翻译
人类对象相互作用(HOI)识别的关键是推断人与物体之间的关系。最近,该图像的人类对象相互作用(HOI)检测取得了重大进展。但是,仍然有改善视频HOI检测性能的空间。现有的一阶段方法使用精心设计的端到端网络来检测视频段并直接预测交互。它使网络的模型学习和进一步的优化更加复杂。本文介绍了空间解析和动态时间池(SPDTP)网络,该网络将整个视频作为时空图作为人类和对象节点作为输入。与现有方法不同,我们提出的网络通过显式空间解析预测交互式和非相互作用对之间的差异,然后执行交互识别。此外,我们提出了一个可学习且可区分的动态时间模块(DTM),以强调视频的关键帧并抑制冗余帧。此外,实验结果表明,SPDTP可以更多地关注主动的人类对象对和有效的密钥帧。总体而言,我们在CAD-1220数据集和某些ELSE数据集上实现了最先进的性能。
translated by 谷歌翻译
高效的时空建模是视频动作识别的重要而挑战性问题。现有的最先进的方法利用相邻的特征差异,以获得短期时间建模的运动线索,简单的卷积。然而,只有一个本地卷积,由于接收领域有限而无法处理各种动作。此外,摄像机运动带来的动作耳鸣还将损害提取的运动功能的质量。在本文中,我们提出了一个时间显着积分(TSI)块,其主要包含突出运动激励(SME)模块和交叉感知时间集成(CTI)模块。具体地,中小企业旨在通过空间级局部 - 全局运动建模突出显示运动敏感区域,其中显着对准和金字塔型运动建模在相邻帧之间连续进行,以捕获由未对准背景引起的噪声较少的运动动态。 CTI旨在分别通过一组单独的1D卷积进行多感知时间建模。同时,不同看法的时间相互作用与注意机制相结合。通过这两个模块,通过引入有限的附加参数,可以有效地编码长短的短期时间关系。在几个流行的基准测试中进行了广泛的实验(即,某种东西 - 某种东西 - 东西 - 400,uCF-101和HMDB-51),这证明了我们所提出的方法的有效性。
translated by 谷歌翻译
In this paper, we develop an efficient multi-scale network to predict action classes in partial videos in an end-to-end manner. Unlike most existing methods with offline feature generation, our method directly takes frames as input and further models motion evolution on two different temporal scales.Therefore, we solve the complexity problems of the two stages of modeling and the problem of insufficient temporal and spatial information of a single scale. Our proposed End-to-End MultiScale Network (E2EMSNet) is composed of two scales which are named segment scale and observed global scale. The segment scale leverages temporal difference over consecutive frames for finer motion patterns by supplying 2D convolutions. For observed global scale, a Long Short-Term Memory (LSTM) is incorporated to capture motion features of observed frames. Our model provides a simple and efficient modeling framework with a small computational cost. Our E2EMSNet is evaluated on three challenging datasets: BIT, HMDB51, and UCF101. The extensive experiments demonstrate the effectiveness of our method for action prediction in videos.
translated by 谷歌翻译
We present SlowFast networks for video recognition. Our model involves (i) a Slow pathway, operating at low frame rate, to capture spatial semantics, and (ii) a Fast pathway, operating at high frame rate, to capture motion at fine temporal resolution. The Fast pathway can be made very lightweight by reducing its channel capacity, yet can learn useful temporal information for video recognition. Our models achieve strong performance for both action classification and detection in video, and large improvements are pin-pointed as contributions by our SlowFast concept. We report state-of-the-art accuracy on major video recognition benchmarks, Kinetics, Charades and AVA. Code has been made available at: https://github.com/ facebookresearch/SlowFast.
translated by 谷歌翻译
translated by 谷歌翻译
Temporal relational reasoning, the ability to link meaningful transformations of objects or entities over time, is a fundamental property of intelligent species. In this paper, we introduce an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales. We evaluate TRN-equipped networks on activity recognition tasks using three recent video datasets -Something-Something, Jester, and Charades -which fundamentally depend on temporal relational reasoning. Our results demonstrate that the proposed TRN gives convolutional neural networks a remarkable capacity to discover temporal relations in videos. Through only sparsely sampled video frames, TRN-equipped networks can accurately predict human-object interactions in the Something-Something dataset and identify various human gestures on the Jester dataset with very competitive performance. TRN-equipped networks also outperform two-stream networks and 3D convolution networks in recognizing daily activities in the Charades dataset. Further analyses show that the models learn intuitive and interpretable visual common sense knowledge in videos 1 .
translated by 谷歌翻译
Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for image recognition problems. Nevertheless, it is not trivial when utilizing a CNN for learning spatio-temporal video representation. A few studies have shown that performing 3D convolutions is a rewarding approach to capture both spatial and temporal dimensions in videos. However, the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand. A valid question is why not recycle off-the-shelf 2D networks for a 3D CNN. In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 × 3 × 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time. Furthermore, we propose a new architecture, named , that exploits all the variants of blocks but composes each in different placement of ResNet, following the philosophy that enhancing structural diversity with going deep could improve the power of neural networks. Our P3D ResNet achieves clear improvements on Sports-1M video classification dataset against 3D CNN and frame-based 2D CNN by 5.3% and 1.8%, respectively. We further examine the generalization performance of video representation produced by our pre-trained P3D ResNet on five different benchmarks and three different tasks, demonstrating superior performances over several state-of-the-art techniques.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译