We address the problem of activity detection in continuous, untrimmed videostreams. This is a difficult task that requires extracting meaningfulspatio-temporal features to capture activities, accurately localizing the startand end times of each activity. We introduce a new model, Region Convolutional3D Network (R-C3D), which encodes the video streams using a three-dimensionalfully convolutional network, then generates candidate temporal regionscontaining activities, and finally classifies selected regions into specificactivities. Computation is saved due to the sharing of convolutional featuresbetween the proposal and the classification pipelines. The entire model istrained end-to-end with jointly optimized localization and classificationlosses. R-C3D is faster than existing methods (569 frames per second on asingle Titan X Maxwell GPU) and achieves state-of-the-art results on THUMOS'14.We further demonstrate that our model is a general activity detection frameworkthat does not rely on assumptions about particular dataset properties byevaluating our approach on ActivityNet and Charades. Our code is available athttp://ai.bu.edu/r-c3d/.
translated by 谷歌翻译
Dense video captioning is a fine-grained video understanding task that involves two sub-problems: localizing distinct events in a long video stream, and generating captions for the localized events. We propose the Joint Event Detection and Description Network (JEDDi-Net), which solves the dense video captioning task in an end-to-end fashion. Our model continuously encodes the input video stream with three-dimensional convolutional layers, proposes variable-length temporal events based on pooled features , and generates their captions. Proposal features are extracted within each proposal segment through 3D Segment-of-Interest pooling from shared video feature encoding. In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context. On the large-scale Activi-tyNet Captions dataset, JEDDi-Net demonstrates improved results as measured by standard metrics. We also present the first dense captioning results on the TACoS-MultiLevel dataset.
translated by 谷歌翻译
在本文中,我们使用一个简单的端到端三维卷积(Conv3D)网络,在长的,未修剪的视频中呈现一种新颖的单镜头多跨度检测器颞叶活动检测。我们的体系结构命名为S3D,对整个视频流进行编码,并将时间活动的输出空间离散为一组在不同时间位置和比例上的默认跨度。在预测时间,S3D预测每个默认跨度中活动类别的存在的分数,并产生与跨度位置相关的时间调整以预测精确的活动持续时间。与需要单独提议和分类阶段的最新系统不同,我们的S3D本质上是简单且专门设计用于单次,端到端时间活动检测。在评估THUMOS'14检测基准时,S3D实现了最先进的性能并且非常高效,并且可以在1271 FPS下运行。
translated by 谷歌翻译
We present a Temporal Context Network (TCN) for precise temporal localization of human activities. Similar to the Faster-RCNN architecture, proposals are placed at equal intervals in a video which span multiple temporal scales. We propose a novel representation for ranking these proposals. Since pooling features only inside a segment is not sufficient to predict activity boundaries, we construct a representation which explicitly captures context around a proposal for ranking it. For each temporal segment inside a proposal, features are uniformly sampled at a pair of scales and are input to a temporal convolutional neural network for classification. After ranking proposals, non-maximum suppression is applied and classification is performed to obtain final detections. TCN outperforms state-of-the-art methods on the ActivityNet dataset and the THU-MOS14 dataset.
translated by 谷歌翻译
在本文中,我们提出了Spatio-TEmporal Progressive(STEP)动作检测器---一种用于视频中时空动作检测的渐进式学习框架。从一些粗略的提议长方体开始,我们的方法逐步完善了针对几步行动的提案。以这种方式,通过利用先前步骤的回归输出,可以在后面的步骤中逐步获得高质量的提议(即,遵守动作运动)。在每一步,我们都会及时自适应地扩展提案,以纳入更多相关的时间背景。与先前在一次运行中执行动作检测的工作相比,我们的渐进式学习框架可以自然地处理动作管内的空间位移,因此为时空建模提供了更有效的方法。我们对UCF101和AVA的方法进行了广泛的评估,并展示了更好的检测结果。值得注意的是,我们通过3个渐进步骤实现了mAP分别为75.0%和18.6%,并且分别仅使用了11和34个初始提案。
translated by 谷歌翻译
We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video. To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work. First, we inject text features early on when generating clip proposals, to help eliminate unlikely clips and thus speed up processing and boost performance. Second, to learn a fine-grained similarity metric for retrieval, we use visual features to modulate the processing of query sentences at the word level in a recurrent neural network. A multi-task loss is also employed by adding query re-generation as an auxiliary task. Our approach significantly outperforms prior work on two challenging benchmarks: Charades-STA and ActivityNet Captions.
translated by 谷歌翻译
Object proposals have contributed significantly to recent advances in object understanding in images. Inspired by the success of this approach, we introduce Deep Action Proposals (DAPs), an effective and efficient algorithm for generating temporal action proposals from long videos. We show how to take advantage of the vast capacity of deep learning models and memory cells to retrieve from untrimmed videos temporal segments, which are likely to contain actions. A comprehensive evaluation indicates that our approach outperforms previous work on a large scale action benchmark, runs at 134 FPS making it practical for large-scale scenarios, and exhibits an appealing ability to generalize, i.e. to retrieve good quality temporal proposals of actions unseen in training.
translated by 谷歌翻译
Deep learning has been demonstrated to achieve excellent results for image classification and object detection. However, the impact of deep learning on video analysis has been limited due to complexity of video data and lack of annotations. Previous convolutional neural networks (CNN) based video action detection approaches usually consist of two major steps: frame-level action proposal generation and association of proposals across frames. Also, most of these methods employ two-stream CNN framework to handle spatial and temporal feature separately. In this paper, we propose an end-to-end deep network called Tube Con-volutional Neural Network (T-CNN) for action detection in videos. The proposed architecture is a unified deep network that is able to recognize and localize action based on 3D convolution features. A video is first divided into equal length clips and next for each clip a set of tube proposals are generated based on 3D Convolutional Network (ConvNet) features. Finally, the tube proposals of different clips are linked together employing network flow and spatio-temporal action detection is performed using these linked video proposals. Extensive experiments on several video datasets demonstrate the superior performance of T-CNN for classifying and localizing actions in both trimmed and untrimmed videos compared to state-of-the-arts.
translated by 谷歌翻译
视频镜头转换的检测是视频分析中至关重要的预处理步骤。先前的研究限于通过相似性测量来检测帧之间的突然内容变化,并且广泛地利用多尺度操作来处理各种长度的过渡。然而,由于相邻帧之间的高视觉相似性,逐渐过渡的定位仍未得到充分研究。剪切过渡是突然的语义中断,而渐变镜头过渡包含由视频效果引起的低级空间 - 时间模式,以及例如逐渐的语义中断。溶解。为了解决这个问题,我们提出了能够分别使用目标模型检测这两个镜头过渡的结构网络。考虑到速度性能的权衡,我们设计了一个智能框架。使用一个TITAN GPU,所提出的方法可以实现a30 \(\ times \)的实时速度。公共TRECVID07和RAI数据库的实验表明我们的方法优于最先进的方法。为了配备高性能镜头过渡探测器,我们提供了一个新的databaseClipShots,其中包含128636个切换过渡和38120个渐变过渡来自4039个在线视频。 ClipShots故意收集短视频,以解决由手持相机振动,大物体运动和遮挡引起的更多硬性情况。
translated by 谷歌翻译
视频时间动作检测旨在临时定位和识别未修剪视频中的动作。现有的一阶段方法主要侧重于统一两个子任务,即行动建议的本地化和通过完全共享的主干对每个提案进行分类。然而,在一个网络中封装两个子任务的所有组件的这种设计可能会通过忽略每个子任务的特殊特性来限制训练。在本文中,我们提出了一种新的解耦单次射击时间检测(解耦-SSAD)方法,通过在一级方案中解耦定位和分类来缓解这种问题。特别地,两个分支是平行设计的,以使每个组件能够私下拥有代表以进行准确的定位或分类。每个分支通过将解卷积应用于主流的特征映射来生成一组动作锚点层。每个分支通过将解卷积应用于主流的特征映射来生成一组特征映射。因此,来自更深层的高级语义信息被合并以增强特征表示。我们对THUMOS14数据库进行了大量实验,证明了其优于最先进的方法。我们的代码可在线获取。
translated by 谷歌翻译
We propose TAL-Net, an improved approach to temporal action localization in video that is inspired by the Faster R-CNN object detection framework. TAL-Net addresses three key shortcomings of existing approaches: (1) we improve receptive field alignment using a multi-scale architecture that can accommodate extreme variation in action durations ; (2) we better exploit the temporal context of actions for both proposal generation and action classification by appropriately extending receptive fields; and (3) we explicitly consider multi-stream feature fusion and demonstrate that fusing motion late is important. We achieve state-of-the-art performance for both action proposal and localiza-tion on THUMOS'14 detection benchmark and competitive performance on ActivityNet challenge.
translated by 谷歌翻译
We address temporal action localization in untrimmed long videos. This is important because videos in real applications are usually unconstrained and contain multiple action instances plus video content of background scenes or other activities. To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as ini-tialization for the localization network; and (3) a localiza-tion network fine-tunes the learned classification network to localize each action instance. We propose a novel loss function for the localization network to explicitly consider temporal overlap and achieve high temporal localization accuracy. In the end, only the proposal network and the local-ization network are used during prediction. On two large-scale benchmarks, our approach achieves significantly superior performances compared with other state-of-the-art systems: mAP increases from 1.7% to 7.4% on MEXaction2 and increases from 15.0% to 19.0% on THUMOS 2014.
translated by 谷歌翻译
Temporal Action Localization (TAL) in untrimmed video is important for many applications. But it is very expensive to annotate the segment-level ground truth (action class and temporal boundary). This raises the interest of addressing TAL with weak supervision, namely only video-level annotations are available during training). However, the state-of-the-art weakly-supervised TAL methods only focus on generating good Class Activation Sequence (CAS) over time but conduct simple thresholding on CAS to localize actions. In this paper, we first develop a novel weakly-supervised TAL framework called AutoLoc to directly predict the temporal boundary of each action instance. We propose a novel Outer-Inner-Contrastive (OIC) loss to automatically discover the needed segment-level supervision for training such a boundary predictor. Our method achieves dramatically improved performance: under the IoU threshold 0.5, our method improves mAP on THUMOS'14 from 13.7% to 21.2% and mAP on ActivityNet from 7.4% to 27.3%. It is also very encouraging to see that our weakly-supervised method achieves comparable results with some fully-supervised methods.
translated by 谷歌翻译
In many large-scale video analysis scenarios, one is interested in localizing and recognizing human activities that occur in short temporal intervals within long untrimmed videos. Current approaches for activity detection still struggle to handle large-scale video collections and the task remains relatively unexplored. This is in part due to the computational complexity of current action recognition approaches and the lack of a method that proposes fewer intervals in the video, where activity processing can be focused. In this paper, we introduce a proposal method that aims to recover temporal segments containing actions in untrimmed videos. Building on techniques for learning sparse dictionaries , we introduce a learning framework to represent and retrieve activity proposals. We demonstrate the capabilities of our method in not only producing high quality proposals but also in its efficiency. Finally, we show the positive impact our method has on recognition performance when it is used for action detection, while running at 10FPS.
translated by 谷歌翻译
In this paper, we address the challenging problem of efficient temporal activity detection in untrimmed long videos. While most recent work has focused and advanced the detection accuracy, the inference time can take seconds to minutes in processing each single video, which is too slow to be useful in real-world settings. This motivates the proposed budget-aware framework, which learns to perform activity detection by intelligently selecting a small subset of frames according to a specified time budget. We formulate this problem as a Markov decision process, and adopt a recurrent network to model the frame selection policy. We derive a recurrent policy gradient based approach to approximate the gradient of the non-decomposable and non-differentiable objective defined in our problem. In the extensive experiments, we achieve competitive detection accuracy, and more importantly, our approach is able to substantially reduce computation time and detect multiple activities with only 0.35s for each untrimmed long video.
translated by 谷歌翻译
Temporal action localization is an important yet challenging problem. Given a long, untrimmed video consisting of multiple action instances and complex background contents, we need not only to recognize their action categories , but also to localize the start time and end time of each instance. Many state-of-the-art systems use segment-level classifiers to select and rank proposal segments of predetermined boundaries. However, a desirable model should move beyond segment-level and make dense predictions at a fine granularity in time to determine precise temporal boundaries. To this end, we design a novel Convolutional-De-Convolutional (CDC) network that places CDC filters on top of 3D ConvNets, which have been shown to be effective for abstracting action semantics but reduce the temporal length of the input data. The proposed CDC filter performs the required temporal upsampling and spatial down-sampling operations simultaneously to predict actions at the frame-level granularity. It is unique in jointly model-ing action semantics in space-time and fine-grained temporal dynamics. We train the CDC network in an end-to-end manner efficiently. Our model not only achieves superior performance in detecting actions in every frame, but also significantly boosts the precision of localizing temporal boundaries. Finally, the CDC network demonstrates a very high efficiency with the ability to process 500 frames per second on a single GPU server. Source code and trained models are available online at https://bitbucket. org/columbiadvmm/cdc.
translated by 谷歌翻译
Temporal Action Proposal (TAP) generation is an important problem, as fast and accurate extraction of semantically important (e.g. human actions) segments from untrimmed videos is an important step for large-scale video analysis. We propose a novel Temporal Unit Regression Network (TURN) model. There are two salient aspects of TURN: (1) TURN jointly predicts action proposals and refines the temporal boundaries by temporal coordinate regression ; (2) Fast computation is enabled by unit feature reuse: a long untrimmed video is decomposed into video units, which are reused as basic building blocks of temporal proposals. TURN outperforms the previous state-of-the-art methods under average recall (AR) by a large margin on THUMOS-14 and ActivityNet datasets, and runs at over 880 frames per second (FPS) on a TITAN X GPU. We further apply TURN as a proposal generation stage for existing temporal action localization pipelines, it outperforms state-of-the-art performance on THUMOS-14 and ActivityNet.
translated by 谷歌翻译
视频中的对象检测对于许多应用程序至关重要与图像相比,视频提供了额外的线索,可以帮助消除检测问题的歧义。我们在本文中的目标是学习对象外观的时间演化的判别模型,并将这些模型用于物体检测。为了模拟时间演化,我们引入了与边界框的时间序列相对应的时空管。我们提出了两个CNN架构,分别用于生成和分类管。我们的管状网络(TPN)首先产生大量的时空性盆管,最大限度地提高了物体回忆率。然后,Tube-CNN在视频中实现了一个tube-levelobject检测器。我们的方法改进了用于视频中对象检测的两个大规模数据集的最新技术:HollywoodHeads和ImageNetVID。管模型在困难的动态场景中显示出特别的优势
translated by 谷歌翻译
时间动作提议生成是一项重要任务,类似于对象提议,时间动作提议旨在捕获可能包含动作的视频中的“剪辑”或时空间隔。以前的方法可以分为两组:滑动窗口排名和动作分数分组。滑动窗口统一覆盖视频中的所有片段,但时间边界不精确;基于分组的方法可能具有更精确的边界,但是当行动质量得分低时,它可能省略一些提议。基于这两种方法的互补特征,我们提出了一种新的补充时间行动建议(CTAP)生成器。具体而言,我们在滑动窗口提议上应用建议级动作可信度估计(PATE)来生成指示动作是否可以的概率。通过动作分数正确检测,收集高分的窗口。然后,通过时间卷积神经网络处理所收集的滑动窗口和动作提议以用于提议排名和边界调整。 CTAP在THUMOS-14和ActivityNet 1.3数据集上大大优于平均召回(AR)的最新方法。我们进一步将CTAP应用为现有动作检测器中的提议生成方法,并显示一致的显着性改进。
translated by 谷歌翻译
Our paper presents a new approach for temporal detection of human actions in long, untrimmed video sequences. We introduce Single-Stream Temporal Action Proposals (SST), a new effective and efficient deep architecture for the generation of temporal action proposals. Our network can run continuously in a single stream over very long input video sequences, without the need to divide input into short overlapping clips or temporal windows for batch processing. We demonstrate empirically that our model out-performs the state-of-the-art on the task of temporal action proposal generation, while achieving some of the fastest processing speeds in the literature. Finally, we demonstrate that using SST proposals in conjunction with existing action classifiers results in improved state-of-the-art temporal action detection performance.
translated by 谷歌翻译