在视频的每一帧中,流式传输视频识别原因及其动作。良好的流识别模型捕获了长期动态和视频的短期变化。不幸的是,在大多数现有方法中,计算复杂性随所考虑的动力学的长度线性或二次增长。此问题在基于变压器的体系结构中特别明显。为了解决这个问题,我们通过内核镜头重新制定了视频变压器中的跨注意事项,并应用了两种暂时的平滑核:盒子内核或拉普拉斯内核。最终的流动注意力可以从框架到框架重新重新计算,并且仅需要恒定的时间更新每个帧。基于这个想法,我们构建了一种时间平滑变压器Testra,它具有恒定的缓存和计算开销的任意输入。具体而言,它的运行$ 6 \ times $ $ $比基于滑动窗口的同等滑动变压器的运行速度快,在流设置中具有2,048帧。此外,由于时间跨度的增加,Testra在Thumos'14和Epic-Kitchen-100上取得了最新的结果,这是两个标准的在线操作检测和动作预期数据集。 Testra的实时版本优于Thumos'14数据集上的所有事先方法。
translated by 谷歌翻译
While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration. Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost. Based on this idea, we build MeMViT, a Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x longer than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models are available at https://github.com/facebookresearch/memvit.
translated by 谷歌翻译
在线行动检测旨在基于长期的历史观察结果对当前框架进行准确的行动预测。同时,它需要对在线流视频进行实时推断。在本文中,我们主张一个新颖有效的在线行动检测原则。它仅在一个窗口中更新最新,最古老的历史表示,但重复了已经计算的中间图表。基于这一原则,我们引入了一个基于窗口的级联变压器,带有圆形历史队列,在每个窗口上都进行了多阶段的注意力和级联精炼。我们还探讨了在线操作检测与其脱机行动分段作为辅助任务之间的关联。我们发现,这种额外的监督有助于判别历史的聚类,并充当功能增强,以更好地培训分类器和级联改善。我们提出的方法在三个具有挑战性的数据集Thumos'14,TVSeries和HDD上实现了最新的表演。接受后将可用。
translated by 谷歌翻译
在线操作检测是一旦在流视频中进行的操作,就可以预测该动作。一个主要的挑战是,该模型无法访问未来,并且必须仅依靠历史,即到目前为止观察到的框架来做出预测。因此,重要的是要强调历史的一部分,这些部分对当前框架的预测更有意义。我们提出了带有背景抑制的封闭历史单元的Gatehub,其中包括一种新颖的位置引导的封闭式跨注意机制,以增强或抑制历史的一部分,因为它们在当前框架预测方面的信息程度。 GateHub进一步建议未来的历史记录(FAH),通过使用后来观察到的框架,使历史特征更具信息性。在一个统一的框架中,GateHub集成了变压器的远程时间建模的能力以及经常性模型选择性编码相关信息的能力。 GateHub还引入了一个背景抑制目标,以进一步减轻与动作框架非常相似的误报背景框架。对三个基准数据集(Thumos,TVSeries和HDD)进行了广泛的验证,这表明GateHub显着胜过所有现有方法,并且比现有最佳工作更有效。此外,与所有需要RGB和光流信息进行预测的现有方法相比,GateHub的无流版本能够以2.8倍的帧速率获得更高或密切的精度。
translated by 谷歌翻译
Anticipating future actions based on video observations is an important task in video understanding, which would be useful for some precautionary systems that require response time to react before an event occurs. Since the input in action anticipation is only pre-action frames, models do not have enough information about the target action; moreover, similar pre-action frames may lead to different futures. Consequently, any solution using existing action recognition models can only be suboptimal. Recently, researchers have proposed using a longer video context to remedy the insufficient information in pre-action intervals, as well as the self-attention to query past relevant moments to address the anticipation problem. However, the indirect use of video input features as the query might be inefficient, as it only serves as the proxy to the anticipation goal. To this end, we propose an inductive attention model, which transparently uses prior prediction as the query to derive the anticipation result by induction from past experience. Our method naturally considers the uncertainty of multiple futures via the many-to-many association. On the large-scale egocentric video datasets, our model not only shows consistently better performance than state of the art using the same backbone, and is competitive to the methods that employ a stronger backbone, but also superior efficiency in less model parameters.
translated by 谷歌翻译
我们提出了块茎:一种简单的时空视频动作检测解决方案。与依赖于离线演员检测器或手工设计的演员位置假设的现有方法不同,我们建议通过同时执行动作定位和识别从单个表示来直接检测视频中的动作微管。块茎学习一组管芯查询,并利用微调模块来模拟视频剪辑的动态时空性质,其有效地加强了与在时空空间中的演员位置假设相比的模型容量。对于包含过渡状态或场景变更的视频,我们提出了一种上下文意识的分类头来利用短期和长期上下文来加强行动分类,以及用于检测精确的时间动作程度的动作开关回归头。块茎直接产生具有可变长度的动作管,甚至对长视频剪辑保持良好的结果。块茎在常用的动作检测数据集AVA,UCF101-24和JHMDB51-21上优于先前的最先进。
translated by 谷歌翻译
时间动作定位中的大多数现代方法将此问题分为两个部分:(i)短期特征提取和(ii)远程时间边界定位。由于处理长期未修剪的视频引起的GPU内存成本很高,因此许多方法通过冷冻骨干或使用小型空间视频分辨率来牺牲短期功能提取器的代表力。由于最近的视频变压器模型,其中许多具有二次记忆复杂性,这个问题变得更糟。为了解决这些问题,我们提出了TallFormer,这是一种具有长期内存的记忆效率和端到端的可训练时间动作定位变压器。我们的长期记忆机制消除了在每个训练迭代期间处理数百个冗余视频帧的需求,从而大大减少了GPU的记忆消耗和训练时间。这些效率节省使我们(i)可以使用功能强大的视频变压器提取器,而无需冷冻主链或减少空间视频分辨率,而(ii)也保持了远距离的时间边界定位能力。只有RGB框架作为输入,没有外部动作识别分类器,TallFormer的表现优于先前的最先前的边距,在Thumos14上获得了59.1%的平均地图,而ActivityNet-1.3的平均地图为35.6%。该代码可公开:https://github.com/klauscc/tallformer。
translated by 谷歌翻译
To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We propose a long-term feature bank-supportive information extracted over the entire span of a video-to augment state-of-the-art video models that otherwise would only view short clips of 2-5 seconds. Our experiments demonstrate that augmenting 3D convolutional networks with a long-term feature bank yields state-of-the-art results on three challenging video datasets: AVA, EPIC-Kitchens, and Charades. Code is available online. 1 1 https://github.com/facebookresearch/ video-long-term-feature-banks Input clip (4 seconds) Target frame
translated by 谷歌翻译
预期未来的事件是智能系统和体现AI的重要功能。但是,与传统的识别任务相比,未来和推理能力要求的不确定性使预期任务非常具有挑战性,并且远远超出了解决。在此文件中,以前的方法通常更关心模型架构设计,或者很少关注如何通过适当的学习政策培训预期模型。为此,在这项工作中,我们提出了一种称为动态上下文删除(DCR)的新型培训方案,该方案动态地安排了学习过程中观察到的未来的可见性。它遵循类似人类的课程学习过程,即逐渐消除事件上下文以增加预期难度,直到满足最终预期目标。我们的学习方案是插件,易于整合包括变压器和LSTM在内的任何推理模型,具有有效性和效率的优势。在广泛的实验中,提出的方法在四个广泛使用的基准上实现了最先进的方法。我们的代码和模型将在https://github.com/allenxuuu/dcr上公开发布。
translated by 谷歌翻译
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
translated by 谷歌翻译
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together. * The work was done during an internship at SenseTime.
translated by 谷歌翻译
我们呈现了基于纯变压器的视频分类模型,在图像分类中最近的近期成功进行了借鉴。我们的模型从输入视频中提取了时空令牌,然后由一系列变压器层编码。为了处理视频中遇到的令牌的长序列,我们提出了我们模型的几种有效的变体,它们将输入的空间和时间维构建。虽然已知基于变换器的模型只有在可用的大型训练数据集时才有效,但我们展示了我们如何在训练期间有效地规范模型,并利用预先训练的图像模型能够在相对小的数据集上训练。我们进行彻底的消融研究,并在包括动力学400和600,史诗厨房,东西的多个视频分类基准上实现最先进的结果,其中 - 基于深度3D卷积网络的现有方法表现出优先的方法。为了促进进一步的研究,我们在https://github.com/google-research/scenic/tree/main/scenic/projects/vivit发布代码
translated by 谷歌翻译
变压器跟踪器最近取得了令人印象深刻的进步,注意力机制起着重要作用。但是,注意机制的独立相关计算可能导致嘈杂和模棱两可的注意力重量,从而抑制了进一步的性能改善。为了解决这个问题,我们提出了注意力(AIA)模块,该模块通过在所有相关向量之间寻求共识来增强适当的相关性并抑制错误的相关性。我们的AIA模块可以很容易地应用于自我注意解区和交叉注意区块,以促进特征聚集和信息传播以进行视觉跟踪。此外,我们通过引入有效的功能重复使用和目标背景嵌入来充分利用时间参考,提出了一个流线型的变压器跟踪框架,称为AIATRACK。实验表明,我们的跟踪器以实时速度运行时在六个跟踪基准测试中实现最先进的性能。
translated by 谷歌翻译
Transformers in their common form are inherently limited to operate on whole token sequences rather than on one token at a time. Consequently, their use during online inference on time-series data entails considerable redundancy due to the overlap in successive token sequences. In this work, we propose novel formulations of the Scaled Dot-Product Attention, which enable Transformers to perform efficient online token-by-token inference on a continual input stream. Importantly, our modifications are purely to the order of computations, while the outputs and learned weights are identical to those of the original Transformer Encoder. We validate our Continual Transformer Encoder with experiments on the THUMOS14, TVSeries and GTZAN datasets with remarkable results: Our Continual one- and two-block architectures reduce the floating point operations per prediction by up to 63x and 2.6x, respectively, while retaining predictive performance.
translated by 谷歌翻译
我们介绍了在视频中发现时间精确,细粒度事件的任务(检测到时间事件的精确时刻)。精确的斑点需要模型在全球范围内对全日制动作规模进行推理,并在本地识别微妙的框架外观和运动差异,以识别这些动作过程中事件的识别。令人惊讶的是,我们发现,最高的绩效解决方案可用于先前的视频理解任务,例如操作检测和细分,不能同时满足这两个要求。作为响应,我们提出了E2E点,这是一种紧凑的端到端模型,在精确的发现任务上表现良好,可以在单个GPU上快速培训。我们证明,E2E点的表现明显优于最近根据视频动作检测,细分和将文献发现到精确的发现任务的基线。最后,我们为几个细粒度的运动动作数据集贡献了新的注释和分裂,以使这些数据集适用于未来的精确发现工作。
translated by 谷歌翻译
这项工作旨在使用带有动作查询的编码器框架(类似于DETR)来推进时间动作检测(TAD),该框架在对象检测中表现出了巨大的成功。但是,如果直接应用于TAD,该框架遇到了几个问题:解码器中争论之间关系的探索不足,由于培训样本数量有限,分类培训不足以及推断时不可靠的分类得分。为此,我们首先提出了解码器中的关系注意机制,该机制根据其关系来指导查询之间的注意力。此外,我们提出了两项​​损失,以促进和稳定行动分类的培训。最后,我们建议在推理时预测每个动作查询的本地化质量,以区分高质量的查询。所提出的命名React的方法在Thumos14上实现了最新性能,其计算成本比以前的方法低得多。此外,还进行了广泛的消融研究,以验证每个提出的组件的有效性。该代码可在https://github.com/sssste/reaeact上获得。
translated by 谷歌翻译
我们提出XMEM,这是一种由Atkinson-Shiffrin Memory模型启发的统一功能存储器存储的长视频的视频对象分割体系结构。视频对象分割的先前工作通常仅使用一种类型的功能内存。对于超过一分钟的视频,单个功能内存模型紧密地链接了内存消耗和准确性。相比之下,遵循Atkinson-Shiffrin模型,我们开发了一种结构,该体系结构结合了多个独立但深厚的特征记忆存储:快速更新的感觉存储器,高分辨率的工作记忆和紧凑的长期记忆。至关重要的是,我们开发了一种记忆增强算法,该算法通常将主动使用的工作记忆元素合并为长期记忆,从而避免记忆爆炸并最大程度地减少长期预测的性能衰减。结合新的记忆阅读机制,XMEM在与最先进的方法(不适用于长视频上使用)相当的长视频时,XMEM大大超过了长效数据集上的最先进性能数据集。代码可从https://hkchengrex.github.io/xmem获得
translated by 谷歌翻译
视觉变压器正在成为解决计算机视觉问题的强大工具。最近的技术还证明了超出图像域之外的变压器来解决许多与视频相关的任务的功效。其中,由于其广泛的应用,人类的行动识别是从研究界受到特别关注。本文提供了对动作识别的视觉变压器技术的首次全面调查。我们朝着这个方向分析并总结了现有文献和新兴文献,同时突出了适应变形金刚以进行动作识别的流行趋势。由于其专业应用,我们将这些方法统称为``动作变压器''。我们的文献综述根据其架构,方式和预期目标为动作变压器提供了适当的分类法。在动作变压器的背景下,我们探讨了编码时空数据,降低维度降低,框架贴片和时空立方体构造以及各种表示方法的技术。我们还研究了变压器层中时空注意的优化,以处理更长的序列,通常通过减少单个注意操作中的令牌数量。此外,我们还研究了不同的网络学习策略,例如自我监督和零局学习,以及它们对基于变压器的行动识别的相关损失。这项调查还总结了在具有动作变压器重要基准的评估度量评分方面取得的进步。最后,它提供了有关该研究方向的挑战,前景和未来途径的讨论。
translated by 谷歌翻译
Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (2 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 % mAP while running at around 30 FPS on a single V100 GPU device. Code and models will be available for further research.
translated by 谷歌翻译
Online Temporal Action Localization (On-TAL) aims to immediately provide action instances from untrimmed streaming videos. The model is not allowed to utilize future frames and any processing techniques to modify past predictions, making On-TAL much more challenging. In this paper, we propose a simple yet effective framework, termed SimOn, that learns to predict action instances using the popular Transformer architecture in an end-to-end manner. Specifically, the model takes the current frame feature as a query and a set of past context information as keys and values of the Transformer. Different from the prior work that uses a set of outputs of the model as past contexts, we leverage the past visual context and the learnable context embedding for the current query. Experimental results on the THUMOS14 and ActivityNet1.3 datasets show that our model remarkably outperforms the previous methods, achieving a new state-of-the-art On-TAL performance. In addition, the evaluation for Online Detection of Action Start (ODAS) demonstrates the effectiveness and robustness of our method in the online setting. The code is available at https://github.com/TuanTNG/SimOn
translated by 谷歌翻译