In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译
Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increased interest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We first conduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization. With the performance saturation under closed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challenging issues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of five different aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art or at least comparable performance on twelve datasets for FOUR different Re-ID tasks. Meanwhile, we introduce a new evaluation metric (mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.
translated by 谷歌翻译
动作检测的任务旨在在每个动作实例中同时推论动作类别和终点的本地化。尽管Vision Transformers推动了视频理解的最新进展,但由于在长时间的视频剪辑中,设计有效的架构以进行动作检测是不平凡的。为此,我们提出了一个有效的层次时空时空金字塔变压器(STPT)进行动作检测,这是基于以下事实:变压器中早期的自我注意力层仍然集中在局部模式上。具体而言,我们建议在早期阶段使用本地窗口注意来编码丰富的局部时空时空表示,同时应用全局注意模块以捕获后期的长期时空依赖性。通过这种方式,我们的STPT可以用冗余的大大减少来编码区域和依赖性,从而在准确性和效率之间进行有希望的权衡。例如,仅使用RGB输入,提议的STPT在Thumos14上获得了53.6%的地图,超过10%的I3D+AFSD RGB模型超过10%,并且对使用其他流量的额外流动功能的表现较少,该流量具有31%的GFLOPS ,它是一个有效,有效的端到端变压器框架,用于操作检测。
translated by 谷歌翻译
视觉变压器正在成为解决计算机视觉问题的强大工具。最近的技术还证明了超出图像域之外的变压器来解决许多与视频相关的任务的功效。其中,由于其广泛的应用,人类的行动识别是从研究界受到特别关注。本文提供了对动作识别的视觉变压器技术的首次全面调查。我们朝着这个方向分析并总结了现有文献和新兴文献,同时突出了适应变形金刚以进行动作识别的流行趋势。由于其专业应用,我们将这些方法统称为``动作变压器''。我们的文献综述根据其架构,方式和预期目标为动作变压器提供了适当的分类法。在动作变压器的背景下,我们探讨了编码时空数据,降低维度降低,框架贴片和时空立方体构造以及各种表示方法的技术。我们还研究了变压器层中时空注意的优化,以处理更长的序列,通常通过减少单个注意操作中的令牌数量。此外,我们还研究了不同的网络学习策略,例如自我监督和零局学习,以及它们对基于变压器的行动识别的相关损失。这项调查还总结了在具有动作变压器重要基准的评估度量评分方面取得的进步。最后,它提供了有关该研究方向的挑战,前景和未来途径的讨论。
translated by 谷歌翻译
计算机视觉任务可以从估计突出物区域和这些对象区域之间的相互作用中受益。识别对象区域涉及利用预借鉴模型来执行对象检测,对象分割和/或对象姿势估计。但是,由于以下原因,在实践中不可行:1)预用模型的训练数据集的对象类别可能不会涵盖一般计算机视觉任务的所有对象类别,2)佩戴型模型训练数据集之间的域间隙并且目标任务的数据集可能会影响性能,3)预磨模模型中存在的偏差和方差可能泄漏到导致无意中偏置的目标模型的目标任务中。为了克服这些缺点,我们建议利用一系列视频帧捕获一组公共对象和它们之间的相互作用的公共基本原理,因此视频帧特征之间的共分割的概念可以用自动的能力装配模型专注于突出区域,以最终的方式提高潜在的任务的性能。在这方面,我们提出了一种称为“共分割激活模块”(COSAM)的通用模块,其可以被插入任何CNN,以促进基于CNN的任何CNN的概念在一系列视频帧特征中的关注。我们在三个基于视频的任务中展示Cosam的应用即1)基于视频的人Re-ID,2)视频字幕分类,并证明COSAM能够在视频帧中捕获突出区域,从而引导对于显着的性能改进以及可解释的关注图。
translated by 谷歌翻译
近年来,随着对公共安全的需求越来越多,智能监测网络的快速发展,人员重新识别(RE-ID)已成为计算机视野领域的热门研究主题之一。人员RE-ID的主要研究目标是从不同的摄像机中检索具有相同身份的人。但是,传统的人重新ID方法需要手动标记人的目标,这消耗了大量的劳动力成本。随着深度神经网络的广泛应用,出现了许多基于深入的基于学习的人物的方法。因此,本文促进研究人员了解最新的研究成果和该领域的未来趋势。首先,我们总结了对几个最近公布的人的研究重新ID调查,并补充了系统地分类基于深度学习的人的重新ID方法的最新研究方法。其次,我们提出了一种多维分类,根据度量标准和表示学习,将基于深度学习的人的重新ID方法分为四类,包括深度度量学习,本地特征学习,生成的对抗学习和序列特征学习的方法。此外,我们根据其方法和动机来细分以上四类,讨论部分子类别的优缺点。最后,我们讨论了一些挑战和可能的研究方向的人重新ID。
translated by 谷歌翻译
视频变压器在主要视频识别基准上取得了令人印象深刻的结果,但它们遭受了高计算成本。在本文中,我们呈现Stts,一个令牌选择框架,动态地在输入视频样本上调节的时间和空间尺寸的几个信息令牌。具体而言,我们将令牌选择作为一个排名问题,估计每个令牌通过轻量级选择网络的重要性,并且只有顶级分数的人将用于下游评估。在时间维度中,我们将最相关的帧保持对识别作用类别的帧,而在空间维度中,我们确定特征映射中最辨别的区域,而不会影响大多数视频变换器中以分层方式使用的空间上下文。由于令牌选择的决定是不可差异的,因此我们采用了一个扰动最大的可分辨率Top-K运算符,用于最终培训。我们对动力学-400进行广泛的实验,最近推出的视频变压器骨架MVIT。我们的框架实现了类似的结果,同时需要计算20%。我们还表明我们的方法与其他变压器架构兼容。
translated by 谷歌翻译
Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced with the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey we analyze main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled as input-level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.
translated by 谷歌翻译
自我关注学习成对相互作用以模型远程依赖性,从而产生了对视频动作识别的巨大改进。在本文中,我们寻求更深入地了解视频中的时间建模的自我关注。我们首先表明通过扁平所有像素通过扁平化的时空信息的缠结建模是次优的,未明确捕获帧之间的时间关系。为此,我们介绍了全球暂时关注(GTA),以脱钩的方式在空间关注之上进行全球时间关注。我们在像素和语义类似地区上应用GTA,以捕获不同水平的空间粒度的时间关系。与计算特定于实例的注意矩阵的传统自我关注不同,GTA直接学习全局注意矩阵,该矩阵旨在编码遍布不同样本的时间结构。我们进一步增强了GTA的跨通道多头方式,以利用通道交互以获得更好的时间建模。对2D和3D网络的广泛实验表明,我们的方法一致地增强了时间建模,并在三个视频动作识别数据集中提供最先进的性能。
translated by 谷歌翻译
变压器在自然语言处理中的成功最近引起了计算机视觉领域的关注。由于能够学习长期依赖性,变压器已被用作广泛使用的卷积运算符的替代品。事实证明,这种替代者在许多任务中都取得了成功,其中几种最先进的方法依靠变压器来更好地学习。在计算机视觉中,3D字段还见证了使用变压器来增加3D卷积神经网络和多层感知器网络的增加。尽管许多调查都集中在视力中的变压器上,但由于与2D视觉相比,由于数据表示和处理的差异,3D视觉需要特别注意。在这项工作中,我们介绍了针对不同3D视觉任务的100多种变压器方法的系统和彻底审查,包括分类,细分,检测,完成,姿势估计等。我们在3D Vision中讨论了变形金刚的设计,该设计使其可以使用各种3D表示形式处理数据。对于每个应用程序,我们强调了基于变压器的方法的关键属性和贡献。为了评估这些方法的竞争力,我们将它们的性能与12个3D基准测试的常见非转化方法进行了比较。我们通过讨论3D视觉中变压器的不同开放方向和挑战来结束调查。除了提出的论文外,我们的目标是频繁更新最新的相关论文及其相应的实现:https://github.com/lahoud/3d-vision-transformers。
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
人类自然有效地在复杂的场景中找到突出区域。通过这种观察的动机,引入了计算机视觉中的注意力机制,目的是模仿人类视觉系统的这一方面。这种注意机制可以基于输入图像的特征被视为动态权重调整过程。注意机制在许多视觉任务中取得了巨大的成功,包括图像分类,对象检测,语义分割,视频理解,图像生成,3D视觉,多模态任务和自我监督的学习。在本调查中,我们对计算机愿景中的各种关注机制进行了全面的审查,并根据渠道注意,空间关注,暂时关注和分支注意力进行分类。相关的存储库https://github.com/menghaoguo/awesome-vision-tions致力于收集相关的工作。我们还建议了未来的注意机制研究方向。
translated by 谷歌翻译
步态识别旨在通过相机来识别一个距离的人。随着深度学习的出现,步态识别的重大进步通过使用深度学习技术在许多情况下取得了鼓舞人心的成功。然而,对视频监视的越来越多的需求引入了更多的挑战,包括在各种方差下进行良好的识别,步态序列中的运动信息建模,由于协议方差,生物量标准安全性和预防隐私而引起的不公平性能比较。本文对步态识别的深度学习进行了全面的调查。我们首先介绍了从传统算法到深层模型的步态识别的奥德赛,从而提供了对步态识别系统的整个工作流程的明确知识。然后,从深度表示和建筑的角度讨论了步态识别的深入学习,并深入摘要。具体而言,深层步态表示分为静态和动态特征,而深度体系结构包括单流和多流架构。遵循我们提出的新颖性分类法,它可能有益于提供灵感并促进对步态认识的感知。此外,我们还提供了所有基于视觉的步态数据集和性能分析的全面摘要。最后,本文讨论了一些潜在潜在前景的开放问题。
translated by 谷歌翻译
由于其在智能城市和城市监测中的潜在应用,车辆重新ID最近引起了热烈的关注。然而,它遭受了通过观察变化和照明变化引起的大型阶级变化,以及阶级相似性,特别是对于具有类似外观的不同标识。为了处理这些问题,在本文中,我们提出了一种新颖的深度网络架构,其由有意义的属性引导,包括相机视图,车辆类型和用于车辆RE-ID的颜色。特别是,我们的网络是端到端训练的,并包含由相应属性嵌入的深度特征的三个子网(即,相机视图,车辆类型和车辆颜色)。此外,为了克服不同视图的有限载体图像的缺点,我们设计了一个视图指定的生成的对抗性网络来生成多视图车辆图像。对于网络培训,我们在Veri-776数据集上注释了视图标签。请注意,只能使用ID信息直接在其他数据集上直接在其他数据集上采用预先训练的视图(以及类型和颜色)子网,这展示了我们模型的泛化。基准数据集Veri-776和车辆的广泛实验表明,拟议的方法实现了有希望的性能,并对车辆重新ID的新型最先进的性能。
translated by 谷歌翻译
有效地对视频中的空间信息进行建模对于动作识别至关重要。为了实现这一目标,最先进的方法通常采用卷积操作员和密集的相互作用模块,例如非本地块。但是,这些方法无法准确地符合视频中的各种事件。一方面,采用的卷积是有固定尺度的,因此在各种尺度的事件中挣扎。另一方面,密集的相互作用建模范式仅在动作 - 欧元零件时实现次优性能,给最终预测带来了其他噪音。在本文中,我们提出了一个统一的动作识别框架,以通过引入以下设计来研究视频内容的动态性质。首先,在提取本地提示时,我们会生成动态尺度的时空内核,以适应各种事件。其次,为了将这些线索准确地汇总为全局视频表示形式,我们建议仅通过变压器在一些选定的前景对象之间进行交互,从而产生稀疏的范式。我们将提出的框架称为事件自适应网络(EAN),因为这两个关键设计都适应输入视频内容。为了利用本地细分市场内的短期运动,我们提出了一种新颖有效的潜在运动代码(LMC)模块,进一步改善了框架的性能。在几个大规模视频数据集上进行了广泛的实验,例如,某种东西,动力学和潜水48,验证了我们的模型是否在低拖鞋上实现了最先进或竞争性的表演。代码可在:https://github.com/tianyuan168326/ean-pytorch中找到。
translated by 谷歌翻译
我们呈现了基于纯变压器的视频分类模型,在图像分类中最近的近期成功进行了借鉴。我们的模型从输入视频中提取了时空令牌,然后由一系列变压器层编码。为了处理视频中遇到的令牌的长序列,我们提出了我们模型的几种有效的变体,它们将输入的空间和时间维构建。虽然已知基于变换器的模型只有在可用的大型训练数据集时才有效,但我们展示了我们如何在训练期间有效地规范模型,并利用预先训练的图像模型能够在相对小的数据集上训练。我们进行彻底的消融研究,并在包括动力学400和600,史诗厨房,东西的多个视频分类基准上实现最先进的结果,其中 - 基于深度3D卷积网络的现有方法表现出优先的方法。为了促进进一步的研究,我们在https://github.com/google-research/scenic/tree/main/scenic/projects/vivit发布代码
translated by 谷歌翻译
In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames. Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds. In this paper, we propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information. We also introduce an auxiliary task that utilizes a new pair of features created through switching and aggregation to increase the network's capability for various camera scenarios. Furthermore, we devise a Target Localization Module (TLM) that extracts robust features against a change in the position of the target according to the frame flow and a Frame Weight Generation (FWG) that reflects temporal information in the final representation. Various loss functions for disentanglement learning are designed so that each component of the network can cooperate while satisfactorily performing its own role. Quantitative and qualitative results from extensive experiments demonstrate the superiority of DSANet over state-of-the-art methods on three benchmark datasets.
translated by 谷歌翻译
在本文中,我们基于任何卷积神经网络中中间注意图的弱监督生成机制,并更加直接地披露了注意模块的有效性,以充分利用其潜力。鉴于现有的神经网络配备了任意注意模块,我们介绍了一个元评论家网络,以评估主网络中注意力图的质量。由于我们设计的奖励的离散性,提出的学习方法是在强化学习环境中安排的,在此设置中,注意力参与者和经常性的批评家交替优化,以提供临时注意力表示的即时批评和修订,因此,由于深度强化的注意力学习而引起了人们的关注。 (Dreal)。它可以普遍应用于具有不同类型的注意模块的网络体系结构,并通过最大程度地提高每个单独注意模块产生的最终识别性能的相对增益来促进其表现能力,如类别和实例识别基准的广泛实验所证明的那样。
translated by 谷歌翻译
自动手术场景细分是促进现代手术剧院认知智能的基础。以前的作品依赖于常规的聚合模块(例如扩张的卷积,卷积LSTM),仅利用局部环境。在本文中,我们提出了一个新颖的框架STSWINCL,该框架通过逐步捕获全球环境来探讨互补的视频内和访问间关系以提高细分性能。我们首先开发了层次结构变压器,以捕获视频内关系,其中包括来自邻居像素和以前的帧的富裕空间和时间提示。提出了一个联合时空窗口移动方案,以有效地将这两个线索聚集到每个像素嵌入中。然后,我们通过像素到像素对比度学习探索视频间的关系,该学习很好地结构了整体嵌入空间。开发了一个多源对比度训练目标,可以将视频中的像素嵌入和基础指导分组,这对于学习整个数据的全球属性至关重要。我们在两个公共外科视频基准测试中广泛验证了我们的方法,包括Endovis18 Challenge和Cadis数据集。实验结果证明了我们的方法的有希望的性能,这始终超过了先前的最新方法。代码可在https://github.com/yuemingjin/stswincl上找到。
translated by 谷歌翻译
Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (2 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 % mAP while running at around 30 FPS on a single V100 GPU device. Code and models will be available for further research.
translated by 谷歌翻译