智能论文笔记

Video Vision Transformers for Violence Detection

Sanskar Singh , Shivaibhav Dewangan , Ghanta Sai Krishna , Vandit Tyagi , Sainath Reddy

分类：计算机视觉 | 人工智能

2022-09-08

执法和城市安全受到监视系统中的暴力事件的严重影响。尽管现代（智能）相机广泛可用且负担得起，但在大多数情况下，这种技术解决方案无能为力。此外，监测CCTV记录的人员经常显示出迟来的反应，从而导致对人和财产的灾难。因此，对迅速行动的暴力自动检测至关重要。拟议的解决方案使用了一种新颖的端到端深度学习视频视觉变压器（Vivit），可以在视频序列中熟练地辨别战斗，敌对运动和暴力事件。该研究提出了利用数据增强策略来克服较弱的电感偏见的缺点，同时在较小的培训数据集中训练视觉变压器。评估的结果随后可以发送给当地有关当局，可以分析捕获的视频。与最先进的（SOTA）相比，所提出的方法在某些具有挑战性的基准数据集上实现了吉祥的性能。

translated by 谷歌翻译

Vision Transformers and YoloV5 based Driver Drowsiness Detection Framework

Ghanta Sai Krishna , Kundrapu Supriya , Jai Vardhan , Mallikharjuna Rao K

分类：计算机视觉

2022-09-03

由于独特的驾驶特征，人类驾驶员具有独特的驾驶技术，知识和情感。驾驶员嗜睡一直是一个严重的问题，危害道路安全。因此，必须设计有效的嗜睡检测算法以绕过道路事故。杂项研究工作已经解决了检测异常的人类驾驶员行为的问题，以通过计算机视觉技术检查驾驶员和汽车动力学的正面面孔。尽管如此，常规方法仍无法捕获复杂的驾驶员行为特征。但是，以深度学习体系结构的起源，还进行了大量研究，以分析和识别使用神经网络算法的驾驶员的嗜睡。本文介绍了一个基于视觉变形金刚和Yolov5架构的新颖框架，以实现驾驶员嗜睡的识别。提出了定制的Yolov5预训练的结构，以提取面部提取，目的是提取感兴趣的区域（ROI）。由于以前的体系结构的局限性，本文引入了视觉变压器进行二进制图像分类，该二进制图像分类在公共数据集UTA-RLDD上经过训练和验证。该模型分别达到了96.2 \％和97.4 \％的培训和验证精度。为了进行进一步的评估，在各种光明情况下的39名参与者的自定义数据集上测试了拟议的框架，并获得了95.5 \％的准确性。进行的实验揭示了我们在智能运输系统中实用应用框架的重要潜力。

translated by 谷歌翻译

ViViT: A Video Vision Transformer

Anurag Arnab , Mostafa Dehghani , Georg Heigold , Chen Sun , Mario Lučić , Cordelia Schmid

分类：计算机视觉

2021-03-29

我们呈现了基于纯变压器的视频分类模型，在图像分类中最近的近期成功进行了借鉴。我们的模型从输入视频中提取了时空令牌，然后由一系列变压器层编码。为了处理视频中遇到的令牌的长序列，我们提出了我们模型的几种有效的变体，它们将输入的空间和时间维构建。虽然已知基于变换器的模型只有在可用的大型训练数据集时才有效，但我们展示了我们如何在训练期间有效地规范模型，并利用预先训练的图像模型能够在相对小的数据集上训练。我们进行彻底的消融研究，并在包括动力学400和600，史诗厨房，东西的多个视频分类基准上实现最先进的结果，其中 - 基于深度3D卷积网络的现有方法表现出优先的方法。为了促进进一步的研究，我们在https://github.com/google-research/scenic/tree/main/scenic/projects/vivit发布代码

translated by 谷歌翻译

Vision Transformers for Action Recognition: A Survey

Anwaar Ulhaq , Naveed Akhtar , Ganna Pogrebna , Ajmal Mian

分类：计算机视觉 | 人工智能

2022-09-13

视觉变压器正在成为解决计算机视觉问题的强大工具。最近的技术还证明了超出图像域之外的变压器来解决许多与视频相关的任务的功效。其中，由于其广泛的应用，人类的行动识别是从研究界受到特别关注。本文提供了对动作识别的视觉变压器技术的首次全面调查。我们朝着这个方向分析并总结了现有文献和新兴文献，同时突出了适应变形金刚以进行动作识别的流行趋势。由于其专业应用，我们将这些方法统称为``动作变压器''。我们的文献综述根据其架构，方式和预期目标为动作变压器提供了适当的分类法。在动作变压器的背景下，我们探讨了编码时空数据，降低维度降低，框架贴片和时空立方体构造以及各种表示方法的技术。我们还研究了变压器层中时空注意的优化，以处理更长的序列，通常通过减少单个注意操作中的令牌数量。此外，我们还研究了不同的网络学习策略，例如自我监督和零局学习，以及它们对基于变压器的行动识别的相关损失。这项调查还总结了在具有动作变压器重要基准的评估度量评分方面取得的进步。最后，它提供了有关该研究方向的挑战，前景和未来途径的讨论。

translated by 谷歌翻译

Short and Long Range Relation Based Spatio-Temporal Transformer for Micro-Expression Recognition

Liangfei Zhang , Xiaopeng Hong , Ognjen Arandjelovic , Guoying Zhao

分类：计算机视觉

2021-12-10

无意识和自发的，微小表达在一个人的真实情绪的推动中是有用的，即使尝试隐藏它们。由于它们短的持续时间和低强度，对微表达的识别是情感计算中的艰巨任务。基于手工制作的时空特征的早期工作最近被不同的深度学习方法取代了现在竞争最先进的性能。然而，捕获本地和全球时空模式的问题仍然挑战。为此，本文我们提出了一种新颖的时空变压器架构 - 据我们所知，是微表达识别的第一种纯粹变压器的方法（即任何卷积网络使用的方法）。该架构包括用于学习空间模式的空间编码器，用于时间维度分析的时间聚合器和分类头。三种广泛使用的自发性微表达数据集，即Smic-HS，Casme II和SAMM的综合评估表明，该方法始终如一地优于现有技术，是发表在微表达上发表文献中的第一个框架在任何上述数据集上识别以实现未加权的F1分数大于0.9。

translated by 谷歌翻译

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation

Bolin Lai , Miao Liu , Fiona Ryan , James Rehg

分类：计算机视觉

2022-08-08

在本文中，我们提出了第一个基于变压器的模型，该模型解决了以自我为中心凝视估计的具有挑战性的问题。我们观察到，全局场景上下文和本地视觉信息之间的连接对于从以自我为中心的视频帧进行凝视固定至关重要。为此，我们设计了变压器编码器将全局上下文嵌入为一个附加的视觉令牌，并进一步提出了一种新型的全球 - 本地相关（GLC）模块，以明确模拟全局令牌和每个本地令牌的相关性。我们在两个以自我为中心的视频数据集中验证了我们的模型-EGTEA凝视+和EGO4D。我们的详细消融研究证明了我们方法的好处。此外，我们的方法超过了先前的最新空间。我们还提供了其他可视化，以支持我们的主张，即全球 - 本地相关性是预测以自我为中心视频的凝视固定的关键表示。更多详细信息可以在我们的网站（https://bolinlai.github.io/glc-egogazeest）中找到。

translated by 谷歌翻译

Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person Re-identification

Ziyi Tang , Ruimao Zhang , Zhanglin Peng , Jinrui Chen , Liang Lin

分类：计算机视觉

2023-01-02

In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.

translated by 谷歌翻译

Two-Stream Transformer Architecture for Long Video Understanding

Edward Fish , Jon Weinbren , Andrew Gilbert

分类：计算机视觉 | 机器学习

2022-08-02

纯视觉变压器体系结构对于简短的视频分类和动作识别任务非常有效。但是，由于自我注意力的二次复杂性和缺乏归纳偏见，变压器是资源密集的，并且遭受了数据效率低下的困扰。长期的视频理解任务扩大了变压器的数据和内存效率问题，使当前方法无法在数据或内存限制域上实施。本文介绍了有效的时空注意网络（Stan），该网络使用两流变压器体系结构来模拟静态图像特征和时间上下文特征之间的依赖性。我们提出的方法可以在单个GPU上进行长达两分钟的视频，这是数据效率的，并且可以在几个长的视频理解任务上实现SOTA性能。

translated by 谷歌翻译

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

AJ Piergiovanni , Weicheng Kuo , Anelia Angelova

分类：计算机视觉

2022-12-06

We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results and the code will be open-sourced.

translated by 谷歌翻译

Multiple Instance Neuroimage Transformer

Ayush Singla , Qingyu Zhao , Daniel K. Do , Yuyin Zhou , Kilian M. Pohl , Ehsan Adeli

分类：计算机视觉 | 机器学习

2022-08-19

我们首次建议使用基于多个实例学习的无卷积变压器模型，称为多个实例神经图像变压器（Minit），以分类T1Weighted（T1W）MRIS。我们首先介绍了为神经图像采用的几种变压器模型。这些模型从输入体积提取非重叠的3D块，并对其线性投影进行多头自我注意。另一方面，Minit将输入MRI的每个非重叠的3D块视为其自己的实例，将其进一步分为非重叠的3D贴片，并在其上计算了多头自我注意力。作为概念验证，我们通过训练模型来评估模型的功效，以确定两个公共数据集的T1W-MRIS：青少年脑认知发展（ABCD）和青少年酒精和神经发展联盟（NCANDA）（NCANDA）。博学的注意力图突出了有助于识别脑形态计量学性别差异的体素。该代码可在https://github.com/singlaayush/minit上找到。

translated by 谷歌翻译

Multiview Transformers for Video Recognition

Shen Yan , Xuehan Xiong , Anurag Arnab , Zhichao Lu , Mi Zhang , Chen Sun , Cordelia Schmid

分类：计算机视觉 | 机器学习

2022-01-12

视频理解需要在多种时空分辨率下推理 - 从短的细粒度动作到更长的持续时间。虽然变压器架构最近提出了最先进的，但它们没有明确建模不同的时空分辨率。为此，我们为视频识别（MTV）提供了多视图变压器。我们的模型由单独的编码器组成，表示输入视频的不同视图，以横向连接，以跨视图熔断信息。我们对我们的模型提供了彻底的消融研究，并表明MTV在一系列模型尺寸范围内的准确性和计算成本方面始终如一地表现优于单视对应力。此外，我们在五个标准数据集上实现最先进的结果，并通过大规模预制来进一步提高。我们将释放代码和备用检查点。

translated by 谷歌翻译

Recent Advances in Vision Transformer: A Survey for Different Domains

Khawar Islam

分类：计算机视觉 | 人工智能

2022-03-03

与卷积神经网络（CNN）相比，视觉变压器（VIT）正在变得越来越流行和主导技术。作为计算机视觉中苛刻的技术，VIT已成功解决了各种视觉问题，同时着眼于远程关系。在本文中，我们首先介绍自我注意机制的基本概念和背景。接下来，我们提供了最新表现最好的VIT方法的全面概述，该方法在强度和弱点，计算成本以及培训和测试数据集方面描述。我们彻底比较了流行基准数据集上各种VIT算法和大多数代表性CNN方法的性能。最后，我们通过有见地的观察来探索一些局限性，并提供进一步的研究方向。项目页面以及论文集可通过https://github.com/khawar512/vit-survey获得

translated by 谷歌翻译

Two-stream Multi-dimensional Convolutional Network for Real-time Violence Detection

Dipon Kumar Ghosh , Amitabha Chakrabarty

分类：计算机视觉

2022-11-08

The increasing number of surveillance cameras and security concerns have made automatic violent activity detection from surveillance footage an active area for research. Modern deep learning methods have achieved good accuracy in violence detection and proved to be successful because of their applicability in intelligent surveillance systems. However, the models are computationally expensive and large in size because of their inefficient methods for feature extraction. This work presents a novel architecture for violence detection called Two-stream Multi-dimensional Convolutional Network (2s-MDCN), which uses RGB frames and optical flow to detect violence. Our proposed method extracts temporal and spatial information independently by 1D, 2D, and 3D convolutions. Despite combining multi-dimensional convolutional networks, our models are lightweight and efficient due to reduced channel capacity, yet they learn to extract meaningful spatial and temporal information. Additionally, combining RGB frames and optical flow yields 2.2% more accuracy than a single RGB stream. Regardless of having less complexity, our models obtained state-of-the-art accuracy of 89.7% on the largest violence detection benchmark dataset.

translated by 谷歌翻译

MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition

Jiawei Chen , Chiu Man Ho

分类：计算机视觉

2021-08-20

本文介绍了一种基于纯变压器的方法，称为视频动作识别的多模态视频变压器（MM-VIT）。与仅利用解码的RGB帧的其他方案不同，MM-VIT专门在压缩视频域中进行操作，并利用所有容易获得的模式，即I帧，运动向量，残差和音频波形。为了处理从多种方式提取的大量时空令牌，我们开发了几种可扩展的模型变体，它们将自我关注分解在空间，时间和模态尺寸上。此外，为了进一步探索丰富的模态互动及其效果，我们开发并比较了可以无缝集成到变压器构建块中的三种不同的交叉模态注意力机制。关于三个公共行动识别基准的广泛实验（UCF-101，某事-V2，Kinetics-600）证明了MM-VIT以效率和准确性的最先进的视频变压器，并且表现更好或同样地表现出对于具有计算重型光学流的最先进的CNN对应物。

translated by 谷歌翻译

Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition

Vittorio Mazzia , Simone Angarano , Francesco Salvetti , Federico Angelini , Marcello Chiaberge

分类：计算机视觉 | 机器学习

2021-07-01

基于纯粹关注的深度神经网络在几个领域中取得了成功，依赖于设计师的最小建筑前瞻性。在人类行动识别（HAR）中，主要是在标准卷积或复发层的顶部采用注意机制，从而提高了整体泛化能力。在这项工作中，我们介绍了动作变压器（ACT），这是一种简单的完全自我注意的架构，可以始终如一地优于混合卷积，复发和周度的更详细的网络。为了限制计算和能量请求，建立以前的人类行动识别研究，所提出的方法利用小型时间窗口的2D姿势表示，为准确且有效的实时性能提供低延迟解决方案。此外，我们开源MOMES2021是一个新的大规模数据集，作为建立正式培训和评估基准的实时短时哈哈。拟议的方法在MOMY2021上广泛测试，并与几个最先进的架构相比，证明了行为模型的有效性并铺设了未来工作的基础。

translated by 谷歌翻译

Video Transformers: A Survey

Javier Selva , Anders S. Johansen , Sergio Escalera , Kamal Nasrollahi , Thomas B. Moeslund , Albert Clapés

分类：计算机视觉

2022-01-16

Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced with the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey we analyze main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled as input-level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.

translated by 谷歌翻译

Video-based Human Action Recognition using Deep Learning: A Review

Hieu H. Pham , Louahdi Khoudour , Alain Crouzil , Pablo Zegers , Sergio A. Velastin

分类：计算机视觉

2022-08-07

人类行动识别是计算机视觉中的重要应用领域。它的主要目的是准确地描述人类的行为及其相互作用，从传感器获得的先前看不见的数据序列中。识别，理解和预测复杂人类行动的能力能够构建许多重要的应用，例如智能监视系统，人力计算机界面，医疗保健，安全和军事应用。近年来，计算机视觉社区特别关注深度学习。本文使用深度学习技术的视频分析概述了当前的动作识别最新识别。我们提出了识别人类行为的最重要的深度学习模型，并分析它们，以提供用于解决人类行动识别问题的深度学习算法的当前进展，以突出其优势和缺点。基于文献中报道的识别精度的定量分析，我们的研究确定了动作识别中最新的深层体系结构，然后为该领域的未来工作提供当前的趋势和开放问题。

translated by 谷歌翻译

Transformers in Vision: A Survey

Salman Khan , Muzammal Naseer , Munawar Hayat , Syed Waqas Zamir , Fahad Shahbaz Khan , Mubarak Shah

分类：

2021-01-04

Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.

translated by 谷歌翻译

Looking for the Signs: Identifying Isolated Sign Instances in Continuous Video Footage

Tao Jiang , Necati Cihan Camgoz , Richard Bowden

分类：计算机视觉

2021-07-21

在本文中，我们专注于单次符号发现的任务，即给定孤立的符号（查询）的示例，我们希望识别是否在连续，共同铰接的手语视频中出现此标志（目标）。为了实现这一目标，我们提出了一个转换器的网络，称为SignLookup。我们使用3D卷积神经网络（CNNS）来提取视频剪辑的时空表示。为了解决查询和目标视频之间的时间尺度差异，我们使用不同的帧级级别构造来自单个视频剪辑的多个查询。在这些查询剪辑中应用自我关注以模拟连续刻度空间。我们还在目标视频上使用另一个自我关注模块来学习序列内的上下文。最后，使用相互关注来匹配时间尺度来定位目标序列内的查询。广泛的实验表明，无论签名者的外观如何，所提出的方法不仅可以可靠地识别连续视频中的孤立的标志，但也可以概括不同的标志语言。通过利用注意机制和自适应功能，我们的模型在符号发现任务上实现了最先进的性能，精度高达96％，在具有挑战性的基准数据集，并显着优于其他方法。

translated by 谷歌翻译

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

Michael S. Ryoo , AJ Piergiovanni , Anurag Arnab , Mostafa Dehghani , Anelia Angelova

分类：计算机视觉 | 机器学习

2021-06-21

在本文中，我们介绍了一种新颖的视觉表示学习，它依赖于少数自适应地学习令牌，并且适用于图像和视频理解任务。而不是依靠手工设计的分割策略来获得视觉令牌并处理大量密集采样的补丁进行关注，我们的方法学会在视觉数据中挖掘重要令牌。这导致有效且有效地找到一些重要的视觉令牌，并且可以在这些令牌之间进行成像注意，在更长的视频的时间范围内，或图像中的空间内容。我们的实验表现出对图像和视频识别任务的几个具有挑战性的基准的强烈性能。重要的是，由于我们的令牌适应性，我们在显着减少的计算金额下实现竞争结果。在计算上更有效的同时，我们获得了对想象成的最先进结果的可比结果。我们在多个视频数据集中建立新的最先进的，包括动力学-400，动力学-600，Charades和Avid。代码可在：https://github.com/google-research/scenic/tree/main/scenic/projects/token_learner

translated by 谷歌翻译