计算机视觉任务可以从估计突出物区域和这些对象区域之间的相互作用中受益。识别对象区域涉及利用预借鉴模型来执行对象检测,对象分割和/或对象姿势估计。但是,由于以下原因,在实践中不可行:1)预用模型的训练数据集的对象类别可能不会涵盖一般计算机视觉任务的所有对象类别,2)佩戴型模型训练数据集之间的域间隙并且目标任务的数据集可能会影响性能,3)预磨模模型中存在的偏差和方差可能泄漏到导致无意中偏置的目标模型的目标任务中。为了克服这些缺点,我们建议利用一系列视频帧捕获一组公共对象和它们之间的相互作用的公共基本原理,因此视频帧特征之间的共分割的概念可以用自动的能力装配模型专注于突出区域,以最终的方式提高潜在的任务的性能。在这方面,我们提出了一种称为“共分割激活模块”(COSAM)的通用模块,其可以被插入任何CNN,以促进基于CNN的任何CNN的概念在一系列视频帧特征中的关注。我们在三个基于视频的任务中展示Cosam的应用即1)基于视频的人Re-ID,2)视频字幕分类,并证明COSAM能够在视频帧中捕获突出区域,从而引导对于显着的性能改进以及可解释的关注图。
translated by 谷歌翻译
人类自然有效地在复杂的场景中找到突出区域。通过这种观察的动机,引入了计算机视觉中的注意力机制,目的是模仿人类视觉系统的这一方面。这种注意机制可以基于输入图像的特征被视为动态权重调整过程。注意机制在许多视觉任务中取得了巨大的成功,包括图像分类,对象检测,语义分割,视频理解,图像生成,3D视觉,多模态任务和自我监督的学习。在本调查中,我们对计算机愿景中的各种关注机制进行了全面的审查,并根据渠道注意,空间关注,暂时关注和分支注意力进行分类。相关的存储库https://github.com/menghaoguo/awesome-vision-tions致力于收集相关的工作。我们还建议了未来的注意机制研究方向。
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
Person re-identification (Re-ID) aims at retrieving a person of interest across multiple non-overlapping cameras. With the advancement of deep neural networks and increasing demand of intelligent video surveillance, it has gained significantly increased interest in the computer vision community. By dissecting the involved components in developing a person Re-ID system, we categorize it into the closed-world and open-world settings. The widely studied closed-world setting is usually applied under various research-oriented assumptions, and has achieved inspiring success using deep learning techniques on a number of datasets. We first conduct a comprehensive overview with in-depth analysis for closed-world person Re-ID from three different perspectives, including deep feature representation learning, deep metric learning and ranking optimization. With the performance saturation under closed-world setting, the research focus for person Re-ID has recently shifted to the open-world setting, facing more challenging issues. This setting is closer to practical applications under specific scenarios. We summarize the open-world Re-ID in terms of five different aspects. By analyzing the advantages of existing methods, we design a powerful AGW baseline, achieving state-of-the-art or at least comparable performance on twelve datasets for FOUR different Re-ID tasks. Meanwhile, we introduce a new evaluation metric (mINP) for person Re-ID, indicating the cost for finding all the correct matches, which provides an additional criteria to evaluate the Re-ID system for real applications. Finally, some important yet under-investigated open issues are discussed.
translated by 谷歌翻译
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
translated by 谷歌翻译
连接视觉和语言在生成智能中起着重要作用。因此,已经致力于图像标题的大型研究工作,即用句法和语义有意义的句子描述图像。从2015年开始,该任务通常通过由Visual Encoder组成的管道和文本生成的语言模型来解决任务。在这些年来,两种组件通过对象区域,属性,介绍多模态连接,完全关注方法和伯特早期融合策略的利用而显着发展。但是,无论令人印象深刻的结果,图像标题的研究还没有达到结论性答案。这项工作旨在提供图像标题方法的全面概述,从视觉编码和文本生成到培训策略,数据集和评估度量。在这方面,我们量化地比较了许多相关的最先进的方法来确定架构和培训策略中最有影响力的技术创新。此外,讨论了问题的许多变体及其开放挑战。这项工作的最终目标是作为理解现有文献的工具,并突出显示计算机视觉和自然语言处理的研究领域的未来方向可以找到最佳的协同作用。
translated by 谷歌翻译
视频字幕定位目标将复杂的视觉内容解释为文本说明,这要求模型充分了解包括对象及其交互的视频场景。流行的方法采用现成的对象检测网络来提供对象建议,并使用注意机制来建模对象之间的关系。他们通常会错过一些预验证模型的不确定语义概念,并且无法识别对象之间的确切谓词关系。在本文中,我们研究了为给定视频生成文本描述的开放研究任务,并提出了带有元概念的跨模式图(CMG)。具体而言,为了涵盖视频字幕中有用的语义概念,我们弱地学习了文本描述的相应视觉区域,其中相关的视觉区域和文本单词被命名为跨模式元概念。我们通过学习的跨模式元概念动态地构建元概念图。我们还构建了整体视频级别和本地框架级视频图,并具有预测的谓词,以建模视频序列结构。我们通过广泛的实验来验证我们提出的技术的功效,并在两个公共数据集上实现最新结果。
translated by 谷歌翻译
近年来,随着对公共安全的需求越来越多,智能监测网络的快速发展,人员重新识别(RE-ID)已成为计算机视野领域的热门研究主题之一。人员RE-ID的主要研究目标是从不同的摄像机中检索具有相同身份的人。但是,传统的人重新ID方法需要手动标记人的目标,这消耗了大量的劳动力成本。随着深度神经网络的广泛应用,出现了许多基于深入的基于学习的人物的方法。因此,本文促进研究人员了解最新的研究成果和该领域的未来趋势。首先,我们总结了对几个最近公布的人的研究重新ID调查,并补充了系统地分类基于深度学习的人的重新ID方法的最新研究方法。其次,我们提出了一种多维分类,根据度量标准和表示学习,将基于深度学习的人的重新ID方法分为四类,包括深度度量学习,本地特征学习,生成的对抗学习和序列特征学习的方法。此外,我们根据其方法和动机来细分以上四类,讨论部分子类别的优缺点。最后,我们讨论了一些挑战和可能的研究方向的人重新ID。
translated by 谷歌翻译
基于文本的视频细分旨在通过用文本查询指定演员及其表演动作来细分视频序列中的演员。由于\ emph {emph {语义不对称}的问题,以前的方法无法根据演员及其动作以细粒度的方式将视频内容与文本查询对齐。 \ emph {语义不对称}意味着在多模式融合过程中包含不同量的语义信息。为了减轻这个问题,我们提出了一个新颖的演员和动作模块化网络,该网络将演员及其动作分别定位在两个单独的模块中。具体来说,我们首先从视频和文本查询中学习与参与者相关的内容,然后以对称方式匹配它们以定位目标管。目标管包含所需的参与者和动作,然后将其送入完全卷积的网络,以预测演员的分割掩模。我们的方法还建立了对象的关联,使其与所提出的时间建议聚合机制交叉多个框架。这使我们的方法能够有效地细分视频并保持预测的时间一致性。整个模型允许联合学习参与者的匹配和细分,并在A2D句子和J-HMDB句子数据集上实现单帧细分和完整视频细分的最新性能。
translated by 谷歌翻译
视觉变压器正在成为解决计算机视觉问题的强大工具。最近的技术还证明了超出图像域之外的变压器来解决许多与视频相关的任务的功效。其中,由于其广泛的应用,人类的行动识别是从研究界受到特别关注。本文提供了对动作识别的视觉变压器技术的首次全面调查。我们朝着这个方向分析并总结了现有文献和新兴文献,同时突出了适应变形金刚以进行动作识别的流行趋势。由于其专业应用,我们将这些方法统称为``动作变压器''。我们的文献综述根据其架构,方式和预期目标为动作变压器提供了适当的分类法。在动作变压器的背景下,我们探讨了编码时空数据,降低维度降低,框架贴片和时空立方体构造以及各种表示方法的技术。我们还研究了变压器层中时空注意的优化,以处理更长的序列,通常通过减少单个注意操作中的令牌数量。此外,我们还研究了不同的网络学习策略,例如自我监督和零局学习,以及它们对基于变压器的行动识别的相关损失。这项调查还总结了在具有动作变压器重要基准的评估度量评分方面取得的进步。最后,它提供了有关该研究方向的挑战,前景和未来途径的讨论。
translated by 谷歌翻译
The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.
translated by 谷歌翻译
Human activity recognition (HAR) using drone-mounted cameras has attracted considerable interest from the computer vision research community in recent years. A robust and efficient HAR system has a pivotal role in fields like video surveillance, crowd behavior analysis, sports analysis, and human-computer interaction. What makes it challenging are the complex poses, understanding different viewpoints, and the environmental scenarios where the action is taking place. To address such complexities, in this paper, we propose a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely sampled video frames for obtaining global weighted temporal attention. The proposed SWTA is comprised of two parts. First, temporal segment network that sparsely samples a given set of frames. Second, weighted temporal attention, which incorporates a fusion of attention maps derived from optical flow, with raw RGB images. This is followed by a basenet network, which comprises a convolutional neural network (CNN) module along with fully connected layers that provide us with activity recognition. The SWTA network can be used as a plug-in module to the existing deep CNN architectures, for optimizing them to learn temporal information by eliminating the need for a separate temporal stream. It has been evaluated on three publicly available benchmark datasets, namely Okutama, MOD20, and Drone-Action. The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets thereby surpassing the previous state-of-the-art performances by a margin of 25.26%, 18.56%, and 2.94%, respectively.
translated by 谷歌翻译
在本文中,我们基于任何卷积神经网络中中间注意图的弱监督生成机制,并更加直接地披露了注意模块的有效性,以充分利用其潜力。鉴于现有的神经网络配备了任意注意模块,我们介绍了一个元评论家网络,以评估主网络中注意力图的质量。由于我们设计的奖励的离散性,提出的学习方法是在强化学习环境中安排的,在此设置中,注意力参与者和经常性的批评家交替优化,以提供临时注意力表示的即时批评和修订,因此,由于深度强化的注意力学习而引起了人们的关注。 (Dreal)。它可以普遍应用于具有不同类型的注意模块的网络体系结构,并通过最大程度地提高每个单独注意模块产生的最终识别性能的相对增益来促进其表现能力,如类别和实例识别基准的广泛实验所证明的那样。
translated by 谷歌翻译
深度学习技术导致了通用对象检测领域的显着突破,近年来产生了很多场景理解的任务。由于其强大的语义表示和应用于场景理解,场景图一直是研究的焦点。场景图生成(SGG)是指自动将图像映射到语义结构场景图中的任务,这需要正确标记检测到的对象及其关系。虽然这是一项具有挑战性的任务,但社区已经提出了许多SGG方法并取得了良好的效果。在本文中,我们对深度学习技术带来了近期成就的全面调查。我们审查了138个代表作品,涵盖了不同的输入方式,并系统地将现有的基于图像的SGG方法从特征提取和融合的角度进行了综述。我们试图通过全面的方式对现有的视觉关系检测方法进行连接和系统化现有的视觉关系检测方法,概述和解释SGG的机制和策略。最后,我们通过深入讨论当前存在的问题和未来的研究方向来完成这项调查。本调查将帮助读者更好地了解当前的研究状况和想法。
translated by 谷歌翻译
视频突出对象检测旨在在视频中找到最具视觉上的对象。为了探索时间依赖性,现有方法通常是恢复性的神经网络或光学流量。然而,这些方法需要高计算成本,并且往往会随着时间的推移积累不准确性。在本文中,我们提出了一种带有注意模块的网络,以学习视频突出物体检测的对比特征,而没有高计算时间建模技术。我们开发了非本地自我关注方案,以捕获视频帧中的全局信息。共注意配方用于结合低级和高级功能。我们进一步应用了对比学学习以改善来自相同视频的前景区域对的特征表示,并将前景 - 背景区域对被推除在潜在的空间中。帧内对比损失有助于将前景和背景特征分开,并且帧间的对比损失提高了时间的稠度。我们对多个基准数据集进行广泛的实验,用于视频突出对象检测和无监督的视频对象分割,并表明所提出的方法需要较少的计算,并且对最先进的方法进行有利地执行。
translated by 谷歌翻译
变压器架构已经带来了计算语言领域的根本变化,这已经由经常性神经网络主导多年。它的成功还意味着具有语言和愿景的跨模型任务的大幅度变化,许多研究人员已经解决了这个问题。在本文中,我们审查了该领域中的一些最关键的里程碑,以及变压器架构如何纳入Visuol语言跨模型任务的整体趋势。此外,我们讨论了当前的局限性,并推测了我们发现迫在眉睫的一些前景。
translated by 谷歌翻译
有效地对视频中的空间信息进行建模对于动作识别至关重要。为了实现这一目标,最先进的方法通常采用卷积操作员和密集的相互作用模块,例如非本地块。但是,这些方法无法准确地符合视频中的各种事件。一方面,采用的卷积是有固定尺度的,因此在各种尺度的事件中挣扎。另一方面,密集的相互作用建模范式仅在动作 - 欧元零件时实现次优性能,给最终预测带来了其他噪音。在本文中,我们提出了一个统一的动作识别框架,以通过引入以下设计来研究视频内容的动态性质。首先,在提取本地提示时,我们会生成动态尺度的时空内核,以适应各种事件。其次,为了将这些线索准确地汇总为全局视频表示形式,我们建议仅通过变压器在一些选定的前景对象之间进行交互,从而产生稀疏的范式。我们将提出的框架称为事件自适应网络(EAN),因为这两个关键设计都适应输入视频内容。为了利用本地细分市场内的短期运动,我们提出了一种新颖有效的潜在运动代码(LMC)模块,进一步改善了框架的性能。在几个大规模视频数据集上进行了广泛的实验,例如,某种东西,动力学和潜水48,验证了我们的模型是否在低拖鞋上实现了最先进或竞争性的表演。代码可在:https://github.com/tianyuan168326/ean-pytorch中找到。
translated by 谷歌翻译
为了为视频产生适当的标题,推理需要确定相关的概念并注意它们之间的空间关系以及剪辑中的时间发展。我们的端到端编码器视频字幕框架结合了两个基于变压器的体系结构,这是一种用于单个关节时空视频分析的改编变压器,以及用于高级文本生成的基于自我注意力的解码器。此外,我们引入了一种自适应框架选择方案,以减少所需的传入帧数,同时在训练两个变压器时保持相关内容。此外,我们通过汇总每个样本的所有基础真理标题来估计与视频字幕相关的语义概念。我们的方法在MSVD以及大规模的MSR-VTT和VATEX基准数据集上实现了最新的结果,并考虑了多个自然语言产生(NLG)指标。对多样性得分的其他评估突出了我们生成的标题结构的表现力和多样性。
translated by 谷歌翻译
具有注释的缺乏大规模的真实数据集使转移学习视频活动的必要性。我们的目标是为少数行动分类开发几次拍摄转移学习的有效方法。我们利用独立培训的本地视觉提示来学习可以从源域传输的表示,该源域只能使用少数示例来从源域传送到不同的目标域。我们使用的视觉提示包括对象 - 对象交互,手掌和地区内的动作,这些地区是手工位置的函数。我们采用了一个基于元学习的框架,以提取部署的视觉提示的独特和域不变组件。这使得能够在使用不同的场景和动作配置捕获的公共数据集中传输动作分类模型。我们呈现了我们转让学习方法的比较结果,并报告了阶级阶级和数据间数据间际传输的最先进的行动分类方法。
translated by 谷歌翻译
Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles which combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy and optimization function, etc. In this paper, we provide a review on deep learning based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely Convolutional Neural Network (CNN). Then we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network based learning systems.
translated by 谷歌翻译