现有的时间动作检测(TAD)方法依赖于大型培训数据,包括细分级注释,仅限于在推理期间单独识别先前看到的课程。为每类兴趣收集和注释一个大型培训集是昂贵的,因此无法计算。零射TAD(ZS-TAD)通过启用预训练的模型来识别任何看不见的动作类别来解决这一障碍。同时,ZS-TAD的调查大大降低,ZS-Tad也更具挑战性。受零摄像图像分类的成功的启发,我们旨在解决更复杂的TAD任务。一种直观的方法是将现成的建议探测器与剪辑样式分类集成。但是,由于顺序定位(例如,提案生成)和分类设计,它很容易进行定位误差传播。为了克服这个问题,在本文中,我们通过视觉提示(陈旧)提出了一种新型的零射击时间动作检测模型。这种新颖的设计通过破坏介于两者之间的错误传播途径来有效地消除了定位和分类之间的依赖性。我们进一步介绍了分类和定位之间的相互作用机制,以改善优化。对标准ZS-TAD视频基准测试的广泛实验表明,我们的陈旧的表现明显优于最先进的替代方案。此外,我们的模型还与最近的强大竞争对手相比,在受到监督的TAD上还能产生卓越的成果。 Stale的Pytorch实现可从https://github.com/sauradip/stale获得。
translated by 谷歌翻译
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/MUPPET
translated by 谷歌翻译
现有的时间动作检测(TAD)方法依赖于带有细分级注释的大量培训数据。因此,收集和注释这样的训练集非常昂贵且不可计入。半监督的TAD(SS-TAD)通过利用规模自由的未标记视频来减轻此问题。但是,SS-Tad也比有监督的TAD更具挑战性的问题,因此研究得多。先前的SS-TAD方法直接结合了现有的基于建议的TAD方法和SSL方法。由于它们的顺序定位(例如,提案生成)和分类设计,它们很容易出现误差传播。为了克服这一局限性,在这项工作中,我们提出了一种基于无建议的时间掩模(点)的新型半监督时间动作检测模型,并具有平行的定位(掩码生成)和分类体系结构。这种新颖的设计通过切断介于两者之间的错误传播途径来有效地消除了定位和分类之间的依赖性。我们进一步介绍了用于预测细化的分类和本地化之间的交互机制,以及用于自我监督模型预训练的新借口任务。对两个标准基准测试的广泛实验表明,我们的现场表现要优于最先进的替代方案,通常是很大的边距。 pytorch实施现场可在https://github.com/sauradip/spot上获得
translated by 谷歌翻译
现有的时间动作检测(TAD)方法依赖于每个视频产生大量的建议。这导致由于提案生成和/或主张行动实例评估以及最终的高计算成本而导致复杂的模型设计。在这项工作中,我们首次提出了一个带有全局分割掩码(TAG)的无建议的时间动作检测模型。我们的核心想法是以完整的视频长度共同学习每个操作实例的全局细分面具。标签模型与基于常规建议的方法有显着不同,通过关注全球时间表示学习,直接在没有建议的情况下直接检测本地起点和终点的行动点。此外,通过对TAD进行整体建模,而不是在单个建议级别上进行本地建模,标签需要更简单的模型体系结构,计算成本较低。广泛的实验表明,尽管设计更简单,但标签的表现优于现有的TAD方法,在两个基准上实现了新的最新性能。重要的是,训练的速度更快约20倍,推理效率更高。我们的标签的Pytorch实现可在https://github.com/sauradip/tags上获得。
translated by 谷歌翻译
视觉语言预培训对从大规模Web数据学习联合视觉文本表示的巨大成功,展示了零拍广泛的显着能力。本文介绍了一种简单的方法,可以将一个预先训练的视觉语言模型有效地调整到具有最小培训的新型任务,以及这里,我们考虑视频了解任务。具体而言,我们建议优化几个随机向量,称为连续提示向量,将新颖任务转换为与预培训目标相同的格式。此外,为了弥合静态图像和视频之间的差距,用堆叠在框架明智的视觉特征之上的轻量压变压器编码时分信息。在实验上,我们进行广泛的消融研究,以分析关键组成部分和必需品。在9个公共基准的行动认可,行动本地化和文本 - 视频检索,跨封闭式,几次射击,开放式场景,我们为现有方法实现了竞争或最先进的性能,尽管培训显着更少的参数。
translated by 谷歌翻译
最近的进展表明,使用对比图像文本对的大规模预训练可以是从自然语言监督的高质量视觉表演学习的有前途的替代方案。从更广泛的监督来源受益,这种新的范例展示了对下游分类任务和数据集的令人印象深刻的可转移性。然而,从图像文本对中学习的知识转移到更复杂的密集预测任务的问题几乎没有访问过。在这项工作中,我们通过隐式和明确地利用来自剪辑的预先训练的知识来提出了一种新的密集预测框架。具体地,我们将剪辑中的原始图像文本匹配问题转换为像素文本匹配问题,并使用像素文本分数图来指导致密预测模型的学习。通过进一步使用图像中的上下文信息来提示语言模型,我们能够促进我们的模型来更好地利用预先接受训练的知识。我们的方法是模型 - 不可行的,它可以应用于任意密集的预测系统和各种预先训练的视觉底座,包括夹模型和想象成预先训练的模型。广泛的实验证明了我们对语义分割,对象检测和实例分段任务的方法的卓越性能。代码可在https://github.com/raoyongming/denseclip获得
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
Prompt Tuning, conditioning on task-specific learned prompt vectors, has emerged as a data-efficient and parameter-efficient method for adapting large pretrained vision-language models to multiple downstream tasks. However, existing approaches usually consider learning prompt vectors for each task independently from scratch, thereby failing to exploit the rich shareable knowledge across different vision-language tasks. In this paper, we propose multitask vision-language prompt tuning (MVLPT), which incorporates cross-task knowledge into prompt tuning for vision-language models. Specifically, (i) we demonstrate the effectiveness of learning a single transferable prompt from multiple source tasks to initialize the prompt for each target task; (ii) we show many target tasks can benefit each other from sharing prompt vectors and thus can be jointly learned via multitask prompt tuning. We benchmark the proposed MVLPT using three representative prompt tuning methods, namely text prompt tuning, visual prompt tuning, and the unified vision-language prompt tuning. Results in 20 vision tasks demonstrate that the proposed approach outperforms all single-task baseline prompt tuning methods, setting the new state-of-the-art on the few-shot ELEVATER benchmarks and cross-task generalization benchmarks. To understand where the cross-task knowledge is most effective, we also conduct a large-scale study on task transferability with 20 vision tasks in 400 combinations for each prompt tuning method. It shows that the most performant MVLPT for each prompt tuning method prefers different task combinations and many tasks can benefit each other, depending on their visual similarity and label similarity. Code is available at https://github.com/sIncerass/MVLPT.
translated by 谷歌翻译
对比视觉语言预培训(剪辑)最近淹没了其可转让的视觉表现学习的关注。由大规模的图像文本对进行监督,剪辑能够对准配对的图像和文本,从而在开放词汇场景中进行零拍摄识别。然而,特定应用与通常预先训练的知识之间存在语义差距,这使得匹配子最优在下游任务上。在本文中,我们提出了VT-CLIP通过可视导向文本来增强视觉语言建模。具体而言,我们指导文本功能以自适应地探索图像上的信息区域,并通过跨关注的Machanism聚合视觉特征。以这种方式,视觉引导文本与图像变得更加语义相关,这极大地利益匹配过程。在几次拍摄的设置中,我们在11名知名分类数据集中评估我们的VT-CLIP,并进行实验广泛的消融研究,以证明VT-CLIP的有效性。代码将很快发布。
translated by 谷歌翻译
探索大规模预处理的基础模型对计算机视觉具有重大兴趣,因为这些模型可以快速转移到许多下游任务中。本文介绍了对比字幕(COCA),这是一种极简主义的设计,旨在为图像文本编码器编码器基础模型预算与对比度损失和字幕损失,从而从剪辑和诸如simvlm之类的生成方法之类的对比方法中包含模型能力。与所有解码器层都参与编码器输出的标准编码器 - 模块变压器相反,可口可乐省略了解码器层的上半部分的交叉注意,以编码单峰文本表示,并串联到剩余的解码器层,这些解码器与图像编码器相交的解码器层多模式图像文本表示。除了对多模态解码器输出的字幕损失外,我们还应用了单峰图像和文本嵌入之间的对比损失,该输出可以预测文本令牌自动加压。通过共享相同的计算图,可以用最小的开销有效地计算两个培训目标。可口可乐是端到端和从头开始的网络尺度alt-text数据和带注释的图像,通过将所有标签视为文本,无缝地统一自然语言监督以进行表示。从经验上讲,可口可乐通过零拍传输或在广泛的下游任务上进行零摄像转移或最少的特定任务适应,跨越视觉识别(Imagenet,Kinetics-400/600/700,瞬间, ),交叉模式检索(MSCOCO,FLICKR30K,MSR-VTT),多模式理解(VQA,SNLI-VE,NLVR2)和图像字幕(MSCOCO,NOCAPS)。值得注意的是,在Imagenet分类方面,COCA获得了86.3%的TOP-1准确性,带有冷冻编码器和学习的分类头90.6%,以及带有填充编码器的Imagenet上的新最先进的91.0%Top-1 Top-1精度。
translated by 谷歌翻译
在低标签制度中,解决图像的多标签识别(MLR)是许多现实世界应用的一项艰巨任务。最近的工作学会了文本和视觉空间之间的一致性,以补偿图像标签不足,但由于可用的MLR注释量有限,因此失去了准确性。在这项工作中,我们利用数百万辅助图像文本对预测的文本和视觉特征的牢固对齐,并提出双背景优化(dualCoop)作为部分标签MLR和零发射MLR的统一框架。 DualCoop用类名来编码正面和负面的上下文,作为语言输入的一部分(即提示)。由于DualCoop仅在验证的视觉语言框架上引入了非常轻松的开销,因此它可以迅速适应具有有限的注释甚至看不见的类别的多标签识别任务。对两个挑战性低标签设置的标准多标签识别基准测试的实验证明了我们方法比最新方法的优势。
translated by 谷歌翻译
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a `bridge and prompt' approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code is available at https://github.com/muzairkhattak/ViFi-CLIP.
translated by 谷歌翻译
自动视觉解对我们多样化和开放的世界需要计算机视觉模型,以概括为特定任务的最小定制,类似于人类视力。计算机视觉基础型号培训,培训多样化,大型数据集,可以适应各种下游任务,对该任务来解决现实世界计算机视觉应用而言至关重要。虽然现有的视觉基础模型如剪辑,对齐和吴道2.0主要集中在映射图像和文本表示到跨模型共享表示,我们介绍了一台新的计算机视觉基础模型,佛罗伦萨,扩大粗糙的表示(现场)到精细(对象),从静态(图像)到动态(视频),以及从RGB到多个模态(标题,深度)。通过从Web级图像文本数据中纳入通用视觉语言表示,我们的佛罗伦萨模型可以很容易地适应各种计算机视觉任务,例如分类,检索,对象检测,VQA,图像标题,视频检索和动作识别。此外,佛罗伦萨在许多类型的转移学习中表现出出色的表现:全面采样的微调,线性探测,几次射击传输和用于新颖图像和物体的零拍摄传输。所有这些属性对于我们的视觉基础模型至关重要,以提供通用视觉任务。佛罗伦萨实现了新的最先进的导致44个代表性基准,例如Imagenet-1K零射击分类,最高1精度为83.74,最高5个精度为97.18,62.4地图上的Coco微调, 80.36在VQA上,动力学-600上的87.8。
translated by 谷歌翻译
为了同时朝着对多个下游任务的整体理解,需要提取具有更好可传递性的功能。尽管许多最新的自我监管的预训练方法在普遍的预处理前范式下在各种视觉任务上取得了令人印象深刻的表现,但它们对多任务学习方案的概括能力尚待探索。在本文中,我们在三个下游任务上进行了广泛研究各种类型的自我监督方法的转移性能,例如Moco和Simc​​lr,包括语义细分,可驱动的区域细分和交通对象检测,在大规模驾驶数据集中BDD100K。我们出人意料地发现,他们的表现是最佳的甚至落后于单任务基线的滞后,这可能是由于训练目标和建筑设计的区别在于预处理范式。为了克服这一难题,并避免重新设计资源密集的预培训阶段,我们提出了一种简单而有效的预处理 - 适应性 - 赛范围,用于一般的多任务培训,可以有效地适应现行预审预周态的模型没有增加培训开销。在自适应阶段,我们利用可学习的多尺度适配器来动态调整由多任务目标监督的预验证的模型权重,同时使经过预告片的知识未经触及。此外,我们将视觉语言预训练模型剪辑视为对预处理 - 适应 - 最终范式的强烈补充,并提出了一个名为LV-Adapter的新型适配器,该适配器通过任务特定的提示将语言先验纳入了多任务的模型中和视觉和文本特征之间的对齐。
translated by 谷歌翻译
最近,Vision-Language预训练的零拍图像分类已经表现出令人难以置信的成就,即该模型可以对任意类别进行分类而不看到该类别的其他注释图像。然而,目前尚不清楚如何在更广泛的视觉问题上进行零射识别,例如对象检测和语义分割。在本文中,我们通过在现成的预训练的视觉模型,即剪辑上建立零拍语义分割来定位零拍语义分割。很难因为语义分割和剪辑模型在不同的视觉粒度上执行,该语义分段处理在像素上时,而剪辑在图像上执行。为了解决处理粒度的差异,我们拒绝使用普遍的一级FCN基于FCN的框架,并倡导一个两级语义分割框架,其中第一阶段提取一个完全提取的掩模提案和第二阶段利用基于图像的剪辑模型在第一阶段生成的蒙版图像作物上执行零拍分类。我们的实验结果表明,这种简单的框架通过大型利润率超越了先前的最先进:+29.5 Hiou On Pascal VOC 2012 DataSet,+8.9 Hiou On Coco Stuff DataSet。凭借其简单性和强大的表现,我们希望本框架成为促进未来研究的基准。
translated by 谷歌翻译
对比性语言图像预测在学习网络尺度数据的视觉文本联合表示方面取得了巨大的成功,这表明了各种图像任务的显着“零射”概括能力。但是,如何有效地将这种新的语言图像预处理方法扩展到视频域仍然是一个开放的问题。在这项工作中,我们提出了一种简单而有效的方法,该方法将预验证的语言图像模型直接适应视频识别,而不是从头开始预处理新模型。更具体地说,为了捕获沿时间维度框架的远距离依赖性,我们提出了一种跨框架注意机制,该机制明确地跨帧交换信息。这样的模块是轻量级的,可以无缝地插入验证的语言图像模型中。此外,我们提出了一个特定于视频的提示方案,该方案利用视频内容信息生成歧视性文本提示。广泛的实验表明,我们的方法是有效的,可以推广到不同的视频识别方案。特别是,在完全监督的设置下,我们的方法在Kinectics-400上获得了最高1的精度为87.1%,而与SWIN-L和Vivit-H相比,使用量少12倍。在零拍摄的实验中,我们的方法超过了当前的最新方法 +7.6%和 +14.9%,而在两个流行协议下,TOP-1的准确性。在少数拍摄的情况下,当标记的数据非常有限时,我们的方法优于先前的最佳方法 +32.1%和 +23.1%。代码和型号可在https://aka.ms/x-clip上找到
translated by 谷歌翻译
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .
translated by 谷歌翻译
Existing Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence, before temporal boundary estimation and action classification. This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original temporal resolution. In essence, this is due to a temporal quantization error introduced during the resolution downsampling and recovery. This could negatively impact the TAD performance, but is largely ignored by existing methods. To address this problem, in this work we introduce a novel model-agnostic post-processing method without model redesign and retraining. Specifically, we model the start and end points of action instances with a Gaussian distribution for enabling temporal boundary inference at a sub-snippet level. We further introduce an efficient Taylor-expansion based approximation, dubbed as Gaussian Approximated Post-processing (GAP). Extensive experiments demonstrate that our GAP can consistently improve a wide variety of pre-trained off-the-shelf TAD models on the challenging ActivityNet (+0.2% -0.7% in average mAP) and THUMOS (+0.2% -0.5% in average mAP) benchmarks. Such performance gains are already significant and highly comparable to those achieved by novel model designs. Also, GAP can be integrated with model training for further performance gain. Importantly, GAP enables lower temporal resolutions for more efficient inference, facilitating low-resource applications. The code will be available in https://github.com/sauradip/GAP
translated by 谷歌翻译
利用在大规模图像文本对中预先训练的视觉和语言模型(VLM)成为开放式视觉识别的有希望的范式。在这项工作中,我们通过利用视频中自然存在的运动和音频来扩展这种范式。我们提出\ textbf {mov},这是\ textbf {m} ult-imodal \ textbf {o} pen- \ textbf {v} ocabulary视频分类的简单而有效的方法。在MOV中,我们直接使用具有最小修改的预训练VLM的视觉编码器来编码视频,光流和音频频谱图。我们设计一种跨模式融合机制来汇总免费的多模式信息。 Kinetics-700和VGGSOUND的实验表明,引入流量或音频模态会带来预先训练的VLM和现有方法的大量性能增长。具体而言,MOV极大地提高了基础类别的准确性,而在新颖的课程上则更好地概括了。 MOV在UCF和HMDB零摄像视频分类基准上实现了最新结果,从而极大地超过了基于VLMS的传统零摄像方法和最新方法。代码和模型将发布。
translated by 谷歌翻译
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.
translated by 谷歌翻译