Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained only on visual data, to generalize to audio-visual data without finetuning any of its original parameters. To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT. To efficiently fuse visual and audio cues, our LAVISH adapter uses a small set of latent tokens, which form an attention bottleneck, thus, eliminating the quadratic cost of standard cross-attention. Compared to the existing modality-specific audio-visual methods, our approach achieves competitive or even better performance on various audio-visual tasks while using fewer tunable parameters and without relying on costly audio pretraining or external audio encoders. Our code is available at https://genjib.github.io/project_page/LAVISH/
translated by 谷歌翻译
我们介绍了一种视听方法,用于远程文本到视频检索。与以前专为简短视频检索设计的方法(例如,持续时间为5-15秒)不同,我们的方法旨在检索捕获复杂人类动作的长时间视频。仅标准视频方法的一个挑战是与从这样的长视频中处理数百个密集提取的帧相关的大量计算成本。为了解决这个问题,我们建议用紧凑的音频提示替换视频的部分,这些线索简洁地汇总了动态音频事件,并且处理便宜。我们的方法称为Eclipse(带有声音编码的有效剪辑),通过添加一个统一的视听变压器块,将流行的剪辑模型调整为视听视频设置,该块从视频和音频流中捕获互补的提示。除了比仅长期视频的方法快2.92倍和2.34倍的内存效率外,我们的方法还可以在几个不同的远程视频数据集上,例如ActivityNet,QVHighighlights,Youcoook2,Youcoook2,Youcook2,Youcook2,Youcook2,Youcook2,Youcook2,Youcook2, Didemo和Charades。
translated by 谷歌翻译
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
translated by 谷歌翻译
The last several years have witnessed remarkable progress in video-and-language (VidL) understanding. However, most modern VidL approaches use complex and specialized model architectures and sophisticated pretraining protocols, making the reproducibility, analysis and comparisons of these frameworks difficult. Hence, instead of proposing yet another new VidL model, this paper conducts a thorough empirical study demystifying the most important factors in the VidL model design. Among the factors that we investigate are (i) the spatiotemporal architecture design, (ii) the multimodal fusion schemes, (iii) the pretraining objectives, (iv) the choice of pretraining data, (v) pretraining and finetuning protocols, and (vi) dataset and model scaling. Our empirical study reveals that the most important design factors include: temporal modeling, video-to-text multimodal fusion, masked modeling objectives, and joint training on images and videos. Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining. Our final model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks without relying on external CLIP pretraining. In particular, on the text-to-video retrieval task, our approach obtains 61.2% on DiDeMo, and 55.0% on ActivityNet, outperforming current SOTA by 7.8% and 6.1% respectively. Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA. Our code and pretrained models are publicly available at: https://github.com/klauscc/VindLU.
translated by 谷歌翻译
在视觉和声音内利用时间同步和关联是朝向探测物体的强大定位的重要一步。为此,我们提出了一个节省空间内存网络,用于探测视频中的对象本地化。它可以同时通过音频和视觉方式的单模和跨模型表示来同时学习时空关注。我们在定量和定性地展示和分析了在本地化视听物体中结合时空学习的有效性。我们展示了我们的方法通过各种复杂的视听场景概括,最近最先进的方法概括。
translated by 谷歌翻译
视频问题回答(VideoQA)是一项复杂的任务,需要多种模式数据进行培训。但是,对视频的问题和答案的手动注释是乏味的,禁止可扩展性。为了解决这个问题,最近的方法考虑了零拍设置,而无需手动注释视觉问题。特别是,一种有前途的方法调整了在网络级文本数据中预测的冻结自回归语言模型,以适应多模式输入。相比之下,我们在这里建立在冷冻双向语言模型(BILM)的基础上,并表明这种方法为零拍出的VideoQA提供了更强大,更便宜的替代方案。特别是(i)我们使用轻型训练模块将视觉输入与冷冻的BILM结合在一起,(ii)我们使用Web-Scrafe Multi-Mododal数据训练此类模块,最后(iii)我们通过掩盖语言执行零声录像带推断建模,其中蒙版文本是给定问题的答案。我们提出的方法Frozenbilm在零摄影的视频中的表现优于最高的,包括LSMDC-FIB,包括LSMDC-FIB,IVQA,MSRVTT-QA,MSVD-QA,ActivityNet-QA,TGIF-FRAMEQA,TGIF-FRAMEQA,,TGIF-FRAMEQA,,TGIF-FRAMEQA,,,MSRVTT-QA,MSRVTT-QA,MSRVTT-QA,MSRVTT-QA,MSRVTT-QA,,均优于最新技术。 How2QA和TVQA。它还在几次且完全监督的环境中展示了竞争性能。我们的代码和模型将在https://antoyang.github.io/frozenbilm.html上公开提供。
translated by 谷歌翻译
传统的视听模型具有独立的音频和视频分支。我们设计了一个统一的音频和视频处理模型,称为统一音频 - 视听模型(UAVM)。在本文中,我们描述了UAVM,报告其在VGGSOUND上的新最新音频事件分类精度为65.8%,并描述模型的有趣属性。
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
我们建议探索一个称为视听分割(AVS)的新问题,其中的目标是输出在图像帧时产生声音的对象的像素级映射。为了促进这项研究,我们构建了第一个视频分割基准(AVSBENCH),为声音视频中的声音对象提供像素的注释。使用此基准测试了两个设置:1)具有单个声源的半监督音频分割和2)完全监督的音频段段,并带有多个声源。为了解决AVS问题,我们提出了一种新颖的方法,该方法使用时间像素的视听相互作用模块注入音频语义作为视觉分割过程的指导。我们还设计正规化损失,以鼓励训练期间的视听映射。 AVSBench上的定量和定性实验将我们的方法与相关任务中的几种现有方法进行了比较,这表明所提出的方法有望在音频和像素视觉语义之间建立桥梁。代码可从https://github.com/opennlplab/avsbench获得。
translated by 谷歌翻译
Current audio-visual separation methods share a standard architecture design where an audio encoder-decoder network is fused with visual encoding features at the encoder bottleneck. This design confounds the learning of multi-modal feature encoding with robust sound decoding for audio separation. To generalize to a new instrument: one must finetune the entire visual and audio network for all musical instruments. We re-formulate visual-sound separation task and propose Instrument as Query (iQuery) with a flexible query expansion mechanism. Our approach ensures cross-modal consistency and cross-instrument disentanglement. We utilize "visually named" queries to initiate the learning of audio queries and use cross-modal attention to remove potential sound source interference at the estimated waveforms. To generalize to a new instrument or event class, drawing inspiration from the text-prompt design, we insert an additional query as an audio prompt while freezing the attention mechanism. Experimental results on three benchmarks demonstrate that our iQuery improves audio-visual sound source separation performance.
translated by 谷歌翻译
本文介绍了一种基于纯变压器的方法,称为视频动作识别的多模态视频变压器(MM-VIT)。与仅利用解码的RGB帧的其他方案不同,MM-VIT专门在压缩视频域中进行操作,并利用所有容易获得的模式,即I帧,运动向量,残差和音频波形。为了处理从多种方式提取的大量时空令牌,我们开发了几种可扩展的模型变体,它们将自我关注分解在空间,时间和模态尺寸上。此外,为了进一步探索丰富的模态互动及其效果,我们开发并比较了可以无缝集成到变压器构建块中的三种不同的交叉模态注意力机制。关于三个公共行动识别基准的广泛实验(UCF-101,某事-V2,Kinetics-600)证明了MM-VIT以效率和准确性的最先进的视频变压器,并且表现更好或同样地表现出对于具有计算重型光学流的最先进的CNN对应物。
translated by 谷歌翻译
本文重点介绍了弱监督的视频视频解析任务,该任务旨在识别属于每种模式的所有事件并定位其时间界。此任务是具有挑战性的,因为只有表示视频事件的整体标签用于培训。但是,事件可能被标记,但不会出现在其中一种方式中,这导致了特定于模态的嘈杂标签问题。在这项工作中,我们提出了一种培训策略,以动态识别和删除特定于模式的嘈杂标签。它是由两个关键观察的动机:1)网络倾向于首先学习干净的样本; 2)标记的事件至少以一种方式出现。具体而言,我们将每个实例在每种模式中单独分别对所有实例的损失进行排序,然后根据模式内和模式间损耗之间的关系选择嘈杂的样本。此外,我们还通过计算置信度低于预设阈值的实例的比例来提出一种简单但有效的噪声比率估计方法。我们的方法对先前的艺术状态进行了大量改进(\ eg,从60.0 \%到63.8 \%\%在细分级视觉度量中),这证明了我们方法的有效性。代码和训练有素的模型可在\ url {https://github.com/mcg-nju/jomold}上公开获得。
translated by 谷歌翻译
Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel selfsupervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.
translated by 谷歌翻译
识别和本地化视频中的事件是视频理解的基本任务。由于事件可能发生在听觉和视觉方式中,因此多式联合的详细感知对于完全的场景理解至关重要。最先前的作品试图从整体角度分析视频。但是,它们不考虑多个尺度的语义信息,这使得模型难以定位各种长度的事件。在本文中,我们提供了一个多模式金字塔注意网络(MM-PYRAMID),用于捕获和集成多级时间特征,用于视听事件定位和视听视频解析。具体而言,我们首先提出了专注特征金字塔模块。该模块通过多个堆叠金字塔单元捕获时间金字塔特征,每个单元都由固定尺寸的注意力块和扩张的卷积块组成。我们还设计了一种自适应语义融合模块,它利用单位级注意块和选择性融合块以交互地集成金字塔特征。对视听事件定位的广泛实验和虚线监督的视听视频解析任务验证了我们方法的有效性。
translated by 谷歌翻译
在本文中,我们考虑了视听同步的问题应用于视频`in-wild'(即,超越语音的一般类)。作为一项新任务,我们识别并策划具有高视听相关性的测试集,即VGG-SOCK SYNC。我们比较了一些专门设计的基于变压器的架构变体,用于模拟任意长度的音频和视觉信号,同时显着降低训练期间的内存要求。我们进一步对策划数据集进行了深入的分析,并定义了开放域视听同步的评估度量。我们在标准唇读语音基准测试中应用我们的方法,LRS2和LRS3,在各个方面的消融。最后,我们在新的VGG-SOCKC SYNC视频数据集中设置了与超过160个不同类别的通用视听同步的第一个基准。在所有情况下,我们所提出的模型通过显着的保证金优于以前的最先进。
translated by 谷歌翻译
We present XKD, a novel self-supervised framework to learn meaningful representations from unlabelled video clips. XKD is trained with two pseudo tasks. First, masked data reconstruction is performed to learn modality-specific representations. Next, self-supervised cross-modal knowledge distillation is performed between the two modalities through teacher-student setups to learn complementary information. To identify the most effective information to transfer and also to tackle the domain gap between audio and visual modalities which could hinder knowledge transfer, we introduce a domain alignment strategy for effective cross-modal distillation. Lastly, to develop a general-purpose solution capable of handling both audio and visual streams, a modality-agnostic variant of our proposed framework is introduced, which uses the same backbone for both audio and visual modalities. Our proposed cross-modal knowledge distillation improves linear evaluation top-1 accuracy of video action classification by 8.4% on UCF101, 8.1% on HMDB51, 13.8% on Kinetics-Sound, and 14.2% on Kinetics400. Additionally, our modality-agnostic variant shows promising results in developing a general-purpose network capable of handling different data streams. The code is released on the project website.
translated by 谷歌翻译
We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual event labels with temporal boundaries. The task is especially challenging since it is weakly supervised with only event labels available as a bag of labels for each video. An existing state-of-the-art model for AVVP uses a hybrid attention network (HAN) to generate cross-modal features for both audio and visual modalities, and an attentive pooling module that aggregates predicted audio and visual segment-level event probabilities to yield video-level event probabilities. We provide a detailed analysis of modality bias in the existing HAN architecture, where a modality is completely ignored during prediction. We also propose a variant of feature aggregation in HAN that leads to an absolute gain in F-scores of about 2% and 1.6% for visual and audio-visual events at both segment-level and event-level, in comparison to the existing HAN model.
translated by 谷歌翻译
本文探讨了时间视频接地(TVG)的任务,在该任务中,给定未修剪的视频和查询句子,目标是在提供的自然语言查询描述的视频中识别和确定动作实例的时间界。最近的作品通过使用大型预训练的语言模型(PLM)直接编码查询来解决此任务。但是,很难隔离改进的语言表示的影响,因为这些作品还提出了视觉输入的改进。此外,这些PLM大大增加了训练TVG模型的计算成本。因此,本文研究了PLM在TVG任务中的影响,并根据适配器评估了NLP参数效率培训替代方案的适用性。我们将流行的PLM与选择现有方法和测试不同的适配器相结合,以减少其他参数的影响。我们在三个具有挑战性的数据集上的结果表明,当TVG模型对该任务进行微调时,可以从PLM中受益匪浅,并且适配器是完全微调的有效替代方法,即使它们并不适合我们的任务。具体而言,适配器有助于节省计算成本,从而使PLM集成在较大的TVG模型中,并提供与最先进模型相当的结果。最后,通过对TVG中不同类型的适配器进行基准测试,我们的结果阐明了哪种适配器最适合每个研究的情况。
translated by 谷歌翻译
主动演讲者的检测和语音增强已成为视听场景中越来越有吸引力的主题。根据它们各自的特征,独立设计的体系结构方案已被广泛用于与每个任务的对应。这可能导致模型特定于任务所学的表示形式,并且不可避免地会导致基于多模式建模的功能缺乏概括能力。最近的研究表明,建立听觉和视觉流之间的跨模式关系是针对视听多任务学习挑战的有前途的解决方案。因此,作为弥合视听任务中多模式关联的动机,提出了一个统一的框架,以通过在本研究中通过联合学习视听模型来实现目标扬声器的检测和语音增强。
translated by 谷歌翻译
来自视频数据的多模态学习最近看过,因为它允许在没有人为注释的情况下培训语义有意义的嵌入,从而使得零射击检索和分类等任务。在这项工作中,我们提出了一种多模态,模态无政府主义融合变压器方法,它学会在多个模态之间交换信息,例如视频,音频和文本,并将它们集成到加入的多模态表示中,以获取聚合的嵌入多模态时间信息。我们建议培训系统的组合丢失,单个模态以及成对的方式,明确地留出任何附加组件,如位置或模态编码。在测试时间时,产生的模型可以处理和融合任意数量的输入模态。此外,变压器的隐式属性允许处理不同长度的输入。为了评估所提出的方法,我们在大规模HOWASET上培训模型,并评估四个具有挑战性的基准数据集上产生的嵌入空间获得最先进的视频检索和零射击视频动作定位。
translated by 谷歌翻译