在本文中,我们探索从教学视频中学习的自我监督的视听模型。事先工作表明,这些模型可以在视频的大规模数据集训练后与视觉内容相关联,但它们仅在英语中进行培训和评估。为了学习多语言视听表示,我们提出了一种级联方法,它利用了在英语视频上培训的模型,并将其应用于其他语言的视听数据,例如日本视频。通过我们的级联方法,与日语视频培训相比,我们展示了近10倍的检索性能提高。我们还将培训的模型应用于英语视频培训到日语和印地文口语字幕的图像,实现最先进的性能。
translated by 谷歌翻译
视频检索随着视觉模型的发展取得了巨大进展。但是,进一步改进这些模型需要其他标记的数据,这是一项巨大的手动努力。在本文中,我们提出了一个框架MKTVR,该框架利用了从多语言模型的知识转移来提高视频检索的性能。我们首先使用最先进的机器翻译模型来构建伪真实的多语言视频文本对。然后,我们使用这些数据来学习视频文本表示,其中英语和非英语文本查询在基于预审前的多语言模型的常见嵌入空间中表示。我们在四个英语视频检索数据集上评估了我们提出的方法,例如MSRVTT,MSVD,DIDEMO和CHARADES。实验结果表明,我们的方法在所有数据集上实现了最先进的结果,超过了先前的模型。最后,我们还在涵盖六种语言的多语言视频回程数据集上评估了我们的模型,并表明我们的模型在零拍设置中优于先前的多语言视频检索模型。
translated by 谷歌翻译
可以代表和描述环境声音的机器具有实际潜力,例如,用于音频标记和标题系统。普遍的学习范式已经依赖于并行音频文本数据,但是,Web上几乎没有可用。我们提出了vip-ant,它在不使用任何并行音频文本数据的情况下诱导\ textbf {a} udio- \ textBF {t} EXT对齐。我们的主要思想是在双模形图像文本表示和双模态图像 - 音频表示之间共享图像模型;图像模态用作枢轴,并将音频和文本连接在三模态嵌入空间中。在没有配对的音频文本数据的困难零拍设置中,我们的模型在ESC50和US8K音频分类任务上展示了最先进的零点性能,甚至超过了披肩标题的领域的监督状态检索(带音频查询)2.2 \%R @ 1。我们进一步调查了最小音频监控的情况,发现,例如,只有几百个监督的音频文本对将零拍音频分类精度提高8 \%US8K。然而,为了匹配人类奇偶校验,我们的经验缩放实验表明我们需要大约2米$ 2 ^ {21} \约2M $监督的音频标题对。我们的工作开辟了新的途径,用于学习音频文本连接,几乎没有并行音频文本数据。
translated by 谷歌翻译
大规模未标记数据集的预培训显示了计算机视觉和自然语言处理领域的令人印象深刻的性能改进。鉴于大规模教学视频数据集的出现,预训练视频编码器的常见策略是使用随附的语音作为弱监管。但是,由于演讲用于监督预培训,视频编码器从未见过,这不会学会处理该模态。我们解决了当前预训练方法的这种缺点,这未能利用口语语言中的丰富的线索。我们的提议是使用所有可用的视频模型作为监督,即外观,声音和转录语音预先列车。我们在输入中掩盖了整个模态并使用其他两个模态预测它。这鼓励每个码头与其他方式合作,我们的视频编码器学会处理外观和音频以及语音。我们展示了我们在How2R,YouScook2和浓缩电影数据集上视频检索的“模态屏蔽”预培训方法的卓越性能。
translated by 谷歌翻译
Learning text-video embeddings usually requires a dataset of video clips with manually provided captions. However, such datasets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-ofthe-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone. Our dataset, code and models are publicly available [1]. * Equal contribution.
translated by 谷歌翻译
根据文本描述检索目标视频是巨大实用价值的任务,并且在过去几年中受到了不断的关注。在本文中,我们专注于多查询视频检索的较少设置,其中提供了多个查询,以便搜索视频档案。首先表明,多查询检索任务是更务实的,代表现实世界用例,更好地评估当前模型的检索能力,从而应得进一步调查与更普遍的单程检索再现。然后,我们提出了几种新方法,用于利用训练时间来利用多个查询,以改善从常规单查验训练模型的简单组合多个查询的相似性输出。我们的模型在三个不同的数据集中始终如一地占有几种竞争基础。例如,Recall @ 1可以在MSR-VTT上提高4.7点,在MSVD上的4.1点和Gatex上的11.7点,在最先进的Clip4Clip模型上构建的强大基线。我们相信进一步的建模努力将为这种方向带来新的见解,并在现实世界视频检索应用中表现更好的新系统。代码可在https://github.com/princetonvisualai/mqvr获得。
translated by 谷歌翻译
Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as Ima-geNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated crossattention models. The representations also enable cross-modality search with complex text and text + image queries.
translated by 谷歌翻译
来自视频数据的多模态学习最近看过,因为它允许在没有人为注释的情况下培训语义有意义的嵌入,从而使得零射击检索和分类等任务。在这项工作中,我们提出了一种多模态,模态无政府主义融合变压器方法,它学会在多个模态之间交换信息,例如视频,音频和文本,并将它们集成到加入的多模态表示中,以获取聚合的嵌入多模态时间信息。我们建议培训系统的组合丢失,单个模态以及成对的方式,明确地留出任何附加组件,如位置或模态编码。在测试时间时,产生的模型可以处理和融合任意数量的输入模态。此外,变压器的隐式属性允许处理不同长度的输入。为了评估所提出的方法,我们在大规模HOWASET上培训模型,并评估四个具有挑战性的基准数据集上产生的嵌入空间获得最先进的视频检索和零射击视频动作定位。
translated by 谷歌翻译
可靠的评估基准是为了可复制性和全面性而设计的,在机器学习方面取得了进步。但是,由于缺乏多语言基准,视觉和语言研究主要集中在英语任务上。为了填补这一空白,我们介绍了图像的语言理解评估基准。 Iglue通过汇总已有的数据集并创建新的数据来汇集 - 视觉问题回答,跨模式检索,扎根的推理以及跨20种不同语言的扎根成本。我们的基准测试能够评估多语言多模型用于转移学习的模型,不仅在零弹位设置中,而且还以新定义的少数图学习设置。根据对可用最新模型的评估,我们发现翻译测试转移优于零弹性转移,并且对于许多任务而言,很难利用射击的学习。此外,下游性能部分用可用的未标记文本数据进行预处理来解释,并且仅通过目标源语言的类型学距离而微弱。我们希望通过向社区释放基准来鼓励该领域的未来研究工作。
translated by 谷歌翻译
我们研究了联合视频和语言(VL)预培训,以实现跨模型学习和益处丰富的下游VL任务。现有的作品要么提取低质量的视频特征或学习有限的文本嵌入,但忽略了高分辨率视频和多样化的语义可以显着提高跨模型学习。在本文中,我们提出了一种新的高分辨率和多样化的视频 - 语言预训练模型(HD-VILA),用于许多可视任务。特别是,我们收集具有两个不同属性的大型数据集:1)第一个高分辨率数据集包括371.5k小时的720p视频,2)最多样化的数据集涵盖15个流行的YouTube类别。为了启用VL预培训,我们通过学习丰富的时空特征的混合变压器联合优化HD-VILA模型,以及多峰变压器,用于强制学习视频功能与多样化文本的交互。我们的预训练模式实现了新的最先进的导致10 VL了解任务和2个新颖的文本到视觉生成任务。例如,我们以零拍摄MSR-VTT文本到视频检索任务的相对增加38.5%R @ 1的相对增长,高分辨率数据集LSMDC为53.6%。学习的VL嵌入也有效地在文本到视觉操纵和超分辨率任务中产生视觉上令人愉悦和语义相关结果。
translated by 谷歌翻译
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.
translated by 谷歌翻译
尽管最近在跨模式检索领域取得了进展,但由于缺乏手动注释的数据集,研究的重点较少。在本文中,我们提出了一种用于低资源语言的噪声跨语法跨模式检索方法。为此,我们使用机器翻译(MT)来构建低资源语言的伪并行句子对。但是,由于MT并不完美,因此它倾向于在翻译过程中引入噪音,从而使文本嵌入被损坏,从而损害了检索性能。为了减轻这一点,我们引入了一种多视图自我验证方法来学习噪声稳定目标语言表示,该方法采用了跨注意模块来生成软伪靶标,以从基于相似性的视图和功能 - 功能 - 基于视图。此外,受到无监督的MT的反向翻译的启发,我们最大程度地减少了原点句子和反翻译句子之间的语义差异,以进一步提高文本编码器的噪声稳健性。在三个视频文本和图像文本跨模式检索基准跨不同语言上进行了广泛的实验,结果表明,我们的方法显着改善了整体性能,而无需使用额外的人体标记数据。此外,从最近的视觉和语言预训练框架(即剪辑)中配备了预训练的视觉编码器,我们的模型可实现显着的性能增长,这表明我们的方法与流行的预训练模型兼容。代码和数据可在https://github.com/huiguanlab/nrccr上找到。
translated by 谷歌翻译
This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to ``flattened frame embeddings'', yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates \(N\) token embeddings per frame for totally \(T\) video frames. We flatten \(N \times T\) token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT and ActivityNet Captions. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA) and video captioning (MSR-VTT, ActivityNet, Youcook2). Our approach establishes a simple and effective video-text baseline for future research.
translated by 谷歌翻译
Timeyou have a little pressure you are cutting the wood readjusting the table saw I am using a roller sure you applied glue Figure 1: We describe an efficient approach to learn visual representations from misaligned and noisy narrations (bottom) automatically extracted from instructional videos (top). Our video representations are learnt from scratch without relying on any manually annotated visual dataset yet outperform all self-supervised and many fully-supervised methods on several video recognition benchmarks.
translated by 谷歌翻译
Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, Audioset and ESC-50 when compared to previous self-supervised work. Our models are publicly available [1, 2, 3]. * Equal contribution. † Work done during an internship at DeepMind. 34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
Most existing text-video retrieval methods focus on cross-modal matching between the visual content of offline videos and textual query sentences. However, in real scenarios, online videos are frequently accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This inspires us to generate associated captions from offline videos to help with existing text-video retrieval methods. To do so, we propose to use the zero-shot video captioner with knowledge of pre-trained web-scale models (e.g., CLIP and GPT-2) to generate captions for offline videos without any training. Given the captions, one question naturally arises: what can auxiliary captions do for text-video retrieval? In this paper, we present a novel framework Cap4Video, which makes use of captions from three aspects: i) Input data: The video and captions can form new video-caption pairs as data augmentation for training. ii) Feature interaction: We perform feature interaction between video and caption to yield enhanced video representations. iii) Output score: The Query-Caption matching branch can be complementary to the original Query-Video matching branch for text-video retrieval. We conduct thorough ablation studies to demonstrate the effectiveness of our method. Without any post-processing, our Cap4Video achieves state-of-the-art performance on MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%).
translated by 谷歌翻译
使用自然语言作为培训视觉识别模型的监督持有巨大的承诺。最近的作品表明,如果在大型训练数据集中的图像和标题之间的对齐形式使用此类监督,则结果对齐模型在零拍摄分类中表现出色,如下游任务2。在本文中,我们专注于挑逗语言监督的哪些部分对于训练零拍摄图像分类模型至关重要。通过广泛和仔细的实验​​,我们表明:1)可以将简单的单词(弓)标题用作数据集中大多数图像标题的替代品。令人惊讶的是,我们观察到这种方法在与单词平衡结合时提高了零拍分类性能。 2)使用船首净化模型,我们可以通过在没有标题的图像上生成伪弓标题来获得更多培训数据。使用真实和伪弓形标题培训的模型达到了更强的零射性能。在ImageNet-1K零拍评估中,我们只使用3M图像标题对的最佳模型,使用15M图像标题对培训的剪辑模型(31.5%VS 31.3%)进行剪辑。
translated by 谷歌翻译
预先训练的图像文本模型(如剪辑)已经证明了从大规模的Web收集的图像文本数据中学到的视觉表示的强大力量。鉴于学习良好的视觉特征,一些现有的作品将图像表示转移到视频域并取得良好的结果。但是,如何利用图像语言预训练的模型(例如,剪辑)进行视频培训(后培训)仍在探索。在本文中,我们研究了两个问题:1)阻碍后期剪辑的因素是什么因素,以进一步提高视频语言任务的性能? 2)如何减轻这些因素的影响?通过一系列比较实验和分析,我们发现语言源之间的数据量表和域间隙具有很大的影响。由这些动机,我们提出了一种配备了视频代理机制的Omnisource跨模式学习方法,即剪辑,即剪辑VIP。广泛的结果表明,我们的方法可以提高视频检索的剪辑的性能。我们的模型还可以在包括MSR-VTT,DIDEMO,LSMDC和ActivityNet在内的各种数据集上实现SOTA结果。我们在https://github.com/microsoft/xpretrain/tree/main/main/main/clip-vip上发布了代码和预训练的剪辑模型。
translated by 谷歌翻译
这项工作是在培训生成动作/视频识别模型上,其输出是描述视频的自由形式的特定动作标题(而不是动作类标签)。生成的方法具有实用的优势,例如生产更细粒度和人类可读的产出,并且自然而然地是开放的。为此,我们提议适应视频/动作识别的预先训练的生成视觉和语言(V&L)基础模型。据我们所知,最近有几次尝试适应了用对比度学习(例如剪辑)训练的V&L模型(例如剪辑),但据我们所知,我们提出了第一种设定实现这一目标的方法来实现生成模型的方法。我们首先表明,生成模型的直接微调生产具有严重过度拟合的动作类别。为了减轻这一点,我们介绍了REST,这是一个由两个关键组成部分组成的培训框架:一种无监督的方法,用于通过伪捕获生成和自我训练,将生成模型适应动作/视频,即不使用任何动作特定的标签; (b)基于剪辑的检索方法,用于为每个视频发现一套伪装的伪扣,以训练该模型。重要的是,我们表明这两个组件对于获得高精度都是必要的。我们评估零拍动识别的问题的休息,我们表明,与基于对比的学习方法相比,我们的方法非常有竞争力。代码将可用。
translated by 谷歌翻译
The canonical approach to video-and-language learning (e.g., video question answering) dictates a neural model to learn from offline-extracted dense video features from vision models and text features from language models. These feature extractors are trained independently and usually on tasks different from the target domains, rendering these fixed features sub-optimal for downstream tasks. Moreover, due to the high computational overload of dense video features, it is often difficult (or infeasible) to plug feature extractors directly into existing approaches for easy finetuning. To provide a remedy to this dilemma, we propose a generic framework CLIPBERT that enables affordable endto-end learning for video-and-language tasks, by employing sparse sampling, where only a single or a few sparsely sampled short clips from a video are used at each training step. Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that CLIP-BERT outperforms (or is on par with) existing methods that exploit full-length videos, suggesting that end-to-end learning with just a few sparsely sampled clips is often more accurate than using densely extracted offline features from full-length videos, proving the proverbial less-is-more principle. Videos in the datasets are from considerably different domains and lengths, ranging from 3-second genericdomain GIF videos to 180-second YouTube human activity videos, showing the generalization ability of our approach. Comprehensive ablation studies and thorough analyses are provided to dissect what factors lead to this success. Our code is publicly available. 1 * Equal contribution.
translated by 谷歌翻译