视频检索随着视觉模型的发展取得了巨大进展。但是,进一步改进这些模型需要其他标记的数据,这是一项巨大的手动努力。在本文中,我们提出了一个框架MKTVR,该框架利用了从多语言模型的知识转移来提高视频检索的性能。我们首先使用最先进的机器翻译模型来构建伪真实的多语言视频文本对。然后,我们使用这些数据来学习视频文本表示,其中英语和非英语文本查询在基于预审前的多语言模型的常见嵌入空间中表示。我们在四个英语视频检索数据集上评估了我们提出的方法,例如MSRVTT,MSVD,DIDEMO和CHARADES。实验结果表明,我们的方法在所有数据集上实现了最先进的结果,超过了先前的模型。最后,我们还在涵盖六种语言的多语言视频回程数据集上评估了我们的模型,并表明我们的模型在零拍设置中优于先前的多语言视频检索模型。
translated by 谷歌翻译
尽管最近在跨模式检索领域取得了进展,但由于缺乏手动注释的数据集,研究的重点较少。在本文中,我们提出了一种用于低资源语言的噪声跨语法跨模式检索方法。为此,我们使用机器翻译(MT)来构建低资源语言的伪并行句子对。但是,由于MT并不完美,因此它倾向于在翻译过程中引入噪音,从而使文本嵌入被损坏,从而损害了检索性能。为了减轻这一点,我们引入了一种多视图自我验证方法来学习噪声稳定目标语言表示,该方法采用了跨注意模块来生成软伪靶标,以从基于相似性的视图和功能 - 功能 - 基于视图。此外,受到无监督的MT的反向翻译的启发,我们最大程度地减少了原点句子和反翻译句子之间的语义差异,以进一步提高文本编码器的噪声稳健性。在三个视频文本和图像文本跨模式检索基准跨不同语言上进行了广泛的实验,结果表明,我们的方法显着改善了整体性能,而无需使用额外的人体标记数据。此外,从最近的视觉和语言预训练框架(即剪辑)中配备了预训练的视觉编码器,我们的模型可实现显着的性能增长,这表明我们的方法与流行的预训练模型兼容。代码和数据可在https://github.com/huiguanlab/nrccr上找到。
translated by 谷歌翻译
Most existing text-video retrieval methods focus on cross-modal matching between the visual content of offline videos and textual query sentences. However, in real scenarios, online videos are frequently accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This inspires us to generate associated captions from offline videos to help with existing text-video retrieval methods. To do so, we propose to use the zero-shot video captioner with knowledge of pre-trained web-scale models (e.g., CLIP and GPT-2) to generate captions for offline videos without any training. Given the captions, one question naturally arises: what can auxiliary captions do for text-video retrieval? In this paper, we present a novel framework Cap4Video, which makes use of captions from three aspects: i) Input data: The video and captions can form new video-caption pairs as data augmentation for training. ii) Feature interaction: We perform feature interaction between video and caption to yield enhanced video representations. iii) Output score: The Query-Caption matching branch can be complementary to the original Query-Video matching branch for text-video retrieval. We conduct thorough ablation studies to demonstrate the effectiveness of our method. Without any post-processing, our Cap4Video achieves state-of-the-art performance on MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%).
translated by 谷歌翻译
Recent cross-lingual cross-modal works attempt to extend Vision-Language Pre-training (VLP) models to non-English inputs and achieve impressive performance. However, these models focus only on understanding tasks utilizing encoder-only architecture. In this paper, we propose ERNIE-UniX2, a unified cross-lingual cross-modal pre-training framework for both generation and understanding tasks. ERNIE-UniX2 integrates multiple pre-training paradigms (e.g., contrastive learning and language modeling) based on encoder-decoder architecture and attempts to learn a better joint representation across languages and modalities. Furthermore, ERNIE-UniX2 can be seamlessly fine-tuned for varieties of generation and understanding downstream tasks. Pre-trained on both multilingual text-only and image-text datasets, ERNIE-UniX2 achieves SOTA results on various cross-lingual cross-modal generation and understanding tasks such as multimodal machine translation and multilingual visual question answering.
translated by 谷歌翻译
现代视频文本检索框架基本上由三个部分组成:视频编码器,文本编码器和相似性。随着Visual和Textual表示学习的成功,在视频文本检索领域也采用了基于变压器的编码器和融合方法。在本报告中,我们呈现Clip2TV,旨在探索关键元素在基于变压器的方法中。为实现这一目标,我们首先重新审视一些对多模态学习的工作,然后将一些技术介绍到视频文本检索中,最后通过不同配置的大量实验进行评估。值得注意的是,Clip2TV在MSR-VTT数据集上实现了52.9 @ R1,优先表现出先前的SOTA结果为4.1%。
translated by 谷歌翻译
Video-Text检索是一类跨模式表示学习问题,其目标是选择与给定文本查询和候选视频库之间的文本查询相对应的视频。视觉训练预处理的对比范式在大规模数据集和统一的变压器体系结构中表现出了有希望的成功,并证明了联合潜在空间的力量。尽管如此,视觉域和文本域之间的固有差异仍未被消除,并且将不同的模态投射到联合潜在空间可能会导致单个模式内的信息扭曲。为了克服上述问题,我们提出了一种新的机制,可以学习从源模式空间$ \ mathcal {s} $到目标模态空间$ \ mathcal {t} $的新颖机制桥接视觉和文本域之间的差距。此外,为了保持翻译之间的循环一致性,我们采用了一个循环损失,涉及从$ \ MATHCAL {S} $到预测的目标空间$ \ Mathcal {t'} $的两个前向翻译,以及$ \ Mathcal {t't'的向后翻译} $返回$ \ Mathcal {s} $。在MSR-VTT,MSVD和DIDEMO数据集上进行的广泛实验证明了我们LAT方法的优势和有效性与香草的最新方法相比。
translated by 谷歌翻译
在本文中,我们探索从教学视频中学习的自我监督的视听模型。事先工作表明,这些模型可以在视频的大规模数据集训练后与视觉内容相关联,但它们仅在英语中进行培训和评估。为了学习多语言视听表示,我们提出了一种级联方法,它利用了在英语视频上培训的模型,并将其应用于其他语言的视听数据,例如日本视频。通过我们的级联方法,与日语视频培训相比,我们展示了近10倍的检索性能提高。我们还将培训的模型应用于英语视频培训到日语和印地文口语字幕的图像,实现最先进的性能。
translated by 谷歌翻译
预先训练的图像文本模型(如剪辑)已经证明了从大规模的Web收集的图像文本数据中学到的视觉表示的强大力量。鉴于学习良好的视觉特征,一些现有的作品将图像表示转移到视频域并取得良好的结果。但是,如何利用图像语言预训练的模型(例如,剪辑)进行视频培训(后培训)仍在探索。在本文中,我们研究了两个问题:1)阻碍后期剪辑的因素是什么因素,以进一步提高视频语言任务的性能? 2)如何减轻这些因素的影响?通过一系列比较实验和分析,我们发现语言源之间的数据量表和域间隙具有很大的影响。由这些动机,我们提出了一种配备了视频代理机制的Omnisource跨模式学习方法,即剪辑,即剪辑VIP。广泛的结果表明,我们的方法可以提高视频检索的剪辑的性能。我们的模型还可以在包括MSR-VTT,DIDEMO,LSMDC和ActivityNet在内的各种数据集上实现SOTA结果。我们在https://github.com/microsoft/xpretrain/tree/main/main/main/clip-vip上发布了代码和预训练的剪辑模型。
translated by 谷歌翻译
This work explores an efficient approach to establish a foundational video-text model for tasks including open-vocabulary video classification, text-to-video retrieval, video captioning and video question-answering. We present VideoCoCa that reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules (for example, cross-frame attention layer or perceiver resampler) and finetune the modified architecture on video-text data, we surprisingly find that the generative attentional pooling and contrastive attentional pooling layers in the image-text CoCa design are instantly adaptable to ``flattened frame embeddings'', yielding a strong zero-shot transfer baseline for many video-text tasks. Specifically, the frozen image encoder of a pretrained image-text CoCa takes each video frame as inputs and generates \(N\) token embeddings per frame for totally \(T\) video frames. We flatten \(N \times T\) token embeddings as a long sequence of frozen video representation and apply CoCa's generative attentional pooling and contrastive attentional pooling on top. All model weights including pooling layers are directly loaded from an image-text CoCa pretrained model. Without any video or video-text data, VideoCoCa's zero-shot transfer baseline already achieves state-of-the-art results on zero-shot video classification on Kinetics 400/600/700, UCF101, HMDB51, and Charades, as well as zero-shot text-to-video retrieval on MSR-VTT and ActivityNet Captions. We also explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering (iVQA, MSRVTT-QA, MSVD-QA) and video captioning (MSR-VTT, ActivityNet, Youcook2). Our approach establishes a simple and effective video-text baseline for future research.
translated by 谷歌翻译
Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without
translated by 谷歌翻译
可靠的评估基准是为了可复制性和全面性而设计的,在机器学习方面取得了进步。但是,由于缺乏多语言基准,视觉和语言研究主要集中在英语任务上。为了填补这一空白,我们介绍了图像的语言理解评估基准。 Iglue通过汇总已有的数据集并创建新的数据来汇集 - 视觉问题回答,跨模式检索,扎根的推理以及跨20种不同语言的扎根成本。我们的基准测试能够评估多语言多模型用于转移学习的模型,不仅在零弹位设置中,而且还以新定义的少数图学习设置。根据对可用最新模型的评估,我们发现翻译测试转移优于零弹性转移,并且对于许多任务而言,很难利用射击的学习。此外,下游性能部分用可用的未标记文本数据进行预处理来解释,并且仅通过目标源语言的类型学距离而微弱。我们希望通过向社区释放基准来鼓励该领域的未来研究工作。
translated by 谷歌翻译
根据文本描述检索目标视频是巨大实用价值的任务,并且在过去几年中受到了不断的关注。在本文中,我们专注于多查询视频检索的较少设置,其中提供了多个查询,以便搜索视频档案。首先表明,多查询检索任务是更务实的,代表现实世界用例,更好地评估当前模型的检索能力,从而应得进一步调查与更普遍的单程检索再现。然后,我们提出了几种新方法,用于利用训练时间来利用多个查询,以改善从常规单查验训练模型的简单组合多个查询的相似性输出。我们的模型在三个不同的数据集中始终如一地占有几种竞争基础。例如,Recall @ 1可以在MSR-VTT上提高4.7点,在MSVD上的4.1点和Gatex上的11.7点,在最先进的Clip4Clip模型上构建的强大基线。我们相信进一步的建模努力将为这种方向带来新的见解,并在现实世界视频检索应用中表现更好的新系统。代码可在https://github.com/princetonvisualai/mqvr获得。
translated by 谷歌翻译
我们介绍了一种视听方法,用于远程文本到视频检索。与以前专为简短视频检索设计的方法(例如,持续时间为5-15秒)不同,我们的方法旨在检索捕获复杂人类动作的长时间视频。仅标准视频方法的一个挑战是与从这样的长视频中处理数百个密集提取的帧相关的大量计算成本。为了解决这个问题,我们建议用紧凑的音频提示替换视频的部分,这些线索简洁地汇总了动态音频事件,并且处理便宜。我们的方法称为Eclipse(带有声音编码的有效剪辑),通过添加一个统一的视听变压器块,将流行的剪辑模型调整为视听视频设置,该块从视频和音频流中捕获互补的提示。除了比仅长期视频的方法快2.92倍和2.34倍的内存效率外,我们的方法还可以在几个不同的远程视频数据集上,例如ActivityNet,QVHighighlights,Youcoook2,Youcoook2,Youcook2,Youcook2,Youcook2,Youcook2,Youcook2,Youcook2, Didemo和Charades。
translated by 谷歌翻译
Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as Ima-geNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated crossattention models. The representations also enable cross-modality search with complex text and text + image queries.
translated by 谷歌翻译
来自视频数据的多模态学习最近看过,因为它允许在没有人为注释的情况下培训语义有意义的嵌入,从而使得零射击检索和分类等任务。在这项工作中,我们提出了一种多模态,模态无政府主义融合变压器方法,它学会在多个模态之间交换信息,例如视频,音频和文本,并将它们集成到加入的多模态表示中,以获取聚合的嵌入多模态时间信息。我们建议培训系统的组合丢失,单个模态以及成对的方式,明确地留出任何附加组件,如位置或模态编码。在测试时间时,产生的模型可以处理和融合任意数量的输入模态。此外,变压器的隐式属性允许处理不同长度的输入。为了评估所提出的方法,我们在大规模HOWASET上培训模型,并评估四个具有挑战性的基准数据集上产生的嵌入空间获得最先进的视频检索和零射击视频动作定位。
translated by 谷歌翻译
探索大规模预处理的基础模型对计算机视觉具有重大兴趣,因为这些模型可以快速转移到许多下游任务中。本文介绍了对比字幕(COCA),这是一种极简主义的设计,旨在为图像文本编码器编码器基础模型预算与对比度损失和字幕损失,从而从剪辑和诸如simvlm之类的生成方法之类的对比方法中包含模型能力。与所有解码器层都参与编码器输出的标准编码器 - 模块变压器相反,可口可乐省略了解码器层的上半部分的交叉注意,以编码单峰文本表示,并串联到剩余的解码器层,这些解码器与图像编码器相交的解码器层多模式图像文本表示。除了对多模态解码器输出的字幕损失外,我们还应用了单峰图像和文本嵌入之间的对比损失,该输出可以预测文本令牌自动加压。通过共享相同的计算图,可以用最小的开销有效地计算两个培训目标。可口可乐是端到端和从头开始的网络尺度alt-text数据和带注释的图像,通过将所有标签视为文本,无缝地统一自然语言监督以进行表示。从经验上讲,可口可乐通过零拍传输或在广泛的下游任务上进行零摄像转移或最少的特定任务适应,跨越视觉识别(Imagenet,Kinetics-400/600/700,瞬间, ),交叉模式检索(MSCOCO,FLICKR30K,MSR-VTT),多模式理解(VQA,SNLI-VE,NLVR2)和图像字幕(MSCOCO,NOCAPS)。值得注意的是,在Imagenet分类方面,COCA获得了86.3%的TOP-1准确性,带有冷冻编码器和学习的分类头90.6%,以及带有填充编码器的Imagenet上的新最先进的91.0%Top-1 Top-1精度。
translated by 谷歌翻译
We present a new approach to perform zero-shot cross-modal transfer between speech and text for translation tasks. Multilingual speech and text are encoded in a joint fixed-size representation space. Then, we compare different approaches to decode these multimodal and multilingual fixed-size representations, enabling zero-shot translation between languages and modalities. All our models are trained without the need of cross-modal labeled translation data. Despite a fixed-size representation, we achieve very competitive results on several text and speech translation tasks. In particular, we significantly improve the state-of-the-art for zero-shot speech translation on Must-C. Incorporating a speech decoder in our framework, we introduce the first results for zero-shot direct speech-to-speech and text-to-speech translation.
translated by 谷歌翻译
预先培训用于学习可转让的视频文本表示的模型,以近年来引起了很多关注。以前的主导作品主要采用两个独立的编码器来有效检索,但忽略视频和文本之间的本地关联。另一种研究使用联合编码器与文本交互视频,但是由于每个文本视频对需要馈送到模型中的低效率。在这项工作中,我们能够通过新颖的借口任务进行微粒视频文本交互,以便通过新颖的借口任务进行检索,称为多项选择题(MCQ),其中参数模块BridgeFormer培训以接受由此构建的“问题”。文本功能通过诉诸视频功能。具体来说,我们利用了文本的丰富语义(即,名词和动词)来构建问题,可以培训视频编码器以捕获更多区域内容和时间动态。以问题和答案的形式,可以正确建立本地视频文本功能之间的语义关联。 BridgeFormer能够删除下游检索,只有两个编码器渲染高效且灵活的模型。我们的方法在具有不同实验设置(即零拍摄和微调)的五个数据集中,在五个数据集中优于最先进的方法,包括不同的实验设置(即零拍摄和微调),包括HOWTO100M(一百万个视频)。我们进一步开展零射击动作识别,可以作为视频到文本检索,我们的方法也显着超越了其对应物。作为额外的好处,我们的方法在单模下游任务中实现了竞争力,在单模下游任务上具有更短的预训练视频,例如,使用线性评估的动作识别。
translated by 谷歌翻译
最先进的愿景和愿景和语言模型依靠大规模的Visio-linguisting预借鉴,以获得各种下游任务的良好性能。通常,这种模型通常是跨模态(对比)或多模态(具有早期融合)但不是两者;它们通常只针对特定的方式或任务。有希望的方向将是使用单一整体普遍模型,作为“基础”,目标是一次性的所有方式 - 真正的视觉和语言基础模型应该擅长视力任务,语言任务和交叉和多数模态视觉和语言任务。我们将Flava介绍在这样的模型中,并在跨越这些目标模式的广泛的35个任务上展示令人印象深刻的性能。
translated by 谷歌翻译
Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.
translated by 谷歌翻译