我们可以训练一个能够处理多个模态和数据集的单个变压器模型,同时分享几乎所有的学习参数?我们呈现Polyvit,一种培训的模型,在图像,音频和视频上接受了讲述这个问题。通过在单一的方式上培训不同的任务,我们能够提高每个任务的准确性,并在5个标准视频和音频分类数据集中实现最先进的结果。多种模式和任务上的共同训练Polyvit会导致一个更具参数效率的模型,并学习遍历多个域的表示。此外,我们展示了实施的共同培训和实用,因为我们不需要调整数据集的每个组合的超级参数,但可以简单地调整来自标准的单一任务培训。
translated by 谷歌翻译
Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.
translated by 谷歌翻译
我们呈现了基于纯变压器的视频分类模型,在图像分类中最近的近期成功进行了借鉴。我们的模型从输入视频中提取了时空令牌,然后由一系列变压器层编码。为了处理视频中遇到的令牌的长序列,我们提出了我们模型的几种有效的变体,它们将输入的空间和时间维构建。虽然已知基于变换器的模型只有在可用的大型训练数据集时才有效,但我们展示了我们如何在训练期间有效地规范模型,并利用预先训练的图像模型能够在相对小的数据集上训练。我们进行彻底的消融研究,并在包括动力学400和600,史诗厨房,东西的多个视频分类基准上实现最先进的结果,其中 - 基于深度3D卷积网络的现有方法表现出优先的方法。为了促进进一步的研究,我们在https://github.com/google-research/scenic/tree/main/scenic/projects/vivit发布代码
translated by 谷歌翻译
视频理解需要在多种时空分辨率下推理 - 从短的细粒度动作到更长的持续时间。虽然变压器架构最近提出了最先进的,但它们没有明确建模不同的时空分辨率。为此,我们为视频识别(MTV)提供了多视图变压器。我们的模型由单独的编码器组成,表示输入视频的不同视图,以横向连接,以跨视图熔断信息。我们对我们的模型提供了彻底的消融研究,并表明MTV在一系列模型尺寸范围内的准确性和计算成本方面始终如一地表现优于单视对应力。此外,我们在五个标准数据集上实现最先进的结果,并通过大规模预制来进一步提高。我们将释放代码和备用检查点。
translated by 谷歌翻译
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
translated by 谷歌翻译
转移学习是用于训练小型目标数据集深层网络的主要范式。通常在大型``上游''数据集上预估计用于分类的模型,因为此类标签易于收集,然后在``下游''任务(例如动作本地化)上进行了填充,这些任务由于其较细粒度的注释而较小。在本文中,我们质疑这种方法,并提出共同访问 - 同时在多个``上游''和``下游''任务上训练单个模型。我们证明,在使用相同的数据总量时,共同传统的表现优于传统的转移学习,并且还展示了我们如何轻松地将方法扩展到多个``上游''数据集以进一步提高性能。尤其是,共同访问可以显着提高我们下游任务中稀有类别的性能,因为它具有正规化的效果,并使网络能够学习在不同数据集之间传输的功能表示。最后,我们观察到如何与公共,视频分类数据集共同进行,我们能够在挑战性的AVA和AVA-Kinetics数据集上实现最新的时空动作的结果,超过了最新的作品,这些作品的最新作品会发展出复杂的作品楷模。
translated by 谷歌翻译
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train. 1
translated by 谷歌翻译
我们使用无卷积的变压器架构提出了一种从未标记数据学习多式式表示的框架。具体而言,我们的视频音频文本变压器(Vatt)将原始信号作为输入提取,提取丰富的多式化表示,以使各种下游任务受益。我们使用多模式对比损失从头划线训练Vatt端到端,并通过视频动作识别,音频事件分类,图像分类和文本到视频检索的下游任务评估其性能。此外,我们通过共享三种方式之间的重量来研究模型 - 无话的单骨架变压器。我们表明,无卷积VATT优于下游任务中的最先进的Convnet架构。特别是,Vatt的视觉变压器在动力学-400上实现82.1%的高精度82.1%,在动力学-600,72.7%的动力学-700上的72.7%,以及时间的时间,新的记录,在避免受监督的预训练时,新的记录。通过从头划伤训练相同的变压器,转移到图像分类导致图像分类导致78.7%的ImageNet精度为64.7%,尽管视频和图像之间的域间差距,我们的模型概括了我们的模型。 Vatt的音雅音频变压器还通过在没有任何监督的预训练的情况下在Audioset上实现39.4%的地图来设置基于波形的音频事件识别的新记录。 Vatt的源代码是公开的。
translated by 谷歌翻译
We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results and the code will be open-sourced.
translated by 谷歌翻译
在本文中,我们探讨了构建统一基础模型的可能性,该模型可以适应愿景和仅文本任务。从BERT和VIT开始,我们设计一个由模态特定标记,共享变压器编码器和任务特定的输出头组成的统一变压器。为了有效地预先列车在未配对的图像和文本上联合培训拟议的模型,我们提出了两种新颖的技术:(i)我们使用单独培训的BERT和VIT模型作为老师,并应用知识蒸馏,为关节提供额外的准确的监督信号训练; (ii)我们提出了一种新颖的渐变掩蔽策略,以平衡图像和文本预培训损失的参数更新。我们通过微调它分别在图像分类任务和自然语言理解任务上进行微调,评估联合预训练的变压器。实验表明,由此产生的统一基础变压器令人惊讶地在视觉和仅文本任务中令人惊讶地令人惊讶,并且所提出的知识蒸馏和梯度掩蔽策略可以有效地提升分别训练模型水平的性能。
translated by 谷歌翻译
基于变压器的体系结构已在各种视觉域(最著名的图像和视频)中变得更具竞争力。虽然先前的工作已经孤立地研究了这些模式,但拥有一个共同的体系结构表明,人们可以训练单个统一模型以多种视觉方式。事先尝试进行统一建模通常使用针对视觉任务量身定制的体系结构,或与单个模态模型相比获得较差的性能。在这项工作中,我们表明可以使用蒙版的自动编码来在图像和视频上训练简单的视觉变压器,而无需任何标记的数据。该单个模型学习了与图像和视频基准上的单模式表示相当或更好的视觉表示,同时使用了更简单的体系结构。特别是,我们的单一预算模型可以进行审核,以在ImageNet上获得86.5%的速度,而在挑战性的事物V2视频基准测试中,可以实现75.3%的范围。此外,可以通过丢弃90%的图像和95%的视频补丁来学习该模型,从而实现非常快速的训练。
translated by 谷歌翻译
Many real-world problems are inherently multimodal, from the communicative modalities humans use to express social and emotional states to the force, proprioception, and visual sensors ubiquitous on robots. While there has been an explosion of interest in multimodal representation learning, these methods are still largely focused on a small set of modalities, primarily in the language, vision, and audio space. In order to accelerate generalization towards diverse and understudied modalities, this paper studies efficient representation learning for high-modality scenarios. Since adding new models for every new modality or task becomes prohibitively expensive, a critical technical challenge is heterogeneity quantification: how can we measure which modalities encode similar information and interactions in order to permit parameter sharing with previous modalities? We propose two new information-theoretic metrics for heterogeneity quantification: (1) modality heterogeneity studies how similar 2 modalities $\{X_1,X_2\}$ are by measuring how much information can be transferred from $X_1$ to $X_2$, while (2) interaction heterogeneity studies how similarly pairs of modalities $\{X_1,X_2\}, \{X_3,X_4\}$ interact by measuring how much interaction information can be transferred from $\{X_1,X_2\}$ to $\{X_3,X_4\}$. We show the importance of these proposed metrics in high-modality scenarios as a way to automatically prioritize the fusion of modalities that contain unique information or interactions. The result is a single model, HighMMT, that scales up to $10$ modalities and $15$ tasks from $5$ different research areas. Not only does HighMMT outperform prior methods on the tradeoff between performance and efficiency, it also demonstrates a crucial scaling behavior: performance continues to improve with each modality added, and transfers to entirely new modalities and tasks during fine-tuning.
translated by 谷歌翻译
深网络架构在不忘记以前的任务的情况下努力继续学习新任务。最近的趋势表明,基于参数扩展的动态架构可以在持续学习中有效地减少灾难性忘记。但是,现有方法通常需要在测试时需要任务标识符,需要复杂调整以平衡越来越多的参数,并且几乎不在任务中共享任何信息。结果,他们努力扩展到大量任务,而无需显着开销。在本文中,我们提出了一种基于专用编码器/解码器框架的变压器体系结构。批判性地,编码器和解码器在所有任务中共享。通过特殊令牌的动态扩展,我们专注于任务分发的解码器网络的各个向前。由于严格控制参数扩展,我们的策略缩小到大量任务,同时具有可忽略的内存和时间开销。此外,这种有效的策略不需要任何HyperParameter调整来控制网络的扩展。我们的模型在大型ImageNet100和ImageNet100上达到了Cifar100和最先进的表演,而参数比并发动态框架的参数越小。
translated by 谷歌翻译
随着变压器作为语言处理的标准及其在计算机视觉方面的进步,参数大小和培训数据的数量相应地增长。许多人开始相信,因此,变形金刚不适合少量数据。这种趋势引起了人们的关注,例如:某些科学领域中数据的可用性有限,并且排除了该领域研究资源有限的人。在本文中,我们旨在通过引入紧凑型变压器来提出一种小规模学习的方法。我们首次表明,具有正确的尺寸,卷积令牌化,变压器可以避免在小数据集上过度拟合和优于最先进的CNN。我们的模型在模型大小方面具有灵活性,并且在获得竞争成果的同时,参数可能仅为0.28亿。当在CIFAR-10上训练Cifar-10,只有370万参数训练时,我们的最佳模型可以达到98%的准确性,这是与以前的基于变形金刚的模型相比,数据效率的显着提高,比其他变压器小于10倍,并且是15%的大小。在实现类似性能的同时,重新NET50。 CCT还表现优于许多基于CNN的现代方法,甚至超过一些基于NAS的方法。此外,我们在Flowers-102上获得了新的SOTA,具有99.76%的TOP-1准确性,并改善了Imagenet上现有基线(82.71%精度,具有29%的VIT参数)以及NLP任务。我们针对变压器的简单而紧凑的设计使它们更可行,可以为那些计算资源和/或处理小型数据集的人学习,同时扩展了在数据高效变压器中的现有研究工作。我们的代码和预培训模型可在https://github.com/shi-labs/compact-transformers上公开获得。
translated by 谷歌翻译
Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without
translated by 谷歌翻译
Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domainspecific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver -a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.
translated by 谷歌翻译
我们研究了可靠的功能表示的任务,旨在在多个数据集上良好地概括以进行行动识别。我们建立了有关变形金刚的功效的方法。尽管在过去的十年中,我们目睹了视频动作识别的巨大进展,但如何培训单个模型可以在多个数据集中表现良好的单一模型仍然充满挑战而有价值。在这里,我们提出了一种新颖的多数据集训练范式,Multitrain,设计了两个新的损失条款,即信息丰富的损失和投射损失,旨在学习稳健的表现以进行行动识别。特别是,信息性损失最大化了功能嵌入的表现力,而每个数据集的投影损失遍历了数据集的类之间的内在关系。我们验证方法对五个具有挑战性的数据集的有效性,即动力学400,动力学700,矩矩,活动网络和某种效果 - v2数据集。广泛的实验结果表明,我们的方法可以始终如一地提高最新性能。
translated by 谷歌翻译
General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by using exclusively global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). In sum our contributions are: 1) scaling Perceiver-type models to raw high-resolution images and audio+video, 2) showing the feasibility of learning 1M+ positional embeddings from scratch using masked auto-encoding, 3) demonstrating competitive performance on raw data from ImageNet, AudioSet, PASCAL VOC, ModelNet40 and Kinetics datasets with the same exact, unchanged model and without specialized preprocessing or any tokenization.
translated by 谷歌翻译
来自视频数据的多模态学习最近看过,因为它允许在没有人为注释的情况下培训语义有意义的嵌入,从而使得零射击检索和分类等任务。在这项工作中,我们提出了一种多模态,模态无政府主义融合变压器方法,它学会在多个模态之间交换信息,例如视频,音频和文本,并将它们集成到加入的多模态表示中,以获取聚合的嵌入多模态时间信息。我们建议培训系统的组合丢失,单个模态以及成对的方式,明确地留出任何附加组件,如位置或模态编码。在测试时间时,产生的模型可以处理和融合任意数量的输入模态。此外,变压器的隐式属性允许处理不同长度的输入。为了评估所提出的方法,我们在大规模HOWASET上培训模型,并评估四个具有挑战性的基准数据集上产生的嵌入空间获得最先进的视频检索和零射击视频动作定位。
translated by 谷歌翻译
由于具有强大的代表性,变形金刚在包括自然语言处理(NLP),计算机视觉和语音识别在内的广泛应用中越来越受欢迎。但是,利用这种代表性的能力有效地需要大量的数据,强大的正则化或两者兼而有之以减轻过度拟合。最近,基于掩盖的自动编码器的自我监督预处理策略已解锁了变压器的功能,这些策略依赖于直接或从未掩盖的内容对比的掩蔽输入进行重建。这种预训练的策略已在NLP中的BERT模型,Speak2VEC模型中使用,最近在Vision中的MAE模型中,该模型迫使该模型使用自动编码相关的目标来了解输入不同部分中的内容之间的关系。在本文中,我们提出了一种小说但令人惊讶的简单替代内容,以预测内容的位置,而无需为其提供位置信息。这样做需要变压器仅凭内容就可以理解输入不同部分之间的位置关系。这相当于有效的实现,其中借口任务是每个输入令牌所有可能位置之间的分类问题。我们在视觉和语音基准上进行了实验,我们的方法对强有力的监督训练基准进行了改进,并且与现代的无监督/自我监督预审方法相媲美。我们的方法还可以使经过训练的变压器在没有位置嵌入的情况下胜过训练有完整位置信息的训练的变压器。
translated by 谷歌翻译