智能论文笔记

Language Models are General-Purpose Interfaces

Yaru Hao , Haoyu Song , Li Dong , Shaohan Huang , Zewen Chi , Wenhui Wang , Shuming Ma , Furu Wei

分类：自然语言处理

2022-06-13

基础模型由于在广泛的下游应用中的有效性而受到了很多关注。尽管在体系结构方面存在很大的融合，但大多数审慎的模型通常仍用于特定任务或模式。在这项工作中，我们建议将语言模型用作各种基础模型的通用接口。一系列预处理的编码者感知到了多种方式（例如视觉和语言），并与扮演通用任务层角色的语言模型对接。我们提出了一个半伴侣的语言建模目标，以共同确定界面和模块化编码器。我们从因果关系和非因果建模中涵盖了优势和能力，从而结合了两个世界的最佳状态。具体而言，所提出的方法不仅从因果语言建模中继承了内在学习和开放式生成的能力，而且由于双向编码器而有利于填补。更重要的是，我们的方法无缝地解锁了上述功能的组合，例如，通过填充编码器启用了文本学习或指导。各种仅语言和视觉语言基准的实验结果表明，我们的模型表现优于或与鉴定，零弹性概括和几乎没有的学习的专业模型竞争。

translated by 谷歌翻译

Finetuned Language Models Are Zero-Shot Learners

Jason Wei , Maarten Bosma , Vincent Y. Zhao , Kelvin Guu , Adams Wei Yu , Brian Lester , Nan Du , Andrew M. Dai , Quoc V. Le

分类：自然语言处理

2021-09-03

本文探讨了提高语言模型的零次学习能力的简单方法。我们表明，指令调整 - 通过对说明书中所述的任务集合微调语言模型 - 大幅提升零射门上看不见任务中的表现。我们采取预训练的语言模型和指令调整它通过自然语言指令模板语言表达了60NLP任务137B参数。我们评估这种指令调整模型，我们称之为FLAN，在看不见的任务类型。FLAN显着改善其未修饰的对应的性能和超过25的20个任务，我们评估零射门175BGPT-3。FLAN甚至GPT-3通过在安利，RTE，BoolQ，AI2-ARC，OpenbookQA和StoryCloze大比分胜过几拍。消融研究显示任务和模型的规模，这个数字是指令调整取得成功的关键组成部分。

translated by 谷歌翻译

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen , Xiao Wang , Soravit Changpinyo , AJ Piergiovanni , Piotr Padlewski , Daniel Salz , Sebastian Goodman , Adam Grycner , Basil Mustafa , Lucas Beyer

分类：计算机视觉 | 自然语言处理

2022-09-14

有效的缩放和灵活的任务接口使大型语言模型能够在许多任务中表现出色。帕利（Pali）根据视觉和文本输入生成文本，并使用该界面以许多语言执行许多视觉，语言和多模式任务。为了训练帕利，我们利用了大型的编码器语言模型和视觉变压器（VITS）。这使我们能够利用其现有能力，并利用培训它们的大量成本。我们发现，视觉和语言组成部分的联合缩放很重要。由于现有的语言变压器比其视觉对应物要大得多，因此我们训练迄今为止最大的VIT（VIT-E），以量化甚至大容量视觉模型的好处。为了训练Pali，我们基于一个新的图像文本训练集，其中包含10B图像和文本，以100多种语言来创建大型的多语言组合。帕利（Pali）在多个视觉和语言任务（例如字幕，视觉问题，索方式，场景文本理解）中实现了最新的，同时保留了简单，模块化和可扩展的设计。

translated by 谷歌翻译

OmniVL:One Foundation Model for Image-Language and Video-Language Tasks

Junke Wang , Dongdong Chen , Zuxuan Wu , Chong Luo , Luowei Zhou , Yucheng Zhao , Yujia Xie , Ce Liu , Yu-Gang Jiang , Lu Yuan

分类：计算机视觉

2022-09-15

本文介绍了Omnivl，这是一种新的基础模型，旨在使用一种通用体系结构来支持图像语言和视频语言任务。它为图像和视频输入采用了统一的基于变压器的视觉编码器，因此可以执行联合图像语言和视频语言预处理。我们首次证明了这样的范式受益于图像和视频任务，而不是传统的单向传输（例如，使用图像语言来帮助视频语言）。为此，我们提出了对图像语言和视频语言的脱钩关节预处理，以有效地将视觉模型分解为空间和时间维度，并在图像和视频任务上获得性能提升。此外，我们引入了一种新颖的统一视觉对比度（UNIVLC）损失，以利用图像文本，视频文本，图像标签（例如，图像分类），视频标签（例如，视频动作识别）在一起受到监督和吵闹的监督预处理数据都尽可能多地利用。无需额外的任务适配器，Omnivl可以同时支持仅视觉任务（例如，图像分类，视频操作识别），跨模式对齐任务（例如，图像/视频 - 文本检索）和多模式理解和生成任务（例如，图像/视频问答，字幕）。我们在各种下游任务上评估Omnivl，并以相似的模型大小和数据量表获得最新的或竞争结果。

translated by 谷歌翻译

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu , Zirui Wang , Vijay Vasudevan , Legg Yeung , Mojtaba Seyedhosseini , Yonghui Wu

分类：计算机视觉 | 机器学习

2022-05-04

探索大规模预处理的基础模型对计算机视觉具有重大兴趣，因为这些模型可以快速转移到许多下游任务中。本文介绍了对比字幕（COCA），这是一种极简主义的设计，旨在为图像文本编码器编码器基础模型预算与对比度损失和字幕损失，从而从剪辑和诸如simvlm之类的生成方法之类的对比方法中包含模型能力。与所有解码器层都参与编码器输出的标准编码器 - 模块变压器相反，可口可乐省略了解码器层的上半部分的交叉注意，以编码单峰文本表示，并串联到剩余的解码器层，这些解码器与图像编码器相交的解码器层多模式图像文本表示。除了对多模态解码器输出的字幕损失外，我们还应用了单峰图像和文本嵌入之间的对比损失，该输出可以预测文本令牌自动加压。通过共享相同的计算图，可以用最小的开销有效地计算两个培训目标。可口可乐是端到端和从头开始的网络尺度alt-text数据和带注释的图像，通过将所有标签视为文本，无缝地统一自然语言监督以进行表示。从经验上讲，可口可乐通过零拍传输或在广泛的下游任务上进行零摄像转移或最少的特定任务适应，跨越视觉识别（Imagenet，Kinetics-400/600/700，瞬间，），交叉模式检索（MSCOCO，FLICKR30K，MSR-VTT），多模式理解（VQA，SNLI-VE，NLVR2）和图像字幕（MSCOCO，NOCAPS）。值得注意的是，在Imagenet分类方面，COCA获得了86.3％的TOP-1准确性，带有冷冻编码器和学习的分类头90.6％，以及带有填充编码器的Imagenet上的新最先进的91.0％Top-1 Top-1精度。

translated by 谷歌翻译

MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning

Zhiyang Xu , Ying Shen , Lifu Huang

分类：自然语言处理

2022-12-21

Instruction tuning, a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions, has shown promising zero-shot performance on various natural language processing tasks. However, it's still not explored for vision and multimodal tasks. In this work, we introduce MultiInstruct, the first multimodal instruction tuning benchmark dataset that consists of 47 diverse multimodal tasks covering 11 broad categories. Each task is designed at least with 5,000 instances (input-out pairs) from existing open-source datasets and 5 expert-written instructions. We take OFA as the base pre-trained model for multimodal instruction tuning, and to improve its performance, we explore multiple transfer learning strategies to leverage the large-scale Natural Instructions dataset. Experimental results demonstrate its strong zero-shot performance on various unseen multimodal tasks and the benefit of transfer learning from text-only instructions. We also design a new evaluation metric: Sensitivity, to evaluate how sensitive the model is to the variety of instructions. Our results indicate that the model is less sensitive to the varying instructions after finetuning on a diverse set of tasks and instructions for each task.

translated by 谷歌翻译

MAGMA -- Multimodal Augmentation of Generative Models through Adapter-based Finetuning

Constantin Eichenberg , Sidney Black , Samuel Weinbach , Letitia Parcalabescu , Anette Frank

分类：计算机视觉 | 自然语言处理

2021-12-09

大规模预制速度迅速成为视觉语言（VL）建模中的规范。然而，普遍的VL方法受标记数据的要求和复杂的多步预介质目标的要求受限。我们呈现Magma - 使用基于适配器的FineTuning使用额外的方式增强生成语言模型的简单方法。在冻结的情况下，我们培训一系列VL模型，从视觉和文本输入的任意组合自动生成文本。使用单一语言建模目的，预先预测完全结束于结束，与先前的方法相比，简化优化。重要的是，在培训期间，语言模型权重保持不变，允许从语言预磨练转移百科全书知识和内心的学习能力。 Magma在开放式生成任务上冻结的岩浆，实现了最先进的状态，结果在Okvqa基准和竞争结果上的一系列其他流行的VL基准测试中，同时预先训练用于培训SIMVLM的样本数量的0.2％。

translated by 谷歌翻译

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Antoine Yang , Antoine Miech , Josef Sivic , Ivan Laptev , Cordelia Schmid

分类：计算机视觉 | 自然语言处理 | 机器学习

2022-06-16

视频问题回答（VideoQA）是一项复杂的任务，需要多种模式数据进行培训。但是，对视频的问题和答案的手动注释是乏味的，禁止可扩展性。为了解决这个问题，最近的方法考虑了零拍设置，而无需手动注释视觉问题。特别是，一种有前途的方法调整了在网络级文本数据中预测的冻结自回归语言模型，以适应多模式输入。相比之下，我们在这里建立在冷冻双向语言模型（BILM）的基础上，并表明这种方法为零拍出的VideoQA提供了更强大，更便宜的替代方案。特别是（i）我们使用轻型训练模块将视觉输入与冷冻的BILM结合在一起，（ii）我们使用Web-Scrafe Multi-Mododal数据训练此类模块，最后（iii）我们通过掩盖语言执行零声录像带推断建模，其中蒙版文本是给定问题的答案。我们提出的方法Frozenbilm在零摄影的视频中的表现优于最高的，包括LSMDC-FIB，包括LSMDC-FIB，IVQA，MSRVTT-QA，MSVD-QA，ActivityNet-QA，TGIF-FRAMEQA，TGIF-FRAMEQA，，TGIF-FRAMEQA，，TGIF-FRAMEQA，，，MSRVTT-QA，MSRVTT-QA，MSRVTT-QA，MSRVTT-QA，MSRVTT-QA，，均优于最新技术。 How2QA和TVQA。它还在几次且完全监督的环境中展示了竞争性能。我们的代码和模型将在https://antoyang.github.io/frozenbilm.html上公开提供。

translated by 谷歌翻译

VL-BEiT: Generative Vision-Language Pretraining

Hangbo Bao , Wenhui Wang , Li Dong , Furu Wei

分类：计算机视觉 | 自然语言处理

2022-06-02

我们介绍了一个名为VL-BEIT的视觉基础模型，这是一种双向多模式变压器，通过生成预处理学习。我们的极简主义解决方案通过共享变压器对单接和多模式数据进行掩盖的预测。具体而言，我们对图像文本对，文本上的掩盖语言建模以及图像上的掩盖图像建模进行了掩盖视觉模型。VL-从头开始学习，其中一项统一的预处理任务，一个共用的骨干和一阶段的训练。我们的方法在概念上是简单的，并且在经验上有效。实验结果表明，VL-BEIT在各种视觉语言基准（例如视觉问题回答，视觉推理和图像文本检索）上获得了强大的结果。此外，我们的方法学习可转移的视觉特征，在图像分类方面实现竞争性能以及语义分割。

translated by 谷歌翻译

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Linjie Li , Zhe Gan , Kevin Lin , Chung-Ching Lin , Zicheng Liu , Ce Liu , Lijuan Wang

分类：计算机视觉

2022-06-14

近年来，统一的视觉语言框架已经大大提高，其中大多数采用编码器架构将图像文本任务统一为序列到序列的生成。但是，现有的视频语言（VIDL）模型仍需要在每个任务的模型体系结构和培训目标中进行特定于任务的设计。在这项工作中，我们探索了一个统一的VIDL框架薰衣草，其中蒙版语言建模（MLM）用作所有前训练和下游任务的常见接口。这样的统一导致了简化的模型体系结构，在多模式编码器之上，只需要一个轻巧的MLM头，而不是具有更多参数的解码器。令人惊讶的是，实验结果表明，这个统一的框架在14个VIDL基准测试中实现了竞争性能，涵盖了视频问答，文本到视频检索和视频字幕。广泛的分析进一步证明了薰衣草比现有VIDL方法的优势：（i）在多任务列出时仅使用一组参数值支持所有下游任务；（ii）对各种下游任务的几乎没有概括；（iii）在视频问题回答任务上启用零射门评估。代码可从https://github.com/microsoft/lavender获得。

translated by 谷歌翻译

Unified Multimodal Pre-training and Prompt-based Tuning for Vision-Language Understanding and Generation

Tianyi Liu , Zuxuan Wu , Wenhan Xiong , Jingjing Chen , Yu-Gang Jiang

分类：计算机视觉 | 自然语言处理 | 机器学习

2021-12-10

大多数现有的视觉语言预训练方法侧重于在预先绘制期间了解解决任务并使用伯特样目标（屏蔽语言建模和图像 - 文本匹配）。虽然它们在许多理解下游任务中表现良好，但是，例如，视觉问题应答，图像文本检索和视觉存在，它们没有生成的能力。为了解决这个问题，我们为视觉语言理解和一代（UNIVL）提出了统一的多模式预培训。建议的UNIVL能够处理理解任务和生成任务。我们增强了现有的预押范例，只使用带有因果面罩的随机掩模，即掩盖未来令牌的三角面具，使得预先接受的模型可以通过设计具有自动发育能力。我们将几个以前的理解任务作为文本生成任务制定，并建议使用基于提示的方法来进行不同的下游任务进行微调。我们的实验表明，在使用相同型号的同时了解任务和生成任务之间存在权衡，以及改善两个任务的可行方式是使用更多数据。我们的UNIVL框架可以在近似验证任务和生成任务中获得最近的愿景预培训方法的性能。此外，我们开展了基于及时的FineTuning更具数据效率 - 在几次拍摄场景中表现出差异的方法。

translated by 谷歌翻译

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

Wenhui Wang , Hangbo Bao , Li Dong , Johan Bjorck , Zhiliang Peng , Qiang Liu , Kriti Aggarwal , Owais Khan Mohammed , Saksham Singhal , Subhojit Som

分类：计算机视觉 | 自然语言处理

2022-08-22

语言，视觉和多模式预审查的大量融合正在出现。在这项工作中，我们介绍了通用多模式基础模型BEIT-3，该模型BEIT-3，该模型在视觉和视觉任务上都实现了最新的转移性能。具体来说，我们从三个方面提出了大融合：骨干架构，预训练任务和模型扩展。我们介绍了多道路变压器进行通用建模，其中模块化体系结构可以实现深融合和模态特定的编码。基于共享的骨干，我们以统一的方式对图像（Imglish），文本（英语）和图像文本对（“平行句子”）进行蒙面的“语言”建模。实验结果表明，BEIT-3在对象检测（COCO），语义分割（ADE20K），图像分类（Imagenet），视觉推理（NLVR2），视觉询问答案（VQAV2），图像字幕上获得最先进的性能（可可）和跨模式检索（Flickr30k，可可）。

translated by 谷歌翻译

Prefix Language Models are Unified Modal Learners

Shizhe Diao , Wangchunshu Zhou , Xinsong Zhang , Jiawei Wang

分类：计算机视觉 | 自然语言处理 | 机器学习

2022-06-15

随着视觉前训练的成功，我们目睹了最先进的方式，以多模式的理解和产生推动。但是，当前的预训练范式不能一次靶向所有模式（例如，文本生成和图像生成），或者需要多重设计良好的任务，从而显着限制可伸缩性。我们证明，可以通过文本和图像序列的前缀语言建模目标学习统一的模态模型。得益于简单但功能强大的预训练范式，我们提出的模型Davinci非常易于训练，可扩展到巨大的数据，并且可以适应跨模态（语言 /视觉 /视觉+语言）的各种下游任务（类型）（理解） / generation）和设置（例如，零射，微调，线性评估）具有单个统一体系结构。达文奇（Davinci）在26个理解 /发电任务的广泛范围内实现了竞争性能，并且在大多数任务上都超过了以前的统一视力语言模型，包括Imagenet分类（+1.6％），VQAV2（+1.4％）（+1.4％），可可标题生成（Bleu@@@@@ 4 +1.1％，苹果酒 +1.5％）和可可图像生成（ +0.9％，FID -1.0％），在可比的模型和数据量表处。此外，我们通过在异质和广泛的分布覆盖范围内报告不同尺度的量表上的性能，为将来的研究提供了明确的基准。我们的结果建立了新的，更强的基线，以便将来在不同的数据量表上进行比较，并阐明了更广泛地比较VLP模型的困难。

translated by 谷歌翻译

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

AJ Piergiovanni , Wei Li , Weicheng Kuo , Mohammad Saffar , Fred Bertsch , Anelia Angelova

分类：计算机视觉

2022-05-02

We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning. In contrast to previous works using contrastive or generative captioning training, we propose a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well. The pre-training uses only noisy image captioning data, and is formulated to use the entire architecture end-to-end with both a strong language encoder and decoder. Our results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results across a variety of question answering tasks. Our multi-task mixture training learns from tasks of various question intents and thus generalizes better, including on zero-shot vision-language tasks. We conduct experiments in the challenging multi-task and open-vocabulary settings and across a variety of datasets and tasks, such as VQA2.0, SNLI-VE, NLVR2, GQA. We observe that the proposed approach is able to generalize to unseen tasks and that more diverse mixtures lead to higher accuracy in both known and novel tasks.

translated by 谷歌翻译

Unified language model pre-training for natural language understanding and generation

分类：

This paper presents a new UNIfied pre-trained Language Model (UNILM) that can be fine-tuned for both natural language understanding and generation tasks. The model is pre-trained using three types of language modeling tasks: unidirectional, bidirectional, and sequence-to-sequence prediction. The unified modeling is achieved by employing a shared Transformer network and utilizing specific self-attention masks to control what context the prediction conditions on. UNILM compares favorably with BERT on the GLUE benchmark, and the SQuAD 2.0 and CoQA question answering tasks. Moreover, UNILM achieves new state-ofthe-art results on five natural language generation datasets, including improving the CNN/DailyMail abstractive summarization ROUGE-L to 40.51 (2.04 absolute improvement), the Gigaword abstractive summarization ROUGE-L to 35.75 (0.86 absolute improvement), the CoQA generative question answering F1 score to 82.5 (37.1 absolute improvement), the SQuAD question generation BLEU-4 to 22.12 (3.75 absolute improvement), and the DSTC7 document-grounded dialog response generation NIST-4 to 2.67 (human performance is 2.65). The code and pre-trained models are available at https://github.com/microsoft/unilm. * Equal contribution. † Contact person.

translated by 谷歌翻译

Unified Vision-Language Pre-Training for Image Captioning and VQA

Luowei Zhou , Hamid Palangi , Lei Zhang , Houdong Hu , Jason J. Corso , Jianfeng Gao

分类：

2019-09-24

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be finetuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.

translated by 谷歌翻译

ILLUME: Rationalizing Vision-Language Models by Interacting with their Jabber

Manuel Brac , Patrick Schramowski , Björn Deiseroth , Kristian Kersting

分类：机器学习 | 人工智能 | 自然语言处理 | 计算机视觉

2022-08-17

从预训练的语言模型中进行的引导已被证明是用于建立基础视觉模型（VLM）的有效方法，例如图像字幕或视觉问题的答案。但是，很难用它来使模型符合用户的理由来获得特定答案。为了引起和加强常识性原因，我们提出了一个迭代采样和调整范式，称为Illume，执行以下循环：给定图像问题提示提示，VLM采样了多个候选人，并通过人类评论家通过偏好提供最小的反馈。选择，用于微调。该循环增加了训练数据，并逐渐雕刻出VLM的合理化功能。我们的详尽实验表明，Illume在使用较少的培训数据的同时，仅需要最少的反馈，与标准监督的微调竞争。

translated by 谷歌翻译

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

Jinze Bai , Rui Men , Hao Yang , Xuancheng Ren , Kai Dang , Yichang Zhang , Xiaohuan Zhou , Peng Wang , Sinan Tan , An Yang

分类：计算机视觉 | 人工智能 | 自然语言处理 | 机器学习

2022-12-08

Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys

translated by 谷歌翻译

GIT: A Generative Image-to-text Transformer for Vision and Language

Jianfeng Wang , Zhengyuan Yang , Xiaowei Hu , Linjie Li , Kevin Lin , Zhe Gan , Zicheng Liu , Ce Liu , Lijuan Wang

分类：计算机视觉

2022-05-27

在本文中，我们设计和训练生成的图像到文本变压器Git，以统一视觉语言任务，例如图像/视频字幕和问题答案。尽管生成模型在预训练和微调之间提供了一致的网络体系结构，但现有工作通常包含复杂的结构（Uni/多模式编码器/解码器），并取决于外部模块，例如对象检测器/标记器和光学角色识别（OCR））。在git中，我们将体系结构简化为一个图像编码器，而在单语言建模任务下将架构简化为一个文本解码器。我们还扩展了预训练数据和模型大小，以提高模型性能。没有铃铛和哨子，我们的git在12个具有挑战性的基准下建立了新的艺术状态。例如，我们的模型在文本贴图上首次超过了人类的表现（138.2 vs. 125.5在苹果酒中）。此外，我们提出了一种新的基于一代的图像分类和场景文本识别的方案，在标准基准上实现了不错的表现。

translated by 谷歌翻译

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

Maxwell Mbabilla Aladago , AJ Piergiovanni

分类：计算机视觉 | 机器学习

2022-12-02

We present an effective method for fusing visual-and-language representations for several question answering tasks including visual question answering and visual entailment. In contrast to prior works that concatenate unimodal representations or use only cross-attention, we compose multimodal representations via channel fusion. By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods. These multimodal representations, which we call compound tokens are generated with cross-attention transformer layers. First, vision tokens are used as queries to retrieve compatible text tokens through cross-attention. We then chain the vision tokens and the queried text tokens along the channel dimension. We call the resulting representations compound tokens. A second group of compound tokens are generated using an analogous process where the text tokens serve as queries to the cross-attention layer. We concatenate all the compound tokens for further processing with multimodal encoder. We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting. Compound Tokens achieve highly competitive performance across a range of question answering tasks including GQA, VQA2.0, and SNLI-VE.

translated by 谷歌翻译