智能论文笔记

MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning

Zhiyang Xu , Ying Shen , Lifu Huang

分类：自然语言处理

2022-12-21

Instruction tuning, a new learning paradigm that fine-tunes pre-trained language models on tasks specified through instructions, has shown promising zero-shot performance on various natural language processing tasks. However, it's still not explored for vision and multimodal tasks. In this work, we introduce MultiInstruct, the first multimodal instruction tuning benchmark dataset that consists of 47 diverse multimodal tasks covering 11 broad categories. Each task is designed at least with 5,000 instances (input-out pairs) from existing open-source datasets and 5 expert-written instructions. We take OFA as the base pre-trained model for multimodal instruction tuning, and to improve its performance, we explore multiple transfer learning strategies to leverage the large-scale Natural Instructions dataset. Experimental results demonstrate its strong zero-shot performance on various unseen multimodal tasks and the benefit of transfer learning from text-only instructions. We also design a new evaluation metric: Sensitivity, to evaluate how sensitive the model is to the variety of instructions. Our results indicate that the model is less sensitive to the varying instructions after finetuning on a diverse set of tasks and instructions for each task.

translated by 谷歌翻译

Language Models are General-Purpose Interfaces

Yaru Hao , Haoyu Song , Li Dong , Shaohan Huang , Zewen Chi , Wenhui Wang , Shuming Ma , Furu Wei

分类：自然语言处理

2022-06-13

基础模型由于在广泛的下游应用中的有效性而受到了很多关注。尽管在体系结构方面存在很大的融合，但大多数审慎的模型通常仍用于特定任务或模式。在这项工作中，我们建议将语言模型用作各种基础模型的通用接口。一系列预处理的编码者感知到了多种方式（例如视觉和语言），并与扮演通用任务层角色的语言模型对接。我们提出了一个半伴侣的语言建模目标，以共同确定界面和模块化编码器。我们从因果关系和非因果建模中涵盖了优势和能力，从而结合了两个世界的最佳状态。具体而言，所提出的方法不仅从因果语言建模中继承了内在学习和开放式生成的能力，而且由于双向编码器而有利于填补。更重要的是，我们的方法无缝地解锁了上述功能的组合，例如，通过填充编码器启用了文本学习或指导。各种仅语言和视觉语言基准的实验结果表明，我们的模型表现优于或与鉴定，零弹性概括和几乎没有的学习的专业模型竞争。

translated by 谷歌翻译

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models

Jinze Bai , Rui Men , Hao Yang , Xuancheng Ren , Kai Dang , Yichang Zhang , Xiaohuan Zhou , Peng Wang , Sinan Tan , An Yang

分类：计算机视觉 | 人工智能 | 自然语言处理 | 机器学习

2022-12-08

Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys

translated by 谷歌翻译

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Jiasen Lu , Christopher Clark , Rowan Zellers , Roozbeh Mottaghi , Aniruddha Kembhavi

分类：计算机视觉

2022-06-17

我们提出了Unified-io，该模型执行了跨越经典计算机视觉任务的各种AI任务，包括姿势估计，对象检测，深度估计和图像生成，视觉和语言任务，例如区域字幕和引用表达理解，并引用表达理解，进行自然语言处理任务，例如回答和释义。由于与每个任务有关的异质输入和输出，包括RGB图像，每个像素映射，二进制掩码，边界框和语言，开发一个统一模型引起了独特的挑战。我们通过将每个受支持的输入和输出均匀地均匀地统一到一系列离散的词汇令牌来实现这一统一。在所有任务中，这种共同的表示使我们能够在视觉和语言字段中的80多个不同数据集上培训单个基于变压器的体系结构。 Unified-io是第一个能够在砂砾基准上执行所有7个任务的模型，并在NYUV2-DEPTH，Imagenet，VQA2.0，OK-VQA，SWIG，SWIG，VIZWIZ，BOOLQ，BOOLQ和SCITAIL，带有NYUV2-DEPTH，Imagenet，VQA2.0，诸如NYUV2-DEPTH，ImageNet，vqa2.0等16个不同的基准中产生强大的结果。没有任务或基准特定的微调。 unified-io的演示可在https://unified-io.allenai.org上获得。

translated by 谷歌翻译

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

AJ Piergiovanni , Wei Li , Weicheng Kuo , Mohammad Saffar , Fred Bertsch , Anelia Angelova

分类：计算机视觉

2022-05-02

We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning. In contrast to previous works using contrastive or generative captioning training, we propose a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well. The pre-training uses only noisy image captioning data, and is formulated to use the entire architecture end-to-end with both a strong language encoder and decoder. Our results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results across a variety of question answering tasks. Our multi-task mixture training learns from tasks of various question intents and thus generalizes better, including on zero-shot vision-language tasks. We conduct experiments in the challenging multi-task and open-vocabulary settings and across a variety of datasets and tasks, such as VQA2.0, SNLI-VE, NLVR2, GQA. We observe that the proposed approach is able to generalize to unseen tasks and that more diverse mixtures lead to higher accuracy in both known and novel tasks.

translated by 谷歌翻译

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

Srinivasan Iyer , Xi Victoria Lin , Ramakanth Pasunuru , Todor Mihaylov , Daniel Simig , Ping Yu , Kurt Shuster , Tianlu Wang , Qing Liu , Punit Singh Koura

分类：自然语言处理

2022-12-22

Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.

translated by 谷歌翻译

Finetuned Language Models Are Zero-Shot Learners

Jason Wei , Maarten Bosma , Vincent Y. Zhao , Kelvin Guu , Adams Wei Yu , Brian Lester , Nan Du , Andrew M. Dai , Quoc V. Le

分类：自然语言处理

2021-09-03

本文探讨了提高语言模型的零次学习能力的简单方法。我们表明，指令调整 - 通过对说明书中所述的任务集合微调语言模型 - 大幅提升零射门上看不见任务中的表现。我们采取预训练的语言模型和指令调整它通过自然语言指令模板语言表达了60NLP任务137B参数。我们评估这种指令调整模型，我们称之为FLAN，在看不见的任务类型。FLAN显着改善其未修饰的对应的性能和超过25的20个任务，我们评估零射门175BGPT-3。FLAN甚至GPT-3通过在安利，RTE，BoolQ，AI2-ARC，OpenbookQA和StoryCloze大比分胜过几拍。消融研究显示任务和模型的规模，这个数字是指令调整取得成功的关键组成部分。

translated by 谷歌翻译

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen , Xiao Wang , Soravit Changpinyo , AJ Piergiovanni , Piotr Padlewski , Daniel Salz , Sebastian Goodman , Adam Grycner , Basil Mustafa , Lucas Beyer

分类：计算机视觉 | 自然语言处理

2022-09-14

有效的缩放和灵活的任务接口使大型语言模型能够在许多任务中表现出色。帕利（Pali）根据视觉和文本输入生成文本，并使用该界面以许多语言执行许多视觉，语言和多模式任务。为了训练帕利，我们利用了大型的编码器语言模型和视觉变压器（VITS）。这使我们能够利用其现有能力，并利用培训它们的大量成本。我们发现，视觉和语言组成部分的联合缩放很重要。由于现有的语言变压器比其视觉对应物要大得多，因此我们训练迄今为止最大的VIT（VIT-E），以量化甚至大容量视觉模型的好处。为了训练Pali，我们基于一个新的图像文本训练集，其中包含10B图像和文本，以100多种语言来创建大型的多语言组合。帕利（Pali）在多个视觉和语言任务（例如字幕，视觉问题，索方式，场景文本理解）中实现了最新的，同时保留了简单，模块化和可扩展的设计。

translated by 谷歌翻译

Prompt Tuning for Generative Multimodal Pretrained Models

Hao Yang , Junyang Lin , An Yang , Peng Wang , Chang Zhou , Hongxia Yang

分类：自然语言处理

2022-08-04

及时的调整已成为模型调整的新范式，它在自然语言预处理甚至预处理方面都取得了成功。在这项工作中，我们探讨了迅速调整到多模式预处理的转移，重点是生成的多模式预审预周化模型，而不是对比度。具体而言，我们实施了迅速调整统一的序列到序列预测模型适应理解和生成任务。实验结果表明，轻重量提示调整可以通过填充并超过其他轻量调整方法来实现可比的性能。此外，与固定模型相比，迅速调整的模型表明了针对对抗性攻击的鲁棒性。我们进一步确定，实验因素，包括及时长度，及时的深度和重新聚集化，对模型性能产生了很大的影响，因此我们从经验上为迅速调整的设置提供了建议。尽管有观察到的优势，但我们仍然在迅速调整中发现了一些局限性，我们相应地指出了未来研究的方向。代码可在\ url {https://github.com/ofa-sys/ofa}中获得

translated by 谷歌翻译

Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts

Soravit Changpinyo , Piyush Sharma , Nan Ding , Radu Soricut

分类：

2021-02-17

The availability of large-scale image captioning and visual question answering datasets has contributed significantly to recent successes in vision-and-language pretraining. However, these datasets are often collected with overrestrictive requirements inherited from their original target tasks (e.g., image caption generation), which limit the resulting dataset scale and diversity. We take a step further in pushing the limits of vision-and-language pretraining data by relaxing the data collection pipeline used in Conceptual Captions 3M (CC3M) [70] and introduce the Conceptual 12M (CC12M), a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. We perform an analysis of this dataset and benchmark its effectiveness against CC3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Our results clearly illustrate the benefit of scaling up pre-training data for vision-and-language tasks, as indicated by the new state-of-the-art results on both the nocaps and Conceptual Captions benchmarks. 1

translated by 谷歌翻译

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data

Sophia Gu , Christopher Clark , Aniruddha Kembhavi

分类：计算机视觉 | 自然语言处理

2022-11-17

Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether this makes it possible to learn those skills from text data and then use them to complete vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study a variety of strategies to mitigate this concern. We produce models using only text training data on three tasks: image captioning, visual entailment and visual question answering, and evaluate them on standard benchmarks using images. We find that this kind of transfer is possible and results in only a small drop in performance relative to models trained on images. We also showcase a variety of stylistic image captioning models that were trained using no image data and no human-curated language data, but instead text data from books, the web, or language models.

translated by 谷歌翻译

Prefix Language Models are Unified Modal Learners

Shizhe Diao , Wangchunshu Zhou , Xinsong Zhang , Jiawei Wang

分类：计算机视觉 | 自然语言处理 | 机器学习

2022-06-15

随着视觉前训练的成功，我们目睹了最先进的方式，以多模式的理解和产生推动。但是，当前的预训练范式不能一次靶向所有模式（例如，文本生成和图像生成），或者需要多重设计良好的任务，从而显着限制可伸缩性。我们证明，可以通过文本和图像序列的前缀语言建模目标学习统一的模态模型。得益于简单但功能强大的预训练范式，我们提出的模型Davinci非常易于训练，可扩展到巨大的数据，并且可以适应跨模态（语言 /视觉 /视觉+语言）的各种下游任务（类型）（理解） / generation）和设置（例如，零射，微调，线性评估）具有单个统一体系结构。达文奇（Davinci）在26个理解 /发电任务的广泛范围内实现了竞争性能，并且在大多数任务上都超过了以前的统一视力语言模型，包括Imagenet分类（+1.6％），VQAV2（+1.4％）（+1.4％），可可标题生成（Bleu@@@@@ 4 +1.1％，苹果酒 +1.5％）和可可图像生成（ +0.9％，FID -1.0％），在可比的模型和数据量表处。此外，我们通过在异质和广泛的分布覆盖范围内报告不同尺度的量表上的性能，为将来的研究提供了明确的基准。我们的结果建立了新的，更强的基线，以便将来在不同的数据量表上进行比较，并阐明了更广泛地比较VLP模型的困难。

translated by 谷歌翻译

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

Or Honovich , Thomas Scialom , Omer Levy , Timo Schick

分类：自然语言处理 | 人工智能 | 机器学习

2022-12-19

Instruction tuning enables pretrained language models to perform new tasks from inference-time natural language descriptions. These approaches rely on vast amounts of human supervision in the form of crowdsourced datasets or user interactions. In this work, we introduce Unnatural Instructions: a large dataset of creative and diverse instructions, collected with virtually no human labor. We collect 64,000 examples by prompting a language model with three seed examples of instructions and eliciting a fourth. This set is then expanded by prompting the model to rephrase each instruction, creating a total of approximately 240,000 examples of instructions, inputs, and outputs. Experiments show that despite containing a fair amount of noise, training on Unnatural Instructions rivals the effectiveness of training on open-source manually-curated datasets, surpassing the performance of models such as T0++ and Tk-Instruct across various benchmarks. These results demonstrate the potential of model-generated data as a cost-effective alternative to crowdsourcing for dataset expansion and diversification.

translated by 谷歌翻译

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Pan Lu , Swaroop Mishra , Tony Xia , Liang Qiu , Kai-Wei Chang , Song-Chun Zhu , Oyvind Tafjord , Peter Clark , Ashwin Kalyan

分类：自然语言处理 | 人工智能 | 计算机视觉 | 机器学习

2022-09-20

在回答问题时，人类会利用跨不同模式可用的信息来综合一致，完整的思想链（COT）。在深度学习模型（例如大规模语言模型）的情况下，这个过程通常是黑匣子。最近，科学问题基准已用于诊断AI系统的多跳推理能力和解释性。但是，现有数据集无法为答案提供注释，或仅限于仅文本模式，小尺度和有限的域多样性。为此，我们介绍了科学问题答案（SQA），这是一个新的基准，由〜21k的多模式多种选择问题组成，其中包含各种科学主题和答案的注释，并提供相应的讲座和解释。我们进一步设计语言模型，以学习将讲座和解释作为思想链（COT），以模仿回答SQA问题时的多跳上推理过程。 SQA在语言模型中展示了COT的实用性，因为COT将问题的答案绩效提高了1.20％的GPT-3和3.99％的unifiedqa。我们还探索了模型的上限，以通过喂食输入中的那些来利用解释；我们观察到它将GPT-3的少量性能提高了18.96％。我们的分析进一步表明，与人类类似的语言模型受益于解释，从较少的数据中学习并仅使用40％的数据实现相同的性能。

translated by 谷歌翻译

Florence: A New Foundation Model for Computer Vision

Lu Yuan , Dongdong Chen , Yi-Ling Chen , Noel Codella , Xiyang Dai , Jianfeng Gao , Houdong Hu , Xuedong Huang , Boxin Li , Chunyuan Li

分类：计算机视觉 | 人工智能 | 机器学习

2021-11-22

自动视觉解对我们多样化和开放的世界需要计算机视觉模型，以概括为特定任务的最小定制，类似于人类视力。计算机视觉基础型号培训，培训多样化，大型数据集，可以适应各种下游任务，对该任务来解决现实世界计算机视觉应用而言至关重要。虽然现有的视觉基础模型如剪辑，对齐和吴道2.0主要集中在映射图像和文本表示到跨模型共享表示，我们介绍了一台新的计算机视觉基础模型，佛罗伦萨，扩大粗糙的表示（现场）到精细（对象），从静态（图像）到动态（视频），以及从RGB到多个模态（标题，深度）。通过从Web级图像文本数据中纳入通用视觉语言表示，我们的佛罗伦萨模型可以很容易地适应各种计算机视觉任务，例如分类，检索，对象检测，VQA，图像标题，视频检索和动作识别。此外，佛罗伦萨在许多类型的转移学习中表现出出色的表现：全面采样的微调，线性探测，几次射击传输和用于新颖图像和物体的零拍摄传输。所有这些属性对于我们的视觉基础模型至关重要，以提供通用视觉任务。佛罗伦萨实现了新的最先进的导致44个代表性基准，例如Imagenet-1K零射击分类，最高1精度为83.74，最高5个精度为97.18，62.4地图上的Coco微调， 80.36在VQA上，动力学-600上的87.8。

translated by 谷歌翻译

ALERT: Adapting Language Models to Reasoning Tasks

Ping Yu , Tianlu Wang , Olga Golovneva , Badr Alkhamissy , Gargi Ghosh , Mona Diab , Asli Celikyilmaz

分类：自然语言处理

2022-12-16

Current large language models can perform reasonably well on complex tasks that require step-by-step reasoning with few-shot learning. Are these models applying reasoning skills they have learnt during pre-training and reason outside of their training context, or are they simply memorizing their training corpus at finer granularity and have learnt to better understand their context? To tease apart these possibilities, we introduce ALERT, a benchmark and suite of analyses for assessing language models' reasoning ability comparing pre-trained and finetuned models on complex tasks that require reasoning skills to solve. ALERT provides a test bed to asses any language model on fine-grained reasoning skills, which spans over 20 datasets and covers 10 different reasoning skills. We leverage ALERT to further investigate the role of finetuning. With extensive empirical analysis we find that language models learn more reasoning skills such as textual entailment, abductive reasoning, and analogical reasoning during finetuning stage compared to pretraining state. We also find that when language models are finetuned they tend to overfit to the prompt template, which hurts the robustness of models causing generalization problems.

translated by 谷歌翻译

Achieving Human Parity on Visual Question Answering

Ming Yan , Haiyang Xu , Chenliang Li , Junfeng Tian , Bin Bi , Wei Wang , Weihua Chen , Xianzhe Xu , Fan Wang , Zheng Cao

分类：自然语言处理 | 计算机视觉

2021-11-17

视觉问题应答（VQA）任务利用视觉图像和语言分析来回回答图像的文本问题。它是一个流行的研究课题，在过去十年中越来越多的现实应用。本文介绍了我们最近对AliceMind-MMU的研究（阿里巴巴的编码器 - 解码器来自Damo Academy - 多媒体理解的机器智能实验室），其比人类在VQA上获得相似甚至略微更好的结果。这是通过系统地改善VQA流水线来实现的，包括：（1）具有全面的视觉和文本特征表示的预培训; （2）与学习参加的有效跨模型互动; （3）一个新颖的知识挖掘框架，具有专门的专业专家模块，适用于复杂的VQA任务。处理不同类型的视觉问题，需要具有相应的专业知识在提高我们的VQA架构的表现方面发挥着重要作用，这取决于人力水平。进行了广泛的实验和分析，以证明新的研究工作的有效性。

translated by 谷歌翻译

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu , Dhruv Batra , Devi Parikh , Stefan Lee

分类：

2019-08-06

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, processing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific modelsachieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.Preprint. Under review.

translated by 谷歌翻译

A Survey of Vision-Language Pre-Trained Models

Yifan Du , Zikang Liu , Junyi Li , Wayne Xin Zhao

分类：计算机视觉 | 自然语言处理 | 机器学习

2022-02-18

随着变压器的发展，近年来预先训练的模型已经以突破性的步伐发展。他们在自然语言处理（NLP）和计算机视觉（CV）中主导了主流技术。如何将预训练适应视觉和语言（V-L）学习和改善下游任务绩效成为多模式学习的重点。在本文中，我们回顾了视力语言预训练模型（VL-PTMS）的最新进展。作为核心内容，我们首先简要介绍了几种方法，将原始图像和文本编码为单模式嵌入在预训练之前。然后，我们在建模文本和图像表示之间的相互作用时深入研究VL-PTM的主流体系结构。我们进一步提出了广泛使用的预训练任务，然后我们介绍了一些常见的下游任务。我们终于结束了本文，并提出了一些有前途的研究方向。我们的调查旨在为研究人员提供合成和指向相关研究的指针。

translated by 谷歌翻译

InstructRL: Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models

Hao Liu , Lisa Lee , Kimin Lee , Pieter Abbeel

分类：计算机视觉 | 机器人

2022-10-24

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the other hand, methods that use pre-trained vision-language models typically come with divided language and visual representations, requiring designing specialized network architecture to fuse them together. We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our \ours method consists of a multimodal transformer that encodes visual observations and language instructions, and a policy transformer that predicts actions based on encoded representations. The multimodal transformer is pre-trained on millions of image-text pairs and natural language text, thereby producing generic cross-modal representations of observations and instructions. The policy transformer keeps track of the full history of observations and actions, and predicts actions autoregressively. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings. Our model also shows better model scalability and generalization ability than prior work.

translated by 谷歌翻译