Visual Prompting via Image Inpainting

Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, Alexei A. Efros

UC Berkeley                  Tel Aviv University
Abstract

How does one adapt a pre-trained visual model to novel downstream tasks without task-specific finetuning or any model modification? Inspired by prompting in NLP, this paper investigates visual prompting: given input-output image example(s) of a new task at test time and a new input image, the goal is to automatically produce the output image, consistent with the given examples. We show that posing this problem as simple image inpainting -- literally just filling in a hole in a concatenated visual prompt image -- turns out to be surprisingly effective, provided that the inpainting algorithm has been trained on the right data. We train masked auto-encoders on a new dataset that we curated -- 88k unlabeled figures from academic papers sources on Arxiv. We apply visual prompting to these pretrained models and demonstrate results on various downstream image-to-image tasks, including foreground segmentation, single object detection, colorization, edge detection, etc.111Project page: https://yossigandelsman.github.io/visual_prompt.

footnotetext: * Equal contribution.

1 Introduction

In the past few years, self-supervised learning has gained popularity in computer vision and natural language processing (NLP). The growing capacity of modern deep learning models made them prone to overfitting when trained on relatively small labeled datasets. Self-supervised learning provides a solution to this problem by generating “free labels” for any dataset, without the need for manual annotation, addressing the data hunger in these high-capacity deep learning models. However, features learned via self-supervision are not “ready for use” – they typically need to be adapted for a given downstream task by fine-tuning on some labeled dataset. Could this fine-tuning be avoided?

In NLP, prompting gpt3 has recently emerged as a way to employ a model for a new task without any additional training. A common way of task-prompting for a specific language understanding task at test time is to provide the trained model with an input corresponding to example(s) of the target task together with the query. E.g., typing the following input prompt:

{BVerbatim}

Je suis désolé I’m sorry J’adore la glace

will prompt the model gpt3 to perform the task of French-to-English translation, returning: {BVerbatim} I love ice cream

Can this idea of test-time task prompting be generalized to the visual domain? That is, instead of the current situation in computer vision, where each trained model serves its predefined task (e.g. segmentation, detection, classification), can we have a single general model that can perform a wide range of user-specified tasks without any fine-tuning (i.e., weight modification)?

Figure 1: Visual prompting via Image Inpainting. Top: Prompting Image Inpainting Models. Given input-output example(s) and image query , we construct a grid-like single image called a visual prompt . The visual prompt is composed of the desired task example(s) and a new query image (all in green). The inpainting model goal is then to predict the masked region (red) such that it is consistent with the example(s). Bottom: an inpainting model can solve this way various computer vision tasks, given that it was trained on the right data. The model predictions are annotated in red.

In this paper we take a step toward this goal by demonstrating that large-capacity image inpainting models, when trained on the right data, can be surprisingly effective tools for visual prompting. As shown in Figure 1, we define each task by constructing a new grid-like image that contains an input-output example(s) of the task and a novel query (green border). The input-output example describes the task, and the image query defines a new input. The model then produces the result by simply inpainting the rest of the image (red border). This setting is most similar to the classic Image Analogies hertzmann2001image formulation, but is less constrained: instead of explicitly defining the A, A’, and B images separately, we simply concatenate them into a single image with a hole (hence, visual prompting is not exactly an analogy since there is no implied left-to-right ordering). Our goal is also not dissimilar from the aims of meta-learning and few-shot learning methods, except that we make no distinction between tasks and example pairs. The only requirement of our formulation is that the tasks must be defined as image-to-image translations, which is a very large subset of vision problems.

To obtain training data that is most useful for our framework, we utilize a domain that spans across a variety of computer vision tasks - figures and infographics from computer vision articles available on Arxiv. We build a large dataset of 88 thousand figures, many of which contain grids of images and their corresponding task results (e.g. images and their segmentation masks/stylized versions/edges, etc.). We then train large-capacity inpainting models to predict randomly masked patches from figures given other patches from the same figure.

Our main contributions are as follows. First, we present a simple yet surprisingly powerful general approach for visual prompting. We show that various computer vision tasks can be treated as grid inpainting problems, given a few examples of task inputs and outputs and a query image. Second, we provide a new dataset that allows a model to learn such grid structures without any labeling, task descriptions, or any additional information about the grid structure. Finally, we show that while using our new dataset for training is essential, adding more generic image data from other sources (e.g. ImageNet) further improves the results.

2 Related Work

Natural Image Inpainting. Filling empty regions in an image has been widely explored for natural images. Earlier methods used data from the input image itself for inpainting Efros99; bertalmio2000image; criminisi2004region; barnes2009patchmatch; UlyanovVL17, whereas later works utilized datasets of images as source data Hays:2007; pathak2016context; yang2017high; liu2018image; Liu_2018_ECCV. Recent methods have attempted to apply transformers to visual synthesis tasks chen2020generative; yu2021diverse; esser2021taming; yu2021vector; chang2022maskgit. Due to the exponentially large number of completion options for a single output patch, these approaches rely on a discrete latent codebook van2017neural; ramesh2021zero; esser2021taming which serves as a smaller yet expressive vocabulary. To tackle the multimodal nature of synthesis, different approaches have been proposed to model the distribution over possible completions esser2021taming; yu2021vector; chang2022maskgit. For example, esser2021taming; yu2021vector proposed to synthesize images line-by-line using an autoregressive model and chang2022maskgit have proposed iterative parallel decoding. While the standard inpainting task typically aims to complete blank parts in natural images, our focus is on completing grid-like visual prompts, which require reasoning across multiple images within the visual prompt image.

Hole-filling as a Pretext task. Recent work has shown that self-supervised pretraining can generate powerful representations for transfer learning, even outperforming its supervised counterparts on challenging vision benchmarks van2018representation; goyal2019scaling; chen2020simple; he2020momentum; caron2020unsupervised; chen2020improved; misra2020self; gidaris2020learning. Pathak et al. pathak2016context first proposed using hole-filling as a pretext task for self-supervision with Context Encoders, where the goal is to predict an random image region given its context. Based on the recent success of Vision Transformers (ViTs) dosovitskiy2020image, multiple works have proposed to hole-filling a self-supervised pretext task for ViTs bao2021beit; mae; xie2021simmim. For example, in MAE mae, the goals is to reconstruct the image given a small subset of input patches. After pretraining on unlabeled data, MAE can produce representations that transfer well when fine-tuned on downstream tasks. Here, we use visual prompting to adapt these models to downstream tasks without any finetuning.

Few-Shot Learning. In this setting, the algorithm is trained on a labeled dataset of base classes, from which it should transfer to a set of novel classes given only a few training examples (like 10 or 30) nguyen2019feature; kang2019few; liu2020part; wang2020frustratingly; yang2020prototype; tian2020prior; zhang2021few; bar2021detreg. Unlike Few-Shot approaches, here we do not assume access to a large training set of base-classes, and our architecture is not task-specific. Our approach is Few-Shot only in the sense that we construct a visual prompt that contains one or two task examples.

Image Analogies. Hertzmann et. al. hertzmann2001image proposed the framework of Image Analogies for texture synthesis, where the algorithm is given a pair of training images (A and A’) and a query image (B). The goal is to synthesize a new image (B’) conditioned on the query, following the relationship inferred from the training pair. Other works have used analogies in style transfer upchurch2016z, and as a supervised image synthesis task reed2015deep. Predicting the correct completion was previously modeled as a classification problem sadeghi2015visalogy; hill2019learning, and other works have explored analogies in the context of learning different transformations between pairs of images memisevic2007unsupervised; taylor2010convolutional. Unlike these approaches, we use inpainting MAE models that learn from data, without assuming any predefined analogies structure.

Prompting in NLP. With the recent success of large unsupervised language models sarzynska2021detecting; devlin2018bert, Brown et al. gpt3 presented how a variety of NLP problems can be reformulated to a text completion problem given a predefined prompt, which can be used to solve different tasks without any finetuning. Prompting was shown to be a useful tool for solving various NLP tasks and benchmarks radford2019language; gpt3. More recently different approaches to prompting have emerged including Prompt Engineering gpt3; lu2021fantastically, Prompt Ensembling jiang2020can, and Prompt Prefix Tuning li2021prefix; prompt_ensemble. Inspired by the success of prompting in NLP, we aim to study prompting in computer vision where prompting hasn’t been widely explored.

3 Visual Prompting via Image Inpainting

Figure 2: MAE-VQGAN Architecture. During training, an input image is patchified, masked and fed into an MAE mae. For each masked token, the decoder outputs a distribution over a pretrained VQGAN esser2021taming codebook. The model is trained using cross entropy loss.

We turn to describe how to perform visual prompting using Image Inpainting models. In Section 3.1, we describe our proposed inpainting model, which is a combination of MAE and VQGAN. We then proceed to discuss visual prompting and propose different ways to create visual prompts in Section 3.2 (see example in Figure 1). Finally, we describe the dataset we collected for training our model in Section 3.3. The training process is illustrated in Figure 2

3.1 Inpainting using MAE-VQGAN

Given an input image and a binary mask , the goal of an inpainting function is to synthesize a new image , with the masked regions filled:

To implement with a neural network, it is necessary to consider design choices like the network architecture, how to train it, and whether it outputs a distribution over possible completions or pixels. We propose the MAE-VQGAN model, which combines ideas from MAE mae and VQGAN esser2021taming. Following the design of MAE, the model is based on ViT vaswani2017attention; dosovitskiy2020image and it is trained via masked auto-encoding by randomly masking image patches and then applying to reconstruct the image from the non-masked parts. MAE-VQGAN models the distribution , where is a visual token from a VQGAN vocabulary that corresponds to the ViT patch. For simplicity, we use a fixed ImageNet pretrained VQGAN codebook.222Using a publicly available checkpoint from https://github.com/CompVis/taming-transformers Unlike MAE which directly predicts pixels, MAE-VQGAN assigns probabilities to visual tokens via a softmax layer, which is better suited for capturing ambiguities. During training, we obtain ground truth visual tokens by mapping the image to visual tokens indices using the VQGAN encoder. The model is trained using cross entropy loss.

Let be the ordered set of predicted visual tokens. To obtain , we use argmax:

Then, to decode the visual tokens to pixels, we apply VQGAN decoder to to obtain .

3.2 Prompting Inpainting Models

To prompt an inpainting model, we construct a visual prompt, a grid-like image composed of task input-output example(s), and a new query image. The model then has to inpaint the rest of the image such that it is consistent with the task defined in the examples (see Figure 1).

Let be the set of input-output examples where is an image and is a function of (e.g is a segmentation mask). We assume is small (one or few examples). Then, given and a new input query , the goal is to predict the corresponding label . To prompt the inpainting model discussed in Section 3.1, we need to define a function that maps the examples set and query image to a new image and a mask:

The image is the visual prompt and the mask defines the masked region has to predict. For a given task, there might exist multiple implementations of that can be considered. The goal of the inpainting model is to reason about the visual prompt , and output a plausible completion without performing any additional training:

To obtain , we just take the part of corresponding to the mask .

Visual Prompt Engineering. For the visual prompting to work, should output a good visual prompt, composed of the examples and query image . Therefore, has to determine where and how to embed the inputs in the visual prompt image, considering the nature of the completion task. All the functions used in this work were hard-coded and manually engineered. In most cases, stacks the examples and image query horizontally by creating an image grid of cells, where the example is placed in the row, and the image query is in the last row. The grid has a fixed size, and therefore before populating it the input-output example pair(s) and query are first resized. Another consideration is how to draw every pair. For example, if is a segmentation mask, we can choose to use different colors to draw it. In Section 4.4, we describe different prompt design choices and their effect on the results.

Visual Prompt Ensembling. There could be multiple options to define . The idea in prompt ensembling, inspired by NLP jiang2020can; prompt_ensemble, is to construct multiple different prompts, apply the inpainting model on each prompt individually to obtain a set of predictions. The final prediction can be determined, for example, via majority voting, or weighted average. For simplicity, here we use a simple average.

3.3 The Computer Vision Figures Dataset

The images produced by are by construction not natural. Specifically, these images have a grid-like figure structure that stitches together images coming from different distributions, like natural images and segmentation masks. Therefore, a model trained on a standard dataset (e.g., ImageNet ILSVRC15) might struggle to process these grid-like images. To mitigate the domain gap, we collected a new dataset.

The Computer Vision Figures (Figures) dataset consists of images that more closely resemble the structure of our visual prompts. The dataset was collected from Arxiv, the open-access web archive for scholarly articles from a variety of academic fields. Arxiv sources are publicly available to download starting from . We downloaded all paper sources from to and selected the Computer-Vision partition “cs.CV” sources, as they contain images that more closely resemble a grid structure, as shown in Figure 3. To remove unrelated source images like graphs or charts, we manually tagged images and trained a binary image classifier to assign a high score to source images in a figure-like structure with at least one natural image. We then used the classifier over the entire data to keep only the most informative source images, coming from different papers. We randomly partitioned of the data to train and left the rest for validation. We include a datasheet with more information in the Supplementary Material.

Figure 3: Random images from our Computer Vision Figures dataset. We curated a dataset of 88k unlabeled figures from Computer Vision academic papers. During training, we randomly sample crops from these figures, without any additional parsing.

4 Experiments and Results

To study visual prompting, we pretrain different models (see Section 4.1) on ImageNet and on the Figures dataset, then quantitatively evaluate the models using different prompts on simple downstream computer vision tasks (see Section 4.2). Using a synthetic dataset, we assess how the choice of model and data affect the success of visual prompting in Section 4.3, and explore different prompting design choices in Section 4.4. We provide a large variate of qualitative results both in this section as well as in the Supplementary Material.

4.1 Models and Baselines

To study the effect of model choice on prompting results, we experiment using different models, including MAE-VQGAN (see Section 3.1) and several other inpainting models briefly described below.

VQGAN esser2021taming is an autoregressive transformer model used for inpainting and image generation. Visual tokens are predicted sequentially, line-by-line, and the model is trained using cross-entropy loss. The VQGAN model codebook is used to encode visual tokens, and it is trained beforehand using perceptual loss johnson2016perceptual and GAN loss goodfellow2014generative. We train it on ImageNet and our Figures dataset, following hyperparams in esser2021taming, and use a pretrained codebook with a vocabulary of size .

BEiT bao2021beit is a masked auto-encoder. The model maps each input patch to a visual token from a d-VAE ramesh2021zero vocabulary of size . To encode each visual token, the image is first resized to and then mapped to tokens. We use the publicly available BEiT large model, pretrained on ImageNet-21k. We also pretrain a large BEiT model on Figures for epochs.

MAE mae. Similar to BEiT, MAEs attempt to reconstruct a masked input image. Unlike in BEiT, the model directly regresses pixels and it is trained with loss. During pretraining, only non-masked tokens are fed into the encoder, which results in a faster training time. We use a publicly released checkpoint pretrained on ImageNet, and pretrain another model for epochs on our dataset.

Copy Example. This simple baseline simply replicates the first example label as the output.

Implementation Details. All the models we describe are large transformer-based models vaswani2017attention; dosovitskiy2020image, with patch size 1616, embedding dim , layers, and heads. For training, we used a machine with 8 Quadro RTX 6000 GPUs, with a batch size of . The input image size is .

Model Foreground Segmentation  Single Object Detection  Colorization 
Split 0 Split 1 Split 2 Split 3 Split 1 Split 2 Split 3 Split 4 MSE LPIPS
Copy 12.92 17.90 13.52 15.29 12.14 13.50 13.03 12.38 2.63 0.75
BEiT (IN-21k) 0.38 0.93 0.90 0.95 0.24 0.32 0.19 0.10 1.25 0.73
VQGAN (IN-1k) 6.96 10.55 9.59 9.43 5.19 4.99 5.09 5.10 2.44 0.66
MAE (IN-1k) 1.92 6.76 3.85 4.57 1.37 1.98 1.62 1.62 1.13 0.87
MAE-VQGAN (IN-1k) 2.22 7.07 5.48 6.28 3.34 3.21 2.80 2.80 3.31 0.75
BEiT (Figures) 5.38 3.94 3.20 3.29 0.17 0.02 0.14 0.16 0.60 0.70
VQGAN (Figures) 12.56 17.51 14.27 15.06 2.27 2.37 2.48 1.99 1.50 0.56
MAE (Figures) 17.42 25.70 18.64 16.53 5.49 4.98 5.24 5.84 0.43 0.55
MAE-VQGAN (Figures) 27.83 30.44 26.15 24.25 24.19 25.20 25.36 25.23 0.67 0.40
Table 1: Visual prompting results on computer vision tasks. For Foreground Segmentation and Single Object Detection, we report the mIOU score. For Colorization, we report the MSE.

4.2 Downstream Computer Vision Tasks

We quantitatively evaluate the inpainting models described above on computer vision tasks.

Visual Prompt. Given one example pair and a query image, we structure the prompt in the same fashion for all tasks. We construct a grid of sub-images, where the example pair is embedded in the first row, and the query image appears in the bottom left cell. See the example in Figure 1.

Computer vision tasks. We evaluate the inpainting models on standard image to image tasks like Foreground Segmentation, Single Object Detection and Colorization.

  • [leftmargin=*]

  • Foreground Segmentation. The goal is to binary-segment the query image to Foreground and Background. The example is an image and corresponding binary segmentation mask. The query is a new image, and the goal is to complete a corresponding segmentation mask. We use the Pascal-5i pascal_5i dataset, which is comprised of 4 different image splits where every split contains between 346 and 725 images and associated segmentation masks. For each class, the data contains a few image-mask pairs, together with held-out image queries. For every image query, we choose one random example pair. To evaluate, every pixel in the completed image is first mapped to the nearest Foreground or Background color. Finally, we report the mean IOU (mIOU) metric.

  • Single Object Detection. Similarly to Foreground Segmentation, the goal here is to binary-segment the object that appears in the query image. However, this task is more challenging than Foreground Segmentation because the example mask is obtained from a bounding box which is more coarse than a segmentation mask. We use the Pascal VOC 2012 dataset using images and their associated detection boxes. For simplicity, we use Pascal annotations to include only images with a single object and filter out trivial images that have an object covering more than of the image. We randomly select an example pair and image query of the same object class and repeat the process with different random seeds. For evaluation, we follow a similar process as in Foreground Segmentation to obtain a binary segmentation mask. Then we keep the connected component with the largest area using morphological operations and draw a bounding box around it. We report the mIOU results.

  • Colorization. The goal is to map a gray-scale image to a color image. The example pair is a gray-scaled image and the corresponding color image, as shown in Figures 1 and 4. We randomly sampled example pairs and image query from ImageNet ILSVRC15 validation set and converted them to gray-scale to obtain gray-scale and color version for each image. We report the MSE loss and LPIPS zhang2018unreasonable.

Results. We include quantitative results in Table 1, and qualitative completion results in Figure 4. Training on the Figures dataset improves the results for most models in all the downstream tasks. MAE-VQGAN outperforms the other models by a large margin for detection and segmentation and generates much sharper images than the MAE. We find that VQGAN struggles to output accurate results, likely due to sequential decoding. The BEiT model is outperformed by MAE, most likely because its training process is less sample efficient. For more results, see the Supplementary Material.

Figure 4: Visual prompting prediction examples. Each visual prompt was fed to an MAE-VQGAN model trained on the Figures dataset. For each visual prompt, the result is marked in red.
Figure 5: Synthetic data study results. MAE-VQGAN predictions are annotated with a red square.

4.3 Synthetic Data Study

To assess the compositional prediction capabilities of inpainting models, we created a set of 3 simple synthetic tasks and 3 of their combinations, and evaluated each model on examples per task.

Visual Prompt. Given two example pairs and a query image, we structure the prompt in the same fashion for all tasks. We construct a grid of sub-images, where the example pairs are embedded in the first two rows, and the query image in the bottom left cell. We include examples in Figure 5.

Change prediction tasks. Each example pair is an image of a colored shape, and a corresponding image with an introduced change. The change can be either in color, shape, size or a combination of two changes. Next, we describe each individual task in more detail.

  • [leftmargin=*]

  • Resize. Each example pair contains an image of a circle, and a corresponding image with the circle smaller in size. The goal is to predict the image with the resized version given image query.

  • Shape. Here every example pair is an image with circle, and a corresponding image with a rectangle. Both are similar in size and appear in the same location. The goal is to predict the image with rectangle, given a new image query.

  • Color. Each example pair contains an image of a circle appearing in the same location, with the color changed from green to blue. Given a new image query, the goal is to predict the corresponding image with the circle colored in blue.

Evaluation. We map each predicted pixel to its nearest neighbor color from a predefined set of options: black, white, blue, or green. We then measure and report the color-aware mIOU, by considering pixel predictions that appear in the ground-truth shape color as foreground and treat the rest as background.

Results. The results are presented in Table 2, for MAE-VQGAN prediction examples see Figure 5. Without training on the Figures dataset, inpainting models fail to generalize to these previously unseen tasks. The performance of all models increases when they are trained on the Figures dataset. Yet, the same models struggle with combinations of tasks due to the increasing complexity. The VQGAN model utilizes sequential decoding and therefore lacks context, which leads to poor performance. The MAE model outperforms MAE-VQGAN on color, and BEiT performs poorly in size. These models rely on pretrained codebooks (VQGAN and dVAE) that are likely not geared towards these tasks.

Color Shape Size Color & Shape Color & Size Shape & Size
Copy 5.53 6.71 1.17 6.74 1.17 1.86
VQGAN (IN-1k) 0.91 6.51 6.24 2.40 0.70 6.53
BEiT (IN-22k) 15.99 9.08 1.26 7.23 2.84 2.66
MAE (IN-1k) 0.00 2.07 1.20 0.00 0.00 1.56
MAE-VQGAN (IN-1k) 0.13 2.94 3.71 0.00 0.01 3.60
VQGAN (Figures) 6.96 19.11 16.21 7.40 2.24 18.41
BEiT (Figures) 40.92 31.43 7.12 33.10 21.21 12.98
MAE (Figures) 70.23 43.99 34.72 19.30 18.99 46.02
MAE-VQGAN (Figures) 40.40 46.53 42.04 20.41 18.27 40.33
Table 2: Synthetic data study results. We report the color-aware mIOU on the six tasks.
Pretraining # Labeled Images # Shots Model Split 0 Split 1 Split 2 Split 3
Unlabeled ImageNet 1 1 Finetune MAE 11.1 13.4 13.0 12.3
4 4 12.9 15.8 14.3 15.0
16 16 13.7 16.1 16.8 17.1
Unlabeled Figures 1 1 MAE-VQGAN 32.5 33.8 32.7 27.2
Labeled Pascal 5i (Segmentation masks) 1 FWB nguyen2019feature 51.3 64.5 56.7 52.2
1 CyCTR zhang2021few 67.2 71.1 57.6 59.0
Table 3: Comparison to Fine Tuning and Classic 1-Shot Segmentation baselines. MAE-VQGAN image query and output resolution is 111111. CyCTR and FWB resolution is 473473 and 512512, both approach utilize Pascal 5i labeled baseclasses data.
Figure 6: More examples, better results. Left: we construct visual prompts with increasing number of input-output pair examples, for a fixed query image (inpaintings annotated in red). Right: We observe that more examples improve the overall mIOU results on the four Pascal-5i splits.

4.4 Analysis

Figure 7: Training MAE-VQGAN on more data improves visual prompting results. Foreground Segmentation results on Pascal-5i, when trained over the Figures dataset and on the combined Figuresand ImageNet dataset.

Comparison to finetuning and Few-Shot baselines.We include a comparison to baselines that utilize training examples for each target class. For completeness, we also include the results of FWB nguyen2019feature and CyCTR zhang2021few, classic 1-shot baselines, which we view as an upper-bound of our approach. FWB and CyCTR utilize a fully labeled base classes train set (2086 to 5883 on different Pascal 5i splits). Additionally, their architecture was designed for the foreground segmentation task (e.g, they operate in higher resolution). The results in Table 3 indicate that the Visual Prompting results of MAE-VQGAN trained on Figures are significantly superior to standard finetuning baselines of MAEs pretrained on unlabeled ImageNet. FWB nguyen2019feature and CyCTR zhang2021few outperform Visual Prompting, mainly because they pretrain on a large tagged base classes dataset and utilize architectures that are specific to image segmentation.

Dataset effect. We evaluate the effect of pretraining on a larger and more diverse dataset. We compare training on ImageNet only, Figures only, and a combination of the two. We report the mIOU results on Pascal 5i for Foreground Segmentation in Figure 7. The MAE-VQGAN trained on ImageNet achieves a consistently low points mIOU. The model trained on the combined dataset performs best, which demonstrates that MAE-VQGAN can benefit from additional amounts of unlabeled images.

More examples, better results. We study how increasing the number of input-output pair examples in the visual prompt affects the results. Intuitively, we expect that including more examples should reduce ambiguities and lead to better results. We use an MAE-VQGAN pretrained on the Figures dataset, and use data from PASCAL-5i. We construct a large grid that can populate up to examples and an image query. We randomly choose different numbers of examples and randomize the placements. The results in Figure 6 confirm that using more examples leads to better segmentation results.

Figure 8: Prompt layout design. Two prompt orderings, and the corresponding average attention maps of the selected patch (annotated with black bounding box). The highest attention values appear on similar (corresponding) areas in the query image.

Prompt Engineering. We explore the effect of constructing different visual prompts for Foreground Segmentation and their corresponding MAE-VQGAN results (see Figure 8.a-b). The model generates plausible completions when changing the prompt layout (e.g. horizontal order vs. vertical order) and when changing the mask colors, texture or using only edges (see Figure 9). The mIOU results in Table 5 indicate that the model performs better with a vertical layout and when the segmentation mask colors are black and white. Interestingly, by analyzing the average attention heads of a masked patch token, we observe that the attention changes following the change in the prompts layout (see Figures 8.d-e).

Prompt Ensembling. Inspired by Prompt Ensembling in NLP prompt_ensemble, given the same example pair and image query, we construct multiple different visual prompts (e.g, horizontal and vertical layouts, see Figure 8.a-b). We then average the completion results. The results in Table 5 on the Synthetic Study tasks demonstrate that utilizing multiple prompts can lead to improved, and more stable performance.

Style/content extrapolation. Inspired by the classic example from Tenenbaum and Freeman tenenbaum2000separating (on the task originally suggested by Hofstadter hofstadter1995fluid), we use MAE-VQGAN to extrapolate letter sequences printed in different fonts (see Figure 10). We find that the model can extrapolate given style and new content (Figure 10a) but that it struggles to extrapolate new content (Figure 10b). The model also struggles to extrapolate more complex letter sequences; the performance deteriorates even if both style and content are given (Figure 10 c-d).

Figure 9: Task performance under different label choices. Prompting results when using different mask colors (e.g, purple/yellow vs. green/red), when drawing full mask compared to edges only, and when changing the mask texture. Compared to other alternatives, purple/yellow and black/white (see Figure 8) masks works best.
Table 4: Prompt Engineering. Foreground Segmentation mIOU results on Pascal-5i when using different prompt colors.
Horizontal Vertical
Black/White 27.17 31.57
Purple/Yellow 23.44 28.47
Table 5: Prompt Ensembling. We report here color-aware mIOU. In every line, the result is based on an ensemble of all previous prompts.
Prompt Layout Color Shape Size
Horizontal 39.97 46.54 42.01
+ Vertical 41.31 54.71 46.18
+ Vertical w/ Rows Swap 44.14 60.42 49.42
Figure 10: Style and content extrapolation using MAE-VQGAN. The model can extrapolate the style of a new content (a), but fails to predict a new content (b). The model struggles to extrapolate new style and content of longer sequences (c-e).

Limitations. The focus of this work is to present a proof of concept that shows it is possible to visually prompt simple image inpainting models trained on noisy, unlabeled data. Specifically, we demonstrate how to pre-train a network once, then prompt it to perform reasonably well on many tasks. The fact that this is possible is surprising and scientifically interesting, although this approach is not competitive with supervised task-specific models. For visual prompting to work, the inpainting models require training on the Computer Vision Figures dataset. However, our initial experiments suggest that it can benefit from training on additional natural image data (see Figure 7). Other limitations include ambiguities in the task definition, reliance on a pretrained VQGAN decoder, and worse performance when the input-output example(s) are not aligned (see examples in Figure 11).

Figure 11: Limitations and failure cases. Single input-output example might be ambiguous and can lead to unintended completions. The MAE-VQGAN model performs worse given non-aligned input-output example, and by using a VQGAN vocabulary, it is limited in synthesizing out-of-distribution pixels (like blurry images).

5 Discussion

Why does our proposed method, despite its simplicity, performs so well on a large subset of visual tasks? At this point, we do not have a good answer. Clearly, the specific training data we use plays an important role, but the amount of generalization observed is still surprising. Perhaps some of these image-to-image tasks are actually simpler than we believed. But it’s also evident that contemporary large-scale inpainting models are learning quite sophisticated long-range co-occurrences and symmetries in the data which can often enable impressive visual reasoning. We hope that our work will encourage further research to better our understanding of what is being learned by inpainting.

Acknowledgements: We would like to thank Assaf Shocher for insightful discussions and ideas related to the Figures dataset. We thank Aaron Hertzmann and Sanjay Subramanian for helpful feedback on the manuscript. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD including DARPA’s LwLL and/or SemaFor programs, as well as BAIR’s industrial alliance programs. Prof. Efros’s group was supported by in part by DoD including DARPA’s MCS and/or ONR MURI, as well as funding from SAP.

References

Supplementary Material

Figure 12: Additional visual prompting results. For each visual prompt, the result is marked in red.

We include more information about the experimental study, as well as the Computer Vision Figures Dataset datasheet. Additional visual prompting results are included in Figure 12.

6 Experiments

6.1 Downstream Computer Vision Tasks

Qualitative model comparison. We include a qualitative comparison of visual prompting results when applied to Foreground Segmentation using different inpainting models (see Figure 13). Compared to other models, MAE-VQGAN outputs more smooth and accurate segmentation results.

Single Object Detection. We include qualitative results of MAE-VQGAN when applied to Single Object Detection (see Figure 14). As mentioned in Section 4.2, the raw output is rounded and postprocessed using morphological operations to convert the foreground segmentation to a box.

The effect of the input-output example. We fix the input query and change the input-output examples and visual prompt MAE-VQGAN (see Figure 15). Different examples could lead to slightly different synthesis results. However, as long as the example is meaningful (e.g, not fully background or foreground) the model results are plausible regardless of the choice of example.

Figure 13: Visual Prompting for Foreground Segmentation using different models. Compared to other models, MAE-VQGAN produces more smooth and accurate results.
Figure 14: Visual prompting applied for Single Object Detection. The raw MAE-VQGAN results are rounded and post processed using morphological operations.
Figure 15: The effect of input-output examples on prompting results. We fix the input query and change the input-output example. Different input-output examples could lead to slightly different results.

Visual Prompting results on computer vision tasks. We include the mean and standard deviation for each Visual Prompting model on Pascal 5i (see Table 6).

Model Foreground Segmentation Single Object Detection
Mean Std Mean Std
Copy 14.91 1.93 12.76 0.54
BEiT (IN-21k) 0.79 0.24 0.21 0.08
VQGAN (IN-1k) 9.13 1.33 5.09 0.07
MAE (IN-1k) 4.28 1.73 1.65 0.22
MAE-VQGAN (IN-1k) 5.26 1.84 3.04 0.24
BEiT (Figures) 3.95 0.87 0.12 0.06
VQGAN (Figures) 14.85 1.78 2.28 0.18
MAE (Figures) 19.57 3.62 5.39 0.32
MAE-VQGAN (Figures) 27.17 2.27 25.00 0.47
Table 6: Visual prompting results on computer vision tasks. We report mean and standard deviation of mIOU scores for Foreground Segmentation and Single Object Detection.

7 The Computer Vision Figures Dataset Datasheet

In Section 3.3 we presented the Computer Vision Figures dataset. The full dataset will be made available upon acceptance. Next, we include the dataset datasheet.

Motivation

For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.

The Computer Vision Figures (Figures) dataset was proposed enable our approach for Visual Prompting. In our setup, a Visual Prompt is a single image that have a grid-like figure structure that stitches together images coming from different distributions, like natural images and segmentation masks. Therefore, a model trained on a standard dataset (e.g., ImageNet ILSVRC15) might struggle to process these grid-like images. To mitigate the domain gap, we collected a new dataset with images that more closely resemble the proposed Visual Prompt structure.

The dataset was collected from Arxiv, the open-access web archive for scholarly articles from a variety of academic fields. Arxiv sources are publicly available to download and the Computer-Vision partition contains images that more closely resemble a grid structure, as shown in Figure 3.

Who created this dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

N/A

Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.

N/A

Any other comments?

No.


Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.

Each instance in the dataset is an image, which is a single arXiv paper figure.

How many instances are there in total (of each type, if appropriate)?

The data is comprised of images, partitioned to train and validation. We include descriptive statistics of the data in Table 7. We find that around of the images in the data does not include any embedded annotation. The majority of the data () is comprised of grid, figure-like images. To obtain these statistics, random Figures images were manually labeled by a human tagger as a single image type (“single”) or a grid-like image type (“grid”). In addition, if the image contained an annotation, the tagger picked the annotation category from a predefined list (e.g, mask, box, pose, heatmap) or “Other” if the annotation doesn’t match any of these categories.

Annotation Type %
No Annotation Grid 37
Other Grid 15
Box Grid 11
Mask Grid 9
Heatmap Grid 8
Box Single 6
No Annotation Single 5
Pose Grid 4
Other Single 3
Pose Single 1
Heatmap Single 1
Table 7: Computer Vision Figures Dataset Annotations Statistics. The results were obtained by manually labeling images.

Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).

The dataset is based on Arxiv papers figures from to in the the Computer-Vision partition “cs.CV”. It only consists of figures with at least one natural image. To remove unrelated source images like graphs or charts, we manually tagged images and trained a binary image classifier to assign a high score to source images in a figure-like structure with at least one natural image. We then used the classifier over the entire data to keep only the most informative source images, coming from different papers. In a manual review of random figures, we found that the dataset is clean of graphs or charts.

What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.

Each instance image is in PNG image format with resolution up to . Each image has an accompanied Arxiv paper id and an indicator to an associated train/val partition.

Is there a label or target associated with each instance? If so, please provide a description.

There are no labels associated with each instance.

Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.

All instances are complete.

Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit.

Images can have a similar or different arXiv source paper and we release this information.

Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.

The dataset was randomly split into train and validation. In this paper, we only made use of the train partition for training on unlabeled figures. In future works, the validation partition may be used in different ways, like hyper-parameters tuning or for the evaluation of generative models.

Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.

Noise. In a manual review of random figures, we found that the dataset is clean of unintended graphs or charts. Naturally, different figures could have potential overlap and we acknowledge this potential redundancies.

Overlap with Computer Vision Datasets. The Computer Vision Figures dataset is a collection of figures from different computer vision papers. Therefore, some of the images may overlap with computer vision datasets test, with or without annotations. To evaluate to what extent this is the case, we randomly sampled Pascal 5i validation images and for each image computed its 10 nearest neighbors in Figures. Given the pascal image and a Figures image, we used the OpenCV template matching function (“cv.matchTemplate”) that operates in a sliding window to compute similarity in pixel space using mean squared error. We also attempted to use an ImageNet pretrained ResNet50 to compute similarity and found that they work worse, likely because the Figures dataset contains out-of-distribution images. A human tagger examined the results and did not find duplicates. In 72% of the cases the nearest neighbors retrieved were irrelevant, and in 28% there was at least one Figures image that is somewhat similar (e.g, the pascal image contains a person and an image of a different person was retrieved). We conducted an additional independent check to evaluate the overlap. For each image of 100 random Pascal 5i test images we computed the 5 nearest neighbors in the Figures datasets using CLIP embeddings. Out of the 100 images, we found only a single one contained in a Figure image, and this image did not have any associated ground-truth annotation. Therefore, we conclude that even if there are potential overlaps they are likely very small and insignificant.

Imprecise Embedded Annotations. Instances in the dataset are images that may include an embedded annotation. Naturally, this annotation might be noisy, incomplete or just imprecise.

Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a future user? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.

We will share and release links to the relevant Arxiv sources and figures, as well as an automated code for downloading and preprocessing the sources which is consistent with the Arxiv license (Non-exclusive license to distribute). Arxiv is a reliable source that is backed up by the entire research community. As commonly done, we may allow an alternative download option of the dataset on a “fair use” basis.

Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals non-public communications)? If so, please provide a description.

No.

Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

No.

Does the dataset relate to people? If not, you may skip the remaining questions in this section.

No.

Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.

N/A

Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.

N/A

Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description.

N/A

Any other comments?

No.


Collection Process

How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.

The data was directly observable, e.g, automatically extracted from Arxiv sources.

What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated?

The data was downloaded in accordance with the offical Arxiv guidlines for data access: https://arxiv.org/help/bulk_data.

If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

The dataset is based on Arxiv paper figures from to in the the Computer-Vision partition “cs.CV”. It only consists of figures with at least one natural image. To remove unrelated source images like graphs or charts, we manually tagged images and trained a binary image classifier to assign a high score to source images in a figure-like structure with at least one natural image. We then used the classifier over the entire data to keep only the most informative source images, coming from different papers. In a manual review of random figures, we found that the dataset is clean of graphs or charts.

Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

The team of students who worked together on this project. None of the team members were compensated for this work, beyond their regular compensation for the position they held.

Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.

The papers and their accompanied sources were collected by Arxiv as part of their normal work protocols between 2010 to 2022. The dataset was created in 2022 based on these sources.

Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.

N/A

Does the dataset relate to people? If not, you may skip the remaining questions in this section.

No.

Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

N/A

Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.

N/A

Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.

N/A

If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

N/A

Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

N/A

Any other comments?

No.


Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.

To remove unrelated source images like graphs or charts, we manually tagged images and trained a binary image classifier to assign a high score to source images in a figure-like structure with at least one natural image. We then used the classifier over the entire data to keep only the most informative source images, coming from different papers. In a manual review of random figures, we found that the dataset is clean of graphs or charts. Each kept figure image was then resized to a resolution of up to and saved in a PNG format.

Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.

We will release a list of links to the raw figures source files.

Is the software used to preprocess/clean/label the instances available? If so, please provide a link or other access point.

The software to download, extract, and preprocess the images will be made publicly available.

Any other comments?

No.


Uses

Has the dataset been used for any tasks already? If so, please provide a description.

The dataset was used for unsupervised learning algorithms, e.g, for pretraining inpainting models.

Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.

A partial list of papers that use the dataset will be made available after the review stage.

What (other) tasks could the dataset be used for?

Potential other use cases include generative modeling, image retrieval and explainabilty.

Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?

We do not anticipate any negative biases in the data or potential harms. Additionally, since Arxiv is moderated we do not anticipate the presence of offensive content.

Are there tasks for which the dataset should not be used? If so, please provide a description.

N/A

Any other comments?

No.


Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

The dataset will be made publicly available after the review period.

How will the dataset will be distributed (e.g., tarball on website, API, GitHub) Does the dataset have a digital object identifier (DOI)?

The dataset will be distributed as a comma-separated (.csv) file describing the Arxiv links, partition, and paper id, containing the original figure sources. Additionaly, we will provide a direct download of the dataset in the form of a tarball on a "fair-use" basis. The dataset DOI will be the same as the one of this work.

When will the dataset be distributed?

The dataset will be made available after the review period.

Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.

The comma-separated (.csv) file and accompanying code will be distributed under the MIT license.

Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.

Arxiv holds a (non-exclusively) license to distribute all submitted papers and sources (https://arxiv.org/licenses/nonexclusive-distrib/1.0/license.html). Publishing the download links for Arxiv sources is in compliance with this license. As commonly done, we may allow an alternative download option of the dataset on a “fair use” basis.

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.

No.

Any other comments?

No.


Maintenance

Who will be supporting/hosting/maintaining the dataset?

The authors of the paper will be maintaining the dataset, which will be hosted on GitHub.

How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

We will post the contact information after the review period.

Is there an erratum? If so, please provide a link or other access point.

No.

Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to users (e.g., mailing list, GitHub)?

There are no plans to update the dataset at this time.

If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.

N/A

Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to users.

No.

If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to other users? If so, please provide a description.

Contributions will be made possible using standard open source tools, submitted as pull request to the relevant GitHub repository.

Any other comments?

No.