Injecting Image Details into CLIP’s Feature Space

Zilun Zhang & Cuifeng Shen & Yuan Shen & Huixin Xiong& Xinyu Zhou
Megvii Technology Inc.
Beijing, BJ 10000, CHN
{zhangzilun,shencuifeng,xionghuixin,zxy}@megvii.com

Abstract

Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model’s image input size (e.g., 224), subtle details are lost in the feature representation if we input high-resolution images (e.g., 2240). In this work, we introduce an efficient framework that can produce a single feature representation for a high-resolution image that injects image details and shares the same semantic space as the original CLIP. In the framework, we train a feature fusing model based on CLIP features extracted from a carefully designed image patch method that can cover objects of any scale, weakly supervised by image-agnostic class prompted queries. We validate our framework by retrieving images from class prompted queries on the real world and synthetic datasets, showing significant performance improvement on these tasks. Furthermore, to fully demonstrate our framework’s detail retrieval ability, we construct a CLEVR-like synthetic dataset called CLVER-DS, which is fully annotated and has a controllable object scale.

\usemintedstyle

manni

1 Introduction

Text-to-image retrieval task is to retrieve relevant images given a text query. The query can either be a sentence describing the whole image or an object name focusing on a small part of the image.

For instance, suppose we use CLIP to retrieve images containing a red helmet and use "red helmet" with a prompt as a text query; what appears first is always a sizeable red helmet right in the middle of the image. However, images showing people wearing red helmets on a football court are also what we want, and red helmets can be tiny in the image. Since CLIP was trained to match an image as a whole to a text description, it is hard for CLIP to retrieve the above image by just using "red helmet" with a prompt as a text query. This means images can be retrieved if their main parts match the text description. However, in many practical scenarios, we need to use a word to retrieve all related images in a database. As illustrated in Figure 1

The problem motioned above can be partially solved by dividing each image into many small patches and doing the retrieval task based on the feature of these small patches. However, this multi-feature method cannot obtain a useful feature directly for downstream tasks or end-to-end training. Models such as Crop-CLIP by crop-clip completes the retrieval by first detecting the objects in the image with a detector YOLOV5. However, the capacity of the detector severely constrains the model performance. For example, the model behaves unpredictably when encountering objects from unseen classes, and the information about these objects will be lost before entering the retrieval stage. Models such as MDETR by kamath_mdetr_2021, XDETR by cai2022x, FILIP by yao2021filip achieve many improvements in detail retrieval by retraining a model with multi-modality data and mining the fine-grained relationship during the training. However, they all require enormous training data and computational resources to complete.

In this paper, we propose a framework, "Detail Injected CLIP (DetailCLIP)," which can solve the above problem at a small cost. By analyzing the retrieval performance of CLIP-like models over a spectrum of object scales, we identify the object scales that CLIP-like models can maintain most of their performance. Given a particularly feasible object scale, we design an image patching scheme, "Complete Cover (CC)," which can cover objects of any scale. Then, we use a transformer to fuse features extracted by CLIP-like models from patches sliced by CC. We also construct a self-supervised learning loss function to align detailed information with the fused new single feature.

Figure 1: Retrieval results from CLIP model and our DetailCLIP model. The red incorrect sign stands for failing to retrieve.

In this framework, we could inject detailed information in the image into a single feature that can be directly used for end-to-end training, with the cost of training a small transformer. Our main contributions are summarized as follows:

We focus on the detailed class-with-prompt text-to-image retrieval task and propose an efficient framework DetailCLIP which can produce a single image feature with detailed information. We test the above text-to-image retrieval result on MSCOCO, LVIS, and synthetic datasets, and our framework outperforms current vision-language models.
We design an image patching scheme "Complete Cover(CC)". CC patches can theoretically cover objects of any scale meanwhile reduce the redundant patches.
We construct a retrieval benchmark based on the CLEVR johnson2017clevr and ShapeNet ShapeNet2015 3D objects, called "CLEVR of Different Scales (CLEVR-DS)". With this full annotated and object scale controllable dataset, our framework’s retrieval result outperforms the current method by a large margin.

We arrange subsequent chapters of this paper as follows. Section 2 gives an introduction to the related works. Section 3 gives a detailed description of the "Complete Cover (CC)" method proposed in this paper, and our DetailCLIP framework with corresponding loss. Section 4 introduces the benchmark proposed in this paper. In section 5, we go through many experiments to verify our framework’s superiority.

2 Related Work

2.1 Vision-Language model overview

Current Vision-Language Model can be divided into different types through task objectives. CLIP radford_learning_2021, ALIGN jia_scaling_2021, and Flamingo alayrac2022flamingo aligns the textual and visual information into a shared semantic space through contrastive learning task. VisualBERT li_visualbert_2019, UNITER chen_uniter_2020 and DALL-E ramesh_zero-shot_2021 focus on region modeling or text guided image generation. Single-stream models such as VisualBERT and ViLT kim_vilt_2021 concatenate the visual and textual features and feed them to a unified encoder-like transformer. Dual-stream models have separate encoders for different modalities, such as CLIP, FILIP yao2021filip, and ViLBERT. There are other works for different tasks such as MDETR kamath_mdetr_2021 (object detection), PhraseCut wu_phrasecut_2020 (segmentation), Florence yuan_florence_2021 (Foundation Model), etc.

2.2 CLIP-like models

CLIP radford_learning_2021 is a neural network trained by OpenAI on various image-text pairs. Given an image, CLIP predicts the most relevant text snippet or Vice Versa. CLIP’s "zero-shot" performance on the ImageNet classification task matches the performance of the original ResNet50, similar to the "zero-shot" capabilities of GPT-2 and GPT-3, and it is also robust to Natural Distribution Shift goh2021multimodal. However, the CLIP feature has limitations, such as being easy to fail on typographic attacks or fine-grained concepts. Besides, zhou2021learning argues that the quality of text feature is highly related to prompt methods (way to perform augmentation on the class label to generate text sentences). In SLIP by slip, different views of each input image are used for text supervision and image self-supervision. It demonstrates that image self-supervision would benefit the performance of CLIP. At the same time, DeCLIP li2021supervision adds several additional training objectives to CLIP to improve the performance of language supervision in the form of CLIP. Most recently, glip proposed Glip, which unifies object detection and phrase grounding for pre-training and can learn object-level, language-aware, and semantic-rich visual representations.

2.3 Vision Language Model applications

MDETR by kamath_mdetr_2021, a modulated detector for multi-modal understanding, has achieved state-of-the-art results on multiple datasets, incl. However, it employs an expensive modal joint transformer to align language and vision, which makes it impractical for practical applications. XDETR by cai2022x aligns the object’s bounding box rather than the entire image with free-form language. Its architecture has three main components: object detector, language encoder, and alignment. Visual and linguistic data streams are independent until the end and aligned by dot product operations. XDETR shows good accuracy and fastness in multi-instance multi-modal tasks (e.g., 16.4 AP on LVIS detection of 1.2K categories). In order to capture the fine-grained alignment between image and text, RegionClip proposed by regionclip creates a pool of object concepts from the text corpus and uses a pre-trained CLIP model to align a concept with an image region making pseudo labels. They use region-text pairs and ground-true image-text pairs to pretrain a vision-language model. RegionClip shows a better ability to recognize region objects and successfully transferred to the open-vocabulary object detection task, but without analysis on whether it can detect the small object in an image. groupvit used grouping mechanism into deep networks, which allows semantic segments to emerge automatically with only text supervision. GroupViT learns to group semantic regions together and successfully transfers to the semantic segmentation task.

3 Methodology

3.1 Motivation and Effective Scale Sensitivity

CLIP’s capability to retrieve an object deteriorates as the object’s size becomes small. We perform our experiment on LVIS dataset gupta2019lvis and use

r_{m a x} = \frac{M a x i m u m A r e a o f t h e O b j e c t s i n t h e i m a g e}{} A r e a o f t h e I m a g e

(1)

with different values as upper bound (threshold) to create subsets of LVIS. We choose LVIS because it has more categories than COCO and has more annotations for small objects. As shown in 1, CLIP performance monotonically decreases as the object size decreases.

$r_{m a x}$	^$†$ Recall@1	Recall@3	Recall@5
$10^{0}$	8.63%	15.19%	18.60%
$10^{- 0.5}$	7.52%	13.87%	17.45 %
$10^{- 1}$	5.75%	11.28%	14.66 %
$10^{- 1.5}$	4.92%	9.61%	12.16 %
$10^{- 2}$	3.98%	8.76%	11.48 %

Refer to section 4.4 for the metric detail

Table 1: Performance of CLIP text-image retrieval task with different LVIS subsets.

We define the Effective Scale Sensitivity (ESS) of CLIP-like models as the minimum object occupying percentage in an image that CLIP-like models can retrieve. We need to input images within the sensitivity of CLIP-like models. A natural thought in solving this problem is slicing an image into small patches and retrieving objects on those patches. In following sections, we propose our method to solve above problem with small cost.

3.2 Problem Definition

In this section, we will give a formal problem definition on Detail Injection with CLIP-like models, image, and a set of image patches. Suppose we generate $p$ patches from a image $X$ , and we denote the set of image patches as $x_{i} \in X$ , where $i \in {1, . . ., p}$ . Then $F : R^{c \times w \times h} \to R^{d}$ represents CLIP-like models image encoder, and the $d$ dimensional feature $u_{i}$ extracted from a single image patch $x_{i}$ can be represented as:

u_{i} = F (x_{i}),

(2)

We denote the set of image patch features for a given image as $U = {u_{i}}$ , where $i \in {1, . . ., p}$ . We define our fusing model as $D : R^{p \times d} \to R^{d}$ . The DetailCLIP feature $v$ is obtained from:

v = D (U), where v \in R^{d}

(3)

3.3 Complete Cover

We will propose a patch generation scheme in this subsection. Consider the side length of an image to be $n$ , and the number of possible patches to cover all possible objects is at $O (n^{4})$ level, which is unbearable. We come up with the "Complete Cover (CC)" method to eliminate redundant patches while covering all possible objects. The schematic diagram of Complete Cover is Figure 4.

Figure 2: Illustration of Patch Selection of Complete Cover.Minimum effective object means the minimum object that can be retrieved by CLIP from the patch. Each color represents a different patch size, which will slide the whole image to cover objects equal to or bigger than the minimum effective size.

Firstly, we define the meaning of cover. Let $Q$ be the set of pixels of a patch, $P$ be the set of pixels in the bounding box of an object. In order for the patch to include objects while maintaining the capability to retrieve the patch, we define $Q$ covers $P$ as follows:

		$QcoversP\coloneqqC(Q,P)={True,∀p=(x,y)∈P→p∈Qand\|P\|>c2⋅\|Q\|False,Otherwise$		(4)
		$where : \| \cdot \| is the number of pixels in \cdot$

Secondly, we call the set of all possible $P$ in the image as $S_{full}$ . In "Complete Cover (CC)" scheme, we design a greedy algorithm to generate a set of $Q$ as $S_{cc}$ , which satisfies:

\forall P \in S_{full}, \exists Q \in S_{cc}, s . t . C (Q, P) = 1

(5)

The specific patch selection method is as follows: Given a effective scale sensitivity $c$ , assume the full image pixel set is $Q_{0} \in S_{cc}$ , all the objects bounding box pixel set $P_{0} \in S_{full}$ it can cover satisfies:

C (Q_{0}, P_{0}) = 1, \forall P_{0} \in S_{full}, if | P_{0} | > c^{2} \cdot | Q_{0} |,

(6)

Without loss of generality, consider $P \in S_{full}$ is square and $P_{0}$ with side length of $a$ . In order to cover $P_{1} \in S_{full}$ whose side length is $a / c - 1$ , we use a greedy way to get $Q_{1} \in S_{cc}, s . t . C (Q_{1}, P_{1}) = 1$ by passing a global sliding window with a side length of $a - c$ and a step size of $a / c - 2$ . Repeat this procedure until we have the patches that can cover objects with side lengths ranging from $a / c$ to $a / c - n$ , where $n = a / c - 1$ .

Our CC method can better retain detailed information than simply slicing the image into non-overlapped, equal-sized patches by vit. CC face a trade-off of completeness and computational complexity with different $c$ . A reasonable choice of $c$ needs to generate a bearable number of patches while ensuring the retention of detailed information.

3.4 Model & Loss

We will propose a fusion model and a proxy loss in this subsection. Assume the patch selection method is determined. We extract the features of each image patch, and then we use CLIP prompts to prompt all class labels into sentences and extract the features of each sentence. This section will discuss fusing multiple patch features into a single feature and using the image-agnostic text feature as a proxy to inject detailed information. The overview architecture of the framework is illustrated in Figure 3.

Figure 3: DetailCLIP framework with Query Proxy Loss.

3.4.1 Fusing Model

The essence of the DetailCLIP is injecting features from different patches into one new feature while keeping details as much as possible. We have many options to implement the fusing model, such as learning a weight to average the patch features, a Linear Projection, an MLP. We choose to implement the fusing model with a small transformer. Specifically, the input of this fusing model consists of two parts, the feature for all patches and the feature for the entire image. We use the former as the input source and the latter as the input target in the transformer.

3.4.2 Query Proxy Loss

For a small object in an image, there will be one patch feature that contains the most information about the object, and we denote that feature as $u_{max} \in U$ . The loss function’s purpose is to inject the patch feature information, which contains the most detailed information about the small object, into the fused feature. Specifically, we use a text feature $w$ that describes the small object in the image as a proxy feature to draw the fused feature $v$ close to the patch feature $u_{max}$ since the text feature is in the same joint feature space as the image feature. We use a similarity function to measure the similarity between the proxy feature and the candidate patch feature and choose the most similar patch feature as $u_{max}$ . Meanwhile, we get similarities between the fused feature and the proxy feature. Then, we minimize the distance function to draw these two similar distributions together.

L_{q p} = D [sim (v, w), sim (w, u_{max})]

(7)

where $w \in R^{d}$ has the same dimension as $v$ and $u_{max}$ . The symbol sim represents the similarity measure function, and the symbol $D$ represents the distance measure function. We try to learn $v$ so that the distance between the similarity distribution of $v$ and $w$ and the similarity distribution of $u_{max}$ and $w$ can be minimized. For every batch of images, we use all class prompted text features to calculate similarity, and no additional supervision is used besides the class name in the dataset. A Pytorch style code and the complete pipeline of our model can be seen at Appendix A.6.

4 Benchmark

In text-image retrieval, the traditional task "use a caption to retrieve single image" is widely used to evaluate the retrieval capability of the model. However, in practice, the need to "use a word to retrieve all related images in a database" is waiting to be fulfilled. Traditional text-image retrieval datasets such as Flickr30k and COCO-caption (MSCOCO dataset using caption to retrieve) are unsuitable for evaluating the latter task. We select some object detection datasets and use "prompting with class name" as a text query to retrieve all related images in a database. We construct the benchmark for this vital task by adopting existing datasets, making a new synthetic dataset, and designing the evaluation metrics.

4.1 Existing Datasets

Theoretically, many existing datasets with class-wise object supervision and class names can be used to construct the benchmark. However, both have deficits not only for the task we proposed but also for analysis of the ability of methods in detail retrieval. Examples are shown in Figure 4.

For real-world datasets, the label information for Visual Genome krishna2017visual, ImageNet-1K imagenet, GPR1200 gpr1200 is more focused on the main object in the image. Since these datasets are collected from real-world scenes, full-annotation of the images is difficult to achieve. Therefore the annotation of fine detail is insufficient. Furthermore, the large dataset LVIS gupta2019lvis has many missing labels, making it impossible to obtain accurate conclusions during the retrieval evaluation. Synthetic datasets such as CLEVR by johnson2017clevr, SHOP-VRB by SHOP-VRB have too few object categories. Besides, the datasets have no additional design and attention to the object size in the images.

4.2 Clevr-Ds

4.2.1 Summary of CLEVR-DS

In order to achieve accurate retrieval evaluation, we made a dataset, "CLEVR of Different Scales(CLEVR-DS)," which includes some common object categories in ShapeNet. We can realize full-annotation information (e.g., spatial position, bounding box, category, and attribute) in the image.

Our CLEVR-DS covers a wide range of object scale (mix, large and small) and is fully annotated but fall short on the number of categories and scene complexity. Other datasets show greater variabilities on scene complexity and therefore complement the CLEVR-DS dataset. In CLEVR-DS, the instance number per image is nearly three, the instance area ranges from 504 to 4849600, and the instance number per class is nearly 590. Detailed information is shown in Figure 5

(a) shows the distribution of the object sizes (areas). The x-axis represents the range of sizes, and the y-axis represents the count of objects whose areas are in this range. (b) demonstrates the number of objects per class for the train, validation, and test set. The x-axis is the class names, and the y-axis is the number of objects per class. The figure shows that our CLEVR-DS is almost class-balanced in each partition, and the sizes of objects are on different scales densely. — (a) Distribution of Instance Area Log Scale

4.2.2 Render Setting

The dataset is rendered by combining the rendering method of CLEVR, calling Blender’s python API, and rendering some object categories in ShapeNet as candidate targets. Scaled and added to the canvas for rendering to control the size distribution of objects in the image. Specifically, we use 51 categories in ShapeNet and ten objects in each category to complete the rendering, solve the class-wise long-tail effect of the ShapeNet model itself, and prevent some categories from being indistinguishable for CLIP under ignoring materials.

4.3 Datasets Statistics

This section lists the datasets we choose to perform the class with prompt text-to-image retrieval task. The statistics about CLEVR-DS, Unity-Retail by unity, MSCOCO and LVIS are from Table 4.3 to Table 6. These four datasets have relatively more instances per image than other datasets, which are more likely to contain small objects in an image. Particularly for our CLEVR-DS, the mean number of classes per image of CLEVR-DS is comparable with MSCOCO, but the instance per image is less than MSCOCO.

Table 2: Statistics of CLEVR-DS Per Image Mean Min Max Instance 3.01 1 5 Class 3.01 1 5 Table 3: Statistics of MSCOCO Per Image Mean Min Max Instance 7.33 1 93 Class 2.92 1 18

Table 4: Statistics of Unity-Retail Per Image Mean Min Max Instance 25.52 16 42 Class 13.52 9 19 Table 5: Statistics of LVIS Per Image Mean Min Max Instance 11.2 1 294 Class 3.4 1 24

DATASET	COCO	LVIS	CLEVR-DS	Unity-Retail
Image Number	122,219	122,219	10,000	1,000
Class Number	80	1230	51	16

Table 6: Class number of datasets.

4.4 Evaluation Metrics

Current evaluation metric for text-to-image retrieval is to calculate the top $k$ recall accuracy of a text query. However, we focus on retrieving all related images in a database, and the top $k$ retrieve result is not enough to evaluate the performance of our method. We propose a new evaluation metric Recall@k to evaluate the retrieval performance. Recall@k is calculated as follows,

Recall@k = \frac{t_{k}}{n}

(8)

where $n$ is the number of images of the query (also a class) in the database. $t_{k}$ represents the number of images that contain the query in the first $n \times k$ retrieved images. Suppose $I_{j}$ is a retrieved image, $t_{k}$ is caculated as follows:

		$t_{k} = N \sum j = 1 I (I_{j}), where$		(9)
		$I (I_{j}) = {\begin{matrix} 0 & if I_{j} is % in top n \times k results 1 & if I_{j} is not in top n \times k results \end{matrix}$		(9)

For the number $k$ , we select 1, 3, and 5 as the anchor points of the evaluation indicators. All datasets are evaluated in this method.

5 Experiments

5.1 Implementation Details

Dataset setting: We focus primarily on two types of datasets: synthetic and real-world datasets. Since the synthetic dataset is fully annotated, we test our model’s image fine-detail retrieval ability by setting the size of objects in the image. For CLEVR-DS, we define two sub-datasets as CLEVR-DS-S/CLEVR-DS-L, which contain only small/large objects of the query semantic information and the distractor data. For Unity-Retail datasets, we randomly split them in 7:1:2 for train, validation, and test set. We also evaluate our model on real-world datasets MSCOCO and LVIS. For MSCOCO, we randomly take 5000 images for validation and use the original validation set to test. Since LVIS has the same images as MSCOCO, we use the same split setting as MSCOCO.

Model architecture: In order to test the generalizability of DetailCLIP framework, we use several CLIP-like models as image encoders to extract image features. Besides, the image and text features are mapped into 512 and 768 dimensions for different image encoder architecture. We use a small transformer structure for the image feature fusing model with three encoders and three decoders. The whole framework is optimized through query proxy loss.

Patch selection mode: In the main result, we use CC to represent the patch selection method. To save computational cost, we set patch numbers as 166. We test the CILP-like model’s retrieval ability for untrained retrieval on the CC patches and whole image features. Specifically, we choose the highest CC patch score as the image retrieval score. For CC@ $\cdot$ , the $\cdot$ represents the selection of hyper-parameter $k$ , where $k = a / c$ . A study on choosing the value of $k$ is in Appendix 10. In the DetailCLIP scenario, the CC patch features are used to train the model.

Training Details: We used the Adam optimizer to train the DetailCLIP model for 100 epochs with a 2080ti, tuned the hyperparameter in the validation set, and tested the best model in the test set. The parameters we searched include the learning rate, the weight decay, the step size, and the gamma value of the scheduler. We also searched the value for gradient clip and layer norm’s epsilon.

5.2 Main Result Analysis

5.2.1 Results on CLEVR-DS

This section uses CLIP, whose backbone is ViT-B/32, as the image and text feature extractor. We choose a tiny transformer with three encoders and three decoders to fuse the patch feature. In the non-training scenario, we use single-feature and multi-feature to retrieve. The former uses the similarity between the whole image feature and the text feature for decision-making, and the latter uses the largest similarity of all patch features with the text feature for decision-making. As shown in Table 7, DetailCLIP improves the retrieval performance from 20.35% to 41.1% on CLEVR-DS dataset with mixed object size. At the same time, DetailCLIP has a significant improvement in the CLEVR-DS-S subset with 22% and a good improvement in the CLEVR-DS-L subset with 14%.

The above results prove that the feature space of CLIP has enough potential, and the information in the image is enough to obtain a good feature that represents the semantics of all the details. In addition, as the $k$ of $R @ k$ increases, the performance superiority of DetailCLIP is more prominent. In a practical scenario, when trying to retrieve several objects, people usually do not mind returning more retrieval results, as long as they include the target.

Size	Method	Single Feature	Input	Recall@1	Recall@3	Recall@5
	CLIP	✔	Full Image	20.35%	35.59%	47.30%
CLEVR-DS	CLIP	✕	CC@10	41.78%	61.49%	71.63%
	DetailCLIP	✔	CC@10	41.10%	63.01%	73.41%
	CLIP	✔	Full Image	3.18%	8.55%	14.81%
CLEVR-DS-S	CLIP	✕	CC@10	29.30%	44.48%	51.94%
	DetailCLIP	✕	CC@10	26.30%	43.11%	52.26%
	CLIP	✔	Full Image	25.67%	41.25%	50.16%
CLEVR-DS-L	CLIP	✕	CC@10	38.97%	58.44%	66.61%
	DetailCLIP	✔	CC@10	39.23%	59.10%	68.74%

Table 7: Retrieval Performance of DetailCLIP and CLIP on CLEVR-DS.

5.2.2 Results on Real-World and synthetic dataset

In this section, we use several feature extractors to test the performance of our DetailCLIP framework. To verify the effectiveness of our method on datasets with complicated scenes, we test on the MSCOCO, LVIS, and Unity-Retail datasets. The former two are real-world datasets but is not fully annotated. The latter is a synthetic dataset with more complicated scenes and is fully annotated. Table 8 demonstrates that the retrieval results of DetailCLIP on most datasets are better than the full image baseline and CC@10 baseline. e.g., on MSCOCO, DetailCLIP outperforms the full image baseline by $\sim$ 6% in recall@1, showing that the DetailCLIP feature is better than the original CLIP feature in retrieval task on the target domain. However, the margin between the full image baseline and the DetailCLIP of MSCOCO ( $\sim$ 6%) is smaller than Unity-Retail ( $\sim$ 17%) and CLEVR-DS-mix ( $\sim$ 27%). Although Unity-Retail is still a synthetic dataset, it has more complicated scenes than CLEVR-DS. We speculate that the full/not full-annotation problem causes the margin gap for different datasets.

DATASET			LVIS	COCO	Unity	CLEVR-DS
Method	Single Feature	Input	Recall@1	Recall@1	Recall@1	Recall@1
$\lx@sectionsign$ CLIP-ViT-B/16	✔	Full Image	7.49%	40.93%	24.63%	10.83%
CLIP-ViT-B/16	✕	CC@10	9.40%	41.24%	23.11%	23.13%
DetailCLIP	✔	CC@10	7.66%	44.19%	25.02%	24.20%
$†$ CLIP-ViT-B/14	✔	Full Image	15.12%	56.74%	35.74%	31.21%
CLIP-ViT-B/14	✕	CC@10	22.00%	59.40%	52.40%	56.82%
DetailCLIP	✔	CC@10	15.29%	62.63%	55.21%	58.88%
$\lx@sectionsign$ SLIP	✔	Full Image	9.65%	47.49%	24.42%	15.32%
$‡$ RegionCLIP	✔	Full Image	10.13%	46.06%	24.10%	23.41%

$\lx@sectionsign$ Trained on YFCC-15M
$†$ Trained on 400M images-text pairs
$‡$ Trained on Conceptual Caption (CC3M)

Table 8: Retrieval Performance of DetailCLIP framework and other methods on four datasets.

5.3 Ablation Study

We perform ablation analysis on our framework, data sources, and retrieval methods to further demonstrate the effectiveness of the method proposed in the paper and its components. We use CLIP-ViT-B/32 as the feature extractor in all ablation experiments.

5.3.1 Patch Generation Ablation

Firstly, we define patch generation schemes, "Patch-cc" and "Patch-grid." "Patch-grid" is almost identical to the "Patch" concept in ViT by vit, simply slicing the image into non-overlapped, equal-sized patches. "Patch-cc" generates patches for an image following the Complete Cover (CC) scheme. In order to verify that CC can effectively generate the patches with different levels of details to inject and fuse, we compare the above two patch generation schemes on the CLEVR-DS dataset.

Figure 6: Multi-Feature retrieval performance under patch-cc and patch-grid scheme.

According to Figure 6, patch-cc outperforms the patch-grid scheme on three sizes of CLEVR-DS datasets except for CLEVR-DS-S, especially on CLEVR-DS-L. The above results demonstrate that our patch-cc method can generate better patches than patch-grid.

5.3.2 Performance Upper Bound Ablation

The upper bond of the text-to-image retrieval task is to directly retrieve the bounding box patch that contains the target in the image. "Patch-obj" generates patches by cropping objects from the image using the bounding box. We use the "Patch-obj" method to generate several bounding box patches and select the most similar one with the target as the retrieved result to test the retrieval ability’s upper bond of CLIP on the CLEVR-DS dataset. In Figure 7, "Full image" represents retrieve the original image CLIP feature, and "DetailCLIP" means to the retrieve the DetailCLIP feature, result of DetailCLIP with CC input 41.1% is close to the result of "patch-obj" 51.57%. The "Patch-obj" result shows the limitation of the CLEVR-DS dataset. Due to rendering problems, such as the setting of materials and backgrounds, it does not match the real-world situation.

Figure 7: Retrieval result of DetailCLIP, Patch-obj scheme and Full image.

5.3.3 Information Injection Ablation

In this section, we test the DetailCLIP framework’s ability to inject detailed information in a different number of patches, and we use CLEVR-DS to perform the task. We use a different number of patches to train the DetailCLIP fusing model and test the retrieval performance of the fused feature. In Figure 8, "Grid" and "CC" represents the patch selection method which is the same as the above setting, and "Mix," "Small," and "Large" means the object scale in CLEVR-DS dataset. As is shown in Figure 8, with the increase in patch number, DiCLLP’s performance stays stable, which proves that our framework could inject detailed information from many patches. It also shows whether the patch generation method is patch-cc or patch-grid. The results of DetailCLIP are similar to the multi-feature CLIP results, which proves that DetailCLIP can effectively inject information from different sources.

Figure 8: After applying DetailCLIP, we achieve 100x #Patch Reduction with no significant performance loss.

5.3.4 Fine-tune On Target Domain

In order to address the fairness of the comparison, we experiment with four different adaptation methods on the CLEVR-DS dataset. Each adds a trainable module on top of the CLIP model, as our method does. We tried two modules, MLP and Transformer, which use the original CLIP feature as input and the adapted feature on the target domain as output. The Query Proxy Loss is applied, and hyperparameters are searched as well. The results in Table 9 show that the gain of our method is not because of the fine-tuning.

fine-tune module	without BN	with BN
Not-Tuned	20.35%	20.35%
MLP-1-layer	18.57%	17.22%
MLP-2-layer	19.85%	15.90%
MLP-3-layer	19.42%	15.12%
Transformer	17.16%	-

Table 9: Add the trainable modules on top of the CLIP model to fine-tune

5.3.5 Retrieval Protocol Ablation

After slicing an image into different patches, we can get the similarity score between text and all image patches. In the above experiments, we choose the maximum similarity score as the retrieval result of the image. However, since our CC method has overlapped patches that may contain the same target object, it is reasonable to use the mean similarity of all patches as a result. In Table 10, we test the performance of the two approaches.

CLEVR-DS	Mix	Small	Large
CLEVR-DS	recall@1	recall@1	recall@1
CC@10 max	41.78%	29.30%	38.97%
CC@10 mean	40.21%	25.29%	24.83%

Table 10: Mean similarity and max similarity result on CLEVR-DS

6 Conclusion & Future Work

Our paper presents a feature fusion model, DetailCLIP, for the text-to-image retrieval task. DetailCLIP shares the same semantic space with CLIP-like models and achieves an outstanding performance in detail retrieval. We proposed a CC patch selection scheme and a Transformer-based framework with query proxy loss to obtain a detail-friendly feature representation. To verify the retrieval performance of DetailCLIP, especially for objects on different scales, we constructed the CLEVR-DS dataset. Extensive experiments on this dataset and other popular datasets demonstrate that DetailCLIP can surpass the retrieval performance of CLIP-like models. The ablation experiments support our claims proposed in this paper. The follow-up work will replace the text features in the training stage with patch features in a self-supervised or unsupervised manner.

References

Appendix A Appendix

a.1 The Effectiveness of Complete Cover

Let us first take a look at the brute force algorithm. For each image with sidelength $a$ , the number of all possible patches is $O (a^{4})$ , since a rectangular patch is defined by its top-left and bottom-right corner coordinates. Each coordinate comprises two numbers (x, y) that leads to $O (a^{4})$ patches. Even if we confine the patches to be square, there are also $O (a^{3})$ patches.

For simplicity, we take square patches as our example. Please note that the $c$ we defined as the ratio of the perimeter in main paper is equivalent to the ratio of the sidelength. Moreover, if we adopt Complete Cover scheme, we can obtain patches (covers) at different levels with side lengths of $[c, 2 c, 3 c, \dots, a]$ . For each level, the number of patches to cover all targets for corresponding sidelength are $O ({(\frac{a}{c})}^{2}), O ({(\frac{a}{2 c})}^{2}), O ({(\frac{}{a} 3 c)}^{2}), \dots, O ({(\frac{a}{a})}^{2})$ respectively. The total number of patches introduces by Complete Cover is:

	${(\frac{a}{c})}^{2} * (1 + \frac{1}{2^{2}} + \frac{1}{3^{2}} + \dots)$	(10)
$=$	${(\frac{a}{c})}^{2} \times \frac{π^{2}}{6}$	(11)
$=$	$O (a^{2})$	(12)

where $c$ is a constant across the experiment.

We plot number of patches under different sidelengths with $c = 3$ in Figure 9and a quadratic function $y = 0.25 x^{2}$ . We can see that they fit well.

Figure 9: Patch numbers for different sidelengths when $c = 3$ .

Please note that, before crop patches, we resize the image to ensure the sidelength $a$ is divisible by $c$ . The minimum effective range for Q is defined as the range of sidelength for P that makes $C (P, Q) = 1$ (The definition of $C (P, Q)$ is in Section 3.1, Eq (1)). The formula for cover sidelength, minimum effective range, and the patch number for a given $c$ are presented in Table 11.

Level	Cover Sidelength	Minimum Effective Range	Patch Numbers
1	$a$	$a \geq x \geq \frac{a}{c}$	1
2	$a - c$	$\frac{a}{c} > x \geq \frac{a - c}{c}$	$(\frac{2 c + a c - a}{- c^{2} + 2 c + a c - a})^{2}$
3	$a - 2 c$	$\frac{a - c}{c} > x \geq \frac{a - 2 c}{c}$	$(\frac{3 c + a c - a}{- 2 c^{2} + 3 c + a c - a})^{2}$
$\dots$	$\dots$	$\dots$	$\dots$
n	$(\frac{n c + a c - a}{- (n - 1) c^{2} + n c + a c - a})^{2}$	$\frac{a - (n - 2) c}{c} > x \geq \frac{a - (n - 1) c}{c}$	$a - (n - 1) c$
$\dots$	$\dots$	$\dots$	$\dots$

Table 11: Relationship between the cover sidelength, the minimum effective range, and the number of patches at a given c

a.2 DetailCLIP Performance Under Different Complete Cover Schemes

Based on different $c$ , we list the number of patches at different levels respectively in Table 12. We set $\frac{a}{c} = k$ for convenience of notation.

CC@k	Patch Numbers	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7
1	1	1
2	5	1	4
3	14	1	4	9
4	30	1	4	9	16
5	39	1	4	9	25
6	66	1	4	9	16	36
7	79	1	4	9	16	49
8	103	1	4	9	25	64
9	136	1	4	9	16	25	81
10	166	1	4	9	16	36	100
11	187	1	4	9	16	36	121
12	248	1	4	9	16	25	49	144
13	273	1	4	9	16	25	49	169
14	315	1	4	9	16	25	64	196
15	355	1	4	9	16	36	64	225

Table 12: Number of patches at different levels.

To select a suitable $k$ , we test the performance of the DetailCLIP model with 14 different $k$ . The results are shown in Figure 10. Lines with star markers are retrieval results, and lines without star markers are DetailCLIP results. Dash lines are results for recall@1, and we use the same color for the same patch selection method under the same recall. The figure shows several facts:

The recall@1 results for DetailCLIP (single feature) are comparable with retrieval baseline (multi-feature) for any $k$ . The DetailCLIP has slightly better performance with large $k$ .
The performance for both DetailCLIP and retrieval baseline begin to increase since $k = 2$ , and become saturated at $k = 9$ . Patch-cc’s performance exceeds patch-grid since $k = 4$ for recall@1.
For $k \geq 8$ , the number of patches increases dramatically, but the performance of the DetailCLIP model does not.

Figure 10: Recall for retrieval and DetailCLIP models with different $k$ Values

Based on the analysis above and the trade-off between DetailCLIP performance and computation complexity, we select $k = 10$ to finish experiments in the main body of our paper. Also, when $k = 10$ , the last level of the patch number is the same as for the patch-grid method.

a.3 Similarity Distribution of Retrieval Results

We further inspect the similarity distribution of positive and negative text-image pairs in Figure 11. Note that the similarity x-axis is in log-scale. The vertical dashed lines denote the log-scale mean of the distribution. We can see that the distribution of the positive pairs of DetailCLIP leans more to the right than CLIP. One may find that distribution in log-scale formed in the shape of normal distribution. This is because the similarity is computed as a normalized dot-product between text and image features, which, under mild assumptions, naturally follows a log-normal distribution.

Figure 11: Distribution of retrieval results for positive & negative samples.

a.4 Hyper-parameter Tuning

We use AdamW optimizer with a linear learning rate scheduler and a linear warm-up training strategy for ten epochs. DetailCLIP is trained using a single GTX 2080ti. Throughout our DetailCLIP experiments, the batch size is set to 30. For different $k$ , the hyper-parameters of DetailCLIP are independently grid searched over the table below. We select the hyper-parameters on a validation set and report the result on a held-out test set.

Name	Candidate
Learning Rate	[0.001, 0.003, 0.005, 0.007, 0.01]
Weight Decay	[0, 0.001]
Step Size for Learning Rate Decay	[60, 120]
Gamma Value for Learning Rate Decay	[0.5, 0.7, 0.9]
Gradient Clip Value	[0.00001, 0.0001]
Layer Normalization’s Epsilon	[0.0001, 0.001, 0.01]

Table 13: Hyper-Parameter candidate in validation Set.

a.5 Retrieval Result Visualization

Figure 12: Retrieval result visualization. The ground truth images for the query are surrounded by blue frames, while green frames surround others. Note that two rows comprise a group. Each row is a ranked retrieval result, the larger portion of blue frames the better. The upper row in each group is the top 10 retrieval result for the CLIP, and the lower row is the result for the DetailCLIP. The ground truth objects are marked with a red bounding box for visualization. It should be aware that non of the methods here produces the bounding box.

Figure 13: More retrieval result visualization.

Figure 14: Selective zoom-ins of Figure 14 and Figure 13 We choose the top 5 retrieval result of Bear and Guitar classes to show that compared to CLIP, DetailCLIP can retrieve more images that contain small objects.

a.6 Pytorch-Like Code

A Pytorch style code and the complete pipeline of our DetailCLIP model are listed in A.6. {listing}[H] Pytorch Like Code {minted}python # b: batch size # p: number of patch # t: number of text feature # f: feature dim # vanilla_feature: clip feature for entire image, (b, f) # patch_feature: clip feature for different patches, (b, p, f) # text_feature: clip feature for text prompts, (t, f) # fusing_model: DetailCLIP model # All features are normalized.

DetailCLIP_feature=fusing_model(patch_feature, vanilla_feature) patch_feature=rearrange(patch_feature, ’b p f -> (b p) f’) proxy_feature=text_feature # (t, f) @ (f, b * p) -> (t, b * p) q_p_similarity=proxy_feature @ patch_feature.T q_p_similarity=rearrange(q_p_similarity, ’k (b p) -> k b p’) # (t, b, p) -> (t, b) q_p_similarity_max=q_p_similarity.max(-1) # (t, f) @ (f, b) -> (t, b) q_c_similarity=proxy_feature @ DetailCLIP_feature.T query_proxy_loss=mse_loss(q_p_similarity_max, q_c_similarity

a.7 Datasheet

Motivation
For what purpose was the dataset created? Who created this dataset? Who funded the creation of the dataset?	The dataset was created for training and evaluating the text-image retrieval models.
Any other comments?	Compared with other datasets for text-image retrieval such as MSCOCO, LVIS, Conceptual Captions, etc., our dataset focus on detail retrieval (small objects), and every object has an annotation in our dataset if they appear in a image. Also, we can generate images as many as we need.
Composition
What do the instances that comprise the dataset represent (e.g. documents, photos, people, countries)?	51 different types of common objects from ShapeNet, but with simple textures, rendered at different sizes on a clean background.
How many instances are there in total (of each type, if appropriate)?	The dataset contains 10k image-annotation pairs.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?	Our dataset is procedurally generated. There can be as many instances as one need.
What data does each instance consist of?	Each instance contains an image with every object in the image annotated with its box and class.
Is there a label or target associated with each instance?	Yes, Each instance has full annotations, including all objects with their bounding box, texture, size, category, etc.
Is any information missing from individual instances?	No.
Are relationships between individual instances made explicit?	Instance are i.i.d. generated from the same program. Objects in the image share the same object classes, but with different sizes and view angles. All instances share the same visual appearance.
Are there recommended data splits?	We use random splits for the training and testing and validation sets.
Are there any errors, sources of noise, or redundancies in the dataset?	No.
Is the dataset self-contained, or does it link to or otherwise rely on external resources?	The dataset is self-contained.
Does the dataset contain data that might be considered confidential?	No.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?	No.
Any other comments?	None.
Collection Process
How was the data associated with each instance acquired?	The objects we used to generate our dataset are taken from ShapeNet, which is available publicly on the web.
What mechanisms or procedures were used to collect the data?	The data was generated using modified CLEVR pipeline and Blender.
If the dataset is a sample from a larger set, what was the sampling strategy?	Only objects from ShapeNet are sampled. We first discard all broken objects that cannot be loaded by the modified CLEVR pipeline (259 classes remained). Then, we filter out all classes that have less than 10 objects (51 classes remained). We sample 10 3D object models from the remaining 51 classes. These strategies are able to improve the long-tailed problem.
Who was involved in the data collection process?	Researchers at our institute.
Over what timeframe was the data collected?	The dataset was generated in April 2022. We didn’t filter the sources based on the creation date.
Were any ethical review processes conducted?	Yes.
Preprocessing / Cleaning / Labeling
Was any preprocessing/cleaning/labeling of the data done (e.g. discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)?	No.
Is the software used to preprocess/clean/label the instances available?	No (but we will release the code later).
Any other comments?	None.
Uses
Has the dataset been used for any tasks already?	No.
Is there a repository that links to any or all papers or systems that use the dataset?	No, the dataset is only used to train and evaluate the models in this paper for now.
What (other) tasks could the dataset be used for?	The dataset be used for relation detection and localization tasks.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?	The dataset is rendered by ShapeNet objects, one may need to change the objects to other 3D models if ShapeNet cannot satisfy the needs.
Are there tasks for which the dataset should not be used?	For tasks related to data which have enormous domain gap to ShapeNet objects, the dataset should not be used.
Any other comments?	None.
Distribution
Will the dataset be distributed to third parties outside of the entity (e.g. company, institution, organization) on behalf of which the dataset was created?	Yes.
How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?	Github.
When will the dataset be distributed?	TBD.
Any other comments?	None.

a.8 Model Card

Model Details
Person or organization developing model	DetailCLIP was developed by our institute.
Model date	DetailCLIP was released on May 20, 2022.
Model version	DetailCLIP described in this paper is version 1.0.0.
Model type	DetailCLIP is a shallow transformer-based feature fusing model for text-image retrieval task
Information about training algorithms, parameters, fairness constraints or other applied approaches, and features	Please see the Data Card (Appendix A.7) for the information about training data and Section 6.1 for the information about the training process. We listed the choice of hyperparameters in Appendix A.4.
Paper or other resource for more information	Please see the paper for details on DetailCLIP. Our implementation will be available later at our Github repository.
License	TBD.
Where to send questions or comments about the model	Please contact the corresponding authors for any questions or comments.
Intended Use
Primary intended uses	We release the DetailCLIP for text-based image retrieval tasks, especially for detail retrieval (small objects).
Primary intended users	We primarily target researchers and the related research community who is interested in detail retrieval task or CLIP feature fusing task.
Out-of-scope use cases	TBD.
Data, Limitations, and Recommendations
Data selection for training	Training data for DetailCLIP is randomly sampled from CLEVR-DS. As for the dataset generation procedure, please see our Datasheet (Appendix A.7) for more information.
Data selection for evaluation	The validation set is randomly sampled from CLEVR-DS, with an emphasis on detail retrieval for small objects in an image from text.
Limitations	The feature input to DetailCLIP model is based CLIP, which inherit not only the capabilities of CLIP, but also the limitations of CLIP, e.g., If an object cannot be recognized in any patch by CLIP, DetailCLIP cannot improve the situation as well. Caution should be taken on the use of model trained on synthetic data on real world scenario. Furthermore, our synthetic dataset is background-free, which means the visual context is relatively simple. The text vocabulary we used is fairly small, which could lead to bias on out-of-vocabulary classes.
Recommendations for future work	Extend DetailCLIP to a larger real-world dataset with a large text vocabulary. Another direction is to use patches feature in place of text features as retrieval queries, since they share the same feature space. This could lead to a new text-free learning-by-retrieval feature learning paradigm.