Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed "X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP +6.5 mask AP on long-tail classes.
translated by 谷歌翻译
这项工作的目的是使用零手动注释建立可扩展的管道,以将对象检测器扩展到新颖/看不见的类别。为此,我们做出以下四个贡献:(i)追求概括,我们提出了一个两阶段的开放式摄制对象检测器,其中类无形的对象建议与预先训练的视觉视觉训练的文本编码一起分类语言模型; (ii)要将视觉潜在空间(RPN框建议)与预训练的文本编码器配对,我们提出了区域提示的概念,以学习将文本嵌入空间与区域视觉对象特征相结合; (iii)为了扩展学习过程以检测更广泛的对象,我们通过新颖的自我训练框架利用可用的在线资源,该框架允许在嘈杂的未经图像的网络图像上训练所提出的检测器。最后,(iv)评估我们所提出的检测器,称为及时插图,我们对具有挑战性的LVI和MS-COCO数据集进行了广泛的实验。提示件表现出优于现有方法的卓越性能,而其他培训图像和零手动注释较少。带代码的项目页面:https://fcjian.github.io/promptdet。
translated by 谷歌翻译
对象剪切已成为有效生成大量标记的训练数据的一种有希望的方法。它涉及将前景对象掩盖在背景图像上。背景图像与对象一致时,为培训对象识别模型提供了有用的上下文信息。尽管该方法可以轻松地生成大型标记的数据,但寻找下游任务的一致上下文图像仍然是一个难以捉摸的问题。在这项工作中,我们为自动上下文图像生成的新范式提出了一个新的范式。我们方法的核心是利用上下文和语言驱动图像生成之间的相互作用。通过在代表上下文的一小部分图像上应用图像字幕方法来提供上下文的语言描述。然后,这些语言描述用于使用基于语言的DALL-E图像生成框架来生成各种上下文图像集。然后将它们与对象合成,以提供分类器的增强培训集。我们在四个对象检测数据集上证明了方法比先前的上下文图像生成方法的优势。此外,我们还强调了数据生成方法对分布和零摄像数据生成方案的组成性质。
translated by 谷歌翻译
将简单的体系结构与大规模预训练相结合已导致图像分类的大量改进。对于对象检测,预训练和缩放方法的确定性不佳,尤其是在长尾和开放式摄影的环境中,训练数据相对较少。在本文中,我们提出了一个强大的配方,用于将图像文本模型转移到开放式对象检测中。我们使用具有最小修改,对比度文本预训练和端到端检测微调的标准视觉变压器体系结构。我们对该设置的缩放属性的分析表明,增加图像级预训练和模型大小在下游检测任务上产生一致的改进。我们提供适应性策略和正规化,以实现零击文本条件和单次图像条件对象检测的非常强劲的性能。代码和型号可在GitHub上找到。
translated by 谷歌翻译
Existing object detection methods are bounded in a fixed-set vocabulary by costly labeled data. When dealing with novel categories, the model has to be retrained with more bounding box annotations. Natural language supervision is an attractive alternative for its annotation-free attributes and broader object concepts. However, learning open-vocabulary object detection from language is challenging since image-text pairs do not contain fine-grained object-language alignments. Previous solutions rely on either expensive grounding annotations or distilling classification-oriented vision models. In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data. We formulate object-language alignment as a set matching problem between a set of image region features and a set of word embeddings. It enables us to train an open-vocabulary object detector on image-text pairs in a much simple and effective way. Extensive experiments on two benchmark datasets, COCO and LVIS, demonstrate our superior performance over the competing approaches on novel categories, e.g. achieving 32.0% mAP on COCO and 21.7% mask mAP on LVIS. Code is available at: https://github.com/clin1223/VLDet.
translated by 谷歌翻译
Building instance segmentation models that are dataefficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations is a promising direction towards addressing this challenge. Here, we perform a systematic study of the Copy-Paste augmentation (e.g., [13,12]) for instance segmentation where we randomly paste objects onto an image. Prior studies on Copy-Paste relied on modeling the surrounding visual context for pasting the objects. However, we find that the simple mechanism of pasting objects randomly is good enough and can provide solid gains on top of strong baselines. Furthermore, we show Copy-Paste is additive with semi-supervised methods that leverage extra data through pseudo labeling (e.g. self-training). On COCO instance segmentation, we achieve 49.1 mask AP and 57.3 box AP, an improvement of +0.6 mask AP and +1.5 box AP over the previous state-of-the-art. We further demonstrate that Copy-Paste can lead to significant improvements on the LVIS benchmark. Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3.6 mask AP on rare categories.
translated by 谷歌翻译
使用图像文本对的对比语言图像预测(剪辑)在零拍摄和传输学习设置中的图像分类中取得了令人印象深刻的结果。但是,我们表明,直接应用此类模型以识别对象检测的图像区域导致由于域移位导致的性能差:剪辑训练以与文本描述的整体匹配,而不捕获图像之间的细粒度对齐地区和文本跨度。为了缓解此问题,我们提出了一种称为RegionClip的新方法,可显着扩展剪辑以学习区域级视觉表示,从而在图像区域和文本概念之间实现细粒度对齐。我们的方法利用剪辑模型将图像区域与模板标题匹配,然后预先列出我们的模型以对准要素空间中的这些区域文本对。将预磨料模型转移到开放词汇对象检测任务时,我们的方法显着优于3.8 AP50和2.2 AP的最新技术,分别用于COCO和LVIS数据集的新型类别。更多,学习区域表示支持对象检测的零拍摄推断,显示了对COCO和LVIS数据集的有希望的结果。我们的代码可在https://github.com/microsoft/regionclip上获得。
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
开放世界对象检测是一个更具笼统和挑战性的目标,旨在识别和本地化由任意类别名称描述的对象。最近的工作GLIP通过将检测数据集的所有类别名称连接到句子中,从而将此问题作为接地问题,从而导致类别名称之间的效率低下的相互作用。本文介绍了Distclip,这是一种通过诉诸于设计概念词典的知识富集,是一种平行的视觉概念训练预训练方法,用于开放世界检测。为了提高学习效率,我们提出了一种新型的并行概念公式,该公式分别提取概念,以更好地利用异质数据集(即检测,接地和图像文本对)进行培训。我们进一步设计了来自各种在线资源和检测数据集的概念字典〜(带有描述),以提供每个概念的先验知识。通过用描述丰富这些概念,我们明确地建立了各种概念之间的关系,以促进开放域学习。所提出的概念词典进一步用于提供足够的负面概念,用于构建单词区域对齐损失\,并完成图像对文本对数据标题中缺少描述的对象的标签。所提出的框架显示出强烈的零射击性能性能,例如,在LVIS数据集上,我们的DETCLIP-T优于9.9%的地图GLIPT-T优于GLIP-T,并且与完全避免的型号相比,稀有类别的稀有类别提高了13.5%。作为我们的。
translated by 谷歌翻译
许多开放世界应用程序需要检测新的对象,但最先进的对象检测和实例分段网络在此任务中不屈服。关键问题在于他们假设没有任何注释的地区应被抑制为否定,这教导了将未经讨犯的对象视为背景的模型。为了解决这个问题,我们提出了一个简单但令人惊讶的强大的数据增强和培训方案,我们呼唤学习来检测每件事(LDET)。为避免抑制隐藏的对象,背景对象可见但未标记,我们粘贴在从原始图像的小区域采样的背景图像上粘贴带有的注释对象。由于仅对这种综合增强的图像培训遭受域名,我们将培训与培训分为两部分:1)培训区域分类和回归头在增强图像上,2)在原始图像上训练掩模头。通过这种方式,模型不学习将隐藏对象作为背景分类,同时概括到真实图像。 LDET导致开放式世界实例分割任务中的许多数据集的重大改进,表现出CoCo上的交叉类别概括的基线,以及对UVO和城市的交叉数据集评估。
translated by 谷歌翻译
最近,Vision-Language预训练的零拍图像分类已经表现出令人难以置信的成就,即该模型可以对任意类别进行分类而不看到该类别的其他注释图像。然而,目前尚不清楚如何在更广泛的视觉问题上进行零射识别,例如对象检测和语义分割。在本文中,我们通过在现成的预训练的视觉模型,即剪辑上建立零拍语义分割来定位零拍语义分割。很难因为语义分割和剪辑模型在不同的视觉粒度上执行,该语义分段处理在像素上时,而剪辑在图像上执行。为了解决处理粒度的差异,我们拒绝使用普遍的一级FCN基于FCN的框架,并倡导一个两级语义分割框架,其中第一阶段提取一个完全提取的掩模提案和第二阶段利用基于图像的剪辑模型在第一阶段生成的蒙版图像作物上执行零拍分类。我们的实验结果表明,这种简单的框架通过大型利润率超越了先前的最先进:+29.5 Hiou On Pascal VOC 2012 DataSet,+8.9 Hiou On Coco Stuff DataSet。凭借其简单性和强大的表现,我们希望本框架成为促进未来研究的基准。
translated by 谷歌翻译
Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR -- hence the name OV-DETR -- which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR -- the first end-to-end Transformer-based open-vocabulary detector -- achieves non-trivial improvements over current state of the arts.
translated by 谷歌翻译
尽管对象检测方面取得了很大进展,但由于实例级边界盒注释所需的巨大人性化,大多数现有方法都仅限于一小一少量的对象类别。为了减轻问题,最近的开放词汇和零射击检测方法试图检测培训期间未见的对象类别。但是,这些方法仍然依赖于一组基类上手动提供的边界盒注释。我们提出了一个开放的词汇检测框架,可以在没有手动提供边界盒注释的情况下培训。我们的方法通过利用预先训练的视觉语言模型的本地化能力来实现这一目标,并产生可直接用于训练对象探测器的伪边界盒标签。 Coco,Pascal VOC,Objects365和LVIS的实验结果证明了我们方法的有效性。具体而言,我们的方法优于使用人类注释的边界箱训练的最先进(SOTA),即使我们的培训源未配备手动边界盒标签,也可以在COCO新型类别上用3%AP培训。在利用手动边界箱标签作为基线时,我们的方法主要超过8%的AP。
translated by 谷歌翻译
由于检测数据集的规模小,当前对象探测器的词汇量受到限制。另一方面,图像分类器的原因是大约更大的词汇表,因为他们的数据集更大,更容易收集。我们提出守则,只需在图像分类数据上培训检测器的分类器,从而扩展了探测器的词汇量到数万个概念。与现有工作不同,拒绝不会根据模型预测将图像标签分配给框,使其更容易实现和兼容一系列检测架构和骨架。我们的结果表明,即使没有箱子注释,否则差异也能产生出色的探测器。它优于开放词汇和长尾检测基准的事先工作。拒绝为所有类和8.3地图提供了2.4地图的增益,用于开放词汇LVIS基准测试中的新型类。在标准的LVIS基准测试中,守护者达到41.7地图所有课程和41.7地图以获得罕见课程。我们首次培训一个探测器,其中包含所有二十一千类的ImageNet数据集,并显示它在没有微调的情况下推广到新数据集。代码可在https://github.com/facebookresearch/dorm提供。
translated by 谷歌翻译
Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI.
translated by 谷歌翻译
这项工作的目的是在训练过程中划分和名称图像区域,而无需访问像素级标签。为了解决这项任务,我们通过提炼两个基础模型的互补优势来构建细分器。第一个剪辑(Radford等,2021)具有将名称分配给图像内容的能力,但缺乏对象结构的可访问表示。第二个Dino(Caron等,2021)捕获了物体的空间范围,但对对象名称不了解。我们的方法称为名为Mask,开始使用剪辑来构建特定于类别的图像档案。这些图像用dino的类别 - 敏捷的对象检测器进行伪标记,然后使用夹档案标签通过类别特定的细分器进行完善。得益于精制面具的高质量,我们表明,在这些档案中训练有适当数据的培训的标准分割体系结构可为单对象和多对象图像带来令人印象深刻的语义细分能力。结果,我们提出的名字命名为在包括VOC2012,可可和大规模ImageNet-S数据集在内的五个基准上的一系列先前工作中表现出色。
translated by 谷歌翻译
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVISv1.0 dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. To facilitate future research, we release all the code and models at https://github.com/baaivision/EVA.
translated by 谷歌翻译
最近的进展表明,使用对比图像文本对的大规模预训练可以是从自然语言监督的高质量视觉表演学习的有前途的替代方案。从更广泛的监督来源受益,这种新的范例展示了对下游分类任务和数据集的令人印象深刻的可转移性。然而,从图像文本对中学习的知识转移到更复杂的密集预测任务的问题几乎没有访问过。在这项工作中,我们通过隐式和明确地利用来自剪辑的预先训练的知识来提出了一种新的密集预测框架。具体地,我们将剪辑中的原始图像文本匹配问题转换为像素文本匹配问题,并使用像素文本分数图来指导致密预测模型的学习。通过进一步使用图像中的上下文信息来提示语言模型,我们能够促进我们的模型来更好地利用预先接受训练的知识。我们的方法是模型 - 不可行的,它可以应用于任意密集的预测系统和各种预先训练的视觉底座,包括夹模型和想象成预先训练的模型。广泛的实验证明了我们对语义分割,对象检测和实例分段任务的方法的卓越性能。代码可在https://github.com/raoyongming/denseclip获得
translated by 谷歌翻译
最近的方法表明,直接在大规模图像文本对集合上训练深神网络可以在各种识别任务上进行零拍传输。一个中心问题是如何将其推广到对象检测,这涉及本地化的非语义任务以及分类的语义任务。为了解决这个问题,我们引入了一种视觉嵌入对准方法,该方法将审计模型(例如夹子)(例如夹子)的概括能力传输到像Yolov5这样的对象检测器。我们制定了一个损耗函数,使我们能够将图像和文本嵌入在预审计的模型夹中对齐与检测器的修改语义预测头。通过这种方法,我们能够训练一个对象检测器,该对象检测器可以在可可,ILSVRC和视觉基因组零摄像机检测基准上实现最先进的性能。在推断期间,我们的模型可以适应以检测任何数量的对象类,而无需其他培训。我们还发现,标准对象检测缩放可以很好地传输到我们的方法,并在Yolov5模型和Yolov3模型的各种尺度上找到一致的改进。最后,我们开发了一种自我标记的方法,该方法可提供显着的分数改进,而无需额外的图像或标签。
translated by 谷歌翻译
构建强大的通用对象检测框架需要扩展到更大的标签空间和更大的培训数据集。但是,大规模获取数千个类别的注释是高昂的成本。我们提出了一种新颖的方法,该方法利用了最近的视觉和语言模型中可用的丰富语义来将对象定位和分类在未标记的图像中,从而有效地生成了伪标签以进行对象检测。从通用和类别的区域建议机制开始,我们使用视觉和语言模型将图像的每个区域分类为下游任务所需的任何对象类别。我们在两个特定的任务(开放式摄影检测检测)中演示了生成的伪标签的值,其中模型需要概括为看不见的对象类别以及半监督对象检测,其中可以使用其他未标记的图像来改善模型。我们的经验评估显示了伪标签在这两个任务中的有效性,我们在其中优于竞争基准并实现了开放式摄制对象检测的新颖最新。我们的代码可在https://github.com/xiaofeng94/vl-plm上找到。
translated by 谷歌翻译