AVA挑战的目标是提供与可访问性相关的基于视觉的基准和方法。在本文中,我们将提交的技术细节介绍给CVPR2022 AVA挑战赛。首先,我们进行了一些实验,以帮助采用适当的模型和数据增强策略来完成此任务。其次,采用有效的培训策略来提高性能。第三,我们整合了两个不同分割框架的结果,以进一步提高性能。实验结果表明,我们的方法可以在AVA测试集上获得竞争结果。最后,我们的方法在CVPR2022 AVA挑战赛的测试集上实现了63.008 \%ap@0.50:0.95。
translated by 谷歌翻译
ACM MMSPORTS2022 DEEPSPORTRADAR实例细分挑战的目标是解决个人人类的细分,包括球员,教练和裁判在篮球场上。这项挑战的主要特征是,玩家之间存在很高的阻塞,数据量也非常有限。为了解决这些问题,我们设计了一个强大的实例分割管道。首先,我们对此任务采用了适当的数据增强策略,主要包括光度失真变换和复制式策略,该策略可以生成更多具有更广泛分布的图像实例。其次,我们采用了强大的分割模型,基于SWIN基础的CBNETV2骨架上的基于混合任务级联的检测器,并将Maskiou Head添加到HTCMASKHEAD,可以简单有效地改善实例细分的性能。最后,采用了SWA培训策略来进一步提高性能。实验结果表明,所提出的管道可以在DeepSportradar挑战中取得竞争成果,而挑战集则以0.768AP@0.50:0.95。源代码可在https://github.com/yjingyu/instanc_segentation_pro中获得。
translated by 谷歌翻译
用于数据有效的计算机视觉挑战的视觉感应前瞻挑战要求竞争对手从数据缺陷的设置中从头划痕培训模型。在本文中,我们向ICCV2021 Vipriors实例分割挑战介绍了我们提交的技术细节。首先,我们设计了一种有效的数据增强方法,以改善数据缺陷的问题。其次,我们进行了一些实验来选择适当的模型,并对这项任务进行了一些改进。第三,我们提出了一种有效的培训策略,可以提高性能。实验结果表明,我们的方法可以在测试集上实现竞争结果。根据竞争规则,我们不使用任何外部图像或视频数据和预先训练的权重。上面的实现细节在第2节和第3节中描述了。最后,我们的方法可以在ICCV2021 Vipriors实例分割挑战的测试集上实现40.2 \%@ 0.50:0.95。
translated by 谷歌翻译
在本文中,我们介绍了一种在2021 Vipriors实例分段挑战中使用的数据有效的实例分段方法。我们的解决方案是一个修改版的Swin变压器,基于MMDetection,它是一个强大的工具箱。为了解决数据缺乏问题,我们利用了数据增强,包括随机翻转和多尺度培训来培训我们的模型。在推理期间,多尺度融合用于提高性能。我们在整个培训和测试阶段仅使用单个GPU。最后,我们的团队在测试集上实现了0.366的结果:0.95,在测试集上与其他排名方法竞争,而仅使用一个GPU。此外,我们的方法达到了AP@0.50:0.95(中等)0.592,其中排名第二。最后,我们的团队在组织者宣布的所有参赛者中排名第三。
translated by 谷歌翻译
最近,基于合成数据的实例分割已成为一种极其有利的优化范式,因为它利用模拟渲染和物理学来生成高质量的图像宣传对。在本文中,我们提出了一个并行预训练的变压器(PPT)框架,以完成基于合成数据的实例分割任务。具体而言,我们利用现成的预训练的视觉变压器来减轻自然数据和合成数据之间的差距,这有助于在下游合成数据场景中提供良好的概括,几乎没有样本。基于SWIN-B基的CBNET V2,基于SWINL的CBNET V2和SWIN-L基统一器用于并行特征学习,并且这三个模型的结果由像素级非最大最大抑制(NMS)算法融合来获得更强大的结果。实验结果表明,PPT在CVPR2022 AVA可访问性视觉和自主性挑战中排名第一,地图为65.155%。
translated by 谷歌翻译
视频场景在野外与不同方案进行了解析,是一个具有挑战性和重要的任务,特别是随着自动驾驶技术的快速发展。野外(VSPW)中的数据集视频场景分析包含良好的修整长时间,密度注释和高分辨率剪辑。基于VSPW,我们设计具有视觉变压器的时间双边网络。我们首先使用卷积设计空间路径以产生能够保留空间信息的低级功能。同时,采用具有视觉变压器的上下文路径来获得足够的上下文信息。此外,时间上下文模块被设计为利用帧间内容信息。最后,该方法可以实现VSPW2021挑战测试数据集的49.85 \%的Union(Miou)的平均交叉点。
translated by 谷歌翻译
Building instance segmentation models that are dataefficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations is a promising direction towards addressing this challenge. Here, we perform a systematic study of the Copy-Paste augmentation (e.g., [13,12]) for instance segmentation where we randomly paste objects onto an image. Prior studies on Copy-Paste relied on modeling the surrounding visual context for pasting the objects. However, we find that the simple mechanism of pasting objects randomly is good enough and can provide solid gains on top of strong baselines. Furthermore, we show Copy-Paste is additive with semi-supervised methods that leverage extra data through pseudo labeling (e.g. self-training). On COCO instance segmentation, we achieve 49.1 mask AP and 57.3 box AP, an improvement of +0.6 mask AP and +1.5 box AP over the previous state-of-the-art. We further demonstrate that Copy-Paste can lead to significant improvements on the LVIS benchmark. Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3.6 mask AP on rare categories.
translated by 谷歌翻译
This report introduces the technical details of the team FuXi-Fresher for LVIS Challenge 2021. Our method focuses on the problem in following two aspects: the long-tail distribution and the segmentation quality of mask and boundary. Based on the advanced HTC instance segmentation algorithm, we connect transformer backbone(Swin-L) through composite connections inspired by CBNetv2 to enhance the baseline results. To alleviate the problem of long-tail distribution, we design a Distribution Balanced method which includes dataset balanced and loss function balaced modules. Further, we use a Mask and Boundary Refinement method composed with mask scoring and refine-mask algorithms to improve the segmentation quality. In addition, we are pleasantly surprised to find that early stopping combined with EMA method can achieve a great improvement. Finally, by using multi-scale testing and increasing the upper limit of the number of objects detected per image, we achieved more than 45.4% boundary AP on the val set of LVIS Challenge 2021. On the test data of LVIS Challenge 2021, we rank 1st and achieve 48.1% AP. Notably, our APr 47.5% is very closed to the APf 48.0%. * indicates equal contribution.
translated by 谷歌翻译
近年来对目标细分研究有了很大的进步。除了通用物体外,水生动物也引起了研究的关注。基于深度学习的方法广泛用于水生动物细分,并取得了有希望的表现。但是,缺乏基准测试的具有挑战性的数据集。因此,我们创建了一个被称为“水生动物物种”的新数据集。此外,我们设计了一种新的基于多模式的场景感知分段框架,其利用多个视图分段模型的优点,以有效地分段为水生动物的图像。为了提高培训表现,我们开发了一个引导的混合增强方法。广泛的实验比较了具有最先进的实例分段方法的提出框架的性能,证明了我们的方法是有效的,并且它显着优于现有方法。
translated by 谷歌翻译
Recently, diffusion frameworks have achieved comparable performance with previous state-of-the-art image generation models. Researchers are curious about its variants in discriminative tasks because of its powerful noise-to-image denoising pipeline. This paper proposes DiffusionInst, a novel framework that represents instances as instance-aware filters and formulates instance segmentation as a noise-to-filter denoising process. The model is trained to reverse the noisy groundtruth without any inductive bias from RPN. During inference, it takes a randomly generated filter as input and outputs mask in one-step or multi-step denoising. Extensive experimental results on COCO and LVIS show that DiffusionInst achieves competitive performance compared to existing instance segmentation models. We hope our work could serve as a simple yet effective baseline, which could inspire designing more efficient diffusion frameworks for challenging discriminative tasks. Our code is available in https://github.com/chenhaoxing/DiffusionInst.
translated by 谷歌翻译
尽管有不同的相关框架,已经通过不同和专门的框架解决了语义,实例和Panoptic分段。本文为这些基本相似的任务提供了统一,简单,有效的框架。该框架,名为K-Net,段段由一组被学习内核持续一致,其中每个内核负责为潜在实例或填充类生成掩码。要解决区分各种实例的困难,我们提出了一个内核更新策略,使每个内核动态和条件在输入图像中的有意义的组上。 K-NET可以以结尾的方式培训,具有二分匹配,其培训和推论是自然的NMS和无框。没有钟声和口哨,K-Net超越了先前发表的全面的全面的单一模型,在ADE20K Val上的MS Coco Test-Dev分割和语义分割上分别与55.2%PQ和54.3%Miou分裂。其实例分割性能也与MS COCO上的级联掩模R-CNN相同,具有60%-90%的推理速度。代码和模型将在https://github.com/zwwwayne/k-net/发布。
translated by 谷歌翻译
Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e.g., CLIP) and text2image models (e.g., StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed "X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2.6 box AP and +2.1 mask AP gains on all classes and even more significant gains with +6.8 box AP +6.5 mask AP on long-tail classes.
translated by 谷歌翻译
In this paper, we aim to design an efficient real-time object detector that exceeds the YOLO series and is easily extensible for many object recognition tasks such as instance segmentation and rotated object detection. To obtain a more efficient model architecture, we explore an architecture that has compatible capacities in the backbone and neck, constructed by a basic building block that consists of large-kernel depth-wise convolutions. We further introduce soft labels when calculating matching costs in the dynamic label assignment to improve accuracy. Together with better training techniques, the resulting object detector, named RTMDet, achieves 52.8% AP on COCO with 300+ FPS on an NVIDIA 3090 GPU, outperforming the current mainstream industrial detectors. RTMDet achieves the best parameter-accuracy trade-off with tiny/small/medium/large/extra-large model sizes for various application scenarios, and obtains new state-of-the-art performance on real-time instance segmentation and rotated object detection. We hope the experimental results can provide new insights into designing versatile real-time object detectors for many object recognition tasks. Code and models are released at https://github.com/open-mmlab/mmdetection/tree/3.x/configs/rtmdet.
translated by 谷歌翻译
Cascade is a classic yet powerful architecture that has boosted performance on various tasks. However, how to introduce cascade to instance segmentation remains an open question. A simple combination of Cascade R-CNN and Mask R-CNN only brings limited gain. In exploring a more effective approach, we find that the key to a successful instance segmentation cascade is to fully leverage the reciprocal relationship between detection and segmentation. In this work, we propose a new framework, Hybrid Task Cascade (HTC), which differs in two important aspects: (1) instead of performing cascaded refinement on these two tasks separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background. Overall, this framework can learn more discriminative features progressively while integrating complementary features together in each stage. Without bells and whistles, a single HTC obtains 38.4% and 1.5% improvement over a strong Cascade Mask R-CNN baseline on MSCOCO dataset. Moreover, our overall system achieves 48.6 mask AP on the test-challenge split, ranking 1st in the COCO 2018 Challenge Object Detection Task. Code is available at: https://github.com/ open-mmlab/mmdetection.
translated by 谷歌翻译
图像分割是关于使用不同语义的分组像素,例如类别或实例成员身份,其中每个语义选择定义任务。虽然只有每个任务的语义不同,但目前的研究侧重于为每项任务设计专业架构。我们提出了蒙面关注掩模变压器(Mask2Former),这是一种能够寻址任何图像分段任务(Panoptic,实例或语义)的新架构。其关键部件包括屏蔽注意,通过限制预测掩模区域内的横向提取局部特征。除了将研究工作减少三次之外,它还优于四个流行的数据集中的最佳专业架构。最值得注意的是,Mask2Former为Panoptic semonation(Coco 57.8 PQ)设置了新的最先进的,实例分段(Coco上50.1 AP)和语义分割(ADE20K上的57.7 miou)。
translated by 谷歌翻译
In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, and scalable, and it can benefit from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. Code is available at \url{https://github.com/IDEACVR/MaskDINO}.
translated by 谷歌翻译
Universal Image Segmentation is not a new concept. Past attempts to unify image segmentation in the last decades include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, CityScapes, and COCO, despite the latter being trained on each of the three tasks individually with three times the resources. With new ConvNeXt and DiNAT backbones, we observe even more performance improvement. We believe OneFormer is a significant step towards making image segmentation more universal and accessible. To support further research, we open-source our code and models at https://github.com/SHI-Labs/OneFormer
translated by 谷歌翻译
Frozen pretrained models have become a viable alternative to the pretraining-then-finetuning paradigm for transfer learning. However, with frozen models there are relatively few parameters available for adapting to downstream tasks, which is problematic in computer vision where tasks vary significantly in input/output format and the type of information that is of value. In this paper, we present a study of frozen pretrained models when applied to diverse and representative computer vision tasks, including object detection, semantic segmentation and video action recognition. From this empirical analysis, our work answers the questions of what pretraining task fits best with this frozen setting, how to make the frozen setting more flexible to various downstream tasks, and the effect of larger model sizes. We additionally examine the upper bound of performance using a giant frozen pretrained model with 3 billion parameters (SwinV2-G) and find that it reaches competitive performance on a varied set of major benchmarks with only one shared frozen base network: 60.0 box mAP and 52.2 mask mAP on COCO object detection test-dev, 57.6 val mIoU on ADE20K semantic segmentation, and 81.7 top-1 accuracy on Kinetics-400 action recognition. With this work, we hope to bring greater attention to this promising path of freezing pretrained image models.
translated by 谷歌翻译
许多开放世界应用程序需要检测新的对象,但最先进的对象检测和实例分段网络在此任务中不屈服。关键问题在于他们假设没有任何注释的地区应被抑制为否定,这教导了将未经讨犯的对象视为背景的模型。为了解决这个问题,我们提出了一个简单但令人惊讶的强大的数据增强和培训方案,我们呼唤学习来检测每件事(LDET)。为避免抑制隐藏的对象,背景对象可见但未标记,我们粘贴在从原始图像的小区域采样的背景图像上粘贴带有的注释对象。由于仅对这种综合增强的图像培训遭受域名,我们将培训与培训分为两部分:1)培训区域分类和回归头在增强图像上,2)在原始图像上训练掩模头。通过这种方式,模型不学习将隐藏对象作为背景分类,同时概括到真实图像。 LDET导致开放式世界实例分割任务中的许多数据集的重大改进,表现出CoCo上的交叉类别概括的基线,以及对UVO和城市的交叉数据集评估。
translated by 谷歌翻译
我们发现Mask2Former还可以在视频实例分段上实现最先进的性能,而无需修改架构,丢失甚至培训管道。在本报告中,我们通过直接预测3D分段卷来显示通用图像分割体系结构通过直接预测3D分段卷来概括到视频分段。具体而言,Mask2Former在Youtubevis-2021上为Youtubevis-2019和52.6 AP设置了新的60.4 AP最先进的。鉴于其在图像分割中的多功能性,我们认为蒙版2格相符也能够处理视频语义和Panoptic分割。我们希望这将使最先进的视频分段研究更可访问,并更加关注设计通用图像和视频分段架构。
translated by 谷歌翻译