Building instance segmentation models that are dataefficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations is a promising direction towards addressing this challenge. Here, we perform a systematic study of the Copy-Paste augmentation (e.g., [13,12]) for instance segmentation where we randomly paste objects onto an image. Prior studies on Copy-Paste relied on modeling the surrounding visual context for pasting the objects. However, we find that the simple mechanism of pasting objects randomly is good enough and can provide solid gains on top of strong baselines. Furthermore, we show Copy-Paste is additive with semi-supervised methods that leverage extra data through pseudo labeling (e.g. self-training). On COCO instance segmentation, we achieve 49.1 mask AP and 57.3 box AP, an improvement of +0.6 mask AP and +1.5 box AP over the previous state-of-the-art. We further demonstrate that Copy-Paste can lead to significant improvements on the LVIS benchmark. Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3.6 mask AP on rare categories.
translated by 谷歌翻译
Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training is commonly used to initialize the backbones of object detection and segmentation models. He et al. [1], for example, show a contrasting result that ImageNet pre-training has limited impact on COCO object detection. Here we investigate self-training as another method to utilize additional data on the same setup and contrast it against ImageNet pre-training. Our study reveals the generality and flexibility of self-training with three additional insights: 1) stronger data augmentation and more labeled data further diminish the value of pre-training, 2) unlike pre-training, self-training is always helpful when using stronger data augmentation, in both low-data and high-data regimes, and 3) in the case that pre-training is helpful, self-training improves upon pre-training. For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data. Self-training, on the other hand, shows positive improvements from +1.3 to +3.4AP across all dataset sizes. In other words, self-training works well exactly on the same setup that pre-training does not work (using ImageNet to help COCO). On the PASCAL segmentation dataset, which is a much smaller dataset than COCO, though pre-training does help significantly, self-training improves upon the pre-trained model. On COCO object detection, we achieve 54.3AP, an improvement of +1.5AP over the strongest SpineNet model. On PASCAL segmentation, we achieve 90.5 mIOU, an improvement of +1.5% mIOU over the previous state-of-the-art result by DeepLabv3+. 1 ⇤ Authors contributed equally. 1 Code and checkpoints for our models are available at https://github.com/tensorflow/tpu/tree/ master/models/official/detection/projects/self_training 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
translated by 谷歌翻译
We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization. The results are no worse than their ImageNet pre-training counterparts even when using the hyper-parameters of the baseline system (Mask R-CNN) that were optimized for fine-tuning pretrained models, with the sole exception of increasing the number of training iterations so the randomly initialized models may converge. Training from random initialization is surprisingly robust; our results hold even when: (i) using only 10% of the training data, (ii) for deeper and wider models, and (iii) for multiple tasks and metrics. Experiments show that ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy. To push the envelope we demonstrate 50.9 AP on COCO object detection without using any external data-a result on par with the top COCO 2017 competition results that used ImageNet pre-training. These observations challenge the conventional wisdom of ImageNet pre-training for dependent tasks and we expect these discoveries will encourage people to rethink the current de facto paradigm of 'pretraining and fine-tuning' in computer vision.
translated by 谷歌翻译
转移学习的一种常见做法是通过预先培训数据丰富的上游任务来初始化下游模型权重。在对象检测中,特征主链通常用成像网分类器的权重初始化,并在对象检测任务上进行微调。最近的作品表明,在更长的培训方案下,这不是严格必要的,并提供了从头开始训练骨干的食谱。我们研究了这种端到端训练趋势的相反方向:我们表明,一种极端的知识保存形式 - 冻结分类器至关重要的骨干 - 始终改善许多不同的检测模型,并导致可观的资源节省。我们假设并通过实验证实,其余的检测器成分的容量和结构是利用冷冻骨架的关键因素。我们发现的直接应用包括对严重案例的绩效改进,例如检测长尾对象类别以及计算和内存资源节省,这有助于使该领域更容易访问具有更少的计算资源的研究人员。
translated by 谷歌翻译
The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10× or 100×? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between 'enormous data' and visual deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks increases logarithmically based on volume of training data size. Second, we show that representation learning (or pretraining) still holds a lot of promise. One can improve performance on many vision tasks by just training a better base model. Finally, as expected, we present new state-of-theart results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.
translated by 谷歌翻译
许多开放世界应用程序需要检测新的对象,但最先进的对象检测和实例分段网络在此任务中不屈服。关键问题在于他们假设没有任何注释的地区应被抑制为否定,这教导了将未经讨犯的对象视为背景的模型。为了解决这个问题,我们提出了一个简单但令人惊讶的强大的数据增强和培训方案,我们呼唤学习来检测每件事(LDET)。为避免抑制隐藏的对象,背景对象可见但未标记,我们粘贴在从原始图像的小区域采样的背景图像上粘贴带有的注释对象。由于仅对这种综合增强的图像培训遭受域名,我们将培训与培训分为两部分:1)培训区域分类和回归头在增强图像上,2)在原始图像上训练掩模头。通过这种方式,模型不学习将隐藏对象作为背景分类,同时概括到真实图像。 LDET导致开放式世界实例分割任务中的许多数据集的重大改进,表现出CoCo上的交叉类别概括的基线,以及对UVO和城市的交叉数据集评估。
translated by 谷歌翻译
我们提出了一个令人尴尬的简单点注释方案,以收集弱监督,例如分割。除了边界框外,我们还收集了在每个边界框内均匀采样的一组点的二进制标签。我们表明,为完整的掩模监督开发的现有实例细分模型可以通过我们的方案收集基于点的监督而无缝培训。值得注意的是,接受了可可,Pascal VOC,CityScapes和LVI的面具R-CNN,每个物体只有10个带注释的随机点可实现94% - 占其完全监督的性能的98%,为弱化的实例细分定下了强大的基线。新点注释方案的速度比注释完整的对象掩码快5倍,使高质量实例分割在实践中更容易访问。受基于点的注释形式的启发,我们提出了对Pointrend实例分割模块的修改。对于每个对象,称为隐式pointrend的新体系结构生成一个函数的参数,该函数可以使最终的点级掩码预测。隐式Pointrend更加简单,并使用单点级掩蔽丢失。我们的实验表明,新模块更适合基于点的监督。
translated by 谷歌翻译
数据增强是改善深神经网络概括的必不可少的技术。大多数现有的图像域增强剂要么依赖几何和结构变换,要么应用不同种类的光度扭曲。在本文中,我们提出了一种有效的技术,可以通过将上下文有意义的知识注入场景中。我们通过语言接地(Semaug)进行对象检测的语义上有意义的图像增强方法首先计算出可以将其放置在图像中相关位置的语义上适当的新对象(问题和位置)。然后,它将这些对象嵌入其相关目标位置,从而促进对象实例分布的多样性。我们的方法允许介绍培训集中可能不存在的新对象实例和类别。此外,它不需要培训上下文网络的额外开销,因此可以轻松地将其添加到现有架构中。我们全面的评估集表明,所提出的方法在改善概括方面非常有效,而开销可以忽略不计。特别是,对于广泛的模型体系结构,我们的方法分别在Pascal VOC和COCO数据集上实现了约2-4%和〜1-2%的MAP改进。
translated by 谷歌翻译
Open-World实例细分(OWIS)旨在从图像中分割类不足的实例,该图像具有广泛的现实应用程序,例如自主驾驶。大多数现有方法遵循两阶段的管道:首先执行类不足的检测,然后再进行特定于类的掩模分段。相比之下,本文提出了一个单阶段框架,以直接为每个实例生成掩码。另外,实例掩码注释在现有数据集中可能很吵。为了克服这个问题,我们引入了新的正规化损失。具体而言,我们首先训练一个额外的分支来执行预测前景区域的辅助任务(即属于任何对象实例的区域),然后鼓励辅助分支的预测与实例掩码的预测一致。关键的见解是,这种交叉任务一致性损失可以充当误差校正机制,以打击注释中的错误。此外,我们发现所提出的跨任务一致性损失可以应用于图像,而无需任何注释,将自己借给了半监督的学习方法。通过广泛的实验,我们证明了所提出的方法可以在完全监督和半监督的设置中获得令人印象深刻的结果。与SOTA方法相比,所提出的方法将$ ap_ {100} $得分提高了4.75 \%\%\%\ rightarrow $ uvo设置和4.05 \%\%\%\%\%\%\ rightarrow $ uvo设置。在半监督学习的情况下,我们的模型仅使用30 \%标记的数据学习,甚至超过了其完全监督的数据,并具有5​​0 \%标记的数据。该代码将很快发布。
translated by 谷歌翻译
最近最近的半监督学习(SSL)研究建立了教师学生的建筑,并通过教师产生的监督信号训练学生网络。数据增强策略在SSL框架中发挥着重要作用,因为很难在不丢失标签信息的情况下创建弱强度增强的输入对。特别是当将SSL扩展到半监督对象检测(SSOD)时,许多与图像几何和插值正则化相关的强大增强方法很难利用,因为它们可能损坏了对象检测任务中的边界框的位置信息。为解决此问题,我们介绍了一个简单但有效的数据增强方法,MIX / unmix(MUM),其中解密为SSOD框架的混合图像块的瓷砖。我们所提出的方法使混合输入图像块进行混合输入图像块,并在特征空间中重建它们。因此,妈妈可以从未插入的伪标签享受插值正则化效果,并成功地生成有意义的弱强对。此外,妈妈可以容易地配备各种SSOD方法。在MS-Coco和Pascal VOC数据集上的广泛实验通过在所有测试的SSOD基准协议中始终如一地提高基线的地图性能,证明了妈妈的优越性。
translated by 谷歌翻译
In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its quality. While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object detection datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN. To facilitate future research, two implementations are made available at https://github.com/zhaoweicai/cascade-rcnn (Caffe) and https://github.com/zhaoweicai/Detectron-Cascade-RCNN (Detectron).
translated by 谷歌翻译
为了提高实例级别检测/分割性能,现有的自我监督和半监督方法从未标记的数据提取非常任务 - 无关或非常任务特定的训练信号。我们认为这两种方法在任务特异性频谱的两端是任务性能的次优。利用太少的任务特定的培训信号导致底下地区任务的地面真理标签导致磨损,而相反的原因会在地面真理标签上过度装修。为此,我们提出了一种新的类别无关的半监督预测(CASP)框架,在提取来自未标记数据的训练信号中实现更有利的任务特异性平衡。与半监督学习相比,CASP通过忽略伪标签中的类信息并具有仅使用任务 - 不相关的未标记数据的单独预先预订阶段来减少训练信号的任务特异性。另一方面,CASP通过利用盒子/面具级伪标签来保留适量的任务特异性。因此,我们的预磨模模型可以更好地避免在下游任务上的FineTuned时避免在地面真理标签上抵抗/过度拟合。使用3.6M未标记的数据,我们在对象检测上实现了4.7%的显着性能增益。我们的预制模型还展示了对其他检测和分割任务/框架的优异可转移性。
translated by 谷歌翻译
本文对实例分割模型进行了全面评估,这些模型与现实世界图像损坏以及室外图像集合,例如与培训数据集不同的设置捕获的图像。室外图像评估显示了模型的概括能力,现实世界应用的一个基本方面以及广泛研究的域适应性主题。当设计用于现实世界应用程序的实例分割模型并选择现成的预期模型以直接用于手头的任务时,这些提出的鲁棒性和泛化评估很重要。具体而言,这项基准研究包括最先进的网络架构,网络骨架,标准化层,从头开始训练的模型,从头开始与预处理的网络以及多任务培训对稳健性和概括的影响。通过这项研究,我们获得了一些见解。例如,我们发现组归一化增强了跨损坏的网络的鲁棒性,其中图像内容保持不变,但损坏却添加在顶部。另一方面,分批归一化改善了图像特征统计信息在不同数据集上的概括。我们还发现,单阶段探测器比其训练大小不太概括到更大的图像分辨率。另一方面,多阶段探测器可以轻松地用于不同尺寸的图像上。我们希望我们的全面研究能够激发更强大和可靠的实例细分模型的发展。
translated by 谷歌翻译
标记数据通常昂贵且耗时,特别是对于诸如对象检测和实例分割之类的任务,这需要对图像的密集标签进行密集的标签。虽然几张拍摄对象检测是关于培训小说中的模型(看不见的)对象类具有很少的数据,但它仍然需要在许多标记的基础(见)类的课程上进行训练。另一方面,自我监督的方法旨在从未标记数据学习的学习表示,该数据转移到诸如物体检测的下游任务。结合几次射击和自我监督的物体检测是一个有前途的研究方向。在本调查中,我们审查并表征了几次射击和自我监督对象检测的最新方法。然后,我们给我们的主要外卖,并讨论未来的研究方向。https://gabrielhuang.github.io/fsod-survey/的项目页面
translated by 谷歌翻译
Jitendra Malik once said, "Supervision is the opium of the AI researcher". Most deep learning techniques heavily rely on extreme amounts of human labels to work effectively. In today's world, the rate of data creation greatly surpasses the rate of data annotation. Full reliance on human annotations is just a temporary means to solve current closed problems in AI. In reality, only a tiny fraction of data is annotated. Annotation Efficient Learning (AEL) is a study of algorithms to train models effectively with fewer annotations. To thrive in AEL environments, we need deep learning techniques that rely less on manual annotations (e.g., image, bounding-box, and per-pixel labels), but learn useful information from unlabeled data. In this thesis, we explore five different techniques for handling AEL.
translated by 谷歌翻译
半监督对象检测(SSOD)的最新进展主要由基于一致性的伪标记方法驱动,用于图像分类任务,产生伪标签作为监控信号。然而,在使用伪标签时,缺乏考虑本地化精度和放大的类别不平衡,这两者都对于检测任务至关重要。在本文中,我们介绍了针对物体检测量身定制的确定性感知伪标签,可以有效地估计导出的伪标签的分类和定位质量。这是通过将传统定位转换为分类任务之后的传统定位来实现的。在分类和本地化质量分数上调节,我们动态调整用于为每个类别生成伪标签和重重损耗函数的阈值,以减轻类别不平衡问题。广泛的实验表明,我们的方法在Coco和Pascal VOC上的1-2%AP改善了最先进的SSOD性能,同时与大多数现有方法正交和互补。在有限的注释制度中,我们的方法可以通过从Coco标记的1-10%标记数据来改善监督基准。
translated by 谷歌翻译
最近自我监督学习成功的核心组成部分是裁剪数据增强,其选择要在自我监督损失中用作正视图的图像的子区域。底层假设是给定图像的随机裁剪和调整大小的区域与感兴趣对象的信息共享信息,其中学习的表示将捕获。这种假设在诸如想象网的数据集中大多满足,其中存在大,以中心为中心的对象,这很可能存在于完整图像的随机作物中。然而,在诸如OpenImages或Coco的其他数据集中,其更像是真实世界未保健数据的代表,通常存在图像中的多个小对象。在这项工作中,我们表明,基于通常随机裁剪的自我监督学习在此类数据集中表现不佳。我们提出用从对象提案算法获得的作物取代一种或两种随机作物。这鼓励模型学习对象和场景级别语义表示。使用这种方法,我们调用对象感知裁剪,导致对分类和对象检测基准的场景裁剪的显着改进。例如,在OpenImages上,我们的方法可以使用基于Moco-V2的预训练来实现8.8%的提高8.8%地图。我们还显示了对Coco和Pascal-Voc对象检测和分割任务的显着改善,通过最先进的自我监督的学习方法。我们的方法是高效,简单且通用的,可用于最现有的对比和非对比的自我监督的学习框架。
translated by 谷歌翻译
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-ofthe-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, topperforming method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.
translated by 谷歌翻译
Object recognition techniques using convolutional neural networks (CNN) have achieved great success. However, state-of-the-art object detection methods still perform poorly on large vocabulary and long-tailed datasets, e.g. LVIS.In this work, we analyze this problem from a novel perspective: each positive sample of one category can be seen as a negative sample for other categories, making the tail categories receive more discouraging gradients. Based on it, we propose a simple but effective loss, named equalization loss, to tackle the problem of long-tailed rare categories by simply ignoring those gradients for rare categories. The equalization loss protects the learning of rare categories from being at a disadvantage during the network parameter updating. Thus the model is capable of learning better discriminative features for objects of rare classes. Without any bells and whistles, our method achieves AP gains of 4.1% and 4.8% for the rare and common categories on the challenging LVIS benchmark, compared to the Mask R-CNN baseline. With the utilization of the effective equalization loss, we finally won the 1st place in the LVIS Challenge 2019. Code has been made available at: https: //github.com/tztztztztz/eql.detectron2
translated by 谷歌翻译
In contrast to fully supervised methods using pixel-wise mask labels, box-supervised instance segmentation takes advantage of simple box annotations, which has recently attracted increasing research attention. This paper presents a novel single-shot instance segmentation approach, namely Box2Mask, which integrates the classical level-set evolution model into deep neural network learning to achieve accurate mask prediction with only bounding box supervision. Specifically, both the input image and its deep features are employed to evolve the level-set curves implicitly, and a local consistency module based on a pixel affinity kernel is used to mine the local context and spatial relations. Two types of single-stage frameworks, i.e., CNN-based and transformer-based frameworks, are developed to empower the level-set evolution for box-supervised instance segmentation, and each framework consists of three essential components: instance-aware decoder, box-level matching assignment and level-set evolution. By minimizing the level-set energy function, the mask map of each instance can be iteratively optimized within its bounding box annotation. The experimental results on five challenging testbeds, covering general scenes, remote sensing, medical and scene text images, demonstrate the outstanding performance of our proposed Box2Mask approach for box-supervised instance segmentation. In particular, with the Swin-Transformer large backbone, our Box2Mask obtains 42.4% mask AP on COCO, which is on par with the recently developed fully mask-supervised methods. The code is available at: https://github.com/LiWentomng/boxlevelset.
translated by 谷歌翻译