由于课程中的训练样本极端不平衡,长尾实例分割是一个具有挑战性的任务。它导致头部课程的严重偏差(含有多数样本)对尾尾。这呈现“如何适当地定义和缓解偏见”最重要的问题之一。先前作品主要使用标签分布或平均分数信息来表示粗粒偏置。在本文中,我们探索挖掘困难的矩阵,该矩阵携带细粒度的错误分类细节,以减轻成对偏置,概括粗液。为此,我们提出了一种新颖的成对类余额(PCB)方法,基于混淆矩阵,在训练期间更新以累积正在进行的预测偏好。 PCB在培训期间生成正规化的纠错软标签。此外,开发了一种迭代学习范例,以支持这种脱结的渐进和平稳的正则化。 PCB可以插入并播放任何现有方法作为补充。 LVIS的实验结果表明,我们的方法在没有钟声和口哨的情况下实现最先进的性能。各种架构的卓越结果表明了泛化能力。
translated by 谷歌翻译
Vanilla用于物体检测和实例分割的模型遭受重偏向朝着长尾设置中的频繁对象进行偏向。现有方法主要在培训期间解决此问题,例如,通过重新采样或重新加权。在本文中,我们调查了一个很大程度上被忽视的方法 - 置信分数的后处理校准。我们提出NORCAL,用于长尾对象检测和实例分割的归一化校准校准,简单而简单的配方,通过其训练样本大小重新恢复每个阶级的预测得分。我们展示了单独处理背景类并使每个提案的课程分数标准化是实现卓越性能的键。在LVIS DataSet上,Norcal不仅可以在罕见的课程上有效地改善所有基线模型,也可以在普通和频繁的阶级上改进。最后,我们进行了广泛的分析和消融研究,以了解我们方法的各种建模选择和机制的见解。我们的代码在https://github.com/tydpan/norcal/上公开提供。
translated by 谷歌翻译
最近在对象检测和细分领域取得了重大进步。但是,当涉及到罕见类别时,最新方法无法检测到它们,从而在稀有类别和频繁类别之间存在显着的性能差距。在本文中,我们确定深探测器中使用的Sigmoid或SoftMax函数是低性能的主要原因,并且是长尾检测和分割的最佳选择。为了解决这个问题,我们开发了牙龈优化的损失(GOL),以进行长尾检测和分割。考虑到大多数长尾检测中的大多数类的预期概率较低,它与数据集中罕见类别的牙胶分布保持一致。拟议的GOL在AP上显着优于最佳最新方法的最佳方法,并将整体分割率提高9.0%,并将检测到8.0%,尤其是将稀有类别的检测提高了20.3%,与Mask-Rcnn相比提高了20.3%。 ,在LVIS数据集上。代码可用:https://github.com/kostas1515/gol
translated by 谷歌翻译
Object recognition techniques using convolutional neural networks (CNN) have achieved great success. However, state-of-the-art object detection methods still perform poorly on large vocabulary and long-tailed datasets, e.g. LVIS.In this work, we analyze this problem from a novel perspective: each positive sample of one category can be seen as a negative sample for other categories, making the tail categories receive more discouraging gradients. Based on it, we propose a simple but effective loss, named equalization loss, to tackle the problem of long-tailed rare categories by simply ignoring those gradients for rare categories. The equalization loss protects the learning of rare categories from being at a disadvantage during the network parameter updating. Thus the model is capable of learning better discriminative features for objects of rare classes. Without any bells and whistles, our method achieves AP gains of 4.1% and 4.8% for the rare and common categories on the challenging LVIS benchmark, compared to the Mask R-CNN baseline. With the utilization of the effective equalization loss, we finally won the 1st place in the LVIS Challenge 2019. Code has been made available at: https: //github.com/tztztztztz/eql.detectron2
translated by 谷歌翻译
尽管近期长尾对象检测成功,但几乎所有长尾对象探测器都是基于两级范式开发的。在实践中,一阶段探测器在行业中更为普遍,因为它们具有简单而快速的管道,易于部署。然而,在长尾情景中,到目前为止,这项工作尚未探讨。在本文中,我们调查了在这种情况下是否可以良好的单级探测器表现良好。我们发现预防一步检测器实现优异性能的主要障碍是:在长尾数据分布下,类别遭受不同程度的正负不平衡问题。传统的焦点损失与所有类别的调制因子相同的调节因子平衡,因此未能处理长尾问题。为了解决这个问题,我们提出了根据其不平衡程度独立地重新平衡不同类别的正面和负样本的损失贡献的均等的联络损失(EFL)。具体而言,EFL采用类别相关调制因子,可以通过不同类别的培训状态来动态调整。对挑战性的LVIS V1基准进行的广泛实验表明了我们提出的方法的有效性。通过端到端培训管道,EF​​L在整体AP方面实现了29.2%,并对稀有类别进行了显着的性能改进,超越了所有现有的最先进的方法。代码可在https://github.com/modeltc/eod上获得。
translated by 谷歌翻译
长尾分布是现实世界中的常见现象。提取的大规模图像数据集不可避免地证明了长尾巴的属性和经过不平衡数据训练的模型可以为代表性过多的类别获得高性能,但为代表性不足的类别而苦苦挣扎,导致偏见的预测和绩效降低。为了应对这一挑战,我们提出了一种名为“逆图像频率”(IIF)的新型偏差方法。 IIF是卷积神经网络分类层中逻辑的乘法边缘调整转换。我们的方法比类似的作品实现了更强的性能,并且对于下游任务(例如长尾实例分割)特别有用,因为它会产生较少的假阳性检测。我们的广泛实验表明,IIF在许多长尾基准的基准(例如Imagenet-lt,cifar-lt,ploce-lt和lvis)上超过了最先进的现状,在Imagenet-lt上,Resnet50和26.2%达到了55.8%的TOP-1准确性LVIS上使用MaskRCNN分割AP。代码可在https://github.com/kostas1515/iif中找到
translated by 谷歌翻译
Letting a deep network be aware of the quality of its own predictions is an interesting yet important problem. In the task of instance segmentation, the confidence of instance classification is used as mask quality score in most instance segmentation frameworks. However, the mask quality, quantified as the IoU between the instance mask and its ground truth, is usually not well correlated with classification score. In this paper, we study this problem and propose Mask Scoring R-CNN which contains a network block to learn the quality of the predicted instance masks. The proposed network block takes the instance feature and the corresponding predicted mask together to regress the mask IoU. The mask scoring strategy calibrates the misalignment between mask quality and mask score, and improves instance segmentation performance by prioritizing more accurate mask predictions during COCO AP evaluation. By extensive evaluations on the COCO dataset, Mask Scoring R-CNN brings consistent and noticeable gain with different models, and outperforms the state-of-the-art Mask R-CNN. We hope our simple and effective approach will provide a new direction for improving instance segmentation. The source code of our method is available at https:// github.com/zjhuang22/maskscoring_rcnn. * The work was done when Zhaojin Huang was an intern in Horizon Robotics Inc.
translated by 谷歌翻译
多年来,使用单点监督的对象检测受到了越来越多的关注。在本文中,我们将如此巨大的性能差距归因于产生高质量的提案袋的失败,这对于多个实例学习至关重要(MIL)。为了解决这个问题,我们引入了现成建议方法(OTSP)方法的轻量级替代方案,从而创建点对点网络(P2BNET),该网络可以通过在中生成建议袋来构建一个互平衡的提案袋一种锚点。通过充分研究准确的位置信息,P2BNET进一步构建了一个实例级袋,避免了多个物体的混合物。最后,以级联方式进行的粗到精细政策用于改善提案和地面真相(GT)之间的IOU。从这些策略中受益,P2BNET能够生产出高质量的实例级袋以进行对象检测。相对于MS可可数据集中的先前最佳PSOD方法,P2BNET将平均平均精度(AP)提高了50%以上。它还证明了弥合监督和边界盒监督检测器之间的性能差距的巨大潜力。该代码将在github.com/ucas-vg/p2bnet上发布。
translated by 谷歌翻译
In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its quality. While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object detection datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN. To facilitate future research, two implementations are made available at https://github.com/zhaoweicai/cascade-rcnn (Caffe) and https://github.com/zhaoweicai/Detectron-Cascade-RCNN (Detectron).
translated by 谷歌翻译
As the class size grows, maintaining a balanced dataset across many classes is challenging because the data are long-tailed in nature; it is even impossible when the sample-of-interest co-exists with each other in one collectable unit, e.g., multiple visual instances in one image. Therefore, long-tailed classification is the key to deep learning at scale. However, existing methods are mainly based on reweighting/re-sampling heuristics that lack a fundamental theory. In this paper, we establish a causal inference framework, which not only unravels the whys of previous methods, but also derives a new principled solution. Specifically, our theory shows that the SGD momentum is essentially a confounder in long-tailed classification. On one hand, it has a harmful causal effect that misleads the tail prediction biased towards the head. On the other hand, its induced mediation also benefits the representation learning and head prediction. Our framework elegantly disentangles the paradoxical effects of the momentum, by pursuing the direct causal effect caused by an input sample. In particular, we use causal intervention in training, and counterfactual reasoning in inference, to remove the "bad" while keep the "good". We achieve new state-of-the-arts on three long-tailed visual recognition benchmarks 1 : Long-tailed CIFAR-10/-100, ImageNet-LT for image classification and LVIS for instance segmentation.
translated by 谷歌翻译
最近的端到端多对象检测器通过删除手工制作的过程(例如使用非最大最大抑制(NMS))删除手工制作的过程来简化推理管道。但是,在训练中,他们需要两分匹配来计算检测器输出的损失。与端到端学习的核心的方向性相反,双方匹配使端到端探测器复杂,启发式和依赖的培训。在本文中,我们提出了一种训练端到端多对象探测器而无需匹配的方法。为此,我们使用混合模型将端到端多对象检测作为密度估计问题。我们提出的检测器,称为稀疏混合物密度检测器(稀疏MDOD),使用混合模型估算边界盒的分布。稀疏MDOD是通过最大程度地减少负对数似然性和我们提出的正则化项,最大成分最大化(MCM)损失来训练的,从而阻止了重复的预测。在训练过程中,不需要其他过程,例如两分匹配,并且损失是直接从网络输出中计算出来的。此外,我们的稀疏MDOD优于MS-Coco上的现有检测器,MS-Coco是一种著名的多对象检测基准。
translated by 谷歌翻译
Cascade is a classic yet powerful architecture that has boosted performance on various tasks. However, how to introduce cascade to instance segmentation remains an open question. A simple combination of Cascade R-CNN and Mask R-CNN only brings limited gain. In exploring a more effective approach, we find that the key to a successful instance segmentation cascade is to fully leverage the reciprocal relationship between detection and segmentation. In this work, we propose a new framework, Hybrid Task Cascade (HTC), which differs in two important aspects: (1) instead of performing cascaded refinement on these two tasks separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background. Overall, this framework can learn more discriminative features progressively while integrating complementary features together in each stage. Without bells and whistles, a single HTC obtains 38.4% and 1.5% improvement over a strong Cascade Mask R-CNN baseline on MSCOCO dataset. Moreover, our overall system achieves 48.6 mask AP on the test-challenge split, ranking 1st in the COCO 2018 Challenge Object Detection Task. Code is available at: https://github.com/ open-mmlab/mmdetection.
translated by 谷歌翻译
平均精度(AP)损失最近在密集的对象检测任务上显示出有希望的性能。但是,尚未开发出对AP损失如何影响检测器的深刻了解。在这项工作中,我们重新审视平均精度(AP)损失,并揭示了关键元素是选择排名对的关键元素基于该观察结果,我们提出了两种改善AP损失的策略。其中的第一个是一种新型的自适应成对误差(APE)损失,该损失集中在正面和负样本中的排名对。此外,我们通过使用聚类算法利用归一化排名得分和本地化得分来选择更准确的排名对。在MSCOCO数据集上进行的实验支持我们的分析,并证明了我们提出的方法的优越性与当前分类和排名损失相比。该代码可从https://github.com/xudangliatiger/ape-loss获得。
translated by 谷歌翻译
大多数最先进的实例级人类解析模型都采用了两阶段的基于锚的探测器,因此无法避免启发式锚盒设计和像素级别缺乏分析。为了解决这两个问题,我们设计了一个实例级人类解析网络,该网络在像素级别上无锚固且可解决。它由两个简单的子网络组成:一个用于边界框预测的无锚检测头和一个用于人体分割的边缘引导解析头。无锚探测器的头继承了像素样的优点,并有效地避免了对象检测应用中证明的超参数的敏感性。通过引入部分感知的边界线索,边缘引导的解析头能够将相邻的人类部分与彼此区分开,最多可在一个人类实例中,甚至重叠的实例。同时,利用了精炼的头部整合盒子级别的分数和部分分析质量,以提高解析结果的质量。在两个多个人类解析数据集(即CIHP和LV-MHP-V2.0)和一个视频实例级人类解析数据集(即VIP)上进行实验,表明我们的方法实现了超过全球级别和实例级别的性能最新的一阶段自上而下的替代方案。
translated by 谷歌翻译
知识蒸馏在分类中取得了巨大的成功,但是,仍然有挑战性。在用于检测的典型图像中,来自不同位置的表示可能对检测目标具有不同的贡献,使蒸馏难以平衡。在本文中,我们提出了一种有条件的蒸馏框架来蒸馏出所需的知识,即关于每个例子的分类和本地化有益的知识。该框架引入了一种可学习的条件解码模块,其将每个目标实例检索为查询的信息。具体而言,我们将条件信息编码为查询并使用教师的表示作为键。查询和键之间的注意用于测量不同特征的贡献,由本地化识别敏感辅助任务指导。广泛的实验表明了我们的方法的功效:我们在各种环境下观察到令人印象深刻的改进。值得注意的是,在1倍计划下,我们将通过37.4至40.7地图(+3.3)与Reset-50骨架的Restinetet提升。代码已在https://github.com/megvii-research/icd上发布。
translated by 谷歌翻译
长尾学习旨在应对在现实情况下严重的阶级失衡下统治训练程序的关键挑战。但是,很少有人注意如何量化表示空间中头等的优势严重性。在此激励的情况下,我们将基于余弦的分类器推广到von mises-fisher(VMF)混合模型,该模型被称为VMF分类器,该模型可以通过计算分布重叠系数来定量地测量超晶体空间上的表示质量。据我们所知,这是从分布重叠系数的角度来衡量分类器和特征的表示质量的第一项工作。最重要的是,我们制定了类间差异和类功能的一致性损失项,以减轻分类器的重量之间的干扰,并与分类器的权重相结合。此外,一种新型的训练后校准算法设计为零成本通过类间重叠系数来提高性能。我们的方法的表现优于先前的工作,并具有很大的利润,并在长尾图像分类,语义细分和实例分段任务上实现了最先进的性能(例如,我们在Imagenet-50中实现了55.0 \%的总体准确性LT)。我们的代码可在https://github.com/vipailab/vmf \_op上找到。
translated by 谷歌翻译
现有的实例分割方法已经达到了令人印象深刻的表现,但仍遭受了共同的困境:一个实例推断出冗余表示(例如,多个框,网格和锚点),这导致了多个重复的预测。因此,主流方法通常依赖于手工设计的非最大抑制(NMS)后处理步骤来选择最佳预测结果,这会阻碍端到端训练。为了解决此问题,我们建议一个称为Uniinst的无盒和无端机实例分割框架,该框架仅对每个实例产生一个唯一的表示。具体而言,我们设计了一种实例意识到的一对一分配方案,即仅产生一个表示(Oyor),该方案根据预测和地面真相之间的匹配质量,动态地为每个实例动态分配一个独特的表示。然后,一种新颖的预测重新排列策略被优雅地集成到框架中,以解决分类评分和掩盖质量之间的错位,从而使学习的表示形式更具歧视性。借助这些技术,我们的Uniinst,第一个基于FCN的盒子和无NMS实例分段框架,实现竞争性能,例如,使用Resnet-50-FPN和40.2 mask AP使用Resnet-101-FPN,使用Resnet-50-FPN和40.2 mask AP,使用Resnet-101-FPN,对抗AP可可测试-DEV的主流方法。此外,提出的实例感知方法对于遮挡场景是可靠的,在重锁定的ochuman基准上,通过杰出的掩码AP优于公共基线。我们的代码将在出版后提供。
translated by 谷歌翻译
尽管广泛用作可视检测任务的性能措施,但平均精度(AP)In(i)的限制在反映了本地化质量,(ii)对其计算的设计选择的鲁棒性以及其对输出的适用性没有信心分数。 Panoptic质量(PQ),提出评估Panoptic Seationation(Kirillov等,2019)的措施,不会遭受这些限制,而是限于Panoptic Seationation。在本文中,我们提出了基于其本地化和分类质量的视觉检测器的平均匹配误差,提出了定位召回精度(LRP)误差。 LRP错误,最初仅为Oksuz等人进行对象检测。 (2018),不遭受上述限制,适用于所有视觉检测任务。我们还介绍了最佳LRP(OLRP)错误,因为通过置信区获得的最小LRP错误以评估视觉检测器并获得部署的最佳阈值。我们提供对AP和PQ的LRP误差的详细比较分析,并使用七个可视检测任务(即对象检测,关键点检测,实例分割,Panoptic分段,视觉关系检测,使用近100个最先进的视觉检测器零拍摄检测和广义零拍摄检测)使用10个数据集来统一地显示LRP误差提供比其对应物更丰富和更辨别的信息。可用的代码:https://github.com/kemaloksuz/lrp-error
translated by 谷歌翻译
Most existing 3D point cloud object detection approaches heavily rely on large amounts of labeled training data. However, the labeling process is costly and time-consuming. This paper considers few-shot 3D point cloud object detection, where only a few annotated samples of novel classes are needed with abundant samples of base classes. To this end, we propose Prototypical VoteNet to recognize and localize novel instances, which incorporates two new modules: Prototypical Vote Module (PVM) and Prototypical Head Module (PHM). Specifically, as the 3D basic geometric structures can be shared among categories, PVM is designed to leverage class-agnostic geometric prototypes, which are learned from base classes, to refine local features of novel categories.Then PHM is proposed to utilize class prototypes to enhance the global feature of each object, facilitating subsequent object localization and classification, which is trained by the episodic training strategy. To evaluate the model in this new setting, we contribute two new benchmark datasets, FS-ScanNet and FS-SUNRGBD. We conduct extensive experiments to demonstrate the effectiveness of Prototypical VoteNet, and our proposed method shows significant and consistent improvements compared to baselines on two benchmark datasets.
translated by 谷歌翻译
Previous knowledge distillation (KD) methods for object detection mostly focus on feature imitation instead of mimicking the prediction logits due to its inefficiency in distilling the localization information. In this paper, we investigate whether logit mimicking always lags behind feature imitation. Towards this goal, we first present a novel localization distillation (LD) method which can efficiently transfer the localization knowledge from the teacher to the student. Second, we introduce the concept of valuable localization region that can aid to selectively distill the classification and localization knowledge for a certain region. Combining these two new components, for the first time, we show that logit mimicking can outperform feature imitation and the absence of localization distillation is a critical reason for why logit mimicking underperforms for years. The thorough studies exhibit the great potential of logit mimicking that can significantly alleviate the localization ambiguity, learn robust feature representation, and ease the training difficulty in the early stage. We also provide the theoretical connection between the proposed LD and the classification KD, that they share the equivalent optimization effect. Our distillation scheme is simple as well as effective and can be easily applied to both dense horizontal object detectors and rotated object detectors. Extensive experiments on the MS COCO, PASCAL VOC, and DOTA benchmarks demonstrate that our method can achieve considerable AP improvement without any sacrifice on the inference speed. Our source code and pretrained models are publicly available at https://github.com/HikariTJU/LD.
translated by 谷歌翻译