Mask r-cnn
分类:
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without tricks, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code will be made available.
translated by 谷歌翻译
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 6 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.
translated by 谷歌翻译
We present a new, embarrassingly simple approach to instance segmentation. Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the "detect-then-segment" strategy (e.g., Mask R-CNN), or predict embedding vectors first then use clustering techniques to group pixels into individual instances. We view the task of instance segmentation from a completely new perspective by introducing the notion of "instance categories", which assigns categories to each pixel within an instance according to the instance's location and size, thus nicely converting instance segmentation into a single-shot classification-solvable problem. We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent single-shot instance segmenters in accuracy. We hope that this simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation. Code is available at https://git.io/AdelaiDet
translated by 谷歌翻译
The recently introduced panoptic segmentation task has renewed our community's interest in unifying the tasks of instance segmentation (for thing classes) and semantic segmentation (for stuff classes). However, current state-ofthe-art methods for this joint task use separate and dissimilar networks for instance and semantic segmentation, without performing any shared computation. In this work, we aim to unify these methods at the architectural level, designing a single network for both tasks. Our approach is to endow Mask R-CNN, a popular instance segmentation method, with a semantic segmentation branch using a shared Feature Pyramid Network (FPN) backbone. Surprisingly, this simple baseline not only remains effective for instance segmentation, but also yields a lightweight, topperforming method for semantic segmentation. In this work, we perform a detailed study of this minimally extended version of Mask R-CNN with FPN, which we refer to as Panoptic FPN, and show it is a robust and accurate baseline for both tasks. Given its effectiveness and conceptual simplicity, we hope our method can serve as a strong baseline and aid future research in panoptic segmentation.
translated by 谷歌翻译
Cascade is a classic yet powerful architecture that has boosted performance on various tasks. However, how to introduce cascade to instance segmentation remains an open question. A simple combination of Cascade R-CNN and Mask R-CNN only brings limited gain. In exploring a more effective approach, we find that the key to a successful instance segmentation cascade is to fully leverage the reciprocal relationship between detection and segmentation. In this work, we propose a new framework, Hybrid Task Cascade (HTC), which differs in two important aspects: (1) instead of performing cascaded refinement on these two tasks separately, it interweaves them for a joint multi-stage processing; (2) it adopts a fully convolutional branch to provide spatial context, which can help distinguishing hard foreground from cluttered background. Overall, this framework can learn more discriminative features progressively while integrating complementary features together in each stage. Without bells and whistles, a single HTC obtains 38.4% and 1.5% improvement over a strong Cascade Mask R-CNN baseline on MSCOCO dataset. Moreover, our overall system achieves 48.6 mask AP on the test-challenge split, ranking 1st in the COCO 2018 Challenge Object Detection Task. Code is available at: https://github.com/ open-mmlab/mmdetection.
translated by 谷歌翻译
The way that information propagates in neural networks is of great importance. In this paper, we propose Path Aggregation Network (PANet) aiming at boosting information flow in proposal-based instance segmentation framework. Specifically, we enhance the entire feature hierarchy with accurate localization signals in lower layers by bottom-up path augmentation, which shortens the information path between lower layers and topmost feature. We present adaptive feature pooling, which links feature grid and all feature levels to make useful information in each feature level propagate directly to following proposal subnetworks. A complementary branch capturing different views for each proposal is created to further improve mask prediction.These improvements are simple to implement, with subtle extra computational overhead. Our PANet reaches the 1 st place in the COCO 2017 Challenge Instance Segmentation task and the 2 nd place in Object Detection task without large-batch training. It is also state-of-the-art on MVD and Cityscapes. Code is available at https://github. com/ShuLiu1993/PANet.
translated by 谷歌翻译
Letting a deep network be aware of the quality of its own predictions is an interesting yet important problem. In the task of instance segmentation, the confidence of instance classification is used as mask quality score in most instance segmentation frameworks. However, the mask quality, quantified as the IoU between the instance mask and its ground truth, is usually not well correlated with classification score. In this paper, we study this problem and propose Mask Scoring R-CNN which contains a network block to learn the quality of the predicted instance masks. The proposed network block takes the instance feature and the corresponding predicted mask together to regress the mask IoU. The mask scoring strategy calibrates the misalignment between mask quality and mask score, and improves instance segmentation performance by prioritizing more accurate mask predictions during COCO AP evaluation. By extensive evaluations on the COCO dataset, Mask Scoring R-CNN brings consistent and noticeable gain with different models, and outperforms the state-of-the-art Mask R-CNN. We hope our simple and effective approach will provide a new direction for improving instance segmentation. The source code of our method is available at https:// github.com/zjhuang22/maskscoring_rcnn. * The work was done when Zhaojin Huang was an intern in Horizon Robotics Inc.
translated by 谷歌翻译
In this paper, we introduce an anchor-box free and single shot instance segmentation method, which is conceptually simple, fully convolutional and can be used by easily embedding it into most off-the-shelf detection methods. Our method, termed PolarMask, formulates the instance segmentation problem as predicting contour of instance through instance center classification and dense distance regression in a polar coordinate. Moreover, we propose two effective approaches to deal with sampling high-quality center examples and optimization for dense distance regression, respectively, which can significantly improve the performance and simplify the training process. Without any bells and whistles, PolarMask achieves 32.9% in mask mAP with single-model and single-scale training/testing on the challenging COCO dataset.For the first time, we show that the complexity of instance segmentation, in terms of both design and computation complexity, can be the same as bounding box object detection and this much simpler and flexible instance segmentation framework can achieve competitive accuracy. We hope that the proposed PolarMask framework can serve as a fundamental and strong baseline for single shot instance segmentation task. Code is available at: github.com/xieenze/PolarMask.
translated by 谷歌翻译
In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its quality. While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object detection datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN. To facilitate future research, two implementations are made available at https://github.com/zhaoweicai/cascade-rcnn (Caffe) and https://github.com/zhaoweicai/Detectron-Cascade-RCNN (Detectron).
translated by 谷歌翻译
We propose a simple yet effective instance segmentation framework, termed CondInst (conditional convolutions for instance segmentation). Top-performing instance segmentation methods such as Mask R-CNN rely on ROI operations (typically ROIPool or ROIAlign) to obtain the final instance masks. In contrast, we propose to solve instance segmentation from a new perspective. Instead of using instancewise ROIs as inputs to a network of fixed weights, we employ dynamic instance-aware networks, conditioned on instances. CondInst enjoys two advantages: 1) Instance segmentation is solved by a fully convolutional network, eliminating the need for ROI cropping and feature alignment.2) Due to the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv. layers, each having only 8 channels), leading to significantly faster inference. We demonstrate a simpler instance segmentation method that can achieve improved performance in both accuracy and inference speed. On the COCO dataset, we outperform a few recent methods including welltuned Mask R-CNN baselines, without longer training schedules needed.
translated by 谷歌翻译
translated by 谷歌翻译
We present a simple, fully-convolutional model for realtime instance segmentation that achieves 29.8 mAP on MS COCO at 33.5 fps evaluated on a single Titan Xp, which is significantly faster than any previous competitive approach. Moreover, we obtain this result after training on only one GPU. We accomplish this by breaking instance segmentation into two parallel subtasks: (1) generating a set of prototype masks and (2) predicting per-instance mask coefficients. Then we produce instance masks by linearly combining the prototypes with the mask coefficients. We find that because this process doesn't depend on repooling, this approach produces very high-quality masks and exhibits temporal stability for free. Furthermore, we analyze the emergent behavior of our prototypes and show they learn to localize instances on their own in a translation variant manner, despite being fully-convolutional. Finally, we also propose Fast NMS, a drop-in 12 ms faster replacement for standard NMS that only has a marginal performance penalty.
translated by 谷歌翻译
我们为变体视觉任务提供了一个概念上简单,灵活和通用的视觉感知头,例如分类,对象检测,实例分割和姿势估计以及不同的框架,例如单阶段或两个阶段的管道。我们的方法有效地标识了图像中的对象,同时同时生成高质量的边界框或基于轮廓的分割掩码或一组关键点。该方法称为Unihead,将不同的视觉感知任务视为通过变压器编码器体系结构学习的可分配点。给定固定的空间坐标,Unihead将其自适应地分散到了不同的空间点和有关它们的关系的原因。它以多个点的形式直接输出最终预测集,使我们能够在具有相同头部设计的不同框架中执行不同的视觉任务。我们展示了对成像网分类的广泛评估以及可可套件的所有三个曲目,包括对象检测,实例分割和姿势估计。如果没有铃铛和口哨声,Unihead可以通过单个视觉头设计统一这些视觉任务,并与为每个任务开发的专家模型相比,实现可比的性能。我们希望我们的简单和通用的Unihead能够成为可靠的基线,并有助于促进通用的视觉感知研究。代码和型号可在https://github.com/sense-x/unihead上找到。
translated by 谷歌翻译
两阶段和基于查询的实例分段方法取得了显着的结果。然而,他们的分段面具仍然非常粗糙。在本文中,我们呈现了用于高质量高效的实例分割的掩模转发器。我们的掩模转发器代替常规密集的张量,而不是在常规密集的张量上进行分解,并表示作为Quadtree的图像区域。我们基于变换器的方法仅处理检测到的错误易于树节点,并并行自我纠正其错误。虽然这些稀疏的像素仅构成总数的小比例,但它们对最终掩模质量至关重要。这允许掩模转换器以低计算成本预测高精度的实例掩模。广泛的实验表明,掩模转发器在三个流行的基准上优于当前实例分段方法,显着改善了COCO和BDD100K上的大型+3.0掩模AP的+3.0掩模AP的大余量和CityScapes上的+6.6边界AP。我们的代码和培训的型号将在http://vis.xyz/pub/transfiner提供。
translated by 谷歌翻译
大多数最先进的实例级人类解析模型都采用了两阶段的基于锚的探测器,因此无法避免启发式锚盒设计和像素级别缺乏分析。为了解决这两个问题,我们设计了一个实例级人类解析网络,该网络在像素级别上无锚固且可解决。它由两个简单的子网络组成:一个用于边界框预测的无锚检测头和一个用于人体分割的边缘引导解析头。无锚探测器的头继承了像素样的优点,并有效地避免了对象检测应用中证明的超参数的敏感性。通过引入部分感知的边界线索,边缘引导的解析头能够将相邻的人类部分与彼此区分开,最多可在一个人类实例中,甚至重叠的实例。同时,利用了精炼的头部整合盒子级别的分数和部分分析质量,以提高解析结果的质量。在两个多个人类解析数据集(即CIHP和LV-MHP-V2.0)和一个视频实例级人类解析数据集(即VIP)上进行实验,表明我们的方法实现了超过全球级别和实例级别的性能最新的一阶段自上而下的替代方案。
translated by 谷歌翻译
Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles which combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy and optimization function, etc. In this paper, we provide a review on deep learning based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely Convolutional Neural Network (CNN). Then we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network based learning systems.
translated by 谷歌翻译
分割高度重叠的图像对象是具有挑战性的,因为图像上的真实对象轮廓和遮挡边界之间通常没有区别。与先前的实例分割方法不同,我们将图像形成模拟为两个重叠层的组成,并提出了双层卷积网络(BCNET),其中顶层检测到遮挡对象(遮挡器),而底层则渗透到部分闭塞实例(胶囊)。遮挡关系与双层结构的显式建模自然地将遮挡和遮挡实例的边界解散,并在掩模回归过程中考虑了它们之间的相互作用。我们使用两种流行的卷积网络设计(即完全卷积网络(FCN)和图形卷积网络(GCN))研究了双层结构的功效。此外,我们通过将图像中的实例表示为单独的可学习封闭器和封闭者查询,从而使用视觉变压器(VIT)制定双层解耦。使用一个/两个阶段和基于查询的对象探测器具有各种骨架和网络层选择验证双层解耦合的概括能力,如图像实例分段基准(可可,亲戚,可可)和视频所示实例分割基准(YTVIS,OVIS,BDD100K MOTS),特别是对于重闭塞病例。代码和数据可在https://github.com/lkeab/bcnet上找到。
translated by 谷歌翻译
In this paper, we propose a unified panoptic segmentation network (UPSNet) for tackling the newly proposed panoptic segmentation task. On top of a single backbone residual network, we first design a deformable convolution based semantic segmentation head and a Mask R-CNN style instance segmentation head which solve these two subtasks simultaneously. More importantly, we introduce a parameter-free panoptic head which solves the panoptic segmentation via pixel-wise classification. It first leverages the logits from the previous two heads and then innovatively expands the representation for enabling prediction of an extra unknown class which helps better resolve the conflicts between semantic and instance segmentation. Additionally, it handles the challenge caused by the varying number of instances and permits back propagation to the bottom modules in an end-to-end manner. Extensive experimental results on Cityscapes, COCO and our internal dataset demonstrate that our UPSNet achieves stateof-the-art performance with much faster inference. Code has been made available at: https://github.com/ uber-research/UPSNet. * Equal contribution.† This work was done when Hengshuang Zhao was an intern at Uber ATG.
translated by 谷歌翻译
在重建掩码的实例分段网络的设计中,分段通常是其文字定义 - 分配每个像素标签。这导致了将问题视为匹配一个问题,其中一个目标是最小化重建和地面真相像素之间的损耗。重新思考重建网络作为发电机,我们定义了预测掩模作为GAN游戏框架的问题:分割网络生成掩码,鉴别器网络决定掩码的质量。为了演示这个游戏,我们对掩模R-CNN的普通分段框架显示了有效修改。我们发现,在特征空间中播放游戏比导致鉴别器和发电机之间的稳定训练的像素空间更有效,应该通过预测对象的上下文区域来替换预测对象坐标,并且整体对抗性损失有助于性能和消除每个不同数据域的任何自定义设置都需要。我们在各个域中测试我们的框架并报告手机回收,自动驾驶,大规模对象检测和医用腺体。我们观察到一般的GANS产生掩模,该掩模占克里克里德界,杂乱,小物体和细节,处于规则形状或异质和聚结形状的领域。我们的再现结果的代码可公开提供。
translated by 谷歌翻译
In object detection, an intersection over union (IoU) threshold is required to define positives and negatives. An object detector, trained with low IoU threshold, e.g. 0.5, usually produces noisy detections. However, detection performance tends to degrade with increasing the IoU thresholds. Two main factors are responsible for this: 1) overfitting during training, due to exponentially vanishing positive samples, and 2) inference-time mismatch between the IoUs for which the detector is optimal and those of the input hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, is proposed to address these problems. It consists of a sequence of detectors trained with increasing IoU thresholds, to be sequentially more selective against close false positives. The detectors are trained stage by stage, leveraging the observation that the output of a detector is a good distribution for training the next higher quality detector. The resampling of progressively improved hypotheses guarantees that all detectors have a positive set of examples of equivalent size, reducing the overfitting problem. The same cascade procedure is applied at inference, enabling a closer match between the hypotheses and the detector quality of each stage. A simple implementation of the Cascade R-CNN is shown to surpass all single-model object detectors on the challenging COCO dataset. Experiments also show that the Cascade R-CNN is widely applicable across detector architectures, achieving consistent gains independently of the baseline detector strength. The code will be made available at https://github.com/zhaoweicai/cascade-rcnn.
translated by 谷歌翻译