In this paper, we aim to design an efficient real-time object detector that exceeds the YOLO series and is easily extensible for many object recognition tasks such as instance segmentation and rotated object detection. To obtain a more efficient model architecture, we explore an architecture that has compatible capacities in the backbone and neck, constructed by a basic building block that consists of large-kernel depth-wise convolutions. We further introduce soft labels when calculating matching costs in the dynamic label assignment to improve accuracy. Together with better training techniques, the resulting object detector, named RTMDet, achieves 52.8% AP on COCO with 300+ FPS on an NVIDIA 3090 GPU, outperforming the current mainstream industrial detectors. RTMDet achieves the best parameter-accuracy trade-off with tiny/small/medium/large/extra-large model sizes for various application scenarios, and obtains new state-of-the-art performance on real-time instance segmentation and rotated object detection. We hope the experimental results can provide new insights into designing versatile real-time object detectors for many object recognition tasks. Code and models are released at https://github.com/open-mmlab/mmdetection/tree/3.x/configs/rtmdet.
translated by 谷歌翻译
多年来,Yolo系列一直是有效对象检测的事实上的行业级别标准。尤洛社区(Yolo Community)绝大多数繁荣,以丰富其在众多硬件平台和丰富场景中的使用。在这份技术报告中,我们努力将其限制推向新的水平,以坚定不移的行业应用心态前进。考虑到对真实环境中速度和准确性的多种要求,我们广泛研究了行业或学术界的最新对象检测进步。具体而言,我们从最近的网络设计,培训策略,测试技术,量化和优化方法中大量吸收了思想。最重要的是,我们整合了思想和实践,以在各种规模上建立一套可供部署的网络,以适应多元化的用例。在Yolo作者的慷慨许可下,我们将其命名为Yolov6。我们还向用户和贡献者表示热烈欢迎,以进一步增强。为了了解性能,我们的Yolov6-N在NVIDIA TESLA T4 GPU上以1234 fps的吞吐量在可可数据集上击中35.9%的AP。 Yolov6-S在495 fps处的43.5%AP罢工,在相同规模〜(Yolov5-S,Yolox-S和Ppyoloe-S)上超过其他主流探测器。我们的量化版本的Yolov6-S甚至在869 fps中带来了新的43.3%AP。此外,与其他推理速度相似的检测器相比,Yolov6-m/L的精度性能(即49.5%/52.3%)更好。我们仔细进行了实验以验证每个组件的有效性。我们的代码可在https://github.com/meituan/yolov6上提供。
translated by 谷歌翻译
In this report, we present a fast and accurate object detection method dubbed DAMO-YOLO, which achieves higher performance than the state-of-the-art YOLO series. DAMO-YOLO is extended from YOLO with some new technologies, including Neural Architecture Search (NAS), efficient Reparameterized Generalized-FPN (RepGFPN), a lightweight head with AlignedOTA label assignment, and distillation enhancement. In particular, we use MAE-NAS, a method guided by the principle of maximum entropy, to search our detection backbone under the constraints of low latency and high performance, producing ResNet-like / CSP-like structures with spatial pyramid pooling and focus modules. In the design of necks and heads, we follow the rule of "large neck, small head". We import Generalized-FPN with accelerated queen-fusion to build the detector neck and upgrade its CSPNet with efficient layer aggregation networks (ELAN) and reparameterization. Then we investigate how detector head size affects detection performance and find that a heavy neck with only one task projection layer would yield better results. In addition, AlignedOTA is proposed to solve the misalignment problem in label assignment. And a distillation schema is introduced to improve performance to a higher level. Based on these new techs, we build a suite of models at various scales to meet the needs of different scenarios, i.e., DAMO-YOLO-Tiny/Small/Medium. They can achieve 43.0/46.8/50.0 mAPs on COCO with the latency of 2.78/3.83/5.62 ms on T4 GPUs respectively. The code is available at https://github.com/tinyvision/damo-yolo.
translated by 谷歌翻译
在过去的十年中,由于航空图像引起的物体的规模和取向的巨大变化,对象检测已经实现了自然图像中的显着进展,而不是在空中图像中。更重要的是,缺乏大规模基准已成为在航拍图像(ODAI)中对物体检测发展的主要障碍。在本文中,我们在航空图像(DotA)中的物体检测和用于ODAI的综合基线的大规模数据集。所提出的DOTA数据集包含1,793,658个对象实例,18个类别的面向边界盒注释从11,268个航拍图像中收集。基于该大规模和注释的数据集,我们构建了具有超过70个配置的10个最先进算法的基线,其中已经评估了每个模型的速度和精度性能。此外,我们为ODAI提供了一个代码库,并建立一个评估不同算法的网站。以前在Dota上运行的挑战吸引了全球1300多队。我们认为,扩大的大型DOTA数据集,广泛的基线,代码库和挑战可以促进鲁棒算法的设计和对空中图像对象检测问题的可再现研究。
translated by 谷歌翻译
In this paper, we introduce an anchor-box free and single shot instance segmentation method, which is conceptually simple, fully convolutional and can be used by easily embedding it into most off-the-shelf detection methods. Our method, termed PolarMask, formulates the instance segmentation problem as predicting contour of instance through instance center classification and dense distance regression in a polar coordinate. Moreover, we propose two effective approaches to deal with sampling high-quality center examples and optimization for dense distance regression, respectively, which can significantly improve the performance and simplify the training process. Without any bells and whistles, PolarMask achieves 32.9% in mask mAP with single-model and single-scale training/testing on the challenging COCO dataset.For the first time, we show that the complexity of instance segmentation, in terms of both design and computation complexity, can be the same as bounding box object detection and this much simpler and flexible instance segmentation framework can achieve competitive accuracy. We hope that the proposed PolarMask framework can serve as a fundamental and strong baseline for single shot instance segmentation task. Code is available at: github.com/xieenze/PolarMask.
translated by 谷歌翻译
自主驾驶的感知模型需要在低潜伏期内快速推断。尽管现有作品忽略了处理后不可避免的环境变化,但流媒体感知将延迟和准确性共同评估为视频在线感知的单个度量标准,从而指导先前的工作以搜索准确性和速度之间的权衡。在本文中,我们探讨了该指标上实时模型的性能,并赋予模型预测未来的能力,从而显着改善了流媒体感知的结果。具体来说,我们构建了一个具有两个有效模块的简单框架。一个是双流感知模块(DFP)。它分别由捕获运动趋势和基本检测特征并行的动态流和静态流动。趋势意识损失(TAL)是另一个模块,它以其移动速度适应每个对象的体重。实际上,我们考虑了多个速度驾驶场景,并进一步提出了含量不足的流媒体AP(VSAP)以共同评估准确性。在这种现实的环境中,我们设计了一种有效的混合速度训练策略,以指导检测器感知任何速度。我们的简单方法与强大的基线相比,在Argoverse-HD数据集上实现了最先进的性能,并将SAP和VSAP分别提高了4.7%和8.2%,从而验证了其有效性。
translated by 谷歌翻译
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 6 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.
translated by 谷歌翻译
大多数最先进的实例级人类解析模型都采用了两阶段的基于锚的探测器,因此无法避免启发式锚盒设计和像素级别缺乏分析。为了解决这两个问题,我们设计了一个实例级人类解析网络,该网络在像素级别上无锚固且可解决。它由两个简单的子网络组成:一个用于边界框预测的无锚检测头和一个用于人体分割的边缘引导解析头。无锚探测器的头继承了像素样的优点,并有效地避免了对象检测应用中证明的超参数的敏感性。通过引入部分感知的边界线索,边缘引导的解析头能够将相邻的人类部分与彼此区分开,最多可在一个人类实例中,甚至重叠的实例。同时,利用了精炼的头部整合盒子级别的分数和部分分析质量,以提高解析结果的质量。在两个多个人类解析数据集(即CIHP和LV-MHP-V2.0)和一个视频实例级人类解析数据集(即VIP)上进行实验,表明我们的方法实现了超过全球级别和实例级别的性能最新的一阶段自上而下的替代方案。
translated by 谷歌翻译
In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its quality. While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object detection datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN. To facilitate future research, two implementations are made available at https://github.com/zhaoweicai/cascade-rcnn (Caffe) and https://github.com/zhaoweicai/Detectron-Cascade-RCNN (Detectron).
translated by 谷歌翻译
检测变压器已在富含样品的可可数据集上实现了竞争性能。但是,我们显示他们中的大多数人在小型数据集(例如CityScapes)上遭受了大量的性能下降。换句话说,检测变压器通常是渴望数据的。为了解决这个问题,我们通过逐步过渡从数据效率的RCNN变体到代表性的DETR,从经验中分析影响数据效率的因素。经验结果表明,来自本地图像区域的稀疏特征采样可容纳关键。基于此观察结果,我们通过简单地简单地交替如何在跨意义层构建键和价值序列,从而减少现有检测变压器的数据问题,并对原始模型进行最小的修改。此外,我们引入了一种简单而有效的标签增强方法,以提供更丰富的监督并提高数据效率。实验表明,我们的方法可以很容易地应用于不同的检测变压器,并在富含样品和样品的数据集上提高其性能。代码将在\ url {https://github.com/encounter1997/de-detrs}上公开提供。
translated by 谷歌翻译
现有的实例分割方法已经达到了令人印象深刻的表现,但仍遭受了共同的困境:一个实例推断出冗余表示(例如,多个框,网格和锚点),这导致了多个重复的预测。因此,主流方法通常依赖于手工设计的非最大抑制(NMS)后处理步骤来选择最佳预测结果,这会阻碍端到端训练。为了解决此问题,我们建议一个称为Uniinst的无盒和无端机实例分割框架,该框架仅对每个实例产生一个唯一的表示。具体而言,我们设计了一种实例意识到的一对一分配方案,即仅产生一个表示(Oyor),该方案根据预测和地面真相之间的匹配质量,动态地为每个实例动态分配一个独特的表示。然后,一种新颖的预测重新排列策略被优雅地集成到框架中,以解决分类评分和掩盖质量之间的错位,从而使学习的表示形式更具歧视性。借助这些技术,我们的Uniinst,第一个基于FCN的盒子和无NMS实例分段框架,实现竞争性能,例如,使用Resnet-50-FPN和40.2 mask AP使用Resnet-101-FPN,使用Resnet-50-FPN和40.2 mask AP,使用Resnet-101-FPN,对抗AP可可测试-DEV的主流方法。此外,提出的实例感知方法对于遮挡场景是可靠的,在重锁定的ochuman基准上,通过杰出的掩码AP优于公共基线。我们的代码将在出版后提供。
translated by 谷歌翻译
物体检测在计算机视觉中取得了巨大的进步。具有外观降级的小物体检测是一个突出的挑战,特别是对于鸟瞰观察。为了收集足够的阳性/阴性样本进行启发式训练,大多数物体探测器预设区域锚,以便将交叉联盟(iou)计算在地面判处符号数据上。在这种情况下,小物体经常被遗弃或误标定。在本文中,我们提出了一种有效的动态增强锚(DEA)网络,用于构建新颖的训练样本发生器。与其他最先进的技术不同,所提出的网络利用样品鉴别器来实现基于锚的单元和无锚单元之间的交互式样本筛选,以产生符合资格的样本。此外,通过基于保守的基于锚的推理方案的多任务联合训练增强了所提出的模型的性能,同时降低计算复杂性。所提出的方案支持定向和水平对象检测任务。对两个具有挑战性的空中基准(即,DotA和HRSC2016)的广泛实验表明,我们的方法以适度推理速度和用于训练的计算开销的准确性实现最先进的性能。在DotA上,我们的DEA-NET与ROI变压器的基线集成了0.40%平均平均精度(MAP)的先进方法,以便用较弱的骨干网(Resnet-101 VS Resnet-152)和3.08%平均 - 平均精度(MAP),具有相同骨干网的水平对象检测。此外,我们的DEA网与重新排列的基线一体化实现最先进的性能80.37%。在HRSC2016上,它仅使用3个水平锚点超过1.1%的最佳型号。
translated by 谷歌翻译
Arbitrary-oriented object detection is a fundamental task in visual scenes involving aerial images and scene text. In this report, we present PP-YOLOE-R, an efficient anchor-free rotated object detector based on PP-YOLOE. We introduce a bag of useful tricks in PP-YOLOE-R to improve detection precision with marginal extra parameters and computational cost. As a result, PP-YOLOE-R-l and PP-YOLOE-R-x achieve 78.14 and 78.28 mAP respectively on DOTA 1.0 dataset with single-scale training and testing, which outperform almost all other rotated object detectors. With multi-scale training and testing, PP-YOLOE-R-l and PP-YOLOE-R-x further improve the detection precision to 80.02 and 80.73 mAP. In this case, PP-YOLOE-R-x surpasses all anchor-free methods and demonstrates competitive performance to state-of-the-art anchor-based two-stage models. Further, PP-YOLOE-R is deployment friendly and PP-YOLOE-R-s/m/l/x can reach 69.8/55.1/48.3/37.1 FPS respectively on RTX 2080 Ti with TensorRT and FP16-precision. Source code and pre-trained models are available at https://github.com/PaddlePaddle/PaddleDetection, which is powered by https://github.com/PaddlePaddle/Paddle.
translated by 谷歌翻译
Recently, diffusion frameworks have achieved comparable performance with previous state-of-the-art image generation models. Researchers are curious about its variants in discriminative tasks because of its powerful noise-to-image denoising pipeline. This paper proposes DiffusionInst, a novel framework that represents instances as instance-aware filters and formulates instance segmentation as a noise-to-filter denoising process. The model is trained to reverse the noisy groundtruth without any inductive bias from RPN. During inference, it takes a randomly generated filter as input and outputs mask in one-step or multi-step denoising. Extensive experimental results on COCO and LVIS show that DiffusionInst achieves competitive performance compared to existing instance segmentation models. We hope our work could serve as a simple yet effective baseline, which could inspire designing more efficient diffusion frameworks for challenging discriminative tasks. Our code is available in https://github.com/chenhaoxing/DiffusionInst.
translated by 谷歌翻译
现有的锚定面向对象检测方法已经实现了惊人的结果,但这些方法需要一些手动预设盒,这引入了额外的超参数和计算。现有的锚定方法通常具有复杂的架构,并且不易部署。我们的目标是提出一种简单易于部署的空中图像检测算法。在本文中,我们介绍了基于FCOS的单级锚定旋转对象检测器(FCOSR),可以在大多数平台上部署。 FCOSR具有简单的架构,包括卷积图层。我们的工作侧重于培训阶段的标签分配策略。我们使用椭圆中心采样方法来定义面向定向框(obb)的合适采样区域。模糊样本分配策略为重叠对象提供合理的标签。为解决采样问题不足,设计了一种多级采样模块。这些策略将更合适的标签分配给培训样本。我们的算法分别在DOTA1.0,DOTA1.5和HRSC2016数据集上实现79.25,75.41和90.15映射。 FCOSR在单规模评估中展示了其他方法的卓越性能。我们将轻量级FCOSR模型转换为Tensorrt格式,该格式在Dota1.0上以10.68 fps在jetson Xavier NX上实现73.93映射。该代码可用于:https://github.com/lzh420202/fcosr
translated by 谷歌翻译
尽管有不同的相关框架,已经通过不同和专门的框架解决了语义,实例和Panoptic分段。本文为这些基本相似的任务提供了统一,简单,有效的框架。该框架,名为K-Net,段段由一组被学习内核持续一致,其中每个内核负责为潜在实例或填充类生成掩码。要解决区分各种实例的困难,我们提出了一个内核更新策略,使每个内核动态和条件在输入图像中的有意义的组上。 K-NET可以以结尾的方式培训,具有二分匹配,其培训和推论是自然的NMS和无框。没有钟声和口哨,K-Net超越了先前发表的全面的全面的单一模型,在ADE20K Val上的MS Coco Test-Dev分割和语义分割上分别与55.2%PQ和54.3%Miou分裂。其实例分割性能也与MS COCO上的级联掩模R-CNN相同,具有60%-90%的推理速度。代码和模型将在https://github.com/zwwwayne/k-net/发布。
translated by 谷歌翻译
无人驾驶飞机(UAV)的实时对象检测是一个具有挑战性的问题,因为Edge GPU设备作为物联网(IoT)节点的计算资源有限。为了解决这个问题,在本文中,我们提出了一种基于Yolox模型的新型轻型深度学习体系结构,用于Edge GPU上的实时对象检测。首先,我们设计了一个有效且轻巧的PixSF头,以更换Yolox的原始头部以更好地检测小物体,可以将其进一步嵌入深度可分离的卷积(DS Conv)中,以达到更轻的头。然后,开发为减少网络参数的颈层中的较小结构,这是精度和速度之间的权衡。此外,我们将注意模块嵌入头层中,以改善预测头的特征提取效果。同时,我们还改进了标签分配策略和损失功能,以减轻UAV数据集的类别不平衡和盒子优化问题。最后,提出了辅助头进行在线蒸馏,以提高PIXSF Head中嵌入位置嵌入和特征提取的能力。在NVIDIA Jetson NX和Jetson Nano GPU嵌入平台上,我们的轻质模型的性能得到了实验验证。扩展的实验表明,与目前的模型相比,Fasterx模型在Visdrone2021数据集中实现了更好的折衷和延迟之间的折衷。
translated by 谷歌翻译
更好的准确性和效率权衡在对象检测中是一个具有挑战性的问题。在这项工作中,我们致力于研究对象检测的关键优化和神经网络架构选择,以提高准确性和效率。我们调查了无锚策略对轻质对象检测模型的适用性。我们增强了骨干结构并设计了颈部的轻质结构,从而提高了网络的特征提取能力。我们改善标签分配策略和损失功能,使培训更稳定和高效。通过这些优化,我们创建了一个名为PP-Picodet的新的实时对象探测器系列,这在移动设备的对象检测上实现了卓越的性能。与其他流行型号相比,我们的模型在准确性和延迟之间实现了更好的权衡。 Picodet-s只有0.99m的参数达到30.6%的地图,它是地图的绝对4.8%,同时与yolox-nano相比将移动CPU推理延迟减少55%,并且与Nanodet相比,MAP的绝对改善了7.1%。当输入大小为320时,它在移动臂CPU上达到123个FPS(使用桨Lite)。Picodet-L只有3.3M参数,达到40.9%的地图,这是地图的绝对3.7%,比yolov5s更快44% 。如图1所示,我们的模型远远优于轻量级对象检测的最先进的结果。代码和预先训练的型号可在https://github.com/paddlepaddle/paddledentions提供。
translated by 谷歌翻译
In this report, we present PP-YOLOE, an industrial state-of-the-art object detector with high performance and friendly deployment. We optimize on the basis of the previous PP-YOLOv2, using anchor-free paradigm, more powerful backbone and neck equipped with CSPRepResStage, ET-head and dynamic label assignment algorithm TAL. We provide s/m/l/x models for different practice scenarios. As a result, PP-YOLOE-l achieves 51.4 mAP on COCO test-dev and 78.1 FPS on Tesla V100, yielding a remarkable improvement of (+1.9 AP, +13.35% speed up) and (+1.3 AP, +24.96% speed up), compared to the previous state-of-the-art industrial models PP-YOLOv2 and YOLOX respectively. Further, PP-YOLOE inference speed achieves 149.2 FPS with TensorRT and FP16-precision. We also conduct extensive experiments to verify the effectiveness of our designs. Source code and pre-trained models are available at https://github.com/PaddlePaddle/PaddleDetection.
translated by 谷歌翻译
深神网络的对象探测器正在不断发展,并用于多种应用程序,每个应用程序都有自己的要求集。尽管关键安全应用需要高准确性和可靠性,但低延迟任务需要资源和节能网络。不断提出了实时探测器,在高影响现实世界中是必需的,但是它们过分强调了准确性和速度的提高,而其他功能(例如多功能性,鲁棒性,资源和能源效率)则被省略。现有网络的参考基准不存在,设计新网络的标准评估指南也不存在,从而导致比较模棱两可和不一致的比较。因此,我们对广泛的数据集进行了多个实时探测器(基于锚点,关键器和变压器)的全面研究,并报告了一系列广泛指标的结果。我们还研究了变量,例如图像大小,锚固尺寸,置信阈值和架构层对整体性能的影响。我们分析了检测网络的鲁棒性,以防止分配变化,自然腐败和对抗性攻击。此外,我们提供了校准分析来评估预测的可靠性。最后,为了强调现实世界的影响,我们对自动驾驶和医疗保健应用进行了两个独特的案例研究。为了进一步衡量关键实时应用程序中网络的能力,我们报告了在Edge设备上部署检测网络后的性能。我们广泛的实证研究可以作为工业界对现有网络做出明智选择的指南。我们还希望激发研究社区的设计和评估网络的新方向,该网络着重于更大而整体的概述,以实现深远的影响。
translated by 谷歌翻译