物体检测通常需要在现代深度学习方法中基于传统或锚盒的滑动窗口分类器。但是,这些方法中的任何一个都需要框中的繁琐配置。在本文中,我们提供了一种新的透视图,其中检测对象被激励为高电平语义特征检测任务。与边缘,角落,斑点和其他特征探测器一样,所提出的探测器扫描到全部图像的特征点,卷积自然适合该特征点。但是,与这些传统的低级功能不同,所提出的探测器用于更高级别的抽象,即我们正在寻找有物体的中心点,而现代深层模型已经能够具有如此高级别的语义抽象。除了Blob检测之外,我们还预测了中心点的尺度,这也是直接的卷积。因此,在本文中,通过卷积简化了行人和面部检测作为直接的中心和规模预测任务。这样,所提出的方法享有一个无盒设置。虽然结构简单,但它对几个具有挑战性的基准呈现竞争准确性,包括行人检测和面部检测。此外,执行交叉数据集评估,证明所提出的方法的卓越泛化能力。可以访问代码和模型(https://github.com/liuwei16/csp和https://github.com/hasanirtiza/pedestron)。
translated by 谷歌翻译
面部检测是为了在图像中搜索面部的所有可能区域,并且如果有任何情况,则定位面部。包括面部识别,面部表情识别,面部跟踪和头部姿势估计的许多应用假设面部的位置和尺寸在图像中是已知的。近几十年来,研究人员从Viola-Jones脸上检测器创造了许多典型和有效的面部探测器到当前的基于CNN的CNN。然而,随着图像和视频的巨大增加,具有面部刻度的变化,外观,表达,遮挡和姿势,传统的面部探测器被挑战来检测野外面孔的各种“脸部。深度学习技术的出现带来了非凡的检测突破,以及计算的价格相当大的价格。本文介绍了代表性的深度学习的方法,并在准确性和效率方面提出了深度和全面的分析。我们进一步比较并讨论了流行的并挑战数据集及其评估指标。进行了几种成功的基于深度学习的面部探测器的全面比较,以使用两个度量来揭示其效率:拖鞋和延迟。本文可以指导为不同应用选择合适的面部探测器,也可以开发更高效和准确的探测器。
translated by 谷歌翻译
Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles which combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy and optimization function, etc. In this paper, we provide a review on deep learning based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely Convolutional Neural Network (CNN). Then we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network based learning systems.
translated by 谷歌翻译
两阶段探测器在物体检测和行人检测中是最新的。但是,当前的两个阶段探测器效率低下,因为它们在多个步骤中进行边界回归,即在区域提案网络和边界框头中进行回归。此外,基于锚的区域提案网络在计算上的训练价格很高。我们提出了F2DNET,这是一种新型的两阶段检测体系结构,通过使用我们的焦点检测网络和边界框以我们的快速抑制头替换区域建议网络,从而消除了当前两阶段检测器的冗余。我们在顶级行人检测数据集上进行基准F2DNET,将其与现有的最新检测器进行彻底比较,并进行交叉数据集评估,以测试我们模型对未见数据的普遍性。我们的F2DNET在城市人员,加州理工学院行人和欧元城市人数据集中分别获得8.7 \%,2.2 \%和6.1 \%MR-2,分别在单个数据集上进行培训并达到20.4 \%\%\%和26.2 \%MR-2。使用渐进式微调时,加州理工学院行人和城市人员数据集的重型闭塞设置。此外,与当前的最新时间相比,F2DNET的推理时间明显较小。代码和训练有素的模型将在https://github.com/abdulhannankhan/f2dnet上找到。
translated by 谷歌翻译
We propose a fully convolutional one-stage object detector (FCOS) to solve object detection in a per-pixel prediction fashion, analogue to semantic segmentation. Almost all state-of-the-art object detectors such as RetinaNet, SSD, YOLOv3, and Faster R-CNN rely on pre-defined anchor boxes. In contrast, our proposed detector FCOS is anchor box free, as well as proposal free. By eliminating the predefined set of anchor boxes, FCOS completely avoids the complicated computation related to anchor boxes such as calculating overlapping during training. More importantly, we also avoid all hyper-parameters related to anchor boxes, which are often very sensitive to the final detection performance. With the only post-processing non-maximum suppression (NMS), FCOS with ResNeXt-64x4d-101 achieves 44.7% in AP with single-model and single-scale testing, surpassing previous one-stage detectors with the advantage of being much simpler. For the first time, we demonstrate a much simpler and flexible detection framework achieving improved detection accuracy. We hope that the proposed FCOS framework can serve as a simple and strong alternative for many other instance-level tasks. Code is available at:tinyurl.com/FCOSv1
translated by 谷歌翻译
In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its quality. While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object detection datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN. To facilitate future research, two implementations are made available at https://github.com/zhaoweicai/cascade-rcnn (Caffe) and https://github.com/zhaoweicai/Detectron-Cascade-RCNN (Detectron).
translated by 谷歌翻译
随着深度卷积神经网络的兴起,对象检测在过去几年中取得了突出的进步。但是,这种繁荣无法掩盖小物体检测(SOD)的不令人满意的情况,这是计算机视觉中臭名昭著的挑战性任务之一,这是由于视觉外观不佳和由小目标的内在结构引起的嘈杂表示。此外,用于基准小对象检测方法基准测试的大规模数据集仍然是瓶颈。在本文中,我们首先对小物体检测进行了详尽的审查。然后,为了催化SOD的发展,我们分别构建了两个大规模的小物体检测数据集(SODA),SODA-D和SODA-A,分别集中在驾驶和空中场景上。 SODA-D包括24704个高质量的交通图像和277596个9个类别的实例。对于苏打水,我们收集2510个高分辨率航空图像,并在9个类别上注释800203实例。众所周知,拟议的数据集是有史以来首次尝试使用针对多类SOD量身定制的大量注释实例进行大规模基准测试。最后,我们评估主流方法在苏打水上的性能。我们预计发布的基准可以促进SOD的发展,并产生该领域的更多突破。数据集和代码将很快在:\ url {https://shaunyuan22.github.io/soda}上。
translated by 谷歌翻译
无锚的检测器基本上将对象检测作为密集的分类和回归。对于流行的无锚检测器,通常是引入单个预测分支来估计本地化的质量。当我们深入研究分类和质量估计的实践时,会观察到以下不一致之处。首先,对于某些分配了完全不同标签的相邻样品,训练有素的模型将产生相似的分类分数。这违反了训练目标并导致绩效退化。其次,发现检测到具有较高信心的边界框与相应的地面真相具有较小的重叠。准确的局部边界框将被非最大抑制(NMS)过程中的精确量抑制。为了解决不一致问题,提出了动态平滑标签分配(DSLA)方法。基于最初在FCO中开发的中心概念,提出了平稳的分配策略。在[0,1]中将标签平滑至连续值,以在正样品和负样品之间稳定过渡。联合(IOU)在训练过程中会动态预测,并与平滑标签结合。分配动态平滑标签以监督分类分支。在这样的监督下,质量估计分支自然合并为分类分支,这简化了无锚探测器的体系结构。全面的实验是在MS Coco基准上进行的。已经证明,DSLA可以通过减轻上述无锚固探测器的不一致来显着提高检测准确性。我们的代码在https://github.com/yonghaohe/dsla上发布。
translated by 谷歌翻译
We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. In addition to our novel formulation, we introduce corner pooling, a new type of pooling layer that helps the network better localize corners. Experiments show that Corner-Net achieves a 42.2% AP on MS COCO, outperforming all existing one-stage detectors.
translated by 谷歌翻译
The detection of human body and its related parts (e.g., face, head or hands) have been intensively studied and greatly improved since the breakthrough of deep CNNs. However, most of these detectors are trained independently, making it a challenging task to associate detected body parts with people. This paper focuses on the problem of joint detection of human body and its corresponding parts. Specifically, we propose a novel extended object representation that integrates the center location offsets of body or its parts, and construct a dense single-stage anchor-based Body-Part Joint Detector (BPJDet). Body-part associations in BPJDet are embedded into the unified representation which contains both the semantic and geometric information. Therefore, BPJDet does not suffer from error-prone association post-matching, and has a better accuracy-speed trade-off. Furthermore, BPJDet can be seamlessly generalized to jointly detect any body part. To verify the effectiveness and superiority of our method, we conduct extensive experiments on the CityPersons, CrowdHuman and BodyHands datasets. The proposed BPJDet detector achieves state-of-the-art association performance on these three benchmarks while maintains high accuracy of detection. Code will be released to facilitate further studies.
translated by 谷歌翻译
大多数最先进的实例级人类解析模型都采用了两阶段的基于锚的探测器,因此无法避免启发式锚盒设计和像素级别缺乏分析。为了解决这两个问题,我们设计了一个实例级人类解析网络,该网络在像素级别上无锚固且可解决。它由两个简单的子网络组成:一个用于边界框预测的无锚检测头和一个用于人体分割的边缘引导解析头。无锚探测器的头继承了像素样的优点,并有效地避免了对象检测应用中证明的超参数的敏感性。通过引入部分感知的边界线索,边缘引导的解析头能够将相邻的人类部分与彼此区分开,最多可在一个人类实例中,甚至重叠的实例。同时,利用了精炼的头部整合盒子级别的分数和部分分析质量,以提高解析结果的质量。在两个多个人类解析数据集(即CIHP和LV-MHP-V2.0)和一个视频实例级人类解析数据集(即VIP)上进行实验,表明我们的方法实现了超过全球级别和实例级别的性能最新的一阶段自上而下的替代方案。
translated by 谷歌翻译
Single-frame InfraRed Small Target (SIRST) detection has been a challenging task due to a lack of inherent characteristics, imprecise bounding box regression, a scarcity of real-world datasets, and sensitive localization evaluation. In this paper, we propose a comprehensive solution to these challenges. First, we find that the existing anchor-free label assignment method is prone to mislabeling small targets as background, leading to their omission by detectors. To overcome this issue, we propose an all-scale pseudo-box-based label assignment scheme that relaxes the constraints on scale and decouples the spatial assignment from the size of the ground-truth target. Second, motivated by the structured prior of feature pyramids, we introduce the one-stage cascade refinement network (OSCAR), which uses the high-level head as soft proposals for the low-level refinement head. This allows OSCAR to process the same target in a cascade coarse-to-fine manner. Finally, we present a new research benchmark for infrared small target detection, consisting of the SIRST-V2 dataset of real-world, high-resolution single-frame targets, the normalized contrast evaluation metric, and the DeepInfrared toolkit for detection. We conduct extensive ablation studies to evaluate the components of OSCAR and compare its performance to state-of-the-art model-driven and data-driven methods on the SIRST-V2 benchmark. Our results demonstrate that a top-down cascade refinement framework can improve the accuracy of infrared small target detection without sacrificing efficiency. The DeepInfrared toolkit, dataset, and trained models are available at https://github.com/YimianDai/open-deepinfrared to advance further research in this field.
translated by 谷歌翻译
近年来,自主驾驶LIDAR数据的3D对象检测一直在迈出卓越的进展。在最先进的方法中,已经证明了将点云进行编码为鸟瞰图(BEV)是有效且有效的。与透视图不同,BEV在物体之间保留丰富的空间和距离信息;虽然在BEV中相同类型的更远物体不会较小,但它们包含稀疏点云特征。这一事实使用共享卷积神经网络削弱了BEV特征提取。为了解决这一挑战,我们提出了范围感知注意网络(RAANET),提取更强大的BEV功能并产生卓越的3D对象检测。范围感知的注意力(RAA)卷曲显着改善了近距离的特征提取。此外,我们提出了一种新的辅助损耗,用于密度估计,以进一步增强覆盖物体的Raanet的检测精度。值得注意的是,我们提出的RAA卷积轻量级,并兼容,以集成到用于BEV检测的任何CNN架构中。 Nuscenes DataSet上的广泛实验表明,我们的提出方法优于基于LIDAR的3D对象检测的最先进的方法,具有16 Hz的实时推断速度,为LITE版本为22 Hz。该代码在匿名GitHub存储库HTTPS://github.com/Anonymous0522 / ange上公开提供。
translated by 谷歌翻译
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 6 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.
translated by 谷歌翻译
深神网络的对象探测器正在不断发展,并用于多种应用程序,每个应用程序都有自己的要求集。尽管关键安全应用需要高准确性和可靠性,但低延迟任务需要资源和节能网络。不断提出了实时探测器,在高影响现实世界中是必需的,但是它们过分强调了准确性和速度的提高,而其他功能(例如多功能性,鲁棒性,资源和能源效率)则被省略。现有网络的参考基准不存在,设计新网络的标准评估指南也不存在,从而导致比较模棱两可和不一致的比较。因此,我们对广泛的数据集进行了多个实时探测器(基于锚点,关键器和变压器)的全面研究,并报告了一系列广泛指标的结果。我们还研究了变量,例如图像大小,锚固尺寸,置信阈值和架构层对整体性能的影响。我们分析了检测网络的鲁棒性,以防止分配变化,自然腐败和对抗性攻击。此外,我们提供了校准分析来评估预测的可靠性。最后,为了强调现实世界的影响,我们对自动驾驶和医疗保健应用进行了两个独特的案例研究。为了进一步衡量关键实时应用程序中网络的能力,我们报告了在Edge设备上部署检测网络后的性能。我们广泛的实证研究可以作为工业界对现有网络做出明智选择的指南。我们还希望激发研究社区的设计和评估网络的新方向,该网络着重于更大而整体的概述,以实现深远的影响。
translated by 谷歌翻译
我们提出对象盒,这是一种新颖的单阶段锚定且高度可推广的对象检测方法。与现有的基于锚固的探测器和无锚的探测器相反,它们更偏向于其标签分配中的特定对象量表,我们仅将对象中心位置用作正样本,并在不同的特征级别中平均处理所有对象,而不论对象'尺寸或形状。具体而言,我们的标签分配策略将对象中心位置视为形状和尺寸不足的锚定,并以无锚固的方式锚定,并允许学习每个对象的所有尺度。为了支持这一点,我们将新的回归目标定义为从中心单元位置的两个角到边界框的四个侧面的距离。此外,为了处理比例变化的对象,我们提出了一个量身定制的损失来处理不同尺寸的盒子。结果,我们提出的对象检测器不需要在数据集中调整任何依赖数据集的超参数。我们在MS-Coco 2017和Pascal VOC 2012数据集上评估了我们的方法,并将我们的结果与最先进的方法进行比较。我们观察到,与先前的作品相比,对象盒的性能优惠。此外,我们执行严格的消融实验来评估我们方法的不同组成部分。我们的代码可在以下网址提供:https://github.com/mohsenzand/objectbox。
translated by 谷歌翻译
Object detection has been dominated by anchor-based detectors for several years. Recently, anchor-free detectors have become popular due to the proposal of FPN and Focal Loss. In this paper, we first point out that the essential difference between anchor-based and anchor-free detection is actually how to define positive and negative training samples, which leads to the performance gap between them. If they adopt the same definition of positive and negative samples during training, there is no obvious difference in the final performance, no matter regressing from a box or a point. This shows that how to select positive and negative training samples is important for current object detectors. Then, we propose an Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples according to statistical characteristics of object. It significantly improves the performance of anchor-based and anchor-free detectors and bridges the gap between them. Finally, we discuss the necessity of tiling multiple anchors per location on the image to detect objects. Extensive experiments conducted on MS COCO support our aforementioned analysis and conclusions. With the newly introduced ATSS, we improve stateof-the-art detectors by a large margin to 50.7% AP without introducing any overhead. The code is available at https://github.com/sfzhang15/ATSS.
translated by 谷歌翻译
近年来,基于深度学习的面部检测算法取得了长足的进步。这些算法通常可以分为两类,即诸如更快的R-CNN和像Yolo这样的单阶段检测器之类的两个阶段检测器。由于准确性和速度之间的平衡更好,因此在许多应用中广泛使用了一阶段探测器。在本文中,我们提出了一个基于一阶段检测器Yolov5的实时面部检测器,名为Yolo-Facev2。我们设计一个称为RFE的接收场增强模块,以增强小面的接受场,并使用NWD损失来弥补IOU对微小物体的位置偏差的敏感性。对于面部阻塞,我们提出了一个名为Seam的注意模块,并引入了排斥损失以解决它。此外,我们使用重量函数幻灯片来解决简单和硬样品之间的不平衡,并使用有效的接收场的信息来设计锚。宽面数据集上的实验结果表明,在所有简单,中和硬子集中都可以找到我们的面部检测器及其变体的表现及其变体。源代码https://github.com/krasjet-yu/yolo-facev2
translated by 谷歌翻译
Modern object detectors rely heavily on rectangular bounding boxes, such as anchors, proposals and the final predictions, to represent objects at various recognition stages. The bounding box is convenient to use but provides only a coarse localization of objects and leads to a correspondingly coarse extraction of object features. In this paper, we present RepPoints (representative points), a new finer representation of objects as a set of sample points useful for both localization and recognition. Given ground truth localization and recognition targets for training, RepPoints learn to automatically arrange themselves in a manner that bounds the spatial extent of an object and indicates semantically significant local areas. They furthermore do not require the use of anchors to sample a space of bounding boxes. We show that an anchor-free object detector based on RepPoints can be as effective as the state-of-the-art anchor-based detection methods, with 46.5 AP and 67.4 AP 50 on the COCO test-dev detection benchmark, using ResNet-101 model. Code is available at https://github.com/microsoft/RepPoints.
translated by 谷歌翻译
Accurate whole-body multi-person pose estimation and tracking is an important yet challenging topic in computer vision. To capture the subtle actions of humans for complex behavior analysis, whole-body pose estimation including the face, body, hand and foot is essential over conventional body-only pose estimation. In this paper, we present AlphaPose, a system that can perform accurate whole-body pose estimation and tracking jointly while running in realtime. To this end, we propose several new techniques: Symmetric Integral Keypoint Regression (SIKR) for fast and fine localization, Parametric Pose Non-Maximum-Suppression (P-NMS) for eliminating redundant human detections and Pose Aware Identity Embedding for jointly pose estimation and tracking. During training, we resort to Part-Guided Proposal Generator (PGPG) and multi-domain knowledge distillation to further improve the accuracy. Our method is able to localize whole-body keypoints accurately and tracks humans simultaneously given inaccurate bounding boxes and redundant detections. We show a significant improvement over current state-of-the-art methods in both speed and accuracy on COCO-wholebody, COCO, PoseTrack, and our proposed Halpe-FullBody pose estimation dataset. Our model, source codes and dataset are made publicly available at https://github.com/MVIG-SJTU/AlphaPose.
translated by 谷歌翻译