使用多模式输入的对象检测可以改善许多安全性系统,例如自动驾驶汽车(AVS)。由白天和黑夜运行的AV动机,我们使用RGB和热摄像机研究多模式对象检测,因为后者在较差的照明下提供了更强的对象签名。我们探索融合来自不同方式的信息的策略。我们的关键贡献是一种概率结合技术,Proben,一种简单的非学习方法,可以将多模式的检测融合在一起。我们从贝叶斯的规则和第一原则中得出了探针,这些原则在跨模态上采用条件独立性。通过概率边缘化,当检测器不向同一物体发射时,概率可以优雅地处理缺失的方式。重要的是,即使有条件的独立性假设不存在,也可以显着改善多模式检测,例如,从其他融合方法(包括现成的内部和训练有素的内部)融合输出。我们在两个基准上验证了包含对齐(KAIST)和未对准(Flir)多模式图像的基准,这表明Proben的相对性能优于先前的工作超过13%!
translated by 谷歌翻译
行人检测是自主驱动系统中最关键的模块。虽然相机通常用于此目的,但其质量严重降低了低光夜间驾驶场景。另一方面,热摄像机图像的质量在类似条件下保持不受影响。本文采用RGB和热图像提出了一种用于行人检测的端到端多峰融合模型。其新颖的时空深度网络架构能够有效利用多模式输入。它由两个不同的可变形ResNext-50编码器组成,用于来自两个方式的特征提取。这两个编码特征的融合发生在由几个图形关注网络和特征融合单元组成的多模式特征嵌入模块(MUFEM)内部。随后将MUFEM的最后一个特征融合单元的输出传递给两个CRF的空间细化。通过在四个不同方向横穿四个RNN的帮助下,通过应用渠道明智的关注和提取上下文信息来实现特征的进一步提高。最后,单级解码器使用这些特征映射来生成每个行人和分数图的边界框。我们在三个公开可用的多模式行人检测基准数据集,即Kaist,CVC-14和Utokyo上进行了广泛的框架实验。每个每个结果都改善了各种最先进的性能。在https://youtu.be/fdjdsifuucs,可以看到一个简短的视频以及其定性结果的概述。我们的源代码将在发布论文时发布。
translated by 谷歌翻译
多光谱遥感图像对的横向熔断互补信息可以提高检测算法的感知能力,使其更加坚固可靠,对更广泛的应用,例如夜间检测。与先前的方法相比,我们认为应具体处理不同的功能,应保留和增强模态特定功能,而模态共享功能应从RGB和热IR模型挑选。在此思想之后,提出了一种具有关节共模和微分方式的小说和轻质的多光谱特征融合方法,称为跨型号注意特征融合(CMAFF)。鉴于RGB和IR图像的中间特征映射,我们的模块并行Infers Infers来自两个单独的模态,共同和微分方式,然后分别将注意力映射乘以自适应特征增强或选择。广泛的实验表明,我们的建议方法可以以低计算成本实现最先进的性能。
translated by 谷歌翻译
The 1$^{\text{st}}$ Workshop on Maritime Computer Vision (MaCVi) 2023 focused on maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicle (USV), and organized several subchallenges in this domain: (i) UAV-based Maritime Object Detection, (ii) UAV-based Maritime Object Tracking, (iii) USV-based Maritime Obstacle Segmentation and (iv) USV-based Maritime Obstacle Detection. The subchallenges were based on the SeaDronesSee and MODS benchmarks. This report summarizes the main findings of the individual subchallenges and introduces a new benchmark, called SeaDronesSee Object Detection v2, which extends the previous benchmark by including more classes and footage. We provide statistical and qualitative analyses, and assess trends in the best-performing methodologies of over 130 submissions. The methods are summarized in the appendix. The datasets, evaluation code and the leaderboard are publicly available at https://seadronessee.cs.uni-tuebingen.de/macvi.
translated by 谷歌翻译
我们在没有监督的情况下解决了学习对象探测器的问题。与弱监督的对象检测不同,我们不假设图像级类标签。取而代之的是,我们使用音频组件来“教”对象检测器,从视听数据中提取监督信号。尽管此问题与声音源本地化有关,但它更难,因为检测器必须按类型对对象进行分类,列举对象的每个实例,并且即使对象保持沉默,也可以这样做。我们通过首先设计一个自制的框架来解决这个问题,该框架具有一个对比目标,该目标共同学会了分类和本地化对象。然后,在不使用任何监督的情况下,我们只需使用这些自我监督的标签和盒子来训练基于图像的对象检测器。因此,对于对象检测和声音源定位的任务,我们优于先前的无监督和弱监督的检测器。我们还表明,我们可以将该探测器与每个伪级标签的标签保持一致,并展示我们的方法如何学习检测超出仪器(例如飞机和猫)的通用对象。
translated by 谷歌翻译
Vanilla用于物体检测和实例分割的模型遭受重偏向朝着长尾设置中的频繁对象进行偏向。现有方法主要在培训期间解决此问题,例如,通过重新采样或重新加权。在本文中,我们调查了一个很大程度上被忽视的方法 - 置信分数的后处理校准。我们提出NORCAL,用于长尾对象检测和实例分割的归一化校准校准,简单而简单的配方,通过其训练样本大小重新恢复每个阶级的预测得分。我们展示了单独处理背景类并使每个提案的课程分数标准化是实现卓越性能的键。在LVIS DataSet上,Norcal不仅可以在罕见的课程上有效地改善所有基线模型,也可以在普通和频繁的阶级上改进。最后,我们进行了广泛的分析和消融研究,以了解我们方法的各种建模选择和机制的见解。我们的代码在https://github.com/tydpan/norcal/上公开提供。
translated by 谷歌翻译
Figure 1: Results obtained from our single image, monocular 3D object detection network MonoDIS on a KITTI3D test image with corresponding birds-eye view, showing its ability to estimate size and orientation of objects at different scales.
translated by 谷歌翻译
用于视觉数据的变压器模型的最新进程导致识别和检测任务的显着改进。特别是,使用学习查询代替区域建议,这已经引起了一种新的一类单级检测模型,由检测变压器(DETR)。这种单阶段方法的变化已经主导了人对象相互作用(HOI)检测。然而,这种单阶段Hoi探测器的成功可以很大程度上被归因于变压器的表示力。我们发现,当配备相同的变压器时,他们的两级同行可以更加性能和记忆力,同时取得一小部分训练。在这项工作中,我们提出了一对成对变压器,这是一个用于HOI的一元和成对表示的两级检测器。我们观察到我们的变压器网络的一对和成对部分专门化,前者优先增加积极示例的分数,后者降低了阴性实例的分数。我们评估我们在HiCO-DET和V-Coco数据集上的方法,并显着优于最先进的方法。在推理时间内,我们使用RESET50的模型在单个GPU上接近实时性能。
translated by 谷歌翻译
基于LIDAR的传感驱动器电流自主车辆。尽管进展迅速,但目前的激光雷达传感器在分辨率和成本方面仍然落后于传统彩色相机背后的二十年。对于自主驾驶,这意味着靠近传感器的大物体很容易可见,但远方或小物体仅包括一个测量或两个。这是一个问题,尤其是当这些对象结果驾驶危险时。另一方面,在车载RGB传感器中清晰可见这些相同的对象。在这项工作中,我们提出了一种将RGB传感器无缝熔化成基于LIDAR的3D识别方法。我们的方法采用一组2D检测来生成密集的3D虚拟点,以增加否则稀疏的3D点云。这些虚拟点自然地集成到任何基于标准的LIDAR的3D探测器以及常规激光雷达测量。由此产生的多模态检测器简单且有效。大规模NUSCENES数据集的实验结果表明,我们的框架通过显着的6.6地图改善了强大的中心点基线,并且优于竞争融合方法。代码和更多可视化可在https://tianweiy.github.io/mvp/上获得
translated by 谷歌翻译
Confluence是对对象检测的边界框后处理中的非墨西哥抑制(NMS)替代的新型非交流(IOU)替代方案。它克服了基于IOU的NMS变体的固有局限性,以通过使用归一化的曼哈顿距离启发的接近度度量来表示边界框聚类的更稳定,一致的预测指标来表示边界框群集。与贪婪和柔软的NMS不同,它不仅依赖分类置信度得分来选择最佳边界框,而是选择与给定群集中最接近其他盒子的框并删除高度汇合的相邻框。在MS Coco和CrowdHuman基准测试中,汇合的平均精度最高2.3-3.8%,而平均召回率则与DEACTO标准和ART NMS NMS变体相比,平均召回率最高为5.3-7.2%。广泛的定性分析和阈值灵敏度分析实验支持了定量结果,这支持了结论,即汇合比NMS变体更健壮。 Confluence代表边界框处理中的范式变化,有可能在边界框回归过程中替换IOU。
translated by 谷歌翻译
In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its quality. While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object detection datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN. To facilitate future research, two implementations are made available at https://github.com/zhaoweicai/cascade-rcnn (Caffe) and https://github.com/zhaoweicai/Detectron-Cascade-RCNN (Detectron).
translated by 谷歌翻译
最近的多目标跟踪(MOT)系统利用高精度的对象探测器;然而,培训这种探测器需要大量标记的数据。虽然这种数据广泛适用于人类和车辆,但其他动物物种显着稀缺。我们目前稳健的置信跟踪(RCT),一种算法,旨在保持鲁棒性能,即使检测质量差。与丢弃检测置信信息的先前方法相比,RCT采用基本上不同的方法,依赖于精确的检测置信度值来初始化曲目,扩展轨道和滤波器轨道。特别地,RCT能够通过有效地使用低置信度检测(以及单个物体跟踪器)来最小化身份切换,以保持对象的连续轨道。为了评估在存在不可靠的检测中的跟踪器,我们提出了一个挑战的现实世界水下鱼跟踪数据集,Fishtrac。在对FISHTRAC以及UA-DETRAC数据集的评估中,我们发现RCT在提供不完美的检测时优于其他算法,包括最先进的深单和多目标跟踪器以及更经典的方法。具体而言,RCT具有跨越方法的最佳平均热量,可以成功返回所有序列的结果,并且具有比其他方法更少的身份交换机。
translated by 谷歌翻译
自主驾驶应用中的对象检测意味着语义对象的检测和跟踪通常是城市驾驶环境的原产,作为行人和车辆。最先进的基于深度学习的物体检测中的主要挑战之一是假阳性,其出现过于自信得分。由于安全问题,这在自动驾驶和其他关键机器人感知域中是非常不可取的。本文提出了一种通过将新的概率层引入测试中的深度对象检测网络来缓解过度自信预测问题的方法。建议的方法避免了传统的乙状结肠或Softmax预测层,其通常产生过度自信预测。证明所提出的技术在不降低真实阳性上的性能的情况下降低了误报的过度频率。通过yolov4和第二(基于LiDar的探测器)对2D-Kitti异点检测验证了该方法。该方法使得能够实现可解释的概率预测,而无需重新培训网络,因此非常实用。
translated by 谷歌翻译
Tracking has traditionally been the art of following interest points through space and time. This changed with the rise of powerful deep networks. Nowadays, tracking is dominated by pipelines that perform object detection followed by temporal association, also known as tracking-by-detection. We present a simultaneous detection and tracking algorithm that is simpler, faster, and more accurate than the state of the art. Our tracker, CenterTrack, applies a detection model to a pair of images and detections from the prior frame. Given this minimal input, CenterTrack localizes objects and predicts their associations with the previous frame. That's it. CenterTrack is simple, online (no peeking into the future), and real-time. It achieves 67.8% MOTA on the MOT17 challenge at 22 FPS and 89.4% MOTA on the KITTI tracking benchmark at 15 FPS, setting a new state of the art on both datasets. CenterTrack is easily extended to monocular 3D tracking by regressing additional 3D attributes. Using monocular video input, it achieves 28.3% AMOTA@0.2 on the newly released nuScenes 3D tracking benchmark, substantially outperforming the monocular baseline on this benchmark while running at 28 FPS.
translated by 谷歌翻译
在本文中,我们通过将无线电信息结合到最先进的检测方法中提出了一种无线电辅助人类检测框架,包括基于锚的oneStage检测器和两级检测器。我们从无线电信号中提取无线电定位和标识符信息以帮助人类检测,由于哪种错误阳性和假否定的问题可能会大大缓解。对于两个探测器,我们使用基于无线电定位的置信度评分修订来提高检测性能。对于两级检测方法,我们建议利用无线电定位产生的区域提案,而不是依赖于区域提案网络(RPN)。此外,利用无线电标识符信息,还提出了具有无线电定位约束的非最大抑制方法,以进一步抑制假检测并减少错过的检测。模拟Microsoft Coco DataSet和CALTECH步行数据集的实验表明,借助无线电信息可以改善平均平均精度(地图)和最先进的检测方法的错过率。最后,我们在现实世界的情况下进行实验,以展示我们在实践中的提出方法的可行性。
translated by 谷歌翻译
准确和高效的行人检测对于关于行人安全和移动性的智能运输系统至关重要,例如先进的驾驶员辅助系统和智能行人人行横道系统。在所有行人检测方法中,基于视觉的检测方法被证明是先前研究中最有效的。然而,现有的基于视觉的行人检测算法仍然有两个限制其实现的限制,那些是实时性能以及对环境因素的影响的阻力,例如,低照明条件。为了解决这些问题,本研究提出了一种轻量级的照明和温度感知多光谱网络(IT-MN),用于准确和高效的行人检测。所提出的IT-Mn是一种有效的一级探测器。为了适应环境因素的影响并增强感测的精度,当视觉图像质量有限时,通过所提出的IT-MN融合了热图像数据,以丰富有用的信息。此外,还开发了一种创新和有效的晚期融合策略来优化图像融合性能。为了使所提出的模型可实现用于边缘计算,应用模型量化以减少模型大小,同时显着缩短推测时间。通过使用由车载摄像机收集的公共数据集进行评估,通过将所提出的算法与所选的最先进的算法进行评估。结果表明,该算法在GPU上以14.19%和0.03秒实现了低的错过率和推理时间。此外,量化的IT-Mn在边缘设备上实现每张映像对的推理时间为0.21秒,这还展示了将所提出的边缘设备上的模型部署为高效的行人检测算法的潜力。
translated by 谷歌翻译
接受注释较弱的对象探测器是全面监督者的负担得起的替代方案。但是,它们之间仍然存在显着的性能差距。我们建议通过微调预先训练的弱监督检测器来缩小这一差距,并使用``Box-In-box''(bib'(bib)自动从训练集中自动选择了一些完全注销的样品,这是一种新颖的活跃学习专门针对弱势监督探测器的据可查的失败模式而设计的策略。 VOC07和可可基准的实验表明,围嘴表现优于其他活跃的学习技术,并显着改善了基本的弱监督探测器的性能,而每个类别仅几个完全宣布的图像。围嘴达到了完全监督的快速RCNN的97%,在VOC07上仅10%的全已通量图像。在可可(COCO)上,平均每类使用10张全面通量的图像,或同等的训练集的1%,还减少了弱监督检测器和完全监督的快速RCN之间的性能差距(In AP)以上超过70% ,在性能和数据效率之间表现出良好的权衡。我们的代码可在https://github.com/huyvvo/bib上公开获取。
translated by 谷歌翻译
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For 300 × 300 input, SSD achieves 74.3% mAP 1 on VOC2007 test at 59 FPS on a Nvidia Titan X and for 512 × 512 input, SSD achieves 76.9% mAP, outperforming a comparable state-of-the-art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at: https://github.com/weiliu89/caffe/tree/ssd .
translated by 谷歌翻译
Due to object detection's close relationship with video analysis and image understanding, it has attracted much research attention in recent years. Traditional object detection methods are built on handcrafted features and shallow trainable architectures. Their performance easily stagnates by constructing complex ensembles which combine multiple low-level image features with high-level context from object detectors and scene classifiers. With the rapid development in deep learning, more powerful tools, which are able to learn semantic, high-level, deeper features, are introduced to address the problems existing in traditional architectures. These models behave differently in network architecture, training strategy and optimization function, etc. In this paper, we provide a review on deep learning based object detection frameworks. Our review begins with a brief introduction on the history of deep learning and its representative tool, namely Convolutional Neural Network (CNN). Then we focus on typical generic object detection architectures along with some modifications and useful tricks to improve detection performance further. As distinct specific detection tasks exhibit different characteristics, we also briefly survey several specific tasks, including salient object detection, face detection and pedestrian detection. Experimental analyses are also provided to compare various methods and draw some meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work in both object detection and relevant neural network based learning systems.
translated by 谷歌翻译
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012-achieving a mAP of 53.3%. Our approach combines two key insights:(1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset. Source code for the complete system is available at http://www.cs.berkeley.edu/ ˜rbg/rcnn.
translated by 谷歌翻译