Detection Transformer (DETR) directly transforms queries to unique objects by using one-to-one bipartite matching during training and enables end-to-end object detection. Recently, these models have surpassed traditional detectors on COCO with undeniable elegance. However, they differ from traditional detectors in multiple designs, including model architecture and training schedules, and thus the effectiveness of one-to-one matching is not fully understood. In this work, we conduct a strict comparison between the one-to-one Hungarian matching in DETRs and the one-to-many label assignments in traditional detectors with non-maximum supervision (NMS). Surprisingly, we observe one-to-many assignments with NMS consistently outperform standard one-to-one matching under the same setting, with a significant gain of up to 2.5 mAP. Our detector that trains Deformable-DETR with traditional IoU-based label assignment achieved 50.2 COCO mAP within 12 epochs (1x schedule) with ResNet50 backbone, outperforming all existing traditional or transformer-based detectors in this setting. On multiple datasets, schedules, and architectures, we consistently show bipartite matching is unnecessary for performant detection transformers. Furthermore, we attribute the success of detection transformers to their expressive transformer architecture. Code is available at https://github.com/jozhang97/DETA.
translated by 谷歌翻译
我们将Dino(\ textbf {d} etr与\ textbf {i} mpred de \ textbf {n} oising hand \ textbf {o} r boxes),一种最先进的端到端对象检测器。 % 在本文中。 Dino通过使用一种对比度方法来降级训练,一种用于锚定初始化的混合查询选择方法以及对盒子预测的两次方案,通过使用对比的方式来改善性能和效率的模型。 Dino在$ 12 $时代获得$ 49.4 $ ap,$ 12.3 $ ap in Coco $ 24 $时期,带有Resnet-50骨干和多尺度功能,可显着改善$ \ textbf {+6.0} $ \ textbf {ap}和ap {ap}和ap}和$ \ textbf {+2.7} $ \ textbf {ap}与以前的最佳detr样模型相比,分别是dn-detr。 Dino在模型大小和数据大小方面都很好地缩放。没有铃铛和哨子,在对objects365数据集进行了swinl骨架的预训练后,Dino在两个Coco \ texttt {val2017}($ \ textbf {63.2} $ \ textbf {ap ap})和\ testtt { -dev}(\ textbf {$ \ textbf {63.3} $ ap})。与排行榜上的其他模型相比,Dino大大降低了其模型大小和预训练数据大小,同时实现了更好的结果。我们的代码将在\ url {https://github.com/ideacvr/dino}提供。
translated by 谷歌翻译
检测变压器(DETR)依赖于一对一的标签分配,即仅分配一个地面真相(GT)对象作为一个阳性对象查询,用于端到端对象检测,并且缺乏利用多个积极查询的能力。我们提出了一种新颖的DETR训练方法,称为{\ em grout detr},以支持多个积极查询。具体来说,我们将阳性分解为多个独立组,并在每个组中只保留一个阳性对象。我们在培训期间进行了简单的修改:(i)采用$ k $ of Absock Queries; (ii)对具有相同参数的每组对象查询进行解码器自我注意; (iii)为每个组执行一对一的标签分配,从而为每个GT对象提供$ K $阳性对象查询。在推论中,我们只使用一组对象查询,对架构和过程没有任何修改。我们验证了提出的方法对DITR变体的有效性,包括条件DITR,DAB-DER,DN-DEN和DINO。
translated by 谷歌翻译
We present in this paper a novel denoising training method to speedup DETR (DEtection TRansformer) training and offer a deepened understanding of the slow convergence issue of DETR-like methods. We show that the slow convergence results from the instability of bipartite graph matching which causes inconsistent optimization goals in early training stages. To address this issue, except for the Hungarian loss, our method additionally feeds ground-truth bounding boxes with noises into Transformer decoder and trains the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to a faster convergence. Our method is universal and can be easily plugged into any DETR-like methods by adding dozens of lines of code to achieve a remarkable improvement. As a result, our DN-DETR results in a remarkable improvement ($+1.9$AP) under the same setting and achieves the best result (AP $43.4$ and $48.6$ with $12$ and $50$ epochs of training respectively) among DETR-like methods with ResNet-$50$ backbone. Compared with the baseline under the same setting, DN-DETR achieves comparable performance with $50\%$ training epochs. Code is available at \url{https://github.com/FengLi-ust/DN-DETR}.
translated by 谷歌翻译
一对一的匹配是DETR建立其端到端功能的关键设计,因此对象检测不需要手工制作的NMS(非最大抑制)方法来删除重复检测。这种端到端的签名对于DETR的多功能性很重要,并且已将其推广到广泛的视觉问题,包括实例/语义分割,人体姿势估计以及基于点云/多视图的检测,但是,我们注意到,由于分配为正样本的查询太少,因此一对一的匹配显着降低了阳性样品的训练效率。本文提出了一种基于混合匹配方案的简单而有效的方法,该方法将原始的一对一匹配分支与辅助查询结合在一起,这些查询在训练过程中使用一对一的匹配损失。该混合策略已被证明可显着提高训练效率并提高准确性。在推断中,仅使用原始的一对一匹配分支,从而维持端到端的优点和相同的DETR推断效率。该方法命名为$ \ MATHCAL {H} $ - DETR,它表明可以在各种视觉任务中始终如一地改进各种代表性的DITR方法,包括可变形,3DDER/PETRV2,PETR和TRANDRACK, ,其他。代码将在以下网址提供:https://github.com/hdetr
translated by 谷歌翻译
检测变压器已在富含样品的可可数据集上实现了竞争性能。但是,我们显示他们中的大多数人在小型数据集(例如CityScapes)上遭受了大量的性能下降。换句话说,检测变压器通常是渴望数据的。为了解决这个问题,我们通过逐步过渡从数据效率的RCNN变体到代表性的DETR,从经验中分析影响数据效率的因素。经验结果表明,来自本地图像区域的稀疏特征采样可容纳关键。基于此观察结果,我们通过简单地简单地交替如何在跨意义层构建键和价值序列,从而减少现有检测变压器的数据问题,并对原始模型进行最小的修改。此外,我们引入了一种简单而有效的标签增强方法,以提供更丰富的监督并提高数据效率。实验表明,我们的方法可以很容易地应用于不同的检测变压器,并在富含样品和样品的数据集上提高其性能。代码将在\ url {https://github.com/encounter1997/de-detrs}上公开提供。
translated by 谷歌翻译
最近的端到端多对象检测器通过删除手工制作的过程(例如使用非最大最大抑制(NMS))删除手工制作的过程来简化推理管道。但是,在训练中,他们需要两分匹配来计算检测器输出的损失。与端到端学习的核心的方向性相反,双方匹配使端到端探测器复杂,启发式和依赖的培训。在本文中,我们提出了一种训练端到端多对象探测器而无需匹配的方法。为此,我们使用混合模型将端到端多对象检测作为密度估计问题。我们提出的检测器,称为稀疏混合物密度检测器(稀疏MDOD),使用混合模型估算边界盒的分布。稀疏MDOD是通过最大程度地减少负对数似然性和我们提出的正则化项,最大成分最大化(MCM)损失来训练的,从而阻止了重复的预测。在训练过程中,不需要其他过程,例如两分匹配,并且损失是直接从网络输出中计算出来的。此外,我们的稀疏MDOD优于MS-Coco上的现有检测器,MS-Coco是一种著名的多对象检测基准。
translated by 谷歌翻译
The DETR object detection approach applies the transformer encoder and decoder architecture to detect objects and achieves promising performance. In this paper, we present a simple approach to address the main problem of DETR, the slow convergence, by using representation learning technique. In this approach, we detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders. By detecting objects as paired keypoints, the model builds up a joint classification and pair association on the output queries from two decoders. For the pair association we propose utilizing contrastive self-supervised learning algorithm without requiring specialized architecture. Experimental results on MS COCO dataset show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training, while having consistently higher Average Precision scores.
translated by 谷歌翻译
已经提出了各种模型来执行对象检测。但是,大多数人都需要许多手工设计的组件,例如锚和非最大抑制(NMS),以表现出良好的性能。为了减轻这些问题,建议了基于变压器的DETR及其变体可变形DETR。这些解决了为对象检测模型设计头部时的许多复杂问题。但是,当将基于变压器的模型视为其他模型的对象检测中的最新方法时,仍然存在对性能的疑问,这取决于锚定和NMS,揭示了更好的结果。此外,目前尚不清楚是否可以仅与注意模块结合使用端到端管道,因为Detr适应的变压器方法使用卷积神经网络(CNN)作为骨干身体。在这项研究中,我们建议将几个注意力模块与我们的新任务特异性分裂变压器(TSST)相结合是一种有力的方法,可以在没有传统手工设计的组件的情况下生成可可结果上最先进的性能。通过将通用注意模块分为两个分开的目标注意模块,该方法允许设计简单的对象检测模型。对可可基准的广泛实验证明了我们方法的有效性。代码可在https://github.com/navervision/tsst上获得
translated by 谷歌翻译
视觉变压器(VIT)正在改变对象检测方法的景观。 VIT的自然使用方法是用基于变压器的骨干替换基于CNN的骨干,该主链很简单有效,其价格为推理带来了可观的计算负担。更微妙的用法是DEDR家族,它消除了对物体检测中许多手工设计的组件的需求,但引入了一个解码器,要求超长时间进行融合。结果,基于变压器的对象检测不能在大规模应用中占上风。为了克服这些问题,我们提出了一种新型的无解码器基于完全变压器(DFFT)对象检测器,这是第一次在训练和推理阶段达到高效率。我们通过居中两个切入点来简化反对检测到仅编码单级锚点的密集预测问题:1)消除训练感知的解码器,并利用两个强的编码器来保留单层特征映射预测的准确性; 2)探索具有有限的计算资源的检测任务的低级语义特征。特别是,我们设计了一种新型的轻巧的面向检测的变压器主链,该主链有效地捕获了基于良好的消融研究的丰富语义的低级特征。 MS Coco基准测试的广泛实验表明,DFFT_SMALL的表现优于2.5%AP,计算成本降低28%,$ 10 \ $ 10 \乘以$ 10 \乘以$较少的培训时期。与尖端的基于锚的探测器视网膜相比,DFFT_SMALL获得了超过5.5%的AP增益,同时降低了70%的计算成本。
translated by 谷歌翻译
DETR方法中引入的查询机制正在改变对象检测的范例,最近有许多基于查询的方法获得了强对象检测性能。但是,当前基于查询的检测管道遇到了以下两个问题。首先,需要多阶段解码器来优化随机初始化的对象查询,从而产生较大的计算负担。其次,训练后的查询是固定的,导致不满意的概括能力。为了纠正上述问题,我们在较快的R-CNN框架中提出了通过查询生成网络预测的特征对象查询,并开发了一个功能性的查询R-CNN。可可数据集的广泛实验表明,我们的特征查询R-CNN获得了所有R-CNN探测器的最佳速度准确性权衡,包括最近的最新稀疏R-CNN检测器。该代码可在\ url {https://github.com/hustvl/featurized-queryrcnn}中获得。
translated by 谷歌翻译
DETR是使用变压器编码器 - 解码器架构的第一端到端对象检测器,并在高分辨率特征映射上展示竞争性能但低计算效率。随后的工作变形Detr,通过更换可变形的关注来提高DEDR的效率,这实现了10倍的收敛性和改进的性能。可变形DETR使用多尺度特征来改善性能,但是,与DETR相比,编码器令牌的数量增加了20倍,编码器注意的计算成本仍然是瓶颈。在我们的初步实验中,我们观察到,即使只更新了编码器令牌的一部分,检测性能也几乎没有恶化。灵感来自该观察,我们提出了稀疏的DETR,其仅选择性更新预期的解码器预期的令牌,从而有效地检测模型。此外,我们表明在编码器中的所选令牌上应用辅助检测丢失可以提高性能,同时最小化计算开销。即使在Coco数据集上只有10%的编码器令牌,我们验证稀疏DETR也可以比可变形DETR实现更好的性能。尽管只有编码器令牌稀疏,但总计算成本减少了38%,与可变形的Detr相比,每秒帧(FPS)增加42%。代码可在https://github.com/kakaobrain/sparse-dett
translated by 谷歌翻译
虽然用变压器(DETR)的检测越来越受欢迎,但其全球注意力建模需要极其长的培训期,以优化和实现有前途的检测性能。现有研究的替代方案主要开发先进的特征或嵌入设计来解决培训问题,指出,基于地区的兴趣区域(ROI)的检测细化可以很容易地帮助减轻DETR方法培训的难度。基于此,我们在本文中介绍了一种新型的经常性闪闪发光的解码器(Rego)。特别是,REGO采用多级复发处理结构,以帮助更准确地逐渐关注前景物体。在每个处理阶段,从ROI的闪烁特征提取视觉特征,其中来自上阶段的检测结果的放大边界框区域。然后,引入了基于一瞥的解码器,以提供基于前一级的瞥见特征和注意力建模输出的精细检测结果。在实践中,Refo可以很容易地嵌入代表性的DETR变体,同时保持其完全端到端的训练和推理管道。特别地,Refo帮助可变形的DETR在MSCOCO数据集上实现44.8AP,只有36个训练时期,与需要500和50时期的第一DETR和可变形的DETR相比,分别可以分别实现相当的性能。实验还表明,Rego始终如一地提升不同DETR探测器的性能高达7%的相对增益,在相同的50次训练时期。代码可通过https://github.com/zhechen/deformable-detr-rego获得。
translated by 谷歌翻译
This paper investigates a phenomenon where query-based object detectors mispredict at the last decoding stage while predicting correctly at an intermediate stage. We review the training process and attribute the overlooked phenomenon to two limitations: lack of training emphasis and cascading errors from decoding sequence. We design and present Selective Query Recollection (SQR), a simple and effective training strategy for query-based object detectors. It cumulatively collects intermediate queries as decoding stages go deeper and selectively forwards the queries to the downstream stages aside from the sequential structure. Such-wise, SQR places training emphasis on later stages and allows later stages to work with intermediate queries from earlier stages directly. SQR can be easily plugged into various query-based object detectors and significantly enhances their performance while leaving the inference pipeline unchanged. As a result, we apply SQR on Adamixer, DAB-DETR, and Deformable-DETR across various settings (backbone, number of queries, schedule) and consistently brings 1.4-2.8 AP improvement.
translated by 谷歌翻译
在本文中,我们提出了简单的关注机制,我们称之为箱子。它可以实现网格特征之间的空间交互,从感兴趣的框中采样,并提高变压器的学习能力,以获得几个视觉任务。具体而言,我们呈现拳击手,短暂的框变压器,通过从输入特征映射上的参考窗口预测其转换来参加一组框。通过考虑其网格结构,拳击手通过考虑其网格结构来计算这些框的注意力。值得注意的是,Boxer-2D自然有关于其注意模块内容信息的框信息的原因,使其适用于端到端实例检测和分段任务。通过在盒注意模块中旋转的旋转的不变性,Boxer-3D能够从用于3D端到端对象检测的鸟瞰图平面产生识别信息。我们的实验表明,拟议的拳击手-2D在Coco检测中实现了更好的结果,并且在Coco实例分割上具有良好的和高度优化的掩模R-CNN可比性。 Boxer-3D已经为Waymo开放的车辆类别提供了令人信服的性能,而无需任何特定的类优化。代码将被释放。
translated by 谷歌翻译
由于检测数据集的规模小,当前对象探测器的词汇量受到限制。另一方面,图像分类器的原因是大约更大的词汇表,因为他们的数据集更大,更容易收集。我们提出守则,只需在图像分类数据上培训检测器的分类器,从而扩展了探测器的词汇量到数万个概念。与现有工作不同,拒绝不会根据模型预测将图像标签分配给框,使其更容易实现和兼容一系列检测架构和骨架。我们的结果表明,即使没有箱子注释,否则差异也能产生出色的探测器。它优于开放词汇和长尾检测基准的事先工作。拒绝为所有类和8.3地图提供了2.4地图的增益,用于开放词汇LVIS基准测试中的新型类。在标准的LVIS基准测试中,守护者达到41.7地图所有课程和41.7地图以获得罕见课程。我们首次培训一个探测器,其中包含所有二十一千类的ImageNet数据集,并显示它在没有微调的情况下推广到新数据集。代码可在https://github.com/facebookresearch/dorm提供。
translated by 谷歌翻译
如果没有图像中的密集瓷砖锚点或网格点,稀疏的R-CNN可以通过以级联的训练方式更新的一组对象查询和建议框来实现有希望的结果。但是,由于性质稀疏以及查询与其参加地区之间的一对一关系,它在很大程度上取决于自我注意力,这通常在早期训练阶段不准确。此外,在密集对象的场景中,对象查询与许多无关的物体相互作用,从而降低了其独特性并损害了性能。本文提议在不同的框之间使用iOU作为自我注意力的价值路由的先验。原始注意力矩阵乘以从提案盒中计算出的相同大小的矩阵,并确定路由方案,以便可以抑制无关的功能。此外,为了准确提取分类和回归的功能,我们添加了两个轻巧投影头,以根据对象查询提供动态通道掩码,并且它们随动态convs的输出而繁殖,从而使结果适合两个不同的任务。我们在包括MS-Coco和CrowdHuman在内的不同数据集上验证了所提出的方案,这表明它可显着提高性能并提高模型收敛速度。
translated by 谷歌翻译
In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, and scalable, and it can benefit from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. Code is available at \url{https://github.com/IDEACVR/MaskDINO}.
translated by 谷歌翻译
Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR -- hence the name OV-DETR -- which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR -- the first end-to-end Transformer-based open-vocabulary detector -- achieves non-trivial improvements over current state of the arts.
translated by 谷歌翻译