由于乳腺癌的发生和死亡率很高,乳房X线照片中检测肿块很重要。在乳房X线照片质量检测中,对成对病变对应的建模特别重要。但是,大多数现有方法构建了相对粗糙的对应关系,并且尚未利用对应的监督。在本文中,我们提出了一个新的基于变压器的框架CL-NET,以端到端的方式学习病变检测和成对对应。在CL-NET中,提出了观察性病变检测器来实现跨视图候选者的动态相互作用,而病变接头则采用通信监督来更准确地指导相互作用过程。这两种设计的组合实现了对乳房X线照片的成对病变对应的精确理解。实验表明,CL-NET在公共DDSM数据集和我们的内部数据集上产生最先进的性能。此外,在低FPI制度中,它的表现优于先前的方法。
translated by 谷歌翻译
Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (2 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 % mAP while running at around 30 FPS on a single V100 GPU device. Code and models will be available for further research.
translated by 谷歌翻译
在本文中,我们对检测变压器(DETR)感兴趣,这是一种基于变压器编码器编码器架构的端到端对象检测方法,而无需手工制作的后处理,例如NMS。受到有条件的Detr的启发,这是一种具有快速训练收敛性的改进的DETR,对内部解码器层提出了盒子查询(最初称为空间查询),我们将对象查询重新将对象查询重新布置为盒子查询的格式,该格式是参考参考嵌入的组成点和框相对于参考点的转换。该重新制定表明在更快地使用R-CNN中广泛研究的DETR中的对象查询与锚固框之间的联系。此外,我们从图像内容中学习了盒子查询,从而进一步提高了通过快速训练收敛的有条件DETR的检测质量。此外,我们采用轴向自我注意的想法来节省内存成本并加速编码器。所得的检测器(称为条件DETR V2)取得比条件DETR更好的结果,可节省内存成本并更有效地运行。例如,对于DC $ 5 $ -Resnet- $ 50 $骨干,我们的方法在可可$ Val $ set上获得了$ 44.8 $ ap,$ 16.4 $ fps和有条件的detr相比,它运行了$ 1.6 \ tims $ $ $ $ $,节省$ 74 $ \ \ \ \ \ \ \ \ \ \ \ \ \ $ 74美元总体内存成本的百分比,并提高$ 1.0 $ ap得分。
translated by 谷歌翻译
The DETR object detection approach applies the transformer encoder and decoder architecture to detect objects and achieves promising performance. In this paper, we present a simple approach to address the main problem of DETR, the slow convergence, by using representation learning technique. In this approach, we detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders. By detecting objects as paired keypoints, the model builds up a joint classification and pair association on the output queries from two decoders. For the pair association we propose utilizing contrastive self-supervised learning algorithm without requiring specialized architecture. Experimental results on MS COCO dataset show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training, while having consistently higher Average Precision scores.
translated by 谷歌翻译
Although DETR-based 3D detectors can simplify the detection pipeline and achieve direct sparse predictions, their performance still lags behind dense detectors with post-processing for 3D object detection from point clouds. DETRs usually adopt a larger number of queries than GTs (e.g., 300 queries v.s. 40 objects in Waymo) in a scene, which inevitably incur many false positives during inference. In this paper, we propose a simple yet effective sparse 3D detector, named Query Contrast Voxel-DETR (ConQueR), to eliminate the challenging false positives, and achieve more accurate and sparser predictions. We observe that most false positives are highly overlapping in local regions, caused by the lack of explicit supervision to discriminate locally similar queries. We thus propose a Query Contrast mechanism to explicitly enhance queries towards their best-matched GTs over all unmatched query predictions. This is achieved by the construction of positive and negative GT-query pairs for each GT, and a contrastive loss to enhance positive GT-query pairs against negative ones based on feature similarities. ConQueR closes the gap of sparse and dense 3D detectors, and reduces up to ~60% false positives. Our single-frame ConQueR achieves new state-of-the-art (sota) 71.6 mAPH/L2 on the challenging Waymo Open Dataset validation set, outperforming previous sota methods (e.g., PV-RCNN++) by over 2.0 mAPH/L2.
translated by 谷歌翻译
人们在我们的日常互动中互相看待彼此或相互凝视是无处不在的,并且发现相互观察对于理解人类的社会场景具有重要意义。当前的相互视线检测方法集中在两阶段方法上,其推理速度受到两阶段管道的限制,第二阶段的性能受第一阶段的影响。在本文中,我们提出了一个新型的一阶段相互视线检测框架,称为相互视线变压器或MGTR,以端到端的方式执行相互视线检测。通过设计相互视线实例三元,MGTR可以检测每个人头边界框,并基于全局图像信息同时推断相互视线的关系,从而简化整个过程。两个相互视线数据集的实验结果表明,我们的方法能够加速相互视线检测过程而不会失去性能。消融研究表明,MGTR的不同组成部分可以捕获图像中不同级别的语义信息。代码可在https://github.com/gmbition/mgtr上找到
translated by 谷歌翻译
We present in this paper a novel denoising training method to speedup DETR (DEtection TRansformer) training and offer a deepened understanding of the slow convergence issue of DETR-like methods. We show that the slow convergence results from the instability of bipartite graph matching which causes inconsistent optimization goals in early training stages. To address this issue, except for the Hungarian loss, our method additionally feeds ground-truth bounding boxes with noises into Transformer decoder and trains the model to reconstruct the original boxes, which effectively reduces the bipartite graph matching difficulty and leads to a faster convergence. Our method is universal and can be easily plugged into any DETR-like methods by adding dozens of lines of code to achieve a remarkable improvement. As a result, our DN-DETR results in a remarkable improvement ($+1.9$AP) under the same setting and achieves the best result (AP $43.4$ and $48.6$ with $12$ and $50$ epochs of training respectively) among DETR-like methods with ResNet-$50$ backbone. Compared with the baseline under the same setting, DN-DETR achieves comparable performance with $50\%$ training epochs. Code is available at \url{https://github.com/FengLi-ust/DN-DETR}.
translated by 谷歌翻译
这项工作旨在使用带有动作查询的编码器框架(类似于DETR)来推进时间动作检测(TAD),该框架在对象检测中表现出了巨大的成功。但是,如果直接应用于TAD,该框架遇到了几个问题:解码器中争论之间关系的探索不足,由于培训样本数量有限,分类培训不足以及推断时不可靠的分类得分。为此,我们首先提出了解码器中的关系注意机制,该机制根据其关系来指导查询之间的注意力。此外,我们提出了两项​​损失,以促进和稳定行动分类的培训。最后,我们建议在推理时预测每个动作查询的本地化质量,以区分高质量的查询。所提出的命名React的方法在Thumos14上实现了最新性能,其计算成本比以前的方法低得多。此外,还进行了广泛的消融研究,以验证每个提出的组件的有效性。该代码可在https://github.com/sssste/reaeact上获得。
translated by 谷歌翻译
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due to the limitation of Transformer attention modules in processing image feature maps. To mitigate these issues, we proposed Deformable DETR, whose attention modules only attend to a small set of key sampling points around a reference. Deformable DETR can achieve better performance than DETR (especially on small objects) with 10× less training epochs. Extensive experiments on the COCO benchmark demonstrate the effectiveness of our approach. Code is released at https:// github.com/fundamentalvision/Deformable-DETR.
translated by 谷歌翻译
场景图生成(SGG)由于其复杂的成分特性,仍然是一个具有挑战性的视觉理解任务。大多数以前的作品采用自下而上的两阶段或基于点的单阶段方法,通常遭受开销时间复杂性或次优设计假设。在这项工作中,我们提出了一种新颖的SGG方法来解决上述问题,其将任务制定为双层图形施工问题。为了解决问题,我们开发一个基于变换器的端到端框架,首先生成实体和谓词提议集,然后推断定向边缘以形成关系三态。特别地,我们基于结构谓词发生器开发新的实体感知谓词表示,以利用关系的组成特性。此外,我们设计了一个曲线图组装模块,以推断基于我们的实体感知结构的二分明场景图的连接,使我们能够以端到端的方式生成场景图。广泛的实验结果表明,我们的设计能够在两个具有挑战性的基准上实现最先进的或可比性的性能,超越大多数现有方法,并享受更高的推理效率。我们希望我们的模型可以作为基于变压器的场景图生成的强大基线。
translated by 谷歌翻译
人类对象的相互作用(HOI)检测在场景理解的背景下受到了很大的关注。尽管基准上的进步越来越高,但我们意识到现有方法通常在遥远的相互作用上表现不佳,其中主要原因是两个方面:1)遥远的相互作用本质上比亲密的相互作用更难以识别。一个自然的场景通常涉及多个人类和具有复杂空间关系的物体,从而使远距离人对象的互动识别很大程度上受到复杂的视觉背景的影响。 2)基准数据集中的远处相互作用不足导致这些实例的合适。为了解决这些问题,在本文中,我们提出了一种新型的两阶段方法,用于更好地处理HOI检测中的遥远相互作用。我们方法中的一个必不可少的组成部分是一个新颖的近距离注意模块。它可以在人类和物体之间进行信息传播,从而熟练考虑空间距离。此外,我们设计了一种新颖的远距离感知损失函数,该功能使模型更加专注于遥远而罕见的相互作用。我们对两个具有挑战性的数据集进行了广泛的实验-HICO-DET和V-COCO。结果表明,所提出的方法可以通过很大的利润来超越现有方法,从而导致新的最新性能。
translated by 谷歌翻译
检测变压器已在富含样品的可可数据集上实现了竞争性能。但是,我们显示他们中的大多数人在小型数据集(例如CityScapes)上遭受了大量的性能下降。换句话说,检测变压器通常是渴望数据的。为了解决这个问题,我们通过逐步过渡从数据效率的RCNN变体到代表性的DETR,从经验中分析影响数据效率的因素。经验结果表明,来自本地图像区域的稀疏特征采样可容纳关键。基于此观察结果,我们通过简单地简单地交替如何在跨意义层构建键和价值序列,从而减少现有检测变压器的数据问题,并对原始模型进行最小的修改。此外,我们引入了一种简单而有效的标签增强方法,以提供更丰富的监督并提高数据效率。实验表明,我们的方法可以很容易地应用于不同的检测变压器,并在富含样品和样品的数据集上提高其性能。代码将在\ url {https://github.com/encounter1997/de-detrs}上公开提供。
translated by 谷歌翻译
用于视觉数据的变压器模型的最新进程导致识别和检测任务的显着改进。特别是,使用学习查询代替区域建议,这已经引起了一种新的一类单级检测模型,由检测变压器(DETR)。这种单阶段方法的变化已经主导了人对象相互作用(HOI)检测。然而,这种单阶段Hoi探测器的成功可以很大程度上被归因于变压器的表示力。我们发现,当配备相同的变压器时,他们的两级同行可以更加性能和记忆力,同时取得一小部分训练。在这项工作中,我们提出了一对成对变压器,这是一个用于HOI的一元和成对表示的两级检测器。我们观察到我们的变压器网络的一对和成对部分专门化,前者优先增加积极示例的分数,后者降低了阴性实例的分数。我们评估我们在HiCO-DET和V-Coco数据集上的方法,并显着优于最先进的方法。在推理时间内,我们使用RESET50的模型在单个GPU上接近实时性能。
translated by 谷歌翻译
人对象交互(HOI)检测作为对象检测任务的下游需要本地化人和对象,并从图像中提取人类和对象之间的语义关系。最近,由于其高效率,一步方法已成为这项任务的新趋势。然而,这些方法侧重于检测可能的交互点或过滤人对象对,忽略空间尺度处的不同物体的位置和大小的可变性。为了解决这个问题,我们提出了一种基于变压器的方法,Qahoi(用于人对象交互检测的查询锚点),它利用了多尺度架构来提取来自不同空间尺度的特征,并使用基于查询的锚来预测全部Hoi实例的元素。我们进一步调查了强大的骨干,显着提高了QAHOI的准确性,QAHOI与基于变压器的骨干优于最近的最近最先进的方法,通过HICO-DEC基准。源代码以$ \ href {https://github.com/cjw2021/qhoii} {\ text {this https url}} $。
translated by 谷歌翻译
变压器是一种基于关注的编码器解码器架构,彻底改变了自然语言处理领域。灵感来自这一重大成就,最近在将变形式架构调整到计算机视觉(CV)领域的一些开创性作品,这已经证明了他们对各种简历任务的有效性。依靠竞争力的建模能力,与现代卷积神经网络相比在本文中,我们已经为三百不同的视觉变压器进行了全面的审查,用于三个基本的CV任务(分类,检测和分割),提出了根据其动机,结构和使用情况组织这些方法的分类。 。由于培训设置和面向任务的差异,我们还在不同的配置上进行了评估了这些方法,以便于易于和直观的比较而不是各种基准。此外,我们已经揭示了一系列必不可少的,但可能使变压器能够从众多架构中脱颖而出,例如松弛的高级语义嵌入,以弥合视觉和顺序变压器之间的差距。最后,提出了三个未来的未来研究方向进行进一步投资。
translated by 谷歌翻译
Open-vocabulary object detection, which is concerned with the problem of detecting novel objects guided by natural language, has gained increasing attention from the community. Ideally, we would like to extend an open-vocabulary detector such that it can produce bounding box predictions based on user inputs in form of either natural language or exemplar image. This offers great flexibility and user experience for human-computer interaction. To this end, we propose a novel open-vocabulary detector based on DETR -- hence the name OV-DETR -- which, once trained, can detect any object given its class name or an exemplar image. The biggest challenge of turning DETR into an open-vocabulary detector is that it is impossible to calculate the classification cost matrix of novel classes without access to their labeled images. To overcome this challenge, we formulate the learning objective as a binary matching one between input queries (class name or exemplar image) and the corresponding objects, which learns useful correspondence to generalize to unseen queries during testing. For training, we choose to condition the Transformer decoder on the input embeddings obtained from a pre-trained vision-language model like CLIP, in order to enable matching for both text and image queries. With extensive experiments on LVIS and COCO datasets, we demonstrate that our OV-DETR -- the first end-to-end Transformer-based open-vocabulary detector -- achieves non-trivial improvements over current state of the arts.
translated by 谷歌翻译
The dominant multi-camera 3D detection paradigm is based on explicit 3D feature construction, which requires complicated indexing of local image-view features via 3D-to-2D projection. Other methods implicitly introduce geometric positional encoding and perform global attention (e.g., PETR) to build the relationship between image tokens and 3D objects. The 3D-to-2D perspective inconsistency and global attention lead to a weak correlation between foreground tokens and queries, resulting in slow convergence. We propose Focal-PETR with instance-guided supervision and spatial alignment module to adaptively focus object queries on discriminative foreground regions. Focal-PETR additionally introduces a down-sampling strategy to reduce the consumption of global attention. Due to the highly parallelized implementation and down-sampling strategy, our model, without depth supervision, achieves leading performance on the large-scale nuScenes benchmark and a superior speed of 30 FPS on a single RTX3090 GPU. Extensive experiments show that our method outperforms PETR while consuming 3x fewer training hours. The code will be made publicly available.
translated by 谷歌翻译
3D对象检测通过将点云作为唯一的输入来取得了显着的进展。但是,点云通常遭受不完整的几何结构和缺乏语义信息,这使得检测器难以准确地对检测到的对象进行分类。在这项工作中,我们专注于如何有效利用来自图像的对象级信息来提高基于点的3D检测器的性能。我们提出DEMF,这是一种简单而有效的方法,将图像信息融合到点特征中。给定一组点特征和图像特征图,DEMF通过将3D点的投影2D位置作为参考来自适应地汇总图像特征。我们在挑战性的Sun RGB-D数据集上评估了我们的方法,从而提高了最新的结果(+2.1 map@0.25和+2.3map@0.5)。代码可从https://github.com/haoy945/demf获得。
translated by 谷歌翻译
同一场景中的不同对象彼此之间或多或少相关,但是只有有限数量的这些关系值得注意。受到对象检测效果的DETR的启发,我们将场景图生成视为集合预测问题,并提出了具有编码器decoder架构的端到端场景图生成模型RELTR。关于视觉特征上下文的编码器原因是,解码器使用带有耦合主题和对象查询的不同类型的注意机制渗透了一组固定大小的三胞胎主题prodicate-object。我们设计了一套预测损失,以执行地面真相与预测三胞胎之间的匹配。与大多数现有场景图生成方法相反,Reltr是一种单阶段方法,它仅使用视觉外观直接预测一组关系,而无需结合实体并标记所有可能的谓词。视觉基因组和开放图像V6数据集的广泛实验证明了我们模型的出色性能和快速推断。
translated by 谷歌翻译
Person Search aims to simultaneously localize and recognize a target person from realistic and uncropped gallery images. One major challenge of person search comes from the contradictory goals of the two sub-tasks, i.e., person detection focuses on finding the commonness of all persons so as to distinguish persons from the background, while person re-identification (re-ID) focuses on the differences among different persons. In this paper, we propose a novel Sequential Transformer (SeqTR) for end-to-end person search to deal with this challenge. Our SeqTR contains a detection transformer and a novel re-ID transformer that sequentially addresses detection and re-ID tasks. The re-ID transformer comprises the self-attention layer that utilizes contextual information and the cross-attention layer that learns local fine-grained discriminative features of the human body. Moreover, the re-ID transformer is shared and supervised by multi-scale features to improve the robustness of learned person representations. Extensive experiments on two widely-used person search benchmarks, CUHK-SYSU and PRW, show that our proposed SeqTR not only outperforms all existing person search methods with a 59.3% mAP on PRW but also achieves comparable performance to the state-of-the-art results with an mAP of 94.8% on CUHK-SYSU.
translated by 谷歌翻译