In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
在本文中,我们提出了PETRV2,这是来自多视图图像的3D感知统一框架。基于PETR,PETRV2探讨了时间建模的有效性,该时间建模利用先前帧的时间信息来增强3D对象检测。更具体地说,我们扩展了PETR中的3D位置嵌入(3D PE)进行时间建模。 3D PE可以在不同帧的对象位置上实现时间对齐。进一步引入了特征引导的位置编码器,以提高3D PE的数据适应性。为了支持高质量的BEV分割,PETRV2通过添加一组分割查询提供了简单而有效的解决方案。每个分割查询负责分割BEV映射的一个特定补丁。 PETRV2在3D对象检测和BEV细分方面实现了最先进的性能。在PETR框架上还进行了详细的鲁棒性分析。我们希望PETRV2可以作为3D感知的强大基准。代码可在\ url {https://github.com/megvii-research/petr}中获得。
translated by 谷歌翻译
在本文中,我们开发了用于多视图3D对象检测的位置嵌入转换(PETR)。PETR将3D坐标的位置信息编码为图像特征,从而产生3D位置感知功能。对象查询可以感知3D位置感知功能并执行端到端对象检测。PETR在标准Nuscenes数据集上实现了最先进的性能(50.4%NDS和44.1%的地图),并在基准中排名第一。它可以作为未来研究的简单但强大的基准。代码可在\ url {https://github.com/megvii-research/petr}中获得。
translated by 谷歌翻译
我们提出了一种用于高质量实例分段的新颖隐式功能细化模块。现有的图像/视频实例分段方法依赖于明确堆叠的卷积来在最终预测之前优化实例特征。在本文中,我们首先对不同的细化策略进行了实证比较,这揭示了广泛使用的四个连续卷积是不必要的。作为替代方案,重量共享卷积块提供竞争性能。当这种块被迭代为无限时间时,块输出最终将使均衡状态变得平衡状态。基于该观察,通过构建隐式功能来开发隐式特征细化(IFR)。可以通过模拟无限深度网络通过定点迭代来获得实例特征的平衡状态。我们的IFR享有几个优点:1)模拟无限深度细化网络,同时只需要单个残余块的参数; 2)产生全球接收领域的高级均衡实例特征; 3)用作即插即用的一般模块,很容易扩展到大多数对象识别框架。 Coco和YouTube-Vis基准的实验表明,我们的IFR实现了最先进的图像/视频实例分段框架的性能,同时降低了参数负担(EG1%AP改进掩码R-CNN,只有30.0掩模头中的%参数)。代码是在https://github.com/lufanma/ifr.git提供的
translated by 谷歌翻译
对象的时间建模是多个对象跟踪(MOT)的关键挑战。现有方法通过通过基于运动和基于外观的相似性启发式方法关联检测来跟踪。关联的后处理性质阻止了视频序列中时间变化的端到端。在本文中,我们提出了MOTR,它扩展了DETR并介绍了轨道查询,以模拟整个视频中的跟踪实例。轨道查询被转移并逐帧更新,以随着时间的推移执行迭代预测。我们提出了曲目感知的标签分配,以训练轨道查询和新生儿对象查询。我们进一步提出了时间聚集网络和集体平均损失,以增强时间关系建模。 Dancetrack上的实验结果表明,MOTR在HOTA度量方面的表现明显优于最先进的方法,字节范围为6.5%。在MOT17上,MOTR在关联性能方面优于我们的并发作品,跟踪器和Transtrack。 MOTR可以作为对时间建模和基于变压器的跟踪器的未来研究的更强基线。代码可在https://github.com/megvii-research/motr上找到。
translated by 谷歌翻译
在这份技术报告中,我们将解决方案介绍给以人为中心的时空视频接地任务。我们提出了一个名为stvgformer的简洁有效框架,该框架将时空视觉语言依赖性与静态分支和动态分支建模。静态分支在单个帧中执行交叉模式的理解,并根据框架内视觉提示(如对象出现)学会在空间上定位目标对象。动态分支在多个帧上执行交叉模式理解。它学会了根据动作(如动作)的动态视觉提示来预测目标力矩的开始和结束时间。静态分支和动态分支均设计为跨模式变压器。我们进一步设计了一种新型的静态动力相互作用块,以使静态和动态分支相互传递有用和互补信息,这被证明可以有效地改善对硬病例的预测。我们提出的方法获得了39.6%的VIOU,并在第四人中挑战中获得了HC-STVG曲目的第一名。
translated by 谷歌翻译
Knowledge graphs (KG) have served as the key component of various natural language processing applications. Commonsense knowledge graphs (CKG) are a special type of KG, where entities and relations are composed of free-form text. However, previous works in KG completion and CKG completion suffer from long-tail relations and newly-added relations which do not have many know triples for training. In light of this, few-shot KG completion (FKGC), which requires the strengths of graph representation learning and few-shot learning, has been proposed to challenge the problem of limited annotated data. In this paper, we comprehensively survey previous attempts on such tasks in the form of a series of methods and applications. Specifically, we first introduce FKGC challenges, commonly used KGs, and CKGs. Then we systematically categorize and summarize existing works in terms of the type of KGs and the methods. Finally, we present applications of FKGC models on prediction tasks in different areas and share our thoughts on future research directions of FKGC.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译