与静态图像不同,视频包含其他时间和空间信息,以进行更好的对象检测。但是,获得大量带有有界框注释的视频是昂贵的,这些视频是有监督的深度学习所需的。尽管人类只能通过仅观看几个视频剪辑来轻松学习识别新对象,但深度学习通常会遭受过度拟合。这导致了一个重要的问题:如何仅从几个标记的视频剪辑中有效地学习视频对象探测器?在本文中,我们研究了视频对象检测几乎没有学习的新问题。我们首先定义了几个弹出设置,并创建一个新的基准数据集,以用于从广泛使用的Imagenet VID数据集中得出的几个弹片视频对象检测。我们采用转移学习框架来有效地训练视频对象探测器在大量基类对象和一些新颖级别对象的视频剪辑上。通过在我们设计的弱和强基数据集中分析该框架(关节和冻结)下两种方法的结果,我们揭示了不足和过度拟合问题。一种简单但有效的方法,称为融化,是自然开发的,可以权衡这两个问题并验证我们的分析。在我们提议的基准数据集上进行不同方案的广泛实验证明了我们在这个新的几弹视频对象检测问题中新颖分析的有效性。
translated by 谷歌翻译
少量对象检测(FSOD)旨在使用少数示例来检测从未见过的对象。通过学习如何在查询图像和少量拍摄类示例之间进行匹配,因此可以通过学习如何匹配来实现最近的改进,使得学习模型可以概括为几滴新颖的类。然而,目前,大多数基于元学习的方法分别在查询图像区域(通常是提议)和新颖类之间执行成对匹配,因此无法考虑它们之间的多个关系。在本文中,我们使用异构图卷积网络提出了一种新颖的FSOD模型。通过具有三种不同类型的边缘的所有提议和类节点之间的有效消息,我们可以获得每个类的上下文感知提案功能和查询 - 自适应,多包子增强型原型表示,这可能有助于促进成对匹配和改进的最终决赛FSOD精度。广泛的实验结果表明,我们所提出的模型表示为QA的Qa-Netwet,优于不同拍摄和评估指标下的Pascal VOC和MSCOCO FSOD基准测试的当前最先进的方法。
translated by 谷歌翻译
Conventional training of a deep CNN based object detector demands a large number of bounding box annotations, which may be unavailable for rare categories. In this work we develop a few-shot object detector that can learn to detect novel objects from only a few annotated examples. Our proposed model leverages fully labeled base classes and quickly adapts to novel classes, using a meta feature learner and a reweighting module within a one-stage detection architecture. The feature learner extracts meta features that are generalizable to detect novel object classes, using training data from base classes with sufficient samples. The reweighting module transforms a few support examples from the novel classes to a global vector that indicates the importance or relevance of meta features for detecting the corresponding objects. These two modules, together with a detection prediction module, are trained end-to-end based on an episodic few-shot learning scheme and a carefully designed loss function. Through extensive experiments we demonstrate that our model outperforms well-established baselines by a large margin for few-shot object detection, on multiple datasets and settings. We also present analysis on various aspects of our proposed model, aiming to provide some inspiration for future few-shot detection works.
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
Few-shot object detection (FSOD), which aims at learning a generic detector that can adapt to unseen tasks with scarce training samples, has witnessed consistent improvement recently. However, most existing methods ignore the efficiency issues, e.g., high computational complexity and slow adaptation speed. Notably, efficiency has become an increasingly important evaluation metric for few-shot techniques due to an emerging trend toward embedded AI. To this end, we present an efficient pretrain-transfer framework (PTF) baseline with no computational increment, which achieves comparable results with previous state-of-the-art (SOTA) methods. Upon this baseline, we devise an initializer named knowledge inheritance (KI) to reliably initialize the novel weights for the box classifier, which effectively facilitates the knowledge transfer process and boosts the adaptation speed. Within the KI initializer, we propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights. Finally, our approach not only achieves the SOTA results across three public benchmarks, i.e., PASCAL VOC, COCO and LVIS, but also exhibits high efficiency with 1.8-100x faster adaptation speed against the other methods on COCO/LVIS benchmark during few-shot transfer. To our best knowledge, this is the first work to consider the efficiency problem in FSOD. We hope to motivate a trend toward powerful yet efficient few-shot technique development. The codes are publicly available at https://github.com/Ze-Yang/Efficient-FSOD.
translated by 谷歌翻译
translated by 谷歌翻译
通过将元学习纳入基于区域的检测框架中,很少有射击对象检测经过广泛的研究。尽管取得了成功,但所述范式仍然受到几个因素的限制,例如(i)新型类别的低质量区域建议以及(ii)不同类别之间的类间相关性的过失。这种限制阻碍了基础知识的概括,以检测新型级别对象。在这项工作中,我们设计了元数据,(i)是第一个图像级的少量检测器,(ii)引入了一种新颖的类间相关元学习策略,以捕获和利用不同类别之间的相关性的相关性稳健而准确的几个射击对象检测。 meta-detr完全在图像级别工作,没有任何区域建议,这规避了普遍的几杆检测框架中不准确的建议的约束。此外,引入的相关元学习使元数据能够同时参加单个进料中的多个支持类别,从而可以捕获不同类别之间的类间相关性,从而大大降低了相似类别的错误分类并增强知识概括性参加新颖的课程。对多个射击对象检测基准进行的实验表明,所提出的元元删除优于大幅度的最先进方法。实施代码可在https://github.com/zhanggongjie/meta-detr上获得。
translated by 谷歌翻译
我们介绍了几次视频对象检测(FSVOD),在我们的高度多样化和充满活力的世界中为视觉学习提供了三个贡献:1)大规模视频数据集FSVOD-500,其中包括每个类别中的500个类别,其中少数 - 学习;2)一种新型管建议网络(TPN),用于为目标视频对象聚合特征表示来生成高质量的视频管建议,这是一种可以高度动态的目标。3)一种策略性地改进的时间匹配网络(TMN +),用于匹配具有更好辨别能力的代表查询管特征,从而实现更高的多样性。我们的TPN和TMN +共同和端到端训练。广泛的实验表明,与基于图像的方法和其他基于视频的扩展相比,我们的方法在两个镜头视频对象检测数据集中产生显着更好的检测结果。代码和数据集将在https://github.com/fanq15/fewx释放。
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
Conventional methods for object detection typically require a substantial amount of training data and preparing such high-quality training data is very labor-intensive. In this paper, we propose a novel few-shot object detection network that aims at detecting objects of unseen categories with only a few annotated examples. Central to our method are our Attention-RPN, Multi-Relation Detector and Contrastive Training strategy, which exploit the similarity between the few shot support set and query set to detect novel objects while suppressing false detection in the background. To train our network, we contribute a new dataset that contains 1000 categories of various objects with high-quality annotations. To the best of our knowledge, this is one of the first datasets specifically designed for few-shot object detection. Once our few-shot network is trained, it can detect objects of unseen categories without further training or finetuning. Our method is general and has a wide range of potential applications. We produce a new state-of-the-art performance on different datasets in the few-shot setting. The dataset link is https://github.com/fanq15/Few-Shot-Object-Detection-Dataset.
translated by 谷歌翻译
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/MUPPET
translated by 谷歌翻译
Few-shot classification aims to learn a classifier to recognize unseen classes during training with limited labeled examples. While significant progress has been made, the growing complexity of network designs, meta-learning algorithms, and differences in implementation details make a fair comparison difficult. In this paper, we present 1) a consistent comparative analysis of several representative few-shot classification algorithms, with results showing that deeper backbones significantly reduce the performance differences among methods on datasets with limited domain differences, 2) a modified baseline method that surprisingly achieves competitive performance when compared with the state-of-the-art on both the mini-ImageNet and the CUB datasets, and 3) a new experimental setting for evaluating the cross-domain generalization ability for few-shot classification algorithms. Our results reveal that reducing intra-class variation is an important factor when the feature backbone is shallow, but not as critical when using deeper backbones. In a realistic cross-domain evaluation setting, we show that a baseline method with a standard fine-tuning practice compares favorably against other state-of-the-art few-shot learning algorithms.
translated by 谷歌翻译
大多数现有的作品在少数拍摄对象检测(FSOD)上的工作重点是从类似域中进行预训练和几乎没有弹出的学习数据集的设置。但是,在多个域中,很少有射击算法很重要。因此,评估需要反映广泛的应用。我们提出了一个多域数少数对象检测(MOFSOD)基准,该基准由来自各个域的10个数据集组成,以评估FSOD算法。我们全面分析了冷冻层,不同的体系结构和不同的预训练数据集对FSOD性能的影响。我们的经验结果表明,以前的作品中尚未探索过的几个关键因素:1)与以前的信念相反,在多域基准测试中,微调(FT)是FSOD的强大基线,在PAR上表现或更好最先进的(SOTA)算法; 2)利用FT作为基线使我们能够探索多个体系结构,我们发现它们对下游的几杆任务产生重大影响,即使具有类似的训练性能; 3)通过取消预训练和几乎没有学习的学习,MOFSOD使我们能够探索不同的预训练数据集的影响,并且正确的选择可以显着提高下游任务的性能。基于这些发现,我们列出了可能提高FSOD性能的调查途径,并对现有算法进行了两次简单修改,这些算法导致MOFSOD基准上的SOTA性能。该代码可在https://github.com/amazon-research/few-shot-object-detection-benchmark上获得。
translated by 谷歌翻译
Most existing 3D point cloud object detection approaches heavily rely on large amounts of labeled training data. However, the labeling process is costly and time-consuming. This paper considers few-shot 3D point cloud object detection, where only a few annotated samples of novel classes are needed with abundant samples of base classes. To this end, we propose Prototypical VoteNet to recognize and localize novel instances, which incorporates two new modules: Prototypical Vote Module (PVM) and Prototypical Head Module (PHM). Specifically, as the 3D basic geometric structures can be shared among categories, PVM is designed to leverage class-agnostic geometric prototypes, which are learned from base classes, to refine local features of novel categories.Then PHM is proposed to utilize class prototypes to enhance the global feature of each object, facilitating subsequent object localization and classification, which is trained by the episodic training strategy. To evaluate the model in this new setting, we contribute two new benchmark datasets, FS-ScanNet and FS-SUNRGBD. We conduct extensive experiments to demonstrate the effectiveness of Prototypical VoteNet, and our proposed method shows significant and consistent improvements compared to baselines on two benchmark datasets.
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
具有注释的缺乏大规模的真实数据集使转移学习视频活动的必要性。我们的目标是为少数行动分类开发几次拍摄转移学习的有效方法。我们利用独立培训的本地视觉提示来学习可以从源域传输的表示,该源域只能使用少数示例来从源域传送到不同的目标域。我们使用的视觉提示包括对象 - 对象交互,手掌和地区内的动作,这些地区是手工位置的函数。我们采用了一个基于元学习的框架,以提取部署的视觉提示的独特和域不变组件。这使得能够在使用不同的场景和动作配置捕获的公共数据集中传输动作分类模型。我们呈现了我们转让学习方法的比较结果,并报告了阶级阶级和数据间数据间际传输的最先进的行动分类方法。
translated by 谷歌翻译
translated by 谷歌翻译