We report competitive results on object detection and instance segmentation on the COCO dataset using standard models trained from random initialization. The results are no worse than their ImageNet pre-training counterparts even when using the hyper-parameters of the baseline system (Mask R-CNN) that were optimized for fine-tuning pretrained models, with the sole exception of increasing the number of training iterations so the randomly initialized models may converge. Training from random initialization is surprisingly robust; our results hold even when: (i) using only 10% of the training data, (ii) for deeper and wider models, and (iii) for multiple tasks and metrics. Experiments show that ImageNet pre-training speeds up convergence early in training, but does not necessarily provide regularization or improve final target task accuracy. To push the envelope we demonstrate 50.9 AP on COCO object detection without using any external data-a result on par with the top COCO 2017 competition results that used ImageNet pre-training. These observations challenge the conventional wisdom of ImageNet pre-training for dependent tasks and we expect these discoveries will encourage people to rethink the current de facto paradigm of 'pretraining and fine-tuning' in computer vision.
translated by 谷歌翻译
Group Normalization
Yuxin Wu , Kaiming He
分类:
2018-03-22
Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems -BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN's usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN's computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BNbased counterparts for object detection and segmentation in COCO, 1 and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.
translated by 谷歌翻译
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning [29] as dictionary look-up, we build a dynamic dictionary with a queue and a moving-averaged encoder. This enables building a large and consistent dictionary on-the-fly that facilitates contrastive unsupervised learning. MoCo provides competitive results under the common linear protocol on ImageNet classification. More importantly, the representations learned by MoCo transfer well to downstream tasks. MoCo can outperform its supervised pre-training counterpart in 7 detection/segmentation tasks on PASCAL VOC, COCO, and other datasets, sometimes surpassing it by large margins. This suggests that the gap between unsupervised and supervised representation learning has been largely closed in many vision tasks.
translated by 谷歌翻译
Building instance segmentation models that are dataefficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations is a promising direction towards addressing this challenge. Here, we perform a systematic study of the Copy-Paste augmentation (e.g., [13,12]) for instance segmentation where we randomly paste objects onto an image. Prior studies on Copy-Paste relied on modeling the surrounding visual context for pasting the objects. However, we find that the simple mechanism of pasting objects randomly is good enough and can provide solid gains on top of strong baselines. Furthermore, we show Copy-Paste is additive with semi-supervised methods that leverage extra data through pseudo labeling (e.g. self-training). On COCO instance segmentation, we achieve 49.1 mask AP and 57.3 box AP, an improvement of +0.6 mask AP and +1.5 box AP over the previous state-of-the-art. We further demonstrate that Copy-Paste can lead to significant improvements on the LVIS benchmark. Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3.6 mask AP on rare categories.
translated by 谷歌翻译
转移学习的一种常见做法是通过预先培训数据丰富的上游任务来初始化下游模型权重。在对象检测中,特征主链通常用成像网分类器的权重初始化,并在对象检测任务上进行微调。最近的作品表明,在更长的培训方案下,这不是严格必要的,并提供了从头开始训练骨干的食谱。我们研究了这种端到端训练趋势的相反方向:我们表明,一种极端的知识保存形式 - 冻结分类器至关重要的骨干 - 始终改善许多不同的检测模型,并导致可观的资源节省。我们假设并通过实验证实,其余的检测器成分的容量和结构是利用冷冻骨架的关键因素。我们发现的直接应用包括对严重案例的绩效改进,例如检测长尾对象类别以及计算和内存资源节省,这有助于使该领域更容易访问具有更少的计算资源的研究人员。
translated by 谷歌翻译
对象检测是用于测试预先训练的网络参数的中央下游任务是否达到益处,例如提高准确度或训练速度。当新架构(如视觉变压器(VIT)模型到达时,物体检测方法的复杂性可以使该基准是非微不足道的。这些困难(例如,架构不相容,慢训练,高记忆消耗,未知的培训公式等)已经阻止了最近通过标准VIT模型进行了基准测试转移学习的研究。在本文中,我们提出了克服这些挑战的培训技术,使得使用标准的VT模型作为面膜R-CNN的骨干。这些工具促进了我们研究的主要目标:我们比较五种Vit初始化,包括最近的最先进的自我监督的学习方法,监督初始化和强大的随机初始化基线。我们的研究结果表明,最近基于掩蔽的无监督学习方法可能是在COCO的令人信服的转移学习改进,将箱子AP增加到4%(绝对)的监督和先前自我监督的预训练方法。此外,基于掩蔽的初始化比例更好,随着模型尺寸的增加而增长的提高。
translated by 谷歌翻译
Pre-training is a dominant paradigm in computer vision. For example, supervised ImageNet pre-training is commonly used to initialize the backbones of object detection and segmentation models. He et al. [1], for example, show a contrasting result that ImageNet pre-training has limited impact on COCO object detection. Here we investigate self-training as another method to utilize additional data on the same setup and contrast it against ImageNet pre-training. Our study reveals the generality and flexibility of self-training with three additional insights: 1) stronger data augmentation and more labeled data further diminish the value of pre-training, 2) unlike pre-training, self-training is always helpful when using stronger data augmentation, in both low-data and high-data regimes, and 3) in the case that pre-training is helpful, self-training improves upon pre-training. For example, on the COCO object detection dataset, pre-training benefits when we use one fifth of the labeled data, and hurts accuracy when we use all labeled data. Self-training, on the other hand, shows positive improvements from +1.3 to +3.4AP across all dataset sizes. In other words, self-training works well exactly on the same setup that pre-training does not work (using ImageNet to help COCO). On the PASCAL segmentation dataset, which is a much smaller dataset than COCO, though pre-training does help significantly, self-training improves upon the pre-trained model. On COCO object detection, we achieve 54.3AP, an improvement of +1.5AP over the strongest SpineNet model. On PASCAL segmentation, we achieve 90.5 mIOU, an improvement of +1.5% mIOU over the previous state-of-the-art result by DeepLabv3+. 1 ⇤ Authors contributed equally. 1 Code and checkpoints for our models are available at https://github.com/tensorflow/tpu/tree/ master/models/official/detection/projects/self_training 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada.
translated by 谷歌翻译
The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10× or 100×? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between 'enormous data' and visual deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks increases logarithmically based on volume of training data size. Second, we show that representation learning (or pretraining) still holds a lot of promise. One can improve performance on many vision tasks by just training a better base model. Finally, as expected, we present new state-of-theart results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.
translated by 谷歌翻译
基准,如Coco,在物体检测中发挥至关重要的作用。然而,现有的基准在规模变化中不足,他们的协议不足以进行公平比较。在本文中,我们介绍了通用尺度对象检测基准(USB)。 USB通过将Coco与最近提出的Waymo Open DataSet和Manga109-S数据集合并了Coco,USB具有对象尺度和图像域的变化。为了实现公平的比较和包容性研究,我们提出了培训和评估议定书。它们有多个部门用于培训时期和评估图像分辨率,如体育中的重量类,以及跨训练协议的兼容性,如通用串行总线的后向兼容性。具体而言,我们要求参与者报告结果,不仅具有更高的协议(更长的培训),而且还有更低的协议(较短培训)。使用所提出的基准和协议,我们分析了八种方法,发现了现有的Coco-偏偏见方法的缺点。代码可在https://github.com/shinya7y/universenet上获得。
translated by 谷歌翻译
In object detection, the intersection over union (IoU) threshold is frequently used to define positives/negatives. The threshold used to train a detector defines its quality. While the commonly used threshold of 0.5 leads to noisy (low-quality) detections, detection performance frequently degrades for larger thresholds. This paradox of high-quality detection has two causes: 1) overfitting, due to vanishing positive samples for large thresholds, and 2) inference-time quality mismatch between detector and test hypotheses. A multi-stage object detection architecture, the Cascade R-CNN, composed of a sequence of detectors trained with increasing IoU thresholds, is proposed to address these problems. The detectors are trained sequentially, using the output of a detector as training set for the next. This resampling progressively improves hypotheses quality, guaranteeing a positive training set of equivalent size for all detectors and minimizing overfitting. The same cascade is applied at inference, to eliminate quality mismatches between hypotheses and detectors. An implementation of the Cascade R-CNN without bells or whistles achieves state-of-the-art performance on the COCO dataset, and significantly improves high-quality detection on generic and specific object detection datasets, including VOC, KITTI, CityPerson, and WiderFace. Finally, the Cascade R-CNN is generalized to instance segmentation, with nontrivial improvements over the Mask R-CNN. To facilitate future research, two implementations are made available at https://github.com/zhaoweicai/cascade-rcnn (Caffe) and https://github.com/zhaoweicai/Detectron-Cascade-RCNN (Detectron).
translated by 谷歌翻译
Adder神经网络(Addernets)在图像分类上表现出令人印象深刻的性能,只有加法操作,比使用乘法建立的传统卷积神经网络更节能。与分类相比,对通过Addernets降低现代对象探测器的能耗的强烈需求,例如自主驾驶和面部检测。在本文中,我们提出了对物体检测的addernets的实证研究。我们首先揭示了预先训练的加法器骨架中的批量归一化统计,不应冻结,因为Addernets的相对较大的特征方差。此外,我们在颈部中插入更多的快捷方式连接,并设计一个新的特征融合架构,以避免加法器层的稀疏功能。我们展示了广泛的消融研究,探讨了加法器探测器的几种设计选择。与最先进的比较在Coco和Pascal VOC基准上进行。具体而言,所提出的加法器FCOS在Coco Val集上实现了37.8 \%AP,展示了卷积对应物的相当性能,具有约1.4倍的能量减少。
translated by 谷歌翻译
To date, most existing self-supervised learning methods are designed and optimized for image classification. These pre-trained models can be sub-optimal for dense prediction tasks due to the discrepancy between image-level prediction and pixel-level prediction. To fill this gap, we aim to design an effective, dense self-supervised learning method that directly works at the level of pixels (or local features) by taking into account the correspondence between local features. We present dense contrastive learning (DenseCL), which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin. Specifically, over the strong MoCo-v2 baseline, our method achieves significant improvements of 2.0% AP on PASCAL VOC object detection, 1.1% AP on COCO object detection, 0.9% AP on COCO instance segmentation, 3.0% mIoU on PASCAL VOC semantic segmentation and 1.8% mIoU on Cityscapes semantic segmentation.
translated by 谷歌翻译
我们探索普通的非层次视觉变压器(VIT)作为用于对象检测的骨干网络。该设计使原始的VIT体系结构可以进行微调以进行对象检测,而无需重新设计层次结构的主链以进行预训练。随着微调的最低适应性,我们的纯净背骨检测器可以取得竞争成果。令人惊讶的是,我们观察到:(i)足以从单尺度特征映射(没有常见的FPN设计)构建一个简单的特征金字塔,并且(ii)足以使用窗户注意力(无需转移),很少有帮助跨窗口传播块。凭借普通的VIT骨架作为掩盖自动编码器(MAE),我们的探测器(名为VITDET)可以与先前基于层次结构骨架的先前领先方法竞争,仅使用ImagEnet-1k Pre Pre pre to Coco Dataset上的61.3 ap_box竞争-训练。我们希望我们的研究能够引起人们对普通背骨检测器的研究。 VITDET的代码可在detectron2中获得。
translated by 谷歌翻译
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -from 1 example per class to 1 M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.
translated by 谷歌翻译
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A topdown architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art singlemodel results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 6 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.
translated by 谷歌翻译
在本文中,我们将多尺度视觉变压器(MVIT)作为图像和视频分类的统一架构,以及对象检测。我们提出了一种改进的MVIT版本,它包含分解的相对位置嵌入和残余汇集连接。我们以五种尺寸实例化此架构,并评估Imagenet分类,COCO检测和动力学视频识别,在此优先效果。我们进一步比较了MVITS的汇集注意力来窗口注意力机制,其中它在准确性/计算中优于后者。如果没有钟声,MVIT在3个域中具有最先进的性能:ImageNet分类的准确性为88.8%,Coco对象检测的56.1盒AP和动力学-400视频分类的86.1%。代码和模型将公开可用。
translated by 谷歌翻译
Object detection with transformers (DETR) reaches competitive performance with Faster R-CNN via a transformer encoder-decoder architecture. Inspired by the great success of pre-training transformers in natural language processing, we propose a pretext task named random query patch detection to Unsupervisedly Pre-train DETR (UP-DETR) for object detection. Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder. The model is pre-trained to detect these query patches from the original image. During the pre-training, we address two critical issues: multi-task learning and multi-query localization. (1) To trade off classification and localization preferences in the pretext task, we freeze the CNN backbone and propose a patch feature reconstruction branch which is jointly optimized with patch detection.(2) To perform multi-query localization, we introduce UP-DETR from single-query patch and extend it to multiquery patches with object query shuffle and attention mask. In our experiments, UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation. Code and pre-training models: https://github.com/dddzg/up-detr.
translated by 谷歌翻译
大规模数据集的预培训模型,如想象成,是计算机视觉中的标准实践。此范例对于具有小型培训套的任务特别有效,其中高容量模型往往会过度装备。在这项工作中,我们考虑一个自我监督的预训练场景,只能利用目标任务数据。我们考虑数据集,如斯坦福汽车,草图或可可,这是比想象成小的数量的顺序。我们的研究表明,在本文中介绍的Beit或诸如Beit或Variant的去噪对预训练数据的类型和大小比通过比较图像嵌入来训练的流行自我监督方法更加强大。我们获得了竞争性能与ImageNet预训练相比,来自不同域的各种分类数据集。在Coco上,当专注于使用Coco Images进行预训练时,检测和实例分割性能超过了可比设置中的监督Imagenet预训练。
translated by 谷歌翻译
我们专注于更好地理解增强不变代表性学习的关键因素。我们重新访问moco v2和byol,并试图证明以下假设的真实性:不同的框架即使具有相同的借口任务也会带来不同特征的表示。我们建立了MoCo V2和BYOL之间公平比较的第一个基准,并观察:(i)复杂的模型配置使得可以更好地适应预训练数据集; (ii)从实现竞争性转移表演中获得的预训练和微调阻碍模型的优化策略不匹配。鉴于公平的基准,我们进行进一步的研究并发现网络结构的不对称性赋予对比框架在线性评估协议下正常工作,同时可能会损害长尾分类任务的转移性能。此外,负样本并不能使模型更明智地选择数据增强,也不会使不对称网络结构结构。我们相信我们的发现为将来的工作提供了有用的信息。
translated by 谷歌翻译
基于对比的学习的预培训的目标是利用大量的未标记数据来产生可以容易地调整下游的模型。电流方法围绕求解图像辨别任务:给定锚图像,该图像的增强对应物和一些其他图像,该模型必须产生表示,使得锚和其对应物之间的距离很小,并且锚和其他图像很大。这种方法存在两个重要问题:(i)通过对比图像级别的表示,很难生成有利于下游对象级任务(如实例分段)的详细对象敏感功能; (ii)制造增强对应的增强策略是固定的,在预培训的后期阶段做出更低的学习。在这项工作中,我们引入课程对比对象级预培训(CCOP)来解决这些问题:(i)我们使用选择性搜索来查找粗略对象区域并使用它们构建图像间对象级对比度损耗和一个图像内对象级别歧视损失进入我们的预训练目标; (ii)我们提出了一种课程学习机制,其自适应地增强所生成的区域,这允许模型一致地获取有用的学习信号,即使在预训练的后期阶段也是如此。我们的实验表明,当在多对象场景图像数据集上进行预训练时,我们的方法通过大量对象级任务的大幅度提高了MoCo V2基线。代码可在https://github.com/chenhongyiyang/ccop中找到。
translated by 谷歌翻译