最近的图像识别进展刺激了以前所未有的规模部署视觉系统。因此,目前的数据通常不仅由人类而且由机器消耗。现有的图像处理方法仅优化以获得更好的人类感知,但是可以通过机器准确地识别所得到的图像。这可以是不希望的,例如,可以通过搜索引擎或推荐系统来处理图像。在这项工作中,我们研究了简单的方法来提高处理图像的机器识别:直接在图像处理网络上或通过中间变换模型优化识别损耗。有趣的是,加工模型加强识别质量的能力可以在评估不同架构,公认的类别,任务和训练数据集的模型时传输。这使得即使在我们没有未来识别模型的知识,例如,如果我们将被处理的图像上传到Internet的情况,也使得这些方法适用。我们对多种图像处理任务进行实验,用Imagenet分类和Pascal VOC检测作为识别任务。利用这些简单但有效的方法,可以通过强大的可转移性和最小的图像质量损失来实现大量的精度增益。通过用户学习,我们进一步表明精度增益可以转移到黑盒云模型。最后,我们试图通过展示不同模型决策边界的相似之处来解释这种可转移性现象。代码可在https://github.com/liuzhuang13/transferable_ra获得。
translated by 谷歌翻译
An intriguing property of deep neural networks is the existence of adversarial examples, which can transfer among different architectures. These transferable adversarial examples may severely hinder deep neural network-based applications. Previous works mostly study the transferability using small scale datasets. In this work, we are the first to conduct an extensive study of the transferability over large models and a large scale dataset, and we are also the first to study the transferability of targeted adversarial examples with their target labels. We study both non-targeted and targeted adversarial examples, and show that while transferable non-targeted adversarial examples are easy to find, targeted adversarial examples generated using existing approaches almost never transfer with their target labels. Therefore, we propose novel ensemble-based approaches to generating transferable adversarial examples. Using such approaches, we observe a large proportion of targeted adversarial examples that are able to transfer with their target labels for the first time. We also present some geometric studies to help understanding the transferable adversarial examples. Finally, we show that the adversarial examples generated using ensemble-based approaches can successfully attack Clarifai.com, which is a black-box image classification system. * Work is done while visiting UC Berkeley.
translated by 谷歌翻译
由于现代硬件的计算能力强烈增加,在大规模数据集上学习的预训练的深度学习模型(例如,BERT,GPT-3)已经显示了它们对传统方法的有效性。巨大进展主要促进了变压器及其变体架构的代表能力。在本文中,我们研究了低级计算机视觉任务(例如,去噪,超级分辨率和派没),并开发了一个新的预先训练的模型,即图像处理变压器(IPT)。为了最大限度地挖掘变压器的能力,我们展示了利用众所周知的想象网基准,以产生大量损坏的图像对。 IPT模型在具有多头和多尾的这些图像上培训。此外,引入了对比度学习,以适应不同的图像处理任务。因此,在微调后,预先训练的模型可以有效地在所需的任务上使用。只有一个预先训练的模型,IPT优于当前的最先进方法对各种低级基准。代码可在https://github.com/huawei-noah/pretrate -ipt和https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/cv/ipt
translated by 谷歌翻译
The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10× or 100×? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between 'enormous data' and visual deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks increases logarithmically based on volume of training data size. Second, we show that representation learning (or pretraining) still holds a lot of promise. One can improve performance on many vision tasks by just training a better base model. Finally, as expected, we present new state-of-theart results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.
translated by 谷歌翻译
超级分辨率(SR)是低级视觉区域的基本和代表任务。通常认为,从SR网络中提取的特征没有特定的语义信息,并且网络只能从输入到输出中学习复杂的非线性映射。我们可以在SR网络中找到任何“语义”吗?在本文中,我们为此问题提供了肯定的答案。通过分析具有维度降低和可视化的特征表示,我们成功地发现了SR网络中的深度语义表示,\ Texit {i.},深度劣化表示(DDR),其与图像劣化类型和度数相关。我们还揭示了分类和SR网络之间的表示语义的差异。通过广泛的实验和分析,我们得出一系列观测和结论,对未来的工作具有重要意义,例如解释低级CNN网络的内在机制以及开发盲人SR的新评估方法。
translated by 谷歌翻译
我们提出蒙版频率建模(MFM),这是一种基于统一的基于频域的方法,用于自我监督的视觉模型预训练。在本文中,我们将视角转移到了频域中,而不是将蒙版令牌随机插入到空间域中的输入嵌入。具体而言,MFM首先掩盖了输入图像的一部分频率分量,然后预测频谱上的缺失频率。我们的关键见解是,由于沉重的空间冗余,预测频域中的屏蔽组件更理想地揭示了基础图像模式,而不是预测空间域中的掩盖斑块。我们的发现表明,通过对蒙版和预测策略的正确配置,高频组件中的结构信息和低频对应物中的低级统计信息都有用。 MFM首次证明,对于VIT和CNN,即使没有使用以下内容,简单的非叙事框架也可以学习有意义的表示形式:(i)额外的数据,(ii)额外的模型,(iii)蒙版令牌。与最近的蒙版图像建模方法相比,对成像网和几个鲁棒性基准的实验结果表明,MFM的竞争性能和高级鲁棒性。此外,我们还全面研究了从统一的频率角度来表示经典图像恢复任务对表示学习的有效性,并揭示了他们与MFM方法的有趣关系。项目页面:https://www.mmlab-ntu.com/project/mfm/index.html。
translated by 谷歌翻译
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -from 1 example per class to 1 M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.
translated by 谷歌翻译
Jitendra Malik once said, "Supervision is the opium of the AI researcher". Most deep learning techniques heavily rely on extreme amounts of human labels to work effectively. In today's world, the rate of data creation greatly surpasses the rate of data annotation. Full reliance on human annotations is just a temporary means to solve current closed problems in AI. In reality, only a tiny fraction of data is annotated. Annotation Efficient Learning (AEL) is a study of algorithms to train models effectively with fewer annotations. To thrive in AEL environments, we need deep learning techniques that rely less on manual annotations (e.g., image, bounding-box, and per-pixel labels), but learn useful information from unlabeled data. In this thesis, we explore five different techniques for handling AEL.
translated by 谷歌翻译
使用卷积神经网络(CNN)已经显着改善了几种图像处理任务,例如图像分类和对象检测。与Reset和Abseralnet一样,许多架构在创建时至少在一个数据集中实现了出色的结果。培训的一个关键因素涉及网络的正规化,这可以防止结构过度装备。这项工作分析了在过去几年中开发的几种正规化方法,显示了不同CNN模型的显着改进。该作品分为三个主要区域:第一个称为“数据增强”,其中所有技术都侧重于执行输入数据的更改。第二个,命名为“内部更改”,旨在描述修改神经网络或内核生成的特征映射的过程。最后一个称为“标签”,涉及转换给定输入的标签。这项工作提出了与关于正则化的其他可用调查相比的两个主要差异:(i)第一个涉及在稿件中收集的论文并非超过五年,并第二个区别是关于可重复性,即所有作品此处推荐在公共存储库中可用的代码,或者它们已直接在某些框架中实现,例如Tensorflow或Torch。
translated by 谷歌翻译
我们提出了自适应培训 - 一种统一的培训算法,通过模型预测动态校准并增强训练过程,而不会产生额外的计算成本 - 以推进深度神经网络的监督和自我监督的学习。我们分析了培训数据的深网络培训动态,例如随机噪声和对抗例。我们的分析表明,模型预测能够在数据中放大有用的基础信息,即使在没有任何标签信息的情况下,这种现象也会发生,突出显示模型预测可能会产生培训过程:自适应培训改善了深网络的概括在噪音下,增强自我监督的代表学习。分析还阐明了解深度学习,例如,在经验风险最小化和最新的自我监督学习算法的折叠问题中对最近发现的双重现象的潜在解释。在CIFAR,STL和Imagenet数据集上的实验验证了我们在三种应用中的方法的有效性:用标签噪声,选择性分类和线性评估进行分类。为了促进未来的研究,该代码已在HTTPS://github.com/layneh/Self-Aveptive-训练中公开提供。
translated by 谷歌翻译
尽管人类可以通过利用对内容的高级理解的传统或最新学习的图像压缩编解码器来毫不费力地将复杂的视觉场景转变为简单的单词,而另一种方式似乎并没有利用视觉内容的语义含义。潜在的。此外,它们主要集中在率延伸上,并且在感知质量上的表现不佳,尤其是在低比特率方案中,并且常常无视下游计算机视觉算法的性能,这是一个快速增长的压缩图像的快速消费者组。在本文中,我们(1)提出了一个通用框架,该框架可以使任何图像编解码器能够利用高级语义,(2)研究感知质量和失真的关节优化。我们的想法是,鉴于任何编解码器,我们利用高级语义来增强其提取的低级视觉特征,并产生基本上的新的语义意识编解码器。我们提出了一个三相训练方案,该方案教授语义意识的编解码器来利用语义的力量来共同优化速率感知渗透率(R-PD)的性能。作为另一个好处,语义感知的编解码器还提高了下游计算机视觉算法的性能。为了验证我们的主张,我们进行了广泛的经验评估,并提供定量和定性结果。
translated by 谷歌翻译
本文对实例分割模型进行了全面评估,这些模型与现实世界图像损坏以及室外图像集合,例如与培训数据集不同的设置捕获的图像。室外图像评估显示了模型的概括能力,现实世界应用的一个基本方面以及广泛研究的域适应性主题。当设计用于现实世界应用程序的实例分割模型并选择现成的预期模型以直接用于手头的任务时,这些提出的鲁棒性和泛化评估很重要。具体而言,这项基准研究包括最先进的网络架构,网络骨架,标准化层,从头开始训练的模型,从头开始与预处理的网络以及多任务培训对稳健性和概括的影响。通过这项研究,我们获得了一些见解。例如,我们发现组归一化增强了跨损坏的网络的鲁棒性,其中图像内容保持不变,但损坏却添加在顶部。另一方面,分批归一化改善了图像特征统计信息在不同数据集上的概括。我们还发现,单阶段探测器比其训练大小不太概括到更大的图像分辨率。另一方面,多阶段探测器可以轻松地用于不同尺寸的图像上。我们希望我们的全面研究能够激发更强大和可靠的实例细分模型的发展。
translated by 谷歌翻译
近年来,计算机视觉社区中最受欢迎的技术之一就是深度学习技术。作为一种数据驱动的技术,深层模型需要大量准确标记的培训数据,这在许多现实世界中通常是无法访问的。数据空间解决方案是数据增强(DA),可以人为地从原始样本中生成新图像。图像增强策略可能因数据集而有所不同,因为不同的数据类型可能需要不同的增强以促进模型培训。但是,DA策略的设计主要由具有领域知识的人类专家决定,这被认为是高度主观和错误的。为了减轻此类问题,一个新颖的方向是使用自动数据增强(AUTODA)技术自动从给定数据集中学习图像增强策略。 Autoda模型的目的是找到可以最大化模型性能提高的最佳DA策略。这项调查从图像分类的角度讨论了Autoda技术出现的根本原因。我们确定标准自动赛车模型的三个关键组件:搜索空间,搜索算法和评估功能。根据他们的架构,我们提供了现有图像AUTODA方法的系统分类法。本文介绍了Autoda领域的主要作品,讨论了他们的利弊,并提出了一些潜在的方向以进行未来的改进。
translated by 谷歌翻译
最近的工作表明,学习的图像压缩策略可以倾销标准的手工制作压缩算法,这些压缩算法已经开发了几十年的速率 - 失真折衷的研究。随着计算机视觉的不断增长的应用,来自可压缩表示的高质量图像重建通常是次要目标。压缩,可确保计算机视觉任务等高精度,例如图像分割,分类和检测,因此具有跨各种设置的显着影响的可能性。在这项工作中,我们开发了一个框架,它产生适合人类感知和机器感知的压缩格式。我们表明可以了解到表示,同时优化核心视觉任务的压缩和性能。我们的方法允许直接从压缩表示培训模型,并且这种方法会产生新任务和低拍学习设置的性能。我们呈现出与标准高质量JPG相比细分和检测性能提高的结果,但是在每像素的比特方面,表示表示的表示性比率为4至10倍。此外,与天真的压缩方法不同,在比标准JEPG的十倍小的级别,我们格式培训的分段和检测模型仅在性能下遭受轻微的降级。
translated by 谷歌翻译
Learning a good image prior is a long-term goal for image restoration and manipulation. While existing methods like deep image prior (DIP) capture low-level image statistics, there are still gaps toward an image prior that captures rich image semantics including color, spatial coherence, textures, and high-level concepts. This work presents an effective way to exploit the image prior captured by a generative adversarial network (GAN) trained on large-scale natural images. As shown in Fig. 1, the deep generative prior (DGP) provides compelling results to restore missing semantics, e.g., color, patch, resolution, of various degraded images. It also enables diverse image manipulation including random jittering, image morphing, and category transfer. Such highly flexible restoration and manipulation are made possible through relaxing the assumption of existing GAN-inversion methods, which tend to fix the generator. Notably, we allow the generator to be fine-tuned on-the-fly in a progressive manner regularized by feature distance obtained by the discriminator in GAN. We show that these easy-to-implement and practical changes help preserve the reconstruction to remain in the manifold of nature image, and thus lead to more precise and faithful reconstruction for real images. Code is available at https://github.com/XingangPan/deepgenerative-prior.
translated by 谷歌翻译
In this paper we establish rigorous benchmarks for image classifier robustness. Our first benchmark, IMAGENET-C, standardizes and expands the corruption robustness topic, while showing which classifiers are preferable in safety-critical applications. Then we propose a new dataset called IMAGENET-P which enables researchers to benchmark a classifier's robustness to common perturbations. Unlike recent robustness research, this benchmark evaluates performance on common corruptions and perturbations not worst-case adversarial perturbations. We find that there are negligible changes in relative corruption robustness from AlexNet classifiers to ResNet classifiers. Afterward we discover ways to enhance corruption and perturbation robustness. We even find that a bypassed adversarial defense provides substantial common perturbation robustness. Together our benchmarks may aid future work toward networks that robustly generalize.
translated by 谷歌翻译
我们介绍了几个新的数据集即想象的A / O和Imagenet-R以及合成环境和测试套件,我们称为CAOS。 Imagenet-A / O允许研究人员专注于想象成剩余的盲点。由于追踪稳健的表示,以特殊创建了ImageNet-R,因为表示不再简单地自然,而是包括艺术和其他演绎。 Caos Suite由Carla Simulator构建,允许包含异常物体,可以创建可重复的合成环境和用于测试稳健性的场景。所有数据集都是为测试鲁棒性和衡量鲁棒性的衡量进展而创建的。数据集已用于各种其他作品中,以衡量其具有鲁棒性的自身进步,并允许切向进展,这些进展不会完全关注自然准确性。鉴于这些数据集,我们创建了几种旨在推进鲁棒性研究的新方法。我们以最大Logit的形式和典型程度的形式构建简单的基线,并以深度的形式创建新的数据增强方法,从而提高上述基准。最大Logit考虑Logit值而不是SoftMax操作后的值,而微小的变化会产生明显的改进。典型程分将输出分布与类的后部分布进行比较。我们表明,除了分段任务之外,这将提高对基线的性能。猜测可能在像素级别,像素的语义信息比类级信息的语义信息不太有意义。最后,新的Deepaulment的新增强技术利用神经网络在彻底不同于先前使用的传统几何和相机的转换的图像上创建增强。
translated by 谷歌翻译
我们表明,将人类的先验知识与端到端学习相结合可以通过引入基于零件的对象分类模型来改善深神经网络的鲁棒性。我们认为,更丰富的注释形式有助于指导神经网络学习更多可靠的功能,而无需更多的样本或更大的模型。我们的模型将零件分割模型与一个微小的分类器结合在一起,并经过训练的端到端,以同时将对象分割为各个部分,然后对分段对象进行分类。从经验上讲,与所有三个数据集的Resnet-50基线相比,我们的基于部分的模型既具有更高的精度和更高的对抗性鲁棒性。例如,鉴于相同的鲁棒性,我们部分模型的清洁准确性高达15个百分点。我们的实验表明,这些模型还减少了纹理偏见,并对共同的腐败和虚假相关性产生更好的鲁棒性。该代码可在https://github.com/chawins/adv-part-model上公开获得。
translated by 谷歌翻译
Conventional training of a deep CNN based object detector demands a large number of bounding box annotations, which may be unavailable for rare categories. In this work we develop a few-shot object detector that can learn to detect novel objects from only a few annotated examples. Our proposed model leverages fully labeled base classes and quickly adapts to novel classes, using a meta feature learner and a reweighting module within a one-stage detection architecture. The feature learner extracts meta features that are generalizable to detect novel object classes, using training data from base classes with sufficient samples. The reweighting module transforms a few support examples from the novel classes to a global vector that indicates the importance or relevance of meta features for detecting the corresponding objects. These two modules, together with a detection prediction module, are trained end-to-end based on an episodic few-shot learning scheme and a carefully designed loss function. Through extensive experiments we demonstrate that our model outperforms well-established baselines by a large margin for few-shot object detection, on multiple datasets and settings. We also present analysis on various aspects of our proposed model, aiming to provide some inspiration for future few-shot detection works.
translated by 谷歌翻译
To date, the best-performing blind super-resolution (SR) techniques follow one of two paradigms: A) generate and train a standard SR network on synthetic low-resolution - high-resolution (LR - HR) pairs or B) attempt to predict the degradations an LR image has suffered and use these to inform a customised SR network. Despite significant progress, subscribers to the former miss out on useful degradation information that could be used to improve the SR process. On the other hand, followers of the latter rely on weaker SR networks, which are significantly outperformed by the latest architectural advancements. In this work, we present a framework for combining any blind SR prediction mechanism with any deep SR network, using a metadata insertion block to insert prediction vectors into SR network feature maps. Through comprehensive testing, we prove that state-of-the-art contrastive and iterative prediction schemes can be successfully combined with high-performance SR networks such as RCAN and HAN within our framework. We show that our hybrid models consistently achieve stronger SR performance than both their non-blind and blind counterparts. Furthermore, we demonstrate our framework's robustness by predicting degradations and super-resolving images from a complex pipeline of blurring, noise and compression.
translated by 谷歌翻译