Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNettrained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies. We then demonstrate that the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on 'Stylized-ImageNet', a stylized version of ImageNet. This provides a much better fit for human behavioural performance in our well-controlled psychophysical lab setting (nine experiments totalling 48,560 psychophysical trials across 97 observers) and comes with a number of unexpected emergent benefits such as improved object detection performance and previously unseen robustness towards a wide range of image distortions, highlighting advantages of a shape-based representation.
translated by 谷歌翻译
在过去的几年中,人类视力与卷积神经网络(CNN)之间越来越多的相似之处。然而,香草CNN通常在推广到对抗性或分布(OOD)示例的概括方面表现出卓越的性能。对抗训练是一种领先的学习算法,用于提高CNN在对抗和OOD数据上的鲁棒性;但是,对这些属性,特别是形状偏差和内部特征知之甚少,在对抗性CNN中学到的内部特征。在本文中,我们进行了一项彻底的系统研究,以了解形状偏差和一些内部机制,以使Alexnet,Googlenet和Resnet-50模型的普遍性通过对抗训练进行了训练。我们发现,尽管标准成像网分类器具有较强的纹理偏见,但它们的R对应物很大程度上依赖形状。值得注意的是,对抗性训练在“鲁棒性” CNN的过程中诱导了隐藏的神经元的三个简单偏见。也就是说,R网络中的每个卷积神经元经常会更改以检测(1)像素的平滑模式,即一种机制,该机制可以阻止高频噪声通过网络; (2)更多较低级别的功能,即纹理和颜色(而不是对象);(3)输入类型较少。我们的发现揭示了有趣的机制,这些机制使网络更具对抗性,并解释了一些最新发现,例如,为什么R网络从更大的容量中受益(Xie等,2020),并且可以在图像合成中充当强大的图像(Santurkar et eT) Al。2019)。
translated by 谷歌翻译
近年来,卷积神经网络(CNNS)已成功应用于许多领域。然而,这种深层神经模型仍然被视为大多数任务中的黑匣子。此问题的基本问题之一是了解图像识别任务中最有影响力的特点以及它们是由CNN处理的方式。众所周知,CNN模型将低级功能组合以形成复杂的形状,直到物体可以容易地分类,然而,最近的几项研究表明,纹理特征比其他特征更重要。在本文中,我们假设某些功能的重要性根据特定任务,即特定任务表现出特征偏差而变化。我们设计了基于人类直觉的两个分类任务,以培训深度神经模型来识别预期的偏见。我们设计了包括许多任务来测试reset和densenet模型的这些偏差的实验。从结果中,我们得出结论(1)某些功能的综合效果通常比任何单一特征更具影响力; (2)在不同的任务中,神经模型可以执行不同的偏见,即我们可以设计特定任务,以使神经模型偏向于特定的预期特征。
translated by 谷歌翻译
Given a grayscale photograph as input, this paper attacks the problem of hallucinating a plausible color version of the photograph. This problem is clearly underconstrained, so previous approaches have either relied on significant user interaction or resulted in desaturated colorizations. We propose a fully automatic approach that produces vibrant and realistic colorizations. We embrace the underlying uncertainty of the problem by posing it as a classification task and use class-rebalancing at training time to increase the diversity of colors in the result. The system is implemented as a feed-forward pass in a CNN at test time and is trained on over a million color images. We evaluate our algorithm using a "colorization Turing test," asking human participants to choose between a generated and ground truth color image. Our method successfully fools humans on 32% of the trials, significantly higher than previous methods. Moreover, we show that colorization can be a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder. This approach results in state-of-the-art performance on several feature learning benchmarks.
translated by 谷歌翻译
与人类相比,即使是最先进的深度学习模型也缺乏基本能力。已经提出了多重比较范例来探索人类与深度学习之间的区别。尽管大多数比较都取决于受数学转变启发的腐败,但很少有人在人类认知现象上具有基础。在这项研究中,我们提出了一种基于毗邻的光栅幻觉的新型腐败方法,这是在人类和广泛的动物物种中广泛发现的视觉现象。腐败方法破坏了梯度定义的边界,并使用彼此毗邻的线光栅产生了虚幻轮廓的感知。我们应用了MNIST,高分辨率MNIST和Silhouette对象图像的方法。对腐败的各种深度学习模型进行了测试,包括从头开始训练的模型和通过ImageNet或各种数据增强技术预测的109个模型。我们的结果表明,即使对于最先进的深度学习模型,将光栅腐败毗邻也是挑战性的,因为大多数模型都是随机猜测的。我们还发现,深度指示技术可以极大地改善固定光栅幻觉的鲁棒性。早期层的可视化表明,更好的性能模型表现出更强的终端特性,这与神经科学发现一致。为了验证腐败方法,涉及24名人类受试者以对损坏数据集进行分类。
translated by 谷歌翻译
The well-documented presence of texture bias in modern convolutional neural networks has led to a plethora of algorithms that promote an emphasis on shape cues, often to support generalization to new domains. Yet, common datasets, benchmarks and general model selection strategies are missing, and there is no agreed, rigorous evaluation protocol. In this paper, we investigate difficulties and limitations when training networks with reduced texture bias. In particular, we also show that proper evaluation and meaningful comparisons between methods are not trivial. We introduce BiasBed, a testbed for texture- and style-biased training, including multiple datasets and a range of existing algorithms. It comes with an extensive evaluation protocol that includes rigorous hypothesis testing to gauge the significance of the results, despite the considerable training instability of some style bias methods. Our extensive experiments, shed new light on the need for careful, statistically founded evaluation protocols for style bias (and beyond). E.g., we find that some algorithms proposed in the literature do not significantly mitigate the impact of style bias at all. With the release of BiasBed, we hope to foster a common understanding of consistent and meaningful comparisons, and consequently faster progress towards learning methods free of texture bias. Code is available at https://github.com/D1noFuzi/BiasBed
translated by 谷歌翻译
我们介绍了几个新的数据集即想象的A / O和Imagenet-R以及合成环境和测试套件,我们称为CAOS。 Imagenet-A / O允许研究人员专注于想象成剩余的盲点。由于追踪稳健的表示,以特殊创建了ImageNet-R,因为表示不再简单地自然,而是包括艺术和其他演绎。 Caos Suite由Carla Simulator构建,允许包含异常物体,可以创建可重复的合成环境和用于测试稳健性的场景。所有数据集都是为测试鲁棒性和衡量鲁棒性的衡量进展而创建的。数据集已用于各种其他作品中,以衡量其具有鲁棒性的自身进步,并允许切向进展,这些进展不会完全关注自然准确性。鉴于这些数据集,我们创建了几种旨在推进鲁棒性研究的新方法。我们以最大Logit的形式和典型程度的形式构建简单的基线,并以深度的形式创建新的数据增强方法,从而提高上述基准。最大Logit考虑Logit值而不是SoftMax操作后的值,而微小的变化会产生明显的改进。典型程分将输出分布与类的后部分布进行比较。我们表明,除了分段任务之外,这将提高对基线的性能。猜测可能在像素级别,像素的语义信息比类级信息的语义信息不太有意义。最后,新的Deepaulment的新增强技术利用神经网络在彻底不同于先前使用的传统几何和相机的转换的图像上创建增强。
translated by 谷歌翻译
Current learning machines have successfully solved hard application problems, reaching high accuracy and displaying seemingly "intelligent" behavior. Here we apply recent techniques for explaining decisions of state-of-the-art learning machines and analyze various tasks from computer vision and arcade games. This showcases a spectrum of problem-solving behaviors ranging from naive and short-sighted, to wellinformed and strategic. We observe that standard performance evaluation metrics can be oblivious to distinguishing these diverse problem solving behaviors. Furthermore, we propose our semi-automated Spectral Relevance Analysis that provides a practically effective way of characterizing and validating the behavior of nonlinear learning machines. This helps to assess whether a learned model indeed delivers reliably for the problem that it was conceived for. Furthermore, our work intends to add a voice of caution to the ongoing excitement about machine intelligence and pledges to evaluate and judge some of these recent successes in a more nuanced manner.
translated by 谷歌翻译
视觉变压器(VIV)架构最近在各种计算机视觉任务中实现了竞争性能。与卷积神经网络(CNNS)相比,VITS背后的动机之一是较弱的感应偏差。然而,这也使VIT更难以训练。它们需要非常大的培训数据集,重型正常化和强大的数据增强。尽管两种架构之间存在显着差异,但用于培训VITS的数据增强策略主要是从CNN培训继承的。在这项工作中,我们经验性评估了如何在CNN(例如,Reset)对图像分类的VIT架构上进行的不同数据增强策略。我们介绍了一种风格的转移数据增强,称为STYLEAUM,这适合培训VITS,而RANDAURMMENT和AUGMIX通常最适合培训CNNS。我们还发现,除了分类损失之外,在培训VITS时,使用同一图像的多个增强之间的一致性损耗尤为有用。
translated by 谷歌翻译
We build new test sets for the CIFAR-10 and ImageNet datasets. Both benchmarks have been the focus of intense research for almost a decade, raising the danger of overfitting to excessively re-used test sets. By closely following the original dataset creation processes, we test to what extent current classification models generalize to new data. We evaluate a broad range of models and find accuracy drops of 3% -15% on CIFAR-10 and 11% -14% on ImageNet. However, accuracy gains on the original test sets translate to larger gains on the new test sets. Our results suggest that the accuracy drops are not caused by adaptivity, but by the models' inability to generalize to slightly "harder" images than those found in the original test sets.
translated by 谷歌翻译
数据失真通常在训练期间(例如混合和cutmix)和评估(例如形状纹理偏见和鲁棒性)中通常应用于视觉模型。此数据修改可以引入人造信息。通常认为所产生的人工制品对训练有害,而在分析模型时可以忽略不计。我们研究了这些假设,并得出结论,在某些情况下它们是毫无根据的,并导致结果不正确。具体而言,我们显示了当前的形状偏差识别方法和遮挡鲁棒性测量是有偏见的,并提出了后者的更公平的替代方法。随后,通过一系列实验,我们试图纠正和加强社区对增强如何影响视觉模型学习的看法。基于我们的经验结果,我们认为必须理解和利用人工制品的影响,而不是被消除。
translated by 谷歌翻译
The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10× or 100×? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between 'enormous data' and visual deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks increases logarithmically based on volume of training data size. Second, we show that representation learning (or pretraining) still holds a lot of promise. One can improve performance on many vision tasks by just training a better base model. Finally, as expected, we present new state-of-theart results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.
translated by 谷歌翻译
自我监督学习的最新进展证明了多种视觉任务的有希望的结果。高性能自我监督方法中的一个重要成分是通过培训模型使用数据增强,以便在嵌入空间附近的相同图像的不同增强视图。然而,常用的增强管道整体地对待图像,忽略图像的部分的语义相关性-e.g。主题与背景 - 这可能导致学习杂散相关性。我们的工作通过调查一类简单但高度有效的“背景增强”来解决这个问题,这鼓励模型专注于语义相关内容,劝阻它们专注于图像背景。通过系统的调查,我们表明背景增强导致在各种任务中跨越一系列最先进的自我监督方法(MOCO-V2,BYOL,SWAV)的性能大量改进。 $ \ SIM $ + 1-2%的ImageNet收益,使得与监督基准的表现有关。此外,我们发现有限标签设置的改进甚至更大(高达4.2%)。背景技术增强还改善了许多分布换档的鲁棒性,包括天然对抗性实例,想象群-9,对抗性攻击,想象成型。我们还在产生了用于背景增强的显着掩模的过程中完全无监督的显着性检测进展。
translated by 谷歌翻译
深度神经网络在计算机视觉中的许多任务中设定了最先进的,但它们的概括对象扭曲的能力令人惊讶地是脆弱的。相比之下,哺乳动物视觉系统对广泛的扰动是强大的。最近的工作表明,这种泛化能力可以通过在整个视觉皮层中的视觉刺激的表示中编码的有用的电感偏差来解释。在这里,我们成功利用了多任务学习方法的这些归纳偏差:我们共同训练了深度网络以进行图像分类并预测猕猴初级视觉皮层(V1)中的神经活动。我们通过测试其对图像扭曲的鲁棒性来衡量我们网络的分发广泛性能力。我们发现,尽管在训练期间没有这些扭曲,但猴子V1数据的共同训练导致鲁棒性增加。此外,我们表明,我们的网络的鲁棒性非常接近Oracle网络的稳定性,其中架构的部分在嘈杂的图像上直接培训。我们的结果还表明,随着鲁布利的改善,网络的表示变得更加大脑。使用新颖的约束重建分析,我们调查了我们的大脑正规网络更加强大的原因。与我们仅对图像分类接受培训的基线网络相比,我们的共同训练网络对内容比噪声更敏感。使用深度预测的显着性图,用于想象成像图像,我们发现我们的猴子共同训练的网络对场景中的突出区域倾向更敏感,让人想起V1在对象边界的检测中的作用和自下而上的角色显着性。总体而言,我们的工作扩大了从大脑转移归纳偏见的有前途的研究途径,并为我们转移的影响提供了新的分析。
translated by 谷歌翻译
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the chal-
translated by 谷歌翻译
This work explores the use of spatial context as a source of free and plentiful supervisory signal for training a rich visual representation. Given only a large, unlabeled image collection, we extract random pairs of patches from each image and train a convolutional neural net to predict the position of the second patch relative to the first. We argue that doing well on this task requires the model to learn to recognize objects and their parts. We demonstrate that the feature representation learned using this within-image context indeed captures visual similarity across images. For example, this representation allows us to perform unsupervised visual discovery of objects like cats, people, and even birds from the Pascal VOC 2011 detection dataset. Furthermore, we show that the learned ConvNet can be used in the R-CNN framework [21] and provides a significant boost over a randomly-initialized ConvNet, resulting in state-of-theart performance among algorithms which use only Pascalprovided training set annotations.
translated by 谷歌翻译
深度卷积神经网络(DCNN)最初是受生物视觉原理的启发,已演变为对象识别的最佳当前计算模型,因此表明在整个与神经图像和神经时间序列数据的比较中,都表明了与腹视觉途径的强大结构和功能并行性。随着深度学习的最新进展似乎降低了这种相似性,计算神经科学面临挑战,以逆转工程,以获得有用模型的生物学合理性。虽然先前的研究表明,生物学启发的体系结构能够扩大模型的人类风格,但在本研究中,我们研究了一种纯粹的数据驱动方法。我们使用人类的眼睛跟踪数据来直接修改训练示例,从而指导模型在自然图像中对象识别期间的视觉注意力朝着或远离人类固定的焦点。我们通过GARGCAM显着性图比较和验证不同的操纵类型(即标准,类人类和非人类的注意力)与人类参与者的眼动数据。我们的结果表明,与人类相比,所提出的指导焦点操作的作用是在负方向上的意图,而非人类样模型则集中在明显不同的图像部分上。观察到的效果是高度类别特异性的,它通过动画和面部的存在增强,仅在完成前馈处理后才开发,并表明对面部检测产生了强烈的影响。然而,使用这种方法,没有发现人类的类似性。讨论了公开视觉注意力在DCNN中的可能应用,并讨论了对面部检测理论的进一步影响。
translated by 谷歌翻译
精确了解人造网络中为何对某些刺激作出反应的单位会致力于解释人工智能的一步。一个广泛使用的方法对此目标是通过激活最大化来可视化单元响应。这些合成特征可视化被声称提供了具有关于导致单元被激活的图像特征的精确信息的人类 - 在其他替代方案中具有强烈激活的自然数据集样本的优点。如果人类确实获得了可视化的因果洞察,这应该使它们能够预测干预的效果,例如如何遮挡图像的某些斑块(例如,狗的头部)改变单位的激活。在这里,我们通过询问人类来确定两个方形遮挡中的哪一个来测试这个假设,导致单元的激活更大。具有专家的大规模众群实验和测量结果表明,平均奥拉等人的激活特征可视化。 (2017)确实帮助人类对此任务(68美元\ PM 4 $%的准确性;没有任何可视化的基线表现为60份\ PM 3 $%)。然而,它们不提供其他可视化(例如DataSet样本)的任何实质性优势,其产生类似的性能(66美元,PM3 $%至67美元\ PM3 $%准确性)。我们联合起来,提出了一个客观的心理物理任务来量化单位级别解释性方法对人类的益处,并且没有证据表明,广泛使用的特征可视化方法提供了比简单的替代可视化的单位激活更好的“因果理解”。
translated by 谷歌翻译
Convolutional neural networks (CNNs) are one of the most successful computer vision systems to solve object recognition. Furthermore, CNNs have major applications in understanding the nature of visual representations in the human brain. Yet it remains poorly understood how CNNs actually make their decisions, what the nature of their internal representations is, and how their recognition strategies differ from humans. Specifically, there is a major debate about the question of whether CNNs primarily rely on surface regularities of objects, or whether they are capable of exploiting the spatial arrangement of features, similar to humans. Here, we develop a novel feature-scrambling approach to explicitly test whether CNNs use the spatial arrangement of features (i.e. object parts) to classify objects. We combine this approach with a systematic manipulation of effective receptive field sizes of CNNs as well as minimal recognizable configurations (MIRCs) analysis. In contrast to much previous literature, we provide evidence that CNNs are in fact capable of using relatively long-range spatial relationships for object classification. Moreover, the extent to which CNNs use spatial relationships depends heavily on the dataset, e.g. texture vs. sketch. In fact, CNNs even use different strategies for different classes within heterogeneous datasets (ImageNet), suggesting CNNs have a continuous spectrum of classification strategies. Finally, we show that CNNs learn the spatial arrangement of features only up to an intermediate level of granularity, which suggests that intermediate rather than global shape features provide the optimal trade-off between sensitivity and specificity in object classification. These results provide novel insights into the nature of CNN representations and the extent to which they rely on the spatial arrangement of features for object classification.
translated by 谷歌翻译
Clustering is a class of unsupervised learning methods that has been extensively applied and studied in computer vision. Little work has been done to adapt it to the end-to-end training of visual features on large scale datasets. In this work, we present DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features. DeepCluster iteratively groups the features with a standard clustering algorithm, kmeans, and uses the subsequent assignments as supervision to update the weights of the network. We apply DeepCluster to the unsupervised training of convolutional neural networks on large datasets like ImageNet and YFCC100M. The resulting model outperforms the current state of the art by a significant margin on all the standard benchmarks.
translated by 谷歌翻译