Modern deep networks can be better generalized when trained with noisy samples and regularization techniques. Mixup and CutMix have been proven to be effective for data augmentation to help avoid overfitting. Previous Mixup-based methods linearly combine images and labels to generate additional training data. However, this is problematic if the object does not occupy the whole image as we demonstrate in Figure 1. Correctly assigning the label weights is hard even for human beings and there is no clear criterion to measure it. To tackle this problem, in this paper, we propose LUMix, which models such uncertainty by adding label perturbation during training. LUMix is simple as it can be implemented in just a few lines of code and can be universally applied to any deep networks \eg CNNs and Vision Transformers, with minimal computational cost. Extensive experiments show that our LUMix can consistently boost the performance for networks with a wide range of diversity and capacity on ImageNet, \eg $+0.7\%$ for a small model DeiT-S and $+0.6\%$ for a large variant XCiT-L. We also demonstrate that LUMix can lead to better robustness when evaluated on ImageNet-O and ImageNet-A. The source code can be found \href{https://github.com/kevin-ssy/LUMix}{here}
translated by 谷歌翻译
已经发现基于混合的增强对于培训期间的概括模型有效,特别是对于视觉变压器(VITS),因为它们很容易过度装备。然而,先前的基于混合的方法具有潜在的先验知识,即目标的线性内插比应保持与输入插值中提出的比率相同。这可能导致一个奇怪的现象,有时由于增强中的随机过程,混合图像中没有有效对象,但标签空间仍然存在响应。为了弥合输入和标签空间之间的这种差距,我们提出了透明度,该差别将基于视觉变压器的注意图混合标签。如果受关注图的相应输入图像加权,则标签的置信度将会更大。传输令人尴尬地简单,可以在几行代码中实现,而不会在不引入任何额外的参数和拖鞋到基于Vit的模型。实验结果表明,我们的方法可以在想象集分类上一致地始终改善各种基于Vit的模型。在ImageNet上预先接受过扫描后,基于Vit的模型还展示了对语义分割,对象检测和实例分割的更好的可转换性。当在评估4个不同的基准时,传输展示展示更加强劲。代码将在https://github.com/beckschen/transmix上公开提供。
translated by 谷歌翻译
CutMix是一种流行的增强技术,通常用于训练现代卷积和变压器视觉网络。它最初旨在鼓励卷积神经网络(CNN)更多地关注图像的全球环境,而不是本地信息,从而大大提高了CNN的性能。但是,我们发现它对自然具有全球接收领域的基于变压器的体系结构的好处有限。在本文中,我们提出了一种新型的数据增强技术图,以提高视觉变压器的性能。 TokenMix通过将混合区分为多个分离的零件,将两个图像在令牌级别混合。此外,我们表明,Cutmix中的混合学习目标是一对地面真相标签的线性组合,可能是不准确的,有时是违反直觉的。为了获得更合适的目标,我们建议根据预先训练的教师模型的两个图像的基于内容的神经激活图分配目标得分,该图像不需要具有高性能。通过大量有关各种视觉变压器体系结构的实验,我们表明我们提出的TokenMix可以帮助视觉变形金刚专注于前景区域,以推断班级并增强其稳健性,以稳定的性能增长。值得注意的是,我们使用 +1%Imagenet TOP-1精度改善DEIT-T/S/B。此外,TokenMix的训练较长,在Imainet上获得了81.2%的TOP-1精度,而DEIT-S训练了400个时代。代码可从https://github.com/sense-x/tokenmix获得。
translated by 谷歌翻译
Regional dropout strategies have been proposed to enhance the performance of convolutional neural network classifiers. They have proved to be effective for guiding the model to attend on less discriminative parts of objects (e.g. leg as opposed to head of a person), thereby letting the network generalize better and have better object localization capabilities. On the other hand, current methods for regional dropout remove informative pixels on training images by overlaying a patch of either black pixels or random noise. Such removal is not desirable because it leads to information loss and inefficiency during training. We therefore propose the CutMix augmentation strategy: patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. By making efficient use of training pixels and retaining the regularization effect of regional dropout, CutMix consistently outperforms the state-of-the-art augmentation strategies on CI-FAR and ImageNet classification tasks, as well as on the Im-ageNet weakly-supervised localization task. Moreover, unlike previous augmentation methods, our CutMix-trained ImageNet classifier, when used as a pretrained model, results in consistent performance gains in Pascal detection and MS-COCO image captioning benchmarks. We also show that CutMix improves the model robustness against input corruptions and its out-of-distribution detection performances. Source code and pretrained models are available at https://github.com/clovaai/CutMix-PyTorch.
translated by 谷歌翻译
CutMix is a vital augmentation strategy that determines the performance and generalization ability of vision transformers (ViTs). However, the inconsistency between the mixed images and the corresponding labels harms its efficacy. Existing CutMix variants tackle this problem by generating more consistent mixed images or more precise mixed labels, but inevitably introduce heavy training overhead or require extra information, undermining ease of use. To this end, we propose an efficient and effective Self-Motivated image Mixing method (SMMix), which motivates both image and label enhancement by the model under training itself. Specifically, we propose a max-min attention region mixing approach that enriches the attention-focused objects in the mixed images. Then, we introduce a fine-grained label assignment technique that co-trains the output tokens of mixed images with fine-grained supervision. Moreover, we devise a novel feature consistency constraint to align features from mixed and unmixed images. Due to the subtle designs of the self-motivated paradigm, our SMMix is significant in its smaller training overhead and better performance than other CutMix variants. In particular, SMMix improves the accuracy of DeiT-T/S, CaiT-XXS-24/36, and PVT-T/S/M/L by more than +1% on ImageNet-1k. The generalization capability of our method is also demonstrated on downstream tasks and out-of-distribution datasets. Code of this project is available at https://github.com/ChenMnZ/SMMix.
translated by 谷歌翻译
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs). They mix two images as inputs for training and assign them with a mixed label with the same ratio. While they are shown effective for vision transformers (ViTs), we identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We empirically observe that the contributions of input tokens fluctuate as forward propagating, which might induce a different mixing ratio in the output tokens. The training target computed by the original data mixing strategy can thus be inaccurate, resulting in less effective training. To address this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token. We reuse the computed attention at each layer for efficient token-label alignment, introducing only negligible additional training costs. Extensive experiments demonstrate that our method improves the performance of ViTs on image classification, semantic segmentation, objective detection, and transfer learning tasks. Code is available at: https://github.com/Euphoria16/TL-Align.
translated by 谷歌翻译
事实证明,数据混合对提高深神经网络的概括能力是有效的。虽然早期方法通过手工制作的策略(例如线性插值)混合样品,但最新方法利用显着性信息通过复杂的离线优化来匹配混合样品和标签。但是,在精确的混合政策和优化复杂性之间进行了权衡。为了应对这一挑战,我们提出了一个新颖的自动混合(Automix)框架,其中混合策略被参数化并直接实现最终分类目标。具体而言,Automix将混合分类重新定义为两个子任务(即混合样品生成和混合分类)与相应的子网络,并在双层优化框架中求解它们。对于这一代,可学习的轻质混合发电机Mix Block旨在通过在相应混合标签的直接监督下对贴片的关系进行建模,以生成混合样品。为了防止双层优化的降解和不稳定性,我们进一步引入了动量管道以端到端的方式训练汽车。与在各种分类场景和下游任务中的最新图像相比,九个图像基准的广泛实验证明了汽车的优势。
translated by 谷歌翻译
弱监督的语义分割(WSSS)是具有挑战性的,特别是当使用图像级标签来监督像素级预测时。为了弥合它们的差距,通常生成一个类激活图(CAM)以提供像素级伪标签。卷积神经网络中的凸轮患有部分激活,即,仅激活最多的识别区域。另一方面,基于变压器的方法在探索具有长范围依赖性建模的全球背景下,非常有效,可能会减轻“部分激活”问题。在本文中,我们提出了基于第一变压器的WSSS方法,并介绍了梯度加权元素明智的变压器注意图(GetAn)。 GetaN显示所有特征映射元素的精确激活,跨越变压器层显示对象的不同部分。此外,我们提出了一种激活感知标签完成模块来生成高质量的伪标签。最后,我们将我们的方法纳入了使用双向向上传播的WSS的结束框架。 Pascal VOC和Coco的广泛实验表明,我们的结果通过显着的保证金击败了最先进的端到端方法,并且优于大多数多级方法.M大多数多级方法。
translated by 谷歌翻译
Image segmentation is often ambiguous at the level of individual image patches and requires contextual information to reach label consensus. In this paper we introduce Segmenter, a transformer model for semantic segmentation. In contrast to convolution-based methods, our approach allows to model global context already at the first layer and throughout the network. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. To do so, we rely on the output embeddings corresponding to image patches and obtain class labels from these embeddings with a point-wise linear decoder or a mask transformer decoder. We leverage models pre-trained for image classification and show that we can fine-tune them on moderate sized datasets available for semantic segmentation. The linear decoder allows to obtain excellent results already, but the performance can be further improved by a mask transformer generating class masks. We conduct an extensive ablation study to show the impact of the different parameters, in particular the performance is better for large models and small patch sizes. Segmenter attains excellent results for semantic segmentation. It outperforms the state of the art on both ADE20K and Pascal Context datasets and is competitive on Cityscapes.
translated by 谷歌翻译
Mix-up training approaches have proven to be effective in improving the generalization ability of Deep Neural Networks. Over the years, the research community expands mix-up methods into two directions, with extensive efforts to improve saliency-guided procedures but minimal focus on the arbitrary path, leaving the randomization domain unexplored. In this paper, inspired by the superior qualities of each direction over one another, we introduce a novel method that lies at the junction of the two routes. By combining the best elements of randomness and saliency utilization, our method balances speed, simplicity, and accuracy. We name our method R-Mix following the concept of "Random Mix-up". We demonstrate its effectiveness in generalization, weakly supervised object localization, calibration, and robustness to adversarial attacks. Finally, in order to address the question of whether there exists a better decision protocol, we train a Reinforcement Learning agent that decides the mix-up policies based on the classifier's performance, reducing dependency on human-designed objectives and hyperparameter tuning. Extensive experiments further show that the agent is capable of performing at the cutting-edge level, laying the foundation for a fully automatic mix-up. Our code is released at [https://github.com/minhlong94/Random-Mixup].
translated by 谷歌翻译
Patch-based models, e.g., Vision Transformers (ViTs) and Mixers, have shown impressive results on various visual recognition tasks, alternating classic convolutional networks. While the initial patch-based models (ViTs) treated all patches equally, recent studies reveal that incorporating inductive bias like spatiality benefits the representations. However, most prior works solely focused on the location of patches, overlooking the scene structure of images. Thus, we aim to further guide the interaction of patches using the object information. Specifically, we propose OAMixer (object-aware mixing layer), which calibrates the patch mixing layers of patch-based models based on the object labels. Here, we obtain the object labels in unsupervised or weakly-supervised manners, i.e., no additional human-annotating cost is necessary. Using the object labels, OAMixer computes a reweighting mask with a learnable scale parameter that intensifies the interaction of patches containing similar objects and applies the mask to the patch mixing layers. By learning an object-centric representation, we demonstrate that OAMixer improves the classification accuracy and background robustness of various patch-based models, including ViTs, MLP-Mixers, and ConvMixers. Moreover, we show that OAMixer enhances various downstream tasks, including large-scale classification, self-supervised learning, and multi-object recognition, verifying the generic applicability of OAMixer
translated by 谷歌翻译
混合方案表明混合一对样品以创造增强的训练样本,并最近获得了相当大的关注,以提高神经网络的普遍性。混合的简单和广泛使用的扩展是与区域辍学方法相结合:从样品中除去随机贴片并用另一个样品的特征替换。尽管它们的简单性和有效性,但这些方法易于由于它们的随机性而产生有害样品。为了解决这个问题,最近提出了“最大显着性”策略:只选择最具信息性的功能以防止这种现象。然而,他们现在缺乏样品多样化,因为它们总是确定具有最大显着性的区域,将偏置注入增强数据。在本文中,我们展示了一种新颖,简单的混合变体,捕获了两个世界的最佳变化。我们的想法是两倍。通过将特征的随机抽查和“将它们嫁接到另一个样本”,我们的方法有效地产生了多样化但有意义的样本。其第二种成分是通过以显着校准的方式混合标签来生产接枝样品的标签,这整流了随机抽样程序引入的监督误导。我们在CiFar,微小想象成和Imagenet数据集下的实验表明,我们的方案不仅在分类准确性方面优于当前的最先进的增强策略,但在数据损坏等压力条件下也是优越的对象遮挡。
translated by 谷歌翻译
Jitendra Malik once said, "Supervision is the opium of the AI researcher". Most deep learning techniques heavily rely on extreme amounts of human labels to work effectively. In today's world, the rate of data creation greatly surpasses the rate of data annotation. Full reliance on human annotations is just a temporary means to solve current closed problems in AI. In reality, only a tiny fraction of data is annotated. Annotation Efficient Learning (AEL) is a study of algorithms to train models effectively with fewer annotations. To thrive in AEL environments, we need deep learning techniques that rely less on manual annotations (e.g., image, bounding-box, and per-pixel labels), but learn useful information from unlabeled data. In this thesis, we explore five different techniques for handling AEL.
translated by 谷歌翻译
在本文中,我们通过利用视觉数据中的空间稀疏性提出了一种新的模型加速方法。我们观察到,视觉变压器中的最终预测仅基于最有用的令牌的子集,这足以使图像识别。基于此观察,我们提出了一个动态的令牌稀疏框架,以根据加速视觉变压器的输入逐渐和动态地修剪冗余令牌。具体而言,我们设计了一个轻量级预测模块,以估计给定当前功能的每个令牌的重要性得分。该模块被添加到不同的层中以层次修剪冗余令牌。尽管该框架的启发是我们观察到视觉变压器中稀疏注意力的启发,但我们发现自适应和不对称计算的想法可能是加速各种体系结构的一般解决方案。我们将我们的方法扩展到包括CNN和分层视觉变压器在内的层次模型,以及更复杂的密集预测任务,这些任务需要通过制定更通用的动态空间稀疏框架,并具有渐进性的稀疏性和非对称性计算,用于不同空间位置。通过将轻质快速路径应用于少量的特征,并使用更具表现力的慢速路径到更重要的位置,我们可以维护特征地图的结构,同时大大减少整体计算。广泛的实验证明了我们框架对各种现代体系结构和不同视觉识别任务的有效性。我们的结果清楚地表明,动态空间稀疏为模型加速提供了一个新的,更有效的维度。代码可从https://github.com/raoyongming/dynamicvit获得
translated by 谷歌翻译
变压器模型在处理各种视觉任务方面表现出了有希望的有效性。但是,与训练卷积神经网络(CNN)模型相比,训练视觉变压器(VIT)模型更加困难,并且依赖于大规模训练集。为了解释这一观察结果,我们做出了一个假设,即\ textit {vit模型在捕获图像的高频组件方面的有效性较小,而不是CNN模型},并通过频率分析对其进行验证。受这一发现的启发,我们首先研究了现有技术从新的频率角度改进VIT模型的影响,并发现某些技术(例如,randaugment)的成功可以归因于高频组件的更好使用。然后,为了补偿这种不足的VIT模型能力,我们提出了HAT,该HAT可以通过对抗训练直接增强图像的高频组成部分。我们表明,HAT可以始终如一地提高各种VIT模型的性能(例如VIT-B的 +1.2%,Swin-B的 +0.5%),尤其是提高了仅使用Imagenet-的高级模型Volo-D5至87.3% 1K数据,并且优势也可以维持在分发数据的数据上,并转移到下游任务。该代码可在以下网址获得:https://github.com/jiawangbai/hat。
translated by 谷歌翻译
使用深度学习模型从组织学数据中诊断癌症提出了一些挑战。这些图像中关注区域(ROI)的癌症分级和定位通常依赖于图像和像素级标签,后者需要昂贵的注释过程。深度弱监督的对象定位(WSOL)方法为深度学习模型的低成本培训提供了不同的策略。仅使用图像级注释,可以训练这些方法以对图像进行分类,并为ROI定位进行分类类激活图(CAM)。本文综述了WSOL的​​最先进的DL方法。我们提出了一种分类法,根据模型中的信息流,将这些方法分为自下而上和自上而下的方法。尽管后者的进展有限,但最近的自下而上方法目前通过深层WSOL方法推动了很多进展。早期作品的重点是设计不同的空间合并功能。但是,这些方法达到了有限的定位准确性,并揭示了一个主要限制 - 凸轮的不足激活导致了高假阴性定位。随后的工作旨在减轻此问题并恢复完整的对象。评估和比较了两个具有挑战性的组织学数据集的分类和本地化准确性,对我们的分类学方法进行了评估和比较。总体而言,结果表明定位性能差,特别是对于最初设计用于处理自然图像的通用方法。旨在解决组织学数据挑战的方法产生了良好的结果。但是,所有方法都遭受高假阳性/阴性定位的影响。在组织学中应用深WSOL方法的应用是四个关键的挑战 - 凸轮的激活下/过度激活,对阈值的敏感性和模型选择。
translated by 谷歌翻译
Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task in computer vision. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without any further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel framework called CLIP-ES for WSSS. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Meanwhile, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP-ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) to mitigate noise and focus on confident regions. Our proposed framework dramatically reduces the cost of training for WSSS and shows the capability of localizing objects in CLIP. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES.
translated by 谷歌翻译
数据增强已被广泛用于改善深形网络的性能。提出了许多方法,例如丢弃,正则化和图像增强,以避免过度发出和增强神经网络的概括。数据增强中的一个子区域是图像混合和删除。这种特定类型的增强混合两个图像或删除图像区域以隐藏或制定困惑的图像的某些特征,以强制它强调图像中对象的整体结构。与此方法培训的模型表明,与未执行混合或删除的培训相比,该模型表现得很好。这种培训方法实现的额外福利是对图像损坏的鲁棒性。由于其最近的计算成本低,因此提出了许多图像混合和删除技术。本文对这些设计的方法提供了详细的审查,在三个主要类别中划分增强策略,切割和删除,切割和混合和混合。纸张的第二部分是评估这些方法的图像分类,微小的图像识别和对象检测方法,其中显示了这类数据增强提高了深度神经网络的整体性能。
translated by 谷歌翻译
视觉变压器(VIT)在各种机器视觉问题上表现出令人印象深刻的性能。这些模型基于多头自我关注机制,可以灵活地参加一系列图像修补程序以编码上下文提示。一个重要问题是在给定贴片上参加图像范围内的上下文的这种灵活性是如何促进在自然图像中处理滋扰,例如,严重的闭塞,域移位,空间置换,对抗和天然扰动。我们通过广泛的一组实验来系统地研究了这个问题,包括三个vit家族和具有高性能卷积神经网络(CNN)的比较。我们展示和分析了vit的以下迷恋性质:(a)变压器对严重闭塞,扰动和域移位高度稳健,例如,即使在随机堵塞80%的图像之后,也可以在想象中保持高达60%的前1个精度。内容。 (b)与局部纹理的偏置有抗闭锁的强大性能,与CNN相比,VITS对纹理的偏置显着偏差。当受到适当训练以编码基于形状的特征时,VITS展示与人类视觉系统相当的形状识别能力,以前在文献中无与伦比。 (c)使用VIT来编码形状表示导致准确的语义分割而没有像素级监控的有趣后果。 (d)可以组合从单VIT模型的现成功能,以创建一个功能集合,导致传统和几枪学习范例的一系列分类数据集中的高精度率。我们显示VIT的有效特征是由于自我关注机制可以实现灵活和动态的接受领域。
translated by 谷歌翻译
最近最近的半监督学习(SSL)研究建立了教师学生的建筑,并通过教师产生的监督信号训练学生网络。数据增强策略在SSL框架中发挥着重要作用,因为很难在不丢失标签信息的情况下创建弱强度增强的输入对。特别是当将SSL扩展到半监督对象检测(SSOD)时,许多与图像几何和插值正则化相关的强大增强方法很难利用,因为它们可能损坏了对象检测任务中的边界框的位置信息。为解决此问题,我们介绍了一个简单但有效的数据增强方法,MIX / unmix(MUM),其中解密为SSOD框架的混合图像块的瓷砖。我们所提出的方法使混合输入图像块进行混合输入图像块,并在特征空间中重建它们。因此,妈妈可以从未插入的伪标签享受插值正则化效果,并成功地生成有意义的弱强对。此外,妈妈可以容易地配备各种SSOD方法。在MS-Coco和Pascal VOC数据集上的广泛实验通过在所有测试的SSOD基准协议中始终如一地提高基线的地图性能,证明了妈妈的优越性。
translated by 谷歌翻译