Image token removal is an efficient augmentation strategy for reducing the cost of computing image features. However, this efficient augmentation strategy has been found to adversely affect the accuracy of CLIP-based training. We hypothesize that removing a large portion of image tokens may improperly discard the semantic content associated with a given text description, thus constituting an incorrect pairing target in CLIP training. To address this issue, we propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description. The correlation scores are computed in an online fashion using the EMA version of the visual encoder. Our experiments show that the proposed attentive masking approach performs better than the previous method of random token removal for CLIP training. The approach also makes it efficient to apply multiple augmentation views to the image, as well as introducing instance contrastive learning tasks between these views into the CLIP framework. Compared to other CLIP improvements that combine different pre-training targets such as SLIP and MaskCLIP, our method is not only more effective, but also much more efficient. Specifically, using ViT-B and YFCC-15M dataset, our approach achieves $43.9\%$ top-1 accuracy on ImageNet-1K zero-shot classification, as well as $62.7/42.1$ and $38.0/23.2$ I2T/T2I retrieval accuracy on Flickr30K and MS COCO, which are $+1.1\%$, $+5.5/+0.9$, and $+4.4/+1.3$ higher than the SLIP method, while being $2.30\times$ faster. An efficient version of our approach running $1.16\times$ faster than the plain CLIP model achieves significant gains of $+5.3\%$, $+11.3/+8.0$, and $+9.5/+4.9$ on these benchmarks.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
In the presence of noisy labels, designing robust loss functions is critical for securing the generalization performance of deep neural networks. Cross Entropy (CE) loss has been shown to be not robust to noisy labels due to its unboundedness. To alleviate this issue, existing works typically design specialized robust losses with the symmetric condition, which usually lead to the underfitting issue. In this paper, our key idea is to induce a loss bound at the logit level, thus universally enhancing the noise robustness of existing losses. Specifically, we propose logit clipping (LogitClip), which clamps the norm of the logit vector to ensure that it is upper bounded by a constant. In this manner, CE loss equipped with our LogitClip method is effectively bounded, mitigating the overfitting to examples with noisy labels. Moreover, we present theoretical analyses to certify the noise-tolerant ability of LogitClip. Extensive experiments show that LogitClip not only significantly improves the noise robustness of CE loss, but also broadly enhances the generalization performance of popular robust losses.
translated by 谷歌翻译
The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one. We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. Unlike image generation, where the output is continuous and redundant with a fixed length, texts in image captions are categorical and short with varied lengths. Therefore, naively applying the discrete diffusion model to text decoding does not work well, as shown in our experiments. To address the performance gap, we propose several key techniques including best-first inference, concentrated attention mask, text length prediction, and image-free training. On COCO without additional caption pre-training, it achieves a CIDEr score of 117.8, which is +5.0 higher than the auto-regressive baseline with the same architecture in the controlled setting. It also performs +26.8 higher CIDEr score than the auto-regressive baseline (230.3 v.s.203.5) on a caption infilling task. With 4M vision-language pre-training images and the base-sized model, we reach a CIDEr score of 125.1 on COCO, which is competitive to the best well-developed auto-regressive frameworks. The code is available at https://github.com/buxiangzhiren/DDCap.
translated by 谷歌翻译
自我监督学习的一个重要目标是使模型预训练能够从几乎无限的数据中受益。但是,一种最近变得流行的方法,即掩盖图像建模(MIM),被怀疑无法从较大的数据中受益。在这项工作中,我们通过广泛的实验打破了这一误解,数据量表从10 \%imagenet-1k到完整的Imagenet-22K,型号的尺寸从4,900万到10亿,培训长度从125k迭代到500k迭代迭代范围不等。我们的研究表明:(i)蒙版的图像建模也要求对较大的数据进行要求。我们观察到,非常大的模型被相对较小的数据过度。 (ii)培训的时间长度。接受掩盖图像建模训练的大型模型可以从更多的数据中受益,并具有更长的培训。 (iii)预训练中的验证损失是衡量模型在多个任务上进行微调的表现的好指标。该观察结果使我们能够预先评估预训练的模型,而无需对下游任务进行昂贵的试用和错误评估。我们希望我们的发现能够从缩放能力方面提高对蒙版图像建模的理解。
translated by 谷歌翻译
蒙版的图像建模(MIM)学习具有非常好的微调性能的表示形式,掩盖了先前普遍的预训练方法,例如图像分类,实例对比度学习和图像文本对齐。在本文中,我们表明,通过以功能蒸馏(FD)形式进行简单的后处理,可以显着改善这些预训练方法的下部微调性能。功能蒸馏将旧表示形式转换为具有一些理想属性的新表示形式,就像MIM产生的表示一样。这些属性总共称为优化友好性,通过一组与注意力和优化相关的诊断工具来识别和分析。借助这些属性,新表示表现出强烈的微调性能。具体而言,对比度的自我监督学习方法在微调方面具有竞争力,就像最先进的蒙版图像建模(MIM)算法一样。剪辑模型的微调性能也得到了显着改善,夹子VIT-L模型达到\ TextBf {89.0%} TOP-1的ImagEnet-1K分类精度。在30亿参数SWINV2-G模型上,ADE20K语义分割的微调精度通过+1.5 miou提高到\ textbf {61.4 miou},创建了新记录。更重要的是,我们的工作为未来的研究提供了一种方法,可以将更多的精力集中在学习表现的通用性和可扩展性上,而不会与优化友好性相处,因为它可以很容易地增强。该代码将在https://github.com/swintransformer/feature-distillation上找到。
translated by 谷歌翻译
检测到分布输入对于在现实世界中安全部署机器学习模型至关重要。然而,已知神经网络遭受过度自信的问题,在该问题中,它们对分布和分布的输入的信心异常高。在这项工作中,我们表明,可以通过在训练中实施恒定的向量规范来通过logit归一化(logitnorm)(logitnorm)来缓解此问题。我们的方法是通过分析的激励,即logit的规范在训练过程中不断增加,从而导致过度自信的产出。因此,LogitNorm背后的关键思想是将网络优化期间输出规范的影响解散。通过LogitNorm培训,神经网络在分布数据和分布数据之间产生高度可区分的置信度得分。广泛的实验证明了LogitNorm的优势,在公共基准上,平均FPR95最高为42.30%。
translated by 谷歌翻译
我们提出了用于将Swin变压器缩放到3亿参数的技术,并使其能够使用高达1,536美元的图像培训1,536美元。通过缩放容量和分辨率,Swin变压器在四个代表视觉基准上设置新记录:84.0%的Top-1在Imagenet-V2图像分类准确度,63.1 / 54.4盒/掩模地图上的Coco对象检测,59.9 Miou在Ade20K语义细分中,在动力学-400视频动作分类上的86.8%的前1个精度。我们的技术通常适用于缩放视觉模型,这尚未广泛探索为NLP语言模型,部分原因是培训和应用中的困难:1)视觉模型经常面临规模的不稳定问题,2)许多下游愿景任务需要高分辨率图像或窗口,并且目前尚不清楚如何有效地将模型在低分辨率上预先培训到更高分辨率。当图像分辨率高时,GPU存储器消耗也是一个问题。为了解决这些问题,我们提出了几种技术,通过使用Swin Transformer作为案例研究来说明:1)归一化技术和缩放的余弦注意力,提高大视觉模型的稳定性; 2)一种日志间隔的连续位置偏置技术,以有效地将在低分辨率图像和窗口预先训练的模型转移到其更高分辨率的对应物。此外,我们分享了我们的关键实施细节,导致GPU内存消耗的大量节省,从而使得用常规GPU培训大型视觉模型可行。使用这些技术和自我监督的预训练,我们成功培训了强大的3B往返变压器模型,并有效地将其转移到涉及高分辨率图像或窗口的各种视觉任务,实现了各种最先进的准确性基准。
translated by 谷歌翻译
This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO testdev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-theart by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures. The code and models are publicly available at https://github. com/microsoft/Swin-Transformer.
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译