以完全监督的方式训练的深度学习模型已显示出依赖所谓的“快捷方式”功能。快捷功能是与培训数据感兴趣的结果相关的输入,但不再关联或在测试或部署设置中存在。在这里,我们提供的实验显示了在图像和文本上训练的最新自我监管模型提供了更强大的图像表示,并在现实的医学成像示例中降低了模型对视觉快捷键功能的依赖。此外,我们发现这些自我监督模型“忘记”快捷方式比在标记数据进行微调时要比完全监督的模型更快。尽管不是一个完整的解决方案,但我们的实验提供了令人信服的证据,表明在图像和文本上训练的自我监督模型为视觉快捷特征提供了韧性。
translated by 谷歌翻译
In this paper, we study the effect of a novel regularization scheme on contrastive language-image pre-trained (CLIP) models. Our approach is based on the observation that, in many domains, text tokens should only describe a small number of image regions and, likewise, each image region should correspond to only a few text tokens. In CLIP-style models, this implies that text-token embeddings should have high similarity to only a small number of image-patch embeddings for a given image-text pair. We formalize this observation using a novel regularization scheme that penalizes the entropy of the text-token to image-patch similarity scores. We qualitatively and quantitatively demonstrate that the proposed regularization scheme shrinks the text-token and image-patch similarity scores towards zero, thus achieving the desired effect. We demonstrate the promise of our approach in an important medical context where this underlying hypothesis naturally arises. Using our proposed approach, we achieve state of the art (SOTA) zero-shot performance on all tasks from the CheXpert chest x-ray dataset, outperforming an unregularized version of the model and several recently published self-supervised models.
translated by 谷歌翻译
Deep neural networks have been successfully adopted to diverse domains including pathology classification based on medical images. However, large-scale and high-quality data to train powerful neural networks are rare in the medical domain as the labeling must be done by qualified experts. Researchers recently tackled this problem with some success by taking advantage of models pre-trained on large-scale general domain data. Specifically, researchers took contrastive image-text encoders (e.g., CLIP) and fine-tuned it with chest X-ray images and paired reports to perform zero-shot pathology classification, thus completely removing the need for pathology-annotated images to train a classification model. Existing studies, however, fine-tuned the pre-trained model with the same contrastive learning objective, and failed to exploit the multi-labeled nature of medical image-report pairs. In this paper, we propose a new fine-tuning strategy based on sentence sampling and positive-pair loss relaxation for improving the downstream zero-shot pathology classification performance, which can be applied to any pre-trained contrastive image-text encoders. Our method consistently showed dramatically improved zero-shot pathology classification performance on four different chest X-ray datasets and 3 different pre-trained models (5.77% average AUROC increase). In particular, fine-tuning CLIP with our method showed much comparable or marginally outperformed to board-certified radiologists (0.619 vs 0.625 in F1 score and 0.530 vs 0.544 in MCC) in zero-shot classification of five prominent diseases from the CheXpert dataset.
translated by 谷歌翻译
学习医学图像的视觉表示(例如X射线)是医学图像理解的核心,但由于人类注释的稀缺性,其进步已经阻止了它。现有的工作通常依赖于从成像网预处理传输的微调权重,由于图像特征截然不同,这是次优的,或者是从文本报告数据与医学图像配对的基于规则的标签提取,这是不准确的,难以推广。同时,最近的几项研究表明,从自然图像中学习的对比度学习令人兴奋,但由于它们的高层间相似性,我们发现这些方法对医学图像无济于事。我们提出了Concirt,这是一种替代的无监督策略,通过利用自然存在的配对描述性文本来学习医学视觉表示。我们通过两种模式之间的双向对比度目标对医学图像进行预处理编码的新方法是域,无关,不需要其他专家输入。我们通过将预处理的权重转移到4个医学图像分类任务和2个零射击检索任务中来测试交通,并证明它导致图像表示,在大多数设置中,它们都超过了强大的基线。值得注意的是,在所有4个分类任务中,我们的方法仅需要10 \%标记的培训数据与成像网初始化的对应物,以实现更好或可比的性能,从而证明了卓越的数据效率。
translated by 谷歌翻译
生物医学中的多模式数据遍布,例如放射学图像和报告。大规模解释这些数据对于改善临床护理和加速临床研究至关重要。与一般领域相比,具有复杂语义的生物医学文本在视觉建模中提出了其他挑战,并且先前的工作使用了缺乏特定领域语言理解的适应性模型不足。在本文中,我们表明,有原则的文本语义建模可以大大改善自我监督的视力 - 语言处理中的对比度学习。我们发布了一种实现最先进的语言模型,从而通过改进的词汇和新颖的语言预测客观的客观利用语义和话语特征在放射学报告中获得了自然语言推断。此外,我们提出了一种自我监督的联合视觉 - 语言方法,重点是更好的文本建模。它在广泛的公开基准上建立了新的最新结果,部分是通过利用我们新的特定领域的语言模型。我们释放了一个新的数据集,该数据集具有放射科医生的局部对齐短语接地注释,以促进生物医学视觉处理中复杂语义建模的研究。广泛的评估,包括在此新数据集中,表明我们的对比学习方法在文本语义建模的帮助下,尽管仅使用了全球对准目标,但在细分任务中的表现都优于细分任务中的先验方法。
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
使用深度学习对胸部射线照相的自动分析具有巨大的潜力,可以增强患者疾病的临床诊断。但是,深度学习模型通常需要大量的带注释的数据来实现高性能 - 通常是医疗领域适应的障碍。在本文中,我们构建了一个利用放射学报告来通过有限的标记数据(少于1000个示例)来改善医学图像分类性能,以提高医学图像分类性能。具体而言,我们检查了捕获图像预告片,以学习以更少的例子进行训练的高质量医学图像表示。在对卷积编码器和变压器解码器进行联合预测之后,我们将学习的编码器转移到各种分类任务中。平均9多种病理学,我们发现我们的模型在标记培训数据受到限制时,比参见和内域监督的预处理的分类性能更高。
translated by 谷歌翻译
Pretrained large-scale vision-language models like CLIP have exhibited strong generalization over unseen tasks. Yet imperceptible adversarial perturbations can significantly reduce CLIP's performance on new tasks. In this work, we identify and explore the problem of \emph{adapting large-scale models for zero-shot adversarial robustness}. We first identify two key factors during model adaption -- training losses and adaptation methods -- that affect the model's zero-shot adversarial robustness. We then propose a text-guided contrastive adversarial training loss, which aligns the text embeddings and the adversarial visual features with contrastive learning on a small set of training data. We apply this training loss to two adaption methods, model finetuning and visual prompt tuning. We find that visual prompt tuning is more effective in the absence of texts, while finetuning wins in the existence of text guidance. Overall, our approach significantly improves the zero-shot adversarial robustness over CLIP, seeing an average improvement of over 31 points over ImageNet and 15 zero-shot datasets. We hope this work can shed light on understanding the zero-shot adversarial robustness of large-scale models.
translated by 谷歌翻译
我们提出了一种称为基本的组合缩放方法,可在ImageNet ILSVRC-2012验证集上实现85.7%的前1个零点精度,超越了最佳发布的零拍模型 - 剪辑并对齐 - 达9.3%。我们的基本模式还显示出鲁棒性基准的显着改进。例如,在5个测试集中,具有自然分布换档,如想象的 - {A,R,V2,素描}和ObjectNet,我们的车型实现了83.7%的前1个平均精度,只有一个小幅度从其原始的想象精度下降。为实现这些结果,我们扩大了剪辑的对比学习框架,并在三个方面对齐:数据大小,型号大小和批量大小。我们的数据集具有6.6B噪声图像文本对,比对齐的4倍,比夹子大16倍。我们最大的型号具有3B重量,参数比为3.75倍,拖鞋比对齐和夹子更大。我们的批量尺寸为65536,比剪辑的2倍,4倍超过对齐。缩放的主要挑战是我们的加速器的内存有限,如GPU和TPU。因此,我们提出了一种在线渐变缓存的简单方法来克服这个限制。
translated by 谷歌翻译
已经证明对比学习有效地对未标记数据的预训练图像模型有效,并且有希望的医学图像分类等任务的结果。在预训练期间使用配对文本和图像(例如放射性报告和图像)甚至进一步改善了结果。尽管如此,大多数现有方法将图像分类为下游任务,并且对于像语义分割或物体检测等本地化任务可能不是最佳的。因此,我们提出了从愿景和文本(Lovt)的局部代表学习,以实现我们最佳知识,这是针对本地化医学成像任务的第一种文本监督的预训练方法。我们的方法将实例级图像报告对比学习与图像区域和报告句子表示的局部对比学习结合起来。我们评估LOVT和常用的预培训方法,这些评估框架是由五个公共数据集的胸部X光上的18个本地化任务组成的新评估框架。虽然没有单一的最佳方法,但是,在18个研究的任务中,Lovt在11个中最佳地表现出优选的选择本地化任务的首选方法。
translated by 谷歌翻译
Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as Ima-geNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated crossattention models. The representations also enable cross-modality search with complex text and text + image queries.
translated by 谷歌翻译
自动视觉解对我们多样化和开放的世界需要计算机视觉模型,以概括为特定任务的最小定制,类似于人类视力。计算机视觉基础型号培训,培训多样化,大型数据集,可以适应各种下游任务,对该任务来解决现实世界计算机视觉应用而言至关重要。虽然现有的视觉基础模型如剪辑,对齐和吴道2.0主要集中在映射图像和文本表示到跨模型共享表示,我们介绍了一台新的计算机视觉基础模型,佛罗伦萨,扩大粗糙的表示(现场)到精细(对象),从静态(图像)到动态(视频),以及从RGB到多个模态(标题,深度)。通过从Web级图像文本数据中纳入通用视觉语言表示,我们的佛罗伦萨模型可以很容易地适应各种计算机视觉任务,例如分类,检索,对象检测,VQA,图像标题,视频检索和动作识别。此外,佛罗伦萨在许多类型的转移学习中表现出出色的表现:全面采样的微调,线性探测,几次射击传输和用于新颖图像和物体的零拍摄传输。所有这些属性对于我们的视觉基础模型至关重要,以提供通用视觉任务。佛罗伦萨实现了新的最先进的导致44个代表性基准,例如Imagenet-1K零射击分类,最高1精度为83.74,最高5个精度为97.18,62.4地图上的Coco微调, 80.36在VQA上,动力学-600上的87.8。
translated by 谷歌翻译
了解产品内容的视觉和语言表示对于电子商务中的搜索和推荐应用程序至关重要。作为在线购物平台的骨干,受到代表学习研究的最新成功的启发,我们提出了一个对比度学习框架,该框架使用未标记的原始产品文本和图像来对齐语言和视觉模型。我们介绍了我们用来培训大规模代表性学习模型的技术,并共享解决特定领域挑战的解决方案。我们使用预先训练的模型作为多种下游任务的骨干进行研究,包括类别分类,属性提取,产品匹配,产品聚类和成人产品识别。实验结果表明,我们所提出的方法在每个下游任务中均优于单个模态和多种方式的基线。
translated by 谷歌翻译
Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data \& models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning. We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes. We open-source our evaluation workflow and all models, including the largest public CLIP models, to ensure reproducibility and make scaling laws research more accessible. Source code and instructions to reproduce this study will be available at https://github.com/LAION-AI/scaling-laws-openclip
translated by 谷歌翻译
对比训练有素的语言图像模型,例如剪辑,Align和Basic,已经证明了对多种具有挑战性的自然分配变化的前所未有的鲁棒性。由于这些语言图像模型与以前的培训方法有多种不同,因此一个重要的问题是导致稳定性增长的原因。我们通过系统的实验研究回答这个问题。具体而言,我们研究了鲁棒性增长的五个不同可能的原因:(i)训练集大小,(ii)培训分配,(iii)在培训时进行语言监督,(iv)测试时语言监督,以及(v)对比损失函数。我们的实验表明,更多样化的训练分布是稳健性增长的主要原因,其他因素几乎没有稳健性。除了实验结果之外,我们还引入了Imagenet捕获,这是一种来自Flickr的原始文本注释的Imagenet版本,以实现语言图像训练的进一步受控实验。
translated by 谷歌翻译
本文提出了一种对比调整,这是一种简单的方法,采用对比训练来对准图像和文本模型,同时仍然利用他们的预训练。在我们的实证研究中,我们发现,锁定的预训练图像模型与解锁文本模型最佳。我们调用这种对比调整“锁定图像文本调整”(LIT TOONING)的实例,该实例仅教导文本模型,从预先训练的图像模型中读出了良好的表示新任务。亮度调谐模型将零拍摄传输到新视觉任务的能力提高,例如图像分类或检索。建议的亮度调整是广泛适用的;它可以使用三种不同的图像文本数据集可靠地使用多种预训练方法(监督和无监督)和多种架构(Reset,Vision变换器和MLP-MILLER)。利用基于变压器的预训练VIT-G / 14型号,LIT调谐模型在想象网测试集中实现了84.5%的零射频传输精度,并且在充满挑战的分发ObjectNet测试集中实现了81.1%。
translated by 谷歌翻译
预训练的视觉模型(例如,剪辑)在许多下游任务中显示出有希望的零弹性概括,并具有正确设计的文本提示。最近的作品不依赖手工设计的提示,而是使用下游任务的培训数据来学习提示。虽然有效,但针对领域数据的培训却降低了模型的概括能力,使其无法看到新领域。在这项工作中,我们提出了测试时间提示调整(TPT),该方法可以通过单个测试样本即时学习自适应提示。对于图像分类,TPT通过使用置信度选择最小化熵来优化提示,以便模型在每个测试样本的不同增强视图上都具有一致的预测。在评估对自然分布变化的概括时,TPT平均将零击的TOP-1精度提高了3.6%,超过了先前需要其他特定于任务的训练数据的迅速调整方法。在评估看不见类别的跨数据集泛化时,TPT与使用其他培训数据的最先进方法相当。项目页面:https://azshue.github.io/tpt。
translated by 谷歌翻译
预训练为深入学习支持的X线射线分析中最近的成功奠定了基础。它通过在源域上进行大规模完全监督或自我监督的学习来学习可转移的图像表示。然而,监督的预培训需要复杂和劳动密集的两级人类辅助注释过程,而自我监督的学习不能与监督范例竞争。为了解决这些问题,我们提出了一个跨监督的方法,命名为审查监督(指的)的自由文本报告,该报告从射线照相中获取来自原始放射学报告的自由监督信号。该方法采用了视觉变压器,旨在从每个患者研究中的多种视图中学习联合表示。在极其有限的监督下,引用其在4个众所周知的X射线数据集上的转移学习和自我监督学习对应。此外,甚至是基于具有人辅助结构标签的射线照相的源区的甚至超越方法。因此,有可能取代规范的预训练方法。
translated by 谷歌翻译
We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's contrastive loss, intending to train an alignment between the patch tokens of the vision encoder and the CLS token of the text encoder. With such an alignment, a model can identify regions of an image corresponding to a given text input, and therefore transfer seamlessly to the task of open vocabulary semantic segmentation without requiring any segmentation annotations during training. Using pre-trained CLIP encoders with PACL, we are able to set the state-of-the-art on the task of open vocabulary zero-shot segmentation on 4 different segmentation benchmarks: Pascal VOC, Pascal Context, COCO Stuff and ADE20K. Furthermore, we show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy compared to CLIP, across a suite of 12 image classification datasets.
translated by 谷歌翻译
视觉预训练的最新进展表明,在不同的视觉任务中表现出惊人的表现,阐明了对人工智能研究中对视觉和文本概念的全面理解的长期问题。但是,在医学领域的视觉预训练的应用方面取得了有限数量和多样性阻碍了对联合视觉语言概念的成功学习。在这项研究中,我们介绍了Max-VL,这是一种针对医疗领域中有效视觉预训练的模型。我们在实验上证明,预先训练的MAX-VL模型在各种视觉任务中都优于当前最新视觉语言模型。我们还提出了用于诊断新出现疾病和人为错误检测的临床实用性,并显示了该模型在不同领域数据中的广泛适用性。
translated by 谷歌翻译