Modern retrieval system often requires recomputing the representation of every piece of data in the gallery when updating to a better representation model. This process is known as backfilling and can be especially costly in the real world where the gallery often contains billions of samples. Recently, researchers have proposed the idea of Backward Compatible Training (BCT) where the new representation model can be trained with an auxiliary loss to make it backward compatible with the old representation. In this way, the new representation can be directly compared with the old representation, in principle avoiding the need for any backfilling. However, followup work shows that there is an inherent tradeoff where a backward compatible representation model cannot simultaneously maintain the performance of the new model itself. This paper reports our ``not-so-surprising'' finding that adding extra dimensions to the representation can help here. However, we also found that naively increasing the dimension of the representation did not work. To deal with this, we propose Backward-compatible Training with a novel Basis Transformation ($BT^2$). A basis transformation (BT) is basically a learnable set of parameters that applies an orthonormal transformation. Such a transformation possesses an important property whereby the original information contained in its input is retained in its output. We show in this paper how a BT can be utilized to add only the necessary amount of additional dimensions. We empirically verify the advantage of $BT^2$ over other state-of-the-art methods in a wide range of settings. We then further extend $BT^2$ to other challenging yet more practical settings, including significant change in model architecture (CNN to Transformers), modality change, and even a series of updates in the model architecture mimicking the evolution of deep learning models.
translated by 谷歌翻译
在视觉检索系统中,更新嵌入式模型需要每条数据的重新计算功能。该昂贵的过程称为回填。最近,提出了向后兼容培训(BCT)的想法。为避免回填的成本,BCT修改了对新模型的培训,使其与旧模型兼容的表示。但是,BCT可以显着地阻碍新模型的性能。在这项工作中,我们提出了一种新的学习范例来代表学习:前进兼容培训(FCT)。在FCT中,当旧型号接受培训时,我们还为未来的未知版本做好准备。我们提出学习侧信息,每个样本的辅助功能,促进了模型的未来更新。为了开发一个强大而灵活的模型兼容框架,我们将侧面信息与旧嵌入到新嵌入的前向转换相结合。新模型的培训没有修改,因此,其准确性不会降低。与各种数据集的BCT相比,我们展示了显着的检索准确性改进:Imagenet-1K(+ 18.1%),Place-365(+ 5.4%)和VGG-Face2(+ 8.3%)。 FCT在不同数据集,损失和架构培训时获得模型兼容性。
translated by 谷歌翻译
在本文中,我们提出了一种学习内部特征表示模型的新方法,该模型是\ Textit {兼容}与先前学识的。兼容功能可用于直接比较旧和新的学习功能,允许它们随时间互换使用。这消除了在顺序升级表示模型时,可以对视觉搜索系统提取用于在画廊集中的所有先前看到的图像的新功能。在非常大的画廊集和/或实时系统(即面部识别系统,社交网络,终身系统,终身系统,机器人和监测系统)的情况下,提取新功能通常是非常昂贵或不可行的。我们的方法是通过实质性(核心)称为兼容表示,通过鼓励自身定义到学习的表示模型来实现兼容性,而无需依赖以前学习的模型。实用性允许功能在随时间偏移下不改变的统计属性,以便当前学习的功能与旧版本相互操作。我们评估了种植大规模训练数据集中的单一和连续的多模型升级,我们表明我们的方法通过大幅度实现了实现兼容特征来提高现有技术。特别是,通过从Casia-Webface培训和在野外(LFW)中的标记面上评估的培训数据升级十次,我们获得了49 \%的测量倍数达到兼容的平均次数,这是544 \%对先前最先进的相对改善。
translated by 谷歌翻译
学习医学图像的视觉表示(例如X射线)是医学图像理解的核心,但由于人类注释的稀缺性,其进步已经阻止了它。现有的工作通常依赖于从成像网预处理传输的微调权重,由于图像特征截然不同,这是次优的,或者是从文本报告数据与医学图像配对的基于规则的标签提取,这是不准确的,难以推广。同时,最近的几项研究表明,从自然图像中学习的对比度学习令人兴奋,但由于它们的高层间相似性,我们发现这些方法对医学图像无济于事。我们提出了Concirt,这是一种替代的无监督策略,通过利用自然存在的配对描述性文本来学习医学视觉表示。我们通过两种模式之间的双向对比度目标对医学图像进行预处理编码的新方法是域,无关,不需要其他专家输入。我们通过将预处理的权重转移到4个医学图像分类任务和2个零射击检索任务中来测试交通,并证明它导致图像表示,在大多数设置中,它们都超过了强大的基线。值得注意的是,在所有4个分类任务中,我们的方法仅需要10 \%标记的培训数据与成像网初始化的对应物,以实现更好或可比的性能,从而证明了卓越的数据效率。
translated by 谷歌翻译
Continual Learning (CL) is a field dedicated to devise algorithms able to achieve lifelong learning. Overcoming the knowledge disruption of previously acquired concepts, a drawback affecting deep learning models and that goes by the name of catastrophic forgetting, is a hard challenge. Currently, deep learning methods can attain impressive results when the data modeled does not undergo a considerable distributional shift in subsequent learning sessions, but whenever we expose such systems to this incremental setting, performance drop very quickly. Overcoming this limitation is fundamental as it would allow us to build truly intelligent systems showing stability and plasticity. Secondly, it would allow us to overcome the onerous limitation of retraining these architectures from scratch with the new updated data. In this thesis, we tackle the problem from multiple directions. In a first study, we show that in rehearsal-based techniques (systems that use memory buffer), the quantity of data stored in the rehearsal buffer is a more important factor over the quality of the data. Secondly, we propose one of the early works of incremental learning on ViTs architectures, comparing functional, weight and attention regularization approaches and propose effective novel a novel asymmetric loss. At the end we conclude with a study on pretraining and how it affects the performance in Continual Learning, raising some questions about the effective progression of the field. We then conclude with some future directions and closing remarks.
translated by 谷歌翻译
自动视觉解对我们多样化和开放的世界需要计算机视觉模型,以概括为特定任务的最小定制,类似于人类视力。计算机视觉基础型号培训,培训多样化,大型数据集,可以适应各种下游任务,对该任务来解决现实世界计算机视觉应用而言至关重要。虽然现有的视觉基础模型如剪辑,对齐和吴道2.0主要集中在映射图像和文本表示到跨模型共享表示,我们介绍了一台新的计算机视觉基础模型,佛罗伦萨,扩大粗糙的表示(现场)到精细(对象),从静态(图像)到动态(视频),以及从RGB到多个模态(标题,深度)。通过从Web级图像文本数据中纳入通用视觉语言表示,我们的佛罗伦萨模型可以很容易地适应各种计算机视觉任务,例如分类,检索,对象检测,VQA,图像标题,视频检索和动作识别。此外,佛罗伦萨在许多类型的转移学习中表现出出色的表现:全面采样的微调,线性探测,几次射击传输和用于新颖图像和物体的零拍摄传输。所有这些属性对于我们的视觉基础模型至关重要,以提供通用视觉任务。佛罗伦萨实现了新的最先进的导致44个代表性基准,例如Imagenet-1K零射击分类,最高1精度为83.74,最高5个精度为97.18,62.4地图上的Coco微调, 80.36在VQA上,动力学-600上的87.8。
translated by 谷歌翻译
Despite significant advances, the performance of state-of-the-art continual learning approaches hinges on the unrealistic scenario of fully labeled data. In this paper, we tackle this challenge and propose an approach for continual semi-supervised learning -- a setting where not all the data samples are labeled. An underlying issue in this scenario is the model forgetting representations of unlabeled data and overfitting the labeled ones. We leverage the power of nearest-neighbor classifiers to non-linearly partition the feature space and learn a strong representation for the current task, as well as distill relevant information from previous tasks. We perform a thorough experimental evaluation and show that our method outperforms all the existing approaches by large margins, setting a strong state of the art on the continual semi-supervised learning paradigm. For example, on CIFAR100 we surpass several others even when using at least 30 times less supervision (0.8% vs. 25% of annotations).
translated by 谷歌翻译
持续学习旨在使单个模型能够学习一系列任务,而不会造成灾难性的遗忘。表现最好的方法通常需要排练缓冲区来存储过去的原始示例以进行经验重播,但是,由于隐私和内存约束,这会限制其实际价值。在这项工作中,我们提出了一个简单而有效的框架,即DualPrompt,该框架学习了一组称为提示的参数,以正确指示预先训练的模型,以依次学习到达的任务,而不会缓冲过去的示例。 DualPrompt提出了一种新颖的方法,可以将互补提示附加到预训练的主链上,然后将目标提出为学习任务不变和特定于任务的“指令”。通过广泛的实验验证,双启示始终在具有挑战性的课堂开发环境下始终设置最先进的表现。尤其是,双启示的表现优于最近的高级持续学习方法,其缓冲尺寸相对较大。我们还引入了一个更具挑战性的基准Split Imagenet-R,以帮助概括无连续的持续学习研究。源代码可在https://github.com/google-research/l2p上找到。
translated by 谷歌翻译
传统的计算机视觉模型受过培训,以预测固定的预定义类别。最近,自然语言已被证明是一个更广泛而更丰富的监督来源,为视觉概念提供更精细的描述,而不是监督“黄金”标签。以前的作品,例如剪辑,使用InfoNce丢失来训练模型以预测图像和文本标题之间的配对。然而,剪辑是饥饿的数据,需要超过400米的图像文本对进行培训。效率低下可以归因于图像文本对嘈杂的事实。为了解决这个问题,我们提出了水獭(有效的零射击识别的最佳运输蒸馏),它使用在线熵最佳运输,找到一个软图像文本与标签进行对比学习。基于预磨料的图像和文本编码器,用电站培训的型号实现了强大的性能,只有3M图像文本对。与InfoNce损失相比,标记平滑和知识蒸馏,OTTER始终如一地优于零拍摄图像(19,958类)和来自腾讯ML图像的多标记Imagenet 10k(10032类)的零拍摄评估中的这些基线。在4个不同的数据集/架构设置x 6度量上,OTTER优于(32)或绑定(2)34中的所有基准。
translated by 谷歌翻译
Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without
translated by 谷歌翻译
根据文本描述检索目标视频是巨大实用价值的任务,并且在过去几年中受到了不断的关注。在本文中,我们专注于多查询视频检索的较少设置,其中提供了多个查询,以便搜索视频档案。首先表明,多查询检索任务是更务实的,代表现实世界用例,更好地评估当前模型的检索能力,从而应得进一步调查与更普遍的单程检索再现。然后,我们提出了几种新方法,用于利用训练时间来利用多个查询,以改善从常规单查验训练模型的简单组合多个查询的相似性输出。我们的模型在三个不同的数据集中始终如一地占有几种竞争基础。例如,Recall @ 1可以在MSR-VTT上提高4.7点,在MSVD上的4.1点和Gatex上的11.7点,在最先进的Clip4Clip模型上构建的强大基线。我们相信进一步的建模努力将为这种方向带来新的见解,并在现实世界视频检索应用中表现更好的新系统。代码可在https://github.com/princetonvisualai/mqvr获得。
translated by 谷歌翻译
我们解决了用草图和文本查询检索图像的问题。我们提出任务形成器(文本和草图变压器),这是一种可使用文本说明和草图作为输入的端到端训练模型。我们认为,两种输入方式都以一种单独的方式无法轻易实现的方式相互补充。任务形成器遵循延迟融合双编码方法,类似于剪辑,该方法允许有效且可扩展的检索,因为检索集可以独立于查询而独立于索引。我们从经验上证明,与传统的基于文本的图像检索相比,除文本外,使用输入草图(甚至是绘制的草图)大大增加了检索召回。为了评估我们的方法,我们在可可数据集的测试集中收集了5,000个手绘草图。收集的草图可获得https://janesjanes.github.io/tsbir/。
translated by 谷歌翻译
在使用深神经网络的现有图像分类系统中,图像分类所需的知识隐含在模型参数中。如果用户想更新此知识,则需要微调模型参数。此外,用户无法验证推理结果的有效性或评估知识对结果的贡献。在本文中,我们研究了一个存储图像分类知识的系统,例如图像特征图,标签和原始图像,而不是模型参数,而是在外部高容量存储中。我们的系统在对输入图像进行分类时,像数据库一样引用存储。为了增加知识,我们的系统会更新数据库,而不是微调模型参数,从而避免了在增量学习方案中灾难性的遗忘。我们重新访问一个KNN(K-Nearest邻居)分类器,并在我们的系统中使用它。通过分析KNN算法引用的邻域样本,我们可以解释过去如何将知识用于推理结果。我们的系统在ImageNet数据集上实现了79.8%的TOP-1精度,而在预处理后无需微调模型参数,而在任务增量学习设置中,在Split CIFAR-100数据集中获得了90.8%的精度。
translated by 谷歌翻译
对比性语言图像预训练(剪辑)模型是最近提出的大规模训练模型,它吸引了计算机视觉社区越来越多的关注。从其巨大的图像文本训练集中受益,剪辑模型在零拍学习和图像文本匹配方面学习了出色的功能。为了提高剪辑在某些目标视觉概念上的识别性能,通常希望通过在额外的培训数据上微调一些利益来进一步更新剪辑模型。但是,此操作引起了一个重要的关注:更新会损害零镜头学习或剪辑的图像文本匹配能力,即灾难性的遗忘问题吗?如果是,是否可以适应现有的持续学习算法来减轻灾难性遗忘的风险?为了回答这些问题,这项工作对剪辑模型的持续学习问题进行了系统性研究。我们构建评估协议,以衡量微调更新的影响,并探索不同的方法来升级现有的持续学习方法,以减轻剪辑模型的遗忘问题。我们的研究揭示了剪辑持续学习问题的特殊挑战,并为进一步的研究奠定了基础。此外,我们提出了一种新算法,被称为学习,而无需通过重播词汇(VR-LWF)忘记,该算法显示出减轻剪辑模型遗忘问题的确切有效性。
translated by 谷歌翻译
具有对比目标的训练前视觉模型已显示出令人鼓舞的结果,这些结果既可以扩展到大型未经切割的数据集,又可以传输到许多下游应用程序。以下一些作品针对提高数据效率,通过添加自学意义来提高数据效率,但是在这些作品中的单个空间上定义了对比度损失(图像文本)对比度损失和内域(图像图像)对比度损失,因此许多可行的可行性监督的组合被忽略了。为了克服这个问题,我们提出了Uniclip,这是对对比语言图像预训练的统一框架。 Uniclip将域间对和域内对的对比损失整合到一个单一的通用空间中。 Uniclip的三个关键组成部分解决了整合不同域之间对比度损失时发生的差异:(1)增强感知功能嵌入,(2)MP-NCE损失和(3)域相似性度量。 Uniclip的表现优于以前的视觉语言预训练方法,在下游任务的各种单模式和多模式上。在我们的实验中,我们表明每个组成的分支都对最终性能有很好的贡献。
translated by 谷歌翻译
Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as Ima-geNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated crossattention models. The representations also enable cross-modality search with complex text and text + image queries.
translated by 谷歌翻译
Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_vision
translated by 谷歌翻译
我们可以训练一个能够处理多个模态和数据集的单个变压器模型,同时分享几乎所有的学习参数?我们呈现Polyvit,一种培训的模型,在图像,音频和视频上接受了讲述这个问题。通过在单一的方式上培训不同的任务,我们能够提高每个任务的准确性,并在5个标准视频和音频分类数据集中实现最先进的结果。多种模式和任务上的共同训练Polyvit会导致一个更具参数效率的模型,并学习遍历多个域的表示。此外,我们展示了实施的共同培训和实用,因为我们不需要调整数据集的每个组合的超级参数,但可以简单地调整来自标准的单一任务培训。
translated by 谷歌翻译
Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text pairs that are aligned at different levels, the inherent noises (e.g., misaligned pairs) make it difficult to learn a precise captioning model. While the filtering strategy can effectively remove noisy data, however, it leads to a decrease in learnable knowledge and sometimes brings about a new problem of data deficiency. To take the best of both worlds, we propose a noise-aware learning framework, which learns rich knowledge from the whole web-crawled data while being less affected by the noises. This is achieved by the proposed quality controllable model, which is learned using alignment levels of the image-text pairs as an additional control signal during training. The alignment-conditioned training allows the model to generate high-quality captions of well-aligned by simply setting the control signal to desired alignment level at inference time. Through in-depth analysis, we show that our controllable captioning model is effective in handling noise. In addition, with two tasks of zero-shot captioning and text-to-image retrieval using generated captions (i.e., self-retrieval), we also demonstrate our model can produce high-quality captions in terms of descriptiveness and distinctiveness. Code is available at \url{https://github.com/kakaobrain/noc}.
translated by 谷歌翻译
对比度学习(CL)方法有效地学习数据表示,而无需标记监督,在该方法中,编码器通过单VS-MONY SOFTMAX跨透镜损失将每个正样本在多个负样本上对比。通过利用大量未标记的图像数据,在Imagenet上预先训练时,最近的CL方法获得了有希望的结果,这是一个具有均衡图像类的曲制曲线曲线集。但是,当对野外图像进行预训练时,它们往往会产生较差的性能。在本文中,为了进一步提高CL的性能并增强其对未经保育数据集的鲁棒性,我们提出了一种双重的CL策略,该策略将其内部查询的正(负)样本对比,然后才能决定多么强烈地拉动(推)。我们通过对比度吸引力和对比度排斥(CACR)意识到这一策略,这使得查询不仅发挥了更大的力量来吸引更遥远的正样本,而且可以驱除更接近的负面样本。理论分析表明,CACR通过考虑正/阴性样品的分布之间的差异来概括CL的行为,而正/负样品的分布通常与查询独立进行采样,并且它们的真实条件分布给出了查询。我们证明了这种独特的阳性吸引力和阴性排斥机制,这有助于消除在数据集的策划较低时尤其有益于数据及其潜在表示的统一先验分布的需求。对许多标准视觉任务进行的大规模大规模实验表明,CACR不仅在表示学习中的基准数据集上始终优于现有的CL方法,而且在对不平衡图像数据集进行预训练时,还表现出更好的鲁棒性。
translated by 谷歌翻译