There is a growing interest in learning data representations that work well for many different types of problems and data. In this paper, we look in particular at the task of learning a single visual representation that can be successfully utilized in the analysis of very different types of images, from dog breeds to stop signs and digits. Inspired by recent work on learning networks that predict the parameters of another, we develop a tunable deep network architecture that, by means of adapter residual modules, can be steered on the fly to diverse visual domains. Our method achieves a high degree of parameter sharing while maintaining or even improving the accuracy of domain-specific representations. We also introduce the Visual Decathlon Challenge, a benchmark that evaluates the ability of representations to capture simultaneously ten very different visual domains and measures their ability to perform well uniformly.
translated by 谷歌翻译
我们提出了一个统一的查看,即通过通用表示,一个深层神经网络共同学习多个视觉任务和视觉域。同时学习多个问题涉及最大程度地减少具有不同幅度和特征的多个损失函数的加权总和,从而导致一个损失的不平衡状态,与学习每个问题的单独模型相比,一个损失的不平衡状态主导了优化和差的结果。为此,我们提出了通过小容量适配器将多个任务/特定于域网络的知识提炼到单个深神经网络中的知识。我们严格地表明,通用表示在学习NYU-V2和CityScapes中多个密集的预测问题方面实现了最新的表现,来自视觉Decathlon数据集中的不同域中的多个图像分类问题以及MetadataSet中的跨域中的几个域中学习。最后,我们还通过消融和定性研究进行多次分析。
translated by 谷歌翻译
在本文中,我们看看跨域几秒分类的问题,旨在从以前看不见的类别和域名的域中学习分类器。最近的方法广泛地通过参数参数化与前者通常在大型训练集上学习的任务 - 不可行的和任务特定权重参数来解决这个问题,并且后者通过在小型支撑集上通过辅助网络动态预测。在这项工作中,我们专注于对后者的估计,并建议将特定于任务特定权重直接在小型支架上学习,以与动态估计它们。特别地,通过系统分析,我们示出了通过矩阵形式的参数适配器以备体网络的多个中间层的参数化适配器的任务特定权重显着提高了Meta DataSet中最先进模型的性能基准以较小的额外费用。
translated by 谷歌翻译
可扩展的网络已经证明了它们在处理灾难性遗忘问题方面的优势。考虑到不同的任务可能需要不同的结构,最近的方法设计了通过复杂技能适应不同任务的动态结构。他们的例程是首先搜索可扩展的结构,然后训练新任务,但是,这将任务分为多个培训阶段,从而导致次优或过度计算成本。在本文中,我们提出了一个名为E2-AEN的端到端可训练的可自适应扩展网络,该网络动态生成了新任务的轻量级结构,而没有任何精确的先前任务下降。具体而言,该网络包含一个功能强大的功能适配器的序列,用于扩大以前学习的表示新任务的表示形式,并避免任务干扰。这些适配器是通过基于自适应门的修剪策略来控制的,该策略决定是否可以修剪扩展的结构,从而根据新任务的复杂性动态地改变网络结构。此外,我们引入了一种新颖的稀疏激活正则化,以鼓励模型学习具有有限参数的区分特征。 E2-aen可以降低成本,并且可以以端到端的方式建立在任何饲喂前架构上。关于分类(即CIFAR和VDD)和检测(即可可,VOC和ICCV2021 SSLAD挑战)的广泛实验证明了提出的方法的有效性,从而实现了新的出色结果。
translated by 谷歌翻译
When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning with similar old and new task datasets for improved new task performance.
translated by 谷歌翻译
Fine-tuning large pre-trained models is an effective transfer mechanism in NLP. However, in the presence of many downstream tasks, fine-tuning is parameter inefficient: an entire new model is required for every task. As an alternative, we propose transfer with adapter modules. Adapter modules yield a compact and extensible model; they add only a few trainable parameters per task, and new tasks can be added without revisiting previous ones. The parameters of the original network remain fixed, yielding a high degree of parameter sharing. To demonstrate adapter's effectiveness, we transfer the recently proposed BERT Transformer model to 26 diverse text classification tasks, including the GLUE benchmark. Adapters attain near state-of-the-art performance, whilst adding only a few parameters per task. On GLUE, we attain within 0.4% of the performance of full fine-tuning, adding only 3.6% parameters per task. By contrast, fine-tuning trains 100% of the parameters per task.
translated by 谷歌翻译
Image classification with small datasets has been an active research area in the recent past. However, as research in this scope is still in its infancy, two key ingredients are missing for ensuring reliable and truthful progress: a systematic and extensive overview of the state of the art, and a common benchmark to allow for objective comparisons between published methods. This article addresses both issues. First, we systematically organize and connect past studies to consolidate a community that is currently fragmented and scattered. Second, we propose a common benchmark that allows for an objective comparison of approaches. It consists of five datasets spanning various domains (e.g., natural images, medical imagery, satellite data) and data types (RGB, grayscale, multispectral). We use this benchmark to re-evaluate the standard cross-entropy baseline and ten existing methods published between 2017 and 2021 at renowned venues. Surprisingly, we find that thorough hyper-parameter tuning on held-out validation data results in a highly competitive baseline and highlights a stunted growth of performance over the years. Indeed, only a single specialized method dating back to 2019 clearly wins our benchmark and outperforms the baseline classifier.
translated by 谷歌翻译
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be repurposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
translated by 谷歌翻译
Deep learning has produced state-of-the-art results for a variety of tasks. While such approaches for supervised learning have performed well, they assume that training and testing data are drawn from the same distribution, which may not always be the case. As a complement to this challenge, single-source unsupervised domain adaptation can handle situations where a network is trained on labeled data from a source domain and unlabeled data from a related but different target domain with the goal of performing well at test-time on the target domain. Many single-source and typically homogeneous unsupervised deep domain adaptation approaches have thus been developed, combining the powerful, hierarchical representations from deep learning with domain adaptation to reduce reliance on potentially-costly target data labels. This survey will compare these approaches by examining alternative methods, the unique and common elements, results, and theoretical insights. We follow this with a look at application areas and open research directions.
translated by 谷歌翻译
对分布(OOD)数据的概括是人类自然的能力,但对于机器而言挑战。这是因为大多数学习算法强烈依赖于i.i.d.〜对源/目标数据的假设,这在域转移导致的实践中通常会违反。域的概括(DG)旨在通过仅使用源数据进行模型学习来实现OOD的概括。在过去的十年中,DG的研究取得了长足的进步,导致了广泛的方法论,例如,基于域的一致性,元学习,数据增强或合奏学习的方法,仅举几例;还在各个应用领域进行了研究,包括计算机视觉,语音识别,自然语言处理,医学成像和强化学习。在本文中,首次提供了DG中的全面文献综述,以总结过去十年来的发展。具体而言,我们首先通过正式定义DG并将其与其他相关领域(如域适应和转移学习)联系起来来涵盖背景。然后,我们对现有方法和理论进行了彻底的审查。最后,我们通过有关未来研究方向的见解和讨论来总结这项调查。
translated by 谷歌翻译
Domain generalization (DG) is the challenging and topical problem of learning models that generalize to novel testing domains with different statistics than a set of known training domains. The simple approach of aggregating data from all source domains and training a single deep neural network end-to-end on all the data provides a surprisingly strong baseline that surpasses many prior published methods. In this paper we build on this strong baseline by designing an episodic training procedure that trains a single deep network in a way that exposes it to the domain shift that characterises a novel domain at runtime. Specifically, we decompose a deep network into feature extractor and classifier components, and then train each component by simulating it interacting with a partner who is badly tuned for the current domain. This makes both components more robust, ultimately leading to our networks producing state-of-the-art performance on three DG benchmarks. Furthermore, we consider the pervasive workflow of using an ImageNet trained CNN as a fixed feature extractor for downstream recognition tasks. Using the Visual Decathlon benchmark, we demonstrate that our episodic-DG training improves the performance of such a general purpose feature extractor by explicitly training a feature for robustness to novel problems. This shows that DG training can benefit standard practice in computer vision.
translated by 谷歌翻译
We propose a novel multi-task learning architecture, which allows learning of task-specific feature-level attention. Our design, the Multi-Task Attention Network (MTAN), consists of a single shared network containing a global feature pool, together with a soft-attention module for each task. These modules allow for learning of taskspecific features from the global features, whilst simultaneously allowing for features to be shared across different tasks. The architecture can be trained end-to-end and can be built upon any feed-forward neural network, is simple to implement, and is parameter efficient. We evaluate our approach on a variety of datasets, across both image-toimage predictions and image classification tasks. We show that our architecture is state-of-the-art in multi-task learning compared to existing methods, and is also less sensitive to various weighting schemes in the multi-task loss function. Code is available at https://github.com/ lorenmt/mtan.
translated by 谷歌翻译
关于无监督域适应性(UDA)的大多数现有研究都认为每个域的训练样本都带有域标签(例如绘画,照片)。假定每个域中的样品都遵循相同的分布,并利用域标签通过特征对齐来学习域不变特征。但是,这样的假设通常并不成立 - 通常存在许多较细粒的领域(例如,已经开发出了数十种现代绘画样式,每种绘画样式与经典风格的范围都有很大不同)。因此,在每个人工定义和粗粒结构域之间强迫特征分布对齐可能是无效的。在本文中,我们从完全不同的角度解决了单源和多源UDA,即将每个实例视为一个良好的域。因此,跨域的特征对齐是冗余。相反,我们建议执行动态实例域的适应性(DIDA)。具体而言,开发了具有自适应卷积内核的动态神经网络,以生成实例自适应残差,以使域 - 无知的深度特征适应每个单独的实例。这使得共享分类器可以同时应用于源域数据,而无需依赖任何域注释。此外,我们没有施加复杂的特征对准损失,而是仅使用标记的源和伪标记为目标数据的跨透镜损失采用简单的半监督学习范式。我们的模型被称为DIDA-NET,可以在几种常用的单源和多源UDA数据集上实现最先进的性能,包括数字,办公室房屋,域名,域名,Digit-Five和PAC。
translated by 谷歌翻译
Meta-learning has been proposed as a framework to address the challenging few-shot learning setting. The key idea is to leverage a large number of similar few-shot tasks in order to learn how to adapt a base-learner to a new task for which only a few labeled samples are available. As deep neural networks (DNNs) tend to overfit using a few samples only, meta-learning typically uses shallow neural networks (SNNs), thus limiting its effectiveness. In this paper we propose a novel few-shot learning method called meta-transfer learning (MTL) which learns to adapt a deep NN for few shot learning tasks. Specifically, meta refers to training multiple tasks, and transfer is achieved by learning scaling and shifting functions of DNN weights for each task. In addition, we introduce the hard task (HT) meta-batch scheme as an effective learning curriculum for MTL. We conduct experiments using (5-class, 1-shot) and (5-class, 5shot) recognition tasks on two challenging few-shot learning benchmarks: miniImageNet and Fewshot-CIFAR100. Extensive comparisons to related works validate that our meta-transfer learning approach trained with the proposed HT meta-batch scheme achieves top performance. An ablation study also shows that both components contribute to fast convergence and high accuracy 1 .Optimize θ by Eq. 3; 5 end 6 Optimize Φ S {1,2} and θ by Eq. 4 and Eq. 5; 7 while not done do 8 Sample class-k in T (te) ; 9 Compute Acc k for T (te) ; 10 end 11 Return class-m with the lowest accuracy Acc m .
translated by 谷歌翻译
整合不同域的知识是人类学习的重要特征。学习范式如转移学习,元学习和多任务学习,通过利用新任务的先验知识,鼓励更快的学习和新任务的良好普遍来反映人类学习过程。本文提供了这些学习范例的详细视图以及比较分析。学习算法的弱点是另一个的力量,从而合并它们是文献中的一种普遍的特征。这项工作提供了对文章的文献综述,这些文章融合了两种算法来完成多个任务。这里还介绍了全球通用学习网络,在此介绍了元学习,转移学习和多任务学习的集合,以及一些开放的研究问题和未来研究的方向。
translated by 谷歌翻译
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes -from 1 example per class to 1 M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance.
translated by 谷歌翻译
Artificial neural networks thrive in solving the classification problem for a particular rigid task, acquiring knowledge through generalized learning behaviour from a distinct training phase. The resulting network resembles a static entity of knowledge, with endeavours to extend this knowledge without targeting the original task resulting in a catastrophic forgetting. Continual learning shifts this paradigm towards networks that can continually accumulate knowledge over different tasks without the need to retrain from scratch. We focus on task incremental classification, where tasks arrive sequentially and are delineated by clear boundaries. Our main contributions concern (1) a taxonomy and extensive overview of the state-of-the-art; (2) a novel framework to continually determine the stability-plasticity trade-off of the continual learner; (3) a comprehensive experimental comparison of 11 state-of-the-art continual learning methods and 4 baselines. We empirically scrutinize method strengths and weaknesses on three benchmarks, considering Tiny Imagenet and large-scale unbalanced iNaturalist and a sequence of recognition datasets. We study the influence of model capacity, weight decay and dropout regularization, and the order in which the tasks are presented, and qualitatively compare methods in terms of required memory, computation time and storage.
translated by 谷歌翻译
State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at https://github.com/OpenAI/CLIP.
translated by 谷歌翻译
海洋生态系统及其鱼类栖息地越来越重要,因为它们在提供有价值的食物来源和保护效果方面的重要作用。由于它们的偏僻且难以接近自然,因此通常使用水下摄像头对海洋环境和鱼类栖息地进行监测。这些相机产生了大量数字数据,这些数据无法通过当前的手动处理方法有效地分析,这些方法涉及人类观察者。 DL是一种尖端的AI技术,在分析视觉数据时表现出了前所未有的性能。尽管它应用于无数领域,但仍在探索其在水下鱼类栖息地监测中的使用。在本文中,我们提供了一个涵盖DL的关键概念的教程,该教程可帮助读者了解对DL的工作原理的高级理解。该教程还解释了一个逐步的程序,讲述了如何为诸如水下鱼类监测等挑战性应用开发DL算法。此外,我们还提供了针对鱼类栖息地监测的关键深度学习技术的全面调查,包括分类,计数,定位和细分。此外,我们对水下鱼类数据集进行了公开调查,并比较水下鱼类监测域中的各种DL技术。我们还讨论了鱼类栖息地加工深度学习的新兴领域的一些挑战和机遇。本文是为了作为希望掌握对DL的高级了解,通过遵循我们的分步教程而为其应用开发的海洋科学家的教程,并了解如何发展其研究,以促进他们的研究。努力。同时,它适用于希望调查基于DL的最先进方法的计算机科学家,以进行鱼类栖息地监测。
translated by 谷歌翻译
The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.
translated by 谷歌翻译