In continual learning (CL), the goal is to design models that can learn a sequence of tasks without catastrophic forgetting. While there is a rich set of techniques for CL, relatively little understanding exists on how representations built by previous tasks benefit new tasks that are added to the network. To address this, we study the problem of continual representation learning (CRL) where we learn an evolving representation as new tasks arrive. Focusing on zero-forgetting methods where tasks are embedded in subnetworks (e.g., PackNet), we first provide experiments demonstrating CRL can significantly boost sample efficiency when learning new tasks. To explain this, we establish theoretical guarantees for CRL by providing sample complexity and generalization error bounds for new tasks by formalizing the statistical benefits of previously-learned representations. Our analysis and experiments also highlight the importance of the order in which we learn the tasks. Specifically, we show that CL benefits if the initial tasks have large sample size and high "representation diversity". Diversity ensures that adding new tasks incurs small representation mismatch and can be learned with few samples while training only few additional nonzero weights. Finally, we ask whether one can ensure each task subnetwork to be efficient during inference time while retaining the benefits of representation learning. To this end, we propose an inference-efficient variation of PackNet called Efficient Sparse PackNet (ESPN) which employs joint channel & weight pruning. ESPN embeds tasks in channel-sparse subnets requiring up to 80% less FLOPs to compute while approximately retaining accuracy and is very competitive with a variety of baselines. In summary, this work takes a step towards data and compute-efficient CL with a representation learning perspective. GitHub page: https://github.com/ucr-optml/CtRL
translated by 谷歌翻译
Artificial neural networks thrive in solving the classification problem for a particular rigid task, acquiring knowledge through generalized learning behaviour from a distinct training phase. The resulting network resembles a static entity of knowledge, with endeavours to extend this knowledge without targeting the original task resulting in a catastrophic forgetting. Continual learning shifts this paradigm towards networks that can continually accumulate knowledge over different tasks without the need to retrain from scratch. We focus on task incremental classification, where tasks arrive sequentially and are delineated by clear boundaries. Our main contributions concern (1) a taxonomy and extensive overview of the state-of-the-art; (2) a novel framework to continually determine the stability-plasticity trade-off of the continual learner; (3) a comprehensive experimental comparison of 11 state-of-the-art continual learning methods and 4 baselines. We empirically scrutinize method strengths and weaknesses on three benchmarks, considering Tiny Imagenet and large-scale unbalanced iNaturalist and a sequence of recognition datasets. We study the influence of model capacity, weight decay and dropout regularization, and the order in which the tasks are presented, and qualitatively compare methods in terms of required memory, computation time and storage.
translated by 谷歌翻译
本文认为,连续学习方法可以通过分割多种模型的学习者的容量来利益。我们使用统计学习理论和实验分析来展示多种任务在单个型号培训时以非琐碎的方式互相交互。特定任务上的泛化误差可以随着协同任务培训,但在竞争任务训练时也可以恶化。该理论激励了我们名为Model动物园的方法,这是从升压文献的启发,增长了小型型号的集合,每个集中都在持续学习的一集中训练。我们展示了模型动物园的准确性提高了各种持续学习基准问题。
translated by 谷歌翻译
Continual Learning (CL) is a field dedicated to devise algorithms able to achieve lifelong learning. Overcoming the knowledge disruption of previously acquired concepts, a drawback affecting deep learning models and that goes by the name of catastrophic forgetting, is a hard challenge. Currently, deep learning methods can attain impressive results when the data modeled does not undergo a considerable distributional shift in subsequent learning sessions, but whenever we expose such systems to this incremental setting, performance drop very quickly. Overcoming this limitation is fundamental as it would allow us to build truly intelligent systems showing stability and plasticity. Secondly, it would allow us to overcome the onerous limitation of retraining these architectures from scratch with the new updated data. In this thesis, we tackle the problem from multiple directions. In a first study, we show that in rehearsal-based techniques (systems that use memory buffer), the quantity of data stored in the rehearsal buffer is a more important factor over the quality of the data. Secondly, we propose one of the early works of incremental learning on ViTs architectures, comparing functional, weight and attention regularization approaches and propose effective novel a novel asymmetric loss. At the end we conclude with a study on pretraining and how it affects the performance in Continual Learning, raising some questions about the effective progression of the field. We then conclude with some future directions and closing remarks.
translated by 谷歌翻译
机器学习中的一个重要问题是能够以顺序方式学习任务。如果有标准的一阶方法培训大多数模型忘记了在新任务上培训时忘记了先前学习的任务,这通常被称为灾难性遗忘。一种流行的克服遗忘方法是通过惩罚在以前任务上的模型来规范损失函数。例如,弹性重量整合(EWC)用二次形式正规,涉及基于过去数据的对角线矩阵构建。虽然EWC对于一些设置工作非常好,但即使在另外理想的条件下,如果对角线矩阵是先前任务的Hessian矩阵的近似近似,它也可以证明灾难性遗忘。我们提出了一种简单的方法来克服这一点:正规规范了与过去数据矩阵的草图草图的新任务的培训。这可以通过内存成本可提供克服灾难忘记线性模型和宽神经网络的灾难性忘记。本文的总体目标是在基于正规化的连续学习算法和内存成本下提供有关时的见解。
translated by 谷歌翻译
Existing generalization bounds fail to explain crucial factors that drive generalization of modern neural networks. Since such bounds often hold uniformly over all parameters, they suffer from over-parametrization, and fail to account for the strong inductive bias of initialization and stochastic gradient descent. As an alternative, we propose a novel optimal transport interpretation of the generalization problem. This allows us to derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the earned prediction function in the data space. Therefore, our bounds are agnostic to the parametrization of the model and work well when the number of training samples is much smaller than the number of parameters. With small modifications, our approach yields accelerated rates for data on low-dimensional manifolds, and guarantees under distribution shifts. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
translated by 谷歌翻译
现代机器学习问题中的不平衡数据集是司空见惯的。具有敏感属性的代表性课程或群体的存在导致关于泛化和公平性的担忧。这种担忧进一步加剧了大容量深网络可以完全适合培训数据,似乎在训练期间达到完美的准确性和公平,但在测试期间表现不佳。为了解决这些挑战,我们提出了自动化,一个自动设计培训损失功能的双层优化框架,以优化准确性和寻求公平目标的混合。具体地,较低级别的问题列举了模型权重,并且上级问题通过监视和优化通过验证数据的期望目标来调谐损耗功能。我们的损耗设计通过采用参数跨熵损失和个性化数据增强方案,可以为类/组进行个性化处理。我们评估我们对不平衡和群体敏感分类的应用方案的方法的好处和性能。广泛的经验评估表明了自动矛盾最先进的方法的益处。我们的实验结果与损耗功能设计的理论见解和培训验证分裂的好处相辅相成。所有代码都是可用的开源。
translated by 谷歌翻译
深度神经网络(DNN)的计算要求增加导致获得稀疏,且准确的DNN模型的兴趣。最近的工作已经调查了稀疏训练的更加困难的情况,其中DNN重量尽可能稀少,以减少训练期间的计算成本。现有的稀疏训练方法通常是经验的,并且可以具有相对于致密基线的准确性较低。在本文中,我们介绍了一种称为交替压缩/解压缩(AC / DC)训练DNN的一般方法,证明了算法变体的收敛,并表明AC / DC在类似的计算预算中准确地表现出现有的稀疏训练方法;在高稀疏水平下,AC / DC甚至优于现有的现有方法,依赖于准确的预训练密集模型。 AC / DC的一个重要属性是它允许联合培训密集和稀疏的型号,在训练过程结束时产生精确的稀疏密集模型对。这在实践中是有用的,其中压缩变体可能是为了在资源受限的设置中进行部署而不重新执行整个训练流,并且还为我们提供了深入和压缩模型之间的精度差距的见解。代码可在:https://github.com/ist-daslab/acdc。
translated by 谷歌翻译
持续学习研究的主要重点领域是通过设计新算法对分布变化更强大的新算法来减轻神经网络中的“灾难性遗忘”问题。尽管持续学习文献的最新进展令人鼓舞,但我们对神经网络的特性有助于灾难性遗忘的理解仍然有限。为了解决这个问题,我们不关注持续的学习算法,而是在这项工作中专注于模型本身,并研究神经网络体系结构对灾难性遗忘的“宽度”的影响,并表明宽度在遗忘遗产方面具有出人意料的显着影响。为了解释这种效果,我们从各个角度研究网络的学习动力学,例如梯度正交性,稀疏性和懒惰的培训制度。我们提供了与不同架构和持续学习基准之间的经验结果一致的潜在解释。
translated by 谷歌翻译
所有著名的机器学习算法构成了受监督和半监督的学习工作,只有在一个共同的假设下:培训和测试数据遵循相同的分布。当分布变化时,大多数统计模型必须从新收集的数据中重建,对于某些应用程序,这些数据可能是昂贵或无法获得的。因此,有必要开发方法,以减少在相关领域中可用的数据并在相似领域中进一步使用这些数据,从而减少需求和努力获得新的标签样品。这引起了一个新的机器学习框架,称为转移学习:一种受人类在跨任务中推断知识以更有效学习的知识能力的学习环境。尽管有大量不同的转移学习方案,但本调查的主要目的是在特定的,可以说是最受欢迎的转移学习中最受欢迎的次级领域,概述最先进的理论结果,称为域适应。在此子场中,假定数据分布在整个培训和测试数据中发生变化,而学习任务保持不变。我们提供了与域适应性问题有关的现有结果的首次最新描述,该结果涵盖了基于不同统计学习框架的学习界限。
translated by 谷歌翻译
持续学习的目标(CL)是随着时间的推移学习不同的任务。与CL相关的主要Desiderata是在旧任务上保持绩效,利用后者来改善未来任务的学习,并在培训过程中引入最小的开销(例如,不需要增长的模型或再培训)。我们建议通过固定密度的稀疏神经网络来解决这些避难所的神经启发性塑性适应(NISPA)体系结构。 NISPA形成了稳定的途径,可以从较旧的任务中保存知识。此外,NISPA使用连接重新设计来创建新的塑料路径,以重用有关新任务的现有知识。我们对EMNIST,FashionMnist,CIFAR10和CIFAR100数据集的广泛评估表明,NISPA的表现明显胜过代表性的最先进的持续学习基线,并且与盆地相比,它的可学习参数最多少了十倍。我们还认为稀疏是持续学习的重要组成部分。 NISPA代码可在https://github.com/burakgurbuz97/nispa上获得。
translated by 谷歌翻译
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network.On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points.
translated by 谷歌翻译
This paper presents a method for adding multiple tasks to a single deep neural network while avoiding catastrophic forgetting. Inspired by network pruning techniques, we exploit redundancies in large deep networks to free up parameters that can then be employed to learn new tasks. By performing iterative pruning and network re-training, we are able to sequentially "pack" multiple tasks into a single network while ensuring minimal drop in performance and minimal storage overhead. Unlike prior work that uses proxy losses to maintain accuracy on older tasks, we always optimize for the task at hand. We perform extensive experiments on a variety of network architectures and largescale datasets, and observe much better robustness against catastrophic forgetting than prior work. In particular, we are able to add three fine-grained classification tasks to a single ImageNet-trained VGG-16 network and achieve accuracies close to those of separately trained networks for each task. Code available at https://github.com/ arunmallya/packnet
translated by 谷歌翻译
元学习或学习学习,寻求设计算法,可以利用以前的经验快速学习新技能或适应新环境。表示学习 - 用于执行元学习的关键工具 - 了解可以在多个任务中传输知识的数据表示,这在数据稀缺的状态方面是必不可少的。尽管最近在Meta-Leature的实践中感兴趣的兴趣,但缺乏元学习算法的理论基础,特别是在学习可转让陈述的背景下。在本文中,我们专注于多任务线性回归的问题 - 其中多个线性回归模型共享常见的低维线性表示。在这里,我们提供了可提供的快速,采样高效的算法,解决了(1)的双重挑战,从多个相关任务和(2)将此知识转移到新的,看不见的任务中的常见功能。两者都是元学习的一般问题的核心。最后,我们通过在学习这些线性特征的样本复杂性上提供信息定理下限来补充这些结果。
translated by 谷歌翻译
我们在监督分类的背景下研究深网的过剩能力。也就是说,给定对基本假设类别的能力度量(在我们的情况下,是经验性的Rademacher的复杂性),我们(先验)可以限制该类别的数量,同时在与无约束性方面保持经验误差的同时保留经验误差?为了评估现代体系结构(例如残留网络)的过剩能力,我们扩展并统一了先前的Rademacher复杂性界限,以适应功能组成和添加以及卷积的结构。我们边界中的容量驱动项是层的Lipschitz常数和卷积权重初始化的(2,1)组的范围距离。在不同任务难度的基准数据集上进行的实验表明,(1)每个任务的容量大量超过容量,并且(2)可以将容量保持在整个任务的惊人相似水平。总体而言,这表明了重量规范的可压缩性概念,这是通过重量修剪正交的经典压缩概念。
translated by 谷歌翻译
我们引入了一个新的培训范式,该范围对神经网络参数空间进行间隔约束以控制遗忘。当代持续学习(CL)方法从一系列数据流有效地培训神经网络,同时减少灾难性遗忘的负面影响,但它们不能提供任何确保的确保网络性能不会随着时间的流逝而无法控制地恶化。在这项工作中,我们展示了如何通过将模型的持续学习作为其参数空间的持续收缩来遗忘。为此,我们提出了Hypertrectangle训练,这是一种新的训练方法,其中每个任务都由参数空间中的超矩形表示,完全包含在先前任务的超矩形中。这种配方将NP-HARD CL问题降低到多项式时间,同时提供了完全防止遗忘的弹性。我们通过开发Intercontinet(间隔持续学习)算法来验证我们的主张,该算法利用间隔算术来有效地将参数区域建模为高矩形。通过实验结果,我们表明我们的方法在不连续的学习设置中表现良好,而无需存储以前的任务中的数据。
translated by 谷歌翻译
现代量子机学习(QML)方法涉及在训练数据集上进行各种优化参数化量子电路,并随后对测试数据集(即,泛化)进行预测。在这项工作中,我们在培训数量为N $培训数据点后,我们在QML中对QML的普遍表现进行了全面的研究。我们表明,Quantum机器学习模型的泛化误差与$ T $培训门的尺寸在$ \ sqrt {t / n} $上缩放。当只有$ k \ ll t $ gates在优化过程中经历了大量变化时,我们证明了泛化误差改善了$ \ sqrt {k / n} $。我们的结果意味着将Unitaries编制到通常使用指数训练数据的量子计算行业的多项式栅极数量,这是一项通常使用指数尺寸训练数据的大量应用程序。我们还表明,使用量子卷积神经网络的相位过渡的量子状态的分类只需要一个非常小的训练数据集。其他潜在应用包括学习量子误差校正代码或量子动态模拟。我们的工作将新的希望注入QML领域,因为较少的培训数据保证了良好的概括。
translated by 谷歌翻译
结构化修剪是一种常用的技术,用于将深神经网络(DNN)部署到资源受限的设备上。但是,现有的修剪方法通常是启发式,任务指定的,并且需要额外的微调过程。为了克服这些限制,我们提出了一个框架,将DNN压缩成纤薄的架构,具有竞争性表现,并且仅通过列车 - 一次(OTO)减少重大拖车。 OTO包含两个键:(i)我们将DNN的参数分区为零不变组,使我们能够修剪零组而不影响输出; (ii)促进零群,我们制定了结构性稀疏优化问题,提出了一种新颖的优化算法,半空间随机投影梯度(HSPG),以解决它,这优于组稀疏性探索的标准近端方法和保持可比的收敛性。为了展示OTO的有效性,我们从划痕上同时培训和压缩全模型,而无需微调推理加速和参数减少,并且在CIFAR10的VGG16实现最先进的结果,为CIFAR10和Squad的BERT为BERT竞争结果在resnet50上为想象成。源代码可在https://github.com/tianyic/only_train_once上获得。
translated by 谷歌翻译
模型不足的元学习(MAML)已越来越流行,对于可以通过一个或几个随机梯度下降步骤迅速适应新任务的训练模型。但是,与标准的非自适应学习(NAL)相比,MAML目标更难优化,并且几乎没有理解MAML在各种情况下的溶液的快速适应性方面的改善。我们通过线性回归设置进行分析解决此问题,该设置由简单而艰难的任务组成,其中硬度与梯度下降在任务上收敛的速率有关。具体而言,我们证明,为了使MAML比NAL获得可观的收益,(i)任务之间的硬度必须有一定的差异,并且(ii)艰苦任务的最佳解决方案必须与中心远离远离中心。简单任务最佳解决方案的中心。我们还提供数值和分析结果,表明这些见解适用于两层神经网络。最后,我们提供了很少的图像分类实验,可以支持我们何时使用MAML的见解,并强调培训MAML对实践中的艰巨任务的重要性。
translated by 谷歌翻译
在持续学习中使用神经网络中的任务特定组件(CL)是一种令人信服的策略,可以解决固定容量模型中稳定性 - 塑性困境,而无需访问过去的数据。当前方法仅着重于选择一个新任务的子网络,以减少忘记过去任务。但是,这种选择可能会限制有助于将来学习的相关过去知识的前瞻性转移。我们的研究表明,当统一的分类器用于所有类别的任务课程学习(class-il)时,共同满足这两个目标是更具挑战性的,因为这很容易跨越任务之间的类之间的歧义。此外,当跨任务的课程相似性增加时,挑战就会增加。为了应对这一挑战,我们提出了一种名为AFAF的新CL方法,旨在避免忘记并允许使用Fix-apainality模型在IL类中向前转移。 AFAF分配了一个子网络,该子网络可以选择性地转移相关知识到新任务,同时保留过去的知识,重复一些先前分配的组件以利用固定容量,并在存在相似之处时解决类型。该实验表明,AFAF在为模型提供多种CL所需属性方面的有效性,同时在具有不同语义相似性的各种具有挑战性的基准上优于最先进的方法。
translated by 谷歌翻译