批量标准化是所有最先进的神经网络架构的必要组件。然而,由于引入了许多实际问题,最近的研究已经致力于设计无规范化的架构。在本文中,我们表明权重初始化是培训Reset的归一化网络的关键。特别是,我们提出了对跳过连接分支的块输出的求和操作的略微修改,从而正确初始化整个网络。我们表明,这种修改的体系结构在CiFar-10上实现了竞争结果,而无需进一步正常化,也不是算法修改。
translated by 谷歌翻译
神经架构的创新促进了语言建模和计算机视觉中的重大突破。不幸的是,如果网络参数未正确初始化,新颖的架构通常会导致挑战超参数选择和培训不稳定。已经提出了许多架构特定的初始化方案,但这些方案并不总是可移植到新体系结构。本文介绍了毕业,一种用于初始化神经网络的自动化和架构不可知论由方法。毕业基础是一个简单的启发式;调整每个网络层的规范,使得具有规定的超参数的SGD或ADAM的单个步骤导致可能的损耗值最小。通过在每个参数块前面引入标量乘数变量,然后使用简单的数字方案优化这些变量来完成此调整。 GradInit加速了许多卷积架构的收敛性和测试性能,无论是否有跳过连接,甚至没有归一化层。它还提高了机器翻译的原始变压器架构的稳定性,使得在广泛的学习速率和动量系数下使用ADAM或SGD来训练它而无需学习速率预热。代码可在https://github.com/zhuchen03/gradinit上获得。
translated by 谷歌翻译
Neural networks require careful weight initialization to prevent signals from exploding or vanishing. Existing initialization schemes solve this problem in specific cases by assuming that the network has a certain activation function or topology. It is difficult to derive such weight initialization strategies, and modern architectures therefore often use these same initialization schemes even though their assumptions do not hold. This paper introduces AutoInit, a weight initialization algorithm that automatically adapts to different neural network architectures. By analytically tracking the mean and variance of signals as they propagate through the network, AutoInit appropriately scales the weights at each layer to avoid exploding or vanishing signals. Experiments demonstrate that AutoInit improves performance of convolutional, residual, and transformer networks across a range of activation function, dropout, weight decay, learning rate, and normalizer settings, and does so more reliably than data-dependent initialization methods. This flexibility allows AutoInit to initialize models for everything from small tabular tasks to large datasets such as ImageNet. Such generality turns out particularly useful in neural architecture search and in activation function discovery. In these settings, AutoInit initializes each candidate appropriately, making performance evaluations more accurate. AutoInit thus serves as an automatic configuration tool that makes design of new neural network architectures more robust. The AutoInit package provides a wrapper around TensorFlow models and is available at https://github.com/cognizant-ai-labs/autoinit.
translated by 谷歌翻译
We study characteristics of receptive fields of units in deep convolutional networks. The receptive field size is a crucial issue in many visual tasks, as the output must respond to large enough areas in the image to capture information about large objects. We introduce the notion of an effective receptive field, and show that it both has a Gaussian distribution and only occupies a fraction of the full theoretical receptive field. We analyze the effective receptive field in several architecture designs, and the effect of nonlinear activations, dropout, sub-sampling and skip connections on it. This leads to suggestions for ways to address its tendency to be too small.
translated by 谷歌翻译
变形金刚在几个领域取得了巨大的成功,从自然语言处理到计算机视觉。然而,最近已经证明,堆叠自发注意层(变压器的独特架构成分)可能会导致在初始化时代币表示的等级崩溃。是否以及如何影响训练的等级崩溃的问题仍然没有得到答复,其调查对于对该架构的更全面理解是必要的。在这项工作中,我们对这种现象的原因和影响有了新的启示。首先,我们表明,代币表示的等级崩溃会导致查询和钥匙的梯度在初始化时消失,从而阻碍了培训。此外,我们提供了对等级崩溃的起源的详尽描述,并讨论了如何通过对残留分支的适当深度依赖性缩放来预防它。最后,我们的分析揭示了特定的体系结构超参数对查询和值的梯度有所不同,从而导致不成比例的梯度规范。这暗示了一种解释,用于广泛使用自适应方法进行变压器的优化。
translated by 谷歌翻译
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66% [29]). To our knowledge, our result is the first to surpass human-level performance (5.1%, [22]) on this visual recognition challenge.
translated by 谷歌翻译
激活功能对于神经网络引入非线性至关重要。许多经验实验已经验证了各种激活功能,但有关激活功能的理论研究不足。在这项工作中,我们研究了激活功能对梯度方差的影响,并提出了一种使激活函数正常化的方法,以使所有层的梯度方差保持相同,以便神经网络可以实现更好的收敛性。首先,我们补充了先前的工作,以分析梯度方差的分析,在这种梯度的方差中,激活功能的影响仅在理想化的初始状态下,几乎不能保存在训练过程中,并获得了良好激活功能应尽可能满足的属性。其次,我们提供了一种将激活功能归一化并证明其对普遍激活功能的有效性的方法。通过观察实验,我们发现收敛速度与我们在前一部分中得出的属性大致相关。我们针对共同的激活函数进行了归一化激活函数的实验。结果表明,我们的方法始终优于其非标准化对应物。例如,就TOP-1的准确性而言,用CIFAR-100的RESNET50在RESNET50上归一化的Swish swilla swish swish swish。我们的方法通过简单地在完全连接的网络和残留网络中替换其归一化功能来改善性能。
translated by 谷歌翻译
深度神经网络通常以随机重量初始化,并具有足够选择的初始方差,以确保训练期间稳定的信号传播。但是,选择适当的方差变得具有挑战性,尤其是随着层数的增长。在这项工作中,我们用完全确定性的初始化方案(即零)代替随机权重初始化,该方案基于身份和Hadamard变换来初始用零和一个(最高范围化因子)开始网络的权重。通过理论和实证研究,我们证明了零能够训练网络而不会损害其表现力。在Resnet上应用零在包括Imagenet在内的各种数据集上实现最先进的性能,这表明随机权重可能不需要网络初始化。此外,零具有许多好处,例如训练超深网络(没有批处理规范化),表现出低级别的学习轨迹,从而导致低级和稀疏的解决方案,并提高培训可重复性。
translated by 谷歌翻译
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batchnormalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
translated by 谷歌翻译
分批归一化(BN)由归一化组成部分,然后是仿射转化,并且对于训练深神经网络至关重要。网络中每个BN的标准初始化分别设置了仿射变换量表,并将其转移到1和0。但是,经过训练,我们观察到这些参数从初始化中并没有太大变化。此外,我们注意到归一化过程仍然可以产生过多的值,这对于训练是不可能的。我们重新审视BN公式,并为BN提出了一种新的初始化方法和更新方法,以解决上述问题。实验旨在强调和证明适当的BN规模初始化对性能的积极影响,并使用严格的统计显着性测试进行评估。该方法可以与现有实施方式一起使用,没有额外的计算成本。源代码可在https://github.com/osu-cvl/revisiting-bninit上获得。
translated by 谷歌翻译
Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91% on CIFAR-10).
translated by 谷歌翻译
我们调查与批量无关的归一化产生的性能降解的原因。我们发现层归一化和实例归一化的原型技术均诱导神经网络预激活中的故障模式的外观:(i)层归一化引起往复恒定函数的崩溃;(ii)实例归一化在实例统计中缺乏可变性,表现出富有症状的症状。为了缓解失败模式(i)而不加重失败模式(ii),我们介绍了使用代理分布的激活后标准化的技术“代理归一化”。当与层归一化或组归一化结合时,这种批量独立的归一化会模拟批量标准化的行为,并始终如一地匹配或超过其性能。
translated by 谷歌翻译
虽然残留连接使训练非常深的神经网络,但由于其多分支拓扑而​​导致在线推断不友好。这鼓励许多研究人员在推动时没有残留连接的情况下设计DNN。例如,repvgg在部署时将多分支拓扑重新参数化为vgg型(单分支)模型,当网络相对较浅时显示出具有很大的性能。但是,RepVGG不能等效地将Reset转换为VGG,因为重新参数化方法只能应用于线性块,并且必须将非线性层(Relu)放在残余连接之外,这导致了有限的表示能力,特别是更深入网络。在本文中,我们的目标是通过在Resblock上的保留和合并(RM)操作等效地纠正此问题,并提出删除Vanilla Reset中的残留连接。具体地,RM操作允许输入特征映射通过块,同时保留其信息,并在每个块的末尾合并所有信息,这可以去除残差而不改变原始输出。作为一个插件方法,RM操作基本上有三个优点:1)其实现使其实现高比率网络修剪。 2)它有助于打破RepVGG的深度限制。 3)与Reset和RepVGG相比,它导致更好的精度速度折衷网络(RMNet)。我们相信RM操作的意识形态可以激发对未来社区的模型设计的许多见解。代码可用:https://github.com/fxmeng/rmnet。
translated by 谷歌翻译
The study of feature propagation at initialization in neural networks lies at the root of numerous initialization designs. An assumption very commonly made in the field states that the pre-activations are Gaussian. Although this convenient Gaussian hypothesis can be justified when the number of neurons per layer tends to infinity, it is challenged by both theoretical and experimental works for finite-width neural networks. Our major contribution is to construct a family of pairs of activation functions and initialization distributions that ensure that the pre-activations remain Gaussian throughout the network's depth, even in narrow neural networks. In the process, we discover a set of constraints that a neural network should fulfill to ensure Gaussian pre-activations. Additionally, we provide a critical review of the claims of the Edge of Chaos line of works and build an exact Edge of Chaos analysis. We also propose a unified view on pre-activations propagation, encompassing the framework of several well-known initialization procedures. Finally, our work provides a principled framework for answering the much-debated question: is it desirable to initialize the training of a neural network whose pre-activations are ensured to be Gaussian?
translated by 谷歌翻译
适当的重量初始化是成功培训神经网络的重要意义。最近,批量归一化通过基于批处理统计数据量化每层来判定权重初始化的作用。遗憾的是,批量归一化在应用于小批量尺寸时具有多个缺点,因为在点云上学习时需要应对内存限制。虽然良好的重量初始化策略可以不需要呈现批量归一化,从而避免这些缺点,没有提出这种方法对于点卷积网络。为了填补这一差距,我们提出了一个框架来统一众多持续卷积。这实现了我们的主要贡献,方差感知权重初始化。我们表明,此初始化可以避免批量标准化,同时实现相似,并且在某些情况下更好的性能。
translated by 谷歌翻译
胶囊网络是一类神经网络,可在许多计算机视觉任务上取得有希望的结果。但是,由于高计算和内存要求,基线胶囊网络未能在更复杂的数据集上达到最新结果。我们通过提出一种称为动量胶囊网络(Mocapsnet)的新网络体系结构来解决这个问题。Mocapsnets的灵感来自动量Resnets,这是一种应用可逆残留构建块的网络。可逆的网络允许重新计算后反向传播算法中正向通行的激活,因此可以大大减少这些内存要求。在本文中,我们提供了一个框架,介绍如何将可逆的残留构建块应用于胶囊网络。我们将证明Mocapsnet在MNIST,SVHN,CIFAR-10和CIFAR-100上击败基线胶囊网络的准确性,同时使用的内存较少。源代码可在https://github.com/moejoe95/mocapsnet上找到。
translated by 谷歌翻译
深度重新结合因实现最新的机器学习任务而被认可。但是,这些体系结构的出色性能取决于培训程序,需要精心制作以避免消失或爆炸梯度,尤其是随着深度$ l $的增加。关于如何减轻此问题,尚无共识,尽管广泛讨论的策略在于将每一层的输出缩放为$ \ alpha_l $。我们在概率环境中显示标准I.I.D.初始化,唯一的非平凡动力学是$ \ alpha_l = 1/\ sqrt {l} $(其他选择导致爆炸或身份映射)。该缩放因子在连续的时间限制中对应于神经随机微分方程,这与广泛的解释相反,即深度重新连接是神经普通微分方程的离散化。相比之下,在后一种制度中,具有特定相关初始化和$ \ alpha_l = 1/l $获得稳定性。我们的分析表明,与层指数的函数之间的缩放比例和规律性之间存在很强的相互作用。最后,在一系列实验中,我们表现出由这两个参数驱动的连续范围,这在训练之前和之后会共同影响性能。
translated by 谷歌翻译
Deep residual networks [1] have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which makes training easier and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10 (4.62% error) and CIFAR-100, and a 200-layer ResNet on ImageNet. Code is available at: https://github.com/KaimingHe/ resnet-1k-layers.
translated by 谷歌翻译
置换不变的神经网络是从集合进行预测的有前途的工具。但是,我们表明,现有的置换式体系结构,深度集和固定的变压器可能会在深度时消失或爆炸。此外,层规范(SET变压器中选择的归一化)可能会通过删除对预测有用的信息来损害性能。为了解决这些问题,我们介绍了白皮剩余连接的干净路径原理,并开发了设置规范,这是针对集合量身定制的标准化。有了这些,我们构建了Deep Sets ++和SET Transformer ++,该模型比其在各种任务套件上的原始配对品具有可比性或更好的性能。我们还引入了Flow-RBC,这是一种新的单细胞数据集和置换不变预测的现实应用。我们在此处开放数据和代码:https://github.com/rajesh-lab/deep_permunt_invariant。
translated by 谷歌翻译
差异隐私(DP)提供了正式的隐私保证,以防止对手可以访问机器学习模型,从而从提取有关单个培训点的信息。最受欢迎的DP训练方法是差异私有随机梯度下降(DP-SGD),它通过在训练过程中注入噪声来实现这种保护。然而,以前的工作发现,DP-SGD通常会导致标准图像分类基准的性能显着降解。此外,一些作者假设DP-SGD在大型模型上固有地表现不佳,因为保留隐私所需的噪声规范与模型维度成正比。相反,我们证明了过度参数化模型上的DP-SGD可以比以前想象的要好得多。将仔细的超参数调整与简单技术结合起来,以确保信号传播并提高收敛速率,我们获得了新的SOTA,而没有额外数据的CIFAR-10,在81.4%的81.4%下(8,10^{ - 5}) - 使用40 -layer wide-Resnet,比以前的SOTA提高了71.7%。当对预训练的NFNET-F3进行微调时,我们在ImageNet(0.5,8*10^{ - 7})下达到了83.8%的TOP-1精度。此外,我们还在(8,8 \ cdot 10^{ - 7})下达到了86.7%的TOP-1精度,DP仅比当前的非私人SOTA仅4.3%。我们认为,我们的结果是缩小私人图像分类和非私有图像分类之间准确性差距的重要一步。
translated by 谷歌翻译