In training neural networks, batch normalization has many benefits, not all of them entirely understood. But it also has some drawbacks. Foremost is arguably memory consumption, as computing the batch statistics requires all instances within the batch to be processed simultaneously, whereas without batch normalization it would be possible to process them one by one while accumulating the weight gradients. Another drawback is that that distribution parameters (mean and standard deviation) are unlike all other model parameters in that they are not trained using gradient descent but require special treatment, complicating implementation. In this paper, I show a simple and straightforward way to address these issues. The idea, in short, is to add terms to the loss that, for each activation, cause the minimization of the negative log likelihood of a Gaussian distribution that is used to normalize the activation. Among other benefits, this will hopefully contribute to the democratization of AI research by means of lowering the hardware requirements for training larger models.
translated by 谷歌翻译
这项研究介绍了一个称为批处理层归一化(BLN)的新的归一化层,以减少深神经网络层中内部协变量转移的问题。作为批处理和层归一化的组合版本,BLN自适应地将适当的重量放在迷你批处理上,并基于迷你批次的逆尺寸,在学习过程中将输入标准化为层。它还使用微型批量统计或人口统计数据,在推理时间执行精确的计算,并在推理时间进行较小的更改。使用迷你批量或人口统计的决策过程使BLN具有在模型的超参数优化过程中发挥全面作用的能力。 BLN的关键优势是对独立于输入数据的理论分析的支持,其统计配置在很大程度上取决于执行的任务,培训数据的量和批次的大小。测试结果表明,BLN的应用潜力及其更快的收敛性在卷积和复发性神经网络中都比批处理归一化和层归一化。实验的代码在线公开可用(https://github.com/a2amir/batch-layer-normalization)。
translated by 谷歌翻译
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batchnormalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
translated by 谷歌翻译
We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning.
translated by 谷歌翻译
深度学习在广泛的AI应用方面取得了有希望的结果。较大的数据集和模型一致地产生更好的性能。但是,我们一般花费更长的培训时间,以更多的计算和沟通。在本调查中,我们的目标是在模型精度和模型效率方面提供关于大规模深度学习优化的清晰草图。我们调查最常用于优化的算法,详细阐述了大批量培训中出现的泛化差距的可辩论主题,并审查了解决通信开销并减少内存足迹的SOTA策略。
translated by 谷歌翻译
Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs). Despite its pervasiveness, the exact reasons for BatchNorm's effectiveness are still poorly understood. The popular belief is that this effectiveness stems from controlling the change of the layers' input distributions during training to reduce the so-called "internal covariate shift". In this work, we demonstrate that such distributional stability of layer inputs has little to do with the success of BatchNorm. Instead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.
translated by 谷歌翻译
差异私有随机梯度下降(DPSGD)是基于差分隐私(DP)范例的随机梯度下降的变化,这可以减轻来自在训练数据中存在敏感信息的隐私威胁。然而,具有DPSGD的培训深度神经网络的一个主要缺点是模型精度的降低。本文研究了标准化层对DPSGD性能的影响。我们证明标准化层显着影响了深度神经网络与嘈杂参数的效用,应该被视为DPSGD培训的基本成分。特别是,我们提出了一种新的方法,用于将批量标准化与DPSGD集成,而不会产生额外的隐私损失。通过我们的方法,我们能够培训更深的网络并实现更好的效用隐私权衡。
translated by 谷歌翻译
这本数字本书包含在物理模拟的背景下与深度学习相关的一切实际和全面的一切。尽可能多,所有主题都带有Jupyter笔记本的形式的动手代码示例,以便快速入门。除了标准的受监督学习的数据中,我们将看看物理丢失约束,更紧密耦合的学习算法,具有可微分的模拟,以及加强学习和不确定性建模。我们生活在令人兴奋的时期:这些方法具有从根本上改变计算机模拟可以实现的巨大潜力。
translated by 谷歌翻译
深度学习使用由其重量进行参数化的神经网络。通常通过调谐重量来直接最小化给定损耗功能来训练神经网络。在本文中,我们建议将权重重新参数转化为网络中各个节点的触发强度的目标。给定一组目标,可以计算使得发射强度最佳地满足这些目标的权重。有人认为,通过我们称之为级联解压缩的过程,使用培训的目标解决爆炸梯度的问题,并使损失功能表面更加光滑,因此导致更容易,培训更快,以及潜在的概括,神经网络。它还允许更容易地学习更深层次和经常性的网络结构。目标对重量的必要转换有额外的计算费用,这是在许多情况下可管理的。在目标空间中学习可以与现有的神经网络优化器相结合,以额外收益。实验结果表明了使用目标空间的速度,以及改进的泛化的示例,用于全连接的网络和卷积网络,以及调用和处理长时间序列的能力,并使用经常性网络进行自然语言处理。
translated by 谷歌翻译
当前的深度学习自适应优化器方法通过改变每个参数使用的有效学习率来调整参数更新的步骤幅度。由批处理大小和更新步骤幅度的学习率之间的已知反向关系激励,我们引入了一个新颖的培训程序,该过程动态地决定了当前更新步骤的维度和组成。我们的过程,动态批次适应(DBA)分析了每个样本的梯度,并选择了最能改善某些指标的子集,例如网络每一层的梯度差异。我们提出的结果显示DBA显着提高了模型收敛的速度。此外,我们发现DBA在数据稀缺条件中使用时,与标准优化器相比会提高改进MNIST数据集的测试准确性达到97.79%。在更加极端的情况下,它设法使用每班只有10个样本达到97.44%的测试精度。与标准优化器,随机梯度下降(SGD)和ADAM相比,这些结果分别代表了81.78%和88.07%的相对错误率降低。
translated by 谷歌翻译
这是一门专门针对STEM学生开发的介绍性机器学习课程。我们的目标是为有兴趣的读者提供基础知识,以在自己的项目中使用机器学习,并将自己熟悉术语作为进一步阅读相关文献的基础。在这些讲义中,我们讨论受监督,无监督和强化学习。注释从没有神经网络的机器学习方法的说明开始,例如原理分析,T-SNE,聚类以及线性回归和线性分类器。我们继续介绍基本和先进的神经网络结构,例如密集的进料和常规神经网络,经常性的神经网络,受限的玻尔兹曼机器,(变性)自动编码器,生成的对抗性网络。讨论了潜在空间表示的解释性问题,并使用梦和对抗性攻击的例子。最后一部分致力于加强学习,我们在其中介绍了价值功能和政策学习的基本概念。
translated by 谷歌翻译
神经架构的创新促进了语言建模和计算机视觉中的重大突破。不幸的是,如果网络参数未正确初始化,新颖的架构通常会导致挑战超参数选择和培训不稳定。已经提出了许多架构特定的初始化方案,但这些方案并不总是可移植到新体系结构。本文介绍了毕业,一种用于初始化神经网络的自动化和架构不可知论由方法。毕业基础是一个简单的启发式;调整每个网络层的规范,使得具有规定的超参数的SGD或ADAM的单个步骤导致可能的损耗值最小。通过在每个参数块前面引入标量乘数变量,然后使用简单的数字方案优化这些变量来完成此调整。 GradInit加速了许多卷积架构的收敛性和测试性能,无论是否有跳过连接,甚至没有归一化层。它还提高了机器翻译的原始变压器架构的稳定性,使得在广泛的学习速率和动量系数下使用ADAM或SGD来训练它而无需学习速率预热。代码可在https://github.com/zhuchen03/gradinit上获得。
translated by 谷歌翻译
机器学习模型的概括对数据,模型和学习算法具有复杂的依赖性。我们研究训练和测试性能,以及它们在不同数据集样本上的差异给出的概括差距,以理解其``典型''行为。我们得出了差距的表达式,作为模型之间协方差的函数参数分布和列车损耗以及平均测试性能的另一种表达,显示了测试概括仅取决于数据平均参数分布和数据平均损失。我们显示,对于大型模型参数分布,修改的概括差距为始终是非负的。通过进一步专门针对由随机梯度下降(SGD)产生的参数分布,以及一些近似值和建模考虑,我们能够预测有关通用差距和模型训练和测试性能如何变化为一个方面的一些方面SGD噪声的功能。我们基于RESNET体系结构对CIFAR10分类任务进行经验评估这些预测。
translated by 谷歌翻译
当我们扩大数据集,模型尺寸和培训时间时,深入学习方法的能力中存在越来越多的经验证据。尽管有一些关于这些资源如何调节统计能力的说法,但对它们对模型培训的计算问题的影响知之甚少。这项工作通过学习$ k $ -sparse $ n $ bits的镜头进行了探索,这是一个构成理论计算障碍的规范性问题。在这种情况下,我们发现神经网络在扩大数据集大小和运行时间时会表现出令人惊讶的相变。特别是,我们从经验上证明,通过标准培训,各种体系结构以$ n^{o(k)} $示例学习稀疏的平等,而损失(和错误)曲线在$ n^{o(k)}后突然下降。 $迭代。这些积极的结果几乎匹配已知的SQ下限,即使没有明确的稀疏性先验。我们通过理论分析阐明了这些现象的机制:我们发现性能的相变不到SGD“在黑暗中绊倒”,直到它找到了隐藏的特征集(自然算法也以$ n^中的方式运行{o(k)} $ time);取而代之的是,我们表明SGD逐渐扩大了人口梯度的傅立叶差距。
translated by 谷歌翻译
Alphazero,Leela Chess Zero和Stockfish Nnue革新了计算机国际象棋。本书对此类引擎的技术内部工作进行了完整的介绍。该书分为四个主要章节 - 不包括第1章(简介)和第6章(结论):第2章引入神经网络,涵盖了所有用于构建深层网络的基本构建块,例如Alphazero使用的网络。内容包括感知器,后传播和梯度下降,分类,回归,多层感知器,矢量化技术,卷积网络,挤压网络,挤压和激发网络,完全连接的网络,批处理归一化和横向归一化和跨性线性单位,残留层,剩余层,过度效果和底漆。第3章介绍了用于国际象棋发动机以及Alphazero使用的经典搜索技术。内容包括minimax,alpha-beta搜索和蒙特卡洛树搜索。第4章展示了现代国际象棋发动机的设计。除了开创性的Alphago,Alphago Zero和Alphazero我们涵盖Leela Chess Zero,Fat Fritz,Fat Fritz 2以及有效更新的神经网络(NNUE)以及MAIA。第5章是关于实施微型α。 Shexapawn是国际象棋的简约版本,被用作为此的示例。 Minimax搜索可以解决六ap峰,并产生了监督学习的培训位置。然后,作为比较,实施了类似Alphazero的训练回路,其中通过自我游戏进行训练与强化学习结合在一起。最后,比较了类似α的培训和监督培训。
translated by 谷歌翻译
We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network during the early stages of training. Thus, the computational cost of subsequent training iterations, besides that of inference, is considerably reduced. Our method, based on variational inference principles using Gaussian scale mixture priors on neural network weights, learns the variational posterior distribution of Bernoulli random variables multiplying the units/filters similarly to adaptive dropout. Our algorithm, ensures that the Bernoulli parameters practically converge to either 0 or 1, establishing a deterministic final network. We analytically derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection and leads to consistent pruning levels and prediction accuracy regardless of weight initialization or the size of the starting network. We prove the convergence properties of our algorithm establishing theoretical and practical pruning conditions. We evaluate the proposed algorithm on the MNIST and CIFAR-10 data sets and the commonly used fully connected and convolutional LeNet and VGG16 architectures. The simulations show that our method achieves pruning levels on par with state-of the-art methods for structured pruning, while maintaining better test-accuracy and more importantly in a manner robust with respect to network initialization and initial size.
translated by 谷歌翻译
分批归一化(BN)由归一化组成部分,然后是仿射转化,并且对于训练深神经网络至关重要。网络中每个BN的标准初始化分别设置了仿射变换量表,并将其转移到1和0。但是,经过训练,我们观察到这些参数从初始化中并没有太大变化。此外,我们注意到归一化过程仍然可以产生过多的值,这对于训练是不可能的。我们重新审视BN公式,并为BN提出了一种新的初始化方法和更新方法,以解决上述问题。实验旨在强调和证明适当的BN规模初始化对性能的积极影响,并使用严格的统计显着性测试进行评估。该方法可以与现有实施方式一起使用,没有额外的计算成本。源代码可在https://github.com/osu-cvl/revisiting-bninit上获得。
translated by 谷歌翻译
本文提出了一种新的和富有激光激活方法,被称为FPLUS,其利用具有形式的极性标志的数学功率函数。它是通过常见的逆转操作来启发,同时赋予仿生学的直观含义。制剂在某些先前知识和预期特性的条件下理论上得出,然后通过使用典型的基准数据集通过一系列实验验证其可行性,其结果表明我们的方法在许多激活功能中拥有卓越的竞争力,以及兼容稳定性许多CNN架构。此外,我们将呈现给更广泛类型的功能延伸到称为PFPlus的函数,具有两个可以固定的或学习的参数,以便增加其表现力的容量,并且相同的测试结果验证了这种改进。
translated by 谷歌翻译
事实证明,诸如层归一化(LN)和批处理(BN)之类的方法可有效改善复发性神经网络(RNN)的训练。但是,现有方法仅在一个特定的时间步骤中仅使用瞬时信息进行归一化,而归一化的结果是具有时间无关分布的预反应状态。该实现无法解释RNN的输入和体系结构中固有的某些时间差异。由于这些网络跨时间步骤共享权重,因此也可能需要考虑标准化方案中时间步长之间的连接。在本文中,我们提出了一种称为“分类时间归一化”(ATN)的归一化方法,该方法保留了来自多个连续时间步骤的信息,并使用它们归一化。这种设置使我们能够将更长的时间依赖项引入传统的归一化方法,而无需引入任何新的可训练参数。我们介绍了梯度传播的理论推导,并证明了权重缩放不变属性。我们将ATN应用于LN的实验表明,对各种任务(例如添加,复制和DENOISE问题和语言建模问题)表现出一致的改进。
translated by 谷歌翻译
We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.
translated by 谷歌翻译