Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batchnormalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
translated by 谷歌翻译
这项研究介绍了一个称为批处理层归一化(BLN)的新的归一化层,以减少深神经网络层中内部协变量转移的问题。作为批处理和层归一化的组合版本,BLN自适应地将适当的重量放在迷你批处理上,并基于迷你批次的逆尺寸,在学习过程中将输入标准化为层。它还使用微型批量统计或人口统计数据,在推理时间执行精确的计算,并在推理时间进行较小的更改。使用迷你批量或人口统计的决策过程使BLN具有在模型的超参数优化过程中发挥全面作用的能力。 BLN的关键优势是对独立于输入数据的理论分析的支持,其统计配置在很大程度上取决于执行的任务,培训数据的量和批次的大小。测试结果表明,BLN的应用潜力及其更快的收敛性在卷积和复发性神经网络中都比批处理归一化和层归一化。实验的代码在线公开可用(https://github.com/a2amir/batch-layer-normalization)。
translated by 谷歌翻译
Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs). Despite its pervasiveness, the exact reasons for BatchNorm's effectiveness are still poorly understood. The popular belief is that this effectiveness stems from controlling the change of the layers' input distributions during training to reduce the so-called "internal covariate shift". In this work, we demonstrate that such distributional stability of layer inputs has little to do with the success of BatchNorm. Instead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.
translated by 谷歌翻译
We introduce a method to train Quantized Neural Networks (QNNs) -neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.
translated by 谷歌翻译
We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PRe-LUs), ELUs alleviate the vanishing gradient problem via the identity for positive values. However ELUs have improved learning characteristics compared to the units with other activation functions. In contrast to ReLUs, ELUs have negative values which allows them to push mean unit activations closer to zero like batch normalization but with lower computational complexity. Mean shifts toward zero speed up learning by bringing the normal gradient closer to the unit natural gradient because of a reduced bias shift effect. While LReLUs and PReLUs have negative values, too, they do not ensure a noise-robust deactivation state. ELUs saturate to a negative value with smaller inputs and thereby decrease the forward propagated variation and information. Therefore ELUs code the degree of presence of particular phenomena in the input, while they do not quantitatively model the degree of their absence. In experiments, ELUs lead not only to faster learning, but also to significantly better generalization performance than ReLUs and LReLUs on networks with more than 5 layers. On CIFAR-100 ELUs networks significantly outperform ReLU networks with batch normalization while batch normalization does not improve ELU networks. ELU networks are among the top 10 reported CIFAR-10 results and yield the best published result on CIFAR-100, without resorting to multi-view evaluation or model averaging. On ImageNet, ELU networks considerably speed up learning compared to a ReLU network with the same architecture, obtaining less than 10% classification error for a single crop, single model network.
translated by 谷歌翻译
深度学习使用由其重量进行参数化的神经网络。通常通过调谐重量来直接最小化给定损耗功能来训练神经网络。在本文中,我们建议将权重重新参数转化为网络中各个节点的触发强度的目标。给定一组目标,可以计算使得发射强度最佳地满足这些目标的权重。有人认为,通过我们称之为级联解压缩的过程,使用培训的目标解决爆炸梯度的问题,并使损失功能表面更加光滑,因此导致更容易,培训更快,以及潜在的概括,神经网络。它还允许更容易地学习更深层次和经常性的网络结构。目标对重量的必要转换有额外的计算费用,这是在许多情况下可管理的。在目标空间中学习可以与现有的神经网络优化器相结合,以额外收益。实验结果表明了使用目标空间的速度,以及改进的泛化的示例,用于全连接的网络和卷积网络,以及调用和处理长时间序列的能力,并使用经常性网络进行自然语言处理。
translated by 谷歌翻译
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
translated by 谷歌翻译
分批归一化(BN)由归一化组成部分,然后是仿射转化,并且对于训练深神经网络至关重要。网络中每个BN的标准初始化分别设置了仿射变换量表,并将其转移到1和0。但是,经过训练,我们观察到这些参数从初始化中并没有太大变化。此外,我们注意到归一化过程仍然可以产生过多的值,这对于训练是不可能的。我们重新审视BN公式,并为BN提出了一种新的初始化方法和更新方法,以解决上述问题。实验旨在强调和证明适当的BN规模初始化对性能的积极影响,并使用严格的统计显着性测试进行评估。该方法可以与现有实施方式一起使用,没有额外的计算成本。源代码可在https://github.com/osu-cvl/revisiting-bninit上获得。
translated by 谷歌翻译
We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning.
translated by 谷歌翻译
Convolutional networks are at the core of most stateof-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.
translated by 谷歌翻译
神经架构的创新促进了语言建模和计算机视觉中的重大突破。不幸的是,如果网络参数未正确初始化,新颖的架构通常会导致挑战超参数选择和培训不稳定。已经提出了许多架构特定的初始化方案,但这些方案并不总是可移植到新体系结构。本文介绍了毕业,一种用于初始化神经网络的自动化和架构不可知论由方法。毕业基础是一个简单的启发式;调整每个网络层的规范,使得具有规定的超参数的SGD或ADAM的单个步骤导致可能的损耗值最小。通过在每个参数块前面引入标量乘数变量,然后使用简单的数字方案优化这些变量来完成此调整。 GradInit加速了许多卷积架构的收敛性和测试性能,无论是否有跳过连接,甚至没有归一化层。它还提高了机器翻译的原始变压器架构的稳定性,使得在广泛的学习速率和动量系数下使用ADAM或SGD来训练它而无需学习速率预热。代码可在https://github.com/zhuchen03/gradinit上获得。
translated by 谷歌翻译
在现代深网(DNS)中,至关重要的,无处不在且知之甚少的成分是批处理(BN),它以特​​征图为中心并归一化。迄今为止,只有有限的进步才能理解为什么BN会提高DN学习和推理表现。工作专注于表明BN平滑DN的损失格局。在本文中,我们从函数近似的角度从理论上研究BN。我们利用这样一个事实,即当今最先进的DNS是连续的分段仿射(CPA),可以通过定义在输入空间的分区上定义的仿射映射来预测培训数据(所谓的“线性”区域”)。 {\ em我们证明了BN是一种无监督的学习技术,它独立于DN的权重或基于梯度的学习 - 适应DN的样条分区的几何形状以匹配数据。} BN提供了“智能初始化”,可提高“智能初始化” DN学习的性能,因为它甚至适应了以随机权重初始化的DN,以使其样条分区与数据保持一致。我们还表明,微型批次之间的BN统计数据的变化引入了辍学的随机扰动,以对分区边界,因此分类问题的决策边界。每次微型摄入扰动可通过增加训练样本和决策边界之间的边距来减少过度拟合并改善概括。
translated by 谷歌翻译
通过卫星摄像机获取关于地球表面的大面积的信息使我们能够看到远远超过我们在地面上看到的更多。这有助于我们在检测和监测土地使用模式,大气条件,森林覆盖和许多非上市方面的地区的物理特征。所获得的图像不仅跟踪连续的自然现象,而且对解决严重森林砍伐的全球挑战也至关重要。其中亚马逊盆地每年占最大份额。适当的数据分析将有助于利用可持续健康的氛围来限制对生态系统和生物多样性的不利影响。本报告旨在通过不同的机器学习和优越的深度学习模型用大气和各种陆地覆盖或土地使用亚马逊雨林的卫星图像芯片。评估是基于F2度量完成的,而用于损耗函数,我们都有S形跨熵以及Softmax交叉熵。在使用预先训练的ImageNet架构中仅提取功能之后,图像被间接馈送到机器学习分类器。鉴于深度学习模型,通过传输学习使用微调Imagenet预训练模型的集合。到目前为止,我们的最佳分数与F2度量为0.927。
translated by 谷歌翻译
In training neural networks, batch normalization has many benefits, not all of them entirely understood. But it also has some drawbacks. Foremost is arguably memory consumption, as computing the batch statistics requires all instances within the batch to be processed simultaneously, whereas without batch normalization it would be possible to process them one by one while accumulating the weight gradients. Another drawback is that that distribution parameters (mean and standard deviation) are unlike all other model parameters in that they are not trained using gradient descent but require special treatment, complicating implementation. In this paper, I show a simple and straightforward way to address these issues. The idea, in short, is to add terms to the loss that, for each activation, cause the minimization of the negative log likelihood of a Gaussian distribution that is used to normalize the activation. Among other benefits, this will hopefully contribute to the democratization of AI research by means of lowering the hardware requirements for training larger models.
translated by 谷歌翻译
Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91% on CIFAR-10).
translated by 谷歌翻译
由于稀疏神经网络通常包含许多零权重,因此可以在不降低网络性能的情况下潜在地消除这些不必要的网络连接。因此,设计良好的稀疏神经网络具有显着降低拖鞋和计算资源的潜力。在这项工作中,我们提出了一种新的自动修剪方法 - 稀疏连接学习(SCL)。具体地,重量被重新参数化为可培训权重变量和二进制掩模的元素方向乘法。因此,由二进制掩模完全描述网络连接,其由单位步进函数调制。理论上,从理论上证明了使用直通估计器(STE)进行网络修剪的基本原理。这一原则是STE的代理梯度应该是积极的,确保掩模变量在其最小值处收敛。在找到泄漏的Relu后,SoftPlus和Identity Stes可以满足这个原理,我们建议采用SCL的身份STE以进行离散面膜松弛。我们发现不同特征的面具梯度非常不平衡,因此,我们建议将每个特征的掩模梯度标准化以优化掩码变量训练。为了自动训练稀疏掩码,我们将网络连接总数作为我们的客观函数中的正则化术语。由于SCL不需要由网络层设计人员定义的修剪标准或超级参数,因此在更大的假设空间中探讨了网络,以实现最佳性能的优化稀疏连接。 SCL克服了现有自动修剪方法的局限性。实验结果表明,SCL可以自动学习并选择各种基线网络结构的重要网络连接。 SCL培训的深度学习模型以稀疏性,精度和减少脚波特的SOTA人类设计和自动修剪方法训练。
translated by 谷歌翻译
神经网络的架构和参数通常独立优化,这需要每当修改体系结构时对参数的昂贵再次再次再次进行验证。在这项工作中,我们专注于在不需要昂贵的再培训的情况下越来越多。我们提出了一种在训练期间添加新神经元的方法,而不会影响已经学到的内容,同时改善了培训动态。我们通过最大化新重量的梯度来实现后者,并通过奇异值分解(SVD)有效地找到最佳初始化。我们称这种技术渐变最大化增长(Gradmax),并展示其各种视觉任务和架构的效力。
translated by 谷歌翻译
In this work we introduce a binarized deep neural network (BDNN) model. BDNNs are trained using a novel binarized back propagation algorithm (BBP), which uses binary weights and binary neurons during the forward and backward propagation, while retaining precision of the stored weights in which gradients are accumulated. At test phase, BDNNs are fully binarized and can be implemented in hardware with low circuit complexity. The proposed binarized networks can be implemented using binary convolutions and proxy matrix multiplications with only standard binary XNOR and population count (popcount) operations. BBP is expected to reduce energy consumption by at least two orders of magnitude when compared to the hardware implementation of existing training algorithms. We obtained near state-of-the-art results with BDNNs on the permutation-invariant MNIST, CIFAR-10 and SVHN datasets.
translated by 谷歌翻译
深度学习在广泛的AI应用方面取得了有希望的结果。较大的数据集和模型一致地产生更好的性能。但是,我们一般花费更长的培训时间,以更多的计算和沟通。在本调查中,我们的目标是在模型精度和模型效率方面提供关于大规模深度学习优化的清晰草图。我们调查最常用于优化的算法,详细阐述了大批量培训中出现的泛化差距的可辩论主题,并审查了解决通信开销并减少内存足迹的SOTA策略。
translated by 谷歌翻译
Batch Normalization (BN) is an important preprocessing step to many deep learning applications. Since it is a data-dependent process, for some homogeneous datasets it is a redundant or even a performance-degrading process. In this paper, we propose an early-stage feasibility assessment method for estimating the benefits of applying BN on the given data batches. The proposed method uses a novel threshold-based approach to classify the training data batches into two sets according to their need for normalization. The need for normalization is decided based on the feature heterogeneity of the considered batch. The proposed approach is a pre-training processing, which implies no training overhead. The evaluation results show that the proposed approach achieves better performance mostly in small batch sizes than the traditional BN using MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets. Additionally, the network stability is increased by reducing the occurrence of internal variable transformation.
translated by 谷歌翻译