Deep Neural Networks (DNN) have achieved state-of-the-art results in a wide range of tasks, with the best results obtained with large training sets and large models. In the past, GPUs enabled these breakthroughs because of their greater computational speed. In the future, faster computation at both training and test time is likely to be crucial for further progress and for consumer applications on low-power devices. As a result, there is much interest in research and development of dedicated hardware for Deep Learning (DL). Binary weights, i.e., weights which are constrained to only two possible values (e.g. -1 or 1), would bring great benefits to specialized DL hardware by replacing many multiply-accumulate operations by simple accumulations, as multipliers are the most space and powerhungry components of the digital implementation of neural networks. We introduce BinaryConnect, a method which consists in training a DNN with binary weights during the forward and backward propagations, while retaining precision of the stored weights in which gradients are accumulated. Like other dropout schemes, we show that BinaryConnect acts as regularizer and we obtain near state-of-the-art results with BinaryConnect on the permutation-invariant MNIST, CIFAR-10 and SVHN.
translated by 谷歌翻译
We introduce a method to train Quantized Neural Networks (QNNs) -neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.
translated by 谷歌翻译
In this work we introduce a binarized deep neural network (BDNN) model. BDNNs are trained using a novel binarized back propagation algorithm (BBP), which uses binary weights and binary neurons during the forward and backward propagation, while retaining precision of the stored weights in which gradients are accumulated. At test phase, BDNNs are fully binarized and can be implemented in hardware with low circuit complexity. The proposed binarized networks can be implemented using binary convolutions and proxy matrix multiplications with only standard binary XNOR and population count (popcount) operations. BBP is expected to reduce energy consumption by at least two orders of magnitude when compared to the hardware implementation of existing training algorithms. We obtained near state-of-the-art results with BDNNs on the permutation-invariant MNIST, CIFAR-10 and SVHN datasets.
translated by 谷歌翻译
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32× memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58× faster convolutional operations (in terms of number of the high precision operations) and 32× memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than 16% in top-1 accuracy. Our code is available at: http://allenai.org/plato/xnornet.
translated by 谷歌翻译
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding.
translated by 谷歌翻译
使用浮点实数实现标准深度学习算法。这呈现了在可能没有专用浮点单元(FPU)的低端设备上实现它们的障碍。因此,Tinyml的研究人员认为可以使用Integer操作在低端设备上培训和运行深神经网络(DNN)的机器学习算法。本文在纯C ++中提出了Pocketnn,轻型和独立的概念概念框架,用于仅使用整数的DNN训练和推断。与其他方法不同,PocketNN直接在整数上运行,而无需任何显式量化算法或定制的定期点格式。这是通过口袋激活来实现的,这是一个用于整数DNN的激活函数系列,以及称为直接反馈对准(DFA)的新兴DNN训练算法。与标准BackPropagation(BP)不同,DFA独立列举每个图层,从而避免在使用仅具有整数操作的BP时是一个关键问题的整数溢出。我们使用Pocketnn在两个着名的数据集,Mnist和Fashion-Mnist上培训一些DNN。我们的实验表明,DNN与我们的PocketNN接受过的DNN培训,分别在MNIST和Fashion-Mnist数据集中获得了96.98%和87.7%的准确性。精度非常接近使用具有浮点实数操作的BP培训的等效DNN,使得精度降解分别为1.02%p和2.09%p。最后,我们的PocketNN为低端设备具有高兼容性和可移植性,因为它是开源的开源,并在纯C ++中实现,没有任何依赖项。
translated by 谷歌翻译
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems.This article aims to provide a comprehensive tutorial and survey about the recent advances towards the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic co-designs, being proposed in academia and industry.The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the trade-offs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
translated by 谷歌翻译
Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91% on CIFAR-10).
translated by 谷歌翻译
我们最近提出了S4NN算法,基本上是对多层尖峰神经网络的反向化的适应,该网上网络使用简单的非泄漏整合和火神经元和一种形式称为第一峰值编码的时间编码。通过这种编码方案,每次刺激最多一次都是神经元火灾,但射击令携带信息。这里,我们引入BS4NN,S4NN的修改,其中突触权重被约束为二进制(+1或-1),以便减少存储器(理想情况下,每个突触的一个比特)和计算占地面积。这是使用两组权重完成:首先,通过梯度下降更新的实际重量,并在BackProjagation的后退通行证中使用,其次是在前向传递中使用的迹象。类似的策略已被用于培训(非尖峰)二值化神经网络。主要区别在于BS4NN在时域中操作:尖峰依次繁殖,并且不同的神经元可以在不同时间达到它们的阈值,这增加了计算能力。我们验证了两个流行的基准,Mnist和Fashion-Mnist上的BS4NN,并获得了这种网络的合理精度(分别为97.0%和87.3%),具有可忽略的准确率,具有可忽略的重量率(0.4%和0.7%,分别)。我们还展示了BS4NN优于具有相同架构的简单BNN,在这两个数据集上(分别为0.2%和0.9%),可能是因为它利用时间尺寸。建议的BS4NN的源代码在HTTPS://github.com/srkh/bs4nn上公开可用。
translated by 谷歌翻译
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batchnormalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
translated by 谷歌翻译
We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning.
translated by 谷歌翻译
日益复杂的机器学习模型的不断增长的计算需求通常需要使用强大的基于云的基础架构进行培训。已知二元神经网络由于其极端的计算和内存节省了更高精确的替代方案,因此有望进行现场推断。但是,他们现有的训练方法需要同时存储所有层的高精度激活,这通常使在内存受限的设备上学习不可行。在本文中,我们证明了二进制神经网络训练所需的向后传播操作对量化非常强大,从而使现代模型的现场学习成为实用命题。我们介绍了一种低成本的二元神经网络训练策略,该策略表现出相当大的记忆范围减少,同时几乎没有准确的损失与Courbariaux&Bengio的标准方法。这些减少主要是通过仅以二进制格式保留激活来实现的。在后一种算法上,我们的置换替换量看到记忆需求减少3--5 $ \ times $,同时在可比时间内达到相似的测试准确性,这些型号跨越了一系列经过培训的小型模型,用于对流行数据集进行分类。我们还展示了对二进制RESNET-18的从划痕成像网训练,并实现了3.78 $ \ times $减少内存。我们的工作是开源的,包括覆盆子Pi靶向原型,我们用来验证建模的内存降低并捕获相关的能量滴。这样的节省将避免不必要的云下载,减少延迟,提高能源效率和保护最终用户的隐私。
translated by 谷歌翻译
由于神经网络变得更加强大,因此在现实世界中部署它们的愿望是一个上升的愿望;然而,神经网络的功率和准确性主要是由于它们的深度和复杂性,使得它们难以部署,尤其是在资源受限的设备中。最近出现了神经网络量化,以满足这种需求通过降低网络的精度来降低神经网络的大小和复杂性。具有较小和更简单的网络,可以在目标硬件的约束中运行神经网络。本文调查了在过去十年中开发的许多神经网络量化技术。基于该调查和神经网络量化技术的比较,我们提出了该地区的未来研究方向。
translated by 谷歌翻译
Increasing the size of a neural network typically improves accuracy but also increases the memory and compute requirements for training the model. We introduce methodology for training deep neural networks using half-precision floating point numbers, without losing model accuracy or having to modify hyperparameters. This nearly halves memory requirements and, on recent GPUs, speeds up arithmetic. Weights, activations, and gradients are stored in IEEE halfprecision format. Since this format has a narrower range than single-precision we propose three techniques for preventing the loss of critical information. Firstly, we recommend maintaining a single-precision copy of weights that accumulates the gradients after each optimizer step (this copy is rounded to half-precision for the forward-and back-propagation). Secondly, we propose loss-scaling to preserve gradient values with small magnitudes. Thirdly, we use half-precision arithmetic that accumulates into single-precision outputs, which are converted to halfprecision before storing to memory. We demonstrate that the proposed methodology works across a wide variety of tasks and modern large scale (exceeding 100 million parameters) model architectures, trained on large datasets.
translated by 谷歌翻译
神经网络的外包计算允许用户访问艺术模型的状态,而无需投资专门的硬件和专业知识。问题是用户对潜在的隐私敏感数据失去控制。通过同性恋加密(HE)可以在加密数据上执行计算,而不会显示其内容。在这种知识的系统化中,我们深入了解与隐私保留的神经网络相结合的方法。我们将更改分类为神经网络模型和架构,使其在他和这些变化的影响方面提供影响。我们发现众多挑战是基于隐私保留的深度学习,例如通过加密方案构成的计算开销,可用性和限制。
translated by 谷歌翻译
The aim of this paper is to introduce a new learning procedure for neural networks and to demonstrate that it works well enough on a few small problems to be worth further investigation. The Forward-Forward algorithm replaces the forward and backward passes of backpropagation by two forward passes, one with positive (i.e. real) data and the other with negative data which could be generated by the network itself. Each layer has its own objective function which is simply to have high goodness for positive data and low goodness for negative data. The sum of the squared activities in a layer can be used as the goodness but there are many other possibilities, including minus the sum of the squared activities. If the positive and negative passes could be separated in time, the negative passes could be done offline, which would make the learning much simpler in the positive pass and allow video to be pipelined through the network without ever storing activities or stopping to propagate derivatives.
translated by 谷歌翻译
代表低精度的深度神经网络(DNN)是一种有希望的方法来实现有效的加速和记忆力。以前的方法在低精度中培训DNN的方法通常在重量更新期间在高精度中保持重量的重量副本。由于低精度数字系统与学习算法之间的复杂相互作用,直接具有低精度重量的培训导致精度下降。为了解决这个问题,我们开发了一个共同设计的低精度训练框架,被称为LNS-MADAM,我们共同设计了对数号系统(LNS)和乘法权重算法(MADAM)。我们证明了LNS-MADAM在重量更新期间导致低量化误差,即使精度有限,也导致稳定的收敛。我们进一步提出了LNS-MADAM的硬件设计,可以解决实现LNS计算的有效数据路径的实际挑战。我们的实现有效地降低了LNS - 整数转换和部分总和累积所产生的能量开销。实验结果表明,LNS-MADAM为全精密对应物达到了可比的准确性,只有8位对流行的计算机视觉和自然语言任务。与全精密浮点实施相比,LNS-MADAM将能耗降低超过90。
translated by 谷歌翻译