我们提出了一种培训具有二进制权重的深神经网络(DNN)的新算法。特别是,我们首先将培训二元神经网络(Binns)作为彼得纤维优化实例的问题施放,随后构建这种携手节目的灵活放松。由此产生的训练方法与若干培训箱的几种现有方法共享其算法简单性,特别是在BinaryConnect中成功使用的直通梯度估计器和随后的方法。实际上,我们所提出的方法可以被解释为原始直通估计器的自适应变型,其有条件地(但不总是)起作用在误差传播的后向通过中的线性映射。实验结果表明,与现有方法相比,我们的新算法提供了有利的性能。
translated by 谷歌翻译
In this paper, we develop a new algorithm, Annealed Skewed SGD - AskewSGD - for training deep neural networks (DNNs) with quantized weights. First, we formulate the training of quantized neural networks (QNNs) as a smoothed sequence of interval-constrained optimization problems. Then, we propose a new first-order stochastic method, AskewSGD, to solve each constrained optimization subproblem. Unlike algorithms with active sets and feasible directions, AskewSGD avoids projections or optimization under the entire feasible set and allows iterates that are infeasible. The numerical complexity of AskewSGD is comparable to existing approaches for training QNNs, such as the straight-through gradient estimator used in BinaryConnect, or other state of the art methods (ProxQuant, LUQ). We establish convergence guarantees for AskewSGD (under general assumptions for the objective function). Experimental results show that the AskewSGD algorithm performs better than or on par with state of the art methods in classical benchmarks.
translated by 谷歌翻译
基于量化的模型压缩是高性能和快速的推理方法,其产生与其全精密浮点对应物相比高压缩的模型。最极端的量化是参数的1位表示,使得它们仅具有两个可能的值,通常是-1(0)或+1,从而能够仅使用添加能够实现普遍存在的点产品。这项工作的主要贡献是引入一种方法来平滑确定权重的二进制载体的组合问题,以通过反向衰竭的经验风险最小化最小化给定目标的预期损失。这是通过利用实际值,连续参数的确定性和可微分的变换来近似多变量二进制状态来实现的。该方法在训练中增加了很少的开销,可以很容易地应用于对原始架构的任何实质性修改,不会引入额外的饱和非线性或辅助损耗,并且不会禁止应用其他二值化的其他方法。与文献中的常见断言相反,证明二元加权网络可以用相同的标准优化技术和类似的封路数据设置作为它们的全精密对应物,专门具有大学学习率和$ L_2 $正常化的动量SGD。为了得出结论实验,证明该方法在与其全精密对应物相比,各种架构的许多电感图像分类任务中的方法非常好。源代码在https://bitbucket.org/yanivshu/binary_weighted_networks_public中公开可用。
translated by 谷歌翻译
While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into everyday's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We also briefly discuss different concepts of embedded hardware for DNNs and their compatibility with machine learning techniques as well as potential for energy and latency reduction. We substantiate our discussion with experiments on well-known benchmark datasets using compression techniques (quantization, pruning) for a set of resource-constrained embedded systems, such as CPUs, GPUs and FPGAs. The obtained results highlight the difficulty of finding good trade-offs between resource efficiency and predictive performance.
translated by 谷歌翻译
深度学习在广泛的AI应用方面取得了有希望的结果。较大的数据集和模型一致地产生更好的性能。但是,我们一般花费更长的培训时间,以更多的计算和沟通。在本调查中,我们的目标是在模型精度和模型效率方面提供关于大规模深度学习优化的清晰草图。我们调查最常用于优化的算法,详细阐述了大批量培训中出现的泛化差距的可辩论主题,并审查了解决通信开销并减少内存足迹的SOTA策略。
translated by 谷歌翻译
我们介绍了正规化的弗兰克 - 沃尔夫(Frank-Wolfe),这是一种通用有效的算法,用于推断和学习密集的有条件随机场(CRF)。该算法使用Vanilla Frank-Wolfe优化了CRF推理问题的不连续放松,并具有近似更新,这相当于最大程度地减少正则能量函数。我们提出的方法是对现有算法(例如平均字段或凹形通用程序)的概括。这种观点不仅提供了对这些算法的统一分析,而且还允许一种简单的方法来探索不同的变体,这些变体可能会产生更好的性能。我们在标准语义分割数据集的经验结果中说明了这一点,在该数据集中,我们正规化的Frank-Wolfe优于均值均值推断的几个实例化,无论是独立的组件还是作为神经网络中的端到端可训练层。我们还表明,密集的CRF与我们的新算法相结合,对强CNN基准产生了重大改进。
translated by 谷歌翻译
We propose a simultaneous learning and pruning algorithm capable of identifying and eliminating irrelevant structures in a neural network during the early stages of training. Thus, the computational cost of subsequent training iterations, besides that of inference, is considerably reduced. Our method, based on variational inference principles using Gaussian scale mixture priors on neural network weights, learns the variational posterior distribution of Bernoulli random variables multiplying the units/filters similarly to adaptive dropout. Our algorithm, ensures that the Bernoulli parameters practically converge to either 0 or 1, establishing a deterministic final network. We analytically derive a novel hyper-prior distribution over the prior parameters that is crucial for their optimal selection and leads to consistent pruning levels and prediction accuracy regardless of weight initialization or the size of the starting network. We prove the convergence properties of our algorithm establishing theoretical and practical pruning conditions. We evaluate the proposed algorithm on the MNIST and CIFAR-10 data sets and the commonly used fully connected and convolutional LeNet and VGG16 architectures. The simulations show that our method achieves pruning levels on par with state-of the-art methods for structured pruning, while maintaining better test-accuracy and more importantly in a manner robust with respect to network initialization and initial size.
translated by 谷歌翻译
模型二进制化是一种压缩神经网络并加速其推理过程的有效方法。但是,1位模型和32位模型之间仍然存在显着的性能差距。实证研究表明,二进制会导致前进和向后传播中的信息损失。我们提出了一个新颖的分布敏感信息保留网络(DIR-NET),该网络通过改善内部传播和引入外部表示,将信息保留在前后传播中。 DIR-NET主要取决于三个技术贡献:(1)最大化二进制(IMB)的信息:最小化信息损失和通过重量平衡和标准化同时同时使用权重/激活的二进制误差; (2)分布敏感的两阶段估计器(DTE):通过共同考虑更新能力和准确的梯度来通过分配敏感的软近似来保留梯度的信息; (3)代表性二进制 - 意识蒸馏(RBD):通过提炼完整精确和二元化网络之间的表示来保留表示信息。 DIR-NET从统一信息的角度研究了BNN的前进过程和后退过程,从而提供了对网络二进制机制的新见解。我们的DIR-NET中的三种技术具有多功能性和有效性,可以在各种结构中应用以改善BNN。关于图像分类和客观检测任务的综合实验表明,我们的DIR-NET始终优于主流和紧凑型体系结构(例如Resnet,vgg,vgg,EfficityNet,darts和mobilenet)下最新的二进制方法。此外,我们在现实世界中的资源有限设备上执行DIR-NET,该设备可实现11.1倍的存储空间和5.4倍的速度。
translated by 谷歌翻译
近期在应用于培训深度神经网络和数据分析中的其他优化问题中的非凸优化的优化算法的兴趣增加,我们概述了最近对非凸优化优化算法的全球性能保证的理论结果。我们从古典参数开始,显示一般非凸面问题无法在合理的时间内有效地解决。然后,我们提供了一个问题列表,可以通过利用问题的结构来有效地找到全球最小化器,因为可能的问题。处理非凸性的另一种方法是放宽目标,从找到全局最小,以找到静止点或局部最小值。对于该设置,我们首先为确定性一阶方法的收敛速率提出了已知结果,然后是最佳随机和随机梯度方案的一般理论分析,以及随机第一阶方法的概述。之后,我们讨论了非常一般的非凸面问题,例如最小化$ \ alpha $ -weakly-are-convex功能和满足Polyak-lojasiewicz条件的功能,这仍然允许获得一阶的理论融合保证方法。然后,我们考虑更高阶和零序/衍生物的方法及其收敛速率,以获得非凸优化问题。
translated by 谷歌翻译
在资源受限的嵌入式系统上部署卷积神经网络的关键推动力是二进制神经网络(BNN)。 BNNS通过将功能和权重进行分配来保存内存并简化计算。不幸的是,二进制不可避免地伴随着准确性的严重降低。为了减少二进制和完整精确网络之间的准确性差距,最近提出了许多维修方法,我们已经将其分类并在本章中进行了单一概述。维修方法分为两个主要分支,培训技术和网络拓扑变化,可以进一步分为较小的类别。后一个类别为嵌入式系统引入了额外的成本(能源消耗或额外的面积),而前者则没有。从我们的概述中,我们可以观察到在减少准确性差距方面取得了进展,但是BNN论文并不对应使用哪种修复方法进行对齐,以获得高度准确的BNN。因此,本章包含一项经验综述,该综述评估了许多维修方法的好处,而不是Resnet-20 \&Cifar10和Resnet-18 \&Cifar100基准。我们发现三个维修类别最有益:功能二进制器,功能归一化和双重残留。基于这篇评论,我们讨论未来的方向和研究机会。我们勾勒出与BNN在嵌入式系统上相关的收益和成本,因为BNN是否能够缩小准确性差距,同时在资源受限的嵌入式系统上保持高能效率仍然有待观察。
translated by 谷歌翻译
Although weight and activation quantization is an effective approach for Deep Neural Network (DNN) compression and has a lot of potentials to increase inference speed leveraging bit-operations, there is still a noticeable gap in terms of prediction accuracy between the quantized model and the full-precision model. To address this gap, we propose to jointly train a quantized, bit-operation-compatible DNN and its associated quantizers, as opposed to using fixed, handcrafted quantization schemes such as uniform or logarithmic quantization. Our method for learning the quantizers applies to both network weights and activations with arbitrary-bit precision, and our quantizers are easy to train. The comprehensive experiments on CIFAR-10 and ImageNet datasets show that our method works consistently well for various network structures such as AlexNet, VGG-Net, GoogLeNet, ResNet, and DenseNet, surpassing previous quantization methods in terms of accuracy by an appreciable margin. Code available at https://github.com/Microsoft/LQ-Nets
translated by 谷歌翻译
我们引入了一种新型的数学公式,用于训练以(可能非平滑)近端图作为激活函数的馈送前向神经网络的培训。该公式基于布雷格曼的距离,关键优势是其相对于网络参数的部分导数不需要计算网络激活函数的导数。我们没有使用一阶优化方法和后传播的组合估算参数(如最先进的),而是建议使用非平滑一阶优化方法来利用特定结构新颖的表述。我们提出了几个数值结果,这些结果表明,与更常规的培训框架相比,这些训练方法可以很好地很好地适合于培训基于神经网络的分类器和具有稀疏编码的(DeNoising)自动编码器。
translated by 谷歌翻译
由于在量化网络上的按位操作产生的有效存储器消耗和更快的计算,神经网络量化已经变得越来越受欢迎。尽管它们表现出优异的泛化能力,但其鲁棒性属性并不是很好地理解。在这项工作中,我们系统地研究量化网络对基于梯度的对抗性攻击的鲁棒性,并证明这些量化模型遭受梯度消失问题并显示出虚假的鲁棒感。通过归因于培训的网络中的渐变消失到较差的前后信号传播,我们引入了一个简单的温度缩放方法来缓解此问题,同时保留决策边界。尽管对基于梯度的对抗攻击进行了简单的修改,但具有多个网络架构的多个图像分类数据集的实验表明,我们的温度缩放攻击在量化网络上获得了近乎完美的成功率,同时表现出对普遍培训的模型以及浮动的原始攻击以及浮动 - 点网络。代码可在https://github.com/kartikgupta-at-anu/Attack-bnn获得。
translated by 谷歌翻译
Quantization has become a predominant approach for model compression, enabling deployment of large models trained on GPUs onto smaller form-factor devices for inference. Quantization-aware training (QAT) optimizes model parameters with respect to the end task while simulating quantization error, leading to better performance than post-training quantization. Approximation of gradients through the non-differentiable quantization operator is typically achieved using the straight-through estimator (STE) or additive noise. However, STE-based methods suffer from instability due to biased gradients, whereas existing noise-based methods cannot reduce the resulting variance. In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator. We show this method combines gradient scale and quantization noise in a better optimized way, providing finer-grained estimation of gradients at each weight and activation layer's quantizer bin size. Our controlled noise also contains an implicit curvature term that could encourage flatter minima, which we show is indeed the case in our experiments. Experiments training ResNet architectures on the CIFAR-10, CIFAR-100 and ImageNet benchmarks show that our method obtains state-of-the-art top-1 classification accuracy for uniform (non mixed-precision) quantization, out-performing previous methods by 0.5-1.2% absolute.
translated by 谷歌翻译
混合精确的深神经网络达到了硬件部署所需的能源效率和吞吐量,尤其是在资源有限的情况下,而无需牺牲准确性。但是,不容易找到保留精度的最佳每层钻头精度,尤其是在创建巨大搜索空间的大量模型,数据集和量化技术中。为了解决这一困难,最近出现了一系列文献,并且已经提出了一些实现有希望的准确性结果的框架。在本文中,我们首先总结了文献中通常使用的量化技术。然后,我们对混合精液框架进行了彻底的调查,该调查是根据其优化技术进行分类的,例如增强学习和量化技术,例如确定性舍入。此外,讨论了每个框架的优势和缺点,我们在其中呈现并列。我们最终为未来的混合精液框架提供了指南。
translated by 谷歌翻译
人工神经网络(ANN)训练景观的非凸起带来了固有的优化困难。虽然传统的背传播随机梯度下降(SGD)算法及其变体在某些情况下是有效的,但它们可以陷入杂散的局部最小值,并且对初始化和普通公共表敏感。最近的工作表明,随着Relu激活的ANN的培训可以重新重整为凸面计划,使希望能够全局优化可解释的ANN。然而,天真地解决凸训练制剂具有指数复杂性,甚至近似启发式需要立方时间。在这项工作中,我们描述了这种近似的质量,并开发了两个有效的算法,这些算法通过全球收敛保证培训。第一算法基于乘法器(ADMM)的交替方向方法。它解决了精确的凸形配方和近似对应物。实现线性全局收敛,并且初始几次迭代通常会产生具有高预测精度的解决方案。求解近似配方时,每次迭代时间复杂度是二次的。基于“采样凸面”理论的第二种算法更简单地实现。它解决了不受约束的凸形制剂,并收敛到大约全球最佳的分类器。当考虑对抗性培训时,ANN训练景观的非凸起加剧了。我们将稳健的凸优化理论应用于凸训练,开发凸起的凸起制剂,培训Anns对抗对抗投入。我们的分析明确地关注一个隐藏层完全连接的ANN,但可以扩展到更复杂的体系结构。
translated by 谷歌翻译
一场堆放堡拥堵游戏(SCG)是一个双重计划,领导者的目标是通过预测和操纵均衡状态来最大程度地提高自己的收益,在该状态下,追随者通过玩拥堵游戏而定居。大规模的SCG以其顽固性和复杂性而闻名。这项研究通过可区分的编程来处理SCG,该编程将机器学习的最新发展与常规方法结合在一起。核心思想以模仿logit动力学形成的进化路径代表低级平衡问题。它可以在朝着平衡的演化路径上使用自动分化,从而导致双环梯度下降算法。我们进一步表明,对低级平衡的固定可能是一个自我强加的计算障碍。取而代之的是,领导者只能沿着追随者的演变路径向前看几个步骤,同时通过共同进化过程更新其决策。启示产生了一种单循环算法,该算法在记忆消耗和计算时间方面都更有效。通过涵盖广泛基准问题的数值实验,我们发现单循环算法始终达到解决方案质量和效率之间的良好平衡,不仅优于标准的双环实现,而且优于文献中的其他方法。重要的是,我们的结果既突出了“充分期待”的浪费和“零预期”的危险。如果需要快速启发术来解决一个非常大的SCG,则提议的单环算法具有一步的外观,使其成为理想的候选人。
translated by 谷歌翻译
量子哈密顿学习和量子吉布斯采样的双重任务与物理和化学中的许多重要问题有关。在低温方案中,这些任务的算法通常会遭受施状能力,例如因样本或时间复杂性差而遭受。为了解决此类韧性,我们将量子自然梯度下降的概括引入了参数化的混合状态,并提供了稳健的一阶近似算法,即量子 - 固定镜下降。我们使用信息几何学和量子计量学的工具证明了双重任务的数据样本效率,因此首次将经典Fisher效率的开创性结果推广到变异量子算法。我们的方法扩展了以前样品有效的技术,以允许模型选择的灵活性,包括基于量子汉密尔顿的量子模型,包括基于量子的模型,这些模型可能会规避棘手的时间复杂性。我们的一阶算法是使用经典镜下降二元性的新型量子概括得出的。两种结果都需要特殊的度量选择,即Bogoliubov-Kubo-Mori度量。为了从数值上测试我们提出的算法,我们将它们的性能与现有基准进行了关于横向场ISING模型的量子Gibbs采样任务的现有基准。最后,我们提出了一种初始化策略,利用几何局部性来建模状态的序列(例如量子 - 故事过程)的序列。我们从经验上证明了它在实际和想象的时间演化的经验上,同时定义了更广泛的潜在应用。
translated by 谷歌翻译
We introduce a method to train Quantized Neural Networks (QNNs) -neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At traintime the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.
translated by 谷歌翻译
Although considerable progress has been obtained in neural network quantization for efficient inference, existing methods are not scalable to heterogeneous devices as one dedicated model needs to be trained, transmitted, and stored for one specific hardware setting, incurring considerable costs in model training and maintenance. In this paper, we study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one. With this representation, we can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model. To this end, we propose a simple once quantization-aware training (QAT) scheme for obtaining high-performance vertical-layered models. Our design incorporates a cascade downsampling mechanism which allows us to obtain multiple quantized networks from one full precision source model by progressively mapping the higher precision weights to their adjacent lower precision counterparts. Then, with networks of different bit-widths from one source model, multi-objective optimization is employed to train the shared source model weights such that they can be updated simultaneously, considering the performance of all networks. By doing this, the shared weights will be optimized to balance the performance of different quantized models, thus making the weights transferable among different bit widths. Experiments show that the proposed vertical-layered representation and developed once QAT scheme are effective in embodying multiple quantized networks into a single one and allow one-time training, and it delivers comparable performance as that of quantized models tailored to any specific bit-width. Code will be available.
translated by 谷歌翻译