变压器架构已成为广泛的自然语言处理〜(NLP)模型的基本要素。随着大型NLP模型的趋势,增加的内存和计算成本阻碍了其在资源有限设备上的有效部署。因此,变压器量化吸引了广泛的研究兴趣。最近的工作认识到结构化的离群值是量化性能的关键瓶颈。但是,他们提出的方法增加了开销的计算,仍然将异常值留在那里。为了从根本上解决这个问题,本文深入研究了异常值的固有诱因和重要性。我们发现$ \ boldsymbol \ gamma $ in LaiserNorm(ln)充当异常值的有罪放大器,而异常值的重要性差异很大,其中一些代币提供的一些异常值覆盖了大面积,但可以牢固地夹住一个大面积,但可以将其夹住,而没有负面影响。 。在这些发现的激励下,我们提出了一个异常抑制框架,其中包括两个组成部分:伽玛迁移和象征性的剪辑。伽马迁移将异常放大器迁移到等效转换中的后续模块,从而导致更量化的模型而没有任何额外的负担。令牌的剪辑利用了令牌范围的较大差异,并设计了代币的粗到精细管道,以有效的方式获得了具有最小的最终量化损失的剪辑范围。该框架有效地抑制了异常值,可以在插件模式下使用。广泛的实验证明,我们的框架超过了现有作品,并且首次将6位训练后的BERT量化量化推向完整精确度(FP)级别。我们的代码可在https://github.com/wimh966/outlier_suppression上找到。
translated by 谷歌翻译
模型量化已成为加速深度学习推理的不可或缺的技术。虽然研究人员继续推动量化算法的前沿,但是现有量化工作通常是不可否认的和不可推销的。这是因为研究人员不选择一致的训练管道并忽略硬件部署的要求。在这项工作中,我们提出了模型量化基准(MQBench),首次尝试评估,分析和基准模型量化算法的再现性和部署性。我们为实际部署选择多个不同的平台,包括CPU,GPU,ASIC,DSP,并在统一培训管道下评估广泛的最新量化算法。 MQBENCK就像一个连接算法和硬件的桥梁。我们进行全面的分析,并找到相当大的直观或反向直观的见解。通过对齐训练设置,我们发现现有的算法在传统的学术轨道上具有大致相同的性能。虽然用于硬件可部署量化,但有一个巨大的精度差距,仍然不稳定。令人惊讶的是,没有现有的算法在MQBench中赢得每一项挑战,我们希望这项工作能够激发未来的研究方向。
translated by 谷歌翻译
模型二进制化是一种压缩神经网络并加速其推理过程的有效方法。但是,1位模型和32位模型之间仍然存在显着的性能差距。实证研究表明,二进制会导致前进和向后传播中的信息损失。我们提出了一个新颖的分布敏感信息保留网络(DIR-NET),该网络通过改善内部传播和引入外部表示,将信息保留在前后传播中。 DIR-NET主要取决于三个技术贡献:(1)最大化二进制(IMB)的信息:最小化信息损失和通过重量平衡和标准化同时同时使用权重/激活的二进制误差; (2)分布敏感的两阶段估计器(DTE):通过共同考虑更新能力和准确的梯度来通过分配敏感的软近似来保留梯度的信息; (3)代表性二进制 - 意识蒸馏(RBD):通过提炼完整精确和二元化网络之间的表示来保留表示信息。 DIR-NET从统一信息的角度研究了BNN的前进过程和后退过程,从而提供了对网络二进制机制的新见解。我们的DIR-NET中的三种技术具有多功能性和有效性,可以在各种结构中应用以改善BNN。关于图像分类和客观检测任务的综合实验表明,我们的DIR-NET始终优于主流和紧凑型体系结构(例如Resnet,vgg,vgg,EfficityNet,darts和mobilenet)下最新的二进制方法。此外,我们在现实世界中的资源有限设备上执行DIR-NET,该设备可实现11.1倍的存储空间和5.4倍的速度。
translated by 谷歌翻译
量化已成为压缩和加速神经网络最普遍的方法之一。最近,无数据量化已被广泛研究作为实用和有前途的解决方案。它根据FP32批量归一化(BN)统计,合成校准量化模型的数据,并显着降低了传统量化方法中实际训练数据的沉重依赖性。不幸的是,我们发现在实践中,BN统计的合成数据在分配水平和样品水平上具有严重均匀化,并且进一步引起量化模型的显着性能下降。我们提出了各种样品生成(DSG)方案,以减轻均质化引起的不利影响。具体而言,我们松弛BN层中的特征统计的对准,以在分配水平处放宽约束,并设计一个层状增强,以加强针对不同的数据样本的特定层。我们的DSG方案是多功能的,甚至能够应用于现代训练后的训练后的量化方法,如亚马逊。我们评估大规模图像分类任务的DSG方案,并始终如一地获得各种网络架构和量化方法的显着改进,特别是当量化到较低位时(例如,在W4A4上的高达22%)。此外,从增强的多样性受益,综合数据校准的模型均接近通过实际数据校准的那些,甚至在W4A4上越优于它们。
translated by 谷歌翻译
The interplay between quantum physics and machine learning gives rise to the emergent frontier of quantum machine learning, where advanced quantum learning models may outperform their classical counterparts in solving certain challenging problems. However, quantum learning systems are vulnerable to adversarial attacks: adding tiny carefully-crafted perturbations on legitimate input samples can cause misclassifications. To address this issue, we propose a general scheme to protect quantum learning systems from adversarial attacks by randomly encoding the legitimate data samples through unitary or quantum error correction encoders. In particular, we rigorously prove that both global and local random unitary encoders lead to exponentially vanishing gradients (i.e. barren plateaus) for any variational quantum circuits that aim to add adversarial perturbations, independent of the input data and the inner structures of adversarial circuits and quantum classifiers. In addition, we prove a rigorous bound on the vulnerability of quantum classifiers under local unitary adversarial attacks. We show that random black-box quantum error correction encoders can protect quantum classifiers against local adversarial noises and their robustness increases as we concatenate error correction codes. To quantify the robustness enhancement, we adapt quantum differential privacy as a measure of the prediction stability for quantum classifiers. Our results establish versatile defense strategies for quantum classifiers against adversarial perturbations, which provide valuable guidance to enhance the reliability and security for both near-term and future quantum learning technologies.
translated by 谷歌翻译
Long short-term memory (LSTM) is a type of powerful deep neural network that has been widely used in many sequence analysis and modeling applications. However, the large model size problem of LSTM networks make their practical deployment still very challenging, especially for the video recognition tasks that require high-dimensional input data. Aiming to overcome this limitation and fully unlock the potentials of LSTM models, in this paper we propose to perform algorithm and hardware co-design towards high-performance energy-efficient LSTM networks. At algorithm level, we propose to develop fully decomposed hierarchical Tucker (FDHT) structure-based LSTM, namely FDHT-LSTM, which enjoys ultra-low model complexity while still achieving high accuracy. In order to fully reap such attractive algorithmic benefit, we further develop the corresponding customized hardware architecture to support the efficient execution of the proposed FDHT-LSTM model. With the delicate design of memory access scheme, the complicated matrix transformation can be efficiently supported by the underlying hardware without any access conflict in an on-the-fly way. Our evaluation results show that both the proposed ultra-compact FDHT-LSTM models and the corresponding hardware accelerator achieve very high performance. Compared with the state-of-the-art compressed LSTM models, FDHT-LSTM enjoys both order-of-magnitude reduction in model size and significant accuracy improvement across different video recognition datasets. Meanwhile, compared with the state-of-the-art tensor decomposed model-oriented hardware TIE, our proposed FDHT-LSTM architecture achieves better performance in throughput, area efficiency and energy efficiency, respectively on LSTM-Youtube workload. For LSTM-UCF workload, our proposed design also outperforms TIE with higher throughput, higher energy efficiency and comparable area efficiency.
translated by 谷歌翻译
Diffusion models have emerged as a powerful tool for point cloud generation. A key component that drives the impressive performance for generating high-quality samples from noise is iteratively denoise for thousands of steps. While beneficial, the complexity of learning steps has limited its applications to many 3D real-world. To address this limitation, we propose Point Straight Flow (PSF), a model that exhibits impressive performance using one step. Our idea is based on the reformulation of the standard diffusion model, which optimizes the curvy learning trajectory into a straight path. Further, we develop a distillation strategy to shorten the straight path into one step without a performance loss, enabling applications to 3D real-world with latency constraints. We perform evaluations on multiple 3D tasks and find that our PSF performs comparably to the standard diffusion model, outperforming other efficient 3D point cloud generation methods. On real-world applications such as point cloud completion and training-free text-guided generation in a low-latency setup, PSF performs favorably.
translated by 谷歌翻译
The decomposition-based multi-objective evolutionary algorithm (MOEA/D) transforms a multi-objective optimization problem (MOP) into a set of single-objective subproblems for collaborative optimization. Mismatches between subproblems and solutions can lead to severe performance degradation of MOEA/D. Most existing mismatch coping strategies only work when the $L_{\infty}$ scalarization is used. A mismatch coping strategy that can use any $L_{p}$ scalarization, even when facing MOPs with non-convex Pareto fronts, is of great significance for MOEA/D. This paper uses the global replacement (GR) as the backbone. We analyze how GR can no longer avoid mismatches when $L_{\infty}$ is replaced by another $L_{p}$ with $p\in [1,\infty)$, and find that the $L_p$-based ($1\leq p<\infty$) subproblems having inconsistently large preference regions. When $p$ is set to a small value, some middle subproblems have very small preference regions so that their direction vectors cannot pass through their corresponding preference regions. Therefore, we propose a generalized $L_p$ (G$L_p$) scalarization to ensure that the subproblem's direction vector passes through its preference region. Our theoretical analysis shows that GR can always avoid mismatches when using the G$L_p$ scalarization for any $p\geq 1$. The experimental studies on various MOPs conform to the theoretical analysis.
translated by 谷歌翻译
Mode collapse is still a major unsolved problem in generative adversarial networks. In this work, we analyze the causes of mode collapse from a new perspective. Due to the nonuniform sampling in the training process, some sub-distributions can be missed while sampling data. Therefore, the GAN objective can reach the minimum when the generated distribution is not the same as the real one. To alleviate the problem, we propose a global distribution fitting (GDF) method by a penalty term to constrain generated data distribution. On the basis of not changing the global minimum of the GAN objective, GDF will make it harder to reach the minimum value when the generated distribution is not the same as the real one. Furthermore, we also propose a local distribution fitting (LDF) method to cope with the situation that the real distribution is unknown. Experiments on several benchmarks demonstrate the effectiveness and competitive performance of GDF and LDF.
translated by 谷歌翻译
The space-air-ground integrated network (SAGIN), one of the key technologies for next-generation mobile communication systems, can facilitate data transmission for users all over the world, especially in some remote areas where vast amounts of informative data are collected by Internet of remote things (IoRT) devices to support various data-driven artificial intelligence (AI) services. However, training AI models centrally with the assistance of SAGIN faces the challenges of highly constrained network topology, inefficient data transmission, and privacy issues. To tackle these challenges, we first propose a novel topology-aware federated learning framework for the SAGIN, namely Olive Branch Learning (OBL). Specifically, the IoRT devices in the ground layer leverage their private data to perform model training locally, while the air nodes in the air layer and the ring-structured low earth orbit (LEO) satellite constellation in the space layer are in charge of model aggregation (synchronization) at different scales.To further enhance communication efficiency and inference performance of OBL, an efficient Communication and Non-IID-aware Air node-Satellite Assignment (CNASA) algorithm is designed by taking the data class distribution of the air nodes as well as their geographic locations into account. Furthermore, we extend our OBL framework and CNASA algorithm to adapt to more complex multi-orbit satellite networks. We analyze the convergence of our OBL framework and conclude that the CNASA algorithm contributes to the fast convergence of the global model. Extensive experiments based on realistic datasets corroborate the superior performance of our algorithm over the benchmark policies.
translated by 谷歌翻译