深处神经网络(例如Deep-FSMN)已被广泛研究以用于关键字发现(KWS)应用。但是,这些网络的计算资源通常受到重大限制,因为它们通常在边缘设备上在通话中运行。在本文中,我们提出了BIFSMN,这是KWS的准确且极高的二元神经网络。我们首先为二进制化训练构建了高频增强蒸馏方案,该方案强调了全优先网络表示的高频信息,这对于对二进制网络的优化更为重要。然后,为了在运行时允许即时和自适应的准确性效率折衷,我们还提出了一个可稀薄的二进制架构,以从拓扑角度进一步解放二进制网络的加速潜力。此外,我们在ARMV8设备上为BIFSMN实施了快速的位计算内核,该内核充分利用了寄存器并增加了指令吞吐量以突破部署效率的极限。广泛的实验表明,BIFSMN通过说服各种数据集的利润率优于现有的二进制方法,甚至与全精度对应物相当(例如,语音命令v1-12下降少于3%)。我们强调的是,BIFSMN受益于稀薄的体系结构和优化的1位实现,可以在现实世界中的Edge硬件上实现令人印象深刻的22.3倍加速和15.5倍的存储空间。
translated by 谷歌翻译
模型二进制化是一种压缩神经网络并加速其推理过程的有效方法。但是,1位模型和32位模型之间仍然存在显着的性能差距。实证研究表明,二进制会导致前进和向后传播中的信息损失。我们提出了一个新颖的分布敏感信息保留网络(DIR-NET),该网络通过改善内部传播和引入外部表示,将信息保留在前后传播中。 DIR-NET主要取决于三个技术贡献:(1)最大化二进制(IMB)的信息:最小化信息损失和通过重量平衡和标准化同时同时使用权重/激活的二进制误差; (2)分布敏感的两阶段估计器(DTE):通过共同考虑更新能力和准确的梯度来通过分配敏感的软近似来保留梯度的信息; (3)代表性二进制 - 意识蒸馏(RBD):通过提炼完整精确和二元化网络之间的表示来保留表示信息。 DIR-NET从统一信息的角度研究了BNN的前进过程和后退过程,从而提供了对网络二进制机制的新见解。我们的DIR-NET中的三种技术具有多功能性和有效性,可以在各种结构中应用以改善BNN。关于图像分类和客观检测任务的综合实验表明,我们的DIR-NET始终优于主流和紧凑型体系结构(例如Resnet,vgg,vgg,EfficityNet,darts和mobilenet)下最新的二进制方法。此外,我们在现实世界中的资源有限设备上执行DIR-NET,该设备可实现11.1倍的存储空间和5.4倍的速度。
translated by 谷歌翻译
最近,生成的数据无量子化作为一种​​实用的方法,将神经网络压缩到低位宽度而不访问真实数据。它通过利用其全精密对应物的批量归一化(BN)统计来生成数据来量化网络。然而,我们的研究表明,在实践中,BN统计的合成数据在分布和样品水平时严重均匀化,这导致量化网络的严重劣化。本文提出了一种通用不同的样本生成(DSG)方案,用于生成无数据的训练后量化和量化感知培训,以减轻有害的均质化。在我们的DSG中,我们首先将统计对齐缩写为BN层中的功能,以放宽分配约束。然后,我们加强特定BN层对不同样品的损失影响,并抑制了生成过程中样品之间的相关性,分别从统计和空间角度分别多样化样本。广泛的实验表明,对于大规模的图像分类任务,我们的DSG可以始终如一地优于各种神经结构上的现有数据无数据量化方法,尤其是在超低比特宽度下(例如,在W4A4设置下的22%的增益下)。此外,由我们的DSG引起的数据多样化引起了各种量化方法的一般增益,证明了多样性是无数据量化的高质量合成数据的重要特性。
translated by 谷歌翻译
量化已成为压缩和加速神经网络最普遍的方法之一。最近,无数据量化已被广泛研究作为实用和有前途的解决方案。它根据FP32批量归一化(BN)统计,合成校准量化模型的数据,并显着降低了传统量化方法中实际训练数据的沉重依赖性。不幸的是,我们发现在实践中,BN统计的合成数据在分配水平和样品水平上具有严重均匀化,并且进一步引起量化模型的显着性能下降。我们提出了各种样品生成(DSG)方案,以减轻均质化引起的不利影响。具体而言,我们松弛BN层中的特征统计的对准,以在分配水平处放宽约束,并设计一个层状增强,以加强针对不同的数据样本的特定层。我们的DSG方案是多功能的,甚至能够应用于现代训练后的训练后的量化方法,如亚马逊。我们评估大规模图像分类任务的DSG方案,并始终如一地获得各种网络架构和量化方法的显着改进,特别是当量化到较低位时(例如,在W4A4上的高达22%)。此外,从增强的多样性受益,综合数据校准的模型均接近通过实际数据校准的那些,甚至在W4A4上越优于它们。
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Unmanned aerial vehicle (UAV) swarms are considered as a promising technique for next-generation communication networks due to their flexibility, mobility, low cost, and the ability to collaboratively and autonomously provide services. Distributed learning (DL) enables UAV swarms to intelligently provide communication services, multi-directional remote surveillance, and target tracking. In this survey, we first introduce several popular DL algorithms such as federated learning (FL), multi-agent Reinforcement Learning (MARL), distributed inference, and split learning, and present a comprehensive overview of their applications for UAV swarms, such as trajectory design, power control, wireless resource allocation, user assignment, perception, and satellite communications. Then, we present several state-of-the-art applications of UAV swarms in wireless communication systems, such us reconfigurable intelligent surface (RIS), virtual reality (VR), semantic communications, and discuss the problems and challenges that DL-enabled UAV swarms can solve in these applications. Finally, we describe open problems of using DL in UAV swarms and future research directions of DL enabled UAV swarms. In summary, this survey provides a comprehensive survey of various DL applications for UAV swarms in extensive scenarios.
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Non-line-of-sight (NLOS) imaging aims to reconstruct the three-dimensional hidden scenes from the data measured in the line-of-sight, which uses photon time-of-flight information encoded in light after multiple diffuse reflections. The under-sampled scanning data can facilitate fast imaging. However, the resulting reconstruction problem becomes a serious ill-posed inverse problem, the solution of which is of high possibility to be degraded due to noises and distortions. In this paper, we propose two novel NLOS reconstruction models based on curvature regularization, i.e., the object-domain curvature regularization model and the dual (i.e., signal and object)-domain curvature regularization model. Fast numerical optimization algorithms are developed relying on the alternating direction method of multipliers (ADMM) with the backtracking stepsize rule, which are further accelerated by GPU implementation. We evaluate the proposed algorithms on both synthetic and real datasets, which achieve state-of-the-art performance, especially in the compressed sensing setting. All our codes and data are available at https://github.com/Duanlab123/CurvNLOS.
translated by 谷歌翻译
Stance detection refers to the task of extracting the standpoint (Favor, Against or Neither) towards a target in given texts. Such research gains increasing attention with the proliferation of social media contents. The conventional framework of handling stance detection is converting it into text classification tasks. Deep learning models have already replaced rule-based models and traditional machine learning models in solving such problems. Current deep neural networks are facing two main challenges which are insufficient labeled data and information in social media posts and the unexplainable nature of deep learning models. A new pre-trained language model chatGPT was launched on Nov 30, 2022. For the stance detection tasks, our experiments show that ChatGPT can achieve SOTA or similar performance for commonly used datasets including SemEval-2016 and P-Stance. At the same time, ChatGPT can provide explanation for its own prediction, which is beyond the capability of any existing model. The explanations for the cases it cannot provide classification results are especially useful. ChatGPT has the potential to be the best AI model for stance detection tasks in NLP, or at least change the research paradigm of this field. ChatGPT also opens up the possibility of building explanatory AI for stance detection.
translated by 谷歌翻译
Motivated by the problem of matching vertices in two correlated Erd\H{o}s-R\'enyi graphs, we study the problem of matching two correlated Gaussian Wigner matrices. We propose an iterative matching algorithm, which succeeds in polynomial time as long as the correlation between the two Gaussian matrices does not vanish. Our result is the first polynomial time algorithm that solves a graph matching type of problem when the correlation is an arbitrarily small constant.
translated by 谷歌翻译