Denoising Diffusion Probabilistic Models (DDPMs) are emerging in text-to-speech (TTS) synthesis because of their strong capability of generating high-fidelity samples. However, their iterative refinement process in high-dimensional data space results in slow inference speed, which restricts their application in real-time systems. Previous works have explored speeding up by minimizing the number of inference steps but at the cost of sample quality. In this work, to improve the inference speed for DDPM-based TTS model while achieving high sample quality, we propose ResGrad, a lightweight diffusion model which learns to refine the output spectrogram of an existing TTS model (e.g., FastSpeech 2) by predicting the residual between the model output and the corresponding ground-truth speech. ResGrad has several advantages: 1) Compare with other acceleration methods for DDPM which need to synthesize speech from scratch, ResGrad reduces the complexity of task by changing the generation target from ground-truth mel-spectrogram to the residual, resulting into a more lightweight model and thus a smaller real-time factor. 2) ResGrad is employed in the inference process of the existing TTS model in a plug-and-play way, without re-training this model. We verify ResGrad on the single-speaker dataset LJSpeech and two more challenging datasets with multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental results show that in comparison with other speed-up methods of DDPMs: 1) ResGrad achieves better sample quality with the same inference speed measured by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech faster than baseline methods by more than 10 times. Audio samples are available at https://resgrad1.github.io/.
translated by 谷歌翻译
Disentangled representation learning remains challenging as ground truth factors of variation do not naturally exist. To address this, we present Vocabulary Disentanglement Retrieval~(VDR), a simple yet effective retrieval-based disentanglement framework that leverages nature language as distant supervision. Our approach is built upon the widely-used bi-encoder architecture with disentanglement heads and is trained on data-text pairs that are readily available on the web or in existing datasets. This makes our approach task- and modality-agnostic with potential for a wide range of downstream applications. We conduct experiments on 16 datasets in both text-to-text and cross-modal scenarios and evaluate VDR in a zero-shot setting. With the incorporation of disentanglement heads and a minor increase in parameters, VDR achieves significant improvements over the base retriever it is built upon, with a 9% higher on NDCG@10 scores in zero-shot text-to-text retrieval and an average of 13% higher recall in cross-modal retrieval. In comparison to other baselines, VDR outperforms them in most tasks, while also improving explainability and efficiency.
translated by 谷歌翻译
As a neural network compression technique, post-training quantization (PTQ) transforms a pre-trained model into a quantized model using a lower-precision data type. However, the prediction accuracy will decrease because of the quantization noise, especially in extremely low-bit settings. How to determine the appropriate quantization parameters (e.g., scaling factors and rounding of weights) is the main problem facing now. Many existing methods determine the quantization parameters by minimizing the distance between features before and after quantization. Using this distance as the metric to optimize the quantization parameters only considers local information. We analyze the problem of minimizing local metrics and indicate that it would not result in optimal quantization parameters. Furthermore, the quantized model suffers from overfitting due to the small number of calibration samples in PTQ. In this paper, we propose PD-Quant to solve the problems. PD-Quant uses the information of differences between network prediction before and after quantization to determine the quantization parameters. To mitigate the overfitting problem, PD-Quant adjusts the distribution of activations in PTQ. Experiments show that PD-Quant leads to better quantization parameters and improves the prediction accuracy of quantized models, especially in low-bit settings. For example, PD-Quant pushes the accuracy of ResNet-18 up to 53.08% and RegNetX-600MF up to 40.92% in weight 2-bit activation 2-bit. The code will be released at https://github.com/hustvl/PD-Quant.
translated by 谷歌翻译
Accurately predicting interactive road agents' future trajectories and planning a socially compliant and human-like trajectory accordingly are important for autonomous vehicles. In this paper, we propose a planning-centric prediction neural network, which takes surrounding agents' historical states and map context information as input, and outputs the joint multi-modal prediction trajectories for surrounding agents, as well as a sequence of control commands for the ego vehicle by imitation learning. An agent-agent interaction module along the time axis is proposed in our network architecture to better comprehend the relationship among all the other intelligent agents on the road. To incorporate the map's topological information, a Dynamic Graph Convolutional Neural Network (DGCNN) is employed to process the road network topology. Besides, the whole architecture can serve as a backbone for the Differentiable Integrated motion Prediction with Planning (DIPP) method by providing accurate prediction results and initial planning commands. Experiments are conducted on real-world datasets to demonstrate the improvements made by our proposed method in both planning and prediction accuracy compared to the previous state-of-the-art methods.
translated by 谷歌翻译
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
知识蒸馏(KD)显示了其对象检测的有效性,在AI知识(教师检测器)和人类知识(人类专家)的监督下,它在该物体检测中训练紧凑的对象检测器。但是,现有研究一致地对待AI知识和人类知识,并在学习过程中采用统一的数据增强策略,这将导致对多尺度对象的学习有偏见,并且对教师探测器的学习不足,从而导致不满意的蒸馏性能。为了解决这些问题,我们提出了特定于样本的数据增强和对抗性功能增强。首先,为了减轻多尺度对象产生的影响,我们根据傅立叶角度的观察结果提出了自适应数据增强。其次,我们提出了一种基于对抗性示例的功能增强方法,以更好地模仿AI知识以弥补教师探测器的信息不足。此外,我们提出的方法是统一的,并且很容易扩展到其他KD方法。广泛的实验证明了我们的框架的有效性,并在一阶段和两阶段探测器中提高了最先进方法的性能,最多可以带来0.5 MAP的增长。
translated by 谷歌翻译
神经文本排名模型已经见证了显着的进步,并越来越多地在实践中部署。不幸的是,它们还继承了一般神经模型的对抗性脆弱性,这些神经模型已被检测到,但仍未被先前的研究所忽视。此外,Blackhat SEO可能会利用继承的对抗性漏洞来击败受保护的搜索引擎。在这项研究中,我们提出了对黑盒神经通道排名模型的模仿对抗攻击。我们首先表明,可以通过列举关键查询/候选者,然后训练排名模仿模型来透明和模仿目标段落排名模型。利用排名模仿模型,我们可以精心操纵排名结果并将操纵攻击转移到目标排名模型。为此,我们提出了一种由成对目标函数授权的基于创新的基于梯度的攻击方法,以产生对抗性触发器,该触发器会导致有预谋的混乱,而具有很少的令牌。为了配备触发器的伪装,我们将下一个句子预测损失和语言模型流利度限制添加到目标函数中。对通过排名的实验结果证明了对各种SOTA神经排名模型的排名模仿攻击模型和对抗触发器的有效性。此外,各种缓解分析和人类评估表明,在面对潜在的缓解方法时,伪装的有效性。为了激励其他学者进一步研究这一新颖和重要的问题,我们将实验数据和代码公开可用。
translated by 谷歌翻译
常规的共呈含量对象检测(COSOD)具有很强的假设,即\ enquote {同一组的每个图像中存在一个共同的显着对象}。但是,偏见的假设与一组图像中可能部分或完全不存在共同升压对象相矛盾。我们提出了一种基于随机抽样的广义COSOD训练(GCT)策略,以提炼出对COSOD模型的相互间距间缺失的意识。此外,GCT中固有的随机抽样过程可以生成高质量的不确定性图,我们可以通过该图进一步补充较不容易自信的模型预测,这些预测容易容易定位非共同的显着对象。为了评估COSOD模型的概括能力,我们提出了两个新的测试数据集,即可口可分子和可口可乐,其中一个共同的显着物体部分存在于前者中,并且在后者中完全不存在。广泛的实验表明,我们提出的方法显着提高了COSOD模型在两个新数据集上的概括能力,同时并没有在常规COSOD设置下对其性能产生负面影响。代码可在https://github.com/carlisle-liu/gcosod上找到。
translated by 谷歌翻译
云计算技术的最新趋势有效地提高了视觉检查的应用。但是,大多数可用系统以人类的方式工作,无法为在线应用提供长期支持。为了向前迈出一步,本文概述了一个名为SSAA的自动注释系统,以一种自学的学习方式工作,以在制造自动化场景中不断进行在线视觉检查。 SSAA受益于自我监督的学习,有效地为整个生命周期建立了视觉检查应用程序。在早期阶段,仅使用无异常数据,采用了无监督的算法来处理借口任务并为以下数据生成粗标签。然后,对监督算法进行了下游任务的培训。借助用户友好的基于Web的接口,SSAA非常方便地集成和部署两个无监督和监督算法。到目前为止,SSAA系统已用于一些现实生活中的工业应用。
translated by 谷歌翻译
越来越多地用于优化深度神经网络(DNN)模型,以满足性能,资源利用和其他要求,越来越多地使用深入学习(DL)编译器(例如TVM和Tensorrt)。这些编译器中的错误可以产生优化的模型,其语义与原始模型不同,并产生不正确的结果,影响了下流应用程序的正确性。但是,由于其复杂性,在这些编译器中找到错误是具有挑战性的。在这项工作中,我们提出了一种新的模糊测试方法,用于在深入学习编译器中查找错误。我们的核心方法使用(i)轻重量操作员规范来生成多样化但有效的DNN模型,使我们能够行使编译器的大部分转换逻辑; (ii)基于梯度的搜索过程,用于查找模型输入,该过程避免在模型执行过程中避免任何浮点异常值,从而减少了错过错误或错误警报的机会; (iii)差异测试以识别错误。我们在NNSmith中实施了这种方法,该方法在过去的七个月中为TVM,Tensorrt,OnxRuntime和Pytorch发现了65个新错误。在这52个已得到证实,项目维护者已确定了44个。
translated by 谷歌翻译