我们介绍了时间特征 - 方向线性调制(TFILM)模型的块在线变体,以实现带宽扩展。所提出的架构简化了TFILM的UNET骨干,以减少推理时间,并在瓶颈中采用有效的变压器来缓解性能下降。我们还利用自我监督的预测和数据增强,以提高带宽扩展信号的质量,并降低对下采样方法的灵敏度。VCTK数据集上的实验结果表明,所提出的方法优于侵入性和非侵入性度量的几个最近基线。预先训练和过滤增强也有助于稳定并提高整体性能。
translated by 谷歌翻译
通常,音频超分辨率模型固定了初始采样率和目标采样率,这需要对每对采样率进行训练的模型。我们介绍了NU-WAVE 2,这是一种用于神经音频上采样的扩散模型,该模型可以通过单个模型从各种采样率的输入中生成48 kHz音频信号。基于Nu-Wave的架构,NU-WAVE 2使用短时傅立叶卷积(STFC)生成谐波来解决NU-WAVE的主要故障模式,并结合带宽光谱特征变换(BSFT)来调节带宽的带宽频域中的输入。我们在实验上证明,NU-WAVE 2可产生高分辨率音频,而不论输入的采样率如何,同时需要的参数少于其他模型。官方代码和音频样本可在https://mindslab-ai.github.io/nuwave2上找到。
translated by 谷歌翻译
生成的对抗网络最近在神经声音中表现出了出色的表现,表现优于最佳自动回归和基于流动的模型。在本文中,我们表明这种成功可以扩展到有条件音频的其他任务。特别是,在HIFI Vocoders的基础上,我们为带宽扩展和语音增强的新型HIFI ++一般框架提出了新颖的一般框架。我们表明,通过改进的生成器体系结构和简化的多歧视培训,HIFI ++在这些任务中的最先进的情况下表现更好或与之相提并论,同时花费大量的计算资源。通过一系列广泛的实验,我们的方法的有效性得到了验证。
translated by 谷歌翻译
我们提出了一个录音录音录音的录音录音。我们的模型通过短时傅立叶变换(STFT)将其输入转换为时频表示,并使用卷积神经网络处理所得的复杂频谱图。该网络在合成音乐数据集上培训了重建和对抗性目标,该数据集是通过将干净的音乐与从旧唱片的安静片段中提取的真实噪声样本混合而创建的。我们在合成数据集的持有测试示例中定量评估我们的方法,并通过人类对实际历史记录样本的评级进行定性评估。我们的结果表明,所提出的方法可有效消除噪音,同时保留原始音乐的质量和细节。
translated by 谷歌翻译
最近在各种语音域应用中提出了卷积增强的变压器(构象异构体),例如自动语音识别(ASR)和语音分离,因为它们可以捕获本地和全球依赖性。在本文中,我们提出了一个基于构型的度量生成对抗网络(CMGAN),以在时间频率(TF)域中进行语音增强(SE)。发电机使用两阶段构象体块编码大小和复杂的频谱图信息,以模拟时间和频率依赖性。然后,解码器将估计分解为尺寸掩模的解码器分支,以滤除不需要的扭曲和复杂的细化分支,以进一步改善幅度估计并隐式增强相信息。此外,我们还包括一个度量歧视器来通过优化相应的评估评分来减轻度量不匹配。客观和主观评估表明,与三个语音增强任务(DeNoising,dereverberation和Super-Losity)中的最新方法相比,CMGAN能够表现出卓越的性能。例如,对语音库+需求数据集的定量降解分析表明,CMGAN的表现优于以前的差距,即PESQ为3.41,SSNR为11.10 dB。
translated by 谷歌翻译
从语音音频中删除背景噪音一直是大量研究和努力的主题,尤其是由于虚拟沟通和业余声音录制的兴起,近年来。然而,背景噪声并不是唯一可以防止可理解性的不愉快干扰:混响,剪裁,编解码器工件,有问题的均衡,有限的带宽或不一致的响度同样令人不安且无处不在。在这项工作中,我们建议将言语增强的任务视为一项整体努力,并提出了一种普遍的语音增强系统,同时解决了55种不同的扭曲。我们的方法由一种使用基于得分的扩散的生成模型以及一个多分辨率调节网络,该网络通过混合密度网络进行增强。我们表明,这种方法在专家听众执行的主观测试中大大优于艺术状态。我们还表明,尽管没有考虑任何特定的快速采样策略,但它仅通过4-8个扩散步骤就可以实现竞争性的目标得分。我们希望我们的方法论和技术贡献都鼓励研究人员和实践者采用普遍的语音增强方法,可能将其作为一项生成任务。
translated by 谷歌翻译
在这项工作中,我们提出了清洁nunet,这是原始波形上的因果语音deno的模型。所提出的模型基于编码器架构,并结合了几个自我注意块,以完善其瓶颈表示,这对于获得良好的结果至关重要。该模型通过在波形和多分辨率光谱图上定义的一组损失进行了优化。所提出的方法在各种客观和主观评估指标中的言语质量方面优于最先进的模型。
translated by 谷歌翻译
基于生成对抗神经网络(GAN)的神经声码器由于其快速推理速度和轻量级网络而被广泛使用,同时产生了高质量的语音波形。由于感知上重要的语音成分主要集中在低频频段中,因此大多数基于GAN的神经声码器进行了多尺度分析,以评估降压化采样的语音波形。这种多尺度分析有助于发电机提高语音清晰度。然而,在初步实验中,我们观察到,重点放在低频频段的多尺度分析会导致意外的伪影,例如,混叠和成像伪像,这些文物降低了合成的语音波形质量。因此,在本文中,我们研究了这些伪影与基于GAN的神经声码器之间的关系,并提出了一个基于GAN的神经声码器,称为Avocodo,该机器人允许合成具有减少伪影的高保真语音。我们介绍了两种歧视者,以各种视角评估波形:协作多波段歧视者和一个子兰歧视器。我们还利用伪正常的镜像滤波器库来获得下采样的多频段波形,同时避免混音。实验结果表明,在语音和唱歌语音合成任务中,鳄梨的表现优于常规的基于GAN的神经声码器,并且可以合成无伪影的语音。尤其是,鳄梨甚至能够复制看不见的扬声器的高质量波形。
translated by 谷歌翻译
Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in these domains in task such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement domain to achieve denosing, dereverberation and multi-speaker separation in monaural speech enhancement. In this paper, we review some dominant DNN techniques being employed to achieve speech separation. The review looks at the whole pipeline of speech enhancement from feature extraction, how DNN based tools are modelling both global and local features of speech and model training (supervised and unsupervised). We also review the use of speech-enhancement pre-trained models to boost speech enhancement process. The review is geared towards covering the dominant trends with regards to DNN application in speech enhancement in speech obtained via a single speaker.
translated by 谷歌翻译
尽管在基于生成的对抗网络(GAN)的声音编码器中,该模型在MEL频谱图中生成原始波形,但在各种录音环境中为众多扬声器合成高保真音频仍然具有挑战性。在这项工作中,我们介绍了Bigvgan,这是一款通用的Vocoder,在零照片环境中在各种看不见的条件下都很好地概括了。我们将周期性的非线性和抗氧化表现引入到发电机中,这带来了波形合成所需的感应偏置,并显着提高了音频质量。根据我们改进的生成器和最先进的歧视器,我们以最大的规模训练我们的Gan Vocoder,最高到1.12亿个参数,这在文献中是前所未有的。特别是,我们识别并解决了该规模特定的训练不稳定性,同时保持高保真输出而不过度验证。我们的Bigvgan在各种分布场景中实现了最先进的零拍性能,包括新的扬声器,新颖语言,唱歌声音,音乐和乐器音频,在看不见的(甚至是嘈杂)的录制环境中。我们将在以下网址发布我们的代码和模型:https://github.com/nvidia/bigvgan
translated by 谷歌翻译
Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two-and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward the realization of speech separation systems for real-world speech processing technologies.
translated by 谷歌翻译
现实世界图像Denoising是一个实用的图像恢复问题,旨在从野外嘈杂的输入中获取干净的图像。最近,Vision Transformer(VIT)表现出强大的捕获远程依赖性的能力,许多研究人员试图将VIT应用于图像DeNosing任务。但是,现实世界的图像是一个孤立的框架,它使VIT构建了内部贴片的远程依赖性,该依赖性将图像分为贴片并混乱噪声模式和梯度连续性。在本文中,我们建议通过使用连续的小波滑动转换器来解决此问题,该小波滑动转换器在现实世界中构建频率对应关系,称为dnswin。具体而言,我们首先使用CNN编码器从嘈杂的输入图像中提取底部功能。 DNSWIN的关键是将高频和低频信息与功能和构建频率依赖性分开。为此,我们提出了小波滑动窗口变压器,该变压器利用离散的小波变换,自我注意力和逆离散小波变换来提取深度特征。最后,我们使用CNN解码器将深度特征重建为DeNo的图像。对现实世界的基准测试的定量和定性评估都表明,拟议的DNSWIN对最新方法的表现良好。
translated by 谷歌翻译
我们提出了一种可扩展高效的神经波形编码系统,用于语音压缩。我们将语音编码问题作为一种自动汇总任务,其中卷积神经网络(CNN)在其前馈例程期间执行编码和解码作为神经波形编解码器(NWC)。所提出的NWC还将量化和熵编码定义为可培训模块,因此在优化过程期间处理编码伪像和比特率控制。通过将紧凑的模型组件引入NWC,如Gated Reseal Networks和深度可分离卷积,我们实现了效率。此外,所提出的模型具有可扩展的架构,跨模块残差学习(CMRL),以覆盖各种比特率。为此,我们采用残余编码概念来连接多个NWC自动汇总模块,其中每个NWC模块执行残差编码以恢复其上一模块已创建的任何重建损失。 CMRL也可以缩小以覆盖下比特率,因为它采用线性预测编码(LPC)模块作为其第一自动化器。混合设计通过将LPC的量化作为可分散的过程重新定义LPC和NWC集成,使系统培训端到端的方式。所提出的系统的解码器在低至中等比特率范围(12至20kbps)或高比特率(32kbps)中的两个NWC中的一个NWC(0.12百万个参数)。尽管解码复杂性尚不低于传统语音编解码器的复杂性,但是从其他神经语音编码器(例如基于WVENET的声码器)显着降低。对于宽带语音编码质量,我们的系统对AMR-WB的性能相当或卓越的性能,并在低和中等比特率下的速度试验话题上的表现。所提出的系统可以扩展到更高的比特率以实现近透明性能。
translated by 谷歌翻译
Music discovery services let users identify songs from short mobile recordings. These solutions are often based on Audio Fingerprinting, and rely more specifically on the extraction of spectral peaks in order to be robust to a number of distortions. Few works have been done to study the robustness of these algorithms to background noise captured in real environments. In particular, AFP systems still struggle when the signal to noise ratio is low, i.e when the background noise is strong. In this project, we tackle this problematic with Deep Learning. We test a new hybrid strategy which consists of inserting a denoising DL model in front of a peak-based AFP algorithm. We simulate noisy music recordings using a realistic data augmentation pipeline, and train a DL model to denoise them. The denoising model limits the impact of background noise on the AFP system's extracted peaks, improving its robustness to noise. We further propose a novel loss function to adapt the DL model to the considered AFP system, increasing its precision in terms of retrieved spectral peaks. To the best of our knowledge, this hybrid strategy has not been tested before.
translated by 谷歌翻译
语音神经调节物有可能为患有扰动或休闲症的人提供沟通。最近的进展已经证明了从放置在皮质表面上的电加电网的高质量文本解码和语音合成。在这里,我们研究了较少的侵入性测量模态,即立体定向脑电图(SEEG),其提供来自多个脑区的稀疏抽样,包括皮质区域。为了评估Seeg是否也可用于综合神经录音的高质量音频,我们采用了一种基于现代深度学习方法的经常性编码器 - 解码器框架。我们证明,尽管有限的训练数据,但是可以从这些微创录音来重建高质量的言论。最后,我们利用变分特征丢失来成功识别最具信息丰富的电极触点。
translated by 谷歌翻译
大多数GaN(生成的对抗网络)基于高保真波形的方法,严重依赖于鉴别者来提高其性能。然而,该GaN方法的过度使用引入了生成过程中的许多不确定性,并且通常导致音调和强度不匹配,当使用诸如唱歌语音合成(SVS)敏感时,这是致命的。为了解决这个问题,我们提出了一种高保真神经声码器的Refinegan,具有更快的实时发电能力,并专注于鲁棒性,俯仰和强度精度和全带音频生成。我们采用了一种具有基于多尺度谱图的损耗功能的播放引导的细化架构,以帮助稳定训练过程,并在使用基于GaN的训练方法的同时保持神经探测器的鲁棒性。与地面真实音频相比,使用此方法生成的音频显示在主观测试中更好的性能。该结果表明,通过消除由扬声器和记录过程产生的缺陷,在波形重建期间甚至改善了保真度。此外,进一步的研究表明,在特定类型的数据上培训的模型可以在完全看不见的语言和看不见的扬声器上相同地执行。生成的样本对在https://timedomain-tech.github.io/refinegor上提供。
translated by 谷歌翻译
Binaural audio plays a significant role in constructing immersive augmented and virtual realities. As it is expensive to record binaural audio from the real world, synthesizing them from mono audio has attracted increasing attention. This synthesis process involves not only the basic physical warping of the mono audio, but also room reverberations and head/ear related filtrations, which, however, are difficult to accurately simulate in traditional digital signal processing. In this paper, we formulate the synthesis process from a different perspective by decomposing the binaural audio into a common part that shared by the left and right channels as well as a specific part that differs in each channel. Accordingly, we propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively. Specifically, in the first stage, the common information of the binaural audio is generated with a single-channel diffusion model conditioned on the mono audio, based on which the binaural audio is generated by a two-channel diffusion model in the second stage. Combining this novel perspective of two-stage synthesis with advanced generative models (i.e., the diffusion models),the proposed BinauralGrad is able to generate accurate and high-fidelity binaural audio samples. Experiment results show that on a benchmark dataset, BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics (Wave L2: 0.128 vs. 0.157, MOS: 3.80 vs. 3.61). The generated audio samples (https://speechresearch.github.io/binauralgrad) and code (https://github.com/microsoft/NeuralSpeech/tree/master/BinauralGrad) are available online.
translated by 谷歌翻译
Several recent work on speech synthesis have employed generative adversarial networks (GANs) to produce raw waveforms. Although such methods improve the sampling efficiency and memory usage, their sample quality has not yet reached that of autoregressive and flow-based generative models. In this work, we propose HiFi-GAN, which achieves both efficient and high-fidelity speech synthesis. As speech audio consists of sinusoidal signals with various periods, we demonstrate that modeling periodic patterns of an audio is crucial for enhancing sample quality. A subjective human evaluation (mean opinion score, MOS) of a single speaker dataset indicates that our proposed method demonstrates similarity to human quality while generating 22.05 kHz high-fidelity audio 167.9 times faster than real-time on a single V100 GPU. We further show the generality of HiFi-GAN to the melspectrogram inversion of unseen speakers and end-to-end speech synthesis. Finally, a small footprint version of HiFi-GAN generates samples 13.4 times faster than real-time on CPU with comparable quality to an autoregressive counterpart. IntroductionVoice is one of the most frequent and naturally used communication interfaces for humans. With recent developments in technology, voice is being used as a main interface in artificial intelligence (AI) voice assistant services such as Amazon Alexa, and it is also widely used in automobiles, smart homes and so forth. Accordingly, with the increase in demand for people to converse with machines, technology that synthesizes natural speech like human speech is being actively studied.Recently, with the development of neural networks, speech synthesis technology has made a rapid progress. Most neural speech synthesis models use a two-stage pipeline: 1) predicting a low resolution intermediate representation such as mel-spectrograms (
translated by 谷歌翻译
Diffusion-based generative models have had a high impact on the computer vision and speech processing communities these past years. Besides data generation tasks, they have also been employed for data restoration tasks like speech enhancement and dereverberation. While discriminative models have traditionally been argued to be more powerful e.g. for speech enhancement, generative diffusion approaches have recently been shown to narrow this performance gap considerably. In this paper, we systematically compare the performance of generative diffusion models and discriminative approaches on different speech restoration tasks. For this, we extend our prior contributions on diffusion-based speech enhancement in the complex time-frequency domain to the task of bandwith extension. We then compare it to a discriminatively trained neural network with the same network architecture on three restoration tasks, namely speech denoising, dereverberation and bandwidth extension. We observe that the generative approach performs globally better than its discriminative counterpart on all tasks, with the strongest benefit for non-additive distortion models, like in dereverberation and bandwidth extension. Code and audio examples can be found online at https://uhh.de/inf-sp-sgmsemultitask
translated by 谷歌翻译
鉴于音乐源分离和自动混合的最新进展,在音乐曲目中删除音频效果是开发自动混合系统的有意义的一步。本文着重于消除对音乐制作中吉他曲目应用的失真音频效果。我们探索是否可以通过设计用于源分离和音频效应建模的神经网络来解决效果的去除。我们的方法证明对混合处理和清洁信号的效果特别有效。与基于稀疏优化的最新解决方案相比,这些模型获得了更好的质量和更快的推断。我们证明这些模型不仅适合倾斜,而且适用于其他类型的失真效应。通过讨论结果,我们强调了多个评估指标的有用性,以评估重建的不同方面的变形效果去除。
translated by 谷歌翻译