利用与人类感知相关的目标函数来训练一个增强模型已成为最近的一个热门话题。这主要是因为传统的均方误差(MSE)损失不能很好地反映听觉感知。在与人类感知相关的指标中,语音质量的感知评估(PESQ)是典型的,并且已被证明与人类评定的质量评分具有高度相关性。然而,由于其复杂且不可微分的特性,PESQ功能可能不会用于直接优化语音增强模型。在这项研究中,我们建议用近似的PESQ函数优化增强模型,该函数是可区分的,并从训练数据中学习。实验结果表明,与基于MSE的预训练模型相比,通过学习损失函数的增强语音微调的平均PESQ得分可以进一步提高0.1个点。
translated by 谷歌翻译
如今,大多数客观语音质量评估工具(例如,语音质量的感知评估(PESQ))基于降级/处理的语音与其清洁对应物的比较。由于通常无法获得干净的参考,因此对“黄金”参考的需求极大地限制了这种评估工具在实际场景中的实用性。另一方面,人类可以容易地评估语言质量而无需任何参考(例如,平均意见得分(MOS)测试),这意味着存在客观和非侵入性(不需要干净的参考)质量评估机制。在这项研究中,我们提出了一种新颖的端到端,非侵入式语音质量评估模型,称为Quality-Net,基于双向长短期记忆。质量网中话语水平质量的评估基于帧级评估。遗忘门偏差的帧约束和灵敏初始化用于从话语级质量标签中学习有意义的帧级质量评估。实验结果表明,质量网可以产生与PESQ的高度相关性(嘈杂语音为0.9,语音处理为0.84)通过speechenhancement)。我们相信Quality-Net有可能用于各种语音信号处理应用。
translated by 谷歌翻译
最近基于深度神经网络的监督学习已经在语音增强方面取得了实质性的改进。去噪网络学习从嘈杂的语音到直接清除语音的映射,或者到干净和有噪声频谱之间的频谱掩模。在任何一种情况下,通过最小化预定标签与光谱或时域信号的网络输出之间的均方误差(MSE)来优化网络。但是,现有方案存在两个关键问题:光谱和度量不匹配。光谱匹配是一个众所周知的问题,一般来说,在短时间傅里叶变换(STFT)之后的任何光谱修改都不能在逆变换后完全恢复。度量不匹配是传统的MSE度量是次优的,以最大化我们的目标度量,信号失真比(SDR)和语音质量的感知评估(PESQ)。本文提出了一个新的端到端去除框架,其目标是联合SDR和PESQ优化。首先,在ISTFT之后对时域信号执行网络优化以避免频谱不匹配。其次,提出了两种与SDR和PESQ度量相关的改进的损失函数,以最小化度量匹配。实验结果表明,与现有方法相比,所提出的去噪方案显着提高了SDR和PESQ性能。
translated by 谷歌翻译
我们提出了一种基于深度神经网络(DNN)的源增强训练方法,以增加客观声音质量评估(OSQA)得分,例如语音质量的感知评估(PESQ)。在许多传统研究中,DNN已被用作映射函数来估计时间 - 频率掩模并且训练以最小化可分析易处理的目标函数,例如均方误差(MSE)。由于OSQA得分已广泛应用于质量评估,因此构建DNN以提高OSQA得分将比使用最小MSE创建高质量输出信号更好。但是,由于大多数OSQA得分在分析上不易处理,因此\ textit {ie}它们是黑盒子,通过简单地应用反向传播不能计算目标函数的梯度。为了计算基于OSQA的目标函数的梯度,我们在\ textit {black-box optimization}的基础上制定了DNN优化方案,用于训练玩游戏的计算机。对于黑盒优化方案,我们采用政策梯度法在抽样算法的基础上计算梯度。为了使用采样算法模拟输出信号,DNN用于估计输出信号的概率密度函数,以最大化OSQA分数。 OSQA分数是根据模拟的输出信号计算的,并且训练DNN以增加生成模拟输出信号的概率,从而获得高OSQA分数。通过几个实验,我们发现OSQA得分通过应用所提出的方法显着增加,即使MSE未被最小化。
translated by 谷歌翻译
In real-world situations, speech reaching our ears is commonly corrupted by both room reverberation and background noise. These distortions are detrimental to speech intelligibility and quality, and also pose a serious problem to many speech-related applications , including automatic speech and speaker recognition. In order to deal with the combined effects of noise and reverberation , we propose a two-stage strategy to enhance corrupted speech, where denoising and dereverberation are conducted sequentially using deep neural networks. In addition, we design a new objective function that incorporates clean phase during model training to better estimate spectral magnitudes, which would in turn yield better phase estimates when combined with iterative phase reconstruction. The two-stage model is then jointly trained to optimize the proposed objective function. Systematic evaluations and comparisons show that the proposed algorithm improves objective metrics of speech intelligibility and quality substantially, and significantly outperforms previous one-stage enhancement systems.
translated by 谷歌翻译
Current speech enhancement techniques operate on the spectral domain and/orexploit some higher-level feature. The majority of them tackle a limited numberof noise conditions and rely on first-order statistics. To circumvent theseissues, deep networks are being increasingly used, thanks to their ability tolearn complex functions from large example sets. In this work, we propose theuse of generative adversarial networks for speech enhancement. In contrast tocurrent techniques, we operate at the waveform level, training the modelend-to-end, and incorporate 28 speakers and 40 different noise conditions intothe same model, such that model parameters are shared across them. We evaluatethe proposed model using an independent, unseen test set with two speakers and20 alternative noise conditions. The enhanced samples confirm the viability ofthe proposed model, and both objective and subjective evaluations confirm theeffectiveness of it. With that, we open the exploration of generativearchitectures for speech enhancement, which may progressively incorporatefurther speech-centric design choices to improve their performance.
translated by 谷歌翻译
In contrast to the conventional minimum mean square error (MMSE)-based noise reduction techniques, we propose a supervised method to enhance speech by means of finding a mapping function between noisy and clean speech signals based on deep neural networks (DNNs). In order to be able to handle a wide range of additive noises in real-world situations, a large training set that encompasses many possible combinations of speech and noise types, is first designed. A DNN architecture is then employed as a nonlinear regression function to ensure a powerful modeling capability. Several techniques have also been proposed to improve the DNN-based speech enhancement system, including global variance equalization to alleviate the over-smoothing problem of the regression model, and the dropout and noise-aware training strategies to further improve the generalization capability of DNNs to unseen noise conditions. Experimental results demonstrate that the proposed framework can achieve significant improvements in both objective and subjective measures over the conventional MMSE based technique. It is also interesting to observe that the proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general. Furthermore, the resulting DNN model, trained with artificial synthesized data, is also effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods. Index Terms-Deep neural networks (DNNs), dropout, global variance equalization, noise aware training, noise reduction, non-stationary noise, speech enhancement.
translated by 谷歌翻译
最近直接从语音波形样本学习的神经网络如WaveNet和sampleRNN已经实现了非常高质量的自然性和说话者相似性的合成语音中间点,即使在多语音到语音合成系统中也是如此。这种神经网络被用作声码器的替代品,因此它们通常被称为神经声码器。神经声码器使用声学特征作为局部条件参数,并且这些参数需要通过另一声学模型准确预测。然而,目前尚不清楚如何训练这种声学模型,这是有问题的,因为合成语音的最终质量受声学模型的性能的显着影响。发生显着的退化,特别是当预测的声学特征与自然特征相比具有不匹配的特征时。为了减少naturaland生成的声学特征之间的不匹配特征,我们提出了将eithera条件生成对抗网络(GAN)或其变体WassersteinGAN与梯度惩罚(WGAN-GP)结合到使用WaveNet声码器的多说话者语音合成的框架。 。我们还扩展了GAN框架并使用经过良好训练的WaveNet的离散混合物逻辑损失以及平均误差和对抗性损失作为目标函数的一部分。实验结果表明,使用WGAN-GP框架使用反向传播离散化训练的声学模型 - 物流混合(DML)损失在质量和说话者相似性方面获得最高的主观评价分数。
translated by 谷歌翻译
在提高自动语音识别(ASR)系统的噪声鲁棒性的背景下,我们研究了生成对抗网络(GAN)对语音增强的有效性。先前的工作表明,GAN可以有效抑制原始波形语音信号中的加性噪声​​,提高感知质量指标;然而,这种技术在ASR的背景下并不合理。在这项工作中,我们进行了详细的研究,以衡量GAN在增强被加性和混响噪声污染的语音中的有效性。受图像处理的最新进展的推动,我们建议在log-Mel滤波器组频谱上操作GAN而不是波形,这需要较少的计算并且对混响噪声更稳健。虽然GANenhancement在noisyspeech上提高了经过干净训练的ASR系统的性能,但它没有达到传统多风格训练(MTR)所实现的性能。通过将GAN增强功能附加到噪声输入和重新训练,我们相对于MTR系统实现了7%的WER改进。
translated by 谷歌翻译
最近的文献已经证明了通过使用一组鉴别器训练GenerativeAdversarial Networks的有希望的结果,与涉及一个发生器对抗单个对手的传统游戏形成对比。这些方法在一些简单的损失合并上执行单目标优化,例如,算术平均值。在这项工作中,我们通过制定不同的车型提供的多目标优化problem.Specifically同时最小化oflosses重温themultiple鉴别器的设置,我们评估的多重梯度下降的性能和多种不同的数据集的thehypervolume最大化算法。此外,我们认为先前提出的方法和超体积最大化可以被视为多梯度下降的变化,其中可以有效地计算更新的方向。我们的结果表明,与以前的方法相比,超体积最大化在样本质量和计算成本之间提供了更好的折衷。
translated by 谷歌翻译
This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture. In the DNN learning process, a large training set ensures a powerful modeling capability to estimate the complicated nonlinear mapping from observed noisy speech to desired clean signals. Acoustic context was found to improve the continuity of speech to be separated from the background noises successfully without the annoying musical artifact commonly observed in conventional speech enhancement algorithms. A series of pilot experiments were conducted under multi-condition training with more than 100 hours of simulated speech data, resulting in a good generalization capability even in mismatched testing conditions. When compared with the logarithmic minimum mean square error approach, the proposed DNN-based algorithm tends to achieve significant improvements in terms of various objective quality measures. Furthermore, in a subjective preference evaluation with 10 listeners, 76.35% of the subjects were found to prefer DNN-based enhanced speech to that obtained with other conventional technique. Index Terms-Deep neural networks, noise reduction, regression model, speech enhancement.
translated by 谷歌翻译
我们研究了生成对抗网络(GAN)在语音识别中的使用,以实现强大的语音识别。最近已经研究了GAN用于语音增强以消除附加噪声,但是仍然缺乏检查它们在语音去混响中的能力并且使用GAN的优点尚未完全建立。在本文中,我们提供了在ASR中使用基于GAN的dereverberation前端的深入研究。首先,我们研究了不同dereverberation网络(生成器GAN)的有效性,并发现与我们的数据集中的前馈DNN和CNN相比,LSTM带来了显着的改进。其次,在深LSTM中进一步添加残余连接也可以提高性能。最后,我们发现,为了GAN的成功,在训练期间使用相同的小批量数据更新发生器和鉴别器是很重要的。此外,如前面的研究所示,使用混响谱图作为鉴别器的条件可能会降低性能。总之,与基线DNN去混响网络相比,在强多条件训练声学模型上进行测试时,我们的基于GAN的减少前端实现了相对CER减少14%-19%。
translated by 谷歌翻译
在本文中,我们提出了一种基于卷积,双向长短期记忆和深度前馈神经网络(CBLDNN)和生成对抗训练(GAT)的单通道语音去混响系统(DeReGAT)。为了获得更好的语音质量而不是仅仅最小化amean平方误差(MSE),采用GAT使得去混响语音可以区分为干净样本。此外,我们的系统可以处理范围广泛的混响,并且可以很好地适应变体环境。实验结果表明,该模型优于加权预测误差(WPE)和基于深度神经网络的系统。此外,DeReGAT还扩展到在线语音dereverberation场景,该场景报告了离线情况下的可比性能。
translated by 谷歌翻译
 Abstract-Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus only on addressing audio information. In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of convolutional neural networks (CNNs) in SE, we propose an audio visual deep CNNs (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. We also propose a multi-task learning framework for reconstructing audio and visual signals at the output layer. Precisely speaking, the proposed AVDCNN model is structured as an audiovisual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer. The model is trained in an end-to-end manner, and parameters are jointly learned through back-propagation. We evaluate enhanced speech using five instrumental criteria. Results show that the AVDCNN model yields a notably superior performance compared with an audio-only CNN-based SE model and two conventional SE approaches , confirming the effectiveness of integrating visual information into the SE process. In addition, the AVDCNN model also outperforms an existing audiovisual SE model, confirming its capability of effectively combining audio and visual information in SE. Index Terms-Audiovisual systems, deep convolutional neural networks, multimodal learning, speech enhancement.
translated by 谷歌翻译
消除非静止环境噪声的负面影响是自动语音识别的基础研究课题,仍然是一项重要的挑战。数据驱动的监督方法,包括基于深度神经网络的方法,最近已成为传统无监督方法的潜在替代方案,并且具有足够的训练,可以减轻无监督方法在各种真实声学环境中的缺点。有鉴于此,我们回顾了最近开发的,具有代表性的深度学习方法,用于解决语音的非固定加性和卷积退化问题,旨在为那些参与开发环境健全的语音识别系统的人提供指导。我们分别讨论了为语音识别系统的前端和后端开发的单通道和多通道技术,以及联合前端和后端培训框架。
translated by 谷歌翻译
Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally , the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation.
translated by 谷歌翻译
In this paper we consider the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech. Most of the current literature on speech enhancement focus primarily on presence of single noise in corrupted speech which is far from real-world environments. Specifically, we deal with improving speech quality in office environment where multiple stationary as well as non-stationary noises can be simultaneously present in speech. We propose several strategies based on Deep Neural Networks (DNN) for speech enhancement in these scenarios. We also investigate a DNN training strategy based on psychoacoustic models from speech coding for enhancement of noisy speech.
translated by 谷歌翻译
Thanks to the growing availability of spoofing databases and rapid advances in using them, systems for detecting voice spoofing attacks are becoming more and more capable, and error rates close to zero are being reached for the ASVspoof2015 database. However, speech synthesis and voice conversion paradigms that are not considered in the ASVspoof2015 database are appearing. Such examples include direct waveform modelling and generative adversarial networks. We also need to investigate the feasibility of training spoofing systems using only low-quality found data. For that purpose, we developed a generative adversarial network-based speech enhancement system that improves the quality of speech data found in publicly available sources. Using the enhanced data, we trained state-of-the-art text-to-speech and voice conversion models and evaluated them in terms of perceptual speech quality and speaker similarity. The results show that the enhancement models significantly improved the SNR of low-quality degraded data found in publicly available sources and that they significantly improved the perceptual cleanliness of the source speech without significantly degrading the naturalness of the voice. However, the results also show limitations when generating speech with the low-quality found data.
translated by 谷歌翻译
在过去的几年中,对抗方法已被广泛用于数据生成。然而,这种方法还没有被广泛用于分类训练。在本文中,我们提出了一种用于分类训练的对抗框架,它也可以处理不平衡的数据。实际上,通过对抗方法训练网络以对多数类的样本赋予权重,使得所获得的分类问题对于识别者变得更具挑战性并因此提高其分类能力。除了一般的不平衡分类问题之外,所提出的方法还可以用于诸如图形表示学习之类的问题,其中期望将相似节点与不同节点区分开。在平衡数据分类和图形链接预测等任务上的实验结果表明,与现有技术方法相比,该方法具有优越性。
translated by 谷歌翻译
生成模型,特别是生成性对抗网络(GAN),最近引起了人们的极大关注。已经提出了许多GAN变体并且已经在许多应用中使用。尽管理论上取得了很大进步,但评估和比较GAN仍然是一项艰巨的任务。虽然已经引入了几项措施,但目前尚未就哪种措施最好地捕捉模型的优势和局限性以及应该用于公平模型比较达成共识。与计算机视觉和机器学习的其他领域一样,关键是要采取一项或几项措施来指导这一领域的进展。在本文中,我回顾并批判性地讨论了超过24种定量模型的定量和5种定性测量方法,特别强调了GAN衍生模型。我还提供了一组7个需求,然后评估了agiven测量或一系列测量与他们兼容。
translated by 谷歌翻译