条件生成对抗网络(GAN)中的对抗性损失未被设计为直接优化目标任务的评估度量,因此,可能不总是引导GAN中的生成器生成具有改进的度量分数的数据。为了克服这个问题,我们提出了一种新颖的MetricGAN方法,旨在针对一个或多个评估指标优化发电机。此外,基于MetricGAN,生成数据的度量分数也可以由用户任意指定。我们在语音增强任务上测试了所提出的MetricGAN,这特别适合于验证所提出的方法,因为存在多个度量来测量语音信号的不同方面。此外,这些指标通常是复杂的,并且无法通过Lp或传统的对抗性无法完全优化。
translated by 谷歌翻译
如今,大多数客观语音质量评估工具(例如,语音质量的感知评估(PESQ))基于降级/处理的语音与其清洁对应物的比较。由于通常无法获得干净的参考,因此对“黄金”参考的需求极大地限制了这种评估工具在实际场景中的实用性。另一方面,人类可以容易地评估语言质量而无需任何参考(例如,平均意见得分(MOS)测试),这意味着存在客观和非侵入性(不需要干净的参考)质量评估机制。在这项研究中,我们提出了一种新颖的端到端,非侵入式语音质量评估模型,称为Quality-Net,基于双向长短期记忆。质量网中话语水平质量的评估基于帧级评估。遗忘门偏差的帧约束和灵敏初始化用于从话语级质量标签中学习有意义的帧级质量评估。实验结果表明,质量网可以产生与PESQ的高度相关性(嘈杂语音为0.9,语音处理为0.84)通过speechenhancement)。我们相信Quality-Net有可能用于各种语音信号处理应用。
translated by 谷歌翻译
In real-world situations, speech reaching our ears is commonly corrupted by both room reverberation and background noise. These distortions are detrimental to speech intelligibility and quality, and also pose a serious problem to many speech-related applications , including automatic speech and speaker recognition. In order to deal with the combined effects of noise and reverberation , we propose a two-stage strategy to enhance corrupted speech, where denoising and dereverberation are conducted sequentially using deep neural networks. In addition, we design a new objective function that incorporates clean phase during model training to better estimate spectral magnitudes, which would in turn yield better phase estimates when combined with iterative phase reconstruction. The two-stage model is then jointly trained to optimize the proposed objective function. Systematic evaluations and comparisons show that the proposed algorithm improves objective metrics of speech intelligibility and quality substantially, and significantly outperforms previous one-stage enhancement systems.
translated by 谷歌翻译
我们提出了一种基于深度神经网络(DNN)的源增强训练方法,以增加客观声音质量评估(OSQA)得分,例如语音质量的感知评估(PESQ)。在许多传统研究中,DNN已被用作映射函数来估计时间 - 频率掩模并且训练以最小化可分析易处理的目标函数,例如均方误差(MSE)。由于OSQA得分已广泛应用于质量评估,因此构建DNN以提高OSQA得分将比使用最小MSE创建高质量输出信号更好。但是,由于大多数OSQA得分在分析上不易处理,因此\ textit {ie}它们是黑盒子,通过简单地应用反向传播不能计算目标函数的梯度。为了计算基于OSQA的目标函数的梯度,我们在\ textit {black-box optimization}的基础上制定了DNN优化方案,用于训练玩游戏的计算机。对于黑盒优化方案,我们采用政策梯度法在抽样算法的基础上计算梯度。为了使用采样算法模拟输出信号,DNN用于估计输出信号的概率密度函数,以最大化OSQA分数。 OSQA分数是根据模拟的输出信号计算的,并且训练DNN以增加生成模拟输出信号的概率,从而获得高OSQA分数。通过几个实验,我们发现OSQA得分通过应用所提出的方法显着增加,即使MSE未被最小化。
translated by 谷歌翻译
In contrast to the conventional minimum mean square error (MMSE)-based noise reduction techniques, we propose a supervised method to enhance speech by means of finding a mapping function between noisy and clean speech signals based on deep neural networks (DNNs). In order to be able to handle a wide range of additive noises in real-world situations, a large training set that encompasses many possible combinations of speech and noise types, is first designed. A DNN architecture is then employed as a nonlinear regression function to ensure a powerful modeling capability. Several techniques have also been proposed to improve the DNN-based speech enhancement system, including global variance equalization to alleviate the over-smoothing problem of the regression model, and the dropout and noise-aware training strategies to further improve the generalization capability of DNNs to unseen noise conditions. Experimental results demonstrate that the proposed framework can achieve significant improvements in both objective and subjective measures over the conventional MMSE based technique. It is also interesting to observe that the proposed DNN approach can well suppress highly nonstationary noise, which is tough to handle in general. Furthermore, the resulting DNN model, trained with artificial synthesized data, is also effective in dealing with noisy speech data recorded in real-world scenarios without the generation of the annoying musical artifact commonly observed in conventional enhancement methods. Index Terms-Deep neural networks (DNNs), dropout, global variance equalization, noise aware training, noise reduction, non-stationary noise, speech enhancement.
translated by 谷歌翻译
This work proposes a new learning framework that uses a loss function in the frequency domain to train a convolutional neural network (CNN) in the time domain. At the training time, an extra operation is added after the speech enhancement network to convert the estimated signal in the time domain to the frequency domain. This operation is differentiable and is used to train the system with a loss in the frequency domain. This proposed approach replaces learning in the frequency domain, i.e., short-time Fourier transform (STFT) magnitude estimation, with learning in the original time domain. The proposed method is a spectral mapping approach in which the CNN first generates a time domain signal then computes its STFT that is used for spectral mapping. This way the CNN can exploit the additional domain knowledge about calculating the STFT magnitude from the time domain signal. Experimental results demonstrate that the proposed method substantially outperforms the other methods of speech enhancement. The proposed approach is easy to implement and applicable to related speech processing tasks that require spectral mapping or time-frequency (T-F) masking.
translated by 谷歌翻译
This letter presents a regression-based speech enhancement framework using deep neural networks (DNNs) with a multiple-layer deep architecture. In the DNN learning process, a large training set ensures a powerful modeling capability to estimate the complicated nonlinear mapping from observed noisy speech to desired clean signals. Acoustic context was found to improve the continuity of speech to be separated from the background noises successfully without the annoying musical artifact commonly observed in conventional speech enhancement algorithms. A series of pilot experiments were conducted under multi-condition training with more than 100 hours of simulated speech data, resulting in a good generalization capability even in mismatched testing conditions. When compared with the logarithmic minimum mean square error approach, the proposed DNN-based algorithm tends to achieve significant improvements in terms of various objective quality measures. Furthermore, in a subjective preference evaluation with 10 listeners, 76.35% of the subjects were found to prefer DNN-based enhanced speech to that obtained with other conventional technique. Index Terms-Deep neural networks, noise reduction, regression model, speech enhancement.
translated by 谷歌翻译
 Abstract-Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus only on addressing audio information. In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent success of convolutional neural networks (CNNs) in SE, we propose an audio visual deep CNNs (AVDCNN) SE model, which incorporates audio and visual streams into a unified network model. We also propose a multi-task learning framework for reconstructing audio and visual signals at the output layer. Precisely speaking, the proposed AVDCNN model is structured as an audiovisual encoder-decoder network, in which audio and visual data are first processed using individual CNNs, and then fused into a joint network to generate enhanced speech (the primary task) and reconstructed images (the secondary task) at the output layer. The model is trained in an end-to-end manner, and parameters are jointly learned through back-propagation. We evaluate enhanced speech using five instrumental criteria. Results show that the AVDCNN model yields a notably superior performance compared with an audio-only CNN-based SE model and two conventional SE approaches , confirming the effectiveness of integrating visual information into the SE process. In addition, the AVDCNN model also outperforms an existing audiovisual SE model, confirming its capability of effectively combining audio and visual information in SE. Index Terms-Audiovisual systems, deep convolutional neural networks, multimodal learning, speech enhancement.
translated by 谷歌翻译
最近基于深度神经网络的监督学习已经在语音增强方面取得了实质性的改进。去噪网络学习从嘈杂的语音到直接清除语音的映射,或者到干净和有噪声频谱之间的频谱掩模。在任何一种情况下,通过最小化预定标签与光谱或时域信号的网络输出之间的均方误差(MSE)来优化网络。但是,现有方案存在两个关键问题:光谱和度量不匹配。光谱匹配是一个众所周知的问题,一般来说,在短时间傅里叶变换(STFT)之后的任何光谱修改都不能在逆变换后完全恢复。度量不匹配是传统的MSE度量是次优的,以最大化我们的目标度量,信号失真比(SDR)和语音质量的感知评估(PESQ)。本文提出了一个新的端到端去除框架,其目标是联合SDR和PESQ优化。首先,在ISTFT之后对时域信号执行网络优化以避免频谱不匹配。其次,提出了两种与SDR和PESQ度量相关的改进的损失函数,以最小化度量匹配。实验结果表明,与现有方法相比,所提出的去噪方案显着提高了SDR和PESQ性能。
translated by 谷歌翻译
Speech separation based on deep neural networks (DNNs) has been widely studied recently, and has achieved considerable success. However, previous studies are mostly based on fully-connected neural networks. In order to capture the local information of speech signals, we propose to use convolutional maxout neural networks (CMNNs) to separate speech and noise by estimating the ideal ratio mask of the time-frequency units. In our work the proposed CMNN is applied in the frequency domain. By using local filtering and max-pooling, convolutional neural networks can model the local structure of speech signals. Instead of sigmoid function, maxout is selected to address the saturation problem. In addition, dropout is integrated into the network to get better generalization ability. The proposed system outperforms a traditional DNN-based system in both objective speech quality and intelligibility.
translated by 谷歌翻译
This paper proposes a novel framework that integrates audio and visual information for speech enhancement. Most speech enhancement approaches consider audio features only to design filters or transfer functions to convert noisy speech signals to clean ones. Visual data, which provide useful complimentary information to audio data, have been integrated with audio data in many speech-related approaches to attain more effective speech processing performance. This paper presents our investigation into the use of the visual features of the motion of lips as additional visual information to improve the speech enhancement capability of deep neural network (DNN) speech enhancement performance. The experimental results show that the performance of DNN with audiovisual inputs exceeds that of DNN with audio inputs only in four standardized objective evaluations, thereby confirming the effectiveness of the inclusion of visual information into an audio-only speech enhancement framework.
translated by 谷歌翻译
Speech separation systems usually operate on the short-time Fourier transform (STFT) of noisy speech, and enhance only the magnitude spectrum while leaving the phase spectrum unchanged. This is done because there was a belief that the phase spectrum is unimportant for speech enhancement. Recent studies, however, suggest that phase is important for perceptual quality, leading some researchers to consider magnitude and phase spectrum enhancements. We present a supervised monaural speech separation approach that simultaneously enhances the magnitude and phase spectra by operating in the complex domain. Our approach uses a deep neural network to estimate the real and imaginary components of the ideal ratio mask defined in the complex domain. We report separation results for the proposed method and compare them to related systems. The proposed approach improves over other methods when evaluated with several objective metrics, including the perceptual evaluation of speech quality (PESQ), and a listening test where subjects prefer the proposed approach with at least a 69% rate. Index Terms-Complex ideal ratio mask, deep neural networks, speech quality, speech separation.
translated by 谷歌翻译
Supervised speech separation algorithms seldom utilize output patterns. This study proposes a novel recurrent deep stacking approach for time-frequency masking based speech separation, where the output context is explicitly employed to improve the accuracy of mask estimation. The key idea is to incorporate the estimated masks of several previous frames as additional inputs to better estimate the mask of the current frame. Rather than formulating it as a recurrent neural network (RNN), which is potentially much harder to train, we propose to train a deep neural network (DNN) with implicit deep stacking. The estimated masks of the previous frames are updated only at the end of each DNN training epoch, and then the updated estimated masks provide additional inputs to train the DNN in the next epoch. At the test stage, the DNN makes predictions sequentially in a recurrent fashion. In addition, we propose to use the L 1 loss for training. Experiments on the CHiME-2 (task-2) dataset demonstrate the effectiveness of our proposed approach. Index Terms-deep stacking networks, recurrent neural networks, deep neural networks, speech separation
translated by 谷歌翻译
消除非静止环境噪声的负面影响是自动语音识别的基础研究课题,仍然是一项重要的挑战。数据驱动的监督方法,包括基于深度神经网络的方法,最近已成为传统无监督方法的潜在替代方案,并且具有足够的训练,可以减轻无监督方法在各种真实声学环境中的缺点。有鉴于此,我们回顾了最近开发的,具有代表性的深度学习方法,用于解决语音的非固定加性和卷积退化问题,旨在为那些参与开发环境健全的语音识别系统的人提供指导。我们分别讨论了为语音识别系统的前端和后端开发的单通道和多通道技术,以及联合前端和后端培训框架。
translated by 谷歌翻译
In this paper we consider the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech. Most of the current literature on speech enhancement focus primarily on presence of single noise in corrupted speech which is far from real-world environments. Specifically, we deal with improving speech quality in office environment where multiple stationary as well as non-stationary noises can be simultaneously present in speech. We propose several strategies based on Deep Neural Networks (DNN) for speech enhancement in these scenarios. We also investigate a DNN training strategy based on psychoacoustic models from speech coding for enhancement of noisy speech.
translated by 谷歌翻译
Formulation of speech separation as a supervised learning problem has shown considerable promise. In its simplest form, a supervised learning algorithm, typically a deep neural network, is trained to learn a mapping from noisy features to a time-frequency representation of the target of interest. Traditionally , the ideal binary mask (IBM) is used as the target because of its simplicity and large speech intelligibility gains. The supervised learning framework, however, is not restricted to the use of binary targets. In this study, we evaluate and compare separation results by using different training targets, including the IBM, the target binary mask, the ideal ratio mask (IRM), the short-time Fourier transform spectral magnitude and its corresponding mask (FFT-MASK), and the Gammatone frequency power spectrum. Our results in various test conditions reveal that the two ratio mask targets, the IRM and the FFT-MASK, outperform the other targets in terms of objective intelligibility and quality metrics. In addition, we find that masking based targets, in general, are significantly better than spectral envelope based targets. We also present comparisons with recent methods in non-negative matrix factorization and speech enhancement, which show clear performance advantages of supervised speech separation.
translated by 谷歌翻译
In this paper we propose the use of Long Short-Term Memory recurrent neural networks for speech enhancement. Networks are trained to predict clean speech as well as noise features from noisy speech features, and a magnitude domain soft mask is constructed from these features. Extensive tests are run on 73 k noisy and reverberated utterances from the AudioVisual Interest Corpus of spontaneous, emotionally colored speech, degraded by several hours of real noise recordings comprising stationary and non-stationary sources and con-volutive noise from the Aachen Room Impulse Response database. In the result, the proposed method is shown to provide superior noise reduction at low signal-to-noise ratios while creating very little artifacts at higher signal-to-noise ratios, thereby outperforming unsupervised magnitude domain spectral subtraction by a large margin in terms of source-distortion ratio.
translated by 谷歌翻译
人类听觉皮层擅长选择性地抑制目标说话者的背景噪声。大脑中选择性注意的过程已知在上下文中利用可用的音频和视觉提示以更好地聚焦在目标扬声器上,同时滤除其他噪声。在这项研究中,我们提出了一种新的基于深度神经网络(DNN)的视听(AV)估计模型。所提出的AV掩模估计模型在上下文中整合了音频和噪声免疫视觉特征的时间动态,以改进掩模估计和语音分离。为了获得最佳的AV特征提取和理想的二进制掩码(IBM)估计,利用混合DNN架构来利用堆叠长短期存储器(LSTM)和卷积LSTM网络的互补优势。在语音质量和可懂度方面的比较模拟结果证明了我们提出的AV掩模估计模型的显着性能改进,与用于说话者相关和独立场景的仅音频和仅视觉掩模估计方法相比。
translated by 谷歌翻译
大量研究已经研究了神经网络量化对模式分类任务的有效性。本研究首次使用新的无指数浮点量化神经网络(EOFP-QNN)研究了语音增强(回归任务检查处理)的性能。所提出的EOFP-QNN包括两个阶段:尾数量化和指数量化。在尾数量化阶段,EOFP-QNN学习如何量化模型参数的尾数位,同时使用最少的mantissaprecision保持回归精度。在指数量化阶段,参数的指数部分被进一步量化,而不会引起任何额外的性能退化。我们在语音增强任务上评估了两种神经网络上提出的EOFP量化技术,即双向长短期记忆(BLSTM)和卷积神经网络(FCN)。实验结果表明,模型大小可以显着减少(量子化BLSTM和FCN模型的模型尺寸分别仅为原始模型的18.75%和21.89%,同时保持了令人满意的语音增强性能。
translated by 谷歌翻译
In this study, we explore long short-term memory recurrent neural networks (LSTM-RNNs) for speech enhancement. First, a regression LSTM-RNN approach for a direct mapping from the noisy to clean speech features is presented and verified to be more effective than deep neural network (DNN) based regression techniques in modeling long-term acoustic context. Then, a comprehensive comparison between the proposed direct mapping based LSTM-RNN and ideal ratio mask (IRM) based LSTM-RNNs is conducted. We observe that the direct mapping framework achieves better speech intelligi-bility at low signal-to-noise ratios (SNRs) while the IRM approach shows its superiority at high SNRs. Accordingly, to fully utilize this complementarity, a novel multiple-target joint learning approach is designed. The experiments under unseen noises show that the proposed framework can consistently and significantly improve the objective measures for both speech quality and intelligibility.
translated by 谷歌翻译