Progress in solving the cocktail party problem, i.e., separating the speech from multiple overlapping speakers, has recently accelerated with the invention of techniques such as deep clustering and permutation free mask inference. These approaches typically focus on estimating target STFT magnitudes and ignore problems of phase inconsistency. In this paper, we explicitly integrate phase reconstruction into our separation algorithm using a loss function defined on timedomain signals. A deep neural network structure is defined by unfolding a phase reconstruction algorithm and treating each iteration as a layer in our network. Furthermore, instead of using fixed STFT/iSTFT time-frequency representations, we allow our network to learn a modified version of these representations from data. We compare several variants of these unfolded phase reconstruction networks achieving state of the art results on the publicly available wsj0-2mix dataset, and show improved performance when the STFT/iSTFT-like representations are allowed to adapt. International Workshop on Acoustic Signal Enhancement (IWAENC) This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. ABSTRACT Progress in solving the cocktail party problem, i.e., separating the speech from multiple overlapping speakers, has recently accelerated with the invention of techniques such as deep clustering and permutation free mask inference. These approaches typically focus on estimating target STFT magnitudes and ignore problems of phase inconsistency. In this paper, we explicitly integrate phase reconstruction into our separation algorithm using a loss function defined on time-domain signals. A deep neural network structure is defined by unfolding a phase reconstruction algorithm and treating each iteration as a layer in our network. Furthermore, instead of using fixed STFT/iSTFT time-frequency representations, we allow our network to learn a modified version of these representations from data. We compare several variants of these unfolded phase reconstruction networks achieving state of the art results on the publicly available wsj0-2mix dataset, and show improved performance when the STFT/iSTFT-like representations are allowed to adapt.
translated by 谷歌翻译
多轨声学环境中的鲁棒语音处理需要自动语音分离。虽然单通道,与扬声器无关的语音分离方法最近取得了很大进展,但语音分离的准确性,等待时间和计算成本仍然不足。大多数先前的方法通过混合信号的时频表示来形成分离问题,这有几个缺点,包括信号的相位和幅度的解耦,用于语音分离的频谱图表示的最优性,以及计算中的长延迟。谱图。为了解决这些缺点,我们提出了时域音频分离网络(TasNet),它是一种用于时域语音分离的深度学习自动编码器框架。 TasNet使用aconvolutional编码器来创建信号的表示,该信号被优化以提取单个扬声器。通过将加权函数(掩模)应用于编码器输出来实现扬声器提取。然后使用线性解码器将修改编码器表示反转为声音波形。使用由扩张的卷积组成的时间卷积网络找到掩模,其允许网络模拟语音信号的长期依赖性。这种端到端语音分离算法在混合音频中分离扬声器方面明显优于以前的时频方法,即使与使用扬声器的理想时频掩模实现的分离精度相比也是如此。此外,TasNet具有更小的模型尺寸和更短的最小延迟,为离线和实时语音分离应用提供了合适的解决方案。因此,该研究代表了实际语音分离对于现实世界语音处理技术迈出的重要一步。
translated by 谷歌翻译
Robust speech processing in multi-talker environments requires effective speech separation. Recent deep learning systems have made significant progress toward solving this problem, yet it remains challenging particularly in real-time, short latency applications. Most methods attempt to construct a mask for each source in time-frequency representation of the mixture signal which is not necessarily an optimal representation for speech separation. In addition , time-frequency decomposition results in inherent problems such as phase/magnitude decoupling and long time window which is required to achieve sufficient frequency resolution. We propose Time-domain Audio Separation Network (TasNet) to overcome these limitations. We directly model the signal in the time-domain using an encoder-decoder framework and perform the source separation on nonnegative encoder outputs. This method removes the frequency decomposition step and reduces the separation problem to estimation of source masks on encoder outputs which is then synthesized by the decoder. Our system outperforms the current state-of-the-art causal and noncausal speech separation algorithms, reduces the computational cost of speech separation, and significantly reduces the minimum required latency of the output. This makes TasNet suitable for applications where low-power, real-time implementation is desirable such as in hearable and telecommunica-tion devices.
translated by 谷歌翻译
We propose a novel deep learning training criterion, named permutation invariant training (PIT), for speaker independent multi-talker speech separation, commonly known as the cocktail-party problem. Different from the multi-class regression technique and the deep clustering (DPCL) technique, our novel approach minimizes the separation error directly. This strategy effectively solves the long-lasting label permutation problem, that has prevented progress on deep learning based techniques for speech separation. We evaluated PIT on the WSJ0 and Danish mixed-speech separation tasks and found that it compares favorably to non-negative matrix factoriza-tion (NMF), computational auditory scene analysis (CASA), and DPCL and generalizes well over unseen speakers and languages. Since PIT is simple to implement and can be easily integrated and combined with other advanced techniques, we believe improvements built upon PIT can eventually solve the cocktail-party problem.
translated by 谷歌翻译
Monaural source separation is important for many real world applications. It is challenging because, with only a single channel of information available, without any constraints, an infinite number of solutions are possible. In this paper, we explore joint optimization of masking functions and deep recurrent neural networks for monaural source separation tasks, including monaural speech separation, monaural singing voice separation, and speech denoising. The joint optimization of the deep recurrent neural networks with an extra masking layer enforces a reconstruction constraint. Moreover, we explore a discriminative criterion for training neural networks to further enhance the separation performance. We evaluate the proposed system on the TSP, MIR-1K, and TIMIT datasets for speech separation, singing voice separation, and speech denoising tasks, respectively. Our approaches achieve 2.30-4.98 dB SDR gain compared to NMF models in the speech separation task, 2.30-2.48 dB GNSDR gain and 4.32-5.42 dB GSIR gain compared to existing models in the singing voice separation task, and outperform NMF and DNN baselines in the speech denoising task.
translated by 谷歌翻译
语音分离在深度学习技术方面已经非常成功。已经基于频谱图的方法报告了大量的努力,其被公知为语音信号的标准时域和频率跨域表示。它与语音的语音结构或人类感知的“语音如何发声”高度相关,但主要是携带时间行为的频域特征。最近报道了在时域上实现语音分离的非常令人印象深刻的工作,可能是因为时域中的波形可以以比谱图更精确的方式描述语音的不同实现。在本文中,我们提出了一个适当整合上述两个方向的框架,希望达到这两个目的。我们通过连接1-dim卷积编码特征图(用于时域)和频谱图(用于频域)来构建时间 - 频率特征图,然后通过联合网络和聚类方法处理非常类似于时间和频率的方法。域先前的工作。通过这种方式,可以在嵌入和聚类期间共同考虑时域和频域中的信息以及它们之间的交互。在初步实验中使用WSJ0-2mix数据集获得了非常令人鼓舞的结果(我们所知的最先进的技术)。
translated by 谷歌翻译
我们在深度学习框架中解决了声源分离的问题,我们将其称为“深度聚类”。我们不是直接估计信号或掩模函数,而是训练深度网络以产生对于训练数据中给出的分区标签具有辨别力的频谱图嵌入。以前的深度网络方法在学习功率和速度方面提供了很大的优势,但以前不清楚如何使用它们以与类无关的方式分离信号。相比之下,频谱聚类方法相对于要分割的项目的类别和数量而言是灵活的,但是如何利用深度网络的学习能力和速度尚不清楚。为了获得两全其美,我们使用一个目标函数来训练嵌入,这种嵌入产生一个理想的成对亲和矩阵的低秩近似,与类无关。这避免了光谱分解的高成本,而是产生了适合于简单聚类方法的紧凑集群。因此,分段在嵌入中被隐式编码,并且可以通过聚类来“解码”。初步实验表明,所提出的方法可以分离语音:当对包含两个扬声器的混合物的谱图特征进行训练,并且在保持的一组扬声器的混合物上进行测试时,它可以推断出掩蔽函数,使信号质量提高大约6dB。我们证明该模型可以推广三种说话者混合物,尽管只对双扬声器混合物进行了训练。框架可以在没有类标签的情况下使用,因此有可能在不同的声音类型上进行训练,并推广到新的声源。我们希望未来的工作将导致任意声音的分割,同时也可以使用麦克风阵列方法。作为图像分割和其他域。
translated by 谷歌翻译
Despite the overwhelming success of deep learning in various speech processing tasks, the problem of separating simultaneous speakers in a mixture remains challenging. Two major difficulties in such systems are the arbitrary source permutation and unknown number of sources in the mixture. We propose a novel deep learning framework for single channel speech separation by creating attractor points in high dimensional embedding space of the acoustic signals which pull together the time-frequency bins corresponding to each source. At-tractor points in this study are created by finding the centroids of the sources in the embedding space, which are subsequently used to determine the similarity of each bin in the mixture to each source. The network is then trained to minimize the reconstruction error of each source by optimizing the embeddings. The proposed model is different from prior works in that it implements an end-to-end training, and it does not depend on the number of sources in the mixture. Two strategies are explored in the test time, K-means and fixed attractor points, where the latter requires no post-processing and can be implemented in real-time. We evaluated our system on Wall Street Journal dataset and show 5.49% improvement over the previous state-of-the-art methods.
translated by 谷歌翻译
Deep clustering is a recently introduced deep learning architecture that uses discriminatively trained embeddings as the basis for clustering. It was recently applied to spectrogram segmen-tation, resulting in impressive results on speaker-independent multi-speaker separation. In this paper we extend the baseline system with an end-to-end signal approximation objective that greatly improves performance on a challenging speech separation. We first significantly improve upon the baseline system performance by incorporating better regularization, larger temporal context, and a deeper architecture, culminating in an overall improvement in signal to distortion ratio (SDR) of 10.3 dB compared to the baseline of 6.0 dB for two-speaker separation, as well as a 7.1 dB SDR improvement for three-speaker separation. We then extend the model to incorporate an enhancement layer to refine the signal estimates, and perform end-to-end training through both the clustering and enhancement stages to maximize signal fidelity. We evaluate the results using automatic speech recognition. The new signal approximation objective , combined with end-to-end training, produces unprecedented performance, reducing the word error rate (WER) from 89.1% down to 30.8%. This represents a major advancement towards solving the cocktail party problem.
translated by 谷歌翻译
This paper proposes an end-to-end approach for single-channel speaker-independent multi-speaker speech separation, where time-frequency (T-F) masking, the short-time Fourier transform (STFT), and its inverse are represented as layers within a deep network. Previous approaches, rather than computing a loss on the reconstructed signal, used a surrogate loss based on the target STFT magnitudes. This ignores reconstruction error introduced by phase inconsistency. In our approach, the loss function is directly defined on the reconstructed signals, which are optimized for best separation. In addition, we train through unfolded iterations of a phase reconstruction algorithm, represented as a series of STFT and inverse STFT layers. While mask values are typically limited to lie between zero and one for approaches using the mixture phase for reconstruction, this limitation is less relevant if the estimated magnitudes are to be used together with phase reconstruction. We thus propose several novel activation functions for the output layer of the T-F masking, to allow mask values beyond one. On the publicly-available wsj0-2mix dataset, our approach achieves state-of-the-art 12.6 dB scale-invariant signal-to-distortion ratio (SI-SDR) and 13.1 dB SDR, revealing new possibilities for deep learning based phase reconstruction and representing a fundamental progress towards solving the notoriously-hard cocktail party problem.
translated by 谷歌翻译
我们提出了一种联合视听模型,用于隔离来自诸如其他扬声器和背景噪声的混合声音的单个语音信号。仅使用音频作为输入来解决该任务是极具挑战性的,并且不提供分离的语音信号与视频中的扬声器的关联。在本文中,我们提出了一个基于网络的深层模型,它结合了视觉和听觉信号来解决这一任务。视觉特征用于将音频“聚焦”在场景中的所需扬声器上并提高音频分离质量。为了训练我们的联合视听模型,我们介绍了AVSpeech,这是一个由来自网络的数千小时视频片段组成的新数据集。我们展示了我们的方法对经典语音分离任务的适用性,以及涉及激烈访谈,嘈杂的酒吧和尖叫儿童的真实场景,只要求用户在视频中指定他们想要隔离的人的面孔。在混合语音的情况下,我们的方法显示出优于现有技术的仅音频语音分离的优势。此外,我们的模型与扬声器无关(训练有效,适用于任何扬声器),比最近的扬声器视觉分离方法产生更好的结果,这些方法取决于扬声器(需要为每个感兴趣的扬声器训练单独的模型)。
translated by 谷歌翻译
Despite the recent success of deep learning for many speech processing tasks, single-microphone, speaker-independent speech separation remains challenging for two main reasons. The first reason is the arbitrary order of the target and masker speakers in the mixture (permutation problem), and the second is the unknown number of speakers in the mixture (output dimension problem). We propose a novel deep learning framework for speech separation that addresses both of these issues. We use a neural network to project the time-frequency representation of the mixture signal into a high-dimensional embedding space. A reference point (attractor) is created in the embedding space to represent each speaker which is defined as the centroid of the speaker in the embedding space. The time-frequency embeddings of each speaker are then forced to cluster around the corresponding attractor point which is used to determine the time-frequency assignment of the speaker. We propose three methods for finding the attractors for each source in the embedding space and compare their advantages and limitations. The objective function for the network is standard signal reconstruction error which enables end-to-end operation during both training and test phases. We evaluated our system using the Wall Street Journal dataset (WSJ0) on two and three speaker mixtures and report comparable or better performance than other state-of-the-art deep learning methods for speech separation.
translated by 谷歌翻译
鉴于最近深度学习的发展激增,本文提供了对音频信号处理的最新深度学习技术的回顾。语音,音乐和环境声音处理被并排考虑,以指出领域之间的相似点和不同点,突出一般方法,问题,关键参考和区域之间相互交流的可能性。回顾了主要特征表示(特别是log-mel光谱和原始波形)和deeplearning模型,包括卷积神经网络,长期短期记忆体系结构的变体,以及更多音频特定的神经网络模型。随后,涵盖了突出的深度学习应用领域,即音频识别(自动语音识别,音乐信息检索,环境声音检测,定位和跟踪)和合成与转换(源分离,音频增强,语音,声音和音乐合成的生成模型)。最后,确定了应用于音频信号处理的深度学习的关键问题和未来问题。
translated by 谷歌翻译
该研究调查了在短时傅里叶变换(STFT)域中基于单声道跟踪器的独立扬声器分离的相位重建。关键的观察结果是,对于两个源的混合,其精确估计的幅度并且在几何约束下,每个源和混合物之间的绝对相位差可以唯一确定;此外,每个时频(T-F)单元的源相位可以缩小到仅两个候选。为了选择合适的候选者,我们提出了基于迭代相位重建,群延迟估计和相位差预测的三种算法。在公开的wsj0-2mix和3mix语料库上获得了最先进的结果。
translated by 谷歌翻译
我们从深度学习和计算听觉场景分析(CASA)的角度分析了与说话者无关的单声道说话者分离。具体而言,我们将多说话者分离任务分解为同时分组和顺序分组的阶段。通过用置换不变训练的神经网络分离不同说话者的频谱,首先在每个时间帧中执行同时分组。在第二阶段,通过聚类网络将帧级分离的频谱顺序地分组到不同的扬声器。所提出的深度CASA方法依次优化帧级分离和说话者跟踪,并为两个目标产生出色的结果。基准WSJ0-2mix数据库的实验结果表明,新方法通过适度的模型化实现了最先进的结果。
translated by 谷歌翻译
Deep clustering is the first method to handle general audio separation scenarios with multiple sources of the same type and an arbitrary number of sources, performing impressively in speaker-independent speech separation tasks. However, little is known about its effectiveness in other challenging situations such as music source separation. Contrary to conventional networks that directly estimate the source signals, deep clustering generates an embedding for each time-frequency bin, and separates sources by clustering the bins in the embedding space. We show that deep clustering outperforms conventional networks on a singing voice separation task, in both matched and mismatched conditions, even though conventional networks have the advantage of end-to-end training for best signal approximation , presumably because its more flexible objective engenders better regularization. Since the strengths of deep clustering and conventional network architectures appear complementary, we explore combining them in a single hybrid network trained via an approach akin to multi-task learning. Remarkably, the combination significantly outperforms either of its components.
translated by 谷歌翻译
近年来随着神经网络的发展和部署,单通道源分离算法的性能得到了很大的提高。然而,许多这样的网络继续在混合的幅度谱图上运行,并产生源幅度谱图的估计,以执行源分离。在本文中,我们将这些步骤解释为额外的神经网络层,并提出端到端的源分离网络,允许通过直接在混合物的形状上操作来估计分离的语音波形。此外,我们还提出使用掩蔽基础端到端分离网络,共同优化掩模和混合波形的潜在表示。与实验中的现有架构相比,这些网络在分离性能方面显示出显着的改善。为了训练这些端到端模型,我们研究了复合成本函数的使用,这些函数源自在波形上测量的客观评估指标。我们提出主观听力测试结果,证明通过使用基于掩蔽的端到端网络获得的改进,并且还揭示了这些成本函数的前端到源分离的性能的见解。
translated by 谷歌翻译
将歌声与其音乐伴奏分开仍然是音乐信息检索领域的重要挑战。我们提出了一种独特的神经网络方法,其灵感来自于一种彻底改变了视野的技术:像素方式的图像分类,我们将CNN的交叉熵损失和预训练结合起来作为歌唱声谱的自动编码器。像素分类技术直接估计谱图中每个时频(T-F)箱的声源标签,从而消除了常见的预处理和后处理任务。通过使用Ideal Binary Mask(IBM)作为目标输出标签来训练所提出的网络。通过将每个T-F bin视为具有多标签(对于每个声源)的像素,IBM识别混合信号的幅度谱图的每个T-F仓中的主要声源。交叉熵用作训练目标,以便最小化每个像素的目标预测标签之间的平均概率误差。通过将歌声分离问题视为像素分类任务,我们还消除了一个常用的但不易理解的后处理步骤:Wienerfilter后处理。在应用于iKala数据集时,建议的CNN在音乐信息检索评估交换(MIREX)2016和MIREX 2014的获胜者中的表现优于2.2702~5.9563 dB的全球归一化源失真比(GNSDR)。对完整曲目歌曲评估任务的DSD100数据集进行的实验也表明,我们的模型能够与使用多声道建模,数据增强和模型混合的尖端歌声分离系统竞争。
translated by 谷歌翻译
Many real-world applications of speech enhancement, such as hearing aids and cochlear implants, desire real-time processing, with no or low latency. In this paper, we propose a novel convo-lutional recurrent network (CRN) to address real-time monaural speech enhancement. We incorporate a convolutional encoder-decoder (CED) and long short-term memory (LSTM) into the CRN architecture, which leads to a causal system that is naturally suitable for real-time processing. Moreover, the proposed model is noise-and speaker-independent, i.e. noise types and speakers can be different between training and test. Our experiments suggest that the CRN leads to consistently better objective intelligibility and perceptual quality than an existing LSTM based model. Moreover, the CRN has much fewer trainable parameters .
translated by 谷歌翻译
In this paper we present a convolutive basis decomposition method and its application on simultaneous speakers separation from monophonic recordings. The model we propose is a convolutive version of the non-negative matrix factorization algorithm. Due to the non-negativity constraint this type of coding is very well suited for intuitively and efficiently representing magnitude spectra. We present results that reveal the nature of these basis functions and we introduce their utility in separating monophonic mixtures of known speakers. IEEE Transactions on Audio, Speech and Language Processing This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Abstract-In this paper we present a convolutive basis decomposition method and its application on simultaneous speakers separation from monophonic recordings. The model we propose is a convolutive version of the non-negative matrix factorization algorithm. Due to the non-negativity constraint this type of coding is very well suited for intuitively and efficiently representing magnitude spectra. We present results that reveal the nature of these basis functions and we introduce their utility in separating monophonic mixtures of known speakers.
translated by 谷歌翻译