音乐源分离表示从给定歌曲中提取所有乐器的任务。近期对这一挑战的突破已经陷入了单一数据集,MusdB,仅限于四个仪器类。更大的数据集和更多乐器在收集数据和培训深度神经网络(DNN)时是昂贵和耗时的。在这项工作中,我们提出了一种快速的方法来评估任何数据集中的仪器在任何数据集中的可分离性,而不会训练和调整DNN。这种可分离性测量有助于选择适当的样本以获得神经网络的有效培训。基于Oracle原理与理想的比率面具,我们的方法是估计最先进的深度学习方法(如TASNet或Open-Unmix)的分离性能的优异代理。我们的结果有助于揭示音频源分离的两个基本要点:1)理想的比率掩模,虽然光线和简单,提供了最近神经网络的音频可分子性能的准确度量,以及2)新的端到端学习方法如TASNet,它直接在波形上运行,实际上是在内部构建时频(TF)表示,使得它们在分离在TF平面中重叠的音频模式时,它们遇到与基于TF的方法相同的限制。
translated by 谷歌翻译
鉴于音乐源分离和自动混合的最新进展,在音乐曲目中删除音频效果是开发自动混合系统的有意义的一步。本文着重于消除对音乐制作中吉他曲目应用的失真音频效果。我们探索是否可以通过设计用于源分离和音频效应建模的神经网络来解决效果的去除。我们的方法证明对混合处理和清洁信号的效果特别有效。与基于稀疏优化的最新解决方案相比,这些模型获得了更好的质量和更快的推断。我们证明这些模型不仅适合倾斜,而且适用于其他类型的失真效应。通过讨论结果,我们强调了多个评估指标的有用性,以评估重建的不同方面的变形效果去除。
translated by 谷歌翻译
Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two-and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward the realization of speech separation systems for real-world speech processing technologies.
translated by 谷歌翻译
将音频分离成不同声音源的深度学习技术面临着几种挑战。标准架构需要培训不同类型的音频源的独立型号。虽然一些通用分离器采用单个模型来靶向多个来源,但它们难以推广到看不见的来源。在本文中,我们提出了一个三个组件的管道,可以从大型但弱标记的数据集:audioset训练通用音频源分离器。首先,我们提出了一种用于处理弱标记训练数据的变压器的声音事件检测系统。其次,我们设计了一种基于查询的音频分离模型,利用此数据进行模型培训。第三,我们设计一个潜在的嵌入处理器来编码指定用于分离的音频目标的查询,允许零拍摄的概括。我们的方法使用单一模型进行多种声音类型的源分离,并仅依赖于跨标记的培训数据。此外,所提出的音频分离器可用于零拍摄设置,学习以分离从未在培训中看到的音频源。为了评估分离性能,我们在侦察中测试我们的模型,同时在不相交的augioset上培训。我们通过对从训练中保持的音频源类型进行另一个实验,进一步通过对训练进行了另一个实验来验证零射性能。该模型在两种情况下实现了对当前监督模型的相当的源 - 失真率(SDR)性能。
translated by 谷歌翻译
我们提出了一个单阶段的休闲波形到波形多通道模型,该模型可以根据动态的声学场景中的广泛空间位置分离移动的声音源。我们将场景分为两个空间区域,分别包含目标和干扰声源。该模型经过训练有素的端到端,并隐含地进行空间处理,而没有基于传统处理或使用手工制作的空间特征的任何组件。我们在现实世界数据集上评估了所提出的模型,并表明该模型与Oracle Beamformer的性能匹配,然后是最先进的单渠道增强网络。
translated by 谷歌翻译
Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in these domains in task such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement domain to achieve denosing, dereverberation and multi-speaker separation in monaural speech enhancement. In this paper, we review some dominant DNN techniques being employed to achieve speech separation. The review looks at the whole pipeline of speech enhancement from feature extraction, how DNN based tools are modelling both global and local features of speech and model training (supervised and unsupervised). We also review the use of speech-enhancement pre-trained models to boost speech enhancement process. The review is geared towards covering the dominant trends with regards to DNN application in speech enhancement in speech obtained via a single speaker.
translated by 谷歌翻译
The marine ecosystem is changing at an alarming rate, exhibiting biodiversity loss and the migration of tropical species to temperate basins. Monitoring the underwater environments and their inhabitants is of fundamental importance to understand the evolution of these systems and implement safeguard policies. However, assessing and tracking biodiversity is often a complex task, especially in large and uncontrolled environments, such as the oceans. One of the most popular and effective methods for monitoring marine biodiversity is passive acoustics monitoring (PAM), which employs hydrophones to capture underwater sound. Many aquatic animals produce sounds characteristic of their own species; these signals travel efficiently underwater and can be detected even at great distances. Furthermore, modern technologies are becoming more and more convenient and precise, allowing for very accurate and careful data acquisition. To date, audio captured with PAM devices is frequently manually processed by marine biologists and interpreted with traditional signal processing techniques for the detection of animal vocalizations. This is a challenging task, as PAM recordings are often over long periods of time. Moreover, one of the causes of biodiversity loss is sound pollution; in data obtained from regions with loud anthropic noise, it is hard to separate the artificial from the fish sound manually. Nowadays, machine learning and, in particular, deep learning represents the state of the art for processing audio signals. Specifically, sound separation networks are able to identify and separate human voices and musical instruments. In this work, we show that the same techniques can be successfully used to automatically extract fish vocalizations in PAM recordings, opening up the possibility for biodiversity monitoring at a large scale.
translated by 谷歌翻译
我们介绍了Audioscopev2,这是一种最先进的通用音频视频在屏幕上的声音分离系统,该系统能够通过观看野外视频来学习将声音与屏幕上的对象相关联。我们确定了先前关于视听屏幕上的声音分离的几个局限性,包括对时空注意力的粗略分辨率,音频分离模型的收敛性不佳,培训和评估数据的差异有限,以及未能说明贸易。在保存屏幕声音和抑制屏幕外声音之间的关闭。我们为所有这些问题提供解决方案。我们提出的跨模式和自我发场网络体系结构随着时间的推移以精细的分辨率捕获了视听依赖性,我们还提出了有效的可分离变体,这些变体能够扩展到更长的视频而不牺牲太多性能。我们还发现,仅在音频上进行预训练模型可大大改善结果。为了进行培训和评估,我们从大型野外视频数据库(YFCC100M)中收集了新的屏幕上的人类注释。这个新数据集更加多样化和具有挑战性。最后,我们提出了一个校准过程,该过程允许对屏幕重建与屏幕外抑制进行精确调整,从而大大简化了具有不同操作点的模型之间的性能。总体而言,我们的实验结果表明,在屏幕上的分离性能在更一般条件下的屏幕分离性能的改善要比以前具有最小的额外计算复杂性的方法更为普遍。
translated by 谷歌翻译
Recent years have seen progress beyond domain-specific sound separation for speech or music towards universal sound separation for arbitrary sounds. Prior work on universal sound separation has investigated separating a target sound out of an audio mixture given a text query. Such text-queried sound separation systems provide a natural and scalable interface for specifying arbitrary target sounds. However, supervised text-queried sound separation systems require costly labeled audio-text pairs for training. Moreover, the audio provided in existing datasets is often recorded in a controlled environment, causing a considerable generalization gap to noisy audio in the wild. In this work, we aim to approach text-queried universal sound separation by using only unlabeled data. We propose to leverage the visual modality as a bridge to learn the desired audio-textual correspondence. The proposed CLIPSep model first encodes the input query into a query vector using the contrastive language-image pretraining (CLIP) model, and the query vector is then used to condition an audio separation model to separate out the target sound. While the model is trained on image-audio pairs extracted from unlabeled videos, at test time we can instead query the model with text inputs in a zero-shot setting, thanks to the joint language-image embedding learned by the CLIP model. Further, videos in the wild often contain off-screen sounds and background noise that may hinder the model from learning the desired audio-textual correspondence. To address this problem, we further propose an approach called noise invariant training for training a query-based sound separation model on noisy data. Experimental results show that the proposed models successfully learn text-queried universal sound separation using only noisy unlabeled videos, even achieving competitive performance against a supervised model in some settings.
translated by 谷歌翻译
Recent research has shown remarkable performance in leveraging multiple extraneous conditional and non-mutually exclusive semantic concepts for sound source separation, allowing the flexibility to extract a given target source based on multiple different queries. In this work, we propose a new optimal condition training (OCT) method for single-channel target source separation, based on greedy parameter updates using the highest performing condition among equivalent conditions associated with a given target source. Our experiments show that the complementary information carried by the diverse semantic concepts significantly helps to disentangle and isolate sources of interest much more efficiently compared to single-conditioned models. Moreover, we propose a variation of OCT with condition refinement, in which an initial conditional vector is adapted to the given mixture and transformed to a more amenable representation for target source extraction. We showcase the effectiveness of OCT on diverse source separation experiments where it improves upon permutation invariant models with oracle assignment and obtains state-of-the-art performance in the more challenging task of text-based source separation, outperforming even dedicated text-only conditioned models.
translated by 谷歌翻译
最近在各种语音域应用中提出了卷积增强的变压器(构象异构体),例如自动语音识别(ASR)和语音分离,因为它们可以捕获本地和全球依赖性。在本文中,我们提出了一个基于构型的度量生成对抗网络(CMGAN),以在时间频率(TF)域中进行语音增强(SE)。发电机使用两阶段构象体块编码大小和复杂的频谱图信息,以模拟时间和频率依赖性。然后,解码器将估计分解为尺寸掩模的解码器分支,以滤除不需要的扭曲和复杂的细化分支,以进一步改善幅度估计并隐式增强相信息。此外,我们还包括一个度量歧视器来通过优化相应的评估评分来减轻度量不匹配。客观和主观评估表明,与三个语音增强任务(DeNoising,dereverberation和Super-Losity)中的最新方法相比,CMGAN能够表现出卓越的性能。例如,对语音库+需求数据集的定量降解分析表明,CMGAN的表现优于以前的差距,即PESQ为3.41,SSNR为11.10 dB。
translated by 谷歌翻译
在本文中,我们介绍了一种用于屏蔽的网络(Bloom-Net)的块优化方法,用于训练可扩展语音增强网络。这里,我们用残差学习方案设计我们的网络,并顺序地训练内部分离器块,以获得用于语音增强的可伸缩掩蔽基础神经网络。其可伸缩性允许它根据测试时间资源约束来调整运行时复杂度:部署一旦部署,该模型可以根据测试时间环境动态改变其复杂性。为此,我们模块化了我们的模型,因为它们可以灵活地适应增强性能和资源限制的不同需求,导致最小的内存或由于增加的可扩展性而训练开销。我们对语音增强的实验表明,所提出的块状优化方法与相应的模型相比,仅具有轻微的性能下降,与端到端的相应模型相比,实现了所需的可扩展性。
translated by 谷歌翻译
最近的单声道源分离的工作表明,通过使用短窗户的完全学习过滤器组可以提高性能。另一方面,广泛众所周知,对于传统的波束成形技术,性能随着长分析窗口而增加。这也适用于最依赖于深神经网络(DNN)来估计空间协方差矩阵的大多数混合神经波束形成方法。在这项工作中,我们尝试弥合这两个世界之间的差距,并探索完全端到端的混合神经波束形成,而不是使用短时傅里叶变换,而不是使用DNN共同学习分析和合成滤波器拦截器。详细说明,我们探索了两种不同类型的学习过滤博客:完全学习和分析。我们使用最近的清晰度挑战数据执行详细分析,并显示通过使用学习的默认覆盖机,可以超越基于Oracle掩码的短窗口的波束成形。
translated by 谷歌翻译
设备方向听到需要从给定方向的音频源分离,同时实现严格的人类难以察觉的延迟要求。虽然神经网络可以实现比传统的波束形成器的性能明显更好,但所有现有型号都缺乏对计算受限的可穿戴物的低延迟因果推断。我们展示了一个混合模型,将传统的波束形成器与定制轻质神经网络相结合。前者降低了后者的计算负担,并且还提高了其普遍性,而后者旨在进一步降低存储器和计算开销,以实现实时和低延迟操作。我们的评估显示了合成数据上最先进的因果推断模型的相当性能,同时实现了模型尺寸的5倍,每秒计算的4倍,处理时间减少5倍,更好地概括到真实的硬件数据。此外,我们的实时混合模型在为低功耗可穿戴设备设计的移动CPU上运行8毫秒,并实现17.5毫秒的端到端延迟。
translated by 谷歌翻译
我们考虑了双耳应用的音频语音分离问题,例如耳机和助听器。虽然当今的神经网络的表现非常出色(用2美元的麦克风分开$ 4+$来源),但他们假设已知或固定的最大数量来源,K。和人头形。本文打算放松这两个约束,而牺牲问题定义的略有改变。我们观察到,当接收到的混合物包含过多的来源时,将它们逐个区域分开,即将信号混合物与用户头部周围的每个圆锥形扇区隔离。这需要学习每个区域的细粒空间特性,包括人头施加的信号扭曲。我们提出了一个两阶段的自我监督框架,在该框架中,预处理耳机中听到声音以提取相对清洁的个性化信号,然后将其用于训练区域分离模型。结果表明表现出色的表现,强调了个性化在通用监督方法上的重要性。 (在我们的项目网站上可用的音频样本:https://uiuc-earable-computing.github.io/binaural/。我们相信,我们相信此结果可以帮助现实世界中的应用程序,以选择性听力,消除噪音和音频增强现实。
translated by 谷歌翻译
源分离模型在频谱图或波形域上工作。在这项工作中,我们展示了如何执行端到端的混合源分离,让模型决定哪个域最适合每个源,甚至可以组合两者。拟议的解除架构的混合版本赢得了索尼组织的2021年音乐贬低挑战。该架构还具有额外的改进,例如压缩残余分支,当地关注或奇异值正则化。总体而言,在穆斯达特总资料数据集中测量的所有来源中观察到信号对失真(SDR)的1.4 dB改善,这是人类主观评估证实的改进,总体质量为2.83(5.36)非混合脱扣),并在3.04(对竞争对手提交的第二排名模型的非混合撤销和2.44)的污染没有污染。
translated by 谷歌翻译
最近,基于扩散的生成模型已引入语音增强的任务。干净的语音损坏被建模为固定的远期过程,其中逐渐添加了越来越多的噪声。通过学习以嘈杂的输入为条件的迭代方式扭转这一过程,可以产生干净的语音。我们以先前的工作为基础,并在随机微分方程的形式主义中得出训练任务。我们对基础分数匹配目标进行了详细的理论综述,并探索了不同的采样器配置,以解决测试时的反向过程。通过使用自然图像生成文献的复杂网络体系结构,与以前的出版物相比,我们可以显着提高性能。我们还表明,我们可以与最近的判别模型竞争,并在评估与培训不同的语料库时获得更好的概括。我们通过主观的听力测试对评估结果进行补充,其中我们提出的方法是最好的。此外,我们表明所提出的方法在单渠道语音覆盖中实现了出色的最新性能。我们的代码和音频示例可在线获得,请参见https://uhh.de/inf-sp-sgmse
translated by 谷歌翻译
Music discovery services let users identify songs from short mobile recordings. These solutions are often based on Audio Fingerprinting, and rely more specifically on the extraction of spectral peaks in order to be robust to a number of distortions. Few works have been done to study the robustness of these algorithms to background noise captured in real environments. In particular, AFP systems still struggle when the signal to noise ratio is low, i.e when the background noise is strong. In this project, we tackle this problematic with Deep Learning. We test a new hybrid strategy which consists of inserting a denoising DL model in front of a peak-based AFP algorithm. We simulate noisy music recordings using a realistic data augmentation pipeline, and train a DL model to denoise them. The denoising model limits the impact of background noise on the AFP system's extracted peaks, improving its robustness to noise. We further propose a novel loss function to adapt the DL model to the considered AFP system, increasing its precision in terms of retrieved spectral peaks. To the best of our knowledge, this hybrid strategy has not been tested before.
translated by 谷歌翻译
雷达传感器逐渐成为道路车辆的广泛设备,在自主驾驶和道路安全中发挥着至关重要的作用。广泛采用雷达传感器增加了不同车辆的传感器之间干扰的可能性,产生损坏的范围曲线和范围 - 多普勒地图。为了从范围 - 多普勒地图中提取多个目标的距离和速度,需要减轻影响每个范围分布的干扰。本文提出了一种全卷积神经网络,用于汽车雷达干扰缓解。为了在真实的方案中培训我们的网络,我们介绍了具有多个目标和多个干扰的新数据集的现实汽车雷达信号。为了我们的知识,我们是第一个在汽车雷达领域施加体重修剪的施加量,与广泛使用的辍学相比获得了优越的结果。虽然最先前的作品成功地估计了汽车雷达信号的大小,但我们提出了一种可以准确估计相位的深度学习模型。例如,我们的新方法将相对于普通采用的归零技术的相位估计误差从12.55度到6.58度降低了一半。考虑到缺乏汽车雷达干扰缓解数据库,我们将释放开源我们的大规模数据集,密切复制了多次干扰案例的现实世界汽车场景,允许其他人客观地比较他们在该域中的未来工作。我们的数据集可用于下载:http://github.com/ristea/arim-v2。
translated by 谷歌翻译
音乐源分离(MSS)在近年来深度学习模型中显示了积极进展。许多MSS模型通过估计有界比掩模并重用混合物的阶段来对谱图进行分离。当使用卷积神经网络(CNN)时,不管频带之间的不同模式如何,卷积期间的重量通常在卷积期间共享。在这项研究中,我们提出了一种新的MSS模型,通道 - 方向子带相位感知resunet(CWS-PREUUNET),以将信号分解为子带,并为每个源估计未结合的复杂理想比率掩码(CIRM)。 CWS-PREUUNET利用通道 - 方向子带(CWS)功能来限制在频谱图上共享不必要的全局权重,并降低计算资源消耗。保存的计算成本和内存又可以允许更大的架构。在MusdB18HQ测试集上,我们提出了一个276层CWS-PREUUNET,并在具有8.92个信号到失真率(SDR)分数的人声上实现最先进的(SOTA)性能。通过组合CWS-PREUUNET和DEMUC,我们的BYTEMSS系统在2021 ISMIR MUSIC DEMIXING(MDX)挑战有限训练数据轨道(排行榜A)中排名第2位的人声学分数和第5分。我们的代码和预先训练的型号可公开提供:https://github.com/haoheliu/2021-ismir-mss-challenge-cws-presunet
translated by 谷歌翻译