由于它们的灵活性,适用性和扩展性,条件源分离引起了重要的关注。它们的性能通常不如现有的方法,例如单源分离模型。然而,最近提出的方法称为Lasaft-Net的方法表明,条件模型可以对现有的单源分离模型表现出相当的性能。本文提出了光明纸网,轻量级的Lasaft-net。作为基线,它提供了足够的SDR性能,以便在ISMIR 2021的音乐解映射挑战期间进行比较。本文还通过用TFC-TDF块替换编码器中的光纤块来增强现有的光线截网。我们的增强型光线牵线纸净额优于前一个参数。
translated by 谷歌翻译
近年来,已经提出了神经网络的方法作为一种方法,作为一种从音乐中的表示的方法,但它们不是人类可读性,并且几乎不可分析的人是人类的。为了解决这个问题,我们提出了一种新的方法,通过矢量量化变分自动编码器(VQ-VAE)来学习音乐的源自令人难以展示的陈述。我们训练我们的VQ-VAE以将输入混合物编码为一个整数的张量离散的百日利斯空间,并设计它们具有分解结构,使人类以源自感知方式达到人类潜伏的载体。本文还表明,我们可以通过在离散空间中估计潜伏向量来生成贝塞斯。
translated by 谷歌翻译
将音频分离成不同声音源的深度学习技术面临着几种挑战。标准架构需要培训不同类型的音频源的独立型号。虽然一些通用分离器采用单个模型来靶向多个来源,但它们难以推广到看不见的来源。在本文中,我们提出了一个三个组件的管道,可以从大型但弱标记的数据集:audioset训练通用音频源分离器。首先,我们提出了一种用于处理弱标记训练数据的变压器的声音事件检测系统。其次,我们设计了一种基于查询的音频分离模型,利用此数据进行模型培训。第三,我们设计一个潜在的嵌入处理器来编码指定用于分离的音频目标的查询,允许零拍摄的概括。我们的方法使用单一模型进行多种声音类型的源分离,并仅依赖于跨标记的培训数据。此外,所提出的音频分离器可用于零拍摄设置,学习以分离从未在培训中看到的音频源。为了评估分离性能,我们在侦察中测试我们的模型,同时在不相交的augioset上培训。我们通过对从训练中保持的音频源类型进行另一个实验,进一步通过对训练进行了另一个实验来验证零射性能。该模型在两种情况下实现了对当前监督模型的相当的源 - 失真率(SDR)性能。
translated by 谷歌翻译
The marine ecosystem is changing at an alarming rate, exhibiting biodiversity loss and the migration of tropical species to temperate basins. Monitoring the underwater environments and their inhabitants is of fundamental importance to understand the evolution of these systems and implement safeguard policies. However, assessing and tracking biodiversity is often a complex task, especially in large and uncontrolled environments, such as the oceans. One of the most popular and effective methods for monitoring marine biodiversity is passive acoustics monitoring (PAM), which employs hydrophones to capture underwater sound. Many aquatic animals produce sounds characteristic of their own species; these signals travel efficiently underwater and can be detected even at great distances. Furthermore, modern technologies are becoming more and more convenient and precise, allowing for very accurate and careful data acquisition. To date, audio captured with PAM devices is frequently manually processed by marine biologists and interpreted with traditional signal processing techniques for the detection of animal vocalizations. This is a challenging task, as PAM recordings are often over long periods of time. Moreover, one of the causes of biodiversity loss is sound pollution; in data obtained from regions with loud anthropic noise, it is hard to separate the artificial from the fish sound manually. Nowadays, machine learning and, in particular, deep learning represents the state of the art for processing audio signals. Specifically, sound separation networks are able to identify and separate human voices and musical instruments. In this work, we show that the same techniques can be successfully used to automatically extract fish vocalizations in PAM recordings, opening up the possibility for biodiversity monitoring at a large scale.
translated by 谷歌翻译
音乐源分离(MSS)在近年来深度学习模型中显示了积极进展。许多MSS模型通过估计有界比掩模并重用混合物的阶段来对谱图进行分离。当使用卷积神经网络(CNN)时,不管频带之间的不同模式如何,卷积期间的重量通常在卷积期间共享。在这项研究中,我们提出了一种新的MSS模型,通道 - 方向子带相位感知resunet(CWS-PREUUNET),以将信号分解为子带,并为每个源估计未结合的复杂理想比率掩码(CIRM)。 CWS-PREUUNET利用通道 - 方向子带(CWS)功能来限制在频谱图上共享不必要的全局权重,并降低计算资源消耗。保存的计算成本和内存又可以允许更大的架构。在MusdB18HQ测试集上,我们提出了一个276层CWS-PREUUNET,并在具有8.92个信号到失真率(SDR)分数的人声上实现最先进的(SOTA)性能。通过组合CWS-PREUUNET和DEMUC,我们的BYTEMSS系统在2021 ISMIR MUSIC DEMIXING(MDX)挑战有限训练数据轨道(排行榜A)中排名第2位的人声学分数和第5分。我们的代码和预先训练的型号可公开提供:https://github.com/haoheliu/2021-ismir-mss-challenge-cws-presunet
translated by 谷歌翻译
音乐源分离表示从给定歌曲中提取所有乐器的任务。近期对这一挑战的突破已经陷入了单一数据集,MusdB,仅限于四个仪器类。更大的数据集和更多乐器在收集数据和培训深度神经网络(DNN)时是昂贵和耗时的。在这项工作中,我们提出了一种快速的方法来评估任何数据集中的仪器在任何数据集中的可分离性,而不会训练和调整DNN。这种可分离性测量有助于选择适当的样本以获得神经网络的有效培训。基于Oracle原理与理想的比率面具,我们的方法是估计最先进的深度学习方法(如TASNet或Open-Unmix)的分离性能的优异代理。我们的结果有助于揭示音频源分离的两个基本要点:1)理想的比率掩模,虽然光线和简单,提供了最近神经网络的音频可分子性能的准确度量,以及2)新的端到端学习方法如TASNet,它直接在波形上运行,实际上是在内部构建时频(TF)表示,使得它们在分离在TF平面中重叠的音频模式时,它们遇到与基于TF的方法相同的限制。
translated by 谷歌翻译
Recent research has shown remarkable performance in leveraging multiple extraneous conditional and non-mutually exclusive semantic concepts for sound source separation, allowing the flexibility to extract a given target source based on multiple different queries. In this work, we propose a new optimal condition training (OCT) method for single-channel target source separation, based on greedy parameter updates using the highest performing condition among equivalent conditions associated with a given target source. Our experiments show that the complementary information carried by the diverse semantic concepts significantly helps to disentangle and isolate sources of interest much more efficiently compared to single-conditioned models. Moreover, we propose a variation of OCT with condition refinement, in which an initial conditional vector is adapted to the given mixture and transformed to a more amenable representation for target source extraction. We showcase the effectiveness of OCT on diverse source separation experiments where it improves upon permutation invariant models with oracle assignment and obtains state-of-the-art performance in the more challenging task of text-based source separation, outperforming even dedicated text-only conditioned models.
translated by 谷歌翻译
Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two-and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward the realization of speech separation systems for real-world speech processing technologies.
translated by 谷歌翻译
We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner, while also benefiting from the performance transformer-based architectures provide. Our evaluations show as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2-4x smaller model size and a 1.5-2x lower runtime. Open-source code and datasets: https://github.com/vb000/Waveformer
translated by 谷歌翻译
对音频信号的长期依赖性进行建模是一个特别具有挑战性的问题,因为即使是小型尺度的产量,也要在十万个样本上产生。随着变形金刚的最近出现,神经体系结构擅长于更长的时间尺度建模依赖性,但它们受到二次限制的限制来扩展它们。我们提出了一种生成的自动回归体系结构,该体系结构可以在相当大的上下文中对音频波形进行建模,超过500,000个样本。我们的工作适应了通过CNN前端学习潜在表示,然后使用变压器编码器,经过全面训练的端到端学习来学习时间依赖性:从而允许它认为适合于该表示的表示形式。下一个样本。与以前的作品比较了不同的时间量表以显示改进,我们使用标准数据集,具有相同数量的参数/上下文来显示改进。与其他方法相比,我们在标准数据集中实现了最先进的性能,例如WaveNet,Sashmi和Sample-RNN,用于建模长期结构。这项工作为该领域提供了非常令人兴奋的方向,鉴于上下文建模的改进,可以通过使用数十亿/万亿个参数来缩放更多数据,并可能更好地结果。
translated by 谷歌翻译
本文提出了一种语音分离的视听方法,在两种情况下以低潜伏期产生最先进的结果:语音和唱歌声音。该模型基于两个阶段网络。运动提示是通过轻巧的图形卷积网络获得的,该网络处理面对地标。然后,将音频和运动功能馈送到视听变压器中,该变压器对隔离目标源产生相当好的估计。在第二阶段,仅使用音频网络增强了主导语音。我们提出了不同的消融研究和与最新方法的比较。最后,我们探讨了在演唱语音分离的任务中训练训练语音分离的模型的可传递性。https://ipcv.github.io/vovit/可用演示,代码和权重
translated by 谷歌翻译
对于沉浸式应用,匹配视觉同行的双耳发电是对虚拟环境中的人们带来有意义的体验至关重要。最近的作品已经显示了使用神经网络来使用2D视觉信息作为指导来使用Mono音频来合成双耳音频。通过使用3D视觉信息引导音频并在波形域中操作来扩展该方法可以允许虚拟音频场景的更准确的Auratization。在本文中,我们提供了一个多模态深入学习模型的点,它使用3D点云场景从单声道音频生成双耳版本。具体地,Point2Sound由具有3D稀疏卷积的视觉网络组成,其从点云场景中提取视觉特征来调节操作在波形域中的音频网络,以合成双耳网络。实验结果表明,3D视觉信息可以成功引导双模深度学习模型的双耳合成任务。此外,我们还调查了不同的丢失函数和3D点云属性,显示直接预测完整的双耳信号并使用RGB深度特征增加了我们所提出的模型的性能。
translated by 谷歌翻译
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and F0 features. We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.
translated by 谷歌翻译
在本文中,我们介绍了联合主义者,这是一种能够感知的多仪器框架,能够转录,识别和识别和将多种乐器与音频剪辑分开。联合主义者由调节其他模块的仪器识别模块组成:输出仪器特异性钢琴卷的转录模块以及利用仪器信息和转录结果的源分离模块。仪器条件设计用于明确的多仪器功能,而转录和源分离模块之间的连接是为了更好地转录性能。我们具有挑战性的问题表述使该模型在现实世界中非常有用,因为现代流行音乐通常由多种乐器组成。但是,它的新颖性需要关于如何评估这种模型的新观点。在实验过程中,我们从各个方面评估了模型,为多仪器转录提供了新的评估观点。我们还认为,转录模型可以用作其他音乐分析任务的预处理模块。在几个下游任务的实验中,我们的转录模型提供的符号表示有助于解决降低检测,和弦识别和关键估计的频谱图。
translated by 谷歌翻译
自从几十年前的频谱分析开创性工作以来,已经研究了提取音频和语音特征的方法。最近的努力以开发通用音频表示的雄心为指导。例如,如果深度神经网络在大型音频数据集上进行了培训,则可以提取最佳的嵌入。这项工作扩展了基于自我监督的学习,通过引导,提出各种编码器体系结构,并探索使用不同的预训练数据集的效果。最后,我们提出了一个新颖的培训框架,以提出一个混合音频表示,该框架结合了手工制作和数据驱动的学习音频功能。在HEAR NEURIPS 2021挑战中,对听觉场景分类和时间戳检测任务进行了评估。我们的结果表明,在大多数听到挑战任务中,带有卷积变压器的混合模型都会产生卓越的性能。
translated by 谷歌翻译
拟声术语是语音上模仿声音的字符序列,在表达声音的特征,诸如持续时间,间距和Timbre的特征是有效的。我们提出了一种使用拟声缺陷的环境 - 辐射方法,以指定要提取的目标声音。利用这种方法,我们通过使用U-Net架构来估计来自输入混合谱图和拟声型的时频掩模,然后通过掩蔽频谱图来提取相应的目标声音。实验结果表明,该方法只能提取对应于拟声病的目标声音,并且比使用声音事件类别指定目标声音的传统方法更好地执行。
translated by 谷歌翻译
源分离模型在频谱图或波形域上工作。在这项工作中,我们展示了如何执行端到端的混合源分离,让模型决定哪个域最适合每个源,甚至可以组合两者。拟议的解除架构的混合版本赢得了索尼组织的2021年音乐贬低挑战。该架构还具有额外的改进,例如压缩残余分支,当地关注或奇异值正则化。总体而言,在穆斯达特总资料数据集中测量的所有来源中观察到信号对失真(SDR)的1.4 dB改善,这是人类主观评估证实的改进,总体质量为2.83(5.36)非混合脱扣),并在3.04(对竞争对手提交的第二排名模型的非混合撤销和2.44)的污染没有污染。
translated by 谷歌翻译
在最先进的心理学研究中,我们注意到,用现有的自动音乐转录(AMT)方法转录的钢琴表演不能成功地重新合成,而不会影响表演的艺术内容。这是由于1)不同乐器使用的MIDI参数之间的不同映射,以及2)音乐家适应周围声学环境的方式。为了面对这个问题,我们提出了一种方法来构建特定于声学的AMT系统,该系统能够模拟音乐家对传达其解释的适应性的建模。具体而言,我们在模块化体系结构中量身定制的虚拟仪器模型,该模型将音频记录和相对对齐的音乐得分作为输入,并输出每个音符的声学特定速度。我们测试不同的模型形状,并表明所提出的方法通常优于通常的AMT管道,该管道不考虑仪器和声学环境的特殊性。有趣的是,这种方法可以简单地扩展,因为仅需要轻微的努力来训练模型来推断其他钢琴参数,例如踩踏。
translated by 谷歌翻译
在本文中,我们介绍了一种用于屏蔽的网络(Bloom-Net)的块优化方法,用于训练可扩展语音增强网络。这里,我们用残差学习方案设计我们的网络,并顺序地训练内部分离器块,以获得用于语音增强的可伸缩掩蔽基础神经网络。其可伸缩性允许它根据测试时间资源约束来调整运行时复杂度:部署一旦部署,该模型可以根据测试时间环境动态改变其复杂性。为此,我们模块化了我们的模型,因为它们可以灵活地适应增强性能和资源限制的不同需求,导致最小的内存或由于增加的可扩展性而训练开销。我们对语音增强的实验表明,所提出的块状优化方法与相应的模型相比,仅具有轻微的性能下降,与端到端的相应模型相比,实现了所需的可扩展性。
translated by 谷歌翻译
In subcellular biological research, fluorescence staining is a key technique to reveal the locations and morphology of subcellular structures. However, fluorescence staining is slow, expensive, and harmful to cells. In this paper, we treat it as a deep learning task termed subcellular structure prediction (SSP), aiming to predict the 3D fluorescent images of multiple subcellular structures from a 3D transmitted-light image. Unfortunately, due to the limitations of current biotechnology, each image is partially labeled in SSP. Besides, naturally, the subcellular structures vary considerably in size, which causes the multi-scale issue in SSP. However, traditional solutions can not address SSP well since they organize network parameters inefficiently and inflexibly. To overcome these challenges, we propose Re-parameterizing Mixture-of-Diverse-Experts (RepMode), a network that dynamically organizes its parameters with task-aware priors to handle specified single-label prediction tasks of SSP. In RepMode, the Mixture-of-Diverse-Experts (MoDE) block is designed to learn the generalized parameters for all tasks, and gating re-parameterization (GatRep) is performed to generate the specialized parameters for each task, by which RepMode can maintain a compact practical topology exactly like a plain network, and meanwhile achieves a powerful theoretical topology. Comprehensive experiments show that RepMode outperforms existing methods on ten of twelve prediction tasks of SSP and achieves state-of-the-art overall performance.
translated by 谷歌翻译