传统上,音乐混合涉及以干净,单个曲目的形式录制乐器,并使用音频效果和专家知识(例如,混合工程师)将它们融合到最终混合物中。近年来,音乐制作任务的自动化已成为一个新兴领域,基于规则的方法和机器学习方法已被探索。然而,缺乏干燥或干净的仪器记录限制了这种模型的性能,这与专业的人造混合物相去甚远。我们探索是否可以使用室外数据,例如潮湿或加工的多轨音乐录音,并将其重新利用以训练有监督的深度学习模型,以弥合自动混合质量的当前差距。为了实现这一目标,我们提出了一种新型的数据预处理方法,该方法允许模型执行自动音乐混合。我们还重新设计了一种用于评估音乐混合系统的听力测试方法。我们使用经验丰富的混合工程师作为参与者来验证结果。
translated by 谷歌翻译
鉴于音乐源分离和自动混合的最新进展,在音乐曲目中删除音频效果是开发自动混合系统的有意义的一步。本文着重于消除对音乐制作中吉他曲目应用的失真音频效果。我们探索是否可以通过设计用于源分离和音频效应建模的神经网络来解决效果的去除。我们的方法证明对混合处理和清洁信号的效果特别有效。与基于稀疏优化的最新解决方案相比,这些模型获得了更好的质量和更快的推断。我们证明这些模型不仅适合倾斜,而且适用于其他类型的失真效应。通过讨论结果,我们强调了多个评估指标的有用性,以评估重建的不同方面的变形效果去除。
translated by 谷歌翻译
We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song. This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a reference music recording. All our models are trained in a self-supervised manner from an already-processed wet multitrack dataset with an effective data preprocessing method that alleviates the data scarcity of obtaining unprocessed dry data. We analyze the proposed encoder for the disentanglement capability of audio effects and also validate its performance for mixing style transfer through both objective and subjective evaluations. From the results, we show the proposed system not only converts the mixing style of multitrack audio close to a reference but is also robust with mixture-wise style transfer upon using a music source separation model.
translated by 谷歌翻译
我们提出了一个录音录音录音的录音录音。我们的模型通过短时傅立叶变换(STFT)将其输入转换为时频表示,并使用卷积神经网络处理所得的复杂频谱图。该网络在合成音乐数据集上培训了重建和对抗性目标,该数据集是通过将干净的音乐与从旧唱片的安静片段中提取的真实噪声样本混合而创建的。我们在合成数据集的持有测试示例中定量评估我们的方法,并通过人类对实际历史记录样本的评级进行定性评估。我们的结果表明,所提出的方法可有效消除噪音,同时保留原始音乐的质量和细节。
translated by 谷歌翻译
Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in these domains in task such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement domain to achieve denosing, dereverberation and multi-speaker separation in monaural speech enhancement. In this paper, we review some dominant DNN techniques being employed to achieve speech separation. The review looks at the whole pipeline of speech enhancement from feature extraction, how DNN based tools are modelling both global and local features of speech and model training (supervised and unsupervised). We also review the use of speech-enhancement pre-trained models to boost speech enhancement process. The review is geared towards covering the dominant trends with regards to DNN application in speech enhancement in speech obtained via a single speaker.
translated by 谷歌翻译
Single-channel, speaker-independent speech separation methods have recently seen great progress. However, the accuracy, latency, and computational cost of such methods remain insufficient. The majority of the previous methods have formulated the separation problem through the time-frequency representation of the mixed signal, which has several drawbacks, including the decoupling of the phase and magnitude of the signal, the suboptimality of time-frequency representation for speech separation, and the long latency in calculating the spectrograms. To address these shortcomings, we propose a fully-convolutional time-domain audio separation network (Conv-TasNet), a deep learning framework for end-to-end time-domain speech separation. Conv-TasNet uses a linear encoder to generate a representation of the speech waveform optimized for separating individual speakers. Speaker separation is achieved by applying a set of weighting functions (masks) to the encoder output. The modified encoder representations are then inverted back to the waveforms using a linear decoder. The masks are found using a temporal convolutional network (TCN) consisting of stacked 1-D dilated convolutional blocks, which allows the network to model the long-term dependencies of the speech signal while maintaining a small model size. The proposed Conv-TasNet system significantly outperforms previous time-frequency masking methods in separating two-and three-speaker mixtures. Additionally, Conv-TasNet surpasses several ideal time-frequency magnitude masks in two-speaker speech separation as evaluated by both objective distortion measures and subjective quality assessment by human listeners. Finally, Conv-TasNet has a significantly smaller model size and a shorter minimum latency, making it a suitable solution for both offline and real-time speech separation applications. This study therefore represents a major step toward the realization of speech separation systems for real-world speech processing technologies.
translated by 谷歌翻译
音乐表达需要控制播放的笔记,以及如何执行它们。传统的音频合成器提供了详细的表达控制,但以现实主义的成本提供了详细的表达控制。黑匣子神经音频合成和连接采样器可以产生现实的音频,但有很少的控制机制。在这项工作中,我们介绍MIDI-DDSP乐器的分层模型,可以实现现实的神经音频合成和详细的用户控制。从可解释的可分辨率数字信号处理(DDSP)合成参数开始,我们推断出富有表现力性能的音符和高级属性(例如Timbre,Vibrato,Dynamics和Asticiculation)。这将创建3级层次结构(注释,性能,合成),提供个人选择在每个级别进行干预,或利用培训的前沿(表现给出备注,综合赋予绩效)进行创造性的帮助。通过定量实验和聆听测试,我们证明了该层次结构可以重建高保真音频,准确地预测音符序列的性能属性,独立地操纵给定性能的属性,以及作为完整的系统,从新颖的音符生成现实音频顺序。通过利用可解释的层次结构,具有多个粒度的粒度,MIDI-DDSP将门打开辅助工具的门,以赋予各种音乐体验的个人。
translated by 谷歌翻译
从语音音频中删除背景噪音一直是大量研究和努力的主题,尤其是由于虚拟沟通和业余声音录制的兴起,近年来。然而,背景噪声并不是唯一可以防止可理解性的不愉快干扰:混响,剪裁,编解码器工件,有问题的均衡,有限的带宽或不一致的响度同样令人不安且无处不在。在这项工作中,我们建议将言语增强的任务视为一项整体努力,并提出了一种普遍的语音增强系统,同时解决了55种不同的扭曲。我们的方法由一种使用基于得分的扩散的生成模型以及一个多分辨率调节网络,该网络通过混合密度网络进行增强。我们表明,这种方法在专家听众执行的主观测试中大大优于艺术状态。我们还表明,尽管没有考虑任何特定的快速采样策略,但它仅通过4-8个扩散步骤就可以实现竞争性的目标得分。我们希望我们的方法论和技术贡献都鼓励研究人员和实践者采用普遍的语音增强方法,可能将其作为一项生成任务。
translated by 谷歌翻译
上采样器是由问题上采样层引起的,并且由于在上采样时出现的光谱副本。此外,根据所用的上采样层,这种伪像可以是色调的伪像(添加性高频噪声)或过滤伪像(衰减,衰减一些带)。在这项工作中,我们通过研究不同的伪像如何交互和评估模型性能的影响,调查在所产生的音频中具有上采样的伪影的实际意义。为此,我们基准为音乐源分离的大量上采样层:不同的转置和子像素卷积设置,不同的插值上升器(包括基于拉伸和SINC插值的两个新颖的层)和基于不同的基于小波的上升器(包括小说可学习小波层)。我们的研究结果表明,与插值上采样器相关的过滤器件是感知的,即使它们倾向于实现更差的客观分数。
translated by 谷歌翻译
Music discovery services let users identify songs from short mobile recordings. These solutions are often based on Audio Fingerprinting, and rely more specifically on the extraction of spectral peaks in order to be robust to a number of distortions. Few works have been done to study the robustness of these algorithms to background noise captured in real environments. In particular, AFP systems still struggle when the signal to noise ratio is low, i.e when the background noise is strong. In this project, we tackle this problematic with Deep Learning. We test a new hybrid strategy which consists of inserting a denoising DL model in front of a peak-based AFP algorithm. We simulate noisy music recordings using a realistic data augmentation pipeline, and train a DL model to denoise them. The denoising model limits the impact of background noise on the AFP system's extracted peaks, improving its robustness to noise. We further propose a novel loss function to adapt the DL model to the considered AFP system, increasing its precision in terms of retrieved spectral peaks. To the best of our knowledge, this hybrid strategy has not been tested before.
translated by 谷歌翻译
宽带音频波形评估网络(Wawenets)是直接在宽带音频波形上运行的卷积神经网络,以便对这些波形进行评估。在目前的工作中,这些评估赋予了电信语音的素质(例如嘈杂,清晰度,整体语音质量)。 Wawenets是无引用网络,因为它们不需要他们评估的波形的``参考''(原始或未经证实的)版本。我们最初的Wawenet出版物引入了四个Wawenets,并模拟了已建立的全参考语音质量或清晰度估计算法的输出。我们已经更新了Wawenet架构,以提高效率和有效性。在这里,我们提出了一个密切跟踪七个不同质量和可理解性值的单一Wawenet。我们创建了第二个网络,该网络还跟踪四个主观语音质量维度。我们提供第三个网络,专注于公正的质量分数并达到很高的共识。这项工作用13种语言利用了334小时的演讲,超过200万个全参考目标值和超过93,000个主观意见分数。我们还解释了Wawenets的操作,并使用信号处理的语言确定其操作的关键:Relus从战略上将光谱信息从非DC组件移动到DC组件中。 96输出信号的直流值在96-D潜在空间中定义了一个向量,然后将该向量映射到输入波形的质量或清晰度值。
translated by 谷歌翻译
The marine ecosystem is changing at an alarming rate, exhibiting biodiversity loss and the migration of tropical species to temperate basins. Monitoring the underwater environments and their inhabitants is of fundamental importance to understand the evolution of these systems and implement safeguard policies. However, assessing and tracking biodiversity is often a complex task, especially in large and uncontrolled environments, such as the oceans. One of the most popular and effective methods for monitoring marine biodiversity is passive acoustics monitoring (PAM), which employs hydrophones to capture underwater sound. Many aquatic animals produce sounds characteristic of their own species; these signals travel efficiently underwater and can be detected even at great distances. Furthermore, modern technologies are becoming more and more convenient and precise, allowing for very accurate and careful data acquisition. To date, audio captured with PAM devices is frequently manually processed by marine biologists and interpreted with traditional signal processing techniques for the detection of animal vocalizations. This is a challenging task, as PAM recordings are often over long periods of time. Moreover, one of the causes of biodiversity loss is sound pollution; in data obtained from regions with loud anthropic noise, it is hard to separate the artificial from the fish sound manually. Nowadays, machine learning and, in particular, deep learning represents the state of the art for processing audio signals. Specifically, sound separation networks are able to identify and separate human voices and musical instruments. In this work, we show that the same techniques can be successfully used to automatically extract fish vocalizations in PAM recordings, opening up the possibility for biodiversity monitoring at a large scale.
translated by 谷歌翻译
尽管近年来取得了惊人的进步,但最先进的音乐分离系统会产生具有显着感知缺陷的源估计,例如增加无关噪声或消除谐波。我们提出了一个后处理模型(MAKE听起来不错(MSG)后处理器),以增强音乐源分离系统的输出。我们将我们的后处理模型应用于最新的基于波形和基于频谱图的音乐源分离器,包括在训练过程中未见的分离器。我们对源分离器产生的误差的分析表明,波形模型倾向于引入更多高频噪声,而频谱图模型倾向于丢失瞬变和高频含量。我们引入了客观措施来量化这两种错误并显示味精改善了两种错误的源重建。众包主观评估表明,人类的听众更喜欢由MSG进行后处理的低音和鼓的来源估计。
translated by 谷歌翻译
最近在各种语音域应用中提出了卷积增强的变压器(构象异构体),例如自动语音识别(ASR)和语音分离,因为它们可以捕获本地和全球依赖性。在本文中,我们提出了一个基于构型的度量生成对抗网络(CMGAN),以在时间频率(TF)域中进行语音增强(SE)。发电机使用两阶段构象体块编码大小和复杂的频谱图信息,以模拟时间和频率依赖性。然后,解码器将估计分解为尺寸掩模的解码器分支,以滤除不需要的扭曲和复杂的细化分支,以进一步改善幅度估计并隐式增强相信息。此外,我们还包括一个度量歧视器来通过优化相应的评估评分来减轻度量不匹配。客观和主观评估表明,与三个语音增强任务(DeNoising,dereverberation和Super-Losity)中的最新方法相比,CMGAN能够表现出卓越的性能。例如,对语音库+需求数据集的定量降解分析表明,CMGAN的表现优于以前的差距,即PESQ为3.41,SSNR为11.10 dB。
translated by 谷歌翻译
现代数字音乐的制作通常涉及将许多声学元素组合在一起以编译音乐。此类元素的重要类型是鼓样品,它们决定了该作品的打击乐成分的特性。艺术家必须使用其审美判断来评估给定的鼓样本是否适合当前的音乐背景。但是,从潜在的大图书馆中选择鼓样品是乏味的,可能会中断创意流程。在这项工作中,我们根据从数据中学到的美学原理探索自动鼓样品检索。结果,艺术家可以通过在制作过程的不同阶段(即适合不完整的歌曲混音)来对其图书馆中的样本进行排名。为此,我们使用对比度学习来最大程度地提高源自与混合物同一歌曲的鼓样品的分数。我们进行了听力测试,以确定人类评分是否与自动评分函数匹配。我们还进行客观的定量分析以评估方法的功效。
translated by 谷歌翻译
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize time-domain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the conditioning input to WaveNet instead of linguistic, duration, and F0 features. We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.
translated by 谷歌翻译
Vocoders是能够将音频信号(通常是MEL频谱图)转换为波形的低维光谱表示。现代语音生成管道使用Vocoder作为其最终组成部分。最近为语音开发的Vocoder模型实现了高度的现实主义,因此自然想知道它们在音乐信号上的表现。与言语相比,音乐声纹理的异质性和结构提供了新的挑战。在这项工作中,我们专注于一种专为语音设计的Vocoder模型在应用于音乐时倾向于展示的一种特定工件:合成持续的音符时的俯仰不稳定性。我们认为,该伪像的特征声音是由于缺乏水平相一致性,这通常是由于使用时间域目标空间与跨度班的模型(例如卷积神经网络)不变的结果。我们提出了专门为音乐设计的新型Vocoder模型。提高音高稳定性的关键是选择由幅度频谱和相位梯度组成的移位不变的目标空间。我们讨论了启发我们重新构建Vocoder任务的原因,概述一个工作示例,并在音乐信号上进行评估。我们的方法使用新颖的谐波误差度量标准,导致60%和10%的改善了相对于现有模型的持续音符和和弦的重建。
translated by 谷歌翻译
快速和用户控制的音乐生成可以实现创作或表演音乐的新颖方法。但是,最先进的音乐生成系统需要大量的数据和计算资源来培训,并且推断很慢。这使它们对于实时交互式使用不切实际。在这项工作中,我们介绍了Musika,Musika是一种音乐发电系统,可以使用单个消费者GPU在数百小时的音乐上进行培训,并且比消费者CPU上有任意长度的音乐的实时生成速度要快得多。我们首先学习具有对抗性自动编码器的光谱图和相位的紧凑型可逆表示,然后在此表示上训练生成性对抗网络(GAN)为特定的音乐领域训练。潜在坐标系可以并行生成任意长的摘录序列,而全局上下文向量使音乐可以在时间上保持风格连贯。我们执行定量评估,以评估生成的样品的质量,并展示钢琴和技术音乐生成的用户控制选项。我们在github.com/marcoppasini/musika上发布源代码和预估计的自动编码器重量,使得可以在几个小时内使用单个GPU的新音乐域中对GAN进行培训。
translated by 谷歌翻译
尽管已经提出了有效的体系结构和大量用于端到端图像分类任务的增强,并进行了大量研究,但针对音频分类的最新技术仍然依赖于音频信号的众多表示,以及大型体系结构,罚款,罚款 - 从大型数据集中调整。通过利用音频和新颖音频增强的继承的轻质性质,我们能够提出具有强大概括能力的有效端到端网络。在各种声音分类集的实验中,通过在各种环境中实现最先进的结果来证明我们方法的有效性和鲁棒性。公共代码可在:\ href {https://github.com/alibaba-miil/audioclassfication} {此http url} {
translated by 谷歌翻译
我们提出了一种可扩展高效的神经波形编码系统,用于语音压缩。我们将语音编码问题作为一种自动汇总任务,其中卷积神经网络(CNN)在其前馈例程期间执行编码和解码作为神经波形编解码器(NWC)。所提出的NWC还将量化和熵编码定义为可培训模块,因此在优化过程期间处理编码伪像和比特率控制。通过将紧凑的模型组件引入NWC,如Gated Reseal Networks和深度可分离卷积,我们实现了效率。此外,所提出的模型具有可扩展的架构,跨模块残差学习(CMRL),以覆盖各种比特率。为此,我们采用残余编码概念来连接多个NWC自动汇总模块,其中每个NWC模块执行残差编码以恢复其上一模块已创建的任何重建损失。 CMRL也可以缩小以覆盖下比特率,因为它采用线性预测编码(LPC)模块作为其第一自动化器。混合设计通过将LPC的量化作为可分散的过程重新定义LPC和NWC集成,使系统培训端到端的方式。所提出的系统的解码器在低至中等比特率范围(12至20kbps)或高比特率(32kbps)中的两个NWC中的一个NWC(0.12百万个参数)。尽管解码复杂性尚不低于传统语音编解码器的复杂性,但是从其他神经语音编码器(例如基于WVENET的声码器)显着降低。对于宽带语音编码质量,我们的系统对AMR-WB的性能相当或卓越的性能,并在低和中等比特率下的速度试验话题上的表现。所提出的系统可以扩展到更高的比特率以实现近透明性能。
translated by 谷歌翻译