Mel Spectrogram Inversion with Stable Pitch

Abstract

Vocoders are models capable of transforming a low-dimensional spectral representation of an audio signal, typically the mel spectrogram, to a waveform. Modern speech generation pipelines use a vocoder as their final component. Recent vocoder models developed for speech achieve a high degree of realism, such that it is natural to wonder how they would perform on music signals. Compared to speech, the heterogeneity and structure of the musical sound texture offers new challenges. In this work we focus on one specific artifact that some vocoder models designed for speech tend to exhibit when applied to music: the perceived instability of pitch when synthesizing sustained notes. We argue that the characteristic sound of this artifact is due to the lack of horizontal phase coherence, which is often the result of using a time-domain target space with a model that is invariant to time-shifts, such as a convolutional neural network.

We propose a new vocoder model that is specifically designed for music. Key to improving the pitch stability is the choice of a shift-invariant target space that consists of the magnitude spectrum and the phase gradient. We discuss the reasons that inspired us to re-formulate the vocoder task, outline a working example, and evaluate it on musical signals¹¹1 Example reconstructions https://machinelearning.apple.com/research/mel-spectrogram. Our method results in 60% and 10% improved reconstruction of sustained notes and chords with respect to existing models, using a novel harmonic error metric.

\threeauthors

Bruno Di Giorgi $^{*}$ Apple
bdigiorgi@apple.com Mark Levy $^{*}$ Apple
mark_levy@apple.com Richard Sharp Apple
richard_sharp@apple.com

^*^*footnotetext: Equal contribution

1 Introduction

In modern speech synthesis pipelines a first model generates a low-dimensional audio representation, usually the mel spectrogram, from text; and a second model, named Vocoder, transforms the mel spectrogram to an audio waveform. Theoretically, vocoders designed for speech could be directly applied to musical signals; however closer inspection reveals features and constraints that are exclusive to the music domain. For example, unlike speech, music signals can be polyphonic and contain longer sustained notes whose pitch precision and stability is essential.

The stability of a sustained pitched note manifests in the time-domain audio signal as the steady repetition of a periodic waveform. Periodic patterns are by definition not shift-invariant, except for shifts of an integer number of periods, therefore they require some form of auto-regression in order to be reproduced accurately. As expected, time-domain vocoders using shift invariant architectures [1, 2], despite other advantages such as generation efficiency, produce jitters that are perceived as pitch and timbre instability. For this reason, other time-domain generative models for audio include an autoregressive mechanism in the neural architecture [3, 4, 5]. In practice, time-domain models are required to learn all possible shifts of periodic patterns, a space that increases exponentially for polyphonic music, and how to create smooth sequences of these patterns.

Figure 1: Our proposed model for mel spectrogram inversion. A one dimensional CNN estimates the magnitude and the phase gradient from the mel spectrogram. The phase gradient is then integrated to estimate the phase spectrum and finally audio is obtained via the inverse STFT.

Inspired by a recent generative model for single notes [6], we propose a new vocoder model for music (Fig. 1), where the target of the neural network is an intermediate frequency-domain audio representation that is shift-invariant for sustained notes. This representation is composed of the magnitude spectrum and the phase gradient, and can be later turned to audio via: 1. a phase integration algorithm and 2. the inverse STFT. The proposed design can be used with an efficient shift-invariant neural architecture and still yield stable reconstruction of sustained notes. Specifically, our contributions include:

a formulation of the mel spectrogram inversion task, matching shift-invariant network and target, in order to improve the perceived stability of sustained notes
a phase integration algorithm
an evaluation metric measuring pitch stability for multiple notes

2 Background

Figure 2: Magnitude $M$ , phase $ϕ$ and phase gradient $\nabla ϕ$ patterns for a sinusoidal (top row) and an impulse (bottom row) signal. While the magnitude spectrum is easy to interpret visually, patterns in the phase spectrum are harder to decipher, but become evident in the phase gradient.

A discrete audio signal $x$ can be analyzed in the time-frequency space using the STFT:

X [m, n] = \sum i x [i + n R] w [i] e^{- j ω_{m} i},

(1)

where $m$ and $n$ are the integer frequency bin and time frame indices, $R$ is the hop size between successive frames, $w$ is a window function defined in the $[- N / 2, N / 2)$ interval, with $N$ being the frame size and $ω_{m} = 2 π m / N$ the angular frequency. The STFT is a complex-valued matrix, as such it can be represented in the polar form:

X [m, n] = M [m, n] e^{- j ϕ [m, n]} .

(2)

The magnitude component $M$ highlights the energy of the signal at various locations in the time-frequency grid: it is easier to interpret and more widely used than the phase component $ϕ$ . However, the phase spectrum is of primary perceptual importance to reconstruct the audio signal precisely, and while harder to interpret at first sight, it does contain patterns that can guide model design choices (Fig. 2).

In Sect. 2.1 we describe two patterns that form in the magnitude and phase spectrum corresponding to the occurrence of ideal sinusoidal and impulsive signal components.

2.1 Sinusoidal and impulsive components

2.1.1 Sinusoidal components

In the magnitude spectrum, sinusoidal components such as any single harmonic of a pitched instrument’s sustained note, show up as horizontal lines (Fig. 2(a)). While the magnitude spectrum does not depend on the frame index $n$ , and is therefore shift-invariant, the phase spectrum depends linearly on $n$ (Fig. 2(b)), and the rate of change is given by the frequency of the sinusoidal component. Failing to reconstruct this linear relation between phase and time results in loss of horizontal phase coherence, perceived as unstable pitch, because errors are attributed to sudden changes of the frequency of the sinusoidal component.

2.1.2 Impulsive components

Impulsive components such as the attack of a percussion instrument show up as vertical lines in the magnitude spectrum (Fig. 2(d)). While the magnitude spectrum does not depend on the frequency index $m$ , the phase spectrum depends linearly on $m$ (Fig. 2(e)), and the rate of change depends on the offset between the location of the impulse and the frame center. Failing to reconstruct this linear relation between phase and frequency results in loss of vertical phase coherence, which is perceived as smeared transients, because the errors are attributed to the location of the impulse.

2.2 Phase gradient

The linear patterns that emerge in the phase for sinusoidal and impulsive components are better highlighted in the two components of the phase gradient $\nabla ϕ = (ϕ_{i}^{'}, ϕ_{m}^{'})$ .

The partial derivative of phase along the time dimension $ϕ_{i}^{'}$ is called instantaneous frequency. For the bins that belong to sinusoidal components, $ϕ_{i}^{'}$ is constant and the phase can be propagated horizontally:

ϕ [m, n + 1] = ϕ [m, n] + R ϕ_{i}^{'} [m, n] .

(3)

The partial derivative along the frequency dimension $ϕ_{m}^{'}$ is called local group delay. For the bins that belong to impulsive components, $ϕ_{m}^{'}$ is constant and the phase can be propagated vertically:

ϕ [m + 1, n] = ϕ [m, n] + ϕ_{m}^{'} [m, n] .

(4)

In the time-frequency reassignment literature (see e.g. [7]), the phase gradient components are used to assign the energy of a spectral bin $(m, n)$ to a nearby point of maximum contribution $(˙ m, ˙ n)$ :

\begin{matrix} ˙ m [m, n] & = m + Δ m [m, n] ˙ n [m, n] & = n + Δ n [m, n], \end{matrix}

(5)

where $Δ n [m, n]$ (Fig. 2(f)) and $Δ m [m, n]$ (Fig. 2(c)) represent time and frequency bin offsets and are derived from the phase gradient

\begin{matrix} Δ m [m, n] & = ϕ_{i}^{'} [m, n] \frac{N}{2 π} - m Δ n [m, n] & = - ϕ_{m}^{'} [m, n] \frac{N}{2 π R} . \end{matrix}

(6)

In the following sections, we discuss how the phase gradient can be used for mel spectrogram inversion.

2.3 Mel spectrogram inversion

The log-amplitude mel spectrogram (simply mel spectrogram from now on) is a low-resolution time-frequency representation that is derived from the power spectrogram $M^{2}$ , by first warping the frequency axis using the mel scale, then scaling the values to log-amplitude. Estimating the original audio signal $x$ from the mel spectrogram requires recovering the information that has been lost in the direct computation, i.e. the phase information and the linearly-spaced and higher frequency resolution of the magnitude spectrum.

While the majority of the recent approaches try to learn this inverse transformation end-to-end, this is especially hard for a polyphonic music signal. To precisely reproduce a sustained note, an end-to-end model needs to learn: 1. different patterns for every combination of phase shift and period of a periodic waveform, and 2. how to activate them in the right sequence [6]. Accomplishing both tasks is challenging for speech and arguably even more so for music, which contains generally longer pitched sounds with wider pitch range, possibly multiple concurrent fundamental frequencies (polyphony), and whose absolute precision is essential.

Instead of reconstructing the signal in the time domain, we propose to use as output space an intermediate time-frequency representation consisting of three channels: the magnitude spectrum and the two components of the phase gradient: $(M, ϕ_{i}^{'}, ϕ_{m}^{'})$ . The phase gradient is later integrated to estimate the phase spectrum $ϕ$ , and finally audio is computed via the inverse STFT.

A model trained on our proposed output representation does not need to learn: 1. the shift variations of periodic waveforms as those are explicitly modeled by the inverse STFT, and 2. how to sequence phase, which is handled via the phase integration algorithm. Differently from the phase spectrum, the phase derivative along time is shift-invariant, thus it is a more suitable target for a shift-invariant architecture, such as the convolutional neural network.

The approaches that have been suggested in the recent years for neural audio synthesis in the time-domain have to use auto-regression to achieve horizontal phase coherence. For example, autoregression is at the core of models like WaveNet [3] and WaveRNN [4], however the fact that it is applied at the rate of audio samples make these model prohibitively expensive for the generation of high-resolution audio signals.

Audio domain shift-invariant convolutional neural vocoders can generate audio samples with much higher efficiency, but are not suited to reconstruct long pitched components precisely. This holds regardless of the training strategy, and includes for example generative adversarial networks (GAN) based models [1, 2] and diffusion based models [8, 9]. A recent neural vocoder for speech mel spectrogram inversion [5] adds an autoregressive loop that works on chunks of audio. The autoregressive nature of this architecture allows performing temporal integration, the operation needed to reconstruct stable sinusoidal components, while advancing by audio chunks rather than samples improves efficiency. However, the poor reconstruction quality observed when applying this model to music signals suggests that it is difficult to learn signal properties such as the rotation of phase from data with sufficient generalization.

In the neural audio synthesis literature, using instantaneous frequency has been considered explicitly in [6], where it is generated alongside the magnitude spectrum in order to reconstruct the audio of single notes, conditioned on the pitch contour and a timbre embedding.

2.4 Phase integration

Recovering the phase spectrum from the phase gradients requires an integration step. Theoretically, perfect integration should be possible under specific constraints, such as continuous phase gradient spectrum and window functions with infinite support. In practice, however there is no closed-form solution for integrating typical discrete phase gradient spectra [10].

Well-known phase gradient integration algorithms have been developed in the Time-Scale Modification (TSM) literature. The standard Phase-Vocoder (PV) algorithm propagates the phase derivative along the time dimension to modify the duration of sinusoidal components [11]. The PV is able to preserve horizontal phase coherence, but struggles with vertical phase coherence, leading to smeared transients. Improvements to the standard phase-vocoder algorithm [12, 13, 14] use the magnitude spectrum to identify sinusoidal and/or impulsive components, and can propagate phase in either direction (time or frequency) depending on the local properties of the signal.

The phase gradient integration algorithm that we develop (Sect. 3.2) is inspired by these recent variations, and leads to subjectively improved reconstruction quality, alongside increased computational efficiency. A formal evaluation of the integration algorithm is out of the scope of this manuscript and left as future work.

3 Model

In this section we describe our proposed model for mel spectrogram inversion. The model is composed of a time-wise convolutional neural network (Sect. 3.1) that estimates magnitude and phase gradient from the mel spectrogram, and a phase integration algorithm (Sect. 3.2) that estimates the phase spectrum given the phase gradient. The time-domain reconstructed audio is finally obtained via inverse STFT from the magnitude and phase spectra.

3.1 Network architecture

The neural network is a stack of 8 1-d (time) convolutional layers with 1536 hidden channels, a kernel size of 3 frames and ReLU activations (Fig. 3). The input and output have the same number of time frames, but different numbers of frequency bins, and their center frequencies are different, i.e. log-spaced for the mel spectrogram input and linearly spaced for the magnitude and phase gradient outputs. The frequency bins of the input and the magnitude channel of the output are independently standardized using the mean and standard deviation values computed from the training set. We found that a direct path from the input to the magnitude channel of the output leads to significant improvements in the reconstruction of magnitude and the training speed. This direct path consists only of a frequency warping operation from mel- to linear-scale.

Figure 3: Convolutional network architecture

The phase gradients $(ϕ_{i}^{'}, ϕ_{m}^{'})$ are computed using the Auger-Flandrin technique [15]. Although we have not experimented with other ways to compute the phase gradient, it was argued [16] that using a less precise method, such as finite differences might perform just as well. Instead of using the phase gradients directly, we use the vertical and horizontal bin offsets (Eq. (6)), which are derived from the two components of the phase gradient. In order to remove outliers, the absolute bin offsets are clipped to $4.0$ on the frequency dimension and to $N / 2 R$ on the time dimension.

The model uses linear output activation on all output channels except for the magnitude channel, where we apply a scaled $tanh$ activation $f_{β} (x) = β tanh (x / β)$ with $β = 5$ to use mostly the linear regime while also preventing overflow [6]. The three channels are then scaled and offset appropriately to match the statistics of the targets.

3.1.1 Losses

The magnitude channel is trained with a Mean Square Error (MSE) loss term, and a further MSE loss term computed on the first 20 linear-frequency cepstral coefficients (LFCC).

	$L_{1}$	$=$	$(^M - M)^{2}$		(7)
	$L_{2}$	$=$	$({DCT}_{: 20} (^M) - {DCT}_{: 20} (M))^{2},$		(8)

where $^M$ indicates the estimated magnitude spectrum, $M$ the target magnitude spectrum, and DCT the normalized Discrete Cosine Transform; in these and the following loss formulas the $[m, n]$ indices and the global average operation have been omitted for simplicity. While $L_{1}$ acts on point estimates, $L_{2}$ pushes the spectral envelope towards its true value, leading to faster convergence and better reconstruction quality.

The phase gradient channels are trained with another MSE loss, weighted by the power spectrum of the target signal $M^{2}$ . A matrix $λ \in [0, 1]$ , computed from the phase gradient (Sect. 3.2), is used to distinguish sinusoidal and impulsive components. The idea is that the phase derivative along the time/frequency dimension contributes to the loss only for sinusoidal/impulsive components:

L3={M2(^Δm−Δm)2λ>0.5M2(^Δn−Δn)2λ≤0.5,,

(9)

where $(^Δ m,^Δ n)$ are the estimated bin offsets.

Finally, because $λ$ is a function of $\nabla ϕ$ and is used for integration, we add a loss term:

L_{4} = M^{2} (^λ - λ)^{2},

(10)

where $^λ$ is computed using the phase gradient estimates and $λ$ using the target values. The final loss is a weighted sum of all the loss terms:

L = \sum l α_{l} L_{l},

(11)

where all weights are set to 1 except $α_{2} = 0.1$ to balance the contribution of all terms during training.

Differently from time-domain methods for mel spectrogram inversion, we found reconstruction losses to yield satisfying results and did not add any adversarial loss. Evaluating the advantages of including adversarial losses is left for future research.

3.2 Phase integration

The algorithm we use for integrating phase from phase gradients relies on the classification of spectral bins into either sinusoidal, transient or noise components.

The classification uses the phase gradient and relies on the following rationale [7]: around sinusoidal/impulsive components the reassigned frequency/time is approximately constant along frequency/time

λ [m, n] = e^{- (\frac{d}{d m} ˙ m [m, n] / \frac{d}{d n} ˙ n [m, n])^{2}},

(12)

where the derivatives $d / d m$ and $d / d n$ are computed with centered finite differences.

After computing $λ$ , the phase gradients are propagated horizontally (Eq. (3)) if $λ > λ^{S}$ , vertically (Eq. (4)) if $λ < λ^{I}$ , and set to a random value otherwise. $λ^{I}$ and $λ^{S}$ are threshold values for impulsive and sinusoidal components, and are used to identify the spectral bins over which respectively vertical or horizontal phase coherence should be enforced. We empirically set $λ^{I} = 0.4$ and $λ^{S} = 0.5$ , as these values performed well on early trials.

4 Experiments

In this section we discuss how we evaluate the pitch stability of the proposed model, comparing to strong baseline vocoder models from the speech synthesis literature.

4.1 Experimental Setup

We compare the reconstruction of the proposed phase-gradient model against state-of-the-art approaches: melgan [1]²²2 https://github.com/descriptinc/melgan-neurips , hifigan [2]³³3 https://github.com/kan-bayashi/ParallelWaveGAN.git , cargan [5]⁴⁴4 https://github.com/descriptinc/cargan , and diffwave [8]⁵⁵5 https://github.com/lmnt-com/diffwave . All models have been trained on the same data, containing 13 hours of ambient music loops from commercial libraries⁶⁶6we used the following loop packs licensed from Big Fish Audio Ltd.: Ambient Piano, Ambient Skyline 3, Ambient Waves, Eclipse: Ambient Guitars, Ethereal Harp, Zen Ambient Vol. 2, split into training, validation and test in the ratio of 80/10/10, which we refer to as the Ambient dataset. The audio from these loop libraries is converted to mono at 44.1kHz, 16-bit, the spectrograms are computed with a frame size of 2048 samples and hop size of 256 samples, and finally 96 bands are used for the mel spectrogram.

All neural networks were trained from scratch. The phase-gradient network was trained using 2 Volta GPUs in parallel and batches of 32 examples, with the Adam optimizer and learning rate set to $3 e - 5$ . The training was stopped after 1024 epochs, when the validation loss converged, which took approximately 2 days.

As a further non machine-learned baseline, we consider a reconstruction algorithm griffin-lim, which generates audio from the mel spectrogram by first warping the frequency axis and the values to obtain a magnitude spectrogram, then applying the Griffin-Lim algorithm [17] for 500 iterations.

Figure 4: Harmonic error of different mel spectrogram inversion models for synthesized notes in the (C2, C7) range (top row). phase-gradient model achieves lower error than the baseline models on the entire range. The values have been smoothed with a moving average with size/stride equal to 12/6 semitones to filter out noise. The bottom row shows the error when adding more notes in different combinations, where the error is averaged over the entire pitch range, and the numbers on the x axis indicate the intervals in semitones that are played simultaneously, e.g. "0, 4, 7" is the major triad.

Figure 5: Reconstructions of an E2 nylon guitar note using different mel spectrogram inversion models. The figure shows the reassigned spectrograms [7] which highlight the instability on the fundamental and the harmonics caused by the lack of temporal phase coherence.

4.2 Pitch stability

To evaluate the pitch stability of our model, we used FluidSynth⁷⁷7http://www.fluidsynth.org to synthesize a dataset of one second long notes and chords in the (C2, C7) range, using a set of four different sounds: a rhodes piano, a church organ, a string ensemble and a nylon guitar.

Measuring stability with a pitch tracker is only feasible for single notes, and we found that errors computed with a pitch tracker even for monophonic signals did not reflect our perception of reconstruction quality. A possible explanation is that pitch-trackers average out errors in the harmonic frequencies, which is inconvenient in this context because the precise location of the overtones is important for timbre perception [18].

For this reason, we define a harmonic error metric $H_{err}$ as the sum of the frequency errors of the fundamental and the first 4 harmonic frequencies, expressed in the pitch scale:

H_{err} [p, h, n] = 12 | {log}_{2} (\frac{{^f}_{p, h} [n]}{f_{p, h} [n]}) |,

(13)

where $f_{p, h} [n]$ and ${^f}_{p, h} [n]$ are the frequencies of the $h$ -th harmonic of the $p$ -th note, at frame $n$ , for the original and the reconstructed audio respectively; the frequencies are estimated as the closest peaks to the nominal frequency of the partial, using quadratic interpolation, on a magnitude spectrum computed using frame/hop size equal to 4096/256 samples. We look at the mean and maximum values of the harmonic error, as a way to summarize the expected and worst case reconstruction errors.

	FAD ( $↓$ )			$H_{err}$ ( $↓$ )		MOS ( $↑$ )	RTF ( $↑$ )	#Params
	Ambient	NSynth	N+C	Notes	Chords	NSynth
griffin-lim[17]	10.59	6.16	6.79	0.28	0.24	1.42 $\pm$ 0.11	7.14	0
melgan[1]	2.07	2.80	2.84	0.36	0.30	1.58 $\pm$ 0.15	179.90	4M
cargan[5]	6.47	8.31	8.45	0.21	0.21	1.74 $\pm$ 0.12	4.36	25M
hifigan[2]	0.85	1.31	1.81	0.16	0.17	2.89 $\pm$ 0.12	68.12	13M
diffwave[8]	2.62	6.84	1.89	0.14	0.15	3.19 $\pm$ 0.12	0.32	7M
phase-gradient	1.26	1.26	1.86	0.09	0.14	3.73 $\pm$ 0.13	3.58	28M
oracle	0.51	0.33	0.00	0.00	0.00	4.34 $\pm$ 0.15	—	—

Table 1: Reconstruction results

Results show that our phase-gradient model was able to reconstruct the single notes with lower mean and maximum harmonic error over the entire range of notes, especially in the lower pitch range (Fig. 4(a)). A possible explanation for the larger improvement on the low register is that the long waveform period of lower-pitched notes makes them challenging to learn in the time-domain, given the high number of phase variations. A visualization of the different models’ reconstruction of the same E2 guitar note is provided in Fig. 5.

We also synthesized different combinations of notes, to test the reconstruction quality on more challenging signals. The combinations include octaves, major 12ths, perfect fifths, an open- and a close-position voicing of the major triad, and a close-position voicing of the major seventh chord. As expected, the harmonic error increases when adding more notes (Fig. 4(b)), as they are harder to recognize in the input mel spectrogram. The phase-gradient model is able to yield lower mean and maximum harmonic error on all the considered combinations of notes. However, the decrease in the reconstruction quality, particularly on close-position chord voicings, suggests possible connections with the target’s frequency resolution.

4.3 Reconstruction quality

To evaluate the overall reconstruction quality of the proposed model, we compute the Frechét Audio Distance (FAD) [19] on the Ambient dataset, the dataset of 1920 one second long notes and chords (“N+C”) used in Sect. 4.2, and the NSynth dataset [20]. The results are shown in Table 1 alongside aggregated mean harmonic error results from the experiment discussed in Sect. 4.2.

The FAD metric compares embedding statistics generated on two potentially different sets of audio signals, i.e. evaluation and reference set. On the Ambient and NSynth datasets we compute the FAD between the reconstructed test split (evaluation) and the original training split (reference). On N+C, the reconstructed and original signals from the entire dataset are used as evaluation and reference sets. An ideal model (oracle) is added to provide reference values, useful when the evaluation and reference sets are different.

The aggregated mean harmonic error results are computed over two meaningful subsets of the N+C dataset: a “Notes” dataset, representing all single notes, and a “Chords” dataset, containing all note combinations with more than one pitch class (see Fig. 4 bottom row).

We conducted a small listening test to evaluate a set of notes reconstructed with different models. The set includes G2 and G3 notes randomly selected from NSynth, for each instrument family. The 5-scale Mean Opinion Score (MOS) values and the 95% confidence intervals are shown in Table 1.

The results show that the phase-gradient model is competitive with other state-of-the-art models, despite having simpler neural network and training procedure. The fact that hifigan model is able to score lower FAD than the proposed model on the Ambient and N+C datasets suggests it has higher reconstruction accuracy on different sonic characteristics.

Specifically, we noticed that hifigan was able to reconstruct impulsive components such as transients and percussive onsets with higher energy and often more accurately than the phase-gradient model, while struggling with the stability of pitched notes and chords. The diffwave model exhibited high frequency “hissing” noise, but was otherwise surprisingly stable on harmonic components. This quality likely stems from the wide ~7s receptive field, obtained using the entire reverse process at generation time instead of a fast sampling schedule [8]. As reported by its authors in [5], we confirm that the reconstructions made by the cargan model contained “boundary artifacts that appear as repeated clicks”, compromising their usability. Finally, the reconstructions obtained with the melgan model were characterized by heavy phase artifacts such as metallic sounds, and very unstable pitch.

Generation time is also shown in Table 1 in terms of real-time factor (RTF), defined as the number of seconds of audio that can be generated per second, evaluated on a single NVidia Volta GPU. phase-gradient’s generation is faster than real time, and close to cargan. While phase-gradient’s neural network is as fast as hifigan and melgan, 99% of the generation time is spent during the auto-regressive phase integration stage, suggesting a clear direction for optimization.

5 Conclusions

In this work we have proposed a new mel spectrogram inversion model designed for music that achieves improved reconstruction of sustained notes and chords, compared to state-of-the-art models from the speech synthesis literature. This improvement is obtained using a frequency-domain target representation that is time shift invariant for harmonic signal components. The proposed model is able to reconstruct single notes and chords respectively 60% and 10% more precisely than existing models, when evaluated with a novel harmonic error metric, while still being competitive on generic loop reconstruction. Potential directions for improvement include using pitch-shift augmentation, investigating log-frequency target representations, and training a separate time-domain model to supply the percussive components.

6 Acknowledgements

We would like to thank the following colleagues for their valuable help and feedback: David Varas Gonzalez, Tim O’Brien, Avery Wang, Meghna Ranjit, and André Bergner.

References

[1] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Melgan: Generative adversarial networks for conditional waveform synthesis,” Advances in neural information processing systems, vol. 32, 2019.
[2] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
[3] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in ISCA Speech Synthesis Workshop (SSW), 2016.
[4] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, and K. Kavukcuoglu, “Efficient neural audio synthesis,” in International Conference on Machine Learning (ICML), 2018.
[5] M. Morrison, R. Kumar, K. Kumar, P. Seetharaman, A. Courville, and Y. Bengio, “Chunked autoregressive gan for conditional waveform synthesis,” in International Conference on Learning Representations (ICLR), 2022.
[6] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and A. Roberts, “Gansynth: Adversarial neural audio synthesis,” in International Conference on Learning Representations (ICLR), 2019.
[7] K. R. Fitz and S. A. Fulop, “A unified theory of time-frequency reassignment,” arXiv preprint arXiv:0903.3080, 2009.
[8] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” in International Conference on Learning Representations (ICLR), 2021.
[9] N. Kandpal, O. Nieto, and Z. Jin, “Music enhancement via image translation and vocoding,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 3124–3128.
[10] Z. Pruša and P. L. Søndergaard, “Real-time spectrogram inversion using phase gradient heap integration,” in International Conference on Digital Audio Effects (DAFx), 2016, pp. 17–21.
[11] M. Dolson, “The phase vocoder: A tutorial,” Computer Music Journal, vol. 10, no. 4, pp. 14–27, 1986.
[12] Z. Pruša and N. Holighaus, “Phase vocoder done right,” in IEEE European Signal Processing Conference (EUSIPCO), 2017, pp. 976–980.
[13] J. Laroche and M. Dolson, “Improved phase vocoder time-scale modification of audio,” IEEE Transactions on Speech and Audio Processing, vol. 7, no. 3, pp. 323–332, 1999.
[14] E.-P. Damskägg and V. Välimäki, “Audio time stretching using fuzzy classification of spectral bins,” Applied Sciences, vol. 7, no. 12, p. 1293, 2017.
[15] F. Auger and P. Flandrin, “Improving the readability of time-frequency and time-scale representations by the reassignment method,” IEEE Transactions on signal processing, vol. 43, no. 5, pp. 1068–1089, 1995.
[16] S. A. Fulop and K. Fitz, “Algorithms for computing the time-corrected instantaneous frequency (reassigned) spectrogram, with applications,” Journal of the Acoustical Society of America, vol. 119, no. 1, pp. 360–371, 2006.
[17] N. Perraudin, P. Balazs, and P. L. Søndergaard, “A fast griffin-lim algorithm,” in IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2013, pp. 1–4.
[18] H. Fletcher, E. D. Blackham, and R. A. Stratton, “Quality of piano tones,” Journal of the Acoustical Society of America, vol. 34, pp. 749–761, 1962.
[19] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms,” in ISCA Interspeech, 2019, pp. 2350–2354.
[20] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musical notes with wavenet autoencoders,” in International Conference on Machine Learning (ICML), 2017.