Improving Natural-Language-based Audio Retrieval
with Transfer Learning and Audio & Text Augmentations

Abstract

The absence of large labeled datasets remains a significant challenge in many application areas of deep learning. Researchers and practitioners typically resort to transfer learning and data augmentation to alleviate this issue. We study these strategies in the context of audio retrieval with natural language queries (Task 6b of the DCASE 2022 Challenge). Our proposed system uses pre-trained embedding models to project recordings and textual descriptions into a shared audio-caption space in which related examples from different modalities are close. We employ various data augmentation techniques on audio and text inputs and systematically tune their corresponding hyperparameters with sequential model-based optimization. Our results show that the used augmentations strategies reduce overfitting and improve retrieval performance. We further show that pre-training the system on the AudioCaps dataset leads to additional improvements.

\newfloatcommand

capbtabboxtable[][\FBwidth] \namePaul Primus $^{1}$ , Gerhard Widmer $^{1, 2}$ \address $^{1}$ Institute of Computational Perception (CP-JKU)
$^{2}$ LIT Artificial Intelligence Lab
Johannes Kepler University, Austria

\ninept{keywords}

Language-based Audio Retrieval, Transfer Learning, Audio Augmentation, Text Augmentation

1 Introduction

Natural-language-based audio retrieval is concerned with ranking audio recordings depending on their content’s similarity to textual descriptions. Retrieval tasks like this are typically solved by converting recordings and textual descriptions into high-level representations and then aligning them in a shared audio-caption space; ranking can then be done based on the distance between embeddings. These systems’ retrieval performance highly depends on the quality of the audio and text embedding models, which must extract features that accurately and discriminatively represent the high-level content. Current state-of-the-art approaches [recent_work_1, recent_work_2, clap] create such feature extractors by training models with millions of parameters directly from raw input features, i.e., deep learning. These large embedding models require a large number of training examples, such as the 400 million image-text pairs used to train CLIP [clip], a cutting-edge image-retrieval model. However, publicly available audio-caption datasets like Clotho and AudioCaps are significantly smaller.

This work showcases how to use off-the-shelf pre-trained audio and text neural networks to create a state-of-the-art retrieval model under this limiting condition. We evaluate our approach in the context of task 6b of the 2022’s DCASE Challenge [dcase2022_task6b] and demonstrate how the already well-performing model can be further improved by using a range of audio and text augmentation methods and pre-training on AudioCaps.

Figure 1: The proposed audio-retrieval system in a nutshell: Audio and descriptions are transformed into the shared audio-caption embedding space via the audio and description embedding models

ϕ_{a}

and

ϕ_{c}

, respectively. The contrastive loss maximizes the similarities between matching pairs.

2 Realted Work

The idea of aligning text and audio features for content-based retrieval is not new: Early audio retrieval methods connected bag-of-words text queries and MFCC features via density or discriminative models [previous_work_2]. However, the handcrafted features and the relatively small vocabulary limited these methods’ performance. Current methods build on top of learnable feature extractors that produce high-level audio and text representations from raw input features. Xie et al. [recent_work_2], for example, used a convolutional recurrent neural network to extract frame-wise acoustic embeddings and aligned those to Word2Vec features via a linear transformation. Recently, language-based audio retrieval has received increased attention due to the newly introduced task 6b in the 2022’s DCASE challenge [dcase2022_task6b]. The task’s objective was to create a retrieval system that takes natural-language queries as input and retrieves the ten best-matching recordings from the test set. The top ranking systems among the nine submitted ones leveraged large pre-trained audio and text embedding models like CNN14 [panns] and BERT [bert]. While most systems applied SpecAugment [specaugment], other data augmentation methods, especially text augmentations, have received little to no attention. We address this paucity and study a range of audio and text augmentation methods in the context of audio retrieval.

Figure 2: Overview of the audio augmentation pipeline.

Augmentation	Caption
Original	The rain pours down.
Back Translation	It rains cats and dogs.
Insert	It tree rains cats and dogs.
Delete	It rains cats and dogs.
Swap	It and cats rains dogs.
Synonym	It drizzles cats and dogs.

3 Retrieval System

Our model uses separate audio and caption embedding networks $ϕ_{a} (\cdot)$ and $ϕ_{c} (\cdot)$ to embed tuples of spectrograms and descriptions ${(a_{i}, c_{i})}_{i = 1}^{N}$ into a shared $D$ -dimensional space in a manner that representations of matching audio-caption pairs are close. This behavior is achieved by contrastive training, which equalizes the embeddings of matching audio-caption pairs $(a_{i}, c_{i})$ , while pushing the representations of mismatching pairs $(a_{i}, c_{j; j \neq i})$ apart. The agreement between audio $a_{i}$ and description $c_{j}$ is estimated via the normalized dot product in the shared embedding space:

The similarity matrix $C \in R^{N \times N}$ holds the agreement of matching pairs on the diagonal and the agreement of mismatching pairs off-diagonal. We train the system using the NT-Xent [NTxent] loss, which is defined as the average Cross Entropy ( $C E$ ) loss over the audio and text dimension; the ground truth is given by the identity matrix $I \in R^{N \times N}$ :

L = \frac{1}{2 \cdot N} N \sum i = 1 CE (C_{i *}, I_{i *}) + CE (C_{* i}, I_{* i})

4 Audio Augmentations

To reduce overfitting of the audio embedding model and improve generalization, we employ three regularization techniques during training: Gain augmentation, MixStyle [mixstyle] along the frequency dimension (Freq-MixStype), and SpecAugment [specaugment]. Figure 3 gives an overview of the audio augmentation pipeline.

Gain Augmentation tries to make the model invariant changes in volume by randomly altering the loudness of the raw audio input signal. Volume manipulations are done by multiplying the waveform with factor

w = 10^{(g / 20)}

$g$ controls the change in volume (in dB), and its value is randomly drawn from a uniform distribution in the range $[- g_{max}, g_{max}]$ .

SpecAugment [specaugment] randomly masks time and frequency stripes in the input spectrogram, thereby reducing the audio embedding model’s reliance on specific input patterns. The number of stripes along the time and frequency dimensions is controlled via hyperparameters $n_{f}$ and $n_{t}$ , respectively. Parameters $w_{f}$ and $w_{t}$ control the maximum width of the time and frequency stripes, respectively. The actual width and the offset of the stripes are chosen from a uniform distribution; masked values are replaced with zeros. We omitted the warping transformation proposed in the original work as it is computationally expensive and only leads to marginal improvements.

Freq-MixStyle [mixstyle] tries to transfer device-style characteristics between recordings by exchanging statistics along the frequency dimension of spectrograms. To this end, the original spectrogram is normalized along the frequency dimension and un-normalized with adjusted mean and standard deviation statistics. The adjusted statistics are a convex combination of the original statistics $(μ_{i}, σ - i)$ and the statistic of a randomly selected spectrogram $(μ_{j}, σ_{j})$ :

μ_{new} = λ μ_{i} + (1 - λ) μ_{j}

σ_{new} = λ σ_{i} + (1 - λ) σ_{j}

The coefficient $λ$ is drawn from a symmetric beta distribution in a manner that the original statistics always receive a higher weight:

λ \sim Beta (α, α)

$α$ controls the shape of the Beta distribution. Freq-MixStyle is applied to each input example with a probability of $p_{MS}$ .

5 Text Augmentations

We apply Back Translation [backtranslation] and Easy Data Augmentation [EDA] (in that order) to reduce overfitting of the sentence embedding model. Examples of these augmentations are given in Table 3.

Back Translation (BT) [backtranslation] introduces variation into the input sentence without changing its semantics by translating the input sentence to a foreign language and back to the source language. We translate the training captions from English to German, French, or Spanish, and back to English.

Easy Data Augmentation (EDA) [EDA] chooses one of four word-level manipulations and applies the selected operation to each word with a certain probability (indicated in parenthesis): insertion of a random word ( $p_{ins}$ ), deletion ( $p_{del}$ ), swap with another word in the sentence ( $p_{swp}$ ), or replacement with a synonym according to WordNet [wordnet] ( $p_{syn}$ ). EDA is applied with a probability of $p_{EDA}$ .

6 Experiments & Discussion

We first established a baseline without augmentation and then conducted a series of experiments to investigate the impact of using pre-trained weights for the audio embedding model, augmenting the audio and text inputs, and pre-training on AudioCaps. The model architecture, the exact experimental setup, and the results are discussed below.

6.1 Dataset & Input Features

We trained our proposed system on ClothoV2 [clotho], which contains $10$ - $30$ second long audio recordings sampled at $32$ kHz and five human-generated captions for each recording. We used the training-validation-test split suggested by the dataset’s creators. To make processing in batches easier, we zero-padded all audio snippets to the maximum audio length in the batch. The resulting waveforms were converted to $64$ -bin log-MEL spectrograms using a $1024$ -point FFT ( $32$ ms) and hop size of $320$ ( $10$ ms). The audio features were normalized via batch normalization [batchnorm] along the frequency dimension before feeding them into the CNN10 embedding model. The input sentences were pre-processed by converting all characters to lowercase and removing punctuation. The resulting strings were tokenized with the WordPiece tokenizer [wordpiece], padded to the maximum sequence length in the batch, and truncated to 32 tokens.

6.2 Audio Embedding Model

We used a slightly modified version of the popular CNN10 architecture [panns] to embed spectrograms into the 1024-dimensional audio-caption space. The architecture is detailed in Table 1. The network aggregates the output after the last convolutional block over the frequency and time dimensions and transforms the result with a two-layer neural network. The weights of the convolutional blocks were transferred from a model pre-trained on AudioSet [audioset], and the fully-connected layers were randomly initialized. The audio embedding model has approximately 9 Million parameters. We chose this simple architecture, because it allowed us to train on a single customer-grade GPU with a reasonable batch size.

CNN10
$2 \times (3 \times 3) @ 64$ , BN, ReLU
Pool $(2 \times 2)$
$2 \times (3 \times 3) @ 128$ , BN, ReLU
Pool $(2 \times 2)$
$2 \times (3 \times 3) @ 256$ , BN, ReLU
Pool $(2 \times 2)$
$2 \times (3 \times 3) @ 512$ , BN, ReLU
Pool $(2 \times 2)$
Frequency Pooling (mean)
Time Pooling (average of mean and max )
FC 2048, ReLU
FC 1024

Table 1: The architecture of the audio embedding model (CNN10).

6.3 Text Embedding Model

We used a pre-trained BERT model [bert] (’bert-base-uncased’) to generate embeddings for the audio captions. BERT is a bi-directional self-attention-based sentence encoder that was pre-trained on BookCorpus [bookcorpus] and WikiText datasets [wikitext] for masked language modeling and next sentence prediction. The learned semantic representations proved effective in multiple downstream tasks. We projected the output vector that corresponds to the class token into the shared audio-caption space by using a neural network with one hidden layer of size 2048 and ReLu activations. The text embedding model has approximately 112 Million parameters.

6.4 Training & Evaluation

Both embedding models were jointly optimized using mini-batch gradient descent with a batch size of 30. We used the Adam update rule [adam] for 50 epochs, set the initial learning rate to $10^{- 4}$ , and dropped it by a factor of $3$ every $10$ epochs. The hyperparameters of the optimizer were set to PyTorch’s [pytorch] defaults. Our main evaluation criterion was the mean Average Precision among the top-10 results (mAP) because this criterion takes the rank of the correct recording into account. We also report the recall among the top-1, top-5, and top-10 retrieved results. All results are averaged over three runs.

6.5 Baseline

The performance of the baseline system (without augmentations) is given in Table 2. On average, our system rankes the correct recording first for every eighth query, among the top five results for every third query, and among the top ten results for every second query. The proposed system further considerably outperforms the DCASE baseline [recent_work_2] based on a convolutional recurrent neural network and Word2Vec [word2vec] for audio and word embedding. The absolute improvement in terms of mAP is $15.3$ pp. Our proposed system retrieves the correct audio approximately four times more often on rank one, three times more often among the first five, and two times more often among the first ten results compared to the DCASE baseline.

	R@1	R@5	R@10	mAP@10
DCASE baseline	3.50	11.50	19.50	$7.50 \pm 0.00$
baseline (no aug)	13.18	35.30	48.61	$22.8 \pm 0.29$

Table 2: Comparison the official DCASE baseline and the custom baseline without data augmentation.

6.6 Ablation Study: Pre-trained Embedding Models

Our first ablation study investigated the impact of using pre-trained weights for the audio embedding model. To that end, we retrained our baseline system (again without augmentations) but randomly initialized the parameters of the audio embedding model instead of transferring them from a CNN10 pre-trained on AudioSet. The outcomes are detailed in Table 3. The resulting system’s mAP is 5 pp. higher than the mAP of the DCASE baseline system; however, is is still 10 pp. mAP worse than the system that used pre-trained weights. This confirms that using pre-trained audio embedding models is an effective strategy to alleviate the data scarcity problem.

	R@1	R@5	R@10	mAP@10
DCASE baseline	3.50	11.50	19.50	$7.50 \pm 0.00$
baseline (no aug)	13.18	35.30	48.61	$22.80 \pm 0.29$
baseline w/o ASP	6.63	20.06	31.52	$12.53 \pm 0.08$

Table 3: Comparison between the custom baseline (no augmentation) and the same model trained without AudioSet Pretraining (ASP).

6.7 Sequential Model-Based Optimization

We performed sequential model-based optimization (SMBO) in the hyperparameter space of the audio and text augmentations to find a good configuration without manual tuning. The search space is detailed in Table 4. SMBO was initialized with ten runs using randomly chosen hyperparameters. After that, we performed 100 trials with hyperparameters sampled using the Tree-structured Parzen Estimator algorithm [TPE] to maximize the mAP on the validation set. To reduce the overall computation time, we stopped runs for which the mAP on the validation set did not increase for ten consecutive epochs. The resulting best hyperparameters and the performance on the test set are given in Tables 4 and 5, respectively. The best found configuration suggests that the parameter which controls the frequency of EDA is superfluous as the best value is very close to one. Synonym replacement appears beneficial, and future experiments should search the optimal value for this parameter in a larger range. The probability of swapping and inserting random words is close to zero, suggesting that these two transformations are less beneficial or even detrimental. All in all, we observed an absolute improvement of approximately 1.5 pp. mAP when training with text and audio augmentations.

	Augmentation	Parameter	Range	best
Text	EDA	$p_{EDA}$	$[0, 1.0]$	.9936
		$p_{syn}$	$[0, 0.3]$	.2962
		$p_{swp}$	$[0, 0.3]$	.0085
		$p_{ins}$	$[0, 0.3]$	.0269
		$p_{del}$	$[0, 0.3]$	.1944
	Backtranslation	$p_{bt}$	$[0, 1.0]$	.1812
Audio	SpecAugment	$n_{f}$	${0, 1}$	1
		$w_{f}$	${1, \dots, 32}$	4
		$n_{t}$	${0, \dots, 8}$	7
		$w_{t}$	${1, \dots, 64}$	58
	Audio Gain	$g_{max}$	${0, \dots, 6}$	3
	Freq-MixStyle	$p_{MS}$	$[0, 1.0]$	.1045
	Freq-MixStyle	$α$	$[0, 1.0]$	.8286

Table 4: Hyperparameter search space for the sequential model-based optimization, and the best configuration found.

	R@1	R@5	R@10	mAP@10
baseline (no aug)	13.18	35.30	48.61	$22.8 \pm 0.29$
SMBO	14.50	37.24	51.04	$24.27 \pm 0.19$

Table 5: Comparison between the custom baseline without augmentation and the model obtained via sequential model-based optimization (SMBO).

6.8 Ablation Study: Augmentations

Based on the previous results, we performed another ablation study to investigate the effect of the audio and text augmentations. To that end, we re-trained the proposed system twice: once without audio augmentations and once without text augmentations. The results are given in Table 6. Using all augmentations gave the best results. We observed a drop of 1.1 and 0.5 pp mAP. without text and audio augmentations, respectively. This could indicate that the text augmentations have a larger impact than the audio augmentations, which might be caused by the large difference in trainable parameters between the sentence and audio embedding models.

To isolate the effect of each individual augmentation method, we further re-trained the system in five variants, always leaving out one of the augmentation methods. The results are summarized in Table 6. The text augmentations have the largest impact, which is in line with the previous results: Leaving out EDA and BT reduced the mAP by 1.0 and 0.7 pp., respectively. Eliminating SpecAugment and Freq-MixStyle reduced the performance by 0.7 and 0.6 pp., respectively. Gain augmentation seems to have the least impact: eliminating it reduced the mAP by only 0.2pp.

	R@1	R@5	R@10	mAP@10
SMBO	14.50	37.24	51.04	$24.27 \pm 0.19$
no audio aug	13.88	36.94	51.06	$23.74 \pm 0.16$
no text aug	13.12	35.77	49.25	$22.91 \pm 0.08$
no SpeAugment	13.50	36.60	50.91	$23.53 \pm 0.20$
no FreqMixStyle	13.61	36.91	50.69	$23.62 \pm 0.14$
no Gain Augment	14.84	37.81	50.95	$24.05 \pm 0.26$
no BT	13.61	36.38	50.00	$23.43 \pm 0.18$
no EDA	13.33	37.02	49.94	$23.27 \pm 0.03$

Table 6: Results of the ablation study on data augmentation.

6.9 Pre-Training on AudioCaps

We hypothesized that pre-training the retrieval system on additional audio-caption pairs could further improve audio-retrieval results; we therefore pre-trained the system on the $46 K$ training examples in AudioCaps [audiocaps]. To this end, we used the same training procedure as described in Section 6.4 for pre-training and fine-tuning, but decreased the inital learning rate for fine-tuning by a factor of 10. Table 7 gives the results. Pre-training on AudioCaps further improved the models retrieval performance by 0.3pp mAP.

	R@1	R@5	R@10	mAP@10
SMBO	14.50	37.24	51.04	$24.27 \pm 0.19$
SMBO + AudioCaps	14.34	38.12	52.04	$24.57 \pm 0.15$

Table 7: Comparison of the model trained with and without pre-training on AudioCaps.

7 Conclusion

This study set out to investigate transfer learning and data augmentation strategies to alleviate the data scarcity problem in natural-language-based audio retrieval. Our research has shown that using pre-trained audio and text embedding models greatly increases the retrieval performance on ClothoV2. We enriched this already well-performing retrieval system with a range of augmentation methods and showed that augmenting both text and audio inputs significantly reduces overfitting. Finally, we further showed that pre-training on AudioCaps leads to additional improvements.

8 Acknowledgment

The LIT AI Lab is financed by the Federal State of Upper Austria.

Improving Natural-Language-based Audio Retrieval with Transfer Learning and Audio & Text Augmentations