MuLan: A Joint Embedding of Music Audio and Natural Language

Abstract

Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications.

\multauthor

Qingqing Huang $^{1}$ Aren Jansen $^{1}$ Joonseok Lee $^{1, 2}$ Ravi Ganti $^{1}$ Judith Yue Li $^{1}$ Daniel P. W. Ellis $^{1}$
Google Research $^{1}$ , Seoul National University $^{2}$
{qqhuang, arenjansen, joonseok, gmravi, judithyueli, dpwe}@google.com

1 Introduction

Classifiers are generally trained to label examples with predefined and fixed class inventories, which are often manually specified as a structured ontology indicating inter-class relationships. Empowered by recent advances in neural language modeling and their demonstrated transfer learning competence, researchers have begun exploring less restrictive natural language interfaces to access the categorical information underlying raw content signals. The majority of this work has been in the visual and audio event domain, where a recent series of studies have demonstrated the utility of jointly embedding media content with natural language captions [24, 36, 28, 26, 32]. These joint embeddings have demonstrated strong capabilities in a range of applications, including transfer learning, cross-modal retrieval, automatic captioning, and zero-shot classification.

The success of these efforts strongly depends on large-scale training resources and hefty neural network architectures that are flexible enough to model the complex, non-monotonic relationship between language and other modalities. In particular, the visual domain has greatly benefited from the availability of large amounts of captioned images available across the web [24]. However, in the general environmental audio domain, such large-scale audio-caption pairs are less readily available and related efforts have relied on small captioned datasets [9, 27]. Critically, these datasets do not span the diversity of sound-descriptive language and their success in the more difficult zero-shot setting has been lacking [11, 12, 28].

This paper considers this task of jointly embedding audio and natural language, but focuses specifically on the music domain. Our goal is to produce a flexible language interface with which any musical concept can be linked to related music audio. We face similar training data prerequisites to works listed above. However, while general environmental audio consists of background sounds that are unlikely to elicit unprompted description, music audio is often a central focus. Consequently, text associated with music videos is much more likely to relate to the underlying musical concepts that we aim to model (e.g., genres, artists, moods, structure). Thus, our strategy is to assemble a collection of textual annotations extracted from metadata, comments, and playlist data and map them to a training set of over 44 million internet music videos. As was the case with image-text model training in [24], our text data only truly refers to the musical content in a fraction of cases. Therefore, we also explore text pre-filtering using a text classifier separately trained to identify music descriptions.

We use this large-scale dataset to train MuLan, a new generation of semantically-structured music audio embedding model equipped with a natural language interface. MuLan employs a two-tower parallel encoder architecture, using a contrastive loss objective that elicits a shared embedding space between music audio and text. We demonstrate that MuLan not only leads to state-of-the-art performance in transfer learning for various music information retrieval tasks, but also enables a range of functionalities in cross-modal text-to-music retrieval, zero-shot music tagging, and music-domain language understanding.

2 Related Work

Audio representation learning. Transfer learning using large-scale, task-agnostic pretraining of general-purpose content representations has become a dominant approach in several fields. Audio representation learning has been no exception, including both general environmental audio [17, 18] and music audio [20, 44, 6, 8]. Different pretraining mechanisms have been explored. In supervised pretraining, an Audio Spectrogram Transformer (AST) [17], pretrained on ImageNet [7] and AudioSet [16], achieved state-of-the-art results in various tagging tasks. A strong early baseline for music audio representation learning was provided in [44], using the Million Song Database [2].

In unsupervised and self-supervised pretraining, both discriminative and generative model approaches have been demonstrated to be successful. Discriminative training was explored in [23, 38, 42, 39] where the models tried to learn representations that assign higher similarity to audio segments extracted from the same recording compared to segments from different recordings. SSAST [18] explored similar discriminative losses, as well as generative masked spectrogram patch modeling. It was shown in [3] that the intermediate embedding of a generative model also provides a strong audio representation for downstream classification. Various forms of weak supervision, such as user interaction statistics and visual cues, have also been examined in [34, 22, 13].

Our work focuses on developing a recipe of cross-modal supervision using an abundance of text annotations that are weakly associated with the music audio. We benchmark the transfer learning capabilities of the learned representations against analogous past work, and also evaluate different audio encoder architectures.

Cross-modal contrastive learning. Spurred by the success of using contrastive learning to align image features and free-form natural language using large-scale data [24, 36], tri-modal architectures were proposed in [19] and [32] where an audio tower was introduced to the image-text model and contrastive learning is used to enforce the cross-modal alignment. Along the same line in the audio domain, [11] used contrastive learning to align the latent representation of audio and associated tags. The tags come from a fixed vocabulary of size 1K from Freesound [14], and the input to the text encoder was the multi-hot encoded tags. Follow up work in [12] uses a pretrained, non-contextual word embedding (Word2Vec) model to support generalization to new terms beyond the 1K tags. However, this still does not support generalization to free-form natural language. Contrastive learning was also explored in [47] for zero-shot audio classification, using AudioSet and ESC-50[35] data. Our method focuses on mining a much larger scale collection of audio-text pairs specifically for the music domain. Our data scale supports using state-of-the-art Transformer-based audio and contextual language encoders, which led to a truly arbitrary zero-shot music tagging and retrieval for the first time.

Music text joint embedding models. Content-based music information retrieval requires linking the rich semantics expressible to free-form text with both broad and fine-grained musical properties. One approach is to consider a large number of text label classes and try to ground the semantics in music with a multi-label classification task. In [22], a large vocabulary of 100K $n$ -grams was mined from noisy natural language text associated with music videos. Then, a cross entropy loss was employed to train the music audio encoder, where the softmax layer weights served as text label embeddings that were aligned with audio features by construction. The work in [46] explored various training tasks (classification, regression, metric learning) to align free-form text and music audio, relying on pre-existing emotion labels to connect the modalities.

Closest to our work is MuLaP[31], where 250K audio-caption pairs were mined from a private production music library and used to train a multimodal Transformer with early fusion of the two modalities. Their choice of early fusion, as accomplished with cross-attention layers, restricts the utility of the resulting embeddings to transfer learning applications. Critically, our two-tower parallel encoder approach results in a joint embedding space that provides a natural language interface to arbitrary music audio. This opens up downstream opportunities for cross-modal retrieval, zero-shot tagging, and language understanding.

3 Proposed approach

Our goal is to construct a shared embedding space for music audio and free-form natural language text, in which proximity is predictive of shared semantics both within and across modalities. To accomplish this, we rely on cross-modal contrastive learning and a simple two-tower architecture. This is a highly data-intensive endeavor, which we support by mining a large-scale training dataset of (audio, text) pairs. We describe these components in turn below.

3.1 Learning Framework

Figure 1 shows a high-level schematic of the learning framework. Each MuLan model consists of two separate embedding networks for the audio and text input modalities. These networks share no weights, but each terminates in $ℓ_{2}$ -normalized embedding spaces with the same dimensionality, $d$ . The audio embedding network, $f : R^{F \times T} \to R^{d}$ , takes as input log mel spectrogram context windows with $F$ mel channels and $T$ frames. The text embedding network, $g : A^{n} \to R^{d}$ takes as input a null-padded text token sequence of length $n$ over a token vocabulary $A$ .

Given a set of music recordings and the associated text elements for each recording, we construct a cross-modal training dataset of (audio, text) pairs as follows. For each recording, we compute an $F$ -channel log mel spectrogram and extract a collection of $T$ -frame context windows. We null-pad or truncate each associated text element to a fixed length $n$ . Then, each mini-batch $B$ consists of a set of $B$ target audio-text pairs of the form ${(x^{(i)}, t^{(i)})}_{i = 1}^{B}$ . Here, each target pair is sampled by first selecting a random recording and sample a random spectrogram context window $x^{(i)} \in R^{F \times T}$ from it. Next, we randomly select one of its associated text elements $t^{(i)} \in A^{n}$ . This sampling scheme means that multiple epochs are required to cover the entirety of the training audio and all the associated text. We also experimented with concatenating multiple text annotations for each example, but it did not generally work as well.

We train to minimize a batch-wise Contrastive Multiview Coding loss function [41], which is a cross-modal extension of the popular InfoNCE and NT-Xent losses [33, 5]. For each batch $B$ , this loss $L (B)$ takes the form

where $h$ is a critic function given by $h [a, b] = exp (a^{T} b / τ)$ for $a, b \in R^{d}$ , and $τ \in (0, 1]$ is a trainable temperature hyperparameter. For our $ℓ_{2}$ -normalized embedding model outputs, the inner product is effectively cosine similarity. The critic’s goal is to produce a large positive value for target audio-text pairs, and a small value close to zero for all non-target pairs constructed within the batch. Temperature values less than one function to increase the output range of $h$ . Previous research[30, 5] demonstrated that a large batch size is beneficial to contrastive loss optimization.

3.2 Audio Embedding Network

For the audio embedding tower, $f$ , we consider two proven audio architectures. Following its introduction to the audio machine learning community [21], the Resnet-50 architecture has become a common and well-performing option. It is a straightforward adaptation of the original vision architecture: as in [21], we remove the stride of 2 in the first convolutional layer and apply to log mel spectrograms ( $F = 64$ mel channels, 25 ms Hanning window, 10 ms step size) treated as grayscale images. Unlike the Resnet-50 model in [21] which operated on 0.96-second context windows, in order to allow the modeling of longer-term musical structure, our implementation takes as input 10-second windows (randomly selected from each training clip), in the form of $(F = 64) \times (T = 1000)$ spectrogram patches. During training, we apply SpecAugment to each spectrogram using the parameters from [17] before passing it into the embedding network. A final mean pooling operation is applied across time and mel channels, followed by a linear fully connected layer with $d = 128$ units, whose output is $ℓ_{2}$ -normalized. We pretrain all but the final linear transform layer via logistic regression on AudioSet [16], including all 527 classes, and discard the final classifier layer before fine-tuning for our task.

Audio Spectrogram Transformer (AST) is a port of the successful Vision Transformer (ViT) base architecture and is currently the state-of-the-art in the audio event classification space [17]. AST consists of a stack of 12 Transformer blocks (hidden dimension 768, 12 self-attention heads) that are applied to a sequence of “tokens” corresponding to a flattened set of linear-transformed $16 \times 16$ (stride 10 along both axes) time-frequency patches extracted from the $(F = 128) \times (T = 1000)$ log mel spectrogram context windows. We again apply SpecAugment during training. Similar to the Transformer-based language models, trainable positional encodings are added to the sequence of patch tokens, and a [CLS] token is prepended to the sequence as a summary of the contextual patch embeddings. We apply a linear fully-connected layer with $d = 128$ units and $ℓ_{2}$ -normalization to the final 768-dimensional encoding at the [CLS] token position, and this forms the output of audio embedding network $f$ . We warm-start training for all but the final linear transform layer using the public AST checkpoint [17].

3.3 Text Embedding Network

For the text embedding model, we consider the commonly-used Bidirectional Encoder Transformer (BERT) with base-uncased architecture [25], which consists of a stack of 12 Transformer blocks (hidden dimension of 768 and 12 self-attention heads). We apply the BERT wordpiece tokenizer to convert a text input string into a sequence of tokens ( $n = 512$ ). The output of the text embedding network is defined to be the [CLS] token embedding, linearly transformed to the shared audio-text embedding space of dimension $d = 128$ and subsequently $ℓ_{2}$ -normalized. We warm-start our text embedding network using the publicly available checkpoint [1].

3.4 Training Dataset Mining

To assemble a large-scale collection of (audio, text) pairs needed to train our MuLan embedding models, we start with a collection of 50 million internet music videos. From the soundtrack of each video, we extract a 30-second clip starting at the 30 second mark. We then apply a pre-existing music audio detector and discard any clip that is less than half music content. After this filtering, we are left with approximately 44 million 30-second clips, which amounts to nearly 370K hours of audio.

Type	Examples
Short-form (SF)	tags like genre, mood, instrument, artist name,
	song title, album name
Long-form (LF)	‘Hip-hop features rap with an electronic backing.’
	‘The melody is so nostalgic and unforgettable.’
Playlist (PL)	‘Feel-good mandopop indie’, ‘Latin workout’
	‘Salsa for broken hearts’, ‘Piano for study’

Table 1: Text annotation examples.

	Pre-filter		Post-filter
Type	Tokens (B)	APV	Tokens (B)	APV
Short-form	31.2	42.9	5.4	29.6
Long-form	30.7	70.7	0.2	0.4
Playlists	2.5	24.3	-	-

Table 2: Statistics for text data sources. Tokens counts (in billions) are across all 44M videos. APV is the average number of text annotations (i.e. separate free-form strings) per video, including those with none.

For each music video, we consider 3 sources of noisy text data: (i) short-form (SF) text including video titles and tags; (ii) long-form (LF) text including video descriptions and comments; and (iii) titles of 171 million playlists (PL) that are linked to the internet music videos in our dataset. None of these text sources is guaranteed to be referring to the musical properties of the soundtrack. In particular, comments data contains the most noise, and can be subjective or less directly related to the music content. In Table 1, we show examples that are indeed music-related to give the readers a flavor of each type of text annotation.

In observance of the highly noisy text, we experimented with training MuLan with the SF and LF text data filtered to a cleaner set of music-descriptive annotations (PL is used unfiltered). For this, we fine-tune a pre-trained BERT model with a binary classification task on a small curated set of 700 sentences, which are manually labeled to be music-descriptive or not. We then apply this text classifier to filter the sentences in the LF annotations. Separately, we apply a set of rule-based filtering heuristics to clean up the SF annotations. Table 2 shows the size and coverage of each of these text sources, both before and after filtering. Note that playlist titles and filtered long-form annotations are only available for a minority of recordings in the dataset (18M and 6.8M out of the total 44M, respectively).

We also convert AudioSet into a set of audio-text pairs, denoted below as ASET. Specifically, we include all examples for all 527 classes, using each label string attached to an example as an associated text annotation. This results in a set of approximately 2 million 10-second clips for training, each with 1.8 label annotations on average. Given the great scale imbalance of these four different data sources, which is often at odds with their linguistic richness and quality, we construct each mini-batch with a prescribed set of proportions that were chosen without any optimization: 2:2:1:1 for SF:LF:PL:ASET. This means that despite its small scale, the (e.g.) filtered LF annotations still comprise 1/3 of each mini-batch.

4 Experiments

We evaluate MuLan using both the Resnet-50 audio encoder (M-Resnet-50) and AST audio encoder (M-AST). In both cases we use the BERT-base-uncased architecture as the text encoder. We train all models for 14 epochs on the collection of audio-text pairs mined from the 44M music recordings and the processed text labels in all categories: AudioSet (ASET), short-form tags (SF), long-form sentences (LF), playlist information (PL). We use the Adam optimizer with a step decay learning rate schedule using a decay factor 0.9 applied every 40K steps and initial values of $5 \times 10^{- 5}$ for M-Resnet-50 and $4 \times 10^{- 5}$ for M-AST. The temperature parameter is initialized to $τ = 0.1$ for all models. M-Resnet-50 is trained with a batch size of $B = 6144$ pairs, while $B = 5120$ pairs were used for M-AST due to memory limitations. Since M-AST and M-Resnet-50 show roughly similar performance in the evaluation tasks considered, we use M-Resnet-50 throughout the text ablation study for its better training efficiency.

4.1 Evaluation Tasks

4.1.1 Zero-shot Music Tagging

Given a music clip and a set of candidate text label tags, we define each prediction score as the cosine similarity between the audio embedding of the music clip and the text embedding of each tag string. The generalization ability of the proposed method to potentially unseen target labels is achieved through (i) the use of a contextual text encoder, which provides a flexible prediction space, and (ii) the use of cross-modal contrastive learning to anchor the language semantics to an audio representation.

We conduct this evaluation with two music tagging benchmarks: MagnaTagATune (MTAT) [29] and the music related portion of AudioSet [16]. For MagnaTagATune, we consider both the well-exercised top-50 tag set, as well as the full 188 tag set. We use standard train/validation/test partitions (note that zero-shot experiments do not use train/validation) and report class-balanced area under the receiver operating characteristic curve (AUC-ROC) on the test set. The audio clips in MagnaTagATune are 29 seconds long, so we split each into 3 non-overlapping 10-second segments and average the segment-level embeddings to get the clip-level embedding. For AudioSet, we consider a 25-way genre tagging task (Gen-25) as studied in [22], and a richer 141-way tagging task (Mu-141) that includes the entire music subtree of AudioSet ontology.

It is important to note that AudioSet is included in contrastive training, and a fraction of MTAT classes overlap with the AudioSet ontology. As a result, AudioSet and (to lesser extent) MTAT evaluations are not strictly zero-shot from a label exposure perspective. However, the explicit, matched AudioSet supervision is diluted by the abundance of free-form language supervision during MuLan training. Therefore, by comparing MuLan models and conventional AudioSet classifiers, we can measure the cost of moving to a flexible natural language interface that additionally supports classes outside the AudioSet ontology.

4.1.2 Transfer Learning with Linear Probes

In addition to the zero-shot experiments introduced above, we also evaluate the audio encoder as a general purpose feature extractor for downstream tagging tasks. We again consider the two benchmarks of MagnaTagATune and AudioSet, and use the training datasets to train an independent per-class logistic regression layer on top of the frozen 128-dimensional audio embeddings. We follow the same evaluation protocol of past transfer learning studies using these datasets, allowing for a direct comparison of performance.

4.1.3 Music Retrieval from Text Queries

Given a music search collection and a text query, MuLan provides the ability to retrieve the music clips that are closest to the query in the embedding space. This evaluation is relevant to music retrieval applications, where content features can offer finer-grained and more complete similarity information when compared with metadata-based methods [43]. We consider a proprietary collection of 7,000 expert-curated playlists, which do not overlap with the playlist information used in training. Each expert-curated playlist has a title and a description, and consists of 10-100 music recordings. The playlist titles are usually short phrases, including a mixture of genres, sub-genres, moods, activities, artist names, and compositional elements (e.g. ‘Indie Pop Workout’, ‘Relaxing Korean Pop’). Playlist descriptions consist of one or more complete sentences (see pos/neg entries of "Playlist" row of Table 3 for examples). The playlist evaluation includes approximately 100K unique recordings.

We construct two cross-modal retrieval evaluation sets from the expert-curated playlist data, one using titles as queries and the other using descriptions. For each dataset, we use the recordings belonging to the corresponding playlist as the ground truth retrieval targets, and all the 100K recordings as the pool of candidates. We report both AUC-ROC and mean average precision (mAP). We use the same embedding averaging and cosine similarity-based scoring mechanism as in the zero-shot tagging case. However, the playlist information is of substantially different nature compared to the tags involved in the music tagging benchmarks. Instead of a small vocabulary of mostly basic genres and instruments, the playlist titles and descriptions have much finer-grained information and are similar to queries that are presented to music search engines.

Eval Set	Anchor / Positive / Negative
Ontology	Steelpan / Sounds of a tuned percussion instrument originally constructed from steel oil drums by hammering out small patches on the head to produce separate pitches. / The sound of a musical instrument that produces sound by vibration of air in a tubular resonator in sympathy with the vibration of the player’s lips.
Playlist	Relaxing Korean Pop / Lets make your chill mood with a collection of easy-going sounds from Korean artists. / These fun and upbeat songs from the alternative side of the pop music spectrum will keep you energized while you exercise.

Table 3: Text triplet evaluation examples.

(a) Zero-shot (Trained w/ ASET + SF + LF + PL)
	AudioSet		MTAT
Model	Gen-25	Mu-141	Top-50	All-188
M-AST	0.840	0.909	0.778	0.776
M-Resnet-50	0.840	0.899	0.782	0.772
(b) Text ablation (using M-Resnet-50 Zero-shot)
ASET + SF + LF	0.839	0.907	0.760	0.756
ASET + SF	0.839	0.885	0.754	0.747
ASET	0.886	0.942	0.753	0.771
SF/LF Unfiltered	0.845	0.908	0.774	0.766
(c) Linear probe
M-AST	0.906	0.942	0.925	0.953
M-Resnet-50	0.910	0.940	0.927	0.954
Baselines:
Hybrid [22]	0.904	0.920	0.915	0.941
JukeBox [8, 3]	-	-	0.915 $^{*}$	-
MuLaP [31]	-	-	0.893 $^{*}$	-
CLMR [39]	-	-	0.866 $^{*}$	-
(d) End-to-end training baselines
AST [17]	0.888	0.949	-	-
SC-CNN [45]	-	-	0.913 $^{*}$	-

$^{*}$ indicates that the number is brought from the original paper.

Table 4: Music tagging results reported in AUC-ROC.

4.1.4 Text Triplet Classification

Compared to the conventional pre-trained BERT model, our text encoder is fine-tuned using in-domain music data and cross-modal contrastive loss. Note that there are no text-only training objectives. To measure whether our proposed method deepens the text encoder’s understanding of music related text, we directly evaluate the text embeddings with a triplet classification task. Each triplet consists of 3 text strings of the form of (anchor, pos, neg), and it is considered correct if pos is closer than neg to anchor in the text embedding space. We derive two such text triplet evaluation sets. The first uses the AudioSet ontology [16]: for each of the 141 music related classes, we use its label string as the anchor text, its long-form description as the positive text, and sample 5 random class’s long-form description as the negative text to construct 5 triplets. For the second set, we sample 1,000 triplets from the expert-curated playlist data in a similar fashion: we first sample a playlist, set the anchor and positive text to be its title and description, respectively, and then set the negative text to be the description of another randomly sampled playlist. Examples of both sets are shown in Table 3.

4.2 Results and Discussion

4.2.1 Music Tagging

Table 4(a) shows the zero-shot tagging metrics, where M-Resnet-50 and M-AST obtain comparable performance. Note that there can be a significant misalignment between the word sense of a label in the tagging evaluation compared to that in our training text. This can lead to a degradation in performance relative to the explicitly supervised linear probe setting where the task-expected tag semantics can be learned. The MTAT gap is substantially larger than AudioSet’s, driven by particularly bad performance for (i) MTAT tags with nonspecific meaning or multiple senses, e.g. “weird” and “beats”; and (ii) MTAT tags involving simple negation (e.g. “not rock”, “no piano”). This is a result of the text encoder not adequarely modeling the meaning of these negated concepts, which is a well known problem with BERT [10, 40] (the text embedding of “not rock” is similar to “rock” and performance suffers).

Table 4(b) shows the results of the text ablation study, which aims to understand the benefits of different sources of text labels. Note that as we remove each dataset we maintain the same proportions described in Section 3.4. Unsurprisingly, training with AudioSet alone gets the highest AUC in AudioSet evaluation, with the text encoder learning the exact label semantics reflected in the test data. On the other hand, including more data sources in general improves performance on all other downstream tasks (MTAT, retrieval/text triplet evaluations in Tables 5 and 6) and the loss on AudioSet AUC is relatively minor. We observe that for the music tagging tasks considered, training with unfiltered data actually achieves comparable performance compared to the filtered version. That the model learns similarly useful associations without being overwhelmed by the sheer amount of noise in the raw text data came as a surprise. We speculate that our text filtering was too aggressive, having removed annotations that were not obviously music-related, but semantically important nonetheless. Since contrastive learning is highly noise tolerant, the gain from restricting to more strongly aligned audio-text pairs may have been offset by the loss of a large set of additional useful pairs.

	Title		Description
Model	AUC	mAP	AUC	mAP
M-AST	0.933	0.110	0.903	0.090
M-Resnet-50	0.931	0.104	0.901	0.084
Text Ablation:
ASET+SF+LF	0.917	0.101	0.892	0.077
ASET+SF	0.913	0.089	0.867	0.060
ASET	0.626	0.005	0.688	0.009
SF/LF Unfiltered	0.933	0.111	0.897	0.081

Table 5: Text query music retrieval evaluation results. Text ablation/unfiltered models use M-Resnet-50.

Model	Playlist	AudioSet
M-AST	0.959	0.962
M-Resnet-50	0.945	0.951
Text Ablation:
ASET + SF + LF	0.935	0.952
ASET + SF	0.910	0.938
ASET	0.693	0.818
SF/LF Unfiltered	0.949	0.959
Baselines:
SimCSE [15]	0.950	0.938
SBERT [37]	0.942	0.889
USE [4]	0.918	0.946
BERT [25]	0.850	0.847

Table 6: Text triplet classification accuracy AudioSet ontology evaluation and Playlist title to description evaluation. Text ablation/unfiltered models use M-Resnet-50.

Table 4(c) shows that when applying linear probes on MuLan audio embeddings, we achieve SOTA transfer learning performance on all tagging tasks. This demonstrates that MuLan’s pretrained audio encoder continues to produce high quality general-purpose music audio embeddings, while also supporting new natural language applications. Finally, Table 4(d) lists end-to-end training baselines for 3 of these tasks. Our linear probe results exceed 2 of 3, and only slightly trails a SOTA AST AudioSet classifier.

4.2.2 Music Retrieval from Text Queries

In Table 5, we evaluate MuLan models (including with text/filter ablation) on the query retrieval evaluation tasks introduced in Section 4.1.3. Even though we start with a BERT checkpoint pretrained with massive language resources, training MuLan with only AudioSet clips and label annotations provides very limited ability to ground in-domain natural language to music. Such limited cross-modal supervision does not generalize to the rich semantics that appear in the playlist titles and descriptions, which are more in line with the complex queries that are presented to real-world music search engines. We observe significant gain after including the large-scale short-form tags mined from the internet, which helps the model learn to ground more fine-grained music concepts. There is additional gain when including comments and playlist data, where the complete sentences are helpful for grounding the more complex queries, including multi-term queries (e.g.‘instrumental action movie soundtrack’), compositional queries (e.g. ‘classical music with middle eastern influence’), and even queries with negation (e.g. ‘hard rock without vocals’). Again, we find that training is surprisingly robust to annotation noise, achieving similar performance using unfiltered training text.

4.2.3 Text Triplet Classification

Table 6 lists triplet classification accuracy on evaluations introduced in Section 4.1.4. We compare MuLan text embedding against the following baselines: Sentence Transformer [37], SimCSE [15], Universal Sentence Embedding [4], and the average token embedding of BERT-base-uncased (this outperforms the [CLS] encoding by a large margin). All baselines are Transformer-based models with similar size to ours. The first three were trained with sentence-level contrastive loss, while BERT is trained with masked language prediction. We warmstart the MuLan text encoder using this same BERT baseline, but it is subsequently only trained with the cross-modal loss. We find that when including our long-form text annotations, the resulting text embedding model, which is now specialized to the music domain, outperforms the generic sentence embedding models. While it is not surprising in-domain text is helpful, it is remarkable that successful specialization is accomplished without using any text-only fine-tuning loss.

5 Conclusions

We presented a music audio and natural language joint embedding model trained with an unprecedented scale of weakly paired text and audio data. Our experiments demonstrate the versatility of the natural language interface in a range of applications. The pretrained audio embeddings also achieve SOTA transfer learning performance on music tagging benchmarks. This is a first attempt at building a free-form natural language interface for music audio and there is plenty of room for improvement. Specifically, we believe improved text filtering methods that better distinguish weak signal from absolute noise will result in better handling of rare and subtle language constructs.

References

[1] \urlTensorflow Hub (2022) \urlhttps://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4. Cited by: §3.3.
[2] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere (2011) The million song dataset. In Proc. of International Conference on Music Information Retrieval (ISMIR), Cited by: §2.
[3] R. Castellon, C. Donahue, and P. Liang (2021) CODIFIED audio language modeling learns useful representations for music information retrieval. In Proc. of International Conference on Music Information Retrieval (ISMIR), Cited by: §2, Table 4.
[4] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, et al. (2018) Universal sentence encoder. arXiv:1803.11175. Cited by: §4.2.3, Table 6.
[5] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In Proc. of International Conference on Machine Learning (ICML), Cited by: §3.1.
[6] K. Choi, G. Fazekas, M. Sandler, and K. Cho (2017) Transfer learning for music classification and regression tasks. In Proc. of International Conference on Music Information Retrieval (ISMIR), Cited by: §2.
[7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
[8] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever (2020) Jukebox: a generative model for music. arXiv:2005.00341. Cited by: §2, Table 4.
[9] K. Drossos, S. Lipping, and T. Virtanen (2020) Clotho: an audio captioning dataset. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1.
[10] A. Ettinger (2020) What BERT is not: lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics 8, pp. 34–48. Cited by: §4.2.1.
[11] X. Favory, K. Drossos, T. Virtanen, and X. Serra (2020) COALA: co-aligned autoencoders for learning semantically enriched audio representations. arXiv:2006.08386. Cited by: §1, §2.
[12] X. Favory, K. Drossos, T. Virtanen, and X. Serra (2021) Learning contextual tag embeddings for cross-modal alignment of audio and tags. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1, §2.
[13] A. Ferraro, X. Favory, K. Drossos, Y. Kim, and D. Bogdanov (2021) Enriched music representations with multiple cross-modal contrastive learning. IEEE Signal Processing Letters 28, pp. 733–737. Cited by: §2.
[14] F. Font, G. Roma, and X. Serra (2013) Freesound technical demo. In Proc. of the ACM International conference on Multimedia (MM), Cited by: §2.
[15] T. Gao, X. Yao, and D. Chen (2021) SimCSE: simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6894–6910. Cited by: §4.2.3, Table 6.
[16] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) AudioSet: an ontology and human-labeled dataset for audio events. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §2, §3.2, §4.1.1, §4.1.4.
[17] Y. Gong, Y. Chung, and J. Glass (2021) AST: audio spectrogram transformer. arXiv:2104.01778. Cited by: §2, §3.2, §3.2, Table 4.
[18] Y. Gong, C. J. Lai, Y. Chung, and J. Glass (2021) SSAST: self-supervised audio spectrogram transformer. arXiv:2110.09784. Cited by: §2, §2.
[19] A. Guzhov, F. Raue, J. Hees, and A. Dengel (2022) Audioclip: extending CLIP to image, text and audio. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §2.
[20] P. Hamel, M. E. Davies, K. Yoshii, and M. Goto (2013) Transfer learning in MIR: sharing learned latent representations for music audio classification and similarity. In Proc. of International Conference on Music Information Retrieval (ISMIR), Cited by: §2.
[21] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. (2017) CNN architectures for large-scale audio classification. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §3.2.
[22] Q. Huang, A. Jansen, L. Zhang, D. P. Ellis, R. A. Saurous, and J. Anderson (2020) Large-scale weakly-supervised content embeddings for music recommendation and tagging. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §2, §2, §4.1.1, Table 4.
[23] A. Jansen, M. Plakal, R. Pandya, D. P. Ellis, S. Hershey, J. Liu, R. C. Moore, and R. A. Saurous (2018) Unsupervised learning of semantic audio representations. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §2.
[24] C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. Le, Y. Sung, Z. Li, and T. Duerig (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In Proc. of the International Conference on Machine Learning (ICML), Cited by: §1, §1, §1, §2.
[25] M. C. Kenton and L. K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171–4186. Cited by: §3.3, Table 6.
[26] K. Kilgour, B. Gfeller, Q. Huang, A. Jansen, S. Wisdom, and M. Tagliasacchi (2022) Text-driven separation of arbitrary sounds. arXiv:2204.05738. Cited by: §1.
[27] C. D. Kim, B. Kim, H. Lee, and G. Kim (2019) AudioCaps: generating captions for audios in the wild. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, Cited by: §1.
[28] A. S. Koepke, A. Oncescu, J. Henriques, Z. Akata, and S. Albanie (2022) Audio retrieval with natural language queries: a benchmark study. IEEE Transactions on Multimedia. Cited by: §1, §1.
[29] E. Law and L. Von Ahn (2009) Input-agreement: a new mechanism for collecting data using human computation games. In Proc. of the ACM SIGCHI Conference on Human Factors in Computing Systems, Cited by: §4.1.1.
[30] H. Lee, J. Lee, J. Y. Ng, and P. Natsev (2020) Large scale video representation learning via relational graph clustering. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.1.
[31] I. Manco, E. Benetos, E. Quinton, and G. Fazekas (2022) Learning music audio representations via weak language supervision. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §2, Table 4.
[32] A. Nagrani, P. H. Seo, B. Seybold, A. Hauth, S. Manen, C. Sun, and C. Schmid (2022) Learning audio-video modalities from image captions. arXiv:2204.00679. Cited by: §1, §2.
[33] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv:1807.03748. Cited by: §3.1.
[34] S. Oramas, O. Nieto, F. Barbieri, and X. Serra (2017) Multi-label music genre classification from audio, text, and images using deep features. In Proc. of International Conference on Music Information Retrieval (ISMIR), Cited by: §2.
[35] K. J. Piczak (2015) ESC: dataset for environmental sound classification. In Proc. of the ACM International Conference on Multimedia (MM), Cited by: §2.
[36] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In Proc. of the International Conference on Machine Learning (ICML), Cited by: §1, §2.
[37] N. Reimers and I. Gurevych (2019) Sentence-BERT: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992. Cited by: §4.2.3, Table 6.
[38] A. Saeed, D. Grangier, and N. Zeghidour (2021) Contrastive learning of general-purpose audio representations. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §2.
[39] J. Spijkervet and J. A. Burgoyne (2021) Contrastive learning of musical representations. arXiv:2103.09410. Cited by: §2, Table 4.
[40] G. N. C. Tejada, J. Scholtes, and G. Spanakis (2021) A study of BERT’s processing of negations to determine sentiment. In Proc. of the Benelux Conference on Artificial Intelligence and Belgian-Dutch Conference on Machine Learning, Cited by: §4.2.1.
[41] Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In Proc. of European Conference on Computer Vision (ECCV), Cited by: §3.1.
[42] N. Turpault, R. Serizel, and E. Vincent (2019) Semi-supervised triplet loss based learning of ambient audio embeddings. In Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: §2.
[43] R. Typke, F. Wiering, R. C. Veltkamp, J. D. Reiss, G. A. Wiggins, et al. (2005) A survey of music information retrieval systems. In Proc. of International Conference on Music Information Retrieval (ISMIR), Cited by: §4.1.3.
[44] A. Van Den Oord, S. Dieleman, and B. Schrauwen (2014) Transfer learning by supervised pre-training for audio-based music classification. In Proc. of International Conference on Music Information Retrieval (ISMIR), Cited by: §2.
[45] M. Won, A. Ferraro, D. Bogdanov, and X. Serra (2020) Evaluation of cnn-based automatic music tagging models. arXiv:2006.00751. Cited by: Table 4.
[46] M. Won, J. Salamon, N. J. Bryan, G. J. Mysore, and X. Serra (2021) EMOTION embedding spaces for matching music to stories. In Proc. of International Conference on Music Information Retrieval (ISMIR), Cited by: §2.
[47] H. Xie and T. Virtanen (2021) Zero-shot audio classification via semantic embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, pp. 1233–1242. Cited by: §2.