A Study on Broadcast Networks for Music Genre Classification

Ahmed Heakl

^{1}

,Abdelrahman Abdelgawad

^{2}

, Victor Parque

^{3}

^{1}

Department of Computer Science, Egypt-Japan University of Science and Technology, Alexandria, Egypt

^{2}

Department of Mechatronics and Robotics, Egypt-Japan University of Science and Technology, Alexandria, Egypt

^{3}

Department of Modern Mechanical Engineering, Waseda University, Tokyo, Japan

Abstract

Due to the increased demand for music streaming/recommender services and the recent developments of music information retrieval frameworks, Music Genre Classification (MGC) has attracted the community’s attention. However, convolutional-based approaches are known to lack the ability to efficiently encode and localize temporal features. In this paper, we study the broadcast-based neural networks aiming to improve the localization and generalizability under a small set of parameters (about 180k) and investigate twelve variants of broadcast networks discussing the effect of block configuration, pooling method, activation function, normalization mechanism, label smoothing, channel interdependency, LSTM block inclusion, and variants of inception schemes. Our computational experiments using relevant datasets such as GTZAN, Extended Ballroom, HOMBURG, and Free Music Archive (FMA) show the state-of-the-art classification accuracies in MGC. Our approach offers insights and the potential to enable compact and generalizable broadcast networks for music classification.

convolutional neural networks, broadcast networks, music classification, music genre classification

I Introduction

Along with the recent developments in Music Information Retrieval (MIR) and the outburst of music streaming services, there has been a rapid increase in the demand for Music Genre Classification (MGC). It is natural for a person’s musical taste to favor specific music genres, and the automatic classification of music tracks to a genre is desirable to meet tailored streaming services. Consumers presently access millions of songs via services like Spotify, Tidal, Apple Music and others; all of which needs the autonomous genre classification for the end-user. Also, millions of music files in the internet databases require classification. As such, the automated classification of audio files is expected to tackle the time-consuming and laborious nature of MGC. Machine learning approaches have been used to address the MGC problem since 2002[1], achieving the earliest classification accuracy of 60% and becoming by then the benchmark.

According to the type of input data, MGC is divided into symbolic MGC and direct audio MGC [2]. The earliest attempts of MGC are of symbolic nature; here, features are manually extracted from the audio signal and then fed to the choice model. As such, after the works of [1], researchers began the search for the best representative hand-crafted features for music classification.

Current work in the symbolic MGC field is improving feature extraction to allow the model to identify the genre of a particular track based on the most dominant features representing the track’s musical identity. This process requires a lot of knowledge in audio signal processing and music fields. Furthermore, the manually extracted features are not universal, and even if they perform well in a specific task (MGC in this case), they may not perform the same in a different one [3]. The most crucial problem with hand-crafted features is that genres are not clearly defined mathematically, making it hard to determine the best features in the songs representing their genre. Often, most classifiers tend to get confused with similar genres[4], e. g., pop and rock, or experimental and instrumental. The attached metadata, which provides information about the samples, will enhance some of these operations. However, as metadata is not always available and because local computing power has dramatically improved, interest in local auto MGC has grown [5].

On the other hand, direct audio MGC uses the audio signal as input. It is also possible to perform preprocessing operations on the audio signal before providing it to the model. [6] and [7] paved the way for using CNNs in MGC: they were the first to introduce MGC in the visual domain using spectrograms and wavelet packets, respectively, as inputs for the classifier. Having proved effective in image classification[8, 9, 10, 11, 12, 13], CNNs received the favorable attention of the MGC community due to the ability of auto-extracting features from the audio signal. For CNNs to be used in MGC, the common practice is to use a mel-spectrogram as input, which is much more efficient than raw amplitude signal[14]. Spectrograms are constructed by applying overlapping short windows on the original signal and taking Fast Fourier transform on each window; as such, they are loaded with textural features, being advantageous when used in conjunction with CNNs[15].

Although mel-spectrograms are effective mechanisms to encode audio signals, CNNs are generally not good at capturing the temporal features of the audio signal: they are commonly good at extracting spatial features. Even though extracting temporal features is not an issue in image classification, it is a challenging task in MGC[5]. On the other hand, RNNs can extract temporal features from the spectrogram due to dealing with sequential data. Using both Convolutional and Recurrent Neural Networks enabled the extraction of both spatial and temporal features of the audio signal as in [3] and [16]. [17] compared CNNs, LSTMs, and several classical algorithms for MGC: they found that CNNs were the most effective and accurate, outperforming even ensemble classifiers, and being in line with the findings of [18] and [19]. Some works like [20] suggested that low-level features are important in MGC, [21] proved that low-level features are more important than high-level features in MGC as many genres depend heavily on rhythmic patterns and tempo which are basically low-level features. Yet, the relevance of low level features posed a new challenge, since low-level features are usually lost in the depth of CNNs.

The class of Bottom-up Broadcast Neural Networks (BBNN) is the most recent architecture implementing the inception mechanism to extract high-level features while preserving low-level ones from the spectrograms[22]. Although the attractive classification performances were presented in tackling the GTZAN and Ballroom-based datasets, it remains unclear whether the broadcast networks offer competitive generalization abilities on a broader class of music classification settings[22]. Thus, to further explore the performance and generalization landscape of BBNN, we investigate the class of broadcast network architectures and its variants for Music Genre Classification (MGC). In particular, our contributions are as follows:

we propose a broadcast network that improves the localization and generalizability under a smaller set of parameters (about 180k),
we conduct the computational experiments considering well-known and challenging benchmark datasets such as GTZAN, Extended Ballroom, HOMBURG, and Free Music Archive (FMA), showing the state-of-the-art classification accuracies in MGC, and
we investigate twelve variants of broadcast networks discussing the effect of block configuration, pooling method, activation function, normalization mechanism, label smoothing, channel interdependency, LSTM block inclusion, and variants of inception schemes.

The above contributions aim at elucidating the performance frontiers of broadcast architectures towards a more generalizable class for MGC. To the best of our knowledge, our studies comparing broadcast networks and their variations in well-known and challenging MGC datasets are the first in the community.

Fig. 1: The architecture of our proposed model.

Ii Proposed Architecture

In this section we present the basic concepts and motivations in our proposed approach.

Ii-a Basic Concept

Fig. 1 shows the proposed architecture of the broadcast network. Inspired by BBNN[22] and the way in which the ear uses distinct filters to analyze sound characteristics [9, 23], we propose broadcast networks comprising multiple feature extractors that render different scales of feature maps. And, extending the inception scheme[9], we used three convolutions with two different kernel sizes along with a max-pooling layer, and an activation layer for each inception block. Moreover, we used residual connections as they smooth the loss landscape, and allow for deeper architectures[24]. Furthermore, residual connections transmit low-level features to all layers, which are crucial to MGC[21]. As such, our module aims at exploiting multiple features in the classifier layer. Compared to the baseline broadcast-bottom up network[22], we implement five relevant variations involving: (1) the removal of the 3x3 convolution, (2) the replacement of the 5x5 convolution by a 3x3 convolution, (3) the extension of the number of blocks to four instead of three, (4) the change of the kernel initializer to lecun normal, and (5) the replacement of the asymmetric factorized 3x3 convolutions. And, for each block, we adopt filter sizes of 1x1, 1x1, 3x3, with a 1x1 convolution before the latter to reduce its channel size and create a bottleneck representation [9]. We also factorized the 3x3 convolutions to 3x1 and 1x3 convolutions as suggested by [10]. The above-mentioned implementations are motivated by the following notions:

The removal of 3x3 convolutions and the replacement of 5x5 convolutions allowed our model to use more feature maps under the same complexity constraints ( $\sim$ 180k parameters), which enriches our model capacity and its ability to capture more textural content from their corresponding spectrograms. Moreover, they provide a better generalization and more depth under the same constraints, which is shown in the training of FMA dataset, Table II.
By adding one additional block, we compensate for the capacity loss from removing the 3x3 convolutions.
Changing the kernel initializer was experimentally proven to increase accuracy significantly.
Finally, the asymmetric factorized convolutions was implemented as inspired by the results from[10].

Ii-B Network Architecture

As shown in Fig. 1, the proposed network consists of $L$ connected modules with residual connections in-between, allowing higher-level layers to receive all feature maps from previous layers. We denote $X_{S L}$ rendered from shallow layers as the input of our module, $L$ is the number of blocks of our module. For simplicity and without loss of generality, we fixed $L = 4$ . Thus, the input of the $l$ -th block, $l = 1, . . ., L$ , can be represented as:

X_{l} = f_{1} ([X_{S L}, X_{1}, . . ., X_{l - 1}]),

(1)

where $[X_{S L}, X_{1}, . . ., X_{l - 1}]$ refers to the concatenation of the feature maps produced by blocks $0, . . ., l - 1$ , and $f_{1}$ is a composite function of all operations in the inception block [22].

Fig. 2: Confusion matrices of our proposed model in all datasets.

As shown in Fig. 1, the model comprises 22 convolution layers (5 are 3x3, and 17 are 1x1) with 4 inception blocks to increase the generalizability and allow the extraction of low-level features. As recommended by [25], we used Batch Normalization after each convolution layer followed by a ReLU activation; however, we changed the order in the inception inspired by the findings of [26]. Batch Normalization does not require regularization and solves the dying ReLU problem. As such, all the layers of the network can be summarized into four different parts: the shallow feature extraction, the proposed module, the transition layers, and the decision layers. The model aims to learn the parameter $θ$ of a function $F (X_{0} | θ)$ that maps the input spectrogram $X_{0}$ to its corresponding genre $p$ , as follows:

p = F (X_{0} | θ),

(2)

p = f_{D L} (f_{T L} (f_{P M} (f_{S L} (X_{0} | θ_{S L}) | θ_{P M}) | θ_{T L}) | θ_{D L}) .

(3)

We used a block-based architecture to simplify the design process and have a constant growth rate to facilitate the scaling towards large datasets. Our model has a growth rate of $k_{l} = k_{l - 1} + 4 \times f$ , where $k_{l}$ is the output of the $l^{t h}$ -layer, and $f$ is the number of convolution filters. We fixed the number of convolution filters to be 32. Hence, the output channel dimension of the proposed module is 544, which is reduced in later steps via the transition layer for the classification step.

GTZAN		HOMBURG		Extended Ballroom		FMA
Track	Genre	Track	Genre	Track	Genre	Track	Genre
Classic	100	Electronic	113	Cha Cha	455	Rock	1000
Jazz	100	Jazz	319	Jive	350	International	1000
Blues	100	Blues	120	Quickstep	497	Folk	1000
Metal	100	Funk/Soul	47	Rumba	470	Experimental	1000
Pop	100	Pop	116	Samba	468	Instrumental	1000
Rock	100	Rock	504	Tango	464	Pop	1000
Country	100	Country	222	Viennese Waltz	252	Hip-Hop	1000
Disco	100	Alternative	145	Waltz	529	Electronic	1000
Hiphop	100	Hiphop	300	Foxtrot	507
Raggae	100			Pasodoble	53
				Salsa	47
				Slow Waltz	65
				Weswing	23
Total	1000	Total	1886	Total	4180	Total	8000

TABLE I: Number of tracks per genre for each Dataset

Iii Computational Experiments

We performed computational comparisons using relevant datasets in the community, this section describes our findings.

Iii-a Datasets

Table I shows the characteristics of the datasets for MGC used in our study. Basically, we used the well-known and challenging datasets in the community, as follows:

GTZAN Dataset. Introduced by [1], and being the first publicly available and well-structured benchmark, the GTZAN dataset is a popular dataset in MGC [27] [28]. Basically, the GTZAN is a collection of 10 popular genres (blues, classical, country, disco, hip-hop, jazz, metal, pop, reggae, rock) with 100 audio files each, all having a length of 30 seconds.
Extended Ballroom Dataset. Assembled for the ISMIR 2004 rhythm description contest, the ballroom dataset includes 698 30-second music clips grouped into eight categories reflecting distinct ballroom dances. The extended ballroom dataset, introduced by [29], has 4180 30 second tracks and covers 13 distinct genres, representing an improvement to the original ballroom dataset with about six times the number of tracks with superior audio quality.
HOMBURG Dataset. The HOMBURG dataset, presented by [30], is a freely available benchmark dataset for audio classification and clustering. The HOMBURG dataset consists of 1886 songs taken from the Garageband site and classified into nine unbalanced genres. Each sample taken randomly from the song lasts for 10 seconds.
Free Music Archive Dataset (FMA). Introduced by [27], this dataset is much larger and diverse compared to the datasets mentioned above. Basically, it comprises four subsets: Full, Large, Medium, and Small. FMA is a freely available dataset used to evaluate a variety of MIR tasks. It has 917 GB and 343 days of licensed audio from 106,574 tracks from 16,341 artists and 14,854 albums, organized into 161 genres in a hierarchical taxonomy. Due to hardware constraints in environment, we used the Small set which contains about 8,000 tracks of 30s songs distributed over eight balanced genres (7.2 GiB). As such, compared to the GTZAN and the Extended Ballroom, the above-mentioned selection is comparatively larger, thus we use it to evaluate the ability to generalize and scale.

Fig. 3: Comparison of the train and validation loss, as well as accuracy between our proposed model and the baseline model for all datasets.

Iii-B Preprocessing

For fairness, we used the same preprocessing technique used by baseline[22]. The Mel-Spectrogram of the audio file was fed to the model as input. Spectrograms were created using a Short Time Fourier Transform (STFT), and then converted into the Mel scale by a logarithmic scale to the frequency axis of the spectrogram in the librosa toolbox [31]. 128 Mel-filters (bands) were used to represent the audible spectrum range 0-22050 Hz. We used a frame length of 2048 and a hop size of 1024. Due to the previous setup, the Mel-Spectrogram was of size 646x128.

Iii-C Training Setup

We used a similar setups for fairness in comparisons with the baseline BBNN model. Our model was built in Keras and trained on Quadro RTX 5000. The model took half an hour to train for 100 epochs on GTZAN. We used a small batch size of 8 since the input resolution is high (647 x 128). Results were measured using a 10-fold validation for each dataset and each variation of the model, and we reported the highest test accuracy. All models were trained using ADAM due to being suitable for noisy gradients, which is the case of high-textural spectrograms[32]. Inspired by the BBNN model, we set the initial learning rate as 0.01, which was decreased by a factor of 0.5 when the loss stopped improving for three epochs. To avoid overfitting, we set up an early stopping mechanism when the training loss stopped decreasing.

	GTZAN		Audio Benchmark		Extended Ballroom		FMA
Model	Validation	Test	Validation	Test	Validation	Test	Validation	Test
BBNN	90.0	89.0	64.0	61.0	94.7	92.3	58.1	56.9
Ours	91.0	90.0	66.1	64.0	93.3	93.1	58.9	58.3

TABLE II: Results comparison between our proposed model and baseline

Iii-D Results and Discussion

To show the effectiveness of our proposed model, Fig. 2 shows the confusion matrices of our proposed model in all the datasets, Table II shows the performance comparison (validation-testing) between our proposed model and the baseline, and Fig. 3 shows the learning performance in the training, validation and testing. By observing Fig. 2, Table II and Fig. 3, we can observe the following facts:

GTZAN Results. As shown by Fig. 3 (a), our model achieved a smaller loss. We also observed that BBNN was unable to differentiate between country, metal, and rock as they were different on the high-frequency components and similar on the low-frequencies. Since our model focuses more on localization, i.e., it allows to capture the high-frequency components, it achieved a much higher accuracy score on the three genres, as shown in Fig. 2 (a). Our model confused hip-hop and pop as the architecture discriminates low-frequency components. Our model boosted the test accuracy by +1%, as shown by Table II.
Extended Ballroom Results. As the extended ballroom dataset has four times the samples of the GTZAN, as shown in Table I, the model must use its parameters efficiently. Although both our model and the BBNN show similar training loss values, our model exceeds the performance of the BBNN on the test set but not on the validation set, as shown in Fig. 3 (b). Nonetheless, our model achieves slightly lower validation accuracy than the BBNN model $- 1.4 %$ , and a higher test accuracy +0.8%, as shown in Table II. By observing at the confusion matrix in Fig. 2 (b), our model was unable to differentiate betwen Rumba, Slow Waltz, and Waltz. Also, pasodoble was confused with tango, yet we consider this phenomenon due to the low number of samples of the pasodoble genre.
HOMBURG Results. To evaluate the generalization ability to perform in datasets different than GTZAN and Ballroom Dataset, we experimented with HOMBURG dataset, which has the equivalent size of the datasets used to test the baseline model. Our model outperformed the baseline with +3% accuracy as shown in table II. We also observed that the model was most successful at identifying jazz, rock, and hip-hop, yet the model failed at identifying the funk/soul genre as it is a hybrid genre. Furthermore, our model confused jazz and blues in some cases as they both have very similar low-frequency components, implying the need for further localization in the convolutional architecture.
FMA Results. To evaluate the generalizability in a scaled environment, we trained both our proposed model and the baseline in a larger dataset, i.e., the FMA dataset. Since the FMA Dataset has 8000 samples (8 times the GTZAN), both our model and the baseline were unable to capture the variance in the data. However, our model slightly outperformed the baseline with 3% accuracy, as shown in table II. We argue this phenomenon is due to the increased capacity of our model under the same constraints ( $\sim$ 180k parameters), as our model is one block deeper than the original BBNN model. As shown the learning curves in Fig. 3, our model has a slightly lower loss in most cases. Our model confused pop with rock which is likely due to the similarity of their low-frequency components, indicating that the model might need further localization to better distinguish the difference. The model also confused the electronic with hip-hop which is also due to the similarity between both genres in almost all features as hip-hop is sometimes categorized as a sub-genre of electronic genre.

Iii-E Variants of Broadcast Networks

We conducted a study to understand further the ability of broadcast networks to outperform state-of-the-art music genre classifiers like those presented in [33, 34, 35, 21, 36, 37, 38, 39]. In this section, we describe our variations and discuss our findings.

Iii-E1 Removing 3x3 Convolution

As shown in Table III, by removing the 3x3 convolutions, both the validation and test accuracy in GTZAN dataset dropped down to 84% and 83%, respectively. We argue this is due to the decreased model capacity (the number of parameters decreased by 27,744), implying the decreased localization ability without the 3x3 convolutions. As such, the features extracted from the 5x5 convolution became more dominant, which reduced the localization in the features overall.

Iii-E2 Replacing 5x5 Convolution

Replacing the 5x5 convolutions with 3x3 convolutions did not affect the model capacity as it gave a validation accuracy of 88%. However, it causes overfitting as the test accuracy dropped down to 77%. This is explained by the significant yet uncompensated decrease in the model number of parameters, decreasing the model’s ability to generalize.

Iii-E3 Global Average Pooling

As shown in table III, we replaced the global average pooling layer, implemented in the decision layer, with a global max-pooling layer causing the accuracy to decrease drastically. Global average pooling has a few advantages[40]:

It is more native to the convolution structure by enforcing correspondences between feature maps and categories.
It can be seen as a structural regularizer since there are no parameters to be optimized at this layer and, hence, overfitting is expected to be avoided.
It is more robust to spatial translations of the input as it sums out the spatial information making it useful to be used in the decision layer [25], unlike global max-pooling, which acts as a rectifying unit.

Variant	Validation Accuracy	Test Accuracy
Remove 3x3	84.0 %	83.0 %
Replace 5x5	88.0 %	77.0 %
Global Pooling	76.3 %	70.6 %
Dropout	81.5 %	78.3 %
Blocks 1	83.8 %	82.4 %
Blocks 2	86.0 %	82.8 %
Blocks 5	89.0 %	86.0 %
SELU Activation	89.0 %	83.0 %
Group Normalization	78.0 %	82.0 %
Label Smoothing	87.3 %	84.0 %
Squeeze & Excitation	85.0 %	85.0 %
LSTM Layer	85.0 %	82.0 %
Inception Resnet v1	87.5 %	87.1 %
Xception	89.2 %	89.0 %

TABLE III: Variations accuracy on the GTZAN Dataset

Iii-E4 Dropout

Inspired by the inception model[9], we added a dropout layer before the fully connected layer mapping from the 32 feature maps to the number of classes. This dropout keeps 60% of the dense connections. We observed that dropout lowered the validation accuracy to 81.5%. We argue this phenomenon is due to the small number of parameters of the BBNN model, making it unable to generalize further.

Iii-E5 Number of Blocks

As shown in Table III, variations of 1, 2, and 5 blocks were evaluated on the BBNN. Although all the variations did not cause overfitting, none of them rendered a higher accuracy than that was obtained from the baseline BBNN with 3 blocks. Decreasing the number of blocks to 1 and 2 decreased the model’s capacity making it unable to fit the GTZAN dataset, yet increasing the number of blocks did not increase the model’s accuracy.

Iii-E6 Selu

We used the Scaled Exponential Linear Unit (SELU) to avoid the dying RELU problem [41], as the SELU activation does not have a zero gradient for input values less than zero. We observed that the model with SELU activation lowered the test accuracy by 3% as shown by Table III. We argue this phenomenon is due to SELU mimicking a soft rectifying unit, implying that piecewise linear functions are better at encoding spectrograms features by stacking frequency bands on top of each other.

Fig. 4: Variants of inception with (a) adding a Squeeze and Excitation Block and (b) adding an LSTM layer.

Iii-E7 Group Normalization

Batch normalization adds stochasticity and reduces the model’s ability to create discriminative features. Thus, we used Group Normalization (GN) with 16 features per group to keep the performance stable for any batch size, as it calculates the mean and standard deviation over the feature dimension rather than the batch dimension[42]. However, GN reduced the test accuracy by 4% as shown by Table III, implying that batch normalization brings a rather positive effect.

Iii-E8 Label Smoothing

For over-confident models, [10] proposed a mechanism to regularize the classifier layer by estimating the marginalized effect of label dropout during training. As our training loss was low $\sim 0.4$ , implying the over-confidence of our model, we implemented label smoothing on the input one-hot encoding. We used the same fixed distribution as [10]: $u (k) = ϵ / K$ with $ϵ = 0.1$ and $K = 10$ for the GTZAN dataset. We observed that the test accuracy of the model dropped by 2% as shown by Table III, implying that our model is not over-confident, as the low loss suggests.

Iii-E9 Squeeze and Excitation Block

Inspired by [43], we modeled the interdependencies between the channels of convolutional features by a Squeeze and Excitation (SE) block after each inception block to improve the efficiency of the feature maps produced by the BBNN model. The inception-SE block is shown in Fig. 4 (a). We used a reducing factor $r = 4$ to create a bottleneck representation to reduce the computation power needed. Nonetheless, as shown in table III, the test accuracy was reduced slightly by 1%. We argue this phenomenon occurred due to (1) the increase of the number of parameters with no regularization to tackle overfitting and (2) the lack of depth of our model, implying that feature maps were not necessarily sophisticated enough to emphasize class-specific encodings.

Iii-E10 Recurrent Neural Networks

As recurrent neural networks (RNN) capture temporal features of the inputs, we embedded a Long-Short Term Memory (LSTM) layer into our inception module to act as a temporal feature extractor. The architecture of the inception module with LSTM is shown in Fig. 4 (b). However, as seen in Table III, the accuracy dropped to about 82%, implying that the added filters from RNN lead to an undesirable increase in model complexity with respect to CNNs for MGC[17].

Fig. 5: Inception-Based Models: (a) Xception-Based Model (b) Inception-Resnet-Based Model

Iii-E11 Inception ResNet v1

We constructed the inception-resnet while keeping similar space and computational complexity to the above-described models[44]. Basically, the model is implemented by replacing each inception block in the BBNN model with its corresponding inception-resnet-v1 block, as shown by Fig. 5 (a). Our inception-resnet model used the same stem block used by the original inception-resnet in [44], and the same decision block used by the BBNN. Also, every convolution in each inception-resnet block was preceded with a batch normalization and ReLU activation layers[26]. We observed that the constructed inception-resnet reached the convergence at 87.1% test accuracy, which was 2% lower than BBNN, as shown in Table III. We argue this phenomenon was due to the relatively small size of our model and GTZAN, whereas inception-resnet-v1 targets very large models aided by activation scaling and dimensionality reduction blocks. Thus, contrary to the marketed generalization abilities to different contexts, our results pinpoint that inception-resnet is unable to be easily generalized outside the scope of ImageNet, e.g. to tackle the MGC problem.

Iii-E12 Xception

To avoid using the hand-engineered set of filters in the baseline BBNN model and to better generalize the representation of broadcast convolution filters, we built the xception-based architecture by following the principle of decoupling the mapping of cross-channel correlations and spatial correlations[45], whose architecture is shown by Fig. 5 (b). Here, we replaced each inception module with a separable convolution layer to generalize model sparsity[45]. Our built xception-based model had about 141k parameters, being more compact than the baseline BBNN by 40k parameters, with four separable convolution blocks. Inspired by [26], each block comprises a separable convolution layer preceded by batch normalization and a ReLU activation layer, with residual connections that preserve the low-level features. We observed that the xception-based model achieved a test accuracy of 89%, by using about 75% of the number of parameters of the BBNN as shown by Table III, showing the effectiveness to achieve comparable performance under smaller space complexity. Further tuning of the configuration of the layers is expected to outperform the baseline[10], yet this line of work is out of the scope of this paper.

Iv Conclusions and Future Work

We have studied the broadcast network architectures for Music Genre Classification (MGC) and proposed a model that improves the localization and generalizability under small number of parameters (about 180k), being potential for practical deployment in embedded devices. Our computational experiments considering well-known and challenging benchmark datasets such as GTZAN, Extended Ballroom, HOMBURG, and Free Music Archive (FMA) has shown the state-of-the-art classification accuracies. Furthermore, we investigated twelve variants of broadcast networks and discussed the effect of block configuration, pooling method, activation function, normalization mechanism, label smoothing, channel interdependency, LSTM block, and inception scheme.

In future work, we will investigate the mechanisms to balance localization and dilation so that the model is able to capture both high and low frequency components. At the same time, we aim at scaling our model to fit large datasets like the FMA dataset. At current, the classification performance in HOMBURG and FMA datasets are still low, suggesting that broadcast networks still lack the desirable generalization abilities. Furthermore, we plan to explore different representations of audio files; although spectrograms are convenient and straightforward to use in MGC, there is potential to explore other representations such as scalograms and wavelet scattering transforms.

References

[1] G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Transactions on speech and audio processing, vol. 10, no. 5, pp. 293–302, 2002.
[2] T. L. Li, A. B. Chan, and A. Chun, “Automatic musical pattern feature extraction using convolutional neural network,” Genre, vol. 10, no. 2010, p. 1x1, 2010.
[3] L. Feng, S. Liu, and J. Yao, “Music genre classification with paralleling recurrent convolutional neural network,” arXiv preprint arXiv:1712.08370, 2017.
[4] D. Kostrzewa, P. Kaminski, and R. Brzeski, “Music genre classification: Looking for the perfect network,” in International Conference on Computational Science. Springer, 2021, pp. 55–67.
[5] M. McKinney and J. Breebaart, “Features for audio and music classification,” 2003.
[6] Y. M. Costa, L. S. Oliveira, A. L. Koericb, and F. Gouyon, “Music genre recognition using spectrograms,” in 2011 18th International conference on systems, signals and image processing. IEEE, 2011, pp. 1–4.
[7] Y. M. Costa, L. Oliveira, A. L. Koerich, F. Gouyon, and J. G. Martins, “Music genre classification using lbp textural features,” Signal Processing, vol. 92, no. 11, pp. 2723–2737, 2012.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
[9] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[10] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
[11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[13] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
[14] D. A. Huang, A. A. Serafini, and E. J. Pugh, “Music genre classification.”
[15] L. G. Hafemann, L. S. Oliveira, and P. Cavalin, “Forest species recognition using deep convolutional neural networks,” in 2014 22Nd international conference on pattern recognition. IEEE, 2014, pp. 1103–1107.
[16] K. Choi, G. Fazekas, M. Sandler, and K. Cho, “Convolutional recurrent neural networks for music classification,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 2392–2396.
[17] R. M. Pereira, Y. M. Costa, R. L. Aguiar, A. S. Britto, L. E. Oliveira, and C. N. Silla, “Representation learning vs. handcrafted features for music genre classification,” in 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–8.
[18] Y. M. Costa, L. S. Oliveira, and C. N. Silla Jr, “An evaluation of convolutional neural networks for music classification using spectrograms,” Applied soft computing, vol. 52, pp. 28–38, 2017.
[19] L. Nanni, S. Ghidoni, and S. Brahnam, “Handcrafted vs. non-handcrafted features for computer vision classification,” Pattern Recognition, vol. 71, pp. 158–172, 2017.
[20] T. Li, M. Ogihara, and Q. Li, “A comparative study on content-based music genre classification,” in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, 2003, pp. 282–289.
[21] K. Choi, G. Fazekas, M. Sandler, and K. Cho, “Transfer learning for music classification and regression tasks,” arXiv preprint arXiv:1703.09179, 2017.
[22] C. Liu, L. Feng, G. Liu, H. Wang, and S. Liu, “Bottom-up broadcast neural network for music genre classification,” Multimedia Tools and Applications, vol. 80, no. 5, pp. 7313–7331, 2021.
[23] S. Arora, A. Bhaskara, R. Ge, and T. Ma, “Provable bounds for learning some deep representations,” in International conference on machine learning. PMLR, 2014, pp. 584–592.
[24] H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein, “Visualizing the loss landscape of neural nets,” arXiv preprint arXiv:1712.09913, 2017.
[25] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. PMLR, 2015, pp. 448–456.
[26] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European conference on computer vision. Springer, 2016, pp. 630–645.
[27] M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “Fma: A dataset for music analysis,” arXiv preprint arXiv:1612.01840, 2016.
[28] B. L. Sturm, “A survey of evaluation in music genre recognition,” in International Workshop on Adaptive Multimedia Retrieval. Springer, 2012, pp. 29–66.
[29] U. Marchand and G. Peeters, “The extended ballroom dataset,” 2016.
[30] H. Homburg, I. Mierswa, B. Möller, K. Morik, and M. Wurst, “A benchmark dataset for audio classification and clustering.” in ISMIR, vol. 2005, 2005, pp. 528–31.
[31] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, vol. 8. Citeseer, 2015, pp. 18–25.
[32] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[33] M. Freitag, S. Amiriparian, S. Pugachevskiy, N. Cummins, and B. Schuller, “audeep: Unsupervised learning of representations from audio with deep recurrent neural networks,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6340–6344, 2017.
[34] W. Zhang, W. Lei, X. Xu, and X. Xing, “Improved music genre classification with convolutional neural networks.” in Interspeech, 2016, pp. 3304–3308.
[35] N. Karunakaran and A. Arya, “A scalable hybrid classifier for music genre classification using machine learning concepts and spark,” in 2018 International Conference on Intelligent Autonomous Systems (ICoIAS). IEEE, 2018, pp. 128–135.
[36] Y. Yu, S. Luo, S. Liu, H. Qiao, Y. Liu, and L. Feng, “Deep attention based music genre classification,” Neurocomputing, vol. 372, pp. 84–91, 2020.
[37] L. Nanni, Y. M. Costa, D. R. Lucio, C. N. Silla Jr, and S. Brahnam, “Combining visual and acoustic features for audio classification tasks,” Pattern Recognition Letters, vol. 88, pp. 49–56, 2017.
[38] C. Senac, T. Pellegrini, F. Mouret, and J. Pinquier, “Music feature maps with convolutional neural networks for music genre classification,” in Proceedings of the 15th international workshop on content-based multimedia indexing, 2017, pp. 1–5.
[39] J. Dai, W. Liu, C. Ni, L. Dong, and H. Yang, ““multilingual” deep neural network for music genre classification,” in Sixteenth annual conference of the international speech communication association, 2015.
[40] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
[41] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural networks,” in Proceedings of the 31st international conference on neural information processing systems, 2017, pp. 972–981.
[42] Y. Wu and K. He, “Group normalization,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
[43] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[44] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Thirty-first AAAI conference on artificial intelligence, 2017.
[45] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.