15-826 Final Report
Dawen Liang,† Haijie Gu,‡ and Brendan O’Connor‡
† School of Music, ‡ Machine Learning Department Carnegie Mellon University
December 3, 2011
The ﬁeld of Music Information Retrieval (MIR) draws from musicology, signal processing, and artiﬁcial intelligence. A long line of work addresses problems including: music understanding (extract the musically-meaningful information from audio waveforms), automatic music annotation (measuring song and artist similarity), and other problems. However, very little work has scaled to commercially sized data sets. The algorithms and data are both complex. An extraordinary range of information is hidden inside of music waveforms, ranging from perceptual to auditory—which inevitably makes largescale applications challenging. There are a number of commercially successful online music services, such as Pandora, Last.fm, and Spotify, but most of them are merely based on traditional text IR. Our course project focuses on large-scale data mining of music information with the recently released Million Song Dataset (Bertin-Mahieux et al., 2011),1 which consists of 1
300GB of audio features and metadata. This dataset was released to push the boundaries of Music IR research to commercial scales. Also, the associated musiXmatch dataset2 provides textual lyrics information for many of the MSD songs. Combining these two datasets, we propose a cross-modal retrieval framework to combine the music and textual data for the task of genre classiﬁcation: Given N song-genre pairs: (S1 , GN ), . . . , (SN , GN ), where Si ∈ F for some feature space F, and Gi ∈ G for some genre set G, output the classiﬁer with the highest classiﬁcation accuracy on the hold-out test set. The raw feature space F contains multiple domains of sub features which can be of variable length. The genre label set G is discrete.
Genre classiﬁcation is a standard problem in Music IR research. Most of the music genre classiﬁcation techniques employ pattern recognition algorithms to classify feature vectors, extracted from short-time recording segments into genres. Commonly used classiﬁers are Support Vector Machines (SVMs), Nearest-Neighbor (NN) classiﬁers, Gaussian Mixture Models, Linear Discriminant Analysis (LDA), etc. Several common audio datasets have been used in experiments to make the reported classiﬁcation accuracies comparable, for example, the GTZAN dataset (Tzanetakis and Cook, 2002) which is the most widely used dataset for music genre classiﬁcation. However, the datasets involved in those studies are very small comparing to the Million Song Dataset. In fact, most of the Music IR research still focuses on very small datasets, such as the GTZAN dataset (Tzanetakis and Cook, 2002) with only 1000 audio tracks, each 30 seconds long; or CAL-500 (Turnbull et al., 2008), a set of 1700 humangenerated musical annotations describing 500 popular western musical tracks. Both of these datasets are widely used in most state-of-the-art research in Music IR, but are far away from practical application. Furthermore, most of the research on genre classiﬁcation focuses only on music features, ignoring lyrics (mostly due to the difﬁculty of collecting large-scale lyric data). 2
Nevertheless, besides the musical features (styles, forms), the genre is also closely related to lyrics—songs in different genres may involve different topics or moods, which could be recoverable from word frequencies in lyrics. This motivates us to join the musical and lyrics information from two databases for this task.
To the best of our knowledge, there have been no published works that perform largescale genre classiﬁcation using cross-modal methods. • We proposed a cross-modal retrival framework of model...