Music Genre Classiﬁcation with the Million Song Dataset
15-826 Final Report
Dawen Liang,† Haijie Gu,‡ and Brendan O’Connor‡
† School of Music, ‡ Machine Learning Department Carnegie Mellon University
December 3, 2011
The ﬁeld of Music Information Retrieval (MIR) draws from musicology, signal processing, and artiﬁcial intelligence. A long line of work addresses problems including: music understanding (extract the musically-meaningful information from audio waveforms), automatic music annotation (measuring song and artist similarity), and other problems. However, very little work has scaled to commercially sized data sets. The algorithms and data are both complex. An extraordinary range of information is hidden inside of music waveforms, ranging from perceptual to auditory—which inevitably makes largescale applications challenging. There are a number of commercially successful online music services, such as Pandora, Last.fm, and Spotify, but most of them are merely based on traditional text IR. Our course project focuses on large-scale data mining of music information with the recently released Million Song Dataset (Bertin-Mahieux et al., 2011),1 which consists of 1
300GB of audio features and metadata. This dataset was released to push the boundaries of Music IR research to commercial scales. Also, the associated musiXmatch dataset2 provides textual lyrics information for many of the MSD songs. Combining these two datasets, we propose a cross-modal retrieval framework to combine the music and textual data for the task of genre classiﬁcation: Given N song-genre pairs: (S1 , GN ), . . . , (SN , GN ), where Si ∈ F for some feature space F, and Gi ∈ G for some genre set G, output the classiﬁer with the highest classiﬁcation accuracy on the hold-out test set. The raw feature space F contains multiple domains of sub features which can be of variable length. The genre label set G is discrete.
Genre classiﬁcation is a standard problem in Music IR research. Most of the music genre classiﬁcation techniques employ pattern recognition algorithms to classify feature vectors, extracted from short-time recording segments into genres. Commonly used classiﬁers are Support Vector Machines (SVMs), Nearest-Neighbor (NN) classiﬁers, Gaussian Mixture Models, Linear Discriminant Analysis (LDA), etc. Several common audio datasets have been used in experiments to make the reported classiﬁcation accuracies comparable, for example, the GTZAN dataset (Tzanetakis and Cook, 2002) which is the most widely used dataset for music genre classiﬁcation. However, the datasets involved in those studies are very small comparing to the Million Song Dataset. In fact, most of the Music IR research still focuses on very small datasets, such as the GTZAN dataset (Tzanetakis and Cook, 2002) with only 1000 audio tracks, each 30 seconds long; or CAL-500 (Turnbull et al., 2008), a set of 1700 humangenerated musical annotations describing 500 popular western musical tracks. Both of these datasets are widely used in most state-of-the-art research in Music IR, but are far away from practical application. Furthermore, most of the research on genre classiﬁcation focuses only on music features, ignoring lyrics (mostly due to the difﬁculty of collecting large-scale lyric data). 2
Nevertheless, besides the musical features (styles, forms), the genre is also closely related to lyrics—songs in different genres may involve different topics or moods, which could be recoverable from word frequencies in lyrics. This motivates us to join the musical and lyrics information from two databases for this task.
To the best of our knowledge, there have been no published works that perform largescale genre classiﬁcation using cross-modal methods. • We proposed a cross-modal retrival framework of model...
References: Leonard E. Baum, Ted Petrie, George Soules, and Norman Weiss. A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains. The Annals of Mathematical Statistics, 41(1):pp. 164–171, 1970. ISSN 00034851. URL http://www.jstor.org/stable/2239727. Robert M. Bell, Yehuda Koren, and Chris Volinsky. The BellKor solution to the
Netﬂix Prize, 2007.
ProgressPrize2007BellKorSolution.pdf. Robert M. Bell, Yehuda. Koren, and Chris Volinsky. The Bellkor 2008 solution to the Netﬂix Prize, 2008. Bellkor2008.pdf. Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The million song dataset. In Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011. Byron Boots and Geoffrey J. Gordon. An online spectral learning algorithm for partially observable nonlinear dynamical systems. In AAAI, 2011. M. M Bradley and P. J Lang. Affective norms for english words (ANEW): instruction manual and affective ratings. University of Florida: The Center for Research in Psychophysiology, 1999. P. S. Dodds and C. M Danforth. Measuring the happiness of Large-Scale written expression: Songs, blogs, and presidents. Journal of Happiness Studies, page 116, 2009. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors). The annals of statistics, 28(2): 337407, 2000. ISSN 0090-5364. Daniel Hsu, Sham M. Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. CoRR, abs/0811.4413, 2008. Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural Computation, 12(6):1371–1398, 2000a. Herbert Jaeger. Observable operator models for discrete stochastic time series. Neural Computation, 2000b. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schtze. Introduction to Information Retrieval. Cambridge University Press, 1st edition, July 2008. ISBN 0521865719. 29 http://www2.research.att.com/˜volinsky/netflix/
M. McVicar, T. Freeman, and T. D. Bie. Mining the correlation between lyrical and audio features and the emergence of mood. In Proceedings of the 12th International Conference on Music Information Retrieval, 2011. M. Muller. Information retrieval for music and motion. In Springer, 2007. Bo Pang and Lillian Lee. Opinion Mining and Sentiment Analysis. Now Publishers Inc, July 2008. ISBN 1601981503. N. Rasiwasia, J. C. Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In Proceedings of the international Conference on Multimedia, 2010. L. Ren, D. Dunson, S. Lindroth, and L. Carin. Dynamic nonparametric bayesian models for analysis of music. Journal of the American Statistical Association, 105(490):458472, 2010. Greg Ridgeway. Generalized boosted models: A guide to the gbm package, 2007. http: //cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf. Matthew Rosencrantz, Geoff Gordon, and Sebastian Thrun. Learning low dimensional predictive representations. In Proceedings of the twenty-ﬁrst international conference on Machine learning, ICML ’04, pages 88–, New York, NY, USA, 2004. ACM. ISBN 1-58113838-5. doi: http://doi.acm.org/10.1145/1015330.1015441. URL http://doi.acm. org/10.1145/1015330.1015441. Sajid Siddiqi, Byron Boots, and Geoffrey J. Gordon. Reduced-rank hidden Markov models. In Proceedings of the Thirteenth International Conference on Artiﬁcial Intelligence and Statistics (AISTATS-2010), 2010. Satinder Singh and Michael R. James. Predictive state representations: A new theory for modeling dynamical systems. In In Uncertainty in Artiﬁcial Intelligence: Proceedings of the Twentieth Conference (UAI), pages 512–519. AUAI Press, 2004. Yla R. Tausczik and James W. Pennebaker. The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 2009. URL http://jls.sagepub.com/cgi/rapidpdf/0261927X09351676v1. 30
Douglas Turnbull, Luke Barrington, David Torres, and Gert Lanckriet. Semantic annotation and retrieval of music and sound effects. IEEE Transactions on Audio, Speech and Language Processing, 16(2):467–476, February 2008. G. Tzanetakis and P. Cook. Musical genre classiﬁcation of audio signals. IEEE Transactions on Speech and Audio Processing, 10(5), July 2002.
Please join StudyMode to read the full document