Abstract
Over decades, music labels have shaped easily identifiable genres to improve recognition value and subsequently market sales of new music acts. Referring to print magazines and later to music television as important distribution channels, the visual representation thus played and still plays a significant role in music marketing. Visual stereotypes developed over decades that enable us to quickly identify referenced music only by sight without listening. Despite the richness of music-related visual information provided by music videos and album covers as well as T-shirts, advertisements, and magazines, research towards harnessing this information to advance existing or approach new problems of music retrieval or recommendation is scarce or missing. In this article, we present our research on visual music computing that aims to extract stereotypical music-related visual information from music videos. To provide comprehensive and reproducible results, we present the Music Video Dataset, a thoroughly assembled suite of datasets with dedicated evaluation tasks that are aligned to current Music Information Retrieval tasks. Based on this dataset, we provide evaluations of conventional low-level image processing and affect-related features to provide an overview of the expressiveness of fundamental visual properties such as color, illumination, and contrasts. Further, we introduce a high-level approach based on visual concept detection to facilitate visual stereotypes. This approach decomposes the semantic content of music video frames into concrete concepts such as vehicles, tools, and so on, defined in a wide visual vocabulary. Concepts are detected using convolutional neural networks and their frequency distributions as semantic descriptions for a music video. Evaluations showed that these descriptions show good performance in predicting the music genre of a video and even outperform audio-content descriptors on cross-genre thematic tags. Further, highly significant performance improvements were observed by augmenting audio-based approaches through the introduced visual approach.
- Esra Acar, Frank Hopfgartner, and Sahin Albayrak. 2014. Understanding affective content of music videos through learned representations. In MultiMedia Modeling. Springer, 303--314. Google ScholarDigital Library
- Eric Brochu, Nando De Freitas, and Kejie Bao. 2003. The sound of an album cover: Probabilistic multimedia and IR. In Proceedings of the Workshop on Artificial Intelligence and Statistics.Google Scholar
- Rui Cai, Lei Zhang, Feng Jing, Wei Lai, and Wei-Ying Ma. 2007. Automated music video generation using web image resource. In Acoustics, Speech and Signal Processing. ICASSP. Google ScholarCross Ref
- Cyril Cleverdon. 1967. The Cranfield tests on index language devices. In Aslib Proceedings, Vol. 19. 173--194. Google Scholar
- Frederique Crete, Thierry Dolmiere, Patricia Ladret, and Marina Nicolas. 2007. The blur effect: Perception and estimation with a new no-reference perceptual blur metric. In Electronic Imaging.Google Scholar
- Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z. Wang. 2006. Studying aesthetics in photographic images using a computational approach. In Computer Vision--ECCV 2006. Springer, 288--301. Google ScholarDigital Library
- J. Stephen Downie. 2003. Music information retrieval. Annual Review of Information Science and Tech.Google Scholar
- Peter Dunker, Stefanie Nowak, André Begau, and Cornelia Lanz. 2008. Content-based mood classification for photos and music: A generic multi-modal classification framework and evaluation approach. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. ACM, 97--104. Google ScholarDigital Library
- Sebastian Ewert, Meinard Müller, and Peter Grosche. 2009. High resolution audio synchronization using chroma onset features. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Google ScholarDigital Library
- Helen Farley. 2009. Demons, devils and witches: The occult in heavy metal music. Heavy Metal Music in Britain (2009), 73--88.Google Scholar
- Joanna Finkelstein. 2007. Art of Self Invention: Image and Identity in Popular Visual Culture. IB Tauris.Google Scholar
- Jonathan Foote, Matthew Cooper, and Andreas Girgensohn. 2002. Creating music videos using automatic media analysis. In Proceedings of the 10th ACM International Conference on Multimedia. ACM. Google ScholarDigital Library
- Simon Frith, Andrew Goodwin, and Lawrence Grossberg. 2005. Sound and Vision: The Music Video Reader. Routledge.Google Scholar
- Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng Zhang. 2011. A survey of audio-based music classification and annotation. IEEE Trans. Multimed. 13, 2 (2011), 303--319. Google ScholarDigital Library
- Olivier Gillet, Slim Essid, and Gaël Richard. 2007. On the correlation of automatic audio and visual segmentations of music videos. IEEE Trans. Circuits Syst. Video Technol. (2007). Google ScholarDigital Library
- Magnus Haake and Agneta Gulz. 2008. Visual stereotypes and virtual pedagogical agents. J. Educ. Technol. Soc. 11, 4 (2008), 1--15.Google Scholar
- Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11, 1 (2009). Google ScholarDigital Library
- P. Cougar Hall, Joshua H. West, and Shane Hill. 2012. Sexualization in lyrics of popular music from 1959 to 2009: Implications for sexuality educators. Sexuality 8 Culture 16, 2 (June 2012), 103--117.Google Scholar
- Allan Hanbury. 2003. Circular statistics applied to colour images. In Proceedings of the 8th Computer Vision Winter Workshop.Google Scholar
- Allan Hanbury and Jean Serra. 2003. Colour image analysis in 3D-polar coordinates. In Proceedings of 25th DAGM Symposium on Pattern Recognition. Springer, 124--131. Google ScholarCross Ref
- Xiao Hu and J. Stephen Downie. 2010. Improving mood classification in music digital libraries by combining lyrics and audio. In Proceedings of the 10th Annual Joint Conference on Digital Libraries. ACM, 159--168. Google ScholarDigital Library
- Xian-Sheng Hua, Lie Lu, and Hong-Jiang Zhang. 2004. Automatic music video generation based on temporal pattern analysis. In Proceedings of the 12th Annual ACM International Conference on Multimedia. ACM. Google ScholarDigital Library
- Johannes Itten and Ernst Van Haagen. 1973. The Art of Color: The Subjective Experience and Objective Rationale of Color. Van Nostrand Reinhold New York, NY.Google Scholar
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014).Google Scholar
- Wonjun Kim and Changick Kim. 2007. Automatic region of interest determination in music videos. In Proceedings of the 41th Asilomar Conference on Signals, Systems and Computers. IEEE, 485--489. Google ScholarCross Ref
- Sander Koelstra, Christian Mühl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. 2012. Deap: A database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 3, 1 (2012), 18--31. Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google ScholarDigital Library
- Paul Lamere. 2008. Social tagging and music information retrieval. J. New Music Res. 37, 2 (2008), 101--114. Google ScholarCross Ref
- Olivier Lartillot and Petri Toiviainen. 2007. A matlab toolbox for musical feature extraction from audio. In International Conference on Digital Audio Effects. 237--244.Google Scholar
- Jin Ha Lee, Kahyun Choi, Xiao Hu, and J. Stephen Downie. 2013. K-pop genres: A cross-cultural exploration. In ISMIR. 529--534.Google Scholar
- Jin Ha Lee, J. Stephen Downie, and Sally Jo Cunningham. 2005. Challenges in cross-cultural/multilingual music information seeking. In ISMIR. 1--7.Google Scholar
- Janis Lıbeks and Douglas Turnbull. 2010. Exploring artist image using content-based analysis of promotional photos. In Proceedings of the International Computer Music Conference.Google Scholar
- J. Libeks and D. Turnbull. 2011. You can judge an artist by an album cover: Using images for music annotation. IEEE MultiMed. 18, 4 (April 2011), 30--37. Google ScholarDigital Library
- Thomas Lidy and Andreas Rauber. 2005. Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In ISMIR.Google Scholar
- Cynthia Liem, Meinard Müller, Douglas Eck, George Tzanetakis, and Alan Hanjalic. 2011. The need for music information retrieval with user-centered and multimodal strategies. In Proceedings of the 1st International ACM Workshop on Music Information Retrieval with User-centered and Multimodal Strategies. ACM, 1--6. Google ScholarDigital Library
- Beth Logan and others. 2000. Mel frequency cepstral coefficients for music modeling. In ISMIR.Google Scholar
- Jana Machajdik and Allan Hanbury. 2010. Affective image classification using features inspired by psychology and art theory. In Proceedings of the International Conference on Multimedia. ACM, 83--92. Google ScholarDigital Library
- Alison Mattek and Michael Casey. 2011. Cross-modal aesthetics from a feature extraction perspective: A pilot study. In ISMIR.Google Scholar
- Rudolf Mayer. 2011. Analysing the similarity of album art with self-organising maps. In Advances in Self-Organizing Maps. LNCS, Vol. 6731. Springer. Google ScholarDigital Library
- Rudolf Mayer, Robert Neumayer, and Andreas Rauber. 2008. Rhyme and style features for musical genre classification by song lyrics.Google Scholar
- Cory McKay and Ichiro Fujinaga. 2006. Musical genre classification: Is it worth pursuing and how can it be improved? In ISMIR. 101--106.Google Scholar
- Leonard B. Meyer. 1956. Emotion and meaning in music. University of Chicago Press.Google Scholar
- George A. Miller. 1995. WordNet: A lexical database for english. Commun. ACM 38, 11 (1995), 39--41. Google ScholarDigital Library
- Riccardo Miotto and Nicola Orio. 2008. A music identification system based on chroma indexing and statistical modeling. In ISMIR.Google Scholar
- Keith Negus. 2011. Producing Pop: Culture and Conflict in the Popular Music Industry. (out of print.)Google Scholar
- Bureau of the Census and United States. 2009. Statistical Abstract of the United States. US Government Printing Office.Google Scholar
- Nicola Orio, Cynthia C. S. Liem, Geoffroy Peeters, and Markus Schedl. 2012. MusiClef: Multimodal music tagging task. In Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. Google ScholarDigital Library
- Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345--1359. Google ScholarDigital Library
- Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 2 (2000), 99--121. Google ScholarDigital Library
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (2015). Google ScholarDigital Library
- Shoto Sasaki, Tatsunori Hirai, Hayato Ohya, and Shigeo Morishima. 2015. Affective music recommendation system based on the mood of input video. LNCS, Vol. 8936. Springer International Publishing.Google ScholarCross Ref
- Nicolas Scaringella, Giorgio Zoia, and Daniel Mlynek. 2006. Automatic genre classification of music content: A survey. IEEE Sign. Process. Mag. 23, 2 (2006), 133--141. Google ScholarCross Ref
- Markus Schedl, Tim Pohle, Peter Knees, and Gerhard Widmer. 2006. Assigning and visualizing music genres by web-based co-occurrence analysis. In ISMIR. Citeseer, 260--265.Google Scholar
- Alexander Schindler. 2014. A picture is worth a thousand songs: Exploring visual aspects of music. In Proceedings of the 1st International Workshop on Digital Libraries for Musicology (DLfM’14). Google ScholarDigital Library
- Alexander Schindler and Andreas Rauber. 2013. A music video information retrieval approach to artist identification. In Proceedings of the 10th Symposium on Computer Music Multidisciplinary Research (CMMR 2013).Google Scholar
- Alexander Schindler and Andreas Rauber. 2014. Capturing the temporal domain in echonest features for improved classification effectiveness. LNCS, Vol. 8382. Google ScholarCross Ref
- Alexander Schindler and Andreas Rauber. 2015. An audio-visual approach to music genre classification through affective color features. In Advances in Information Retrieval. LNCS, Vol. 9022. 61--67. Google ScholarCross Ref
- Xavier Serra, Michela Magas, Emmanouil Benetos, Magdalena Chudy, S. Dixon, Arthur Flexer, Emilia Gómez, F. Gouyon, P. Herrera, S. Jordà, Oscar Paytuvi, G. Peeters, Jan Schlüter, H. Vinet, and G. Widmer. 2013. Roadmap for Music Information ReSearch.Google Scholar
- Xi Shao, Changsheng Xu, Namunu C. Maddage, Qi Tian, Mohan S. Kankanhalli, and Jesse S. Jin. 2006. Automatic summarization of music videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 2, 2 (May 2006), 127--148. Google ScholarDigital Library
- Mark Shevy. 2008. Music genre as cognitive schema: Extramusical associations with country and hip-hop music. Psychol. Music 36, 4 (2008), 477--498. Google ScholarCross Ref
- Kai Siedenburg, Ichiro Fujinaga, and Stephen McAdams. 2016. A comparison of approaches to timbre descriptors in music information retrieval and music psychology. J. New Music Research (2016).Google Scholar
- Bob L. Sturm. 2013. Classification accuracy is not enough. J. Intell. Inform. Syst. (2013). Google ScholarDigital Library
- Bob L. Sturm. 2014. A simple method to determine if a music information retrieval system is a horse. IEEE Trans. Multimed. 16, 6 (2014), 1636--1644. Google ScholarCross Ref
- George Tzanetakis and Perry Cook. 2000. Marsyas: A framework for audio analysis. Organised Sound (2000). Google ScholarDigital Library
- George Tzanetakis and Perry Cook. 2002. Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10, 5 (2002), 293--302. Google ScholarCross Ref
- Julián Urbano, Markus Schedl, and Xavier Serra. 2013. Evaluation in music information retrieval. J. Intell. Inform. Syst. 41, 3 (2013), 345--369. Google ScholarDigital Library
- Patricia Valdez and Albert Mehrabian. 1994. Effects of color on emotions. J. Exp. Psychol.: Gen. 123, 4 (1994), 394. Google ScholarCross Ref
- Andrea Vedaldi and Stefano Soatto. 2008. Quick shift and kernel methods for mode seeking. In Computer Vision--ECCV 2008. Springer, 705--718. Google ScholarCross Ref
- Carol Vernallis. 2004. Experiencing Music Video: Aesthetics and Cultural Context. Columbia University Press.Google Scholar
- Wang Wei-ning, Yu Ying-lin, and Jiang Sheng-ming. 2006. Image retrieval by emotional semantics: A study of emotional space and feature extraction. In Proceedings of the International Conference on Systems, Man and Cybernetics. IEEE.Google Scholar
- Felix Weninger, Björn Schuller, Cynthia Liem, Frank Kurth, and Alan Hanjalic. 2012. Music information retrieval: An inspirational guide to transfer from related disciplines. Dagstuhl Follow-Ups 3 (2012).Google Scholar
- Ashkan Yazdani, Krista Kappeler, and Touradj Ebrahimi. 2011. Affective content analysis of music video clips. In Proceedings of the International ACM Workshop on Music Information Retrieval with User-centered and Multimodal Strategies. Google ScholarDigital Library
- Jong-Chul Yoon, In-Kwon Lee, and Siwoo Byun. 2009. Automated music video generation using multi-level feature-based segmentation. In Handbook of Multimedia for Digital Entertainment and Arts. Springer. Google ScholarCross Ref
- Shiliang Zhang, Qingming Huang, Shuqiang Jiang, Wen Gao, and Qi Tian. 2010. Affective visualization and retrieval for music video. IEEE Trans. Multimed. 12, 6 (2010). Google ScholarDigital Library
- Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Advances in Neural Information Proc. Systems. Google ScholarDigital Library
- Karel Zuiderveld. 1994. Contrast limited adaptive histogram equalization. In Graphics gems IV. 474--485. Google ScholarDigital Library
Index Terms
- Harnessing Music-Related Visual Stereotypes for Music Information Retrieval
Recommendations
Music Information Retrieval of Carnatic Songs Based on Carnatic Music Singer Identification
ICCEE '08: Proceedings of the 2008 International Conference on Computer and Electrical EngineeringIn this paper, a methodology for Carnatic music singer identification is proposed and implemented. The motive behind identifying the singer is to extend this work for efficient music information retrieval of Carnatic music song based on singer ...
Pitch-frequency histogram-based music information retrieval for Turkish music
This study reviews the use of pitch histograms in music information retrieval studies for western and non-western music. The problems in applying the pitch-class histogram-based methods developed for western music to non-western music and specifically ...
Comments