skip to main content
research-article

Harnessing Music-Related Visual Stereotypes for Music Information Retrieval

Published:25 October 2016Publication History
Skip Abstract Section

Abstract

Over decades, music labels have shaped easily identifiable genres to improve recognition value and subsequently market sales of new music acts. Referring to print magazines and later to music television as important distribution channels, the visual representation thus played and still plays a significant role in music marketing. Visual stereotypes developed over decades that enable us to quickly identify referenced music only by sight without listening. Despite the richness of music-related visual information provided by music videos and album covers as well as T-shirts, advertisements, and magazines, research towards harnessing this information to advance existing or approach new problems of music retrieval or recommendation is scarce or missing. In this article, we present our research on visual music computing that aims to extract stereotypical music-related visual information from music videos. To provide comprehensive and reproducible results, we present the Music Video Dataset, a thoroughly assembled suite of datasets with dedicated evaluation tasks that are aligned to current Music Information Retrieval tasks. Based on this dataset, we provide evaluations of conventional low-level image processing and affect-related features to provide an overview of the expressiveness of fundamental visual properties such as color, illumination, and contrasts. Further, we introduce a high-level approach based on visual concept detection to facilitate visual stereotypes. This approach decomposes the semantic content of music video frames into concrete concepts such as vehicles, tools, and so on, defined in a wide visual vocabulary. Concepts are detected using convolutional neural networks and their frequency distributions as semantic descriptions for a music video. Evaluations showed that these descriptions show good performance in predicting the music genre of a video and even outperform audio-content descriptors on cross-genre thematic tags. Further, highly significant performance improvements were observed by augmenting audio-based approaches through the introduced visual approach.

References

  1. Esra Acar, Frank Hopfgartner, and Sahin Albayrak. 2014. Understanding affective content of music videos through learned representations. In MultiMedia Modeling. Springer, 303--314. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Eric Brochu, Nando De Freitas, and Kejie Bao. 2003. The sound of an album cover: Probabilistic multimedia and IR. In Proceedings of the Workshop on Artificial Intelligence and Statistics.Google ScholarGoogle Scholar
  3. Rui Cai, Lei Zhang, Feng Jing, Wei Lai, and Wei-Ying Ma. 2007. Automated music video generation using web image resource. In Acoustics, Speech and Signal Processing. ICASSP. Google ScholarGoogle ScholarCross RefCross Ref
  4. Cyril Cleverdon. 1967. The Cranfield tests on index language devices. In Aslib Proceedings, Vol. 19. 173--194. Google ScholarGoogle Scholar
  5. Frederique Crete, Thierry Dolmiere, Patricia Ladret, and Marina Nicolas. 2007. The blur effect: Perception and estimation with a new no-reference perceptual blur metric. In Electronic Imaging.Google ScholarGoogle Scholar
  6. Ritendra Datta, Dhiraj Joshi, Jia Li, and James Z. Wang. 2006. Studying aesthetics in photographic images using a computational approach. In Computer Vision--ECCV 2006. Springer, 288--301. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Stephen Downie. 2003. Music information retrieval. Annual Review of Information Science and Tech.Google ScholarGoogle Scholar
  8. Peter Dunker, Stefanie Nowak, André Begau, and Cornelia Lanz. 2008. Content-based mood classification for photos and music: A generic multi-modal classification framework and evaluation approach. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. ACM, 97--104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Sebastian Ewert, Meinard Müller, and Peter Grosche. 2009. High resolution audio synchronization using chroma onset features. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Helen Farley. 2009. Demons, devils and witches: The occult in heavy metal music. Heavy Metal Music in Britain (2009), 73--88.Google ScholarGoogle Scholar
  11. Joanna Finkelstein. 2007. Art of Self Invention: Image and Identity in Popular Visual Culture. IB Tauris.Google ScholarGoogle Scholar
  12. Jonathan Foote, Matthew Cooper, and Andreas Girgensohn. 2002. Creating music videos using automatic media analysis. In Proceedings of the 10th ACM International Conference on Multimedia. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Simon Frith, Andrew Goodwin, and Lawrence Grossberg. 2005. Sound and Vision: The Music Video Reader. Routledge.Google ScholarGoogle Scholar
  14. Zhouyu Fu, Guojun Lu, Kai Ming Ting, and Dengsheng Zhang. 2011. A survey of audio-based music classification and annotation. IEEE Trans. Multimed. 13, 2 (2011), 303--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Olivier Gillet, Slim Essid, and Gaël Richard. 2007. On the correlation of automatic audio and visual segmentations of music videos. IEEE Trans. Circuits Syst. Video Technol. (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Magnus Haake and Agneta Gulz. 2008. Visual stereotypes and virtual pedagogical agents. J. Educ. Technol. Soc. 11, 4 (2008), 1--15.Google ScholarGoogle Scholar
  17. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter 11, 1 (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Cougar Hall, Joshua H. West, and Shane Hill. 2012. Sexualization in lyrics of popular music from 1959 to 2009: Implications for sexuality educators. Sexuality 8 Culture 16, 2 (June 2012), 103--117.Google ScholarGoogle Scholar
  19. Allan Hanbury. 2003. Circular statistics applied to colour images. In Proceedings of the 8th Computer Vision Winter Workshop.Google ScholarGoogle Scholar
  20. Allan Hanbury and Jean Serra. 2003. Colour image analysis in 3D-polar coordinates. In Proceedings of 25th DAGM Symposium on Pattern Recognition. Springer, 124--131. Google ScholarGoogle ScholarCross RefCross Ref
  21. Xiao Hu and J. Stephen Downie. 2010. Improving mood classification in music digital libraries by combining lyrics and audio. In Proceedings of the 10th Annual Joint Conference on Digital Libraries. ACM, 159--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Xian-Sheng Hua, Lie Lu, and Hong-Jiang Zhang. 2004. Automatic music video generation based on temporal pattern analysis. In Proceedings of the 12th Annual ACM International Conference on Multimedia. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Johannes Itten and Ernst Van Haagen. 1973. The Art of Color: The Subjective Experience and Objective Rationale of Color. Van Nostrand Reinhold New York, NY.Google ScholarGoogle Scholar
  24. Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093 (2014).Google ScholarGoogle Scholar
  25. Wonjun Kim and Changick Kim. 2007. Automatic region of interest determination in music videos. In Proceedings of the 41th Asilomar Conference on Signals, Systems and Computers. IEEE, 485--489. Google ScholarGoogle ScholarCross RefCross Ref
  26. Sander Koelstra, Christian Mühl, Mohammad Soleymani, Jong-Seok Lee, Ashkan Yazdani, Touradj Ebrahimi, Thierry Pun, Anton Nijholt, and Ioannis Patras. 2012. Deap: A database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput. 3, 1 (2012), 18--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 1097--1105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Paul Lamere. 2008. Social tagging and music information retrieval. J. New Music Res. 37, 2 (2008), 101--114. Google ScholarGoogle ScholarCross RefCross Ref
  29. Olivier Lartillot and Petri Toiviainen. 2007. A matlab toolbox for musical feature extraction from audio. In International Conference on Digital Audio Effects. 237--244.Google ScholarGoogle Scholar
  30. Jin Ha Lee, Kahyun Choi, Xiao Hu, and J. Stephen Downie. 2013. K-pop genres: A cross-cultural exploration. In ISMIR. 529--534.Google ScholarGoogle Scholar
  31. Jin Ha Lee, J. Stephen Downie, and Sally Jo Cunningham. 2005. Challenges in cross-cultural/multilingual music information seeking. In ISMIR. 1--7.Google ScholarGoogle Scholar
  32. Janis Lıbeks and Douglas Turnbull. 2010. Exploring artist image using content-based analysis of promotional photos. In Proceedings of the International Computer Music Conference.Google ScholarGoogle Scholar
  33. J. Libeks and D. Turnbull. 2011. You can judge an artist by an album cover: Using images for music annotation. IEEE MultiMed. 18, 4 (April 2011), 30--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Thomas Lidy and Andreas Rauber. 2005. Evaluation of feature extractors and psycho-acoustic transformations for music genre classification. In ISMIR.Google ScholarGoogle Scholar
  35. Cynthia Liem, Meinard Müller, Douglas Eck, George Tzanetakis, and Alan Hanjalic. 2011. The need for music information retrieval with user-centered and multimodal strategies. In Proceedings of the 1st International ACM Workshop on Music Information Retrieval with User-centered and Multimodal Strategies. ACM, 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Beth Logan and others. 2000. Mel frequency cepstral coefficients for music modeling. In ISMIR.Google ScholarGoogle Scholar
  37. Jana Machajdik and Allan Hanbury. 2010. Affective image classification using features inspired by psychology and art theory. In Proceedings of the International Conference on Multimedia. ACM, 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Alison Mattek and Michael Casey. 2011. Cross-modal aesthetics from a feature extraction perspective: A pilot study. In ISMIR.Google ScholarGoogle Scholar
  39. Rudolf Mayer. 2011. Analysing the similarity of album art with self-organising maps. In Advances in Self-Organizing Maps. LNCS, Vol. 6731. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Rudolf Mayer, Robert Neumayer, and Andreas Rauber. 2008. Rhyme and style features for musical genre classification by song lyrics.Google ScholarGoogle Scholar
  41. Cory McKay and Ichiro Fujinaga. 2006. Musical genre classification: Is it worth pursuing and how can it be improved? In ISMIR. 101--106.Google ScholarGoogle Scholar
  42. Leonard B. Meyer. 1956. Emotion and meaning in music. University of Chicago Press.Google ScholarGoogle Scholar
  43. George A. Miller. 1995. WordNet: A lexical database for english. Commun. ACM 38, 11 (1995), 39--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Riccardo Miotto and Nicola Orio. 2008. A music identification system based on chroma indexing and statistical modeling. In ISMIR.Google ScholarGoogle Scholar
  45. Keith Negus. 2011. Producing Pop: Culture and Conflict in the Popular Music Industry. (out of print.)Google ScholarGoogle Scholar
  46. Bureau of the Census and United States. 2009. Statistical Abstract of the United States. US Government Printing Office.Google ScholarGoogle Scholar
  47. Nicola Orio, Cynthia C. S. Liem, Geoffroy Peeters, and Markus Schedl. 2012. MusiClef: Multimodal music tagging task. In Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345--1359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis. 40, 2 (2000), 99--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Shoto Sasaki, Tatsunori Hirai, Hayato Ohya, and Shigeo Morishima. 2015. Affective music recommendation system based on the mood of input video. LNCS, Vol. 8936. Springer International Publishing.Google ScholarGoogle ScholarCross RefCross Ref
  52. Nicolas Scaringella, Giorgio Zoia, and Daniel Mlynek. 2006. Automatic genre classification of music content: A survey. IEEE Sign. Process. Mag. 23, 2 (2006), 133--141. Google ScholarGoogle ScholarCross RefCross Ref
  53. Markus Schedl, Tim Pohle, Peter Knees, and Gerhard Widmer. 2006. Assigning and visualizing music genres by web-based co-occurrence analysis. In ISMIR. Citeseer, 260--265.Google ScholarGoogle Scholar
  54. Alexander Schindler. 2014. A picture is worth a thousand songs: Exploring visual aspects of music. In Proceedings of the 1st International Workshop on Digital Libraries for Musicology (DLfM’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Alexander Schindler and Andreas Rauber. 2013. A music video information retrieval approach to artist identification. In Proceedings of the 10th Symposium on Computer Music Multidisciplinary Research (CMMR 2013).Google ScholarGoogle Scholar
  56. Alexander Schindler and Andreas Rauber. 2014. Capturing the temporal domain in echonest features for improved classification effectiveness. LNCS, Vol. 8382. Google ScholarGoogle ScholarCross RefCross Ref
  57. Alexander Schindler and Andreas Rauber. 2015. An audio-visual approach to music genre classification through affective color features. In Advances in Information Retrieval. LNCS, Vol. 9022. 61--67. Google ScholarGoogle ScholarCross RefCross Ref
  58. Xavier Serra, Michela Magas, Emmanouil Benetos, Magdalena Chudy, S. Dixon, Arthur Flexer, Emilia Gómez, F. Gouyon, P. Herrera, S. Jordà, Oscar Paytuvi, G. Peeters, Jan Schlüter, H. Vinet, and G. Widmer. 2013. Roadmap for Music Information ReSearch.Google ScholarGoogle Scholar
  59. Xi Shao, Changsheng Xu, Namunu C. Maddage, Qi Tian, Mohan S. Kankanhalli, and Jesse S. Jin. 2006. Automatic summarization of music videos. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 2, 2 (May 2006), 127--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Mark Shevy. 2008. Music genre as cognitive schema: Extramusical associations with country and hip-hop music. Psychol. Music 36, 4 (2008), 477--498. Google ScholarGoogle ScholarCross RefCross Ref
  61. Kai Siedenburg, Ichiro Fujinaga, and Stephen McAdams. 2016. A comparison of approaches to timbre descriptors in music information retrieval and music psychology. J. New Music Research (2016).Google ScholarGoogle Scholar
  62. Bob L. Sturm. 2013. Classification accuracy is not enough. J. Intell. Inform. Syst. (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Bob L. Sturm. 2014. A simple method to determine if a music information retrieval system is a horse. IEEE Trans. Multimed. 16, 6 (2014), 1636--1644. Google ScholarGoogle ScholarCross RefCross Ref
  64. George Tzanetakis and Perry Cook. 2000. Marsyas: A framework for audio analysis. Organised Sound (2000). Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. George Tzanetakis and Perry Cook. 2002. Musical genre classification of audio signals. IEEE Trans. Speech Audio Process. 10, 5 (2002), 293--302. Google ScholarGoogle ScholarCross RefCross Ref
  66. Julián Urbano, Markus Schedl, and Xavier Serra. 2013. Evaluation in music information retrieval. J. Intell. Inform. Syst. 41, 3 (2013), 345--369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Patricia Valdez and Albert Mehrabian. 1994. Effects of color on emotions. J. Exp. Psychol.: Gen. 123, 4 (1994), 394. Google ScholarGoogle ScholarCross RefCross Ref
  68. Andrea Vedaldi and Stefano Soatto. 2008. Quick shift and kernel methods for mode seeking. In Computer Vision--ECCV 2008. Springer, 705--718. Google ScholarGoogle ScholarCross RefCross Ref
  69. Carol Vernallis. 2004. Experiencing Music Video: Aesthetics and Cultural Context. Columbia University Press.Google ScholarGoogle Scholar
  70. Wang Wei-ning, Yu Ying-lin, and Jiang Sheng-ming. 2006. Image retrieval by emotional semantics: A study of emotional space and feature extraction. In Proceedings of the International Conference on Systems, Man and Cybernetics. IEEE.Google ScholarGoogle Scholar
  71. Felix Weninger, Björn Schuller, Cynthia Liem, Frank Kurth, and Alan Hanjalic. 2012. Music information retrieval: An inspirational guide to transfer from related disciplines. Dagstuhl Follow-Ups 3 (2012).Google ScholarGoogle Scholar
  72. Ashkan Yazdani, Krista Kappeler, and Touradj Ebrahimi. 2011. Affective content analysis of music video clips. In Proceedings of the International ACM Workshop on Music Information Retrieval with User-centered and Multimodal Strategies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Jong-Chul Yoon, In-Kwon Lee, and Siwoo Byun. 2009. Automated music video generation using multi-level feature-based segmentation. In Handbook of Multimedia for Digital Entertainment and Arts. Springer. Google ScholarGoogle ScholarCross RefCross Ref
  74. Shiliang Zhang, Qingming Huang, Shuqiang Jiang, Wen Gao, and Qi Tian. 2010. Affective visualization and retrieval for music video. IEEE Trans. Multimed. 12, 6 (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Bolei Zhou, Agata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. 2014. Learning deep features for scene recognition using places database. In Advances in Neural Information Proc. Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Karel Zuiderveld. 1994. Contrast limited adaptive histogram equalization. In Graphics gems IV. 474--485. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Harnessing Music-Related Visual Stereotypes for Music Information Retrieval

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Intelligent Systems and Technology
              ACM Transactions on Intelligent Systems and Technology  Volume 8, Issue 2
              Survey Paper, Special Issue: Intelligent Music Systems and Applications and Regular Papers
              March 2017
              407 pages
              ISSN:2157-6904
              EISSN:2157-6912
              DOI:10.1145/3004291
              • Editor:
              • Yu Zheng
              Issue’s Table of Contents

              Copyright © 2016 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 25 October 2016
              • Accepted: 1 April 2016
              • Revised: 1 February 2016
              • Received: 1 October 2015
              Published in tist Volume 8, Issue 2

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader