skip to main content
10.1145/1498759.1498809acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Clustering the tagged web

Published:09 February 2009Publication History

ABSTRACT

Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.

References

  1. Open directory project. http://dmoz.org/.Google ScholarGoogle Scholar
  2. M. Aurnhammer, P. Hanappe, and L. Steels. Integrating collaborative tagging and emergent semantics for image retrieval. Proc. of the Collaborative Web Tagging Workshop (WWW'06).Google ScholarGoogle Scholar
  3. Shenghua Bao, Guirong Xue, Xiaoyuan Wu, Yong Yu, Ben Fei, and Zhong Su. Optimizing web search using social annotations. In WWW '07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Begelman, P. Keller, and F. Smadja. Automated tag clustering: Improving search and exploration in the tag space. Proc. of the Collaborative Web Tagging Workshop (WWW'06).Google ScholarGoogle Scholar
  5. S. M Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, and O. Frieder. Hourly analysis of a very large topically categorized web query log. In SIGIR '04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Berendt and C. Hanser. Tags are not Metadata, but "Just More Content"--to Some People. ICWSM '07.Google ScholarGoogle Scholar
  7. D. M. Blei and M. I. Jordan. Modeling annotated data. In SIGIR '03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. H. Brooks and N. Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW'06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: a new approach to topic-specific web resource discovery. In WWW '99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. In SIGIR '99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/Gather: a cluster-based approach to browsing large document collections. In SIGIR '92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR '03. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Johannes Fürnkranz. Exploiting structural information for text classification on the WWW. In IDA '99.Google ScholarGoogle Scholar
  16. T. L. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228--5235, '04.Google ScholarGoogle Scholar
  17. T. Haveliwala. Topic-sensitive pagerank. In WWW '02. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. In WWW '02. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Hayes and P. Avesani. Using tags and clustering to identify topic-relevant blogs. In ICWSM, 2007.Google ScholarGoogle Scholar
  20. Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In SIGIR '96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search. In WSDM '08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Thomas Hofmann. Probabilistic latent semantic indexing. In SIGIR '99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme. Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications, 4011:411--426, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Liu, S. Liu, Z. Chen, and W. Y. Ma. An evaluation on feature selection for text clustering. In ICML '03.Google ScholarGoogle Scholar
  25. X. Liu and W. B. Croft. Cluster-based retrieval using language models. In SIGIR'04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, J. L. Klavans, A. Nenkova, C. Sable, B. Schiffman, and S. Sigelman. Tracking and summarizing news on a daily basis with Columbia's Newsblaster. In HLT'02. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Osinski and D. Weiss. A concept-driven algorithm for clustering search results. IEEE Intelligent Systems, 20(3):48--54, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Rattenbury, N. Good, and M. Naaman. Towards automatic extraction of event and place semantics from Flickr tags. In SIGIR '07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. Song, Y. Tian, W. Gao, and T. Huang. Diversifying the image retrieval results. In MULTIMEDIA '06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In AAAI Workshop on AI for Web Search (AAAI 2000).Google ScholarGoogle Scholar
  32. C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. von Ahn and L. Dabbish. Labeling images with a computer game. In CHI '04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ellen M. Voorhees. The cluster hypothesis revisited. Technical report, Ithaca, NY, USA, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR '06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Yanbe, A. Jatowt, S. Nakamura, and K. Tanaka. Can social bookmarking enhance search in the web? In JCDL '07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In ICML '97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Oren Zamir and Oren Etzioni. Web document clustering: a feasibility demonstration. In SIGIR '98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. Ma. Learning to cluster web search results. In SIGIR '04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. D. Zhou, J. Bian, S. Zheng, H. Zha, and C. L. Giles. Exploring social annotations for information retrieval. In WWW '08. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Clustering the tagged web

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining
            February 2009
            314 pages
            ISBN:9781605583907
            DOI:10.1145/1498759

            Copyright © 2009 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 February 2009

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate498of2,863submissions,17%

            Upcoming Conference

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader