ABSTRACT
Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.
- Open directory project. http://dmoz.org/.Google Scholar
- M. Aurnhammer, P. Hanappe, and L. Steels. Integrating collaborative tagging and emergent semantics for image retrieval. Proc. of the Collaborative Web Tagging Workshop (WWW'06).Google Scholar
- Shenghua Bao, Guirong Xue, Xiaoyuan Wu, Yong Yu, Ben Fei, and Zhong Su. Optimizing web search using social annotations. In WWW '07. Google ScholarDigital Library
- G. Begelman, P. Keller, and F. Smadja. Automated tag clustering: Improving search and exploration in the tag space. Proc. of the Collaborative Web Tagging Workshop (WWW'06).Google Scholar
- S. M Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, and O. Frieder. Hourly analysis of a very large topically categorized web query log. In SIGIR '04. Google ScholarDigital Library
- B. Berendt and C. Hanser. Tags are not Metadata, but "Just More Content"--to Some People. ICWSM '07.Google Scholar
- D. M. Blei and M. I. Jordan. Modeling annotated data. In SIGIR '03. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003. Google ScholarDigital Library
- C. H. Brooks and N. Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW'06. Google ScholarDigital Library
- Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: a new approach to topic-specific web resource discovery. In WWW '99. Google ScholarDigital Library
- W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. In SIGIR '99. Google ScholarDigital Library
- D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/Gather: a cluster-based approach to browsing large document collections. In SIGIR '92. Google ScholarDigital Library
- S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.Google ScholarDigital Library
- Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR '03. Google ScholarDigital Library
- Johannes Fürnkranz. Exploiting structural information for text classification on the WWW. In IDA '99.Google Scholar
- T. L. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228--5235, '04.Google Scholar
- T. Haveliwala. Topic-sensitive pagerank. In WWW '02. Google ScholarDigital Library
- T. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. In WWW '02. Google ScholarDigital Library
- C. Hayes and P. Avesani. Using tags and clustering to identify topic-relevant blogs. In ICWSM, 2007.Google Scholar
- Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In SIGIR '96. Google ScholarDigital Library
- P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search. In WSDM '08. Google ScholarDigital Library
- Thomas Hofmann. Probabilistic latent semantic indexing. In SIGIR '99. Google ScholarDigital Library
- A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme. Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications, 4011:411--426, 2006. Google ScholarDigital Library
- T. Liu, S. Liu, Z. Chen, and W. Y. Ma. An evaluation on feature selection for text clustering. In ICML '03.Google Scholar
- X. Liu and W. B. Croft. Cluster-based retrieval using language models. In SIGIR'04. Google ScholarDigital Library
- C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
- K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, J. L. Klavans, A. Nenkova, C. Sable, B. Schiffman, and S. Sigelman. Tracking and summarizing news on a daily basis with Columbia's Newsblaster. In HLT'02. Google ScholarDigital Library
- S. Osinski and D. Weiss. A concept-driven algorithm for clustering search results. IEEE Intelligent Systems, 20(3):48--54, 2005. Google ScholarDigital Library
- T. Rattenbury, N. Good, and M. Naaman. Towards automatic extraction of event and place semantics from Flickr tags. In SIGIR '07. Google ScholarDigital Library
- K. Song, Y. Tian, W. Gao, and T. Huang. Diversifying the image retrieval results. In MULTIMEDIA '06. Google ScholarDigital Library
- A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In AAAI Workshop on AI for Web Search (AAAI 2000).Google Scholar
- C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979. Google ScholarDigital Library
- L. von Ahn and L. Dabbish. Labeling images with a computer game. In CHI '04. Google ScholarDigital Library
- Ellen M. Voorhees. The cluster hypothesis revisited. Technical report, Ithaca, NY, USA, 1985. Google ScholarDigital Library
- X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR '06. Google ScholarDigital Library
- Y. Yanbe, A. Jatowt, S. Nakamura, and K. Tanaka. Can social bookmarking enhance search in the web? In JCDL '07. Google ScholarDigital Library
- Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In ICML '97. Google ScholarDigital Library
- Oren Zamir and Oren Etzioni. Web document clustering: a feasibility demonstration. In SIGIR '98. Google ScholarDigital Library
- H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. Ma. Learning to cluster web search results. In SIGIR '04. Google ScholarDigital Library
- D. Zhou, J. Bian, S. Zheng, H. Zha, and C. L. Giles. Exploring social annotations for information retrieval. In WWW '08. Google ScholarDigital Library
Index Terms
- Clustering the tagged web
Recommendations
Leveraging Social Bookmarks from Partially Tagged Corpus for Improved Web Page Clustering
Automatic clustering of Web pages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, Web page clustering algorithms use only features ...
Exploiting the Social Tagging Network for Web Clustering
Social tagging is a major characteristic of Web 2.0. A social tagging system can be modeled with a tripartite network of users, resources, and tags. In this paper, we investigate how to enhance Web clustering by leveraging the tripartite network of ...
Clustering geo-tagged photo collections using dynamic programming
MM '11: Proceedings of the 19th ACM international conference on MultimediaThis paper describes methods for clustering photos that possess both time stamps and geographical coordinates as metadata. We present a two part method that first analyzes photos' time and location information to independently partition the photos into ...
Comments