research-article

Clustering the tagged web

Authors:
Daniel Ramage

Serra Mall, Stanford, CA

Serra Mall, Stanford, CA
View Profile

,
Paul Heymann

Serra Mall, Stanford, CA

Serra Mall, Stanford, CA
View Profile

,
Christopher D. Manning

Serra Mall, Stanford, CA

Serra Mall, Stanford, CA
View Profile

,
Hector Garcia-Molina

Serra Mall, Stanford, CA

Serra Mall, Stanford, CA
View Profile

WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data MiningFebruary 2009Pages 54–63https://doi.org/10.1145/1498759.1498809

Published:09 February 2009Publication History

WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining

Pages 54–63

ABSTRACT

Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from large-scale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically significant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.

References

Open directory project. http://dmoz.org/.Google Scholar
M. Aurnhammer, P. Hanappe, and L. Steels. Integrating collaborative tagging and emergent semantics for image retrieval. Proc. of the Collaborative Web Tagging Workshop (WWW'06).Google Scholar
Shenghua Bao, Guirong Xue, Xiaoyuan Wu, Yong Yu, Ben Fei, and Zhong Su. Optimizing web search using social annotations. In WWW '07. Google ScholarDigital Library
G. Begelman, P. Keller, and F. Smadja. Automated tag clustering: Improving search and exploration in the tag space. Proc. of the Collaborative Web Tagging Workshop (WWW'06).Google Scholar
S. M Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, and O. Frieder. Hourly analysis of a very large topically categorized web query log. In SIGIR '04. Google ScholarDigital Library
B. Berendt and C. Hanser. Tags are not Metadata, but "Just More Content"--to Some People. ICWSM '07.Google Scholar
D. M. Blei and M. I. Jordan. Modeling annotated data. In SIGIR '03. Google ScholarDigital Library
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003. Google ScholarDigital Library
C. H. Brooks and N. Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In WWW'06. Google ScholarDigital Library
Soumen Chakrabarti, Martin van den Berg, and Byron Dom. Focused crawling: a new approach to topic-specific web resource discovery. In WWW '99. Google ScholarDigital Library
W. W. Cohen and Y. Singer. Context-sensitive learning methods for text categorization. In SIGIR '99. Google ScholarDigital Library
D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/Gather: a cluster-based approach to browsing large document collections. In SIGIR '92. Google ScholarDigital Library
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.Google ScholarDigital Library
Nadav Eiron and Kevin S. McCurley. Analysis of anchor text for web search. In SIGIR '03. Google ScholarDigital Library
Johannes Fürnkranz. Exploiting structural information for text classification on the WWW. In IDA '99.Google Scholar
T. L. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228--5235, '04.Google Scholar
T. Haveliwala. Topic-sensitive pagerank. In WWW '02. Google ScholarDigital Library
T. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. In WWW '02. Google ScholarDigital Library
C. Hayes and P. Avesani. Using tags and clustering to identify topic-relevant blogs. In ICWSM, 2007.Google Scholar
Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: scatter/gather on retrieval results. In SIGIR '96. Google ScholarDigital Library
P. Heymann, G. Koutrika, and H. Garcia-Molina. Can social bookmarking improve web search. In WSDM '08. Google ScholarDigital Library
Thomas Hofmann. Probabilistic latent semantic indexing. In SIGIR '99. Google ScholarDigital Library
A. Hotho, R. Jaschke, C. Schmitz, and G. Stumme. Information retrieval in folksonomies: Search and ranking. The Semantic Web: Research and Applications, 4011:411--426, 2006. Google ScholarDigital Library
T. Liu, S. Liu, Z. Chen, and W. Y. Ma. An evaluation on feature selection for text clustering. In ICML '03.Google Scholar
X. Liu and W. B. Croft. Cluster-based retrieval using language models. In SIGIR'04. Google ScholarDigital Library
C. Manning, P. Raghavan, and H. Schütze. Introduction to information retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
K. R. McKeown, R. Barzilay, D. Evans, V. Hatzivassiloglou, J. L. Klavans, A. Nenkova, C. Sable, B. Schiffman, and S. Sigelman. Tracking and summarizing news on a daily basis with Columbia's Newsblaster. In HLT'02. Google ScholarDigital Library
S. Osinski and D. Weiss. A concept-driven algorithm for clustering search results. IEEE Intelligent Systems, 20(3):48--54, 2005. Google ScholarDigital Library
T. Rattenbury, N. Good, and M. Naaman. Towards automatic extraction of event and place semantics from Flickr tags. In SIGIR '07. Google ScholarDigital Library
K. Song, Y. Tian, W. Gao, and T. Huang. Diversifying the image retrieval results. In MULTIMEDIA '06. Google ScholarDigital Library
A. Strehl, J. Ghosh, and R. Mooney. Impact of similarity measures on web-page clustering. In AAAI Workshop on AI for Web Search (AAAI 2000).Google Scholar
C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979. Google ScholarDigital Library
L. von Ahn and L. Dabbish. Labeling images with a computer game. In CHI '04. Google ScholarDigital Library
Ellen M. Voorhees. The cluster hypothesis revisited. Technical report, Ithaca, NY, USA, 1985. Google ScholarDigital Library
X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR '06. Google ScholarDigital Library
Y. Yanbe, A. Jatowt, S. Nakamura, and K. Tanaka. Can social bookmarking enhance search in the web? In JCDL '07. Google ScholarDigital Library
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In ICML '97. Google ScholarDigital Library
Oren Zamir and Oren Etzioni. Web document clustering: a feasibility demonstration. In SIGIR '98. Google ScholarDigital Library
H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. Ma. Learning to cluster web search results. In SIGIR '04. Google ScholarDigital Library
D. Zhou, J. Bian, S. Zheng, H. Zha, and C. L. Giles. Exploring social annotations for information retrieval. In WWW '08. Google ScholarDigital Library

Index Terms

Clustering the tagged web

Recommendations

Leveraging Social Bookmarks from Partially Tagged Corpus for Improved Web Page Clustering

Automatic clustering of Web pages helps a number of information retrieval tasks, such as improving user interfaces, collection clustering, introducing diversity in search results, etc. Typically, Web page clustering algorithms use only features ...
Read More
Exploiting the Social Tagging Network for Web Clustering

Social tagging is a major characteristic of Web 2.0. A social tagging system can be modeled with a tripartite network of users, resources, and tags. In this paper, we investigate how to enhance Web clustering by leveraging the tripartite network of ...
Read More
Clustering geo-tagged photo collections using dynamic programming
MM '11: Proceedings of the 19th ACM international conference on Multimedia

This paper describes methods for clustering photos that possess both time stamps and geographical coordinates as metadata. We present a two part method that first analyzes photos' time and location information to independently partition the photos into ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining
February 2009
314 pages
ISBN:9781605583907
DOI:10.1145/1498759
Editors:
Ricardo Baeza-Yates
Yahoo! Research, Spain
,
Paolo Boldi
Universita degli Studi di Milano, Italy
,
Berthier Ribeiro-Neto
Google Engineering, Brazil & CS Dept., Univ. Fed. de Minas Gerais, Brazil
,
B. Barla Cambazoglu
Yahoo! Research
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 February 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate498of2,863submissions,17%
Upcoming Conference
WSDM '25

Sponsor:

sigir

sigir

sigir

sigir

The Eighteenth ACM International Conference on Web Search and Data Mining

April 7 - 11, 2025

Hannover , Germany
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 172
  Total Citations
  View Citations
- 1,957
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Clustering the tagged web

WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Leveraging Social Bookmarks from Partially Tagged Corpus for Improved Web Page Clustering

Exploiting the Social Tagging Network for Web Clustering

Clustering geo-tagged photo collections using dynamic programming