skip to main content
10.1145/1286240.1286245acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
Article

User-assisted similarity estimation for searching related web pages

Published: 10 September 2007 Publication History

Abstract

To utilize the similarity information hidden in the Web graph, we investigate the problem of adaptively retrieving related Web pages with user assistance. Given a definition of similarities between pages, it is intuitive to estimate that any similarity will propagate from page to page, inducing an implicit topical relatedness between pages. In this paper, we extract connected subgraphs from the whole graph that consists of all pairs of pages whose similarity scores are above a given threshold, and then sort the candidates of related pages by a novel rank measure which is based on the combination distances of a flexible hierarchical clustering. Moreover, due to the subjectivity of similarity values, we dynamically supply the ordering list of related pages according to a parameter adjusted by users. We show our approach effectively handles a set of pages originating from three related categories of Web hierarchies, such as Google Directory. The experiments with three similarity measures demonstrate that using in-link information is favorable while using a combination measure of in-links and out-links lowers the precision of identifying similar pages.

References

[1]
A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Trans. Internet Techn., 1(1):2--43, 2001.
[2]
R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. of the 16th International Conference on World Wide Web (WWW'07), pages 131--140, 2007.
[3]
D. Beeferman and A. L. Berger. Agglomerative clustering of a search engine query log. In Proc. of the Sixth ACM SIGKDD International Conference on Knowledge discovery and data mining (KDD'00), pages 407--416, Boston, MA, USA, 2000.
[4]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1-7):107--117, 1998.
[5]
P. Calado, M. Cristo, M. A. Gonçalves, E. S. de Moura, B. A. Ribeiro-Neto, and N. Ziviani. Link-based similarity measures for the classification of web documents. JASIST, 57(2):208--221, 2006.
[6]
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. of ACM SIGMOD International Conference on Management of Data (SIGMOD'98), pages 307--318, Seattle, Washington, USA, 1998.
[7]
P.-A. Chirita, D. Olmedilla, and W. Nejdl. Finding related pages using the link structure of the www. In Proc. of IEEE/WIC/ACM International Conference on Web Intelligence (WI'04), pages 632--635, Beijing, China, 2004.
[8]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. McGraw-Hill, 1990.
[9]
H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma. Query expansion by mining user logs. IEEE Trans. Knowl. Data Eng., 15(4):829--839, 2003.
[10]
J. Dean and M. R. Henzinger. Finding related pages in the world wide web. Computer Networks, 31(11-16):1467--1479, 1999.
[11]
G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Proc. of the Sixth ACM SIGKDD International Conference on Knowledge discovery and data mining (KDD'00), pages 150--160, Boston, MA, USA, 2000.
[12]
G. W. Flake, S. Lawrence, C. L. Giles, and F. Coetzee. Self-organization and identification of web communities. IEEE Computer, 35(3):66--71, 2002.
[13]
D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In Proc. of the Ninth ACM Conference on Hypertext and Hypermedia (HT'98), pages 225--234, Pittsburgh, PA, USA, 1998.
[14]
A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In Proc. of the 14th international conference on World Wide Web (WWW'05) - Special interest tracks and posters, pages 902--903, Chiba, Japan, 2005.
[15]
T. H. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. In Proc. of the Eleventh International World Wide Web Conference (WWW'02), pages 432--442, Honolulu, Hawaii, USA, 2002.
[16]
X. He, H. Zha, C. H. Q. Ding, and H. D. Simon. Web document clustering using hyperlink structures. Computational Statistics & Data Analysis, 41(1):19--45, 2002.
[17]
H. Ino, M. Kudo, and A. Nakamura. Partitioning of web graphs by community topology. In Proc. of the 14th International Conference on World Wide Web (WWW'05), pages 661--669, Chiba, Japan, 2005.
[18]
G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In Proc. of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02), pages 538--543, Edmonton, Alberta, Canada, 2002.
[19]
M. Kessler. Bibliographic coupling between scientific papers. Journal of American Documentation, 14(1):10--25, 1963.
[20]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'98), pages 668--677, 1998.
[21]
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. Computer Networks, 31(11-16):1481--1493, 1999.
[22]
G. N. Lance and W. T. Williams. A generalized sorting strategy for computer classifications. Nature, 212:218, 1966.
[23]
G. N. Lance and W. T. Williams. A general theory of classificatory sorting strategies: 1. hierarchical systems. The Computer Journal, 9:373--380, 1967.
[24]
R. Larson. Bibliometrics of the world wide web: An exploratory analysis of the intellectual structure of cyberspace. In Ann. Meeting of the American Soc. Info. Sci.
[25]
J. E. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proc. of the CHI 97 Conference on Human Factors in Computing Systems, pages 383--390, Atlanta, Georgia, USA, 1997.
[26]
H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4):265--269, 1973.
[27]
J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Neighborhood formation and anomaly detection in bipartite graphs. In Proc. of the 5th IEEE International Conference on Data Mining (ICDM'05), pages 418--425, Houston, Texas, USA, 2005.
[28]
A. X. Zheng, A. Y. Ng, and M. I. Jordan. Stable algorithms for link analysis. In Proc. of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), pages 258--266, New Orleans, Louisiana, USA, 2001.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HT '07: Proceedings of the eighteenth conference on Hypertext and hypermedia
September 2007
240 pages
ISBN:9781595938206
DOI:10.1145/1286240
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 September 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. graph partitioning
  3. similarity search

Qualifiers

  • Article

Conference

HT07
Sponsor:
HT07: 18th Conference on Hypertext and Hypermedia
September 10 - 12, 2007
Manchester, UK

Acceptance Rates

Overall Acceptance Rate 378 of 1,158 submissions, 33%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media