Article

User-assisted similarity estimation for searching related web pages

Authors:

Kulwadee Somboonviwat,

Masaru KitsuregawaAuthors Info & Claims

HT '07: Proceedings of the eighteenth conference on Hypertext and hypermedia

Pages 11 - 20

https://doi.org/10.1145/1286240.1286245

Published: 10 September 2007 Publication History

Abstract

To utilize the similarity information hidden in the Web graph, we investigate the problem of adaptively retrieving related Web pages with user assistance. Given a definition of similarities between pages, it is intuitive to estimate that any similarity will propagate from page to page, inducing an implicit topical relatedness between pages. In this paper, we extract connected subgraphs from the whole graph that consists of all pairs of pages whose similarity scores are above a given threshold, and then sort the candidates of related pages by a novel rank measure which is based on the combination distances of a flexible hierarchical clustering. Moreover, due to the subjectivity of similarity values, we dynamically supply the ordering list of related pages according to a parameter adjusted by users. We show our approach effectively handles a set of pages originating from three related categories of Web hierarchies, such as Google Directory. The experiments with three similarity measures demonstrate that using in-link information is favorable while using a combination measure of in-links and out-links lowers the precision of identifying similar pages.

References

[1]

A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the web. ACM Trans. Internet Techn., 1(1):2--43, 2001.

Digital Library

[2]

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In Proc. of the 16th International Conference on World Wide Web (WWW'07), pages 131--140, 2007.

Digital Library

[3]

D. Beeferman and A. L. Berger. Agglomerative clustering of a search engine query log. In Proc. of the Sixth ACM SIGKDD International Conference on Knowledge discovery and data mining (KDD'00), pages 407--416, Boston, MA, USA, 2000.

Digital Library

[4]

S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks, 30(1-7):107--117, 1998.

Digital Library

[5]

P. Calado, M. Cristo, M. A. Gonçalves, E. S. de Moura, B. A. Ribeiro-Neto, and N. Ziviani. Link-based similarity measures for the classification of web documents. JASIST, 57(2):208--221, 2006.

Digital Library

[6]

S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. of ACM SIGMOD International Conference on Management of Data (SIGMOD'98), pages 307--318, Seattle, Washington, USA, 1998.

Digital Library

[7]

P.-A. Chirita, D. Olmedilla, and W. Nejdl. Finding related pages using the link structure of the www. In Proc. of IEEE/WIC/ACM International Conference on Web Intelligence (WI'04), pages 632--635, Beijing, China, 2004.

Digital Library

[8]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. McGraw-Hill, 1990.

Digital Library

[9]

H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma. Query expansion by mining user logs. IEEE Trans. Knowl. Data Eng., 15(4):829--839, 2003.

Digital Library

[10]

J. Dean and M. R. Henzinger. Finding related pages in the world wide web. Computer Networks, 31(11-16):1467--1479, 1999.

Digital Library

[11]

G. W. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In Proc. of the Sixth ACM SIGKDD International Conference on Knowledge discovery and data mining (KDD'00), pages 150--160, Boston, MA, USA, 2000.

Digital Library

[12]

G. W. Flake, S. Lawrence, C. L. Giles, and F. Coetzee. Self-organization and identification of web communities. IEEE Computer, 35(3):66--71, 2002.

Digital Library

[13]

D. Gibson, J. M. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In Proc. of the Ninth ACM Conference on Hypertext and Hypermedia (HT'98), pages 225--234, Pittsburgh, PA, USA, 1998.

Digital Library

[14]

A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In Proc. of the 14th international conference on World Wide Web (WWW'05) - Special interest tracks and posters, pages 902--903, Chiba, Japan, 2005.

Digital Library

[15]

T. H. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. In Proc. of the Eleventh International World Wide Web Conference (WWW'02), pages 432--442, Honolulu, Hawaii, USA, 2002.

Digital Library

[16]

X. He, H. Zha, C. H. Q. Ding, and H. D. Simon. Web document clustering using hyperlink structures. Computational Statistics & Data Analysis, 41(1):19--45, 2002.

Digital Library

[17]

H. Ino, M. Kudo, and A. Nakamura. Partitioning of web graphs by community topology. In Proc. of the 14th International Conference on World Wide Web (WWW'05), pages 661--669, Chiba, Japan, 2005.

Digital Library

[18]

G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In Proc. of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02), pages 538--543, Edmonton, Alberta, Canada, 2002.

Digital Library

[19]

M. Kessler. Bibliographic coupling between scientific papers. Journal of American Documentation, 14(1):10--25, 1963.

[20]

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'98), pages 668--677, 1998.

Digital Library

[21]

R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. Computer Networks, 31(11-16):1481--1493, 1999.

Digital Library

[22]

G. N. Lance and W. T. Williams. A generalized sorting strategy for computer classifications. Nature, 212:218, 1966.

[23]

G. N. Lance and W. T. Williams. A general theory of classificatory sorting strategies: 1. hierarchical systems. The Computer Journal, 9:373--380, 1967.

[24]

R. Larson. Bibliometrics of the world wide web: An exploratory analysis of the intellectual structure of cyberspace. In Ann. Meeting of the American Soc. Info. Sci.

[25]

J. E. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier. In Proc. of the CHI 97 Conference on Human Factors in Computing Systems, pages 383--390, Atlanta, Georgia, USA, 1997.

Digital Library

[26]

H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4):265--269, 1973.

[27]

J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Neighborhood formation and anomaly detection in bipartite graphs. In Proc. of the 5th IEEE International Conference on Data Mining (ICDM'05), pages 418--425, Houston, Texas, USA, 2005.

Digital Library

[28]

A. X. Zheng, A. Y. Ng, and M. I. Jordan. Stable algorithms for link analysis. In Proc. of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01), pages 258--266, New Orleans, Louisiana, USA, 2001.

Digital Library

Cited By

Cattelan RKirovski D(2018)Towards improving the online shopping experienceWeb Intelligence and Agent Systems10.5555/2589968.258997410:2(209-231)Online publication date: 17-Dec-2018
https://dl.acm.org/doi/10.5555/2589968.2589974

Index Terms

User-assisted similarity estimation for searching related web pages
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Graph algorithms

Recommendations

Efficient link-based similarity search in web networks

The pre-computation cost in the off-line stage is significantly reduced.The efficiency of query processing is optimized by proposing a pruning algorithm.The accuracy loss of pruning algorithm is controlled by tuning threshold.The effectiveness of ...
Clustering Search Engine Suggests by Modeling Topics of Web Pages collected with Suggests
IMCOM '16: Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication

In this paper, we address the issue of how to overview the knowledge of a given query keyword. We especially focus on concerns of those who search for Web pages with a given query keyword, and study how to efficiently overview the whole list of Web ...
Improving performance of similarity measures for uncertain time series using preprocessing techniques
SSDBM '15: Proceedings of the 27th International Conference on Scientific and Statistical Database Management

We study the impact of preprocessing techniques on performance and effectiveness of the similarity measures for uncertain time series. Some existing work on uncertain time series use the same similarity measures developed for standard time series, to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HT '07: Proceedings of the eighteenth conference on Hypertext and hypermedia

September 2007

240 pages

ISBN:9781595938206

DOI:10.1145/1286240

General Chair:
Simon Harper
University of Manchester, UK
,
Program Chairs:
Helen Ashman
The University of South Australia, Australia
,
Mark Bernstein
Eastgate Systems, USA
,
Alexandra Cristea
The University of Warwick, UK
,
Hugh C. Davis
University of Southampton, UK
,
Paul De Bra
Eindhoven University of Technology, The Netherlands
,
Vicki Hanson
IBM T.J. Watson Research Center, USA
,
Dave Millard
University of Southampton, UK

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 September 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

HT07

Sponsor:

HT07: 18th Conference on Hypertext and Hypermedia

September 10 - 12, 2007

Manchester, UK

Acceptance Rates

Overall Acceptance Rate 378 of 1,158 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
386
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Cattelan RKirovski D(2018)Towards improving the online shopping experienceWeb Intelligence and Agent Systems10.5555/2589968.258997410:2(209-231)Online publication date: 17-Dec-2018
https://dl.acm.org/doi/10.5555/2589968.2589974

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten