skip to main content
10.1145/2187980.2188109acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
poster

A statistical approach to URL-based web page clustering

Published:16 April 2012Publication History

ABSTRACT

Most web page classifiers use features from the page content, which means that it has to be downloaded to be classified. We propose a technique to cluster web pages by means of their URL exclusively. In contrast to other proposals, we analyze features that are outside the page, hence, we do not need to download a page to classify it. Also, it is non-supervised, requiring little intervention from the user. Furthermore, we do not need to crawl extensively a site to build a classifier for that site, but only a small subset of pages. We have performed an experiment over 21 highly visited websites to evaluate the performance of our classifier, obtaining good precision and recall results.

References

  1. E. Baykan, M. R. Henzinger, L. Marian, and I. Weber. Purely URL-based topic classification. In WWW, pages 1109--1110, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Blanco, N. Dalvi, and A. Machanavajjhala. Highly efficient algorithms for structural clustering of large websites. In WWW, pages 437--446, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. I. Hernández, C. Rivero, D. Ruiz, and R. Corchuelo. A tool for link-based web page classification. In CAEPIA, pages 443--452. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M.-Y. Kan and H. O. N. Thi. Fast webpage classification using URL features. In CIKM, pages 325--326, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. Generating SPARQL executable mappings to integrate ontologies. In ER, pages 118--131, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. On benchmarking data translation systems for semantic-web ontologies. In CIKM, pages 1613--1618, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A statistical approach to URL-based web page clustering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web
        April 2012
        1250 pages
        ISBN:9781450312301
        DOI:10.1145/2187980

        Copyright © 2012 Authors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 April 2012

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • poster

        Acceptance Rates

        Overall Acceptance Rate1,899of8,196submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader