poster

A statistical approach to URL-based web page clustering

Authors:
Inma Hernández

University of Seville, Seville, Spain

University of Seville, Seville, Spain
View Profile

,
Carlos R. Rivero

University of Seville, Seville, Spain

University of Seville, Seville, Spain
View Profile

,
David Ruiz

University of Seville, Seville, Spain

University of Seville, Seville, Spain
View Profile

,
Rafael Corchuelo

University of Seville, Seville, Spain

University of Seville, Seville, Spain
View Profile

WWW '12 Companion: Proceedings of the 21st International Conference on World Wide WebApril 2012Pages 525–526https://doi.org/10.1145/2187980.2188109

Published:16 April 2012Publication History

WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

Pages 525–526

ABSTRACT

Most web page classifiers use features from the page content, which means that it has to be downloaded to be classified. We propose a technique to cluster web pages by means of their URL exclusively. In contrast to other proposals, we analyze features that are outside the page, hence, we do not need to download a page to classify it. Also, it is non-supervised, requiring little intervention from the user. Furthermore, we do not need to crawl extensively a site to build a classifier for that site, but only a small subset of pages. We have performed an experiment over 21 highly visited websites to evaluate the performance of our classifier, obtaining good precision and recall results.

References

E. Baykan, M. R. Henzinger, L. Marian, and I. Weber. Purely URL-based topic classification. In WWW, pages 1109--1110, 2009. Google ScholarDigital Library
L. Blanco, N. Dalvi, and A. Machanavajjhala. Highly efficient algorithms for structural clustering of large websites. In WWW, pages 437--446, 2011. Google ScholarDigital Library
I. Hernández, C. Rivero, D. Ruiz, and R. Corchuelo. A tool for link-based web page classification. In CAEPIA, pages 443--452. 2011. Google ScholarDigital Library
M.-Y. Kan and H. O. N. Thi. Fast webpage classification using URL features. In CIKM, pages 325--326, 2005. Google ScholarDigital Library
C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. Generating SPARQL executable mappings to integrate ontologies. In ER, pages 118--131, 2011. Google ScholarDigital Library
C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. On benchmarking data translation systems for semantic-web ontologies. In CIKM, pages 1613--1618, 2011. Google ScholarDigital Library

Index Terms

A statistical approach to URL-based web page clustering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees

Recommendations

CALA: An unsupervised URL-based web page classification system

Unsupervised web page classification refers to the problem of clustering the pages in a web site so that each cluster includes a set of web pages that can be classified using a unique class. The existing proposals to perform web page classification do ...
Read More
A framework for incremental deep web crawler based on URL classification
WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II

With the Web grows rapidly, more and more data become available in the Deep Web.But users have to key in a set of keywords in order to access the pages from some web sites. Traditional search engines only index and retrieve Surface Web pages through ...
Read More
URL-based Web Page Classification
IC3K 2014: Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1

This paper is concerned with the classification of web pages using their Uniform Resource Locators (URLs) only. There is a number of contexts these days in which it is important to have an efficient and reliable classification of a web-page from the URL,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web
April 2012
1250 pages
ISBN:9781450312301
DOI:10.1145/2187980
General Chairs:
Alain Mille
Université de Lyon, France
,
Fabien Gandon
INRIA, France
,
Jacques Misselis
HP, France
,
Program Chairs:
Michael Rabinovich
Case Western Reserve University, USA
,
Steffen Staab
University of Koblenz-Landau, Germany
Copyright © 2012 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 April 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
URL classification
URL patterns
web page clustering
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 440
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A statistical approach to URL-based web page clustering

WWW '12 Companion: Proceedings of the 21st International Conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

CALA: An unsupervised URL-based web page classification system

A framework for incremental deep web crawler based on URL classification

URL-based Web Page Classification