ABSTRACT
Most web page classifiers use features from the page content, which means that it has to be downloaded to be classified. We propose a technique to cluster web pages by means of their URL exclusively. In contrast to other proposals, we analyze features that are outside the page, hence, we do not need to download a page to classify it. Also, it is non-supervised, requiring little intervention from the user. Furthermore, we do not need to crawl extensively a site to build a classifier for that site, but only a small subset of pages. We have performed an experiment over 21 highly visited websites to evaluate the performance of our classifier, obtaining good precision and recall results.
- E. Baykan, M. R. Henzinger, L. Marian, and I. Weber. Purely URL-based topic classification. In WWW, pages 1109--1110, 2009. Google ScholarDigital Library
- L. Blanco, N. Dalvi, and A. Machanavajjhala. Highly efficient algorithms for structural clustering of large websites. In WWW, pages 437--446, 2011. Google ScholarDigital Library
- I. Hernández, C. Rivero, D. Ruiz, and R. Corchuelo. A tool for link-based web page classification. In CAEPIA, pages 443--452. 2011. Google ScholarDigital Library
- M.-Y. Kan and H. O. N. Thi. Fast webpage classification using URL features. In CIKM, pages 325--326, 2005. Google ScholarDigital Library
- C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. Generating SPARQL executable mappings to integrate ontologies. In ER, pages 118--131, 2011. Google ScholarDigital Library
- C. R. Rivero, I. Hernández, D. Ruiz, and R. Corchuelo. On benchmarking data translation systems for semantic-web ontologies. In CIKM, pages 1613--1618, 2011. Google ScholarDigital Library
Index Terms
- A statistical approach to URL-based web page clustering
Recommendations
CALA: An unsupervised URL-based web page classification system
Unsupervised web page classification refers to the problem of clustering the pages in a web site so that each cluster includes a set of web pages that can be classified using a unique class. The existing proposals to perform web page classification do ...
A framework for incremental deep web crawler based on URL classification
WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part IIWith the Web grows rapidly, more and more data become available in the Deep Web.But users have to key in a set of keywords in order to access the pages from some web sites. Traditional search engines only index and retrieve Surface Web pages through ...
URL-based Web Page Classification
IC3K 2014: Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1This paper is concerned with the classification of web pages using their Uniform Resource Locators (URLs) only. There is a number of contexts these days in which it is important to have an efficient and reliable classification of a web-page from the URL,...
Comments