ABSTRACT
In this paper, we present a novel method for the classification of Web sites. This method exploits both structure and content of Web sites in order to discern their functionality. It allows for distinguishing between eight of the most relevant functional classes of Web sites. We show that a pre-classification of Web sites utilizing structural properties considerably improves a subsequent textual classification with standard techniques. We evaluate this approach on a dataset comprising more than 16,000 Web sites with about 20 million crawled and 100 million known Web pages. Our approach achieves an accuracy of 92% for the coarse-grained classification of these Web sites.
- E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer, The Connectivity Sonar: Detecting Site Functionality by Structural Patterns, Proc. 14th Conf. on Hypertext and Hypermedia, Nottingham, United Kingdom, 2003. Google ScholarDigital Library
- M. Ester, H. -P. Kriegel, and M. Schubert, Web Site Mining: A New Way to Spot Competitors, Customers and Suppliers in the World Wide Web, Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining, Edmonton, Canada, 2002. Google ScholarDigital Library
- C. Lindemann and L. Littig, Coarse-grained Classification of Web Sites by Their Structural Properties, Proc. 8th Int. Workshop on Web Information and Data Management, Arlington, VA, 2006 Google ScholarDigital Library
- Yahoo! Mindset, http://mindset.research.yahoo.comGoogle Scholar
Index Terms
- Classifying web sites
Recommendations
Coarse-grained classification of web sites by their structural properties
WIDM '06: Proceedings of the 8th annual ACM international workshop on Web information and data managementIn this paper, we identify and analyze structural properties which reflect the functionality of a Web site. These structural properties consider the size, the organization, the composition of URLs, and the link structure of Web sites. Opposed to ...
Text categorization based on k-nearest neighbor approach for web site classification
Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Comments