research-article

A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

Authors:
Eda Baykan

Izmir University

Izmir University
View Profile

,
Monika Henzinger

University of Vienna

University of Vienna
View Profile

,
Ingmar Weber

Yahoo! Research Barcelona

Yahoo! Research Barcelona
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 7 Issue 1Article No.: 3pp 1–37https://doi.org/10.1145/2435215.2435218

Published:01 March 2013Publication History

ACM Transactions on the Web

Abstract

Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time.

We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers.

We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers.

We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the “wrong” language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.

References

Alabbad, S. and Aounallah, M. 2010. URL based classification of arabic web pages. In Proceedings of the International Conference on Internet Computing (ICOMP). 334--338.Google Scholar
Anastacio, I., Martins, B., and Calado, P. 2009. Classifying documents according to locational relevance. In Proceedings of the Portuguese Conference on Artificial Intelligence (EPIA). 598--609. Google ScholarDigital Library
Baldwin, T. and Lui, M. 2010. Language identification: The long and the short of the matter. In Proceedings of the Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). 229--237. Google ScholarDigital Library
Baykan, E., Castelberg, S. D., Henzinger, M., Keller, S., and Kinzler, M. 2006. A comparison of techniques for sampling web pages. In Proceedings of the International Workshop on Information Integration on the Web (IIWEB).Google Scholar
Baykan, E., Henzinger, M., and Weber, I. 2008. Web page language identification based on URLs. Proc. VLDB Endow. 1, 176--187. Google ScholarDigital Library
Baykan, E., Henzinger, M., Keller, S., Castelberg, S. D., and Kinzler, M. 2009a. A comparison of techniques for sampling web pages. In Proceedings of the International Symposium on Theoretical Aspects of Computer Science (STACS). 13--30.Google Scholar
Baykan, E., Henzinger, M., Marian, L., and Weber, I. 2009b. Purely URL-based topic classification. In Proceedings of the International Conference on World Wide Web (WWW) (Poster Track). 1109--1110. Google ScholarDigital Library
Baykan, E., Henzinger, M., Marian, L., and Weber, I. 2011. A comprehensive study of features and algorithms for URL-based topic classification. Trans. Web 5, 3, 15:1--15:29. Google ScholarDigital Library
Cambazoglu, B. B., Varol, E., Kayaaslan, E., Aykanat, C., and Baeza-Yates, R. 2010. Query forwarding in geographically distributed search engines. In Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR). 90--97. Google ScholarDigital Library
Cavnar, W. B. and Trenkle, J. M. 1994. N-Gram-Based text categorization. In Proceedings of the Symposium on Document Analysis and Information Retrieval (SDAIR). 161--175.Google Scholar
Chakrabarti, S., Van Den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific web resource discovery. Comput. Netw. 31, 11, 1623--1640. Google ScholarDigital Library
Chan, S. -B. and Yamana, H. 2010. The method of improving the specific language focused crawler. In Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP).Google Scholar
Chung, Y., Toyoda, M., and Kitsugeregawa, M. 2010. Topic classification of spam host based on urls. In Proceedings of the Forum on Data Engineering and Information Management (DEIM).Google Scholar
Dunning, T. 1994. Statistical identification of language. Tech. rep., Computing Research Lab (CRL), New Mexico State University.Google Scholar
Freudiger, J., Vratonjic, N., and Hubaux, J. -P. 2009. Towards privacy-friendly online advertising. In Proceedings of the Web 2.0 Security and Privacy Conference (W2SP).Google Scholar
Grefenstette, G. 1995. Comparing two language identification schemes. In Proceedings of the International Conference on Statistical Analysis of Textual Data (JADT). 263--268.Google Scholar
Halavais, A. 2000. Halavais, A. 2000. National borders on the world wide web. New Media Soc. 2, 7--28.Google ScholarCross Ref
Hanse, M., Kan, M. -Y., and Karduck, A. 2010. Kairos: Proactive harvesting of research paper metadata from scientific conference web sites. In Proceedings of the International Conference on Asia-Pacific Digital Libraries (ICADL). 226--235. Google ScholarDigital Library
Hastie, T., Tibshirani, R., and Friedman, J. H. 2001. The Elements of Statistical Learning. Springer.Google Scholar
Hayati, K. 2004. Language identification on the world wide web. Master’s project, University of California, Santa Cruz.Google Scholar
Hughes, B., Baldwin, T., and Bird, S. 2006. Reconsidering language identification for written language resources. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). 485--488.Google Scholar
Ingle, N. C. 1976. A language identification table. Incorporated Linguist 15, 4, 98--101.Google Scholar
Joachims, T. 2009. SVM-perf: Support vector machine for multivariate performance measures. http://svmlight.joachims.org/svm_perf.htmlGoogle Scholar
Kan, M. -Y. 2004. Web page classification without the web page. In Proceedings of the International World Wide Web Conference on Alternate Track Papers and Posters (WWW Alt.). 262--263. Google ScholarDigital Library
Koppula, H. S., Leela, K., Agarwal, A., Chitrapura, K. P., Garg, S., and Sasturkar, A. 2010. Learning url patterns for webpage de-duplication. In Proceedings of the International Conference on Web Search and Data Mining (WSDM). 381--390. Google ScholarDigital Library
Kumar, R. and Tomkins, A. 2010. A characterization of online browsing behavior. In Proceedings of the International Conference on World Wide Web (WWW). 561--570. Google ScholarDigital Library
Martins, B. and Silva, M. J. 2005. Language identification in web pages. In Proceedings of the Symposium on Applied Computing (SAC). 764--768. Google ScholarDigital Library
Math Works. 2013. Matlab. http://www.mathworks.com/products/matlab/.Google Scholar
McCallum, A. K. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bowGoogle Scholar
Nigam, K., Lafferty, J., and MacCallum, A. 1999. Using maximum entropy for text classification. In Proceedings of the International Joint Conference on Artificial Intelligence Workshop on Machine Learning for Information Filtering (IJCAI). 61--67.Google Scholar
Pingali, P., Jagarlamudi, J., and Varma, V. 2006. Webkhoj: Indian language IR from multiple character encodings. In Proceedings of the International Conference on World Wide Web (WWW). 801--809. Google ScholarDigital Library
Poola, K. L. and Ramanujapuram, A. 2007. Techniques for keyword extraction from URLs using statistical analysis. US patent application. http://www.faqs.org/patents/app/20090089278.Google Scholar
Rhekurek, R. and Kolkus, M. 2009. Language identification on the web: Extending the dictionary method. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing (CICLing). 357--368. Google ScholarDigital Library
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1--47. Google ScholarDigital Library
Sibun, P. and Reynar, J. C. 1996. Language identification: Examining the issues. In Proceedings of the Symposium on Document Analysis and Information Retrieval (SDAIR). 125--135.Google Scholar
Somboonviwat, K., Kitsuregawa, M., and Tamura, T. 2005. Simulation study of language specific web crawling. In Proceedings of the International Conference on Data Engineering Workshops (ICDEW). 1254. Google ScholarDigital Library
Tamura, T., Somboonviwat, K., and Kitsuregawa, M. 2007. A method for language-specific web crawling and its evaluation. Syst. Comput. Japan 38, 10--20. Google ScholarDigital Library
Teahan, W. and Harper, D. 2001. Using compression-based language models for text categorization. In Proceedings of the Workshop on Language Modeling and Information Retrieval.Google Scholar
Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. 2005. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453--1484. Google ScholarDigital Library
Umbrich, J., Karnstedt, M., and Harth, A. 2009. Fast and scalable pattern mining for media-type focused crawling. In Proceedings of the Knowledge Discovery, Data Mining, and Machine Learning Workshop (KDML). 119--126.Google Scholar
Vega, V. B. and Bressan, S. 2001. Continuous-Learning weighted trigram approach for indonesian language distinction: A preliminary study. In Proceedings of the International Conference on Computer Processing of Oriental Languages (ICCPOL).Google Scholar

Index Terms

A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity ...
Read More
Purely URL-based topic classification
WWW '09: Proceedings of the 18th international conference on World wide web

Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable ...
Read More
Classifier and feature set ensembles for web page classification

Web page classification is an important research direction on web mining. The abundant amount of data available on the web makes it essential to develop efficient and robust models for web mining tasks. Web page classification is the process of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on the Web Volume 7, Issue 1
March 2013
128 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2435215
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2013
- Accepted: 1 October 2012
- Revised: 1 July 2012
- Received: 1 September 2011
Published in tweb Volume 7, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Document and text processing
URL
Web page classification
language classification
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 556
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Purely URL-based topic classification

Classifier and feature set ensembles for web page classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Purely URL-based topic classification

Classifier and feature set ensembles for web page classification

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media