Abstract
Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time.
We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers.
We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers.
We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the “wrong” language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.
- Alabbad, S. and Aounallah, M. 2010. URL based classification of arabic web pages. In Proceedings of the International Conference on Internet Computing (ICOMP). 334--338.Google Scholar
- Anastacio, I., Martins, B., and Calado, P. 2009. Classifying documents according to locational relevance. In Proceedings of the Portuguese Conference on Artificial Intelligence (EPIA). 598--609. Google ScholarDigital Library
- Baldwin, T. and Lui, M. 2010. Language identification: The long and the short of the matter. In Proceedings of the Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). 229--237. Google ScholarDigital Library
- Baykan, E., Castelberg, S. D., Henzinger, M., Keller, S., and Kinzler, M. 2006. A comparison of techniques for sampling web pages. In Proceedings of the International Workshop on Information Integration on the Web (IIWEB).Google Scholar
- Baykan, E., Henzinger, M., and Weber, I. 2008. Web page language identification based on URLs. Proc. VLDB Endow. 1, 176--187. Google ScholarDigital Library
- Baykan, E., Henzinger, M., Keller, S., Castelberg, S. D., and Kinzler, M. 2009a. A comparison of techniques for sampling web pages. In Proceedings of the International Symposium on Theoretical Aspects of Computer Science (STACS). 13--30.Google Scholar
- Baykan, E., Henzinger, M., Marian, L., and Weber, I. 2009b. Purely URL-based topic classification. In Proceedings of the International Conference on World Wide Web (WWW) (Poster Track). 1109--1110. Google ScholarDigital Library
- Baykan, E., Henzinger, M., Marian, L., and Weber, I. 2011. A comprehensive study of features and algorithms for URL-based topic classification. Trans. Web 5, 3, 15:1--15:29. Google ScholarDigital Library
- Cambazoglu, B. B., Varol, E., Kayaaslan, E., Aykanat, C., and Baeza-Yates, R. 2010. Query forwarding in geographically distributed search engines. In Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR). 90--97. Google ScholarDigital Library
- Cavnar, W. B. and Trenkle, J. M. 1994. N-Gram-Based text categorization. In Proceedings of the Symposium on Document Analysis and Information Retrieval (SDAIR). 161--175.Google Scholar
- Chakrabarti, S., Van Den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific web resource discovery. Comput. Netw. 31, 11, 1623--1640. Google ScholarDigital Library
- Chan, S. -B. and Yamana, H. 2010. The method of improving the specific language focused crawler. In Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP).Google Scholar
- Chung, Y., Toyoda, M., and Kitsugeregawa, M. 2010. Topic classification of spam host based on urls. In Proceedings of the Forum on Data Engineering and Information Management (DEIM).Google Scholar
- Dunning, T. 1994. Statistical identification of language. Tech. rep., Computing Research Lab (CRL), New Mexico State University.Google Scholar
- Freudiger, J., Vratonjic, N., and Hubaux, J. -P. 2009. Towards privacy-friendly online advertising. In Proceedings of the Web 2.0 Security and Privacy Conference (W2SP).Google Scholar
- Grefenstette, G. 1995. Comparing two language identification schemes. In Proceedings of the International Conference on Statistical Analysis of Textual Data (JADT). 263--268.Google Scholar
- Halavais, A. 2000. Halavais, A. 2000. National borders on the world wide web. New Media Soc. 2, 7--28.Google ScholarCross Ref
- Hanse, M., Kan, M. -Y., and Karduck, A. 2010. Kairos: Proactive harvesting of research paper metadata from scientific conference web sites. In Proceedings of the International Conference on Asia-Pacific Digital Libraries (ICADL). 226--235. Google ScholarDigital Library
- Hastie, T., Tibshirani, R., and Friedman, J. H. 2001. The Elements of Statistical Learning. Springer.Google Scholar
- Hayati, K. 2004. Language identification on the world wide web. Master’s project, University of California, Santa Cruz.Google Scholar
- Hughes, B., Baldwin, T., and Bird, S. 2006. Reconsidering language identification for written language resources. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). 485--488.Google Scholar
- Ingle, N. C. 1976. A language identification table. Incorporated Linguist 15, 4, 98--101.Google Scholar
- Joachims, T. 2009. SVM-perf: Support vector machine for multivariate performance measures. http://svmlight.joachims.org/svm_perf.htmlGoogle Scholar
- Kan, M. -Y. 2004. Web page classification without the web page. In Proceedings of the International World Wide Web Conference on Alternate Track Papers and Posters (WWW Alt.). 262--263. Google ScholarDigital Library
- Koppula, H. S., Leela, K., Agarwal, A., Chitrapura, K. P., Garg, S., and Sasturkar, A. 2010. Learning url patterns for webpage de-duplication. In Proceedings of the International Conference on Web Search and Data Mining (WSDM). 381--390. Google ScholarDigital Library
- Kumar, R. and Tomkins, A. 2010. A characterization of online browsing behavior. In Proceedings of the International Conference on World Wide Web (WWW). 561--570. Google ScholarDigital Library
- Martins, B. and Silva, M. J. 2005. Language identification in web pages. In Proceedings of the Symposium on Applied Computing (SAC). 764--768. Google ScholarDigital Library
- Math Works. 2013. Matlab. http://www.mathworks.com/products/matlab/.Google Scholar
- McCallum, A. K. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bowGoogle Scholar
- Nigam, K., Lafferty, J., and MacCallum, A. 1999. Using maximum entropy for text classification. In Proceedings of the International Joint Conference on Artificial Intelligence Workshop on Machine Learning for Information Filtering (IJCAI). 61--67.Google Scholar
- Pingali, P., Jagarlamudi, J., and Varma, V. 2006. Webkhoj: Indian language IR from multiple character encodings. In Proceedings of the International Conference on World Wide Web (WWW). 801--809. Google ScholarDigital Library
- Poola, K. L. and Ramanujapuram, A. 2007. Techniques for keyword extraction from URLs using statistical analysis. US patent application. http://www.faqs.org/patents/app/20090089278.Google Scholar
- Rhekurek, R. and Kolkus, M. 2009. Language identification on the web: Extending the dictionary method. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing (CICLing). 357--368. Google ScholarDigital Library
- Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1--47. Google ScholarDigital Library
- Sibun, P. and Reynar, J. C. 1996. Language identification: Examining the issues. In Proceedings of the Symposium on Document Analysis and Information Retrieval (SDAIR). 125--135.Google Scholar
- Somboonviwat, K., Kitsuregawa, M., and Tamura, T. 2005. Simulation study of language specific web crawling. In Proceedings of the International Conference on Data Engineering Workshops (ICDEW). 1254. Google ScholarDigital Library
- Tamura, T., Somboonviwat, K., and Kitsuregawa, M. 2007. A method for language-specific web crawling and its evaluation. Syst. Comput. Japan 38, 10--20. Google ScholarDigital Library
- Teahan, W. and Harper, D. 2001. Using compression-based language models for text categorization. In Proceedings of the Workshop on Language Modeling and Information Retrieval.Google Scholar
- Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. 2005. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453--1484. Google ScholarDigital Library
- Umbrich, J., Karnstedt, M., and Harth, A. 2009. Fast and scalable pattern mining for media-type focused crawling. In Proceedings of the Knowledge Discovery, Data Mining, and Machine Learning Workshop (KDML). 119--126.Google Scholar
- Vega, V. B. and Bressan, S. 2001. Continuous-Learning weighted trigram approach for indonesian language distinction: A preliminary study. In Proceedings of the International Conference on Computer Processing of Oriental Languages (ICCPOL).Google Scholar
Index Terms
- A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
Recommendations
A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity ...
Purely URL-based topic classification
WWW '09: Proceedings of the 18th international conference on World wide webGiven only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable ...
Classifier and feature set ensembles for web page classification
Web page classification is an important research direction on web mining. The abundant amount of data available on the web makes it essential to develop efficient and robust models for web mining tasks. Web page classification is the process of ...
Comments