skip to main content
research-article

A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

Published:01 March 2013Publication History
Skip Abstract Section

Abstract

Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time.

We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers.

We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers.

We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the “wrong” language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.

References

  1. Alabbad, S. and Aounallah, M. 2010. URL based classification of arabic web pages. In Proceedings of the International Conference on Internet Computing (ICOMP). 334--338.Google ScholarGoogle Scholar
  2. Anastacio, I., Martins, B., and Calado, P. 2009. Classifying documents according to locational relevance. In Proceedings of the Portuguese Conference on Artificial Intelligence (EPIA). 598--609. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baldwin, T. and Lui, M. 2010. Language identification: The long and the short of the matter. In Proceedings of the Human Language Technologies: Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL HLT). 229--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Baykan, E., Castelberg, S. D., Henzinger, M., Keller, S., and Kinzler, M. 2006. A comparison of techniques for sampling web pages. In Proceedings of the International Workshop on Information Integration on the Web (IIWEB).Google ScholarGoogle Scholar
  5. Baykan, E., Henzinger, M., and Weber, I. 2008. Web page language identification based on URLs. Proc. VLDB Endow. 1, 176--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Baykan, E., Henzinger, M., Keller, S., Castelberg, S. D., and Kinzler, M. 2009a. A comparison of techniques for sampling web pages. In Proceedings of the International Symposium on Theoretical Aspects of Computer Science (STACS). 13--30.Google ScholarGoogle Scholar
  7. Baykan, E., Henzinger, M., Marian, L., and Weber, I. 2009b. Purely URL-based topic classification. In Proceedings of the International Conference on World Wide Web (WWW) (Poster Track). 1109--1110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Baykan, E., Henzinger, M., Marian, L., and Weber, I. 2011. A comprehensive study of features and algorithms for URL-based topic classification. Trans. Web 5, 3, 15:1--15:29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cambazoglu, B. B., Varol, E., Kayaaslan, E., Aykanat, C., and Baeza-Yates, R. 2010. Query forwarding in geographically distributed search engines. In Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR). 90--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cavnar, W. B. and Trenkle, J. M. 1994. N-Gram-Based text categorization. In Proceedings of the Symposium on Document Analysis and Information Retrieval (SDAIR). 161--175.Google ScholarGoogle Scholar
  11. Chakrabarti, S., Van Den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific web resource discovery. Comput. Netw. 31, 11, 1623--1640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chan, S. -B. and Yamana, H. 2010. The method of improving the specific language focused crawler. In Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP).Google ScholarGoogle Scholar
  13. Chung, Y., Toyoda, M., and Kitsugeregawa, M. 2010. Topic classification of spam host based on urls. In Proceedings of the Forum on Data Engineering and Information Management (DEIM).Google ScholarGoogle Scholar
  14. Dunning, T. 1994. Statistical identification of language. Tech. rep., Computing Research Lab (CRL), New Mexico State University.Google ScholarGoogle Scholar
  15. Freudiger, J., Vratonjic, N., and Hubaux, J. -P. 2009. Towards privacy-friendly online advertising. In Proceedings of the Web 2.0 Security and Privacy Conference (W2SP).Google ScholarGoogle Scholar
  16. Grefenstette, G. 1995. Comparing two language identification schemes. In Proceedings of the International Conference on Statistical Analysis of Textual Data (JADT). 263--268.Google ScholarGoogle Scholar
  17. Halavais, A. 2000. Halavais, A. 2000. National borders on the world wide web. New Media Soc. 2, 7--28.Google ScholarGoogle ScholarCross RefCross Ref
  18. Hanse, M., Kan, M. -Y., and Karduck, A. 2010. Kairos: Proactive harvesting of research paper metadata from scientific conference web sites. In Proceedings of the International Conference on Asia-Pacific Digital Libraries (ICADL). 226--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hastie, T., Tibshirani, R., and Friedman, J. H. 2001. The Elements of Statistical Learning. Springer.Google ScholarGoogle Scholar
  20. Hayati, K. 2004. Language identification on the world wide web. Master’s project, University of California, Santa Cruz.Google ScholarGoogle Scholar
  21. Hughes, B., Baldwin, T., and Bird, S. 2006. Reconsidering language identification for written language resources. In Proceedings of the International Conference on Language Resources and Evaluation (LREC). 485--488.Google ScholarGoogle Scholar
  22. Ingle, N. C. 1976. A language identification table. Incorporated Linguist 15, 4, 98--101.Google ScholarGoogle Scholar
  23. Joachims, T. 2009. SVM-perf: Support vector machine for multivariate performance measures. http://svmlight.joachims.org/svm_perf.htmlGoogle ScholarGoogle Scholar
  24. Kan, M. -Y. 2004. Web page classification without the web page. In Proceedings of the International World Wide Web Conference on Alternate Track Papers and Posters (WWW Alt.). 262--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Koppula, H. S., Leela, K., Agarwal, A., Chitrapura, K. P., Garg, S., and Sasturkar, A. 2010. Learning url patterns for webpage de-duplication. In Proceedings of the International Conference on Web Search and Data Mining (WSDM). 381--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kumar, R. and Tomkins, A. 2010. A characterization of online browsing behavior. In Proceedings of the International Conference on World Wide Web (WWW). 561--570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Martins, B. and Silva, M. J. 2005. Language identification in web pages. In Proceedings of the Symposium on Applied Computing (SAC). 764--768. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Math Works. 2013. Matlab. http://www.mathworks.com/products/matlab/.Google ScholarGoogle Scholar
  29. McCallum, A. K. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bowGoogle ScholarGoogle Scholar
  30. Nigam, K., Lafferty, J., and MacCallum, A. 1999. Using maximum entropy for text classification. In Proceedings of the International Joint Conference on Artificial Intelligence Workshop on Machine Learning for Information Filtering (IJCAI). 61--67.Google ScholarGoogle Scholar
  31. Pingali, P., Jagarlamudi, J., and Varma, V. 2006. Webkhoj: Indian language IR from multiple character encodings. In Proceedings of the International Conference on World Wide Web (WWW). 801--809. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Poola, K. L. and Ramanujapuram, A. 2007. Techniques for keyword extraction from URLs using statistical analysis. US patent application. http://www.faqs.org/patents/app/20090089278.Google ScholarGoogle Scholar
  33. Rhekurek, R. and Kolkus, M. 2009. Language identification on the web: Extending the dictionary method. In Proceedings of the International Conference on Computational Linguistics and Intelligent Text Processing (CICLing). 357--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sibun, P. and Reynar, J. C. 1996. Language identification: Examining the issues. In Proceedings of the Symposium on Document Analysis and Information Retrieval (SDAIR). 125--135.Google ScholarGoogle Scholar
  36. Somboonviwat, K., Kitsuregawa, M., and Tamura, T. 2005. Simulation study of language specific web crawling. In Proceedings of the International Conference on Data Engineering Workshops (ICDEW). 1254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tamura, T., Somboonviwat, K., and Kitsuregawa, M. 2007. A method for language-specific web crawling and its evaluation. Syst. Comput. Japan 38, 10--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Teahan, W. and Harper, D. 2001. Using compression-based language models for text categorization. In Proceedings of the Workshop on Language Modeling and Information Retrieval.Google ScholarGoogle Scholar
  39. Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. 2005. Large margin methods for structured and interdependent output variables. J. Mach. Learn. Res. 6, 1453--1484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Umbrich, J., Karnstedt, M., and Harth, A. 2009. Fast and scalable pattern mining for media-type focused crawling. In Proceedings of the Knowledge Discovery, Data Mining, and Machine Learning Workshop (KDML). 119--126.Google ScholarGoogle Scholar
  41. Vega, V. B. and Bressan, S. 2001. Continuous-Learning weighted trigram approach for indonesian language distinction: A preliminary study. In Proceedings of the International Conference on Computer Processing of Oriental Languages (ICCPOL).Google ScholarGoogle Scholar

Index Terms

  1. A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 7, Issue 1
      March 2013
      128 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/2435215
      Issue’s Table of Contents

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 March 2013
      • Accepted: 1 October 2012
      • Revised: 1 July 2012
      • Received: 1 September 2011
      Published in tweb Volume 7, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader