skip to main content
10.1145/2797115.2797125acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

Implicit Links based Web Page Representation for Web Page Classification

Authors Info & Claims
Published:13 July 2015Publication History

ABSTRACT

With the rapid growth of the web's size, web page classification becomes more prominent. The representation way of a web page and contextual features used for this representation have both an impact on the classification's performance. Thus, finding an adequate representation of web pages is essential for a better web page classification. In this paper, we propose a web page representation based on the structure of the implicit graph built using implicit links extracted from the query-log. In this representation, we represent web pages using their textual contents along with their neighbors as features instead of using features of their neighbors. When two or more web pages in the implicit graph share the same direct neighbors and belong to the same class ci, it is most likely that every other web page, having the same immediate neighbors, will belong to the same class ci. We propose two kinds of web page representations: Boolean Neighbor Vector (BNV) and Weighted Neighbor Vector (WNV). In BNV, we supplement the feature vector, which represents the textual content of a web page, by a Boolean vector. This vector represents the target web page's neighbors and shows whether a web page is a direct neighbor of the target web page or not. In WNV, we supplement the feature vector, which represents the textual content of a web page, by a weighted vector. This latter represents the target web page's neighbors and shows strengths of relations between the target web page and its neighbors. We conduct experiments using four classifiers: SVM (Support Vector Machine), NB (Naive Bayes), RF (Random Forest) and KNN (K-Nearest Neighbors) on two subsets of ODP (Open Directory Project). Results show that: (1) the proposed representation helps obtain better classification results when using SVM, NB, RF and KNN for both Bag of Words (BW) and 5-gram representations. (2) The performances based on BNV are better than those based on WNV.

References

  1. X. Qi and B. D. Davison, "Web page classification: Features and algorithms," ACM Comput Surv, vol. 41, no. 2, pp. 1--31, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Shen, J.-T. Sun, Q. Yang, and Z. Chen, "A comparison of implicit and explicit links for web page classification," in Proceedings of the 15th international conference on World Wide Web, New York, NY, USA, 2006, pp. 643--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. X. Qi and B. D. Davison, "Classifiers without borders: incorporating fielded text from neighboring web pages," in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2008, pp. 643--650. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. X. Qi and B. D. Davison, "Knowing a web page by the company it keeps," in Proceedings of the 15th ACM inter-national conference on Information and knowledge ma-nagement, New York, NY, USA, 2006, pp. 228--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H.-J. Oh, S. H. Myaeng, and M.-H. Lee, "A practical hy-pertext catergorization method using links and incrementally available class information," in Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2000, pp. 264--271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chakrabarti, B. Dom, and P. Indyk, "Enhanced hypertext categorization using hyperlinks," SIGMOD Rec, vol. 27, no. 2, pp. 307--318, juin 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Inf Process Manage, vol. 24, no. 5, pp. 513--523, aoÃżt 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K. S. Jones, "A statistical interpretation of term specificity and its application in retrieval," J. Doc., vol. 28, pp. 11--21, 1972.Google ScholarGoogle ScholarCross RefCross Ref
  9. "ODP-Open Directory Project." {Online}. Available: http://www.dmoz.org/.Google ScholarGoogle Scholar
  10. "AOL Search Query Logs-RP." {Online}. Available: http://www.researchpipeline.com/mediawiki/index.php?title=AOL_Search_Query_Logs.Google ScholarGoogle Scholar
  11. "AOL search data mirrors." {Online}. Available: http://gregsadetsky.com/aol-data/.Google ScholarGoogle Scholar
  12. C. Cortes and V. Vapnik, "Support Vector Networks," Mach Learn, vol. 20, no. 3, pp. 273--297, Sep. 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Joachims, "Text categorization with Support Vector Machines: Learning with many relevant features," in Machine Learning: ECML-98, C. NÃľdellec and C. Rouveirol, Eds. Springer Berlin Heidelberg, 1998, pp. 137--142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. M. Mitchell, Machine Learning, 1st ed. McGraw-Hill Science/Engineering/Math, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. McCallum and K. Nigam, A comparison of event models for Naive Bayes text classification. 1998.Google ScholarGoogle Scholar
  16. D. Aha and D. Kibler, "Instance-based learning algorithms," Mach. Learn., vol. 6, pp. 37--66, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Breiman, "Random Forests," Mach. Learn., vol. 45, no. 1, pp. 5--32, Oct. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Commun ACM, vol. 18, no. 11, pp. 613--620, Nov. 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Markov, M. Last, and A. Kandel, "The hybrid representation model for web document classification," Int J Intell Syst, vol. 23, no. 6, pp. 654--679, juin 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Belmouhcine, A. Idrissi, and M. Benkhalifa, "Web Classification Approach Using Reduces Vector Representation Model Based On HTML Tags," J. Theor. Appl. Inf. Technol., vol. 55, no. 1, pp. 137--148, Sep. 2013.Google ScholarGoogle Scholar
  21. A. Sun, E.-P. Lim, and W.-K. Ng, "Web classification using support vector machine," in Proceedings of the 4th international workshop on Web information and data management, New York, NY, USA, 2002, pp. 96--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Mladenic, Turning Yahoo into an Automatic Web Page Classifier. 1998.Google ScholarGoogle Scholar
  23. S. Slattery and T. Mitchell, "Discovering Test Set Regularities in Relational Domains," in In Proc. ICML, 2000, pp. 895--902. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. Sriurai, P. Meesad, and C. Haruechaiyasak, "Improving Web Page Classification by Integrating Neighboring Pages via a Topic Model.," pp. 238--246, 2010.Google ScholarGoogle Scholar
  25. G.-R. Xue, Y. Yu, D. Shen, Q. Yang, H.-J. Zeng, and Z. Chen, "Reinforcing Web-object Categorization Through Interrelationships," Data Min Knowl Discov, vol. 12, no. 2-3, pp. 229--248, mai 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S.-M. Kim, P. Pantel, L. Duan, and S. Gaffney, "Improving web page classification by label-propagation over click graphs," in Proceedings of the 18th ACM conference on Information and knowledge management, New York, NY, USA, 2009, pp. 1077--1086. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Dai, Y. Yu, C.-L. Zhang, J. Han, and G.-R. Xue, "A novel web page categorization algorithm based on block propagation using query-log information," in Proceedings of the 7th international conference on Advances in Web-Age Information Management, Berlin, Heidelberg, 2006, pp. 435--446. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Belmouhcine and M. Benkhalifa, "Formal Concept Analysis Based Corrective Approach Using Query-log for Web Page Classification," J. Emerg. Technol. Web Intell., vol. 6, no. 2, May 2014.Google ScholarGoogle Scholar
  29. A. Belmouhcine and M. Benkhalifa, "A Clique Based Web Page Classification Corrective Approach," in 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014, vol. 2, pp. 467--473. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. F. Porter, "Readings in information retrieval," K. Sparck Jones and P. Willett, Eds. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1997, pp. 313--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. New York, NY, USA: McGraw-Hill, Inc., 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C.-J. Lin, "Asymptotic convergence of an SMO algorithm without any assumptions," IEEE Trans. Neural Netw., vol. 13, no. 1, pp. 248--250, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. C. Platt, "Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines," ADVANCES IN KERNEL METHODS - SUPPORT VECTOR LEARNING, 1998.Google ScholarGoogle Scholar
  34. S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, "Improvements to Platt's SMO Algorithm for SVM Classifier Design," Neural Comput., vol. 13, no. 3, pp. 637--649, Mar. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, "When Is "Nearest Neighbor" Meaningful?," in Proceedings of the 7th International Conference on Database Theory, London, UK, UK, 1999, pp. 217--235. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. L. Henderson, "Automated Text Classification in the DMOZ Hierarchy." 06-Nov-2009.Google ScholarGoogle Scholar

Index Terms

  1. Implicit Links based Web Page Representation for Web Page Classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WIMS '15: Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics
      July 2015
      176 pages
      ISBN:9781450332934
      DOI:10.1145/2797115

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 July 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate140of278submissions,50%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader