ABSTRACT
With the rapid growth of the web's size, web page classification becomes more prominent. The representation way of a web page and contextual features used for this representation have both an impact on the classification's performance. Thus, finding an adequate representation of web pages is essential for a better web page classification. In this paper, we propose a web page representation based on the structure of the implicit graph built using implicit links extracted from the query-log. In this representation, we represent web pages using their textual contents along with their neighbors as features instead of using features of their neighbors. When two or more web pages in the implicit graph share the same direct neighbors and belong to the same class ci, it is most likely that every other web page, having the same immediate neighbors, will belong to the same class ci. We propose two kinds of web page representations: Boolean Neighbor Vector (BNV) and Weighted Neighbor Vector (WNV). In BNV, we supplement the feature vector, which represents the textual content of a web page, by a Boolean vector. This vector represents the target web page's neighbors and shows whether a web page is a direct neighbor of the target web page or not. In WNV, we supplement the feature vector, which represents the textual content of a web page, by a weighted vector. This latter represents the target web page's neighbors and shows strengths of relations between the target web page and its neighbors. We conduct experiments using four classifiers: SVM (Support Vector Machine), NB (Naive Bayes), RF (Random Forest) and KNN (K-Nearest Neighbors) on two subsets of ODP (Open Directory Project). Results show that: (1) the proposed representation helps obtain better classification results when using SVM, NB, RF and KNN for both Bag of Words (BW) and 5-gram representations. (2) The performances based on BNV are better than those based on WNV.
- X. Qi and B. D. Davison, "Web page classification: Features and algorithms," ACM Comput Surv, vol. 41, no. 2, pp. 1--31, 2009. Google ScholarDigital Library
- D. Shen, J.-T. Sun, Q. Yang, and Z. Chen, "A comparison of implicit and explicit links for web page classification," in Proceedings of the 15th international conference on World Wide Web, New York, NY, USA, 2006, pp. 643--650. Google ScholarDigital Library
- X. Qi and B. D. Davison, "Classifiers without borders: incorporating fielded text from neighboring web pages," in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2008, pp. 643--650. Google ScholarDigital Library
- X. Qi and B. D. Davison, "Knowing a web page by the company it keeps," in Proceedings of the 15th ACM inter-national conference on Information and knowledge ma-nagement, New York, NY, USA, 2006, pp. 228--237. Google ScholarDigital Library
- H.-J. Oh, S. H. Myaeng, and M.-H. Lee, "A practical hy-pertext catergorization method using links and incrementally available class information," in Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2000, pp. 264--271. Google ScholarDigital Library
- S. Chakrabarti, B. Dom, and P. Indyk, "Enhanced hypertext categorization using hyperlinks," SIGMOD Rec, vol. 27, no. 2, pp. 307--318, juin 1998. Google ScholarDigital Library
- G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Inf Process Manage, vol. 24, no. 5, pp. 513--523, aoÃżt 1988. Google ScholarDigital Library
- K. S. Jones, "A statistical interpretation of term specificity and its application in retrieval," J. Doc., vol. 28, pp. 11--21, 1972.Google ScholarCross Ref
- "ODP-Open Directory Project." {Online}. Available: http://www.dmoz.org/.Google Scholar
- "AOL Search Query Logs-RP." {Online}. Available: http://www.researchpipeline.com/mediawiki/index.php?title=AOL_Search_Query_Logs.Google Scholar
- "AOL search data mirrors." {Online}. Available: http://gregsadetsky.com/aol-data/.Google Scholar
- C. Cortes and V. Vapnik, "Support Vector Networks," Mach Learn, vol. 20, no. 3, pp. 273--297, Sep. 1995. Google ScholarDigital Library
- T. Joachims, "Text categorization with Support Vector Machines: Learning with many relevant features," in Machine Learning: ECML-98, C. NÃľdellec and C. Rouveirol, Eds. Springer Berlin Heidelberg, 1998, pp. 137--142. Google ScholarDigital Library
- T. M. Mitchell, Machine Learning, 1st ed. McGraw-Hill Science/Engineering/Math, 1997. Google ScholarDigital Library
- A. McCallum and K. Nigam, A comparison of event models for Naive Bayes text classification. 1998.Google Scholar
- D. Aha and D. Kibler, "Instance-based learning algorithms," Mach. Learn., vol. 6, pp. 37--66, 1991. Google ScholarDigital Library
- L. Breiman, "Random Forests," Mach. Learn., vol. 45, no. 1, pp. 5--32, Oct. 2001. Google ScholarDigital Library
- G. Salton, A. Wong, and C. S. Yang, "A vector space model for automatic indexing," Commun ACM, vol. 18, no. 11, pp. 613--620, Nov. 1975. Google ScholarDigital Library
- A. Markov, M. Last, and A. Kandel, "The hybrid representation model for web document classification," Int J Intell Syst, vol. 23, no. 6, pp. 654--679, juin 2008. Google ScholarDigital Library
- A. Belmouhcine, A. Idrissi, and M. Benkhalifa, "Web Classification Approach Using Reduces Vector Representation Model Based On HTML Tags," J. Theor. Appl. Inf. Technol., vol. 55, no. 1, pp. 137--148, Sep. 2013.Google Scholar
- A. Sun, E.-P. Lim, and W.-K. Ng, "Web classification using support vector machine," in Proceedings of the 4th international workshop on Web information and data management, New York, NY, USA, 2002, pp. 96--99. Google ScholarDigital Library
- D. Mladenic, Turning Yahoo into an Automatic Web Page Classifier. 1998.Google Scholar
- S. Slattery and T. Mitchell, "Discovering Test Set Regularities in Relational Domains," in In Proc. ICML, 2000, pp. 895--902. Google ScholarDigital Library
- W. Sriurai, P. Meesad, and C. Haruechaiyasak, "Improving Web Page Classification by Integrating Neighboring Pages via a Topic Model.," pp. 238--246, 2010.Google Scholar
- G.-R. Xue, Y. Yu, D. Shen, Q. Yang, H.-J. Zeng, and Z. Chen, "Reinforcing Web-object Categorization Through Interrelationships," Data Min Knowl Discov, vol. 12, no. 2-3, pp. 229--248, mai 2006. Google ScholarDigital Library
- S.-M. Kim, P. Pantel, L. Duan, and S. Gaffney, "Improving web page classification by label-propagation over click graphs," in Proceedings of the 18th ACM conference on Information and knowledge management, New York, NY, USA, 2009, pp. 1077--1086. Google ScholarDigital Library
- W. Dai, Y. Yu, C.-L. Zhang, J. Han, and G.-R. Xue, "A novel web page categorization algorithm based on block propagation using query-log information," in Proceedings of the 7th international conference on Advances in Web-Age Information Management, Berlin, Heidelberg, 2006, pp. 435--446. Google ScholarDigital Library
- A. Belmouhcine and M. Benkhalifa, "Formal Concept Analysis Based Corrective Approach Using Query-log for Web Page Classification," J. Emerg. Technol. Web Intell., vol. 6, no. 2, May 2014.Google Scholar
- A. Belmouhcine and M. Benkhalifa, "A Clique Based Web Page Classification Corrective Approach," in 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014, vol. 2, pp. 467--473. Google ScholarDigital Library
- M. F. Porter, "Readings in information retrieval," K. Sparck Jones and P. Willett, Eds. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1997, pp. 313--316. Google ScholarDigital Library
- G. Salton and M. J. McGill, Introduction to Modern Information Retrieval. New York, NY, USA: McGraw-Hill, Inc., 1986. Google ScholarDigital Library
- C.-J. Lin, "Asymptotic convergence of an SMO algorithm without any assumptions," IEEE Trans. Neural Netw., vol. 13, no. 1, pp. 248--250, 2002. Google ScholarDigital Library
- J. C. Platt, "Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines," ADVANCES IN KERNEL METHODS - SUPPORT VECTOR LEARNING, 1998.Google Scholar
- S. S. Keerthi, S. K. Shevade, C. Bhattacharyya, and K. R. K. Murthy, "Improvements to Platt's SMO Algorithm for SVM Classifier Design," Neural Comput., vol. 13, no. 3, pp. 637--649, Mar. 2001. Google ScholarDigital Library
- K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, "When Is "Nearest Neighbor" Meaningful?," in Proceedings of the 7th International Conference on Database Theory, London, UK, UK, 1999, pp. 217--235. Google ScholarDigital Library
- L. Henderson, "Automated Text Classification in the DMOZ Hierarchy." 06-Nov-2009.Google Scholar
Index Terms
- Implicit Links based Web Page Representation for Web Page Classification
Recommendations
Web page classification: Features and algorithms
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as ...
A Clique Based Web Page Classification Corrective Approach
WI-IAT '14: Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 02Nowadays, the web is the most relevant data source. Its size does not stop growing day by day. Web page classification becomes crucial due to this overwhelming amount of data. Web pages contain many noisy contents that bias textual classifiers and lead ...
Classifier and feature set ensembles for web page classification
Web page classification is an important research direction on web mining. The abundant amount of data available on the web makes it essential to develop efficient and robust models for web mining tasks. Web page classification is the process of ...
Comments