ABSTRACT
Automatically classifying the Web directories is an effective way to manage Web information. However, our experiments showed that the state-of-the-art text classification technologies could not lead to acceptable performance in this task. Due to our analysis, the main problem is the lack of effective training data in rare categories of Web directories. To tackle this problem, we proposed a novel technology named Site Abstraction to synthesize new training examples from the website of the existing training document. The main idea is to propagate features through parent-child relationship in the sitemap tree. Experiments showed that our method significantly improved the classification performance.
- Calvo R. A., Lee J. M. and Li X., Managing Content with Automatic Document Classification. Journal of Digital Information, Vol.5, No.282, 2004.Google Scholar
- Dumais S. and Chen H., Hierarchical classification of Web content. SIGIR 2000, 256--263. Google ScholarDigital Library
- Feng G., Liu T. Y., Ma W. Y., et al, Level-based Link Analysis, APWeb 2005. Google ScholarDigital Library
- Huang C. C., et al. Liveclassifier: creating hierarchical text classifiers through web corpora. WWW 2004, 184--192. Google ScholarDigital Library
- Lewis, D. D., Yang, Y., Rose, T., Li, F. RCV1: A new benchmark collection for text classification research. Journal of Machine Learning Research. 5 (2004) 361--397 Google ScholarDigital Library
- McCallum, A., Rosenfeld, R., Mitchell, T. and Ng, A. Improving text classification by shrinkage in a hierarchy of classes. ICML 1998, 359--367. Google ScholarDigital Library
- Tseng Y. H. and Juang D. W., Document-Self Expansion for Text Classification, SIGIR 2003, 399--400. Google ScholarDigital Library
Index Terms
- Site abstraction for rare category classification in large-scale web directory
Recommendations
Building a Directory for the Underdeveloped Web: An Experiment on the Arabic Medical Web Directory
Asian Digital Libraries. Looking Back 10 Years and Forging New FrontiersAbstractDespite significant growth of the Web in recent years, some portions of the Web remain largely underdeveloped, as shown in a lack of high quality content and functionality. An example is the Arabic Web, in which a lack of well-structured Web ...
Building a directory for the underdeveloped web: an experiment on the Arabic medical web directory
ICADL'07: Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiersDespite significant growth of the Web in recent years, some portions of the Web remain largely underdeveloped, as shown in a lack of high quality content and functionality. An example is the Arabic Web, in which a lack of well-structured Web directories ...
Web site topic-hierarchy generation based on link structure
Navigating through hyperlinks within a Web site to look for information from one of its Web pages without the support of a site map can be inefficient and ineffective. Although the content of a Web site is usually organized with an inherent structure ...
Comments