skip to main content
10.1145/1062745.1062892acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Site abstraction for rare category classification in large-scale web directory

Authors Info & Claims
Published:10 May 2005Publication History

ABSTRACT

Automatically classifying the Web directories is an effective way to manage Web information. However, our experiments showed that the state-of-the-art text classification technologies could not lead to acceptable performance in this task. Due to our analysis, the main problem is the lack of effective training data in rare categories of Web directories. To tackle this problem, we proposed a novel technology named Site Abstraction to synthesize new training examples from the website of the existing training document. The main idea is to propagate features through parent-child relationship in the sitemap tree. Experiments showed that our method significantly improved the classification performance.

References

  1. Calvo R. A., Lee J. M. and Li X., Managing Content with Automatic Document Classification. Journal of Digital Information, Vol.5, No.282, 2004.Google ScholarGoogle Scholar
  2. Dumais S. and Chen H., Hierarchical classification of Web content. SIGIR 2000, 256--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Feng G., Liu T. Y., Ma W. Y., et al, Level-based Link Analysis, APWeb 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Huang C. C., et al. Liveclassifier: creating hierarchical text classifiers through web corpora. WWW 2004, 184--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Lewis, D. D., Yang, Y., Rose, T., Li, F. RCV1: A new benchmark collection for text classification research. Journal of Machine Learning Research. 5 (2004) 361--397 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. McCallum, A., Rosenfeld, R., Mitchell, T. and Ng, A. Improving text classification by shrinkage in a hierarchy of classes. ICML 1998, 359--367. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Tseng Y. H. and Juang D. W., Document-Self Expansion for Text Classification, SIGIR 2003, 399--400. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Site abstraction for rare category classification in large-scale web directory

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web
        May 2005
        454 pages
        ISBN:1595930515
        DOI:10.1145/1062745

        Copyright © 2005 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 May 2005

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,899of8,196submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader