Article

Site abstraction for rare category classification in large-scale web directory

Authors:
Tie-Yan LIU

Microsoft Research Asia, Beijing, P. R. China

Microsoft Research Asia, Beijing, P. R. China
View Profile

,
Hao WAN

Tsinghua University Beijing, P.R. China

Tsinghua University Beijing, P.R. China
View Profile

,
Tao QIN

Tsinghua University Beijing, P.R. China

Tsinghua University Beijing, P.R. China
View Profile

,
Zheng CHEN

Microsoft Research Asia, Beijing, P. R. China

Microsoft Research Asia, Beijing, P. R. China
View Profile

,
Yong REN

Tsinghua University Beijing, P.R. China

Tsinghua University Beijing, P.R. China
View Profile

,
Wei-Ying MA

Microsoft Research Asia, Beijing, P. R. China

Microsoft Research Asia, Beijing, P. R. China
View Profile

WWW '05: Special interest tracks and posters of the 14th international conference on World Wide WebMay 2005Pages 1108–1109https://doi.org/10.1145/1062745.1062892

Published:10 May 2005Publication History

WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

Pages 1108–1109

ABSTRACT

Automatically classifying the Web directories is an effective way to manage Web information. However, our experiments showed that the state-of-the-art text classification technologies could not lead to acceptable performance in this task. Due to our analysis, the main problem is the lack of effective training data in rare categories of Web directories. To tackle this problem, we proposed a novel technology named Site Abstraction to synthesize new training examples from the website of the existing training document. The main idea is to propagate features through parent-child relationship in the sitemap tree. Experiments showed that our method significantly improved the classification performance.

References

Calvo R. A., Lee J. M. and Li X., Managing Content with Automatic Document Classification. Journal of Digital Information, Vol.5, No.282, 2004.Google Scholar
Dumais S. and Chen H., Hierarchical classification of Web content. SIGIR 2000, 256--263. Google ScholarDigital Library
Feng G., Liu T. Y., Ma W. Y., et al, Level-based Link Analysis, APWeb 2005. Google ScholarDigital Library
Huang C. C., et al. Liveclassifier: creating hierarchical text classifiers through web corpora. WWW 2004, 184--192. Google ScholarDigital Library
Lewis, D. D., Yang, Y., Rose, T., Li, F. RCV1: A new benchmark collection for text classification research. Journal of Machine Learning Research. 5 (2004) 361--397 Google ScholarDigital Library
McCallum, A., Rosenfeld, R., Mitchell, T. and Ng, A. Improving text classification by shrinkage in a hierarchy of classes. ICML 1998, 359--367. Google ScholarDigital Library
Tseng Y. H. and Juang D. W., Document-Self Expansion for Text Classification, SIGIR 2003, 399--400. Google ScholarDigital Library

Index Terms

Site abstraction for rare category classification in large-scale web directory
1. Applied computing
  1. Document management and text processing
2. Information systems
  1. Information systems applications

Recommendations

Building a Directory for the Underdeveloped Web: An Experiment on the Arabic Medical Web Directory
Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers
Abstract
Despite significant growth of the Web in recent years, some portions of the Web remain largely underdeveloped, as shown in a lack of high quality content and functionality. An example is the Arabic Web, in which a lack of well-structured Web ...
Read More
Building a directory for the underdeveloped web: an experiment on the Arabic medical web directory
ICADL'07: Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers

Despite significant growth of the Web in recent years, some portions of the Web remain largely underdeveloped, as shown in a lack of high quality content and functionality. An example is the Arabic Web, in which a lack of well-structured Web directories ...
Read More
Web site topic-hierarchy generation based on link structure

Navigating through hyperlinks within a Web site to look for information from one of its Web pages without the support of a site map can be inefficient and ineffective. Although the content of a Web site is usually organized with an inherent structure ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web
May 2005
454 pages
ISBN:1595930515
DOI:10.1145/1062745
Conference Chairs:
Allan Ellis
Southern Cross University, Australia
,
Tatsuya Hagino
Keio University, Japan
,
Program Chairs:
Fred Douglis
IBM Research
,
Prabhakar Raghavan
Verity, Inc.
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 May 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
hierarchical classification
site abstraction
support vector machines (SVM)
text classification
web directory
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 292
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Site abstraction for rare category classification in large-scale web directory

WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Building a Directory for the Underdeveloped Web: An Experiment on the Arabic Medical Web Directory

Building a directory for the underdeveloped web: an experiment on the Arabic medical web directory

Web site topic-hierarchy generation based on link structure