Article

Taxonomies by the numbers: building high-performance taxonomies

Authors:
Stephen C. Gates

IBM T.J. Watson Research Center, Hawthorne, NY

IBM T.J. Watson Research Center, Hawthorne, NY
View Profile

,
Wilfried Teiken

IBM T.J. Watson Research Center, Hawthorne, NY

IBM T.J. Watson Research Center, Hawthorne, NY
View Profile

,
Keh-Shin F. Cheng

IBM T.J. Watson Research Center, Hawthorne, NY

IBM T.J. Watson Research Center, Hawthorne, NY
View Profile

CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge managementOctober 2005Pages 568–577https://doi.org/10.1145/1099554.1099703

Published:31 October 2005Publication History

CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

Pages 568–577

ABSTRACT

In this paper, we describe a system for the construction of taxonomies which yield high accuracies with automated categorization systems, even on Web and intranet documents. In particular, we describe the way in which measurement of five key features of the system can be used to predict when categories are sufficiently well defined to yield high accuracy categorization. We describe the use of this system to construct a large (8800-category) general-purpose taxonomy and categorization system.

References

Adami, G., Avesani, P., and Sona, D. 2003. Bootstrapping for hierarchical document classification. In Proceedings of the Twelfth international Conference on information and Knowledge Management (New Orleans, LA, USA, November 03 - 08, 2003). CIKM '03. ACM Press, New York, NY, 295--302. Google ScholarDigital Library
Aggarwal, C. C., Gates, S. C., and Yu, P. S. 1999. On the merits of building categorization systems by supervised clustering. In Proceedings of the Fifth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Diego, California, United States, August 15 - 18, 1999). KDD '99. ACM Press, New York, NY, 352--356. Google ScholarDigital Library
Anagnostopoulos, A., Broder, A. Z., and Carmel, D. 2005. Sampling search-engine results. In Proceedings of the 14th international Conference on World Wide Web (Chiba, Japan, May 10 - 14, 2005). WWW '05. ACM Press, New York, NY, 245--256. Google ScholarDigital Library
Broder, A. Z. and Ciccolo, A. C. 2004. Towards the next generation of enterprise search technology. IBM Systems J. 43, 3 (Jul. 2004), 451--454. Google ScholarDigital Library
Byeungwoo Jeon and David Landgrebe, Partially Supervised Classification Using Weighted Unsupervised Clustering, IEEE Transactions on Geoscience and Remote Sensing, Vol. 37, No.2, pp 1073--1079, March 1999.Google Scholar
Cody, W. F., Kreulen, J. T., Krishna, V., and Spangler, W. S. 2002. The integration of business intelligence and knowledge management. IBM Syst. J. 41, 4 (Oct. 2002), 697--713. Google ScholarDigital Library
Cohn, D. A.; Ghahramani, Z.; and Jordan, M. I. 1995. Active learning with statistical models. In Tesauro, G.; Touretzky, D.; and Alspector, J., eds., Advances in Neural Information Processing, Volume 7. Morgan Kaufmann.Google Scholar
Eirinaki, M., Vazirgiannis, M., and Varlamis, I. 2003. SEWeP: using site semantics and a taxonomy to enhance the Web personalization process. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 99--108. Google ScholarDigital Library
Ferrucci, D. and Lally, A. 2004. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10, 3-4 (Sep. 2004), 327--348. Google ScholarDigital Library
Michelangelo Ceci, Floriana Esposito, Michele Lapi, Donato Malerba: Automated Classification of Web Documents into a Hierarchy of Categories. In Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM'03 (Zakopane, Poland, June 2-5, 2003). 59--68Google Scholar
Neff, M. S., Byrd, R. J., and Boguraev, B. K. 2004. The Talent system: TEXTRACT architecture and data model. Nat. Lang. Eng. 10, 3-4 (Sep. 2004), 307--326. Google ScholarDigital Library
Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T. 2000. Text Classification from Labeled and Unlabeled Documents using EM. Mach. Learn. 39, 2-3 (May. 2000), 103--134. Google ScholarDigital Library
Pelikan, M., Leous, J., Pearce, R., Smith, M. E., and Vaught, R. 2004. Searching for the needle in the haystack: taxonomies, tags and targets. In Proceedings of the 32nd Annual ACM SIGUCCS Conference on User Services (Baltimore, MD, USA, October 10 - 13, 2004). SIGUCCS '04. ACM Press, New York, NY, 256--261. Google ScholarDigital Library
Pohs, W., Pinder, G., Dougherty, C., and White, M. 2001. The Lotus Knowledge Discovery System: tools and experiences. IBM Systems J. 40, 4 (Oct. 2001), 956--966. Google ScholarDigital Library
Pohs, Wendi, In: Practical Knowledge Management: The Lotus Discovery Server, IBM Press (2001), 53.Google Scholar
Prieto-Díaz, R. 1991. Implementing faceted classification for software reuse. Commun. ACM 34, 5 (May. 1991), 88--97 Google ScholarDigital Library
Spangler, S. and Kreulen, J. 2002. Interactive methods for taxonomy editing and validation. In Proceedings of the Eleventh international Conference on information and Knowledge Management (McLean, Virginia, USA, November 04 - 09, 2002). CIKM '02. ACM Press, New York, NY, 665--668. Google ScholarDigital Library
Tzitzikas, Y., Spyratos, N., and Constantopoulos, P. 2005. Mediators over taxonomy-based information sources. The VLDB Journal 14, 1 (Mar. 2005), 112--136. Google ScholarDigital Library
Zhang, L., Liu, S., Pan, Y., and Yang, L. 2004. InfoAnalyzer: a computer-aided tool for building enterprise taxonomies. In Proceedings of the Thirteenth ACM Conference on information and Knowledge Management (Washington, D.C., USA, November 08 - 13, 2004). CIKM '04. ACM Press, New York, NY, 477--483. Google ScholarDigital Library

Index Terms

Taxonomies by the numbers: building high-performance taxonomies
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Tailoring Taxonomies for Efficient Text Categorization and Expert Finding
WI-IAT '08: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03

Automatic content categorization by means of taxonomies is a powerful tool for information retrieval and search technologies as it improves the accessibility of data both for humans and machines. While research on automatic categorization has mainly ...
Read More
A Taxonomy of Factors Influencing Data Quality
Distributed, Ambient and Pervasive Interactions
Abstract
This paper aims at developing a taxonomy of factors influencing data quality. For this to happen, firstly, we conducted a survey of literature that has focused on examining the factors affecting data quality for the purpose of quality management ...
Read More
Cross-lingual text categorization: Conquering language boundaries in globalized environments

Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management
October 2005
854 pages
ISBN:1595931406
DOI:10.1145/1099554
General Chair:
Otthein Herzog
University of Bremen, Germany
,
Program Chairs:
Hans-Jörg Schek
University for Health Sciences, Medical Informatics and Technology, Austria
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Abdur Chowdhury
America Online, USA
,
Wilfried Teiken
IBM T.J. Watson Research Center, USA
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 October 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering for document selection
quality measurements
taxonomy development
text categorization
Qualifiers
- Article
Conference

Acceptance Rates
CIKM '05 Paper Acceptance Rate77of425submissions,18%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 842
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Taxonomies by the numbers: building high-performance taxonomies

CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Tailoring Taxonomies for Efficient Text Categorization and Expert Finding

A Taxonomy of Factors Influencing Data Quality

Cross-lingual text categorization: Conquering language boundaries in globalized environments

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Taxonomies by the numbers: building high-performance taxonomies

CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Tailoring Taxonomies for Efficient Text Categorization and Expert Finding

A Taxonomy of Factors Influencing Data Quality

Cross-lingual text categorization: Conquering language boundaries in globalized environments

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media