ABSTRACT
In this paper, we describe a system for the construction of taxonomies which yield high accuracies with automated categorization systems, even on Web and intranet documents. In particular, we describe the way in which measurement of five key features of the system can be used to predict when categories are sufficiently well defined to yield high accuracy categorization. We describe the use of this system to construct a large (8800-category) general-purpose taxonomy and categorization system.
- Adami, G., Avesani, P., and Sona, D. 2003. Bootstrapping for hierarchical document classification. In Proceedings of the Twelfth international Conference on information and Knowledge Management (New Orleans, LA, USA, November 03 - 08, 2003). CIKM '03. ACM Press, New York, NY, 295--302. Google ScholarDigital Library
- Aggarwal, C. C., Gates, S. C., and Yu, P. S. 1999. On the merits of building categorization systems by supervised clustering. In Proceedings of the Fifth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (San Diego, California, United States, August 15 - 18, 1999). KDD '99. ACM Press, New York, NY, 352--356. Google ScholarDigital Library
- Anagnostopoulos, A., Broder, A. Z., and Carmel, D. 2005. Sampling search-engine results. In Proceedings of the 14th international Conference on World Wide Web (Chiba, Japan, May 10 - 14, 2005). WWW '05. ACM Press, New York, NY, 245--256. Google ScholarDigital Library
- Broder, A. Z. and Ciccolo, A. C. 2004. Towards the next generation of enterprise search technology. IBM Systems J. 43, 3 (Jul. 2004), 451--454. Google ScholarDigital Library
- Byeungwoo Jeon and David Landgrebe, Partially Supervised Classification Using Weighted Unsupervised Clustering, IEEE Transactions on Geoscience and Remote Sensing, Vol. 37, No.2, pp 1073--1079, March 1999.Google Scholar
- Cody, W. F., Kreulen, J. T., Krishna, V., and Spangler, W. S. 2002. The integration of business intelligence and knowledge management. IBM Syst. J. 41, 4 (Oct. 2002), 697--713. Google ScholarDigital Library
- Cohn, D. A.; Ghahramani, Z.; and Jordan, M. I. 1995. Active learning with statistical models. In Tesauro, G.; Touretzky, D.; and Alspector, J., eds., Advances in Neural Information Processing, Volume 7. Morgan Kaufmann.Google Scholar
- Eirinaki, M., Vazirgiannis, M., and Varlamis, I. 2003. SEWeP: using site semantics and a taxonomy to enhance the Web personalization process. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM Press, New York, NY, 99--108. Google ScholarDigital Library
- Ferrucci, D. and Lally, A. 2004. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10, 3-4 (Sep. 2004), 327--348. Google ScholarDigital Library
- Michelangelo Ceci, Floriana Esposito, Michele Lapi, Donato Malerba: Automated Classification of Web Documents into a Hierarchy of Categories. In Intelligent Information Processing and Web Mining, Proceedings of the International IIS: IIPWM'03 (Zakopane, Poland, June 2-5, 2003). 59--68Google Scholar
- Neff, M. S., Byrd, R. J., and Boguraev, B. K. 2004. The Talent system: TEXTRACT architecture and data model. Nat. Lang. Eng. 10, 3-4 (Sep. 2004), 307--326. Google ScholarDigital Library
- Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T. 2000. Text Classification from Labeled and Unlabeled Documents using EM. Mach. Learn. 39, 2-3 (May. 2000), 103--134. Google ScholarDigital Library
- Pelikan, M., Leous, J., Pearce, R., Smith, M. E., and Vaught, R. 2004. Searching for the needle in the haystack: taxonomies, tags and targets. In Proceedings of the 32nd Annual ACM SIGUCCS Conference on User Services (Baltimore, MD, USA, October 10 - 13, 2004). SIGUCCS '04. ACM Press, New York, NY, 256--261. Google ScholarDigital Library
- Pohs, W., Pinder, G., Dougherty, C., and White, M. 2001. The Lotus Knowledge Discovery System: tools and experiences. IBM Systems J. 40, 4 (Oct. 2001), 956--966. Google ScholarDigital Library
- Pohs, Wendi, In: Practical Knowledge Management: The Lotus Discovery Server, IBM Press (2001), 53.Google Scholar
- Prieto-Díaz, R. 1991. Implementing faceted classification for software reuse. Commun. ACM 34, 5 (May. 1991), 88--97 Google ScholarDigital Library
- Spangler, S. and Kreulen, J. 2002. Interactive methods for taxonomy editing and validation. In Proceedings of the Eleventh international Conference on information and Knowledge Management (McLean, Virginia, USA, November 04 - 09, 2002). CIKM '02. ACM Press, New York, NY, 665--668. Google ScholarDigital Library
- Tzitzikas, Y., Spyratos, N., and Constantopoulos, P. 2005. Mediators over taxonomy-based information sources. The VLDB Journal 14, 1 (Mar. 2005), 112--136. Google ScholarDigital Library
- Zhang, L., Liu, S., Pan, Y., and Yang, L. 2004. InfoAnalyzer: a computer-aided tool for building enterprise taxonomies. In Proceedings of the Thirteenth ACM Conference on information and Knowledge Management (Washington, D.C., USA, November 08 - 13, 2004). CIKM '04. ACM Press, New York, NY, 477--483. Google ScholarDigital Library
Index Terms
- Taxonomies by the numbers: building high-performance taxonomies
Recommendations
Tailoring Taxonomies for Efficient Text Categorization and Expert Finding
WI-IAT '08: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 03Automatic content categorization by means of taxonomies is a powerful tool for information retrieval and search technologies as it improves the accessibility of data both for humans and machines. While research on automatic categorization has mainly ...
A Taxonomy of Factors Influencing Data Quality
Distributed, Ambient and Pervasive InteractionsAbstractThis paper aims at developing a taxonomy of factors influencing data quality. For this to happen, firstly, we conducted a survey of literature that has focused on examining the factors affecting data quality for the purpose of quality management ...
Cross-lingual text categorization: Conquering language boundaries in globalized environments
Text categorization pertains to the automatic learning of a text categorization model from a training set of preclassified documents on the basis of their contents and the subsequent assignment of unclassified documents to appropriate categories. Most ...
Comments