article

Taxonomy generation for text segments: A practical web-based approach

Authors:
Shui-Lung Chuang

Institute of Information Science, Academia Sinica, Taipei, Taiwan

Institute of Information Science, Academia Sinica, Taipei, Taiwan
View Profile

,
Lee-Feng Chien

Institute of Information Science, Academia Sinica and Department of Information Management, National Taiwan University

Institute of Information Science, Academia Sinica and Department of Information Management, National Taiwan University
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 23 Issue 4pp 363–396https://doi.org/10.1145/1095872.1095873

Published:01 October 2005Publication History

ACM Transactions on Information Systems

Abstract

It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed taxonomy. In this article, we address the problem of taxonomy generation for diverse text segments with a general and practical approach that uses the Web as an additional knowledge source. Unlike long documents, short text segments typically do not contain enough information to extract reliable features. This work investigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments. A hierarchical clustering algorithm is then designed for creating the hierarchical topic structure of text segments. Text segments with close concepts can be grouped together in a cluster, and relevant clusters linked at the same or near levels. Different from traditional clustering algorithms, which tend to produce cluster hierarchies with a very unnatural shape, the algorithm tries to produce a more natural and comprehensive tree hierarchy. Extensive experiments were conducted on different domains of text segments, including subject terms, people names, paper titles, and natural language questions. The obtained experimental results have shown the potential of the proposed approach, which provides a basis for the in-depth analysis of text segments on a larger scale and is believed able to benefit many information systems.

References

Agirre, E., Ansa, O., Hovy, E., and Martinez, D. 2000. Enriching very large ontologies using the WWW. In Proceedings of ECAI 2000 Workshop on Ontology Learning (Berlin, Germany).Google Scholar
Agrawal, R. and Srikant, R. 2001. On integrating catalogs. In Proceedings of the 10th International World Wide Web Conference (Hong Kong). ACM Press, New York, 603--612. Google Scholar
Ahonen, H., Heinonen, O., Klemettinen, M., and Verkamo, A. 1999. Finding co-occurring text phrases by combining sequence and frequent set discovery. In Proceedings of IJCAI'99 Workshop on Text Mining: Foundations, Techniques and Applications (Stockholm, Sweden). 1--9.Google Scholar
Baker, L. D. and McCallum, A. K. 1998. Distributional clustering of words for text classification. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 96--103. Google Scholar
Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Boston, MA). ACM, New York, 407--416. Google Scholar
Brown, P., Pietra, S. D., Pietra, V. D., and Mercer, R. 1991. Word sense disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (Berkeley, CA). 264--270. Google Scholar
Buckley, C., Salton, G., and Allan, J. 1992. Automatic retrieval with locality information using smart. In Proceedings of the 1st Text REtrieval Conference (TREC-1) (Gaithersburg, MD). 59--72.Google Scholar
Chakrabarti, S. 2002. Mining the Web: Discovering Knowledge from Hypertext Data. Elsevier Science & Technology. Google Scholar
Chakrabarti, S., Dorm, B., and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of 1998 ACM SIGMOD International Conference on Management of Data (Seattle, WA). ACM, New York, 307--318. Google Scholar
Chuang, S.-L. and Chien, L.-F. 2002. Towards automatic generation of query taxonomy: A hierarchical query clustering approach. In Proceedings of the 2002 IEEE International Conference on Data Mining (Maebashi City, Japan). IEEE Computer Society Press, Los Alamitos, CA, 75--82. Google Scholar
Chuang, S.-L. and Chien, L.-F. 2003. Enriching web taxonomies through subject categorization of query terms from search engine logs. Decision Support System, Special Issue on Web Retrieval and Mining 35, 1, 113--127. Google Scholar
Dhillon, I. S., Mallela, S., and Kumar, R. 2002. Enhanced word clustering for hierarchical text classification. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Edminton, Alto, Canada). ACM, New York. Google Scholar
Ding, C., He, X., Zha, H., Guu, M., and Simon, H. 2001. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the 2001 IEEE International Conference on Data Mining (San Jose, CA). IEEE Computer Society Press, Los Alamitos, CA, 107--114. Google Scholar
Feldman, R. and Dagan, I. 1995. Knowledge discovery in textual databases (KDT). In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (Montreal, Ont., Canada). AAAI Press, 112--117.Google Scholar
Glover, E., Pennock, D. M., Lawrence, S., and Krovetz, R. 2002. Inferring hierarchical descriptions. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM) (McLean, VA). 4--9. Google Scholar
Hearst, M. 1999. Untangling text data mining. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. Google Scholar
Johansson, S., Atwell, E., Garside, R., and Leech, G. 1986. The Tagged Lob Corpus: Users' Manual.Google Scholar
Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of the 14th International Conference on Machine Learning. Morgan-Kaufmann, San Francisco, CA, 170--178. Google Scholar
Larsen, B. and Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Diego, CA). ACM, New York, 16--22. Google Scholar
Lawrie, D., Croft, W. B., and Rosenberg, A. L. 2001. Finding topic words for hierarchical summarization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orlean, LA). ACM, New York, 349--357. Google Scholar
Li, T., Zhu, S., and Ogihara, M. 2003. Topic hierarchy generation via linear discriminant projection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (Toronto, Ont., Canada). ACM, New York, 421--422. Google Scholar
Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google Scholar
McCallum, A. K., Rosenfeld, R., Mitchell, T. M., and Ng, A. Y. 1998. Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the 15th International Conference on Machine Learning (Madison, WI). J. W. Shavlik, Ed. Morgan-Kaufmann, San Francisco, CA, 359--367. Google Scholar
Milligan, G. W. and Cooper, M. C. 1985. An examination of procedures for detecting the number of clusters in a data set. Psychometrika 50, 159--179.Google Scholar
Mirkin, B. 1996. Mathematical Classification and Clustering. Kluwer.Google Scholar
Moldovan, D. I. and Girju, R. 2001. An interactive tool for the rapid development of knowledge bases. Int. J. Artif. Intell. Tools 10, 1-2 (Mar. & Jun.), 65--86.Google Scholar
Muller, A., Dorre, J., Gerstl, P., and Seiffert, R. 1999. The TaxGen framework: Automating the generation of a taxonomy for a large document collection. In Proceedings of the 32nd Hawaii International Conference on System Sciences (Maui, Hawaii). Google Scholar
Pereira, F. C. N., Tishby, N., and Lee, L. 1993. Distributional clustering of english words. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 183--190. Google Scholar
Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Inf. Proc. Manage. 24, 513--523. Google Scholar
Sanderson, M. and Croft, B. 1999. Deriving concept hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Berkeley, CA). ACM, New York, 206--213. Google Scholar
Slonim, N. and Thishby, N. 2000. Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Athens, Greece). ACM, New York, 208--215. Google Scholar
Soderland, S. 1997. Learning to extract text-based information from the world wide web. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (Newport Beach, CA). AAAI Press, 251--254.Google Scholar
Suan N. M.-M. 2004. Semi-automatic taxonomy for efficient information searching. In Proceedings of the 2nd International Conference on Information Technology for Application.Google Scholar
Sullivan, D. 2002. Document warehousing & content management: Poor search quality in your enterprise information portal? DM Review.Google Scholar
Vaithyanathan, S. and Dom, B. 2000. Model-based hierarchical clustering. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (Stanford, CA). 599--608. Google Scholar
Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2002. Query clustering using user logs. ACM Trans. Inf. Syst. 20, 1 (Jan.), 59--81. Google Scholar
Willet, P. 1988. Recent trends in hierarchical document clustering: A critical review. Inf. Proc. Manage. 24, 577--597. Google Scholar
Xu, J. and Croft, B. 1996. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland). ACM, New York, 4--11. Google Scholar

Index Terms

Taxonomy generation for text segments: A practical web-based approach
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

A practical web-based approach to generating topic hierarchy for text segments
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed topic hierarchy. In this paper, we address the problem of generating topic hierarchies for diverse text ...
Read More
Automatic Category Generation for Text Documents by Self-Organizing Maps
IJCNN '00: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 3 - Volume 3

Recently knowledge discovery and data mining in unstructured or semi-structured texts has been attracted lots of attention from both commercial and research fields. The task is not easy to tackle due to the unstructured nature of ordinary text ...
Read More
Performance study on "Carry-along Sort" vs. recursive commands for building dynamic materialized hierarchy path
ICAC3 '09: Proceedings of the International Conference on Advances in Computing, Communication and Control

Concept hierarchies are important for generalization across database/data mining applications. Hierarchy structures are widely used in data model and SQL Server implementation for real world entities like manager employee relation, organizational ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 23, Issue 4
October 2005
135 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1095872
Issue’s Table of Contents

Copyright © 2005 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2005
Published in tois Volume 23, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Taxonomy generation
hierarchical clustering
partitioning
search-result snippet
text data mining
text segment
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 2,318
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Taxonomy generation for text segments: A practical web-based approach

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

A practical web-based approach to generating topic hierarchy for text segments

Automatic Category Generation for Text Documents by Self-Organizing Maps

Performance study on "Carry-along Sort" vs. recursive commands for building dynamic materialized hierarchy path

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Taxonomy generation for text segments: A practical web-based approach

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

A practical web-based approach to generating topic hierarchy for text segments

Automatic Category Generation for Text Documents by Self-Organizing Maps

Performance study on "Carry-along Sort" vs. recursive commands for building dynamic materialized hierarchy path

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media