Abstract
It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed taxonomy. In this article, we address the problem of taxonomy generation for diverse text segments with a general and practical approach that uses the Web as an additional knowledge source. Unlike long documents, short text segments typically do not contain enough information to extract reliable features. This work investigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments. A hierarchical clustering algorithm is then designed for creating the hierarchical topic structure of text segments. Text segments with close concepts can be grouped together in a cluster, and relevant clusters linked at the same or near levels. Different from traditional clustering algorithms, which tend to produce cluster hierarchies with a very unnatural shape, the algorithm tries to produce a more natural and comprehensive tree hierarchy. Extensive experiments were conducted on different domains of text segments, including subject terms, people names, paper titles, and natural language questions. The obtained experimental results have shown the potential of the proposed approach, which provides a basis for the in-depth analysis of text segments on a larger scale and is believed able to benefit many information systems.
- Agirre, E., Ansa, O., Hovy, E., and Martinez, D. 2000. Enriching very large ontologies using the WWW. In Proceedings of ECAI 2000 Workshop on Ontology Learning (Berlin, Germany).Google Scholar
- Agrawal, R. and Srikant, R. 2001. On integrating catalogs. In Proceedings of the 10th International World Wide Web Conference (Hong Kong). ACM Press, New York, 603--612. Google Scholar
- Ahonen, H., Heinonen, O., Klemettinen, M., and Verkamo, A. 1999. Finding co-occurring text phrases by combining sequence and frequent set discovery. In Proceedings of IJCAI'99 Workshop on Text Mining: Foundations, Techniques and Applications (Stockholm, Sweden). 1--9.Google Scholar
- Baker, L. D. and McCallum, A. K. 1998. Distributional clustering of words for text classification. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 96--103. Google Scholar
- Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Boston, MA). ACM, New York, 407--416. Google Scholar
- Brown, P., Pietra, S. D., Pietra, V. D., and Mercer, R. 1991. Word sense disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (Berkeley, CA). 264--270. Google Scholar
- Buckley, C., Salton, G., and Allan, J. 1992. Automatic retrieval with locality information using smart. In Proceedings of the 1st Text REtrieval Conference (TREC-1) (Gaithersburg, MD). 59--72.Google Scholar
- Chakrabarti, S. 2002. Mining the Web: Discovering Knowledge from Hypertext Data. Elsevier Science & Technology. Google Scholar
- Chakrabarti, S., Dorm, B., and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of 1998 ACM SIGMOD International Conference on Management of Data (Seattle, WA). ACM, New York, 307--318. Google Scholar
- Chuang, S.-L. and Chien, L.-F. 2002. Towards automatic generation of query taxonomy: A hierarchical query clustering approach. In Proceedings of the 2002 IEEE International Conference on Data Mining (Maebashi City, Japan). IEEE Computer Society Press, Los Alamitos, CA, 75--82. Google Scholar
- Chuang, S.-L. and Chien, L.-F. 2003. Enriching web taxonomies through subject categorization of query terms from search engine logs. Decision Support System, Special Issue on Web Retrieval and Mining 35, 1, 113--127. Google Scholar
- Dhillon, I. S., Mallela, S., and Kumar, R. 2002. Enhanced word clustering for hierarchical text classification. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Edminton, Alto, Canada). ACM, New York. Google Scholar
- Ding, C., He, X., Zha, H., Guu, M., and Simon, H. 2001. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the 2001 IEEE International Conference on Data Mining (San Jose, CA). IEEE Computer Society Press, Los Alamitos, CA, 107--114. Google Scholar
- Feldman, R. and Dagan, I. 1995. Knowledge discovery in textual databases (KDT). In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (Montreal, Ont., Canada). AAAI Press, 112--117.Google Scholar
- Glover, E., Pennock, D. M., Lawrence, S., and Krovetz, R. 2002. Inferring hierarchical descriptions. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM) (McLean, VA). 4--9. Google Scholar
- Hearst, M. 1999. Untangling text data mining. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. Google Scholar
- Johansson, S., Atwell, E., Garside, R., and Leech, G. 1986. The Tagged Lob Corpus: Users' Manual.Google Scholar
- Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of the 14th International Conference on Machine Learning. Morgan-Kaufmann, San Francisco, CA, 170--178. Google Scholar
- Larsen, B. and Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Diego, CA). ACM, New York, 16--22. Google Scholar
- Lawrie, D., Croft, W. B., and Rosenberg, A. L. 2001. Finding topic words for hierarchical summarization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orlean, LA). ACM, New York, 349--357. Google Scholar
- Li, T., Zhu, S., and Ogihara, M. 2003. Topic hierarchy generation via linear discriminant projection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (Toronto, Ont., Canada). ACM, New York, 421--422. Google Scholar
- Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google Scholar
- McCallum, A. K., Rosenfeld, R., Mitchell, T. M., and Ng, A. Y. 1998. Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the 15th International Conference on Machine Learning (Madison, WI). J. W. Shavlik, Ed. Morgan-Kaufmann, San Francisco, CA, 359--367. Google Scholar
- Milligan, G. W. and Cooper, M. C. 1985. An examination of procedures for detecting the number of clusters in a data set. Psychometrika 50, 159--179.Google Scholar
- Mirkin, B. 1996. Mathematical Classification and Clustering. Kluwer.Google Scholar
- Moldovan, D. I. and Girju, R. 2001. An interactive tool for the rapid development of knowledge bases. Int. J. Artif. Intell. Tools 10, 1-2 (Mar. & Jun.), 65--86.Google Scholar
- Muller, A., Dorre, J., Gerstl, P., and Seiffert, R. 1999. The TaxGen framework: Automating the generation of a taxonomy for a large document collection. In Proceedings of the 32nd Hawaii International Conference on System Sciences (Maui, Hawaii). Google Scholar
- Pereira, F. C. N., Tishby, N., and Lee, L. 1993. Distributional clustering of english words. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 183--190. Google Scholar
- Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Inf. Proc. Manage. 24, 513--523. Google Scholar
- Sanderson, M. and Croft, B. 1999. Deriving concept hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Berkeley, CA). ACM, New York, 206--213. Google Scholar
- Slonim, N. and Thishby, N. 2000. Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Athens, Greece). ACM, New York, 208--215. Google Scholar
- Soderland, S. 1997. Learning to extract text-based information from the world wide web. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (Newport Beach, CA). AAAI Press, 251--254.Google Scholar
- Suan N. M.-M. 2004. Semi-automatic taxonomy for efficient information searching. In Proceedings of the 2nd International Conference on Information Technology for Application.Google Scholar
- Sullivan, D. 2002. Document warehousing & content management: Poor search quality in your enterprise information portal? DM Review.Google Scholar
- Vaithyanathan, S. and Dom, B. 2000. Model-based hierarchical clustering. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (Stanford, CA). 599--608. Google Scholar
- Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2002. Query clustering using user logs. ACM Trans. Inf. Syst. 20, 1 (Jan.), 59--81. Google Scholar
- Willet, P. 1988. Recent trends in hierarchical document clustering: A critical review. Inf. Proc. Manage. 24, 577--597. Google Scholar
- Xu, J. and Croft, B. 1996. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland). ACM, New York, 4--11. Google Scholar
Index Terms
- Taxonomy generation for text segments: A practical web-based approach
Recommendations
A practical web-based approach to generating topic hierarchy for text segments
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge managementIt is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed topic hierarchy. In this paper, we address the problem of generating topic hierarchies for diverse text ...
Automatic Category Generation for Text Documents by Self-Organizing Maps
IJCNN '00: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN'00)-Volume 3 - Volume 3Recently knowledge discovery and data mining in unstructured or semi-structured texts has been attracted lots of attention from both commercial and research fields. The task is not easy to tackle due to the unstructured nature of ordinary text ...
Performance study on "Carry-along Sort" vs. recursive commands for building dynamic materialized hierarchy path
ICAC3 '09: Proceedings of the International Conference on Advances in Computing, Communication and ControlConcept hierarchies are important for generalization across database/data mining applications. Hierarchy structures are widely used in data model and SQL Server implementation for real world entities like manager employee relation, organizational ...
Comments