skip to main content
article

Taxonomy generation for text segments: A practical web-based approach

Authors Info & Claims
Published:01 October 2005Publication History
Skip Abstract Section

Abstract

It is crucial in many information systems to organize short text segments, such as keywords in documents and queries from users, into a well-formed taxonomy. In this article, we address the problem of taxonomy generation for diverse text segments with a general and practical approach that uses the Web as an additional knowledge source. Unlike long documents, short text segments typically do not contain enough information to extract reliable features. This work investigates the possibilities of using highly ranked search-result snippets to enrich the representation of text segments. A hierarchical clustering algorithm is then designed for creating the hierarchical topic structure of text segments. Text segments with close concepts can be grouped together in a cluster, and relevant clusters linked at the same or near levels. Different from traditional clustering algorithms, which tend to produce cluster hierarchies with a very unnatural shape, the algorithm tries to produce a more natural and comprehensive tree hierarchy. Extensive experiments were conducted on different domains of text segments, including subject terms, people names, paper titles, and natural language questions. The obtained experimental results have shown the potential of the proposed approach, which provides a basis for the in-depth analysis of text segments on a larger scale and is believed able to benefit many information systems.

References

  1. Agirre, E., Ansa, O., Hovy, E., and Martinez, D. 2000. Enriching very large ontologies using the WWW. In Proceedings of ECAI 2000 Workshop on Ontology Learning (Berlin, Germany).Google ScholarGoogle Scholar
  2. Agrawal, R. and Srikant, R. 2001. On integrating catalogs. In Proceedings of the 10th International World Wide Web Conference (Hong Kong). ACM Press, New York, 603--612. Google ScholarGoogle Scholar
  3. Ahonen, H., Heinonen, O., Klemettinen, M., and Verkamo, A. 1999. Finding co-occurring text phrases by combining sequence and frequent set discovery. In Proceedings of IJCAI'99 Workshop on Text Mining: Foundations, Techniques and Applications (Stockholm, Sweden). 1--9.Google ScholarGoogle Scholar
  4. Baker, L. D. and McCallum, A. K. 1998. Distributional clustering of words for text classification. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne, Australia). ACM, New York, 96--103. Google ScholarGoogle Scholar
  5. Beeferman, D. and Berger, A. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Boston, MA). ACM, New York, 407--416. Google ScholarGoogle Scholar
  6. Brown, P., Pietra, S. D., Pietra, V. D., and Mercer, R. 1991. Word sense disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (Berkeley, CA). 264--270. Google ScholarGoogle Scholar
  7. Buckley, C., Salton, G., and Allan, J. 1992. Automatic retrieval with locality information using smart. In Proceedings of the 1st Text REtrieval Conference (TREC-1) (Gaithersburg, MD). 59--72.Google ScholarGoogle Scholar
  8. Chakrabarti, S. 2002. Mining the Web: Discovering Knowledge from Hypertext Data. Elsevier Science & Technology. Google ScholarGoogle Scholar
  9. Chakrabarti, S., Dorm, B., and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of 1998 ACM SIGMOD International Conference on Management of Data (Seattle, WA). ACM, New York, 307--318. Google ScholarGoogle Scholar
  10. Chuang, S.-L. and Chien, L.-F. 2002. Towards automatic generation of query taxonomy: A hierarchical query clustering approach. In Proceedings of the 2002 IEEE International Conference on Data Mining (Maebashi City, Japan). IEEE Computer Society Press, Los Alamitos, CA, 75--82. Google ScholarGoogle Scholar
  11. Chuang, S.-L. and Chien, L.-F. 2003. Enriching web taxonomies through subject categorization of query terms from search engine logs. Decision Support System, Special Issue on Web Retrieval and Mining 35, 1, 113--127. Google ScholarGoogle Scholar
  12. Dhillon, I. S., Mallela, S., and Kumar, R. 2002. Enhanced word clustering for hierarchical text classification. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Edminton, Alto, Canada). ACM, New York. Google ScholarGoogle Scholar
  13. Ding, C., He, X., Zha, H., Guu, M., and Simon, H. 2001. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the 2001 IEEE International Conference on Data Mining (San Jose, CA). IEEE Computer Society Press, Los Alamitos, CA, 107--114. Google ScholarGoogle Scholar
  14. Feldman, R. and Dagan, I. 1995. Knowledge discovery in textual databases (KDT). In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (Montreal, Ont., Canada). AAAI Press, 112--117.Google ScholarGoogle Scholar
  15. Glover, E., Pennock, D. M., Lawrence, S., and Krovetz, R. 2002. Inferring hierarchical descriptions. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM) (McLean, VA). 4--9. Google ScholarGoogle Scholar
  16. Hearst, M. 1999. Untangling text data mining. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics. Google ScholarGoogle Scholar
  17. Johansson, S., Atwell, E., Garside, R., and Leech, G. 1986. The Tagged Lob Corpus: Users' Manual.Google ScholarGoogle Scholar
  18. Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of the 14th International Conference on Machine Learning. Morgan-Kaufmann, San Francisco, CA, 170--178. Google ScholarGoogle Scholar
  19. Larsen, B. and Aone, C. 1999. Fast and effective text mining using linear-time document clustering. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Diego, CA). ACM, New York, 16--22. Google ScholarGoogle Scholar
  20. Lawrie, D., Croft, W. B., and Rosenberg, A. L. 2001. Finding topic words for hierarchical summarization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orlean, LA). ACM, New York, 349--357. Google ScholarGoogle Scholar
  21. Li, T., Zhu, S., and Ogihara, M. 2003. Topic hierarchy generation via linear discriminant projection. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval (Toronto, Ont., Canada). ACM, New York, 421--422. Google ScholarGoogle Scholar
  22. Manning, C. D. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA. Google ScholarGoogle Scholar
  23. McCallum, A. K., Rosenfeld, R., Mitchell, T. M., and Ng, A. Y. 1998. Improving text classification by shrinkage in a hierarchy of classes. In Proceedings of the 15th International Conference on Machine Learning (Madison, WI). J. W. Shavlik, Ed. Morgan-Kaufmann, San Francisco, CA, 359--367. Google ScholarGoogle Scholar
  24. Milligan, G. W. and Cooper, M. C. 1985. An examination of procedures for detecting the number of clusters in a data set. Psychometrika 50, 159--179.Google ScholarGoogle Scholar
  25. Mirkin, B. 1996. Mathematical Classification and Clustering. Kluwer.Google ScholarGoogle Scholar
  26. Moldovan, D. I. and Girju, R. 2001. An interactive tool for the rapid development of knowledge bases. Int. J. Artif. Intell. Tools 10, 1-2 (Mar. & Jun.), 65--86.Google ScholarGoogle Scholar
  27. Muller, A., Dorre, J., Gerstl, P., and Seiffert, R. 1999. The TaxGen framework: Automating the generation of a taxonomy for a large document collection. In Proceedings of the 32nd Hawaii International Conference on System Sciences (Maui, Hawaii). Google ScholarGoogle Scholar
  28. Pereira, F. C. N., Tishby, N., and Lee, L. 1993. Distributional clustering of english words. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. 183--190. Google ScholarGoogle Scholar
  29. Salton, G. and Buckley, C. 1988. Term weighting approaches in automatic text retrieval. Inf. Proc. Manage. 24, 513--523. Google ScholarGoogle Scholar
  30. Sanderson, M. and Croft, B. 1999. Deriving concept hierarchies from text. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Berkeley, CA). ACM, New York, 206--213. Google ScholarGoogle Scholar
  31. Slonim, N. and Thishby, N. 2000. Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Athens, Greece). ACM, New York, 208--215. Google ScholarGoogle Scholar
  32. Soderland, S. 1997. Learning to extract text-based information from the world wide web. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (Newport Beach, CA). AAAI Press, 251--254.Google ScholarGoogle Scholar
  33. Suan N. M.-M. 2004. Semi-automatic taxonomy for efficient information searching. In Proceedings of the 2nd International Conference on Information Technology for Application.Google ScholarGoogle Scholar
  34. Sullivan, D. 2002. Document warehousing & content management: Poor search quality in your enterprise information portal? DM Review.Google ScholarGoogle Scholar
  35. Vaithyanathan, S. and Dom, B. 2000. Model-based hierarchical clustering. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (Stanford, CA). 599--608. Google ScholarGoogle Scholar
  36. Wen, J.-R., Nie, J.-Y., and Zhang, H.-J. 2002. Query clustering using user logs. ACM Trans. Inf. Syst. 20, 1 (Jan.), 59--81. Google ScholarGoogle Scholar
  37. Willet, P. 1988. Recent trends in hierarchical document clustering: A critical review. Inf. Proc. Manage. 24, 577--597. Google ScholarGoogle Scholar
  38. Xu, J. and Croft, B. 1996. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland). ACM, New York, 4--11. Google ScholarGoogle Scholar

Index Terms

  1. Taxonomy generation for text segments: A practical web-based approach

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader