ABSTRACT
Categorisation is a useful method for organising documents into subcollections that can be browsed or searched to more accurately and quickly meet information needs. On the Web, category-based portals such as Yahoo! and DMOZ are extremely popular: DMOZ is maintained by over 56,000 volunteers, is used as the basis of the popular Google directory, and is perhaps used by millions of users each day. Support Vector Machines (SVM) is a machine-learning algorithm which has been shown to be highly effective for automatic text categorisation. However, a problem with iterative training techniques such as SVM is that during their learning or training phase, they require the entire training collection to be held in main-memory; this is infeasible for large training collections such as DMOZ or large news wire feeds. In this paper, we show how inverted indexes can be used for scalable training in categorisation, and propose novel heuristics for a fast, accurate, and memory efficient approach. Our results show that an index can be constructed on a desktop workstation with little effect on categorisation accu-racy compared to a memory-based approach. We conclude that our techniques permit automatic categorisation using very large train-ing collections, vocabularies, and numbers of categories.
- R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. On feature distributional clustering for text categorization. In W. B. Croft, D. J. Harper, D. H. Kraft, and J. Zobel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 146--153, New Orleans, LA, 2001. ACM Press, NY. Google ScholarDigital Library
- C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998. Google ScholarDigital Library
- S. Chakrabarti, S. Roy, and M. Soundalgekar. Fast and accurate text classification via multiple linear discriminant projections. In Proc. of 28th International Conference on Very Large Data Bases, pages 658--669, Hong Kong, China, August 2002. Morgan Kaufmann. Google ScholarDigital Library
- N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In W.B. Croft, D.J. Harper, D.H. Kraft, and J. Zobel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 250--257, New Orleans, LA, 2001. ACM Press, NY. Google ScholarDigital Library
- I. Dagan, Y. Karov, and D. Roth. Mistake-driven learning in text categorization. In C. Cardie and R. Weischedel, editors, Proc. Conference on Empirical Methods in NLP, pages 55--63, Providence, RI, 1997.Google Scholar
- P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21(2):194--203, March 1975.Google ScholarDigital Library
- S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. Journal of the American Society for Information Science and Technology. (To appear). Google ScholarDigital Library
- T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In D. H. Fisher, editor, Proc. International Conference on Machine Learning, pages 143--151, Nashville, TN, July 1997. Morgan Kaufmann. Google ScholarDigital Library
- T. Joachims. Text categorisation with support vector machines: Learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proc. European Conference on Machine Learning, pages 137--142, Chemnitz, Germany, 1998. Springer. Google ScholarDigital Library
- D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In D.H. Fisher, editor, Proc. of the 14th International Conference on Machine Learning (ICML97), pages 170--178, Nashville, 1997. Morgan Kaufmann. Google ScholarDigital Library
- D. Lewis, R. Schapire, J. Callan, and R. Papka. Training algorithms for linear text classifiers. In H. Frei, D. Harman, P. Schäuble, and R. Wilkinson, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 296-306, Zurich, Switzerland, August 1996. Google ScholarDigital Library
- D. D. Lewis. Applying support vector machines to the TREC-2001 batch filtering and routing tasks. In E. M. Voorhees and D. K. Harman, editors, Proc. Text REtrieval Conference, Gaithersburg, MD, 2001. NIST publication number SN003-003-03750-8.Google Scholar
- C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999. Google ScholarDigital Library
- A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Available at: http://www.cs.cmu.edu/mccallum/bow, 1996.Google Scholar
- S. Robertson and I. Soboroff. The TREC 2001 filtering track report. In E. M. Voorhees and D. K. Harman, editors, Proc. Text REtrieval Conference, pages 26--37, Gaithersburg, MD, 2001. NIST publication number SN003-003-03750-8.Google Scholar
- S. E. Robertson, S. Walker, and M. Beaulieu. Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. In E. M. Voorhees and D. K. Harman, editors, Proc. Text REtrieval Conference, pages 253--264, Gaithersburg, MD, November 1998. NIST publication number SN003-003-03614-5.Google Scholar
- J. J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System Experiments in Automatic Document Processing, pages 313--323. Prentice-Hall, 1971.Google Scholar
- T. G. Rose, M. Stevenson, and M. Whitehead. The Reuters corpus volume 1 - from yesterday's news to tomorrow's language resources. In Proc. Third International Conference on Language Resources and Evaluation, Las Palmas de Gran Canaria, May 2002.Google Scholar
- D. Heckerman S. T. Dumais, J. Platt and M. Sahami. Inductive learning algorithms and representations for text categorization. In K. Makki and L. Bouganim, editors, Proc. International Conference on Information and Knowledge Management, pages 148--155, Bethesda, MD, November 1998. Google ScholarDigital Library
- G. Salton. Automatic Text Processing. Addison Wesley, Massachusetts, 1989. Google ScholarDigital Library
- G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983. Google ScholarDigital Library
- F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In K. Järvelin, M. Beaulieu, R. Baeza-Yates, and S. H. Myaeng, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 222--229, Tampere, Finland, 2002. Google ScholarDigital Library
- F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
- V. R. Shanks, H. E. Williams, and A. Cannane. Indexing for fast categorisation. In Proc. Australasian Computer Science Conference, pages 119-127, Adelaide, Australia, 2003. Australian Computer Society. Google ScholarDigital Library
- A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document length normalisation. Information Processing & Management, 32(5):619-633, 1996. Google ScholarDigital Library
- C. J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979. Google ScholarDigital Library
- W. Wibowo and H. E. Williams. Strategies for minimising errors in hierarchical web categorisation. In C. Nicholas, D. Grossman, K. Kalpakis, S. Qureshi, H. van Dissel, and L. Seligman, editors, Proc. International Conference on Information and Knowledge Management, pages 525--531, McLean, VA, 2002. Google ScholarDigital Library
- H. E. Williams. Indexing and Retrieval for Genomic Databases. PhD thesis, RMIT University, 1998.Google Scholar
- H. E. Williams and J. Zobel. Compressing integers for fast file access. Computer Journal, 42(3):193--201, 1999.Google ScholarDigital Library
- I. H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. J. Cunningham. Weka: Practical machine learning tools and techniques with Java implementations. In N. Kasabov and K. Ko, editors, Proc. ICONIP/ANZIIS/ANNES'99 International Workshop: Emerging Knowledge Engineering and Connectionist-Based Information Systems, pages 192-196, Dunedin, NZ, 1999.Google Scholar
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA, second edition, 1999. Google ScholarDigital Library
- Y. Yang. A study on thresholding strategies for text categorization. In W. B. Croft, D. J. Harper, D. H. Kraft, and J. Zobel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 137--145, New Orleans, LA, 2001. ACM Press, NY. Google ScholarDigital Library
- Y. Yang and X. Liu. A re-examination of text categorization methods. In M. A. Hearst, F. Gey, and R. Tong, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 42-49, Berkeley, CA, August 1999. Google ScholarDigital Library
- Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proc. of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, TX, 1997. Morgan Kaufmann. Google ScholarDigital Library
- J. Zobel and A. Moffat. Exploring the similarity space. ACM SIGIR Forum, 32(1):18--34, Spring 1998. Google ScholarDigital Library
Index Terms
- Index construction for linear categorisation
Recommendations
Indexing for fast categorisation
ACSC '03: Proceedings of the 26th Australasian computer science conference - Volume 16Automatic categorisation is an important technique for the management of large document collections. Categorisation can be used to store or locate documents that satisfy an information need when the need cannot be expressed as a concise list of query ...
Simple and accurate feature selection for hierarchical categorisation
DocEng '02: Proceedings of the 2002 ACM symposium on Document engineeringCategorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical ...
Fast construction of the HYB index
As shown in a series of recent works, the HYB index is an alternative to the inverted index (INV) that enables very fast prefix searches, which in turn is the basis for fast processing of many other types of advanced queries, including autocompletion, ...
Comments