skip to main content
10.1145/956863.956926acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Index construction for linear categorisation

Published:03 November 2003Publication History

ABSTRACT

Categorisation is a useful method for organising documents into subcollections that can be browsed or searched to more accurately and quickly meet information needs. On the Web, category-based portals such as Yahoo! and DMOZ are extremely popular: DMOZ is maintained by over 56,000 volunteers, is used as the basis of the popular Google directory, and is perhaps used by millions of users each day. Support Vector Machines (SVM) is a machine-learning algorithm which has been shown to be highly effective for automatic text categorisation. However, a problem with iterative training techniques such as SVM is that during their learning or training phase, they require the entire training collection to be held in main-memory; this is infeasible for large training collections such as DMOZ or large news wire feeds. In this paper, we show how inverted indexes can be used for scalable training in categorisation, and propose novel heuristics for a fast, accurate, and memory efficient approach. Our results show that an index can be constructed on a desktop workstation with little effect on categorisation accu-racy compared to a memory-based approach. We conclude that our techniques permit automatic categorisation using very large train-ing collections, vocabularies, and numbers of categories.

References

  1. R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. On feature distributional clustering for text categorization. In W. B. Croft, D. J. Harper, D. H. Kraft, and J. Zobel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 146--153, New Orleans, LA, 2001. ACM Press, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chakrabarti, S. Roy, and M. Soundalgekar. Fast and accurate text classification via multiple linear discriminant projections. In Proc. of 28th International Conference on Very Large Data Bases, pages 658--669, Hong Kong, China, August 2002. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In W.B. Croft, D.J. Harper, D.H. Kraft, and J. Zobel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 250--257, New Orleans, LA, 2001. ACM Press, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. I. Dagan, Y. Karov, and D. Roth. Mistake-driven learning in text categorization. In C. Cardie and R. Weischedel, editors, Proc. Conference on Empirical Methods in NLP, pages 55--63, Providence, RI, 1997.Google ScholarGoogle Scholar
  6. P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21(2):194--203, March 1975.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. Journal of the American Society for Information Science and Technology. (To appear). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In D. H. Fisher, editor, Proc. International Conference on Machine Learning, pages 143--151, Nashville, TN, July 1997. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Joachims. Text categorisation with support vector machines: Learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proc. European Conference on Machine Learning, pages 137--142, Chemnitz, Germany, 1998. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In D.H. Fisher, editor, Proc. of the 14th International Conference on Machine Learning (ICML97), pages 170--178, Nashville, 1997. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Lewis, R. Schapire, J. Callan, and R. Papka. Training algorithms for linear text classifiers. In H. Frei, D. Harman, P. Schäuble, and R. Wilkinson, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 296-306, Zurich, Switzerland, August 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. D. Lewis. Applying support vector machines to the TREC-2001 batch filtering and routing tasks. In E. M. Voorhees and D. K. Harman, editors, Proc. Text REtrieval Conference, Gaithersburg, MD, 2001. NIST publication number SN003-003-03750-8.Google ScholarGoogle Scholar
  13. C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Available at: http://www.cs.cmu.edu/mccallum/bow, 1996.Google ScholarGoogle Scholar
  15. S. Robertson and I. Soboroff. The TREC 2001 filtering track report. In E. M. Voorhees and D. K. Harman, editors, Proc. Text REtrieval Conference, pages 26--37, Gaithersburg, MD, 2001. NIST publication number SN003-003-03750-8.Google ScholarGoogle Scholar
  16. S. E. Robertson, S. Walker, and M. Beaulieu. Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. In E. M. Voorhees and D. K. Harman, editors, Proc. Text REtrieval Conference, pages 253--264, Gaithersburg, MD, November 1998. NIST publication number SN003-003-03614-5.Google ScholarGoogle Scholar
  17. J. J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System Experiments in Automatic Document Processing, pages 313--323. Prentice-Hall, 1971.Google ScholarGoogle Scholar
  18. T. G. Rose, M. Stevenson, and M. Whitehead. The Reuters corpus volume 1 - from yesterday's news to tomorrow's language resources. In Proc. Third International Conference on Language Resources and Evaluation, Las Palmas de Gran Canaria, May 2002.Google ScholarGoogle Scholar
  19. D. Heckerman S. T. Dumais, J. Platt and M. Sahami. Inductive learning algorithms and representations for text categorization. In K. Makki and L. Bouganim, editors, Proc. International Conference on Information and Knowledge Management, pages 148--155, Bethesda, MD, November 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Salton. Automatic Text Processing. Addison Wesley, Massachusetts, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In K. Järvelin, M. Beaulieu, R. Baeza-Yates, and S. H. Myaeng, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 222--229, Tampere, Finland, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. R. Shanks, H. E. Williams, and A. Cannane. Indexing for fast categorisation. In Proc. Australasian Computer Science Conference, pages 119-127, Adelaide, Australia, 2003. Australian Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document length normalisation. Information Processing & Management, 32(5):619-633, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Wibowo and H. E. Williams. Strategies for minimising errors in hierarchical web categorisation. In C. Nicholas, D. Grossman, K. Kalpakis, S. Qureshi, H. van Dissel, and L. Seligman, editors, Proc. International Conference on Information and Knowledge Management, pages 525--531, McLean, VA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. E. Williams. Indexing and Retrieval for Genomic Databases. PhD thesis, RMIT University, 1998.Google ScholarGoogle Scholar
  29. H. E. Williams and J. Zobel. Compressing integers for fast file access. Computer Journal, 42(3):193--201, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. I. H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. J. Cunningham. Weka: Practical machine learning tools and techniques with Java implementations. In N. Kasabov and K. Ko, editors, Proc. ICONIP/ANZIIS/ANNES'99 International Workshop: Emerging Knowledge Engineering and Connectionist-Based Information Systems, pages 192-196, Dunedin, NZ, 1999.Google ScholarGoogle Scholar
  31. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA, second edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Yang. A study on thresholding strategies for text categorization. In W. B. Croft, D. J. Harper, D. H. Kraft, and J. Zobel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 137--145, New Orleans, LA, 2001. ACM Press, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Yang and X. Liu. A re-examination of text categorization methods. In M. A. Hearst, F. Gey, and R. Tong, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 42-49, Berkeley, CA, August 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proc. of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, TX, 1997. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Zobel and A. Moffat. Exploring the similarity space. ACM SIGIR Forum, 32(1):18--34, Spring 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Index construction for linear categorisation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management
      November 2003
      592 pages
      ISBN:1581137230
      DOI:10.1145/956863

      Copyright © 2003 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 November 2003

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    • Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader