Article

Index construction for linear categorisation

Authors:
Vaughan R. Shanks

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

,
Hugh E. Williams

RMIT University, Melbourne, Australia

RMIT University, Melbourne, Australia
View Profile

CIKM '03: Proceedings of the twelfth international conference on Information and knowledge managementNovember 2003Pages 334–341https://doi.org/10.1145/956863.956926

Published:03 November 2003Publication History

CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

Pages 334–341

ABSTRACT

Categorisation is a useful method for organising documents into subcollections that can be browsed or searched to more accurately and quickly meet information needs. On the Web, category-based portals such as Yahoo! and DMOZ are extremely popular: DMOZ is maintained by over 56,000 volunteers, is used as the basis of the popular Google directory, and is perhaps used by millions of users each day. Support Vector Machines (SVM) is a machine-learning algorithm which has been shown to be highly effective for automatic text categorisation. However, a problem with iterative training techniques such as SVM is that during their learning or training phase, they require the entire training collection to be held in main-memory; this is infeasible for large training collections such as DMOZ or large news wire feeds. In this paper, we show how inverted indexes can be used for scalable training in categorisation, and propose novel heuristics for a fast, accurate, and memory efficient approach. Our results show that an index can be constructed on a desktop workstation with little effect on categorisation accu-racy compared to a memory-based approach. We conclude that our techniques permit automatic categorisation using very large train-ing collections, vocabularies, and numbers of categories.

References

R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. On feature distributional clustering for text categorization. In W. B. Croft, D. J. Harper, D. H. Kraft, and J. Zobel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 146--153, New Orleans, LA, 2001. ACM Press, NY. Google ScholarDigital Library
C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121--167, 1998. Google ScholarDigital Library
S. Chakrabarti, S. Roy, and M. Soundalgekar. Fast and accurate text classification via multiple linear discriminant projections. In Proc. of 28th International Conference on Very Large Data Bases, pages 658--669, Hong Kong, China, August 2002. Morgan Kaufmann. Google ScholarDigital Library
N. Craswell, D. Hawking, and S. Robertson. Effective site finding using link anchor information. In W.B. Croft, D.J. Harper, D.H. Kraft, and J. Zobel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 250--257, New Orleans, LA, 2001. ACM Press, NY. Google ScholarDigital Library
I. Dagan, Y. Karov, and D. Roth. Mistake-driven learning in text categorization. In C. Cardie and R. Weischedel, editors, Proc. Conference on Empirical Methods in NLP, pages 55--63, Providence, RI, 1997.Google Scholar
P. Elias. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory, IT-21(2):194--203, March 1975.Google ScholarDigital Library
S. Heinz and J. Zobel. Efficient single-pass index construction for text databases. Journal of the American Society for Information Science and Technology. (To appear). Google ScholarDigital Library
T. Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In D. H. Fisher, editor, Proc. International Conference on Machine Learning, pages 143--151, Nashville, TN, July 1997. Morgan Kaufmann. Google ScholarDigital Library
T. Joachims. Text categorisation with support vector machines: Learning with many relevant features. In C. Nédellec and C. Rouveirol, editors, Proc. European Conference on Machine Learning, pages 137--142, Chemnitz, Germany, 1998. Springer. Google ScholarDigital Library
D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In D.H. Fisher, editor, Proc. of the 14th International Conference on Machine Learning (ICML97), pages 170--178, Nashville, 1997. Morgan Kaufmann. Google ScholarDigital Library
D. Lewis, R. Schapire, J. Callan, and R. Papka. Training algorithms for linear text classifiers. In H. Frei, D. Harman, P. Schäuble, and R. Wilkinson, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 296-306, Zurich, Switzerland, August 1996. Google ScholarDigital Library
D. D. Lewis. Applying support vector machines to the TREC-2001 batch filtering and routing tasks. In E. M. Voorhees and D. K. Harman, editors, Proc. Text REtrieval Conference, Gaithersburg, MD, 2001. NIST publication number SN003-003-03750-8.Google Scholar
C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999. Google ScholarDigital Library
A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. Available at: http://www.cs.cmu.edu/mccallum/bow, 1996.Google Scholar
S. Robertson and I. Soboroff. The TREC 2001 filtering track report. In E. M. Voorhees and D. K. Harman, editors, Proc. Text REtrieval Conference, pages 26--37, Gaithersburg, MD, 2001. NIST publication number SN003-003-03750-8.Google Scholar
S. E. Robertson, S. Walker, and M. Beaulieu. Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. In E. M. Voorhees and D. K. Harman, editors, Proc. Text REtrieval Conference, pages 253--264, Gaithersburg, MD, November 1998. NIST publication number SN003-003-03614-5.Google Scholar
J. J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System Experiments in Automatic Document Processing, pages 313--323. Prentice-Hall, 1971.Google Scholar
T. G. Rose, M. Stevenson, and M. Whitehead. The Reuters corpus volume 1 - from yesterday's news to tomorrow's language resources. In Proc. Third International Conference on Language Resources and Evaluation, Las Palmas de Gran Canaria, May 2002.Google Scholar
D. Heckerman S. T. Dumais, J. Platt and M. Sahami. Inductive learning algorithms and representations for text categorization. In K. Makki and L. Bouganim, editors, Proc. International Conference on Information and Knowledge Management, pages 148--155, Bethesda, MD, November 1998. Google ScholarDigital Library
G. Salton. Automatic Text Processing. Addison Wesley, Massachusetts, 1989. Google ScholarDigital Library
G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983. Google ScholarDigital Library
F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In K. Järvelin, M. Beaulieu, R. Baeza-Yates, and S. H. Myaeng, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 222--229, Tampere, Finland, 2002. Google ScholarDigital Library
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1--47, 2002. Google ScholarDigital Library
V. R. Shanks, H. E. Williams, and A. Cannane. Indexing for fast categorisation. In Proc. Australasian Computer Science Conference, pages 119-127, Adelaide, Australia, 2003. Australian Computer Society. Google ScholarDigital Library
A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document length normalisation. Information Processing & Management, 32(5):619-633, 1996. Google ScholarDigital Library
C. J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979. Google ScholarDigital Library
W. Wibowo and H. E. Williams. Strategies for minimising errors in hierarchical web categorisation. In C. Nicholas, D. Grossman, K. Kalpakis, S. Qureshi, H. van Dissel, and L. Seligman, editors, Proc. International Conference on Information and Knowledge Management, pages 525--531, McLean, VA, 2002. Google ScholarDigital Library
H. E. Williams. Indexing and Retrieval for Genomic Databases. PhD thesis, RMIT University, 1998.Google Scholar
H. E. Williams and J. Zobel. Compressing integers for fast file access. Computer Journal, 42(3):193--201, 1999.Google ScholarDigital Library
I. H. Witten, E. Frank, L. Trigg, M. Hall, G. Holmes, and S. J. Cunningham. Weka: Practical machine learning tools and techniques with Java implementations. In N. Kasabov and K. Ko, editors, Proc. ICONIP/ANZIIS/ANNES'99 International Workshop: Emerging Knowledge Engineering and Connectionist-Based Information Systems, pages 192-196, Dunedin, NZ, 1999.Google Scholar
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, Los Altos, CA, second edition, 1999. Google ScholarDigital Library
Y. Yang. A study on thresholding strategies for text categorization. In W. B. Croft, D. J. Harper, D. H. Kraft, and J. Zobel, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 137--145, New Orleans, LA, 2001. ACM Press, NY. Google ScholarDigital Library
Y. Yang and X. Liu. A re-examination of text categorization methods. In M. A. Hearst, F. Gey, and R. Tong, editors, Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 42-49, Berkeley, CA, August 1999. Google ScholarDigital Library
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proc. of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, TX, 1997. Morgan Kaufmann. Google ScholarDigital Library
J. Zobel and A. Moffat. Exploring the similarity space. ACM SIGIR Forum, 32(1):18--34, Spring 1998. Google ScholarDigital Library

Index Terms

Index construction for linear categorisation
1. Information systems
  1. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Indexing for fast categorisation
ACSC '03: Proceedings of the 26th Australasian computer science conference - Volume 16

Automatic categorisation is an important technique for the management of large document collections. Categorisation can be used to store or locate documents that satisfy an information need when the need cannot be expressed as a concise list of query ...
Read More
Simple and accurate feature selection for hierarchical categorisation
DocEng '02: Proceedings of the 2002 ACM symposium on Document engineering

Categorisation of digital documents is useful for organisation and retrieval. While document categories can be a set of unstructured category labels, some document categories are hierarchically structured. This paper investigates automatic hierarchical ...
Read More
Fast construction of the HYB index

As shown in a series of recent works, the HYB index is an alternative to the inverted index (INV) that enables very fast prefix searches, which in turn is the basis for fast processing of many other types of advanced queries, including autocompletion, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management
November 2003
592 pages
ISBN:1581137230
DOI:10.1145/956863
General Chair:
Donald Kraft
Louisiana State University
,
Program Chairs:
Ophir Frieder
Illinois Institute of Technology
,
Joachim Hammer
University of Florida
,
Sajda Qureshi
University of Nebraska, Omaha
,
Len Seligman
The MITRE Corporation
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 November 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
categorisation
efficiency
index construction
support vector machines
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 619
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Index construction for linear categorisation

CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Indexing for fast categorisation

Simple and accurate feature selection for hierarchical categorisation

Fast construction of the HYB index

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Index construction for linear categorisation

CIKM '03: Proceedings of the twelfth international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Indexing for fast categorisation

Simple and accurate feature selection for hierarchical categorisation

Fast construction of the HYB index

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media