|
ABSTRACT
Real-world applications of text categorization often require a system to deal with tens of thousands of categories defined over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including Support Vector Machines, k-nearest neighbor, ridge regression, linear least square fit and logistic regression. By providing a formal analysis of the computational complexity of each classification method, followed by an investigation on the usage of different classifiers in a hierarchical setting of categorization, we show how the scalability of a method depends on the topology of the hierarchy and the category distributions. In addition, we are able to obtain tight bounds for the complexities by using the power law to approximate category distributions over a hierarchy. Experiments with kNN and SVM classifiers on the OHSUMED corpus are reported on, as concrete examples.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
M. Berry. Large-scale singular value computations. volume 6-1, pages 13--49, 1992.
|
| |
2
|
|
| |
3
|
R. F. E. Osuna and F. Girosi. An improved training algorithm for support vector machines. In Neural Networks for Signal Processing VII-Proceedings of 1997 IEEE Workshop, New York, 1995.
|
 |
4
|
Michalis Faloutsos , Petros Faloutsos , Christos Faloutsos, On power-law relationships of the Internet topology, Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communication, p.251-262, August 30-September 03, 1999, Cambridge, Massachusetts, United States
|
| |
5
|
|
| |
6
|
|
| |
7
|
T. Joachims. The Maximum-Margin Approach to Learning Text Classifiers: Methods, Theory, and Algorithms. Ph.D. thesis, University of Dortmund, 2000.
|
| |
8
|
D. Lewis, F. Li, T. Rose, and Y. Yang. The reuters corpus volume i as a text categorization test collection. In Journal of Machine Learning Research, page (to appear), 2003.
|
 |
9
|
David D. Lewis , Robert E. Schapire , James P. Callan , Ron Papka, Training algorithms for linear text classifiers, Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, p.298-306, August 18-22, 1996, Zurich, Switzerland
[doi> 10.1145/243199.243277]
|
| |
10
|
F. Li and Y. Yang. A loss function analysis for classification methods in text categorization. In ICML, 2003 (submitted).
|
| |
11
|
A. Newell and P. Rosenbloom. Mechanisms of skill acquisition and the law of practice. In J. Anderson, editor, Cognitive Skills and Their Acquisition, pages chapter 1, pp 1--55, Hillsdale, NJ, 1981. Lawrence Erlbaum Associates, Inc.
|
| |
12
|
J. Platt. Sequetial minimal optimization: A fast algorithm for training support vector machines. In Technical Report MST-TR-98-14. Microsoft Research, 1998.
|
| |
13
|
S. Robertson and S. Walker. Microsoft cambridge at trec-9. In D. Harmon, editor, Proceedings of the Nineth Text REtrieval Conference (TREC-9), 2001.
|
| |
14
|
V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998.
|
| |
15
|
|
 |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
CITED BY 13
|
|
|
|
|
|
|
|
|
|
Tie-Yan Liu , Yiming Yang , Hao Wan , Hua-Jun Zeng , Zheng Chen , Wei-Ying Ma, Support vector machines classification with a very large-scale taxonomy, ACM SIGKDD Explorations Newsletter, v.7 n.1, p.36-43, June 2005
|
|
|
|
|
|
|
|
Bin Gao , Tie-Yan Liu , Guang Feng , Tao Qin , Qian-Sheng Cheng , Wei-Ying Ma, Hierarchical Taxonomy Preparation for Text Categorization Using Consistent Bipartite Spectral Graph Copartitioning, IEEE Transactions on Knowledge and Data Engineering, v.17 n.9, p.1263-1273, September 2005
|
|
Dikan Xing , Gui-Rong Xue , Qiang Yang , Yong Yu, Deep classifier: automatically categorizing search results into large-scale hierarchies, Proceedings of the international conference on Web search and web data mining, February 11-12, 2008, Palo Alto, California, USA
|
|
|
|
|
|
|
|
Nayer M. Wanas , Dina A. Said , Nadia H. Hegazy , Nevin M. Darwish, A study of local and global thresholding techniques in text categorization, Proceedings of the fifth Australasian conference on Data mining and analystics, p.91-101, November 29-30, 2006, Sydney, Australia
|
|
|
Peer to Peer - Readers of this Article have also read:
-
Constructing reality
Proceedings of the 11th annual international conference on Systems documentation
Douglas A. Powell
, Norman R. Ball
, Mansel W. Griffiths
-
M4: a metamodel for data preprocessing
Proceedings of the 4th ACM international workshop on Data warehousing and OLAP
Anca Vaduva
, Jörg-Uwe Kietz
, Regina Zücker
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
|