Article

Efficient handling of high-dimensional feature spaces by randomized classifier ensembles

Authors:
Aleksander Kołcz

Personalogy, Inc., Colorado Springs, CO

Personalogy, Inc., Colorado Springs, CO
View Profile

,
Xiaomei Sun

University of Colorado at Colorado Springs, Colorado Springs, CO

University of Colorado at Colorado Springs, Colorado Springs, CO
View Profile

,
Jugal Kalita

University of Colorado at Colorado Springs, Colorado Springs, CO

University of Colorado at Colorado Springs, Colorado Springs, CO
View Profile

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data miningJuly 2002Pages 307–313https://doi.org/10.1145/775047.775093

Published:23 July 2002Publication History

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 307–313

ABSTRACT

Handling massive datasets is a difficult problem not only due to prohibitively large numbers of entries but in some cases also due to the very high dimensionality of the data. Often, severe feature selection is performed to limit the number of attributes to a manageable size, which unfortunately can lead to a loss of useful information. Feature space reduction may well be necessary for many stand-alone classifiers, but recent advances in the area of ensemble classifier techniques indicate that overall accurate classifier aggregates can be learned even if each individual classifier operates on incomplete "feature view" training data, i.e., such where certain input attributes are excluded. In fact, by using only small random subsets of features to build individual component classifiers, surprisingly accurate and robust models can be created. In this work we demonstrate how these types of architectures effectively reduce the feature space for submodels and groups of sub-models, which lends itself to efficient sequential and/or parallel implementations. Experiments with a randomized version of Adaboost are used to support our arguments, using the text classification task as an example.

References

I. Aleksander and T. J. Stonham. A guide to pattern recognition using random-access memories, IEE Proceedings-E Computers and Digital Techniques, 2(1):29--40, 1979.]]Google ScholarCross Ref
I. Aleksander, W. Thomas, and P. Bowden. WISARD, a radical new step forward in image recognition. Sensor Rev., 4(3):120--124, 1984.]]Google Scholar
Y. Amit, G. Blanchard, and K. Wilder. Multiple randomized classifiers: MRCL. Technical Report 446, Depertment of Statistics, University of Chicago, 2000.]]Google Scholar
C. Apté, F. Damerau, and S. M. Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 12(3):233--251, 1994.]] Google ScholarDigital Library
W. Bledsoe and I. Browning. Pattern recognition and reading by machine. In IRE Joint Computer Conference, pages 225--232, 1959.]]Google ScholarDigital Library
L. Breiman. Bagging predictors. Machine Learning, 24(2):123--140, 1996.]] Google ScholarCross Ref
L. Breiman. Arcing classifiers. The Annals of Statistics, 26(3):801--849, 1998.]]Google ScholarCross Ref
L. Breiman. Random forests. Machine Learning, 24(2):5--32, 2001.]] Google ScholarDigital Library
T. G. Dietterich. An experimental comparison of three methods of constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139--157, 2000.]] Google ScholarDigital Library
C. Domingo and O. Watanabe. Scaling up a boosting-based learner via adaptive sampling. In Proceedings of the 2000 Pacific-Asia Conference on Knowledge Discovery and Data Mining(PAKDD-2000), pages 317--328, 2000.]] Google ScholarDigital Library
Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Machine Learning Conference, pages 148--156, 1996.]]Google ScholarDigital Library
J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. The Annals of Statistics, 38(2):337--374, 2000.]]Google ScholarCross Ref
D. Pavlov, J. Mao, and B. Dom. Scaling-up support vector machines using boosting algorithm. In Proceedings of the 2000 International Conference on Pattern Recognition, 2000.]]Google ScholarCross Ref
J. A. Reichler and H. D. Harris. Parallel online continuous arcing and a new framework for wrapping parallel ensembles. In Proceedings of IJCAI 2001: International Joint Conference on Artificial Intelligence, Workshop on Wrappers for Performance Enhancement in Knowledge Discovery in Databases, pages 148--156, 2001.]]Google Scholar
R. Rohwer and M. Morciniec. The theoretical and experimental status of the N-tuple classifier. Neural Networkas, 11(1):1--14, 1998.]] Google ScholarDigital Library
R. E. Schapire and Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2--3):135--168, 2000.]] Google ScholarDigital Library
V. N. Vapnik. Statistical Learning Theory. John Wiley, New York, 1998.]]Google ScholarDigital Library
Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of the 22nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 42--49, 1999.]] Google ScholarDigital Library
Y. Yang and J. P. Pedersen. A comparative study on feature selection in text categorization. In Proceedings of the Fourteenth International Conference on Machine Learning (ICML'97), pages 412--420, 1997.]] Google ScholarDigital Library
C. Yu and D. B. Skillicorn. Parallelizing boosting and bagging. Technical Report 2001-442, Depertment of Computing and Information Science, Queen's University, Kingston, Canada, 2001.]]Google Scholar

Index Terms

Efficient handling of high-dimensional feature spaces by randomized classifier ensembles

Recommendations

Classifier and feature set ensembles for web page classification

Web page classification is an important research direction on web mining. The abundant amount of data available on the web makes it essential to develop efficient and robust models for web mining tasks. Web page classification is the process of ...
Read More
Classifier ensembles: Select real-world applications

Broad classes of statistical classification algorithms have been developed and applied successfully to a wide range of real-world domains. In general, ensuring that the particular classification algorithm matches the properties of the data is crucial in ...
Read More
Hybridization of Base Classifiers of Random Subsample Ensembles for Enhanced Performance in High Dimensional Feature Spaces
ICMLA '10: Proceedings of the 2010 Ninth International Conference on Machine Learning and Applications

This paper presents a simulation-based empirical study of the performance profile of random sub sample ensembles with a hybrid mix of base learner composition in high dimensional feature spaces. The performance of hybrid random sub sample ensemble that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
July 2002
719 pages
ISBN:158113567X
DOI:10.1145/775047
Conference Chair:
Osmar R. Zaïane
University of Alberta, Canada
,
General Chair:
Randy Goebel
University of Alberta, Canada
,
Program Chairs:
David Hand
Imperial College, UK
,
Daniel Keim
AT&T
,
Raymond Ng
University of British Columbia, Canada
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 678
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient handling of high-dimensional feature spaces by randomized classifier ensembles

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Classifier and feature set ensembles for web page classification

Classifier ensembles: Select real-world applications

Hybridization of Base Classifiers of Random Subsample Ensembles for Enhanced Performance in High Dimensional Feature Spaces