ACM Home Page
Please provide us with feedback. Feedback
A refinement approach to handling model misfit in text categorization
Full text pdf formatPdf (171 KB)
Source Conference on Knowledge Discovery in Data archive
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Edmonton, Alberta, Canada
SESSION: Text classification table of contents
Pages: 207 - 216  
Year of Publication: 2002
ISBN:1-58113-567-X
Authors
Haoran Wu  National University of Singapore, Singapore
Tong Heng Phang  National University of Singapore, Singapore
Bing Liu  National University of Singapore, Singapore
Xiaoli Li  National University of Singapore, Singapore
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
: AAAI
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 84,   Citation Count: 13
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues   peer to peer  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775047.775078
What is a DOI?

ABSTRACT

Text categorization or classification is the automated assigning of text documents to pre-defined classes based on their contents. This problem has been studied in information retrieval, machine learning and data mining. So far, many effective techniques have been proposed. However, most techniques are based on some underlying models and/or assumptions. When the data fits the model well, the classification accuracy will be high. However, when the data does not fit the model well, the classification accuracy can be very low. In this paper, we propose a refinement approach to dealing with this problem of model misfit. We show that we do not need to change the classification technique itself (or its underlying model) to make it more flexible. Instead, we propose to use successive refinements of classification on the training data to correct the model misfit. We apply the proposed technique to improve the classification performance of two simple and efficient text classifiers, the Rocchio classifier and the naïve Bayesian classifier. These techniques are suitable for very large text collections because they allow the data to reside on disk and need only one scan of the data to build a text classifier. Extensive experiments on two benchmark document corpora show that the proposed technique is able to improve text categorization accuracy of the two techniques dramatically. In particular, our refined model is able to improve the naïve Bayesian or Rocchio classifier's prediction performance by 45% on average.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Apte, C., Hong, S., Hosking, J., Lepre, J., Pednault, E. and Rosen, B. Decomposition of heterogeneous classification problems, Intelligent Data Analysis, 1998.
 
3
 
4
 
5
Chan, P. and Stolfo, S. Comparative evaluation of voting and meta-learning on partitioned data. In Proceedings of the Twelfth International Conference on Machine Learning, 1995.
 
6
Chan, P. and Stolfo, S. Learning arbiter and combiner trees from partitioned data for scaling machine learning. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1995.
 
7
 
8
 
9
Duda, R. and Hart, P. Pattern Classification and Scene Analysis, 1973.
 
10
Elkan, C. Boosting and naive Bayesian learning. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1997.
 
11
 
12
Freund, Y. and Schapire, R. Experiments with a new boosting algorithm, In Proceedings of the Thirteenth International Conference on Machine Learning, 1996.
 
13
Friedman, J. Flexible metric nearest neighbor classification, Technical Report, 1994.
 
14
Guo, Y. and Sutiwaraphun, J. Knowledge probing in distributed data mining. In Advances in Distributed and Parallel Knowledge Discovery, 1999.
 
15
Hand, D. and Yu, K. Idiot's Bayes - Not so Stupid after All? 2001.
16
17
 
18
 
19
20
 
21
Kohavi, R. Scaling up the accuracy of naïve-Bayes classifiers: A decision-tree hybrid. In Proceedings of the Second International. Conference on Knowledge Discovery Data Mining, 1996.
 
22
23
 
24
Lang, K. Newsweeder: Learning to filter netnews. In Proceedings of International Conference on Machine Learning, 1995.
25
 
26
 
27
Lewis, D. & Ringuette, M. A comparison of two learning algorithms for text categorization, Third Annual Symposium on Document Analysis and Information Retrieval (pp. 81--93), 1994.
 
28
 
29
McCallum, A., & Nigam, K. A comparison of event models for naïïve Bayes text classification, AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI Press, 1998
 
30
Pavlov, D. and Mao, J. Scaling-up Support Vector machines using boosting algorithm. In International Conference on Pattern Recognition, 2000.
 
31
Quinlan, J. Bagging, boosting and C4.5. In Proceedings AAAI, 1996.
 
32
33
 
34
 
35
 
36
Ting, K. and Witten, I. Stacked generalization: when does it work? In Proceedings of the Fifteenth International oint Conference on Artificial Intelligence, 1997.
 
37
 
38
 
39
 
40
41
 
42

CITED BY  13
 
 
 
 
 
 

Collaborative Colleagues:
Haoran Wu: colleagues
Tong Heng Phang: colleagues
Bing Liu: colleagues
Xiaoli Li: colleagues

Peer to Peer - Readers of this Article have also read: