|
ABSTRACT
Text categorization or classification is the automated assigning of text documents to pre-defined classes based on their contents. This problem has been studied in information retrieval, machine learning and data mining. So far, many effective techniques have been proposed. However, most techniques are based on some underlying models and/or assumptions. When the data fits the model well, the classification accuracy will be high. However, when the data does not fit the model well, the classification accuracy can be very low. In this paper, we propose a refinement approach to dealing with this problem of model misfit. We show that we do not need to change the classification technique itself (or its underlying model) to make it more flexible. Instead, we propose to use successive refinements of classification on the training data to correct the model misfit. We apply the proposed technique to improve the classification performance of two simple and efficient text classifiers, the Rocchio classifier and the naïve Bayesian classifier. These techniques are suitable for very large text collections because they allow the data to reside on disk and need only one scan of the data to build a text classifier. Extensive experiments on two benchmark document corpora show that the proposed technique is able to improve text categorization accuracy of the two techniques dramatically. In particular, our refined model is able to improve the naïve Bayesian or Rocchio classifier's prediction performance by 45% on average.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Apte, C., Hong, S., Hosking, J., Lepre, J., Pednault, E. and Rosen, B. Decomposition of heterogeneous classification problems, Intelligent Data Analysis, 1998.
|
| |
3
|
|
| |
4
|
|
| |
5
|
Chan, P. and Stolfo, S. Comparative evaluation of voting and meta-learning on partitioned data. In Proceedings of the Twelfth International Conference on Machine Learning, 1995.
|
| |
6
|
Chan, P. and Stolfo, S. Learning arbiter and combiner trees from partitioned data for scaling machine learning. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1995.
|
| |
7
|
|
| |
8
|
Mark Craven , Dan DiPasquo , Dayne Freitag , Andrew McCallum , Tom Mitchell , Kamal Nigam , Seán Slattery, Learning to extract symbolic knowledge from the World Wide Web, Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence, p.509-516, July 1998, Madison, Wisconsin, United States
|
| |
9
|
Duda, R. and Hart, P. Pattern Classification and Scene Analysis, 1973.
|
| |
10
|
Elkan, C. Boosting and naive Bayesian learning. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 1997.
|
| |
11
|
|
| |
12
|
Freund, Y. and Schapire, R. Experiments with a new boosting algorithm, In Proceedings of the Thirteenth International Conference on Machine Learning, 1996.
|
| |
13
|
Friedman, J. Flexible metric nearest neighbor classification, Technical Report, 1994.
|
| |
14
|
Guo, Y. and Sutiwaraphun, J. Knowledge probing in distributed data mining. In Advances in Distributed and Parallel Knowledge Discovery, 1999.
|
| |
15
|
Hand, D. and Yu, K. Idiot's Bayes - Not so Stupid after All? 2001.
|
 |
16
|
|
 |
17
|
Vijay S. Iyengar , Chidanand Apte , Tong Zhang, Active learning using adaptive resampling, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.91-98, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347110]
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
Kohavi, R. Scaling up the accuracy of naïve-Bayes classifiers: A decision-tree hybrid. In Proceedings of the Second International. Conference on Knowledge Discovery Data Mining, 1996.
|
| |
22
|
|
 |
23
|
|
| |
24
|
Lang, K. Newsweeder: Learning to filter netnews. In Proceedings of International Conference on Machine Learning, 1995.
|
 |
25
|
|
| |
26
|
|
| |
27
|
Lewis, D. & Ringuette, M. A comparison of two learning algorithms for text categorization, Third Annual Symposium on Document Analysis and Information Retrieval (pp. 81--93), 1994.
|
| |
28
|
|
| |
29
|
McCallum, A., & Nigam, K. A comparison of event models for naïïve Bayes text classification, AAAI-98 Workshop on Learning for Text Categorization. Tech. Rep. WS-98-05, AAAI Press, 1998
|
| |
30
|
Pavlov, D. and Mao, J. Scaling-up Support Vector machines using boosting algorithm. In International Conference on Pattern Recognition, 2000.
|
| |
31
|
Quinlan, J. Bagging, boosting and C4.5. In Proceedings AAAI, 1996.
|
| |
32
|
|
 |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
Ting, K. and Witten, I. Stacked generalization: when does it work? In Proceedings of the Fifteenth International oint Conference on Artificial Intelligence, 1997.
|
| |
37
|
|
| |
38
|
|
| |
39
|
|
| |
40
|
|
 |
41
|
|
| |
42
|
|
CITED BY 13
|
Songbo Tan , Xueqi Cheng , Bin Wang , Hongbo Xu , Moustafa M. Ghanem , Yike Guo, Using dragpushing to refine centroid text classifiers, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
Songbo Tan , Xueqi Cheng , Moustafa M. Ghanem , Bin Wang , Hongbo Xu, A novel refinement approach for text categorization, Proceedings of the 14th ACM international conference on Information and knowledge management, October 31-November 05, 2005, Bremen, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Peer to Peer - Readers of this Article have also read:
-
Open signaling for ATM, internet and mobile networks (OPENSIG'98)
ACM SIGCOMM Computer Communication Review
29, 1
Andrew T. Campbell
, Irene Katzela
, Kazuho Miki
, John Vicente
-
Active bridging
ACM SIGCOMM Computer Communication Review
27, 4
D. Scott Alexander
, Marianne Shaw
, Scott M. Nettles
, Jonathan M. Smith
-
Active electronic mail
Proceedings of the 2002 ACM symposium on Applied computing
S. Karnouskos
, A. Vasilakos
-
Object-oriented database management system for process control systems—development and evaluation
Proceedings of the 1999 ACM symposium on Applied computing
Ryuji Wakizono
, Toshikazu Kawamura
, Takehiko Tsuchiya
, Takahiro Hatanaka
, Tatsuji Tanaka
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
|