Article

Raising the baseline for high-precision text classifiers

Authors:

Aleksander Kolcz,

Wen-tau YihAuthors Info & Claims

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 400 - 409

https://doi.org/10.1145/1281192.1281237

Published: 12 August 2007 Publication History

Abstract

Many important application areas of text classifiers demand high precision andit is common to compare prospective solutions to the performance of Naive Bayes. This baseline is usually easy to improve upon, but in this work we demonstrate that appropriate document representation can make out performing this classifier much more challenging. Most importantly, we provide a link between Naive Bayes and the logarithmic opinion pooling of the mixture-of-experts framework, which dictates a particular type of document length normalization. Motivated by document-specific feature selection we propose monotonic constraints on document term weighting, which is shown as an effective method of fine-tuning document representation. The discussion is supported by experiments using three large email corpora corresponding to the problem of spam detection, where high precision is of particular importance.

Supplementary Material

JPG File (p400-kolcz-200.jpg)

Download
6.53 KB

JPG File (p400-kolcz-768.jpg)

Download
7.91 KB

Low Resolution (p400-kolcz-200.mov)

Download
29.51 MB

High Resolution (p400-kolcz-768.mov)

Download
104.35 MB

References

[1]

P. Bennett. Assessing the calibration of naive Bayes' posterior estimates. Technical Report CMU-CS-00-155, School of Computer Science, Carnegie Mellon University, 2000.

[2]

L. Chen, J. Huang, and Z. Gong. An anti-noise text categorization method based on support vector machines. In Proceedings of AWIC 2005, pages 272--278, 20025.

Digital Library

[3]

G. Cormack. The TREC 2006 spam filter evaluation track. Virus Bulletin, (1), 2007.

[4]

G. Cormack and A. Bratko. Batch and online spam filter comparison. In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS-2006), 2006.

[5]

G. Cormack and T. Lynam. TREC 2005 spam track overview. In Proceedings of TREC 2005 - the Fourteenth Text REtrieval Conference, 2005.

[6]

F. Debole and F. Sebastiani. Supervised term weighting for automated text categorization. In Proceedings of the 2003 ACM symposium on Applied Computing, pages 784--788, 2003.

Digital Library

[7]

P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3):103--130, 1997.

Digital Library

[8]

H. Drucker, D. Wu, and V. Vapnik. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks, 10(5):1048--1054, 1999.

Digital Library

[9]

L. Edda and J. Kindermann. Text categorization with support vector machines. how to represent texts in input space? Machine Learning, 46:423--444, 2002.

Digital Library

[10]

C. Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973--978, 2001.

Digital Library

[11]

N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine Learning, 29:131--163, 1997.

Digital Library

[12]

T. Gartner and P. Flach. WBCsvm: Weighted Bayesian classification based on support vector machines. In Proceedings of the Eighteenth International Conference on Machine Learning, 2001.

Digital Library

[13]

C. Genest and J. Zidek. Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1986.

[14]

J. Goodman. Sequential conditional generalized iterative scaling. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 9--16, 2001.

Digital Library

[15]

P. Graham. A plan for spam http://www.paulgraham.com/spam.html, 2002.

[16]

G. Hinton. Products of experts. In Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN99), pages 1--6, 1999.

[17]

S. Hong, J. Hosking, and T. Natarajan. Ensemble modeling through multiplicative adjustment of class probability. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), pages 621--624, 2002.

Digital Library

[18]

A. Juan and H. Ney. Reversing and smoothing the multinomial naive Bayes text classifier. In Proceedings of the 2nd Int. Workshop on Pattern Recognition in Information Systems (PRIS 2002), pages 200--212, 2002.

[19]

S. Kim, K. Han, H. Rim, and S. Myaeng. Some effective techniques for naive Bayes text classification. IEEE Transactions on Knowledge and Data Engineering, 18(11), 2006.

Digital Library

[20]

A. Kołcz. Local sparsity control for naive Bayes with extreme misclassification costs. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005.

Digital Library

[21]

A. Kołcz and J. Alspector. SVM-based filtering of e-mail spam with content-specific misclassification costs. In Proceedings of the Workshop on Text Mining (TextDM'2001), 2001.

[22]

M. Lee and E. Corlett. Sequential sampling models of human text classification. Cognitive Science, 27(2):159--1193, 2003.

[23]

D. D. Lewis. Naive (Bayes) at forty: the independence assumption in information retrieval. In Proceedings of the 10th European Conference on Machine Learning, pages 4--15, 1998.

Digital Library

[24]

A. K. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.

[25]

V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with Naive Bayes - which Naive Bayes? In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS-2006), 2006.

[26]

D. Mladenic, J. Brank, M. Grobelnik, and N. Milic-Frayling. Feature selection using linear classifier weights: Interaction with classification models. In Proceedings of the 27th Annual International ACM SIGIR Conference (SIGIR2004), 2004.

Digital Library

[27]

D. Mladenic and M. Grobelnik. Feature selection for unbalanced class distribution and Naive Bayes. In Proceedings of ICML 1999, 1999.

Digital Library

[28]

R. Raina, Y. Shen, A. Ng, and A. McCallum. Classification with hybrid generative/discriminative models. In Proceedings of NIPS 16, 2004.

[29]

J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of Naive Bayes text classifiers. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.

[30]

G. Robinson. A statistical approach to the spam problem. Linux Journal, (107), 2003.

Digital Library

[31]

M. Sauban and B. Pfahringer. Text categorisation using document profiling. In Proceedings of PKDD 2003, pages 411--422, 2003.

[32]

K. Schneider. Techniques for improving the performance of naive Bayes for text classification. In Proceedings of CICLing 2005, pages 682--693, 2005.

Digital Library

[33]

P. Soucy and G. Mineau. Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005), pages 1130--1135, 2005.

Digital Library

[34]

H. Wu, T. Phang, B. Liu, and X. Li. Arefinement approach to handling model misfit intext categorization. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002.

Digital Library

[35]

W. Yih, J. Goodman, and G. Hulten. Learning at low false positive rates. In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS-2006), 2006.

[36]

Z. Zheng and K. T. G. I. Webb. Lazy Bayesian rules: A lazy semi-naive Bayesian learning technique competitive to boosting decision trees. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 493--502, 1999.

Digital Library

Cited By

Abduljaleel SAbdullhussien I(2023)Evaluation of language processing classification models2ND INTERNATIONAL CONFERENCE OF MATHEMATICS, APPLIED SCIENCES, INFORMATION AND COMMUNICATION TECHNOLOGY10.1063/5.0162627(050017)Online publication date: 2023
https://doi.org/10.1063/5.0162627
Kilimci ZAkyokus SOmurca S(2016)The effectiveness of homogenous ensemble classifiers for Turkish and English texts2016 International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA.2016.7571854(1-7)Online publication date: Aug-2016
https://doi.org/10.1109/INISTA.2016.7571854
Wong TLiu C(2016)An efficient parameter estimation method for generalized Dirichlet priors in naïve Bayesian classifiers with multinomial modelsPattern Recognition10.1016/j.patcog.2016.04.01960:C(62-71)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1016/j.patcog.2016.04.019
Show More Cited By

Index Terms

Raising the baseline for high-precision text classifiers
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

Discrete Bayesian Network Classifiers: A Survey

We have had to wait over 30 years since the naive Bayes model was first introduced in 1960 for the so-called Bayesian network classifiers to resurge. Based on Bayesian networks, these classifiers have many strengths, like model interpretability, ...
TAN Classifiers Based on Decomposable Distributions

In this paper we present several Bayesian algorithms for learning Tree Augmented Naive Bayes (TAN) models. We extend the results in Meila & Jaakkola (2000a) to TANs by proving that accepting a prior decomposable distribution over TAN's, we can compute ...
A Novel Bayes Model: Hidden Naive Bayes

Because learning an optimal Bayesian network classifier is an NP-hard problem, learning-improved naive Bayes has attracted much attention from researchers. In this paper, we summarize the existing improved algorithms and propose a novel Bayes model: ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2007

1080 pages

ISBN:9781595936097

DOI:10.1145/1281192

General Chair:
Pavel Berkhin
Yahoo!, USA
,
Program Chairs:
Rich Caruana
Cornell University, USA
,
Xindong Wu
University of Vermont, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD07

Sponsor:

KDD07: The 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 12 - 15, 2007

California, San Jose, USA

Acceptance Rates

KDD '07 Paper Acceptance Rate 111 of 573 submissions, 19%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
846
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Abduljaleel SAbdullhussien I(2023)Evaluation of language processing classification models2ND INTERNATIONAL CONFERENCE OF MATHEMATICS, APPLIED SCIENCES, INFORMATION AND COMMUNICATION TECHNOLOGY10.1063/5.0162627(050017)Online publication date: 2023
https://doi.org/10.1063/5.0162627
Kilimci ZAkyokus SOmurca S(2016)The effectiveness of homogenous ensemble classifiers for Turkish and English texts2016 International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA.2016.7571854(1-7)Online publication date: Aug-2016
https://doi.org/10.1109/INISTA.2016.7571854
Wong TLiu C(2016)An efficient parameter estimation method for generalized Dirichlet priors in naïve Bayesian classifiers with multinomial modelsPattern Recognition10.1016/j.patcog.2016.04.01960:C(62-71)Online publication date: 1-Dec-2016
https://dl.acm.org/doi/10.1016/j.patcog.2016.04.019
Sailer AMahindru RSong YWei X(2015)Using Machine Learning and Probabilistic Frameworks to Enhance Incident and Problem ManagementMaximizing Management Performance and Quality with Service Analytics10.4018/978-1-4666-8496-6.ch010(259-298)Online publication date: 2015
https://doi.org/10.4018/978-1-4666-8496-6.ch010
Kilimci ZGaniz M(2015)Evaluation of classification models for language processing2015 International Symposium on Innovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA.2015.7276787(1-8)Online publication date: Sep-2015
https://doi.org/10.1109/INISTA.2015.7276787
Chen JWu J(2014)Sentiment Analysis of Chinese Micro Blog Using Machine Learning and an Improved Feature Selection MethodApplied Mechanics and Materials10.4028/www.scientific.net/AMM.631-632.1219631-632(1219-1223)Online publication date: Sep-2014
https://doi.org/10.4028/www.scientific.net/AMM.631-632.1219
Poyraz MKilimci ZGaniz M(2014)Higher-Order Smoothing: A Novel Semantic Smoothing Method for Text ClassificationJournal of Computer Science and Technology10.1007/s11390-014-1437-629:3(376-391)Online publication date: 17-May-2014
https://doi.org/10.1007/s11390-014-1437-6
Wong T(2014)Generalized Dirichlet priors for Naïve Bayesian classifiers with multinomial models in document classificationData Mining and Knowledge Discovery10.1007/s10618-012-0296-428:1(123-144)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.1007/s10618-012-0296-4
Li HQing HXin XLijun H(2014)Topic Identification Strategy for English Academic ResourcesAdvances in Computer Science and its Applications10.1007/978-3-642-41674-3_149(1073-1078)Online publication date: 2014
https://doi.org/10.1007/978-3-642-41674-3_149
Kawase RFisichella MNunes BHa KBick MCamacho DAkerkar RRodriguez Moreno M(2013)Automatic classification of documents in cold-start scenariosProceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics10.1145/2479787.2479789(1-10)Online publication date: 12-Jun-2013
https://dl.acm.org/doi/10.1145/2479787.2479789
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents