skip to main content
10.1145/1281192.1281237acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Raising the baseline for high-precision text classifiers

Published: 12 August 2007 Publication History

Abstract

Many important application areas of text classifiers demand high precision andit is common to compare prospective solutions to the performance of Naive Bayes. This baseline is usually easy to improve upon, but in this work we demonstrate that appropriate document representation can make out performing this classifier much more challenging. Most importantly, we provide a link between Naive Bayes and the logarithmic opinion pooling of the mixture-of-experts framework, which dictates a particular type of document length normalization. Motivated by document-specific feature selection we propose monotonic constraints on document term weighting, which is shown as an effective method of fine-tuning document representation. The discussion is supported by experiments using three large email corpora corresponding to the problem of spam detection, where high precision is of particular importance.

Supplementary Material

JPG File (p400-kolcz-200.jpg)
JPG File (p400-kolcz-768.jpg)
Low Resolution (p400-kolcz-200.mov)
High Resolution (p400-kolcz-768.mov)

References

[1]
P. Bennett. Assessing the calibration of naive Bayes' posterior estimates. Technical Report CMU-CS-00-155, School of Computer Science, Carnegie Mellon University, 2000.
[2]
L. Chen, J. Huang, and Z. Gong. An anti-noise text categorization method based on support vector machines. In Proceedings of AWIC 2005, pages 272--278, 20025.
[3]
G. Cormack. The TREC 2006 spam filter evaluation track. Virus Bulletin, (1), 2007.
[4]
G. Cormack and A. Bratko. Batch and online spam filter comparison. In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS-2006), 2006.
[5]
G. Cormack and T. Lynam. TREC 2005 spam track overview. In Proceedings of TREC 2005 - the Fourteenth Text REtrieval Conference, 2005.
[6]
F. Debole and F. Sebastiani. Supervised term weighting for automated text categorization. In Proceedings of the 2003 ACM symposium on Applied Computing, pages 784--788, 2003.
[7]
P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3):103--130, 1997.
[8]
H. Drucker, D. Wu, and V. Vapnik. Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks, 10(5):1048--1054, 1999.
[9]
L. Edda and J. Kindermann. Text categorization with support vector machines. how to represent texts in input space? Machine Learning, 46:423--444, 2002.
[10]
C. Elkan. The foundations of cost-sensitive learning. In IJCAI, pages 973--978, 2001.
[11]
N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine Learning, 29:131--163, 1997.
[12]
T. Gartner and P. Flach. WBCsvm: Weighted Bayesian classification based on support vector machines. In Proceedings of the Eighteenth International Conference on Machine Learning, 2001.
[13]
C. Genest and J. Zidek. Combining probability distributions: A critique and an annotated bibliography. Statistical Science, 1986.
[14]
J. Goodman. Sequential conditional generalized iterative scaling. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 9--16, 2001.
[15]
P. Graham. A plan for spam http://www.paulgraham.com/spam.html, 2002.
[16]
G. Hinton. Products of experts. In Proceedings of the Ninth International Conference on Artificial Neural Networks (ICANN99), pages 1--6, 1999.
[17]
S. Hong, J. Hosking, and T. Natarajan. Ensemble modeling through multiplicative adjustment of class probability. In Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), pages 621--624, 2002.
[18]
A. Juan and H. Ney. Reversing and smoothing the multinomial naive Bayes text classifier. In Proceedings of the 2nd Int. Workshop on Pattern Recognition in Information Systems (PRIS 2002), pages 200--212, 2002.
[19]
S. Kim, K. Han, H. Rim, and S. Myaeng. Some effective techniques for naive Bayes text classification. IEEE Transactions on Knowledge and Data Engineering, 18(11), 2006.
[20]
A. Kołcz. Local sparsity control for naive Bayes with extreme misclassification costs. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2005.
[21]
A. Kołcz and J. Alspector. SVM-based filtering of e-mail spam with content-specific misclassification costs. In Proceedings of the Workshop on Text Mining (TextDM'2001), 2001.
[22]
M. Lee and E. Corlett. Sequential sampling models of human text classification. Cognitive Science, 27(2):159--1193, 2003.
[23]
D. D. Lewis. Naive (Bayes) at forty: the independence assumption in information retrieval. In Proceedings of the 10th European Conference on Machine Learning, pages 4--15, 1998.
[24]
A. K. McCallum and K. Nigam. A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.
[25]
V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with Naive Bayes - which Naive Bayes? In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS-2006), 2006.
[26]
D. Mladenic, J. Brank, M. Grobelnik, and N. Milic-Frayling. Feature selection using linear classifier weights: Interaction with classification models. In Proceedings of the 27th Annual International ACM SIGIR Conference (SIGIR2004), 2004.
[27]
D. Mladenic and M. Grobelnik. Feature selection for unbalanced class distribution and Naive Bayes. In Proceedings of ICML 1999, 1999.
[28]
R. Raina, Y. Shen, A. Ng, and A. McCallum. Classification with hybrid generative/discriminative models. In Proceedings of NIPS 16, 2004.
[29]
J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of Naive Bayes text classifiers. In Proceedings of the Twentieth International Conference on Machine Learning, 2003.
[30]
G. Robinson. A statistical approach to the spam problem. Linux Journal, (107), 2003.
[31]
M. Sauban and B. Pfahringer. Text categorisation using document profiling. In Proceedings of PKDD 2003, pages 411--422, 2003.
[32]
K. Schneider. Techniques for improving the performance of naive Bayes for text classification. In Proceedings of CICLing 2005, pages 682--693, 2005.
[33]
P. Soucy and G. Mineau. Beyond TFIDF weighting for text categorization in the vector space model. In Proceedings of the Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI 2005), pages 1130--1135, 2005.
[34]
H. Wu, T. Phang, B. Liu, and X. Li. Arefinement approach to handling model misfit intext categorization. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002.
[35]
W. Yih, J. Goodman, and G. Hulten. Learning at low false positive rates. In Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS-2006), 2006.
[36]
Z. Zheng and K. T. G. I. Webb. Lazy Bayesian rules: A lazy semi-naive Bayesian learning technique competitive to boosting decision trees. In Proceedings of the Sixteenth International Conference on Machine Learning, pages 493--502, 1999.

Cited By

View all
  • (2023)Evaluation of language processing classification models2ND INTERNATIONAL CONFERENCE OF MATHEMATICS, APPLIED SCIENCES, INFORMATION AND COMMUNICATION TECHNOLOGY10.1063/5.0162627(050017)Online publication date: 2023
  • (2016)The effectiveness of homogenous ensemble classifiers for Turkish and English texts2016 International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA.2016.7571854(1-7)Online publication date: Aug-2016
  • (2016)An efficient parameter estimation method for generalized Dirichlet priors in naïve Bayesian classifiers with multinomial modelsPattern Recognition10.1016/j.patcog.2016.04.01960:C(62-71)Online publication date: 1-Dec-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. email spam detection
  2. high precision text classification
  3. low false positive rates
  4. naive bayes

Qualifiers

  • Article

Conference

KDD07

Acceptance Rates

KDD '07 Paper Acceptance Rate 111 of 573 submissions, 19%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Evaluation of language processing classification models2ND INTERNATIONAL CONFERENCE OF MATHEMATICS, APPLIED SCIENCES, INFORMATION AND COMMUNICATION TECHNOLOGY10.1063/5.0162627(050017)Online publication date: 2023
  • (2016)The effectiveness of homogenous ensemble classifiers for Turkish and English texts2016 International Symposium on INnovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA.2016.7571854(1-7)Online publication date: Aug-2016
  • (2016)An efficient parameter estimation method for generalized Dirichlet priors in naïve Bayesian classifiers with multinomial modelsPattern Recognition10.1016/j.patcog.2016.04.01960:C(62-71)Online publication date: 1-Dec-2016
  • (2015)Using Machine Learning and Probabilistic Frameworks to Enhance Incident and Problem ManagementMaximizing Management Performance and Quality with Service Analytics10.4018/978-1-4666-8496-6.ch010(259-298)Online publication date: 2015
  • (2015)Evaluation of classification models for language processing2015 International Symposium on Innovations in Intelligent SysTems and Applications (INISTA)10.1109/INISTA.2015.7276787(1-8)Online publication date: Sep-2015
  • (2014)Sentiment Analysis of Chinese Micro Blog Using Machine Learning and an Improved Feature Selection MethodApplied Mechanics and Materials10.4028/www.scientific.net/AMM.631-632.1219631-632(1219-1223)Online publication date: Sep-2014
  • (2014)Higher-Order Smoothing: A Novel Semantic Smoothing Method for Text ClassificationJournal of Computer Science and Technology10.1007/s11390-014-1437-629:3(376-391)Online publication date: 17-May-2014
  • (2014)Generalized Dirichlet priors for Naïve Bayesian classifiers with multinomial models in document classificationData Mining and Knowledge Discovery10.1007/s10618-012-0296-428:1(123-144)Online publication date: 1-Jan-2014
  • (2014)Topic Identification Strategy for English Academic ResourcesAdvances in Computer Science and its Applications10.1007/978-3-642-41674-3_149(1073-1078)Online publication date: 2014
  • (2013)Automatic classification of documents in cold-start scenariosProceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics10.1145/2479787.2479789(1-10)Online publication date: 12-Jun-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media