research-article

Partitioned logistic regression for spam filtering

Authors:
Ming-wei Chang

University of Illinois Urbana Champaign, Urbana, IL, USA

University of Illinois Urbana Champaign, Urbana, IL, USA
View Profile

,
Wen-tau Yih

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Christopher Meek

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2008Pages 97–105https://doi.org/10.1145/1401890.1401907

Published:24 August 2008Publication History

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 97–105

ABSTRACT

Naive Bayes and logistic regression perform well in different regimes. While the former is a very simple generative model which is efficient to train and performs well empirically in many applications,the latter is a discriminative model which often achieves better accuracy and can be shown to outperform naive Bayes asymptotically. In this paper, we propose a novel hybrid model, partitioned logistic regression, which has several advantages over both naive Bayes and logistic regression. This model separates the original feature space into several disjoint feature groups. Individual models on these groups of features are learned using logistic regression and their predictions are combined using the naive Bayes principle to produce a robust final estimation. We show that our model is better both theoretically and empirically. In addition, when applying it in a practical application, email spam filtering, it improves the normalized AUC score at 10% false-positive rate by 28.8% and 23.6% compared to naive Bayes and logistic regression, when using the exact same training examples.

References

I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D.Spyropoulos. An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e--mail messages. In SIGIR-2000, pages 160--167, 2000. Google ScholarDigital Library
P. N. Bennett. Using asymmetric distributions to improve text classifier probability estimates. In SIGIR-2003, 2003. Google ScholarDigital Library
P. N. Bennett. Building Reliable Metaclassifiers for Text Learning. PhD thesis, Carnegie Mellon University, 2006. Google ScholarDigital Library
S. Bickel and T. Scheffer. Dirichlet-enhanced spam filtering based on biased samples. In Advances in Neural Information Processing Systems 19 (NIPS--2006), pages 161--168, 2007.Google Scholar
G. Cormack. TREC 2006 spam track overview. In Proceedings of TREC-2006, 2006.Google Scholar
G. Cormack and T. Lynam. TREC 2005 spam track overview. In Proceedings of TREC-2005, 2005.Google Scholar
T. G. Dietterich. Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895--1923, 1998. Google ScholarDigital Library
T. G. Dietterich. Ensemble methods in machine learning.Lecture Notes in Computer Science, 1857:1--15, 2000. Google ScholarDigital Library
P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero--one loss. Machine Learning, 29(2-3):103--130, 1997. Google ScholarDigital Library
H. Drucker, D. Wu, and V. Vapnik. Support vector machines for Spam categorization. IEEE Transactions on Neural Networks, 10(5):1048--1054, 1999. Google ScholarDigital Library
D. Fallows. Spam: How it is hurting email and degrading life on the Internet. Pew Internet and American Life Project, October 2003.Google Scholar
J. Goodman. Sequential conditional generalized iterative scaling. In ACL--2001, pages 9--16, 2001. Google ScholarDigital Library
J. Goodman and W. Yih. Online discriminative spam filter training. In CEAS-2006, 2006.Google Scholar
J. He and B. Thiesson. Asymmetric gradient boosting with application to spam filtering. In CEAS-2007, 2007.Google Scholar
S. Hershkop and S. J. Stolfo. Combining email models for false positive reduction. In KDD-2005, pages 98--107, 2005. Google ScholarDigital Library
G. Hinton. Products of experts. In Proc. of the 9thInternational Conference on Artificial Neural Networks (ICANN99), pages 1--6, 1999.Google ScholarCross Ref
J. M. Kahn. A generative bayesian model for aggregating experts. In UAI, pages 301--308, 2004. Google ScholarDigital Library
J. Kittler, M. Hatef, R. P. Duin, and J. Matas. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226--239, 1998. Google ScholarDigital Library
A. Kolcz and W. Yih. Raising the baseline for high-precision text classifiers. In KDD--2007, 2007. Google ScholarDigital Library
B. Leiba, J. Ossher, V. T. Rajan, R. Segal, and M. N. Wegman. SMTP path analysis. In CEAS-2005, 2005.Google Scholar
D. Lowd and C. Meek. Adversarial learning. In KDD-2005,pages 641--647, 2005. Google ScholarDigital Library
D. Lowd and C. Meek. Good word attacks on statistical spam filters. In CEAS--2005, 2005.Google Scholar
V. Metsis, V. Androutsopoulos, and G. Paliouras. Spam filtering with naive Bayes -- which naive Bayes? In CEAS-2006, 2006.Google Scholar
A. Ng and M. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Proceedings of NIPS 14, 2002.Google Scholar
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999.Google Scholar
R. Raina, Y. Shen, A. Ng, and A. McCallum. Classification with hybrid generative/discriminative models. In Proceedings of NIPS 16, 2004.Google Scholar
J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of naive Bayes text classifiers. In ICML-2003, 2003.Google Scholar
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In AAAI-98 Workshop on Learning for Text Categorization, 1998.Google Scholar
G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis,C. D. Spyropoulos, and P. Stamatopoulos. Stacking classifiers for anti-spam filtering of e-mail. In EMNLP-2001,pages 44--50, 2001.Google Scholar
D. Sculley and G. M. Wachman. Relaxed online SVMs for spam filtering. In SIGIR--2007, pages 415--422, 2007. Google ScholarDigital Library
R. Segal. Combining global and personal anti--spam filtering.In CEAS--2007, 2007.Google Scholar
A. Smith, T. Cohn, and M. Osborne. Logarithmic opinion pools for conditional random fields. In ACL-2005, pages 18--25, 2005. Google ScholarDigital Library
A. Smith and M. Osborne. Using gazetteers in discriminative information extraction. In Proceedings of the Tenth Conference on Computational Natural Language Learning(CoNLL-X), pages 133--140, 2006. Google ScholarDigital Library
C. Sutton, M. Sindelar, and A. McCallum. Reducing weight undertraining in structured discriminative learning. In HLT-NAACL-2006, pages 89--95, 2006. Google ScholarDigital Library
V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998. Google ScholarDigital Library
W. Yih, J. Goodman, and G. Hulten. Learning at low false positive rates. In CEAS--2006, 2006.Google Scholar
W. Yih, R. McCann, and A. Kolcz. Improving spam filtering by detecting gray mail. In CEAS--2007, 2007.Google Scholar

Index Terms

Partitioned logistic regression for spam filtering
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. World Wide Web
    1. Web applications
      1. Internet communications tools
        Email

Recommendations

An empirical study of reducing multiclass classification methodologies
MLDM'13: Proceedings of the 9th international conference on Machine Learning and Data Mining in Pattern Recognition

One-against-all and one-against-one are two popular methodologies for reducing multiclass classification problems into a set of binary classifications. In this paper, we are interested in the performance of both one-against-all and one-against-one for ...
Read More
Logistic regression using covariates obtained by product-unit neural network models

We propose a logistic regression method based on the hybridation of a linear model and product-unit neural network models for binary classification. In a first step we use an evolutionary algorithm to determine the basic structure of the product-unit ...
Read More
Applications of Logistic Regression and Naive Bayes in Commodity Sentiment Analysis
IVSP '22: Proceedings of the 2022 4th International Conference on Image, Video and Signal Processing

Sentiment analysis is popular research which helps people perceive the trend of public opinion commented in separate social networking platforms. The aim of the paper is to investigate the real evaluation of goods when marketing tweets are eliminated ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
email spam filtering
logistic regression
naive bayes
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 1,035
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Partitioned logistic regression for spam filtering

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

An empirical study of reducing multiclass classification methodologies

Logistic regression using covariates obtained by product-unit neural network models

Applications of Logistic Regression and Naive Bayes in Commodity Sentiment Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Partitioned logistic regression for spam filtering

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

An empirical study of reducing multiclass classification methodologies

Logistic regression using covariates obtained by product-unit neural network models

Applications of Logistic Regression and Naive Bayes in Commodity Sentiment Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media