skip to main content
10.1145/1401890.1401907acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Partitioned logistic regression for spam filtering

Published:24 August 2008Publication History

ABSTRACT

Naive Bayes and logistic regression perform well in different regimes. While the former is a very simple generative model which is efficient to train and performs well empirically in many applications,the latter is a discriminative model which often achieves better accuracy and can be shown to outperform naive Bayes asymptotically. In this paper, we propose a novel hybrid model, partitioned logistic regression, which has several advantages over both naive Bayes and logistic regression. This model separates the original feature space into several disjoint feature groups. Individual models on these groups of features are learned using logistic regression and their predictions are combined using the naive Bayes principle to produce a robust final estimation. We show that our model is better both theoretically and empirically. In addition, when applying it in a practical application, email spam filtering, it improves the normalized AUC score at 10% false-positive rate by 28.8% and 23.6% compared to naive Bayes and logistic regression, when using the exact same training examples.

References

  1. I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, and C. D.Spyropoulos. An experimental comparison of naive bayesian and keyword-based anti-spam filtering with personal e--mail messages. In SIGIR-2000, pages 160--167, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. N. Bennett. Using asymmetric distributions to improve text classifier probability estimates. In SIGIR-2003, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. N. Bennett. Building Reliable Metaclassifiers for Text Learning. PhD thesis, Carnegie Mellon University, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Bickel and T. Scheffer. Dirichlet-enhanced spam filtering based on biased samples. In Advances in Neural Information Processing Systems 19 (NIPS--2006), pages 161--168, 2007.Google ScholarGoogle Scholar
  5. G. Cormack. TREC 2006 spam track overview. In Proceedings of TREC-2006, 2006.Google ScholarGoogle Scholar
  6. G. Cormack and T. Lynam. TREC 2005 spam track overview. In Proceedings of TREC-2005, 2005.Google ScholarGoogle Scholar
  7. T. G. Dietterich. Approximate statistical test for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895--1923, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. G. Dietterich. Ensemble methods in machine learning.Lecture Notes in Computer Science, 1857:1--15, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero--one loss. Machine Learning, 29(2-3):103--130, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Drucker, D. Wu, and V. Vapnik. Support vector machines for Spam categorization. IEEE Transactions on Neural Networks, 10(5):1048--1054, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Fallows. Spam: How it is hurting email and degrading life on the Internet. Pew Internet and American Life Project, October 2003.Google ScholarGoogle Scholar
  12. J. Goodman. Sequential conditional generalized iterative scaling. In ACL--2001, pages 9--16, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Goodman and W. Yih. Online discriminative spam filter training. In CEAS-2006, 2006.Google ScholarGoogle Scholar
  14. J. He and B. Thiesson. Asymmetric gradient boosting with application to spam filtering. In CEAS-2007, 2007.Google ScholarGoogle Scholar
  15. S. Hershkop and S. J. Stolfo. Combining email models for false positive reduction. In KDD-2005, pages 98--107, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Hinton. Products of experts. In Proc. of the 9thInternational Conference on Artificial Neural Networks (ICANN99), pages 1--6, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  17. J. M. Kahn. A generative bayesian model for aggregating experts. In UAI, pages 301--308, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Kittler, M. Hatef, R. P. Duin, and J. Matas. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226--239, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Kolcz and W. Yih. Raising the baseline for high-precision text classifiers. In KDD--2007, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Leiba, J. Ossher, V. T. Rajan, R. Segal, and M. N. Wegman. SMTP path analysis. In CEAS-2005, 2005.Google ScholarGoogle Scholar
  21. D. Lowd and C. Meek. Adversarial learning. In KDD-2005,pages 641--647, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Lowd and C. Meek. Good word attacks on statistical spam filters. In CEAS--2005, 2005.Google ScholarGoogle Scholar
  23. V. Metsis, V. Androutsopoulos, and G. Paliouras. Spam filtering with naive Bayes -- which naive Bayes? In CEAS-2006, 2006.Google ScholarGoogle Scholar
  24. A. Ng and M. Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Proceedings of NIPS 14, 2002.Google ScholarGoogle Scholar
  25. K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In IJCAI-99 Workshop on Machine Learning for Information Filtering, 1999.Google ScholarGoogle Scholar
  26. R. Raina, Y. Shen, A. Ng, and A. McCallum. Classification with hybrid generative/discriminative models. In Proceedings of NIPS 16, 2004.Google ScholarGoogle Scholar
  27. J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of naive Bayes text classifiers. In ICML-2003, 2003.Google ScholarGoogle Scholar
  28. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In AAAI-98 Workshop on Learning for Text Categorization, 1998.Google ScholarGoogle Scholar
  29. G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis,C. D. Spyropoulos, and P. Stamatopoulos. Stacking classifiers for anti-spam filtering of e-mail. In EMNLP-2001,pages 44--50, 2001.Google ScholarGoogle Scholar
  30. D. Sculley and G. M. Wachman. Relaxed online SVMs for spam filtering. In SIGIR--2007, pages 415--422, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. Segal. Combining global and personal anti--spam filtering.In CEAS--2007, 2007.Google ScholarGoogle Scholar
  32. A. Smith, T. Cohn, and M. Osborne. Logarithmic opinion pools for conditional random fields. In ACL-2005, pages 18--25, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Smith and M. Osborne. Using gazetteers in discriminative information extraction. In Proceedings of the Tenth Conference on Computational Natural Language Learning(CoNLL-X), pages 133--140, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. Sutton, M. Sindelar, and A. McCallum. Reducing weight undertraining in structured discriminative learning. In HLT-NAACL-2006, pages 89--95, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. V. N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. W. Yih, J. Goodman, and G. Hulten. Learning at low false positive rates. In CEAS--2006, 2006.Google ScholarGoogle Scholar
  37. W. Yih, R. McCann, and A. Kolcz. Improving spam filtering by detecting gray mail. In CEAS--2007, 2007.Google ScholarGoogle Scholar

Index Terms

  1. Partitioned logistic regression for spam filtering

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2008
        1116 pages
        ISBN:9781605581934
        DOI:10.1145/1401890
        • General Chair:
        • Ying Li,
        • Program Chairs:
        • Bing Liu,
        • Sunita Sarawagi

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader