skip to main content
article

Mining e-mail content for author identification forensics

Authors Info & Claims
Published:01 December 2001Publication History
Skip Abstract Section

Abstract

We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different e-mail topics. An extended set of e-mail document features including structural characteristics and linguistic patterns were derived and, together with a Support Vector Machine learning algorithm, were used for mining the e-mail content. Experiments using a number of e-mail documents generated by different authors on a set of topics gave promising results for both aggregated and multi-topic author categorisation.

References

  1. A. Anderson, M. Corney, O. de Vel, and G. Mohay. "Identifying the Authors of Suspect E-mail". Communications of the ACM, 2001. (Submitted).Google ScholarGoogle Scholar
  2. A. Anderson, M. Corney, O. de Vel, and G. Mohay. "Multi-topic E-mail authorship attribution forensics". In Proc. Workshop on Data Mining for Security Applications, 8th ACM Conference on Computer Security (CCS'2001), 2001.Google ScholarGoogle Scholar
  3. C. Apte, F. Damerau, and S. Weiss. "Text mining with decision rules and decision trees". In Workshop on Learning from text and the Web, Conference on Automated Learning and Discovery, 1998.Google ScholarGoogle Scholar
  4. R. Bosch and J. Smith. "Separating hyperplanes and the authorship of the disputed federalist papers". American Mathematical Monthly, 105(7):601-608, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  5. C. Chaski. "A Daubert-inspired assessment of current techniques for language-based author identification". Technical report, US National Institute of Justice, 1998. Available through www.ncjrs.org.Google ScholarGoogle Scholar
  6. C. Chaski. "Empirical evaluations of language-based author identification techniques". Forensic Linguistics, 2001. (to appear).Google ScholarGoogle Scholar
  7. W. Cohen. "Learning rules that classify e-mail". In Proc. Machine Learning in Information Access: AAAI Spring Symposium (SS-96-05), pages 18-25, 1996.Google ScholarGoogle Scholar
  8. C. Crain. "The Bard's fingerprints". Lingua Franca, pages 29-39, 1998.Google ScholarGoogle Scholar
  9. O. de Vel. "Evaluation of Text Document Categorisation Techniques for Computer Forensics". Journal of Computer Security, 1999. (Submitted).Google ScholarGoogle Scholar
  10. O. de Vel. "Mining e-mail authorship". In Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD'2000), 2000.Google ScholarGoogle Scholar
  11. J. Diederich, J. Kindermann, E. Leopold, and G. Paass. "Authorship attribution with Support Vector Machines". Applied Intelligence, 2000. Submitted. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Druker, D. Wu, and V. Vapnik. "Support vector machines for spam categorisation". IEEE Trans. on Neural Networks, 10:1048-1054, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Elliot and R. Valenza. "Was the Earl of Oxford the true Shakespeare?". Notes and Queries, 38:501-506, 1991.Google ScholarGoogle Scholar
  14. J. Farringdon. Analysing for Authorship: A Guide to the Cusum Technique. University of Wales Press, Cardiff, 1996.Google ScholarGoogle Scholar
  15. D. Foster. Author Unknown: On the Trail of Anonymous. Henry Holt, New York, 2000.Google ScholarGoogle Scholar
  16. A. Gray, P. Sallis, and S. MacDonell. "Software forensics: Extending authorship analysis techniques to computer programs". In Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL'97), pages 1-8, 1997.Google ScholarGoogle Scholar
  17. D. Holmes and R. Forsyth. "The Federalist revisited: New directions in authorship attribution". Literary and Linguistic Computing, pages 111-127, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  18. T. Joachims. "Text categorization with support vector machines: Learning with many relevant features". In Proc. European Conf. Machine Learning (ECML'98), pages 137-142, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Khmelev. "Disputed authorship resolution using relative entropy for Markov chain of letters in a text". In R. Baayen, editor, Proc. 4th Conference Int. Quantitative Linguistics Association, Prague, 2000.Google ScholarGoogle Scholar
  20. I. Krsul. "Authorship analysis: Identifying the author of a program". Technical report, Department of Computer Science, Purdue University, 1994. Technical Report CSD-TR-94-030.Google ScholarGoogle Scholar
  21. I. Krsul and E. Spafford. "Authorship analysis: Identifying the author of a program". Computers and Security, 16:248-259, 1997.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Lowe and R. Matthews. "Shakespeare vs Fletcher: A stylometric analysis by radial basis functions". Computers and the Humanities, pages 449-461, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  23. T. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Mosteller and D. Wallace. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, Mass., 1964.Google ScholarGoogle Scholar
  25. H. Ng, W. Goh, and K. Low. "Feature selection, perceptron learning, and a usability case study for text categorization". In Proc. 20th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR97), pages 67-73, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Oman and C. Cook. "Programming style authorship analysis". In Proc. 17th Annual ACM Computer Science Conference, pages 320-326, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Rudman. "The state of authorship attribution studies: Some problems and solutions". Computers and the Humanities, 31(4):351-365, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  28. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. "A Bayesian approach to filtering junk e-mail". In Learning for Text Categorization Workshop: 15th National Conf. on AI. AAAI Technical Report WS-98-05, pages 55-62, 1998.Google ScholarGoogle Scholar
  29. P. Sallis, S. MacDonell, G. MacLennan, A. Gray, and R. Kilgour. "Identified: Software authorship analysis with case-based reasoning". In Proc. Addendum Session Int. Conf. Neural Info. Processing and Intelligent Info. Systems, pages 53-56, 1997.Google ScholarGoogle Scholar
  30. G. Salton and M. McGill. Introduction to Modern Information Filtering. McGraw-Hill, New York, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. Spafford and S. Weeber. "Software forensics: tracking code to its authors". Computers and Security, 12:585-595, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. O. Teytaud and R. Jalam. "Kernel-based text categorization". In International Joint Conference on Neural Networks (IJCNN'2001), 2001. Washington DC, to appear.Google ScholarGoogle Scholar
  33. B. Thisted and R. Efron. "Did Shakespeare write a newly discovered poem?". Biometrika, pages 445-455, 1987.Google ScholarGoogle ScholarCross RefCross Ref
  34. R. Thomson and T. Murachver. "Predicting gender from electronic discourse". British Journal of Social Psychology, 40:193-208, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  35. F. Tweedie and R. Baayen. "How variable may a constant be? Measure of lexical richness in perspective". Computers and the Humanities, 32(5):323-352, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  36. F. Tweedie, S. Singh, and D. Holmes. "Neural network applications in stylometry: The Federalist papers". Computers and the Humanities, 30(1):1-10, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  37. University of Dortmund. Support Vector Machine, SVMLight. http://www-ai.cs.uni-dortmund.de/FORSCHUNG/VERFAHREN/SVM_LIGHT/svm_light.eng.html.Google ScholarGoogle Scholar
  38. V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Waugh, A. Adams, and F. Tweedie. "Computational stylistics using artificial neural networks". Literary and Linguistic Computing, 15(2):187-198, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  40. I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67-88, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Y. Yang and X. Liu. "A re-examination of text categorisation methods". In Proc. 22nd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR99), pages 67-73, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining e-mail content for author identification forensics

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGMOD Record
      ACM SIGMOD Record  Volume 30, Issue 4
      December 2001
      104 pages
      ISSN:0163-5808
      DOI:10.1145/604264
      Issue’s Table of Contents

      Copyright © 2001 Authors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 December 2001

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader