Abstract
We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different e-mail topics. An extended set of e-mail document features including structural characteristics and linguistic patterns were derived and, together with a Support Vector Machine learning algorithm, were used for mining the e-mail content. Experiments using a number of e-mail documents generated by different authors on a set of topics gave promising results for both aggregated and multi-topic author categorisation.
- A. Anderson, M. Corney, O. de Vel, and G. Mohay. "Identifying the Authors of Suspect E-mail". Communications of the ACM, 2001. (Submitted).Google Scholar
- A. Anderson, M. Corney, O. de Vel, and G. Mohay. "Multi-topic E-mail authorship attribution forensics". In Proc. Workshop on Data Mining for Security Applications, 8th ACM Conference on Computer Security (CCS'2001), 2001.Google Scholar
- C. Apte, F. Damerau, and S. Weiss. "Text mining with decision rules and decision trees". In Workshop on Learning from text and the Web, Conference on Automated Learning and Discovery, 1998.Google Scholar
- R. Bosch and J. Smith. "Separating hyperplanes and the authorship of the disputed federalist papers". American Mathematical Monthly, 105(7):601-608, 1998.Google ScholarCross Ref
- C. Chaski. "A Daubert-inspired assessment of current techniques for language-based author identification". Technical report, US National Institute of Justice, 1998. Available through www.ncjrs.org.Google Scholar
- C. Chaski. "Empirical evaluations of language-based author identification techniques". Forensic Linguistics, 2001. (to appear).Google Scholar
- W. Cohen. "Learning rules that classify e-mail". In Proc. Machine Learning in Information Access: AAAI Spring Symposium (SS-96-05), pages 18-25, 1996.Google Scholar
- C. Crain. "The Bard's fingerprints". Lingua Franca, pages 29-39, 1998.Google Scholar
- O. de Vel. "Evaluation of Text Document Categorisation Techniques for Computer Forensics". Journal of Computer Security, 1999. (Submitted).Google Scholar
- O. de Vel. "Mining e-mail authorship". In Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD'2000), 2000.Google Scholar
- J. Diederich, J. Kindermann, E. Leopold, and G. Paass. "Authorship attribution with Support Vector Machines". Applied Intelligence, 2000. Submitted. Google ScholarDigital Library
- H. Druker, D. Wu, and V. Vapnik. "Support vector machines for spam categorisation". IEEE Trans. on Neural Networks, 10:1048-1054, 1999. Google ScholarDigital Library
- W. Elliot and R. Valenza. "Was the Earl of Oxford the true Shakespeare?". Notes and Queries, 38:501-506, 1991.Google Scholar
- J. Farringdon. Analysing for Authorship: A Guide to the Cusum Technique. University of Wales Press, Cardiff, 1996.Google Scholar
- D. Foster. Author Unknown: On the Trail of Anonymous. Henry Holt, New York, 2000.Google Scholar
- A. Gray, P. Sallis, and S. MacDonell. "Software forensics: Extending authorship analysis techniques to computer programs". In Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL'97), pages 1-8, 1997.Google Scholar
- D. Holmes and R. Forsyth. "The Federalist revisited: New directions in authorship attribution". Literary and Linguistic Computing, pages 111-127, 1995.Google ScholarCross Ref
- T. Joachims. "Text categorization with support vector machines: Learning with many relevant features". In Proc. European Conf. Machine Learning (ECML'98), pages 137-142, 1998. Google ScholarDigital Library
- D. Khmelev. "Disputed authorship resolution using relative entropy for Markov chain of letters in a text". In R. Baayen, editor, Proc. 4th Conference Int. Quantitative Linguistics Association, Prague, 2000.Google Scholar
- I. Krsul. "Authorship analysis: Identifying the author of a program". Technical report, Department of Computer Science, Purdue University, 1994. Technical Report CSD-TR-94-030.Google Scholar
- I. Krsul and E. Spafford. "Authorship analysis: Identifying the author of a program". Computers and Security, 16:248-259, 1997.Google ScholarDigital Library
- D. Lowe and R. Matthews. "Shakespeare vs Fletcher: A stylometric analysis by radial basis functions". Computers and the Humanities, pages 449-461, 1995.Google ScholarCross Ref
- T. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. Google ScholarDigital Library
- F. Mosteller and D. Wallace. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, Mass., 1964.Google Scholar
- H. Ng, W. Goh, and K. Low. "Feature selection, perceptron learning, and a usability case study for text categorization". In Proc. 20th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR97), pages 67-73, 1997. Google ScholarDigital Library
- P. Oman and C. Cook. "Programming style authorship analysis". In Proc. 17th Annual ACM Computer Science Conference, pages 320-326, 1989. Google ScholarDigital Library
- J. Rudman. "The state of authorship attribution studies: Some problems and solutions". Computers and the Humanities, 31(4):351-365, 1997.Google ScholarCross Ref
- M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. "A Bayesian approach to filtering junk e-mail". In Learning for Text Categorization Workshop: 15th National Conf. on AI. AAAI Technical Report WS-98-05, pages 55-62, 1998.Google Scholar
- P. Sallis, S. MacDonell, G. MacLennan, A. Gray, and R. Kilgour. "Identified: Software authorship analysis with case-based reasoning". In Proc. Addendum Session Int. Conf. Neural Info. Processing and Intelligent Info. Systems, pages 53-56, 1997.Google Scholar
- G. Salton and M. McGill. Introduction to Modern Information Filtering. McGraw-Hill, New York, 1983. Google ScholarDigital Library
- E. Spafford and S. Weeber. "Software forensics: tracking code to its authors". Computers and Security, 12:585-595, 1993. Google ScholarDigital Library
- O. Teytaud and R. Jalam. "Kernel-based text categorization". In International Joint Conference on Neural Networks (IJCNN'2001), 2001. Washington DC, to appear.Google Scholar
- B. Thisted and R. Efron. "Did Shakespeare write a newly discovered poem?". Biometrika, pages 445-455, 1987.Google ScholarCross Ref
- R. Thomson and T. Murachver. "Predicting gender from electronic discourse". British Journal of Social Psychology, 40:193-208, 2001.Google ScholarCross Ref
- F. Tweedie and R. Baayen. "How variable may a constant be? Measure of lexical richness in perspective". Computers and the Humanities, 32(5):323-352, 1998.Google ScholarCross Ref
- F. Tweedie, S. Singh, and D. Holmes. "Neural network applications in stylometry: The Federalist papers". Computers and the Humanities, 30(1):1-10, 1996.Google ScholarCross Ref
- University of Dortmund. Support Vector Machine, SVMLight. http://www-ai.cs.uni-dortmund.de/FORSCHUNG/VERFAHREN/SVM_LIGHT/svm_light.eng.html.Google Scholar
- V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. Google ScholarDigital Library
- S. Waugh, A. Adams, and F. Tweedie. "Computational stylistics using artificial neural networks". Literary and Linguistic Computing, 15(2):187-198, 2000.Google ScholarCross Ref
- I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 2000. Google ScholarDigital Library
- Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67-88, 1999. Google ScholarDigital Library
- Y. Yang and X. Liu. "A re-examination of text categorisation methods". In Proc. 22nd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR99), pages 67-73, 1999. Google ScholarDigital Library
Index Terms
- Mining e-mail content for author identification forensics
Recommendations
Evidentiary usage of e-mail forensics: real life design of a case
IITM '10: Proceedings of the First International Conference on Intelligent Interactive Technologies and MultimediaComputer Forensic, the upcoming branch of forensic science where acquiring, preserving, retrieving and presenting content processed electronically and stored digitally, is used for legal evidence in computer related crimes or any other unethical ...
Multilingual opinion holder identification using author and authority viewpoints
Opinion holder identification research is important for discriminating between opinions that are viewed from different perspectives. We propose a new opinion holder identification method that is based on a differentiation between the author and ...
A novel approach of mining write-prints for authorship attribution in e-mail forensics
There is an alarming increase in the number of cybercrime incidents through anonymous e-mails. The problem of e-mail authorship attribution is to identify the most plausible author of an anonymous e-mail from a group of potential suspects. Most previous ...
Comments