article

Mining e-mail content for author identification forensics

Authors:
O. de Vel

Defence Science and Technology Organisation, Salisbury, Australia

Defence Science and Technology Organisation, Salisbury, Australia
View Profile

,
A. Anderson

Queensland University of Technology, Brisbane, Australia

Queensland University of Technology, Brisbane, Australia
View Profile

,
M. Corney

Queensland University of Technology, Brisbane, Australia

Queensland University of Technology, Brisbane, Australia
View Profile

,
G. Mohay

Queensland University of Technology, Brisbane, Australia

Queensland University of Technology, Brisbane, Australia
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 30 Issue 4December 2001pp 55–64https://doi.org/10.1145/604264.604272

Published:01 December 2001Publication History

ACM SIGMOD Record

Abstract

We describe an investigation into e-mail content mining for author identification, or authorship attribution, for the purpose of forensic investigation. We focus our discussion on the ability to discriminate between authors for the case of both aggregated e-mail topics as well as across different e-mail topics. An extended set of e-mail document features including structural characteristics and linguistic patterns were derived and, together with a Support Vector Machine learning algorithm, were used for mining the e-mail content. Experiments using a number of e-mail documents generated by different authors on a set of topics gave promising results for both aggregated and multi-topic author categorisation.

References

A. Anderson, M. Corney, O. de Vel, and G. Mohay. "Identifying the Authors of Suspect E-mail". Communications of the ACM, 2001. (Submitted).Google Scholar
A. Anderson, M. Corney, O. de Vel, and G. Mohay. "Multi-topic E-mail authorship attribution forensics". In Proc. Workshop on Data Mining for Security Applications, 8th ACM Conference on Computer Security (CCS'2001), 2001.Google Scholar
C. Apte, F. Damerau, and S. Weiss. "Text mining with decision rules and decision trees". In Workshop on Learning from text and the Web, Conference on Automated Learning and Discovery, 1998.Google Scholar
R. Bosch and J. Smith. "Separating hyperplanes and the authorship of the disputed federalist papers". American Mathematical Monthly, 105(7):601-608, 1998.Google ScholarCross Ref
C. Chaski. "A Daubert-inspired assessment of current techniques for language-based author identification". Technical report, US National Institute of Justice, 1998. Available through www.ncjrs.org.Google Scholar
C. Chaski. "Empirical evaluations of language-based author identification techniques". Forensic Linguistics, 2001. (to appear).Google Scholar
W. Cohen. "Learning rules that classify e-mail". In Proc. Machine Learning in Information Access: AAAI Spring Symposium (SS-96-05), pages 18-25, 1996.Google Scholar
C. Crain. "The Bard's fingerprints". Lingua Franca, pages 29-39, 1998.Google Scholar
O. de Vel. "Evaluation of Text Document Categorisation Techniques for Computer Forensics". Journal of Computer Security, 1999. (Submitted).Google Scholar
O. de Vel. "Mining e-mail authorship". In Proc. Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD'2000), 2000.Google Scholar
J. Diederich, J. Kindermann, E. Leopold, and G. Paass. "Authorship attribution with Support Vector Machines". Applied Intelligence, 2000. Submitted. Google ScholarDigital Library
H. Druker, D. Wu, and V. Vapnik. "Support vector machines for spam categorisation". IEEE Trans. on Neural Networks, 10:1048-1054, 1999. Google ScholarDigital Library
W. Elliot and R. Valenza. "Was the Earl of Oxford the true Shakespeare?". Notes and Queries, 38:501-506, 1991.Google Scholar
J. Farringdon. Analysing for Authorship: A Guide to the Cusum Technique. University of Wales Press, Cardiff, 1996.Google Scholar
D. Foster. Author Unknown: On the Trail of Anonymous. Henry Holt, New York, 2000.Google Scholar
A. Gray, P. Sallis, and S. MacDonell. "Software forensics: Extending authorship analysis techniques to computer programs". In Proc. 3rd Biannual Conf. Int. Assoc. of Forensic Linguists (IAFL'97), pages 1-8, 1997.Google Scholar
D. Holmes and R. Forsyth. "The Federalist revisited: New directions in authorship attribution". Literary and Linguistic Computing, pages 111-127, 1995.Google ScholarCross Ref
T. Joachims. "Text categorization with support vector machines: Learning with many relevant features". In Proc. European Conf. Machine Learning (ECML'98), pages 137-142, 1998. Google ScholarDigital Library
D. Khmelev. "Disputed authorship resolution using relative entropy for Markov chain of letters in a text". In R. Baayen, editor, Proc. 4th Conference Int. Quantitative Linguistics Association, Prague, 2000.Google Scholar
I. Krsul. "Authorship analysis: Identifying the author of a program". Technical report, Department of Computer Science, Purdue University, 1994. Technical Report CSD-TR-94-030.Google Scholar
I. Krsul and E. Spafford. "Authorship analysis: Identifying the author of a program". Computers and Security, 16:248-259, 1997.Google ScholarDigital Library
D. Lowe and R. Matthews. "Shakespeare vs Fletcher: A stylometric analysis by radial basis functions". Computers and the Humanities, pages 449-461, 1995.Google ScholarCross Ref
T. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. Google ScholarDigital Library
F. Mosteller and D. Wallace. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, Mass., 1964.Google Scholar
H. Ng, W. Goh, and K. Low. "Feature selection, perceptron learning, and a usability case study for text categorization". In Proc. 20th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR97), pages 67-73, 1997. Google ScholarDigital Library
P. Oman and C. Cook. "Programming style authorship analysis". In Proc. 17th Annual ACM Computer Science Conference, pages 320-326, 1989. Google ScholarDigital Library
J. Rudman. "The state of authorship attribution studies: Some problems and solutions". Computers and the Humanities, 31(4):351-365, 1997.Google ScholarCross Ref
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. "A Bayesian approach to filtering junk e-mail". In Learning for Text Categorization Workshop: 15th National Conf. on AI. AAAI Technical Report WS-98-05, pages 55-62, 1998.Google Scholar
P. Sallis, S. MacDonell, G. MacLennan, A. Gray, and R. Kilgour. "Identified: Software authorship analysis with case-based reasoning". In Proc. Addendum Session Int. Conf. Neural Info. Processing and Intelligent Info. Systems, pages 53-56, 1997.Google Scholar
G. Salton and M. McGill. Introduction to Modern Information Filtering. McGraw-Hill, New York, 1983. Google ScholarDigital Library
E. Spafford and S. Weeber. "Software forensics: tracking code to its authors". Computers and Security, 12:585-595, 1993. Google ScholarDigital Library
O. Teytaud and R. Jalam. "Kernel-based text categorization". In International Joint Conference on Neural Networks (IJCNN'2001), 2001. Washington DC, to appear.Google Scholar
B. Thisted and R. Efron. "Did Shakespeare write a newly discovered poem?". Biometrika, pages 445-455, 1987.Google ScholarCross Ref
R. Thomson and T. Murachver. "Predicting gender from electronic discourse". British Journal of Social Psychology, 40:193-208, 2001.Google ScholarCross Ref
F. Tweedie and R. Baayen. "How variable may a constant be? Measure of lexical richness in perspective". Computers and the Humanities, 32(5):323-352, 1998.Google ScholarCross Ref
F. Tweedie, S. Singh, and D. Holmes. "Neural network applications in stylometry: The Federalist papers". Computers and the Humanities, 30(1):1-10, 1996.Google ScholarCross Ref
University of Dortmund. Support Vector Machine, SVMLight. http://www-ai.cs.uni-dortmund.de/FORSCHUNG/VERFAHREN/SVM_LIGHT/svm_light.eng.html.Google Scholar
V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. Google ScholarDigital Library
S. Waugh, A. Adams, and F. Tweedie. "Computational stylistics using artificial neural networks". Literary and Linguistic Computing, 15(2):187-198, 2000.Google ScholarCross Ref
I. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, 2000. Google ScholarDigital Library
Y. Yang. An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2):67-88, 1999. Google ScholarDigital Library
Y. Yang and X. Liu. "A re-examination of text categorisation methods". In Proc. 22nd Int. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR99), pages 67-73, 1999. Google ScholarDigital Library

Index Terms

Mining e-mail content for author identification forensics
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Evidentiary usage of e-mail forensics: real life design of a case
IITM '10: Proceedings of the First International Conference on Intelligent Interactive Technologies and Multimedia

Computer Forensic, the upcoming branch of forensic science where acquiring, preserving, retrieving and presenting content processed electronically and stored digitally, is used for legal evidence in computer related crimes or any other unethical ...
Read More
Multilingual opinion holder identification using author and authority viewpoints

Opinion holder identification research is important for discriminating between opinions that are viewed from different perspectives. We propose a new opinion holder identification method that is based on a differentiation between the author and ...
Read More
A novel approach of mining write-prints for authorship attribution in e-mail forensics

There is an alarming increase in the number of cybercrime incidents through anonymous e-mails. The problem of e-mail authorship attribution is to identify the most plausible author of an anonymous e-mail from a group of potential suspects. Most previous ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGMOD Record Volume 30, Issue 4
December 2001
104 pages
ISSN:0163-5808
DOI:10.1145/604264
Issue’s Table of Contents

Copyright © 2001 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 2001
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 304
  Total Citations
  View Citations
- 3,806
  Total Downloads
- Downloads (Last 12 months)47
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Mining e-mail content for author identification forensics

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Evidentiary usage of e-mail forensics: real life design of a case

Multilingual opinion holder identification using author and authority viewpoints

A novel approach of mining write-prints for authorship attribution in e-mail forensics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Mining e-mail content for author identification forensics

ACM SIGMOD Record

Abstract

References

Cited By

Index Terms

Recommendations

Evidentiary usage of e-mail forensics: real life design of a case

Multilingual opinion holder identification using author and authority viewpoints

A novel approach of mining write-prints for authorship attribution in e-mail forensics

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media