skip to main content
10.1145/1451983.1451991acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
research-article

Latent dirichlet allocation in web spam filtering

Published: 22 April 2008 Publication History

Abstract

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web spam classification. We create a bag-of-words document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if its total spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA. We test this method on the UK2007-WEBSPAM corpus, and reach a relative improvement of 11% in F-measure by a logistic regression based combination with strong link and content baseline classifiers.

References

[1]
J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.
[2]
E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The Connectivity Sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia (HT), pages 38--47, Nottingham, United Kingdom, 2003.
[3]
I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. SIAM International Conference on Data Mining, 2006.
[4]
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5):993--1022, 2003.
[5]
A. Bratko, B. Filipič, G. Cormack, T. Lynam, and B. Zupan. Spam Filtering Using Statistical Data Compression Models. The Journal of Machine Learning Research, 7:2673--2698, 2006.
[6]
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. Technical report, DELIS -- Dynamically Evolving, Large-Scale Information Systems, 2006.
[7]
G. Cormack. Content-based Web Spam Detection. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2007.
[8]
L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. Proc. CVPR, 5, 2005.
[9]
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics -- Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), pages 1--6, Paris, France, 2004.
[10]
D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR), Salvador, Brazil, 2005.
[11]
T. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl_1):5228--5235, 2004.
[12]
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.
[13]
G. Heinrich. Parameter estimation for text analysis. Technical report, Technical Report, 2004.
[14]
M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11--22, 2002.
[15]
T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50--57, 1999.
[16]
T. Lynam, G. Cormack, and D. Cheriton. On-line spam filter fusion. Proc. of the 29th international ACM SIGIR conference on Research and development in information retrieval, pages 123--130, 2006.
[17]
T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. Uncertainty in Artificial Intelligence (UAI), 2002.
[18]
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83--92, Edinburgh, Scotland, 2006.
[19]
X.-H. Phan. http://gibbslda.sourceforge.net/.
[20]
A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004.
[21]
A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document length normalization. Information Processing and Management, 32(5):619--633, 1996.
[22]
J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman. Discovering Objects and their Localization in Images. Computer Vision, ICCV 2005. Tenth IEEE International Conference on, 1, 2005.
[23]
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005.
[24]
D. Xing and M. Girolami. Employing Latent Dirichlet Allocation for fraud detection in telecommunications. Pattern Recognition Letters, 28(13):1727--1734, 2007.

Cited By

View all
  • (2023)Identification and Analysis of Email Spam using Filtering Techniques2023 International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT)10.1109/ICAICCIT60255.2023.10465913(1260-1263)Online publication date: 23-Nov-2023
  • (2022)Semantic Analysis and Topic Modelling of Web-Scrapped COVID-19 Tweet Corpora through Data Mining MethodologiesHealthcare10.3390/healthcare1005088110:5(881)Online publication date: 10-May-2022
  • (2020)Thematic analysis of 18 years of physics education research conference proceedings using natural language processingPhysical Review Physics Education Research10.1103/PhysRevPhysEducRes.16.01014216:1Online publication date: 29-Jun-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web
April 2008
81 pages
ISBN:9781605581590
DOI:10.1145/1451983
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document classification
  2. feature selection
  3. information retrieval
  4. latent dirichlet allocation
  5. text analysis
  6. web content spam

Qualifiers

  • Research-article

Funding Sources

Conference

AIRWeb'08

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Identification and Analysis of Email Spam using Filtering Techniques2023 International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT)10.1109/ICAICCIT60255.2023.10465913(1260-1263)Online publication date: 23-Nov-2023
  • (2022)Semantic Analysis and Topic Modelling of Web-Scrapped COVID-19 Tweet Corpora through Data Mining MethodologiesHealthcare10.3390/healthcare1005088110:5(881)Online publication date: 10-May-2022
  • (2020)Thematic analysis of 18 years of physics education research conference proceedings using natural language processingPhysical Review Physics Education Research10.1103/PhysRevPhysEducRes.16.01014216:1Online publication date: 29-Jun-2020
  • (2020)A comprehensive survey and analysis of generative models in machine learningComputer Science Review10.1016/j.cosrev.2020.10028538:COnline publication date: 1-Nov-2020
  • (2020)Web celebrity shop assessment and improvement based on online review with probabilistic linguistic term sets by using sentiment analysis and fuzzy cognitive mapFuzzy Optimization and Decision Making10.1007/s10700-020-09327-8Online publication date: 8-Jun-2020
  • (2020)Semantic Representation Based on Deep Learning for Spam DetectionFoundations and Practice of Security10.1007/978-3-030-45371-8_5(72-81)Online publication date: 17-Apr-2020
  • (2019)Advanced diagnostic imaging utilization during emergency department visits in the United States: A predictive modeling study for emergency department triagePLOS ONE10.1371/journal.pone.021490514:4(e0214905)Online publication date: 9-Apr-2019
  • (2019)Comment Spam Detection via Effective Features CombinationICC 2019 - 2019 IEEE International Conference on Communications (ICC)10.1109/ICC.2019.8761340(1-6)Online publication date: May-2019
  • (2019)Analyzing Corrective Maintenance using Change Coupled Clusters at Fix-inducing Changes2019 International Conference on Electrical, Computer and Communication Engineering (ECCE)10.1109/ECACE.2019.8679503(1-6)Online publication date: Feb-2019
  • (2019)Unsupervised dimensionality reduction versus supervised regularization for classification from sparse dataData Mining and Knowledge Discovery10.1007/s10618-019-00616-433:4(871-916)Online publication date: 1-Jul-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media