research-article

Latent dirichlet allocation in web spam filtering

Authors:

István Bíró,

Jácint Szabó,

András A. BenczúrAuthors Info & Claims

AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web

Pages 29 - 32

https://doi.org/10.1145/1451983.1451991

Published: 22 April 2008 Publication History

Abstract

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web spam classification. We create a bag-of-words document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if its total spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA. We test this method on the UK2007-WEBSPAM corpus, and reach a relative improvement of 11% in F-measure by a logistic regression based combination with strong link and content baseline classifiers.

References

[1]

J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.

[2]

E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The Connectivity Sonar: Detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia (HT), pages 38--47, Nottingham, United Kingdom, 2003.

Digital Library

[3]

I. Bhattacharya and L. Getoor. A latent dirichlet model for unsupervised entity resolution. SIAM International Conference on Data Mining, 2006.

[4]

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3(5):993--1022, 2003.

Digital Library

[5]

A. Bratko, B. Filipič, G. Cormack, T. Lynam, and B. Zupan. Spam Filtering Using Statistical Data Compression Models. The Journal of Machine Learning Research, 7:2673--2698, 2006.

Digital Library

[6]

C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. Technical report, DELIS -- Dynamically Evolving, Large-Scale Information Systems, 2006.

[7]

G. Cormack. Content-based Web Spam Detection. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2007.

[8]

L. Fei-Fei and P. Perona. A Bayesian hierarchical model for learning natural scene categories. Proc. CVPR, 5, 2005.

Digital Library

[9]

D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics -- Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB), pages 1--6, Paris, France, 2004.

Digital Library

[10]

D. Fetterly, M. Manasse, and M. Najork. Detecting phrase-level duplication on the world wide web. In Proceedings of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR), Salvador, Brazil, 2005.

Digital Library

[11]

T. Griffiths. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl_1):5228--5235, 2004.

[12]

Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.

[13]

G. Heinrich. Parameter estimation for text analysis. Technical report, Technical Report, 2004.

[14]

M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11--22, 2002.

Digital Library

[15]

T. Hofmann. Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50--57, 1999.

Digital Library

[16]

T. Lynam, G. Cormack, and D. Cheriton. On-line spam filter fusion. Proc. of the 29th international ACM SIGIR conference on Research and development in information retrieval, pages 123--130, 2006.

Digital Library

[17]

T. Minka and J. Lafferty. Expectation-propagation for the generative aspect model. Uncertainty in Artificial Intelligence (UAI), 2002.

Digital Library

[18]

A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83--92, Edinburgh, Scotland, 2006.

Digital Library

[19]

X.-H. Phan. http://gibbslda.sourceforge.net/.

[20]

A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004.

[21]

A. Singhal, G. Salton, M. Mitra, and C. Buckley. Document length normalization. Information Processing and Management, 32(5):619--633, 1996.

Digital Library

[22]

J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman. Discovering Objects and their Localization in Images. Computer Vision, ICCV 2005. Tenth IEEE International Conference on, 1, 2005.

Digital Library

[23]

I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005.

Digital Library

[24]

D. Xing and M. Girolami. Employing Latent Dirichlet Allocation for fraud detection in telecommunications. Pattern Recognition Letters, 28(13):1727--1734, 2007.

Digital Library

Cited By

Ojha AChakravarty A(2023)Identification and Analysis of Email Spam using Filtering Techniques2023 International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT)10.1109/ICAICCIT60255.2023.10465913(1260-1263)Online publication date: 23-Nov-2023
https://doi.org/10.1109/ICAICCIT60255.2023.10465913
Gourisaria MChandra SDas HPatra SSahni MLeon-Castro ESingh VKumar S(2022)Semantic Analysis and Topic Modelling of Web-Scrapped COVID-19 Tweet Corpora through Data Mining MethodologiesHealthcare10.3390/healthcare1005088110:5(881)Online publication date: 10-May-2022
https://doi.org/10.3390/healthcare10050881
Odden TMarin ACaballero M(2020)Thematic analysis of 18 years of physics education research conference proceedings using natural language processingPhysical Review Physics Education Research10.1103/PhysRevPhysEducRes.16.01014216:1Online publication date: 29-Jun-2020
https://doi.org/10.1103/PhysRevPhysEducRes.16.010142
Show More Cited By

Index Terms

Latent dirichlet allocation in web spam filtering
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Linked latent Dirichlet allocation in web spam filtering
AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply an extension of LDA for web spam classification. Our linked LDA ...
Latent dirichlet allocation based multi-document summarization
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Extraction based Multi-Document Summarization Algorithms consist of choosing sentences from the documents using some weighting mechanism and combining them into a summary. In this article we use Latent Dirichlet Allocation to capture the events being ...
Obtaining single document summaries using latent dirichlet allocation
ICONIP'12: Proceedings of the 19th international conference on Neural Information Processing - Volume Part IV

In this paper, we present a novel approach that makes use of topic models based on Latent Dirichlet allocation(LDA) for generating single document summaries. Our approach is distinguished from other LDA based approaches in that we identify the summary ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIRWeb '08: Proceedings of the 4th international workshop on Adversarial information retrieval on the web

April 2008

81 pages

ISBN:9781605581590

DOI:10.1145/1451983

Editors:
Carlos Castillo
Yahoo! Research
,
Kumar Chellapilla
Microsoft Live Labs
,
Dennis Fetterly
Microsoft Research

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 April 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

AIRWeb'08

AIRWeb'08: AIRWeb '08, Third International Workshop on Adversarial Information Retrieval on the Web

April 22, 2008

Beijing, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

44
Total Citations
View Citations
856
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ojha AChakravarty A(2023)Identification and Analysis of Email Spam using Filtering Techniques2023 International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT)10.1109/ICAICCIT60255.2023.10465913(1260-1263)Online publication date: 23-Nov-2023
https://doi.org/10.1109/ICAICCIT60255.2023.10465913
Gourisaria MChandra SDas HPatra SSahni MLeon-Castro ESingh VKumar S(2022)Semantic Analysis and Topic Modelling of Web-Scrapped COVID-19 Tweet Corpora through Data Mining MethodologiesHealthcare10.3390/healthcare1005088110:5(881)Online publication date: 10-May-2022
https://doi.org/10.3390/healthcare10050881
Odden TMarin ACaballero M(2020)Thematic analysis of 18 years of physics education research conference proceedings using natural language processingPhysical Review Physics Education Research10.1103/PhysRevPhysEducRes.16.01014216:1Online publication date: 29-Jun-2020
https://doi.org/10.1103/PhysRevPhysEducRes.16.010142
GM HGourisaria MPandey MRautaray S(2020)A comprehensive survey and analysis of generative models in machine learningComputer Science Review10.1016/j.cosrev.2020.10028538:COnline publication date: 1-Nov-2020
https://dl.acm.org/doi/10.1016/j.cosrev.2020.100285
Liang DDai ZWang MLi J(2020)Web celebrity shop assessment and improvement based on online review with probabilistic linguistic term sets by using sentiment analysis and fuzzy cognitive mapFuzzy Optimization and Decision Making10.1007/s10700-020-09327-8Online publication date: 8-Jun-2020
https://doi.org/10.1007/s10700-020-09327-8
Saidani NAdi KAllili M(2020)Semantic Representation Based on Deep Learning for Spam DetectionFoundations and Practice of Security10.1007/978-3-030-45371-8_5(72-81)Online publication date: 17-Apr-2020
https://doi.org/10.1007/978-3-030-45371-8_5
Zhang XKim JPatzer RPitts SChokshi FSchrager J(2019)Advanced diagnostic imaging utilization during emergency department visits in the United States: A predictive modeling study for emergency department triagePLOS ONE10.1371/journal.pone.021490514:4(e0214905)Online publication date: 9-Apr-2019
https://doi.org/10.1371/journal.pone.0214905
Li MWu BWang Y(2019)Comment Spam Detection via Effective Features CombinationICC 2019 - 2019 IEEE International Conference on Communications (ICC)10.1109/ICC.2019.8761340(1-6)Online publication date: May-2019
https://doi.org/10.1109/ICC.2019.8761340
Sadiq AKabir AAkash PIbna Mostafa M(2019)Analyzing Corrective Maintenance using Change Coupled Clusters at Fix-inducing Changes2019 International Conference on Electrical, Computer and Communication Engineering (ECCE)10.1109/ECACE.2019.8679503(1-6)Online publication date: Feb-2019
https://doi.org/10.1109/ECACE.2019.8679503
Clark JProvost F(2019)Unsupervised dimensionality reduction versus supervised regularization for classification from sparse dataData Mining and Knowledge Discovery10.1007/s10618-019-00616-433:4(871-916)Online publication date: 1-Jul-2019
https://dl.acm.org/doi/10.1007/s10618-019-00616-4
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents