research-article

Web spam classification: a few features worth more

Authors:
Miklós Erdélyi

Hungarian Academy of Sciences and University of Pannonia, Veszprém

Hungarian Academy of Sciences and University of Pannonia, Veszprém
View Profile

,
András Garzó

Hungarian Academy of Sciences

Hungarian Academy of Sciences
View Profile

,
András A. Benczúr

Hungarian Academy of Sciences

Hungarian Academy of Sciences
View Profile

WebQuality '11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web QualityMarch 2011Pages 27–34https://doi.org/10.1145/1964114.1964121

Published:28 March 2011Publication History

WebQuality '11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality

Pages 27–34

ABSTRACT

In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less attention in the adversarial IR community, yields more improvement than new features and result in low cost yet accurate spam filters. Our original contributions are as follows:

• We collect and handle a large number of features based on recent advances in Web spam filtering.

• We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest significantly improve accuracy.

• We conclude that, with appropriate learning techniques, a small and computationally inexpensive feature subset outperforms all previous results published so far on our data set and can only slightly be further improved by computationally expensive features.

• We test our method on two major publicly available data sets, the Web Spam Challenge 2008 data set WEB-SPAM-UK2007 and the ECML/PKDD Discovery Challenge data set DC2010.

Our classifier ensemble reaches an improvement of 5% in AUC over the Web Spam Challenge 2008 best result; more importantly our improvement is 3.5% based solely on less than 100 inexpensive content features and 5% if a small vocabulary bag of words representation is included. For DC2010 we improve over the best achieved NDCG for spam by 7.5% and over 5% by using inexpensive content features and a small bag of words representation.

References

J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.Google Scholar
L. D. Artem Sokolov, Tanguy Urvoy and O. Ricard. Madspam consortium at the ecml/pkdd discovery challenge 2010. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, 2010.Google Scholar
J. Attenberg and T. Suel. Cleaning search results using term distance features. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages 21--24. ACM New York, NY, USA, 2008. Google ScholarDigital Library
A. A. Benczúr, M. Erdélyi, J. Masanés, and D. Siklósi. Web spam challenge proposal for filtering in archives. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarDigital Library
I. Bíró, D. Siklósi, J. Szabó, and A. A. Benczúr. Linked latent dirichlet allocation in web spam filtering. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarDigital Library
L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarDigital Library
R. Caruana, A. Munson, and A. Niculescu-Mizil. Getting the most out of ensemble selection. In ICDM '06: Proceedings of the Sixth International Conference on Data Mining, pages 828--833, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection from libraries of models. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 18, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
C. Castillo, K. Chellapilla, and L. Denoyer. Web spam challenge 2008. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.Google ScholarDigital Library
C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. SIGIR Forum, 40(2):11--24, December 2006. Google ScholarDigital Library
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423--430, 2007. Google ScholarDigital Library
O. Chapelle, Y. Chang, and T.-Y. Liu. The yahoo! learning to rank challenge, 2010.Google Scholar
N. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1):1--6, 2004. Google ScholarDigital Library
K. Chellapilla and D. M. Chickering. Improving cloaking detection using search query popularity and monetizability. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 17--24, Seattle, WA, August 2006.Google Scholar
G. Cormack. Content-based Web Spam Detection. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2007.Google Scholar
K. Csalogány, A. Benczúr, D. Siklósi, and L. Lukács. Semi-Supervised Learning: A Comparative Study for Web Spam and Telephone User Churn. In Graph Labeling Workshop in conjunction with ECML/PKDD 2007, 2007.Google Scholar
N. Dai, B. D. Davison, and X. Qi. Looking into the past to better classify web spam. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarDigital Library
P. Desikan, N. Pathak, J. Srivastava, and V. Kumar. Incremental page rank computation on evolving graphs. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 1094--1095, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
P. K. Desikan, N. Pathak, J. Srivastava, and V. Kumar. Divide and conquer approach for efficient pagerank computation. In ICWE '06: Proceedings of the 6th international conference on Web engineering, pages 233--240, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
M. Erdélyi and A. A. Benczúr. Temporal analysis for web spam detection: An overview. In 1st International Temporal Web Analytics Workshop (TWAW) in conjunction with the 20th International World Wide Web Conference in Hyderabad, India. CEUR Workshop Proceedings, 2011. Google ScholarDigital Library
M. Erdélyi, A. A. Benczúr, J. Masanés, and D. Siklósi. Web spam filtering in internet archives. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarDigital Library
FastRandomForest. Re-implementation of the random forest classifier for the weka environment. http://code.google.com/p/fast-random-forest/.Google Scholar
D. Fetterly and Z. Gyöngyi. Fifth international workshop on adversarial information retrieval on the web (AIRWeb 2009). 2009. Google Scholar
J. Fogarty, R. S. Baker, and S. E. Hudson. Case studies in the use of roc curve analysis for sensor-based estimates in human computer interaction. In Proceedings of Graphics Interface 2005, GI '05, pages 129--136, School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, 2005. Canadian Human-Computer Communications Society. Google ScholarDigital Library
J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. Annals of statistics, pages 337--374, 2000.Google Scholar
G. Geng, X. Jin, and C. Wang. CASIA at WSC2008. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.Google Scholar
X.-C. Z. Guang-Gang Geng, Xiao-Bo Jin and D. Zhang. Evaluating web content quality via multi-scale features. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, 2010.Google Scholar
Z. Gyöngyi and H. Garcia-Molina. Spam: It's not just for inboxes anymore. IEEE Computer Magazine, 38(10):28--34, October 2005. Google ScholarDigital Library
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.Google Scholar
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 576--587, Toronto, Canada, 2004. Google ScholarDigital Library
M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11--22, 2002. Google ScholarDigital Library
A. Hotho, D. Benz, R. Jäschke, and B. Krause, editors. Proceedings of the ECML/PKDD Discovery Challenge. 2008.Google Scholar
Y. joo Chung, M. Toyoda, and M. Kitsuregawa. A study of web spam evolution using a time series of web snapshots. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarDigital Library
C. Kohlschütter, P. A. Chirita, and W. Nejdl. Efficient parallel computation of pagerank, 2007.Google Scholar
Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random fields. In SDM 07, 2007.Google ScholarCross Ref
Y. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. Tseng. Splog detection using content, time and link structures. In 2007 IEEE International Conference on Multimedia and Expo, pages 2030--2033, 2007.Google ScholarCross Ref
G. Mohr, M. Stack, I. Rnitovic, D. Avery, and M. Kimpton. Introduction to Heritrix. In 4th International Web Archiving Workshop, 2004.Google Scholar
A. Niculescu-Mizil, C. Perlich, G. Swirszcz, V. Sindhwani, Y. Liu, P. Melville, D. Wang, J. Xiao, J. Hu, M. Singh, et al. Winning the KDD Cup Orange Challenge with Ensemble Selection. In KDD Cup and Workshop in conjunction with KDD 2009, 2009.Google Scholar
V. Nikulin. Web-mining with wilcoxon-based feature selection, ensembling and multiple binary classifiers. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, 2010.Google Scholar
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83--92, Edinburgh, Scotland, 2006. Google ScholarDigital Library
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In In Proceedings of SIGIR '94, pages 232--241. Springer-Verlag, 1994. Google ScholarDigital Library
G. Shen, B. Gao, T. Liu, G. Feng, S. Song, and H. Li. Detecting link spam using temporal information. In ICDM '06., pages 1049--1053, 2006. Google ScholarDigital Library
A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004.Google Scholar
S. Webb, J. Caverlee, and C. Pu. Predicting web spam with HTTP session information. In Proceeding of the 17th ACM conference on Information and knowledge management, pages 339--348. ACM, 2008. Google ScholarDigital Library
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005. Google ScholarDigital Library
B. Wu, V. Goel, and B. D. Davison. Topical TrustRank: Using topicality to combat web spam. In Proceedings of the 15th International World Wide Web Conference (WWW), Edinburgh, Scotland, 2006. Google ScholarDigital Library

Index Terms

Web spam classification: a few features worth more

Recommendations

Content-based trust and bias classification via biclustering
WebQuality '12: Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality

In this paper we improve trust, bias and factuality classification over Web data on the domain level. Unlike the majority of literature in this area that aims at extracting opinion and handling short text on the micro level, we aim to aid a researcher ...
Read More
Correlation-based feature subset selection technique for web spam classification

In past years, different machine learning algorithms and web spam features have been created to recognise the spam. The key part of progression of machine learning (ML) depends on the features being utilised. If we have features which correlate with each ...
Read More
Predicting web spam with HTTP session information
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Web spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search engine indexes, burden Web crawlers and Web mining services, and expose users to dangerous Web-borne malware. To defend against Web spam, most ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

WebQuality '11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
March 2011
55 pages
ISBN:9781450307062
DOI:10.1145/1964114

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 March 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
document classification
ensemble classification
feature selection
hyperlink analysis
information retrieval
machine learning
web quality
web spam
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 41
  Total Citations
  View Citations
- 666
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Web spam classification: a few features worth more

WebQuality '11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality

ABSTRACT

References

Cited By

Index Terms

Recommendations

Content-based trust and bias classification via biclustering

Correlation-based feature subset selection technique for web spam classification

Predicting web spam with HTTP session information

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Web spam classification: a few features worth more

WebQuality '11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality

ABSTRACT

References

Cited By

Index Terms

Recommendations

Content-based trust and bias classification via biclustering

Correlation-based feature subset selection technique for web spam classification

Predicting web spam with HTTP session information

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media