ABSTRACT
In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less attention in the adversarial IR community, yields more improvement than new features and result in low cost yet accurate spam filters. Our original contributions are as follows:
• We collect and handle a large number of features based on recent advances in Web spam filtering.
• We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest significantly improve accuracy.
• We conclude that, with appropriate learning techniques, a small and computationally inexpensive feature subset outperforms all previous results published so far on our data set and can only slightly be further improved by computationally expensive features.
• We test our method on two major publicly available data sets, the Web Spam Challenge 2008 data set WEB-SPAM-UK2007 and the ECML/PKDD Discovery Challenge data set DC2010.
Our classifier ensemble reaches an improvement of 5% in AUC over the Web Spam Challenge 2008 best result; more importantly our improvement is 3.5% based solely on less than 100 inexpensive content features and 5% if a small vocabulary bag of words representation is included. For DC2010 we improve over the best achieved NDCG for spam by 7.5% and over 5% by using inexpensive content features and a small bag of words representation.
- J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.Google Scholar
- L. D. Artem Sokolov, Tanguy Urvoy and O. Ricard. Madspam consortium at the ecml/pkdd discovery challenge 2010. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, 2010.Google Scholar
- J. Attenberg and T. Suel. Cleaning search results using term distance features. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages 21--24. ACM New York, NY, USA, 2008. Google ScholarDigital Library
- A. A. Benczúr, M. Erdélyi, J. Masanés, and D. Siklósi. Web spam challenge proposal for filtering in archives. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarDigital Library
- I. Bíró, D. Siklósi, J. Szabó, and A. A. Benczúr. Linked latent dirichlet allocation in web spam filtering. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarDigital Library
- L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarDigital Library
- R. Caruana, A. Munson, and A. Niculescu-Mizil. Getting the most out of ensemble selection. In ICDM '06: Proceedings of the Sixth International Conference on Data Mining, pages 828--833, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection from libraries of models. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 18, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- C. Castillo, K. Chellapilla, and L. Denoyer. Web spam challenge 2008. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.Google ScholarDigital Library
- C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. SIGIR Forum, 40(2):11--24, December 2006. Google ScholarDigital Library
- C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423--430, 2007. Google ScholarDigital Library
- O. Chapelle, Y. Chang, and T.-Y. Liu. The yahoo! learning to rank challenge, 2010.Google Scholar
- N. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1):1--6, 2004. Google ScholarDigital Library
- K. Chellapilla and D. M. Chickering. Improving cloaking detection using search query popularity and monetizability. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 17--24, Seattle, WA, August 2006.Google Scholar
- G. Cormack. Content-based Web Spam Detection. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2007.Google Scholar
- K. Csalogány, A. Benczúr, D. Siklósi, and L. Lukács. Semi-Supervised Learning: A Comparative Study for Web Spam and Telephone User Churn. In Graph Labeling Workshop in conjunction with ECML/PKDD 2007, 2007.Google Scholar
- N. Dai, B. D. Davison, and X. Qi. Looking into the past to better classify web spam. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarDigital Library
- P. Desikan, N. Pathak, J. Srivastava, and V. Kumar. Incremental page rank computation on evolving graphs. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 1094--1095, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- P. K. Desikan, N. Pathak, J. Srivastava, and V. Kumar. Divide and conquer approach for efficient pagerank computation. In ICWE '06: Proceedings of the 6th international conference on Web engineering, pages 233--240, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- M. Erdélyi and A. A. Benczúr. Temporal analysis for web spam detection: An overview. In 1st International Temporal Web Analytics Workshop (TWAW) in conjunction with the 20th International World Wide Web Conference in Hyderabad, India. CEUR Workshop Proceedings, 2011. Google ScholarDigital Library
- M. Erdélyi, A. A. Benczúr, J. Masanés, and D. Siklósi. Web spam filtering in internet archives. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarDigital Library
- FastRandomForest. Re-implementation of the random forest classifier for the weka environment. http://code.google.com/p/fast-random-forest/.Google Scholar
- D. Fetterly and Z. Gyöngyi. Fifth international workshop on adversarial information retrieval on the web (AIRWeb 2009). 2009. Google Scholar
- J. Fogarty, R. S. Baker, and S. E. Hudson. Case studies in the use of roc curve analysis for sensor-based estimates in human computer interaction. In Proceedings of Graphics Interface 2005, GI '05, pages 129--136, School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, 2005. Canadian Human-Computer Communications Society. Google ScholarDigital Library
- J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. Annals of statistics, pages 337--374, 2000.Google Scholar
- G. Geng, X. Jin, and C. Wang. CASIA at WSC2008. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.Google Scholar
- X.-C. Z. Guang-Gang Geng, Xiao-Bo Jin and D. Zhang. Evaluating web content quality via multi-scale features. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, 2010.Google Scholar
- Z. Gyöngyi and H. Garcia-Molina. Spam: It's not just for inboxes anymore. IEEE Computer Magazine, 38(10):28--34, October 2005. Google ScholarDigital Library
- Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.Google Scholar
- Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 576--587, Toronto, Canada, 2004. Google ScholarDigital Library
- M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11--22, 2002. Google ScholarDigital Library
- A. Hotho, D. Benz, R. Jäschke, and B. Krause, editors. Proceedings of the ECML/PKDD Discovery Challenge. 2008.Google Scholar
- Y. joo Chung, M. Toyoda, and M. Kitsuregawa. A study of web spam evolution using a time series of web snapshots. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarDigital Library
- C. Kohlschütter, P. A. Chirita, and W. Nejdl. Efficient parallel computation of pagerank, 2007.Google Scholar
- Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random fields. In SDM 07, 2007.Google ScholarCross Ref
- Y. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. Tseng. Splog detection using content, time and link structures. In 2007 IEEE International Conference on Multimedia and Expo, pages 2030--2033, 2007.Google ScholarCross Ref
- G. Mohr, M. Stack, I. Rnitovic, D. Avery, and M. Kimpton. Introduction to Heritrix. In 4th International Web Archiving Workshop, 2004.Google Scholar
- A. Niculescu-Mizil, C. Perlich, G. Swirszcz, V. Sindhwani, Y. Liu, P. Melville, D. Wang, J. Xiao, J. Hu, M. Singh, et al. Winning the KDD Cup Orange Challenge with Ensemble Selection. In KDD Cup and Workshop in conjunction with KDD 2009, 2009.Google Scholar
- V. Nikulin. Web-mining with wilcoxon-based feature selection, ensembling and multiple binary classifiers. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, 2010.Google Scholar
- A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83--92, Edinburgh, Scotland, 2006. Google ScholarDigital Library
- S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In In Proceedings of SIGIR '94, pages 232--241. Springer-Verlag, 1994. Google ScholarDigital Library
- G. Shen, B. Gao, T. Liu, G. Feng, S. Song, and H. Li. Detecting link spam using temporal information. In ICDM '06., pages 1049--1053, 2006. Google ScholarDigital Library
- A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004.Google Scholar
- S. Webb, J. Caverlee, and C. Pu. Predicting web spam with HTTP session information. In Proceeding of the 17th ACM conference on Information and knowledge management, pages 339--348. ACM, 2008. Google ScholarDigital Library
- I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005. Google ScholarDigital Library
- B. Wu, V. Goel, and B. D. Davison. Topical TrustRank: Using topicality to combat web spam. In Proceedings of the 15th International World Wide Web Conference (WWW), Edinburgh, Scotland, 2006. Google ScholarDigital Library
Index Terms
- Web spam classification: a few features worth more
Recommendations
Content-based trust and bias classification via biclustering
WebQuality '12: Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web QualityIn this paper we improve trust, bias and factuality classification over Web data on the domain level. Unlike the majority of literature in this area that aims at extracting opinion and handling short text on the micro level, we aim to aid a researcher ...
Correlation-based feature subset selection technique for web spam classification
In past years, different machine learning algorithms and web spam features have been created to recognise the spam. The key part of progression of machine learning (ML) depends on the features being utilised. If we have features which correlate with each ...
Predicting web spam with HTTP session information
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge managementWeb spam is a widely-recognized threat to the quality and security of the Web. Web spam pages pollute search engine indexes, burden Web crawlers and Web mining services, and expose users to dangerous Web-borne malware. To defend against Web spam, most ...
Comments