skip to main content
10.1145/1964114.1964121acmotherconferencesArticle/Chapter ViewAbstractPublication PageswebqualityConference Proceedingsconference-collections
research-article

Web spam classification: a few features worth more

Published:28 March 2011Publication History

ABSTRACT

In this paper we investigate how much various classes of Web spam features, some requiring very high computational effort, add to the classification accuracy. We realize that advances in machine learning, an area that has received less attention in the adversarial IR community, yields more improvement than new features and result in low cost yet accurate spam filters. Our original contributions are as follows:

• We collect and handle a large number of features based on recent advances in Web spam filtering.

• We show that machine learning techniques including ensemble selection, LogitBoost and Random Forest significantly improve accuracy.

• We conclude that, with appropriate learning techniques, a small and computationally inexpensive feature subset outperforms all previous results published so far on our data set and can only slightly be further improved by computationally expensive features.

• We test our method on two major publicly available data sets, the Web Spam Challenge 2008 data set WEB-SPAM-UK2007 and the ECML/PKDD Discovery Challenge data set DC2010.

Our classifier ensemble reaches an improvement of 5% in AUC over the Web Spam Challenge 2008 best result; more importantly our improvement is 3.5% based solely on less than 100 inexpensive content features and 5% if a small vocabulary bag of words representation is included. For DC2010 we improve over the best achieved NDCG for spam by 7.5% and over 5% by using inexpensive content features and a small bag of words representation.

References

  1. J. Abernethy, O. Chapelle, and C. Castillo. WITCH: A New Approach to Web Spam Detection. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.Google ScholarGoogle Scholar
  2. L. D. Artem Sokolov, Tanguy Urvoy and O. Ricard. Madspam consortium at the ecml/pkdd discovery challenge 2010. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, 2010.Google ScholarGoogle Scholar
  3. J. Attenberg and T. Suel. Cleaning search results using term distance features. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages 21--24. ACM New York, NY, USA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. A. Benczúr, M. Erdélyi, J. Masanés, and D. Siklósi. Web spam challenge proposal for filtering in archives. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. I. Bíró, D. Siklósi, J. Szabó, and A. A. Benczúr. Linked latent dirichlet allocation in web spam filtering. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Caruana, A. Munson, and A. Niculescu-Mizil. Getting the most out of ensemble selection. In ICDM '06: Proceedings of the Sixth International Conference on Data Mining, pages 828--833, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection from libraries of models. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 18, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Castillo, K. Chellapilla, and L. Denoyer. Web spam challenge 2008. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. SIGIR Forum, 40(2):11--24, December 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: web spam detection using the web topology. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423--430, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. O. Chapelle, Y. Chang, and T.-Y. Liu. The yahoo! learning to rank challenge, 2010.Google ScholarGoogle Scholar
  13. N. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsletter, 6(1):1--6, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Chellapilla and D. M. Chickering. Improving cloaking detection using search query popularity and monetizability. In Proceedings of the 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 17--24, Seattle, WA, August 2006.Google ScholarGoogle Scholar
  15. G. Cormack. Content-based Web Spam Detection. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2007.Google ScholarGoogle Scholar
  16. K. Csalogány, A. Benczúr, D. Siklósi, and L. Lukács. Semi-Supervised Learning: A Comparative Study for Web Spam and Telephone User Churn. In Graph Labeling Workshop in conjunction with ECML/PKDD 2007, 2007.Google ScholarGoogle Scholar
  17. N. Dai, B. D. Davison, and X. Qi. Looking into the past to better classify web spam. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Desikan, N. Pathak, J. Srivastava, and V. Kumar. Incremental page rank computation on evolving graphs. In WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web, pages 1094--1095, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. K. Desikan, N. Pathak, J. Srivastava, and V. Kumar. Divide and conquer approach for efficient pagerank computation. In ICWE '06: Proceedings of the 6th international conference on Web engineering, pages 233--240, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Erdélyi and A. A. Benczúr. Temporal analysis for web spam detection: An overview. In 1st International Temporal Web Analytics Workshop (TWAW) in conjunction with the 20th International World Wide Web Conference in Hyderabad, India. CEUR Workshop Proceedings, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Erdélyi, A. A. Benczúr, J. Masanés, and D. Siklósi. Web spam filtering in internet archives. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. FastRandomForest. Re-implementation of the random forest classifier for the weka environment. http://code.google.com/p/fast-random-forest/.Google ScholarGoogle Scholar
  23. D. Fetterly and Z. Gyöngyi. Fifth international workshop on adversarial information retrieval on the web (AIRWeb 2009). 2009. Google ScholarGoogle Scholar
  24. J. Fogarty, R. S. Baker, and S. E. Hudson. Case studies in the use of roc curve analysis for sensor-based estimates in human computer interaction. In Proceedings of Graphics Interface 2005, GI '05, pages 129--136, School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, 2005. Canadian Human-Computer Communications Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. Annals of statistics, pages 337--374, 2000.Google ScholarGoogle Scholar
  26. G. Geng, X. Jin, and C. Wang. CASIA at WSC2008. In Proceedings of the 4th International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2008.Google ScholarGoogle Scholar
  27. X.-C. Z. Guang-Gang Geng, Xiao-Bo Jin and D. Zhang. Evaluating web content quality via multi-scale features. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, 2010.Google ScholarGoogle Scholar
  28. Z. Gyöngyi and H. Garcia-Molina. Spam: It's not just for inboxes anymore. IEEE Computer Magazine, 38(10):28--34, October 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.Google ScholarGoogle Scholar
  30. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 576--587, Toronto, Canada, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11--22, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Hotho, D. Benz, R. Jäschke, and B. Krause, editors. Proceedings of the ECML/PKDD Discovery Challenge. 2008.Google ScholarGoogle Scholar
  33. Y. joo Chung, M. Toyoda, and M. Kitsuregawa. A study of web spam evolution using a time series of web snapshots. In AIRWeb '09: Proceedings of the 5th international workshop on Adversarial information retrieval on the web. ACM Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. C. Kohlschütter, P. A. Chirita, and W. Nejdl. Efficient parallel computation of pagerank, 2007.Google ScholarGoogle Scholar
  35. Z. Kou and W. W. Cohen. Stacked graphical models for efficient inference in markov random fields. In SDM 07, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  36. Y. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. Tseng. Splog detection using content, time and link structures. In 2007 IEEE International Conference on Multimedia and Expo, pages 2030--2033, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  37. G. Mohr, M. Stack, I. Rnitovic, D. Avery, and M. Kimpton. Introduction to Heritrix. In 4th International Web Archiving Workshop, 2004.Google ScholarGoogle Scholar
  38. A. Niculescu-Mizil, C. Perlich, G. Swirszcz, V. Sindhwani, Y. Liu, P. Melville, D. Wang, J. Xiao, J. Hu, M. Singh, et al. Winning the KDD Cup Orange Challenge with Ensemble Selection. In KDD Cup and Workshop in conjunction with KDD 2009, 2009.Google ScholarGoogle Scholar
  39. V. Nikulin. Web-mining with wilcoxon-based feature selection, ensembling and multiple binary classifiers. In Proceedings of the ECML/PKDD 2010 Discovery Challenge, 2010.Google ScholarGoogle Scholar
  40. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International World Wide Web Conference (WWW), pages 83--92, Edinburgh, Scotland, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In In Proceedings of SIGIR '94, pages 232--241. Springer-Verlag, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. G. Shen, B. Gao, T. Liu, G. Feng, S. Song, and H. Li. Detecting link spam using temporal information. In ICDM '06., pages 1049--1053, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. A. Singhal. Challenges in running a commercial search engine. In IBM Search and Collaboration Seminar 2004. IBM Haifa Labs, 2004.Google ScholarGoogle Scholar
  44. S. Webb, J. Caverlee, and C. Pu. Predicting web spam with HTTP session information. In Proceeding of the 17th ACM conference on Information and knowledge management, pages 339--348. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. B. Wu, V. Goel, and B. D. Davison. Topical TrustRank: Using topicality to combat web spam. In Proceedings of the 15th International World Wide Web Conference (WWW), Edinburgh, Scotland, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Web spam classification: a few features worth more

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            WebQuality '11: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality
            March 2011
            55 pages
            ISBN:9781450307062
            DOI:10.1145/1964114

            Copyright © 2011 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 28 March 2011

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader