ABSTRACT
In this paper, we study the classification of web spam. Web spam refers to pages that use techniques to mislead search engines into assigning them higher rank, thus increasing their site traffic. Our contributions are two fold. First, we find that the method of datset construction is crucial for accurate spam classification and we note that this problem occurs generally in learning problems and can be hard to detect. In particular, we find that ensuring no overlapping domains between test and training sets is necessary to accurately test a web spam classifier. In our case, classification performance can differ by as much as 40% in precision when using non-domain-separated data. Second, we show rank-time features can improve the performance of a web spam classifier. Our paper is the first to investigate the use of rank-time features, and in particular query-dependent rank-time features, for web spam detection. We show that the use of rank-time and query-dependent features can lead to an increase in accuracy over a classifier trained using page-based content only.
- E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: Detecting site functionality by structural patterns. In 14th ACM Conference on Hypertext and Hypermedia, 2003. Google ScholarDigital Library
- L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD). ACM Press, August 2006.Google Scholar
- C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to Rank using Gradient Descent. Bonn, Germany, 2005.Google Scholar
- C. Castillo, D. Donato, L. Becchetti, P. Boldi, M. Santini, and S. Vigna. A reference collection for web spam. In SIGIR Forum, volume 40, December 2006. Google ScholarDigital Library
- B. Davison. Recognizing nepotistic links on the web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, 2000.Google Scholar
- D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In S. Amer-Yahia and L. Gravano, editors, WebDB, pages 1--6, 2004. Google ScholarDigital Library
- Z. Gyongyi and H. Garcia-Molina. Link spam alliances. In Proceedings of the 31st VLDB Conference, 2005. Google ScholarDigital Library
- Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), 2005.Google Scholar
- M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. In Proc. of the 18th International Joint Conference on Artificial Intelligence, pages 1573--1579, 2003. Google ScholarDigital Library
- G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), 2005.Google Scholar
- A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In L. Carr, D. D. Roure, A. Iyengar, C. A. Goble, and M. Dahlin, editors, WWW, pages 83--92. ACM, 2006. Google ScholarDigital Library
- V. Vapnik. The Nature of Statistical Learning. Springer-Verlag, 1995. Google ScholarDigital Library
- Y. Wang, M. Ma, Y. Niu, and H. Chen. Spam double-funnel: Connecting web spammers with advertisers. In Proc. of International World Wide Web (WWW), May 2007. Google ScholarDigital Library
- B. Wu and B. Davison. Cloaking and redirection: a preliminary study. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), May 2005.Google Scholar
- B. Wu and B. Davison. Identifying link farm spam pages. In Proceedings of the 14th International World Wide Web Conference, Industrial Track, May 2005. Google ScholarDigital Library
Index Terms
- Improving web spam classification using rank-time features
Recommendations
Improving web spam classifiers using link structure
AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the webWeb spam has been recognized as one of the top challenges in the search engine industry [14]. A lot of recent work has addressed the problem of detecting or demoting web spam, including both content spam [16, 12] and link spam [22, 13]. However, any ...
Improving web spam detection with re-extracted features
WWW '08: Proceedings of the 17th international conference on World Wide WebWeb spam detection has become one of the top challenges for the Internet search industry. Instead of using some heuristic rules, we propose a feature re-extraction strategy to optimize the detection result. Based on the predicted spamicity obtained by ...
Fighting against web spam: a novel propagation method based on click-through data
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalCombating Web spam is one of the greatest challenges for Web search engines. State-of-the-art anti-spam techniques focus mainly on detecting varieties of spam strategies, such as content spamming and link-based spamming. Although these anti-spam ...
Comments