skip to main content
10.1145/1244408.1244411acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
Article

Improving web spam classification using rank-time features

Published:08 May 2007Publication History

ABSTRACT

In this paper, we study the classification of web spam. Web spam refers to pages that use techniques to mislead search engines into assigning them higher rank, thus increasing their site traffic. Our contributions are two fold. First, we find that the method of datset construction is crucial for accurate spam classification and we note that this problem occurs generally in learning problems and can be hard to detect. In particular, we find that ensuring no overlapping domains between test and training sets is necessary to accurately test a web spam classifier. In our case, classification performance can differ by as much as 40% in precision when using non-domain-separated data. Second, we show rank-time features can improve the performance of a web spam classifier. Our paper is the first to investigate the use of rank-time features, and in particular query-dependent rank-time features, for web spam detection. We show that the use of rank-time and query-dependent features can lead to an increase in accuracy over a classifier trained using page-based content only.

References

  1. E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: Detecting site functionality by structural patterns. In 14th ACM Conference on Hypertext and Hypermedia, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD). ACM Press, August 2006.Google ScholarGoogle Scholar
  3. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to Rank using Gradient Descent. Bonn, Germany, 2005.Google ScholarGoogle Scholar
  4. C. Castillo, D. Donato, L. Becchetti, P. Boldi, M. Santini, and S. Vigna. A reference collection for web spam. In SIGIR Forum, volume 40, December 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Davison. Recognizing nepotistic links on the web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, 2000.Google ScholarGoogle Scholar
  6. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In S. Amer-Yahia and L. Gravano, editors, WebDB, pages 1--6, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Z. Gyongyi and H. Garcia-Molina. Link spam alliances. In Proceedings of the 31st VLDB Conference, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), 2005.Google ScholarGoogle Scholar
  9. M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. In Proc. of the 18th International Joint Conference on Artificial Intelligence, pages 1573--1579, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), 2005.Google ScholarGoogle Scholar
  11. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In L. Carr, D. D. Roure, A. Iyengar, C. A. Goble, and M. Dahlin, editors, WWW, pages 83--92. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. Vapnik. The Nature of Statistical Learning. Springer-Verlag, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Wang, M. Ma, Y. Niu, and H. Chen. Spam double-funnel: Connecting web spammers with advertisers. In Proc. of International World Wide Web (WWW), May 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Wu and B. Davison. Cloaking and redirection: a preliminary study. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), May 2005.Google ScholarGoogle Scholar
  15. B. Wu and B. Davison. Identifying link farm spam pages. In Proceedings of the 14th International World Wide Web Conference, Industrial Track, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving web spam classification using rank-time features

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
            May 2007
            98 pages
            ISBN:9781595937322
            DOI:10.1145/1244408

            Copyright © 2007 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 8 May 2007

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader