Article

Improving web spam classification using rank-time features

Authors:
Krysta M. Svore

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
Qiang Wu

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
Chris J. C. Burges

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
Aaswath Raman

Microsoft Redmond, WA

Microsoft Redmond, WA
View Profile

AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the webMay 2007Pages 9–16https://doi.org/10.1145/1244408.1244411

Published:08 May 2007Publication History

AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web

Pages 9–16

ABSTRACT

In this paper, we study the classification of web spam. Web spam refers to pages that use techniques to mislead search engines into assigning them higher rank, thus increasing their site traffic. Our contributions are two fold. First, we find that the method of datset construction is crucial for accurate spam classification and we note that this problem occurs generally in learning problems and can be hard to detect. In particular, we find that ensuring no overlapping domains between test and training sets is necessary to accurately test a web spam classifier. In our case, classification performance can differ by as much as 40% in precision when using non-domain-separated data. Second, we show rank-time features can improve the performance of a web spam classifier. Our paper is the first to investigate the use of rank-time features, and in particular query-dependent rank-time features, for web spam detection. We show that the use of rank-time and query-dependent features can lead to an increase in accuracy over a classifier trained using page-based content only.

References

E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: Detecting site functionality by structural patterns. In 14th ACM Conference on Hypertext and Hypermedia, 2003. Google ScholarDigital Library
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD). ACM Press, August 2006.Google Scholar
C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to Rank using Gradient Descent. Bonn, Germany, 2005.Google Scholar
C. Castillo, D. Donato, L. Becchetti, P. Boldi, M. Santini, and S. Vigna. A reference collection for web spam. In SIGIR Forum, volume 40, December 2006. Google ScholarDigital Library
B. Davison. Recognizing nepotistic links on the web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, 2000.Google Scholar
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In S. Amer-Yahia and L. Gravano, editors, WebDB, pages 1--6, 2004. Google ScholarDigital Library
Z. Gyongyi and H. Garcia-Molina. Link spam alliances. In Proceedings of the 31st VLDB Conference, 2005. Google ScholarDigital Library
Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), 2005.Google Scholar
M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. In Proc. of the 18th International Joint Conference on Artificial Intelligence, pages 1573--1579, 2003. Google ScholarDigital Library
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), 2005.Google Scholar
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In L. Carr, D. D. Roure, A. Iyengar, C. A. Goble, and M. Dahlin, editors, WWW, pages 83--92. ACM, 2006. Google ScholarDigital Library
V. Vapnik. The Nature of Statistical Learning. Springer-Verlag, 1995. Google ScholarDigital Library
Y. Wang, M. Ma, Y. Niu, and H. Chen. Spam double-funnel: Connecting web spammers with advertisers. In Proc. of International World Wide Web (WWW), May 2007. Google ScholarDigital Library
B. Wu and B. Davison. Cloaking and redirection: a preliminary study. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb '05), May 2005.Google Scholar
B. Wu and B. Davison. Identifying link farm spam pages. In Proceedings of the 14th International World Wide Web Conference, Industrial Track, May 2005. Google ScholarDigital Library

Index Terms

Improving web spam classification using rank-time features

Recommendations

Improving web spam classifiers using link structure
AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web

Web spam has been recognized as one of the top challenges in the search engine industry [14]. A lot of recent work has addressed the problem of detecting or demoting web spam, including both content spam [16, 12] and link spam [22, 13]. However, any ...
Read More
Improving web spam detection with re-extracted features
WWW '08: Proceedings of the 17th international conference on World Wide Web

Web spam detection has become one of the top challenges for the Internet search industry. Instead of using some heuristic rules, we propose a feature re-extraction strategy to optimize the detection result. Based on the predicted spamicity obtained by ...
Read More
Fighting against web spam: a novel propagation method based on click-through data
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Combating Web spam is one of the greatest challenges for Web search engines. State-of-the-art anti-spam techniques focus mainly on detecting varieties of spam strategies, such as content spamming and link-based spamming. Although these anti-spam ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
May 2007
98 pages
ISBN:9781595937322
DOI:10.1145/1244408
Conference Chairs:
Carlos Castillo
Yahoo! Research
,
Kumar Chellapilla
Microsoft Live Labs
,
Brian D. Davison
Lehigh University
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 May 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 559
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Improving web spam classification using rank-time features

AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving web spam classifiers using link structure

Improving web spam detection with re-extracted features

Fighting against web spam: a novel propagation method based on click-through data