|
ABSTRACT
In this paper, we continue our investigations of "web spam": the injection of artificially-created pages into the web in order to influence the results from search engines, to drive traffic to certain pages for fun or profit. This paper considers some previously-undescribed techniques for automatically detecting spam pages, examines the effectiveness of these techniques in isolation and when aggregated using classification algorithms. When combined, our heuristics correctly identify 2,037 (86.2%) of the 2,364 spam pages (13.8%) in our judged collection of 17,168 pages, while misidentifying 526 spam and non-spam pages (3.1%).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S. Adali, T. Liu and M. Magdon-Ismail. Optimal Link Bombs are Uncoordinated. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
|
 |
2
|
Einat Amitay , David Carmel , Adam Darlow , Ronny Lempel , Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, August 26-30, 2003, Nottingham, UK
[doi> 10.1145/900051.900060]
|
| |
3
|
R. Baeza-Yates, C. Castillo and V. López. PageRank Increase under Different Collusion Topologies. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
|
| |
4
|
A. Benczúr, K. Csalogány, T. Sarlós and M. Uher. SpamRank -- Fully Automatic Link Spam Detection. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
|
| |
5
|
|
| |
6
|
U.S. Census Bureau. Quarterly Retail E-Commerce Sales -- 4th Quarter 2004. http://www.census.gov/mrts/www/data/html/04Q4.html (dated Feb. 2005, visited Sept. 2005)
|
| |
7
|
B. Davison. Recognizing Nepotistic Links on the Web. In AAAI-2000 Workshop on Artificial Intelligence for Web Search, July 2000.
|
 |
8
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
 |
9
|
|
| |
10
|
|
| |
11
|
Z. Gyöngyi, H. Garcia-Molina and J. Pedersen. Combating Web Spam with TrustRank. In 30th International Conference on Very Large Data Bases, Aug. 2004.
|
| |
12
|
|
| |
13
|
Z. Gyöngyi and H. Garcia-Molina. Web Spam Taxonomy. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
|
| |
14
|
GZIP. http://www.gzip.org/
|
 |
15
|
|
 |
16
|
|
| |
17
|
B. Jansen and A. Spink. An Analysis of Web Documents Retrieved and Viewed. In International Conference on Internet Computing, June 2003.
|
| |
18
|
C. Johnson. US eCommerce: 2005 To 2010. http://www.forrester.com/Research/Document/Excerpt/0,7211,37626,00.html (dated Sept. 2005, visited Sept. 2005)
|
| |
19
|
|
| |
20
|
P. Metaxas and J. DeStefano. Web Spam, Propaganda and Trust. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
|
| |
21
|
G. Mishne, D. Carmel and R. Lempel. Blocking Blog Spam with Language Model Disagreement. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
|
| |
22
|
MSN Search. http://search.msn.com/
|
| |
23
|
J. Nielsen. Statistics for Traffic Referred by Search Engines and Navigation Directories to Useit. http://useit.com/about/searchreferrals.html (dated April 2004, visited Sept. 2005)
|
| |
24
|
L. Page, S. Brin, R. Motwani and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project, 1998.
|
| |
25
|
A. Perkins. The Classification of Search Engine Spam. http://www.silverdisc.co.uk/articles/spam-classification/ (dated Sept. 2001, visited Sept. 2005)
|
| |
26
|
|
| |
27
|
J. R. Quinlan. Bagging, Boosting, and C4.5. In 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference, Vol. 1, 725--730, Aug. 1996.
|
| |
28
|
M. Sahami, S. Dumais, D. Heckerman and E. Horvitz. A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization: Papers from the 1998 Workshop, AAAI Technical Report WS-98-05, 1998.
|
 |
29
|
|
| |
30
|
B. Wu and B. Davison. Cloaking and Redirection: a preliminary study. In 1st International Workshop on Adversarial Information Retrieval on the Web, May 2005.
|
| |
31
|
H. Zhang, A. Goel, R. Govindan, K. Mason and B. Van Roy. Making Eigenvector-Based Systems Robust to Collusion. In 3rd International Workshop on Algorithms and Models for the Web Graph, Oct. 2004.
|
CITED BY 27
|
|
|
|
|
|
|
|
|
|
|
|
|
Yu-Ru Lin , Hari Sundaram , Yun Chi , Junichi Tatemura , Belle L. Tseng, Splog detection using self-similarity analysis on blog temporal dynamics, Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, May 08-08, 2007, Banff, Alberta, Canada
|
|
Krysta M. Svore , Qiang Wu , Chris J. C. Burges , Aaswath Raman, Improving web spam classification using rank-time features, Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, May 08-08, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
Carlos Castillo , Debora Donato , Luca Becchetti , Paolo Boldi , Stefano Leonardi , Massimo Santini , Sebastiano Vigna, A reference collection for web spam, ACM SIGIR Forum, v.40 n.2, p.11-24, December 2006
|
|
Yi-Min Wang , Ming Ma , Yuan Niu , Hao Chen, Spam double-funnel: connecting web spammers with advertisers, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
András Benczúr , István Bíró , Károly Csalogány , Tamás Sarlós, Web spam detection via commercial intent analysis, Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, May 08-08, 2007, Banff, Alberta, Canada
|
|
|
|
Ziming Zhuang , Ergin Elmacioglu , Dongwon Lee , C. Lee Giles, Measuring conference quality by mining program committee characteristics, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
|
|
|
Carlos Castillo , Debora Donato , Aristides Gionis , Vanessa Murdock , Fabrizio Silvestri, Know your neighbors: web spam detection using the web topology, Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, July 23-27, 2007, Amsterdam, The Netherlands
|
|
|
|
|
|
|
|
|
|
|
|
|
|