|
ABSTRACT
By supplying different versions of a web page to search engines and to browsers, a content provider attempts to cloak the real content from the view of the search engine. Semantic cloaking refers to differences in meaning between pages which have the effect of deceiving search engine ranking algorithms. In this paper, we propose an automated two-step method to detect semantic cloaking pages based on different copies of the same page downloaded by a web crawler and a web browser. The first step is a filtering step, which generates a candidate list of semantic cloaking pages. In the second step, a classifier is used to detect semantic cloaking pages from the candidates generated by the filtering step. Experiments on manually labeled data sets show that we can generate a classifier with a precision of 93% and a recall of 85%. We apply our approach to links from the dmoz Open Directory Project and estimate that more than 50,000 of these pages employ semantic cloaking.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Acharya, M. Cutts, J. Dean, P. Haahr, M. Henzinger, U. Hoelzle, S. Lawrence, K. Pfleger, O. Sercinoglu, and S. Tong. Information retrieval based on historical data, Mar. 31 2005. US Patent Application number 20050071741.
|
| |
2
|
America Online, Inc. AOL Search: Hot searches, Mar. 2005. http://hot.aol.com/hot/hot.
|
 |
3
|
Einat Amitay , David Carmel , Adam Darlow , Ronny Lempel , Aya Soffer, The connectivity sonar: detecting site functionality by structural patterns, Proceedings of the fourteenth ACM conference on Hypertext and hypermedia, August 26-30, 2003, Nottingham, UK
[doi> 10.1145/900051.900060]
|
| |
4
|
AskJeeves / Teoma Site Submit managed by ineedhits.com: Program Terms, 2005. Online at http://ask.ineedhits.com/programterms.asp.
|
| |
5
|
Ask Jeeves, Inc. Ask Jeeves About, Mar. 2005. http://sp.ask.com/docs/about/jeevesiq.html.
|
 |
6
|
|
| |
7
|
|
 |
8
|
|
 |
9
|
Soumen Chakrabarti , Mukul M. Joshi , Kunal Punera , David M. Pennock, The structure of broad topics on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511480]
|
 |
10
|
Soumen Chakrabarti , Mukul Joshi , Vivek Tawde, Enhanced topic distillation using text, markup tags, and hyperlinks, Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, p.208-216, September 2001, New Orleans, Louisiana, United States
[doi> 10.1145/383952.383990]
|
 |
11
|
|
| |
12
|
|
| |
13
|
B. D. Davison. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, July 2000. Presented at the AAAI-2000 workshop on Artificial Intelligence for Web Search, Technical Report WS-00-01.
|
| |
14
|
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proceedings of European Conference on Machine Learning, pages 96--107, Oct. 2005.
|
 |
15
|
|
 |
16
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
| |
17
|
Google, Inc. Google information for webmasters, 2005. Online at http://www.google.com/webmasters/faq.html.
|
| |
18
|
Google, Inc. Google Zeitgeist, Jan. 2005. http://www.google.com/press/zeitgeist/zeitgeist-jan05.html.
|
| |
19
|
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.
|
| |
20
|
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 271--279, Toronto, Canada, Sept. 2004.
|
 |
21
|
|
 |
22
|
|
| |
23
|
|
 |
24
|
|
 |
25
|
|
| |
26
|
|
| |
27
|
Lycos. Lycos 50 with Dean: 2004 web's most wanted, Dec. 2004. http://50.lycos.com/121504.asp.
|
| |
28
|
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
| |
29
|
M. Najork. System and method for identifying cloaked web servers, June 21 2005. U.S. Patent number 6,910,077.
|
 |
30
|
|
| |
31
|
Open Directory Project, 2005. http://dmoz.org/.
|
| |
32
|
Open Directory RDF Dump, 2005. http://rdf.dmoz.org/.
|
| |
33
|
A. Perkins. White paper: The classification of search engine spam, Sept. 2001. Online at http://www.silverdisc.co.uk/articles/spam-classification/.
|
| |
34
|
|
 |
35
|
|
| |
36
|
A. Westbrook and R. Greene. Using semantic analysis to classify search engine spam, Dec. 2002. Class project report at http://www.stanford.edu/class/cs276a/projects/reports/.
|
| |
37
|
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, second edition, 2005.
|
| |
38
|
B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), May 2005.
|
 |
39
|
|
 |
40
|
|
| |
41
|
Yahoo! Inc. Yahoo! Help - Yahoo! Search, 2005. Online at http://help.yahoo.com/help/us/ysearch/deletions/.
|
| |
42
|
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. Making eigenvector-based reputation systems robust to collusions. In Proceedings of the Third Workshop on Algorithms and Models for the Web Graph, Oct. 2004.
|
CITED BY 2
|
|
|
Yi-Min Wang , Ming Ma , Yuan Niu , Hao Chen, Spam double-funnel: connecting web spammers with advertisers, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|