ACM Home Page
Please provide us with feedback. Feedback
Detecting semantic cloaking on the web
Full text PdfPdf (232 KB)
Source International World Wide Web Conference archive
Proceedings of the 15th international conference on World Wide Web table of contents
Edinburgh, Scotland
SESSION: Industrial practice & experience table of contents
Pages: 819 - 828  
Year of Publication: 2006
ISBN:1-59593-323-9
Authors
Baoning Wu  Lehigh University, Bethlehem, PA
Brian D. Davison  Lehigh University, Bethlehem, PA
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 109,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1135777.1135901
What is a DOI?

ABSTRACT

By supplying different versions of a web page to search engines and to browsers, a content provider attempts to cloak the real content from the view of the search engine. Semantic cloaking refers to differences in meaning between pages which have the effect of deceiving search engine ranking algorithms. In this paper, we propose an automated two-step method to detect semantic cloaking pages based on different copies of the same page downloaded by a web crawler and a web browser. The first step is a filtering step, which generates a candidate list of semantic cloaking pages. In the second step, a classifier is used to detect semantic cloaking pages from the candidates generated by the filtering step. Experiments on manually labeled data sets show that we can generate a classifier with a precision of 93% and a recall of 85%. We apply our approach to links from the dmoz Open Directory Project and estimate that more than 50,000 of these pages employ semantic cloaking.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. Acharya, M. Cutts, J. Dean, P. Haahr, M. Henzinger, U. Hoelzle, S. Lawrence, K. Pfleger, O. Sercinoglu, and S. Tong. Information retrieval based on historical data, Mar. 31 2005. US Patent Application number 20050071741.
 
2
America Online, Inc. AOL Search: Hot searches, Mar. 2005. http://hot.aol.com/hot/hot.
3
 
4
AskJeeves / Teoma Site Submit managed by ineedhits.com: Program Terms, 2005. Online at http://ask.ineedhits.com/programterms.asp.
 
5
Ask Jeeves, Inc. Ask Jeeves About, Mar. 2005. http://sp.ask.com/docs/about/jeevesiq.html.
6
 
7
8
9
10
11
 
12
 
13
B. D. Davison. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, July 2000. Presented at the AAAI-2000 workshop on Artificial Intelligence for Web Search, Technical Report WS-00-01.
 
14
I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proceedings of European Conference on Machine Learning, pages 96--107, Oct. 2005.
15
16
 
17
Google, Inc. Google information for webmasters, 2005. Online at http://www.google.com/webmasters/faq.html.
 
18
Google, Inc. Google Zeitgeist, Jan. 2005. http://www.google.com/press/zeitgeist/zeitgeist-jan05.html.
 
19
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.
 
20
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 271--279, Toronto, Canada, Sept. 2004.
21
22
 
23
24
25
 
26
 
27
Lycos. Lycos 50 with Dean: 2004 web's most wanted, Dec. 2004. http://50.lycos.com/121504.asp.
 
28
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
 
29
M. Najork. System and method for identifying cloaked web servers, June 21 2005. U.S. Patent number 6,910,077.
30
 
31
Open Directory Project, 2005. http://dmoz.org/.
 
32
Open Directory RDF Dump, 2005. http://rdf.dmoz.org/.
 
33
A. Perkins. White paper: The classification of search engine spam, Sept. 2001. Online at http://www.silverdisc.co.uk/articles/spam-classification/.
 
34
35
 
36
A. Westbrook and R. Greene. Using semantic analysis to classify search engine spam, Dec. 2002. Class project report at http://www.stanford.edu/class/cs276a/projects/reports/.
 
37
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, second edition, 2005.
 
38
B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), May 2005.
39
40
 
41
Yahoo! Inc. Yahoo! Help - Yahoo! Search, 2005. Online at http://help.yahoo.com/help/us/ysearch/deletions/.
 
42
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. Making eigenvector-based reputation systems robust to collusions. In Proceedings of the Third Workshop on Algorithms and Models for the Web Graph, Oct. 2004.


Collaborative Colleagues:
Baoning Wu: colleagues
Brian D. Davison: colleagues