|
ABSTRACT
Web spam is behavior that attempts to deceive search engine ranking algorithms. TrustRank is a recent algorithm that can combat web spam. However, TrustRank is vulnerable in the sense that the seed set used by TrustRank may not be sufficiently representative to cover well the different topics on the Web. Also, for a given seed set, TrustRank has a bias towards larger communities. We propose the use of topical information to partition the seed set and calculate trust scores for each topic separately to address the above issues. A combination of these trust scores for a page is used to determine its ranking. Experimental results on two large datasets show that our Topical TrustRank has a better performance than TrustRank in demoting spam sites or pages. Compared to TrustRank, our best technique can decrease spam from the top ranked sites by as much as 43.1%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A. Acharya, M. Cutts, J. Dean, P. Haahr, M. Henzinger, U. Hoelzle, S. Lawrence, K. Peger, O. Sercinoglu, and S. Tong. Information retrieval based on historical data, Mar. 31 2005. US Patent Application number 20050071741.
|
| |
2
|
A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. SpamRank - fully automatic link spam detection. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
 |
3
|
Soumen Chakrabarti , Mukul M. Joshi , Kunal Punera , David M. Pennock, The structure of broad topics on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511480]
|
| |
4
|
|
 |
5
|
|
| |
6
|
G. Collins. Latest search engine spam techniques, Aug. 2004. Online at http://www.sitepoint.com/article/search-enginespam-techniques.
|
 |
7
|
|
| |
8
|
I. Drost and T. Scheer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proceedings of European Conference on Machine Learning, pages 96--107, Oct. 2005.
|
 |
9
|
Dennis Fetterly , Mark Manasse , Marc Najork, Spam, damn spam, and statistics: using statistical analysis to locate spam web pages, Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, June 17-18, 2004, Paris, France
[doi> 10.1145/1017074.1017077]
|
 |
10
|
|
 |
11
|
R. Guha , Ravi Kumar , Prabhakar Raghavan , Andrew Tomkins, Propagation of trust and distrust, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988727]
|
| |
12
|
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), Chiba, Japan, 2005.
|
| |
13
|
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), pages 271--279, Toronto, Canada, Sept. 2004.
|
 |
14
|
|
 |
15
|
Taher H. Haveliwala , Aristides Gionis , Dan Klein , Piotr Indyk, Evaluating strategies for similarity search on the web, Proceedings of the 11th international conference on World Wide Web, May 07-11, 2002, Honolulu, Hawaii, USA
[doi> 10.1145/511446.511502]
|
 |
16
|
|
| |
17
|
|
| |
18
|
Internet Archive, 2005. http://www.archive.org/.
|
 |
19
|
|
| |
20
|
G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005.
|
| |
21
|
Open Directory Project, 2005. http://dmoz.org/.
|
| |
22
|
Open Directory RDF Dump, 2005. http://rdf.dmoz.org/.
|
| |
23
|
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.
|
| |
24
|
A. Perkins. White paper: The classication of search engine spam, Sept. 2001. Online at http://www.silverdisc.co.uk/articles/spamclassication/.
|
| |
25
|
Räber Information Management GmbH. The Swiss search engine, 2005. http://www.search.ch/.
|
 |
26
|
|
| |
27
|
B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), May 2005.
|
 |
28
|
|
CITED BY 9
|
|
|
|
|
|
|
Georgia Koutrika , Frans Adjie Effendi , Zoltán Gyöngyi , Paul Heymann , Hector Garcia-Molina, Combating spam in tagging systems, Proceedings of the 3rd international workshop on Adversarial information retrieval on the web, May 08-08, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|