research-article

A study of link farm distribution and evolution using a time series of web snapshots

Authors:

Young-joo Chung,

Masashi Toyoda,

Masaru KitsuregawaAuthors Info & Claims

AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

Pages 9 - 16

https://doi.org/10.1145/1531914.1531917

Published: 21 April 2009 Publication History

Abstract

In this paper, we study the overall link-based spam structure and its evolution which would be helpful for the development of robust analysis tools and research for Web spamming as a social activity in the cyber space. First, we use strongly connected component (SCC) decomposition to separate many link farms from the largest SCC, so called the core. We show that denser link farms in the core can be extracted by node filtering and recursive application of SCC decomposition to the core. Surprisingly, we can find new large link farms during each iteration and this trend continues until at least 10 iterations. In addition, we measure the spamicity of such link farms. Next, the evolution of link farms is examined over two years. Results show that almost all large link farms do not grow anymore while some of them shrink, and many large link farms are created in one year.

References

[1]

J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 668--677, 1998.

Digital Library

[2]

S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th international conference on World Wide Web, 1998.

Digital Library

[3]

M. Kitsuregawa, T. Tamura, M. Toyoda and N. Kaji. Socio-Sense:A system for analysing the societal behavior from long term Web archive, In Proceedings of 10th Asia-Pacific Web conference, 2008.

Digital Library

[4]

M. Toyoda and M. Kitsuregawa. Creating a web community chart for navigating related communities. In Proceedings of the 12th conference on Hypertext and Hypermedia, 2001.

Digital Library

[5]

M. Toyoda and M. Kitsuregawa. Extracting evolution of web communities from a series of Web archive. In Proceedings of the 14th ACM conference on hypertext and hypermedia, 2003.

Digital Library

[6]

R. Kumar, P. Raghavan S. Rajagopalan and A. Tomkins. Trawling the Web for emerging cyber-Communities. Proceedings of the 8th international conference on World Wide Web, 1999.

Digital Library

[7]

H. Saito, M. Toyoda, M. Kitsuregawa and K. Aihara. A large-scale study of link spam detection by graph algorithms In Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web, 2007.

Digital Library

[8]

D. Fetterly, M. Manasse and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam Web pages. In Proceedings of the 7th International Workshop on the Web and Databases, 2004.

Digital Library

[9]

A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. Computer Networks, Volume 33, Number 1, 2000, pp. 309--320.

Digital Library

[10]

Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st international workshop on Adversarial information retrieval on the Web, 2005.

[11]

Z. Gyöngyi and H. Molina. Link Spam Alliance In Proceedings of the 31st international conference on Very large Data Bases, 2005.

[12]

Z. Gyöngyi, H. Garcia-Molina and J. Pedersen. Combating Web spam with TrustRank. In Proceedings of the 30th international conference on Very Large Data Bases, 2004.

Digital Library

[13]

A. A. Benczúr, K Csalogány, T Sarlós and M. Uher. SpamRank-fully automatic link spam detection. In Proceedings of the 1st international workshop on Adversarial information retrieval on the Web, 2005.

[14]

L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates. Link-based characterization and detection of Web spam. In Proceedings of the 2nd international workshop on Adversarial information retrieval on the Web, 2006.

[15]

A. Carvalho, P. Chirita, E. Moura and P. Calado. Site level noise removal for search engines. In Proceedings of the 15th international conference on World Wide Web. 2006.

Digital Library

[16]

X. Qi, L. Nie and B. D. Davison. Measuring similarity to detect qualified links, In Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web, 2007.

Digital Library

[17]

M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of the 10th international conference on World Wide Web, 2001.

Digital Library

[18]

C. Castillo, D. Donato, L. Becchetti and P. Boldi. A reference collection for Web spam. SIGIR Forum, 40(2), 2006, pp 11--24.

Digital Library

[19]

Internet Archive Wayback Machine. http://www.archive.org.

[20]

Y. Fujiwara, C. Di Guilmi, H. Aoyama, M. Gallegati and W. Souma. Do Pareto-Zipf and Gibrat laws hold true? An analysis with European firms. Physica A(335), 2004, pp. 197--216.

Cited By

Yang HDu KZhang YHao SWang HZhang JDuan H(2021)Mingling of Clear and Muddy Water: Understanding and Detecting Semantic Confusion in Blackhat SEOComputer Security – ESORICS 202110.1007/978-3-030-88418-5_13(263-284)Online publication date: 30-Sep-2021
https://doi.org/10.1007/978-3-030-88418-5_13
Costa MMasanès J(2021)Big Data Science Over the Past WebThe Past Web10.1007/978-3-030-63291-5_21(271-282)Online publication date: 1-Jul-2021
https://doi.org/10.1007/978-3-030-63291-5_21
Yang RLiu JGu LChen Y(2020)Search & Catch: Detecting Promotion Infection in the Underground through Search Engines2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom50675.2020.00216(1566-1571)Online publication date: Dec-2020
https://doi.org/10.1109/TrustCom50675.2020.00216
Show More Cited By

Index Terms

A study of link farm distribution and evolution using a time series of web snapshots
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Identifying link farm spam pages
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

With the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines' ranking systems, ...
Identifying spam link generators for monitoring emerging web spam
WICOW '10: Proceedings of the 4th workshop on Information credibility

In this paper, we address the question of how we can identify hosts that will generate links to web spam. Detecting such spam link generators is important because almost all new spam links are created by them. By monitoring spam link generators, we can ...
Detecting Link Hijacking by Web Spammers
PAKDD '09: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining

Since current search engines employ link-based ranking algorithms as an important tool to decide a ranking of sites, Web spammers are making a significant effort to manipulate the link structure of the Web, so called, link spamming. Link hijacking is an ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

April 2009

67 pages

ISBN:9781605584386

DOI:10.1145/1531914

Editors:
Dennis Fetterly
Microsoft Research
,
Zoltán Gyöngyi
Google Research

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 April 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

AIRWeb '09

AIRWeb '09: AIRWeb '09, 5th International Workshop on Adversarial Information Retrieval on the Web

April 21, 2009

Madrid, Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

25
Total Citations
View Citations
286
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang HDu KZhang YHao SWang HZhang JDuan H(2021)Mingling of Clear and Muddy Water: Understanding and Detecting Semantic Confusion in Blackhat SEOComputer Security – ESORICS 202110.1007/978-3-030-88418-5_13(263-284)Online publication date: 30-Sep-2021
https://doi.org/10.1007/978-3-030-88418-5_13
Costa MMasanès J(2021)Big Data Science Over the Past WebThe Past Web10.1007/978-3-030-63291-5_21(271-282)Online publication date: 1-Jul-2021
https://doi.org/10.1007/978-3-030-63291-5_21
Yang RLiu JGu LChen Y(2020)Search & Catch: Detecting Promotion Infection in the Underground through Search Engines2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom50675.2020.00216(1566-1571)Online publication date: Dec-2020
https://doi.org/10.1109/TrustCom50675.2020.00216
Cui YSparkman CLee HLoguinov D(2018)Unsupervised Domain Ranking in Large-Scale Web CrawlsACM Transactions on the Web10.1145/318218012:4(1-29)Online publication date: 27-Sep-2018
https://dl.acm.org/doi/10.1145/3182180
Costa MGomes DSilva M(2017)The evolution of web archivingInternational Journal on Digital Libraries10.1007/s00799-016-0171-918:3(191-205)Online publication date: 1-Sep-2017
https://dl.acm.org/doi/10.1007/s00799-016-0171-9
Ohsaka NMaehara TKawarabayashi KCao LZhang CJoachims TWebb GMargineantu DWilliams G(2015)Efficient PageRank Tracking in Evolving NetworksProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783297(875-884)Online publication date: 10-Aug-2015
https://dl.acm.org/doi/10.1145/2783258.2783297
ASANO YOSHINO TYOSHIKAWA M(2014)Time Graph Pattern Mining for Network Analysis and Information RetrievalIEICE Transactions on Information and Systems10.1587/transinf.E97.D.733E97.D:4(733-742)Online publication date: 2014
https://doi.org/10.1587/transinf.E97.D.733
Costa MCouto FSilva MGeva STrotman ABruza PClarke CJärvelin K(2014)Learning temporal-dependent ranking modelsProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval10.1145/2600428.2609619(757-766)Online publication date: 3-Jul-2014
https://dl.acm.org/doi/10.1145/2600428.2609619
Erdélyi MBenczúr ADaróczy BGarzó AKiss TSiklósi D(2014)The Classification Power of Web FeaturesInternet Mathematics10.1080/15427951.2013.85045610:3-4(421-457)Online publication date: 15-Sep-2014
https://doi.org/10.1080/15427951.2013.850456
Liu XWang YZhu SLin H(2013)Combating Web spam through trust-distrust propagation with confidencePattern Recognition Letters10.1016/j.patrec.2013.05.01734:13(1462-1469)Online publication date: 1-Oct-2013
https://dl.acm.org/doi/10.1016/j.patrec.2013.05.017
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten