skip to main content
10.1145/1531914.1531917acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
research-article

A study of link farm distribution and evolution using a time series of web snapshots

Published: 21 April 2009 Publication History

Abstract

In this paper, we study the overall link-based spam structure and its evolution which would be helpful for the development of robust analysis tools and research for Web spamming as a social activity in the cyber space. First, we use strongly connected component (SCC) decomposition to separate many link farms from the largest SCC, so called the core. We show that denser link farms in the core can be extracted by node filtering and recursive application of SCC decomposition to the core. Surprisingly, we can find new large link farms during each iteration and this trend continues until at least 10 iterations. In addition, we measure the spamicity of such link farms. Next, the evolution of link farms is examined over two years. Results show that almost all large link farms do not grow anymore while some of them shrink, and many large link farms are created in one year.

References

[1]
J. M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, pp. 668--677, 1998.
[2]
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th international conference on World Wide Web, 1998.
[3]
M. Kitsuregawa, T. Tamura, M. Toyoda and N. Kaji. Socio-Sense:A system for analysing the societal behavior from long term Web archive, In Proceedings of 10th Asia-Pacific Web conference, 2008.
[4]
M. Toyoda and M. Kitsuregawa. Creating a web community chart for navigating related communities. In Proceedings of the 12th conference on Hypertext and Hypermedia, 2001.
[5]
M. Toyoda and M. Kitsuregawa. Extracting evolution of web communities from a series of Web archive. In Proceedings of the 14th ACM conference on hypertext and hypermedia, 2003.
[6]
R. Kumar, P. Raghavan S. Rajagopalan and A. Tomkins. Trawling the Web for emerging cyber-Communities. Proceedings of the 8th international conference on World Wide Web, 1999.
[7]
H. Saito, M. Toyoda, M. Kitsuregawa and K. Aihara. A large-scale study of link spam detection by graph algorithms In Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web, 2007.
[8]
D. Fetterly, M. Manasse and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam Web pages. In Proceedings of the 7th International Workshop on the Web and Databases, 2004.
[9]
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. Computer Networks, Volume 33, Number 1, 2000, pp. 309--320.
[10]
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the 1st international workshop on Adversarial information retrieval on the Web, 2005.
[11]
Z. Gyöngyi and H. Molina. Link Spam Alliance In Proceedings of the 31st international conference on Very large Data Bases, 2005.
[12]
Z. Gyöngyi, H. Garcia-Molina and J. Pedersen. Combating Web spam with TrustRank. In Proceedings of the 30th international conference on Very Large Data Bases, 2004.
[13]
A. A. Benczúr, K Csalogány, T Sarlós and M. Uher. SpamRank-fully automatic link spam detection. In Proceedings of the 1st international workshop on Adversarial information retrieval on the Web, 2005.
[14]
L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates. Link-based characterization and detection of Web spam. In Proceedings of the 2nd international workshop on Adversarial information retrieval on the Web, 2006.
[15]
A. Carvalho, P. Chirita, E. Moura and P. Calado. Site level noise removal for search engines. In Proceedings of the 15th international conference on World Wide Web. 2006.
[16]
X. Qi, L. Nie and B. D. Davison. Measuring similarity to detect qualified links, In Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web, 2007.
[17]
M. Najork and J. L. Wiener. Breadth-first crawling yields high-quality pages. In Proceedings of the 10th international conference on World Wide Web, 2001.
[18]
C. Castillo, D. Donato, L. Becchetti and P. Boldi. A reference collection for Web spam. SIGIR Forum, 40(2), 2006, pp 11--24.
[19]
Internet Archive Wayback Machine. http://www.archive.org.
[20]
Y. Fujiwara, C. Di Guilmi, H. Aoyama, M. Gallegati and W. Souma. Do Pareto-Zipf and Gibrat laws hold true? An analysis with European firms. Physica A(335), 2004, pp. 197--216.

Cited By

View all
  • (2021)Mingling of Clear and Muddy Water: Understanding and Detecting Semantic Confusion in Blackhat SEOComputer Security – ESORICS 202110.1007/978-3-030-88418-5_13(263-284)Online publication date: 30-Sep-2021
  • (2021)Big Data Science Over the Past WebThe Past Web10.1007/978-3-030-63291-5_21(271-282)Online publication date: 1-Jul-2021
  • (2020)Search & Catch: Detecting Promotion Infection in the Underground through Search Engines2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom50675.2020.00216(1566-1571)Online publication date: Dec-2020
  • Show More Cited By

Index Terms

  1. A study of link farm distribution and evolution using a time series of web snapshots

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
      April 2009
      67 pages
      ISBN:9781605584386
      DOI:10.1145/1531914
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 April 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. information retrieval
      2. link analysis
      3. web spam

      Qualifiers

      • Research-article

      Conference

      AIRWeb '09

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 19 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)Mingling of Clear and Muddy Water: Understanding and Detecting Semantic Confusion in Blackhat SEOComputer Security – ESORICS 202110.1007/978-3-030-88418-5_13(263-284)Online publication date: 30-Sep-2021
      • (2021)Big Data Science Over the Past WebThe Past Web10.1007/978-3-030-63291-5_21(271-282)Online publication date: 1-Jul-2021
      • (2020)Search & Catch: Detecting Promotion Infection in the Underground through Search Engines2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom50675.2020.00216(1566-1571)Online publication date: Dec-2020
      • (2018)Unsupervised Domain Ranking in Large-Scale Web CrawlsACM Transactions on the Web10.1145/318218012:4(1-29)Online publication date: 27-Sep-2018
      • (2017)The evolution of web archivingInternational Journal on Digital Libraries10.1007/s00799-016-0171-918:3(191-205)Online publication date: 1-Sep-2017
      • (2015)Efficient PageRank Tracking in Evolving NetworksProceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining10.1145/2783258.2783297(875-884)Online publication date: 10-Aug-2015
      • (2014)Time Graph Pattern Mining for Network Analysis and Information RetrievalIEICE Transactions on Information and Systems10.1587/transinf.E97.D.733E97.D:4(733-742)Online publication date: 2014
      • (2014)Learning temporal-dependent ranking modelsProceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval10.1145/2600428.2609619(757-766)Online publication date: 3-Jul-2014
      • (2014)The Classification Power of Web FeaturesInternet Mathematics10.1080/15427951.2013.85045610:3-4(421-457)Online publication date: 15-Sep-2014
      • (2013)Combating Web spam through trust-distrust propagation with confidencePattern Recognition Letters10.1016/j.patrec.2013.05.01734:13(1462-1469)Online publication date: 1-Oct-2013
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media