skip to main content
10.1145/2910896.2910902acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation

Published: 19 June 2016 Publication History

Abstract

Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability.
Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that builds a research corpus by working on existing and standardized data formats commonly held by Web archiving institutions. Performance optimizations in ArchiveSpark, facilitated by the use of a widely available metadata index, result in significant speed-ups of data processing. Our benchmarks show that ArchiveSpark is faster than alternative approaches without depending on any additional data stores while improving usability by seamlessly integrating queries and derivations with external tools.

References

[1]
Jefferson Bailey et al. (Internet Archive). Web archiving in the united states: A 2013 survey, 2014. URL http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_USWebArchivingSurvey_2013.pdf. A report of the National Digital Stewardship Alliance. {Accessed: 11/01/2016}.
[2]
Daniel Gomes, Joao Miranda, and Miguel Costa. A survey on web archiving initiatives. In Proceedings of TPDL'11.
[3]
Helen Hockx-Yu. Access and scholarly use of web archives. Alexandria, 25 (1--2): 113--127, 2014.
[4]
Niels Brügger. Web history, web archives, and web research infrastructure - between close and distant reading, 2015. URL http://alexandria-project.eu/events/2nd-int-alexandria-workshop-2015. Keynote at the 2nd Int. Alexandria Workshop on Foundations for Temporal Retrieval, Exploration and Analytics in Web Archives on 03/11/2015 {Accessed: 17/01/2016}.
[5]
Jimmy Lin, Milad Gholami, and Jinfeng Rao. Infrastructure for supporting exploration and discovery in web archives. In WWW'14 Companion, 2014.
[6]
Avishek Anand, Srikanta Bedathur, Klaus Berberich, and Ralf Schenkel. Temporal index sharding for space-time efficiency in archive search. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, 2011.
[7]
Annika Hinze, Craig Taube-Schock, David Bainbridge, Rangi Matamua, and J Stephen Downie. Improving access to large-scale digital libraries throughsemantic-enhanced search and disambiguation. In Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries, 2015.
[8]
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: a flexible data processing tool. Communications of the ACM, 53 (1): 72--77, 2010.
[9]
Ahmed AlSum. Web archive services framework for tighter integration between the past and present web. PhD thesis, Old Dominion University, 2014.
[10]
Avinash Lakshman and Prashant Malik. Cassandra: a decentralized structured storage system. ACM SIGOPS Operating Systems Review, 44 (2): 35--40, 2010.
[11]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C Hsieh, Deborah A Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E Gruber. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26 (2): 4, 2008.
[12]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, volume 10, page 10, 2010.
[13]
Jimmy Lin. Warcbase on github. URL https://github.com/lintool/warcbase. {Accessed: 11/01/2016}.

Cited By

View all
  • (2024)Exploration of Web Contents of Selangor Royal Family using Web Archiving ToolsEnvironment-Behaviour Proceedings Journal10.21834/e-bpj.v9iSI18.54859:SI18(255-261)Online publication date: 17-Jan-2024
  • (2023)Summarizing Web Archive Corpora via Social Media Storytelling by Automatically Selecting and Visualizing ExemplarsACM Transactions on the Web10.1145/360603018:1(1-48)Online publication date: 11-Oct-2023
  • (2023)To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web PagesACM Transactions on the Web10.1145/358920617:4(1-49)Online publication date: 11-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '16: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries
June 2016
316 pages
ISBN:9781450342292
DOI:10.1145/2910896
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data
  2. data extraction
  3. web archives

Qualifiers

  • Research-article

Funding Sources

  • European Research Council

Conference

JCDL '16
Sponsor:

Acceptance Rates

JCDL '16 Paper Acceptance Rate 15 of 52 submissions, 29%;
Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)3
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Exploration of Web Contents of Selangor Royal Family using Web Archiving ToolsEnvironment-Behaviour Proceedings Journal10.21834/e-bpj.v9iSI18.54859:SI18(255-261)Online publication date: 17-Jan-2024
  • (2023)Summarizing Web Archive Corpora via Social Media Storytelling by Automatically Selecting and Visualizing ExemplarsACM Transactions on the Web10.1145/360603018:1(1-48)Online publication date: 11-Oct-2023
  • (2023)To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web PagesACM Transactions on the Web10.1145/358920617:4(1-49)Online publication date: 11-Jul-2023
  • (2023)Synthesizing Web Archive Collections into Big Data: Lessons from Mining Data from Web ArchivesLinking Theory and Practice of Digital Libraries10.1007/978-3-031-43849-3_19(220-229)Online publication date: 26-Sep-2023
  • (2022)WARChainJournal of Computer Security10.3233/JCS-21004030:3(499-515)Online publication date: 1-Jan-2022
  • (2022)ABCDEFProceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries10.1145/3529372.3530916(1-11)Online publication date: 20-Jun-2022
  • (2022)CDX Summary: Web Archival Collection InsightsLinking Theory and Practice of Digital Libraries10.1007/978-3-031-16802-4_25(297-305)Online publication date: 20-Sep-2022
  • (2021)Was this the real Web? Quantitative overview of the Polish… ccTLD Internet Archive data (1996–2001)Archeion10.4467/26581264ARC.21.015.14495122(44-68)Online publication date: 23-Dec-2021
  • (2021)From archive to analysis: accessing web archives at scale through a cloud-based interfaceInternational Journal of Digital Humanities10.1007/s42803-020-00029-6Online publication date: 6-Jan-2021
  • (2021)WARChain: Blockchain-Based Validation of Web ArchivesSocio-Technical Aspects in Security and Trust10.1007/978-3-030-79318-0_7(121-134)Online publication date: 22-Jun-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media