skip to main content
10.1145/2534248.2534249acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

On assisting scientific data curation in collection-based dataflows using labels

Published: 17 November 2013 Publication History

Abstract

Thanks to the proliferation of computational techniques and the availability of datasets, data-intensive research has become commonplace in science. Sharing and re-use of datasets is key to scientific progress. A critical requirement for enabling data re-use, is for data to be accompanied by lineage metadata that describes the context in which data is produced, the source datasets from which it was derived and the tooling or settings involved in its generation. By and large, this metadata is provided through a manual curation process, which is tedious, repetitive and time consuming.
In this paper, we explore the problem of curating data artifacts generated from scientific workflows, which have become an established method for organizing computational data analyses. Most workflow systems can be instrumented to gather provenance, i.e. lineage, information about the data artifacts generated as a result of their execution. While this form of raw provenance provides elaborate information on localized lineage traced during a run in the form of data derivation or activity causality relations, it is of little use when one needs to report on lineage in a broader scientific context. And, consequently, datasets resulting from workflow-based analyses also require manual curation prior to their publishing.
We argue that by making the analysis process explicit, workflow-based investigations provide an opportunity for semi-automating data curation. In this paper we introduce a novel approach that semi-automates curation through a special kind of workflow, which we call a Labeling Workflow. Using 1) the description of a scientific workflow, 2) a set of semantic annotations characterizing the data processing in workflows, and, 3) a library of label handling functions, we devise a Labeling Workflow, which can be executed over raw provenance in order to curate the data artifacts it refers to. We semi-formally describe the elements of our solution, and showcase its usefulness using an example from Biodiversity.

References

[1]
K. Alexander, R. Cyganiak, et al. Describing linked datasets. In Linked Data on the Web Workshop in the International World Wide Web Conference, 2009.
[2]
Y. Amsterdamer, S. B. Davidson, et al. Putting lipstick on pig: Enabling database-style workflow provenance. PVLDB, 5(4): 346--357, 2011.
[3]
K. Belhajjame, O. Corcho, et al. Workflow-centric research objects: First class citizens in scholarly discourse. In Proc. Workshop on the Semantic Publishing (SePublica), Crete, Greece, 2012.
[4]
D. Bhagwat, L. Chiticariu, et al. An annotation management system for relational databases. In M. A. Nascimento, M. T. Ězsu, D. Kossmann, R. J. Miller, J. A. Blakeley, and K. B. Schiefer, editors, (e)Proceedings of the Thirtieth International Conference on Very Large Data Bases, pages 900--911, 2004.
[5]
S. Bowers and B. LudŁscher. A calculus for propagating semantic annotations through scientific workflow queries. In Query Languages and Query Processing (QLQP): 11th Intl. Workshop on Foundations of Models and Languages for Data and Objects, LNCS, 2006.
[6]
S. Bowers, T. McPhillips, and B. LudÃd'scher. Declarative rules for inferring fine-grained data provenance from scientific workflow execution traces. In P. Groth and J. Frew, editors, Provenance and Annotation of Data and Processes, volume 7525 of Lecture Notes in Computer Science, pages 82--96. Springer Berlin Heidelberg, 2012.
[7]
Ccsds. Reference Model for an Open Archival Information System (OAIS). Blue book. Technical Report 1, January 2002.
[8]
V. Chavan. Recommended practices for citation of data published through the GBIF network. (May), 2012.
[9]
J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in Databases: Why, How, and Where. Foundations and Trends in Databases, 1(4): 379--474, 2007.
[10]
P. Ciccarese et al. Pav ontology: Provenance, authoring and versioning. CoRR, abs/1304.7224, 2013.
[11]
S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD Conference, pages 1345--1350, 2008.
[12]
D. De Roure, C. Goble, and R. Stevens. The design and realisation of the myexperiment virtual research environment for social sharing of workflows. Future Generation Computer Systems, 25: 561--567, May 2008.
[13]
D. Garijo, P. Alper, K. Belhajjame, et al. Common motifs in scientific workflows: An empirical analysis. In In the proceedings of the IEEE eScience Conference. IEEE CS, 2012.
[14]
J. Greenberg. Theoretical considerations of lifecycle modeling: An analysis of the dryad repository demonstrating automatic metadata propagation, inheritance, and value system adoption. Cataloging and Classification Quarterly, 47(3--4): 380--402, 2009.
[15]
T. Hey, S. Tansley, and K. M. Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.
[16]
R. Ikeda, J. Cho, et al. Provenance-based debugging and drill-down in data-oriented workflows. In ICDE 2012. Stanford InfoLab.
[17]
G. Klyne and J. J. Carroll. Resource description framework (RDF): Concepts and abstract syntax. World Wide Web Consortium, Recommendation REC-rdf-concepts-20040210, February 2004.
[18]
P. Missier, S. Dey, et al. D-prov: extending the prov provenance model with workflow structure. In Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance, TaPP '13, pages 9:1--9:7, 2013.
[19]
P. Missier, S. M. Embury, et al. Quality Views: Capturing and Exploiting the User Perspective on Data Quality. In Procs. VLDB, pages 977--988, Seoul, Korea, Sept. 2006.
[20]
P. Missier, S. S. Sahoo, J. Zhao, et al. Janus: From workflows to semantic provenance and linked open data. In IPAW, pages 129--141, 2010.
[21]
P. Missier, S. Soiland-Reyes, et al. Taverna, reloaded. In SSDBM, pages 471--481, 2010.
[22]
L. Moreau et al. The Provenance of Electronic Data. Communications of the ACM, 51: 52--58, 2008.
[23]
Y. L. Simmhan et al. A survey of data provenance in e-science. SIGMOD Rec., 34(3): 31--36, Sept. 2005.
[24]
C. Tenopir, S. Allard, et al. Data sharing by scientists: Practices and perceptions. PLoS ONE, 6(6): e21101, 06 2011.
[25]
Y. R. Wang and S. E. Madnick. A polygen model for heterogeneous database systems: The source tagging perspective. In D. McLeod, R. Sacks-Davis, and H.-J. Schek, editors, 16th Int. Conf. on Very Large Data Bases, Proceedings, pages 519--538. Morgan Kaufmann, 1990.
[26]
J. Wieczorek, D. Bloom, et al. Darwin Core: An Evolving Community-Developed Biodiversity Data Standard. PLoS ONE, 7(1): e29715+, Jan. 2012.

Cited By

View all
  • (2018)LabelFlow Framework for Annotating Workflow ProvenanceInformatics10.3390/informatics50100115:1(11)Online publication date: 23-Feb-2018
  • (2017)"They're all going out to something weird"Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing10.1145/2998181.2998325(995-1008)Online publication date: 25-Feb-2017
  • (2017)Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social ComputingundefinedOnline publication date: 25-Feb-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WORKS '13: Proceedings of the 8th Workshop on Workflows in Support of Large-Scale Science
November 2013
133 pages
ISBN:9781450325028
DOI:10.1145/2534248
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data curation
  2. scientific workflows
  3. workflow provenance

Qualifiers

  • Research-article

Funding Sources

Conference

SC13

Acceptance Rates

WORKS '13 Paper Acceptance Rate 13 of 16 submissions, 81%;
Overall Acceptance Rate 30 of 54 submissions, 56%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2018)LabelFlow Framework for Annotating Workflow ProvenanceInformatics10.3390/informatics50100115:1(11)Online publication date: 23-Feb-2018
  • (2017)"They're all going out to something weird"Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing10.1145/2998181.2998325(995-1008)Online publication date: 25-Feb-2017
  • (2017)Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social ComputingundefinedOnline publication date: 25-Feb-2017
  • (2014)LabelFlowRevised Selected Papers of the 5th International Provenance and Annotation Workshop on Provenance and Annotation of Data and Processes - Volume 862810.1007/978-3-319-16462-5_7(84-96)Online publication date: 9-Jun-2014

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media