skip to main content
10.1145/2457317.2457370acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Enhancing and abstracting scientific workflow provenance for data publishing

Published: 18 March 2013 Publication History

Abstract

Many scientists are using workflows to systematically design and run computational experiments. Once the workflow is executed, the scientist may want to publish the dataset generated as a result, to be, e.g., reused by other scientists as input to their experiments. In doing so, the scientist needs to curate such dataset by specifying metadata information that describes it, e.g. its derivation history, origins and ownership. To assist the scientist in this task, we explore in this paper the use of provenance traces collected by workflow management systems when enacting workflows. Specifically, we identify the shortcomings of such raw provenance traces in supporting the data publishing task, and propose an approach whereby distilled, yet more informative, provenance traces that are fit for the data publishing task can be derived.

References

[1]
Recommended practices for citation of data published through the GBIF network. (May), 2012.
[2]
Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting lipstick on pig: Enabling database-style workflow provenance. PVLDB, 5(4):346--357, 2011.
[3]
M. K. Anand, S. Bowers, and B. Ludäscher. Provenance browser: Displaying and querying scientific workflow provenance graphs. In ICDE, pages 1201--1204, 2010.
[4]
R. Bentley, J. M. Brooke, A. Csillaghy, D. Fellows, A. L. Blanc, M. Messerotti, D. Perez-Suarez, G. Pierantoni, and M. Soldati. Helio: Discovery and analysis of data in heliophysics. In eScience, pages 248--255. IEEE Computer Society, 2011.
[5]
D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. In Proceedings of the 13th VLDB Conference, pages 900--911. Morgan Kaufmann, 2004.
[6]
O. Biton, S. Cohen-Boulakia, S. B. Davidson, and C. S. Hara. Querying and Managing Provenance through User Views in Scientific Workflows. 2008 IEEE 24th International Conference on Data Engineering, pages 1072--1081, Apr. 2008.
[7]
J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009.
[8]
S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD Conference, pages 1345--1350, 2008.
[9]
H. V. de Sompel and C. Lagoze. All aboard: toward a machine-friendly scholarly communication system. In The Fourth Paradigm, pages 193--199. 2009.
[10]
E. Deelman, D. Gannon, M. S. Shields, and I. Taylor. Workflows and e-science: An overview of workflow system features and capabilities. Future Generation Comp. Syst., 25(5):528--540, 2009.
[11]
S. C. Dey, D. Zinn, and B. Ludäscher. Propub: towards a declarative approach for publishing customized, policy-aware provenance. In Proceedings of the 23rd international conference on Scientific and statistical database management, SSDBM'11, pages 225--243, Berlin, Heidelberg, 2011. Springer-Verlag.
[12]
B. Francine. Got Data? A Guide to Data Preservation in the Information Age. Communications of the ACM, 51(12):50--56, 2008.
[13]
M. Gamble and C. Goble. Quality, trust, and utility of scientific data on the web: Towards a joint model. In Proceedings of the ACM WebSci'11, Koblenz, Germany., June 2011.
[14]
D. Garijo, P. Alper, K. Belhajjame, O. Corcho, C. Goble, and Y. Gil. Common motifs in scientific workflows: An empirical analysis. In In the proceedings of the IEEE eScience Conference. IEEE CS, 2012.
[15]
T. Hey, S. Tansley, and K. Tolle, editors. The Fourth Paradigm: Data-Intensive Scientific Discovery. Microsoft Research, 2009.
[16]
D. Hull, R. Stevens, P. Lord, C. Wroe, and C. Goble. Treating shimantic web syndrome with ontologies. In AKT Workshop on Semantic Web Services, 2004.
[17]
R. Ikeda, J. Cho, C. Fang, S. Salihoglu, S. Torikai, and J. Widom. Provenance-based debugging and drill-down in data-oriented workflows. In ICDE 2012. Stanford InfoLab.
[18]
P. Ingwersen and V. Chavan. Indicators for the Data Usage Index (DUI): an incentive for publishing primary biodiversity data through global information infrastructure. BMC bioinformatics, 12 Suppl 1(Suppl 15):S3, Dec. 2011.
[19]
J. Kim, E. Deelman, Y. Gil, G. Mehta, and V. Ratnakar. Provenance trails in the wings-pegasus system. Concurr. Comput.: Pract. Exper., 20(5):587--597, Apr. 2008.
[20]
B. F. Lavoie. Technology Watch Report The Open Archival Information System Reference Model: Introductory Guide. (January), 2004.
[21]
P. Missier, S. S. Sahoo, J. Zhao, C. A. Goble, and A. P. Sheth. Janus: From workflows to semantic provenance and linked open data. In IPAW, pages 129--141, 2010.
[22]
P. Missier, S. Soiland-Reyes, S. Owen, W. Tan, A. Nenadic, I. Dunlop, A. Williams, T. Oinn, and C. A. Goble. Taverna, reloaded. In M. Gertz and B. Ludäscher, editors, SSDBM, volume 6187 of Lecture Notes in Computer Science, pages 471--481. Springer, 2010.
[23]
C. Scheidegger, D. Koop, E. Santos, H. Vo, S. Callahan, J. Freire, and C. Silva. Tackling the provenance challenge one layer at a time. Concurrency and Computation: Practice and Experience, 20(5):473--483, 2008.

Cited By

View all
  • (2024)Trusted Provenance of Collaborative, Adaptive, Process-Based Data Processing PipelinesEnterprise Design, Operations, and Computing. EDOC 2023 Workshops10.1007/978-3-031-54712-6_25(363-370)Online publication date: 2-Mar-2024
  • (2023)Workflows for Bioinformatics Data IntegrationBiological Data Integration10.1002/9781394257317.ch3(53-85)Online publication date: 8-Dec-2023
  • (2022)Simplifying Text Mining Activities: Scalable and Self-Tuning Methodology for Topic Detection and CharacterizationApplied Sciences10.3390/app1210512512:10(5125)Online publication date: 19-May-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops
March 2013
423 pages
ISBN:9781450315999
DOI:10.1145/2457317
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data publishing
  2. provenance
  3. scientific workflows

Qualifiers

  • Research-article

Conference

EDBT/ICDT '13

Acceptance Rates

EDBT '13 Paper Acceptance Rate 7 of 10 submissions, 70%;
Overall Acceptance Rate 7 of 10 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Trusted Provenance of Collaborative, Adaptive, Process-Based Data Processing PipelinesEnterprise Design, Operations, and Computing. EDOC 2023 Workshops10.1007/978-3-031-54712-6_25(363-370)Online publication date: 2-Mar-2024
  • (2023)Workflows for Bioinformatics Data IntegrationBiological Data Integration10.1002/9781394257317.ch3(53-85)Online publication date: 8-Dec-2023
  • (2022)Simplifying Text Mining Activities: Scalable and Self-Tuning Methodology for Topic Detection and CharacterizationApplied Sciences10.3390/app1210512512:10(5125)Online publication date: 19-May-2022
  • (2020)Provenance Holder: Bringing Provenance, Reproducibility and Trust to Flexible Scientific Workflows and ChoreographiesBusiness Process Management Workshops10.1007/978-3-030-37453-2_53(664-675)Online publication date: 3-Jan-2020
  • (2019)Capturing and Reporting Provenance Information of Simulation Studies Based on an Artifact-Based Workflow ApproachProceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3316480.3325514(185-196)Online publication date: 29-May-2019
  • (2018)A Templating System to Generate ProvenanceIEEE Transactions on Software Engineering10.1109/TSE.2017.265974544:2(103-121)Online publication date: 1-Feb-2018
  • (2018)Provenance Network AnalyticsData Mining and Knowledge Discovery10.1007/s10618-017-0549-332:3(708-735)Online publication date: 1-May-2018
  • (2017)Provenance in DISC systemsProceedings of the 9th USENIX Conference on Theory and Practice of Provenance10.5555/3183865.3183883(13-13)Online publication date: 23-Jun-2017
  • (2017)A survey on provenanceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-017-0486-126:6(881-906)Online publication date: 1-Dec-2017
  • (2017)SHARP: Harmonizing and Bridging Cross-Workflow ProvenanceThe Semantic Web: ESWC 2017 Satellite Events10.1007/978-3-319-70407-4_35(219-234)Online publication date: 8-Nov-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media