skip to main content
research-article

Evaluation of a Hybrid Approach for Efficient Provenance Storage

Published:01 November 2013Publication History
Skip Abstract Section

Abstract

Provenance is the metadata that describes the history of objects. Provenance provides new functionality in a variety of areas, including experimental documentation, debugging, search, and security. As a result, a number of groups have built systems to capture provenance. Most of these systems focus on provenance collection, a few systems focus on building applications that use the provenance, but all of these systems ignore an important aspect: efficient long-term storage of provenance.

In this article, we first analyze the provenance collected from multiple workloads and characterize the properties of provenance with respect to long-term storage. We then propose a hybrid scheme that takes advantage of the graph structure of provenance data and the inherent duplication in provenance data. Our evaluation indicates that our hybrid scheme, a combination of Web graph compression (adapted for provenance) and dictionary encoding, provides the best trade-off in terms of compression ratio, compression time, and query performance when compared to other compression schemes.

References

  1. Adler, M. and Mitzenmacher, M. 2001. Towards compressing web graphs. In Proceedings of the IEEE Data Compression Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Barga, R. S. and Digiampietri, L. A. 2007. Automatic capture and efficient storage of escience experiment provenance. Concur. Comput. Pract. Exper. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Boldi, P. and Vigna, S. 2004a. The webgraph framework I: Compression techniques. In Proceedings of the 13th International World Wide Web Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Boldi, P. and Vigna, S. 2004b. The webgraph framework II: Codes for the world-wide web. In Proceedings of the International Data Compression Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Boncz, P. A. 2002. Monet: A next generation dbms kernel for query-intensive application. Ph.D. thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands. http://oai.cwi.nl/oai/asset/14832/14832A.pdf.Google ScholarGoogle Scholar
  6. Bose, R. and Frew, J. 2004. Composing lineage metadata with xml for custom satellite-derived data products. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cao, B., Plale, B., Subramanian, G., Robertson, E., and Simmhan, Y. 2009. Provenance information model of karma version 3. In Proceedings of the 3rd IEEE International Workshop on Scientific Workflows (SWF’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Challenge3. 2009. The third provenance challenge. http://twiki.ipaw.info/bin/view/Challenge/ParticipatingTeams3.Google ScholarGoogle Scholar
  9. Chapman, A. P., Jagadish, H. V., and Ramanan, P. 2008. Efficient provenance storage. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Cheah, Y.-W., Plale, B., Kendall-Morwick, J., Leake, D., and Ramakrishnan, L. 2011. A noisy 10GB provenance database. In Proceedings of the 2nd International Workshop on Traceability and Compliance of Semi-Structured Processes, in conjunction with the 9th International Conference on Business Process Management.Google ScholarGoogle Scholar
  11. Chen, Z., Gehrke, J., and Korn, F. 2001. Query optimization in compressed database system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 271--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Futrelle, J., Gaynor, J., Plutchak, J., Myers, J. D., McGrath, R. E., Bajcsy, P., Kastner, J., Kotwani, K., Lee, J. S., Marini, L., Kooper, R., Mclaren, T., and Liu, Y. 2009. Semantic middleware for e-science knowledge spaces. In Proceedings of the 7th International Workshop on Middleware for Grids, Clouds and e-Science. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Goldstein, J., Ramakrishnan, R., and Shaft, U. 1998. Compressing relations and indexes. In Proceedings of the International Conference on Data Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Graefe, G. and Shapiro, L. 1991. Data compression and database performance. In Proceedings of the ACM/IEEE-CS Symposium on Applied Computing. 22--27.Google ScholarGoogle Scholar
  15. Groth, P., Miles, S., Fang, W., Wong, S. C., Zauner, K., and Moreau, L. 2005. Recording and using provenance in a protein compressibility experiment. In Proceedings of the International ACM Symposium on High-Performance Parallel and Distributed Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Groth, P., Jiang, S., Miles, S., Munroe, S., Tan, V., Tsasakou, S., and Moreau, L. 2006. An architecture for provenance system. Tech. rep. http://eprints.soton.ac.uk/263196/1/provenanceArchitecture7.pdf.Google ScholarGoogle Scholar
  17. Jayapandian, M., Chapman, A. P., Tarcea, V. G., Yu, C., Elkiss, A., Ianni, A., Liu, B., Nandi, A., Santos, C., Andrews, P., Athey, B., States, D., and Jagadish, H. V. 2007. Michigan molecular interactions (MiMI): Putting the jigsaw puzzle together. Nucleic Acids Res. 35, D566-571.Google ScholarGoogle ScholarCross RefCross Ref
  18. King, S. T. and Chen, P. M. 2003. Backtracking intrusions. In Proceedings of the ACM Symposium on Operating Systems Principles. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Liefke, H. and Suciu, D. 2000. XMill: An efficient compressor for xml data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Missier, P., Soiland-Reyes, S., Owen, S., Tan, W., Nenadic, A., Dunlop, I., Williams, A., Oinn, T., and Goble, C. 2010. Taverna, reloaded. In Proceedings of the 22nd International Conference on Scientific and Statistical Database Nanagement (SSDBM’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., and van den Bussche, J. 2011. The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27, 6, 743--756. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Muniswamy-Reddy, K.-K., Holland, D. A., Braun, U., and Seltzer, M. I. 2006. Provenance-aware storage systems. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Muniswamy-Reddy, K.-K., Braun, U., Holland, D. A., Macko, P., Maclean, D., Margo, D., Seltzer, M. I., and Smogor, R. 2009. Layering in provenance systems. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Passtrace. 2008. http://www.eecs.harvard.edu/syrah/pass/traces/.Google ScholarGoogle Scholar
  25. Poss, M. and Potapov, D. 2003. Data compression in oracle. In Proceedings of the International Conference on Very Large Data Bases. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Randall, K., Wickremesinghe, R., and Wiener, J. 2001. The link database: Fast access to graphs of the web. Res. rep. 175, Compaq Systems Research Center, Palo Alto, CA.Google ScholarGoogle Scholar
  27. Roth, M. A. and Horn, S. J. V. 1993. Database compression. SIGMOD Rec. 22, 3, 31--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shah, S., Soules, C. A. N., Ganger, G. R., and Noble, B. D. 2007. Using provenance to aid in personal file search. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Simmhan, Y. L., Plale, B., and Gannon, D. 2006. A framework for collecting provenance in data-centric scientific workflows. In Proceedings of the IEEE International Conference on Web Services. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Suel, T. and Yuan, J. 2001. Compressing the graph structure of the web. In Proceedings of the IEEE Data Compression Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Tolani, P. M. and Haritsa, J. R. 2002. XGRIND: A query-friendly xml compressor. In Proceedings of the International Conference on Data Engineering. 225--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Vahdat, A. and Anderson, T. 1997. Transparent result caching. Tech. rep. CSD-97-974,8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the International Conference on Innovation Data Systems Research (CIDR).Google ScholarGoogle Scholar
  34. Witten, I., Moffat, A., and Bell, T. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xie, Y., K. Muniswamy-Reddy, K., Long, D. D. E., Amer, A., Feng, D., and Tan, Z. 2011. Compressing provenance graphs. In Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance.Google ScholarGoogle Scholar
  36. Xie, Y., K. Muniswamy-Reddy, K., Feng, D., Yan, L., Long, D. D. E., Tan, Z., and Chen, L. 2012. A hybrid approach for efficient provenance storage. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3, 337--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Zukowski, M., Heman, S., Nes, N., and Boncz, P. 2006. Super-scalar ram-cpu cache compression. In Proceedings of the International Conference on Data Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Evaluation of a Hybrid Approach for Efficient Provenance Storage

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Storage
            ACM Transactions on Storage  Volume 9, Issue 4
            November 2013
            117 pages
            ISSN:1553-3077
            EISSN:1553-3093
            DOI:10.1145/2555948
            • Editor:
            • Darrell Long
            Issue’s Table of Contents

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 November 2013
            • Accepted: 1 June 2013
            • Revised: 1 April 2013
            • Received: 1 October 2012
            Published in tos Volume 9, Issue 4

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader