Abstract
Provenance is the metadata that describes the history of objects. Provenance provides new functionality in a variety of areas, including experimental documentation, debugging, search, and security. As a result, a number of groups have built systems to capture provenance. Most of these systems focus on provenance collection, a few systems focus on building applications that use the provenance, but all of these systems ignore an important aspect: efficient long-term storage of provenance.
In this article, we first analyze the provenance collected from multiple workloads and characterize the properties of provenance with respect to long-term storage. We then propose a hybrid scheme that takes advantage of the graph structure of provenance data and the inherent duplication in provenance data. Our evaluation indicates that our hybrid scheme, a combination of Web graph compression (adapted for provenance) and dictionary encoding, provides the best trade-off in terms of compression ratio, compression time, and query performance when compared to other compression schemes.
- Adler, M. and Mitzenmacher, M. 2001. Towards compressing web graphs. In Proceedings of the IEEE Data Compression Conference. Google ScholarDigital Library
- Barga, R. S. and Digiampietri, L. A. 2007. Automatic capture and efficient storage of escience experiment provenance. Concur. Comput. Pract. Exper. 1--10. Google ScholarDigital Library
- Boldi, P. and Vigna, S. 2004a. The webgraph framework I: Compression techniques. In Proceedings of the 13th International World Wide Web Conference. Google ScholarDigital Library
- Boldi, P. and Vigna, S. 2004b. The webgraph framework II: Codes for the world-wide web. In Proceedings of the International Data Compression Conference. Google ScholarDigital Library
- Boncz, P. A. 2002. Monet: A next generation dbms kernel for query-intensive application. Ph.D. thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands. http://oai.cwi.nl/oai/asset/14832/14832A.pdf.Google Scholar
- Bose, R. and Frew, J. 2004. Composing lineage metadata with xml for custom satellite-derived data products. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management. Google ScholarDigital Library
- Cao, B., Plale, B., Subramanian, G., Robertson, E., and Simmhan, Y. 2009. Provenance information model of karma version 3. In Proceedings of the 3rd IEEE International Workshop on Scientific Workflows (SWF’09). Google ScholarDigital Library
- Challenge3. 2009. The third provenance challenge. http://twiki.ipaw.info/bin/view/Challenge/ParticipatingTeams3.Google Scholar
- Chapman, A. P., Jagadish, H. V., and Ramanan, P. 2008. Efficient provenance storage. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Cheah, Y.-W., Plale, B., Kendall-Morwick, J., Leake, D., and Ramakrishnan, L. 2011. A noisy 10GB provenance database. In Proceedings of the 2nd International Workshop on Traceability and Compliance of Semi-Structured Processes, in conjunction with the 9th International Conference on Business Process Management.Google Scholar
- Chen, Z., Gehrke, J., and Korn, F. 2001. Query optimization in compressed database system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 271--282. Google ScholarDigital Library
- Futrelle, J., Gaynor, J., Plutchak, J., Myers, J. D., McGrath, R. E., Bajcsy, P., Kastner, J., Kotwani, K., Lee, J. S., Marini, L., Kooper, R., Mclaren, T., and Liu, Y. 2009. Semantic middleware for e-science knowledge spaces. In Proceedings of the 7th International Workshop on Middleware for Grids, Clouds and e-Science. Google ScholarDigital Library
- Goldstein, J., Ramakrishnan, R., and Shaft, U. 1998. Compressing relations and indexes. In Proceedings of the International Conference on Data Engineering. Google ScholarDigital Library
- Graefe, G. and Shapiro, L. 1991. Data compression and database performance. In Proceedings of the ACM/IEEE-CS Symposium on Applied Computing. 22--27.Google Scholar
- Groth, P., Miles, S., Fang, W., Wong, S. C., Zauner, K., and Moreau, L. 2005. Recording and using provenance in a protein compressibility experiment. In Proceedings of the International ACM Symposium on High-Performance Parallel and Distributed Computing. Google ScholarDigital Library
- Groth, P., Jiang, S., Miles, S., Munroe, S., Tan, V., Tsasakou, S., and Moreau, L. 2006. An architecture for provenance system. Tech. rep. http://eprints.soton.ac.uk/263196/1/provenanceArchitecture7.pdf.Google Scholar
- Jayapandian, M., Chapman, A. P., Tarcea, V. G., Yu, C., Elkiss, A., Ianni, A., Liu, B., Nandi, A., Santos, C., Andrews, P., Athey, B., States, D., and Jagadish, H. V. 2007. Michigan molecular interactions (MiMI): Putting the jigsaw puzzle together. Nucleic Acids Res. 35, D566-571.Google ScholarCross Ref
- King, S. T. and Chen, P. M. 2003. Backtracking intrusions. In Proceedings of the ACM Symposium on Operating Systems Principles. Google ScholarDigital Library
- Liefke, H. and Suciu, D. 2000. XMill: An efficient compressor for xml data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Missier, P., Soiland-Reyes, S., Owen, S., Tan, W., Nenadic, A., Dunlop, I., Williams, A., Oinn, T., and Goble, C. 2010. Taverna, reloaded. In Proceedings of the 22nd International Conference on Scientific and Statistical Database Nanagement (SSDBM’10). Google ScholarDigital Library
- Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., and van den Bussche, J. 2011. The open provenance model core specification (v1.1). Future Gener. Comput. Syst. 27, 6, 743--756. Google ScholarDigital Library
- Muniswamy-Reddy, K.-K., Holland, D. A., Braun, U., and Seltzer, M. I. 2006. Provenance-aware storage systems. In Proceedings of the USENIX Annual Technical Conference. Google ScholarDigital Library
- Muniswamy-Reddy, K.-K., Braun, U., Holland, D. A., Macko, P., Maclean, D., Margo, D., Seltzer, M. I., and Smogor, R. 2009. Layering in provenance systems. In Proceedings of the USENIX Annual Technical Conference. Google ScholarDigital Library
- Passtrace. 2008. http://www.eecs.harvard.edu/syrah/pass/traces/.Google Scholar
- Poss, M. and Potapov, D. 2003. Data compression in oracle. In Proceedings of the International Conference on Very Large Data Bases. Google ScholarDigital Library
- Randall, K., Wickremesinghe, R., and Wiener, J. 2001. The link database: Fast access to graphs of the web. Res. rep. 175, Compaq Systems Research Center, Palo Alto, CA.Google Scholar
- Roth, M. A. and Horn, S. J. V. 1993. Database compression. SIGMOD Rec. 22, 3, 31--39. Google ScholarDigital Library
- Shah, S., Soules, C. A. N., Ganger, G. R., and Noble, B. D. 2007. Using provenance to aid in personal file search. In Proceedings of the USENIX Annual Technical Conference. Google ScholarDigital Library
- Simmhan, Y. L., Plale, B., and Gannon, D. 2006. A framework for collecting provenance in data-centric scientific workflows. In Proceedings of the IEEE International Conference on Web Services. Google ScholarDigital Library
- Suel, T. and Yuan, J. 2001. Compressing the graph structure of the web. In Proceedings of the IEEE Data Compression Conference. Google ScholarDigital Library
- Tolani, P. M. and Haritsa, J. R. 2002. XGRIND: A query-friendly xml compressor. In Proceedings of the International Conference on Data Engineering. 225--234. Google ScholarDigital Library
- Vahdat, A. and Anderson, T. 1997. Transparent result caching. Tech. rep. CSD-97-974,8. Google ScholarDigital Library
- Widom, J. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the International Conference on Innovation Data Systems Research (CIDR).Google Scholar
- Witten, I., Moffat, A., and Bell, T. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco. Google ScholarDigital Library
- Xie, Y., K. Muniswamy-Reddy, K., Long, D. D. E., Amer, A., Feng, D., and Tan, Z. 2011. Compressing provenance graphs. In Proceedings of the 3rd USENIX Workshop on the Theory and Practice of Provenance.Google Scholar
- Xie, Y., K. Muniswamy-Reddy, K., Feng, D., Yan, L., Long, D. D. E., Tan, Z., and Chen, L. 2012. A hybrid approach for efficient provenance storage. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management. Google ScholarDigital Library
- Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3, 337--343. Google ScholarDigital Library
- Zukowski, M., Heman, S., Nes, N., and Boncz, P. 2006. Super-scalar ram-cpu cache compression. In Proceedings of the International Conference on Data Engineering. Google ScholarDigital Library
Index Terms
- Evaluation of a Hybrid Approach for Efficient Provenance Storage
Recommendations
Distributed Provenance Compression
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataNetwork provenance, which records the execution history of network events as meta-data, is becoming increasingly important for network accountability and failure diagnosis. For example, network provenance may be used to trace the path that a message ...
Efficient provenance storage
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataAs the world is increasingly networked and digitized, the data we store has more and more frequently been chopped, baked, diced and stewed. In consequence, there is an increasing need to store and manage provenance for each data item stored in a ...
A hybrid approach for efficient provenance storage
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementEfficient provenance storage is an essential step towards the adoption of provenance. In this paper, we analyze the provenance collected from multiple workloads with a view towards efficient storage. Based on our analysis, we characterize the properties ...
Comments