|
ABSTRACT
Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and scientific value. General purpose database systems provide little support for tracking provenance, especially when data moves among databases. This paper investigates general-purpose techniques for recording provenance for data that is copied among databases. We describe an approach in which we track the user's actions while browsing source databases and copying data into a curated database, in order to record the user's actions in a convenient, queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management. Our experiments show that although the overhead of a naive approach is fairly high, it can be decreased to an acceptable level using simple optimizations.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
G. Bader, D. Betel, and C. W. Hogue. BIND: the biomolecule interaction network database. Nucleic Acids Research, 31(1):248--250, 2003.
|
| |
2
|
D. Bhagwat, L. Chiticariu, W. C. Tan, and G. Vijayvargiya. An annotation management system for relational databases. In Proc. of the Intl. Conf. on Very Large Data Bases (VLDB), pages 900--911. Morgan Kaufmann, 2004.
|
 |
3
|
|
| |
4
|
P. Buneman, S. Davidson, W. Fan, C. Hara, and W.-C. Tan. Keys for XML. Computer Networks, 39(5), August 2002.
|
 |
5
|
|
| |
6
|
|
| |
7
|
J. Cherry, C. Adler, C. Ball, S. Chervitz, S. Dwight, E. Hester, Y. Jia, G. Juvik, T. Roe, M. Schroeder, S. Weng, and D. Botstein. SGD: Saccharomyces genome database. Nucleic Acids Res., 26(1):73--79, 1998.
|
| |
8
|
|
| |
9
|
G. Dellaire, R. Farrall, and W. A. Bickmore. The nuclear protein database (NPD): sub-nuclear localisation and functional annotation of the nuclear proteome. Nucleic Acids Research, 31(1):328--330, 2003.
|
| |
10
|
Ian T. Foster , Jens-S. Vöckler , Michael Wilde , Yong Zhao, Chimera: AVirtual Data System for Representing, Querying, and Automating Data Derivation, Proceedings of the 14th International Conference on Scientific and Statistical Database Management, p.37-46, July 24-26, 2002
|
 |
11
|
J. Nathan Foster , Michael B. Greenwald , Jonathan T. Moore , Benjamin C. Pierce , Alan Schmitt, Combinators for bi-directional tree transformations: a linguistic approach to the view update problem, Proceedings of the 32nd ACM SIGPLAN-SIGACT symposium on Principles of programming languages, p.233-246, January 12-14, 2005, Long Beach, California, USA
|
| |
12
|
M. Y. Galperin. The molecular biology database collection: 2006 update. Nucl. Acids Res., 34:D3-D5, Jan 2006. doi:10.1093/nar/gkj162.
|
| |
13
|
J. Gray, D. T. Liu, M. A. Nieto-Santisteban, A. S. Szalay, G. Heber, and D. DeWitt. Scientific data management in the coming decade. Technical Report MSR-TR-2005-10, Microsoft Research, January 2005.
|
| |
14
|
P. Groth, S. Miles, W. Fang, S. C. Wong, K.-P. Zauner, and L. Moreau. Recording and using provenance in a protein compressibility experiment. In Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC'05), 2005.
|
| |
15
|
H. V. Jagadish , S. Al-Khalifa , A. Chapman , L. V. S. Lakshmanan , A. Nierman , S. Paparizos , J. M. Patel , D. Srivastava , N. Wiwatwattana , Y. Wu , C. Yu, TIMBER: A native XML database, The VLDB Journal — The International Journal on Very Large Data Bases, v.11 n.4, p.274-291, December 2002
[doi> 10.1007/s00778-002-0081-x]
|
| |
16
|
T. Lee, S. Bressan, and S. E. Madnick. Source attribution for querying against semi-structured documents. In Workshop on Web Information and Data Management, pages 33--39, 1998.
|
| |
17
|
|
| |
18
|
Mimi. http://mimi.ctaalliance.org.
|
| |
19
|
W. O'Mullane, J. Gray, N. Li, T. Budavari, M. A. Nieto-Santisteban, and A. Szalay. Batch query system with interactive local storage for SDSS and the VO. In F. Ochsenbein, M. Allen, and D. Egret, editors, Astronomical Data Analysis Software and Systems XIII, volume 314 of ASP Conference Series, 2004.
|
| |
20
|
Y. Reimer and S. A. Douglas. Implementation challenges associated with developing a web-based e-notebook. Journal of Digital Information (JoDI), 4(3), 2003.
|
 |
21
|
|
| |
22
|
W. Tan. Containment of relational queries with annotation propagation. In Proceedings of the International Workshop on Database and Programming Languages (DBPL), 2003.
|
| |
23
|
UniProt. http://www.ebi.ac.uk/uniprot/.
|
| |
24
|
J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.
|
| |
25
|
N. Wiwatwattana and A. Kumar. Organelle DB: a cross-species database of protein localization and function. Nucleic Acids Research, 33:D598--604, 2005.
|
| |
26
|
|
| |
27
|
J. Zhao, C. A. Goble, R. Stevens, and S. Bechhofer. Semantically linking and browsing provenance logs for e-science. In ICSNW, pages 158--176, 2004.
|
CITED BY 7
|
|
|
|
|
|
|
|
|
|
|
Curtis Dyreson , Richard T. Snodgrass , Faiz Currim , Sabah Currim , Shailesh Joshi, Weaving temporal and reliability aspects into a schema tapestry, Data & Knowledge Engineering, v.63 n.3, p.752-773, December, 2007
|
|
|
|
|
|
|