ABSTRACT
It seems inevitable that the datasets associated with a research project proliferate over time: collaborators may extend datasets with new measurements and new attributes, new experimental runs result in new files with similar structures, and subsets of data are extracted for independent analysis. As these "residual" datasets begin to accrete over time, scientists can lose track of the derivation history that connects them, complicating data sharing, provenance tracking, and scientific reproducibility. In this paper, focusing on data in spreadsheets, we consider how observable relationships between two datasets can help scientists recall their original derivation connection. For instance, if dataset A is wholly contained in dataset B, B may be a more recent version of A and should be preferred when archiving or publishing.
We articulate a space of relevant relationships, develop a set of algorithms for efficient discovery of these relationships, and organize these algorithms into a new system called ReConnect to assist scientists in relationship discovery. Our evaluation shows that existing approaches that rely on flagging differences between two spreadsheets are impractical for many relationship-discovery tasks, and a user study shows that ReConnect can improve scientists' ability to detect useful relationships and subsequently identify the best dataset for a given task.
- Inter-university Consortium for Political and Social Research. http://www.icpsr.umich.edu.Google Scholar
- The Research Data Alliance. https://rd-alliance.org/.Google Scholar
- Z. Bellahséne, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer, 2011. Google ScholarDigital Library
- P. A. Bernstein and L. M. Haas. Information Integration in The Enterprise. Communications of the ACM, 51(9):72--79, 2008. Google ScholarDigital Library
- C. Chambers, M. Erwig, and M. Luckey. SheetDiff: A Tool for Identifying Changes in Spreadsheets. In Visual Languages and Human-Centric Computing (VL/HCC), 2010 IEEE Symposium on, pages 85--92. IEEE, 2010. Google ScholarDigital Library
- T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or, How to Build a Data Quality Browser. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 240--251. ACM, 2002. Google ScholarDigital Library
- A. Doan, P. Domingos, and A. Levy. Learning Source Descriptions for Data Integration. In WebDB (Informal Proceedings), pages 81--86, 2000.Google Scholar
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. on Knowl. and Data Eng., 19(1):1--16, Jan. 2007. Google ScholarDigital Library
- M. Fisher and G. Rothermel. The EUSES Spreadsheet Corpus: a Shared Resource for Supporting Experimentation with Spreadsheet Dependability Mechanisms. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 1--5. ACM, 2005. Google ScholarDigital Library
- Florencesoft. DiffEngineX: Compare Excel Work Sheets, 2010. http://www.florencesoft.com/compare-excel-workbooks-differences.html.Google Scholar
- H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google Fusion Tables: Web-centered Data Management and Collaboration. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1061--1066. ACM, 2010. Google ScholarDigital Library
- L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an rdbms for web data integration. In Proceedings of the 12th International Conference on World Wide Web, WWW '03, pages 90--101, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- M. Hernández, R. Miller, and L. Haas. Clio: A Semi-automatic Tool for Schema Mapping. In ACM SIGMOD Record, volume 30, page 607. ACM, 2001. Google ScholarDigital Library
- B. Howe, G. Cole, E. Souroush, P. Koutris, A. Key, N. Khoussainova, and L. Battle. Database-as-a-service for Long-tail Science. In Scientific and Statistical Database Management, pages 480--489. Springer, 2011. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. ACM SIGOPS Operating Systems Review, 41(3):59--72, 2007. Google ScholarDigital Library
- Jonathan Wyatt and Ewen Ferguson. Spreadsheets Risk Management: Frequently Asked Questions Guide. Technical report, Protiviti, 2011.Google Scholar
- R. Kimball, M. Ross, W. Thornthwaite, J. Mundy, and B. Becker. The Data Warehouse Lifecycle Toolkit. Wiley, 2011.Google Scholar
- R. Macefield. How to Specify the Participant Group Size for Usability Studies: A Practitioner's Guide. Journal of Usability Studies, 5(1):34--45, 2009.Google ScholarDigital Library
- W. Michener, D. Vieglais, T. Vision, J. Kunze, P. Cruse, and G. Janée. DataOne: Data Observation Network for Earth-preserving Data and Enabling Innovation in The Biological and Environmental Sciences. D-Lib Magazine, 17(1):3, 2011.Google Scholar
- Microsoft. What You Can Do with Spreadsheet Inquire, 2013. http://office.microsoft.com/en-us/excel-help/what-you-can-do-with-spreadsheet-inquire-HA102835926.aspx.Google Scholar
- E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, 23(4):3--13, 2000.Google Scholar
- XL-Consulting. Synkronizer: Compares Excel Files Faster than You Can, 2010. http://www.synkronizer.com/.Google Scholar
- D. Zardetto, M. Scannapieco, and T. Catarci. Effective automated object matching. 2013 IEEE 29th International Conference on Data Engineering (ICDE), 0:757--768, 2010.Google ScholarCross Ref
Index Terms
- Helping scientists reconnect their datasets
Recommendations
SciCSM: novel contrast set mining over scientific datasets using bitmap indices
SSDBM '15: Proceedings of the 27th International Conference on Scientific and Statistical Database ManagementContrast set mining is a broadly applicable exploratory technique, which identifies interesting differences across contrast groups. The existing algorithms primarily target relational datasets with categorical attributes. There is clearly a need to ...
Exploring Genomic Datasets: from Batch to Interactive and Back
ExploreDB 2018: Proceedings of the 5th International Workshop on Exploratory Search in Databases and the WebGenomic data management is focused on achieving high performance over big datasets using batch, cloud-based architectures; this enables the execution of massive pipelines, but hampers the capability of exploring the solution space when it is not well-...
Publishing and interlinking the Global Health Observatory dataset: Towards increasing transparency in Global Health
The improvement of public health is one of the main indicators for societal progress. Statistical data for monitoring public health is highly relevant for a number of sectors, such as research e.g. in the life sciences or economy, policy making, health ...
Comments