research-article

Helping scientists reconnect their datasets

Authors:

Abdussalam Alawini,

Bill HoweAuthors Info & Claims

SSDBM '14: Proceedings of the 26th International Conference on Scientific and Statistical Database Management

Article No.: 29, Pages 1 - 12

https://doi.org/10.1145/2618243.2618263

Published: 30 June 2014 Publication History

Abstract

It seems inevitable that the datasets associated with a research project proliferate over time: collaborators may extend datasets with new measurements and new attributes, new experimental runs result in new files with similar structures, and subsets of data are extracted for independent analysis. As these "residual" datasets begin to accrete over time, scientists can lose track of the derivation history that connects them, complicating data sharing, provenance tracking, and scientific reproducibility. In this paper, focusing on data in spreadsheets, we consider how observable relationships between two datasets can help scientists recall their original derivation connection. For instance, if dataset A is wholly contained in dataset B, B may be a more recent version of A and should be preferred when archiving or publishing.

We articulate a space of relevant relationships, develop a set of algorithms for efficient discovery of these relationships, and organize these algorithms into a new system called ReConnect to assist scientists in relationship discovery. Our evaluation shows that existing approaches that rely on flagging differences between two spreadsheets are impractical for many relationship-discovery tasks, and a user study shows that ReConnect can improve scientists' ability to detect useful relationships and subsequently identify the best dataset for a given task.

References

[1]

Inter-university Consortium for Political and Social Research. http://www.icpsr.umich.edu.

[2]

The Research Data Alliance. https://rd-alliance.org/.

[3]

Z. Bellahséne, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer, 2011.

Digital Library

[4]

P. A. Bernstein and L. M. Haas. Information Integration in The Enterprise. Communications of the ACM, 51(9):72--79, 2008.

Digital Library

[5]

C. Chambers, M. Erwig, and M. Luckey. SheetDiff: A Tool for Identifying Changes in Spreadsheets. In Visual Languages and Human-Centric Computing (VL/HCC), 2010 IEEE Symposium on, pages 85--92. IEEE, 2010.

Digital Library

[6]

T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or, How to Build a Data Quality Browser. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 240--251. ACM, 2002.

Digital Library

[7]

A. Doan, P. Domingos, and A. Levy. Learning Source Descriptions for Data Integration. In WebDB (Informal Proceedings), pages 81--86, 2000.

[8]

A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. on Knowl. and Data Eng., 19(1):1--16, Jan. 2007.

Digital Library

[9]

M. Fisher and G. Rothermel. The EUSES Spreadsheet Corpus: a Shared Resource for Supporting Experimentation with Spreadsheet Dependability Mechanisms. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 1--5. ACM, 2005.

Digital Library

[10]

Florencesoft. DiffEngineX: Compare Excel Work Sheets, 2010. http://www.florencesoft.com/compare-excel-workbooks-differences.html.

[11]

H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google Fusion Tables: Web-centered Data Management and Collaboration. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1061--1066. ACM, 2010.

Digital Library

[12]

L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an rdbms for web data integration. In Proceedings of the 12th International Conference on World Wide Web, WWW '03, pages 90--101, New York, NY, USA, 2003. ACM.

Digital Library

[13]

M. Hernández, R. Miller, and L. Haas. Clio: A Semi-automatic Tool for Schema Mapping. In ACM SIGMOD Record, volume 30, page 607. ACM, 2001.

Digital Library

[14]

B. Howe, G. Cole, E. Souroush, P. Koutris, A. Key, N. Khoussainova, and L. Battle. Database-as-a-service for Long-tail Science. In Scientific and Statistical Database Management, pages 480--489. Springer, 2011.

Digital Library

[15]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. ACM SIGOPS Operating Systems Review, 41(3):59--72, 2007.

Digital Library

[16]

Jonathan Wyatt and Ewen Ferguson. Spreadsheets Risk Management: Frequently Asked Questions Guide. Technical report, Protiviti, 2011.

[17]

R. Kimball, M. Ross, W. Thornthwaite, J. Mundy, and B. Becker. The Data Warehouse Lifecycle Toolkit. Wiley, 2011.

[18]

R. Macefield. How to Specify the Participant Group Size for Usability Studies: A Practitioner's Guide. Journal of Usability Studies, 5(1):34--45, 2009.

Digital Library

[19]

W. Michener, D. Vieglais, T. Vision, J. Kunze, P. Cruse, and G. Janée. DataOne: Data Observation Network for Earth-preserving Data and Enabling Innovation in The Biological and Environmental Sciences. D-Lib Magazine, 17(1):3, 2011.

[20]

Microsoft. What You Can Do with Spreadsheet Inquire, 2013. http://office.microsoft.com/en-us/excel-help/what-you-can-do-with-spreadsheet-inquire-HA102835926.aspx.

[21]

E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, 23(4):3--13, 2000.

[22]

XL-Consulting. Synkronizer: Compares Excel Files Faster than You Can, 2010. http://www.synkronizer.com/.

[23]

D. Zardetto, M. Scannapieco, and T. Catarci. Effective automated object matching. 2013 IEEE 29th International Conference on Data Engineering (ICDE), 0:757--768, 2010.

Cited By

Thanos CMeghini CBartalesi VCoro G(2023)An exploratory approach to data driven knowledge creationJournal of Big Data10.1186/s40537-023-00702-x10:1Online publication date: 6-Mar-2023
https://doi.org/10.1186/s40537-023-00702-x
Pan SWang PWang CWang WWang J(2022)NLCProceedings of the VLDB Endowment10.14778/3523210.352321515:7(1363-1375)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.14778/3523210.3523215
Rehman MHuang SElmore A(2021)A demonstration of RELICProceedings of the VLDB Endowment10.14778/3476311.347634714:12(2795-2798)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.14778/3476311.3476347
Show More Cited By

Index Terms

Helping scientists reconnect their datasets
1. Information systems
  1. Information systems applications

Recommendations

SciCSM: novel contrast set mining over scientific datasets using bitmap indices
SSDBM '15: Proceedings of the 27th International Conference on Scientific and Statistical Database Management

Contrast set mining is a broadly applicable exploratory technique, which identifies interesting differences across contrast groups. The existing algorithms primarily target relational datasets with categorical attributes. There is clearly a need to ...
Exploring Genomic Datasets: from Batch to Interactive and Back
ExploreDB 2018: Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web

Genomic data management is focused on achieving high performance over big datasets using batch, cloud-based architectures; this enables the execution of massive pipelines, but hampers the capability of exploring the solution space when it is not well-...
Publishing and interlinking the Global Health Observatory dataset: Towards increasing transparency in Global Health

The improvement of public health is one of the main indicators for societal progress. Statistical data for monitoring public health is highly relevant for a number of sectors, such as research e.g. in the life sciences or economy, policy making, health ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSDBM '14: Proceedings of the 26th International Conference on Scientific and Statistical Database Management

June 2014

417 pages

ISBN:9781450327220

DOI:10.1145/2618243

Editors:
Christian S. Jensen
Aalborg University
,
Hua Lu
Aalborg University
,
Torben Bach Pedersen
Aalborg University
,
Christian Thomsen
Aalborg University
,
Kristian Torp
Aalborg University

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Division of Information and Intelligent Systems

Conference

SSDBM '14

SSDBM '14: Conference on Scientific and Statistical Database Management

June 30 - July 2, 2014

Aalborg, Denmark

Acceptance Rates

SSDBM '14 Paper Acceptance Rate 26 of 71 submissions, 37%;

Overall Acceptance Rate 56 of 146 submissions, 38%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
145
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Thanos CMeghini CBartalesi VCoro G(2023)An exploratory approach to data driven knowledge creationJournal of Big Data10.1186/s40537-023-00702-x10:1Online publication date: 6-Mar-2023
https://doi.org/10.1186/s40537-023-00702-x
Pan SWang PWang CWang WWang J(2022)NLCProceedings of the VLDB Endowment10.14778/3523210.352321515:7(1363-1375)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.14778/3523210.3523215
Rehman MHuang SElmore A(2021)A demonstration of RELICProceedings of the VLDB Endowment10.14778/3476311.347634714:12(2795-2798)Online publication date: 1-Jul-2021
https://dl.acm.org/doi/10.14778/3476311.3476347
Ho NVo HVu MPedersen T(2021)AMIC: An Adaptive Information Theoretic Method to Identify Multi-Scale Temporal Correlations in Big Time Series DataIEEE Transactions on Big Data10.1109/TBDATA.2019.29079877:1(128-146)Online publication date: 1-Mar-2021
https://doi.org/10.1109/TBDATA.2019.2907987
Stoyanovich JHowe BAbiteboul SMiklau GSahuguet AWeikum GChoudhary AWu KDong B(2017)FidesProceedings of the 29th International Conference on Scientific and Statistical Database Management10.1145/3085504.3085530(1-6)Online publication date: 27-Jun-2017
https://dl.acm.org/doi/10.1145/3085504.3085530
Chirigati FDoraiswamy HDamoulas TFreire JÖzcan FKoutrika GMadden S(2016)Data PolygamyProceedings of the 2016 International Conference on Management of Data10.1145/2882903.2915245(1011-1025)Online publication date: 26-Jun-2016
https://dl.acm.org/doi/10.1145/2882903.2915245
Ho NVo HVu M(2016)An adaptive information-theoretic approach for identifying temporal correlations in big data sets2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840659(666-675)Online publication date: Dec-2016
https://doi.org/10.1109/BigData.2016.7840659
Alawini AMaier DTufte KHowe BNandikur R(2015)Towards automated prediction of relationships among scientific datasetsProceedings of the 27th International Conference on Scientific and Statistical Database Management10.1145/2791347.2791358(1-5)Online publication date: 29-Jun-2015
https://dl.acm.org/doi/10.1145/2791347.2791358

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten