skip to main content
10.1145/2618243.2618263acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

Helping scientists reconnect their datasets

Published:30 June 2014Publication History

ABSTRACT

It seems inevitable that the datasets associated with a research project proliferate over time: collaborators may extend datasets with new measurements and new attributes, new experimental runs result in new files with similar structures, and subsets of data are extracted for independent analysis. As these "residual" datasets begin to accrete over time, scientists can lose track of the derivation history that connects them, complicating data sharing, provenance tracking, and scientific reproducibility. In this paper, focusing on data in spreadsheets, we consider how observable relationships between two datasets can help scientists recall their original derivation connection. For instance, if dataset A is wholly contained in dataset B, B may be a more recent version of A and should be preferred when archiving or publishing.

We articulate a space of relevant relationships, develop a set of algorithms for efficient discovery of these relationships, and organize these algorithms into a new system called ReConnect to assist scientists in relationship discovery. Our evaluation shows that existing approaches that rely on flagging differences between two spreadsheets are impractical for many relationship-discovery tasks, and a user study shows that ReConnect can improve scientists' ability to detect useful relationships and subsequently identify the best dataset for a given task.

References

  1. Inter-university Consortium for Political and Social Research. http://www.icpsr.umich.edu.Google ScholarGoogle Scholar
  2. The Research Data Alliance. https://rd-alliance.org/.Google ScholarGoogle Scholar
  3. Z. Bellahséne, A. Bonifati, and E. Rahm. Schema Matching and Mapping. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. A. Bernstein and L. M. Haas. Information Integration in The Enterprise. Communications of the ACM, 51(9):72--79, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Chambers, M. Erwig, and M. Luckey. SheetDiff: A Tool for Identifying Changes in Spreadsheets. In Visual Languages and Human-Centric Computing (VL/HCC), 2010 IEEE Symposium on, pages 85--92. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining Database Structure; or, How to Build a Data Quality Browser. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pages 240--251. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Doan, P. Domingos, and A. Levy. Learning Source Descriptions for Data Integration. In WebDB (Informal Proceedings), pages 81--86, 2000.Google ScholarGoogle Scholar
  8. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. on Knowl. and Data Eng., 19(1):1--16, Jan. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Fisher and G. Rothermel. The EUSES Spreadsheet Corpus: a Shared Resource for Supporting Experimentation with Spreadsheet Dependability Mechanisms. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 1--5. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Florencesoft. DiffEngineX: Compare Excel Work Sheets, 2010. http://www.florencesoft.com/compare-excel-workbooks-differences.html.Google ScholarGoogle Scholar
  11. H. Gonzalez, A. Y. Halevy, C. S. Jensen, A. Langen, J. Madhavan, R. Shapley, W. Shen, and J. Goldberg-Kidon. Google Fusion Tables: Web-centered Data Management and Collaboration. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 1061--1066. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Gravano, P. G. Ipeirotis, N. Koudas, and D. Srivastava. Text joins in an rdbms for web data integration. In Proceedings of the 12th International Conference on World Wide Web, WWW '03, pages 90--101, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Hernández, R. Miller, and L. Haas. Clio: A Semi-automatic Tool for Schema Mapping. In ACM SIGMOD Record, volume 30, page 607. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Howe, G. Cole, E. Souroush, P. Koutris, A. Key, N. Khoussainova, and L. Battle. Database-as-a-service for Long-tail Science. In Scientific and Statistical Database Management, pages 480--489. Springer, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-parallel Programs from Sequential Building Blocks. ACM SIGOPS Operating Systems Review, 41(3):59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jonathan Wyatt and Ewen Ferguson. Spreadsheets Risk Management: Frequently Asked Questions Guide. Technical report, Protiviti, 2011.Google ScholarGoogle Scholar
  17. R. Kimball, M. Ross, W. Thornthwaite, J. Mundy, and B. Becker. The Data Warehouse Lifecycle Toolkit. Wiley, 2011.Google ScholarGoogle Scholar
  18. R. Macefield. How to Specify the Participant Group Size for Usability Studies: A Practitioner's Guide. Journal of Usability Studies, 5(1):34--45, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Michener, D. Vieglais, T. Vision, J. Kunze, P. Cruse, and G. Janée. DataOne: Data Observation Network for Earth-preserving Data and Enabling Innovation in The Biological and Environmental Sciences. D-Lib Magazine, 17(1):3, 2011.Google ScholarGoogle Scholar
  20. Microsoft. What You Can Do with Spreadsheet Inquire, 2013. http://office.microsoft.com/en-us/excel-help/what-you-can-do-with-spreadsheet-inquire-HA102835926.aspx.Google ScholarGoogle Scholar
  21. E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, 23(4):3--13, 2000.Google ScholarGoogle Scholar
  22. XL-Consulting. Synkronizer: Compares Excel Files Faster than You Can, 2010. http://www.synkronizer.com/.Google ScholarGoogle Scholar
  23. D. Zardetto, M. Scannapieco, and T. Catarci. Effective automated object matching. 2013 IEEE 29th International Conference on Data Engineering (ICDE), 0:757--768, 2010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Helping scientists reconnect their datasets

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SSDBM '14: Proceedings of the 26th International Conference on Scientific and Statistical Database Management
      June 2014
      417 pages
      ISBN:9781450327220
      DOI:10.1145/2618243

      Copyright © 2014 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 June 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SSDBM '14 Paper Acceptance Rate26of71submissions,37%Overall Acceptance Rate56of146submissions,38%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader