research-article

Open access

Goods: Organizing Google's Datasets

Authors:

Natalya F. Noy,

Christopher Olston,

Neoklis Polyzotis,

Steven Euijong WhangAuthors Info & Claims

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 795 - 806

https://doi.org/10.1145/2882903.2903730

Published: 14 June 2016 Publication History

Abstract

Enterprises increasingly rely on structured datasets to run their businesses. These datasets take a variety of forms, such as structured files, databases, spreadsheets, or even services that provide access to the data. The datasets often reside in different storage systems, may vary in their formats, may change every day. In this paper, we present GOODS, a project to rethink how we organize structured datasets at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them. GOODS extracts metadata ranging from salient information about each dataset (owners, timestamps, schema) to relationships among datasets, such as similarity and provenance. It then exposes this metadata through services that allow engineers to find datasets within the company, to monitor datasets, to annotate them in order to enable others to use their datasets, and to analyze relationships between them. We discuss the technical challenges that we had to overcome in order to crawl and infer the metadata for billions of datasets, to maintain the consistency of our metadata catalog at scale, and to expose the metadata to users. We believe that many of the lessons that we learned are applicable to building large-scale enterprise-level data-management systems in general.

References

[1]

Azure data lake. https://azure.microsoft.com/en-us/solutions/data-lake/.

[2]

Azure marketplace. http://datamarket.azure.com/browse/data.

[3]

CKAN. http://ckan.org.

[4]

Data lakes and the promise of unsiloed data. http://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/features/data-lakes.html.

[5]

Quandl. https://www.quandl.com.

[6]

A universally unique identifier (uuid) urn namespace. https://www.ietf.org/rfc/rfc4122.txt.

[7]

S. Balakrishnan, A. Y. Halevy, B. Harb, H. Lee, J. Madhavan, A. Rostamizadeh, W. Shen, K. Wilder, F. Wu, and C. Yu. Applying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 2015.

[8]

A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. DataHub: Collaborative data science & dataset version management at scale. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 2015.

[9]

A. P. Bhardwaj, A. Deshpande, A. J. Elmore, D. R. Karger, S. Madden, A. G. Parameswaran, H. Subramanyam, E. Wu, and R. Zhang. Collaborative data analytics with DataHub. PVLDB, 8(12):1916--1927, 2015.

Digital Library

[10]

S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and A. G. Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. PVLDB, 8(12):1346--1357, 2015.

Digital Library

[11]

A. Brown. Get smarter answers from the knowledge graph. http://insidesearch.blogspot.com/2012/12/get-smarter-answers-from-knowledge_4.html, 2012.

[12]

M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008.

Digital Library

[13]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1--4:26, June 2008.

Digital Library

[14]

J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in databases: Why, how, and where. Found. Trends databases, 1(4):379--474, Apr. 2009.

Digital Library

[15]

P. Flajolet, E. Fusy, G. O., and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Analysis of Algorithms (AOFA), 2007.

[16]

M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: A new abstraction for information management. SIGMOD Rec., 34(4):27--33, Dec. 2005.

Digital Library

[17]

I. Konstantinou, E. Angelou, D. Tsoumakos, and N. Koziris. Distributed indexing of web scale datasets for the cloud. In Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, MDAC '10, pages 1:1--1:6, 2010.

Digital Library

[18]

A. W. Leung, M. Shao, T. Bisson, S. Pasupathy, and E. L. Miller. Spyglass: Fast, scalable metadata search for large-scale storage systems. In M. I. Seltzer and R. Wheeler, editors, FAST, pages 153--166. USENIX, 2009.

Digital Library

[19]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Commun. ACM, 54(6):114--123, 2011.

Digital Library

[20]

K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. Provenance-aware storage systems. In Proceedings of the Annual Conference on USENIX '06 Annual Technical Conference, pages 43--56, 2006.

Digital Library

[21]

P. Rao and B. Moon. An internet-scale service for publishing and locating xml documents. In Proceedings of the 2009 Int'l Conference on Data Engineering (ICDE), pages 1459--1462, 2009.

Digital Library

[22]

I. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino. Data wrangling: The challenging journey from the wild to the lake. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 2015.

[23]

K. Varda. Protocol buffers: Google's data interchange format. Google Open Source Blog, Accessed July, 2008.

[24]

J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.

[25]

L. Xu, H. Jiang, X. Liu, L. Tian, Y. Hua, and J. Hu. Propeller: A scalable metadata organization for a versatile searchable file system. Technical Report 119, Department of Computer Science and Engineering, University of Nebraska-Lincoln, 2011.

[26]

M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. InfoGather: entity augmentation and attribute discovery by holistic matching with web tables. In K. S. Candan, Y. Chen, R. T. Snodgrass, L. Gravano, and A. Fuxman, editors, SIGMOD Conference, pages 97--108. ACM, 2012.

Digital Library

Cited By

Harby AZulkernine F(2025)Data Lakehouse: A survey and experimental studyInformation Systems10.1016/j.is.2024.102460127(102460)Online publication date: Jan-2025
https://doi.org/10.1016/j.is.2024.102460
Ankaralı EKülcü Ö(2024)Veri Gölleri ve Türkiye'deki Kurumların Veri Mimarisi Geliştirme Süreçlerine Entegrasyonu: Bir Model ÖnerisiBilgi Yönetimi10.33721/by.15631537:2(272-304)Online publication date: 31-Dec-2024
https://doi.org/10.33721/by.1563153
Abu HKirienko AHomonenko A(2024)Method of Transition from Data Warehouses to Geographic Information System Data Lakes Based on Lambda ArchitectureIntellectual Technologies on Transport10.20295/2413-2527-2024-137-45-55(45-55)Online publication date: 14-Apr-2024
https://doi.org/10.20295/2413-2527-2024-137-45-55
Show More Cited By

Index Terms

Goods: Organizing Google's Datasets
1. Information systems
  1. Data management systems
    1. Database management system engines
    2. Information integration

Recommendations

IndeGx: A model and a framework for indexing RDF knowledge graphs with SPARQL-based test suits
Abstract
In recent years, a large number of RDF datasets have been built and published on the Web in fields as diverse as linguistics or life sciences, as well as general datasets such as DBpedia or Wikidata. The joint exploitation of these ...
Serverless Workflows for Indexing Large Scientific Data
WOSC '19: Proceedings of the 5th International Workshop on Serverless Computing

The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. ...
Accountable Data: The Politics and Pragmatics of Disclosure Datasets
FAccT '22: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency

This paper attends specifically to what I call “disclosure datasets” - tabular datasets produced in accordance with laws requiring various kinds of disclosure. For the purposes of this paper, the most significant defining feature of disclosure datasets ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

June 2016

2300 pages

ISBN:9781450335317

DOI:10.1145/2882903

General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA

Copyright © 2016 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS'16

Sponsor:

SIGMOD

SIGMOD/PODS'16: International Conference on Management of Data

June 26 - July 1, 2016

California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

121
Total Citations
View Citations
8,204
Total Downloads

Downloads (Last 12 months)887
Downloads (Last 6 weeks)127

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Harby AZulkernine F(2025)Data Lakehouse: A survey and experimental studyInformation Systems10.1016/j.is.2024.102460127(102460)Online publication date: Jan-2025
https://doi.org/10.1016/j.is.2024.102460
Ankaralı EKülcü Ö(2024)Veri Gölleri ve Türkiye'deki Kurumların Veri Mimarisi Geliştirme Süreçlerine Entegrasyonu: Bir Model ÖnerisiBilgi Yönetimi10.33721/by.15631537:2(272-304)Online publication date: 31-Dec-2024
https://doi.org/10.33721/by.1563153
Abu HKirienko AHomonenko A(2024)Method of Transition from Data Warehouses to Geographic Information System Data Lakes Based on Lambda ArchitectureIntellectual Technologies on Transport10.20295/2413-2527-2024-137-45-55(45-55)Online publication date: 14-Apr-2024
https://doi.org/10.20295/2413-2527-2024-137-45-55
Zhang YChen PIves Z(2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
https://doi.org/10.14778/3681954.3682005
Behme LGalhotra SBeedkar KMarkl V(2024)Fainder: A Fast and Accurate Index for Distribution-Aware Dataset SearchProceedings of the VLDB Endowment10.14778/3681954.368199917:11(3269-3282)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.14778/3681954.3681999
Deng YChai CCao LYuan QChen SYu YSun ZWang JLi JCao ZJin KZhang CJiang YZhang YWang YYuan YWang GTang N(2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
https://dl.acm.org/doi/10.14778/3659437.3659448
da Cunha Davison JTostes PGuerra Carneiro C(2024)Framework Architecture for AI/ML Data Management for Safety-Critical Applications2024 AIAA DATC/IEEE 43rd Digital Avionics Systems Conference (DASC)10.1109/DASC62030.2024.10748782(1-9)Online publication date: 29-Sep-2024
https://doi.org/10.1109/DASC62030.2024.10748782
Kumar SDatta SSingh VSingh SSharma R(2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3369417
Boukraa DBala MRizzi S(2024)Metadata Management in Data Lake Environments: A SurveyJournal of Library Metadata10.1080/19386389.2024.235931024:4(215-274)Online publication date: 15-Jul-2024
https://doi.org/10.1080/19386389.2024.2359310
Almaslukh AAlameer AAlsaleh HAlkadyan FAllheeib NAlhadlag AAlabdulkarim Y(2024)Data Mesh Meets BlockchainInternational Journal of Computational Intelligence Systems10.1007/s44196-024-00404-z17:1Online publication date: 7-Feb-2024
https://doi.org/10.1007/s44196-024-00404-z
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten