skip to main content
10.1145/2882903.2903730acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Goods: Organizing Google's Datasets

Published: 14 June 2016 Publication History

Abstract

Enterprises increasingly rely on structured datasets to run their businesses. These datasets take a variety of forms, such as structured files, databases, spreadsheets, or even services that provide access to the data. The datasets often reside in different storage systems, may vary in their formats, may change every day. In this paper, we present GOODS, a project to rethink how we organize structured datasets at scale, in a setting where teams use diverse and often idiosyncratic ways to produce the datasets and where there is no centralized system for storing and querying them. GOODS extracts metadata ranging from salient information about each dataset (owners, timestamps, schema) to relationships among datasets, such as similarity and provenance. It then exposes this metadata through services that allow engineers to find datasets within the company, to monitor datasets, to annotate them in order to enable others to use their datasets, and to analyze relationships between them. We discuss the technical challenges that we had to overcome in order to crawl and infer the metadata for billions of datasets, to maintain the consistency of our metadata catalog at scale, and to expose the metadata to users. We believe that many of the lessons that we learned are applicable to building large-scale enterprise-level data-management systems in general.

References

[1]
Azure data lake. https://azure.microsoft.com/en-us/solutions/data-lake/.
[2]
Azure marketplace. http://datamarket.azure.com/browse/data.
[3]
CKAN. http://ckan.org.
[4]
Data lakes and the promise of unsiloed data. http://www.pwc.com/us/en/technology-forecast/2014/cloud-computing/features/data-lakes.html.
[5]
Quandl. https://www.quandl.com.
[6]
A universally unique identifier (uuid) urn namespace. https://www.ietf.org/rfc/rfc4122.txt.
[7]
S. Balakrishnan, A. Y. Halevy, B. Harb, H. Lee, J. Madhavan, A. Rostamizadeh, W. Shen, K. Wilder, F. Wu, and C. Yu. Applying webtables in practice. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 2015.
[8]
A. P. Bhardwaj, S. Bhattacherjee, A. Chavan, A. Deshpande, A. J. Elmore, S. Madden, and A. G. Parameswaran. DataHub: Collaborative data science & dataset version management at scale. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 2015.
[9]
A. P. Bhardwaj, A. Deshpande, A. J. Elmore, D. R. Karger, S. Madden, A. G. Parameswaran, H. Subramanyam, E. Wu, and R. Zhang. Collaborative data analytics with DataHub. PVLDB, 8(12):1916--1927, 2015.
[10]
S. Bhattacherjee, A. Chavan, S. Huang, A. Deshpande, and A. G. Parameswaran. Principles of dataset versioning: Exploring the recreation/storage tradeoff. PVLDB, 8(12):1346--1357, 2015.
[11]
A. Brown. Get smarter answers from the knowledge graph. http://insidesearch.blogspot.com/2012/12/get-smarter-answers-from-knowledge_4.html, 2012.
[12]
M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. PVLDB, 1(1):538--549, 2008.
[13]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1--4:26, June 2008.
[14]
J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in databases: Why, how, and where. Found. Trends databases, 1(4):379--474, Apr. 2009.
[15]
P. Flajolet, E. Fusy, G. O., and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Analysis of Algorithms (AOFA), 2007.
[16]
M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: A new abstraction for information management. SIGMOD Rec., 34(4):27--33, Dec. 2005.
[17]
I. Konstantinou, E. Angelou, D. Tsoumakos, and N. Koziris. Distributed indexing of web scale datasets for the cloud. In Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, MDAC '10, pages 1:1--1:6, 2010.
[18]
A. W. Leung, M. Shao, T. Bisson, S. Pasupathy, and E. L. Miller. Spyglass: Fast, scalable metadata search for large-scale storage systems. In M. I. Seltzer and R. Wheeler, editors, FAST, pages 153--166. USENIX, 2009.
[19]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Commun. ACM, 54(6):114--123, 2011.
[20]
K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. Provenance-aware storage systems. In Proceedings of the Annual Conference on USENIX '06 Annual Technical Conference, pages 43--56, 2006.
[21]
P. Rao and B. Moon. An internet-scale service for publishing and locating xml documents. In Proceedings of the 2009 Int'l Conference on Data Engineering (ICDE), pages 1459--1462, 2009.
[22]
I. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino. Data wrangling: The challenging journey from the wild to the lake. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 2015.
[23]
K. Varda. Protocol buffers: Google's data interchange format. Google Open Source Blog, Accessed July, 2008.
[24]
J. Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262--276, 2005.
[25]
L. Xu, H. Jiang, X. Liu, L. Tian, Y. Hua, and J. Hu. Propeller: A scalable metadata organization for a versatile searchable file system. Technical Report 119, Department of Computer Science and Engineering, University of Nebraska-Lincoln, 2011.
[26]
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. InfoGather: entity augmentation and attribute discovery by holistic matching with web tables. In K. S. Candan, Y. Chen, R. T. Snodgrass, L. Gravano, and A. Fuxman, editors, SIGMOD Conference, pages 97--108. ACM, 2012.

Cited By

View all
  • (2025)Data Lakehouse: A survey and experimental studyInformation Systems10.1016/j.is.2024.102460127(102460)Online publication date: Jan-2025
  • (2024)Veri Gölleri ve Türkiye'deki Kurumların Veri Mimarisi Geliştirme Süreçlerine Entegrasyonu: Bir Model ÖnerisiBilgi Yönetimi10.33721/by.15631537:2(272-304)Online publication date: 31-Dec-2024
  • (2024)Method of Transition from Data Warehouses to Geographic Information System Data Lakes Based on Lambda ArchitectureIntellectual Technologies on Transport10.20295/2413-2527-2024-137-45-55(45-55)Online publication date: 14-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data culture
  2. data flow
  3. data lakes
  4. data monitoring
  5. data organization
  6. data provenance
  7. data search
  8. enterprise data management
  9. metadata extraction

Qualifiers

  • Research-article

Conference

SIGMOD/PODS'16
Sponsor:
SIGMOD/PODS'16: International Conference on Management of Data
June 26 - July 1, 2016
California, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)887
  • Downloads (Last 6 weeks)127
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Data Lakehouse: A survey and experimental studyInformation Systems10.1016/j.is.2024.102460127(102460)Online publication date: Jan-2025
  • (2024)Veri Gölleri ve Türkiye'deki Kurumların Veri Mimarisi Geliştirme Süreçlerine Entegrasyonu: Bir Model ÖnerisiBilgi Yönetimi10.33721/by.15631537:2(272-304)Online publication date: 31-Dec-2024
  • (2024)Method of Transition from Data Warehouses to Geographic Information System Data Lakes Based on Lambda ArchitectureIntellectual Technologies on Transport10.20295/2413-2527-2024-137-45-55(45-55)Online publication date: 14-Apr-2024
  • (2024)Searching Data Lakes for Nested and Joined DataProceedings of the VLDB Endowment10.14778/3681954.368200517:11(3346-3359)Online publication date: 30-Aug-2024
  • (2024)Fainder: A Fast and Accurate Index for Distribution-Aware Dataset SearchProceedings of the VLDB Endowment10.14778/3681954.368199917:11(3269-3282)Online publication date: 1-Jul-2024
  • (2024)LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data LakesProceedings of the VLDB Endowment10.14778/3659437.365944817:8(1925-1938)Online publication date: 1-Apr-2024
  • (2024)Framework Architecture for AI/ML Data Management for Safety-Critical Applications2024 AIAA DATC/IEEE 43rd Digital Avionics Systems Conference (DASC)10.1109/DASC62030.2024.10748782(1-9)Online publication date: 29-Sep-2024
  • (2024)Opportunities and Challenges in Data-Centric AIIEEE Access10.1109/ACCESS.2024.336941712(33173-33189)Online publication date: 2024
  • (2024)Metadata Management in Data Lake Environments: A SurveyJournal of Library Metadata10.1080/19386389.2024.235931024:4(215-274)Online publication date: 15-Jul-2024
  • (2024)Data Mesh Meets BlockchainInternational Journal of Computational Intelligence Systems10.1007/s44196-024-00404-z17:1Online publication date: 7-Feb-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media