research-article

Detecting Inclusion Dependencies on Very Many Tables

Authors:
Fabian Tschirschnitz

Hasso Plattner Institute, Potsdam, Germany

Hasso Plattner Institute, Potsdam, Germany
View Profile

,
Thorsten Papenbrock

Hasso Plattner Institute, Potsdam, Germany

Hasso Plattner Institute, Potsdam, Germany
View Profile

,
Felix Naumann

Hasso Plattner Institute, Potsdam, Germany

Hasso Plattner Institute, Potsdam, Germany
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 42 Issue 3Article No.: 18pp 1–29https://doi.org/10.1145/3105959

Published:31 July 2017Publication History

ACM Transactions on Database Systems

Abstract

Detecting inclusion dependencies, the prerequisite of foreign keys, in relational data is a challenging task. Detecting them among the hundreds of thousands or even millions of tables on the web is daunting. Still, such inclusion dependencies can help connect disparate pieces of information on the Web and reveal unknown relationships among tables.

With the algorithm Many, we present a novel inclusion dependency detection algorithm, specialized for the very many—but typically small—tables found on the Web. We make use of Bloom filters and indexed bit-vectors to show the feasibility of our approach. Our evaluation on two corpora of Web tables shows a superior runtime over known approaches and its usefulness to reveal hidden structures on the Web.

References

Ziawasch Abedjan, John Morcos, Michael Gubanov, Ihab F. Ilyas, Michael Stonebraker, Paolo Papotti, and Mourad Ouzzani. 2015. DataXFormer: Leveraging the web for semantic transformations. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). www.cidrdb.org.Google Scholar
Jana Bauckmann, Ulf Leser, and Felix Naumann. 2010. Efficient and Exact Computation of Inclusion Dependencies for Data Integration. Technical Report. Universitátsverlag Potsdam.Google Scholar
Jana Bauckmann, Ulf Leser, Felix Naumann, and Veronique Tietz. 2007. Efficiently detecting inclusion dependencies. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE, 1448--1450. Google ScholarCross Ref
Siegfried Bell and Peter Brockhausen. 1995. Discovery of Data Dependencies in Relational Databases. Technical Report. Universitát Dortmund.Google Scholar
Frank Benford. 1938. The law of anomalous numbers. Proceedings of the American Philosophical Society 78 (1938), 551--572.Google Scholar
Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2013. Methods for exploring and mining tables on wikipedia. In Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics. ACM, New York, NY, 18--26. Google ScholarDigital Library
Thomas Blásius, Tobias Friedrich, and Martin Schirneck. 2016. The parameterized complexity of dependency detection in relational databases. In International Symposium on Parametrized and Exact Computation (IPEC). Schloss Dagstuhl&dash;Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany.Google Scholar
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
Michael J. Cafarella, Alon Y. Halevy, Daisy Z. Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: Exploring the power of tables on the web. Proceedings of the VLDB Endowment 1, 1 (2008), 538--549. Google ScholarDigital Library
Moses S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Symposium on Theory of Computing (STOC). ACM, New York, NY, 380--388. Google ScholarDigital Library
Benjamin Kille, Frank Hopfgartner, Torben Brodt, and Tobias Heintz. 2013. The plista dataset. In Proceedings of the International Workshop and Challenge on News Recommender Systems. ACM, New York, NY, USA, 16--23. Google ScholarDigital Library
Stéphane Lopes, Jean-Marc Petit, and Farouk Toumani. 2002. Discovering interesting inclusion dependencies: Application to logical database tuning. Information Systems 27, 1 (2002), 1--19. Google ScholarDigital Library
Fabien De Marchi, Stphane Lopes, and Jean-Marc Petit. 2002. Efficient algorithms for mining inclusion dependencies. In Proceedings of the International Conference on Extending Database Technology (EDBT). Springer, Berlin, 464--476. Google ScholarCross Ref
Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015a. Data profiling with metanome. Proceedings of the VLDB Endowment 8, 12 (2015), 1860--1871. Google ScholarDigital Library
Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Naumann. 2015b. Divide 8 conquer-based inclusion dependency discovery. Proceedings of the VLDB Endowment 8, 7 (2015), 774--785.Google ScholarDigital Library
Alexandra Rostin, Oliver Albrecht, Jana Bauckmann, Felix Naumann, and Ulf Leser. 2009. A machine learning approach to foreign key discovery. In Proceedings of the ACM Workshop on the Web and Databases (WebDB). ACM, Providence, RI.Google Scholar
Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri. 2012. InfoGather: Entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the International Conference on Management of Data (SIGMOD). ACM, New York, NY, 97--108. Google ScholarDigital Library
Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, and Divesh Srivastava. 2010. On multi-column foreign key discovery. Proceedings of the VLDB Endowment 3, 1--2 (2010), 805--814.Google ScholarDigital Library

Index Terms

Detecting Inclusion Dependencies on Very Many Tables
1. Information systems
  1. Data management systems
    1. Database design and models
      1. Data model extensions
        Incomplete data
  2. World Wide Web
    1. Web mining
      1. Data extraction and integration

Recommendations

Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management

Inclusion dependencies are an important type of metadata in relational databases, because they indicate foreign key relationships and serve a variety of data management tasks, such as data linkage, query optimization, and data integration. The discovery ...
Read More
Discovering Similarity Inclusion Dependencies
PACMMOD

Inclusion dependencies (INDs) are a well-known type of data dependency, specifying that the values of one column are contained in those of another column. INDs can be used for various purposes, such as foreign-key candidate selection or join partner ...
Read More
Discovering interesting inclusion dependencies: application to logical database tuning
Databases: Creation, management and utilization

Inclusion dependencies together with functional dependencies form the most important data dependencies used in practice. Inclusion dependencies are important for various database applications such as database design and maintenance, semantic query ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Database Systems Volume 42, Issue 3
Invited Paper from SIGMOD 2015, Invited Paper from PODS 2015, Regular Papers and Technical Correspondence
September 2017
220 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3129336
Editor:
Christian S. Jensen
Aalborg University, Denmark
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 July 2017
- Accepted: 1 May 2017
- Revised: 1 April 2017
- Received: 1 February 2016
Published in tods Volume 42, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Inclusion dependency discovery
data profiling
foreign key discovery
web data management
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 435
  Total Downloads
- Downloads (Last 12 months)39
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Detecting Inclusion Dependencies on Very Many Tables

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms

Discovering Similarity Inclusion Dependencies

Discovering interesting inclusion dependencies: application to logical database tuning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Detecting Inclusion Dependencies on Very Many Tables

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms

Discovering Similarity Inclusion Dependencies

Discovering interesting inclusion dependencies: application to logical database tuning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media