skip to main content
research-article

Detecting Inclusion Dependencies on Very Many Tables

Published:31 July 2017Publication History
Skip Abstract Section

Abstract

Detecting inclusion dependencies, the prerequisite of foreign keys, in relational data is a challenging task. Detecting them among the hundreds of thousands or even millions of tables on the web is daunting. Still, such inclusion dependencies can help connect disparate pieces of information on the Web and reveal unknown relationships among tables.

With the algorithm Many, we present a novel inclusion dependency detection algorithm, specialized for the very many—but typically small—tables found on the Web. We make use of Bloom filters and indexed bit-vectors to show the feasibility of our approach. Our evaluation on two corpora of Web tables shows a superior runtime over known approaches and its usefulness to reveal hidden structures on the Web.

References

  1. Ziawasch Abedjan, John Morcos, Michael Gubanov, Ihab F. Ilyas, Michael Stonebraker, Paolo Papotti, and Mourad Ouzzani. 2015. DataXFormer: Leveraging the web for semantic transformations. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). www.cidrdb.org.Google ScholarGoogle Scholar
  2. Jana Bauckmann, Ulf Leser, and Felix Naumann. 2010. Efficient and Exact Computation of Inclusion Dependencies for Data Integration. Technical Report. Universitátsverlag Potsdam.Google ScholarGoogle Scholar
  3. Jana Bauckmann, Ulf Leser, Felix Naumann, and Veronique Tietz. 2007. Efficiently detecting inclusion dependencies. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE, 1448--1450. Google ScholarGoogle ScholarCross RefCross Ref
  4. Siegfried Bell and Peter Brockhausen. 1995. Discovery of Data Dependencies in Relational Databases. Technical Report. Universitát Dortmund.Google ScholarGoogle Scholar
  5. Frank Benford. 1938. The law of anomalous numbers. Proceedings of the American Philosophical Society 78 (1938), 551--572.Google ScholarGoogle Scholar
  6. Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2013. Methods for exploring and mining tables on wikipedia. In Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics. ACM, New York, NY, 18--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Thomas Blásius, Tobias Friedrich, and Martin Schirneck. 2016. The parameterized complexity of dependency detection in relational databases. In International Symposium on Parametrized and Exact Computation (IPEC). Schloss Dagstuhl‐Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany.Google ScholarGoogle Scholar
  8. Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 7 (1970), 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Michael J. Cafarella, Alon Y. Halevy, Daisy Z. Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: Exploring the power of tables on the web. Proceedings of the VLDB Endowment 1, 1 (2008), 538--549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Moses S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Symposium on Theory of Computing (STOC). ACM, New York, NY, 380--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Benjamin Kille, Frank Hopfgartner, Torben Brodt, and Tobias Heintz. 2013. The plista dataset. In Proceedings of the International Workshop and Challenge on News Recommender Systems. ACM, New York, NY, USA, 16--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Stéphane Lopes, Jean-Marc Petit, and Farouk Toumani. 2002. Discovering interesting inclusion dependencies: Application to logical database tuning. Information Systems 27, 1 (2002), 1--19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Fabien De Marchi, Stphane Lopes, and Jean-Marc Petit. 2002. Efficient algorithms for mining inclusion dependencies. In Proceedings of the International Conference on Extending Database Technology (EDBT). Springer, Berlin, 464--476. Google ScholarGoogle ScholarCross RefCross Ref
  14. Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015a. Data profiling with metanome. Proceedings of the VLDB Endowment 8, 12 (2015), 1860--1871. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Naumann. 2015b. Divide 8 conquer-based inclusion dependency discovery. Proceedings of the VLDB Endowment 8, 7 (2015), 774--785.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Alexandra Rostin, Oliver Albrecht, Jana Bauckmann, Felix Naumann, and Ulf Leser. 2009. A machine learning approach to foreign key discovery. In Proceedings of the ACM Workshop on the Web and Databases (WebDB). ACM, Providence, RI.Google ScholarGoogle Scholar
  17. Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri. 2012. InfoGather: Entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the International Conference on Management of Data (SIGMOD). ACM, New York, NY, 97--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, and Divesh Srivastava. 2010. On multi-column foreign key discovery. Proceedings of the VLDB Endowment 3, 1--2 (2010), 805--814.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Detecting Inclusion Dependencies on Very Many Tables

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Database Systems
        ACM Transactions on Database Systems  Volume 42, Issue 3
        Invited Paper from SIGMOD 2015, Invited Paper from PODS 2015, Regular Papers and Technical Correspondence
        September 2017
        220 pages
        ISSN:0362-5915
        EISSN:1557-4644
        DOI:10.1145/3129336
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 31 July 2017
        • Accepted: 1 May 2017
        • Revised: 1 April 2017
        • Received: 1 February 2016
        Published in tods Volume 42, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader