Abstract
Detecting inclusion dependencies, the prerequisite of foreign keys, in relational data is a challenging task. Detecting them among the hundreds of thousands or even millions of tables on the web is daunting. Still, such inclusion dependencies can help connect disparate pieces of information on the Web and reveal unknown relationships among tables.
With the algorithm Many, we present a novel inclusion dependency detection algorithm, specialized for the very many—but typically small—tables found on the Web. We make use of Bloom filters and indexed bit-vectors to show the feasibility of our approach. Our evaluation on two corpora of Web tables shows a superior runtime over known approaches and its usefulness to reveal hidden structures on the Web.
- Ziawasch Abedjan, John Morcos, Michael Gubanov, Ihab F. Ilyas, Michael Stonebraker, Paolo Papotti, and Mourad Ouzzani. 2015. DataXFormer: Leveraging the web for semantic transformations. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). www.cidrdb.org.Google Scholar
- Jana Bauckmann, Ulf Leser, and Felix Naumann. 2010. Efficient and Exact Computation of Inclusion Dependencies for Data Integration. Technical Report. Universitátsverlag Potsdam.Google Scholar
- Jana Bauckmann, Ulf Leser, Felix Naumann, and Veronique Tietz. 2007. Efficiently detecting inclusion dependencies. In Proceedings of the International Conference on Data Engineering (ICDE). IEEE, 1448--1450. Google ScholarCross Ref
- Siegfried Bell and Peter Brockhausen. 1995. Discovery of Data Dependencies in Relational Databases. Technical Report. Universitát Dortmund.Google Scholar
- Frank Benford. 1938. The law of anomalous numbers. Proceedings of the American Philosophical Society 78 (1938), 551--572.Google Scholar
- Chandra Sekhar Bhagavatula, Thanapon Noraset, and Doug Downey. 2013. Methods for exploring and mining tables on wikipedia. In Proceedings of the ACM SIGKDD Workshop on Interactive Data Exploration and Analytics. ACM, New York, NY, 18--26. Google ScholarDigital Library
- Thomas Blásius, Tobias Friedrich, and Martin Schirneck. 2016. The parameterized complexity of dependency detection in relational databases. In International Symposium on Parametrized and Exact Computation (IPEC). Schloss Dagstuhl‐Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany.Google Scholar
- Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
- Michael J. Cafarella, Alon Y. Halevy, Daisy Z. Wang, Eugene Wu, and Yang Zhang. 2008. WebTables: Exploring the power of tables on the web. Proceedings of the VLDB Endowment 1, 1 (2008), 538--549. Google ScholarDigital Library
- Moses S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Symposium on Theory of Computing (STOC). ACM, New York, NY, 380--388. Google ScholarDigital Library
- Benjamin Kille, Frank Hopfgartner, Torben Brodt, and Tobias Heintz. 2013. The plista dataset. In Proceedings of the International Workshop and Challenge on News Recommender Systems. ACM, New York, NY, USA, 16--23. Google ScholarDigital Library
- Stéphane Lopes, Jean-Marc Petit, and Farouk Toumani. 2002. Discovering interesting inclusion dependencies: Application to logical database tuning. Information Systems 27, 1 (2002), 1--19. Google ScholarDigital Library
- Fabien De Marchi, Stphane Lopes, and Jean-Marc Petit. 2002. Efficient algorithms for mining inclusion dependencies. In Proceedings of the International Conference on Extending Database Technology (EDBT). Springer, Berlin, 464--476. Google ScholarCross Ref
- Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015a. Data profiling with metanome. Proceedings of the VLDB Endowment 8, 12 (2015), 1860--1871. Google ScholarDigital Library
- Thorsten Papenbrock, Sebastian Kruse, Jorge-Arnulfo Quiané-Ruiz, and Naumann. 2015b. Divide 8 conquer-based inclusion dependency discovery. Proceedings of the VLDB Endowment 8, 7 (2015), 774--785.Google ScholarDigital Library
- Alexandra Rostin, Oliver Albrecht, Jana Bauckmann, Felix Naumann, and Ulf Leser. 2009. A machine learning approach to foreign key discovery. In Proceedings of the ACM Workshop on the Web and Databases (WebDB). ACM, Providence, RI.Google Scholar
- Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri. 2012. InfoGather: Entity augmentation and attribute discovery by holistic matching with web tables. In Proceedings of the International Conference on Management of Data (SIGMOD). ACM, New York, NY, 97--108. Google ScholarDigital Library
- Meihui Zhang, Marios Hadjieleftheriou, Beng Chin Ooi, Cecilia M. Procopiuc, and Divesh Srivastava. 2010. On multi-column foreign key discovery. Proceedings of the VLDB Endowment 3, 1--2 (2010), 805--814.Google ScholarDigital Library
Index Terms
- Detecting Inclusion Dependencies on Very Many Tables
Recommendations
Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms
CIKM '19: Proceedings of the 28th ACM International Conference on Information and Knowledge ManagementInclusion dependencies are an important type of metadata in relational databases, because they indicate foreign key relationships and serve a variety of data management tasks, such as data linkage, query optimization, and data integration. The discovery ...
Discovering Similarity Inclusion Dependencies
PACMMODInclusion dependencies (INDs) are a well-known type of data dependency, specifying that the values of one column are contained in those of another column. INDs can be used for various purposes, such as foreign-key candidate selection or join partner ...
Discovering interesting inclusion dependencies: application to logical database tuning
Databases: Creation, management and utilizationInclusion dependencies together with functional dependencies form the most important data dependencies used in practice. Inclusion dependencies are important for various database applications such as database design and maintenance, semantic query ...
Comments