Abstract
The World-Wide Web consists of a huge number of unstructured documents, but it also contains structured data in the form of HTML tables. We extracted 14.1 billion HTML tables from Google's general-purpose web crawl, and used statistical classification techniques to find the estimated 154M that contain high-quality relational data. Because each relational table has its own "schema" of labeled and typed columns, each such table can be considered a small structured database. The resulting corpus of databases is larger than any other corpus we are aware of, by at least five orders of magnitude.
We describe the WEBTABLES system to explore two fundamental questions about this collection of databases. First, what are effective techniques for searching for structured data at search-engine scales? Second, what additional power can be derived by analyzing such a huge corpus?
First, we develop new techniques for keyword search over a corpus of tables, and show that they can achieve substantially higher relevance than solutions based on a traditional search engine. Second, we introduce a new object derived from the database corpus: the attribute correlation statistics database (AcsDB) that records corpus-wide statistics on co-occurrences of schema elements. In addition to improving search relevance, the AcsDB makes possible several novel applications: schema auto-complete, which helps a database designer to choose schema elements; attribute synonym finding, which automatically computes attribute synonym pairs for schema matching; and join-graph traversal, which allows a user to navigate between extracted schemas using automatically-generated join links.
- E. Agichtein, L. Gravano, V. Sokolovna, and A. Voskoboynik. Snowball: A prototype system for extracting relations from large text collections. In SIGMOD Conference, 2001. Google ScholarDigital Library
- S. Agrawal, S. Chaudhuri, and G. Das. Dbxplorer: A system for keyword-based search over relational databases. In ICDE, 2002. Google ScholarDigital Library
- S. Bell and P. Brockhausen. Discovery of data dependencies in relational databases. In European Conference on Machine Learning, 1995.Google Scholar
- T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean. Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, pages 858--867, 2007.Google Scholar
- M. Cafarella, A. Halevy, Z. Wang, E. Wu, and Y. Zhang. Uncovering the relational web. In under review, 2008.Google Scholar
- M. J. Cafarella, D. Suciu, and O. Etzioni. Navigating extracted data with schema discovery. In Web DB, 2007.Google Scholar
- H. Chen, S. Tsai, and J. Tsai. Mining tables from large scale html texts. In 18th International Conference on Computational Linguistics (COLING), pages 166--172, 2000. Google ScholarDigital Library
- K. W. Church and P. Hanks. Word association norms, mutual information, and lexicography. In Proceedings of the 27th Annual Association for Computational Linguistics, 1989. Google ScholarDigital Library
- R. Dhamankar, Y. Lee, A. Doan, A. Y. Halevy, and P. Domingos. imap: Discovering complex mappings between database schemas. In SIGMOD Conference, 2004. Google ScholarDigital Library
- A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In SIGMOD Conference, 2001. Google ScholarDigital Library
- O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. Popescu, T. Shaked, S. Soderland, D. Weld, and A. Yates. Web-scale information extraction in knowitall (preliminary results). In Thirteenth International World Wide Web Conference, 2004. Google ScholarDigital Library
- W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In Proceedings of the 16th International World Wide Web Conference (WWW 2007), pages 71--80, 2007. Google ScholarDigital Library
- B. He, Z. Zhang, and K. C.-C. Chang. Knocking the door to the deep web: Integration of web query interfaces. In SIGMOD Conference, pages 913--914, 2004. Google ScholarDigital Library
- V. Hristidis and Y. Papakonstantinou. Discover: Keyword search in relational databases. In VLDB, 2002. Google ScholarDigital Library
- D. Lin and P. Pantel. Dirt: Discovery of inference rules from text. In KDD, 2001. Google ScholarDigital Library
- J. Madhavan, P. A. Bernstein, A. Doan, and A. Y. Halevy. Corpus-based schema matching. In ICDE, 2005. Google ScholarDigital Library
- J. Madhavan, P. A. Bernstein, and E. Rahm. Generic schema matching with cupid. In VLDB, 2001. Google ScholarDigital Library
- J. Madhavan, A. Y. Halevy, S. Cohen, X. L. Dong, S. R. Jeffery, D. Ko, and C. Yu. Structured data meets the web: A few observations. IEEE Data Eng. Bull., 29(4): 19--26, 2006.Google Scholar
- C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, 1999. Google ScholarDigital Library
- I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In ICDE, 2006. Google ScholarDigital Library
- R. Miller and P. Andritsos. Schema discovery. IEEE Data Eng. Bull., 26(3):40--45, 2003.Google Scholar
- A. Nandi and H. V. Jagadish. Assisted querying using instant-response interfaces. In SIGMOD Conference, pages 1156--1158, 2007. Google ScholarDigital Library
- G. Penn, J. Hu, H. Luo, and R. McDonald. Flexible web document analysis for delivery to narrow-bandwidth devices. In International Conference on Document Analysis and Recognition (ICDAR01), pages 1074--1078, 2001. Google ScholarDigital Library
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 10(4):334--350, 2001. Google ScholarDigital Library
- P. D. Turney. Mining the web for synonyms: Pmi-ir versus Isa on toefl. In Proceedings of the Twelfth European Conference on Machine Learning, 2001. Google ScholarDigital Library
- Y. Wang and J. Hu. A machine learning based approach for table detection on the web. In Eleventh International World Wide Web Conference, 2002. Google ScholarDigital Library
- I. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufman, San Francisco, 2nd edition edition, 2005. Google ScholarDigital Library
- S. Wong, C. Butz, and Y. Xiang. Automated database schema design using mined data dependencies. Journal of the American Society of Information Science, 49(5):455--470, 1998. Google ScholarDigital Library
- M. Yoshida and K. Torisawa. A method to integrate tables of the world wide web. In Proceedings of the 1st International Workshop on Web Document Analysis, pages 31--34, 2001.Google Scholar
- R. Zanibbi, D. Blostein, and J. Cordy. A survey of table recognition: Models, observations, transformations, and inferences, 2003.Google Scholar
Index Terms
WebTables: exploring the power of tables on the web
Recommendations
Ten years of webtables
In 2008, we wrote about WebTables, an effort to exploit the large and diverse set of structured databases casually published online in the form of HTML tables. The past decade has seen a flurry of research and commercial activities around the WebTables ...
A study of results overlap and uniqueness among major web search engines
The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results ...
Comments