article

Towards a query optimizer for text-centric tasks

Authors:
Panagiotis G. Ipeirotis

New York University, New York, NY

New York University, New York, NY
View Profile

,
Eugene Agichtein

Emory University, Atlanta, GA

Emory University, Atlanta, GA
View Profile

,
Pranay Jain

Columbia University, New York, NY

Columbia University, New York, NY
View Profile

,
Luis Gravano

Columbia University, New York, NY

Columbia University, New York, NY
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 32 Issue 4pp 21–eshttps://doi.org/10.1145/1292609.1292611

Published:01 November 2007Publication History

ACM Transactions on Database Systems

Abstract

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the Web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model. We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.

References

Agichtein, E. 2005. Scaling information extraction to large document collections. IEEE Data Eng. Bull. 28, 4 (Dec.), 3--10.Google Scholar
Agichtein, E. and Gravano, L. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM Conference on Digital Libraries (DL'00). Google ScholarDigital Library
Agichtein, E. and Gravano, L. 2003. Querying text databases for efficient information extraction. In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE'03).Google Scholar
Agichtein, E., Ipeirotis, P. G., and Gravano, L. 2003. Modeling query-based access to text databases. In Proceedings of the Sixth International Workshop on the Web and Databases (WebDB'03). 87--92.Google Scholar
Avnur, R. and Hellerstein, J. M. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD'00). 261--272. Google ScholarDigital Library
Baayen, R. H. 2006. Word Frequency Distributions. Springer, Berlin, Germany.Google Scholar
Banko, M., Brill, E., Dumais, S., and Lin, J. 2002. AskMSR: Question answering using the World Wide Web. In Proceedings of 2002 AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases.Google Scholar
Bergman, M. K. 2001. The deep Web: Surfacing hidden value. J. Electron. Publish. 7, 1 (Aug.).Google ScholarCross Ref
Blake, C. L. and Merz, C. J. 1998. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html.Google Scholar
Brin, S. 1998. Extracting patterns and relations from the World Wide Web. In Proceedings of the First International Workshop on the Web and Databases (WebDB'98). 172--183. Google ScholarDigital Library
Cafarella, M. J. and Etzioni, O. 2005. A search engine for natural language applications. In Proceedings of the 14th International World Wide Web Conference (WWW'05). 442--452. Google ScholarDigital Library
Cafarella, M. J., Re, C., Suciu, D., and Etzioni, O. 2007. Structured querying of Web text data: A technical challenge. In Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR'07). 225--234.Google Scholar
Callan, J. P. and Connell, M. 2001. Query-based sampling of text databases. ACM Trans. Inform. Syst. 19, 2, 97--130. Google ScholarDigital Library
Callan, J. P., Connell, M., and Du, A. 1999. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD'99). 479--490. Google ScholarDigital Library
Callan, J. P., Lu, Z., and Croft, W. B. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95). 21--28. Google ScholarDigital Library
Chakrabarti, S., Punera, K., and Subramanyam, M. 2002. Accelerated focused crawling through online relevance feedback. In Proceedings of the 11th International World Wide Web Conference (WWW11). 148--159. Google ScholarDigital Library
Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. Comput. Netw. 31, 11--16 (May), 1623--1640. Google ScholarDigital Library
Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD'03). 313--324. Google ScholarDigital Library
Chaudhuri, S., Motwani, R., and Narasayya, V. R. 1998. Random sampling for histogram construction: How much is enough&quest; In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD'98). 436--447. Google ScholarDigital Library
Chaudhuri, S. and Shim, K. 1999. Optimization of queries with user-defined predicates. ACM Trans. Database Syst. 24, 2, 177--228. Google ScholarDigital Library
Chung, F. and Lu, L. 2002. Connected components in random graphs with given degree sequences. Ann. Combin. 6, 125--145.Google ScholarCross Ref
Cohen, W. W. 1996. Learning trees and rules with set-valued features. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), Eighth Conference on Innovative Applications of Artificial Intelligence (IAAI-96). 709--716. Google ScholarDigital Library
Cohen, W. W. and Singer, Y. 1996. Learning to query the Web. In Proceedings of the AAAI Workshop on Internet-Based Information Systems. 16--25.Google Scholar
Cunningham, H. 2006. Information extraction, automatic. In Encyclopedia of Language and Linguistics, 2nd ed. Elsevier Science, Amsterdam, The Netherlands.Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Databases (VLDB'00). 527--534. Google ScholarDigital Library
Doorenbos, R. B., Etzioni, O., and Weld, D. S. 1997. A scalable comparison-shopping agent for the World-Wide Web. In AGENTS '97: Proceedings of the First International Conference on Autonomous Agents. 39--48. Google ScholarDigital Library
Duda, R. O., Hart, P. E., and Stork, D. G. 2000. Pattern Classification, 2nd ed. Wiley, New York, NY. Google ScholarDigital Library
Etzioni, O., Cafarella, M. J., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. 2004. Web-scale information extraction in KnowItAll (preliminary results). In Proceedings of the 13th International World Wide Web Conference (WWW'04). 100--110. Google ScholarDigital Library
Gravano, L., García-Molina, H., and Tomasic, A. 1999. GlOSS: Text-source discovery over the Internet. ACM Trans. Database Syst. 24, 2 (June), 229--264. Google ScholarDigital Library
Gravano, L., Ipeirotis, P. G., and Sahami, M. 2002. Query- vs. crawling-based classification of searchable Web databases. IEEE Data Eng. Bull. 25, 1 (Mar.), 43--50.Google Scholar
Grishman, R. 1997. Information extraction: Techniques and challenges. In Proceedings of Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, International Summer School, (SCIE-97). 10--27. Google ScholarDigital Library
Grishman, R., Huttunen, S., and Yangarber, R. 2002. Information extraction for enhanced access to disease outbreak reports. J. Biomed. Inform. 35, 4 (Aug.), 236--246. Google ScholarDigital Library
Ipeirotis, P. G., Agichtein, E., Jain, P., and Gravano, L. 2006. To search or to crawl&quest; Towards a query optimizer for text-centric tasks. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD'06). 265--276. Google ScholarDigital Library
Ipeirotis, P. G. and Gravano, L. 2002. Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proceedings of the 28th International Conference on Very Large Databases (VLDB'02). 394--405. Google ScholarDigital Library
Ives, Z. G., Florescu, D., Friedman, M., Levy, A. Y., and Weld, D. S. 1999. An adaptive query execution system for data integration. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD'99). 299--310. Google ScholarDigital Library
Jain, A., Doan, A., and Gravano, L. 2007. SQL queries over unstructured text databases (poster paper). In Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE'07).Google Scholar
Jones, R. 2005. Learning to extract entities from labeled and unlabeled text. Ph.D. dissertation. Carnegie Mellon University, School of Computer Science, Pittsburgh, PA.Google Scholar
Kabra, N. and DeWitt, D. J. 1998. Efficient mid-query re-optimization of sub-optimal query execution plans. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD'98). 106--117. Google ScholarDigital Library
Ling, Y. and Sun, W. 1995. An evaluation of sampling-based size estimation methods for selections in database systems. In Proceedings of the 11th IEEE International Conference on Data Engineering (ICDE'95). 532--539. Google ScholarDigital Library
Markl, V., Raman, V., Simmen, D., Lohman, G., Pirahesh, H., and Cilimdzic, M. 2004. Robust query processing through progressive optimization. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD'04). 659--670. Google ScholarDigital Library
McCallum, A. 2005. Information extraction: Distilling structured data from unstructured text. ACM Queue 3, 9, 48--57. Google ScholarDigital Library
Menczer, F., Pant, G., and Srinivasan, P. 2004. Topical Web crawlers: Evaluating adaptive algorithms. ACM Trans. Internet Tech. 4, 4 (Nov.), 378--419. Google ScholarDigital Library
Mitzenmacher, M. 2004. Dynamic models for file sizes and double pareto distributions. Internet Math. 1, 3, 305--334.Google ScholarCross Ref
Newman, M. E. J., Strogatz, S. H., and Watts, D. J. 2001. Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E (Statistical, Nonlinear, and Soft Matter Physics) 64, 2 (Aug.), 026118 (1--17).Google ScholarCross Ref
Ntoulas, A., Zerfos, P., and Cho, J. 2005. Downloading textual hidden Web content by keyword queries. In Proceedings of the Fifth ACM+IEEE Joint Conference on Digital Libraries (JCDL'05). Google ScholarDigital Library
Oard, D. W. 1997. The state of the art in text filtering. User Model. User-Adapt. Interact. 7, 3, 141--178. Google ScholarDigital Library
Ross, S. M. 2002. Introduction to Probability Models, 8th ed. Academic Press, New York, NY. Google ScholarDigital Library
Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (Mar.), 1--47. Google ScholarDigital Library
Vapnik, V. N. 1998. Statistical Learning Theory. Wiley-Interscience, New York, NY. Google ScholarDigital Library
Wilf, H. S. 1990. Generatingfunctionology. Academic Press Professional, Inc., New York, NY. Google ScholarDigital Library
Yangarber, R. and Grishman, R. 1998. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7).Google Scholar
Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading, MA.Google Scholar

Index Terms

Towards a query optimizer for text-centric tasks
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

To search or to crawl?: towards a query optimizer for text-centric tasks
SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused ...
Read More
Distributed text retrieval from overlapping collections
ADC '07: Proceedings of the eighteenth conference on Australasian database - Volume 63

In standard text retrieval systems, the documents are gathered and indexed on a single server. In distributed information retrieval (DIR), the documents are held in multiple collections; answers to queries are produced by selecting the collections to ...
Read More
The MemSQL query optimizer: a modern optimizer for real-time analytics in a distributed database

Real-time analytics on massive datasets has become a very common need in many enterprises. These applications require not only rapid data ingest, but also quick answers to analytical queries operating on the latest data. MemSQL is a distributed SQL ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 32, Issue 4
November 2007
364 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1292609
Issue’s Table of Contents

Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 2007
Published in tods Volume 32, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Metasearching
distributed information retrieval
focused crawling
information extraction
text database selection
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 912
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

To search or to crawl?: towards a query optimizer for text-centric tasks

Distributed text retrieval from overlapping collections

The MemSQL query optimizer: a modern optimizer for real-time analytics in a distributed database

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Towards a query optimizer for text-centric tasks

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

To search or to crawl?: towards a query optimizer for text-centric tasks

Distributed text retrieval from overlapping collections

The MemSQL query optimizer: a modern optimizer for real-time analytics in a distributed database

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media