skip to main content
article

Towards a query optimizer for text-centric tasks

Published:01 November 2007Publication History
Skip Abstract Section

Abstract

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the Web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model. We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.

References

  1. Agichtein, E. 2005. Scaling information extraction to large document collections. IEEE Data Eng. Bull. 28, 4 (Dec.), 3--10.Google ScholarGoogle Scholar
  2. Agichtein, E. and Gravano, L. 2000. Snowball: Extracting relations from large plain-text collections. In Proceedings of the Fifth ACM Conference on Digital Libraries (DL'00). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agichtein, E. and Gravano, L. 2003. Querying text databases for efficient information extraction. In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE'03).Google ScholarGoogle Scholar
  4. Agichtein, E., Ipeirotis, P. G., and Gravano, L. 2003. Modeling query-based access to text databases. In Proceedings of the Sixth International Workshop on the Web and Databases (WebDB'03). 87--92.Google ScholarGoogle Scholar
  5. Avnur, R. and Hellerstein, J. M. 2000. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD'00). 261--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Baayen, R. H. 2006. Word Frequency Distributions. Springer, Berlin, Germany.Google ScholarGoogle Scholar
  7. Banko, M., Brill, E., Dumais, S., and Lin, J. 2002. AskMSR: Question answering using the World Wide Web. In Proceedings of 2002 AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases.Google ScholarGoogle Scholar
  8. Bergman, M. K. 2001. The deep Web: Surfacing hidden value. J. Electron. Publish. 7, 1 (Aug.).Google ScholarGoogle ScholarCross RefCross Ref
  9. Blake, C. L. and Merz, C. J. 1998. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html.Google ScholarGoogle Scholar
  10. Brin, S. 1998. Extracting patterns and relations from the World Wide Web. In Proceedings of the First International Workshop on the Web and Databases (WebDB'98). 172--183. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Cafarella, M. J. and Etzioni, O. 2005. A search engine for natural language applications. In Proceedings of the 14th International World Wide Web Conference (WWW'05). 442--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Cafarella, M. J., Re, C., Suciu, D., and Etzioni, O. 2007. Structured querying of Web text data: A technical challenge. In Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR'07). 225--234.Google ScholarGoogle Scholar
  13. Callan, J. P. and Connell, M. 2001. Query-based sampling of text databases. ACM Trans. Inform. Syst. 19, 2, 97--130. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Callan, J. P., Connell, M., and Du, A. 1999. Automatic discovery of language models for text databases. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD'99). 479--490. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Callan, J. P., Lu, Z., and Croft, W. B. 1995. Searching distributed collections with inference networks. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'95). 21--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Chakrabarti, S., Punera, K., and Subramanyam, M. 2002. Accelerated focused crawling through online relevance feedback. In Proceedings of the 11th International World Wide Web Conference (WWW11). 148--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. Comput. Netw. 31, 11--16 (May), 1623--1640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD'03). 313--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Chaudhuri, S., Motwani, R., and Narasayya, V. R. 1998. Random sampling for histogram construction: How much is enough? In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD'98). 436--447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Chaudhuri, S. and Shim, K. 1999. Optimization of queries with user-defined predicates. ACM Trans. Database Syst. 24, 2, 177--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Chung, F. and Lu, L. 2002. Connected components in random graphs with given degree sequences. Ann. Combin. 6, 125--145.Google ScholarGoogle ScholarCross RefCross Ref
  22. Cohen, W. W. 1996. Learning trees and rules with set-valued features. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), Eighth Conference on Innovative Applications of Artificial Intelligence (IAAI-96). 709--716. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Cohen, W. W. and Singer, Y. 1996. Learning to query the Web. In Proceedings of the AAAI Workshop on Internet-Based Information Systems. 16--25.Google ScholarGoogle Scholar
  24. Cunningham, H. 2006. Information extraction, automatic. In Encyclopedia of Language and Linguistics, 2nd ed. Elsevier Science, Amsterdam, The Netherlands.Google ScholarGoogle Scholar
  25. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. 2000. Focused crawling using context graphs. In Proceedings of the 26th International Conference on Very Large Databases (VLDB'00). 527--534. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Doorenbos, R. B., Etzioni, O., and Weld, D. S. 1997. A scalable comparison-shopping agent for the World-Wide Web. In AGENTS '97: Proceedings of the First International Conference on Autonomous Agents. 39--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Duda, R. O., Hart, P. E., and Stork, D. G. 2000. Pattern Classification, 2nd ed. Wiley, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Etzioni, O., Cafarella, M. J., Downey, D., Kok, S., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S., and Yates, A. 2004. Web-scale information extraction in KnowItAll (preliminary results). In Proceedings of the 13th International World Wide Web Conference (WWW'04). 100--110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gravano, L., García-Molina, H., and Tomasic, A. 1999. GlOSS: Text-source discovery over the Internet. ACM Trans. Database Syst. 24, 2 (June), 229--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gravano, L., Ipeirotis, P. G., and Sahami, M. 2002. Query- vs. crawling-based classification of searchable Web databases. IEEE Data Eng. Bull. 25, 1 (Mar.), 43--50.Google ScholarGoogle Scholar
  31. Grishman, R. 1997. Information extraction: Techniques and challenges. In Proceedings of Information Extraction: A Multidisciplinary Approach to an Emerging Information Technology, International Summer School, (SCIE-97). 10--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Grishman, R., Huttunen, S., and Yangarber, R. 2002. Information extraction for enhanced access to disease outbreak reports. J. Biomed. Inform. 35, 4 (Aug.), 236--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Ipeirotis, P. G., Agichtein, E., Jain, P., and Gravano, L. 2006. To search or to crawl? Towards a query optimizer for text-centric tasks. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD'06). 265--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ipeirotis, P. G. and Gravano, L. 2002. Distributed search over the hidden Web: Hierarchical database sampling and selection. In Proceedings of the 28th International Conference on Very Large Databases (VLDB'02). 394--405. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ives, Z. G., Florescu, D., Friedman, M., Levy, A. Y., and Weld, D. S. 1999. An adaptive query execution system for data integration. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD'99). 299--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jain, A., Doan, A., and Gravano, L. 2007. SQL queries over unstructured text databases (poster paper). In Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE'07).Google ScholarGoogle Scholar
  37. Jones, R. 2005. Learning to extract entities from labeled and unlabeled text. Ph.D. dissertation. Carnegie Mellon University, School of Computer Science, Pittsburgh, PA.Google ScholarGoogle Scholar
  38. Kabra, N. and DeWitt, D. J. 1998. Efficient mid-query re-optimization of sub-optimal query execution plans. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD'98). 106--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Ling, Y. and Sun, W. 1995. An evaluation of sampling-based size estimation methods for selections in database systems. In Proceedings of the 11th IEEE International Conference on Data Engineering (ICDE'95). 532--539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Markl, V., Raman, V., Simmen, D., Lohman, G., Pirahesh, H., and Cilimdzic, M. 2004. Robust query processing through progressive optimization. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD'04). 659--670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. McCallum, A. 2005. Information extraction: Distilling structured data from unstructured text. ACM Queue 3, 9, 48--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Menczer, F., Pant, G., and Srinivasan, P. 2004. Topical Web crawlers: Evaluating adaptive algorithms. ACM Trans. Internet Tech. 4, 4 (Nov.), 378--419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Mitzenmacher, M. 2004. Dynamic models for file sizes and double pareto distributions. Internet Math. 1, 3, 305--334.Google ScholarGoogle ScholarCross RefCross Ref
  44. Newman, M. E. J., Strogatz, S. H., and Watts, D. J. 2001. Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E (Statistical, Nonlinear, and Soft Matter Physics) 64, 2 (Aug.), 026118 (1--17).Google ScholarGoogle ScholarCross RefCross Ref
  45. Ntoulas, A., Zerfos, P., and Cho, J. 2005. Downloading textual hidden Web content by keyword queries. In Proceedings of the Fifth ACM+IEEE Joint Conference on Digital Libraries (JCDL'05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Oard, D. W. 1997. The state of the art in text filtering. User Model. User-Adapt. Interact. 7, 3, 141--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Ross, S. M. 2002. Introduction to Probability Models, 8th ed. Academic Press, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (Mar.), 1--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Vapnik, V. N. 1998. Statistical Learning Theory. Wiley-Interscience, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Wilf, H. S. 1990. Generatingfunctionology. Academic Press Professional, Inc., New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Yangarber, R. and Grishman, R. 1998. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7).Google ScholarGoogle Scholar
  52. Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading, MA.Google ScholarGoogle Scholar

Index Terms

  1. Towards a query optimizer for text-centric tasks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Database Systems
      ACM Transactions on Database Systems  Volume 32, Issue 4
      November 2007
      364 pages
      ISSN:0362-5915
      EISSN:1557-4644
      DOI:10.1145/1292609
      Issue’s Table of Contents

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 November 2007
      Published in tods Volume 32, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader