skip to main content
10.1145/2588555.2610531acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Dynamically optimizing queries over large scale data platforms

Published:18 June 2014Publication History

ABSTRACT

Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain actionable insights from their "big data". Query optimization is still an open challenge in this environment due to the volume and heterogeneity of data, comprising both structured and un/semi-structured datasets. Moreover, it has become common practice to push business logic close to the data via user-defined functions (UDFs), which are usually opaque to the optimizer, further complicating cost-based optimization. As a result, classical relational query optimization techniques do not fit well in this setting, while at the same time, suboptimal query plans can be disastrous with large datasets. In this paper, we propose new techniques that take into account UDFs and correlations between relations for optimizing queries running on large scale clusters. We introduce "pilot runs", which execute part of the query over a sample of the data to estimate selectivities, and employ a cost-based optimizer that uses these selectivities to choose an initial query plan. Then, we follow a dynamic optimization approach, in which plans evolve as parts of the queries get executed. Our experimental results show that our techniques produce plans that are at least as good as, and up to 2x (4x) better for Jaql (Hive) than, the best hand-written left-deep query plans.

References

  1. S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing data-parallel computing. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Babu, P. Bizarro, and D. J. DeWitt. Proactive re-optimization. In SIGMOD Conference, pages 107--118, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: a programming model and execution framework for web-scale analytical processing. In SoCC, pages 119--130, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. A. Bernstein, N. Goodman, E. Wong, C. L. Reeve, and J. B. R. Jr. Query processing in a system for distributed databases (SDD-1). ACM Trans. Database Syst., 6(4):602--625, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. S. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Y. Eltabakh, C.-C. Kanne, F. Özcan, and E. J. Shekita. Jaql: A scripting language for large scale semistructured data analysis. PVLDB, 4(12), 2011.Google ScholarGoogle Scholar
  6. K. S. Beyer, P. J. Haas, B. Reinwald, Y. Sismanis, and R. Gemulla. On synopses for distinct-value estimation under multiset operations. In SIGMOD, pages 199--210, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In SIGMOD, pages 975--986, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Bruno, S. Jain, and J. Zhou. Continuous cloud-scale query optimization and processing. In VLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In PODS, pages 268--279, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chaudhuri, G. Das, and U. Srivastava. Effective use of block-level sampling in statistics estimation. In SIGMOD Conference, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. ACM Trans. Database Syst., 24(2):177--228, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Columbia Query Optimizer. http://web.cecs.pdx.edu/len/Columbia.Google ScholarGoogle Scholar
  13. B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In SoCC, pages 143--154, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Deshpande, Z. G. Ives, and V. Raman. Adaptive query processing. Foundations and Trends in Databases, 1(1):1--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6):85--98, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Gates, J. Dai, and T. Nair. Apache Pig's optimizer. IEEE Data Eng. Bull., 36(1):34--45, 2013.Google ScholarGoogle Scholar
  18. A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of MapReduce: The Pig experience. PVLDB, 2(2):1414--1425, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Ghazal, T. Rabl, M. Hu, F. Raab, M. Poess, A. Crolotte, and H.-A. Jacobsen. BigBench: towards an industry standard benchmark for big data analytics. In SIGMOD, pages 1197--1208, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Graefe. Query evaluation techniques for large databases. ACM Comput. Surv., 25(2):73--170, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Graefe. The Cascades framework for query optimization. IEEE Data Eng. Bull., 18(3):19--29, 1995.Google ScholarGoogle Scholar
  22. W.-S. Han, J. Ng, V. Markl, H. Kache, and M. Kandil. Progressive optimization in a shared-nothing parallel database. In SIGMOD, pages 809--820, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Z. He, B. S. Lee, and R. R. Snapp. Self-tuning cost modeling of user-defined functions in an object-relational dbms. ACM Trans. Database Syst., 30(3):812--853, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. M. Hellerstein. Optimization techniques for queries with expensive methods. ACM TODS, 23(2):113--157, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. F. Hueske, M. Peters, M. Sax, A. Rheinländer, R. Bergmann, A. Krettek, and K. Tzoumas. Opening the black boxes in data flow optimization. PVLDB, 5(11):1256--1267, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. CORDS: Automatic discovery of correlations and soft functional dependencies. In SIGMOD, pages 647--658, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. E. Ioannidis and S. Christodoulakis. On the propagation of errors in the size of join results. In SIGMOD Conference, pages 268--277, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. In SIGMOD Conference, pages 106--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. S. Lee, L. Chen, J. Buzas, and V. Kannoth. Regression-based self-tuning modeling of smooth user-defined function costs for an object-relational database management system query optimizer. Comput. J., 47(6):673--693, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  30. H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for mapreduce workflows. PVLDB, 5(11):1196--1207, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. V. Markl, V. Raman, D. E. Simmen, G. M. Lohman, and H. Pirahesh. Robust query processing through progressive optimization. In SIGMOD, pages 659--670, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large MapReduce jobs. PVLDB, 4(11):1135--1145, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, pages 165--178, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. H. Pirahesh, J. M. Hellerstein, and W. Hasan. Extensible/rule based query rewrite optimization in Starburst. In SIGMOD, pages 39--48, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pages 23--34, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - a warehousing solution over a Map-Reduce framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. TPC-H Benchmark. http://www.tpc.org/tpch.Google ScholarGoogle Scholar
  38. R. Vernica, A. Balmin, K. S. Beyer, and V. Ercegovac. Adaptive MapReduce using situation-aware mappers. In EDBT, pages 420--431, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query optimization for massively parallel data processing. In SoCC, page 12, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: SQL and rich analytics at scale. In SIGMOD, pages 13--24, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1--14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Dynamically optimizing queries over large scale data platforms

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
        June 2014
        1645 pages
        ISBN:9781450323765
        DOI:10.1145/2588555

        Copyright © 2014 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 June 2014

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader