ABSTRACT
The availability of large data centers with tens of thousands of servers has led to the popular adoption of massive parallelism for data analysis on large datasets. Several query languages exist for running queries on massively parallel architectures, some based on the MapReduce infrastructure, others using proprietary implementations. Motivated by this trend, this paper analyzes the parallel complexity of conjunctive queries. We propose a very simple model of parallel computation that captures these architectures, in which the complexity parameter is the number of parallel steps requiring synchronization of all servers. We study the complexity of conjunctive queries and give a complete characterization of the queries which can be computed in one parallel step. These form a strict subset of hierarchical queries, and include flat queries like R(x,y), S(x,z), T(x,v), U(x,w), tall queries like R(x), S(x,y), T(x,y,z), U(x,y,z,w), and combinations thereof, which we call tall-flat queries. We describe an algorithm for computing in parallel any tall-flat query, and prove that any query that is not tall-flat cannot be computed in one step in this model. Finally, we present extensions of our results to queries that are not tall-flat.
- F. N. Afrati and J. D. Ullman. Optimizing joins in a map-reduce environment. In EDBT, pages 99--110, 2010. Google ScholarDigital Library
- P. Alvaro, W. Marczak, N. Conway, J. M. Hellerstein, D. Maier, and R. C. Sears. Dedalus: Datalog in time and space. Technical Report UCB/EECS-2009-173, EECS Department, University of California, Berkeley, Dec 2009.Google Scholar
- L. Carter and M. N. Wegman. Universal classes of hash functions. J. Comput. Syst. Sci., 18(2):143--154, 1979.Google ScholarCross Ref
- R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1:1265--1276, August 2008. Google ScholarDigital Library
- S. Cohen. Containment of aggregate queries. SIGMOD Record, 34(1):77--85, 2005. Google ScholarDigital Library
- D. E. Culler, R. M. Karp, D. A. Patterson, A. Sahay, K. E. Schauser, E. E. Santos, R. Subramonian, and T. von Eicken. Logp: Towards a realistic model of parallel computation. In PPOPP, pages 1--12, 1993. Google ScholarDigital Library
- N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB J., 16(4):523--544, 2007. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database systems. Commun. ACM, 35(6):85--98, 1992. Google ScholarDigital Library
- A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2):1414--1425, 2009. Google ScholarDigital Library
- M. Grohe, Y. Gurevich, D. Leinders, N. Schweikardt, J. Tyszkiewicz, and J. V. den Bussche. Database query processing using finite cursor machines. Theory Comput. Syst., 44(4):533--560, 2009. Google ScholarDigital Library
- J. M. Hellerstein. The declarative imperative: experiences and conjectures in distributed logic. SIGMOD Rec., 39:5--19, September 2010. Google ScholarDigital Library
- N. Immerman. Expressibility and parallel complexity. SIAM J. Comput., 18(3):625--638, 1989. Google ScholarDigital Library
- H. J. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for mapreduce. In SODA, pages 938--948, 2010. Google ScholarDigital Library
- P. Koutris and D. Suciu. Parallel evaluation of conjunctive queries. Research Report UW-CSE-11-03-01, University of Washington, 2011.Google Scholar
- L. Libkin. Elements of Finite Model Theory. Springer, 2004. Google ScholarDigital Library
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. PVLDB, 3(1):330--339, 2010. Google ScholarDigital Library
- A. Pagh and R. Pagh. Uniform hashing in constant time and optimal space. SIAM J. Comput., 38(1):85--96, 2008. Google ScholarDigital Library
- M. Raab and A. Steger. "balls into bins" - a simple and tight analysis. In RANDOM, pages 159--170, 1998. Google ScholarDigital Library
- P. Sanders. On the competitive analysis of randomized static load balancing. In Proceedings of the first Workshop on Randomized Parallel Algorithms, RANDOM, 1996.Google Scholar
- L. J. Stockmeyer and U. Vishkin. Simulation of parallel random access machines by circuits. SIAM J. Comput., 13(2):409--422, 1984.Google ScholarCross Ref
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, 1990. Google ScholarDigital Library
- Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew in parallel joins in shared-nothing systems. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1043--1052, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1--14, 2008. Google ScholarDigital Library
Index Terms
- Parallel evaluation of conjunctive queries
Recommendations
The complexity of acyclic conjunctive queries
This paper deals with the evaluation of acyclic Boolean conjunctive queries in relational databases. By well-known results of Yannakakis[1981], this problem is solvable in polynomial time; its precise complexity, however, has not been pinpointed so far. We ...
The Complexity of Acyclic Conjunctive Queries
FOCS '98: Proceedings of the 39th Annual Symposium on Foundations of Computer ScienceWe show that the problem of evaluating acyclic Boolean database-queries is LOGCFL-complete and thus highly parallelizable. We present a parallel database algorithm solving this problem with a logarithmic number of parallel join operations. We also show ...
Interleaving a Join Sequence with Semijoins in Distributed Query Processing
The problem of combining join and semijoin reducers for distributed query processing is studied. An approach based on interleaving a join sequence with beneficial semijoins is proposed. A join sequence is mapped into a join sequence tree first. The join ...
Comments