skip to main content
10.1145/3106237.3117774acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Static analysis for optimizing big data queries

Published: 21 August 2017 Publication History

Abstract

Query languages for big data analysis provide user extensibility through a mechanism of user-defined operators (UDOs). These operators allow programmers to write proprietary functionalities on top of a relational query skeleton. However, achieving effective query optimization for such languages is extremely challenging since the optimizer needs to understand data dependencies induced by UDOs. SCOPE, the query language from Microsoft, allows for hand coded declarations of UDO data dependencies. Unfortunately, most programmers avoid using this facility since writing and maintaining the declarations is tedious and error-prone. In this work, we designed and implemented two sound and robust static analyses for computing UDO data dependencies. The analyses can detect what columns of an input table are never used or pass-through a UDO unchanged. This information can be used to significantly improve execution of SCOPE scripts. We evaluate our analyses on thousands of real-world queries and show we can catch many unused and pass-through columns automatically without relying on any manually provided declarations.

References

[1]
U-sql, the new big data language for azure data lake. https://azure.microsoft.com/ en-us/blog/u-sql-the-new-big-data-language-for-azure-data-lake/. Accessed: 2017-05-09.
[2]
A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Addison-Wesley, 1986.
[3]
M. Barnett, M. Fähndrich, F. Logozzo, and D. Garbervetsky. Annotations for (more) precise points-to analysis. In IWACO, pages 11ś18, 2007.
[4]
B. Blanchet. Escape analysis: correctness proof, implementation and experimental results. In In POPL, pages 25ś37. ACM, 1998.
[5]
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and eicient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1(2):1265ś1276, 2008.
[6]
P. Cousot and R. Cousot. Abstract interpretation frameworks. Journal of logic and computation, 2(4):511ś547, 1992.
[7]
Z. Guo, X. Fan, R. Chen, J. Zhang, H. Zhou, S. McDirmid, C. Liu, W. Lin, J. Zhou, and L. Zhou. Spotting code optimizations in data-parallel pipelines through periscope. In OSDI, pages 121ś133, 2012.
[8]
B. Livshits, M. Sridharan, Y. Smaragdakis, O. Lhoták, J. N. Amaral, B. E. Chang, S. Z. Guyer, U. P. Khedker, A. Mùller, and D. Vardoulakis. In defense of soundiness: a manifesto. Commun. ACM, 58(2):44ś46, 2015.
[9]
F. Logozzo. Clousot: Static contract checking with abstract interpretation. Formal Veriication of Object-Oriented Software, page 5.
[10]
S. S. Muchnick. Advanced compiler design implementation. Morgan Kaufmann, 1997.
[11]
A. Sălcianu and M. Rinard. Purity and side efect analysis for java programs. In In VMCAI, pages 199ś215. Springer, 2005.
[12]
B. Steensgaard. Points-to analysis in almost linear time. In In POPL, pages 32ś41. ACM, 1996.
[13]
S. Xia, M. Fähndrich, and F. Logozzo. Inferring datalow properties of user deined table processors. In SAS, pages 19ś35, 2009.
[14]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. HotCloud, 10(10-10):95, 2010.
[15]
J. Zhou, N. Bruno, M. Wu, P. Larson, R. Chaiken, and D. Shakib. SCOPE: parallel databases meet mapreduce. VLDB J., 21(5):611ś636, 2012.

Cited By

View all
  • (2021)Design of Distributed Network Mass Data Processing System based on Cloud Computing Technology2021 5th International Conference on Trends in Electronics and Informatics (ICOEI)10.1109/ICOEI51242.2021.9452963(1317-1320)Online publication date: 3-Jun-2021
  • (2021)SODA: A Semantics-Aware Optimization Framework for Data-Intensive Applications Using Hybrid Program Analysis2021 IEEE 14th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD53861.2021.00058(433-444)Online publication date: Sep-2021
  • (2019)NiijimaProceedings of the 27th ACM Symposium on Operating Systems Principles10.1145/3341301.3359649(306-321)Online publication date: 27-Oct-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2017: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering
August 2017
1073 pages
ISBN:9781450351058
DOI:10.1145/3106237
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Big Data
  2. Query optimization
  3. Static analysis
  4. UDOs

Qualifiers

  • Research-article

Conference

ESEC/FSE'17
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Design of Distributed Network Mass Data Processing System based on Cloud Computing Technology2021 5th International Conference on Trends in Electronics and Informatics (ICOEI)10.1109/ICOEI51242.2021.9452963(1317-1320)Online publication date: 3-Jun-2021
  • (2021)SODA: A Semantics-Aware Optimization Framework for Data-Intensive Applications Using Hybrid Program Analysis2021 IEEE 14th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD53861.2021.00058(433-444)Online publication date: Sep-2021
  • (2019)NiijimaProceedings of the 27th ACM Symposium on Operating Systems Principles10.1145/3341301.3359649(306-321)Online publication date: 27-Oct-2019
  • (2019)Going big: a large-scale study on what big data developers askProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3338906.3338939(432-442)Online publication date: 12-Aug-2019
  • (2017)Optimizing Big-Data Queries Using Program SynthesisProceedings of the 26th Symposium on Operating Systems Principles10.1145/3132747.3132773(631-646)Online publication date: 14-Oct-2017

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media