skip to main content
10.1145/3035918.3058735acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper

The Best of Both Worlds: Big Data Programming with Both Productivity and Performance

Published: 09 May 2017 Publication History

Abstract

Coarse-grained operators such as map and reduce have been widely used for large-scale data processing. While they are easy to master, over-simplified APIs sometimes hinder programmers from fine-grained control on how computation is performed and hence designing more efficient algorithms. On the other hand, resorting to domain-specific languages (DSLs) is also not a practical solution, since programmers may need to learn how to use many systems that can be very different from each other, and the use of low-level tools may even result in bug-prone programming.
In [7] our prior work, we proposed Husky which provides a highly expressive API to solve the above dilemma. It allows developers to program in a variety of patterns, such as MapReduce, GAS, vertex-centric programs, and even asynchronous machine learning. While the Husky C++ engine provides great performance, in this demo proposal we introduce PyHusky and ScHusky, which allow users (e.g., data scientists) without system knowledge and low-level programming skills to leverage the performance of Husky and build high-level applications with ease using Python and Scala.

References

[1]
Apache Flink. https://flink.apache.org/.
[2]
J. Canny and H. Zhao. Butterfly mixing: Accelerating incremental-update algorithms on clusters. In Proceedings of the 13th SIAM International Conference on Data Mining, May 2-4, 2013. Austin, Texas, USA., pages 785--793, 2013.
[3]
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17--30, 2012.
[4]
S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, and E. P. Xing. On model parallelization and scheduling strategies for distributed machine learning. In NIPS, pages 2834--2842, 2014.
[5]
J. Li, J. Cheng, Y. Zhao, F. Yang, Y. Huang, H. Chen, and R. Zhao. A comparison of general-purpose distributed systems for data processing. In IEEE International Conference on Big Data, pages 378--383, 2016.
[6]
G. Malewicz, M. H. Austern, A. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large scale graph processing. In SIGMOD, pages 135--146, 2010.
[7]
F. Yang, J. Li, and J. Cheng. Husky: Towards a more efficient and expressive distributed computing framework. PVLDB, 9(5):420--431, 2016.
[8]
F. Yang, F. Shang, Y. Huang, J. Cheng, J. Li, Y. Zhao, and R. Zhao. Lftf: A framework for efficient tensor analytics at scale. PVLDB, 10(7), 2017.
[9]
H. Yun, H. Yu, C. Hsieh, S. V. N. Vishwanathan, and I. S. Dhillon. NOMAD: nonlocking, stochastic multi-machine algorithm for asynchronous and decentralized matrix completion. PVLDB, 7(11):975--986, 2014.
[10]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 15--28, 2012.

Cited By

View all
  • (2023)A Highly Interactive Honeypot-Based Approach to Network Threat ManagementFuture Internet10.3390/fi1504012715:4(127)Online publication date: 28-Mar-2023
  • (2020)Synthesis of Incremental Linear Algebra ProgramsACM Transactions on Database Systems10.1145/338539845:3(1-44)Online publication date: 26-Aug-2020
  • (2019)GrasperProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362715(87-100)Online publication date: 20-Nov-2019
  • Show More Cited By

Index Terms

  1. The Best of Both Worlds: Big Data Programming with Both Productivity and Performance

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
    May 2017
    1810 pages
    ISBN:9781450341974
    DOI:10.1145/3035918
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 09 May 2017

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data-parallel processing
    2. distributed system
    3. programming model

    Qualifiers

    • Short-paper

    Funding Sources

    • ITF
    • Hong Kong RGC
    • Research Committee of CUHK

    Conference

    SIGMOD/PODS'17
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Highly Interactive Honeypot-Based Approach to Network Threat ManagementFuture Internet10.3390/fi1504012715:4(127)Online publication date: 28-Mar-2023
    • (2020)Synthesis of Incremental Linear Algebra ProgramsACM Transactions on Database Systems10.1145/338539845:3(1-44)Online publication date: 26-Aug-2020
    • (2019)GrasperProceedings of the ACM Symposium on Cloud Computing10.1145/3357223.3362715(87-100)Online publication date: 20-Nov-2019
    • (2018)G-MinerProceedings of the Thirteenth EuroSys Conference10.1145/3190508.3190545(1-12)Online publication date: 23-Apr-2018
    • (2018)FlexPSProceedings of the VLDB Endowment10.1145/3187009.317773411:5(566-579)Online publication date: 1-Jan-2018
    • (2018)A General and Efficient Querying Method for Learning to HashProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183750(1333-1347)Online publication date: 27-May-2018
    • (2018)FlexPSProceedings of the VLDB Endowment10.1145/3177732.317773411:5(566-579)Online publication date: 5-Oct-2018
    • (2017)Architectural implications on the performance and cost of graph analytics systemsProceedings of the 2017 Symposium on Cloud Computing10.1145/3127479.3128606(40-51)Online publication date: 24-Sep-2017
    • (2017)LoSHaProceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3077136.3080800(635-644)Online publication date: 7-Aug-2017

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media