skip to main content
10.1145/3132402.3132427acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article

Identifying the potential of near data processing for apache spark

Published:02 October 2017Publication History

ABSTRACT

While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. There is also a renewed interest in Near Data Processing (NDP) due to technological advancement in the last decade. However, it is not known if NDP architectures can improve the performance of big data processing frameworks such as Apache Spark. In this paper, we build the case of NDP architecture comprising programmable logic based hybrid 2D integrated processing-in-memory and in-storage processing for Apache Spark, by extensive profiling of Apache Spark based workloads on Ivy Bridge Server.

References

  1. Hybrid memory cube consortium. hybrid memory cube specification 2.0. http://www.hybridmemorycube.org/specification-v2-download-form/,Nov.2014.Google ScholarGoogle Scholar
  2. Intel Vtune Amplifier XE 2013.Google ScholarGoogle Scholar
  3. STREAM. https://www.cs.virginia.edu/stream/.Google ScholarGoogle Scholar
  4. Toshiba SATA HDD Enterprise, Performance Review.Google ScholarGoogle Scholar
  5. Using Intel VTune Amplifier XE to Tune Software on the Intel Xeon Processor E5/E7 v2 Family. https://software.intel.com/en-us/articles/using-intel-vtune-amplifier-xe-to-tune-software-on-the-intel-xeon-processor-e5e7-v2-family.Google ScholarGoogle Scholar
  6. Ahn, J., Hong, S., Yoo, S., Mutlu, O., and Choi, K. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (2015), ACM, pp. 105--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Awan, A. J., Brorsson, M., Vlassov, V., and Ayguade, E. Performance characterization of in-memory data analytics on a modern cloud server. In Big Data and Cloud Computing (BDCloud), 2015 IEEE Fifth International Conference on (2015), IEEE, pp. 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Awan, A. J., Brorsson, M., Vlassov, V., and Ayguade, E. Big Data Benchmarks, Performance Optimization, and Emerging Hardware: 6th Workshop, BPOE 2015, Kohala, HI, USA, August 31 - September 4, 2015. Revised Selected Papers. Springer International Publishing, 2016, ch. How Data Volume Affects Spark Based Data Analytics on a Scale-up Server, pp. 81--92.Google ScholarGoogle Scholar
  9. Awan, A. J., Brorsson, M., Vlassov, V., and Ayguade, E. Micro-architectural characterization of apache spark on batch and stream processing workloads. In Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom)(BDCloud-SocialCom-SustainCom), 2016 IEEE International Conferences on (2016), IEEE, pp. 59--66.Google ScholarGoogle Scholar
  10. Awan, A. J., Brorsson, M., Vlassov, V., and Ayguade, E. Node architecture implications for in-memory data analytics on scale-in clusters. In Big Data Computing Applications and Technologies (BDCAT), 2016 IEEE/ACM 3rd International Conference on (2016), IEEE, pp. 237--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Bender, M. A., Berry, J., Hammond, S. D., Moore, B., Moseley, B., and Phillips, C. A. k-means clustering on two-level memory systems. In Proceedings of the 2015 International Symposium on Memory Systems (2015), ACM, pp. 197--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chang, J., Ranganathan, P., Mudge, T., Roberts, D., Shah, M. A., and Lim, K. T. A limits study of benefits from nanostore-based future data-centric system architectures. In Proceedings of the 9th conference on Computing Frontiers (2012), ACM, pp. 33--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. del Mundo, C. C., Lee, V. T., Ceze, L., and Oskin, M. Ncam: Near-data processing for nearest neighbor search. In Proceedings of the 2015 International Symposium on Memory Systems (2015), ACM, pp. 274--275. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gokhale, M., Lloyd, S., and Hajas, C. Near memory data structure rearrangement. In Proceedings of the 2015 International Symposium on Memory Systems (2015), ACM, pp. 283--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Huang, S., Huang, J., Dai, J., Xie, T., and Huang, B. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on (2010), pp. 41--51.Google ScholarGoogle ScholarCross RefCross Ref
  16. Islam, M., Scrbak, M., Kavi, K. M., Ignatowski, M., and Jayasena, N. Improving node-level mapreduce performance using processing-in-memory technologies. In Euro-Par 2014: Parallel Processing Workshops (2014), Springer, pp. 425--437.Google ScholarGoogle Scholar
  17. Jacob, B. The memory system: you can't avoid it, you can't ignore it, you can't fake it. Synthesis Lectures on Computer Architecture 4, 1 (2009), 1--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kanev, S., Darago, J. P., Hazelwood, K., Ranganathan, P., Moseley, T., Wei, G.-Y., Brooks, D., Campanoni, S., Brownell, K., Jones, T. M., et al. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (2015), ACM, pp. 158--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kreps, J., Narkhede, N., Rao, J., et al. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB (2011), pp. 1--7.Google ScholarGoogle Scholar
  20. Lee, J. H., Sim, J., and Kim, H. Bssync: Processing near memory for machine learning workloads with bounded staleness consistency models.Google ScholarGoogle Scholar
  21. Loh, G., Jayasena, N., Oskin, M., Nutter, M., Roberts, D., Meswani, M., Zhang, D., and Ignatowski, M. A processing in memory taxonomy and a case for studying fixed-function pim. In Workshop on Near-Data Processing (WoNDP) (2013).Google ScholarGoogle Scholar
  22. Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al. Mllib: Machine learning in apache spark. arXiv preprint arXiv:1505.06807 (2015).Google ScholarGoogle Scholar
  23. Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., and Zhan, J. BDGS: A scalable big data generator suite in big data benchmarking. In Advancing Big Data Benchmarks, vol. 8585 of Lecture Notes in Computer Science. 2014, pp. 138--154.Google ScholarGoogle ScholarCross RefCross Ref
  24. Mirzadeh, N., Koçberber, Y. O., Falsafi, B., and Grot, B. Sort vs. hash join revisited for near-memory execution. In 5th Workshop on Architectures and Systems for Big Data (ASBD 2015) (2015), no. EPFL-CONF-209121.Google ScholarGoogle Scholar
  25. Nai, L., and Kim, H. Instruction offloading with hmc 2.0 standard: A case study for graph traversals. In Proceedings of the 2015 International Symposium on Memory Systems (2015), ACM, pp. 258--261. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Perera, S., and Suhothayan, S. Solution patterns for realtime streaming analytics. In Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems (2015), ACM, pp. 247--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Pugsley, S. H. Opportunities for near data computing in MapReduce workloads. PhD thesis, The University of Utah, 2015.Google ScholarGoogle Scholar
  28. Pugsley, S. H., Jestes, J., Zhang, H., Balasubramonian, R., Srinivasan, V., Buyuktosunoglu, A., Li, F., et al. Ndc: Analyzing the impact of 3d-stacked memory+ logic devices on mapreduce workloads. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on (2014), IEEE, pp. 190--200.Google ScholarGoogle ScholarCross RefCross Ref
  29. Radulovic, M., Zivanovic, D., Ruiz, D., de Supinski, B. R., McKee, S. A., Radojković, P., and Ayguadé, E. Another trip to the wall: How much will stacked dram benefit hpc? In Proceedings of the 2015 International Symposium on Memory Systems (2015), ACM, pp. 31--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ranganathan, P. From microprocessors to nanostores: Rethinking data-centric systems (vol 44, pg 39, 2010). COMPUTER 44, 3 (2011), 6--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Siegl, P., Buchty, R., and Berekovic, M. Data-centric computing frontiers: A survey on processing-in-memory. In Proceedings of the Second International Symposium on Memory Systems (2016), ACM, pp. 295--308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., and Qiu, B. Bigdatabench: A big data benchmark suite from internet services. In 20th IEEE International Symposium on High Performance Computer Architecture, HPCA (2014), pp. 488--499.Google ScholarGoogle ScholarCross RefCross Ref
  33. Wang, Y., Han, Y., Zhang, L., Li, H., and Li, X. Propram: exploiting the transparent logic resources in non-volatile memory for near data computing. In Proceedings of the 52nd Annual Design Automation Conference (2015), ACM, p. 47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xi, S. L., Babarinsa, O., Athanassoulis, M., and Idreos, S. Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware (DaMoN) (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yasin, A. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS (2014), pp. 35--44.Google ScholarGoogle ScholarCross RefCross Ref
  36. Yasin, A., Ben-Asher, Y., and Mendelson, A. Deep-dive analysis of the data analytics workload in cloudsuite. In Workload Characterization (IISWC), IEEE International Symposium on (Oct 2014), pp. 202--211.Google ScholarGoogle Scholar
  37. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12) (San Jose, CA, 2012), pp. 15--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Identifying the potential of near data processing for apache spark

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      MEMSYS '17: Proceedings of the International Symposium on Memory Systems
      October 2017
      409 pages
      ISBN:9781450353359
      DOI:10.1145/3132402

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 2 October 2017

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader