ABSTRACT
While cluster computing frameworks are continuously evolving to provide real-time data analysis capabilities, Apache Spark has managed to be at the forefront of big data analytics for being a unified framework for both, batch and stream data processing. There is also a renewed interest in Near Data Processing (NDP) due to technological advancement in the last decade. However, it is not known if NDP architectures can improve the performance of big data processing frameworks such as Apache Spark. In this paper, we build the case of NDP architecture comprising programmable logic based hybrid 2D integrated processing-in-memory and in-storage processing for Apache Spark, by extensive profiling of Apache Spark based workloads on Ivy Bridge Server.
- Hybrid memory cube consortium. hybrid memory cube specification 2.0. http://www.hybridmemorycube.org/specification-v2-download-form/,Nov.2014.Google Scholar
- Intel Vtune Amplifier XE 2013.Google Scholar
- STREAM. https://www.cs.virginia.edu/stream/.Google Scholar
- Toshiba SATA HDD Enterprise, Performance Review.Google Scholar
- Using Intel VTune Amplifier XE to Tune Software on the Intel Xeon Processor E5/E7 v2 Family. https://software.intel.com/en-us/articles/using-intel-vtune-amplifier-xe-to-tune-software-on-the-intel-xeon-processor-e5e7-v2-family.Google Scholar
- Ahn, J., Hong, S., Yoo, S., Mutlu, O., and Choi, K. A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (2015), ACM, pp. 105--117. Google ScholarDigital Library
- Awan, A. J., Brorsson, M., Vlassov, V., and Ayguade, E. Performance characterization of in-memory data analytics on a modern cloud server. In Big Data and Cloud Computing (BDCloud), 2015 IEEE Fifth International Conference on (2015), IEEE, pp. 1--8. Google ScholarDigital Library
- Awan, A. J., Brorsson, M., Vlassov, V., and Ayguade, E. Big Data Benchmarks, Performance Optimization, and Emerging Hardware: 6th Workshop, BPOE 2015, Kohala, HI, USA, August 31 - September 4, 2015. Revised Selected Papers. Springer International Publishing, 2016, ch. How Data Volume Affects Spark Based Data Analytics on a Scale-up Server, pp. 81--92.Google Scholar
- Awan, A. J., Brorsson, M., Vlassov, V., and Ayguade, E. Micro-architectural characterization of apache spark on batch and stream processing workloads. In Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom)(BDCloud-SocialCom-SustainCom), 2016 IEEE International Conferences on (2016), IEEE, pp. 59--66.Google Scholar
- Awan, A. J., Brorsson, M., Vlassov, V., and Ayguade, E. Node architecture implications for in-memory data analytics on scale-in clusters. In Big Data Computing Applications and Technologies (BDCAT), 2016 IEEE/ACM 3rd International Conference on (2016), IEEE, pp. 237--246. Google ScholarDigital Library
- Bender, M. A., Berry, J., Hammond, S. D., Moore, B., Moseley, B., and Phillips, C. A. k-means clustering on two-level memory systems. In Proceedings of the 2015 International Symposium on Memory Systems (2015), ACM, pp. 197--205. Google ScholarDigital Library
- Chang, J., Ranganathan, P., Mudge, T., Roberts, D., Shah, M. A., and Lim, K. T. A limits study of benefits from nanostore-based future data-centric system architectures. In Proceedings of the 9th conference on Computing Frontiers (2012), ACM, pp. 33--42. Google ScholarDigital Library
- del Mundo, C. C., Lee, V. T., Ceze, L., and Oskin, M. Ncam: Near-data processing for nearest neighbor search. In Proceedings of the 2015 International Symposium on Memory Systems (2015), ACM, pp. 274--275. Google ScholarDigital Library
- Gokhale, M., Lloyd, S., and Hajas, C. Near memory data structure rearrangement. In Proceedings of the 2015 International Symposium on Memory Systems (2015), ACM, pp. 283--290. Google ScholarDigital Library
- Huang, S., Huang, J., Dai, J., Xie, T., and Huang, B. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on (2010), pp. 41--51.Google ScholarCross Ref
- Islam, M., Scrbak, M., Kavi, K. M., Ignatowski, M., and Jayasena, N. Improving node-level mapreduce performance using processing-in-memory technologies. In Euro-Par 2014: Parallel Processing Workshops (2014), Springer, pp. 425--437.Google Scholar
- Jacob, B. The memory system: you can't avoid it, you can't ignore it, you can't fake it. Synthesis Lectures on Computer Architecture 4, 1 (2009), 1--77. Google ScholarDigital Library
- Kanev, S., Darago, J. P., Hazelwood, K., Ranganathan, P., Moseley, T., Wei, G.-Y., Brooks, D., Campanoni, S., Brownell, K., Jones, T. M., et al. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (2015), ACM, pp. 158--169. Google ScholarDigital Library
- Kreps, J., Narkhede, N., Rao, J., et al. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB (2011), pp. 1--7.Google Scholar
- Lee, J. H., Sim, J., and Kim, H. Bssync: Processing near memory for machine learning workloads with bounded staleness consistency models.Google Scholar
- Loh, G., Jayasena, N., Oskin, M., Nutter, M., Roberts, D., Meswani, M., Zhang, D., and Ignatowski, M. A processing in memory taxonomy and a case for studying fixed-function pim. In Workshop on Near-Data Processing (WoNDP) (2013).Google Scholar
- Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al. Mllib: Machine learning in apache spark. arXiv preprint arXiv:1505.06807 (2015).Google Scholar
- Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., and Zhan, J. BDGS: A scalable big data generator suite in big data benchmarking. In Advancing Big Data Benchmarks, vol. 8585 of Lecture Notes in Computer Science. 2014, pp. 138--154.Google ScholarCross Ref
- Mirzadeh, N., Koçberber, Y. O., Falsafi, B., and Grot, B. Sort vs. hash join revisited for near-memory execution. In 5th Workshop on Architectures and Systems for Big Data (ASBD 2015) (2015), no. EPFL-CONF-209121.Google Scholar
- Nai, L., and Kim, H. Instruction offloading with hmc 2.0 standard: A case study for graph traversals. In Proceedings of the 2015 International Symposium on Memory Systems (2015), ACM, pp. 258--261. Google ScholarDigital Library
- Perera, S., and Suhothayan, S. Solution patterns for realtime streaming analytics. In Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems (2015), ACM, pp. 247--255. Google ScholarDigital Library
- Pugsley, S. H. Opportunities for near data computing in MapReduce workloads. PhD thesis, The University of Utah, 2015.Google Scholar
- Pugsley, S. H., Jestes, J., Zhang, H., Balasubramonian, R., Srinivasan, V., Buyuktosunoglu, A., Li, F., et al. Ndc: Analyzing the impact of 3d-stacked memory+ logic devices on mapreduce workloads. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on (2014), IEEE, pp. 190--200.Google ScholarCross Ref
- Radulovic, M., Zivanovic, D., Ruiz, D., de Supinski, B. R., McKee, S. A., Radojković, P., and Ayguadé, E. Another trip to the wall: How much will stacked dram benefit hpc? In Proceedings of the 2015 International Symposium on Memory Systems (2015), ACM, pp. 31--36. Google ScholarDigital Library
- Ranganathan, P. From microprocessors to nanostores: Rethinking data-centric systems (vol 44, pg 39, 2010). COMPUTER 44, 3 (2011), 6--6. Google ScholarDigital Library
- Siegl, P., Buchty, R., and Berekovic, M. Data-centric computing frontiers: A survey on processing-in-memory. In Proceedings of the Second International Symposium on Memory Systems (2016), ACM, pp. 295--308. Google ScholarDigital Library
- Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., and Qiu, B. Bigdatabench: A big data benchmark suite from internet services. In 20th IEEE International Symposium on High Performance Computer Architecture, HPCA (2014), pp. 488--499.Google ScholarCross Ref
- Wang, Y., Han, Y., Zhang, L., Li, H., and Li, X. Propram: exploiting the transparent logic resources in non-volatile memory for near data computing. In Proceedings of the 52nd Annual Design Automation Conference (2015), ACM, p. 47. Google ScholarDigital Library
- Xi, S. L., Babarinsa, O., Athanassoulis, M., and Idreos, S. Beyond the wall: Near-data processing for databases. In Proceedings of the 11th International Workshop on Data Management on New Hardware (DaMoN) (2015). Google ScholarDigital Library
- Yasin, A. A top-down method for performance analysis and counters architecture. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS (2014), pp. 35--44.Google ScholarCross Ref
- Yasin, A., Ben-Asher, Y., and Mendelson, A. Deep-dive analysis of the data analytics workload in cloudsuite. In Workload Characterization (IISWC), IEEE International Symposium on (Oct 2014), pp. 202--211.Google Scholar
- Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M. J., Shenker, S., and Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 12) (San Jose, CA, 2012), pp. 15--28. Google ScholarDigital Library
Identifying the potential of near data processing for apache spark
Recommendations
Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing ResearchThe term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
A comparative between hadoop mapreduce and apache Spark on HDFS
IML '17: Proceedings of the 1st International Conference on Internet of Things and Machine LearningData is growing now in a very high speed with a large volume, Spark and MapReduce1 both provide a processing model for analyzing and managing this large data -Big Data- stored on HDFS. In this paper, we discuss a comparative between Apache Spark and ...
Big Data Network Flow Processing Using Apache Spark
ECBS '19: Proceedings of the 6th Conference on the Engineering of Computer Based SystemsThe increasing amount of traffic flows captured as a part of network monitoring activities makes the analysis more complicated. One of the goals for network traffic analysis is to identify malicious communication. In the paper, we present a new system ...
Comments