ABSTRACT
Implementing parallel operators in multi-core machines often involves a data partitioning step that divides the data into cache-size blocks and arranges them so to allow concurrent threads to process them in parallel. Data partitioning is expensive, in some cases up to 90% of the cost of, e.g., a parallel hash join. In this paper we explore the use of an FPGA to accelerate data partitioning. We do so in the context of new hybrid architectures where the FPGA is located as a co-processor residing on a socket and with coherent access to the same memory as the CPU residing on the other socket. Such an architecture reduces data transfer overheads between the CPU and the FPGA, enabling hybrid operator execution where the partitioning happens on the FPGA and the build and probe phases of a join happen on the CPU. Our experiments demonstrate that FPGA-based partitioning is significantly faster and more robust than CPU-based partitioning. The results open interesting options as FPGAs are gradually integrated tighter with the CPU.
- I. Absalyamov, R. Halstead, P. Budhkar, W. Najjar, et al. Fpga-accelerated group-by aggregation using synchronizing caches. In ACM DaMoN, 2016. Google ScholarDigital Library
- A. Appleby. https://github.com/aappleby/smhasher. January 2016.Google Scholar
- C. Balkesen, G. Alonso, J. Teubner, and M. T. Özsu. Multi-core, main-memory joins: Sort vs. hash revisited. PVLDB, 7(1):85--96, 2013. Google ScholarDigital Library
- C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Main-memory hash joins on multi-core cpus: Tuning to the underlying hardware. In IEEE ICDE, 2013. Google ScholarDigital Library
- R. Barber, G. Lohman, I. Pandis, et al. Memory-efficient hash joins. PVLDB, 8(4):353--364, 2014. Google ScholarDigital Library
- C. Barthels, S. Loesing, G. Alonso, and D. Kossmann. Rack-scale in-memory join processing using rdma. In ACM SIGMOD, 2015. Google ScholarDigital Library
- C. Barthels, I. Müller, T. Schneider, G. Alonso, and T. Hoefler. Distributed join algorithms on thousands of cores. PVLDB, 10(5), 2017. Google ScholarDigital Library
- S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memory hash join algorithms for multi-core cpus. In ACM SIGMOD, 2011. Google ScholarDigital Library
- J. Casper and K. Olukotun. Hardware acceleration of database operations. In ACM FPGA, 2014. Google ScholarDigital Library
- R. Chen and V. K. Prasanna. Accelerating Equi-Join on a CPU-FPGA Heterogeneous Platform. IEEE FCCM, 2016.Google ScholarCross Ref
- R. J. Halstead, I. Absalyamov, W. A. Najjar, and V. J. Tsotras. FPGA-based Multithreading for In-Memory Hash Joins. In CIDR, 2015.Google Scholar
- R. J. Halstead, B. Sukhwani, H. Min, M. Thoennes, et al. Accelerating join operation for relational databases with FPGAs. In IEEE FCCM, 2013. Google ScholarDigital Library
- J. He, S. Zhang, and B. He. In-cache query co-processing on coupled cpu-gpu architectures. PVLDB, 8(4):329--340, 2014. Google ScholarDigital Library
- Z. Istvan, D. Sidler, and G. Alonso. Runtime parameterizable regular expression operators for databases. 2016.Google ScholarCross Ref
- Z. Istvan, L. Woods, and G. Alonso. Histograms as a side effect of data movement for big data. In ACM SIGMOD, 2014. Google ScholarDigital Library
- S. Jha, B. He, M. Lu, X. Cheng, and H. P. Huynh. Improving main memory hash joins on intel xeon phi processors: An experimental approach. PVLDB, 8(6):642--653, 2015. Google ScholarDigital Library
- T. Kaldewey, G. Lohman, R. Mueller, and P. Volk. Gpu join processing revisited. In ACM DaMoN, 2012. Google ScholarDigital Library
- K. Kara and G. Alonso. Fast and robust hashing for database operators. In IEEE FPL, 2016.Google ScholarCross Ref
- C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, et al. Sort vs. hash revisited: fast join implementation on modern multi-core cpus. PVLDB, 2(2):1378--1389, 2009. Google ScholarDigital Library
- H. Lang, V. Leis, M.-C. Albutiu, T. Neumann, and A. Kemper. Massively parallel numa-aware hash joins. In In Memory Data Management and Analysis. 2015.Google ScholarCross Ref
- S. Manegold, P. Boncz, and M. Kersten. Optimizing main-memory join on modern hardware. IEEE TKDE, 14(4):709--730, 2002. Google ScholarDigital Library
- N. Mirzadeh, O. Kocberber, B. Falsafi, and B. Grot. Sort vs. hash join revisited for near-memory execution. In ASBD, 2015.Google Scholar
- R. Mueller, J. Teubner, and G. Alonso. Glacier: a query-to-hardware compiler. In ACM SIGMOD, 2010. Google ScholarDigital Library
- N. Oliver et al. A reconfigurable computing system based on a cache-coherent fabric. In IEEE ReConFig, 2011. Google ScholarDigital Library
- H. Pirk, S. Manegold, and M. Kersten. Waste not... efficient co-processing of relational data. In IEEE ICDE, 2014.Google ScholarCross Ref
- O. Polychroniou, A. Raghavan, and K. A. Ross. Rethinking SIMD vectorization for in-memory databases. In ACM SIGMOD, 2015. Google ScholarDigital Library
- O. Polychroniou and K. A. Ross. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In ACM SIGMOD, 2014. Google ScholarDigital Library
- A. Putnam, A. M. Caulfield, E. S. Chung, et al. A reconfigurable fabric for accelerating large-scale datacenter services. In ACM/IEEE ISCA, 2014. Google ScholarDigital Library
- S. Richter, V. Alvarez, and J. Dittrich. A seven-dimensional analysis of hashing methods and its implications on query processing. PVLDB, 9(3):96--107, 2015. Google ScholarDigital Library
- N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, et al. Fast sort on cpus and gpus: a case for bandwidth oblivious simd sort. In ACM SIGMOD, 2010. Google ScholarDigital Library
- S. Schuh, X. Chen, and J. Dittrich. An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory. In ACM SIGMOD, 2016. Google ScholarDigital Library
- F. M. Schuhknecht, P. Khanchandani, and J. Dittrich. On the surprising difficulty of simple things: the case of radix partitioning. PVLDB, 8(9):934--937, 2015. Google ScholarDigital Library
- D. Sidler, Z. Istvan, M. Owaida, and G. Alonso. Accelerating pattern matching queries in hybrid cpu-fpga architectures. In ACM SIGMOD, 2017. Google ScholarDigital Library
- J. Stuecheli, B. Blaner, C. Johns, and M. Siegel. Capi: A coherent accelerator processor interface. IBM Journal of Research and Development, 59(1):7--1, 2015.Google ScholarDigital Library
- B. Sukhwani, H. Min, M. Thoennes, et al. Database analytics acceleration using fpgas. In ACM PACT, 2012. Google ScholarDigital Library
- T. Ueda, M. Ito, and M. Ohara. A Dynamically Reconfigurable Equi-Joiner on FPGA. IBM Tehnical Report RT0969, 2015.Google Scholar
- Z. Wang, B. He, and W. Zhang. A study of data partitioning on OpenCL-based FPGAs. In IEEE FPL, 2015.Google ScholarCross Ref
- J. Wassenberg and P. Sanders. Engineering a multi-core radix sort. In Euro-Par, 2011. Google ScholarDigital Library
- S. Werner, S. Groppe, V. Linnemann, and T. Pionteck. Hardware-accelerated join processing in large Semantic Web databases with FPGAs. In IEEE HPCS, 2013.Google ScholarCross Ref
- L. Woods, G. Alonso, and J. Teubner. Parallel computation of skyline queries. In IEEE FCCM, 2013. Google ScholarDigital Library
- L. Wu, R. J. Barker, M. A. Kim, and K. A. Ross. Navigating big data with high-throughput, energy-efficient data partitioning. In ACM SIGARCH, 2013. Google ScholarDigital Library
Index Terms
- FPGA-based Data Partitioning
Recommendations
Hardware accelerated FPGA placement
A key advantage of field-programmable gate arrays (FPGAs) over full-custom and semi-custom devices is that they provide relatively quick implementation from concept to physical realization. However, as modern FPGAs reach close to one million logic ...
Optimizing CNN-based Segmentation with Deeply Customized Convolutional and Deconvolutional Architectures on FPGA
Special Issue on Deep learning on FPGAsConvolutional Neural Networks-- (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art ...
Comments