skip to main content
10.1145/2925426.2926259acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Fast Multiplication in Binary Fields on GPUs via Register Cache

Published: 01 June 2016 Publication History

Abstract

Finite fields of characteristic 2 -- "binary fields" -- are used in a variety of applications in cryptography and data storage. Multiplication of two finite field elements is a fundamental operation and a well-known computational bottleneck in many of these applications, as they often require multiplication of a large number of elements. In this work we focus on accelerating multiplication in "large" binary fields of sizes greater than 232. We devise a new parallel algorithm optimized for execution on GPUs. This algorithm makes it possible to multiply large number of finite field elements, and achieves high performance via bit-slicing and fine-grained parallelization.
The key to the efficient implementation of the algorithm is a novel performance optimization methodology we call the register cache. This methodology speeds up an algorithm that caches its input in shared memory by transforming the code to use per-thread registers instead. We show how to replace shared memory accesses with the shuffle() intra-warp communication instruction, thereby significantly reducing or even eliminating shared memory accesses. We thoroughly analyze the register cache approach and characterize its benefits and limitations.
We apply the register cache methodology to the implementation of the binary finite field multiplication algorithm on GPUs. We achieve up to 138x speedup for fields of size 232 over the popular, highly optimized Number Theory Library (NTL) [26], which uses the specialized CLMUL CPU instruction, and over 30x for larger fields of size below 2256. Our register cache implementation enables up to 50% higher performance compared to the traditional shared-memory based design.

References

[1]
A. Davidson, and J. D. Owens. Register packing for cyclic reduction: A case study. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pages 4:1--4:6. ACM, 2011.
[2]
A. E. Cohen and K. K. Parhi. GPU Accelerated Elliptic Curve Cryptography in GF(2m). In IEEE 53rd International Midwest Symposium on Circuits and Systems, pages 57--60, Aug 2010.
[3]
A. Magni, C. Dubach, and M. F. P. O'Boyle. A Large-scale Cross-architecture Evaluation of Thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 11:1--11:11. ACM, 2013.
[4]
B. Catanzaro, A. Keller, and M. Garland. A decomposition for in-place matrix transposition. Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 193--206, 2014.
[5]
C. Su, and H. Fan. Impact of Intel's new instruction sets on software implementation of GF(2){x} multiplication. Inf. Process. Lett., 112(12):497--502, June 2012.
[6]
D. B. Kirk, and W. W. Hwu. Programming Massively Parallel Pocessors: A Hands-on Approach. Newnes, 2012.
[7]
D. G. Cantor, and E. Kaltofen. On fast multiplication of polynomials over arbitrary algebras. Acta Informatica, 28(7):693--701, 1991.
[8]
E. Ben-Sasson, and M. Sudan. Short PCPs with polylog query complexity. SIAM Journal on Computing, 38(2):551--607, 2008. Preliminary version appeared in STOC '05.
[9]
E. D. Win, A. Bosselaers, S. Vandenberghe, P. D. Gersem, and J. Vandewalle. A fast software implementation for arithmetic operations in GF(2n). In Proceedings of the International Conference on the Theory and Applications of Cryptology and Information Security: Advances in Cryptology, pages 65--76. Springer-Verlag, 1996.
[10]
A. Fog. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Available at http://www.agner.org/optimize/instruction_tables.pdf, 1996--2016. {Online; accessed 28-Mar-2016}.
[11]
G. L. Steele Jr., and J. B. Tristan. Using butterfly-patterned partial sums to optimize GPU memory accesses for drawing from discrete distributions. CoRR, abs/1505.03851, 2015.
[12]
G. Shay, and M. E. Kounavis. Intel(R) carry-less multiplication instruction and its usage for computing the GCM mode - rev 2.02. Intel Corporation, April 2014.
[13]
J. Daemen, and V. Rijmen. AES proposal: Rijndael. Available at http://jda.noekeon.org/JDA_VRI_Rijndael_V2_1999.pdf, 1998. {Online; accessed 28-Mar-2016}.
[14]
J. L. Massey, and J. K. Omura. Computational method and apparatus for finite field arithmetic. US patent number 4587627. May 1986.
[15]
J. Luo, K. D. Bowers, A. Oprea, and L. Xu. Efficient software implementations of large finite fields GF(2n) for secure storage applications. Trans. Storage, 8(1):2:1--2:27, February 2012.
[16]
J. S. Plank, K. M. Greenan, and E. L. Miller. Screaming fast Galois field arithmetic using Intel SIMD instructions. In 11th USENIX Conference on File and Storage Technologies, pages 298--306, February 2013.
[17]
K. Leboeuf, R. Muscedere, and M. Ahmadi. High performance prime field multiplication for GPU. In IEEE International Symposium on Circuits and Systems, pages 93--96, May 2012.
[18]
A. Karatsuba and Y. Ofman. Multiplication of Many-Digital Numbers by Automatic Computers. Doklady Akad. Nauk SSSR145, 293--294, 1962. Translation in Physics-Doklady 7, 595--596, 1963.
[19]
M. Arabi. Comparison of Traditional, Karatsuba and Fourier Big Integer Multiplication. B.Sc. Thesis. University of Bath, May 2005.
[20]
nVidia. Kepler Tuning Guide. http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html, 2015. {Online; accessed 26-Jan-2016}.
[21]
P. Enfedaque, F. Aulí-Llinàs, and J. C. Moure. Implementation of the DWT in a GPU through a register-based strategy. IEEE Transactions on Parallel and Distributed Systems, 26(12):3394--3406, Dec 2015.
[22]
R. Lidl and H. Niederreiter. Finite Fields. (2nd ed.), Cambridge University Press, 1997.
[23]
S. Ashkiani, A. Davidson, U. Meyer, and J. D. Owens. GPU Multisplit. Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 12:1--12:13, 2016.
[24]
S. Kalcher, and V. Lindenstruth. Accelerating Galois field arithmetic for Reed-Solomon erasure codes in storage applications. In IEEE International Conference on Cluster Computing, pages 290--298, Sept 2011.
[25]
A. Schonhage and V. Strassen. Schnelle multiplikation grosser zahlen. Computing, 7(3-4):281--292, 1971.
[26]
V. Shoup. NTL: A library for doing number theory. Avilable at http://www.shoup.net/ntl, 2003. {Online; accessed 28-Mar-2016}.
[27]
V. Volkov, and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2008., pages 1--11.
[28]
V. Volkov. Better performance at lower occupancy. Proceedings of the GPU Technology Conference, 2010.
[29]
Wikipedia. Bit slicing--- Wikipedia,. {Online; accessed 27-Mar-2016}.

Cited By

View all
  • (2023)A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX CodeProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580253(110-121)Online publication date: 17-Feb-2023
  • (2023)BitGNN: Unleashing the Performance Potential of Binary Graph Neural Networks on GPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593725(264-276)Online publication date: 21-Jun-2023
  • (2023)Ligero: lightweight sublinear arguments without a trusted setupDesigns, Codes and Cryptography10.1007/s10623-023-01222-891:11(3379-3424)Online publication date: 13-Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Finite Field Multiplication
  2. GPGPU
  3. GPU Code Optimization
  4. Parallel Algorithms
  5. SIMD

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICS '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)10
Reflects downloads up to 09 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX CodeProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580253(110-121)Online publication date: 17-Feb-2023
  • (2023)BitGNN: Unleashing the Performance Potential of Binary Graph Neural Networks on GPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593725(264-276)Online publication date: 21-Jun-2023
  • (2023)Ligero: lightweight sublinear arguments without a trusted setupDesigns, Codes and Cryptography10.1007/s10623-023-01222-891:11(3379-3424)Online publication date: 13-Jul-2023
  • (2021)Pisces: A New Zero-Knowledge Protocol for Blockchain PrivacyFoundations and Practice of Security10.1007/978-3-030-70881-8_12(180-204)Online publication date: 27-Feb-2021
  • (2020)Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414649(317-328)Online publication date: 30-Sep-2020
  • (2020)Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.3045828(1-1)Online publication date: 2020
  • (2020)Efficient Modular Squaring in Binary Fields on CPU Supporting AVX and GPUParallel Processing and Applied Mathematics10.1007/978-3-030-43229-4_5(49-57)Online publication date: 19-Mar-2020
  • (2019)Swizzle InventorProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304059(65-78)Online publication date: 4-Apr-2019
  • (2019)BSTCProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356169(1-30)Online publication date: 17-Nov-2019
  • (2019)A versatile software systolic execution model for GPU memory-bound kernelsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356162(1-81)Online publication date: 17-Nov-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media