research-article

Fast Multiplication in Binary Fields on GPUs via Register Cache

Authors:

Eli Ben-Sasson,

Mark Silberstein,

Eran TromerAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 35, Pages 1 - 12

https://doi.org/10.1145/2925426.2926259

Published: 01 June 2016 Publication History

Abstract

Finite fields of characteristic 2 -- "binary fields" -- are used in a variety of applications in cryptography and data storage. Multiplication of two finite field elements is a fundamental operation and a well-known computational bottleneck in many of these applications, as they often require multiplication of a large number of elements. In this work we focus on accelerating multiplication in "large" binary fields of sizes greater than 232. We devise a new parallel algorithm optimized for execution on GPUs. This algorithm makes it possible to multiply large number of finite field elements, and achieves high performance via bit-slicing and fine-grained parallelization.

The key to the efficient implementation of the algorithm is a novel performance optimization methodology we call the register cache. This methodology speeds up an algorithm that caches its input in shared memory by transforming the code to use per-thread registers instead. We show how to replace shared memory accesses with the shuffle() intra-warp communication instruction, thereby significantly reducing or even eliminating shared memory accesses. We thoroughly analyze the register cache approach and characterize its benefits and limitations.

We apply the register cache methodology to the implementation of the binary finite field multiplication algorithm on GPUs. We achieve up to 138x speedup for fields of size 232 over the popular, highly optimized Number Theory Library (NTL) [26], which uses the specialized CLMUL CPU instruction, and over 30x for larger fields of size below 2256. Our register cache implementation enables up to 50% higher performance compared to the traditional shared-memory based design.

References

[1]

A. Davidson, and J. D. Owens. Register packing for cyclic reduction: A case study. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pages 4:1--4:6. ACM, 2011.

Digital Library

[2]

A. E. Cohen and K. K. Parhi. GPU Accelerated Elliptic Curve Cryptography in GF(2m). In IEEE 53rd International Midwest Symposium on Circuits and Systems, pages 57--60, Aug 2010.

[3]

A. Magni, C. Dubach, and M. F. P. O'Boyle. A Large-scale Cross-architecture Evaluation of Thread-coarsening. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, pages 11:1--11:11. ACM, 2013.

Digital Library

[4]

B. Catanzaro, A. Keller, and M. Garland. A decomposition for in-place matrix transposition. Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 193--206, 2014.

Digital Library

[5]

C. Su, and H. Fan. Impact of Intel's new instruction sets on software implementation of GF(2){x} multiplication. Inf. Process. Lett., 112(12):497--502, June 2012.

Digital Library

[6]

D. B. Kirk, and W. W. Hwu. Programming Massively Parallel Pocessors: A Hands-on Approach. Newnes, 2012.

Digital Library

[7]

D. G. Cantor, and E. Kaltofen. On fast multiplication of polynomials over arbitrary algebras. Acta Informatica, 28(7):693--701, 1991.

Digital Library

[8]

E. Ben-Sasson, and M. Sudan. Short PCPs with polylog query complexity. SIAM Journal on Computing, 38(2):551--607, 2008. Preliminary version appeared in STOC '05.

Digital Library

[9]

E. D. Win, A. Bosselaers, S. Vandenberghe, P. D. Gersem, and J. Vandewalle. A fast software implementation for arithmetic operations in GF(2n). In Proceedings of the International Conference on the Theory and Applications of Cryptology and Information Security: Advances in Cryptology, pages 65--76. Springer-Verlag, 1996.

Digital Library

[10]

A. Fog. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Available at http://www.agner.org/optimize/instruction_tables.pdf, 1996--2016. {Online; accessed 28-Mar-2016}.

[11]

G. L. Steele Jr., and J. B. Tristan. Using butterfly-patterned partial sums to optimize GPU memory accesses for drawing from discrete distributions. CoRR, abs/1505.03851, 2015.

[12]

G. Shay, and M. E. Kounavis. Intel(R) carry-less multiplication instruction and its usage for computing the GCM mode - rev 2.02. Intel Corporation, April 2014.

[13]

J. Daemen, and V. Rijmen. AES proposal: Rijndael. Available at http://jda.noekeon.org/JDA_VRI_Rijndael_V2_1999.pdf, 1998. {Online; accessed 28-Mar-2016}.

[14]

J. L. Massey, and J. K. Omura. Computational method and apparatus for finite field arithmetic. US patent number 4587627. May 1986.

[15]

J. Luo, K. D. Bowers, A. Oprea, and L. Xu. Efficient software implementations of large finite fields GF(2n) for secure storage applications. Trans. Storage, 8(1):2:1--2:27, February 2012.

Digital Library

[16]

J. S. Plank, K. M. Greenan, and E. L. Miller. Screaming fast Galois field arithmetic using Intel SIMD instructions. In 11th USENIX Conference on File and Storage Technologies, pages 298--306, February 2013.

Digital Library

[17]

K. Leboeuf, R. Muscedere, and M. Ahmadi. High performance prime field multiplication for GPU. In IEEE International Symposium on Circuits and Systems, pages 93--96, May 2012.

[18]

A. Karatsuba and Y. Ofman. Multiplication of Many-Digital Numbers by Automatic Computers. Doklady Akad. Nauk SSSR145, 293--294, 1962. Translation in Physics-Doklady 7, 595--596, 1963.

[19]

M. Arabi. Comparison of Traditional, Karatsuba and Fourier Big Integer Multiplication. B.Sc. Thesis. University of Bath, May 2005.

[20]

nVidia. Kepler Tuning Guide. http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html, 2015. {Online; accessed 26-Jan-2016}.

[21]

P. Enfedaque, F. Aulí-Llinàs, and J. C. Moure. Implementation of the DWT in a GPU through a register-based strategy. IEEE Transactions on Parallel and Distributed Systems, 26(12):3394--3406, Dec 2015.

Digital Library

[22]

R. Lidl and H. Niederreiter. Finite Fields. (2nd ed.), Cambridge University Press, 1997.

Digital Library

[23]

S. Ashkiani, A. Davidson, U. Meyer, and J. D. Owens. GPU Multisplit. Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 12:1--12:13, 2016.

Digital Library

[24]

S. Kalcher, and V. Lindenstruth. Accelerating Galois field arithmetic for Reed-Solomon erasure codes in storage applications. In IEEE International Conference on Cluster Computing, pages 290--298, Sept 2011.

Digital Library

[25]

A. Schonhage and V. Strassen. Schnelle multiplikation grosser zahlen. Computing, 7(3-4):281--292, 1971.

[26]

V. Shoup. NTL: A library for doing number theory. Avilable at http://www.shoup.net/ntl, 2003. {Online; accessed 28-Mar-2016}.

[27]

V. Volkov, and J. W. Demmel. Benchmarking GPUs to tune dense linear algebra. In International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2008., pages 1--11.

Digital Library

[28]

V. Volkov. Better performance at lower occupancy. Proceedings of the GPU Technology Conference, 2010.

[29]

Wikipedia. Bit slicing--- Wikipedia,. {Online; accessed 27-Mar-2016}.

Cited By

Matsumura KDe Gonzalo SPeña AVerbrugge CLhoták OShen X(2023)A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX CodeProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580253(110-121)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580253
Chen JSung HShen XChoudhury SLi AGallivan KNikolopoulos DBeivide RGallopoulos E(2023)BitGNN: Unleashing the Performance Potential of Binary Graph Neural Networks on GPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593725(264-276)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593725
Ames SHazay CIshai YVenkitasubramaniam M(2023)Ligero: lightweight sublinear arguments without a trusted setupDesigns, Codes and Cryptography10.1007/s10623-023-01222-891:11(3379-3424)Online publication date: 13-Jul-2023
https://doi.org/10.1007/s10623-023-01222-8
Show More Cited By

Recommendations

Cooperative Caching for GPUs

The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad ...
Early miss prediction based periodic cache bypassing for high performance GPUs

The aim of the hierarchical cache memories that are equipped for GPUs is the management of irregular memory access patterns for general purpose workloads. The level-1 data cache (L1D) of the GPU plays an important role for its ability in the provision ...
Efficient utilization of GPGPU cache hierarchy
GPGPU-8: Proceedings of the 8th Workshop on General Purpose Processing using GPUs

Recent GPUs are equipped with general-purpose L1 and L2 caches in an attempt to reduce memory bandwidth demand and improve the performance of some irregular GPGPU applications. However, due to the massive multithreading, GPGPU caches suffer from severe ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
414
Total Downloads

Downloads (Last 12 months)59
Downloads (Last 6 weeks)10

Reflects downloads up to 09 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Matsumura KDe Gonzalo SPeña AVerbrugge CLhoták OShen X(2023)A Symbolic Emulator for Shuffle Synthesis on the NVIDIA PTX CodeProceedings of the 32nd ACM SIGPLAN International Conference on Compiler Construction10.1145/3578360.3580253(110-121)Online publication date: 17-Feb-2023
https://dl.acm.org/doi/10.1145/3578360.3580253
Chen JSung HShen XChoudhury SLi AGallivan KNikolopoulos DBeivide RGallopoulos E(2023)BitGNN: Unleashing the Performance Potential of Binary Graph Neural Networks on GPUsProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593725(264-276)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593725
Ames SHazay CIshai YVenkitasubramaniam M(2023)Ligero: lightweight sublinear arguments without a trusted setupDesigns, Codes and Cryptography10.1007/s10623-023-01222-891:11(3379-3424)Online publication date: 13-Jul-2023
https://doi.org/10.1007/s10623-023-01222-8
Fu SGong G(2021)Pisces: A New Zero-Knowledge Protocol for Blockchain PrivacyFoundations and Practice of Security10.1007/978-3-030-70881-8_12(180-204)Online publication date: 27-Feb-2021
https://doi.org/10.1007/978-3-030-70881-8_12
Jangda AGuha ASarkar VKim H(2020)Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUsProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414649(317-328)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414649
Li ASu S(2020)Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.3045828(1-1)Online publication date: 2020
https://doi.org/10.1109/TPDS.2020.3045828
Augustynowicz PPaszkiewicz A(2020)Efficient Modular Squaring in Binary Fields on CPU Supporting AVX and GPUParallel Processing and Applied Mathematics10.1007/978-3-030-43229-4_5(49-57)Online publication date: 19-Mar-2020
https://doi.org/10.1007/978-3-030-43229-4_5
Phothilimthana PElliott AWang AJangda AHagedorn BBarthels HKaufman SGrover VTorlak EBodik RBahar IHerlihy MWitchel ELebeck A(2019)Swizzle InventorProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304059(65-78)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304059
Li AGeng TWang THerbordt MSong SBarker KTaufer MBalaji PPeña A(2019)BSTCProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356169(1-30)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356169
Chen PWahib MTakizawa STakano RMatsuoka STaufer MBalaji PPeña A(2019)A versatile software systolic execution model for GPU memory-bound kernelsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3295500.3356162(1-81)Online publication date: 17-Nov-2019
https://dl.acm.org/doi/10.1145/3295500.3356162
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten