| Fast scan algorithms on graphics processors |
| Full text |
Pdf
(829 KB)
|
Source
|
International Conference on Supercomputing
archive
Proceedings of the 22nd annual international conference on Supercomputing
table of contents
Island of Kos, Greece
SESSION: Algorithms & applications 2
table of contents
Pages 205-213
Year of Publication: 2008
ISBN:978-1-60558-158-3
|
|
Authors
|
|
Yuri Dotsenko
|
Microsoft Corporation, Redmond, WA, USA
|
|
Naga K. Govindaraju
|
Microsoft Corporation, Redmond, WA, USA
|
|
Peter-Pike Sloan
|
Microsoft Corporation, Redmond, WA, USA
|
|
Charles Boyd
|
Microsoft Corporation, Redmond, WA, USA
|
|
John Manferdelli
|
Microsoft Corporation, Redmond, WA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 49, Downloads (12 Months): 158, Citation Count: 0
|
|
|
ABSTRACT
Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the GPU architecture. Our algorithms exploit shared memory to improve memory performance. We further improve the performance of our algorithms by eliminating shared-memory bank conflicts and reducing the overheads in prior shared-memory GPU algorithms. Furthermore, our algorithms are designed to work well on general data sets, including segmented arrays with arbitrary segment lengths. We also present optimizations to improve the performance of segmented scans based on the segment lengths. We implemented our algorithms on a PC with an NVIDIA GeForce 8800 GPU and compared our results with prior GPU-based algorithms. Our results indicate up to 10x higher performance over prior algorithms on input sequences with millions of elements.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Iverson, K. E. A Programming Language. Wiley, New York, 1962.
|
| |
2
|
Schwartz, J. T. Ultracomputers. ACM Transactions on Programming Languages and Systems, 2(4), pp 484-521, Oct. 1980.
|
| |
3
|
NVIDIA. NVIDIA CUDA Compute Unified Device Architecture Programming Guide, v. 1.1, Nov. 2007, http://developer.download.nvidia.com/compute/cuda/1_1/NVIDIA_CUDA_Programming_Guide_1.1.pdf.
|
| |
4
|
Blelloch, G. E. Prefix Sums and Their Applications. Technical Report CMU-CS-90-190, Carnegie Mellon University, Pittsburgh, PA, Nov. 1990. Book Chapter in Synthesis of Parallel Algorithms, Reif, J. H. (Ed.).
|
| |
5
|
Harris, M., Sengupta, S., and Owens, J. D. Parallel Prefix Sum (Scan) with CUDA. GPU Gems 3, Hguyen, H. (Ed.). Addison-Wesley, Aug. 2007, ch. 39.
|
| |
6
|
Sengupta, S., Harris, M., Zhang, Y., and Owens, J. D. Scan Primitives for GPU Computing. Graphics Hardware 2007. San Diego, CA, Aug. 2007.
|
| |
7
|
Chatterjee, S., Blelloch, G. E., and Zagha, M. Scan Primitives for Vector Computers. Proceedings of the 1990 Conference on Supercomputing. New York, NY, 1990, pp. 666 -- 675.
|
| |
8
|
NVIDIA CUDA SDK. http://developer.nvidia.com/object/cuda.html.
|
| |
9
|
CUDA Data Parallel Primitives Library -- CUDPP. http://www.gpgpu.org/developer/cudpp/rel/rel_gems3/html/index.html.
|
| |
10
|
Blelloch, G. E. Scans as Primitive Parallel Operations. Proceeding of the International Conference on Parallel Processing. 1987, pp 355--362.
|
| |
11
|
Horn, D. Stream Reduction Operations for GPGPU Applications. GPU Gems 2, Pharr, M. (Ed.). Addison-Wesley, pp 573--589.
|
| |
12
|
Hensley, J., Scheuermann, T., Coombe, G., Singh, M., Lastra A. Fast Summed-Area Table Generation and its Applications. Computer Graphics Forum 24 (3), pp. 547--555.
|
| |
13
|
Sengupta S., Lefohn A., and Owens, J. A Work-Efficient Step-Efficient Prefix-Sum Algorithm. Proceedings of the Workshop on Edge Computing Using New Commodity Architectures. Chapel Hill, NC, May 2006, pp. 26--27.
|
| |
14
|
Cray-Cyber.org. Cray Y-MP EL. http://www.cray-cyber.org/systems/yel.php.
|
| |
15
|
Hwu, W. and Kirk, D. UIUC ELE 498 AL1: Programming Massively Parallel Processors. http://courses.ece.uiuc.edu/ece498/al1
|
| |
16
|
Blelloch, G. E., Heroux, M. A., and Zagha, M. Segmented Operations for Sparse Matrix Computation on Vector Multiprocessors. Tech. Rep. CMU-CS-93-173, School of Computer Science, Carnegie Mellon University, Aug. 1993.
|
| |
17
|
Blelloch G. E. Vector Models for Data-Parallel Computing. MIT Press, 1990.
|
| |
18
|
Gre² A., Guthe M., Klein R. GPU-based collision detection for deformable parameterized surfaces. Computer Graphics Forum 25, 3 (Sept. 2006), 497--506.
|
| |
19
|
Blelloch, G. E., Siddhartha, C., Marco Z. Solving linear recurrences with loop raking. Journal of Parallel and Distributed Computing, 25(1), pp 91--97, Feb. 2005.
|
|