research-article

Exploiting loop-dependent stream reuse for stream processors

Authors:

Guibin WangAuthors Info & Claims

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Pages 22 - 31

https://doi.org/10.1145/1454115.1454121

Published: 25 October 2008 Publication History

Abstract

The memory access limits the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip storage, the number of memory accesses can be reduced. In current stream compilers reuse is only attempted for simple stream references, those whose start and end are known. Compiler analysis from outside of stream processors does not directly enable the consideration of other complex stream references. In this paper we propose a transformation to automatically optimize stream programs to exploit the reuse supplied by loop-dependent stream references. The transformation is based on three results: algorithms to recognize the reuse supplied by stream references, a new abstract expression called the Stream Reuse Graph (SRG) to depict the reuse and the optimization of the SRG for the transformation. Both the reuse between whole sequences accessed by stream references and that between partial sequences are exploited in the paper. In particular, the problem of exploiting partial stream reuse does not have its parallel in the traditional data reuse exploitation setting (for scalars and arrays). Finally, we have implemented our techniques using the StreamC/KernelC compiler for Imagine. Experimental results show a resultant speedup of 1.14 to 2.54 times using a range of typical stream processing application kernels.

References

[1]

J. H. Ahn, M. Erez, and W. J. Dally. The design space of data-parallel memory systems. In SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 80, New York, NY, USA, 2006. ACM Press.

Digital Library

[2]

J. R. Allen. Dependence analysis for subscripted variables and its application to program transformations. PhD thesis, Houston, TX, USA, 1983.

Digital Library

[3]

R. Allen and K. Kennedy. Vector register allocation. IEEE Trans. Comput., 41(10):1290--1317, 1992.

Digital Library

[4]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for gpus: stream computing on graphics hardware. ACM Trans. Graph., 23(3):777--786, 2004.

Digital Library

[5]

D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In PLDI'90: Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation, pages 53--65, New York, NY, USA, 1990. ACM.

Digital Library

[6]

S. Carr and K. Kennedy. Scalar replacement in the presence of conditional control flow. Softw. Pract. Exper., 24(1):51--77, 1994.

Digital Library

[7]

T. F. Chan, E. Gallopoulos, V. Simoncini, T. Szeto, and C. H. Tong. A quasi-minimal residual variant of the bi-cgstab algorithm for nonsymmetric systems. SIAM J. Sci. Comput., 15(2):338--347, 1994.

Digital Library

[8]

R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K Zadeck. Efficiently computing static single assignment form and the control dependence graph. Technical report, Providence, RI, USA, 1991.

Digital Library

[9]

W. J. Dally, F. Labonte, A. Das, P. Hanrahan, J. H. Ahn, J. Gummaraju, M. Erez, N. Jayasena, I. Buck, T. J. K., and U. J. Kapasi. Merrimac: Supercomputing with streams. In SC '03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, page 35, Washington, DC, USA, 2003. IEEE Computer Society.

Digital Library

[10]

A. Das, W. J. Dally, and P. Mattson. Compiling for stream processing. In PACT '06: Proceedings of the 15th international conference on Parallel architectures and compilation techniques, pages 33--42, New York, NY, USA, 2006. ACM Press.

Digital Library

[11]

D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. J. Parallel Distrib. Comput., 5(5):587--616, 1988.

Digital Library

[12]

N. Jayasena, M. Erez, J. H. Ahn, and W. J. Dally. Stream register files with indexed access. In HPCA '04: Proceedings of the 10th International Symposium on High Performance Computer Architecture, page 60, Washington, DC, USA, 2004. IEEE Computer Society.

Digital Library

[13]

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy. Introduction to the cell multiprocessor. IBM J. Res. Dev., 49(4/5):589--604, 2005.

[14]

U. J. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany. The imagine stream processor. In ICCD, pages 282--288, 2002.

Digital Library

[15]

U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. D. Owens. Programmable stream processors. Computer, 36(8):54--62, 2003.

Digital Library

[16]

S. W. Liao, Z. H. Du, G. S. Wu, and G. Y. Lueh. Data and computation transformations for brook streaming applications on multiprocessors. In CGO '06: Proceedings of the International Symposium on Code Generation and Optimization, pages 196--207, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[17]

R. Lo, F. Chow, R. Kennedy, S. Liu, and P. Tu. Register promotion by sparse partial redundancy elimination of loads and stores. In PLDI '98: Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, pages 26--37, New York, NY, USA, 1998. ACM.

Digital Library

[18]

J. Lu and K. D. Cooper. Register promotion in c programs. In PLDI '97: Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, pages 308--319, New York, NY, USA, 1997. ACM.

Digital Library

[19]

P. Mattson. A programming system for the imagine media processor. PhD thesis, Computer Systems Laboratory, Stanford University, 2002. Adviser-William J. Dally.

Digital Library

[20]

P. Mattson and et al. Imagine Programming System Developer's Guide, 2004.

[21]

M. Narayanan, L. Oliker, A. Janin, P. Husbands, X Ye, and S. Li. Scientific kernels on viram and imagine media processors. Lawrence Berkeley National Laboratory, 2002.

[22]

J. D. Owens, U. J. Kapasi, P. Mattson, B. Towles, B. Serebrin, S. Rixner, and W. J. Dally. Media processing applications on the Imagine stream processor. In Proceedings of the IEEE International Conference on Computer Design, pages 295--302, September 2002.

Digital Library

[23]

S. Rixner. Stream Processor Architecture. Kluwer Academic Publishers, 2002.

Digital Library

[24]

A. V. S. Sastry and R. D. C. Ju. A new algorithm for scalar register promotion based on ssa form. In PLDI '98: Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, pages 15--25, New York, NY, USA, 1998. ACM.

Digital Library

[25]

R. Stephens. A survey of stream processing. Acta Informatica, 34(7):491--541, 1997.

[26]

M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. The raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro, 22(2):25--35, 2002.

Digital Library

[27]

M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In PLDI '91: Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation, pages 30--44, New York, NY, USA, 1991. ACM.

Digital Library

[28]

W. A. Wulf and S. A. McKee. Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News, 23(1):20--24, 1995.

Digital Library

[29]

J. Xue and C.-H. Huang. Reuse-driven tiling for data locality. In 10th Workshop on Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science 1366, pages 16--33, Minneapolis, Minn., 1997. Springer-Verlag.

Digital Library

[30]

X. J. Yang, X. B. Yan, Z. C. Xing, Y. Deng, J. Jiang, and Y. Zhang. A 64-bit stream processor architecture for scientific applications. In ISCA '07, volume 35, pages 210--219, New York, NY, USA, 2007. ACM Press.

Digital Library

Cited By

Yang XWang LXue JWu Q(2012)Comparability Graph Coloring for Optimizing Utilization of Software-Managed Stream Register Files for Stream ProcessorsACM Transactions on Architecture and Code Optimization10.1145/2133382.21333879:1(1-30)Online publication date: 1-Mar-2012
https://dl.acm.org/doi/10.1145/2133382.2133387
Rashidy RYousefpour SKoohi M(2011)Parallel bubble sort using stream programming paradigm2011 5th International Conference on Application of Information and Communication Technologies (AICT)10.1109/ICAICT.2011.6111024(1-5)Online publication date: Oct-2011
https://doi.org/10.1109/ICAICT.2011.6111024
Wang LXue JYang XDe Micheli GAl-Hashimi BMueller WMacii E(2010)Reuse-aware modulo scheduling for stream processorsProceedings of the Conference on Design, Automation and Test in Europe10.5555/1870926.1871197(1112-1117)Online publication date: 8-Mar-2010
https://dl.acm.org/doi/10.5555/1870926.1871197
Show More Cited By

Index Terms

Exploiting loop-dependent stream reuse for stream processors
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
    2. Context specific languages
      1. Specialized application languages

Recommendations

Exploiting the reuse supplied by loop-dependent stream references for stream processors

Memory accesses limit the performance of stream processors. By exploiting the reuse of data held in the Stream Register File (SRF), an on-chip, software controlled storage, the number of memory accesses can be reduced. In current stream compilers, reuse ...
Loop fusion and reordering for register file optimization on stream processors

Stream processors are gaining popularity and getting deployed in many multimedia and scientific applications. stream register file (SRF) is a non-bypassing software-managed on-chip memory. Unlike conventional register files, the input data must be all ...
Optimizing modulo scheduling to achieve reuse and concurrency for stream processors

Both reuse and concurrency are performance-critical for stream processors. When applying loop unrolling and software pipelining separately to stream-level loops, either reuse or concurrency or both may be inadequately exploited. In this paper, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

October 2008

328 pages

ISBN:9781605582825

DOI:10.1145/1454115

General Chair:
Andreas Moshovos
University of Toronto, Canada
,
Program Chairs:
David Tarditi
Microsoft, USA
,
Kunle Olukotun
Stanford University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '08

Sponsor:

PACT '08: International Conference on Parallel Architectures and Compilation Techniques

October 25 - 29, 2008

Ontario, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
265
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang XWang LXue JWu Q(2012)Comparability Graph Coloring for Optimizing Utilization of Software-Managed Stream Register Files for Stream ProcessorsACM Transactions on Architecture and Code Optimization10.1145/2133382.21333879:1(1-30)Online publication date: 1-Mar-2012
https://dl.acm.org/doi/10.1145/2133382.2133387
Rashidy RYousefpour SKoohi M(2011)Parallel bubble sort using stream programming paradigm2011 5th International Conference on Application of Information and Communication Technologies (AICT)10.1109/ICAICT.2011.6111024(1-5)Online publication date: Oct-2011
https://doi.org/10.1109/ICAICT.2011.6111024
Wang LXue JYang XDe Micheli GAl-Hashimi BMueller WMacii E(2010)Reuse-aware modulo scheduling for stream processorsProceedings of the Conference on Design, Automation and Test in Europe10.5555/1870926.1871197(1112-1117)Online publication date: 8-Mar-2010
https://dl.acm.org/doi/10.5555/1870926.1871197
Li Wang Jingling Xue Xuejun Yang (2010)Reuse-aware modulo scheduling for stream processors2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010)10.1109/DATE.2010.5456975(1112-1117)Online publication date: Mar-2010
https://doi.org/10.1109/DATE.2010.5456975
Yang XWang LXue JDeng YZhang Y(2009)Comparability graph coloring for optimizing utilization of stream register files in stream processorsACM SIGPLAN Notices10.1145/1594835.150419544:4(111-120)Online publication date: 14-Feb-2009
https://dl.acm.org/doi/10.1145/1594835.1504195
Yang XWang LXue JDeng YZhang YReed DSarkar V(2009)Comparability graph coloring for optimizing utilization of stream register files in stream processorsProceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming10.1145/1504176.1504195(111-120)Online publication date: 14-Feb-2009
https://dl.acm.org/doi/10.1145/1504176.1504195
Yang XZhang YLu XXue JRogers ILi GWang GFang X(2008)Exploiting the reuse supplied by loop-dependent stream references for stream processorsACM Transactions on Architecture and Code Optimization (TACO)10.1145/1839667.18396737:2(1-35)Online publication date: 5-Oct-2008
https://dl.acm.org/doi/10.1145/1839667.1839673

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten