article

MPSoC memory optimization using program transformation

Authors:

Youcef Bouchebaba,

Bruno Girodias,

Gabriela Nicolescu,

El Mostapha Aboulhamid,

Bruno Lavigueur,

Pierre PaulinAuthors Info & Claims

ACM Transactions on Design Automation of Electronic Systems (TODAES), Volume 12, Issue 4

Pages 43 - es

https://doi.org/10.1145/1278349.1278356

Published: 01 September 2007 Publication History

Abstract

Multiprocessor system-on-a-chip (MPSoC) architectures have received a lot of attention in the past years, but few advances in compilation techniques target these architectures. This is particularly true for the exploitation of data locality. Most of the compilation techniques for parallel architectures discussed in the literature are based on a single loop nest. This article presents new techniques that consist in applying loop fusion and tiling to several loop nests and to parallelize the resulting code across different processors. These two techniques reduce the number of memory accesses. However, they increase dependencies and thereby reduce the exploitable parallelism in the code. This article tries to address this contradiction. To optimize the memory space used by temporary arrays, smaller buffers are used as a replacement. Different strategies are studied to optimize the processing time spent accessing these buffers. The experiments show that these techniques yield a significant reduction in the number of data cache misses (30%) and in processing time (50%).

References

[1]

Anderson, J. M. and Lam, M. S. 1993. Global optimizations for parallelism and locality on scalable parallel machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Albuquerque, New Mexico. R. Cartwright, Ed. ACM Press, New York, NY, 112--125.

Digital Library

[2]

Banerjee, U. 1991. Unimodular transformation of doubles loops. In Advances in Languages and Compiler for Parallel Processing, chapter 10. the MIT Press.

[3]

Bastoul, C. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 7--16.

Digital Library

[4]

Bouchebaba, Y. 2002. Optimisation des transferts de données pour le traitement du signal: Fusion, pavage et réallocation des tableaux. PhD thesis, Ecole des mines de Paris, France.

[5]

Bouchebaba, Y., Nicolescu, G., Aboulhamid, E., and Coelho, F. 2006. Buffer and register allocation for memory space optimization. In Proceedings of the IEEE 17th International Conference on Application-Specific Systems. Architectures and Processors. IEEE Computer Society, 283--290.

Digital Library

[6]

Carr, S. and Kennedy, K. 1994. Scalar replacement in the presence of conditional control flow. Softw. Pract. Exper. 24, 1, 51--77.

Digital Library

[7]

Carr, S., Ding, C., and Sweany, P. 1996. Improving software pipelining with unroll-and-jam. In Proceedings of the 29th Hawaii International Conference on System Sciences (Hicss'96) Volume 1: Software Technology and Architecture. IEEE Computer Society, 183.

Digital Library

[8]

Catthoor, F., Greef, E. d., and Suytack, S. 1998. HICSS. Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design. Kluwer Academic Publishers.

Digital Library

[9]

Cierniak, M. and Li, W. 1995. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (La Jolla, CA). ACM Press, New York, NY, 205--217.

Digital Library

[10]

Darte, A. and Huard, G. 2002. New results on array contraction. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors. IEEE Computer Society, 359.

Digital Library

[11]

Darte, A. 1999. On the complexity of loop fusion. In Proceedings of the International Conference in Parallel Architectures and Compilation Techniques. 149--157.

Digital Library

[12]

Darte, A., Schreiber, R., and Villard, G. 2003. Lattice-based memory allocation. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (San Jose, CA). ACM Press, New York, NY, 298--308.

Digital Library

[13]

Darte, A., Robert, Y., and Vivien, F. 2000. Scheduling and Automatic Parallelization. Birkhauser, Boston, MA.

Digital Library

[14]

Davidson, J. W. and Jinturkar, S. 1995. Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation. In Proceedings of the 28th Annual International Symposium on Microarchitecture (Ann Arborn, MI), IEEE Computer Society Press, Los Alamitos, CA, 125--132.

Digital Library

[15]

Feautrier, P. 1996. Automatic parallelization in the polytope model. In the Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications, G. Perrin and A. Darte, Eds. Lecture Notes In Computer Science, vol. 1132. Springer-Verlag, London, 79--103.

Digital Library

[16]

Fraboulet, A., Kodary, K., and Mignotte, A. 2001. Loop fusion for memory space optimization. In Proceedings of the 14th International Symposium on Systems Synthesis (Montréal, Canada). ACM Press, New York, NY, 95--100.

Digital Library

[17]

Gannon, D., Jalby, W., and Gallivan, K. 1988. Strategies for cache and local memory management by global program transformation. J. Parallel Distrib. Comput. 5, 5, 587--616.

Digital Library

[18]

Greef, E. 1998. Storage size reduction for multimedia application. PhD thesis, Katholieke Universiteit Leuven---IMEC.

[19]

Irigoin, F. and Triolet, R. 1988. Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (San Diego, CA). ACM Press, New York, NY, 319--329.

Digital Library

[20]

ITRS. 2003. International Technology Roadmap for Semiconductors.

[21]

Kadayif, I. and Kandemir, M. 2005. Data space-oriented tiling for enhancing locality. Trans. Emb. Comput. Sys. 4, 2, 388--414.

Digital Library

[22]

Kandemir, M., Kadayif, I., Choudhary, A., and Zambreno, J. A. 2002. Optimizing inter-nest data locality. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis For Embedded Systems (Grenoble, France). ACM Press, New York, NY, 127--135.

Digital Library

[23]

Karp, R. M., Miller, R. E., and Winograd, S. 1967. The organization of computations for uniform recurrence equations. J. ACM 14, 3, 563--590.

Digital Library

[24]

Kennedy, K. 2000. Fast greedy weighted fusion. In Proceedings of the 14th International Conference on Supercomputing (Santa Fe, NM). ACM Press, New York, NY, 131--140.

Digital Library

[25]

Kennedy, K. and McKinley, K. S. 1994. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the 6th International Workshop on Languages and Compilers For Parallel Computing. U. Banerjee, D. Gelernter, A. Nicolau, and D. A. Padua, Eds. Lecture Notes In Computer Science, vol. 768. Springer-Verlag, London, 301--320.

Digital Library

[26]

Krishnan, V. and Torrellas, J. 1999. A chip-multiprocessor architecture with speculative multithreading. IEEE Trans. Comput. 48, 9, 866--880.

Digital Library

[27]

Kulkarni, D. and Stumm, M. 1995. Linear loop transformations in optimizing compilers for parallel machines. Australian Comput. J. 27, 2, 41--50.

[28]

Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Atlanta, CA). R. L. Wexelblat, Ed. ACM Press, New York, NY, 318--328.

Digital Library

[29]

Lamport, L. 1974. The parallel execution of DO loops. Commun. ACM 17, 83--93.

Digital Library

[30]

Lengauer, C. 1993. Loop parallelization in the polytope model. In Proceedings of the 4th International Conference on Concurrency Theory. E. Best, Ed. Lecture Notes In Computer Science, vol. 715. Springer-Verlag, London, 398--416.

Digital Library

[31]

Lefebvre, V. and Feautrier, P. 1998. Automatic storage management for parallel programs. Parall. Comput. 24, 3--4, 649--671.

Digital Library

[32]

Li, F. and Kandemir, M. 2005. Locality-conscious workload assignment for array-based computations in MPSOC architectures. In Proceedings of the 42nd Annual Conference on Design Automation (San Diego, CA). ACM Press, New York, NY, 95--100.

Digital Library

[33]

Manjikian, N. and Abdelrahman, T. S. 1997. Fusion of loops for parallelism and locality. IEEE Trans. Parallel Distrib. Syst. 8, 2, 193--209.

Digital Library

[34]

Marchal, P., Gómez, J. I., and Catthoor, F. 2004. Optimizing the memory bandwidth with loop fusion. In Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM Press, New York, NY, 188--193.

Digital Library

[35]

Megiddo, N. and Sarkar, V. 1997. Optimal weighted loop fusion for parallel programs. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (Newport, RI). ACM Press, New York, NY, 282--291.

Digital Library

[36]

Mowry, T. C., Lam, M. S., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems. (Boston, MA). R. L. Wexelblat, Ed. ACM Press, New York, NY, 62--73.

Digital Library

[37]

Olukotun, K., Nayfeh, B. A., Hammond, L., Wilson, K., and Chang, K. 1996. The case for a single-chip multiprocessor. SIGOPS Oper. Syst. Rev. 30, 5, 2--11.

Digital Library

[38]

Paulin, P. G., Pilkington, C., Langevin, M., Bensoudane, E., and Nicolescu, G. 2004. Parallel programming models for a multi-processor SoC platform applied to high-speed traffic management. In Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (Stockholm, Sweden). ACM Press, New York, NY, 48--53.

Digital Library

[39]

Qian, Y., Carr, S., and Sweany, P. 2002. Loop fusion for clustered VLIW architectures. In Proceedings of the Joint Conference on Languages, Compilers and Tools For Embedded Systems: Software and Compilers For Embedded Systems (Berlin, Germany). ACM Press, New York, NY, 112--119.

Digital Library

[40]

Quilleré, F. and Rajopadhye, S. 2000. Optimizing memory usage in the polyhedral model. ACM Trans. Program. Lang. Syst. 22, 5, 773--815.

Digital Library

[41]

Quinton, P. 1987. The systematic design of systolic arrays. In Centre National De Recherche Scientifique on Automata Networks in Computer Science: Theory and Applications. F. F. Soulié, Y. Robert, and M. Tchuente, Eds. Princeton University Press, Princeton, NJ, 229--260.

Digital Library

[42]

Song, Y., Xu, R., and Wang, C. 2004. Improving data locality by array contraction. IEEE Trans. Comput. 53, 9, 1073--1084.

Digital Library

[43]

Song, Y., Wang, C., and Li, Z. 2005. A polynomial-time algorithm for memory space reduction. Int. J. Parallel Program. 33, 1, 1--33.

Digital Library

[44]

Tu, P. and Padua, D. A. 1994. Automatic array privatization. In Proceedings of the 6th International Workshop on Languages and Compilers For Parallel Computing. U. Banerjee, D. Gelernter, A. Nicolau, and D. A. Padua, Eds. Lecture Notes In Computer Science, vol. 768. Springer-Verlag, London, 500--521.

Digital Library

[45]

Wolf, M. E. and Lam, M. S. 1991. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation. ACM Press, New York, NY, 30--44.

Digital Library

[46]

Wolf, M. E. and Lam, M. S. 1991. A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2, 4, 452--471.

Digital Library

[47]

Xue, J. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers.

Digital Library

[48]

Zhu, Y., Magklis, G., Scott, M. L., Ding, C., and Albonesi, D. H. 2004. The energy impact of aggressive loop fusion. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 153--164.

Cited By

Bandyopadhyay SBanerjee KKumar ASarkar SGerndt M(2017)PRESGenProceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications10.1145/3085158.3086158(13-20)Online publication date: 26-Jun-2017
https://dl.acm.org/doi/10.1145/3085158.3086158
Banerjee KMandal CSarkar D(2017)An Equivalence Checking Framework for Array-Intensive ProgramsAutomated Technology for Verification and Analysis10.1007/978-3-319-68167-2_6(84-90)Online publication date: 27-Sep-2017
https://doi.org/10.1007/978-3-319-68167-2_6
Banerjee KMandal CSarkar D(2016)Translation validation of loop and arithmetic transformations in the presence of recurrencesACM SIGPLAN Notices10.1145/2980930.290795451:5(31-40)Online publication date: 13-Jun-2016
https://dl.acm.org/doi/10.1145/2980930.2907954
Show More Cited By

Index Terms

MPSoC memory optimization using program transformation
1. Hardware
  1. Electronic design automation
    1. High-level and register-transfer level synthesis
      1. Datapath optimization
    2. Logic synthesis
      1. Circuit optimization
  2. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches

Although direct-mapped caches suffer from higher miss ratios as compared to set-associative caches, they are attractive for today's high-speed pipelined processors that require very low access times. Victim caching was proposed by Jouppi [1] as an ...
Precise Data Locality Optimization of Nested Loops

A significant source for enhancing application performance and for reducing power consumption in embedded processor applications is to improve the usage of the memory hierarchy. In this paper, a temporal and spatial locality optimization framework of ...
Trading cache hit rate for memory performance
PACT '14: Proceedings of the 23rd international conference on Parallel architectures and compilation

Most of the prior compiler based data locality optimization works target exclusively cache locality optimization, and row-buffer locality in DRAM banks received much less attention. In particular, to the best of our knowledge, there is no single ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems

ACM Transactions on Design Automation of Electronic Systems Volume 12, Issue 4

September 2007

449 pages

ISSN:1084-4309

EISSN:1557-7309

DOI:10.1145/1278349

Issue’s Table of Contents

Copyright © 2007 ACM.

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 01 September 2007

Published in TODAES Volume 12, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
539
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bandyopadhyay SBanerjee KKumar ASarkar SGerndt M(2017)PRESGenProceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications10.1145/3085158.3086158(13-20)Online publication date: 26-Jun-2017
https://dl.acm.org/doi/10.1145/3085158.3086158
Banerjee KMandal CSarkar D(2017)An Equivalence Checking Framework for Array-Intensive ProgramsAutomated Technology for Verification and Analysis10.1007/978-3-319-68167-2_6(84-90)Online publication date: 27-Sep-2017
https://doi.org/10.1007/978-3-319-68167-2_6
Banerjee KMandal CSarkar D(2016)Translation validation of loop and arithmetic transformations in the presence of recurrencesACM SIGPLAN Notices10.1145/2980930.290795451:5(31-40)Online publication date: 13-Jun-2016
https://dl.acm.org/doi/10.1145/2980930.2907954
Banerjee KMandal CSarkar DKuo TWhalley D(2016)Translation validation of loop and arithmetic transformations in the presence of recurrencesProceedings of the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory for Embedded Systems10.1145/2907950.2907954(31-40)Online publication date: 13-Jun-2016
https://dl.acm.org/doi/10.1145/2907950.2907954
Banerjee KMandal CSarkar D(2015)Translation Validation of Transformations of Embedded System Specifications Using Equivalence Checking2015 IEEE Computer Society Annual Symposium on VLSI10.1109/ISVLSI.2015.10(183-186)Online publication date: Jul-2015
https://doi.org/10.1109/ISVLSI.2015.10
Karfa CBanerjee KSarkar DMandal CGupta SKrishnapuram RVerma A(2013)Experimentation with SMT solvers and theorem provers for verification of loop and arithmetic transformationsProceedings of the 5th IBM Collaborative Academia Research Exchange Workshop10.1145/2528228.2528231(1-4)Online publication date: 17-Oct-2013
https://dl.acm.org/doi/10.1145/2528228.2528231
Karfa CBanerjee KSarkar DMandal C(2013)Verification of Loop and Arithmetic Transformations of Array-Intensive BehaviorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2013.227253632:11(1787-1800)Online publication date: 1-Nov-2013
https://dl.acm.org/doi/10.1109/TCAD.2013.2272536
Corvino RGamatie AGeilen MJozwiak L(2012)Design space exploration in application-specific hardware synthesis for multiple communicating nested loops2012 International Conference on Embedded Computer Systems (SAMOS)10.1109/SAMOS.2012.6404166(128-135)Online publication date: Jul-2012
https://doi.org/10.1109/SAMOS.2012.6404166
Corvino RDiken EGamatie AJozwiak L(2012)Transformation-Based Exploration of Data Parallel Architecture for Customizable HardwareProceedings of the 2012 15th Euromicro Conference on Digital System Design10.1109/DSD.2012.133(774-781)Online publication date: 5-Sep-2012
https://dl.acm.org/doi/10.1109/DSD.2012.133
Karfa CBanerjee KSarkar DMandal C(2011)Equivalence Checking of Array-Intensive ProgramsProceedings of the 2011 IEEE Computer Society Annual Symposium on VLSI10.1109/ISVLSI.2011.61(156-161)Online publication date: 4-Jul-2011
https://dl.acm.org/doi/10.1109/ISVLSI.2011.61
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents