skip to main content
article

MPSoC memory optimization using program transformation

Published: 01 September 2007 Publication History

Abstract

Multiprocessor system-on-a-chip (MPSoC) architectures have received a lot of attention in the past years, but few advances in compilation techniques target these architectures. This is particularly true for the exploitation of data locality. Most of the compilation techniques for parallel architectures discussed in the literature are based on a single loop nest. This article presents new techniques that consist in applying loop fusion and tiling to several loop nests and to parallelize the resulting code across different processors. These two techniques reduce the number of memory accesses. However, they increase dependencies and thereby reduce the exploitable parallelism in the code. This article tries to address this contradiction. To optimize the memory space used by temporary arrays, smaller buffers are used as a replacement. Different strategies are studied to optimize the processing time spent accessing these buffers. The experiments show that these techniques yield a significant reduction in the number of data cache misses (30%) and in processing time (50%).

References

[1]
Anderson, J. M. and Lam, M. S. 1993. Global optimizations for parallelism and locality on scalable parallel machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Albuquerque, New Mexico. R. Cartwright, Ed. ACM Press, New York, NY, 112--125.
[2]
Banerjee, U. 1991. Unimodular transformation of doubles loops. In Advances in Languages and Compiler for Parallel Processing, chapter 10. the MIT Press.
[3]
Bastoul, C. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 7--16.
[4]
Bouchebaba, Y. 2002. Optimisation des transferts de données pour le traitement du signal: Fusion, pavage et réallocation des tableaux. PhD thesis, Ecole des mines de Paris, France.
[5]
Bouchebaba, Y., Nicolescu, G., Aboulhamid, E., and Coelho, F. 2006. Buffer and register allocation for memory space optimization. In Proceedings of the IEEE 17th International Conference on Application-Specific Systems. Architectures and Processors. IEEE Computer Society, 283--290.
[6]
Carr, S. and Kennedy, K. 1994. Scalar replacement in the presence of conditional control flow. Softw. Pract. Exper. 24, 1, 51--77.
[7]
Carr, S., Ding, C., and Sweany, P. 1996. Improving software pipelining with unroll-and-jam. In Proceedings of the 29th Hawaii International Conference on System Sciences (Hicss'96) Volume 1: Software Technology and Architecture. IEEE Computer Society, 183.
[8]
Catthoor, F., Greef, E. d., and Suytack, S. 1998. HICSS. Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design. Kluwer Academic Publishers.
[9]
Cierniak, M. and Li, W. 1995. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (La Jolla, CA). ACM Press, New York, NY, 205--217.
[10]
Darte, A. and Huard, G. 2002. New results on array contraction. In Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors. IEEE Computer Society, 359.
[11]
Darte, A. 1999. On the complexity of loop fusion. In Proceedings of the International Conference in Parallel Architectures and Compilation Techniques. 149--157.
[12]
Darte, A., Schreiber, R., and Villard, G. 2003. Lattice-based memory allocation. In Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (San Jose, CA). ACM Press, New York, NY, 298--308.
[13]
Darte, A., Robert, Y., and Vivien, F. 2000. Scheduling and Automatic Parallelization. Birkhauser, Boston, MA.
[14]
Davidson, J. W. and Jinturkar, S. 1995. Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation. In Proceedings of the 28th Annual International Symposium on Microarchitecture (Ann Arborn, MI), IEEE Computer Society Press, Los Alamitos, CA, 125--132.
[15]
Feautrier, P. 1996. Automatic parallelization in the polytope model. In the Data Parallel Programming Model: Foundations, HPF Realization, and Scientific Applications, G. Perrin and A. Darte, Eds. Lecture Notes In Computer Science, vol. 1132. Springer-Verlag, London, 79--103.
[16]
Fraboulet, A., Kodary, K., and Mignotte, A. 2001. Loop fusion for memory space optimization. In Proceedings of the 14th International Symposium on Systems Synthesis (Montréal, Canada). ACM Press, New York, NY, 95--100.
[17]
Gannon, D., Jalby, W., and Gallivan, K. 1988. Strategies for cache and local memory management by global program transformation. J. Parallel Distrib. Comput. 5, 5, 587--616.
[18]
Greef, E. 1998. Storage size reduction for multimedia application. PhD thesis, Katholieke Universiteit Leuven---IMEC.
[19]
Irigoin, F. and Triolet, R. 1988. Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (San Diego, CA). ACM Press, New York, NY, 319--329.
[20]
ITRS. 2003. International Technology Roadmap for Semiconductors.
[21]
Kadayif, I. and Kandemir, M. 2005. Data space-oriented tiling for enhancing locality. Trans. Emb. Comput. Sys. 4, 2, 388--414.
[22]
Kandemir, M., Kadayif, I., Choudhary, A., and Zambreno, J. A. 2002. Optimizing inter-nest data locality. In Proceedings of the International Conference on Compilers, Architecture, and Synthesis For Embedded Systems (Grenoble, France). ACM Press, New York, NY, 127--135.
[23]
Karp, R. M., Miller, R. E., and Winograd, S. 1967. The organization of computations for uniform recurrence equations. J. ACM 14, 3, 563--590.
[24]
Kennedy, K. 2000. Fast greedy weighted fusion. In Proceedings of the 14th International Conference on Supercomputing (Santa Fe, NM). ACM Press, New York, NY, 131--140.
[25]
Kennedy, K. and McKinley, K. S. 1994. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Proceedings of the 6th International Workshop on Languages and Compilers For Parallel Computing. U. Banerjee, D. Gelernter, A. Nicolau, and D. A. Padua, Eds. Lecture Notes In Computer Science, vol. 768. Springer-Verlag, London, 301--320.
[26]
Krishnan, V. and Torrellas, J. 1999. A chip-multiprocessor architecture with speculative multithreading. IEEE Trans. Comput. 48, 9, 866--880.
[27]
Kulkarni, D. and Stumm, M. 1995. Linear loop transformations in optimizing compilers for parallel machines. Australian Comput. J. 27, 2, 41--50.
[28]
Lam, M. 1988. Software pipelining: An effective scheduling technique for VLIW machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (Atlanta, CA). R. L. Wexelblat, Ed. ACM Press, New York, NY, 318--328.
[29]
Lamport, L. 1974. The parallel execution of DO loops. Commun. ACM 17, 83--93.
[30]
Lengauer, C. 1993. Loop parallelization in the polytope model. In Proceedings of the 4th International Conference on Concurrency Theory. E. Best, Ed. Lecture Notes In Computer Science, vol. 715. Springer-Verlag, London, 398--416.
[31]
Lefebvre, V. and Feautrier, P. 1998. Automatic storage management for parallel programs. Parall. Comput. 24, 3--4, 649--671.
[32]
Li, F. and Kandemir, M. 2005. Locality-conscious workload assignment for array-based computations in MPSOC architectures. In Proceedings of the 42nd Annual Conference on Design Automation (San Diego, CA). ACM Press, New York, NY, 95--100.
[33]
Manjikian, N. and Abdelrahman, T. S. 1997. Fusion of loops for parallelism and locality. IEEE Trans. Parallel Distrib. Syst. 8, 2, 193--209.
[34]
Marchal, P., Gómez, J. I., and Catthoor, F. 2004. Optimizing the memory bandwidth with loop fusion. In Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM Press, New York, NY, 188--193.
[35]
Megiddo, N. and Sarkar, V. 1997. Optimal weighted loop fusion for parallel programs. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (Newport, RI). ACM Press, New York, NY, 282--291.
[36]
Mowry, T. C., Lam, M. S., and Gupta, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems. (Boston, MA). R. L. Wexelblat, Ed. ACM Press, New York, NY, 62--73.
[37]
Olukotun, K., Nayfeh, B. A., Hammond, L., Wilson, K., and Chang, K. 1996. The case for a single-chip multiprocessor. SIGOPS Oper. Syst. Rev. 30, 5, 2--11.
[38]
Paulin, P. G., Pilkington, C., Langevin, M., Bensoudane, E., and Nicolescu, G. 2004. Parallel programming models for a multi-processor SoC platform applied to high-speed traffic management. In Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (Stockholm, Sweden). ACM Press, New York, NY, 48--53.
[39]
Qian, Y., Carr, S., and Sweany, P. 2002. Loop fusion for clustered VLIW architectures. In Proceedings of the Joint Conference on Languages, Compilers and Tools For Embedded Systems: Software and Compilers For Embedded Systems (Berlin, Germany). ACM Press, New York, NY, 112--119.
[40]
Quilleré, F. and Rajopadhye, S. 2000. Optimizing memory usage in the polyhedral model. ACM Trans. Program. Lang. Syst. 22, 5, 773--815.
[41]
Quinton, P. 1987. The systematic design of systolic arrays. In Centre National De Recherche Scientifique on Automata Networks in Computer Science: Theory and Applications. F. F. Soulié, Y. Robert, and M. Tchuente, Eds. Princeton University Press, Princeton, NJ, 229--260.
[42]
Song, Y., Xu, R., and Wang, C. 2004. Improving data locality by array contraction. IEEE Trans. Comput. 53, 9, 1073--1084.
[43]
Song, Y., Wang, C., and Li, Z. 2005. A polynomial-time algorithm for memory space reduction. Int. J. Parallel Program. 33, 1, 1--33.
[44]
Tu, P. and Padua, D. A. 1994. Automatic array privatization. In Proceedings of the 6th International Workshop on Languages and Compilers For Parallel Computing. U. Banerjee, D. Gelernter, A. Nicolau, and D. A. Padua, Eds. Lecture Notes In Computer Science, vol. 768. Springer-Verlag, London, 500--521.
[45]
Wolf, M. E. and Lam, M. S. 1991. A data locality optimizing algorithm. In Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation. ACM Press, New York, NY, 30--44.
[46]
Wolf, M. E. and Lam, M. S. 1991. A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2, 4, 452--471.
[47]
Xue, J. 2000. Loop Tiling for Parallelism. Kluwer Academic Publishers.
[48]
Zhu, Y., Magklis, G., Scott, M. L., Ding, C., and Albonesi, D. H. 2004. The energy impact of aggressive loop fusion. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques. IEEE Computer Society, 153--164.

Cited By

View all
  • (2017)PRESGenProceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications10.1145/3085158.3086158(13-20)Online publication date: 26-Jun-2017
  • (2017)An Equivalence Checking Framework for Array-Intensive ProgramsAutomated Technology for Verification and Analysis10.1007/978-3-319-68167-2_6(84-90)Online publication date: 27-Sep-2017
  • (2016)Translation validation of loop and arithmetic transformations in the presence of recurrencesACM SIGPLAN Notices10.1145/2980930.290795451:5(31-40)Online publication date: 13-Jun-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Design Automation of Electronic Systems
ACM Transactions on Design Automation of Electronic Systems  Volume 12, Issue 4
September 2007
449 pages
ISSN:1084-4309
EISSN:1557-7309
DOI:10.1145/1278349
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 01 September 2007
Published in TODAES Volume 12, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data locality
  2. compiler transformations
  3. data cache
  4. embedded systems

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2017)PRESGenProceedings of the 2017 Workshop on Software Engineering Methods for Parallel and High Performance Applications10.1145/3085158.3086158(13-20)Online publication date: 26-Jun-2017
  • (2017)An Equivalence Checking Framework for Array-Intensive ProgramsAutomated Technology for Verification and Analysis10.1007/978-3-319-68167-2_6(84-90)Online publication date: 27-Sep-2017
  • (2016)Translation validation of loop and arithmetic transformations in the presence of recurrencesACM SIGPLAN Notices10.1145/2980930.290795451:5(31-40)Online publication date: 13-Jun-2016
  • (2016)Translation validation of loop and arithmetic transformations in the presence of recurrencesProceedings of the 17th ACM SIGPLAN/SIGBED Conference on Languages, Compilers, Tools, and Theory for Embedded Systems10.1145/2907950.2907954(31-40)Online publication date: 13-Jun-2016
  • (2015)Translation Validation of Transformations of Embedded System Specifications Using Equivalence Checking2015 IEEE Computer Society Annual Symposium on VLSI10.1109/ISVLSI.2015.10(183-186)Online publication date: Jul-2015
  • (2013)Experimentation with SMT solvers and theorem provers for verification of loop and arithmetic transformationsProceedings of the 5th IBM Collaborative Academia Research Exchange Workshop10.1145/2528228.2528231(1-4)Online publication date: 17-Oct-2013
  • (2013)Verification of Loop and Arithmetic Transformations of Array-Intensive BehaviorsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2013.227253632:11(1787-1800)Online publication date: 1-Nov-2013
  • (2012)Design space exploration in application-specific hardware synthesis for multiple communicating nested loops2012 International Conference on Embedded Computer Systems (SAMOS)10.1109/SAMOS.2012.6404166(128-135)Online publication date: Jul-2012
  • (2012)Transformation-Based Exploration of Data Parallel Architecture for Customizable HardwareProceedings of the 2012 15th Euromicro Conference on Digital System Design10.1109/DSD.2012.133(774-781)Online publication date: 5-Sep-2012
  • (2011)Equivalence Checking of Array-Intensive ProgramsProceedings of the 2011 IEEE Computer Society Annual Symposium on VLSI10.1109/ISVLSI.2011.61(156-161)Online publication date: 4-Jul-2011
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media