ABSTRACT
The advance of multi-core processors has led to renewed interest in extracting parallelism from programs. It is sometimes useful to know how much parallelism is exploitable in the limit for general programs, to put into perspective the speedups of various parallelisation techniques. Wall's study [19] was one of the first to examine limits of parallelism in detail. We present an extension of Wall's analysis of limits of parallelism, by constructing Dynamic Dependency Graphs from execution traces of a number of benchmark programs, allowing us better visualisation of the types of dependencies which limit parallelism, as well as flexibility in transforming graphs when exploring possible optimisations. Some of the results of Wall and subsequent studies are confirmed, including the fact that average available parallelism is often above 100, but requires effective measures to resolve control dependencies, as well as memory renaming. We also study how certain compiler artifacts affect the limits of parallelism. In particular we show that the use of a spaghetti stack, as a technique to implicitly rename stack memory and break chains on true dependencies on the stack pointer, can lead to a doubling of potential parallelism.
- A. W. Appel and Z. Shao. An empirical and analytic study of stack vs. heap cost for languages with closures. Technical Report CS-TR-450-94, 1994.Google Scholar
- T. M. Austin and G. S. Sohi. Dynamic dependency analysis of ordinary programs. In Nineteenth International Symposium on Computer Architecture, pages 342--351, Gold Coast, Australia, 1992. ACM and IEEE Computer Society. Google ScholarDigital Library
- F. Bellard. Qemu, a fast and portable dynamic translator. In ATEC '05: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 41--41, Berkeley, CA, USA, 2005. USENIX Association. Google ScholarDigital Library
- B. Blume, R. Eigenmann, K. Faigin, J. Grout, J. Hoe, D. Padua, P. Petersen, B. Pottenger, L. Rauchwerger, P. Tu, and S. Weatherford. Polaris: The next generation in parallelizing compilers. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, pages 10--1. Springer-Verlag, Berlin/Heidelberg, 1994. Google ScholarDigital Library
- H. J. Curnow and B. A. Wichmann. A synthetic benchmark. Computer Journal, 19(1):43--49, 1976.Google ScholarCross Ref
- J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319--349, 1987. Google ScholarDigital Library
- S. C. Goldstein, K. E. Schauser, and D. E. Culler. Lazy threads: Implementing a fast parallel call. Journal of Parallel and Distributed Computing, 37(1):5--20, 1996. Google ScholarDigital Library
- M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. Mibench: A free, commercially representative embedded benchmark suite. In WWC '01: Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop, pages 3--14, Washington, DC, USA, 2001. IEEE Computer Society. Google ScholarDigital Library
- K. Kennedy and J. R. Allen. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2002. Google ScholarDigital Library
- M. S. Lam and R. P. Wilson. Limits of control flow on parallelism. In Nineteenth International Symposium on Computer Architecture, pages 46--57, Gold Coast, Australia, 1992. ACM and IEEE Computer Society. Google ScholarDigital Library
- M. H. Lipasti, C. B. Wilkerson, and J. P. Shen. Value locality and load value prediction. SIGOPS Oper. Syst. Rev., 30(5):138--147, 1996. Google ScholarDigital Library
- G. Ottoni and D. I. August. Global multi-threaded instruction scheduling. In Proceedings of the 40th annual IEEE/ACM International Symposium on Microarchitecture, pages 56--68, 2007. Google ScholarDigital Library
- G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, pages 105--118, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- M. A. Postiff, D. A. Greene, G. S. Tyson, and T. N. Mudge. The limits of instruction level parallelism in SPEC95 applications. Computer Architecture News, 217(1):31--34, 1999. Google ScholarDigital Library
- J. E. Smith. A study of branch prediction strategies. In ISCA '98: 25 years of the international symposia on Computer architecture (selected papers), pages 202--215, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
- D. Stefanović and M. Martonosi. Limits and graph structure of available instruction-level parallelism (research note). Lecture Notes in Computer Science, 1900:1018--1022, 2001. Google ScholarDigital Library
- J. G. Steffan, C. B. Colohan, A. Zhai, and T. C. Mowry. A scalable approach to thread-level speculation. SIGARCH Comput. Archit. News, 28(2):1--12, 2000. Google ScholarDigital Library
- H. Sutter. A fundamental turn toward concurrency in software. Dr. Dobb's Journal, 30(3):16--20, March 2005.Google Scholar
- D. W. Wall. Limits of instruction-level parallelism. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating System (ASPLOS), volume 26, pages 176--189, New York, NY, 1991. ACM Press. Google ScholarDigital Library
- R. P. Weicker. Dhrystone: a synthetic systems programming benchmark. Commun. ACM, 27(10):1013--1030, 1984. Google ScholarDigital Library
Index Terms
- Limits of parallelism using dynamic dependency graphs
Recommendations
Dynamic dependency analysis of ordinary programs
ISCA '92: Proceedings of the 19th annual international symposium on Computer architectureA quantitative analysis of program execution is essential to the computer architecture design process. With the current trend in architecture of enhancing the performance of uniprocessors by exploiting fine-grain parallelism, first-order metrics of ...
Limits of control flow on parallelism
Special Issue: Proceedings of the 19th annual international symposium on Computer architecture (ISCA '92)This paper discusses three techniques useful in relaxing the constraints imposed by control flow on parallelism: control dependence analysis, executing multiple flows of control simultaneously, and speculative execution. We evaluate these techniques by ...
Comments