skip to main content
10.1145/291069.291061acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
Article
Free Access

An empirical study of decentralized ILP execution models

Authors Info & Claims
Published:01 October 1998Publication History

ABSTRACT

Recent fascination for dynamic scheduling as a means for exploiting instruction-level parallelism has introduced significant interest in the scalability aspects of dynamic scheduling hardware. In order to overcome the scalability problems of centralized hardware schedulers, many decentralized execution models are being proposed and investigated recently. The crux of all these models is to split the instruction window across multiple processing elements (PEs) that do independent, scheduling of instructions. The decentralized execution models proposed so far can be grouped under 3 categories, based on the criterion used for assigning an instruction to a particular PE. They are: (i) execution unit dependence based decentralization (EDD), (ii) control dependence based decentralization (CDD), and (iii) data dependence based decentralization (DDD). This paper investigates the performance aspects of these three decentralization approaches. Using a suite of important benchmarks and realistic system parameters, we examine performance differences resulting from the type of partitioning as well as from specific implementation issues such as the type of PE interconnect.We found that with a ring-type PE interconnect, the DDD approach performs the best when the number of PEs is moderate, and that the CDD approach performs best when the number of PEs is large. The currently used approach---EDD---does not perform well for any configuration. With a realistic crossbar, performance does not increase with the number of PEs for any of the partitioning approaches. The results give insight into the best way to use the transistor budget available for implementing the instruction window.

References

  1. 1.S. Dutta and M. Franklin, "Control Flow Prediction with Tree-like Subgraphs for Superscalax Processors," Proc. ~8th International Symposium on Microarchitecture (MICRO-28), pp. 258-263, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.P. G. Emma, "Understanding Some Simple Processor- Performance Limits," IBM Journal of Research and Development, Vol. 41, No. 3, May 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic, "The Multicluster Architecture: Reducing Cycle Time Through Partitioning," Proc. 30th International Symposium on Microarchitecture (MICRO-30), pp. 149-159, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, Technical Report TR 1196, Computer Sciences Department, University of Wisconsin-Madison, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.M. Franklin and G. S. Sohi, "Register Traffic Analysis for Streamlining Inter-Operation Communication in Fine-Grain Parallel Processors," Proc. 25th Annual International Symposium on Microarchitecture (MICRO- 25), pp. 236-245, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.G. A. Kemp and M. Franklin, "PEWs: A Decentralized Dynamic Scheduler for ILP Processing," Proc. International Conference on Parallel Processing (ICPP), Vol. I, pp. 239-246, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  7. 7.D. Leibholz and R. Razdan, "The Alpha 21264: A 500 MHz Out-of-Order Execution Microprocessor," Proc. Compcon, pp. 28-36, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.S. W. Melvin and Y. N. Patt, "Exploiting Fine Grained Parallelism Through a Combination of Hardware and Software Techniques," Proceedings of the 18th Annual International Symposium on Computer Architecture, pp. 287-296, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.R. Nair and M. E. Hopkins, "Exploiting Instruction Level Parallelism in Processors by Caching Scheduled Groups, Proc. 24th Annual International Symposium on Computer Architecture, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.S. Palacharla, N. P. Jouppi, and J. E. Smith, "Complexity-Effective Superscalar Processors," Proc. 24th Annual International Symposium on Computer Architecture, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.N. Ranganathan and M. Franklin, "Complexity~ Effective PEWs Microarchitecture," (to appear in) Microprocessors and Microsystems.Google ScholarGoogle Scholar
  12. 12.E. Rotenberg, S. Bennett, and J. E. Smith, "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching," Proc. 29th International Symposium on Microarchitecture (MICRO-29), pp. 24-34, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. E. Smith, "Trace Processors," Proc. 30th International Symposium on Microarchitecture (MICRO-30), pp. 138-148, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.G. S. Sohi, S. E. Breach, and T. N. Vijaykumar, "Multiscalar Processors," Proc. 22nd International Symposium on Computer Architecture, pp. 414-425, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.K. K. Sundararaman and M. Franklin, "Multiscalar Execution along a Single Flow of Control," Proc. International Conference on Parallel Processing (ICPP), pp. 106-113, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.R. M. Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units," IBM Journal of Research and Development, pp. 25-33, January 1967.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17.J-Y Tsai and P-C. Yew, "The Superthreaded Architecture: Thread Pipelining with Run-Time Data Dependence Checking and Control Speculation," Proc. International Conference on Parallel Architectures and Compilation Techniques (PACT '96), pp. 35-46, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18.G. Tyson, M. Fattens, and A. Pleszkun, "MISC: A Multiple Instruction Stream Computer," Proc. 25th Annual International Symposium on Microarchitecture (MICRO-25), pp. 193-196, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.S. Vajapeyam and T. Mitra, "Improving Superscalar Instruction Dispatch and Issue by Exploiting Dynamic Code Sequences," Proc. 24th Annual International Symposium on Computer Architecture, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.K. C. Yeager, "The MIPS R10000 Superscalar Microprocessor," IEEE Micro, pp. 28-40, April 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An empirical study of decentralized ILP execution models

                        Recommendations

                        Comments

                        Login options

                        Check if you have access through your login credentials or your institution to get full access on this article.

                        Sign in

                        PDF Format

                        View or Download as a PDF file.

                        PDF

                        eReader

                        View online with eReader.

                        eReader