skip to main content
10.1145/1555754.1555774acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Published:20 June 2009Publication History

ABSTRACT

This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications.

We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.

References

  1. J. Balfour and W. J. Dally. Design tradeoffs for tiled cmp on-chip networks. In ICS'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. E. Blelloch. Scans as primitive parallel operations. IEEE Trans. Comput., 38(11), 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Burger, J. R. Goodman, and A. Ki. Memory bandwidth limitations of future microprocessors. In ISCA'96, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. P. Dandamudi and P. S. P. Cheng. A hierarchical task queue organization for shared-memory multiprocessor systems. IEEE Trans. Parallel Distrib. Syst., 6(1), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Elsen, E. et al. Poster session - n-body simulation on gpus. In SC, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Fatahalian and M. Houston. Gpus: a closer look. Queue, 6(2):18--28, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Fatahalian, K. et al. Sequoia: programming the memory hierarchy. In SC'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gajski, D. et al. Cedar: a large scale multiprocessor. SIGARCH Comput. Archit. News, 11(1):7--11, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gottlieb, A. et al. The NYU ultracomputer. In ISCA'82, 1982.Google ScholarGoogle Scholar
  10. M. Gschwind. Chip multiprocessing and the cell broadband engine. In CF'06, pages 1--8, New York, NY, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Guo, J. et al. Hierarchically tiled arrays for parallelism and locality. In Parallel and Distributed Processing Symposium, April 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Intel. Intel microprocessor export compliance metrics, Februrary 2009.Google ScholarGoogle Scholar
  13. Kapasi, U. J. et al. Programmable stream processors. Computer, 36(8), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In ISCA'07, pages 162--173, New York, NY, USA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Leiserson, C. E. et al. The network architecture of the connection machine cm-5. J. Parallel Distrib. Comput., 33(2), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lindholm, E. et al. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro, 28(2), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mahesri, A. et al. Tradeoffs in designing accelerator architectures for visual computing. In MICRO'08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst., 9(1):21--65, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. MIPS. MIPS32 24K Family of Synthesizable Processor Cores, 2009.Google ScholarGoogle Scholar
  20. Nickolls, J. et al. Scalable parallel programming with CUDA. Queue, 6(2), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Owens, J. D. et al. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  22. S. Rusu et al. A 45nm 8-core enterprise xeon processor. In ISSCC'09, Februrary 2009.Google ScholarGoogle Scholar
  23. Sampson, J. et al. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In MICRO'06, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. L. Scott. Synchronization and communication in the t3e multiprocessor. In ASPLOS'96, pages 26--36, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Seiler, L. et al. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1--15, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Smith. The architecture of hep. In On Parallel MIMD computation, pages 41--55. Massachusetts Institute of Technology, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Stone, J. E. et al. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28:2618--2640, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  28. Stone, S. S. et al. Accelerating advanced MRI reconstructions on gpus. J. Parallel Distrib. Comput., 68(10):1307--1318, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Tensilica. 570T Static-Superscalar CPU Core PRODUCT BRIEF, 2007.Google ScholarGoogle Scholar
  30. M. Tremblay and S. Chaudhry. A third-generation 65nm 16-core 32-thread plus 32-scout-thread cmt sparc processor. In ISSCC 2008, Feb. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  31. L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Rigel: an architecture and scalable programming interface for a 1000-core accelerator

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture
      June 2009
      510 pages
      ISBN:9781605585260
      DOI:10.1145/1555754
      • cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 37, Issue 3
        June 2009
        495 pages
        ISSN:0163-5964
        DOI:10.1145/1555815
        Issue’s Table of Contents

      Copyright © 2009 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 June 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate543of3,203submissions,17%

      Upcoming Conference

      ISCA '24

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader