ABSTRACT
This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications.
We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.
- J. Balfour and W. J. Dally. Design tradeoffs for tiled cmp on-chip networks. In ICS'06, 2006. Google ScholarDigital Library
- G. E. Blelloch. Scans as primitive parallel operations. IEEE Trans. Comput., 38(11), 1989. Google ScholarDigital Library
- D. Burger, J. R. Goodman, and A. Ki. Memory bandwidth limitations of future microprocessors. In ISCA'96, 1996. Google ScholarDigital Library
- S. P. Dandamudi and P. S. P. Cheng. A hierarchical task queue organization for shared-memory multiprocessor systems. IEEE Trans. Parallel Distrib. Syst., 6(1), 1995. Google ScholarDigital Library
- Elsen, E. et al. Poster session - n-body simulation on gpus. In SC, 2006. Google ScholarDigital Library
- K. Fatahalian and M. Houston. Gpus: a closer look. Queue, 6(2):18--28, 2008. Google ScholarDigital Library
- Fatahalian, K. et al. Sequoia: programming the memory hierarchy. In SC'06, 2006. Google ScholarDigital Library
- Gajski, D. et al. Cedar: a large scale multiprocessor. SIGARCH Comput. Archit. News, 11(1):7--11, 1983. Google ScholarDigital Library
- Gottlieb, A. et al. The NYU ultracomputer. In ISCA'82, 1982.Google Scholar
- M. Gschwind. Chip multiprocessing and the cell broadband engine. In CF'06, pages 1--8, New York, NY, USA, 2006. Google ScholarDigital Library
- Guo, J. et al. Hierarchically tiled arrays for parallelism and locality. In Parallel and Distributed Processing Symposium, April 2006. Google ScholarDigital Library
- Intel. Intel microprocessor export compliance metrics, Februrary 2009.Google Scholar
- Kapasi, U. J. et al. Programmable stream processors. Computer, 36(8), 2003. Google ScholarDigital Library
- S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In ISCA'07, pages 162--173, New York, NY, USA, 2007. Google ScholarDigital Library
- Leiserson, C. E. et al. The network architecture of the connection machine cm-5. J. Parallel Distrib. Comput., 33(2), 1996. Google ScholarDigital Library
- Lindholm, E. et al. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro, 28(2), 2008. Google ScholarDigital Library
- Mahesri, A. et al. Tradeoffs in designing accelerator architectures for visual computing. In MICRO'08, 2008. Google ScholarDigital Library
- J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst., 9(1):21--65, 1991. Google ScholarDigital Library
- MIPS. MIPS32 24K Family of Synthesizable Processor Cores, 2009.Google Scholar
- Nickolls, J. et al. Scalable parallel programming with CUDA. Queue, 6(2), 2008. Google ScholarDigital Library
- Owens, J. D. et al. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, 2007.Google ScholarCross Ref
- S. Rusu et al. A 45nm 8-core enterprise xeon processor. In ISSCC'09, Februrary 2009.Google Scholar
- Sampson, J. et al. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In MICRO'06, 2006. Google ScholarDigital Library
- S. L. Scott. Synchronization and communication in the t3e multiprocessor. In ASPLOS'96, pages 26--36, 1996. Google ScholarDigital Library
- Seiler, L. et al. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1--15, 2008. Google ScholarDigital Library
- B. Smith. The architecture of hep. In On Parallel MIMD computation, pages 41--55. Massachusetts Institute of Technology, 1985. Google ScholarDigital Library
- Stone, J. E. et al. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28:2618--2640, 2007.Google ScholarCross Ref
- Stone, S. S. et al. Accelerating advanced MRI reconstructions on gpus. J. Parallel Distrib. Comput., 68(10):1307--1318, 2008. Google ScholarDigital Library
- Tensilica. 570T Static-Superscalar CPU Core PRODUCT BRIEF, 2007.Google Scholar
- M. Tremblay and S. Chaudhry. A third-generation 65nm 16-core 32-thread plus 32-scout-thread cmt sparc processor. In ISSCC 2008, Feb. 2008.Google ScholarCross Ref
- L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarDigital Library
Index Terms
- Rigel: an architecture and scalable programming interface for a 1000-core accelerator
Recommendations
Rigel: an architecture and scalable programming interface for a 1000-core accelerator
This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-...
Rigel: A 1,024-Core Single-Chip Accelerator Architecture
Rigel is a single-chip accelerator architecture with 1,024 independent processing cores targeted at a broad class of data- and task-parallel computation. This article discusses Rigel's motivation, evaluates its performance scalability as well as power ...
From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture
Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
Comments