research-article

Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Authors:
John H. Kelm

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

,
Daniel R. Johnson

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

,
Matthew R. Johnson

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

,
Neal C. Crago

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

,
William Tuohy

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

,
Aqeel Mahesri

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

,
Steven S. Lumetta

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

,
Matthew I. Frank

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

,
Sanjay J. Patel

University of Illinois, Urbana, IL, USA

University of Illinois, Urbana, IL, USA
View Profile

ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureJune 2009Pages 140–151https://doi.org/10.1145/1555754.1555774

Published:20 June 2009Publication History

ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

Pages 140–151

ABSTRACT

This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications.

We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm² in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.

References

J. Balfour and W. J. Dally. Design tradeoffs for tiled cmp on-chip networks. In ICS'06, 2006. Google ScholarDigital Library
G. E. Blelloch. Scans as primitive parallel operations. IEEE Trans. Comput., 38(11), 1989. Google ScholarDigital Library
D. Burger, J. R. Goodman, and A. Ki. Memory bandwidth limitations of future microprocessors. In ISCA'96, 1996. Google ScholarDigital Library
S. P. Dandamudi and P. S. P. Cheng. A hierarchical task queue organization for shared-memory multiprocessor systems. IEEE Trans. Parallel Distrib. Syst., 6(1), 1995. Google ScholarDigital Library
Elsen, E. et al. Poster session - n-body simulation on gpus. In SC, 2006. Google ScholarDigital Library
K. Fatahalian and M. Houston. Gpus: a closer look. Queue, 6(2):18--28, 2008. Google ScholarDigital Library
Fatahalian, K. et al. Sequoia: programming the memory hierarchy. In SC'06, 2006. Google ScholarDigital Library
Gajski, D. et al. Cedar: a large scale multiprocessor. SIGARCH Comput. Archit. News, 11(1):7--11, 1983. Google ScholarDigital Library
Gottlieb, A. et al. The NYU ultracomputer. In ISCA'82, 1982.Google Scholar
M. Gschwind. Chip multiprocessing and the cell broadband engine. In CF'06, pages 1--8, New York, NY, USA, 2006. Google ScholarDigital Library
Guo, J. et al. Hierarchically tiled arrays for parallelism and locality. In Parallel and Distributed Processing Symposium, April 2006. Google ScholarDigital Library
Intel. Intel microprocessor export compliance metrics, Februrary 2009.Google Scholar
Kapasi, U. J. et al. Programmable stream processors. Computer, 36(8), 2003. Google ScholarDigital Library
S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In ISCA'07, pages 162--173, New York, NY, USA, 2007. Google ScholarDigital Library
Leiserson, C. E. et al. The network architecture of the connection machine cm-5. J. Parallel Distrib. Comput., 33(2), 1996. Google ScholarDigital Library
Lindholm, E. et al. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro, 28(2), 2008. Google ScholarDigital Library
Mahesri, A. et al. Tradeoffs in designing accelerator architectures for visual computing. In MICRO'08, 2008. Google ScholarDigital Library
J. M. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Trans. Comput. Syst., 9(1):21--65, 1991. Google ScholarDigital Library
MIPS. MIPS32 24K Family of Synthesizable Processor Cores, 2009.Google Scholar
Nickolls, J. et al. Scalable parallel programming with CUDA. Queue, 6(2), 2008. Google ScholarDigital Library
Owens, J. D. et al. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, 2007.Google ScholarCross Ref
S. Rusu et al. A 45nm 8-core enterprise xeon processor. In ISSCC'09, Februrary 2009.Google Scholar
Sampson, J. et al. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In MICRO'06, 2006. Google ScholarDigital Library
S. L. Scott. Synchronization and communication in the t3e multiprocessor. In ASPLOS'96, pages 26--36, 1996. Google ScholarDigital Library
Seiler, L. et al. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1--15, 2008. Google ScholarDigital Library
B. Smith. The architecture of hep. In On Parallel MIMD computation, pages 41--55. Massachusetts Institute of Technology, 1985. Google ScholarDigital Library
Stone, J. E. et al. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28:2618--2640, 2007.Google ScholarCross Ref
Stone, S. S. et al. Accelerating advanced MRI reconstructions on gpus. J. Parallel Distrib. Comput., 68(10):1307--1318, 2008. Google ScholarDigital Library
Tensilica. 570T Static-Superscalar CPU Core PRODUCT BRIEF, 2007.Google Scholar
M. Tremblay and S. Chaudhry. A third-generation 65nm 16-core 32-thread plus 32-scout-thread cmt sparc processor. In ISSCC 2008, Feb. 2008.Google ScholarCross Ref
L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarDigital Library

Index Terms

Rigel: an architecture and scalable programming interface for a 1000-core accelerator
1. Computer systems organization
  1. Architectures
    1. Distributed architectures

Recommendations

Rigel: an architecture and scalable programming interface for a 1000-core accelerator

This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-...
Read More
Rigel: A 1,024-Core Single-Chip Accelerator Architecture

Rigel is a single-chip accelerator architecture with 1,024 independent processing cores targeted at a broad class of data- and task-parallel computation. This article discusses Rigel's motivation, evaluates its performance scalability as well as power ...
Read More
From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture

Comparing the architectures and performance levels of an Nvidia Fermi accelerator with an Intel MIC Architecture coprocessor demonstrates the benefit of the coprocessor for bringing highly parallel applications into, or even beyond, GPGPU performance ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture
June 2009
510 pages
ISBN:9781605585260
DOI:10.1145/1555754
General Chair:
Steve Keckler
University of Texas at Austin
,
Program Chair:
Luiz André Barroso
Google Inc.
ACM SIGARCH Computer Architecture News Volume 37, Issue 3
June 2009
495 pages
ISSN:0163-5964
DOI:10.1145/1555815
Issue’s Table of Contents
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
accelerator
computer architecture
low-level programming interface
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 112
  Total Citations
  View Citations
- 2,207
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Rigel: an architecture and scalable programming interface for a 1000-core accelerator

ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Rigel: an architecture and scalable programming interface for a 1000-core accelerator

Rigel: A 1,024-Core Single-Chip Accelerator Architecture

From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture