|
ABSTRACT
To compete performance-wise, modern VLIW processors must have fast clock rates and high instruction-level parallelism (ILP). Partitioning resources (functional units and registers) into clusters allows the processor to be clocked faster, but operand transfers across clusters can easily become a bottleneck. Increasing the number of functional units increases the potential ILP, but only helps if the functional units can be kept busy.To support these features, optimizations such as loop unrolling must be applied to expose ILP, and instructions must be explicitly assigned to clusters to minimize cross-cluster transfers. In an architecture with homogeneous clusters, the number of functional units of a given type is typically a multiple of the number of clusters. Thus, it is common to unroll a loop so that the number of copies of the loop body is a multiple of the number of clusters. The result is that there is a natural mapping of instructions to clusters, which is often the best mapping. While this mapping can be obvious by inspection, we have found that existing cluster assignment algorithms often miss this natural split. The consequence is an excessive number of inter-cluster transfers, which slows down the loop.Because we were unable to find an existing cluster-assignment algorithm that performed well for unrolled loops, we developed our own. Our Affinity-Based Clustering (ABC) algorithm has been implemented in a production compiler for the Texas Instruments TMS320C6000, a two-cluster VLIW architecture. It is tailored for exploiting the patterns that result from either manual or compiler-based unrolling. As demonstrated experimentally, it performs well, even when post-unrolling optimizations partially obscure the natural split.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Andrea Capitanio , Nikil Dutt , Alexandru Nicolau, Partitioned register files for VLIWs: a preliminary analysis of tradeoffs, Proceedings of the 25th annual international symposium on Microarchitecture, p.292-300, December 01-04, 1992, Portland, Oregon, United States
|
| |
3
|
|
| |
4
|
|
| |
5
|
Desoli, G., "Instruction Assignment for Clustered VLIW DSP Compilers: A New Approach". Technical Report HPL-98-13, Hewlett-Packard Laboratories, Jan. 1998.
|
| |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
Kuras, D., S. Carr, and P. Sweany, "Value Cloning for Architectures with Partitioned Register Banks". The 1998 Workshop on Compiler Support for Embedded Systems (CASES '98), Dec. 1998.
|
 |
10
|
|
| |
11
|
P. Geoffrey Lowney , Stefan M. Freudenberger , Thomas J. Karzes , W. D. Lichtenstein , Robert P. Nix , John S. O'Donnell , John Ruttenberg, The multiflow trace scheduling compiler, The Journal of Supercomputing, v.7 n.1-2, p.51-142, May 1993
[doi> 10.1007/BF01205182]
|
| |
12
|
|
| |
13
|
|
 |
14
|
|
| |
15
|
|
 |
16
|
Eric Stotzer , Ernst Leiss, Modulo scheduling for the TMS320C6x VLIW DSP architecture, Proceedings of the ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded systems, p.28-34, May 05-05, 1999, Atlanta, Georgia, United States
|
| |
17
|
Texas Instruments, Inc., TMS320C6000 CPU and Instruction Set Reference Guide, (literature number SPRU189), 2000.
|
CITED BY 2
|
Hyunchul Park , Kevin Fan , Manjunath Kudlur , Scott Mahlke, Modulo graph embedding: mapping applications onto coarse-grained reconfigurable architectures, Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, October 22-25, 2006, Seoul, Korea
|
|
|
INDEX TERMS
Primary Classification:
D.
Software
D.3
PROGRAMMING LANGUAGES
D.3.4
Processors
Subjects:
Code generation
Additional Classification:
D.
Software
D.3
PROGRAMMING LANGUAGES
D.3.4
Processors
Subjects:
Optimization;
Compilers
General Terms:
Algorithms,
Design,
Performance
Keywords:
VLIW architectures,
affinity-based clustering (ABC) algorithms,
cluster assignment,
homogeneous clusters,
loop optimizations,
loop scheduling,
loop unrolling,
partitioned register files,
software pipelining
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE conference on Design automation
Gwo-Dong Chen
, Daniel D. Gajski
|