ACM Home Page
Please provide us with feedback. Feedback
Affinity-based cluster assignment for unrolled loops
Full text PdfPdf (633 KB)
Source International Conference on Supercomputing archive
Proceedings of the 16th international conference on Supercomputing table of contents
New York, New York, USA
SESSION: Compilers I table of contents
Pages: 107 - 116  
Year of Publication: 2002
ISBN:1-58113-483-5
Authors
Gayathri Krishnamurthy  Texas Instruments, Houston, TX
Elana D. Granston  Texas Instruments, Houston, TX
Eric J. Stotzer  Texas Instruments, Houston, TX
Sponsor
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 24,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues   peer to peer  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/514191.514209
What is a DOI?

ABSTRACT

To compete performance-wise, modern VLIW processors must have fast clock rates and high instruction-level parallelism (ILP). Partitioning resources (functional units and registers) into clusters allows the processor to be clocked faster, but operand transfers across clusters can easily become a bottleneck. Increasing the number of functional units increases the potential ILP, but only helps if the functional units can be kept busy.To support these features, optimizations such as loop unrolling must be applied to expose ILP, and instructions must be explicitly assigned to clusters to minimize cross-cluster transfers. In an architecture with homogeneous clusters, the number of functional units of a given type is typically a multiple of the number of clusters. Thus, it is common to unroll a loop so that the number of copies of the loop body is a multiple of the number of clusters. The result is that there is a natural mapping of instructions to clusters, which is often the best mapping. While this mapping can be obvious by inspection, we have found that existing cluster assignment algorithms often miss this natural split. The consequence is an excessive number of inter-cluster transfers, which slows down the loop.Because we were unable to find an existing cluster-assignment algorithm that performed well for unrolled loops, we developed our own. Our Affinity-Based Clustering (ABC) algorithm has been implemented in a production compiler for the Texas Instruments TMS320C6000, a two-cluster VLIW architecture. It is tailored for exploiting the patterns that result from either manual or compiler-based unrolling. As demonstrated experimentally, it performs well, even when post-unrolling optimizations partially obscure the natural split.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
 
3
 
4
 
5
Desoli, G., "Instruction Assignment for Clustered VLIW DSP Compilers: A New Approach". Technical Report HPL-98-13, Hewlett-Packard Laboratories, Jan. 1998.
 
6
 
7
 
8
 
9
Kuras, D., S. Carr, and P. Sweany, "Value Cloning for Architectures with Partitioned Register Banks". The 1998 Workshop on Compiler Support for Embedded Systems (CASES '98), Dec. 1998.
10
 
11
 
12
 
13
14
 
15
16
 
17
Texas Instruments, Inc., TMS320C6000 CPU and Instruction Set Reference Guide, (literature number SPRU189), 2000.


Collaborative Colleagues:
Gayathri Krishnamurthy: colleagues
Elana D. Granston: colleagues
Eric J. Stotzer: colleagues

Peer to Peer - Readers of this Article have also read: