|
ABSTRACT
The traditional VLIW (very long instruction word) architecture with a single register file does not scale up well to address growing performance demands on embedded media processors. However, splitting a VLIW processor in smaller clusters, which are comprised of function units fully connected to local register files, can significantly improve VLSI implementation characteristics of the processor, such as speed, energy consumption, and area. In our paper we reveal that achieving the best characteristics of a clustered VLIW requires a thorough selection of an Inter-cluster Communication (ICC) model, which is the way clustering is exposed in the Instruction Set Architecture. For our study we, first, define a taxonomy of ICC models including copy operations, dedicated issue slots, extended operands, extended results, and multicast. Evaluation of the execution time of the models requires both the dynamic cycle count and clock period. We developed an advanced instruction scheduler for all the five ICC models in order to quantify the dynamic cycle counts of our multimedia C benchmarks. To assess the clock period of the ICC models we designed and laid out VLIW datapaths using the RTL hardware descriptions derived from a deeply pipelined commercial TriMedia processor. In contrast to prior art, our research shows that fully distributed register file architectures (with eight clusters in our study) often underperform compared to moderately clustered machines with two or four clusters because of explosion of the cycle count overhead in the former. Among the evaluated ICC models, performance of the copy operation model, popular both in academia and industry, is severely limited by the copy operations hampering scheduling of regular operations in high ILP (instruction-level parallelism) code. The dedicated issue slots model combats this limitation by dedicating extra VLIW issue slots purely for ICC, reaching the highest 1.74 execution time speedup relative to the unicluster. Furthermore, our VLSI experiments show that the lowest area and energy consumption of 42 and 57% relative to the unicluster, respectively, are achieved by the extended operands model, which, nevertheless, provides higher performance than the copy operation model.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Vikas Agarwal , M. S. Hrishikesh , Stephen W. Keckler , Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th annual international symposium on Computer architecture, p.248-259, June 2000, Vancouver, British Columbia, Canada
|
| |
2
|
|
| |
3
|
|
| |
4
|
Bekooij, M. 2004. Constraint Driven Operation Assignment for Retargetable VLIW Compilers. PhD thesis, ISBN 90-74445-60-8, Technical University of Eindhoven, Eindhoven, The Netherlands.
|
| |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
 |
11
|
Paolo Faraboschi , Geoffrey Brown , Joseph A. Fisher , Giuseppe Desoli , Fred Homewood, Lx: a technology platform for customizable VLIW embedded processing, Proceedings of the 27th annual international symposium on Computer architecture, p.203-213, June 2000, Vancouver, British Columbia, Canada
|
| |
12
|
Fisher, J. A. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Transactions on Computers, 478--490.
|
| |
13
|
Fisher, J. A., Faraboschi, P., and Young, C. 2004. Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann. San Francisco, CA.
|
| |
14
|
Gangwar, A., Balakrishnan, M., and Kumar, A. 2003. Impact of Inter-cluster Communication Mechanisms on ILP in Clustered VLIW Architectures, In Proceedings of the 2nd Workshop on Application Specific Processors, San Diego, CA.
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
|
| |
19
|
Halfhill, T. R. 2004. Best media processor: TriMedia TM5250. Microprocessor Report, 2/9/04, http://www.mpronline.com.
|
| |
20
|
|
| |
21
|
|
| |
22
|
Ho, R., Mai, K., and Horowitz, M. 2001. The future of wires. Proceedings of the IEEE, 89, 4, 490--504.
|
| |
23
|
Hoogerbrugge, J. and Augusteijn, L. 1999. Instruction scheduling for TriMedia. The Journal of Instruction-Level Parallelism, 1, http://www.jilp.org/.
|
 |
24
|
|
| |
25
|
ITRS Technology Working Groups. 2005. International Technology Roadmap for Semiconductors (ITRS). The ITRS Technology Working Groups. http://www.itrs.net/.
|
| |
26
|
Janssen, J. 2001. Compiler Strategies for Transport Triggered Architecture. PhD thesis, Technical University of Deflt, The Netherlands.
|
| |
27
|
|
| |
28
|
|
| |
29
|
|
 |
30
|
|
| |
31
|
Lapinskii, V. S., Jacome, M. F., and De Veciana, G. A. 2002. Application-specific clustered VLIW datapaths: Early exploration on a parameterized design space. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 21, 8, 889--903.
|
| |
32
|
|
 |
33
|
Walter Lee , Rajeev Barua , Matthew Frank , Devabhaktuni Srikrishna , Jonathan Babb , Vivek Sarkar , Saman Amarasinghe, Space-time scheduling of instruction-level parallelism on a raw machine, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.46-57, October 02-07, 1998, San Jose, California, United States
|
| |
34
|
|
| |
35
|
Levy, M. 2001. ManArray devours DSP code. Microprocessor report, 10/8/01-01, http://www.mpronline.com/.
|
 |
36
|
Scott A. Mahlke , David C. Lin , William Y. Chen , Richard E. Hank , Roger A. Bringmann, Effective compiler support for predicated execution using the hyperblock, Proceedings of the 25th annual international symposium on Microarchitecture, p.45-54, December 01-04, 1992, Portland, Oregon, United States
|
 |
37
|
|
| |
38
|
|
 |
39
|
Subbarao Palacharla , Norman P. Jouppi , J. E. Smith, Complexity-effective superscalar processors, Proceedings of the 24th annual international symposium on Computer architecture, p.206-218, June 01-04, 1997, Denver, Colorado, United States
|
| |
40
|
|
| |
41
|
Rixner, S., Dally, W. J., Khailany, B., Mattson, P., Kapasi, U. J., Owens, J. D. 1999. Register organization for media processing. In Proceedings of the 6th International Symposium on High-Performance Computer Architecture, Toulouse, France. IEEE Computer Society, Los Alamitos, CA. 375--386.
|
| |
42
|
Roos, S., Corporaal, H., and Lamberts, R. 2002. Clustering on the Move. In Proceedings of the 4th International Conference on Massively Parallel Computing Systems, Ischia, Italy, IEEE Computer Society Press, Los Alamitos, CA.
|
| |
43
|
Smith, J. E. 2006. Benchmarking: Science? Art? Neither? In 2006 SPEC Benchmark Workshop, Austin, Texas. http://www.spec.org/workshops/2006/.
|
| |
44
|
Sudharsanan, S., Sriram, P., Frederickson, and H., Gulati, A. 2000. Image and video processing using Majc 5200. In Proceedings of the International Conference on Image Processing, Vancouver Canada, IEEE Computer Society Press, Los Alamitos, CA. 122--125.
|
| |
45
|
|
 |
46
|
Andrei Terechko , Erwan Le Thénaff , Henk Corporaal, Cluster assignment of global values for clustered VLIW processors, Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, October 30-November 01, 2003, San Jose, California, USA
[doi> 10.1145/951710.951717]
|
| |
47
|
|
| |
48
|
|
| |
49
|
Jan-Willem van de Waerdt , Stamatis Vassiliadis , Sanjeev Das , Sebastian Mirolo , Chris Yen , Bill Zhong , Carlos Basto , Jean-Paul van Itegem , Dinesh Amirtharaj , Kulbhushan Kalra , Pedro Rodriguez , Hans van Antwerpen, The TM3270 Media-Processor, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, p.331-342, November 12-16, 2005, Barcelona, Spain
[doi> 10.1109/MICRO.2005.35]
|
| |
50
|
Veredas, F. J., Scheppler, M., Moffat, W., and Mei, B. 2005. Custom implementation of the coarse-grained reconfigurable ADRES architecture for multimedia purposes. In Proceedings of the International Conference on Field Programmable Logic and Applications, Tampere, Finland. IEEE Computer Society Press, Los Alamitos, CA. 106--111.
|
| |
51
|
|
INDEX TERMS
Primary Classification:
C.
Computer Systems Organization
C.1
PROCESSOR ARCHITECTURES
C.1.4
Parallel Architectures
Additional Classification:
D.
Software
D.3
PROGRAMMING LANGUAGES
D.3.4
Processors
Subjects:
Code generation;
Compilers
H.
Information Systems
H.4
INFORMATION SYSTEMS APPLICATIONS
H.4.3
Communications Applications
General Terms:
Design,
Experimentation,
Languages,
Performance
Keywords:
Instruction-level parallelism,
VLIW,
clock frequency,
cluster assignment,
instruction scheduler,
intercluster communication,
optimizing compiler,
pipelining,
register allocation
|