| Late-binding: enabling unordered load-store queues |
| Full text |
Pdf
(253 KB)
|
Source
|
International Symposium on Computer Architecture
archive
Proceedings of the 34th annual international symposium on Computer architecture
table of contents
San Diego, California, USA
SESSION: Clocks, scheduling, and stores
table of contents
Pages: 347 - 357
Year of Publication: 2007
ISBN:978-1-59593-706-3
Also published in ...
|
|
Authors
|
|
Simha Sethumadhavan
|
The University of Texas at Austin, Austin, TX
|
|
Franziska Roesner
|
The University of Texas at Austin, Austin, TX
|
|
Joel S. Emer
|
Intel Corporation, Boston, MA
|
|
Doug Burger
|
The University of Texas at Austin, Austin, TX
|
|
Stephen W. Keckler
|
The University of Texas at Austin, Austin, TX
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 10, Downloads (12 Months): 138, Citation Count: 1
|
|
|
ABSTRACT
Conventional load/store queues (LSQs) are an impediment to both power-efficient execution in superscalar processors and scaling tolarge-window designs. In this paper, we propose techniques to improve the area and power efficiency of LSQs by allocating entries when instructions issue ("late binding"), rather than when they are dispatched. This approach enables lower occupancy and thus smaller LSQs. Efficient implementations of late-binding LSQs, however, require the entries in the LSQ to be unordered with respect to age. In this paper, we show how to provide full LSQ functionality in an unordered design with only small additional complexity and negligible performance losses. We show that late-binding, unordered LSQs work well for small-window superscalar processors, but can also be scaled effectively to large, kilo-window processors by breaking the LSQs into address-interleaved banks. To handle the increased overflows, we apply classic network flow control techniques to the processor micronetworks, enabling low-overhead recovery mechanisms from bank overflows. We evaluate three such mechanisms: instruction replay, skid buffers, an dvirtual-channel buffering in the on-chip memory network. We show that for an 80-instruction window, the LSQ can be reduced to 32 entries. For a 1024-instruction window, the unordered, late-binding LSQ works well with four banks of 48 entries each. By applying a Bloom filter as well, this design achieves full hardware memory disambiguation for a 1,024 instruction window while requiring low average power per load and store access of 8 and 12 CAM entries, respectively.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Fernando Castro , Luis Pinuel , Daniel Chaver , Manuel Prieto , Michael Huang , Francisco Tirado, DMDC: Delayed Memory Dependence Checking through Age-Based Filtering, Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, p.297-308, December 09-13, 2006
[doi> 10.1109/MICRO.2006.21]
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
 |
6
|
|
 |
7
|
Alper Buyuktosunoglu , David H. Albonesi , Pradip Bose , Peter W. Cook , Stanley E. Schuster, Tradeoffs in power-efficient issue queue design, Proceedings of the 2002 international symposium on Low power electronics and design, August 12-14, 2002, Monterey, California, USA
[doi> 10.1145/566408.566454]
|
| |
8
|
Amir Roth. High Bandwidth Load Store Unit for Single- and Multi-Threaded Processors. Technical Report MS-CIS-04-09, Dept. of Computer and Information Sciences, University of Pennsylvania, 2004.
|
| |
9
|
|
| |
10
|
L. Baugh and C. Zilles. Decomposing the load-store queue by function for power reduction and scalability. In P=ac<sup>2</sup> Conference, IBM Research, 2004.
|
| |
11
|
|
| |
12
|
Doug Burger , Stephen W. Keckler , Kathryn S. McKinley , Mike Dahlin , Lizy K. John , Calvin Lin , Charles R. Moore , James Burrill , Robert G. McDonald , William Yoder , the TRIPS Team, Scaling to the End of Silicon with EDGE Architectures, Computer, v.37 n.7, p.44-55, July 2004
[doi> 10.1109/MC.2004.65]
|
 |
13
|
|
| |
14
|
Adrian Cristal , Oliverio J. Santana , Francisco Cazorla , Marco Galluzzi , Tanausu Ramirez , Miquel Pericas , Mateo Valero, Kilo-Instruction Processors: Overcoming the Memory Wall, IEEE Micro, v.25 n.3, p.48-57, May 2005
[doi> 10.1109/MM.2005.53]
|
 |
15
|
|
 |
16
|
Elham Safi , Andreas Moshovos , Andreas Veneris, L-CBF: a low-power, fast counting bloom filter architecture, Proceedings of the 2006 international symposium on Low power electronics and design, October 04-06, 2006, Tegernsee, Bavaria, Germany
[doi> 10.1145/1165573.1165634]
|
| |
17
|
|
| |
18
|
|
| |
19
|
Teresa Monreal , Antonio González , Mateo Valero , José González , Victor Viñals, Delaying physical register allocation through virtual-physical registers, Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture, p.186-192, November 16-18, 1999, Haifa, Israel
|
 |
20
|
|
| |
21
|
Karthikeyan Sankaralingam , Ramadass Nagarajan , Robert McDonald , Rajagopalan Desikan , Saurabh Drolia , M. S. Govindan , Paul Gratz , Divya Gulati , Heather Hanson , Changkyu Kim , Haiming Liu , Nitya Ranganathan , Simha Sethumadhavan , Sadia Sharif , Premkishore Shivakumar , Stephen W. Keckler , Doug Burger, Distributed Microarchitectural Protocols in the TRIPS Prototype Processor, Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, p.480-491, December 09-13, 2006
[doi> 10.1109/MICRO.2006.19]
|
| |
22
|
|
| |
23
|
Simha Sethumadhavan and Robert McDonald and Rajagopalan Desikan and Doug Burger and Stephen W. Keckler. Design and Implementation of the TRIPS Primary Memory System. In ICCD, 2006.
|
 |
24
|
Srikanth T. Srinivasan , Ravi Rajwar , Haitham Akkary , Amit Gandhi , Mike Upton, Continual flow pipelines, Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, October 07-13, 2004, Boston, MA, USA
|
| |
25
|
|
| |
26
|
J. M. Tendler, J. S. Dodson, J. J. S. Fields, H. Le, and B. Sinharoy. POWER4 system microarchitecture. IBM Journal of Research and Development, 26(1):5--26, January 2001.
|
| |
27
|
|
| |
28
|
|
| |
29
|
|
| |
30
|
|
|