|
ABSTRACT
Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multiprogrammed workload, the real challenge is how to easily create parallel software to allow single programs to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this article, we propose and evaluate a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of write-back invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on single-chip multiprocessors where the first level caches are either private or shared. For our private-cache design, the program performance of two of 13 general purpose applications studied improves by 86% and 56%, four others by more than 8%, and an average across all applications of 16%---confirming that TLS is a promising way to exploit the naturally-multithreaded processing resources of future computer systems.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Vikas Agarwal , M. S. Hrishikesh , Stephen W. Keckler , Doug Burger, Clock rate versus IPC: the end of the road for conventional microarchitectures, Proceedings of the 27th annual international symposium on Computer architecture, p.248-259, June 2000, Vancouver, British Columbia, Canada
|
| |
2
|
Alfred V. Aho , Ravi Sethi , Jeffrey D. Ullman, Compilers: principles, techniques, and tools, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1986
|
| |
3
|
|
| |
4
|
Breach, S. E., Vijaykumar, T. N., Gopal, S., Smith, J. E., and Sohi, G. S. 1996. Data memory alternatives for multiscalar processors. Tech. Rep. CS-TR-1997-1344, Computer Sciences Department, University of Wisconsin-Madison.
|
 |
5
|
Scott E. Breach , T. N. Vijaykumar , Gurindar S. Sohi, The anatomy of the register file in a multiscalar processor, Proceedings of the 27th annual international symposium on Microarchitecture, p.181-190, November 30-December 02, 1994, San Jose, California, United States
[doi> 10.1145/192724.192750]
|
 |
6
|
|
 |
7
|
|
| |
8
|
|
| |
9
|
Emer, J. 2001. Ev8: The post-ultimate alpha (keynote address). In International Conference on Parallel Architectures and Compilation Techniques.
|
 |
10
|
M. Farrens , G. Tyson , A. R. Pleszkun, A study of single-chip processor/cache organizations for large numbers of transistors, Proceedings of the 21ST annual international symposium on Computer architecture, p.338-347, April 18-21, 1994, Chicago, Illinois, United States
|
| |
11
|
Frank, M., Moritz, C., Greenwald, B., Amarasinghe, S., and Agarwal, A. 1999. Suds: Primitive mechanisms for memory dependence speculation. Tech. Rep. MIT/LCS Technical Memo LCS-TM-591. January.
|
| |
12
|
|
| |
13
|
María Jesús Garzarán , Milos Prvulovic , José María Llabería , Víctor Viñals , Lawrence Rauchwerger , Josep Torrellas, Tradeoffs in Buffering Memory State for Thread-Level Speculation in Multiprocessors, Proceedings of the 9th International Symposium on High-Performance Computer Architecture, p.191, February 08-12, 2003
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
 |
17
|
Lance Hammond , Mark Willey , Kunle Olukotun, Data speculation support for a chip multiprocessor, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.58-69, October 02-07, 1998, San Jose, California, United States
|
| |
18
|
Kahle, J. 1999. Power4: A Dual-CPU processor chip. Microprocessor Forum '99.
|
 |
19
|
Jens Knoop , Oliver Rüthing , Bernhard Steffen, Lazy code motion, Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation, p.224-234, June 15-19, 1992, San Francisco, California, United States
|
| |
20
|
|
| |
21
|
|
 |
22
|
|
| |
23
|
|
 |
24
|
|
 |
25
|
Andreas Moshovos , Scott E. Breach , T. N. Vijaykumar , Gurindar S. Sohi, Dynamic speculation and synchronization of data dependences, Proceedings of the 24th annual international symposium on Computer architecture, p.181-193, June 01-04, 1997, Denver, Colorado, United States
|
 |
26
|
Kunle Olukotun , Basem A. Nayfeh , Lance Hammond , Ken Wilson , Kunyung Chang, The case for a single-chip multiprocessor, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.2-11, October 01-04, 1996, Cambridge, Massachusetts, United States
|
 |
27
|
Chong-Liang Ooi , Seon Wook Kim , Il Park , Rudolf Eigenmann , Babak Falsafi , T. N. Vijaykumar, Multiplex: unifying conventional and speculative thread-level parallelism on a chip multiprocessor, Proceedings of the 15th international conference on Supercomputing, p.368-380, June 2001, Sorrento, Italy
[doi> 10.1145/377792.377863]
|
| |
28
|
|
| |
29
|
Palacharla, S., Jouppi, N. P., and Smith, J. E. 1996. Quantifying the complexity of superscalar processors. Tech. Rep. CS-TR-1996-1328, University of Wisconsin-Madison.
|
 |
30
|
|
 |
31
|
|
 |
32
|
Milos Prvulovic , María Jesús Garzarán , Lawrence Rauchwerger , Josep Torrellas, Removing architectural bottlenecks to the scalability of speculative parallelization, Proceedings of the 28th annual international symposium on Computer architecture, p.204-215, June 30-July 04, 2001, Göteborg, Sweden
|
 |
33
|
|
| |
34
|
Eric Rotenberg , Quinn Jacobson , Yiannakis Sazeides , Jim Smith, Trace processors, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.138-148, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
35
|
|
| |
36
|
Rundberg, P. and Stenstrom, P. 2000. Low-cost thread-level data dependence speculation on multiprocessors. In Fourth Workshop on Multithreaded Execution, Architecture and Compilation.
|
 |
37
|
|
| |
38
|
SPEC. 2000. The SPEC Benchmark Suite. Tech. rep., Standard Performance Evaluation Corporation. http://www.spechbench.org.
|
| |
39
|
|
| |
40
|
|
| |
41
|
|
 |
42
|
J. Greggory Steffan , Christopher B. Colohan , Antonia Zhai , Todd C. Mowry, A scalable approach to thread-level speculation, Proceedings of the 27th annual international symposium on Computer architecture, p.1-12, June 2000, Vancouver, British Columbia, Canada
|
| |
43
|
Tjiang, S., Wolf, M., Lam, M., Pieper, K., and Hennessy, J. 1992. Languages and Compilers for Parallel Computing. Springer-Verlag, Berlin, Germany, 137--151.
|
| |
44
|
Tremblay, M. 1999. MAJC: Microprocessor Architecture for Java Computing. HotChips '99.
|
 |
45
|
|
| |
46
|
Veenstra, J. 2000. MINT+ mips emulator. Personal communication.
|
| |
47
|
|
| |
48
|
|
 |
49
|
|
| |
50
|
Antonia Zhai , Christopher B. Colohan , J. Gregory Steffan , Todd C. Mowry, Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads, Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, p.39, March 20-24, 2004, Palo Alto, California
|
| |
51
|
|
| |
52
|
|
 |
53
|
|
| |
54
|
|
CITED BY 7
|
Easwaran Raman , Guilherme Ottoni , Arun Raman , Matthew J. Bridges , David I. August, Parallel-stage decoupled software pipelining, Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, April 05-09, 2008, Boston, MA, USA
|
|
Easwaran Raman , Neil Va hharajani , Ram Rangan , David I. August, Spice: speculative parallel iteration chunk execution, Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, April 05-09, 2008, Boston, MA, USA
|
|
|
|
|
|
|
|
|
|
|
|