|
ABSTRACT
In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3--1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3), 2002.
|
 |
2
|
|
| |
3
|
|
| |
4
|
R. Balasubramonian, D. H. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. A dynamically tunable memory hierarchy. IEEE Trans. on Computers, 52(10), 2003.
|
 |
5
|
|
 |
6
|
|
 |
7
|
Guy E. Blelloch , Phillip B. Gibbons , Girija J. Narlikar , Yossi Matias, Space-efficient scheduling of parallelism with synchronization variables, Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, p.12-23, June 23-25, 1997, Newport, Rhode Island, United States
[doi> 10.1145/258492.258494]
|
 |
8
|
Robert D. Blumofe , Matteo Frigo , Christopher F. Joerg , Charles E. Leiserson , Keith H. Randall, An analysis of dag-consistent distributed shared-memory algorithms, Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures, p.297-308, June 24-26, 1996, Padua, Italy
[doi> 10.1145/237502.237574]
|
 |
9
|
Robert D. Blumofe , Christopher F. Joerg , Bradley C. Kuszmaul , Charles E. Leiserson , Keith H. Randall , Yuli Zhou, Cilk: an efficient multithreaded runtime system, Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.207-216, July 19-21, 1995, Santa Barbara, California, United States
|
 |
10
|
|
| |
11
|
|
| |
12
|
J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. R. Kunkel. A multithreaded PowerPC processor for commercial servers. IBM JRD, 44(6), 2000.
|
| |
13
|
|
| |
14
|
G. Chen, H. Chen, M. Haurylau, N. Nelson, D. Albonesi, P. M. Fauchet, and E. G. Friedman. Electrical and optical on-chip interconnects in scaled microprocessors. In International Symp. on Circuits and Systems, 2005.
|
| |
15
|
|
| |
16
|
S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. Technical Report IRP-TR-07-01, Intel Research Pittsburgh, 2007.
|
| |
17
|
Y.-Y. Chen, J.-K. Peir, and C.-T. King. Performance of shared caches on multithreaded architectures. J. of Information Science and Engineering, 14(2), 1998.
|
 |
18
|
|
 |
19
|
Joachim Clabes , Joshua Friedrich , Mark Sweet , Jack DiLullo , Sam Chu , Donald Plass , James Dawson , Paul Muench , Larry Powell , Michael Floyd , Balaram Sinharoy , Mike Lee , Michael Goulet , James Wagoner , Nicole Schwartz , Steve Runyon , Gary Gorman , Phillip Restle , Ronald Kalla , Joseph McGill , Steve Dodson, Design and implementation of the POWER5™ microprocessor, Proceedings of the 41st annual conference on Design automation, June 07-11, 2004, San Diego, CA, USA
[doi> 10.1145/996566.996749]
|
| |
20
|
|
| |
21
|
S. Eddy. HMMER: profile HMMs for protein sequence analysis. http://hmmer.wustl.edu/.
|
| |
22
|
Alexandra Fedorova , Margo Seltzer , Christoper Small , Daniel Nussbaum, Performance of multithreaded chip multiprocessors and implications for operating system design, Proceedings of the USENIX Annual Technical Conference 2005 on USENIX Annual Technical Conference, p.26-26, April 10-15, 2005, Anaheim, CA
|
 |
23
|
|
| |
24
|
|
 |
25
|
Ken Mai , Tim Paaske , Nuwan Jayasena , Ron Ho , William J. Dally , Mark Horowitz, Smart Memories: a modular reconfigurable architecture, Proceedings of the 27th annual international symposium on Computer architecture, p.161-171, June 2000, Vancouver, British Columbia, Canada
|
| |
26
|
T. Moreshet, R. I. Bahar, and M. Herlihy. Energy-aware microprocessor synchronization: Transactional memory vs. locks. In WMPI, 2006.
|
| |
27
|
G. J. Narlikar. A parallel, multithreaded decision tree builder. Technical Report CMU-CS-98-184, Carnegie Mellon University, 1998.
|
 |
28
|
|
| |
29
|
S. Parekh, S. Eggers, and H. Levy. Thread-sensitive scheduling for SMT processors. Technical report, U. Washington, 2000.
|
 |
30
|
James Philbin , Jan Edler , Otto J. Anshus , Craig C. Douglas , Kai Li, Thread scheduling for cache locality, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.60-71, October 01-04, 1996, Cambridge, Massachusetts, United States
|
 |
31
|
|
| |
32
|
Semiconductor Industry Association. The International Technology Roadmap for Semiconductors (ITRS) 2005 Edition, 2005.
|
| |
33
|
|
| |
34
|
P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and area model. Technical Report WRL 2001/2, Compaq Computer Corporation, 2001.
|
 |
35
|
|
 |
36
|
|
| |
37
|
|
| |
38
|
|
 |
39
|
|
| |
40
|
M. W. Weissmann. Libpmsort. http://freshmeat.net/projects/libpmsort.
|
| |
41
|
|
 |
42
|
|
CITED BY 4
|
|
|
|
Sangeetha Seshadri , Lawrence Chiu , Cornel Constantinescu , Subashini Balachandran , Clem Dickey , Ling Liu , Paul Muench, Enhancing storage system availability on multi-core architectures with recovery-conscious scheduling, Proceedings of the 6th USENIX Conference on File and Storage Technologies, p.1-16, February 26-29, 2008, San Jose, California
|
|
Jeffrey R. Diamond , Behnam Robatmili , Stephen W. Keckler , Robert van de Geijn , Kazushige Goto , Doug Burger, High performance dense linear algebra on a spatially distributed processor, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, February 20-23, 2008, Salt Lake City, UT, USA
|
|
|
Guy E. Blelloch , Rezaul A. Chowdhury , Phillip B. Gibbons , Vijaya Ramachandran , Shimin Chen , Michael Kozuch, Provably good multicore cache performance for divide-and-conquer algorithms, Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, p.501-510, January 20-22, 2008, San Francisco, California
|
|