ACM Home Page
Please provide us with feedback. Feedback
Scheduling threads for constructive cache sharing on CMPs
Full text PdfPdf (302 KB)
Source ACM Symposium on Parallel Algorithms and Architectures archive
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures table of contents
San Diego, California, USA
SESSION: Multicore architectures and algorithms table of contents
Pages: 105 - 115  
Year of Publication: 2007
ISBN:978-1-59593-667-7
Authors
Shimin Chen  Intel Research Pittsburgh
Phillip B. Gibbons  Intel Research Pittsburgh
Michael Kozuch  Intel Research Pittsburgh
Vasileios Liaskovitis  Carnegie Mellon University
Anastassia Ailamaki  Carnegie Mellon University
Guy E. Blelloch  Carnegie Mellon University
Babak Falsafi  Carnegie Mellon University
Limor Fix  Intel Research Pittsburgh
Nikos Hardavellas  Carnegie Mellon University
Todd C. Mowry  Carnegie Mellon University and Intel Research Pittsburgh
Chris Wilkerson  Intel Microprocessor Research Lab
Sponsors
ACM: Association for Computing Machinery
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 38,   Downloads (12 Months): 335,   Citation Count: 4
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1248377.1248396
What is a DOI?

ABSTRACT

In chip multiprocessors (CMPs), limiting the number of offchip cache misses is crucial for good performance. Many multithreaded programs provide opportunities for constructive cache sharing, in which concurrently scheduled threads share a largely overlapping working set. In this paper, we compare the performance of two state-of-the-art schedulers proposed for fine-grained multithreaded programs: Parallel Depth First (PDF), which is specifically designed for constructive cache sharing, and Work Stealing (WS), which is a more traditional design. Our experimental results indicate that PDF scheduling yields a 1.3--1.6X performance improvement relative to WS for several fine-grain parallel benchmarks on projected future CMP configurations; we also report several issues that may limit the advantage of PDF in certain applications. These results also indicate that PDF more effectively utilizes off-chip bandwidth, making it possible to trade-off on-chip cache for a larger number of cores. Moreover, we find that task granularity plays a key role in cache performance. Therefore, we present an automatic approach for selecting effective grain sizes, based on a new working set profiling algorithm that is an order of magnitude faster than previous approaches. This is the first paper demonstrating the effectiveness of PDF on real benchmarks, providing a direct comparison between PDF and WS, revealing the limiting factors for PDF in practice, and presenting an approach for overcoming these factors.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3), 2002.
2
 
3
 
4
R. Balasubramonian, D. H. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. A dynamically tunable memory hierarchy. IEEE Trans. on Computers, 52(10), 2003.
5
6
7
8
9
10
 
11
 
12
J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. R. Kunkel. A multithreaded PowerPC processor for commercial servers. IBM JRD, 44(6), 2000.
 
13
 
14
G. Chen, H. Chen, M. Haurylau, N. Nelson, D. Albonesi, P. M. Fauchet, and E. G. Friedman. Electrical and optical on-chip interconnects in scaled microprocessors. In International Symp. on Circuits and Systems, 2005.
 
15
 
16
S. Chen, P. B. Gibbons, M. Kozuch, V. Liaskovitis, A. Ailamaki, G. E. Blelloch, B. Falsafi, L. Fix, N. Hardavellas, T. C. Mowry, and C. Wilkerson. Scheduling threads for constructive cache sharing on CMPs. Technical Report IRP-TR-07-01, Intel Research Pittsburgh, 2007.
 
17
Y.-Y. Chen, J.-K. Peir, and C.-T. King. Performance of shared caches on multithreaded architectures. J. of Information Science and Engineering, 14(2), 1998.
18
19
 
20
 
21
S. Eddy. HMMER: profile HMMs for protein sequence analysis. http://hmmer.wustl.edu/.
 
22
23
 
24
25
 
26
T. Moreshet, R. I. Bahar, and M. Herlihy. Energy-aware microprocessor synchronization: Transactional memory vs. locks. In WMPI, 2006.
 
27
G. J. Narlikar. A parallel, multithreaded decision tree builder. Technical Report CMU-CS-98-184, Carnegie Mellon University, 1998.
28
 
29
S. Parekh, S. Eggers, and H. Levy. Thread-sensitive scheduling for SMT processors. Technical report, U. Washington, 2000.
30
31
 
32
Semiconductor Industry Association. The International Technology Roadmap for Semiconductors (ITRS) 2005 Edition, 2005.
 
33
 
34
P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and area model. Technical Report WRL 2001/2, Compaq Computer Corporation, 2001.
35
36
 
37
 
38
39
 
40
M. W. Weissmann. Libpmsort. http://freshmeat.net/projects/libpmsort.
 
41
42


Collaborative Colleagues:
Shimin Chen: colleagues
Phillip B. Gibbons: colleagues
Michael Kozuch: colleagues
Vasileios Liaskovitis: colleagues
Anastassia Ailamaki: colleagues
Guy E. Blelloch: colleagues
Babak Falsafi: colleagues
Limor Fix: colleagues
Nikos Hardavellas: colleagues
Todd C. Mowry: colleagues
Chris Wilkerson: colleagues