ACM Home Page
Please provide us with feedback. Feedback
HPC-Colony: services and interfaces for very large systems
Full text PdfPdf (437 KB)
Source ACM SIGOPS Operating Systems Review archive
Volume 40 ,  Issue 2  (April 2006) table of contents
COLUMN: Operating and runtime systems for high-end computing systems table of contents
Pages: 43 - 49  
Year of Publication: 2006
ISSN:0163-5980
Authors
Sayantan Chakravorty  University of Illinois
Celso L. Mendes  University of Illinois
Laxmikant V. Kalé  University of Illinois
Terry Jones  Lawrence Livermore National Lab.
Andrew Tauferner  IBM
Todd Inglett  IBM
José Moreira  IBM
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 39,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1131322.1131334
What is a DOI?

ABSTRACT

Traditional full-featured operating systems are known to have properties that limit the scalability of distributed memory parallel programs, the most common programming paradigm utilized in high end computing. Furthermore, as processor counts increase with the most capable systems, the necessary activity to manage the system becomes more of a burden. To make a general purpose operating system scale to such levels, new technology is required for parallel resource management and global system management (including fault management). In this paper, we describe the shortcomings of full-featured operating systems and runtime systems and discuss an approach to scale such systems to one hundred thousand processors with both scalable parallel application performance and efficient system management.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
C. Huang, O. Lawlor, and L. V. Kalé, "Adaptive MPI," in Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2003), LNCS 2958, (College Station, Texas), pp. 306--322, October 2003.
 
2
 
3
R. K. Brunner and L. V. Kalé, "Handling application-induced load imbalance using parallel objects," in Parallel and Distributed Computing for Symbolic and Irregular Applications, pp. 167--181, World Scientific Publishing, 2000.
 
4
G. Zheng, Achieving High Performance on Extremely Large Parallel Machines: Performance Prediction and Load Balancing. PhD thesis, Department of Computer Science, University of Illinois at Urbana-Champaign, 2005.
 
5
T. Agarwal, A. Sharma, and L. V. Kalé, "Topology-aware task mapping for reducing communication contention on large parallel machines," in Proceedings of IEEE International Parallel and Distributed Processing Symposium 2006, April 2006.
 
6
C. Huang, "System support for checkpoint and restart of charm++ and ampi applications," Master's thesis, Dept. of Computer Science, University of Illinois, 2004.
 
7
 
8
S. Chakravorty and L. V. Kale, "A fault tolerant protocol for massively parallel machines," in FTPDS Workshop for IPDPS 2004, IEEE Press, 2004.
 
9
P. Apparao and G. Averill, "Firmware-based platform reliability." Intel white paper, October 2004.
10
 
11
A. J. Oliner, R. K. Sahoo, J. E. Moreira, M. Gupta, and A. Sivasubramaniam, "Fault-aware job scheduling for BlueGene/L systems," Tech. Rep. RC23077, IBM Research, January (2004).
 
12
T. Jones, J. Fier, and L. Brenner, "Observed impacts of operating systems on the scalability of applications," Tech. Rep. UCRL-MI-202629, Lawrence Livermore National Laboratory, March 2003.
 
13
 
14
 
15
A. W. Cook and W. H. Cabot, "Large scale simulations with miranda on Blue Gene/L," Tech. Rep. UCRL-PRES-200327, Lawrence Livermore National Laboratory, 2003.
 
16
J. Moreira et al, "Blue Gene/L programming and operating environment," IBM Journal of Research and Development, vol. 49, no. 2/3, pp. 367--376, 2005.
 
17
Y.-C. Chow and W. H. Kohler, "Models for dynamic load balancing in homogeneous multiple processor systems," in IEEE Transactions on Computers, vol. c-36, pp. 667--679, May 1982.
 
18
L. M. Ni and K. Hwang, "Optimal Load Balancing in a Multiple Processor System with Many Job Classes," in IEEE Trans. on Software Eng., vol. SE-11, 1985.
 
19
 
20
A. Ha'c and X. Jin, "Dynamic Load Balancing in Distributed System Using a Decentralized Algorithm," in Proc. of 7-th Intl. Conf. on Distributed Computing Systems, April 1987.
 
21
A. Sinha and L. Kalé, "A load balancing strategy for prioritized execution of tasks," in International Parallel Processing Symposium, (New Port Beach, CA.), pp. 230--237, April 1993.
 
22
 
23
A. Basermann, J. Clinckemaillie, T. Coupez, J. Fingberg, H. Digonnet, R. Ducloux, J.-M. Gratien, U. Hartmann, G. Lonsdale, B. Maerten, D. Roose, and C. Walshaw, "Dynamic load balancing of finite element applications with the DRAMA Library," in Applied Math. Modeling, vol. 25, pp. 83--98, 2000.
 
24
 
25
P. Colella, D. Graves, T. Ligocki, D. Martin, D. Modiano, D. Serafini, and B. Van Straalen, "Chombo Software Package for AMR Applications Design Document," 2003. http://seesar.lbl.gov/anag/chombo/ChomboDesign-1.4. pdf.
26
 
27
 
28
 
29
 
30
31
32
 
33
 
34
 
35
S. Louca, N. Neophytou, A. Lachanas, and P. Evripidou, "MPI-FT: Portable fault tolerance scheme for MPI," Parallel Processing Letters, vol. 10, no. 4, pp. 371--382, 2000.
 
36
 
37
 
38
S. Chakravorty, C. L. Mendes, and L. V. Kalé, "Proactive fault tolerance in MPI applications via task migration," 2006. Submitted to publication.
 
39
J. K. Ousterhout, "Scheduling techniques for concurrent systems," in Third International Conference on Distributed Computing Systems, pp. 22--30, May 1982.
 
40
P. G. Sobalvarro, S. Pakin, W. E. Weihl, and A. A. Chien, "Dynamic co-scheduling on workstation clusters," Tech. Rep. 1997-017, Digital Systems Research Center, March 1997.
 
41
 
42
K. London, S. Moore, D. Terpstra, and J. Dongarra, "Support for simultaneous multiple substrate performance monitoring," October 2005. Poster Session at LACSI Symposium 2005.


Collaborative Colleagues:
Sayantan Chakravorty: colleagues
Celso L. Mendes: colleagues
Laxmikant V. Kalé: colleagues
Terry Jones: colleagues
Andrew Tauferner: colleagues
Todd Inglett: colleagues
José Moreira: colleagues