|
ABSTRACT
We present an automated software interference detection methodology for Single Program, Multiple Data (SPMD) parallel applications. Interference comes from the system and unexpected processes. If not detected and corrected such interference may result in performance degradation. Our goal is to provide a reliable metric for software interference that can be used in soft-failure protection and recovery systems. A unique feature of our algorithm is that we measure the relative timing of application events (i.e. time between MPI calls) rather than system level events such as CPU utilization. This approach lets our system automatically accommodate natural variations in an application's utilization of resources. We use performance irregularities and degradation as signs of software interference. However, instead of relying on temporal changes in performance, our system detects spatial performance degradation across multiple processors. We also include a case study that demonstrates our technique's effectiveness, resilience and robustness.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Andrzejak, A., L. M. Silva, "Deterministic Models of Software Aging and Optimal Rejuvenation Schedules," CoreGRID Technical Report TR-0047.
|
| |
2
|
Bailey, D., et. al., "The NAS Parallel Benchmarks," RNR Technical Report, RNR-94-007, March 1994.
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
Castelli, V., et. al., "Proactive Management of Software Aging," IBM Journal of Res. and Dev., pp. 311--332, vol. 45, no. 2, 2001.
|
| |
7
|
Chakravorty, S., et. al., "Proactive Fault Tolerance in Large Systems," HPCRI workshop in conjunction with HPCA'05, 2005.
|
| |
8
|
|
| |
9
|
Mike Y. Chen , Emre Kiciman , Eugene Fratkin , Armando Fox , Eric Brewer, Pinpoint: Problem Determination in Large, Dynamic Internet Services, Proceedings of the 2002 International Conference on Dependable Systems and Networks, p.595-604, June 23-26, 2002
|
 |
10
|
|
| |
11
|
|
| |
12
|
Dukowicz, J. K., R. D. Smith, and R. C. Malone, "A Reformulation and Implementation of the Bryan-Cox-Semtner Ocean Model on the Connection Machine," Journal of Atmospheric and Oceanic Technology, vol. 10, no. 2, pp. 195--208, Apr. 1993.
|
| |
13
|
Florez, J., et. al., "Detecting Anomalies in High-Performance Parallel Programs" J. Digital Inf. Mgmt, vol. 2, no. 2, June 2004.
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
Javitz, H. S., A. Valdes, "The NIDES Statistical Component: Description and Justification" Tech. Report. Computer Science Lab., SRI International, 1993.
|
| |
18
|
|
| |
19
|
Terry Jones , Shawn Dawson , Rob Neely , William Tuel , Larry Brenner , Jeffrey Fier , Robert Blackmore , Patrick Caffrey , Brian Maskell , Paul Tomlinson , Mark Roberts, Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p.10, November 15-21, 2003
|
 |
20
|
|
| |
21
|
Nataraj, A., et. al., "Kernel-Level Measurements for Integrated Parallel Performance Views: The KTAU Project," Proc. of IEEE Int. Conf. on Cluster Computing, pp. 1--12, Sep. 2006.
|
| |
22
|
Nevill-Manning, C. G., I. H. Witten, "Compression and Explanation Using Hierarchical Grammars," The Computer Journal, vol. 40, pp. 103--116, 1997.
|
| |
23
|
|
 |
24
|
|
| |
25
|
|
|