ACM Home Page
Please provide us with feedback. Feedback
Automatic software interference detection in parallel applications
Full text PdfPdf (658 KB)
Source
Conference on High Performance Networking and Computing archive
Proceedings of the 2007 ACM/IEEE conference on Supercomputing table of contents
Reno, Nevada
SESSION: Security and fault tolerance table of contents
Article No. 14  
Year of Publication: 2007
ISBN:978-1-59593-764-3
Authors
Vahid Tabatabaee  University of Maryland at College Park
Jeffrey K. Hollingsworth  University of Maryland at College Park
Sponsors
IEEE-CS\DATC : IEEE Computer Society
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 33,   Citation Count: 0
Additional Information:

abstract   references   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1362622.1362642
What is a DOI?

ABSTRACT

We present an automated software interference detection methodology for Single Program, Multiple Data (SPMD) parallel applications. Interference comes from the system and unexpected processes. If not detected and corrected such interference may result in performance degradation. Our goal is to provide a reliable metric for software interference that can be used in soft-failure protection and recovery systems. A unique feature of our algorithm is that we measure the relative timing of application events (i.e. time between MPI calls) rather than system level events such as CPU utilization. This approach lets our system automatically accommodate natural variations in an application's utilization of resources. We use performance irregularities and degradation as signs of software interference. However, instead of relying on temporal changes in performance, our system detects spatial performance degradation across multiple processors. We also include a case study that demonstrates our technique's effectiveness, resilience and robustness.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Andrzejak, A., L. M. Silva, "Deterministic Models of Software Aging and Optimal Rejuvenation Schedules," CoreGRID Technical Report TR-0047.
 
2
Bailey, D., et. al., "The NAS Parallel Benchmarks," RNR Technical Report, RNR-94-007, March 1994.
 
3
 
4
 
5
 
6
Castelli, V., et. al., "Proactive Management of Software Aging," IBM Journal of Res. and Dev., pp. 311--332, vol. 45, no. 2, 2001.
 
7
Chakravorty, S., et. al., "Proactive Fault Tolerance in Large Systems," HPCRI workshop in conjunction with HPCA'05, 2005.
 
8
 
9
10
 
11
 
12
Dukowicz, J. K., R. D. Smith, and R. C. Malone, "A Reformulation and Implementation of the Bryan-Cox-Semtner Ocean Model on the Connection Machine," Journal of Atmospheric and Oceanic Technology, vol. 10, no. 2, pp. 195--208, Apr. 1993.
 
13
Florez, J., et. al., "Detecting Anomalies in High-Performance Parallel Programs" J. Digital Inf. Mgmt, vol. 2, no. 2, June 2004.
 
14
 
15
 
16
 
17
Javitz, H. S., A. Valdes, "The NIDES Statistical Component: Description and Justification" Tech. Report. Computer Science Lab., SRI International, 1993.
 
18
 
19
20
 
21
Nataraj, A., et. al., "Kernel-Level Measurements for Integrated Parallel Performance Views: The KTAU Project," Proc. of IEEE Int. Conf. on Cluster Computing, pp. 1--12, Sep. 2006.
 
22
Nevill-Manning, C. G., I. H. Witten, "Compression and Explanation Using Hierarchical Grammars," The Computer Journal, vol. 40, pp. 103--116, 1997.
 
23
24
 
25
Collaborative Colleagues:
Vahid Tabatabaee: colleagues
Jeffrey K. Hollingsworth: colleagues