skip to main content
10.1145/1150402.1150488acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Mining for misconfigured machines in grid systems

Published: 20 August 2006 Publication History

Abstract

Grid systems are proving increasingly useful for managing the batch computing jobs of organizations. One well-known example is Intel, whose internally developed NetBatch system manages tens of thousands of machines. The size, heterogeneity, and complexity of grid systems make them very difficult, however, to configure. This often results in misconfigured machines, which may adversely affect the entire system.We investigate a distributed data mining approach for detection of misconfigured machines. Our Grid Monitoring System (GMS) non-intrusively collects data from all sources (log files, system services, etc.) available throughout the grid system. It converts raw data to semantically meaningful data and stores this data on the machine it was obtained from, limiting incurred overhead and allowing scalability. Afterwards, when analysis is requested, a distributed outliers detection algorithm is employed to identify misconfigured machines. The algorithm itself is implemented as a recursive workflow of grid jobs. It is especially suited to grid systems, in which the machines might be unavailable most of the time and often fail altogether.

References

[1]
F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Proc. of PKDD, 2002.
[2]
J. Basney and M. Livny. Improving goodput by co-scheduling CPU and network capacity. Intl. Journal of High Performance Computing Applications, 13(3), 1999.
[3]
J. W. Branch, B. Szymanski, C. Giannella, R. Wolff, and H. Kargupta. In-network outlier detection in wireless sensor networks. In Proc. of ICDCS, July 2006.
[4]
M. Cannataro, A. Massara, and P. Veltri. The OnBrowser ontology manager: Managing ontologies on the grid. In Intl. Workshop on Semantic Intelligent Middleware for the Web and the Grid, 2004.
[5]
M. Chen, A. Zheng, J. Lloyd, M. Jordan, and E. Brewer. Failure diagnosis using decision trees. In Proc. of ICAC, 2004.
[6]
Hodge V. and Austin J. A Survey of Outlier Detection Methodologies. Artificial Intelligence Review, 22:85--126, 2004.
[7]
M. J. Litzkow, M. Livny, and M. W. Mutka. Condor - A hunter of idle workstations. In Proc. of ICDCS, June 1988.

Cited By

View all
  • (2024)Using Datalog for Effective Continuous Integration Policy EvaluationSoftware Quality as a Foundation for Security10.1007/978-3-031-56281-5_3(41-52)Online publication date: 12-Apr-2024
  • (2023)Test Selection for Unified Regression Testing2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00145(1687-1699)Online publication date: May-2023
  • (2022)Troubleshooting Configuration Errors via Information Retrieval and Configuration Testing2022 4th International Academic Exchange Conference on Science and Technology Innovation (IAECST)10.1109/IAECST57965.2022.10062229(422-426)Online publication date: 9-Dec-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2006
986 pages
ISBN:1595933395
DOI:10.1145/1150402
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 August 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed data mining
  2. grid information system
  3. grid systems
  4. outliers detection
  5. system monitoring

Qualifiers

  • Article

Conference

KDD06

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Using Datalog for Effective Continuous Integration Policy EvaluationSoftware Quality as a Foundation for Security10.1007/978-3-031-56281-5_3(41-52)Online publication date: 12-Apr-2024
  • (2023)Test Selection for Unified Regression Testing2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00145(1687-1699)Online publication date: May-2023
  • (2022)Troubleshooting Configuration Errors via Information Retrieval and Configuration Testing2022 4th International Academic Exchange Conference on Science and Technology Innovation (IAECST)10.1109/IAECST57965.2022.10062229(422-426)Online publication date: 9-Dec-2022
  • (2021)Test-case prioritization for configuration testingProceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3460319.3464810(452-465)Online publication date: 11-Jul-2021
  • (2020)Testing configuration changes in context to prevent production failuresProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation10.5555/3488766.3488808(735-751)Online publication date: 4-Nov-2020
  • (2018)MisconfDoctor: Diagnosing Misconfiguration via Log-Based Configuration Testing2018 IEEE International Conference on Software Quality, Reliability and Security (QRS)10.1109/QRS.2018.00014(1-12)Online publication date: Jul-2018
  • (2016)Early detection of configuration errors to reduce failure damageProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026925(619-634)Online publication date: 2-Nov-2016
  • (2015)Systems Approaches to Tackling Configuration ErrorsACM Computing Surveys10.1145/279157747:4(1-41)Online publication date: 21-Jul-2015
  • (2012)Failure analysis of distributed scientific workflows executing in the cloudProceedings of the 8th International Conference on Network and Service Management10.5555/2499406.2499412(46-54)Online publication date: 22-Oct-2012
  • (2012)Latent fault detection in large scale servicesIEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012)10.1109/DSN.2012.6263932(1-12)Online publication date: Jun-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media