article

Stardust: tracking activity in a distributed storage system

Authors:

Brandon Salmon,

Michael Abd-El-Malek,

Gregory R. GangerAuthors Info & Claims

ACM SIGMETRICS Performance Evaluation Review, Volume 34, Issue 1

Pages 3 - 14

https://doi.org/10.1145/1140103.1140280

Published: 26 June 2006 Publication History

Abstract

Performance monitoring in most distributed systems provides minimal guidance for tuning, problem diagnosis, and decision making. Stardust is a monitoring infrastructure that replaces traditional performance counters with end-to-end traces of requests and allows for efficient querying of performance metrics. Such traces better inform key administrative performance challenges by enabling, for example, extraction of per-workload, per-resource demand information and per-workload latency graphs. This paper reports on our experience building and using end-to-end tracing as an on-line monitoring tool in a distributed storage system. Using diverse system workloads and scenarios, we show that such fine-grained tracing can be made efficient (less than 6% overhead) and is useful for on- and off-line analysis of system behavior. These experiences make a case for having other systems incorporate such an instrumentation framework.

References

[1]

M. Abd-El-Malek, W. V. Courtright II, C. Cranor, G. R. Ganger, J. Hendricks, A. J. Klosterman, M. Mesnier, M. Prasad, B. Salmon, R. R. Sambasivan, S. Sinnamohideen, J. D. Strunk, E. Thereska, M. Wachs, and J. J. Wylie. Ursa Minor: versatile cluster-based storage. Conference on File and Storage Technologies, pages 59--72. USENIX Association, 2005.]]

Digital Library

[2]

M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. ACM Symposium on Operating System Principles, pages 74--89. ACM Press, 2003.]]

Digital Library

[3]

V. Akcelik, J. Bielak, G. Biros, I. Epanomeritakis, A. Fernandez, O. Ghattas, E. J. Kim, J. Lopez, D. O'Hallaron, T. Tu, and J. Urbanic. High Resolution Forward and Inverse Earthquake Modeling on Terasacale Computers. ACM International Conference on Supercomputing, 2003.]]

Digital Library

[4]

N. Allen. Don't waste your storage dollars: what you need to know, March, 2001. Research note, Gartner Group.]]

[5]

E. Anderson and D. Patterson. Extensible, scalable monitoring for clusters of computers. Systems Administration Conference, pages 9--16. USENIX Association, 1997.]]

Digital Library

[6]

Apple. Optimize with Shark: Big payoff, small effort, February 2006. http://developer.apple.com/tools/.]]

[7]

P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. Symposium on Operating Systems Design and Implementation, 2004.]]

Digital Library

[8]

J. P. Bouhana. UNIX Workload Characterization Using Process Accounting. 22nd International Computer Measurement Group Conference, pages 379--390, 1996.]]

[9]

B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic instrumentation of production systems. USENIX Annual Technical Conference, pages 15--28. USENIX Association, 2004.]]

Digital Library

[10]

M. Y. Chen, E. Kiciman, and E. Brewer. Pinpoint: Problem Determination in Large, Dynamic Internet Services. International Conference on Dependable Systems and Networks (DSN'02), pages 595--604, 2002.]]

Digital Library

[11]

I. Cohen, J. S. Chase, M. Goldszmidt, T. Kelly, and J. Symons. Correlating instrumentation data to system states: a building block for automated diagnosis and control. Symposium on Operating Systems Design and Implementation, pages 231--244. USENIX Association, 2004.]]

Digital Library

[12]

D. Ellard, J. Ledlie, P. Malkani, and M. Seltzer. Passive NFS tracing of email and research workloads. Conference on File and Storage Technologies, pages 203--216. USENIX Association, 2003.]]

Digital Library

[13]

G. R. Ganger, J. D. Strunk, and A. J. Klosterman. Self-* Storage: brick-based storage with automated administration. Technical Report CMU-CS-03-178. Carnegie Mellon University, August 2003.]]

[14]

Gartner Group. Total Cost of Storage Ownership - A User-oriented Approach, February, 2000. Research note, Gartner Group.]]

[15]

G. R. Goodson, J. J. Wylie, G. R. Ganger, and M. K. Reiter. Efficient Byzantine-tolerant erasure-coded storage. International Conference on Dependable Systems and Networks, 2004.]]

Digital Library

[16]

Gprof - The GNU Profiler, November 1998. http://www.gnu.org/software/binutils/manual/gprof-2.9.1/html mono/gprof.html.]]

[17]

GraphViz. Graphviz - Graph Visualization Software, February 2006. http://www.graphviz.org/.]]

[18]

J. L. Hellerstein, M. M. Maccabee, W. N. Millsii, and J. J. Turek. ETE: a customizable approach to measuring end-to-end response times and their components in distributed systems. International Conference on Distributed Computing Systems, pages 152--162. IEEE, 1999.]]

Digital Library

[19]

HP Labs (Storage systems). Cello disk-level traces, 2005. http://www.hpl.hp.com/research/ssp/software.]]

[20]

C. Hrischuk, J. Rolia, and C. M. Woodside. Automatic generation of a software performance model using an object-oriented prototype. International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, pages 399--409, 1995.]]

Digital Library

[21]

IBM. DB2 Performance Expert, 2004. http://www-306.ibm.com/software.]]

[22]

R. Isaacs, P. Barham, J. Bulpin, R. Mortier, and D. Narayanan. Request extraction in Magpie: events, schemas and temporal joins. 11th ACM SIGOPS European Workshop, pages 92--97, 2004.]]

Digital Library

[23]

J. Katcher. PostMark: a new file system benchmark. Technical report TR3022. Network Appliance, October 1997.]]

[24]

E. Lazowska, J. Zahorjan, S. Graham, and K. Sevcik. Quantitative system performance: computer system analysis using queuing network models. Prentice Hall, 1984.]]

Digital Library

[25]

M. L. Massie, B. N. Chun, and D. E. Culler. The Ganglia Distributed Monitoring System: Design, Implementation, and Experience. Parallel Computing, 30(7), July 2004.]]

[26]

D. Menasce and V. Almeida. Capacity planning for web performance: metrics, models and methods. Prentice Hall, 1998.]]

Digital Library

[27]

Microsoft. Event tracing, 2005. http://msdn.microsoft.com/.]]

[28]

Microsoft. Microsoft .NET, 2005. http://www.microsoft.com/net/default.mspx.]]

[29]

Microsoft. Windows Server 2003 Performance Counters Reference, 2005. http://www.microsoft.com/technet/.]]

[30]

S. Microsystems. Java 2 Platform, Enterprise Edition (J2EE), 2005. http://java.sun.com/j2ee.]]

[31]

B. D. Noble, M. Satyanarayanan, G. T. Nguyen, and R. H. Katz. Trace-based mobile network emulation. ACM SIGCOMM Conference, pages 51--61, 1997.]]

Digital Library

[32]

W. Norcott and D. Capps. IOzone filesystem benchmark program, 2002. http://www.iozone.org.]]

[33]

Oracle. Oracle Database Manageability, 2004. http://www.oracle.com/technology/.]]

[34]

J. K. Ousterhout, H. Da Costa, D. Harrison, J. A. Kunze, M. Kupfer, and J. G. Thompson. A trace-driven analysis of the UNIX 4.2 BSD file system. ACM Symposium on Operating System Principles. Published as Operating Systems Review, 19(5):15--24, December 1985.]]

Digital Library

[35]

Quest Software. Application Assurance, October, 2005. http://www.quest.com.]]

[36]

B. Salmon, E. Thereska, C. A. N. Soules, and G. R. Ganger. A two-tiered software architecture for automated tuning of disk layouts. Algorithms and Architectures for Self-Managing Systems, pages 13--18. ACM, 2003.]]

[37]

A. D. Samples. Mache: no-loss trace compaction. ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. Published as Performance Evaluation Review, 17(1):89--97, May 1989.]]

Digital Library

[38]

K. W. Shirriff and J. K. Ousterhout. A trace-driven analysis of name and attribute caching in a distributed system. Winter USENIX Technical Conference, pages 315--331, January 1992.]]

[39]

SourceForge.net. State Threads Library for Internet Applications, February 2006. http://state-threads.sourceforge.net/.]]

[40]

SQLite. SQLite, 2005. http://www.sqlite.org.]]

[41]

Sun Microsystems. NFS: network file system protocol specification, RFC-1094, March 1989.]]

Digital Library

[42]

E. Thereska, M. Abd-El-Malek, J. J. Wylie, D. Narayanan, and G. R. Ganger. Informed data distribution selection in a self-predicting storage system. International conference on autonomic computing, 2006.]]

Digital Library

[43]

J. J. Wylie. A read/write protocol family for versatile storage infrastructures. PhD thesis, published as Technical Report CMU-PDL-05-108. Carnegie Mellon University, October 2005.]]

Digital Library

[44]

Xaffire Inc. Web session recording and analysis, October, 2005. http://www.xaffire.com.]]

[45]

Q. Zhu, Z. Chen, L. Tan, Y. Zhou, K. Keeton, and J. Wilkes. Hibernator: helping disk arrays sleep through the winter. ACM Symposium on Operating System Principles, pages 177--190, 2005.]]

Digital Library

Cited By

Zhao YLiu CYu TWang LShi XYang YLi YWang ZShangguan D(2025)Tracing Service Request Processing in CloudSoftware Testing, Verification and Reliability10.1002/stvr.191135:2Online publication date: 2-Feb-2025
https://doi.org/10.1002/stvr.1911
Phothilimthana PKadekodi SGhodrati SMoon SMaas MTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O SamplesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651337(1016-1032)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651337
Zhao LCui YYang YZhou XQiu TLi KBao Y(2023)Component-distinguishable Co-location and Resource Reclamation for High-throughput ComputingACM Transactions on Computer Systems10.1145/363000642:1-2(1-37)Online publication date: 18-Nov-2023
https://dl.acm.org/doi/10.1145/3630006
Show More Cited By

Index Terms

Stardust: tracking activity in a distributed storage system
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Stardust: tracking activity in a distributed storage system
SIGMETRICS '06/Performance '06: Proceedings of the joint international conference on Measurement and modeling of computer systems

Performance monitoring in most distributed systems provides minimal guidance for tuning, problem diagnosis, and decision making. Stardust is a monitoring infrastructure that replaces traditional performance counters with end-to-end traces of requests ...
Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today—logs, counters, and metrics—have two important limitations: ...
TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMETRICS Performance Evaluation Review

ACM SIGMETRICS Performance Evaluation Review Volume 34, Issue 1

Performance evaluation review

June 2006

388 pages

ISSN:0163-5999

DOI:10.1145/1140103

Issue’s Table of Contents

SIGMETRICS '06/Performance '06: Proceedings of the joint international conference on Measurement and modeling of computer systems
June 2006
404 pages
ISBN:1595933190
DOI:10.1145/1140277
General Chair:
Raymond Marie
University of Rennes 1/IRISA, France
,
Program Chairs:
Peter Key
Microsoft Research, Cambridge, U.K.
,
Evgenia Smirni
College of William and Mary, USA

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2006

Published in SIGMETRICS Volume 34, Issue 1

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

103
Total Citations
View Citations
930
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao YLiu CYu TWang LShi XYang YLi YWang ZShangguan D(2025)Tracing Service Request Processing in CloudSoftware Testing, Verification and Reliability10.1002/stvr.191135:2Online publication date: 2-Feb-2025
https://doi.org/10.1002/stvr.1911
Phothilimthana PKadekodi SGhodrati SMoon SMaas MTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O SamplesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651337(1016-1032)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620666.3651337
Zhao LCui YYang YZhou XQiu TLi KBao Y(2023)Component-distinguishable Co-location and Resource Reclamation for High-throughput ComputingACM Transactions on Computer Systems10.1145/363000642:1-2(1-37)Online publication date: 18-Nov-2023
https://dl.acm.org/doi/10.1145/3630006
Yang YWang LGu JLi Y(2023)Capturing Request Execution Path for Understanding Service Behavior and Detecting Anomalies Without Code InstrumentationIEEE Transactions on Services Computing10.1109/TSC.2022.314994916:2(996-1010)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TSC.2022.3149949
Yu GHuang ZChen P(2023)TraceRankJournal of Software: Evolution and Process10.1002/smr.241335:10Online publication date: 12-Oct-2023
https://dl.acm.org/doi/10.1002/smr.2413
Zhao YLiu CYu TWang LShi XYang YLi YWang ZShangguan D(2022)Tracing Processing of Service Requests in Cloud Environments2022 IEEE 27th Pacific Rim International Symposium on Dependable Computing (PRDC)10.1109/PRDC55274.2022.00017(24-33)Online publication date: Nov-2022
https://doi.org/10.1109/PRDC55274.2022.00017
Ezzati-Jivan NDaoud HDagenais M(2021)Debugging of Performance Degradation in Distributed Requests Handling Using Multilevel Trace AnalysisWireless Communications & Mobile Computing10.1155/2021/84780762021Online publication date: 16-Nov-2021
https://dl.acm.org/doi/10.1155/2021/8478076
Huang LZhu T(2021)tprofProceedings of the ACM Symposium on Cloud Computing10.1145/3472883.3486994(76-91)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1145/3472883.3486994
Esteves TNeves FOliveira RPaulo JZhang KGherbi AVenkatasubramanian NVeiga L(2021)CATProceedings of the 22nd International Middleware Conference10.1145/3464298.3493396(223-235)Online publication date: 6-Dec-2021
https://dl.acm.org/doi/10.1145/3464298.3493396
Li YWu ZWang HSun KLi ZJee KRhee JChen HJoshi ACarminati BVerma R(2021)UTrackProceedings of the Eleventh ACM Conference on Data and Application Security and Privacy10.1145/3422337.3447831(161-172)Online publication date: 26-Apr-2021
https://dl.acm.org/doi/10.1145/3422337.3447831
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents