research-article

Pivot tracing: dynamic causal monitoring for distributed systems

Authors:

Rodrigo FonsecaAuthors Info & Claims

SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

Pages 378 - 393

https://doi.org/10.1145/2815400.2815415

Published: 04 October 2015 Publication History

Abstract

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today -- logs, counters, and metrics -- have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This paper presents Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with a novel relational operator: the happened-before join. Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries. We have implemented a prototype of Pivot Tracing for Java-based systems and evaluate it on a heterogeneous Hadoop cluster comprising HDFS, HBase, MapReduce, and YARN. We show that Pivot Tracing can effectively identify a diverse range of root causes such as software bugs, misconfiguration, and limping hardware. We show that Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between inter-operating applications, with low execution overhead.

Supplementary Material

MP4 File (p378.mp4)

Download
2519.18 MB

References

[1]

Apache HBase Reference Guide. http://hbase.apache.org/book.html. {Online; accessed 25-Feb-2015}. (§2.3).

[2]

HADOOP-6599 Split RPC metrics into summary and detailed metrics. https://issues.apache.org/jira/browse/HADOOP-6599. {Online; accessed 25-Feb-2015}. (§2.3).

[3]

HADOOP-6859 Introduce additional statistics to FileSystem. https://issues.apache.org/jira/browse/HADOOP-6859. {Online; accessed 25-Feb-2015}. (§2.3).

[4]

HBASE-11559 Add dumping of DATA block usage to the Block-Cache JSON report. https://issues.apache.org/jira/browse/HBASE-11559. {Online; accessed 25-Feb-2015}. (§2.3).

[5]

HBASE-12364 API for query metrics. https://issues.apache.org/jira/browse/HBASE-12364. {Online; accessed 25-Feb-2015}. (§2.3).

[6]

HBASE-12424 Finer grained logging and metrics for split transaction. https://issues.apache.org/jira/browse/HBASE-12424. {Online; accessed 25-Feb-2015}. (§2.3).

[7]

HBASE-12477 Add a flush failed metric. https://issues.apache.org/jira/browse/HBASE-12477. {Online; accessed 25-Feb-2015}. (§2.3).

[8]

HBASE-12494 Add metrics for blocked updates and delayed flushes. https://issues.apache.org/jira/browse/HBASE-12494. {Online; accessed 25-Feb-2015}. (§2.3).

[9]

HBASE-12496 A blockedRequestsCount metric. https://issues.apache.org/jira/browse/HBASE-12496. {Online; accessed 25-Feb-2015}. (§2.3).

[10]

HBASE-12574 Update replication metrics to not do so many map look ups. https://issues.apache.org/jira/browse/HBASE-12574. {Online; accessed 25-Feb-2015}. (§2.3).

[11]

HBASE-2257 {stargate} multiuser mode. https://issues.apache.org/jira/browse/HBASE-2257. {Online; accessed 25-Feb-2015}. (§2.3).

[12]

HBASE-4038 Hot Region: Write Diagnosis. https://issues.apache.org/jira/browse/HBASE-4038. {Online; accessed 25-Feb-2015}. (§2.3).

[13]

HBASE-4145 Provide metrics for hbase client. https://issues.apache.org/jira/browse/HBASE-4145. {Online; accessed 25-Feb-2015}. (§2.3).

[14]

HBASE-4169 Add per-disk latency metrics to DataNode. https://issues.apache.org/jira/browse/HDFS-4169. {Online; accessed 25-Feb-2015}. (§2.3).

[15]

HBASE-4219 Add Per-Column Family Metrics. https://issues.apache.org/jira/browse/HBASE-4219. {Online; accessed 25-Feb-2015}.

[16]

HBASE-5253 Add requesting user's name to PathBased-CacheEntry. https://issues.apache.org/jira/browse/HDFS-5253. {Online; accessed 25-Feb-2015}. (§2.3).

[17]

HBASE-6093 Expose more caching information for debugging by users. https://issues.apache.org/jira/browse/HDFS-6093. {Online; accessed 25-Feb-2015}. (§2.3).

[18]

HBASE-6292 Display HDFS per user and per group usage on webUI. https://issues.apache.org/jira/browse/HDFS-6292. {Online; accessed 25-Feb-2015}. (§2.3).

[19]

HBASE-7390 Provide JMX metrics per storage type. https://issues.apache.org/jira/browse/HDFS-7390. {Online; accessed 25-Feb-2015}. (§2.3).

[20]

HBASE-7958 Statistics per-column family per-region. https://issues.apache.org/jira/browse/HBASE-7958. {Online; accessed 25-Feb-2015}. (§2.3).

[21]

HBASE-8370 Report data block cache hit rates apart from aggregate cache hit rates. https://issues.apache.org/jira/browse/HBASE-8370. {Online; accessed 25-Feb-2015}. (§1 and 2.3).

[22]

HBASE-8868 add metric to report client shortcircuit reads. https://issues.apache.org/jira/browse/HBASE-8868. {Online; accessed 25-Feb-2015}. (§2.3).

[23]

HBASE-9722 need documentation to configure HBase to reduce metrics. https://issues.apache.org/jira/browse/HBASE-9722. {Online; accessed 25-Feb-2015}. (§2.3).

[24]

HDFS-6268 Better sorting in NetworkTopology.pseudoSortByDistance when no local node is found. https://issues.apache.org/jira/browse/HDFS-6268. {Online; accessed 25-Feb-2015}. (§6.1, 8, and 6.1).

[25]

MESOS-1949 All log messages from master, slave, executor, etc. should be collected on a per-task basis. https://issues.apache.org/jira/browse/MESOS-1949. {Online; accessed 25-Feb-2015}. (§2.3).

[26]

MESOS-2157 Add /master/slaves and /master/frameworks/{framework}/tasks/{task} endpoints. https://issues.apache.org/jira/browse/MESOS-2157. {Online; accessed 25-Feb-2015}. (§2.3).

[27]

Apache accumulo. http://accumulo.apache.org/. {Online; accessed March 2015}. (§2.3).

[28]

Aguilera, M. K., Mogul, J. C., Wiener, J. L., Reynolds, P., and Muthitacharoen, A. Performance debugging for distributed systems of black boxes. In SOSP (New York, NY, USA, 2003), ACM Press. (§7).

Digital Library

[29]

Almeida, P. S., Baquero, C., and Fonte, V. Interval tree clocks: A logical clock for dynamic systems. In OPODIS (Berlin, Heidelberg, 2008), Springer-Verlag, pp. 259--274. (§5).

Digital Library

[30]

Appneta traceview. http://appneta.com. {Online; accessed July 2013}. (§7).

[31]

Attariyan, M., Chow, M., and Flinn, J. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In OSDI (2012), pp. 307--320. (§7).

Digital Library

[32]

Barham, P., Donnelly, A., Isaacs, R., and Mortier, R. Using magpie for request extraction and workload modelling. In OSDI (2004), vol. 4, pp. 18--18. (§4 and 7).

Digital Library

[33]

Barham, P., Isaacs, R., Mortier, R., and Narayanan, D. Magpie: Online modelling and performance-aware systems. In HotOS (2003), vol. 9. (§7).

Digital Library

[34]

Beschastnikh, I., Brun, Y., Ernst, M. D., and Krishnamurthy, A. Inferring models of concurrent systems from logs of their behavior with CSight. In ICSE (Hyderabad, India, June 4--6, 2014), pp. 468--479. (§7).

Digital Library

[35]

Bodik, P. Overview of the Workshop of Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques (SLAML'11). SIGOPS Operating Systems Review 45, 3 (2011), 20--22. (§2.3).

Digital Library

[36]

Buch, I., and Park, R. Improve debugging and performance tuning with ETW. MSDN Magazine (2007). {Online; accessed 01-01-2012}. (§8).

[37]

Cantrill, B. Hidden in plain sight. ACM Queue 4, 1 (Feb. 2006), 26--36. (§1 and 2.3).

Digital Library

[38]

Cantrill, B., Shapiro, M. W., and Leventhal, A. H. Dynamic instrumentation of production systems. In USENIX ATC (2004), pp. 15--28. (§1, 2.3, 5, 7, and 8).

Digital Library

[39]

Chanda, A., Cox, A. L., and Zwaenepoel, W. Whodunit: Transactional profiling for multi-tier applications. ACM SIGOPS Operating Systems Review 41, 3 (2007), 17--30. (§1, 6.2, and 7).

Digital Library

[40]

Chanda, A., Elmeleegy, K., Cox, A. L., and Zwaenepoel, W. Causeway: System support for controlling and analyzing the execution of multi-tier applications. In Middleware (November 2005), pp. 42--59. (§8).

Digital Library

[41]

Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26, 2 (2008), 4. (§6).

Digital Library

[42]

Chen, M. Y., Accardi, A., Kiciman, E., Patterson, D. A., Fox, A., and Brewer, E. A. Path-based failure and evolution management. In NSDI (2004). (§7).

Digital Library

[43]

Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., and Brewer, E. Pinpoint: Problem Determination in Large, Dynamic Internet Services. In DSN (Washington, DC, USA, 2002), DSN '02, IEEE Computer Society, pp. 595--604. (§7).

Digital Library

[44]

Chiba, S. Javassist: Java bytecode engineering made simple. Java Developer's Journal 9, 1 (2004). (§5).

[45]

Chow, M., Meisner, D., Flinn, J., Peek, D., and Wenisch, T. F. The mystery machine: End-to-end performance analysis of large-scale internet services. In OSDI (Broomfield, CO, Oct. 2014), USENIX Association, pp. 217--231. (§7 and 8).

Digital Library

[46]

Compuware dynatrace purepath. http://www.compuware.com. {Online; accessed July 2013}. (§7).

[47]

Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., and Sears, R. Benchmarking cloud serving systems with ycsb. In SOCC (2010), ACM, pp. 143--154. (§6.3).

Digital Library

[48]

Couckuyt, J., Davies, P., and Cahill, J. Multiple chart user interface, June 14 2005. US Patent 6,906,717. (§1).

[49]

Dean, J., and Ghemawat, S. Mapreduce: simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107--113. (§6).

Digital Library

[50]

Do, T., Hao, M., Leesatapornwongsa, T., Patana-anake, T., and Gunawi, H. S. Limplock: Understanding the impact of limpware on scale-out cloud systems. In SOCC (2013), ACM, p. 14. (§6.2).

Digital Library

[51]

Erlingsson, Ú., Peinado, M., Peter, S., Budiu, M., and Mainar-Ruiz, G. Fay: extensible distributed tracing from kernels to clusters. ACM Transactions on Computer Systems (TOCS) 30, 4 (2012), 13. (§1, 2.3, 4, 5, 7, and 8).

Digital Library

[52]

Fonseca, R., Porter, G., Katz, R. H., Shenker, S., and Stoica, I. X-trace: A pervasive network tracing framework. In NSDI (Berkeley, CA, USA, 2007), NSDI'07, USENIX Association. (§1, 4, 4, and 7).

Digital Library

[53]

Google Protocol Buffers. http://code.google.com/p/protobuf/. (§5).

[54]

Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., and Pirahesh, H. Data cube: A relational aggregation operator generalizing groupby, cross-tab, and sub-totals. Data Mining and Knowledge Discovery 1, 1 (1997), 29--53. (§1).

Digital Library

[55]

Guo, Z., Zhou, D., Lin, H., Yang, M., Long, F., Deng, C., Liu, C., and Zhou, L. G2: A graph processing system for diagnosing distributed systems. In USENIX ATC (2011). (§7).

Digital Library

[56]

Apache HBase. http://hbase.apache.org. {Online; accessed March 2015}. (§6).

[57]

The Java HotSpot Performance Engine Architecture. http://www.oracle.com/technetwork/java/whitepaper-135217.html. {Online; accessed March 2015}. (§6.3).

[58]

Apache HTrace. http://htrace.incubator.apache.org/. {Online; accessed March 2015}. (§2.3 and 7).

[59]

Huang, S., Huang, J., Dai, J., Xie, T., and Huang, B. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In ICDEW (2010), IEEE, pp. 41--51. (§6.3).

[60]

Kavulya, S. P., Daniels, S., Joshi, K., Hiltunen, M., Gandhi, R., and Narasimhan, P. Draco: Statistical diagnosis of chronic problems in large distributed systems. In IEEE/IFIP Conference on Dependable Systems and Networks (DSN) (June 2012). (§1 and 7).

Digital Library

[61]

Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., and Griswold, W. G. An Overview of AspectJ. In ECOOP (London, UK, UK, 2001), ECOOP '01, Springer-Verlag, pp. 327--353. (§5 and 6).

Digital Library

[62]

Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C. V., Loingtier, J.-M., and Irwin, J. Aspect-Oriented Programming. In ECOOP (June 1997), LNCS 1241, Springer-Verlag. (§2.2 and 7).

[63]

Kim, M., Sumbaly, R., and Shah, S. Root cause detection in a service-oriented architecture. ACM SIGMETRICS Performance Evaluation Review 41, 1 (2013), 93--104. (§7).

Digital Library

[64]

Ko, S. Y., Yalagandula, P., Gupta, I., Talwar, V., Milojicic, D., and Iyer, S. Moara: flexible and scalable group-based querying system. In Middleware 2008. Springer, 2008, pp. 408--428. (§7).

Digital Library

[65]

Lamport, L. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 7 (1978), 558--565. (§1 and 3).

Digital Library

[66]

Laub, B., Wang, C., Schwan, K., and Huneycutt, C. Towards combining online & offline management for big data applications. In ICAC (Philadelphia, PA, June 2014), USENIX Association, pp. 121--127. (§6 and 6.2).

[67]

Mace, J., Bodik, P., Musuvathi, M., and Fonseca, R. Retro: Targeted resource management in multi-tenant distributed systems. In NSDI (May 2015), USENIX Association. (§6.2).

Digital Library

[68]

Mann, G., Sandler, M., Krushevskaja, D., Guha, S., and Even-Dar, E. Modeling the parallel execution of black-box services. USENIX/HotCloud (2011). (§7).

Digital Library

[69]

Massie, M. L., Chun, B. N., and Culler, D. E. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing 30, 7 (2004), 817--840. (§2.3 and 7).

[70]

Meijer, E., Beckman, B., and Bierman, G. Linq: Reconciling object, relations and xml in the .net framework. In SIGMOD (New York, NY, USA, 2006), SIGMOD '06, ACM, pp. 706--706. (§2.1).

Digital Library

[71]

Mi, H., Wang, H., Chen, Z., and Zhou, Y. Automatic detecting performance bugs in cloud computing systems via learning latency specification model. In SOSE (2014), IEEE, pp. 302--307. (§7).

Digital Library

[72]

Mi, H., Wang, H., Zhou, Y., Lyu, M. R., and Cai, H. Toward fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Transactions on Parallel and Distributed Systems 24, 6 (2013), 1245--1255.

Digital Library

[73]

Mi, H., Wang, H., Zhou, Y., Lyu, M. R.-T., Cai, H., and Yin, G. An online service-oriented performance profiling tool for cloud computing systems. Frontiers of Computer Science 7, 3 (2013), 431--445. (§1 and 7).

Digital Library

[74]

Nagaraj, K., Killian, C. E., and Neville, J. Structured comparative analysis of systems logs to diagnose performance problems. In NSDI (2012), pp. 353--366. (§1 and 7).

Digital Library

[75]

Oliner, A., Ganapathi, A., and Xu, W. Advances and challenges in log analysis. Communications of the ACM 55, 2 (2012), 55--61. (§2.3 and 8).

Digital Library

[76]

Oliner, A., Kulkarni, A., and Aiken, A. Using correlated surprise to infer shared influence. In IEEE/IFIP Dependable Systems and Networks (DSN) (June 2010), pp. 191--200. (§1 and 7).

[77]

Ostrowski, K., Mann, G., and Sandler, M. Diagnosing latency in multi-tier black-box services. In LADIS (2011). (§7).

[78]

Prasad, V., Cohen, W., Eigler, F. C., Hunt, M., Keniston, J., and Chen, B. Locating system problems using dynamic instrumentation. In Ottawa Linux Symposium (OLS) (2005). (§2.3 and 7).

[79]

Rabkin, A., and Katz, R. H. How hadoop clusters break. Software, IEEE 30, 4 (2013), 88--94. (§2.3).

Digital Library

[80]

Ramakrishnan, R., and Gehrke, J. Database Management Systems, 2nd ed. Osborne/McGraw-Hill, Berkeley, CA, USA, 2000. (§4 and 7).

Digital Library

[81]

Ravindranath, L., Padhye, J., Mahajan, R., and Balakrishnan, H. Timecard: Controlling user-perceived delays in server-based mobile applications. In SOSP (2013), ACM, pp. 85--100. (§6.2).

Digital Library

[82]

Reumann, J., and Shin, K. G. Stateful distributed interposition. ACM Trans. Comput. Syst. 22, 1 (2004), 1--48. (§8).

Digital Library

[83]

Reynolds, P., Killian, C., Wiener, J. L., Mogul, J. C., Shah, M. A., and Vahdat, A. Pip: detecting the unexpected in distributed systems. In NSDI (Berkeley, CA, USA, 2006), USENIX Association. (§7).

Digital Library

[84]

Sambasivan, R. R., Fonseca, R., Shafer, I., and Ganger, G. R. So, you want to trace your distributed system? Key design insights from years of practical experience. Tech. Rep. CMU-PDL-14-102, Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA 15213-3890, April 2014. (§7).

[85]

Sambasivan, R. R., Zheng, A. X., De Rosa, M., Krevat, E., Whitman, S., Stroucken, M., Wang, W., Xu, L., and Ganger, G. R. Diagnosing performance changes by comparing request flows. In NSDI (2011). (§7).

Digital Library

[86]

Shvachko, K., Kuang, H., Radia, S., and Chansler, R. The Hadoop distributed file system. In MSST (2010), IEEE, pp. 1--10. (§6).

Digital Library

[87]

Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., and Shanbhag, C. Dapper, a large-scale distributed systems tracing infrastructure. Google research (2010). (§1, 2.3, 4, 4, and 7).

[88]

Thereska, E., Salmon, B., Strunk, J., Wachs, M., Abd-El-Malek, M., Lopez, J., and Ganger, G. R. Stardust: tracking activity in a distributed storage system. SIGMETRICS Perform. Eval. Rev. 34, 1 (2006), 3--14. (§4 and 7).

Digital Library

[89]

Twitter Zipkin. http://twitter.github.io/zipkin/. {Online; accessed March 2015}. (§2.3 and 7).

[90]

Van Renesse, R., Birman, K. P., and Vogels, W. Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining. ACM Transactions on Computer Systems (TOCS) 21, 2 (2003), 164--206. (§7).

Digital Library

[91]

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O'Malley, O., Radia, S., Reed, B., and Baldeschwieler, E. Apache Hadoop YARN: Yet Another Resource Negotiator. In SOCC (New York, NY, USA, 2013), SOCC '13, ACM, pp. 5:1--5:16. (§6).

Digital Library

[92]

Wang, C., Kavulya, S. P., Tan, J., Hu, L., Kutare, M., Kasick, M., Schwan, K., Narasimhan, P., and Gandhi, R. Performance troubleshooting in data centers: an annotated bibliography? ACM SIGOPS Operating Systems Review 47, 3 (2013), 50--62. (§7).

Digital Library

[93]

Wang, C., Rayan, I. A., Eisenhauer, G., Schwan, K., Talwar, V., Wolf, M., and Huneycutt, C. Vscope: middleware for troubleshooting time-sensitive data center applications. In Middleware 2012. Springer, 2012, pp. 121--141. (§6, 6.2, and 7).

Digital Library

[94]

Wood, P. T. Query languages for graph databases. SIGMOD Rec. 41, 1 (Apr. 2012), 50--60. (§7).

Digital Library

[95]

Xu, W., Huang, L., Fox, A., Patterson, D., and Jordan, M. I. Detecting large-scale system problems by mining console logs. In SOSP (New York, NY, USA, 2009), ACM, pp. 117--132. (§1 and 7).

Digital Library

[96]

Yin, Z., Ma, X., Zheng, J., Zhou, Y., Bairavasundaram, L. N., and Pasupathy, S. An empirical study on configuration errors in commercial and open source systems. In SOSP (2011), ACM, pp. 159--172. (§2.3).

Digital Library

[97]

Yuan, D., Zheng, J., Park, S., Zhou, Y., and Savage, S. Improving software diagnosability via log enhancement. In Proceedings of the International Conference on Architecture Support for Programming Languages and Operating Systems (March 2011). (§1, 2.3, and 7).

Digital Library

[98]

Zhao, X., Zhang, Y., Lion, D., Faizan, M., Luo, Y., Yuan, D., and Stumm, M. lprof: A nonintrusive request flow profiler for distributed systems. In OSDI (2014). (§1 and 7).

Digital Library

[99]

Zhou, J., Chen, Z., Mi, H., and Wang, J. Mtracer: a trace-oriented monitoring framework for medium-scale distributed systems. In SOSE (2014), IEEE, pp. 266--271. (§7).

Digital Library

Cited By

Wang ZHu HKong LKang XMa TXiang QLi JLu YSong ZYang PWu JYang YMa TLiu ZZeng XCai DChen GBagchi SZhang Y(2024)Diagnosing application-network anomalies for millions of IPs in production cloudsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692046(885-899)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692046
Liyakathali Patan (2024)Enhancing Reliability in Distributed Systems: A Comprehensive Approach to Telemetry and MonitoringInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT24105105110:5(661-667)Online publication date: 15-Oct-2024
https://doi.org/10.32628/CSEIT241051051
Sruthi PGuo ZChu DChen ZZhang Y(2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698568
Show More Cited By

Index Terms

Pivot tracing: dynamic causal monitoring for distributed systems

Recommendations

Canopy: An End-to-End Performance Tracing And Analysis System
SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles

This paper presents Canopy, Facebook's end-to-end performance tracing infrastructure. Canopy records causally related performance data across the end-to-end execution path of requests, including from browsers, mobile applications, and backend services. ...
An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems
ASPLOS '19: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems

Cloud services have recently started undergoing a major shift from monolithic applications, to graphs of hundreds or thousands of loosely-coupled microservices. Microservices fundamentally change a lot of assumptions current cloud systems are designed ...
Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today—logs, counters, and metrics—have two important limitations: ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

October 2015

499 pages

ISBN:9781450338349

DOI:10.1145/2815400

General Chair:
Ethan Miller
UC Santa Cruz
,
Program Chair:
Steven Hand
Google

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SSRC: Storage Systems Research Center, UC Santa Cruz
SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Best Paper

Qualifiers

Research-article

Conference

SOSP '15

Sponsor:

SSRC
SIGOPS

SOSP '15: ACM SIGOPS 25th Symposium on Operating Systems Principles

October 4 - 7, 2015

California, Monterey

Acceptance Rates

SOSP '15 Paper Acceptance Rate 30 of 181 submissions, 17%;

Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25

Sponsor:
sigops

ACM SIGOPS 31st Symposium on Operating Systems Principles

October 13 - 16, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

134
Total Citations
View Citations
2,687
Total Downloads

Downloads (Last 12 months)137
Downloads (Last 6 weeks)8

Reflects downloads up to 07 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wang ZHu HKong LKang XMa TXiang QLi JLu YSong ZYang PWu JYang YMa TLiu ZZeng XCai DChen GBagchi SZhang Y(2024)Diagnosing application-network anomalies for millions of IPs in production cloudsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692046(885-899)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692046
Liyakathali Patan (2024)Enhancing Reliability in Distributed Systems: A Comprehensive Approach to Telemetry and MonitoringInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT24105105110:5(661-667)Online publication date: 15-Oct-2024
https://doi.org/10.32628/CSEIT241051051
Sruthi PGuo ZChu DChen ZZhang Y(2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698568
Song ZWu JMa TWang ZKong LWen ZLi JLu YYang YMa TLiu ZChen G(2024)Zero+: Monitoring Large-Scale Cloud-Native Infrastructure Using One-Sided RDMAIEEE/ACM Transactions on Networking10.1109/TNET.2024.339451432:4(3499-3514)Online publication date: Aug-2024
https://doi.org/10.1109/TNET.2024.3394514
Yao ZYe HPei CCheng GWang GLiu ZChen HCui HLi ZLi JXie GPei D(2024)SparseRCA: Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00045(391-402)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ISSRE62328.2024.00045
Toslali MQasim SParthasarathy SOliveira FHuang HStringhini GLiu ZCoskun A(2024)Unleashing Performance Insights with Online Probabilistic Tracing2024 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E61754.2024.00015(72-82)Online publication date: 24-Sep-2024
https://doi.org/10.1109/IC2E61754.2024.00015
Zhang YWang LWang ZShangguan D(2024)Exploring Use of Symbolic Execution for Service Analysis2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00014(12-16)Online publication date: 24-Jun-2024
https://doi.org/10.1109/DSN-S60304.2024.00014
Chen ZJiang ZSu YLyu MZheng Z(2024)Tracemesh: Scalable and Streaming Sampling for Distributed Traces2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00016(54-65)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00016
Yu ZOuyang QPei CWang XChen WSu LJiang HWang XLi JPei D(2024)Causality Enhanced Graph Representation Learning for Alert-Based Root Cause Analysis2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00018(77-86)Online publication date: 6-May-2024
https://doi.org/10.1109/CCGrid59990.2024.00018
Tunde-Onadele OQin FGu XLin Y(2024)ClearCausal: Cross Layer Causal Analysis for Automatic Microservice Performance Debugging2024 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS)10.1109/ACSOS61780.2024.00039(175-180)Online publication date: 16-Sep-2024
https://doi.org/10.1109/ACSOS61780.2024.00039
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents