skip to main content
10.1145/2815400.2815415acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Pivot tracing: dynamic causal monitoring for distributed systems

Published: 04 October 2015 Publication History

Abstract

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today -- logs, counters, and metrics -- have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This paper presents Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with a novel relational operator: the happened-before join. Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries. We have implemented a prototype of Pivot Tracing for Java-based systems and evaluate it on a heterogeneous Hadoop cluster comprising HDFS, HBase, MapReduce, and YARN. We show that Pivot Tracing can effectively identify a diverse range of root causes such as software bugs, misconfiguration, and limping hardware. We show that Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between inter-operating applications, with low execution overhead.

Supplementary Material

MP4 File (p378.mp4)

References

[1]
Apache HBase Reference Guide. http://hbase.apache.org/book.html. {Online; accessed 25-Feb-2015}. (§2.3).
[2]
HADOOP-6599 Split RPC metrics into summary and detailed metrics. https://issues.apache.org/jira/browse/HADOOP-6599. {Online; accessed 25-Feb-2015}. (§2.3).
[3]
HADOOP-6859 Introduce additional statistics to FileSystem. https://issues.apache.org/jira/browse/HADOOP-6859. {Online; accessed 25-Feb-2015}. (§2.3).
[4]
HBASE-11559 Add dumping of DATA block usage to the Block-Cache JSON report. https://issues.apache.org/jira/browse/HBASE-11559. {Online; accessed 25-Feb-2015}. (§2.3).
[5]
HBASE-12364 API for query metrics. https://issues.apache.org/jira/browse/HBASE-12364. {Online; accessed 25-Feb-2015}. (§2.3).
[6]
HBASE-12424 Finer grained logging and metrics for split transaction. https://issues.apache.org/jira/browse/HBASE-12424. {Online; accessed 25-Feb-2015}. (§2.3).
[7]
HBASE-12477 Add a flush failed metric. https://issues.apache.org/jira/browse/HBASE-12477. {Online; accessed 25-Feb-2015}. (§2.3).
[8]
HBASE-12494 Add metrics for blocked updates and delayed flushes. https://issues.apache.org/jira/browse/HBASE-12494. {Online; accessed 25-Feb-2015}. (§2.3).
[9]
HBASE-12496 A blockedRequestsCount metric. https://issues.apache.org/jira/browse/HBASE-12496. {Online; accessed 25-Feb-2015}. (§2.3).
[10]
HBASE-12574 Update replication metrics to not do so many map look ups. https://issues.apache.org/jira/browse/HBASE-12574. {Online; accessed 25-Feb-2015}. (§2.3).
[11]
HBASE-2257 {stargate} multiuser mode. https://issues.apache.org/jira/browse/HBASE-2257. {Online; accessed 25-Feb-2015}. (§2.3).
[12]
HBASE-4038 Hot Region: Write Diagnosis. https://issues.apache.org/jira/browse/HBASE-4038. {Online; accessed 25-Feb-2015}. (§2.3).
[13]
HBASE-4145 Provide metrics for hbase client. https://issues.apache.org/jira/browse/HBASE-4145. {Online; accessed 25-Feb-2015}. (§2.3).
[14]
HBASE-4169 Add per-disk latency metrics to DataNode. https://issues.apache.org/jira/browse/HDFS-4169. {Online; accessed 25-Feb-2015}. (§2.3).
[15]
HBASE-4219 Add Per-Column Family Metrics. https://issues.apache.org/jira/browse/HBASE-4219. {Online; accessed 25-Feb-2015}.
[16]
HBASE-5253 Add requesting user's name to PathBased-CacheEntry. https://issues.apache.org/jira/browse/HDFS-5253. {Online; accessed 25-Feb-2015}. (§2.3).
[17]
HBASE-6093 Expose more caching information for debugging by users. https://issues.apache.org/jira/browse/HDFS-6093. {Online; accessed 25-Feb-2015}. (§2.3).
[18]
HBASE-6292 Display HDFS per user and per group usage on webUI. https://issues.apache.org/jira/browse/HDFS-6292. {Online; accessed 25-Feb-2015}. (§2.3).
[19]
HBASE-7390 Provide JMX metrics per storage type. https://issues.apache.org/jira/browse/HDFS-7390. {Online; accessed 25-Feb-2015}. (§2.3).
[20]
HBASE-7958 Statistics per-column family per-region. https://issues.apache.org/jira/browse/HBASE-7958. {Online; accessed 25-Feb-2015}. (§2.3).
[21]
HBASE-8370 Report data block cache hit rates apart from aggregate cache hit rates. https://issues.apache.org/jira/browse/HBASE-8370. {Online; accessed 25-Feb-2015}. (§1 and 2.3).
[22]
HBASE-8868 add metric to report client shortcircuit reads. https://issues.apache.org/jira/browse/HBASE-8868. {Online; accessed 25-Feb-2015}. (§2.3).
[23]
HBASE-9722 need documentation to configure HBase to reduce metrics. https://issues.apache.org/jira/browse/HBASE-9722. {Online; accessed 25-Feb-2015}. (§2.3).
[24]
HDFS-6268 Better sorting in NetworkTopology.pseudoSortByDistance when no local node is found. https://issues.apache.org/jira/browse/HDFS-6268. {Online; accessed 25-Feb-2015}. (§6.1, 8, and 6.1).
[25]
MESOS-1949 All log messages from master, slave, executor, etc. should be collected on a per-task basis. https://issues.apache.org/jira/browse/MESOS-1949. {Online; accessed 25-Feb-2015}. (§2.3).
[26]
MESOS-2157 Add /master/slaves and /master/frameworks/{framework}/tasks/{task} endpoints. https://issues.apache.org/jira/browse/MESOS-2157. {Online; accessed 25-Feb-2015}. (§2.3).
[27]
Apache accumulo. http://accumulo.apache.org/. {Online; accessed March 2015}. (§2.3).
[28]
Aguilera, M. K., Mogul, J. C., Wiener, J. L., Reynolds, P., and Muthitacharoen, A. Performance debugging for distributed systems of black boxes. In SOSP (New York, NY, USA, 2003), ACM Press. (§7).
[29]
Almeida, P. S., Baquero, C., and Fonte, V. Interval tree clocks: A logical clock for dynamic systems. In OPODIS (Berlin, Heidelberg, 2008), Springer-Verlag, pp. 259--274. (§5).
[30]
Appneta traceview. http://appneta.com. {Online; accessed July 2013}. (§7).
[31]
Attariyan, M., Chow, M., and Flinn, J. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In OSDI (2012), pp. 307--320. (§7).
[32]
Barham, P., Donnelly, A., Isaacs, R., and Mortier, R. Using magpie for request extraction and workload modelling. In OSDI (2004), vol. 4, pp. 18--18. (§4 and 7).
[33]
Barham, P., Isaacs, R., Mortier, R., and Narayanan, D. Magpie: Online modelling and performance-aware systems. In HotOS (2003), vol. 9. (§7).
[34]
Beschastnikh, I., Brun, Y., Ernst, M. D., and Krishnamurthy, A. Inferring models of concurrent systems from logs of their behavior with CSight. In ICSE (Hyderabad, India, June 4--6, 2014), pp. 468--479. (§7).
[35]
Bodik, P. Overview of the Workshop of Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques (SLAML'11). SIGOPS Operating Systems Review 45, 3 (2011), 20--22. (§2.3).
[36]
Buch, I., and Park, R. Improve debugging and performance tuning with ETW. MSDN Magazine (2007). {Online; accessed 01-01-2012}. (§8).
[37]
Cantrill, B. Hidden in plain sight. ACM Queue 4, 1 (Feb. 2006), 26--36. (§1 and 2.3).
[38]
Cantrill, B., Shapiro, M. W., and Leventhal, A. H. Dynamic instrumentation of production systems. In USENIX ATC (2004), pp. 15--28. (§1, 2.3, 5, 7, and 8).
[39]
Chanda, A., Cox, A. L., and Zwaenepoel, W. Whodunit: Transactional profiling for multi-tier applications. ACM SIGOPS Operating Systems Review 41, 3 (2007), 17--30. (§1, 6.2, and 7).
[40]
Chanda, A., Elmeleegy, K., Cox, A. L., and Zwaenepoel, W. Causeway: System support for controlling and analyzing the execution of multi-tier applications. In Middleware (November 2005), pp. 42--59. (§8).
[41]
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. Bigtable: A distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS) 26, 2 (2008), 4. (§6).
[42]
Chen, M. Y., Accardi, A., Kiciman, E., Patterson, D. A., Fox, A., and Brewer, E. A. Path-based failure and evolution management. In NSDI (2004). (§7).
[43]
Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., and Brewer, E. Pinpoint: Problem Determination in Large, Dynamic Internet Services. In DSN (Washington, DC, USA, 2002), DSN '02, IEEE Computer Society, pp. 595--604. (§7).
[44]
Chiba, S. Javassist: Java bytecode engineering made simple. Java Developer's Journal 9, 1 (2004). (§5).
[45]
Chow, M., Meisner, D., Flinn, J., Peek, D., and Wenisch, T. F. The mystery machine: End-to-end performance analysis of large-scale internet services. In OSDI (Broomfield, CO, Oct. 2014), USENIX Association, pp. 217--231. (§7 and 8).
[46]
Compuware dynatrace purepath. http://www.compuware.com. {Online; accessed July 2013}. (§7).
[47]
Cooper, B. F., Silberstein, A., Tam, E., Ramakrishnan, R., and Sears, R. Benchmarking cloud serving systems with ycsb. In SOCC (2010), ACM, pp. 143--154. (§6.3).
[48]
Couckuyt, J., Davies, P., and Cahill, J. Multiple chart user interface, June 14 2005. US Patent 6,906,717. (§1).
[49]
Dean, J., and Ghemawat, S. Mapreduce: simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107--113. (§6).
[50]
Do, T., Hao, M., Leesatapornwongsa, T., Patana-anake, T., and Gunawi, H. S. Limplock: Understanding the impact of limpware on scale-out cloud systems. In SOCC (2013), ACM, p. 14. (§6.2).
[51]
Erlingsson, Ú., Peinado, M., Peter, S., Budiu, M., and Mainar-Ruiz, G. Fay: extensible distributed tracing from kernels to clusters. ACM Transactions on Computer Systems (TOCS) 30, 4 (2012), 13. (§1, 2.3, 4, 5, 7, and 8).
[52]
Fonseca, R., Porter, G., Katz, R. H., Shenker, S., and Stoica, I. X-trace: A pervasive network tracing framework. In NSDI (Berkeley, CA, USA, 2007), NSDI'07, USENIX Association. (§1, 4, 4, and 7).
[53]
Google Protocol Buffers. http://code.google.com/p/protobuf/. (§5).
[54]
Gray, J., Chaudhuri, S., Bosworth, A., Layman, A., Reichart, D., Venkatrao, M., Pellow, F., and Pirahesh, H. Data cube: A relational aggregation operator generalizing groupby, cross-tab, and sub-totals. Data Mining and Knowledge Discovery 1, 1 (1997), 29--53. (§1).
[55]
Guo, Z., Zhou, D., Lin, H., Yang, M., Long, F., Deng, C., Liu, C., and Zhou, L. G2: A graph processing system for diagnosing distributed systems. In USENIX ATC (2011). (§7).
[56]
Apache HBase. http://hbase.apache.org. {Online; accessed March 2015}. (§6).
[57]
The Java HotSpot Performance Engine Architecture. http://www.oracle.com/technetwork/java/whitepaper-135217.html. {Online; accessed March 2015}. (§6.3).
[58]
Apache HTrace. http://htrace.incubator.apache.org/. {Online; accessed March 2015}. (§2.3 and 7).
[59]
Huang, S., Huang, J., Dai, J., Xie, T., and Huang, B. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In ICDEW (2010), IEEE, pp. 41--51. (§6.3).
[60]
Kavulya, S. P., Daniels, S., Joshi, K., Hiltunen, M., Gandhi, R., and Narasimhan, P. Draco: Statistical diagnosis of chronic problems in large distributed systems. In IEEE/IFIP Conference on Dependable Systems and Networks (DSN) (June 2012). (§1 and 7).
[61]
Kiczales, G., Hilsdale, E., Hugunin, J., Kersten, M., Palm, J., and Griswold, W. G. An Overview of AspectJ. In ECOOP (London, UK, UK, 2001), ECOOP '01, Springer-Verlag, pp. 327--353. (§5 and 6).
[62]
Kiczales, G., Lamping, J., Mendhekar, A., Maeda, C., Lopes, C. V., Loingtier, J.-M., and Irwin, J. Aspect-Oriented Programming. In ECOOP (June 1997), LNCS 1241, Springer-Verlag. (§2.2 and 7).
[63]
Kim, M., Sumbaly, R., and Shah, S. Root cause detection in a service-oriented architecture. ACM SIGMETRICS Performance Evaluation Review 41, 1 (2013), 93--104. (§7).
[64]
Ko, S. Y., Yalagandula, P., Gupta, I., Talwar, V., Milojicic, D., and Iyer, S. Moara: flexible and scalable group-based querying system. In Middleware 2008. Springer, 2008, pp. 408--428. (§7).
[65]
Lamport, L. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM 21, 7 (1978), 558--565. (§1 and 3).
[66]
Laub, B., Wang, C., Schwan, K., and Huneycutt, C. Towards combining online & offline management for big data applications. In ICAC (Philadelphia, PA, June 2014), USENIX Association, pp. 121--127. (§6 and 6.2).
[67]
Mace, J., Bodik, P., Musuvathi, M., and Fonseca, R. Retro: Targeted resource management in multi-tenant distributed systems. In NSDI (May 2015), USENIX Association. (§6.2).
[68]
Mann, G., Sandler, M., Krushevskaja, D., Guha, S., and Even-Dar, E. Modeling the parallel execution of black-box services. USENIX/HotCloud (2011). (§7).
[69]
Massie, M. L., Chun, B. N., and Culler, D. E. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing 30, 7 (2004), 817--840. (§2.3 and 7).
[70]
Meijer, E., Beckman, B., and Bierman, G. Linq: Reconciling object, relations and xml in the .net framework. In SIGMOD (New York, NY, USA, 2006), SIGMOD '06, ACM, pp. 706--706. (§2.1).
[71]
Mi, H., Wang, H., Chen, Z., and Zhou, Y. Automatic detecting performance bugs in cloud computing systems via learning latency specification model. In SOSE (2014), IEEE, pp. 302--307. (§7).
[72]
Mi, H., Wang, H., Zhou, Y., Lyu, M. R., and Cai, H. Toward fine-grained, unsupervised, scalable performance diagnosis for production cloud computing systems. IEEE Transactions on Parallel and Distributed Systems 24, 6 (2013), 1245--1255.
[73]
Mi, H., Wang, H., Zhou, Y., Lyu, M. R.-T., Cai, H., and Yin, G. An online service-oriented performance profiling tool for cloud computing systems. Frontiers of Computer Science 7, 3 (2013), 431--445. (§1 and 7).
[74]
Nagaraj, K., Killian, C. E., and Neville, J. Structured comparative analysis of systems logs to diagnose performance problems. In NSDI (2012), pp. 353--366. (§1 and 7).
[75]
Oliner, A., Ganapathi, A., and Xu, W. Advances and challenges in log analysis. Communications of the ACM 55, 2 (2012), 55--61. (§2.3 and 8).
[76]
Oliner, A., Kulkarni, A., and Aiken, A. Using correlated surprise to infer shared influence. In IEEE/IFIP Dependable Systems and Networks (DSN) (June 2010), pp. 191--200. (§1 and 7).
[77]
Ostrowski, K., Mann, G., and Sandler, M. Diagnosing latency in multi-tier black-box services. In LADIS (2011). (§7).
[78]
Prasad, V., Cohen, W., Eigler, F. C., Hunt, M., Keniston, J., and Chen, B. Locating system problems using dynamic instrumentation. In Ottawa Linux Symposium (OLS) (2005). (§2.3 and 7).
[79]
Rabkin, A., and Katz, R. H. How hadoop clusters break. Software, IEEE 30, 4 (2013), 88--94. (§2.3).
[80]
Ramakrishnan, R., and Gehrke, J. Database Management Systems, 2nd ed. Osborne/McGraw-Hill, Berkeley, CA, USA, 2000. (§4 and 7).
[81]
Ravindranath, L., Padhye, J., Mahajan, R., and Balakrishnan, H. Timecard: Controlling user-perceived delays in server-based mobile applications. In SOSP (2013), ACM, pp. 85--100. (§6.2).
[82]
Reumann, J., and Shin, K. G. Stateful distributed interposition. ACM Trans. Comput. Syst. 22, 1 (2004), 1--48. (§8).
[83]
Reynolds, P., Killian, C., Wiener, J. L., Mogul, J. C., Shah, M. A., and Vahdat, A. Pip: detecting the unexpected in distributed systems. In NSDI (Berkeley, CA, USA, 2006), USENIX Association. (§7).
[84]
Sambasivan, R. R., Fonseca, R., Shafer, I., and Ganger, G. R. So, you want to trace your distributed system? Key design insights from years of practical experience. Tech. Rep. CMU-PDL-14-102, Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA 15213-3890, April 2014. (§7).
[85]
Sambasivan, R. R., Zheng, A. X., De Rosa, M., Krevat, E., Whitman, S., Stroucken, M., Wang, W., Xu, L., and Ganger, G. R. Diagnosing performance changes by comparing request flows. In NSDI (2011). (§7).
[86]
Shvachko, K., Kuang, H., Radia, S., and Chansler, R. The Hadoop distributed file system. In MSST (2010), IEEE, pp. 1--10. (§6).
[87]
Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., and Shanbhag, C. Dapper, a large-scale distributed systems tracing infrastructure. Google research (2010). (§1, 2.3, 4, 4, and 7).
[88]
Thereska, E., Salmon, B., Strunk, J., Wachs, M., Abd-El-Malek, M., Lopez, J., and Ganger, G. R. Stardust: tracking activity in a distributed storage system. SIGMETRICS Perform. Eval. Rev. 34, 1 (2006), 3--14. (§4 and 7).
[89]
Twitter Zipkin. http://twitter.github.io/zipkin/. {Online; accessed March 2015}. (§2.3 and 7).
[90]
Van Renesse, R., Birman, K. P., and Vogels, W. Astrolabe: A robust and scalable technology for distributed system monitoring, management, and data mining. ACM Transactions on Computer Systems (TOCS) 21, 2 (2003), 164--206. (§7).
[91]
Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O'Malley, O., Radia, S., Reed, B., and Baldeschwieler, E. Apache Hadoop YARN: Yet Another Resource Negotiator. In SOCC (New York, NY, USA, 2013), SOCC '13, ACM, pp. 5:1--5:16. (§6).
[92]
Wang, C., Kavulya, S. P., Tan, J., Hu, L., Kutare, M., Kasick, M., Schwan, K., Narasimhan, P., and Gandhi, R. Performance troubleshooting in data centers: an annotated bibliography? ACM SIGOPS Operating Systems Review 47, 3 (2013), 50--62. (§7).
[93]
Wang, C., Rayan, I. A., Eisenhauer, G., Schwan, K., Talwar, V., Wolf, M., and Huneycutt, C. Vscope: middleware for troubleshooting time-sensitive data center applications. In Middleware 2012. Springer, 2012, pp. 121--141. (§6, 6.2, and 7).
[94]
Wood, P. T. Query languages for graph databases. SIGMOD Rec. 41, 1 (Apr. 2012), 50--60. (§7).
[95]
Xu, W., Huang, L., Fox, A., Patterson, D., and Jordan, M. I. Detecting large-scale system problems by mining console logs. In SOSP (New York, NY, USA, 2009), ACM, pp. 117--132. (§1 and 7).
[96]
Yin, Z., Ma, X., Zheng, J., Zhou, Y., Bairavasundaram, L. N., and Pasupathy, S. An empirical study on configuration errors in commercial and open source systems. In SOSP (2011), ACM, pp. 159--172. (§2.3).
[97]
Yuan, D., Zheng, J., Park, S., Zhou, Y., and Savage, S. Improving software diagnosability via log enhancement. In Proceedings of the International Conference on Architecture Support for Programming Languages and Operating Systems (March 2011). (§1, 2.3, and 7).
[98]
Zhao, X., Zhang, Y., Lion, D., Faizan, M., Luo, Y., Yuan, D., and Stumm, M. lprof: A nonintrusive request flow profiler for distributed systems. In OSDI (2014). (§1 and 7).
[99]
Zhou, J., Chen, Z., Mi, H., and Wang, J. Mtracer: a trace-oriented monitoring framework for medium-scale distributed systems. In SOSE (2014), IEEE, pp. 266--271. (§7).

Cited By

View all
  • (2024)Diagnosing application-network anomalies for millions of IPs in production cloudsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692046(885-899)Online publication date: 10-Jul-2024
  • (2024)Enhancing Reliability in Distributed Systems: A Comprehensive Approach to Telemetry and MonitoringInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT24105105110:5(661-667)Online publication date: 15-Oct-2024
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles
October 2015
499 pages
ISBN:9781450338349
DOI:10.1145/2815400
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 October 2015

Permissions

Request permissions for this article.

Check for updates

Badges

  • Best Paper

Qualifiers

  • Research-article

Conference

SOSP '15
Sponsor:

Acceptance Rates

SOSP '15 Paper Acceptance Rate 30 of 181 submissions, 17%;
Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)137
  • Downloads (Last 6 weeks)8
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Diagnosing application-network anomalies for millions of IPs in production cloudsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692046(885-899)Online publication date: 10-Jul-2024
  • (2024)Enhancing Reliability in Distributed Systems: A Comprehensive Approach to Telemetry and MonitoringInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT24105105110:5(661-667)Online publication date: 15-Oct-2024
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • (2024)Zero+: Monitoring Large-Scale Cloud-Native Infrastructure Using One-Sided RDMAIEEE/ACM Transactions on Networking10.1109/TNET.2024.339451432:4(3499-3514)Online publication date: Aug-2024
  • (2024)SparseRCA: Unsupervised Root Cause Analysis in Sparse Microservice Testing Traces2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE62328.2024.00045(391-402)Online publication date: 28-Oct-2024
  • (2024)Unleashing Performance Insights with Online Probabilistic Tracing2024 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E61754.2024.00015(72-82)Online publication date: 24-Sep-2024
  • (2024)Exploring Use of Symbolic Execution for Service Analysis2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S)10.1109/DSN-S60304.2024.00014(12-16)Online publication date: 24-Jun-2024
  • (2024)Tracemesh: Scalable and Streaming Sampling for Distributed Traces2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00016(54-65)Online publication date: 7-Jul-2024
  • (2024)Causality Enhanced Graph Representation Learning for Alert-Based Root Cause Analysis2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00018(77-86)Online publication date: 6-May-2024
  • (2024)ClearCausal: Cross Layer Causal Analysis for Automatic Microservice Performance Debugging2024 IEEE International Conference on Autonomic Computing and Self-Organizing Systems (ACSOS)10.1109/ACSOS61780.2024.00039(175-180)Online publication date: 16-Sep-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media