Abstract
IBM Spectrum Scale’s parallel file system General Parallel File System (GPFS) has a 20-year development history with over 100 contributing developers. Its ability to support strict POSIX semantics across more than 10K clients leads to a complex design with intricate interactions between the cluster nodes. Tracing has proven to be a vital tool to understand the behavior and the anomalies of such a complex software product. However, the necessary trace information is often buried in hundreds of gigabytes of by-product trace records. Further, the overhead of tracing can significantly impact running applications and file system performance, limiting the use of tracing in a production system.
In this research article, we discuss the evolution of the mature and highly scalable GPFS tracing tool and present the exploratory study of GPFS’ new tracing interface, FlexTrace, which allows developers and users to accurately specify what to trace for the problem they are trying to solve. We evaluate our methodology and prototype, demonstrating that the proposed approach has negligible overhead, even under intensive I/O workloads and with low-latency storage devices.
- A. Aggarwal, W. Scott, E. Rustici, D. Bucciero, A. Haskins, and W. Matthews. 1997. Method and apparatus for determining a communications path between two nodes in an Internet Protocol (IP) network. (Oct. 7 1997). Patent No. 5,675,741.Google Scholar
- Matt Ahrens, Val Henson Jeff Bonwick, Mark Maybee, and Mark Shellenbaum. 2003. The zettabyte file system. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’03).Google Scholar
- Akshat Aranya, Charles P. Wright, and Erez Zadok. 2004. Tracefs: A file system to trace them all. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’04). 129--145. Google ScholarDigital Library
- Daniel Becker, Felix Wolf, Wolfgang Frings, Markus Geimer, Brian JN Wylie, and Bernd Mohr. 2007. Automatic trace-based performance analysis of metacomputing applications. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2007 (IPDPS’07). IEEE, 1--10.Google ScholarCross Ref
- BeeGFS configurations 2017. BeeGFS logging configurations. Retrieved from https://git.beegfs.com/pub/v6/blob/6.1/fhgfs_client_module/build/dist/etc/beegfs-client.conf#L224.Google Scholar
- John Bent, Garth Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez, Milo Polte, and Meghan Wingate. 2009. PLFS: A checkpoint filesystem for parallel applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 21. Google ScholarDigital Library
- Dean Michael Berris, Alistair Veitch, Nevin Heintze, Eric Anderson, and Ning Wang. 2016. XRay: A Function Call Tracing System. Technical Report.Google Scholar
- Peter J. Braam and Philip Schwan. 2002. Lustre: The intergalactic file system. In Proceedings of the Ottawa Linux Symposium. 50--54.Google Scholar
- Holger Brunst, Manuela Winkler, Wolfgang E Nagel, and Hans-Christian Hoppe. 2001. Performance optimization for large scale computing: The scalable VAMPIR approach. In Proceedings of the International Conference on Computational Science. Springer, 751--760. Google ScholarDigital Library
- Bryan Cantrill, Michael W Shapiro, Adam H Leventhal, and others. 2004. Dynamic instrumentation of production systems. In Proceedings of the USENIX Annual Technical Conference, General Track. 15--28. Google ScholarDigital Library
- Philip Carns, Robert Latham, Robert Ross, Kamil Iskra, Samuel Lang, and Katherine Riley. 2009. 24/7 characterization of petascale I/O workloads. In Proceedings of the IEEE International Conference on Cluster Computing and Workshops, 2009 (CLUSTER’09). IEEE, 1--10.Google ScholarCross Ref
- Philip Carns, Yushu Yao, Kevin Harms, Robert Latham, Robert Ross, and Katie Antypas. 2013. Production I/O characterization on the cray XE6. In Proceedings of the Cray User Group Meeting, Vol. 2013.Google Scholar
- Ceph tracing 2016. Ceph Logging and Debugging. Retrieved from http://docs.ceph.com/docs/master/rados/troubleshooting/log-and-debug/?highlight=dout.Google Scholar
- W. Cohen. 2005. Gaining insight into the linux kernel with kprobes. RedHat Mag.5 (March 2005).Google Scholar
- Jonathan Corbet. 2013. BF tracing filters. Retrieved from https://lwn.net/Articles/575531/.Google Scholar
- Mathieu Desnoyers and Michel R. Dagenais. 2006. The LTTng tracer : A low impact performance and behavior monitor for GNU/Linux. In Proceedings of the Ottawa Linux Symposium. 209--223.Google Scholar
- Frank Ch. Eigler. 2006. Problem solving with systemtap. In Proceedings of the Ottawa Linux Symposium. 261--268.Google Scholar
- D. Ellard, J. Ledlie, P. Malkani, and M. Seltzer. 2003. Passive NFS tracing of email and research workloads. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’03). USENIX Association, San Francisco, CA. Google ScholarDigital Library
- D. Ellard and M. Seltzer. 2003. New NFS tracing tools and techniques for system analysis. In Proceedings of the Annual USENIX Conference on Large Installation Systems Administration. USENIX Association, San Diego, CA. Google ScholarDigital Library
- Úlfar Erlingsson, Marcus Peinado, Simon Peter, Mihai Budiu, and Gloria Mainar-Ruiz. 2012. Fay: Extensible distributed tracing from kernels to clusters. ACM Trans. Comput. Syst. 30, 4 (2012), 13. Google ScholarDigital Library
- Rodrigo Fonseca, George Porter, Randy H Katz, Scott Shenker, and Ion Stoica. 2007. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Conference on Networked Systems Design 8 Implementation. USENIX Association, 20. Google ScholarDigital Library
- Wolfgang Frings, Felix Wolf, and Ventsislav Petkov. 2009. Scalable massively parallel I/O to task-local files. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. IEEE, 1--11. Google ScholarDigital Library
- Dennis Michael Geels, Gautam Altekar, Scott Shenker, and Ion Stoica. 2006. Replay Debugging for Distributed Applications. Ph.D. thesis. University of California, Berkeley.Google Scholar
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles 2003 (SOSP’03). 29--43. Google ScholarDigital Library
- Gluster 2017. Gluster storage. Retrieved from https://www.redhat.com/en/technologies/storage/gluster.Google Scholar
- Gluster storage logging 2017. Managing Red Hat storage logs. Retrieved from https://access.redhat.com/documentation/en-US/Red_Hat_Storage/3/html/Administration_Guide/chap-Managing_Red_Hat_Storage_Logs.html.Google Scholar
- Google. 2015. Glog -- C++ implementation of the Google logging module. Retrieved from https://github.com/google/glog.Google Scholar
- Brendan Gregg and Jim Mauro. 2011. DTrace: Dynamic Tracing in Oracle Solaris, Mac OS X and FreeBSD. Prentice Hall Computer. Google ScholarDigital Library
- Roger L. Haskin and Frank B. Schmuck. 1996. The tiger shark file system. In Proceedings of the Conference on Technologies for the Information Superhighway (Compcon’96). IEEE, 226--231. Google ScholarDigital Library
- Jan Heichler. 2014. An introduction to BeeGFS. (2014).Google Scholar
- Dean Hildebrand and Frank Schmuck. 2014. Chapter 9—GPFS. In High Performance Parallel I/O, Quincey Koziol Prabhat (Ed.). CRC Press.Google Scholar
- D. Hildebrand and F. Schmuck. 2015. On making GPFS truly general. ;login: The USENIX Mag. 40, 3 (June 2015), 16--19.Google Scholar
- IBM. 2016. IBM Spectrum Scale Version 4 Release 2.0. Administration and Programming Reference.Google Scholar
- ibmessflash 2017. IBM Elastic Storage Server delivers flash-based storage models and Power I/O and server enhancements. Retrieved from https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=an%&subtype===ca&appname===gpateam&supplier===877&letternum===ENUSZG17-0068.Google Scholar
- Kernel summit 2016. Kernel Summit 2011 Summary. Retrieved from http://lwn.net/Articles/464268/.Google Scholar
- Seong Jo Kim, Seung Woo Son, Wei-keng Liao, Mahmut Kandemir, Rajeev Thakur, and Alok Choudhary. 2012. IOPin: Runtime profiling of parallel I/O in HPC systems. In Proceedings of the Conference on High Performance Computing, Networking, Storage and Analysis (SCC’12). IEEE, 18--23. Google ScholarDigital Library
- R. Krishnakumar. 2005. Kernel korner: kprobes—A kernel debugger. Linux J. 2005, 133 (2005), 11. Google ScholarDigital Library
- S. C. Lee and C. Shields. 2001. Tracing the source of network attack: A technical, legal and societal problem. In Proceedings of the 2001 IEEE Workshop on Information Assurance and Security, Vol. 6. Citeseer.Google Scholar
- Linux tracepoints 2016. Using the Linux Kernel Tracepoints. Retrieved from https://www.kernel.org/doc/Documentation/trace/tracepoints.txt.Google Scholar
- Lmbench 2013. lmbench. Retrieved from https://sourceforge.net/projects/lmbench/.Google Scholar
- Lmbench-cache 2015. lmbench cache benchmark. Retrieved from http://lmbench.sourceforge.net/cgi-bin/man?keyword=cache§ion===8.Google Scholar
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM SIGPLAN Notices, Vol. 40. ACM, 190--200. Google ScholarDigital Library
- Xiaoqing Luo, Frank Mueller, Philip Carns, Jonathan Jenkins, Robert Latham, Robert Ross, and Shane Snyder. 2017. ScalaIOExtrap: Elastic I/O tracing and extrapolation. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS’17). IEEE, 585--594.Google ScholarCross Ref
- Lustre 2016. Lustre File System. Retrieved from http://www.lustre.org.Google Scholar
- Lustre tracing 2016. Lustre Diagnostic and Debugging Tools. Retrieved from http://wiki.lustre.org/index.php/Diagnostic_and_Debugging_Tools.Google Scholar
- Huong Luu, Marianne Winslett, William Gropp, Robert Ross, Philip Carns, Kevin Harms, Mr Prabhat, Suren Byna, and Yushu Yao. 2015. A multiplatform study of I/O behavior on petascale supercomputers. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 33--44. Google ScholarDigital Library
- Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles. ACM, 378--393. Google ScholarDigital Library
- Mdtest 2015. MDtest metadata benchmark. Retrieved from https://github.com/MDTEST-LANL/mdtest.Google Scholar
- John Meehan, Cansu Aslantas, Stan Zdonik, Nesime Tatbul, and Jiang Du. 2017. Data ingestion for the connected world. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’17).Google Scholar
- Michael P. Mesnier, Matthew Wachs, Raja R. Simbasivan, Julio Lopez, James Hendricks, Gregory R. Ganger, and David R. O’hallaron. 2007. Trace: Parallel trace replay with approximate causal events. USENIX. Google ScholarDigital Library
- Michael Noeth, Prasun Ratn, Frank Mueller, Martin Schulz, and Bronis R. de Supinski. 2009. ScalaTrace: Scalable compression and replay of communication traces for high-performance computing. J. Parallel Distrib. Comput. 69, 8 (2009), 696--710. Google ScholarDigital Library
- Sarp Oral and Gautam Shah. 2016. Spectrum Scale Enhancements for CORAL. Presentation slides at Supercomputing’16. Retrieved from http://files.gpfsug.org/presentations/2016/SC16/11_Sarp_Oral_Gautam_Shah_Spectrum_Scale_Enhancements_for_CORAL_v2.pdf.Google Scholar
- Swapnil Patil and Garth A Gibson. 2011. Scale and concurrency of GIGA+: File system directories with millions of files.. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’11), Vol. 11. 13--13. Google ScholarDigital Library
- Swapnil Patil, Kai Ren, and Garth Gibson. 2012. A case for scaling HPC metadata performance through de-specialization. In Proceedings of the High Performance Computing, Networking, Storage and Analysis (SCC’12). IEEE, 30--35.Google ScholarCross Ref
- Perf 2015. Linux Perf. Retrieved from https://perf.wiki.kernel.org/index.php/Main_Page.Google Scholar
- K. Pollack and A. Veitch. 2005. I/O Traces, Tools and Analysis. Retrieved from www.usenix.org/events/fast05/bofs.html#io.Google Scholar
- Xuanjia Qiu, Hongxing Li, Chuan Wu, Zongpeng Li, and Francis CM Lau. 2015. Cost-minimizing dynamic migration of content distribution services into hybrid clouds. IEEE Trans. Parallel Distrib. Syst. 26, 12 (2015), 3330--3345. Google ScholarDigital Library
- Kai Ren, Qing Zheng, Swapnil Patil, and Garth Gibson. 2014. IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE, 237--248. Google ScholarDigital Library
- Ohad Rodeh, Josef Bacik, and Chris Mason. 2013. BTRFS: The linux B-tree filesystem. ACM Trans. Stor. 9, 3 (2013), 9. Google ScholarDigital Library
- Steven Rostedt. 2009. Finding origins of latencies using ftrace. 11th Real-Time Linux Workshop. 117--130.Google Scholar
- s3fs 2017. s3fs home page. Retrieved from https://github.com/s3fs-fuse/s3fs-fuse.Google Scholar
- Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing performance changes by comparing request flows. In Proceedings of the USENIX Symposium on Neworked Systems Design and Implementation (NSDI’11). 43--56. Google ScholarDigital Library
- Scale developers 2016 (private communication).Google Scholar
- F. Schmuck and R. Haskin. 2002. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the First USENIX Conference on File and Storage Technologies (FAST’02). USENIX Association, 231--244. Google ScholarDigital Library
- Suchakrapani Datt Sharma and Michel Dagenais. 2016. Enhanced userspace and in-kernel trace filtering for production systems. J. Comput. Sci. Technol. 31, 6 (2016), 1161--1178.Google ScholarCross Ref
- Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10). IEEE, 1--10. Google ScholarDigital Library
- Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K. Lockwood, and Nicholas J. Wright. 2016. Modular HPC I/O characterization with darshan. In Proceedings of the 5th Workshop on Extreme-Scale Programming Tools. IEEE Press, 9--17. Google ScholarDigital Library
- Strace 2001. strace software home page. Retrieved from https://strace.io/.Google Scholar
- Sysdig 2016. Sysdig. Retrieved from http://www.sysdig.org/.Google Scholar
- Vasily Tarasov, Santhosh Kumar, Jack Ma, Dean Hildebrand, Anna Povzner, Geoff Kuenning, and Erez Zadok. 2012. Extracting flexible, replayable models from large block traces. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12). 22. Google ScholarDigital Library
- Eno Thereska, Brandon Salmon, John Strunk, Matthew Wachs, Michael Abd-El-Malek, Julio Lopez, and Gregory R Ganger. 2006. Stardust: Tracking activity in a distributed storage system. In ACM SIGMETRICS Performance Evaluation Review, Vol. 34. ACM, 3--14. Google ScholarDigital Library
- Tracing 2016. Tracing on Linux. Retrieved from https://events.linuxfoundation.org/images/stories/pdf/lceu2012_zannoni.pdf.Google Scholar
- Andrew Uselton, Mark Howison, Nicholas J. Wright, David Skinner, Noel Keen, John Shalf, Karen L. Karavanic, and Leonid Oliker. 2010. Parallel I/O performance: From events to ensembles. In Proceedings of the 2010 IEEE International Symposium on Parallel 8 Distributed Processing (IPDPS’10). IEEE, 1--11.Google ScholarCross Ref
- Marc-André Vef. 2016. Analyzing File Create Performance in IBM Spectrum Scale. Master’s thesis. Johannes Gutenberg University Mainz. Retrieved from http://www.staff.uni-mainz.de/vef/pubs/vef2016thesis.pdf.Google Scholar
- Marc-André Vef, Vasily Tarasov, Dean Hildebrand, and André Brinkmann. 2016. Tracing of Complex Production Systems: Obstacles and Solutions. System Analytics and Characterization. Retrieved from https://drive.google.com/open?id=0B-75gd4swZPMZ1pOUFBJeWxfVjQ.Google Scholar
- Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation. USENIX Association, 307--320. Google ScholarDigital Library
- Steven A. Wright, Simon D. Hammond, Simon J. Pennycook, Robert F. Bird, J. A. Herdman, Ian Miller, A. Vadgama, Abhir Bhalerao, and Stephen A. Jarvis. 2012. Parallel file system analysis through application I/O tracing. Comput. J. 56, 2 (2012), 141--155. Google ScholarDigital Library
- Jing Xing, Jin Xiong, Ninghui Sun, and Jie Ma. 2009. Adaptive and scalable metadata management to support a trillion files. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, 26. Google ScholarDigital Library
- Shuangyang Yang, Walter B. Ligon III, and Elaine C. Quarles. 2011. Scalable distributed directory implementation on orange file system. Proceedings of the IEEE International Workshope on Storage Network Architecture and Parallel I/Os (SNAPI’11).Google Scholar
- S. Zhou, H. Da Costa, and A. J. Smith. 1984. A file system tracing package for berkeley UNIX. In Proceedings of the USENIX Summer Conference. Portland, OR, 407--419.Google Scholar
- Yingjin Qian, Xi Li, Shuichi Ihara, Lingfang Zeng, Jürgen Kaiser, Tim Süß, and André Brinkmann. 2017. A configurable rule based classful token bucket filter network request scheduler for the lustre file system. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC, Denver, CO, USA, 6:1--6:12. Google ScholarDigital Library
Index Terms
- Challenges and Solutions for Tracing Storage Systems: A Case Study with Spectrum Scale
Recommendations
Large files, small writes, and pNFS
ICS '06: Proceedings of the 20th annual international conference on SupercomputingWorkload characterization studies highlight the prevalence of small and sequential data requests in scientific applications. Parallel file systems excel at large data transfers but sometimes at the expense of small I/O performance. pNFS is an NFSv4.1 ...
MPI-IO/GPFS, an optimized implementation of MPI-IO on top of GPFS
SC '01: Proceedings of the 2001 ACM/IEEE conference on SupercomputingMPI-IO/GPFS is an optimized prototype implementation of the I/O chapter of the Message Passing Interface (MPI) 2 standard. It uses the IBM General Parallel File System (GPFS) Release 3 as the underlying file system. This paper describes optimization ...
Comments