skip to main content
10.1145/3149393.3149398acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory

Published:12 November 2017Publication History

ABSTRACT

In this paper we introduce the Indexed Massive Directory, a new technique for indexing data within DeltaFS. With its design as a scalable, server-less file system for HPC platforms, DeltaFS scales file system metadata performance with application scale. The Indexed Massive Directory is a novel extension to the DeltaFS data plane, enabling in-situ indexing of massive amounts of data written to a single directory simultaneously, and in an arbitrarily large number of files. We achieve this through a memory-efficient indexing mechanism for reordering and indexing data, and a log-structured storage layout to pack small writes into large log objects, all while ensuring compute node resources are used frugally. We demonstrate the efficiency of this indexing mechanism through VPIC, a widely-used simulation code that scales to trillions of particles. With DeltaFS, we modify VPIC to create a file for each particle to receive writes of that particle's output data. Dynamically indexing the directory's underlying storage allows us to achieve a 5000x speedup in single particle trajectory queries, which require reading all data for a single particle. This speedup increases with application scale while the overhead is fixed at 3% of available memory.

References

  1. Exascale computing project, http://www.exascale.org.Google ScholarGoogle Scholar
  2. Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  3. Leveldb. https://github.com/google/leveldb/.Google ScholarGoogle Scholar
  4. Spark. https://spark.apache.org/.Google ScholarGoogle Scholar
  5. Trinity. http://www.lanl.gov/projects/trinity/.Google ScholarGoogle Scholar
  6. Ali, N., Carns, P., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R., Ward, L., and Sadayappan, P. Scalable I/O forwarding framework for high-performance computing systems. In Proceedings of the 2009 IEEE International Conference on Cluster Computing (CLUSTER 09), pp. 1--10.Google ScholarGoogle Scholar
  7. Alverson, B., Froese, E., Kaplan, L., and Roweth, D. Cray xc series network. Tech. Rep. WP-Aries01-1112, Cray Inc., Nov. 2012.Google ScholarGoogle Scholar
  8. Anderson, T. E., Dahlin, M. D., Neefe, J. M., Patterson, D. A., Roselli, D. S., and Wang, R. Y. Serverless network file systems. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (SOSP 95), pp. 109--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., and Wingate, M. PLFS: A checkpoint filesystem for parallel applications. In Proceedings of the 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 09), pp. 21:1--21:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bent, J., Settlemyer, B., and Grider, G. Serving data to the lunatic fringe: The evolution of HPC storage. USENIX ;login: 41, 2 (June 2016).Google ScholarGoogle Scholar
  11. Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bonnie, D. J., and Torres, A. G. Small file aggregation with plfs. Tech. rep., Los Alamos National Laboratory, 2013.Google ScholarGoogle Scholar
  13. Byna, S., Sisneros, R., Chadalavada, K., and Koziol, Q. Tuning parallel i/o on blue waters for writing 10 trillion particles. In Cray User Group (CUG) (2015).Google ScholarGoogle Scholar
  14. Byna, S., Uselton, A., Prabhat, D. K., and He, Y. Trillion particles, 120,000 cores, and 350 tbs: Lessons learned from a hero i/o run on hopper. In Cray User Group (CUG) (2013).Google ScholarGoogle Scholar
  15. Carns, P., Ligon, W, Ross, R., and Wyckoff, P. Bmi: a network abstraction layer for parallel i/o. In 19th IEEE International Parallel and Distributed Processing Symposium (April 2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Carns, P. H., Ligon, W. B., Ross, R. B, and Thakur, R. PVFS: A parallel file system for linux clusters. In Proceedings of the 4th USENIX Annual Linux showcase and Conference (ALS 00), pp. 317--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C, Wallach, D. A., Burrows, M., Chandra, T, Fikes, A., and Gruber, R. E. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI 06), pp. 205--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Chou, J., Wu, K., and Prabhat. FastQuery: A parallel indexing system for scientific data. In Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER 11), pp. 455--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Cranor, C, Polte, M., and Gibson, G. Structuring PLFS for extensibility. In Proceedings of the 8th Parallel Data Storage Workshop (PDSW 13), pp. 20--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dean, J., and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Opearting Systems Design and Implementation (OSDI 04), pp. 10--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Folk, M., Cheng, A., and Yates, K. Hdf5: A file format and i/o library for high performance computing applications. In Proceedings of Supercomputing (1999), vol. 99, pp. 5--33.Google ScholarGoogle Scholar
  22. Ghemawat, S., Gobioff, H., and Leung, S.-T. The google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP 03), pp. 29--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Gibson, G. A., Nagle, D. F., Amiri, K., Butler, J., Chang, F. W., Gobioff, H., Hardin, C, Riedel, E., Rochberg, D., and Zelenka, J. A cost-effective, high-bandwidth storage architecture. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 98), pp. 92--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Greenberg, H. N., Bent, J., and Grider, G. MDHIM: A parallel key/value framework for HPC. In Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage 15), pp. 10--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Hereld, M., Papka, M. E., and Vishwanath, V. Toward simulation-time data analysis and i/o acceleration on leadership-class systems. In Proc. IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV2011) (Providence, RI, 10/2011 2011).Google ScholarGoogle Scholar
  26. Inman, J., Vining, W., Ransom, G., and Grider, G. MarFS, a Near-POSIX interface to cloud objects. USENIX ;login: 42, 1 (Jan. 2017).Google ScholarGoogle Scholar
  27. Kim, J., Abbasi, H., Chacón, L., Docan, C, Klasky, S., Liu, Q., Podhorszki, N., Shoshani, A., and Wu, K. Parallel in situ indexing for data-intensive computing. In Proceedings of the 2011 IEEE Symposium on Large Data Analysis and Visualization (LDAV 11), pp. 65--72.Google ScholarGoogle Scholar
  28. LANL, NERSC, and SNL. Apex workflows. Tech. rep., Los Alamos National Laboratory (LANL), National Energy Research Scientific Computing Center (NERSC), Sandia National Laboratory (SNL), Mar. 2016.Google ScholarGoogle Scholar
  29. Liu, N, Cope, J., Carns, P., Carothers, C, Ross, R., Grider, G., Crume, A., and Maltzahn, C. On the role of burst buffers in leadership-class storage systems. In Proceedings of the 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST 12), pp. 1--11.Google ScholarGoogle Scholar
  30. Liu, Q., Logan, J., Tian, Y., Abbasi, H., Podhorszki, N, Choi, J. Y, Klasky, S., Tchoua, R., Lofstead, J., Oldfield, R., Parashar, M., Samatova, N., Schwan, K, Shoshani, A., Wolf, M., Wu, K., and Yu, W. Hello ADIOS: The challenges and lessons of developing leadership class I/O frameworks. Concurr. Comput. : Pract. Exper. 26, 7 (May 2014), 1453--1473. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. O'Neil, P., Cheng, E., Gawlick, D., and O'Neil, E. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (June 1996), 351--385. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Patil, S., and Gibson, G. Scale and concurrency of GIGA+: File system directories with millions of files. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (FAST 11), pp. 13--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Rajachandrasekar, R., Moody, A., Mohror, K., and Panda, D. K. D. A 1 PB/s file system to checkpoint three million MPI tasks. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC 13), pp. 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ren, K., and Gibson, G. TABLEFS: Enhancing metadata efficiency in the local file system. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), pp. 145--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Ren, K., Zheng, Q., Arulraj, J., and Gibson, G. Slimdb: A space-efficient key-value storage engine for semi-sorted data. Proc. VLDB Endow. 10, 13 (Sept. 2017), 2037--2048. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ren, K., Zheng, Q., Patil, S., and Gibson, G. IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion. In Proceedings of the 2014 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 14), pp. 237--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Schmuck, F. B., and Haskin, R. L. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 1st USENLX Conference on File and Storage Technologies (FAST 02), pp. 231--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Schwan, P. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Ottawa Linux Symposium (OLS 03), pp. 380--386.Google ScholarGoogle Scholar
  39. Soumagne, J., Kimpe, D., Zounmevo, J., Chaarawi, M., Koziol, Q., Afsahi, A., and Ross, R. Mercury: Enabling remote procedure call for high-performance computing. In 2013 IEEE International Conference on Cluster Computing (CLUSTER) (Sept 2013), pp. 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  40. Wang, T, Mohror, K., Moody, A., Sato, K., and Yu, W. An ephemeral burst-buffer file system for scientific applications. In Proceedings of the 2016 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 16), pp. 69:1--69:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Wang, T, Moody, A., Zhu, Y., Mohror, K., Sato, K., Islam, T, and Yu, W. Metakv: A key-value store for metadata management of distributed burst buffers. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (May 2017), pp. 1174--1183.Google ScholarGoogle ScholarCross RefCross Ref
  42. Wang, Y, Agrawal, G., Bicer, T, and Jiang, W. Smart: A mapreduce-like framework for in-situ scientific analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2015), SC '15, ACM, pp. 51:1--51:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI 06), pp. 307--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Welch, B., Unangst, M., Abbasi, Z., Gibson, G., Mueller, B., Small, J., Zelenka, J., and Zhou, B. Scalable performance of the panasas parallel file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST 08), pp. 2:1--2:17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Wu, K. Fastbit: an efficient indexing technology for accelerating data-intensive science. Journal of Physics: Conference Series 16, 1 (2005), 556.Google ScholarGoogle ScholarCross RefCross Ref
  46. Zhao, D., Zhang, Z., Zhou, X., Li, T, Wang, K., Kimpe, D., Carns, P., Ross, R., and Raicu, I. FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. In Proceedings of the 2014 IEEE International Conference on Big Data (BigData 14), pp. 61--70.Google ScholarGoogle Scholar
  47. Zheng, F., Abbasi, H., Docan, C, Lofstead, J., Liu, Q., Klasky, S., Parashar, M., Podhorszki, N., Schwan, K., and Wolf, M. PreDatA - preparatory data analytics on peta-scale machines. In Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010 (05 2010), pp. 1--12.Google ScholarGoogle ScholarCross RefCross Ref
  48. Zheng, Q., Ren, K., and Gibson, G. BatchFS: Scaling the file system control plane with client-funded metadata servers. In Proceedings of the 9th Parallel Data Storage Workshop (PDSW 14), pp. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Zheng, Q., Ren, K., Gibson, G., Settlemyer, B. W., and Grider, G. DeltaFS: Exascale file systems scale better without dedicated servers. In Proceedings of the 10th Parallel Data Storage Workshop (PDSW 15), pp. 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems
            November 2017
            74 pages
            ISBN:9781450351348
            DOI:10.1145/3149393

            Copyright © 2017 ACM

            Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 November 2017

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate17of41submissions,41%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader