research-article

Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory

Authors:
Qing Zheng

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
George Amvrosiadis

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Saurabh Kadekodi

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Garth A. Gibson

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Charles D. Cranor

Carnegie Mellon University

Carnegie Mellon University
View Profile

,
Bradley W. Settlemyer

Los Alamos National Laboratory

Los Alamos National Laboratory
View Profile

,
Gary Grider

Los Alamos National Laboratory

Los Alamos National Laboratory
View Profile

,
Fan Guo

Los Alamos National Laboratory

Los Alamos National Laboratory
View Profile

PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing SystemsNovember 2017Pages 7–12https://doi.org/10.1145/3149393.3149398

Published:12 November 2017Publication History

PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems

Pages 7–12

ABSTRACT

In this paper we introduce the Indexed Massive Directory, a new technique for indexing data within DeltaFS. With its design as a scalable, server-less file system for HPC platforms, DeltaFS scales file system metadata performance with application scale. The Indexed Massive Directory is a novel extension to the DeltaFS data plane, enabling in-situ indexing of massive amounts of data written to a single directory simultaneously, and in an arbitrarily large number of files. We achieve this through a memory-efficient indexing mechanism for reordering and indexing data, and a log-structured storage layout to pack small writes into large log objects, all while ensuring compute node resources are used frugally. We demonstrate the efficiency of this indexing mechanism through VPIC, a widely-used simulation code that scales to trillions of particles. With DeltaFS, we modify VPIC to create a file for each particle to receive writes of that particle's output data. Dynamically indexing the directory's underlying storage allows us to achieve a 5000x speedup in single particle trajectory queries, which require reading all data for a single particle. This speedup increases with application scale while the overhead is fixed at 3% of available memory.

References

Exascale computing project, http://www.exascale.org.Google Scholar
Hadoop. http://hadoop.apache.org/.Google Scholar
Leveldb. https://github.com/google/leveldb/.Google Scholar
Spark. https://spark.apache.org/.Google Scholar
Trinity. http://www.lanl.gov/projects/trinity/.Google Scholar
Ali, N., Carns, P., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R., Ward, L., and Sadayappan, P. Scalable I/O forwarding framework for high-performance computing systems. In Proceedings of the 2009 IEEE International Conference on Cluster Computing (CLUSTER 09), pp. 1--10.Google Scholar
Alverson, B., Froese, E., Kaplan, L., and Roweth, D. Cray xc series network. Tech. Rep. WP-Aries01-1112, Cray Inc., Nov. 2012.Google Scholar
Anderson, T. E., Dahlin, M. D., Neefe, J. M., Patterson, D. A., Roselli, D. S., and Wang, R. Y. Serverless network file systems. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (SOSP 95), pp. 109--126. Google ScholarDigital Library
Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., and Wingate, M. PLFS: A checkpoint filesystem for parallel applications. In Proceedings of the 2009 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 09), pp. 21:1--21:12. Google ScholarDigital Library
Bent, J., Settlemyer, B., and Grider, G. Serving data to the lunatic fringe: The evolution of HPC storage. USENIX ;login: 41, 2 (June 2016).Google Scholar
Bloom, B.H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422--426. Google ScholarDigital Library
Bonnie, D. J., and Torres, A. G. Small file aggregation with plfs. Tech. rep., Los Alamos National Laboratory, 2013.Google Scholar
Byna, S., Sisneros, R., Chadalavada, K., and Koziol, Q. Tuning parallel i/o on blue waters for writing 10 trillion particles. In Cray User Group (CUG) (2015).Google Scholar
Byna, S., Uselton, A., Prabhat, D. K., and He, Y. Trillion particles, 120,000 cores, and 350 tbs: Lessons learned from a hero i/o run on hopper. In Cray User Group (CUG) (2013).Google Scholar
Carns, P., Ligon, W, Ross, R., and Wyckoff, P. Bmi: a network abstraction layer for parallel i/o. In 19th IEEE International Parallel and Distributed Processing Symposium (April 2005). Google ScholarDigital Library
Carns, P. H., Ligon, W. B., Ross, R. B, and Thakur, R. PVFS: A parallel file system for linux clusters. In Proceedings of the 4th USENIX Annual Linux showcase and Conference (ALS 00), pp. 317--328. Google ScholarDigital Library
Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C, Wallach, D. A., Burrows, M., Chandra, T, Fikes, A., and Gruber, R. E. Bigtable: A distributed storage system for structured data. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI 06), pp. 205--218. Google ScholarDigital Library
Chou, J., Wu, K., and Prabhat. FastQuery: A parallel indexing system for scientific data. In Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER 11), pp. 455--464. Google ScholarDigital Library
Cranor, C, Polte, M., and Gibson, G. Structuring PLFS for extensibility. In Proceedings of the 8th Parallel Data Storage Workshop (PDSW 13), pp. 20--26. Google ScholarDigital Library
Dean, J., and Ghemawat, S. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Symposium on Opearting Systems Design and Implementation (OSDI 04), pp. 10--10. Google ScholarDigital Library
Folk, M., Cheng, A., and Yates, K. Hdf5: A file format and i/o library for high performance computing applications. In Proceedings of Supercomputing (1999), vol. 99, pp. 5--33.Google Scholar
Ghemawat, S., Gobioff, H., and Leung, S.-T. The google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles (SOSP 03), pp. 29--43. Google ScholarDigital Library
Gibson, G. A., Nagle, D. F., Amiri, K., Butler, J., Chang, F. W., Gobioff, H., Hardin, C, Riedel, E., Rochberg, D., and Zelenka, J. A cost-effective, high-bandwidth storage architecture. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 98), pp. 92--103. Google ScholarDigital Library
Greenberg, H. N., Bent, J., and Grider, G. MDHIM: A parallel key/value framework for HPC. In Proceedings of the 7th USENIX Conference on Hot Topics in Storage and File Systems (HotStorage 15), pp. 10--10. Google ScholarDigital Library
Hereld, M., Papka, M. E., and Vishwanath, V. Toward simulation-time data analysis and i/o acceleration on leadership-class systems. In Proc. IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV2011) (Providence, RI, 10/2011 2011).Google Scholar
Inman, J., Vining, W., Ransom, G., and Grider, G. MarFS, a Near-POSIX interface to cloud objects. USENIX ;login: 42, 1 (Jan. 2017).Google Scholar
Kim, J., Abbasi, H., Chacón, L., Docan, C, Klasky, S., Liu, Q., Podhorszki, N., Shoshani, A., and Wu, K. Parallel in situ indexing for data-intensive computing. In Proceedings of the 2011 IEEE Symposium on Large Data Analysis and Visualization (LDAV 11), pp. 65--72.Google Scholar
LANL, NERSC, and SNL. Apex workflows. Tech. rep., Los Alamos National Laboratory (LANL), National Energy Research Scientific Computing Center (NERSC), Sandia National Laboratory (SNL), Mar. 2016.Google Scholar
Liu, N, Cope, J., Carns, P., Carothers, C, Ross, R., Grider, G., Crume, A., and Maltzahn, C. On the role of burst buffers in leadership-class storage systems. In Proceedings of the 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST 12), pp. 1--11.Google Scholar
Liu, Q., Logan, J., Tian, Y., Abbasi, H., Podhorszki, N, Choi, J. Y, Klasky, S., Tchoua, R., Lofstead, J., Oldfield, R., Parashar, M., Samatova, N., Schwan, K, Shoshani, A., Wolf, M., Wu, K., and Yu, W. Hello ADIOS: The challenges and lessons of developing leadership class I/O frameworks. Concurr. Comput. : Pract. Exper. 26, 7 (May 2014), 1453--1473. Google ScholarDigital Library
O'Neil, P., Cheng, E., Gawlick, D., and O'Neil, E. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (June 1996), 351--385. Google ScholarDigital Library
Patil, S., and Gibson, G. Scale and concurrency of GIGA+: File system directories with millions of files. In Proceedings of the 9th USENIX Conference on File and Stroage Technologies (FAST 11), pp. 13--13. Google ScholarDigital Library
Rajachandrasekar, R., Moody, A., Mohror, K., and Panda, D. K. D. A 1 PB/s file system to checkpoint three million MPI tasks. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC 13), pp. 143--154. Google ScholarDigital Library
Ren, K., and Gibson, G. TABLEFS: Enhancing metadata efficiency in the local file system. In Proceedings of the 2013 USENIX Annual Technical Conference (USENIX ATC 13), pp. 145--156. Google ScholarDigital Library
Ren, K., Zheng, Q., Arulraj, J., and Gibson, G. Slimdb: A space-efficient key-value storage engine for semi-sorted data. Proc. VLDB Endow. 10, 13 (Sept. 2017), 2037--2048. Google ScholarDigital Library
Ren, K., Zheng, Q., Patil, S., and Gibson, G. IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion. In Proceedings of the 2014 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 14), pp. 237--248. Google ScholarDigital Library
Schmuck, F. B., and Haskin, R. L. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 1st USENLX Conference on File and Storage Technologies (FAST 02), pp. 231--244. Google ScholarDigital Library
Schwan, P. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Ottawa Linux Symposium (OLS 03), pp. 380--386.Google Scholar
Soumagne, J., Kimpe, D., Zounmevo, J., Chaarawi, M., Koziol, Q., Afsahi, A., and Ross, R. Mercury: Enabling remote procedure call for high-performance computing. In 2013 IEEE International Conference on Cluster Computing (CLUSTER) (Sept 2013), pp. 1--8.Google ScholarCross Ref
Wang, T, Mohror, K., Moody, A., Sato, K., and Yu, W. An ephemeral burst-buffer file system for scientific applications. In Proceedings of the 2016 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 16), pp. 69:1--69:12. Google ScholarDigital Library
Wang, T, Moody, A., Zhu, Y., Mohror, K., Sato, K., Islam, T, and Yu, W. Metakv: A key-value store for metadata management of distributed burst buffers. In 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (May 2017), pp. 1174--1183.Google ScholarCross Ref
Wang, Y, Agrawal, G., Bicer, T, and Jiang, W. Smart: A mapreduce-like framework for in-situ scientific analytics. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (New York, NY, USA, 2015), SC '15, ACM, pp. 51:1--51:12. Google ScholarDigital Library
Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI 06), pp. 307--320. Google ScholarDigital Library
Welch, B., Unangst, M., Abbasi, Z., Gibson, G., Mueller, B., Small, J., Zelenka, J., and Zhou, B. Scalable performance of the panasas parallel file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST 08), pp. 2:1--2:17. Google ScholarDigital Library
Wu, K. Fastbit: an efficient indexing technology for accelerating data-intensive science. Journal of Physics: Conference Series 16, 1 (2005), 556.Google ScholarCross Ref
Zhao, D., Zhang, Z., Zhou, X., Li, T, Wang, K., Kimpe, D., Carns, P., Ross, R., and Raicu, I. FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems. In Proceedings of the 2014 IEEE International Conference on Big Data (BigData 14), pp. 61--70.Google Scholar
Zheng, F., Abbasi, H., Docan, C, Lofstead, J., Liu, Q., Klasky, S., Parashar, M., Podhorszki, N., Schwan, K., and Wolf, M. PreDatA - preparatory data analytics on peta-scale machines. In Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010 (05 2010), pp. 1--12.Google ScholarCross Ref
Zheng, Q., Ren, K., and Gibson, G. BatchFS: Scaling the file system control plane with client-funded metadata servers. In Proceedings of the 9th Parallel Data Storage Workshop (PDSW 14), pp. 1--6. Google ScholarDigital Library
Zheng, Q., Ren, K., Gibson, G., Settlemyer, B. W., and Grider, G. DeltaFS: Exascale file systems scale better without dedicated servers. In Proceedings of the 10th Parallel Data Storage Workshop (PDSW 15), pp. 1--6. Google ScholarDigital Library

Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory

Recommendations

Streaming Data Reorganization at Scale with DeltaFS Indexed Massive Directories
Special Section on Computational Storage and Regular Papers

Complex storage stacks providing data compression, indexing, and analytics help leverage the massive amounts of data generated today to derive insights. It is challenging to perform this computation, however, while fully utilizing the underlying storage ...
Read More
DeltaFS: exascale file systems scale better without dedicated servers
PDSW '15: Proceedings of the 10th Parallel Data Storage Workshop

High performance computing fault tolerance depends on scalable parallel file system performance. For more than a decade scalable bandwidth has been available from the object storage systems that underlie modern parallel file systems, and recently we ...
Read More
Efficient Directory Mutations in a Full-Path-Indexed File System
Special Issue on FAST 2018 and Regular Papers

Full-path indexing can improve I/O efficiency for workloads that operate on data organized using traditional, hierarchical directories, because data is placed on persistent storage in scan order. Prior results indicate, however, that renames in a local ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems
November 2017
74 pages
ISBN:9781450351348
DOI:10.1145/3149393
Program Chairs:
Kathryn Mohror
Lawrence Livermore National Laboratory
,
Brent Welch
Google
Copyright © 2017 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate17of41submissions,41%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 243
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory

PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems

ABSTRACT

References

Cited By

Recommendations

Streaming Data Reorganization at Scale with DeltaFS Indexed Massive Directories

DeltaFS: exascale file systems scale better without dedicated servers

Efficient Directory Mutations in a Full-Path-Indexed File System

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory

PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems

ABSTRACT

References

Cited By

Recommendations

Streaming Data Reorganization at Scale with DeltaFS Indexed Massive Directories

DeltaFS: exascale file systems scale better without dedicated servers

Efficient Directory Mutations in a Full-Path-Indexed File System

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media