ABSTRACT
High performance computing fault tolerance depends on scalable parallel file system performance. For more than a decade scalable bandwidth has been available from the object storage systems that underlie modern parallel file systems, and recently we have seen demonstrations of scalable parallel metadata using dynamic partitioning of the namespace over multiple metadata servers. But even these scalable parallel file systems require significant numbers of dedicated servers, and some workloads still experience bottlenecks. We envision exascale parallel file systems that do not have any dedicated server machines. Instead a parallel job instantiates a file system namespace service in client middleware that operates on only scalable object storage and communicates with other jobs by sharing or publishing namespace snapshots. Experiments shows that our serverless file system design, DeltaFS, performs metadata operations orders of magnitude faster than traditional file system architectures.
- S. Lang et al. "I/O Performance Challenges at Leadership Scale". In: SC. 2009. Google ScholarDigital Library
- Trinity. http://www.lanl.gov/projects/trinity/.Google Scholar
- N. Ali et al. "Scalable I/O forwarding framework for high-performance computing systems". In: CLUSTER. 2009.Google Scholar
- N. Liu et al. "On the role of burst buffers in leadership-class storage systems". In: MSST. 2012.Google Scholar
- P. Schwan. "Lustre: Building a file system for 1000-node clusters". In: Linux Symposium. 2003.Google Scholar
- F. Schmuck and R. Haskin. "GPFS: A Shared-Disk File System for Large Computing Clusters". In: FAST. 2002. Google ScholarDigital Library
- B. Welch et al. "Scalable Performance of the Panasas Parallel File System". In: FAST. 2008. Google ScholarDigital Library
- P. H. Carns et al. "PVFS: A parallel file system for Linux clusters". In: Linux Showcase and Conference. 2000. Google ScholarDigital Library
- Titan. https://www.olcf.ornl.gov/computing-resources/titan-cray-xk7/.Google Scholar
- S. R. Alam et al. "Parallel I/O and the metadata wall". In: PDSW. 2011. Google ScholarDigital Library
- R. Latham, R. Ross, and R. Thakur. "The impact of file systems on MPI-IO scalability". In: EuroPVM/MPI. 2004.Google Scholar
- S. A. Weil et al. "Dynamic Metadata Management for Petabyte-Scale File Systems". In: SC. 2004. Google ScholarDigital Library
- J. Xing et al. "Adaptive and Scalable Metadata Management to Support a Trillion Files". In: SC. 2009. Google ScholarDigital Library
- S. Patil and G. Gibson. "Scale and Concurrency of GIGA+: File System Directories with Millions of Files". In: FAST. 2011. Google ScholarDigital Library
- K. Ren et al. "IndexFS: Scaling File System Metadata Performance with Stateless Caching and Bulk Insertion". In: SC. 2014. Google ScholarDigital Library
- Q. Zheng, K. Ren, and G. Gibson. "BatchFS: Scaling the File System Control Plane with Client-Funded Metadata Servers". In: PDSW. 2014. Google ScholarDigital Library
- J. Bent et al. "PLFS: a checkpoint filesystem for parallel applications". In: SC. 2009. Google ScholarDigital Library
- R. Rajachandrasekar et al. "A 1 PB/s File System to Checkpoint Three Million MPI Tasks". In: HPDC. 2013. Google ScholarDigital Library
- R. Prabhakar et al. "Provisioning a Multi-tiered Data Staging Area for Extreme-Scale Machines". In: ICDCS. 2011. Google ScholarDigital Library
- P. O'Neil et al. "The Log-structured Merge-tree". In: Acta Inf. 33.4 (June 1996). Google ScholarDigital Library
- D. Hildebrand and P. Honeyman. "Exporting storage systems in a scalable manner with pNFS". In: MSST. 2005. Google ScholarDigital Library
- S. A. Weil et al. "Ceph: A Scalable, High-Performance Distributed System". In: OSDI. 2006. Google ScholarDigital Library
- S. A. Weil et al. "RADOS: A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters". In: PDSW. 2007. Google ScholarDigital Library
- P. Hunt et al. "ZooKeeper: Wait-free Coordination for Internet-scale Systems." In: USENIX ATC. 2010. Google ScholarDigital Library
- LevelDB. A fast and lightweight key/value database library. https://github.com/google/leveldb/.Google Scholar
- G. DeCandia et al. "Dynamo: Amazon's Highly Available Key-value Store". In: SOSP. 2007. Google ScholarDigital Library
- F. Chang et al. "BigTable: a distributed storage system for structured data". In: OSDI. 2006. Google ScholarDigital Library
- M. Burrows. "The Chubby Lock Service for Loosely-coupled Distributed Systems". In: OSDI. 2006. Google ScholarDigital Library
- OpenLDAP. http://www.openldap.org/.Google Scholar
- AWS Directory Service. https://aws.amazon.com/directoryservice/.Google Scholar
- D. B. Terry et al. "Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System". In: SOSP. 1995. Google ScholarDigital Library
- H. Greenberg, J. Bent, and G. Grider. "MDHIM: A Parallel Key/Value Framework for HPC". In: HotStorage. 2015. Google ScholarDigital Library
- Nome. http://nome.nmc-probe.org/.Google Scholar
- G. Gibson et al. "PRObE: A Thousand-Node Experimental Cluster for Computer Systems Research". In: USENIX;login: 38.3 (June 2013).Google Scholar
- A. Torres and D. Bonnie. Small File Aggregation with PLFS. http://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-13-22024. 2013.Google Scholar
- J. He et al. "Discovering Structure in Unstructured I/O". In: PDSW. 2012. Google ScholarDigital Library
- K. Ren and G. Gibson. "TableFS: Enhancing Metadata Efficiency in the Local File System". In: USENIX ATC. 2013. Google ScholarDigital Library
- C. Cranor, M. Polte, and G. Gibson. "Structuring PLFS for Extensibility". In: PDSW. 2013. Google ScholarDigital Library
- T. E. Anderson et al. "Serverless Network File Systems". In: SOSP. 1995. Google ScholarDigital Library
- D. Zhao et al. "FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems". In: Big Data. 2014.Google Scholar
- G. A. Gibson et al. "A Cost-effective, High-bandwidth Storage Architecture". In: ASPLOS. 1998. Google ScholarDigital Library
- B. Calder et al. "Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency". In: SOSP. 2011. Google ScholarDigital Library
- J. Chen et al. "Walnut: A Unified Cloud Object Store". In: SIGMOD. 2012. Google ScholarDigital Library
- J. J. Kistler and M. Satyanarayanan. "Disconnected Operation in the Coda File System". In: SOSP. 1991. Google ScholarDigital Library
Recommendations
DeltaFS: a scalable no-ground-truth filesystem for massively-parallel computing
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisHigh-Performance Computing (HPC) is known for its use of massive concurrency. But it can be challenging for a parallel filesystem's control plane to utilize cores when every client process must globally synchronize and serialize its metadata mutations ...
Software-defined storage for fast trajectory queries using a deltaFS indexed massive directory
PDSW-DISCS '17: Proceedings of the 2nd Joint International Workshop on Parallel Data Storage & Data Intensive Scalable Computing SystemsIn this paper we introduce the Indexed Massive Directory, a new technique for indexing data within DeltaFS. With its design as a scalable, server-less file system for HPC platforms, DeltaFS scales file system metadata performance with application scale. ...
Streaming Data Reorganization at Scale with DeltaFS Indexed Massive Directories
Special Section on Computational Storage and Regular PapersComplex storage stacks providing data compression, indexing, and analytics help leverage the massive amounts of data generated today to derive insights. It is challenging to perform this computation, however, while fully utilizing the underlying storage ...
Comments