Abstract
Synchronous small writes play a critical role in system availability because they safely log recent state modifications for fast recovery from crashes. Demanding systems typically dedicate separate devices to logging for adequate performance during normal operation and redundancy during state reconstruction. However, storage stacks enforce page-sized granularity in data transfers from memory to disk. Thus, they consume excessive storage bandwidth to handle small writes, which hurts performance. The problem becomes worse, as filesystems often handle multiple concurrent streams, which effectively generate random I/O traffic. In a journaled filesystem, we introduce wasteless journaling as a mount mode that coalesces synchronous concurrent small writes of data into full page-sized journal blocks. Additionally, we propose selective journaling to automatically activate wasteless journaling on data writes with size below a fixed threshold. We implemented a functional prototype of our design over a widely-used filesystem. Our modes are compared against existing methods using microbenchmarks and application-level workloads on stand-alone servers and a multitier networked system. We examine synchronous and asynchronous writes. Coalescing small data updates to the journal sequentially preserves filesystem consistency while it reduces consumed bandwidth up to several factors, decreases recovery time up to 22%, and lowers write latency up to orders of magnitude.
- Anand, A., Sen, S., Krioukov, A., Popovici, F. I., Akella, A., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Banerjee, S. 2008. Avoiding file system micromanagement with range writes. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 161--176. Google ScholarDigital Library
- Appuswamy, R., van Moolenbroek, D. C., and Tanenbaum, A. S. 2010. Block-level RAID is dead. In Proceedings of the Workshop on Hot Topics in Storage in File Systems. Google ScholarDigital Library
- Baker, J., Bondç, C., Corbett, J., Furman, J. J., Khorlin, A., Larson, J., Léon, J., Li, Y., Lloyd, A., and Yushprakh, V. 2011. Megastore: Providing scalable, highly available storage for interactive services. In Proceedings of the Conference on Innovative Data Systems Research. 223--234.Google Scholar
- Batsakis, A., Burns, R. C., Kanevsky, A., Lentini, J., and Talpey, T. 2008. AWOL: An adaptive write optimizations layer. In Proceedings of the USENIX Conference on File and Storage Technologies. 67--80. Google ScholarDigital Library
- Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., and Wingate, M. 2009. PLFS: A checkpoint filesystem for parallel applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC). 1--12. Google ScholarDigital Library
- Birrell, A. D., Hisgen, A., Jerian, C., Mann, T., and Swart, G. 1993. The Echo distributed file system. Tech. rep. TR-111, DEC Systems Research Center, Palo Alto, CA.Google Scholar
- Borthakur, D., Gray, J., Sarma, J. S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., Rash, S., Schmidt, R., and Aiyer, A. 2011. Apache Hadoop goes realtime at facebook. In Proceedings of the ACM SIGMOD Conference. 1071--1080. Google ScholarDigital Library
- Bovet, D. P. and Cesati, M. 2005. Understanding the Linux Kernel 3rd Ed. O’Reilly Media, Sebastopol, CA. Google ScholarDigital Library
- Brito, A., Fetzer, C., and Felber, P. 2009. Minimizing latency in fault-tolerant distributed stream processing systems. In Proceedings of the International Conference on Distributed Computing Systems. 173--182. Google ScholarDigital Library
- Calder, B., Wang, J., Ogus, A., Nilakantan, N., and Skjolsvold, A., et al. 2011. Windows Azure Storage: A highly available cloud storage service with strong consistency. In Proceedings of the ACM Symposium on Operating Systems Principles. ACM, New York, 143--157. Google ScholarDigital Library
- Carns, P., Lang, S., Ross, R., Vilayannur, M., Kunkel, J., and Ludwig, T. 2009. Small-file access in parallel file systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. IEEE, Washington, D.C., 1--11. Google ScholarDigital Library
- Chandrasekaran, S. and Franklin, M. 2004. Remembrance of streams past: Overload-sensitive management of archived streams. In Proceedings of the Conference on Very Large Data Bases. 348--359. Google ScholarDigital Library
- Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 205--218. Google ScholarDigital Library
- Cheetah. 2007. Seagate Cheetah 15K.5 SAS (ST3300655SS). Product Manual. http://www.seagate.com/staticfiles/support/disc/manuals/enterprise/cheetah/15K.5/SAS/100384784e.pdf.Google Scholar
- Chen, F., Koufaty, D. A., and Zhang, X. 2009. Understanding intrinsic characteristics and system implications of flash memory based solid state drives. In Proceedings of the Conference on SIGMETRICS/Performance. 181--192. Google ScholarDigital Library
- Chen, P. M., Ng, W. T., Chandra, S., Aycock, C., Rajamani, G., and Lowell, D. 1996. The Rio file cache: Surviving operating system crashes. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 74--83. Google ScholarDigital Library
- Chidambaram, V., Sharma, T., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2012. Consistency without ordering. In Proceedings of the USENIX Conference on File and Storage Technologies. 101--116. Google ScholarDigital Library
- Choi, H. J., Lim, S.-H., and Park, K. H. 2009. JFTL: A flash translation layer based on a journal remapping for flash memory. ACM Trans. Storage 4, 14:1--14:22. Google ScholarDigital Library
- Dai, H., Neufeld, M., and Han, R. 2004. ELF: An efficient log-structured flash file system for micro sensor nodes. In Proceedings of the ACM International Conference on Embedded Networked Sensor Systems. 176--187. Google ScholarDigital Library
- DBT. Database test suite. http://osdldbt.sourceforge.net/.Google Scholar
- Desnoyers, P. J. and Shenoy, P. 2007. Hyperion: High volume stream archival for retrospective querying. In Proceedings of the USENIX Annual Technical Conference. 45--58. Google ScholarDigital Library
- DeWitt, D. J., Katz, R. H., Olken, F., Shapiro, L. D., Stonebraker, M. R., and Wood, D. A. 1984. Implementation techniques for main memory database systems. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 1--8. Google ScholarDigital Library
- Elnozahy, E. N. and Plank, J. S. 2004. Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dependable Secure Comput. 1, 2, 97--108. Google ScholarDigital Library
- Filebench. 2011. http://sourceforge.net/apps/mediawiki/filebench/index.php?title=Main_Page.Google Scholar
- Fryer, D., Sun, K., Mahmood, R., Cheng, T., Benjamin, S., Goel, A., and Brown, A. D. 2012. Recon: Verifying file system consistency at runtime. In Proceedings of the USENIX Conference on File and Storage Technologies. 73--86. Google ScholarDigital Library
- Gray, J. and Reuter, A. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, Ch. 9. Google ScholarDigital Library
- Grupp, L. M., Davis, J. D., and Swanson, S. 2012. The bleak future of NAND flash memory. In Proceedings of the USENIX Conference on File and Storage Technologies. 17--24. Google ScholarDigital Library
- Hagmann, R. 1987. Reimplementing the Cedar file system using logging and group commit. In Proceedings of the ACM Symposium on Operating Systems Principles. ACM, New York, 155--162. Google ScholarDigital Library
- Hatzieleftheriou, A. and Anastasiadis, S. V. 2011a. JLFS: Journaling the log-structured filesystem for proactive cleaning in flash storage. In Proceedings of the USENIX Annual Technical Conference (poster).Google Scholar
- Hatzieleftheriou, A. and Anastasiadis, S. V. 2011b. Okeanos: Wasteless journaling for fast and reliable multistream storage. In Proceedings of the USENIX Annual Technical Conference. 235--240. Google ScholarDigital Library
- Hildebrand, D., Ward, L., and Honeyman, P. 2006. Large files, small writes, and pNFS. In Proceedings of the ACM International Conference on Supercomputing. 116--124. Google ScholarDigital Library
- Hildebrand, D., Povzner, A., Tewari, R., and Tarasov, V. 2011. Revisiting the storage stack in virtualized nas environments. In Proceedings of the Workshop on I/O Virtualization (co-held with USENIX ATC). Google ScholarDigital Library
- Hisgen, A., Birrell, A., Jerian, C., Mann, T., and Swart, G. 1993. New-value logging in the Echo replicated file system. Tech. rep. SRC 104, Digital Equipment Corp., Palo Alto, CA.Google Scholar
- Hitz, D., Lau, J., and Malcolm, M. 1994. File system design for an NFS file server appliance. In Proceedings of the USENIX Winter Technical Conference. 235--246. Google ScholarDigital Library
- Hu, Y., Nightingale, T., and Yang, Q. 2002. RAPID-Cache--a reliable and inexpensive write cache for high performance storage systems. IEEE Trans. Parallel Distrib. Syst. 13, 3, 290--307. Google ScholarDigital Library
- Huang, T.-C. and Chang, D.-W. 2011. VM aware journaling: Improving journaling file system performance in virtualization environments. Softw. Pract. Exper. 42, 3, 303--330. Google ScholarDigital Library
- Itzkovitz, A. and Schuster, A. 1999. MultiView and Millipage - Fine-grain sharing in page-based DSMs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 215--228. Google ScholarDigital Library
- Jetstress. 2007. Microsoft exchange server jetstress tool. http://technet.microsoft.com/en-us/library/bb643093.aspx.Google Scholar
- Katcher, J. 1997. PostMark: A new file system benchmark. Tech. rep. TR-3022, NetApp.Google Scholar
- Kumar, V. A., Cao, M., Santos, J. R., and Dilger, A. 2008. Ext4 block and inode allocator improvements. In Proceedings of the Linux Symposium. 263--274.Google Scholar
- Kwon, Y., Balazinska, M., and Greensberg, A. 2008. Fault-tolerant stream processing using a distributed, replicated file system. In Proceedings of the Very Large Data Bases Conference. 574--585. Google ScholarDigital Library
- Le, D., Hang, H., and Wang, H. 2012. Understanding performance implications of nested file systems in a virtualized environment. In Proceedings of the USENIX Conference on File and Storage Technologies. 87--100. Google ScholarDigital Library
- Leung, A. W., Pasupathy, S., Goodson, G., and Miller, E. L. 2008. Measurement and analysis of large-scale network file system workloads. In Proceedings of the USENIX Annual Technical Conference. 213--226. Google ScholarDigital Library
- Mammarella, M., Hovsepian, S., and Kohler, E. 2009. Modular data storage with Anvil. In Proceedings of the ACM Symposium on Operating Systems Principles. ACM, New York, 147--160. Google ScholarDigital Library
- Mao, Y., Kohler, E., and Morris, R. 2012. Cache craftiness for fast multicore key-value storage. In Proceedings of the ACM European Conference on Computer Systems. ACM, New York. Google ScholarDigital Library
- Mesnier, M., Chen, F., Luo, T., and Akers, J. 2011. Differentiated storage services. In Proceedings of the ACM Symposium on Operating Systems Pinciples. ACM, New York, 57--70. Google ScholarDigital Library
- Min, C., Kim, K., Cho, H., Lee, S.-W., and Eom, Y. I. 2012. SFS: Random write considered harmful in solid state drives. In Proceedings of the USENIX Conference on File and Storage Technologies. 139--154. Google ScholarDigital Library
- MPI-IO. The Los Alamos National LabMPI-IO Test. http://public.lanl.gov/jnunez/benchmarks/mpiiotest.htm.Google Scholar
- Mullins, C. S. 2002. Database Administration: The Complete Guide to Practices and Procedures. Addison Wesley, Ch. 11, 308.Google Scholar
- MySQL. http://www.mysql.com/.Google Scholar
- Narayanan, D., Thereska, E., Donnelly, A., Elnikety, S., and Rowstron, A. 2009. Migrating server storage to SSDs: Analysis of tradeoffs. In Proceedings of the ACM European Conference on Computer Systems. ACM, New York, 145--158. Google ScholarDigital Library
- Nightingale, E. B., Veeraraghavan, K., Chen, P. M., and Flinn, J. 2006. Rethink the sync. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 1--14. Google ScholarDigital Library
- Oral, S., Wang, F., Dillow, D., Shipman, G., Miller, R., and Drokin, O. 2010. Efficient object storage journaling in a distributed parallel file system. In Proceedings of the USENIX Conference on File and Storage Technologies. 143--154. Google ScholarDigital Library
- Ouyang, X., Nellans, D., Wipfel, R., Flynn, D., and Panda, D. K. 2011a. Beyond block I/O: Rethinking traditional storage primitives. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture. IEEE, Los Alamitos, CA, 301--311. Google ScholarDigital Library
- Ouyang, X., Rajachandrasekar, R., Besseron, X., Wang, H., Huang, J., and Panda, D. K. 2011b. CRFS: A lightweight user-level filesystem for generic checkpoint/restart. In Proceedings of the International Conference Parallel Processing. 375--384. Google ScholarDigital Library
- Polte, M., Simsa, J., Tantisiriroj, W., Gibson, G., Dayal, S., Chainani, M., and Uppugandla, D. K. 2008. Fast log-based concurrent writing of checkpoints. In Proceedings of the Petascale Data Storage Workshop.Google Scholar
- Prabhakaran, V., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2005a. Analysis and evolution of journaling file systems. In Proceedings of the USENIX Annual Technical Conference. 105--120. Google ScholarDigital Library
- Prabhakaran, V., Bairavasundaram, L. N., Agrawal, N., Gunawi, H. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2005b. IRON file systems. In Proceedings of the ACM Symposium on Operating Systems Principles. ACM, New York, 206--220. Google ScholarDigital Library
- PVFS2. Parallel virtual file system, version 2. http://www.pvfs.org.Google Scholar
- Rajimwale, A., Chidambaram, V., Ramamurthi, D., Arpaci-Dusseau, A., and Arpaci-Dusseau, R. 2011. Coerced cache eviction and discreet-mode journaling: Dealing with misbehaving disks. In Proceedings of the International Conference Dependable Systems and Networks. Google ScholarDigital Library
- Rosenblum, M. and Ousterhout, J. K. 1992. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10, 1, 26--52. Google ScholarDigital Library
- SATA. 2003. Serial ATA: High speed serialized AT attachment. Revision 1.0a, SerialATA Workgroup.Google Scholar
- Satyanarayanan, M., Mashburn, H. H., Kumar, P., Steere, D. C., and Kistler, J. J. 1993. Lightweight recoverable virtual memory. In Proceedings of the ACM SIGOPS. ACM, New York, 146--160. Google ScholarDigital Library
- SBC. 2005. Working draft project American National Standard, SCSI Block Commands-3, Technical Committee T10, INCITS. ftp://ftp.t10.org/t10/document.05/05-369r0.pdf.Google Scholar
- Schindler, J., Griffin, J. L., Lumb, C. R., and Ganger, G. R. 2002. Track-aligned extents: Matching access patterns to disk drive characteristics. In Proceedings of the USENIX Conference on File and Storage Technologies. 259--274. Google ScholarDigital Library
- Sears, R. and Brewer, E. 2006. Stasis: Flexible transactional storage. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 29--44. Google ScholarDigital Library
- Seltzer, M., Smith, K. A., Balakrishnan, H., Chang, J., McMains, S., and Padmanabhan, V. 1995. File system logging versus clustering: A performance comparison. In Proceedings of the USENIX Annual Technical Conference. 21--21. Google ScholarDigital Library
- Seltzer, M. I., Ganger, G. R., McKusick, M. K., Smith, K. A., Soules, C. A. N., and Stein, C. A. 2000. Journaling versus soft updates: Asynchronous meta-data protection in file systems. In Proceedings of the USENIX Annual Technical Conference. 71--84. Google ScholarDigital Library
- Shin, D. I., Yu, Y. J., Kim, H. S., Eom, H., and Yeom, H. Y. 2011. Request bridging and interleaving: Improving the performance of small synchronous updates under seek-optimizing disk subsystems. ACM Trans. Storage 7, 2, 4:1--4:31. Google ScholarDigital Library
- Thakur, R., Gropp, W., and Lusk, E. 1999. Data sieving and collective I/O in ROMIO. In Proceedings of the IEEE Symposium Frontiers of Massively Parallel Computation. 182--189. Google ScholarDigital Library
- TPCC. 1992. TPC benchmark C standard specification. Tech. rep., Transaction Processing Council.Google Scholar
- Tweedie, S. C. 1998. Journaling the Linux ext2fs filesystem. In LinuxExpo. 25--29.Google Scholar
- Verissimo, P. and Rodrigues, L. 2001. Distributed Systems for System Architects. Kluwer Academic, Norwell, MA. Google ScholarDigital Library
- Wang, R. Y., Anderson, T. E., and Patterson, D. A. 1999. Virtual log based file systems for a programmable disk. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 29--43. Google ScholarDigital Library
- Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 307--320. http://ceph.newdream.net/wiki/OSD_journal. Google ScholarDigital Library
- Woodhouse, D. 2001. JFFS: The journaling flash file system. In Proceedings of the Linux Symposium.Google Scholar
- Yoshiji, A., Konishi, R., Sato, K., Hifumi, H., Tamura, Y., Kihara, S., and Moriai, S. 2009. NILFS: Continuous snapshotting filesystem for Linux. NTT Corp. http://www.nilfs.org/en/.Google Scholar
- Zhang, Z. and Ghose, K. 2007. hFS: A hybrid file system prototype for improving small file and metadata performance. In Proceedings of the ACM European Conference on Computer Systems. ACM, New York, 175--187. Google ScholarDigital Library
Index Terms
- Improving Bandwidth Efficiency for Consistent Multistream Storage
Recommendations
Design and Implementation of a Journaling File System for Phase-Change Memory
Journaling file systems are widely used in modern computer systems as they provide high reliability at reasonable cost. However, existing journaling file systems are not efficient for emerging PCM (phase-change memory) storage because they are optimized ...
WOJ: Enabling Write-Once Full-data Journaling in SSDs by Using Weak-Hashing-based Deduplication
Journaling is a commonly used technique to ensure data consistency in file systems, such as ext3 and ext4. With journaling technique, file system updates are first recorded in a journal (in the commit phase) and later applied to their home locations in ...
The design and implementation of a log-structured file system
This paper presents a new technique for disk storage management called a log-structured file system. A log-structured file system writes all modifications to disk sequentially in a log-like structure, thereby speeding up both file writing and crash ...
Comments