skip to main content
research-article

Improving Bandwidth Efficiency for Consistent Multistream Storage

Published:01 March 2013Publication History
Skip Abstract Section

Abstract

Synchronous small writes play a critical role in system availability because they safely log recent state modifications for fast recovery from crashes. Demanding systems typically dedicate separate devices to logging for adequate performance during normal operation and redundancy during state reconstruction. However, storage stacks enforce page-sized granularity in data transfers from memory to disk. Thus, they consume excessive storage bandwidth to handle small writes, which hurts performance. The problem becomes worse, as filesystems often handle multiple concurrent streams, which effectively generate random I/O traffic. In a journaled filesystem, we introduce wasteless journaling as a mount mode that coalesces synchronous concurrent small writes of data into full page-sized journal blocks. Additionally, we propose selective journaling to automatically activate wasteless journaling on data writes with size below a fixed threshold. We implemented a functional prototype of our design over a widely-used filesystem. Our modes are compared against existing methods using microbenchmarks and application-level workloads on stand-alone servers and a multitier networked system. We examine synchronous and asynchronous writes. Coalescing small data updates to the journal sequentially preserves filesystem consistency while it reduces consumed bandwidth up to several factors, decreases recovery time up to 22%, and lowers write latency up to orders of magnitude.

References

  1. Anand, A., Sen, S., Krioukov, A., Popovici, F. I., Akella, A., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Banerjee, S. 2008. Avoiding file system micromanagement with range writes. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 161--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Appuswamy, R., van Moolenbroek, D. C., and Tanenbaum, A. S. 2010. Block-level RAID is dead. In Proceedings of the Workshop on Hot Topics in Storage in File Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Baker, J., Bondç, C., Corbett, J., Furman, J. J., Khorlin, A., Larson, J., Léon, J., Li, Y., Lloyd, A., and Yushprakh, V. 2011. Megastore: Providing scalable, highly available storage for interactive services. In Proceedings of the Conference on Innovative Data Systems Research. 223--234.Google ScholarGoogle Scholar
  4. Batsakis, A., Burns, R. C., Kanevsky, A., Lentini, J., and Talpey, T. 2008. AWOL: An adaptive write optimizations layer. In Proceedings of the USENIX Conference on File and Storage Technologies. 67--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bent, J., Gibson, G., Grider, G., McClelland, B., Nowoczynski, P., Nunez, J., Polte, M., and Wingate, M. 2009. PLFS: A checkpoint filesystem for parallel applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Birrell, A. D., Hisgen, A., Jerian, C., Mann, T., and Swart, G. 1993. The Echo distributed file system. Tech. rep. TR-111, DEC Systems Research Center, Palo Alto, CA.Google ScholarGoogle Scholar
  7. Borthakur, D., Gray, J., Sarma, J. S., Muthukkaruppan, K., Spiegelberg, N., Kuang, H., Ranganathan, K., Molkov, D., Menon, A., Rash, S., Schmidt, R., and Aiyer, A. 2011. Apache Hadoop goes realtime at facebook. In Proceedings of the ACM SIGMOD Conference. 1071--1080. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bovet, D. P. and Cesati, M. 2005. Understanding the Linux Kernel 3rd Ed. O’Reilly Media, Sebastopol, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Brito, A., Fetzer, C., and Felber, P. 2009. Minimizing latency in fault-tolerant distributed stream processing systems. In Proceedings of the International Conference on Distributed Computing Systems. 173--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Calder, B., Wang, J., Ogus, A., Nilakantan, N., and Skjolsvold, A., et al. 2011. Windows Azure Storage: A highly available cloud storage service with strong consistency. In Proceedings of the ACM Symposium on Operating Systems Principles. ACM, New York, 143--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Carns, P., Lang, S., Ross, R., Vilayannur, M., Kunkel, J., and Ludwig, T. 2009. Small-file access in parallel file systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. IEEE, Washington, D.C., 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chandrasekaran, S. and Franklin, M. 2004. Remembrance of streams past: Overload-sensitive management of archived streams. In Proceedings of the Conference on Very Large Data Bases. 348--359. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., and Gruber, R. E. 2006. Bigtable: A distributed storage system for structured data. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 205--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cheetah. 2007. Seagate Cheetah 15K.5 SAS (ST3300655SS). Product Manual. http://www.seagate.com/staticfiles/support/disc/manuals/enterprise/cheetah/15K.5/SAS/100384784e.pdf.Google ScholarGoogle Scholar
  15. Chen, F., Koufaty, D. A., and Zhang, X. 2009. Understanding intrinsic characteristics and system implications of flash memory based solid state drives. In Proceedings of the Conference on SIGMETRICS/Performance. 181--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Chen, P. M., Ng, W. T., Chandra, S., Aycock, C., Rajamani, G., and Lowell, D. 1996. The Rio file cache: Surviving operating system crashes. In Proceedings of the ACM International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, 74--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chidambaram, V., Sharma, T., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2012. Consistency without ordering. In Proceedings of the USENIX Conference on File and Storage Technologies. 101--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Choi, H. J., Lim, S.-H., and Park, K. H. 2009. JFTL: A flash translation layer based on a journal remapping for flash memory. ACM Trans. Storage 4, 14:1--14:22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Dai, H., Neufeld, M., and Han, R. 2004. ELF: An efficient log-structured flash file system for micro sensor nodes. In Proceedings of the ACM International Conference on Embedded Networked Sensor Systems. 176--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. DBT. Database test suite. http://osdldbt.sourceforge.net/.Google ScholarGoogle Scholar
  21. Desnoyers, P. J. and Shenoy, P. 2007. Hyperion: High volume stream archival for retrospective querying. In Proceedings of the USENIX Annual Technical Conference. 45--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. DeWitt, D. J., Katz, R. H., Olken, F., Shapiro, L. D., Stonebraker, M. R., and Wood, D. A. 1984. Implementation techniques for main memory database systems. In Proceedings of the ACM SIGMOD Conference. ACM, New York, 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Elnozahy, E. N. and Plank, J. S. 2004. Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery. IEEE Trans. Dependable Secure Comput. 1, 2, 97--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Filebench. 2011. http://sourceforge.net/apps/mediawiki/filebench/index.php?title=Main_Page.Google ScholarGoogle Scholar
  25. Fryer, D., Sun, K., Mahmood, R., Cheng, T., Benjamin, S., Goel, A., and Brown, A. D. 2012. Recon: Verifying file system consistency at runtime. In Proceedings of the USENIX Conference on File and Storage Technologies. 73--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gray, J. and Reuter, A. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, Ch. 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Grupp, L. M., Davis, J. D., and Swanson, S. 2012. The bleak future of NAND flash memory. In Proceedings of the USENIX Conference on File and Storage Technologies. 17--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hagmann, R. 1987. Reimplementing the Cedar file system using logging and group commit. In Proceedings of the ACM Symposium on Operating Systems Principles. ACM, New York, 155--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Hatzieleftheriou, A. and Anastasiadis, S. V. 2011a. JLFS: Journaling the log-structured filesystem for proactive cleaning in flash storage. In Proceedings of the USENIX Annual Technical Conference (poster).Google ScholarGoogle Scholar
  30. Hatzieleftheriou, A. and Anastasiadis, S. V. 2011b. Okeanos: Wasteless journaling for fast and reliable multistream storage. In Proceedings of the USENIX Annual Technical Conference. 235--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Hildebrand, D., Ward, L., and Honeyman, P. 2006. Large files, small writes, and pNFS. In Proceedings of the ACM International Conference on Supercomputing. 116--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Hildebrand, D., Povzner, A., Tewari, R., and Tarasov, V. 2011. Revisiting the storage stack in virtualized nas environments. In Proceedings of the Workshop on I/O Virtualization (co-held with USENIX ATC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Hisgen, A., Birrell, A., Jerian, C., Mann, T., and Swart, G. 1993. New-value logging in the Echo replicated file system. Tech. rep. SRC 104, Digital Equipment Corp., Palo Alto, CA.Google ScholarGoogle Scholar
  34. Hitz, D., Lau, J., and Malcolm, M. 1994. File system design for an NFS file server appliance. In Proceedings of the USENIX Winter Technical Conference. 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Hu, Y., Nightingale, T., and Yang, Q. 2002. RAPID-Cache--a reliable and inexpensive write cache for high performance storage systems. IEEE Trans. Parallel Distrib. Syst. 13, 3, 290--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Huang, T.-C. and Chang, D.-W. 2011. VM aware journaling: Improving journaling file system performance in virtualization environments. Softw. Pract. Exper. 42, 3, 303--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Itzkovitz, A. and Schuster, A. 1999. MultiView and Millipage - Fine-grain sharing in page-based DSMs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 215--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jetstress. 2007. Microsoft exchange server jetstress tool. http://technet.microsoft.com/en-us/library/bb643093.aspx.Google ScholarGoogle Scholar
  39. Katcher, J. 1997. PostMark: A new file system benchmark. Tech. rep. TR-3022, NetApp.Google ScholarGoogle Scholar
  40. Kumar, V. A., Cao, M., Santos, J. R., and Dilger, A. 2008. Ext4 block and inode allocator improvements. In Proceedings of the Linux Symposium. 263--274.Google ScholarGoogle Scholar
  41. Kwon, Y., Balazinska, M., and Greensberg, A. 2008. Fault-tolerant stream processing using a distributed, replicated file system. In Proceedings of the Very Large Data Bases Conference. 574--585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Le, D., Hang, H., and Wang, H. 2012. Understanding performance implications of nested file systems in a virtualized environment. In Proceedings of the USENIX Conference on File and Storage Technologies. 87--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Leung, A. W., Pasupathy, S., Goodson, G., and Miller, E. L. 2008. Measurement and analysis of large-scale network file system workloads. In Proceedings of the USENIX Annual Technical Conference. 213--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mammarella, M., Hovsepian, S., and Kohler, E. 2009. Modular data storage with Anvil. In Proceedings of the ACM Symposium on Operating Systems Principles. ACM, New York, 147--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Mao, Y., Kohler, E., and Morris, R. 2012. Cache craftiness for fast multicore key-value storage. In Proceedings of the ACM European Conference on Computer Systems. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Mesnier, M., Chen, F., Luo, T., and Akers, J. 2011. Differentiated storage services. In Proceedings of the ACM Symposium on Operating Systems Pinciples. ACM, New York, 57--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Min, C., Kim, K., Cho, H., Lee, S.-W., and Eom, Y. I. 2012. SFS: Random write considered harmful in solid state drives. In Proceedings of the USENIX Conference on File and Storage Technologies. 139--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. MPI-IO. The Los Alamos National LabMPI-IO Test. http://public.lanl.gov/jnunez/benchmarks/mpiiotest.htm.Google ScholarGoogle Scholar
  49. Mullins, C. S. 2002. Database Administration: The Complete Guide to Practices and Procedures. Addison Wesley, Ch. 11, 308.Google ScholarGoogle Scholar
  50. MySQL. http://www.mysql.com/.Google ScholarGoogle Scholar
  51. Narayanan, D., Thereska, E., Donnelly, A., Elnikety, S., and Rowstron, A. 2009. Migrating server storage to SSDs: Analysis of tradeoffs. In Proceedings of the ACM European Conference on Computer Systems. ACM, New York, 145--158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Nightingale, E. B., Veeraraghavan, K., Chen, P. M., and Flinn, J. 2006. Rethink the sync. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Oral, S., Wang, F., Dillow, D., Shipman, G., Miller, R., and Drokin, O. 2010. Efficient object storage journaling in a distributed parallel file system. In Proceedings of the USENIX Conference on File and Storage Technologies. 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ouyang, X., Nellans, D., Wipfel, R., Flynn, D., and Panda, D. K. 2011a. Beyond block I/O: Rethinking traditional storage primitives. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture. IEEE, Los Alamitos, CA, 301--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Ouyang, X., Rajachandrasekar, R., Besseron, X., Wang, H., Huang, J., and Panda, D. K. 2011b. CRFS: A lightweight user-level filesystem for generic checkpoint/restart. In Proceedings of the International Conference Parallel Processing. 375--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Polte, M., Simsa, J., Tantisiriroj, W., Gibson, G., Dayal, S., Chainani, M., and Uppugandla, D. K. 2008. Fast log-based concurrent writing of checkpoints. In Proceedings of the Petascale Data Storage Workshop.Google ScholarGoogle Scholar
  57. Prabhakaran, V., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2005a. Analysis and evolution of journaling file systems. In Proceedings of the USENIX Annual Technical Conference. 105--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Prabhakaran, V., Bairavasundaram, L. N., Agrawal, N., Gunawi, H. S., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. 2005b. IRON file systems. In Proceedings of the ACM Symposium on Operating Systems Principles. ACM, New York, 206--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. PVFS2. Parallel virtual file system, version 2. http://www.pvfs.org.Google ScholarGoogle Scholar
  60. Rajimwale, A., Chidambaram, V., Ramamurthi, D., Arpaci-Dusseau, A., and Arpaci-Dusseau, R. 2011. Coerced cache eviction and discreet-mode journaling: Dealing with misbehaving disks. In Proceedings of the International Conference Dependable Systems and Networks. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Rosenblum, M. and Ousterhout, J. K. 1992. The design and implementation of a log-structured file system. ACM Trans. Comput. Syst. 10, 1, 26--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. SATA. 2003. Serial ATA: High speed serialized AT attachment. Revision 1.0a, SerialATA Workgroup.Google ScholarGoogle Scholar
  63. Satyanarayanan, M., Mashburn, H. H., Kumar, P., Steere, D. C., and Kistler, J. J. 1993. Lightweight recoverable virtual memory. In Proceedings of the ACM SIGOPS. ACM, New York, 146--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. SBC. 2005. Working draft project American National Standard, SCSI Block Commands-3, Technical Committee T10, INCITS. ftp://ftp.t10.org/t10/document.05/05-369r0.pdf.Google ScholarGoogle Scholar
  65. Schindler, J., Griffin, J. L., Lumb, C. R., and Ganger, G. R. 2002. Track-aligned extents: Matching access patterns to disk drive characteristics. In Proceedings of the USENIX Conference on File and Storage Technologies. 259--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Sears, R. and Brewer, E. 2006. Stasis: Flexible transactional storage. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 29--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Seltzer, M., Smith, K. A., Balakrishnan, H., Chang, J., McMains, S., and Padmanabhan, V. 1995. File system logging versus clustering: A performance comparison. In Proceedings of the USENIX Annual Technical Conference. 21--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Seltzer, M. I., Ganger, G. R., McKusick, M. K., Smith, K. A., Soules, C. A. N., and Stein, C. A. 2000. Journaling versus soft updates: Asynchronous meta-data protection in file systems. In Proceedings of the USENIX Annual Technical Conference. 71--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Shin, D. I., Yu, Y. J., Kim, H. S., Eom, H., and Yeom, H. Y. 2011. Request bridging and interleaving: Improving the performance of small synchronous updates under seek-optimizing disk subsystems. ACM Trans. Storage 7, 2, 4:1--4:31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Thakur, R., Gropp, W., and Lusk, E. 1999. Data sieving and collective I/O in ROMIO. In Proceedings of the IEEE Symposium Frontiers of Massively Parallel Computation. 182--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. TPCC. 1992. TPC benchmark C standard specification. Tech. rep., Transaction Processing Council.Google ScholarGoogle Scholar
  72. Tweedie, S. C. 1998. Journaling the Linux ext2fs filesystem. In LinuxExpo. 25--29.Google ScholarGoogle Scholar
  73. Verissimo, P. and Rodrigues, L. 2001. Distributed Systems for System Architects. Kluwer Academic, Norwell, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Wang, R. Y., Anderson, T. E., and Patterson, D. A. 1999. Virtual log based file systems for a programmable disk. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 29--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation. 307--320. http://ceph.newdream.net/wiki/OSD_journal. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Woodhouse, D. 2001. JFFS: The journaling flash file system. In Proceedings of the Linux Symposium.Google ScholarGoogle Scholar
  77. Yoshiji, A., Konishi, R., Sato, K., Hifumi, H., Tamura, Y., Kihara, S., and Moriai, S. 2009. NILFS: Continuous snapshotting filesystem for Linux. NTT Corp. http://www.nilfs.org/en/.Google ScholarGoogle Scholar
  78. Zhang, Z. and Ghose, K. 2007. hFS: A hybrid file system prototype for improving small file and metadata performance. In Proceedings of the ACM European Conference on Computer Systems. ACM, New York, 175--187. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving Bandwidth Efficiency for Consistent Multistream Storage

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Storage
              ACM Transactions on Storage  Volume 9, Issue 1
              March 2013
              84 pages
              ISSN:1553-3077
              EISSN:1553-3093
              DOI:10.1145/2435204
              Issue’s Table of Contents

              Copyright © 2013 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 March 2013
              • Accepted: 1 October 2012
              • Revised: 1 July 2012
              • Received: 1 April 2012
              Published in tos Volume 9, Issue 1

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader