skip to main content
research-article

A study of practical deduplication

Published:02 February 2012Publication History
Skip Abstract Section

Abstract

We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of redundancy. We found that whole-file deduplication achieves about three quarters of the space savings of the most aggressive block-level deduplication for storage of live file systems, and 87% of the savings for backup images. We also studied file fragmentation, finding that it is not prevalent, and updated prior file system metadata studies, finding that the distribution of file sizes continues to skew toward very large unstructured files.

References

  1. Agrawal, N., Bolosky, W., Douceur, J., and Lorch, J. 2007. A five-year study of file-system metadata. In Proceedings of the 5th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. BackupRead. 2010. Microsoft Corp. BackupRead function. MSDN. http://msdn.microsoft.com/en-us/library/aa362509(VS.85).aspxGoogle ScholarGoogle Scholar
  3. Bhadkamkar, M., Guerra, J., Useche, L., Burnett, S., Liptak, J., Rangaswami, R., and Hristidis, V. 2009. Borg: Block-reorganization for self-optimizing storage systems. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bhagwat, D., Eshghi, K., Long, D., and Lillibridge, M. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup, In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, Los Alamitos, CA.Google ScholarGoogle Scholar
  5. Bloom, B. 1970. Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7, 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bolosky, W., Corbin, S., Goebel, D., and Douceur, J. 2000. Single instance storage in Windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Clements, A., Ahmad, I., Vilayannur, M., and Li, J. 2009. Decentralized deduplication in SAN cluster file systems. InProceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S., and Shilane, P. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technology. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Dorward, S. and Quinlan, S. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Douceur, J. and Bolosky, W. 1999. A large-scale study of file-system contents. In Proceeedings of the ACM SIGMETRICS International Conference on Measurement and Modelling of Computer Systems. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., and Welnicki, M. 2009. Hydrastor: A scalable secondary storage. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Huang, H., Hung, W., and Shin, K. G. 2005. Fs2: Dynamic data replication in free disk space for improving disk performance and energy consumption. In Proceedings of the 20th ACM Symposium on Operating Systems Principles. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kulkarni, P., Douglis, F., Lavoie, J., and Tracey, J. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jin, K. and Miller, E. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of SYSTOR: The Israeli Experimental Systems Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Lillibridge, M., Eshghi, K., Bhagwat, D., Deola-Likar, V., Trezise, G., and Camble, P. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mathur, A., Cao, M., Bhattacharya, S., Dilger, A., Tomas, A., and Vivier, L. 2007. The new ext4 filesystem: Current status and future plans. In Proceedings of the Linux SymposiumGoogle ScholarGoogle Scholar
  17. MS Atime. 2010. Microsoft Corp. Disabling last access time in Windows Vista to improve NTFS perfomance. The Storage Team Blog. http://blogs.technet.com/b/filecab/archive/2006/11/07/disabling-last-access-time-in-windows-vista-to-improve-ntfs-performance.aspx.Google ScholarGoogle Scholar
  18. MS Filesystem. 2010. Microsoft Corp. File systems. Microsoft TechNet. http://technet.microsoft.com/en-us/library/cc938929.aspx.Google ScholarGoogle Scholar
  19. VSS. 2010. Microsoft Corp.Volume shadow copy service. MSDN. http://msdn.microsoft.com/en-us/library/bb968832(VS.85).aspx.Google ScholarGoogle Scholar
  20. Miller, D. R. 2009. Storage economics: Four principles for reducing total cost of ownership. Hitachi Corporate Web Site. http://www.hds.com/assets/pdf/four-principles-for-reducing-total-cost-of-ownership.pdf.Google ScholarGoogle Scholar
  21. Murphy, N. and Seltzer, M. 2009. Hierarchical file systems are dead. In Proceedings of the 12th Workshop on Hot Topics in Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Nagar, R. 1997. Windows NT File System Internals. O'Reilly. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems. In Proceedings of the. USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rabin, M. 1981. Fingerprinting by random polynomials. Tech. rep. TR-CSE-03-01. Harvard University Center for Research in Computing Technology.Google ScholarGoogle Scholar
  25. Rivest, R. 1992. The MD5 message-digest algorithm. http://tools.ietf.org/rfc/rfc1321.txt. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Satyanarayanan, M. 1981. A study of file sizes and functional lifetimes. In Proceedings of the 8th ACM Symposium on Operating Systems Principles. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Scheduled Tasks. 2010. Microsoft Corp. description of the scheduled tasks in Widows Vista. Microsoft support. http://support.microsoft.com/kb/939039.Google ScholarGoogle Scholar
  28. Seltzer, M. and Smith, K. 1997. File system aging: Increasing the relevance of file system benchmarks. In Proceedings of the 1997 ACM SIGMETRICS, ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sweeney, A., Doucette, D., Hu, W., Anderson, C., Nishimoto, M., and Peck, G. 1996. Scalability in the XFS file system. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Vogels, W. 1999. File system usage in windows NT 4.0. In Proceedings of the 17th ACM Symposium on Operating Systems Principles. ACM, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ungureanu, C., Atkin, B., Aranya, A., Gokhale, S., Rago, S., Cakowski, G., Dubnicki, C., and Bohra, A. 2010. Hydrafs: A high-throughput file system for the Hydrastor content-addressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ungureanu, E. and Kruus, C. 2010. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zhu, B., Li, K., and Patterson, H. 2008 Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, 1--14. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A study of practical deduplication

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Storage
        ACM Transactions on Storage  Volume 7, Issue 4
        January 2012
        65 pages
        ISSN:1553-3077
        EISSN:1553-3093
        DOI:10.1145/2078861
        Issue’s Table of Contents

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 2 February 2012
        • Received: 1 September 2011
        • Accepted: 1 September 2011
        Published in tos Volume 7, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader