Abstract
We collected file system content data from 857 desktop computers at Microsoft over a span of 4 weeks. We analyzed the data to determine the relative efficacy of data deduplication, particularly considering whole-file versus block-level elimination of redundancy. We found that whole-file deduplication achieves about three quarters of the space savings of the most aggressive block-level deduplication for storage of live file systems, and 87% of the savings for backup images. We also studied file fragmentation, finding that it is not prevalent, and updated prior file system metadata studies, finding that the distribution of file sizes continues to skew toward very large unstructured files.
- Agrawal, N., Bolosky, W., Douceur, J., and Lorch, J. 2007. A five-year study of file-system metadata. In Proceedings of the 5th USENIX Conference on File and Storage Technologies. Google ScholarDigital Library
- BackupRead. 2010. Microsoft Corp. BackupRead function. MSDN. http://msdn.microsoft.com/en-us/library/aa362509(VS.85).aspxGoogle Scholar
- Bhadkamkar, M., Guerra, J., Useche, L., Burnett, S., Liptak, J., Rangaswami, R., and Hristidis, V. 2009. Borg: Block-reorganization for self-optimizing storage systems. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. Google ScholarDigital Library
- Bhagwat, D., Eshghi, K., Long, D., and Lillibridge, M. 2009. Extreme binning: Scalable, parallel deduplication for chunk-based file backup, In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, Los Alamitos, CA.Google Scholar
- Bloom, B. 1970. Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7, 422--426. Google ScholarDigital Library
- Bolosky, W., Corbin, S., Goebel, D., and Douceur, J. 2000. Single instance storage in Windows 2000. In Proceedings of the 4th USENIX Windows Systems Symposium. Google ScholarDigital Library
- Clements, A., Ahmad, I., Vilayannur, M., and Li, J. 2009. Decentralized deduplication in SAN cluster file systems. InProceedings of the USENIX Annual Technical Conference. Google ScholarDigital Library
- Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S., and Shilane, P. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technology. Google ScholarDigital Library
- Dorward, S. and Quinlan, S. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies. Google ScholarDigital Library
- Douceur, J. and Bolosky, W. 1999. A large-scale study of file-system contents. In Proceeedings of the ACM SIGMETRICS International Conference on Measurement and Modelling of Computer Systems. ACM, New York. Google ScholarDigital Library
- Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Ungureanu, C., and Welnicki, M. 2009. Hydrastor: A scalable secondary storage. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. Google ScholarDigital Library
- Huang, H., Hung, W., and Shin, K. G. 2005. Fs2: Dynamic data replication in free disk space for improving disk performance and energy consumption. In Proceedings of the 20th ACM Symposium on Operating Systems Principles. ACM, New York. Google ScholarDigital Library
- Kulkarni, P., Douglis, F., Lavoie, J., and Tracey, J. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Annual Technical Conference. Google ScholarDigital Library
- Jin, K. and Miller, E. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of SYSTOR: The Israeli Experimental Systems Conference. Google ScholarDigital Library
- Lillibridge, M., Eshghi, K., Bhagwat, D., Deola-Likar, V., Trezise, G., and Camble, P. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. Google ScholarDigital Library
- Mathur, A., Cao, M., Bhattacharya, S., Dilger, A., Tomas, A., and Vivier, L. 2007. The new ext4 filesystem: Current status and future plans. In Proceedings of the Linux SymposiumGoogle Scholar
- MS Atime. 2010. Microsoft Corp. Disabling last access time in Windows Vista to improve NTFS perfomance. The Storage Team Blog. http://blogs.technet.com/b/filecab/archive/2006/11/07/disabling-last-access-time-in-windows-vista-to-improve-ntfs-performance.aspx.Google Scholar
- MS Filesystem. 2010. Microsoft Corp. File systems. Microsoft TechNet. http://technet.microsoft.com/en-us/library/cc938929.aspx.Google Scholar
- VSS. 2010. Microsoft Corp.Volume shadow copy service. MSDN. http://msdn.microsoft.com/en-us/library/bb968832(VS.85).aspx.Google Scholar
- Miller, D. R. 2009. Storage economics: Four principles for reducing total cost of ownership. Hitachi Corporate Web Site. http://www.hds.com/assets/pdf/four-principles-for-reducing-total-cost-of-ownership.pdf.Google Scholar
- Murphy, N. and Seltzer, M. 2009. Hierarchical file systems are dead. In Proceedings of the 12th Workshop on Hot Topics in Operating Systems. Google ScholarDigital Library
- Nagar, R. 1997. Windows NT File System Internals. O'Reilly. Google ScholarDigital Library
- Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems. In Proceedings of the. USENIX Annual Technical Conference. Google ScholarDigital Library
- Rabin, M. 1981. Fingerprinting by random polynomials. Tech. rep. TR-CSE-03-01. Harvard University Center for Research in Computing Technology.Google Scholar
- Rivest, R. 1992. The MD5 message-digest algorithm. http://tools.ietf.org/rfc/rfc1321.txt. Google ScholarDigital Library
- Satyanarayanan, M. 1981. A study of file sizes and functional lifetimes. In Proceedings of the 8th ACM Symposium on Operating Systems Principles. Google ScholarDigital Library
- Scheduled Tasks. 2010. Microsoft Corp. description of the scheduled tasks in Widows Vista. Microsoft support. http://support.microsoft.com/kb/939039.Google Scholar
- Seltzer, M. and Smith, K. 1997. File system aging: Increasing the relevance of file system benchmarks. In Proceedings of the 1997 ACM SIGMETRICS, ACM, New York. Google ScholarDigital Library
- Sweeney, A., Doucette, D., Hu, W., Anderson, C., Nishimoto, M., and Peck, G. 1996. Scalability in the XFS file system. In Proceedings of the USENIX Annual Technical Conference. Google ScholarDigital Library
- Vogels, W. 1999. File system usage in windows NT 4.0. In Proceedings of the 17th ACM Symposium on Operating Systems Principles. ACM, New York. Google ScholarDigital Library
- Ungureanu, C., Atkin, B., Aranya, A., Gokhale, S., Rago, S., Cakowski, G., Dubnicki, C., and Bohra, A. 2010. Hydrafs: A high-throughput file system for the Hydrastor content-addressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies. Google ScholarDigital Library
- Ungureanu, E. and Kruus, C. 2010. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX Conference on File and Storage Technologies. Google ScholarDigital Library
- Zhu, B., Li, K., and Patterson, H. 2008 Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies, 1--14. Google ScholarDigital Library
Index Terms
- A study of practical deduplication
Recommendations
Storage Deduplication by Virtual Large-Scale Disks
NBIS '12: Proceedings of the 2012 15th International Conference on Network-Based Information SystemsRecently, the demand of low cost large scale storages increases. We developed VLSD (Virtual Large Scale Disks) toolkit for constructing virtual disk based distributed storages, which aggregate free spaces of individual disks. VLSD realizes low-cost ...
Flash-Based Storage Deduplication Techniques: A Survey
Exponential growth of the amount of data stored worldwide together with high level of data redundancy motivates the active development of data deduplication techniques. The overall increasing popularity of solid-state drives (SSDs) as primary storage ...
The design and implementation of an extensible network backup system in realtime
ICUIMC '09: Proceedings of the 3rd International Conference on Ubiquitous Information Management and CommunicationThis paper proposes a backup system based on mirroring filesystem "GMFS." GMFS has been developed to mirror data in realtime on the filesystem layer. The GMFS is a stackable filesystem which flexibly mirrors without changing the existing environment by ...
Comments