Abstract
In order to catch and reduce waste in the exponentially increasing demand for disk storage, we have developed very efficient technology to detect approximate duplication of large directory hierarchies. Such duplication can be caused, for example, by unnecessary mirroring of repositories by uncoordinated employees or departments. Identifying these duplicate or near-duplicate hierarchies allows appropriate action to be taken at a high level. For example, one could coordinate and consolidate multiple copies in one location.
- Bolosky, W. J., Corbin, S., Goebel, D., and Douceur, J. R. 2000. Single instance storage in Windows® 2000. In Proceedings of the 4th Conference on USENIX Windows Systems Symposium -- Volume 4 (Seattle, Washington, Aug. 3-4, 2000). USENIX Association, Berkeley, CA, 2-2. Google ScholarDigital Library
- Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. 2000. Min-wise-independent permutations. Journal of Computer and System Sciences. 60, 3 (Jun. 2000), 630--659. Google ScholarDigital Library
- Douceur, J., Adya, A., Bolosky, W., Simon, D., Theimer, M. 2002. Reclaiming Space from Duplicate Files in a Serverless Distributed File System. In the 22nd IEEE International Conference on Distributed Computing Systems (ICDCS '02). Google ScholarDigital Library
- Forman, G., Eshghi, K., and Chiocchetti, S. 2005. Finding similar files in large document repositories. In the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (Chicago, Illinois, USA, August 21-24, 2005). KDD '05. ACM, New York, NY, 394--400. Google ScholarDigital Library
- Gantz, J. F. et al. 2007. The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through 2010. IDC White Paper, Framingham, MA. June 22, 2007. www.idc.comGoogle Scholar
- Simpson, D., and Hatcher, J. TIP survey reveals storage trends. InfoStor Europe, Dec. 2006. www.infostor.comGoogle Scholar
Index Terms
- Efficient detection of large-scale redundancy in enterprise file systems
Recommendations
The Conquest file system: Better performance through a disk/persistent-RAM hybrid design
Modern file systems assume the use of disk, a system-wide performance bottleneck for over a decade. Current disk caching and RAM file systems either impose high overhead to access memory content or fail to provide mechanisms to achieve data persistence ...
A multiple-file write scheme for improving write performance of small files in Fast File System
Fast File System (FFS) stores files to disk in separate disk writes, each of which incurs a disk positioning (seek + rotation) limiting the write performance for small files. We propose a new scheme called co-writing to accelerate small file writes in ...
Implementation of a stackable file system for real-time network backup
We propose a backup system based on a stackable mirroring file system, general-purpose mirroring file system (GMFS). This file system mirrors data in real-time on the file system layer. It uses the typical network file system (NFS) and backs up data to ...
Comments