research-article

Efficient detection of large-scale redundancy in enterprise file systems

Authors:
George Forman

Hewlett-Packard Labs, Palo Alto, CA

Hewlett-Packard Labs, Palo Alto, CA
View Profile

,
Kave Eshghi

Hewlett-Packard Labs, Palo Alto, CA

Hewlett-Packard Labs, Palo Alto, CA
View Profile

,
Jaap Suermondt

Hewlett-Packard Labs, Palo Alto, CA

Hewlett-Packard Labs, Palo Alto, CA
View Profile

Authors Info & Claims

ACM SIGOPS Operating Systems Review Volume 43 Issue 1January 2009pp 84–91https://doi.org/10.1145/1496909.1496926

Published:01 January 2009Publication History

ACM SIGOPS Operating Systems Review

Abstract

In order to catch and reduce waste in the exponentially increasing demand for disk storage, we have developed very efficient technology to detect approximate duplication of large directory hierarchies. Such duplication can be caused, for example, by unnecessary mirroring of repositories by uncoordinated employees or departments. Identifying these duplicate or near-duplicate hierarchies allows appropriate action to be taken at a high level. For example, one could coordinate and consolidate multiple copies in one location.

References

Bolosky, W. J., Corbin, S., Goebel, D., and Douceur, J. R. 2000. Single instance storage in Windows® 2000. In Proceedings of the 4th Conference on USENIX Windows Systems Symposium -- Volume 4 (Seattle, Washington, Aug. 3-4, 2000). USENIX Association, Berkeley, CA, 2-2. Google ScholarDigital Library
Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. 2000. Min-wise-independent permutations. Journal of Computer and System Sciences. 60, 3 (Jun. 2000), 630--659. Google ScholarDigital Library
Douceur, J., Adya, A., Bolosky, W., Simon, D., Theimer, M. 2002. Reclaiming Space from Duplicate Files in a Serverless Distributed File System. In the 22nd IEEE International Conference on Distributed Computing Systems (ICDCS '02). Google ScholarDigital Library
Forman, G., Eshghi, K., and Chiocchetti, S. 2005. Finding similar files in large document repositories. In the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (Chicago, Illinois, USA, August 21-24, 2005). KDD '05. ACM, New York, NY, 394--400. Google ScholarDigital Library
Gantz, J. F. et al. 2007. The Expanding Digital Universe: A Forecast of Worldwide Information Growth Through 2010. IDC White Paper, Framingham, MA. June 22, 2007. www.idc.comGoogle Scholar
Simpson, D., and Hatcher, J. TIP survey reveals storage trends. InfoStor Europe, Dec. 2006. www.infostor.comGoogle Scholar

Index Terms

Efficient detection of large-scale redundancy in enterprise file systems
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications

Recommendations

The Conquest file system: Better performance through a disk/persistent-RAM hybrid design

Modern file systems assume the use of disk, a system-wide performance bottleneck for over a decade. Current disk caching and RAM file systems either impose high overhead to access memory content or fail to provide mechanisms to achieve data persistence ...
Read More
A multiple-file write scheme for improving write performance of small files in Fast File System

Fast File System (FFS) stores files to disk in separate disk writes, each of which incurs a disk positioning (seek + rotation) limiting the write performance for small files. We propose a new scheme called co-writing to accelerate small file writes in ...
Read More
Implementation of a stackable file system for real-time network backup

We propose a backup system based on a stackable mirroring file system, general-purpose mirroring file system (GMFS). This file system mirrors data in real-time on the file system layer. It uses the typical network file system (NFS) and backs up data to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGOPS Operating Systems Review Volume 43, Issue 1
January 2009
97 pages
ISSN:0163-5980
DOI:10.1145/1496909
Issue’s Table of Contents

Copyright © 2009 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 January 2009
Check for updates
Author Tags
data mining
directory similarity and de-duplication
file systems
min-hashing
scalability
set sketches
storage management
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 21
  Total Citations
  View Citations
- 281
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient detection of large-scale redundancy in enterprise file systems

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

The Conquest file system: Better performance through a disk/persistent-RAM hybrid design

A multiple-file write scheme for improving write performance of small files in Fast File System

Implementation of a stackable file system for real-time network backup

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient detection of large-scale redundancy in enterprise file systems

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

The Conquest file system: Better performance through a disk/persistent-RAM hybrid design

A multiple-file write scheme for improving write performance of small files in Fast File System

Implementation of a stackable file system for real-time network backup

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media