ABSTRACT
We present a simple and efficient external perfect hashing scheme (referred to as EPH algorithm) for very large static key sets. We use a number of techniques from the literature to obtain a novel scheme that is theoretically well-understood and at the same time achieves an order-of-magnitude increase in the size of the problem to be solved compared to previous "practical" methods. We demonstrate the scalability of our algorithm by constructing minimum perfect hash functions for a set of 1.024 billion URLs from the World Wide Web of average length 64 characters in approximately 62 minutes, using a commodity PC. Our scheme produces minimal perfect hash functions using approximately 3.8 bits per key. For perfect hash functions in the range {0,...,2n - 1} the space usage drops to approximately 2.7 bits per key. The main contribution is the first algorithm that has experimentally proven practicality for sets in the order of billions of keys and has time and space usage carefully analyzed without unrealistic assumptions.
- N. Alon, M. Dietzfelbinger, P. B. Miltersen, E. Petrank, and G. Tardos. Linear hash functions. J. of the ACM, 46(5):667--683, 1999. Google ScholarDigital Library
- N. Alon and M. Naor. Derandomization, witnesses for Boolean matrix multiplication and construction of perfect hash functions. Algorithmica, 16(4-5):434--449, 1996.Google ScholarCross Ref
- P. Boldi and S. Vigna. The webgraph framework i: Compression techniques. In Proc. of the 13th Intl. World Wide Web Conference, pages 595--602, 2004. Google ScholarDigital Library
- F. Botelho, Y. Kohayakawa, and N. Ziviani. A practical minimal perfect hashing method. In Proc. of the 4th Intl. Workshop on Efficient and Experimental Algorithms, pages 488--500. Springer LNCS, 2005. Google ScholarDigital Library
- F. Botelho, R. Pagh, and N. Ziviani. Simple and space-efficient minimal perfect hash functions. In Proc. of the 10th Intl. Workshop on Data Structures and Algorithms, pages 139--150. Springer LNCS, 2007. Google ScholarDigital Library
- F. C. Botelho, R. Pagh, and N. Ziviani. Perfect hashing for data management applications. Technical Report TR002/07, Federal University of Minas Gerais, 2007. Available at http://arxiv.org/pdf/cs/0702159.Google Scholar
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the 7th Intl. World Wide Web Conference, pages 107--117, April 1998. Google ScholarDigital Library
- C.-C. Chang and C.-Y. Lin. A perfect hashing schemes for mining association rules. The Computer Journal, 48(2):168--179, 2005. Google ScholarDigital Library
- C.-C. Chang, C.-Y. Lin, and H. Chou. Perfect hashing schemes for mining traversal patterns. J. of Fundamenta Informaticae, 70(3):185--202, 2006. Google ScholarDigital Library
- Z. Czech, G. Havas, and B. Majewski. An optimal algorithm for generating minimal perfect hash functions. Information Processing Letters, 43(5):257--264, 1992. Google ScholarDigital Library
- Z. Czech, G. Havas, and B. Majewski. Fundamental study perfect hashing. Theoretical Computer Science, 182:1--143, 1997. Google ScholarDigital Library
- M. Dietzfelbinger and T. Hagerup. Simple minimal perfect hashing in less space. In Proc. of the 9th European Symposium on Algorithms, pages 109--120. Springer LNCS vol. 2161, 2001. Google ScholarDigital Library
- E. Fox, Q. Chen, and L. Heath. A faster algorithm for constructing minimal perfect hash functions. In Proc. of the 15th Intl. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 266--273, 1992. Google ScholarDigital Library
- E. Fox, L. S. Heath, Q. Chen, and A. Daoud. Practical minimal perfect hash functions for large databases. Communications of the ACM, 35(1):105--121, 1992. Google ScholarDigital Library
- M. L. Fredman, J. Komlós, and E. Szemerédi. On the size of separating systems and families of perfect hashing functions. SIAM J. on Algebraic and Discrete Methods, 5:61--68, 1984.Google ScholarCross Ref
- M. L. Fredman, J. Komlós, and E. Szemerédi. Storing a sparse table with O(1) worst case access time. J. of the ACM, 31(3):538--544, July 1984. Google ScholarDigital Library
- T. Hagerup and T. Tholey. Efficient minimal perfect hashing in nearly minimal space. In Proc. of the 18th Symposium on Theoretical Aspects of Computer Science, pages 317--326. Springer LNCS vol. 2010, 2001. Google ScholarDigital Library
- R. Jain. The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley, first edition, 1991.Google Scholar
- B. Jenkins. Algorithm alley: Hash functions. Dr. Dobb's J. of Software Tools, 22(9), 1997.Google Scholar
- D. E. Knuth. The Art of Computer Programming: Sorting and Searching, volume 3. Addison-Wesley, second edition, 1973.Google Scholar
- P. Larson and G. Graefe. Memory management during run generation in external sorting. In Proc. of the 1998 ACM SIGMOD intl. conference on Management of data, pages 472--483. ACM Press, 1998. Google ScholarDigital Library
- S. Lefebvre and H. Hoppe. Perfect spatial hashing. ACM Transactions on Graphics, 25(3):579--588, 2006. Google ScholarDigital Library
- B. Majewski, N. Wormald, G. Havas, and Z. Czech. A family of perfect hashing methods. The Computer Journal, 39(6):547--554, 1996.Google ScholarCross Ref
- S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing database architecture for the new bottleneck: Memory access. The VLDB journal, 9:231--246, 2000. Google ScholarDigital Library
- K. Mehlhorn. Data Structures and Algorithms 1: Sorting and Searching. Springer-Verlag, 1984.Google Scholar
- A. Pagh, R. Pagh, and S. S. Rao. An optimal bloom filter replacement. In Proc. of the 16th ACM-SIAM symposium on Discrete algorithms, pages 823--829, 2005. Google ScholarDigital Library
- R. Pagh. Hash and displace: Efficient evaluation of minimal perfect hash functions. In Workshop on Algorithms and Data Structures, pages 49--54, 1999. Google ScholarDigital Library
- B. Prabhakar and F. Bonomi. Perfect hashing for network applications. In Proc. of the IEEE International Symposium on Information Theory. IEEE Press, 2006.Google Scholar
- J. Radhakrishnan. Improved bounds for covering complete uniform hypergraphs. Information Processing Letters, 41:203--207, 1992. Google ScholarDigital Library
- J. P. Schmidt and A. Siegel. The spatial complexity of oblivious k-probe hash functions. SIAM J. on Computing, 19(5):775--786, October 1990. Google ScholarDigital Library
- M. Seltzer. Beyond relational databases. ACM Queue, 3(3), April 2005. Google ScholarDigital Library
- M. Thorup. Even strongly universal hashing is pretty fast. In Proc. of the 11th ACM-SIAM symposium on Discrete algorithms, pages 496--497, 2000. Google ScholarDigital Library
- P. Woelfel. Maintaining external memory efficient hash tables. In Proc. of the 10th International Workshop on Randomization and Computation (RANDOM '06), pages 508--519. Springer LNCS vol. 4110, 2006. Google ScholarDigital Library
Index Terms
- External perfect hashing for very large key sets
Recommendations
Distributed perfect hashing for very large key sets
InfoScale '08: Proceedings of the 3rd international conference on Scalable information systemsA perfect hash function (PHF) h: S → [0, m -- 1] for a key set S ⊆ U of size n, where m ≥ n and U is a key universe, is an injective function that maps the keys of S to unique values. A minimal perfect hash function (MPHF) is a PHF with m = n, the ...
Existence of Perfect 3-Deletion-CorrectingCodes
Bours [4] recently showed some constructions for perfect 2 and 3-deletion-correcting codes from combinatorial designs. He settled existence of perfect 2-deletion-correcting codes with words of length 4. However, the existence of perfect 3-deletion-...
Comments