skip to main content
10.1145/1321440.1321532acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

External perfect hashing for very large key sets

Published:06 November 2007Publication History

ABSTRACT

We present a simple and efficient external perfect hashing scheme (referred to as EPH algorithm) for very large static key sets. We use a number of techniques from the literature to obtain a novel scheme that is theoretically well-understood and at the same time achieves an order-of-magnitude increase in the size of the problem to be solved compared to previous "practical" methods. We demonstrate the scalability of our algorithm by constructing minimum perfect hash functions for a set of 1.024 billion URLs from the World Wide Web of average length 64 characters in approximately 62 minutes, using a commodity PC. Our scheme produces minimal perfect hash functions using approximately 3.8 bits per key. For perfect hash functions in the range {0,...,2n - 1} the space usage drops to approximately 2.7 bits per key. The main contribution is the first algorithm that has experimentally proven practicality for sets in the order of billions of keys and has time and space usage carefully analyzed without unrealistic assumptions.

References

  1. N. Alon, M. Dietzfelbinger, P. B. Miltersen, E. Petrank, and G. Tardos. Linear hash functions. J. of the ACM, 46(5):667--683, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Alon and M. Naor. Derandomization, witnesses for Boolean matrix multiplication and construction of perfect hash functions. Algorithmica, 16(4-5):434--449, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  3. P. Boldi and S. Vigna. The webgraph framework i: Compression techniques. In Proc. of the 13th Intl. World Wide Web Conference, pages 595--602, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Botelho, Y. Kohayakawa, and N. Ziviani. A practical minimal perfect hashing method. In Proc. of the 4th Intl. Workshop on Efficient and Experimental Algorithms, pages 488--500. Springer LNCS, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. Botelho, R. Pagh, and N. Ziviani. Simple and space-efficient minimal perfect hash functions. In Proc. of the 10th Intl. Workshop on Data Structures and Algorithms, pages 139--150. Springer LNCS, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. C. Botelho, R. Pagh, and N. Ziviani. Perfect hashing for data management applications. Technical Report TR002/07, Federal University of Minas Gerais, 2007. Available at http://arxiv.org/pdf/cs/0702159.Google ScholarGoogle Scholar
  7. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the 7th Intl. World Wide Web Conference, pages 107--117, April 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C.-C. Chang and C.-Y. Lin. A perfect hashing schemes for mining association rules. The Computer Journal, 48(2):168--179, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C.-C. Chang, C.-Y. Lin, and H. Chou. Perfect hashing schemes for mining traversal patterns. J. of Fundamenta Informaticae, 70(3):185--202, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Z. Czech, G. Havas, and B. Majewski. An optimal algorithm for generating minimal perfect hash functions. Information Processing Letters, 43(5):257--264, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Z. Czech, G. Havas, and B. Majewski. Fundamental study perfect hashing. Theoretical Computer Science, 182:1--143, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Dietzfelbinger and T. Hagerup. Simple minimal perfect hashing in less space. In Proc. of the 9th European Symposium on Algorithms, pages 109--120. Springer LNCS vol. 2161, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Fox, Q. Chen, and L. Heath. A faster algorithm for constructing minimal perfect hash functions. In Proc. of the 15th Intl. ACM SIGIR Conference on Research and Development in Information Retrieval, pages 266--273, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Fox, L. S. Heath, Q. Chen, and A. Daoud. Practical minimal perfect hash functions for large databases. Communications of the ACM, 35(1):105--121, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. L. Fredman, J. Komlós, and E. Szemerédi. On the size of separating systems and families of perfect hashing functions. SIAM J. on Algebraic and Discrete Methods, 5:61--68, 1984.Google ScholarGoogle ScholarCross RefCross Ref
  16. M. L. Fredman, J. Komlós, and E. Szemerédi. Storing a sparse table with O(1) worst case access time. J. of the ACM, 31(3):538--544, July 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Hagerup and T. Tholey. Efficient minimal perfect hashing in nearly minimal space. In Proc. of the 18th Symposium on Theoretical Aspects of Computer Science, pages 317--326. Springer LNCS vol. 2010, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Jain. The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. John Wiley, first edition, 1991.Google ScholarGoogle Scholar
  19. B. Jenkins. Algorithm alley: Hash functions. Dr. Dobb's J. of Software Tools, 22(9), 1997.Google ScholarGoogle Scholar
  20. D. E. Knuth. The Art of Computer Programming: Sorting and Searching, volume 3. Addison-Wesley, second edition, 1973.Google ScholarGoogle Scholar
  21. P. Larson and G. Graefe. Memory management during run generation in external sorting. In Proc. of the 1998 ACM SIGMOD intl. conference on Management of data, pages 472--483. ACM Press, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Lefebvre and H. Hoppe. Perfect spatial hashing. ACM Transactions on Graphics, 25(3):579--588, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Majewski, N. Wormald, G. Havas, and Z. Czech. A family of perfect hashing methods. The Computer Journal, 39(6):547--554, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  24. S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing database architecture for the new bottleneck: Memory access. The VLDB journal, 9:231--246, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Mehlhorn. Data Structures and Algorithms 1: Sorting and Searching. Springer-Verlag, 1984.Google ScholarGoogle Scholar
  26. A. Pagh, R. Pagh, and S. S. Rao. An optimal bloom filter replacement. In Proc. of the 16th ACM-SIAM symposium on Discrete algorithms, pages 823--829, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Pagh. Hash and displace: Efficient evaluation of minimal perfect hash functions. In Workshop on Algorithms and Data Structures, pages 49--54, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B. Prabhakar and F. Bonomi. Perfect hashing for network applications. In Proc. of the IEEE International Symposium on Information Theory. IEEE Press, 2006.Google ScholarGoogle Scholar
  29. J. Radhakrishnan. Improved bounds for covering complete uniform hypergraphs. Information Processing Letters, 41:203--207, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. P. Schmidt and A. Siegel. The spatial complexity of oblivious k-probe hash functions. SIAM J. on Computing, 19(5):775--786, October 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Seltzer. Beyond relational databases. ACM Queue, 3(3), April 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Thorup. Even strongly universal hashing is pretty fast. In Proc. of the 11th ACM-SIAM symposium on Discrete algorithms, pages 496--497, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Woelfel. Maintaining external memory efficient hash tables. In Proc. of the 10th International Workshop on Randomization and Computation (RANDOM '06), pages 508--519. Springer LNCS vol. 4110, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. External perfect hashing for very large key sets

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management
        November 2007
        1048 pages
        ISBN:9781595938039
        DOI:10.1145/1321440

        Copyright © 2007 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 November 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader