skip to main content
research-article

Reordering rows for better compression: Beyond the lexicographic order

Published:06 September 2012Publication History
Skip Abstract Section

Abstract

Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades off compression for a major running-time speedup, is a good option for very large tables. However, for some compression schemes, it is more important to generate long runs rather than few runs. For this case, another novel heuristic, Vortex, is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns, in a few cases.

References

  1. Abadi, D., Madden, S., and Ferreira, M. 2006. Integrating compression and execution in column-oriented database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 671--682. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Abadi, D. J., Madden, S. R., and Hachem, N. 2008. Column-stores vs. row-stores: How different are they really? In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 967--980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A., Naughton, J. F., Ramakrishnan, R., and Sarawagi, S. 1996. On the computation of multidimensional aggregates. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB'96). Morgan Kaufmann, San Francisco, CA, 506--521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Anantha, M., Bose, B., and AlBdaiwi, B. 2007. Mixed-Radix Gray codes in Lee metric. IEEE Trans. Comput. 56, 10, 1297--1307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Anh, V. N. and Moffat, A. 2010. Index compression using 64-bit words. Softw. Pract. Exper. 40, 2, 131--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Antoshenkov, G. 1995. Byte-aligned bitmap compression. In Proceedings of the Data Compression Conference (DCC'95). IEEE Computer Society, Washington, DC, 476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Aouiche, K. and Lemire, D. 2007. A comparison of five probabilistic view-size estimation techniques in OLAP. In Proceedings of the ACM 10th International Workshop on Data Warehousing and OLAP (DOLAP '07). ACM, New York, 17--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Applegate, D., Cook, W., and Rohe, A. 2003. Chained Lin-Kernighan for large traveling salesman problems. INFORMS J. Comput. 15, 1, 82--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bellmore, M. and Nemhauser, G. L. 1968. The traveling salesman problem: A survey. Oper. Res. 16, 3, 538--558.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Bentley, J. 1992. Fast algorithms for geometric traveling salesman problems. INFORMS J. Comput. 4, 4, 387--411.Google ScholarGoogle ScholarCross RefCross Ref
  11. Berman, P. and Karpinski, M. 2006. 8/7-approximation algorithm for (1,2)-TSP. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms. ACM, New York, 641--648. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Bhattacharjee, B., Lim, L., Malkemus, T., Mihaila, G., Ross, K., Lau, S., McArthur, C., Toth, Z., and Sherkat, R. 2009. Efficient index compression in DB2 LUW. Proc. VLDB Endow. 2, 2, 1462--1473. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Binnig, C., Hildenbrand, S., and Färber, F. 2009. Dictionary-Based order-preserving string compression for main memory column stores. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 283--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Bruno, N. 2009. Teaching an old elephant new tricks. In Proceedings of the 4th Biennial Conference on Innovative Data Systems (CIDR'09). https://database.cs.wisc.edu/cidr/cidr2009/cidr2009.zip, last checked 2012-07-04.Google ScholarGoogle Scholar
  15. Cai, J. and Paige, R. 1995. Using multiset discrimination to solve language processing problems without hashing. Theor. Comput. Sci. 145, 1-2, 189--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Cesari, G. 1996. Divide and conquer strategies for parallel TSP heuristics. Comput. Oper. Res. 23, 7, 681--694. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Chakrabarti, A., Chazelle, B., Gum, B., and Lvov, A. 1999. A lower bound on the complexity of approximate nearest-neighbor searching on the Hamming cube. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing. ACM, New York, 305--311. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Cho, D.-S. and Hong, B.-H. 2000. Optimal page ordering for region queries in static spatial databases. In Proceedings of the 11th International Conference on Database and Expert System Applications (DEXA'00). Lecture Notes in Computer Science, vol. 1873. Springer, 366--375. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Christofides, N. 1976. Worst-Case analysis of a new heuristic for the travelling salesman problem. Tech. rep. 388, Graduate School of Industrial Administration, Carnegie Mellon University.Google ScholarGoogle Scholar
  20. Clarke, G. and Wright, J. W. 1964. Scheduling of vehicles from a central depot to a number of delivery points. Oper. Res. 12, 4, 568--581.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Croes, G. A. 1958. A method for solving traveling-salesman problems. Oper. Res. 6, 6, 791--812.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Dean, T. and Boddy, M. 1988. An analysis of time-dependent planning. In Proceedings of the 7th Nationalc Conference on Artificial Intelligence (AAAI). 49--54.Google ScholarGoogle Scholar
  23. Ding, S., Attenberg, J., and Suel, T. 2010. Scalable techniques for document identifier assignment in inverted indexes. In Proceedings of the 19th International Conference on World Wide Web (WWW '10). ACM, New York, 311--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Eavis, T. and Cueva, D. 2007. A Hilbert space compression architecture for data warehouse environments. In Proceedings of the Conference on Data Warehousing and Knowledge Discovery (DaWaK'07). Lecture Notes in Computer Science, vol. 4654. Springer, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ernvall, J., Katajainen, J., and Penttonen, M. 1985. NP-Completeness of the Hamming salesman problem. Bit 25, 1, 289--292.Google ScholarGoogle ScholarCross RefCross Ref
  26. Faloutsos, C. 1986. Multiattribute hashing using Gray codes. SIGMOD Rec. 15, 2, 227--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml (checked 2012-07-04).Google ScholarGoogle Scholar
  28. Fusco, F., Stoecklin, M. P., and Vlachos, M. 2010. NET-FLi: On-the-Fly compression, archiving and indexing of streaming network traffic. Proc. VLDB Endow. 3, 1382--1393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Fusco, F., Vlachos, M., and Stoecklin, M. 2012. Real-Time creation of bitmap indexes on streaming network data. VLDB J. 21, 3, 287--307. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gabow, H. N. and Tarjan, R. E. 1991. Faster scaling algorithms for general graph matching problems. J. ACM 38, 4, 815--853. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Gharan, S. O., Saberi, A., and Singh, M. 2011. A randomized rounding approach to the traveling salesman problem. In Proceedings of the IEEE 52nd Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Washington, DC, 550--559. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Gionis, A., Indyk, P., and Motwani, R. 1999. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB'99). Morgan Kaufmann, San Francisco, CA, 518--529. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Golomb, S. W. 1966. Run-Length encodings. IEEE Trans. Inf. Theory 12, 3, 399--401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Gray, J., Sundaresan, P., Englert, S., Baclawski, K., and Weinberger, P. J. 1994. Quickly generating billion-record synthetic databases. SIGMOD Rec. 23, 2, 243--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Grigoriadis, M. D. and Kalantari, B. 1988. A new class of heuristic algorithms for weighted perfect matching. J. ACM 35, 4, 769--776. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Hahn, C., Warren, S., and London, J. 2004. Edited synoptic cloud reports from ships and land stations over the globe, 1982--1991. http://cdiac.ornl.gov/ftp/ndp026b/ (checked 2012-07-04).Google ScholarGoogle Scholar
  37. Hamilton, C. H. and Rau-Chaplin, A. 2008. Compact Hilbert indices: Space-Filling curves for domains with unequal side lengths. Inf. Process. Lett. 105, 5, 155--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Helsgaun, K. 2000. An effective implementation of the Lin--Kernighan traveling salesman heuristic. Eur. J. Oper. Res. 126, 1, 106--130.Google ScholarGoogle ScholarCross RefCross Ref
  39. Holloway, A. L. and DeWitt, D. J. 2008. Read-Optimized databases, in depth. Proc. VLDB Endow. 1, 1, 502--513. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Houkjær, K., Torp, K., and Wind, R. 2006. Simple and realistic data generation. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB'06). ACM, New York, 1243--1246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Indyk, P. and Motwani, R. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing. ACM, New York, 604--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Indyk, P., Motwani, R., Raghavan, P., and Vempala, S. 1997. Locality-Preserving hashing in multidimensional spaces. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing. ACM, New York, 618--625. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Johnson, D. S., Krishnan, S., Chhugani, J., Kumar, S., and Venkatasubramanian, S. 2004. Compressing large boolean matrices using reordering techniques. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04). Morgan Kaufmann, San Francisco, CA, 13--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Johnson, D. S. and McGeoch, L. A. 1997. The traveling salesman problem: A case study. In Local Search in Combinatorial Optimization, E. H. L. Aarts and J. K. Lenstra, Eds. John Wiley and Sons, Hoboken, NJ, 215--310. http://akpublic.research.att.com/~dsj/papers/TSPchapter.pdfGoogle ScholarGoogle Scholar
  45. Johnson, D. S. and McGeoch, L. A. 2004. Experimental analysis of heuristics for the STSP. In The Traveling Salesman Problem and Its Variations, G. Gutin and A. P. Punnen, Eds. Springer, 369--443.Google ScholarGoogle Scholar
  46. Kahng, A. and Reda, S. 2004. Match twice and stitch: A new TSP tour construction heuristic. Oper. Res. Lett. 32, 6, 499--509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Kamel, I. and Faloutsos, C. 1994. Hilbert R-tree: An improved R-tree using fractals. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94). Morgan Kaufmann, San Francisco, CA, 500--509. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Kane, D. M., Nelson, J., and Woodruff, D. P. 2010. An optimal algorithm for the distinct elements problem. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, New York, 41--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Karp, R. M. 1977. Probabilistic analysis of partitioning algorithms for the traveling-salesman problem in the plane. Math. Oper. Res. 2, 3, 209--224.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Knuth, D. E. 1997. Searching and Sorting. The Art of Computer Programming Series, vol. 3. Addison-Wesley, Reading, MA.Google ScholarGoogle Scholar
  51. Knuth, D. E. 2011. Combinatorial Algorithms, Part 1. The Art of Computer Programming Series, vol. 4A. Addison-Wesley, Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Kushilevitz, E., Ostrovsky, R., and Rabani, Y. 1998. Efficient search for approximate nearest neighbor in high dimensional spaces. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing. ACM, New York, 614--623. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Lam, F. and Newman, A. 2008. Traveling salesman path problems. Math. Program. 113, 1, 39--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Lemire, D. and Kaser, O. 2011. Reordering columns for smaller indexes. Inf. Sci. 181, 12, 2550--2570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Lemire, D., Kaser, O., and Aouiche, K. 2010. Sorting improves word-aligned bitmap indexes. Data Knowl. Eng. 69, 1, 3--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Lemire, D., Kaser, O., and Gutarra, E. 2012. Reordering rows for better compression: Beyond the lexicographical order. Tech. rep. TR-12-001, UNBSJ CSAS. July.Google ScholarGoogle Scholar
  57. Lemke, C., Sattler, K.-U., Faerber, F., and Zeier, A. 2010. Speeding up queries in column stores. In Proceedings of the Conference on Data Warehousing and Knowledge Discovery (DaWaK'10). Lecture Notes in Computer Science, vol.6263. Springer, 117--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Lin, S. and Kernighan, B. W. 1973. An effective heuristic algorithm for the traveling-salesman problem. Oper. Res. 21, 2, 498--516.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Liu, D. 2004. A strong lower bound for approximate nearest neighbor searching. Inf. Process. Lett. 92, 1, 23--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Malik, H. H. and Kender, J. R. 2007. Optimizing frequency queries for data mining applications. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM'07). IEEE Computer Society, Washington, DC, 595--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Moffat, A. and Stuiver, L. 2000. Binary interpolative coding for effective index compression. Inf. Retr. 3, 1, 25--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Ng, W. and Ravishankar, C. 1997. Block-Oriented compression techniques for large statistical databases. IEEE Trans. Knowl. Data Eng. 9, 2, 314--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Oberhumer, M. F. X. J. 2011. LZO real-time data compression library. http://www.oberhumer.com/opensource/lzo/.Google ScholarGoogle Scholar
  64. Olken, F. and Rotem, D. 1986. Rearranging data to maximize the efficiency of compression. In Proceedings of the 5th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, New York, 78--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. O'Neil, P., O'Neil, E., Chen, X., and Revilak, S. 2009. The star schema benchmark and augmented fact table indexing. In Proceedings of the Conference on Performance Evaluation and Benchmarking. Lecture Notes in Computer Science, vol. 5895. Springer, 237--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Pinar, A. and Heath, M. T. 1999. Improving performance of sparse matrix-vector multiplication. In Proceedings of the ACM/IEEE Conference on Supercomputing. ACM, New York, Article No. 30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Pinar, A., Tao, T., and Ferhatosmanoglu, H. 2005. Compressing bitmap indices by data reorganization. In Proceedings of the 21st International Conference on Data Engineering (ICDE'05). IEEE Computer Society, Washington, DC, 310--321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Platzman, L. and Bartholdi, III, J. 1989. Spacefilling curves and the planar travelling salesman problem. J. ACM 36, 4, 719--737. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Poess, M. and Potapov, D. 2003. Data compression in Oracle. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03). Morgan Kaufmann, San Francisco, CA, 937--947. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Pourabbas, E., Shoshani, A., and Wu, K. 2012. Minimizing index size by reordering rows and columns. In Proceedings of the 24th International Conference on Scientific and Statistical Database Management (SSDBM'08). Springer, 467--484. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Programme de recherche en démographie historique. 2009. The 1852 and 1881 historical censuses of Canada. http://www.prdh.umontreal.ca/census/en/main.aspx.Google ScholarGoogle Scholar
  72. Reinelt, G. 1994. The Traveling Salesman: Computational Solutions for TSP Applications. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Richards, D. 1986. Data compression and Gray-code sorting. Inf. Process. Lett. 22, 4, 201--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Rosenkrantz, D., Stearns, R., and Lewis, II, P. 1977. An analysis of several heuristics for the traveling salesman problem. SIAM J. Comput. 6, 3, 563--581.Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Schaller, M. 1999. Reclustering of high energy physics data. In Proceedings of the 11th International Conference on Scientific and Statistical Database Management (SSDBM'99). IEEE Computer Society, Washington, DC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Scholer, F., Williams, H., Yiannis, J., and Zobel, J. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 222--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O'Neil, E., O'Neil, P., Rasin, A., Tran, N., and Zdonik, S. 2005. C-Store: A column-oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05). ACM, New York, 553--564. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Trevisan, L. 1997. When Hamming meets Euclid: The approximability of geometric TSP and MST. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing. ACM, New York, 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. US Census Bureau. 2002. United States census 2000—summary file 3. http://www.census.gov/census2000/sumfile3.html.Google ScholarGoogle Scholar
  80. Wu, K., Otoo, E. J., and Shoshani, A. 2006. Optimizing bitmap indices with efficient compression. ACM Trans. Database Syst. 31, 1, 1--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 5, 530--536. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reordering rows for better compression: Beyond the lexicographic order

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Database Systems
        ACM Transactions on Database Systems  Volume 37, Issue 3
        August 2012
        191 pages
        ISSN:0362-5915
        EISSN:1557-4644
        DOI:10.1145/2338626
        Issue’s Table of Contents

        Copyright © 2012 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 September 2012
        • Accepted: 1 June 2012
        • Revised: 1 February 2012
        • Received: 1 November 2010
        Published in tods Volume 37, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader