Abstract
Sorting database tables before compressing them improves the compression rate. Can we do better than the lexicographical order? For minimizing the number of runs in a run-length encoding compression scheme, the best approaches to row-ordering are derived from traveling salesman heuristics, although there is a significant trade-off between running time and compression. A new heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades off compression for a major running-time speedup, is a good option for very large tables. However, for some compression schemes, it is more important to generate long runs rather than few runs. For this case, another novel heuristic, Vortex, is promising. We find that we can improve run-length encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%: these gains are on top of the gains due to lexicographically sorting the table. We prove that the new row reordering is optimal (within 10%) at minimizing the runs of identical values within columns, in a few cases.
- Abadi, D., Madden, S., and Ferreira, M. 2006. Integrating compression and execution in column-oriented database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 671--682. Google ScholarDigital Library
- Abadi, D. J., Madden, S. R., and Hachem, N. 2008. Column-stores vs. row-stores: How different are they really? In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 967--980. Google ScholarDigital Library
- Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A., Naughton, J. F., Ramakrishnan, R., and Sarawagi, S. 1996. On the computation of multidimensional aggregates. In Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB'96). Morgan Kaufmann, San Francisco, CA, 506--521. Google ScholarDigital Library
- Anantha, M., Bose, B., and AlBdaiwi, B. 2007. Mixed-Radix Gray codes in Lee metric. IEEE Trans. Comput. 56, 10, 1297--1307. Google ScholarDigital Library
- Anh, V. N. and Moffat, A. 2010. Index compression using 64-bit words. Softw. Pract. Exper. 40, 2, 131--147. Google ScholarDigital Library
- Antoshenkov, G. 1995. Byte-aligned bitmap compression. In Proceedings of the Data Compression Conference (DCC'95). IEEE Computer Society, Washington, DC, 476. Google ScholarDigital Library
- Aouiche, K. and Lemire, D. 2007. A comparison of five probabilistic view-size estimation techniques in OLAP. In Proceedings of the ACM 10th International Workshop on Data Warehousing and OLAP (DOLAP '07). ACM, New York, 17--24. Google ScholarDigital Library
- Applegate, D., Cook, W., and Rohe, A. 2003. Chained Lin-Kernighan for large traveling salesman problems. INFORMS J. Comput. 15, 1, 82--92. Google ScholarDigital Library
- Bellmore, M. and Nemhauser, G. L. 1968. The traveling salesman problem: A survey. Oper. Res. 16, 3, 538--558.Google ScholarDigital Library
- Bentley, J. 1992. Fast algorithms for geometric traveling salesman problems. INFORMS J. Comput. 4, 4, 387--411.Google ScholarCross Ref
- Berman, P. and Karpinski, M. 2006. 8/7-approximation algorithm for (1,2)-TSP. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms. ACM, New York, 641--648. Google ScholarDigital Library
- Bhattacharjee, B., Lim, L., Malkemus, T., Mihaila, G., Ross, K., Lau, S., McArthur, C., Toth, Z., and Sherkat, R. 2009. Efficient index compression in DB2 LUW. Proc. VLDB Endow. 2, 2, 1462--1473. Google ScholarDigital Library
- Binnig, C., Hildenbrand, S., and Färber, F. 2009. Dictionary-Based order-preserving string compression for main memory column stores. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, 283--296. Google ScholarDigital Library
- Bruno, N. 2009. Teaching an old elephant new tricks. In Proceedings of the 4th Biennial Conference on Innovative Data Systems (CIDR'09). https://database.cs.wisc.edu/cidr/cidr2009/cidr2009.zip, last checked 2012-07-04.Google Scholar
- Cai, J. and Paige, R. 1995. Using multiset discrimination to solve language processing problems without hashing. Theor. Comput. Sci. 145, 1-2, 189--228. Google ScholarDigital Library
- Cesari, G. 1996. Divide and conquer strategies for parallel TSP heuristics. Comput. Oper. Res. 23, 7, 681--694. Google ScholarDigital Library
- Chakrabarti, A., Chazelle, B., Gum, B., and Lvov, A. 1999. A lower bound on the complexity of approximate nearest-neighbor searching on the Hamming cube. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing. ACM, New York, 305--311. Google ScholarDigital Library
- Cho, D.-S. and Hong, B.-H. 2000. Optimal page ordering for region queries in static spatial databases. In Proceedings of the 11th International Conference on Database and Expert System Applications (DEXA'00). Lecture Notes in Computer Science, vol. 1873. Springer, 366--375. Google ScholarDigital Library
- Christofides, N. 1976. Worst-Case analysis of a new heuristic for the travelling salesman problem. Tech. rep. 388, Graduate School of Industrial Administration, Carnegie Mellon University.Google Scholar
- Clarke, G. and Wright, J. W. 1964. Scheduling of vehicles from a central depot to a number of delivery points. Oper. Res. 12, 4, 568--581.Google ScholarDigital Library
- Croes, G. A. 1958. A method for solving traveling-salesman problems. Oper. Res. 6, 6, 791--812.Google ScholarDigital Library
- Dean, T. and Boddy, M. 1988. An analysis of time-dependent planning. In Proceedings of the 7th Nationalc Conference on Artificial Intelligence (AAAI). 49--54.Google Scholar
- Ding, S., Attenberg, J., and Suel, T. 2010. Scalable techniques for document identifier assignment in inverted indexes. In Proceedings of the 19th International Conference on World Wide Web (WWW '10). ACM, New York, 311--320. Google ScholarDigital Library
- Eavis, T. and Cueva, D. 2007. A Hilbert space compression architecture for data warehouse environments. In Proceedings of the Conference on Data Warehousing and Knowledge Discovery (DaWaK'07). Lecture Notes in Computer Science, vol. 4654. Springer, 1--12. Google ScholarDigital Library
- Ernvall, J., Katajainen, J., and Penttonen, M. 1985. NP-Completeness of the Hamming salesman problem. Bit 25, 1, 289--292.Google ScholarCross Ref
- Faloutsos, C. 1986. Multiattribute hashing using Gray codes. SIGMOD Rec. 15, 2, 227--238. Google ScholarDigital Library
- Frank, A. and Asuncion, A. 2010. UCI machine learning repository. http://archive.ics.uci.edu/ml (checked 2012-07-04).Google Scholar
- Fusco, F., Stoecklin, M. P., and Vlachos, M. 2010. NET-FLi: On-the-Fly compression, archiving and indexing of streaming network traffic. Proc. VLDB Endow. 3, 1382--1393. Google ScholarDigital Library
- Fusco, F., Vlachos, M., and Stoecklin, M. 2012. Real-Time creation of bitmap indexes on streaming network data. VLDB J. 21, 3, 287--307. Google ScholarDigital Library
- Gabow, H. N. and Tarjan, R. E. 1991. Faster scaling algorithms for general graph matching problems. J. ACM 38, 4, 815--853. Google ScholarDigital Library
- Gharan, S. O., Saberi, A., and Singh, M. 2011. A randomized rounding approach to the traveling salesman problem. In Proceedings of the IEEE 52nd Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Washington, DC, 550--559. Google ScholarDigital Library
- Gionis, A., Indyk, P., and Motwani, R. 1999. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB'99). Morgan Kaufmann, San Francisco, CA, 518--529. Google ScholarDigital Library
- Golomb, S. W. 1966. Run-Length encodings. IEEE Trans. Inf. Theory 12, 3, 399--401. Google ScholarDigital Library
- Gray, J., Sundaresan, P., Englert, S., Baclawski, K., and Weinberger, P. J. 1994. Quickly generating billion-record synthetic databases. SIGMOD Rec. 23, 2, 243--252. Google ScholarDigital Library
- Grigoriadis, M. D. and Kalantari, B. 1988. A new class of heuristic algorithms for weighted perfect matching. J. ACM 35, 4, 769--776. Google ScholarDigital Library
- Hahn, C., Warren, S., and London, J. 2004. Edited synoptic cloud reports from ships and land stations over the globe, 1982--1991. http://cdiac.ornl.gov/ftp/ndp026b/ (checked 2012-07-04).Google Scholar
- Hamilton, C. H. and Rau-Chaplin, A. 2008. Compact Hilbert indices: Space-Filling curves for domains with unequal side lengths. Inf. Process. Lett. 105, 5, 155--163. Google ScholarDigital Library
- Helsgaun, K. 2000. An effective implementation of the Lin--Kernighan traveling salesman heuristic. Eur. J. Oper. Res. 126, 1, 106--130.Google ScholarCross Ref
- Holloway, A. L. and DeWitt, D. J. 2008. Read-Optimized databases, in depth. Proc. VLDB Endow. 1, 1, 502--513. Google ScholarDigital Library
- Houkjær, K., Torp, K., and Wind, R. 2006. Simple and realistic data generation. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB'06). ACM, New York, 1243--1246. Google ScholarDigital Library
- Indyk, P. and Motwani, R. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing. ACM, New York, 604--613. Google ScholarDigital Library
- Indyk, P., Motwani, R., Raghavan, P., and Vempala, S. 1997. Locality-Preserving hashing in multidimensional spaces. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing. ACM, New York, 618--625. Google ScholarDigital Library
- Johnson, D. S., Krishnan, S., Chhugani, J., Kumar, S., and Venkatasubramanian, S. 2004. Compressing large boolean matrices using reordering techniques. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB'04). Morgan Kaufmann, San Francisco, CA, 13--23. Google ScholarDigital Library
- Johnson, D. S. and McGeoch, L. A. 1997. The traveling salesman problem: A case study. In Local Search in Combinatorial Optimization, E. H. L. Aarts and J. K. Lenstra, Eds. John Wiley and Sons, Hoboken, NJ, 215--310. http://akpublic.research.att.com/~dsj/papers/TSPchapter.pdfGoogle Scholar
- Johnson, D. S. and McGeoch, L. A. 2004. Experimental analysis of heuristics for the STSP. In The Traveling Salesman Problem and Its Variations, G. Gutin and A. P. Punnen, Eds. Springer, 369--443.Google Scholar
- Kahng, A. and Reda, S. 2004. Match twice and stitch: A new TSP tour construction heuristic. Oper. Res. Lett. 32, 6, 499--509. Google ScholarDigital Library
- Kamel, I. and Faloutsos, C. 1994. Hilbert R-tree: An improved R-tree using fractals. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94). Morgan Kaufmann, San Francisco, CA, 500--509. Google ScholarDigital Library
- Kane, D. M., Nelson, J., and Woodruff, D. P. 2010. An optimal algorithm for the distinct elements problem. In Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, New York, 41--52. Google ScholarDigital Library
- Karp, R. M. 1977. Probabilistic analysis of partitioning algorithms for the traveling-salesman problem in the plane. Math. Oper. Res. 2, 3, 209--224.Google ScholarDigital Library
- Knuth, D. E. 1997. Searching and Sorting. The Art of Computer Programming Series, vol. 3. Addison-Wesley, Reading, MA.Google Scholar
- Knuth, D. E. 2011. Combinatorial Algorithms, Part 1. The Art of Computer Programming Series, vol. 4A. Addison-Wesley, Boston, MA. Google ScholarDigital Library
- Kushilevitz, E., Ostrovsky, R., and Rabani, Y. 1998. Efficient search for approximate nearest neighbor in high dimensional spaces. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing. ACM, New York, 614--623. Google ScholarDigital Library
- Lam, F. and Newman, A. 2008. Traveling salesman path problems. Math. Program. 113, 1, 39--59. Google ScholarDigital Library
- Lemire, D. and Kaser, O. 2011. Reordering columns for smaller indexes. Inf. Sci. 181, 12, 2550--2570. Google ScholarDigital Library
- Lemire, D., Kaser, O., and Aouiche, K. 2010. Sorting improves word-aligned bitmap indexes. Data Knowl. Eng. 69, 1, 3--28. Google ScholarDigital Library
- Lemire, D., Kaser, O., and Gutarra, E. 2012. Reordering rows for better compression: Beyond the lexicographical order. Tech. rep. TR-12-001, UNBSJ CSAS. July.Google Scholar
- Lemke, C., Sattler, K.-U., Faerber, F., and Zeier, A. 2010. Speeding up queries in column stores. In Proceedings of the Conference on Data Warehousing and Knowledge Discovery (DaWaK'10). Lecture Notes in Computer Science, vol.6263. Springer, 117--129. Google ScholarDigital Library
- Lin, S. and Kernighan, B. W. 1973. An effective heuristic algorithm for the traveling-salesman problem. Oper. Res. 21, 2, 498--516.Google ScholarDigital Library
- Liu, D. 2004. A strong lower bound for approximate nearest neighbor searching. Inf. Process. Lett. 92, 1, 23--29. Google ScholarDigital Library
- Malik, H. H. and Kender, J. R. 2007. Optimizing frequency queries for data mining applications. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM'07). IEEE Computer Society, Washington, DC, 595--600. Google ScholarDigital Library
- Moffat, A. and Stuiver, L. 2000. Binary interpolative coding for effective index compression. Inf. Retr. 3, 1, 25--47. Google ScholarDigital Library
- Ng, W. and Ravishankar, C. 1997. Block-Oriented compression techniques for large statistical databases. IEEE Trans. Knowl. Data Eng. 9, 2, 314--328. Google ScholarDigital Library
- Oberhumer, M. F. X. J. 2011. LZO real-time data compression library. http://www.oberhumer.com/opensource/lzo/.Google Scholar
- Olken, F. and Rotem, D. 1986. Rearranging data to maximize the efficiency of compression. In Proceedings of the 5th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, New York, 78--90. Google ScholarDigital Library
- O'Neil, P., O'Neil, E., Chen, X., and Revilak, S. 2009. The star schema benchmark and augmented fact table indexing. In Proceedings of the Conference on Performance Evaluation and Benchmarking. Lecture Notes in Computer Science, vol. 5895. Springer, 237--252. Google ScholarDigital Library
- Pinar, A. and Heath, M. T. 1999. Improving performance of sparse matrix-vector multiplication. In Proceedings of the ACM/IEEE Conference on Supercomputing. ACM, New York, Article No. 30. Google ScholarDigital Library
- Pinar, A., Tao, T., and Ferhatosmanoglu, H. 2005. Compressing bitmap indices by data reorganization. In Proceedings of the 21st International Conference on Data Engineering (ICDE'05). IEEE Computer Society, Washington, DC, 310--321. Google ScholarDigital Library
- Platzman, L. and Bartholdi, III, J. 1989. Spacefilling curves and the planar travelling salesman problem. J. ACM 36, 4, 719--737. Google ScholarDigital Library
- Poess, M. and Potapov, D. 2003. Data compression in Oracle. In Proceedings of the 29th International Conference on Very Large Data Bases (VLDB'03). Morgan Kaufmann, San Francisco, CA, 937--947. Google ScholarDigital Library
- Pourabbas, E., Shoshani, A., and Wu, K. 2012. Minimizing index size by reordering rows and columns. In Proceedings of the 24th International Conference on Scientific and Statistical Database Management (SSDBM'08). Springer, 467--484. Google ScholarDigital Library
- Programme de recherche en démographie historique. 2009. The 1852 and 1881 historical censuses of Canada. http://www.prdh.umontreal.ca/census/en/main.aspx.Google Scholar
- Reinelt, G. 1994. The Traveling Salesman: Computational Solutions for TSP Applications. Springer. Google ScholarDigital Library
- Richards, D. 1986. Data compression and Gray-code sorting. Inf. Process. Lett. 22, 4, 201--205. Google ScholarDigital Library
- Rosenkrantz, D., Stearns, R., and Lewis, II, P. 1977. An analysis of several heuristics for the traveling salesman problem. SIAM J. Comput. 6, 3, 563--581.Google ScholarDigital Library
- Schaller, M. 1999. Reclustering of high energy physics data. In Proceedings of the 11th International Conference on Scientific and Statistical Database Management (SSDBM'99). IEEE Computer Society, Washington, DC. Google ScholarDigital Library
- Scholer, F., Williams, H., Yiannis, J., and Zobel, J. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, 222--229. Google ScholarDigital Library
- Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O'Neil, E., O'Neil, P., Rasin, A., Tran, N., and Zdonik, S. 2005. C-Store: A column-oriented DBMS. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05). ACM, New York, 553--564. Google ScholarDigital Library
- Trevisan, L. 1997. When Hamming meets Euclid: The approximability of geometric TSP and MST. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing. ACM, New York, 21--29. Google ScholarDigital Library
- US Census Bureau. 2002. United States census 2000—summary file 3. http://www.census.gov/census2000/sumfile3.html.Google Scholar
- Wu, K., Otoo, E. J., and Shoshani, A. 2006. Optimizing bitmap indices with efficient compression. ACM Trans. Database Syst. 31, 1, 1--38. Google ScholarDigital Library
- Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 5, 530--536. Google ScholarDigital Library
Index Terms
- Reordering rows for better compression: Beyond the lexicographic order
Recommendations
Reordering columns for smaller indexes
Column-oriented indexes-such as projection or bitmap indexes-are compressed by run-length encoding to reduce storage and increase speed. Sorting the tables improves compression. On realistic data sets, permuting the columns in the right order before ...
Lossless compression of VLSI layout image data
We present a novel lossless compression algorithm called Context Copy Combinatorial Code (C4), which integrates the advantages of two very disparate compression techniques: context-based modeling and Lempel-Ziv (LZ) style copying. While the algorithm ...
Post BWT stages of the Burrows–Wheeler compression algorithm
The lossless Burrows–Wheeler compression algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input sequence—the Burrows–Wheeler transformation (BWT)—which groups ...
Comments