skip to main content
research-article

A General SIMD-Based Approach to Accelerating Compression Algorithms

Published:23 March 2015Publication History
Skip Abstract Section

Abstract

Compression algorithms are important for data-oriented tasks, especially in the era of “Big Data.” Modern processors equipped with powerful SIMD instruction sets provide us with an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies, we propose a general approach to accelerate compression algorithms. By instantiating the approach, we have developed several novel integer compression algorithms, called Group-Simple, Group-Scheme, Group-AFOR, and Group-PFD, and implemented their corresponding vectorized versions. We evaluate the proposed algorithms on two public TREC datasets, a Wikipedia dataset, and a Twitter dataset. With competitive compression ratios and encoding speeds, our SIMD-based algorithms outperform state-of-the-art nonvectorized algorithms with respect to decoding speeds.

References

  1. V. N. Anh and A. Moffat. 2005. Inverted index compression using word-aligned binary codes. Information Retrieval 8, 1, 151--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. N. Anh and A. Moffat. 2006. Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering 18, 6, 857--861. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. V. N. Anh and A. Moffat. 2010. Index compression using 64-bit words. Software: Practice and Experience 40, 2, 131--147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Z. Broder, D. Carmel, M. Herscovici, M. Soffer, and J. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 426--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Büttcher, C. Clarke, and G. V. Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Chatterjee, L. R. Bachega, P. Bergner, A. D. Kenneth, J. A. Gunnels, M. Gupta, F. G. Gustavson, C. A. Lapkowski, G. K. Liu, M. P. Mendell, R. D. Nair, C. D. Wait, C. Ward, and P. Wu. 2005. Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L. IBM Journal of Research and Development 49, 2--3, 377--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Dean. 2009. Challenges in building large-scale information retrieval systems: Invited talk. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Ding and T. Suel. 2011. Faster top-k document retrieval using block-max indexes. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 993--1002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Delbru, S. Campinas, and G. Tummarello. 2012. Searching web data: An entity retrieval and high-performance indexing model. Web Semantics: Science, Services and Agents on the World Wide Web 10, 33--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Elias. 1975. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory 21, 2, 194--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Intel Corporation. 2010. Intel 64 and IA-32 Architectures Software Developers Manual (Version 37). Intel Corporation, Santa Clara, CA.Google ScholarGoogle Scholar
  12. D. Inkster, M. Zukowski, and P. Boncz. 2011. Integration of VectorWise with Ingres. ACM SIGMOD Record 40, 3, 45--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Jonassen and S. E. Bratsberg. 2011. Efficient compressed inverted index skipping for disjunctive text-queries. In Proceedings of the 33rd European Conference on Advances in Information Retrieval. Springer, Berlin, 530--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International World Wide Web Conference. ACM, 591--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. K. Liu, X. Qin, X. Yan, and L. Quan. 2006. A SIMD video signal processor with efficient data organization. In Proceedings of IEEE Asian Solid-State Circuits Conference, 115--118.Google ScholarGoogle Scholar
  16. C. Lemke, K. U. Sattler, F. Faerber, and A. Zeier. 2010. Speeding up queries in column stores: A case for compression. In Proceedings of the 12th International Conference on Data Warehousing and Knowledge Discovery, 117--129. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Lomont. 2011. Introduction to Intel advanced vector extensions. In Proceedings of the 2nd Annual ASCI Conference.Google ScholarGoogle Scholar
  18. D. Lemire and L. Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Ma and C. Yang. 2002. Using Intel streaming SIMD extensions for 3D geometry processing. In Proceedings of IEEE Pacific Rim Conference on Multimedia. 1080--1087. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. D. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Navarro, E. S. De Moura, M. Neubert, N. Ziviani, and R. B. Yates. 2000. Adding compression to block addressing inverted indexes. Information Retrieval 3, 1, 49--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Rice and J. Plaunt. 1971. Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Transactions on Communication Technology 19, 6, 889--897.Google ScholarGoogle ScholarCross RefCross Ref
  23. S. E. Robertson, S. Walker, M. Beaulieu, and P. Willett. 1999. Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. NIST Special Publication SP 253--264.Google ScholarGoogle Scholar
  24. V. Raman, G. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Mueller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. Storm, and L. Zhang. 2013. DB2 with BLU acceleration: So much more than just a column store. Proceedings of the VLDB Endowment 6, 11, 1080--1091. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 222--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Schlegel, R. Gemulla, and W. Lehner. 2010. Fast integer compression using SIMD instructions. In Proceedings of the 6th International Workshop on Data Management on New Hardware. ACM, 34--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. F. Silvestri and Venturini R. VSEncoding. 2010. Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 1219--1228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and R. S. Oberoi. 2011. SIMD-based decoding of posting lists. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 317--326. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Shan, S. Ding, J. He, H. Yan, and X. Li. 2012. Optimized top-k processing with global page scores on block-max indexes. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining. ACM, 423--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner. 2009. SIMD-scan: Ultra fast in-memory table scan using on-chip vector processing units. Proceedings of the VLDB Endowment 2, 1, 385--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Willhalm, I. Oukid, I. Müller, and F. Faerber. 2013. Vectorizing database column scans with complex predicates. Accelerating Data Management Systems Using Modern Processor and Storage Architectures, 1--12.Google ScholarGoogle Scholar
  32. I. H. Witten, A. Moffat, and T. C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Walder, M. Krátký, R. Bača, J. Platoš, and V. Snášel. 2012. Fast decoding algorithms for variable-lengths codes. Information Sciences 183, 1, 66--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. H. Yan, S. Ding, and T. Suel. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International World Wide Web Conference. ACM, 401--410. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Zukowski, S. Heman, N. Nes, and Boncz, P. 2006. Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd International Conference on Data Engineering. IEEE, 59--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Zhang, X. Long, and T. Suel. 2008. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th International World Wide Web Conference. ACM, 387--396. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A General SIMD-Based Approach to Accelerating Compression Algorithms

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Information Systems
            ACM Transactions on Information Systems  Volume 33, Issue 3
            March 2015
            184 pages
            ISSN:1046-8188
            EISSN:1558-2868
            DOI:10.1145/2737814
            Issue’s Table of Contents

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 23 March 2015
            • Accepted: 2 January 2015
            • Revised: 1 January 2015
            • Received: 1 July 2014
            Published in tois Volume 33, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader