Abstract
Compression algorithms are important for data-oriented tasks, especially in the era of “Big Data.” Modern processors equipped with powerful SIMD instruction sets provide us with an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies, we propose a general approach to accelerate compression algorithms. By instantiating the approach, we have developed several novel integer compression algorithms, called Group-Simple, Group-Scheme, Group-AFOR, and Group-PFD, and implemented their corresponding vectorized versions. We evaluate the proposed algorithms on two public TREC datasets, a Wikipedia dataset, and a Twitter dataset. With competitive compression ratios and encoding speeds, our SIMD-based algorithms outperform state-of-the-art nonvectorized algorithms with respect to decoding speeds.
- V. N. Anh and A. Moffat. 2005. Inverted index compression using word-aligned binary codes. Information Retrieval 8, 1, 151--166. Google ScholarDigital Library
- V. N. Anh and A. Moffat. 2006. Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering 18, 6, 857--861. Google ScholarDigital Library
- V. N. Anh and A. Moffat. 2010. Index compression using 64-bit words. Software: Practice and Experience 40, 2, 131--147. Google ScholarDigital Library
- A. Z. Broder, D. Carmel, M. Herscovici, M. Soffer, and J. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 426--434. Google ScholarDigital Library
- S. Büttcher, C. Clarke, and G. V. Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press. Google ScholarDigital Library
- S. Chatterjee, L. R. Bachega, P. Bergner, A. D. Kenneth, J. A. Gunnels, M. Gupta, F. G. Gustavson, C. A. Lapkowski, G. K. Liu, M. P. Mendell, R. D. Nair, C. D. Wait, C. Ward, and P. Wu. 2005. Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L. IBM Journal of Research and Development 49, 2--3, 377--392. Google ScholarDigital Library
- J. Dean. 2009. Challenges in building large-scale information retrieval systems: Invited talk. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM. Google ScholarDigital Library
- S. Ding and T. Suel. 2011. Faster top-k document retrieval using block-max indexes. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 993--1002. Google ScholarDigital Library
- R. Delbru, S. Campinas, and G. Tummarello. 2012. Searching web data: An entity retrieval and high-performance indexing model. Web Semantics: Science, Services and Agents on the World Wide Web 10, 33--58. Google ScholarDigital Library
- P. Elias. 1975. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory 21, 2, 194--203. Google ScholarDigital Library
- Intel Corporation. 2010. Intel 64 and IA-32 Architectures Software Developers Manual (Version 37). Intel Corporation, Santa Clara, CA.Google Scholar
- D. Inkster, M. Zukowski, and P. Boncz. 2011. Integration of VectorWise with Ingres. ACM SIGMOD Record 40, 3, 45--53. Google ScholarDigital Library
- S. Jonassen and S. E. Bratsberg. 2011. Efficient compressed inverted index skipping for disjunctive text-queries. In Proceedings of the 33rd European Conference on Advances in Information Retrieval. Springer, Berlin, 530--542. Google ScholarDigital Library
- H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International World Wide Web Conference. ACM, 591--600. Google ScholarDigital Library
- K. Liu, X. Qin, X. Yan, and L. Quan. 2006. A SIMD video signal processor with efficient data organization. In Proceedings of IEEE Asian Solid-State Circuits Conference, 115--118.Google Scholar
- C. Lemke, K. U. Sattler, F. Faerber, and A. Zeier. 2010. Speeding up queries in column stores: A case for compression. In Proceedings of the 12th International Conference on Data Warehousing and Knowledge Discovery, 117--129. Google ScholarDigital Library
- C. Lomont. 2011. Introduction to Intel advanced vector extensions. In Proceedings of the 2nd Annual ASCI Conference.Google Scholar
- D. Lemire and L. Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1.Google ScholarDigital Library
- W. Ma and C. Yang. 2002. Using Intel streaming SIMD extensions for 3D geometry processing. In Proceedings of IEEE Pacific Rim Conference on Multimedia. 1080--1087. Google ScholarDigital Library
- C. D. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
- G. Navarro, E. S. De Moura, M. Neubert, N. Ziviani, and R. B. Yates. 2000. Adding compression to block addressing inverted indexes. Information Retrieval 3, 1, 49--77. Google ScholarDigital Library
- R. Rice and J. Plaunt. 1971. Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Transactions on Communication Technology 19, 6, 889--897.Google ScholarCross Ref
- S. E. Robertson, S. Walker, M. Beaulieu, and P. Willett. 1999. Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. NIST Special Publication SP 253--264.Google Scholar
- V. Raman, G. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Mueller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. Storm, and L. Zhang. 2013. DB2 with BLU acceleration: So much more than just a column store. Proceedings of the VLDB Endowment 6, 11, 1080--1091. Google ScholarDigital Library
- F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 222--229. Google ScholarDigital Library
- B. Schlegel, R. Gemulla, and W. Lehner. 2010. Fast integer compression using SIMD instructions. In Proceedings of the 6th International Workshop on Data Management on New Hardware. ACM, 34--40. Google ScholarDigital Library
- F. Silvestri and Venturini R. VSEncoding. 2010. Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 1219--1228. Google ScholarDigital Library
- A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and R. S. Oberoi. 2011. SIMD-based decoding of posting lists. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 317--326. Google ScholarDigital Library
- D. Shan, S. Ding, J. He, H. Yan, and X. Li. 2012. Optimized top-k processing with global page scores on block-max indexes. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining. ACM, 423--432. Google ScholarDigital Library
- T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner. 2009. SIMD-scan: Ultra fast in-memory table scan using on-chip vector processing units. Proceedings of the VLDB Endowment 2, 1, 385--394. Google ScholarDigital Library
- T. Willhalm, I. Oukid, I. Müller, and F. Faerber. 2013. Vectorizing database column scans with complex predicates. Accelerating Data Management Systems Using Modern Processor and Storage Architectures, 1--12.Google Scholar
- I. H. Witten, A. Moffat, and T. C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. Google ScholarDigital Library
- J. Walder, M. Krátký, R. Bača, J. Platoš, and V. Snášel. 2012. Fast decoding algorithms for variable-lengths codes. Information Sciences 183, 1, 66--91. Google ScholarDigital Library
- H. Yan, S. Ding, and T. Suel. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International World Wide Web Conference. ACM, 401--410. Google ScholarDigital Library
- M. Zukowski, S. Heman, N. Nes, and Boncz, P. 2006. Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd International Conference on Data Engineering. IEEE, 59--71. Google ScholarDigital Library
- J. Zhang, X. Long, and T. Suel. 2008. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th International World Wide Web Conference. ACM, 387--396. Google ScholarDigital Library
Index Terms
- A General SIMD-Based Approach to Accelerating Compression Algorithms
Recommendations
SIMD compression and the intersection of sorted integers
Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the single-instruction, multiple data SIMD instructions available in common processors to boost the speed of integer ...
Comparison between text compression algorithms in biological sequences
AbstractInverted indexes are mainstream in Information Retrieval systems and many compression techniques have been proposed. The purpose of this paper is to explore the compression efficiency on a two-level inverted index tailored for n-gram ...
Optimizing partitioning strategies for faster inverted index compression
The inverted index is a key component for search engines to manage billions of documents and quickly respond to users' queries.Whereas substantial effort has been devoted to reducing space occupancy and decoding speed, the encoding speed when ...
Comments