research-article

A General SIMD-Based Approach to Accelerating Compression Algorithms

Authors:
Wayne Xin Zhao

Renmin University of China, Beijing, China

Renmin University of China, Beijing, China
View Profile

,
Xudong Zhang

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Daniel Lemire

Université du Québec, Canada

Université du Québec, Canada
View Profile

,
Dongdong Shan

Alibaba Group

Alibaba Group
View Profile

,
Jian-Yun Nie

Université de Montréal, Canada

Université de Montréal, Canada
View Profile

,
Hongfei Yan

Peking University, Beijing, China

Peking University, Beijing, China
View Profile

,
Ji-Rong Wen

Renmin University of China, Beijing, China

Renmin University of China, Beijing, China
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 33 Issue 3Article No.: 15pp 1–28https://doi.org/10.1145/2735629

Published:23 March 2015Publication History

ACM Transactions on Information Systems

Abstract

Compression algorithms are important for data-oriented tasks, especially in the era of “Big Data.” Modern processors equipped with powerful SIMD instruction sets provide us with an opportunity for achieving better compression performance. Previous research has shown that SIMD-based optimizations can multiply decoding speeds. Following these pioneering studies, we propose a general approach to accelerate compression algorithms. By instantiating the approach, we have developed several novel integer compression algorithms, called Group-Simple, Group-Scheme, Group-AFOR, and Group-PFD, and implemented their corresponding vectorized versions. We evaluate the proposed algorithms on two public TREC datasets, a Wikipedia dataset, and a Twitter dataset. With competitive compression ratios and encoding speeds, our SIMD-based algorithms outperform state-of-the-art nonvectorized algorithms with respect to decoding speeds.

References

V. N. Anh and A. Moffat. 2005. Inverted index compression using word-aligned binary codes. Information Retrieval 8, 1, 151--166. Google ScholarDigital Library
V. N. Anh and A. Moffat. 2006. Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering 18, 6, 857--861. Google ScholarDigital Library
V. N. Anh and A. Moffat. 2010. Index compression using 64-bit words. Software: Practice and Experience 40, 2, 131--147. Google ScholarDigital Library
A. Z. Broder, D. Carmel, M. Herscovici, M. Soffer, and J. Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 426--434. Google ScholarDigital Library
S. Büttcher, C. Clarke, and G. V. Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press. Google ScholarDigital Library
S. Chatterjee, L. R. Bachega, P. Bergner, A. D. Kenneth, J. A. Gunnels, M. Gupta, F. G. Gustavson, C. A. Lapkowski, G. K. Liu, M. P. Mendell, R. D. Nair, C. D. Wait, C. Ward, and P. Wu. 2005. Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L. IBM Journal of Research and Development 49, 2--3, 377--392. Google ScholarDigital Library
J. Dean. 2009. Challenges in building large-scale information retrieval systems: Invited talk. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. ACM. Google ScholarDigital Library
S. Ding and T. Suel. 2011. Faster top-k document retrieval using block-max indexes. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 993--1002. Google ScholarDigital Library
R. Delbru, S. Campinas, and G. Tummarello. 2012. Searching web data: An entity retrieval and high-performance indexing model. Web Semantics: Science, Services and Agents on the World Wide Web 10, 33--58. Google ScholarDigital Library
P. Elias. 1975. Universal codeword sets and representations of the integers. IEEE Transactions on Information Theory 21, 2, 194--203. Google ScholarDigital Library
Intel Corporation. 2010. Intel 64 and IA-32 Architectures Software Developers Manual (Version 37). Intel Corporation, Santa Clara, CA.Google Scholar
D. Inkster, M. Zukowski, and P. Boncz. 2011. Integration of VectorWise with Ingres. ACM SIGMOD Record 40, 3, 45--53. Google ScholarDigital Library
S. Jonassen and S. E. Bratsberg. 2011. Efficient compressed inverted index skipping for disjunctive text-queries. In Proceedings of the 33rd European Conference on Advances in Information Retrieval. Springer, Berlin, 530--542. Google ScholarDigital Library
H. Kwak, C. Lee, H. Park, and S. Moon. 2010. What is Twitter, a social network or a news media&quest; In Proceedings of the 19th International World Wide Web Conference. ACM, 591--600. Google ScholarDigital Library
K. Liu, X. Qin, X. Yan, and L. Quan. 2006. A SIMD video signal processor with efficient data organization. In Proceedings of IEEE Asian Solid-State Circuits Conference, 115--118.Google Scholar
C. Lemke, K. U. Sattler, F. Faerber, and A. Zeier. 2010. Speeding up queries in column stores: A case for compression. In Proceedings of the 12th International Conference on Data Warehousing and Knowledge Discovery, 117--129. Google ScholarDigital Library
C. Lomont. 2011. Introduction to Intel advanced vector extensions. In Proceedings of the 2nd Annual ASCI Conference.Google Scholar
D. Lemire and L. Boytsov. 2015. Decoding billions of integers per second through vectorization. Software: Practice and Experience 45, 1.Google ScholarDigital Library
W. Ma and C. Yang. 2002. Using Intel streaming SIMD extensions for 3D geometry processing. In Proceedings of IEEE Pacific Rim Conference on Multimedia. 1080--1087. Google ScholarDigital Library
C. D. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
G. Navarro, E. S. De Moura, M. Neubert, N. Ziviani, and R. B. Yates. 2000. Adding compression to block addressing inverted indexes. Information Retrieval 3, 1, 49--77. Google ScholarDigital Library
R. Rice and J. Plaunt. 1971. Adaptive variable-length coding for efficient compression of spacecraft television data. IEEE Transactions on Communication Technology 19, 6, 889--897.Google ScholarCross Ref
S. E. Robertson, S. Walker, M. Beaulieu, and P. Willett. 1999. Okapi at TREC-7: Automatic ad hoc, filtering, VLC and interactive track. NIST Special Publication SP 253--264.Google Scholar
V. Raman, G. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Mueller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. Storm, and L. Zhang. 2013. DB2 with BLU acceleration: So much more than just a column store. Proceedings of the VLDB Endowment 6, 11, 1080--1091. Google ScholarDigital Library
F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. 2002. Compression of inverted indexes for fast query evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 222--229. Google ScholarDigital Library
B. Schlegel, R. Gemulla, and W. Lehner. 2010. Fast integer compression using SIMD instructions. In Proceedings of the 6th International Workshop on Data Management on New Hardware. ACM, 34--40. Google ScholarDigital Library
F. Silvestri and Venturini R. VSEncoding. 2010. Efficient coding and fast decoding of integer lists via dynamic programming. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management. ACM, 1219--1228. Google ScholarDigital Library
A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and R. S. Oberoi. 2011. SIMD-based decoding of posting lists. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 317--326. Google ScholarDigital Library
D. Shan, S. Ding, J. He, H. Yan, and X. Li. 2012. Optimized top-k processing with global page scores on block-max indexes. In Proceedings of the 5th ACM International Conference on Web Search and Data Mining. ACM, 423--432. Google ScholarDigital Library
T. Willhalm, N. Popovici, Y. Boshmaf, H. Plattner, A. Zeier, and J. Schaffner. 2009. SIMD-scan: Ultra fast in-memory table scan using on-chip vector processing units. Proceedings of the VLDB Endowment 2, 1, 385--394. Google ScholarDigital Library
T. Willhalm, I. Oukid, I. Müller, and F. Faerber. 2013. Vectorizing database column scans with complex predicates. Accelerating Data Management Systems Using Modern Processor and Storage Architectures, 1--12.Google Scholar
I. H. Witten, A. Moffat, and T. C. Bell. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann. Google ScholarDigital Library
J. Walder, M. Krátký, R. Bača, J. Platoš, and V. Snášel. 2012. Fast decoding algorithms for variable-lengths codes. Information Sciences 183, 1, 66--91. Google ScholarDigital Library
H. Yan, S. Ding, and T. Suel. 2009. Inverted index compression and query processing with optimized document ordering. In Proceedings of the 18th International World Wide Web Conference. ACM, 401--410. Google ScholarDigital Library
M. Zukowski, S. Heman, N. Nes, and Boncz, P. 2006. Super-scalar RAM-CPU cache compression. In Proceedings of the 22nd International Conference on Data Engineering. IEEE, 59--71. Google ScholarDigital Library
J. Zhang, X. Long, and T. Suel. 2008. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th International World Wide Web Conference. ACM, 387--396. Google ScholarDigital Library

Index Terms

A General SIMD-Based Approach to Accelerating Compression Algorithms

Recommendations

SIMD compression and the intersection of sorted integers

Sorted lists of integers are commonly used in inverted indexes and database systems. They are often compressed in memory. We can use the single-instruction, multiple data SIMD instructions available in common processors to boost the speed of integer ...
Read More
Comparison between text compression algorithms in biological sequences
Abstract
Inverted indexes are mainstream in Information Retrieval systems and many compression techniques have been proposed. The purpose of this paper is to explore the compression efficiency on a two-level inverted index tailored for n-gram ...
Read More
Optimizing partitioning strategies for faster inverted index compression

The inverted index is a key component for search engines to manage billions of documents and quickly respond to users' queries.Whereas substantial effort has been devoted to reducing space occupancy and decoding speed, the encoding speed when ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Information Systems Volume 33, Issue 3
March 2015
184 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2737814
Editor:
Maarten de Rijke
University of Amsterdam, The Netherlands
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 March 2015
- Accepted: 2 January 2015
- Revised: 1 January 2015
- Received: 1 July 2014
Published in tois Volume 33, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
SIMD
index compression
integer encoding
inverted index
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 25
  Total Citations
  View Citations
- 433
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A General SIMD-Based Approach to Accelerating Compression Algorithms

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

SIMD compression and the intersection of sorted integers

Comparison between text compression algorithms in biological sequences

Optimizing partitioning strategies for faster inverted index compression