skip to main content
research-article

LCP Array Construction in External Memory

Published: 05 April 2016 Publication History

Abstract

One of the most important data structures for string processing—the suffix array—needs to be augmented with the longest-common-prefix (LCP) array in numerous applications. We describe the first external memory algorithm for constructing the LCP array given the suffix array as input. The only previous way to compute the LCP array for data that is bigger than the RAM is to use an external memory suffix array construction algorithm (SACA) with complex modifications to produce the LCP array as a by-product. Compared to the best prior method, our algorithm needs much less disk space (by more than a factor of three) and is significantly faster. Furthermore, our algorithm can be combined with any SACA, including a better one developed in the future.

References

[1]
Mohamed I. Abouelhoda, Stefan Kurtz, and Enno Ohlebusch. 2004. Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2, 1, 53--86.
[2]
Lars Arge and Mikkel Thorup. 2013. RAM-efficient external memory sorting. In Algorithms and Computation. Lecture Notes in Computer Science, Vol. 8283. Springer, 491--501.
[3]
Markus J. Bauer, Anthony J. Cox, Giovanna Rosone, and Marinella Sciortino. 2012. Lightweight LCP construction for next-generation sequencing datasets. In Algorithms and Bioinformatics. Lecture Notes in Computer Science, Vol. 7534. Springer, 326--337.
[4]
Timo Beller, Simon Gog, Enno Ohlebusch, and Thomas Schnattinger. 2013. Computing the longest common prefix array based on the Burrows-Wheeler transform. Journal of Discrete Algorithms 18, 22--31.
[5]
Timo Bingmann, Johannes Fischer, and Vitaly Osipov. 2013. Inducing suffix and LCP arrays in external memory. In Proceedings of the 15th Workshop on Algorithm Engineering and Experiments (ALENEX'13). 88--102.
[6]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1, 107--113.
[7]
Roman Dementiev, Juha Kärkkäinen, Jens Mehnert, and Peter Sanders. 2008a. Better external memory suffix array construction. ACM Journal of Experimental Algorithmics 12, Article No. 3.4.
[8]
Roman Dementiev, Lutz Kettner, and Peter Sanders. 2008b. STXXL: Standard template library for XXL data sets. Software: Practice and Experience 38, 6, 589--637.
[9]
Mrinal Deo and Sean Keely. 2013. Parallel suffix array and least common prefix for the GPU. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'13). ACM, New York, NY, 197--206.
[10]
Paolo Ferragina, Travis Gagie, and Giovanni Manzini. 2012. Lightweight data indexing and compression in external memory. Algorithmica 63, 3, 707--730.
[11]
Johannes Fischer. 2011. Inducing the LCP-array. In Algorithms and Data Structures. Lecture Notes in Computer Science, Vol. 6844. Springer, 374--385.
[12]
Simon Gog and Enno Ohlebusch. 2011. Fast and lightweight LCP-array construction algorithms. In Proceedings of the 13th Workshop on Algorithm Engineering and Experiments (ALENEX'11). 25--34.
[13]
Gaston H. Gonnet, Ricardo A. Baeza-Yates, and Tim Snider. 1992. New indices for text: Pat trees and Pat arrays. In Information Retrieval: Data Structures & Algorithms, W. B. Frakes and R. A. Baeza-Yates (Ed.). Prentice Hall, Upper Saddle River, NJ, 66--82.
[14]
Juha Kärkkäinen and Dominik Kempa. 2014. Engineering a lightweight external memory suffix array construction algorithm. In Proceedings of the 2nd International Conference on Algorithms for Big Data (ICABD'14) (CEUR Workshop Proceedings), Vol. 1146. 53--60.
[15]
Juha Kärkkäinen, Dominik Kempa, and Marcin Piatkowski. 2015. Tighter bounds for the sum of irreducible LCP values. In Combinatorial Pattern Matching. Lecture Notes in Computer Science, Vol. 9133. Springer, 316--328.
[16]
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. 2013. Lightweight Lempel-Ziv parsing. In Experimental Algorithms. Lecture Notes in Computer Science, Vol. 7933. Springer, 139--150.
[17]
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. 2014. Lempel-Ziv parsing in external memory. In Proceedings of the 2014 Data Compression Conference (DCC'14). IEEE, Los Alamitos, CA, 153--162.
[18]
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. 2015. Parallel external memory suffix sorting. In Combinatorial Pattern Matching. Lecture Notes in Computer Science, Vol. 9133. Springer, 329--342.
[19]
Juha Kärkkäinen, Giovanni Manzini, and Simon J. Puglisi. 2009. Permuted longest-common-prefix array. In Combinatorial Pattern Matching. Lecture Notes in Computer Science, Vol. 5577. Springer, 181--192.
[20]
Juha Kärkkäinen and Peter Sanders. 2003. Simple linear work suffix array construction. In Automata, Languages and Programming. Lecture Notes in Computer Science, Vol. 2719. Springer, 943--955.
[21]
Juha Kärkkäinen, Peter Sanders, and Stefan Burkhardt. 2006. Linear work suffix array construction. Journal of the ACM 53, 6, 918--936.
[22]
Toru Kasai, Gunho Lee, Hiroki Arimura, Setsuo Arikawa, and Kunsoo Park. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial Pattern Matching. Lecture Notes in Computer Science, Vol. 2089. Springer, 181--192.
[23]
Weijun Liu, Ge Nong, Wai Hong Chan, and Yi Wu. 2015. Induced sorting suffixes in external memory with better design and less space. In String Processing and Information Retrieval. Lecture Notes in Computer Science, Vol. 9309. Springer, 83--94.
[24]
Felipe A. Louza, Guilherme P. Telles, and Cristina D. A. Ciferri. 2013. External memory generalized suffix and LCP arrays construction. In Combinatorial Pattern Matching. Lecture Notes in Computer Science, Vol. 7922. Springer, 201--210.
[25]
Veli Mäkinen. 2003. Compact suffix array—a space efficient full-text index. Fundamenta Informaticae 56, 1--2, 191--210.
[26]
Udi Manber and Eugene W. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing 22, 5, 935--948.
[27]
Giovanni Manzini. 2004. Two space saving tricks for linear time LCP array computation. In Algorithm Theory—SWAT 2004. Lecture Notes in Computer Science, Vol. 3111. Springer, 372--383.
[28]
Gonzalo Navarro and Veli Mäkinen. 2007. Compressed full-text indexes. ACM Computing Surveys 39, 1, Article No. 2.
[29]
Ge Nong, Wai Hong Chan, Sheng Qing Hu, and Yi Wu. 2015. Induced sorting suffixes in external memory. ACM Transactions on Information Systems 33, 3, Article No. 12.
[30]
Ge Nong, Wai Hong Chan, Sen Zhang, and Xiao Feng Guan. 2014. Suffix array construction in external memory using d-critical substrings. ACM Transactions on Information Systems 32, 1, Article No. 1.
[31]
Ge Nong, Sen Zhang, and Wai Hong Chan. 2011. Two efficient algorithms for linear time suffix array construction. IEEE Transactions on Computers 60, 10, 1471--1484.
[32]
Enno Ohlebusch. 2013. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag, Bremen, Germany.
[33]
Simon J. Puglisi, William F. Smyth, and Andrew Turpin. 2007. A taxonomy of suffix array construction algorithms. ACM Computing Surveys 39, 2, Article No. 4.
[34]
Simon J. Puglisi and Andrew Turpin. 2008. Space-time tradeoffs for longest-common-prefix array computation. In Algorithms and Computation. Lecture Notes in Computer Science, Vol. 5369. Springer, 124--135.
[35]
Peter Sanders. 2009. Algorithm engineering—an attempt at a definition. In Efficient Algorithms. Lecture Notes in Computer Science, Vol. 5760. Springer, 321--340.
[36]
Julian Shun. 2014. Fast parallel computation of longest common prefixes. In Proceedings of the 2014 ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC'14). IEEE, Los Alamitos, CA, 387--398.
[37]
Jouni Sirén. 2010. Sampled longest common prefix array. In Combinatorial Pattern Matching. Lecture Notes in Computer Science, Vol. 6129. Springer, 227--237.
[38]
Jeffrey S. Vitter. 2008. Algorithms and data structures for external memory. Foundations and Trends in Theoretical Computer Science 2, 4, 305--474.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Journal of Experimental Algorithmics
ACM Journal of Experimental Algorithmics  Volume 21, Issue
Special Issue SEA 2014, Regular Papers and Special Issue ALENEX 2013
2016
404 pages
ISSN:1084-6654
EISSN:1084-6654
DOI:10.1145/2888418
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 April 2016
Accepted: 01 November 2015
Revised: 01 April 2015
Received: 01 October 2014
Published in JEA Volume 21

Author Tags

  1. External memory algorithms
  2. LCP array
  3. suffix array

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Academy of Finland

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Text Indexing for Long Patterns: Anchors are All you NeedProceedings of the VLDB Endowment10.14778/3598581.359858616:9(2117-2131)Online publication date: 1-May-2023
  • (2023)Scalable Text Index ConstructionAlgorithms for Big Data10.1007/978-3-031-21534-6_14(252-284)Online publication date: 18-Jan-2023
  • (2022)Computing Lexicographic Parsings2022 Data Compression Conference (DCC)10.1109/DCC52660.2022.00031(232-241)Online publication date: Mar-2022
  • (2019)Better External Memory LCP Array ConstructionACM Journal of Experimental Algorithmics10.1145/329772324(1-27)Online publication date: 14-Feb-2019
  • (2017)Longest Common Prefixes with k-Mismatches and ApplicationsSOFSEM 2018: Theory and Practice of Computer Science10.1007/978-3-319-73117-9_45(636-649)Online publication date: 22-Dec-2017
  • (2017)Lightweight BWT and LCP Merging via the Gap AlgorithmString Processing and Information Retrieval10.1007/978-3-319-67428-5_15(176-190)Online publication date: 6-Sep-2017

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media