|
ABSTRACT
Given a sequence S = s1s2…sn of integers smaller than r = O(polylog(n)), we show how S can be represented using nH0(S) + o(n) bits, so that we can know any sq, as well as answer rank and select queries on S, in constant time. H0(S) is the zero-order empirical entropy of S and nH0(S) provides an information-theoretic lower bound to the bit storage of any sequence S via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in O(log r) time. For larger r, we can still represent S in nH0(S) + o(n log r) bits and answer queries in O(log r/log log n) time. Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design compressed full-text indexes that scale well with the size of the input alphabet Σ. Specifically, we design a variant of the FM-index that indexes a string T[1, n] within nHk(T) + o(n) bits of storage, where Hk(T) is the kth-order empirical entropy of T. This space bound holds simultaneously for all k ≤ α log|Σ| n, constant 0 < α < 1, and |Σ| = O(polylog(n)). This index counts the occurrences of an arbitrary pattern P[1, p] as a substring of T in O(p) time; it locates each pattern occurrence in O(log1+ϵ n) time for any constant 0 < ϵ < 1; and reports a text substring of length ℓ in O(ℓ + log1+ϵ n) time. Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the kth-order entropy of the text T, which is the best space obtained in previous work. We can also handle larger alphabets of size |Σ| = O(nβ), for any 0 < β < 1, by paying o(n log|Σ|) extra space and multiplying all query times by O(log |Σ|/log log n).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Burrows, M., and Wheeler, D. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation.
|
| |
2
|
Chan, W.-L., Hon, W.-K., and Lam, T.-W. 2004. Compressed index for a dynamic collection of texts. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer Verlag, Berlin. 445--456.
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
 |
6
|
|
| |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
Ferragina, P., Manzini, G., Mäkinen, V., and Navarro, G. 2004. An alphabet-friendly FM-index. In Proceedings of the 11th International Symposium on String Processing and Information Retrieval (SPIRE). Lecture Notes in Computer Science, vol. 3246. Springer Verlag, Berlin. 150--160.
|
 |
11
|
|
| |
12
|
Geary, R., Rahman, N., Raman, R., and V.Raman. 2004. A simple optimal representation for balanced parentheses. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer Verlag, Berlin. 159--172.
|
 |
13
|
|
| |
14
|
|
| |
15
|
González, R., and Navarro, G. 2006. Statistical encoding of succinct data structures. In Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 4009. Springer Verlag, Berlin. 295--306.
|
| |
16
|
Grabowski, S., Navarro, G., Przywarski, R., Salinger, A., and Mäkinen, V. 2006. A simple alphabet-independent FM-index. Int. J. Found. Comput. Sci. 17, 6, 1365--1384.
|
| |
17
|
|
| |
18
|
|
| |
19
|
Healy, J., Thomas, E., Schwartz, J., and Wigler, M. 2003. Annotating large genomes with exact word matches. Genome Res. 13, 2306--2315.
|
| |
20
|
|
| |
21
|
Hon, W.-K., Lam, T.-W., Sung, W. K., Tse, W. L., Wong, C.-K., and Yiu, S. M. 2004. Practical aspects of compressed suffix arrays and FM-index in searching DNA sequences. In Proceedings of the Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, 31--38.
|
| |
22
|
Huynh, N., Hon, W., Lam, T., and Sung, W. 2004. Approximate string matching using compressed suffix arrays. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer Verlag, Berlin. 434--444.
|
| |
23
|
Jacobson, G. 1989. Space-Efficient static trees and graphs. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science (FOCS). 549--554.
|
| |
24
|
Mäkinen, V., and Navarro, G. 2004a. Compressed compact suffix arrays. In Proceedings of the Symposium on Combinatorial Pattern Matching (CPM). Lecture Notes in Computer Science, vol. 3109. Springer Verlag, Berlin. 420--433.
|
| |
25
|
Mäkinen, V., and Navarro, G. 2004b. New search algorithms and time/space tradeoffs for succinct suffix arrays. Tech. Rep. C-2004-20, Department of Computer Science, University of Helsinki.
|
| |
26
|
|
| |
27
|
Mäkinen, V., Navarro, G., and Sadakane, K. 2004. Advantages of backward searching---Efficient secondary memory and distributed implementation of compressed suffix arrays. In Proceedings of the International Symposium on Algorithms and Computation (ISAAC). Lecture Notes in Computer Science, vol. 3341. Springer Verlag, Berlin. 681--692.
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
| |
32
|
Munro, J., Raman, R., Raman, V., and Rao, S. 2003. Succinct representations of permutations. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP). Lecture Notes in Computer Science, vol. 2719. Springer Verlag, Berlin. 345--356.
|
| |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
|
| |
37
|
|
 |
38
|
|
| |
39
|
Sadakane, K., and Shibuya, T. 2001. Indexing huge genome sequences for solving various problems. Genome Inf. 12, 175--183.
|
|