Abstract
Given a finite set of texts S = {w1, … , wk} over some fixed finite alphabet Σ, a complete inverted file for S is an abstract data type that provides the functions find(w), which returns the longest prefix of w that occurs (as a subword of a word) in S; freq(w), which returns the number of times w occurs in S; and locations(w), which returns the set of positions where w occurs in S. A data structure that implements a complete inverted file for S that occupies linear space and can be built in linear time, using the uniform-cost RAM model, is given. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, techniques from the theory of finite automata and the work on suffix trees are used to build a deterministic finite automaton that recognizes the set of all subwords of the set S. This automaton is then annotated with additional information and compacted to facilitate the desired query functions. The result is a data structure that is smaller and more flexible than the suffix tree.
- 1 AHO, V., HOPCROFT, J. E., AND ULLMAN, J.D. The Design and Analysis of Computer Algorithms Addison-Wesley, Reading, Mass., 1974. Google Scholar
- 2 APOSTOLICO, A., AND PREPARATA, F.P. The myriad virtues of suffix trees. In Proceedings of the NA TO Advanced Research Workshop on Combinatorial Algorithms on Words (Maratea, Italy, June 18-22). Springer-Verlag, New York, 1985.Google Scholar
- 3 BAGGETT, P., EHRENFEUCHT, A., AND PERRY, M. A technique for designing computer access and selecting good terminology. In Proceedings of the 1st Annual Rocky Mountain Conference on Artificial Intelligence. Breit International Inc., Boulder, Colo., 1986.Google Scholar
- 4 BLUMER, J. Correctness and linearity of the on-line directed acyclic word graph algorithm. Tech. Rep. MS-8410, Univ. of Denver, Denver, Colo., 1984.Google Scholar
- 5 BLUMER, A., HAUSSLER, D., AND EHRENFEUCHT, A. Average sizes of suffix trees and DAWGs. Presented at the Ist Montreal Conference on Combinatorics and Computer Science, Univ. of Montreal, Canada, May 1987.Google Scholar
- 6 BLUMER, A., BLUMER, J., EHRENFEUCHT, A., HA USSLER, D., AND MCCONNELL, R. Linear size finite automata for the set of all subwords of a word: An outline of results. Bull. Fur. Assoc. Theoret. Comput. Sci. 21 (1983), 12-20.Google Scholar
- 7 BLUMER, A., BLUMER, J., EHRENFEUCHT, A., HAUSSLER, D., AND MCCONNELL, R. Building a complete inverted file for a set of text files in linear time. In Proceedings of the 16th ACM Symposium on the Theory of Computing. ACM, New York, 1984, pp. 349-358. Google Scholar
- 8 BLUMER, A., BLUMER J., EHRENFEUCHT, A., HA USSLER, D., CHEN, M. T., AND SEIFERAS, J. The smallest automaton recognizing the subwords of a text. Theoret. Comput. Sci., 40 (1985), 31-55.Google Scholar
- 9 CARDENAS, A.F. Analysis and performance of inverted data base structures. Commun. ACM 18, 5 (May 1975), 253-263. Google Scholar
- 10 CHEN, M. T., AND SEIFERAS, J. Efficient and elegant subword-tree construction. Univ. of Rochester 1983-84 C.S. and C.E. Res. Rev., Univ. of Rochester, N.Y., 1984, pp. 10-14.Google Scholar
- 11 CLIFT, B., HAUSSLER, D., MCCONNELL, R., SCHNEIDER, T. D., AND STORMO, G.O. Sequence Landscapes. Nucleic Acids Res., 14, l (1986), 141-158.Google Scholar
- 12 EHRENFEUCHT, A., AND HAUSSLER, D. A new distance metric on strings computable in linear time. Tech. Rep. UCSC-CRL-86-27, Dept. of Computer and Information Sciences, Univ. of California at Santa Cruz, Oct. 1986.Google Scholar
- 13 GOLDSMITH, N. An appraisal of factors affecting the performance of text retrieval systems. Inf. Tech.: Res. Dev. 1 (1982), 41-53.Google Scholar
- 14 KOHONEN, T. Content-Addressable Memories. Springer-Vedag, New York, 1980. Google Scholar
- 15 MAJSTER, M. E., AND REISER, A. Efficient on-line construction and correction of position trees. SIAM J. Comput 9, 4 (Nov. 1980), 785-807.Google Scholar
- 16 MALLER, V. The content addressable fde storemA technical overview. Angwte. Inf. 3 (1981), 100-106.Google Scholar
- 17 MCCREIGHT, E. M. A space-economical suffix tree construction algorithm. J. ACM 23, 2 (Apr. 1976), 262-272. Google Scholar
- 18 MORRISON, D. R. PATRICIA--Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 4 (Oct. 1968), 514-534. Google Scholar
- 19 NERODE, A. Linear automaton transformations. Proc. AMS 9 (1958), 541-544.Google Scholar
- 20 PRATT, V. R. Improvements and applications for the Weiner repetition finder. Unpublished manuscript, Mar. 1975.Google Scholar
- 21 SLISENKO, A.O. Detection of periodicities and string matching in real time (English translation). J. Soy. Math. 22, 3 (1983), 1316-1387. (Originally published 1980).Google Scholar
- 22 TANIMOTO, S. L. A method for detecting structure in polygons. Pattern Rec. 13, 6 (1981), 389-394.Google Scholar
- 23 VAN RIJSBERGEN, C.J. File organization in library automation and information retrieval.I. Doc. 32, 4 (Dec. 1976), 294-317.Google Scholar
- 24 WEINER, P. Linear pattern matching algorithms. In IEEE 14th Annual Symposium on Switching and Automata Theory. IEEE, New York, 1973, pp. 1-11.Google Scholar
Index Terms
- Complete inverted files for efficient text retrieval and analysis
Recommendations
Inverted files versus signature files for text indexing
Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We ...
Self-indexing inverted files for fast text retrieval
Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time ...
Compressing Inverted Files
AbstractResearch into inverted file compression has focused on compression ratio—how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.
The premise “...
Comments