article

Free Access

Complete inverted files for efficient text retrieval and analysis

Authors:
A. Blumer

Univ. of Denver, Denver, CO

Univ. of Denver, Denver, CO
View Profile

,
J. Blumer

Univ. of Denver, Denver, CO

Univ. of Denver, Denver, CO
View Profile

,
D. Haussler

Univ. of Denver, Denver, CO

Univ. of Denver, Denver, CO
View Profile

,
R. McConnell

Univ. of Denver, Denver, CO

Univ. of Denver, Denver, CO
View Profile

,
A. Ehrenfeucht

Univ. of Colorado at Boulder, Boulder

Univ. of Colorado at Boulder, Boulder
View Profile

Authors Info & Claims

Journal of the ACM Volume 34 Issue 3pp 578–595https://doi.org/10.1145/28869.28873

Published:01 July 1987Publication History

Journal of the ACM

Abstract

Given a finite set of texts S = {w1, … , w_k} over some fixed finite alphabet Σ, a complete inverted file for S is an abstract data type that provides the functions find(w), which returns the longest prefix of w that occurs (as a subword of a word) in S; freq(w), which returns the number of times w occurs in S; and locations(w), which returns the set of positions where w occurs in S. A data structure that implements a complete inverted file for S that occupies linear space and can be built in linear time, using the uniform-cost RAM model, is given. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, techniques from the theory of finite automata and the work on suffix trees are used to build a deterministic finite automaton that recognizes the set of all subwords of the set S. This automaton is then annotated with additional information and compacted to facilitate the desired query functions. The result is a data structure that is smaller and more flexible than the suffix tree.

References

1 AHO, V., HOPCROFT, J. E., AND ULLMAN, J.D. The Design and Analysis of Computer Algorithms Addison-Wesley, Reading, Mass., 1974. Google Scholar
2 APOSTOLICO, A., AND PREPARATA, F.P. The myriad virtues of suffix trees. In Proceedings of the NA TO Advanced Research Workshop on Combinatorial Algorithms on Words (Maratea, Italy, June 18-22). Springer-Verlag, New York, 1985.Google Scholar
3 BAGGETT, P., EHRENFEUCHT, A., AND PERRY, M. A technique for designing computer access and selecting good terminology. In Proceedings of the 1st Annual Rocky Mountain Conference on Artificial Intelligence. Breit International Inc., Boulder, Colo., 1986.Google Scholar
4 BLUMER, J. Correctness and linearity of the on-line directed acyclic word graph algorithm. Tech. Rep. MS-8410, Univ. of Denver, Denver, Colo., 1984.Google Scholar
5 BLUMER, A., HAUSSLER, D., AND EHRENFEUCHT, A. Average sizes of suffix trees and DAWGs. Presented at the Ist Montreal Conference on Combinatorics and Computer Science, Univ. of Montreal, Canada, May 1987.Google Scholar
6 BLUMER, A., BLUMER, J., EHRENFEUCHT, A., HA USSLER, D., AND MCCONNELL, R. Linear size finite automata for the set of all subwords of a word: An outline of results. Bull. Fur. Assoc. Theoret. Comput. Sci. 21 (1983), 12-20.Google Scholar
7 BLUMER, A., BLUMER, J., EHRENFEUCHT, A., HAUSSLER, D., AND MCCONNELL, R. Building a complete inverted file for a set of text files in linear time. In Proceedings of the 16th ACM Symposium on the Theory of Computing. ACM, New York, 1984, pp. 349-358. Google Scholar
8 BLUMER, A., BLUMER J., EHRENFEUCHT, A., HA USSLER, D., CHEN, M. T., AND SEIFERAS, J. The smallest automaton recognizing the subwords of a text. Theoret. Comput. Sci., 40 (1985), 31-55.Google Scholar
9 CARDENAS, A.F. Analysis and performance of inverted data base structures. Commun. ACM 18, 5 (May 1975), 253-263. Google Scholar
10 CHEN, M. T., AND SEIFERAS, J. Efficient and elegant subword-tree construction. Univ. of Rochester 1983-84 C.S. and C.E. Res. Rev., Univ. of Rochester, N.Y., 1984, pp. 10-14.Google Scholar
11 CLIFT, B., HAUSSLER, D., MCCONNELL, R., SCHNEIDER, T. D., AND STORMO, G.O. Sequence Landscapes. Nucleic Acids Res., 14, l (1986), 141-158.Google Scholar
12 EHRENFEUCHT, A., AND HAUSSLER, D. A new distance metric on strings computable in linear time. Tech. Rep. UCSC-CRL-86-27, Dept. of Computer and Information Sciences, Univ. of California at Santa Cruz, Oct. 1986.Google Scholar
13 GOLDSMITH, N. An appraisal of factors affecting the performance of text retrieval systems. Inf. Tech.: Res. Dev. 1 (1982), 41-53.Google Scholar
14 KOHONEN, T. Content-Addressable Memories. Springer-Vedag, New York, 1980. Google Scholar
15 MAJSTER, M. E., AND REISER, A. Efficient on-line construction and correction of position trees. SIAM J. Comput 9, 4 (Nov. 1980), 785-807.Google Scholar
16 MALLER, V. The content addressable fde storemA technical overview. Angwte. Inf. 3 (1981), 100-106.Google Scholar
17 MCCREIGHT, E. M. A space-economical suffix tree construction algorithm. J. ACM 23, 2 (Apr. 1976), 262-272. Google Scholar
18 MORRISON, D. R. PATRICIA--Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 4 (Oct. 1968), 514-534. Google Scholar
19 NERODE, A. Linear automaton transformations. Proc. AMS 9 (1958), 541-544.Google Scholar
20 PRATT, V. R. Improvements and applications for the Weiner repetition finder. Unpublished manuscript, Mar. 1975.Google Scholar
21 SLISENKO, A.O. Detection of periodicities and string matching in real time (English translation). J. Soy. Math. 22, 3 (1983), 1316-1387. (Originally published 1980).Google Scholar
22 TANIMOTO, S. L. A method for detecting structure in polygons. Pattern Rec. 13, 6 (1981), 389-394.Google Scholar
23 VAN RIJSBERGEN, C.J. File organization in library automation and information retrieval.I. Doc. 32, 4 (Dec. 1976), 294-317.Google Scholar
24 WEINER, P. Linear pattern matching algorithms. In IEEE 14th Annual Symposium on Switching and Automata Theory. IEEE, New York, 1973, pp. 1-11.Google Scholar

Index Terms

Complete inverted files for efficient text retrieval and analysis
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory

Recommendations

Inverted files versus signature files for text indexing

Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We ...
Read More
Self-indexing inverted files for fast text retrieval

Query-processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Retrieval time for inverted lists can be greatly reduced by the use of compression, but this adds to the CPU time ...
Read More
Compressing Inverted Files
Abstract
Research into inverted file compression has focused on compression ratio—how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.
The premise “...
Read More

Reviews

Reviewer: Jeffrey A. Fried

The authors provide an excellent treatment of a new data structure for text analysis, called the compact directed acyclic word graph (compact DAWG). This data structure is both smaller than and functionally superior to conventional keyword-based retrieval structures and to structures proposed in the mid-1970s such as suffix and position trees. One of the more powerful aspects of the approach treated in the paper is that searches for arbitrary strings are allowed. This is particularly important when standard keywords are undefined or inadequate (the authors suggest that searching a library of DNA sequences is just such a problem). The flavor of the paper is theoretical. Linear (and hence optimal) times are shown for finding the longest prefix of a keyword that is present in a set of texts, counting the number of occurrences of this prefix, and finding the locations of all the occurrences. Bounds on the sizes of the data structures involved are also proven. In addition, the treatment is complete enough to show how to construct the needed data structures and implement text retrieval functions. In summary, the paper is interesting reading for anyone interested in text retrieval problems or data structures.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Journal of the ACM Volume 34, Issue 3
July 1987
248 pages
ISSN:0004-5411
EISSN:1557-735X
DOI:10.1145/28869
Issue’s Table of Contents

Copyright © 1987 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 July 1987
Published in jacm Volume 34, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 194
  Total Citations
  View Citations
- 1,087
  Total Downloads
- Downloads (Last 12 months)86
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Complete inverted files for efficient text retrieval and analysis

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Inverted files versus signature files for text indexing

Self-indexing inverted files for fast text retrieval

Compressing Inverted Files

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Complete inverted files for efficient text retrieval and analysis

Journal of the ACM

Abstract

References

Cited By

Index Terms

Recommendations

Inverted files versus signature files for text indexing

Self-indexing inverted files for fast text retrieval

Compressing Inverted Files

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media