skip to main content
article
Free Access

Complete inverted files for efficient text retrieval and analysis

Published:01 July 1987Publication History
Skip Abstract Section

Abstract

Given a finite set of texts S = {w1, … , wk} over some fixed finite alphabet Σ, a complete inverted file for S is an abstract data type that provides the functions find(w), which returns the longest prefix of w that occurs (as a subword of a word) in S; freq(w), which returns the number of times w occurs in S; and locations(w), which returns the set of positions where w occurs in S. A data structure that implements a complete inverted file for S that occupies linear space and can be built in linear time, using the uniform-cost RAM model, is given. Using this data structure, the time for each of the above query functions is optimal. To accomplish this, techniques from the theory of finite automata and the work on suffix trees are used to build a deterministic finite automaton that recognizes the set of all subwords of the set S. This automaton is then annotated with additional information and compacted to facilitate the desired query functions. The result is a data structure that is smaller and more flexible than the suffix tree.

References

  1. 1 AHO, V., HOPCROFT, J. E., AND ULLMAN, J.D. The Design and Analysis of Computer Algorithms Addison-Wesley, Reading, Mass., 1974. Google ScholarGoogle Scholar
  2. 2 APOSTOLICO, A., AND PREPARATA, F.P. The myriad virtues of suffix trees. In Proceedings of the NA TO Advanced Research Workshop on Combinatorial Algorithms on Words (Maratea, Italy, June 18-22). Springer-Verlag, New York, 1985.Google ScholarGoogle Scholar
  3. 3 BAGGETT, P., EHRENFEUCHT, A., AND PERRY, M. A technique for designing computer access and selecting good terminology. In Proceedings of the 1st Annual Rocky Mountain Conference on Artificial Intelligence. Breit International Inc., Boulder, Colo., 1986.Google ScholarGoogle Scholar
  4. 4 BLUMER, J. Correctness and linearity of the on-line directed acyclic word graph algorithm. Tech. Rep. MS-8410, Univ. of Denver, Denver, Colo., 1984.Google ScholarGoogle Scholar
  5. 5 BLUMER, A., HAUSSLER, D., AND EHRENFEUCHT, A. Average sizes of suffix trees and DAWGs. Presented at the Ist Montreal Conference on Combinatorics and Computer Science, Univ. of Montreal, Canada, May 1987.Google ScholarGoogle Scholar
  6. 6 BLUMER, A., BLUMER, J., EHRENFEUCHT, A., HA USSLER, D., AND MCCONNELL, R. Linear size finite automata for the set of all subwords of a word: An outline of results. Bull. Fur. Assoc. Theoret. Comput. Sci. 21 (1983), 12-20.Google ScholarGoogle Scholar
  7. 7 BLUMER, A., BLUMER, J., EHRENFEUCHT, A., HAUSSLER, D., AND MCCONNELL, R. Building a complete inverted file for a set of text files in linear time. In Proceedings of the 16th ACM Symposium on the Theory of Computing. ACM, New York, 1984, pp. 349-358. Google ScholarGoogle Scholar
  8. 8 BLUMER, A., BLUMER J., EHRENFEUCHT, A., HA USSLER, D., CHEN, M. T., AND SEIFERAS, J. The smallest automaton recognizing the subwords of a text. Theoret. Comput. Sci., 40 (1985), 31-55.Google ScholarGoogle Scholar
  9. 9 CARDENAS, A.F. Analysis and performance of inverted data base structures. Commun. ACM 18, 5 (May 1975), 253-263. Google ScholarGoogle Scholar
  10. 10 CHEN, M. T., AND SEIFERAS, J. Efficient and elegant subword-tree construction. Univ. of Rochester 1983-84 C.S. and C.E. Res. Rev., Univ. of Rochester, N.Y., 1984, pp. 10-14.Google ScholarGoogle Scholar
  11. 11 CLIFT, B., HAUSSLER, D., MCCONNELL, R., SCHNEIDER, T. D., AND STORMO, G.O. Sequence Landscapes. Nucleic Acids Res., 14, l (1986), 141-158.Google ScholarGoogle Scholar
  12. 12 EHRENFEUCHT, A., AND HAUSSLER, D. A new distance metric on strings computable in linear time. Tech. Rep. UCSC-CRL-86-27, Dept. of Computer and Information Sciences, Univ. of California at Santa Cruz, Oct. 1986.Google ScholarGoogle Scholar
  13. 13 GOLDSMITH, N. An appraisal of factors affecting the performance of text retrieval systems. Inf. Tech.: Res. Dev. 1 (1982), 41-53.Google ScholarGoogle Scholar
  14. 14 KOHONEN, T. Content-Addressable Memories. Springer-Vedag, New York, 1980. Google ScholarGoogle Scholar
  15. 15 MAJSTER, M. E., AND REISER, A. Efficient on-line construction and correction of position trees. SIAM J. Comput 9, 4 (Nov. 1980), 785-807.Google ScholarGoogle Scholar
  16. 16 MALLER, V. The content addressable fde storemA technical overview. Angwte. Inf. 3 (1981), 100-106.Google ScholarGoogle Scholar
  17. 17 MCCREIGHT, E. M. A space-economical suffix tree construction algorithm. J. ACM 23, 2 (Apr. 1976), 262-272. Google ScholarGoogle Scholar
  18. 18 MORRISON, D. R. PATRICIA--Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 4 (Oct. 1968), 514-534. Google ScholarGoogle Scholar
  19. 19 NERODE, A. Linear automaton transformations. Proc. AMS 9 (1958), 541-544.Google ScholarGoogle Scholar
  20. 20 PRATT, V. R. Improvements and applications for the Weiner repetition finder. Unpublished manuscript, Mar. 1975.Google ScholarGoogle Scholar
  21. 21 SLISENKO, A.O. Detection of periodicities and string matching in real time (English translation). J. Soy. Math. 22, 3 (1983), 1316-1387. (Originally published 1980).Google ScholarGoogle Scholar
  22. 22 TANIMOTO, S. L. A method for detecting structure in polygons. Pattern Rec. 13, 6 (1981), 389-394.Google ScholarGoogle Scholar
  23. 23 VAN RIJSBERGEN, C.J. File organization in library automation and information retrieval.I. Doc. 32, 4 (Dec. 1976), 294-317.Google ScholarGoogle Scholar
  24. 24 WEINER, P. Linear pattern matching algorithms. In IEEE 14th Annual Symposium on Switching and Automata Theory. IEEE, New York, 1973, pp. 1-11.Google ScholarGoogle Scholar

Index Terms

  1. Complete inverted files for efficient text retrieval and analysis

      Recommendations

      Reviews

      Jeffrey A. Fried

      The authors provide an excellent treatment of a new data structure for text analysis, called the compact directed acyclic word graph (compact DAWG). This data structure is both smaller than and functionally superior to conventional keyword-based retrieval structures and to structures proposed in the mid-1970s such as suffix and position trees. One of the more powerful aspects of the approach treated in the paper is that searches for arbitrary strings are allowed. This is particularly important when standard keywords are undefined or inadequate (the authors suggest that searching a library of DNA sequences is just such a problem). The flavor of the paper is theoretical. Linear (and hence optimal) times are shown for finding the longest prefix of a keyword that is present in a set of texts, counting the number of occurrences of this prefix, and finding the locations of all the occurrences. Bounds on the sizes of the data structures involved are also proven. In addition, the treatment is complete enough to show how to construct the needed data structures and implement text retrieval functions. In summary, the paper is interesting reading for anyone interested in text retrieval problems or data structures.

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Journal of the ACM
        Journal of the ACM  Volume 34, Issue 3
        July 1987
        248 pages
        ISSN:0004-5411
        EISSN:1557-735X
        DOI:10.1145/28869
        Issue’s Table of Contents

        Copyright © 1987 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 July 1987
        Published in jacm Volume 34, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader