skip to main content
10.1145/1183614.1183623acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

An approximate multi-word matching algorithm for robust document retrieval

Published:06 November 2006Publication History

ABSTRACT

Document generation from low level data and its utilization is one of the most challenging tasks in document engineering. Word occurrence detection is a fundamental problem in the recognized document utilization obtained by a recognizer, such as OCR and speech recognition. Given a set of words, such as a dictionary, this paper proposes an efficient dynamic programming (DP) algorithm to find the occurrences of each word in a text. In this paper, the string similarity is measured by a statistical similarity model that enables a definition of the similarities in the character level as well as edit operation level. The proposed algorithm uses tree structures to measure similarities in order to avoid measuring similarities of the same substrings appearing in different parts of the text and words. The time complexity of the proposed algorithm is O(|W|⋅|S|⋅|Q|), where |W| (resp. |S|) denote the number of nodes in the trees representing the word set (resp. the text), and |Q| donotes the number of the states of the model used for string similarity. This paper shows the proposed algorithm is experimentally about six times faster than a naive DP algorithm.

References

  1. R. Baeza-Yates and G. Navarro. New and faster filters for multiple approximate string matching. Random Structures and Algorithms, 20(1):23--49, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Baeza-Yates and B. Ribeiro-Neto. "Modern Information Retrieval". ACM Press, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. W. Chang and T. Marr. "Theoretical and Empirical Comparisons of Approximate String Matching Algorithms". In Proc. of 3rd Annual Symposium on Combinatorial Pattern Matching, pages 172--181, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Chávez and G. Navarro. "A Metric Index for Approximate String Matching". In Proc. of LATIN, pages 181--195, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Inenaga, A. Shinohara, and M. Takeda. A fully compressed pattern matching algorithm for simple collage systems. Foundations of Computer Science, 16(6):1155--1166, 2005.]]Google ScholarGoogle ScholarCross RefCross Ref
  6. S. Kahan, T. Pavlidis, and H. S. Baird. On the recognition of printed characters of any font and size. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(2):274--288, March 1987.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Kukich. "Techniques for Automatically Correcting Words in Text". ACM Computing Surveys, 24(4):377--439, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Landau and U. Vishkin. "Fast Parallel and Serial Approximate String Matching". Journal of Algorithms, 10:157--169, 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Li, D. Lopresti, and A. Tomkins. "Validation of Document Image Defect Models for Optical Character Recognition". In Proc. of 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 137--150, 1994.]]Google ScholarGoogle Scholar
  10. J. Mamou and R. H. D. Carmel. Spoken document retrieval from call-center conversations. In Proc. of 29th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 51--58, 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Myka and U. Guntzer. "Fuzzy Full-Text Searches in OCR Database". In Proc. of Forum on Research & Technology Advances in Digital Libraries, pages 87--100, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Navarro and R. Baeza-Yates. A practical q-gram index for text retrieval allowing errors. CLEI Electron, 1(2):31--88, 1998.]]Google ScholarGoogle Scholar
  14. M. Ohta, A. Takasu, and J. Adachi. "Probabilistic Automaton Model for Fuzzy English-text Retrieval". In Lecture Notes in Computer Science 1923, pages 35--44, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. S. Ristad and P. N. Yianilos. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):522--532, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Srinivasan and G. Petkovic. Phonetic confusion matrix based spoken document retrieval. In Proc. of 23rd Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 81--87, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. Taghva, J. Borsack, and A. Condit. Results of applying probabilistic ir to ocr text. In Proc. of 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 202--211, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Taghva, J. Borsack, and A. Condit. Effects of ocr errors on ranking and feedback using the vector space model. Information Processing and Management, 32(3):317--327, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Taghva, J. Borsack, and A. Condit. Evaluation of model-based retrieval effectiveness with ocr text. ACM Transaction on Information Systems, 14(1):64--93, January 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Takasu. "Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model". In Proc. of 3rd ACM & IEEE Joint Conference on Digital Libraries, pages 49--60, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Takasu and K. Aihara. "DVHMM: Variable Length Text Recognition Error Model". In Proc. of 15th International Conference on Pattern Recognition, pages 110--114, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Tanaka and H. Torii. "Transmedia Machine and its Keyword Search over Image Texts". In Proc. of RIAO 88, pages 248--258, 1988.]]Google ScholarGoogle Scholar
  23. E. Ukkonen. "Algorithms for Approximate String Matching". Information and Control, 64:100--118, 1985.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. Ukkonen. "Approximate String Matching with q-grams and Maximal Matches". Theoretical Computer Science, 1:191--211, 1985.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Valiente. "Algorithms on Trees and Graphs". Springer, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Wechsler, E. Munteanu, and P. Schauble. New techniques for open-vocabulary spoken document retrieval. In Proc. of 21st Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 20--27, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. N. Ziviani, E. S. Moura, G. Navarro, and R. Baeza-Yates. Compression: A key for next-generation text retrieval systems. IEEE Computer, 33(11):37--44, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An approximate multi-word matching algorithm for robust document retrieval

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management
        November 2006
        916 pages
        ISBN:1595934332
        DOI:10.1145/1183614

        Copyright © 2006 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 November 2006

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader