Article

An approximate multi-word matching algorithm for robust document retrieval

Author:
Atsuhiro Takasu

National Institute of Informatics, Chiyoda-ku, Tokyo, Japan

National Institute of Informatics, Chiyoda-ku, Tokyo, Japan
View Profile

CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge managementNovember 2006Pages 34–42https://doi.org/10.1145/1183614.1183623

Published:06 November 2006Publication History

CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

Pages 34–42

ABSTRACT

Document generation from low level data and its utilization is one of the most challenging tasks in document engineering. Word occurrence detection is a fundamental problem in the recognized document utilization obtained by a recognizer, such as OCR and speech recognition. Given a set of words, such as a dictionary, this paper proposes an efficient dynamic programming (DP) algorithm to find the occurrences of each word in a text. In this paper, the string similarity is measured by a statistical similarity model that enables a definition of the similarities in the character level as well as edit operation level. The proposed algorithm uses tree structures to measure similarities in order to avoid measuring similarities of the same substrings appearing in different parts of the text and words. The time complexity of the proposed algorithm is O(|W|⋅|S|⋅|Q|), where |W| (resp. |S|) denote the number of nodes in the trees representing the word set (resp. the text), and |Q| donotes the number of the states of the model used for string similarity. This paper shows the proposed algorithm is experimentally about six times faster than a naive DP algorithm.

References

R. Baeza-Yates and G. Navarro. New and faster filters for multiple approximate string matching. Random Structures and Algorithms, 20(1):23--49, 2002.]] Google ScholarDigital Library
R. Baeza-Yates and B. Ribeiro-Neto. "Modern Information Retrieval". ACM Press, 1999.]] Google ScholarDigital Library
W. Chang and T. Marr. "Theoretical and Empirical Comparisons of Approximate String Matching Algorithms". In Proc. of 3rd Annual Symposium on Combinatorial Pattern Matching, pages 172--181, 1992.]] Google ScholarDigital Library
E. Chávez and G. Navarro. "A Metric Index for Approximate String Matching". In Proc. of LATIN, pages 181--195, 2002.]] Google ScholarDigital Library
S. Inenaga, A. Shinohara, and M. Takeda. A fully compressed pattern matching algorithm for simple collage systems. Foundations of Computer Science, 16(6):1155--1166, 2005.]]Google ScholarCross Ref
S. Kahan, T. Pavlidis, and H. S. Baird. On the recognition of printed characters of any font and size. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(2):274--288, March 1987.]] Google ScholarDigital Library
K. Kukich. "Techniques for Automatically Correcting Words in Text". ACM Computing Surveys, 24(4):377--439, 1992.]] Google ScholarDigital Library
G. Landau and U. Vishkin. "Fast Parallel and Serial Approximate String Matching". Journal of Algorithms, 10:157--169, 1989.]] Google ScholarDigital Library
Y. Li, D. Lopresti, and A. Tomkins. "Validation of Document Image Defect Models for Optical Character Recognition". In Proc. of 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 137--150, 1994.]]Google Scholar
J. Mamou and R. H. D. Carmel. Spoken document retrieval from call-center conversations. In Proc. of 29th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 51--58, 2006.]] Google ScholarDigital Library
A. Myka and U. Guntzer. "Fuzzy Full-Text Searches in OCR Database". In Proc. of Forum on Research & Technology Advances in Digital Libraries, pages 87--100, 1995.]] Google ScholarDigital Library
G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001.]] Google ScholarDigital Library
G. Navarro and R. Baeza-Yates. A practical q-gram index for text retrieval allowing errors. CLEI Electron, 1(2):31--88, 1998.]]Google Scholar
M. Ohta, A. Takasu, and J. Adachi. "Probabilistic Automaton Model for Fuzzy English-text Retrieval". In Lecture Notes in Computer Science 1923, pages 35--44, 2000.]] Google ScholarDigital Library
E. S. Ristad and P. N. Yianilos. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):522--532, 1998.]] Google ScholarDigital Library
S. Srinivasan and G. Petkovic. Phonetic confusion matrix based spoken document retrieval. In Proc. of 23rd Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 81--87, 2000.]] Google ScholarDigital Library
K. Taghva, J. Borsack, and A. Condit. Results of applying probabilistic ir to ocr text. In Proc. of 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 202--211, 1994.]] Google ScholarDigital Library
K. Taghva, J. Borsack, and A. Condit. Effects of ocr errors on ranking and feedback using the vector space model. Information Processing and Management, 32(3):317--327, 1996.]] Google ScholarDigital Library
K. Taghva, J. Borsack, and A. Condit. Evaluation of model-based retrieval effectiveness with ocr text. ACM Transaction on Information Systems, 14(1):64--93, January 1996.]] Google ScholarDigital Library
A. Takasu. "Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model". In Proc. of 3rd ACM & IEEE Joint Conference on Digital Libraries, pages 49--60, 2003.]] Google ScholarDigital Library
A. Takasu and K. Aihara. "DVHMM: Variable Length Text Recognition Error Model". In Proc. of 15th International Conference on Pattern Recognition, pages 110--114, 2002.]] Google ScholarDigital Library
Y. Tanaka and H. Torii. "Transmedia Machine and its Keyword Search over Image Texts". In Proc. of RIAO 88, pages 248--258, 1988.]]Google Scholar
E. Ukkonen. "Algorithms for Approximate String Matching". Information and Control, 64:100--118, 1985.]] Google ScholarDigital Library
E. Ukkonen. "Approximate String Matching with q-grams and Maximal Matches". Theoretical Computer Science, 1:191--211, 1985.]] Google ScholarDigital Library
G. Valiente. "Algorithms on Trees and Graphs". Springer, 2002.]] Google ScholarDigital Library
M. Wechsler, E. Munteanu, and P. Schauble. New techniques for open-vocabulary spoken document retrieval. In Proc. of 21st Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 20--27, 1998.]] Google ScholarDigital Library
N. Ziviani, E. S. Moura, G. Navarro, and R. Baeza-Yates. Compression: A key for next-generation text retrieval systems. IEEE Computer, 33(11):37--44, 2000.]] Google ScholarDigital Library

Index Terms

An approximate multi-word matching algorithm for robust document retrieval
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

An efficient pruning strategy for approximate string matching over suffix tree

Approximate string matching over suffix tree with depth-first search (ASM_ST_DFS), a classical algorithm in the field of approximate string matching, was originally proposed by Ricardo A. Baeza-Yates and Gaston H. Gonnet in 1990. The algorithm is one of ...
Read More
A fast longest common subsequence algorithm for similar strings
LATA'10: Proceedings of the 4th international conference on Language and Automata Theory and Applications

The longest common subsequence problem is a very important computational problem for which there are many algorithms. We present a new algorithm for this problem. Let X and Y be any two given strings each of length O(n). We observe that a longest common ...
Read More
Approximate string matching in sublinear expected time
SFCS '90: Proceedings of the 31st Annual Symposium on Foundations of Computer Science

The k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, and the number k of differences (insertions, deletions, substitutions) allowed in a match, and asks for every location in the text ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management
November 2006
916 pages
ISBN:1595934332
DOI:10.1145/1183614
General Chair:
Philip S. Yu
IBM T.J. Watson Research Center (USA)
,
Program Chairs:
Vassilis Tsotras
University of California-Riverside (USA)
,
Edward Fox
Virginia Tech (USA)
,
Bing Liu
University of Illinois at Chicago (USA)
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 November 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dynamic programming
statistical model
suffix tree
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 609
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An approximate multi-word matching algorithm for robust document retrieval

CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

An efficient pruning strategy for approximate string matching over suffix tree

A fast longest common subsequence algorithm for similar strings

Approximate string matching in sublinear expected time