ABSTRACT
Document generation from low level data and its utilization is one of the most challenging tasks in document engineering. Word occurrence detection is a fundamental problem in the recognized document utilization obtained by a recognizer, such as OCR and speech recognition. Given a set of words, such as a dictionary, this paper proposes an efficient dynamic programming (DP) algorithm to find the occurrences of each word in a text. In this paper, the string similarity is measured by a statistical similarity model that enables a definition of the similarities in the character level as well as edit operation level. The proposed algorithm uses tree structures to measure similarities in order to avoid measuring similarities of the same substrings appearing in different parts of the text and words. The time complexity of the proposed algorithm is O(|W|⋅|S|⋅|Q|), where |W| (resp. |S|) denote the number of nodes in the trees representing the word set (resp. the text), and |Q| donotes the number of the states of the model used for string similarity. This paper shows the proposed algorithm is experimentally about six times faster than a naive DP algorithm.
- R. Baeza-Yates and G. Navarro. New and faster filters for multiple approximate string matching. Random Structures and Algorithms, 20(1):23--49, 2002.]] Google ScholarDigital Library
- R. Baeza-Yates and B. Ribeiro-Neto. "Modern Information Retrieval". ACM Press, 1999.]] Google ScholarDigital Library
- W. Chang and T. Marr. "Theoretical and Empirical Comparisons of Approximate String Matching Algorithms". In Proc. of 3rd Annual Symposium on Combinatorial Pattern Matching, pages 172--181, 1992.]] Google ScholarDigital Library
- E. Chávez and G. Navarro. "A Metric Index for Approximate String Matching". In Proc. of LATIN, pages 181--195, 2002.]] Google ScholarDigital Library
- S. Inenaga, A. Shinohara, and M. Takeda. A fully compressed pattern matching algorithm for simple collage systems. Foundations of Computer Science, 16(6):1155--1166, 2005.]]Google ScholarCross Ref
- S. Kahan, T. Pavlidis, and H. S. Baird. On the recognition of printed characters of any font and size. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(2):274--288, March 1987.]] Google ScholarDigital Library
- K. Kukich. "Techniques for Automatically Correcting Words in Text". ACM Computing Surveys, 24(4):377--439, 1992.]] Google ScholarDigital Library
- G. Landau and U. Vishkin. "Fast Parallel and Serial Approximate String Matching". Journal of Algorithms, 10:157--169, 1989.]] Google ScholarDigital Library
- Y. Li, D. Lopresti, and A. Tomkins. "Validation of Document Image Defect Models for Optical Character Recognition". In Proc. of 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 137--150, 1994.]]Google Scholar
- J. Mamou and R. H. D. Carmel. Spoken document retrieval from call-center conversations. In Proc. of 29th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 51--58, 2006.]] Google ScholarDigital Library
- A. Myka and U. Guntzer. "Fuzzy Full-Text Searches in OCR Database". In Proc. of Forum on Research & Technology Advances in Digital Libraries, pages 87--100, 1995.]] Google ScholarDigital Library
- G. Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001.]] Google ScholarDigital Library
- G. Navarro and R. Baeza-Yates. A practical q-gram index for text retrieval allowing errors. CLEI Electron, 1(2):31--88, 1998.]]Google Scholar
- M. Ohta, A. Takasu, and J. Adachi. "Probabilistic Automaton Model for Fuzzy English-text Retrieval". In Lecture Notes in Computer Science 1923, pages 35--44, 2000.]] Google ScholarDigital Library
- E. S. Ristad and P. N. Yianilos. Learning string-edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):522--532, 1998.]] Google ScholarDigital Library
- S. Srinivasan and G. Petkovic. Phonetic confusion matrix based spoken document retrieval. In Proc. of 23rd Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 81--87, 2000.]] Google ScholarDigital Library
- K. Taghva, J. Borsack, and A. Condit. Results of applying probabilistic ir to ocr text. In Proc. of 17th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 202--211, 1994.]] Google ScholarDigital Library
- K. Taghva, J. Borsack, and A. Condit. Effects of ocr errors on ranking and feedback using the vector space model. Information Processing and Management, 32(3):317--327, 1996.]] Google ScholarDigital Library
- K. Taghva, J. Borsack, and A. Condit. Evaluation of model-based retrieval effectiveness with ocr text. ACM Transaction on Information Systems, 14(1):64--93, January 1996.]] Google ScholarDigital Library
- A. Takasu. "Bibliographic Attribute Extraction from Erroneous References Based on a Statistical Model". In Proc. of 3rd ACM & IEEE Joint Conference on Digital Libraries, pages 49--60, 2003.]] Google ScholarDigital Library
- A. Takasu and K. Aihara. "DVHMM: Variable Length Text Recognition Error Model". In Proc. of 15th International Conference on Pattern Recognition, pages 110--114, 2002.]] Google ScholarDigital Library
- Y. Tanaka and H. Torii. "Transmedia Machine and its Keyword Search over Image Texts". In Proc. of RIAO 88, pages 248--258, 1988.]]Google Scholar
- E. Ukkonen. "Algorithms for Approximate String Matching". Information and Control, 64:100--118, 1985.]] Google ScholarDigital Library
- E. Ukkonen. "Approximate String Matching with q-grams and Maximal Matches". Theoretical Computer Science, 1:191--211, 1985.]] Google ScholarDigital Library
- G. Valiente. "Algorithms on Trees and Graphs". Springer, 2002.]] Google ScholarDigital Library
- M. Wechsler, E. Munteanu, and P. Schauble. New techniques for open-vocabulary spoken document retrieval. In Proc. of 21st Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval, pages 20--27, 1998.]] Google ScholarDigital Library
- N. Ziviani, E. S. Moura, G. Navarro, and R. Baeza-Yates. Compression: A key for next-generation text retrieval systems. IEEE Computer, 33(11):37--44, 2000.]] Google ScholarDigital Library
Index Terms
- An approximate multi-word matching algorithm for robust document retrieval
Recommendations
An efficient pruning strategy for approximate string matching over suffix tree
Approximate string matching over suffix tree with depth-first search (ASM_ST_DFS), a classical algorithm in the field of approximate string matching, was originally proposed by Ricardo A. Baeza-Yates and Gaston H. Gonnet in 1990. The algorithm is one of ...
A fast longest common subsequence algorithm for similar strings
LATA'10: Proceedings of the 4th international conference on Language and Automata Theory and ApplicationsThe longest common subsequence problem is a very important computational problem for which there are many algorithms. We present a new algorithm for this problem. Let X and Y be any two given strings each of length O(n). We observe that a longest common ...
Approximate string matching in sublinear expected time
SFCS '90: Proceedings of the 31st Annual Symposium on Foundations of Computer ScienceThe k differences approximate string matching problem specifies a text string of length n, a pattern string of length m, and the number k of differences (insertions, deletions, substitutions) allowed in a match, and asks for every location in the text ...
Comments