Abstract
With the increase in word and text processing computer systems, programs which check and correct spelling will become more and more common. Peterson investigates the basic structure of several such existing programs and their approaches to solving the problems which arise when this type of program is created. The basic framework and background necessary to write a spelling checker or corrector are provided.
- 1 Alberga, C.N. String similarity and misspellings. Comm. ACM 10, 5 (May 1967), 302- 313. This master's thesis reviews previous work, referring to the work of two researchers at IBM Watson who suggest finding the longest common substrings and assigning probabilities based on the portion of the correct string matched. The paper presents rather extensive but unreadable analysis of different algorithms, with no real results. (Reviewed in Comptng. Reviews 8, 5 (Sept.-Oct. 1967), 440- 441.) Google ScholarDigital Library
- 2 Bledsoe W.W., and Browning, 1. Pattern recognition and reading by machine. Proc. Eastern Joint Comptr. Conf., Boston, Mass., Dec. 1959, pp. 225-232. Here, the authors have used a small dictionary with the probability of each word for OCR.Google ScholarDigital Library
- 3 Blair, C.R. A program for correcting spelling errors. Inform. and Control 3, I (March 1960), 60-67. Blair weights the letters to create a four- or five-letter abbreviation for each word. If abbreviations match, the words are assumed to be the same. The possibility (impossibility) of building in rules like "i before e except after c and when like a as in neighbor and weigh" is mentioned.Google ScholarCross Ref
- 4 Bourne, C.P. Frequency and impact of spelling errors in bibliographic data bases. Inform. Processing and Mgmt. 13, 1 (1977), 1- 12. Bourne examines the frequency of spelling errors in a sample drawn from 11 machinereadable bibliographic databases. He finds that spelling errors are severe enough to influence the search strategy to find information in the database. Errors are not only in the input queries, but also in the database itself.Google ScholarCross Ref
- 5 Carlson, G. Techniques for replacing characters that are garbled on input. Proc. 1966 Spring Joint Comptr. Conf., AFIPS Press, Arlington, Va., 1966, pp. 189-192. The author uses trigrams to correct the OCR input of genealogical records.Google ScholarDigital Library
- 6 Cornew, R.W. A statistical method of spelling correction. Inform. and Control 12, 2 (Feb. 1968), 79-93. Cornew employs digrams first, then a dictionary search to correct one character substitutions. The dictionary already exists for a speech output problem.Google ScholarCross Ref
- 7 Damerau, F.J. A technique for computer detection and correction of spelling errors. Comm. ACM 7, 3 (March 1964), 171-176. The four errors (wrong, missing, extra, and transposed) are mentioned here as accounting for 80 percent of errors. Damerau uses a bit vector for preliminary compare (bit {i} is on if letter i is in word). (Reviewed in Comptng. Reviews 5, 4 (July-Aug. 1964), 219.) Google ScholarDigital Library
- 8 Davidson, L. Retrieval of misspelled names in an airlines passenger record system. Comm. ACM 5, 3 (March 1962), 169-171. The author abbreviates a name to match a stored one. Either name (token or dictionary) may be wrong. Google ScholarDigital Library
- 9 Fisher, E.G. The use of context in character recognition. Tech. Rep. 76-12, Dept. of Comptr. and Inform. Sci., Univ. of Massachusetts, Amherst, July 1976. Fisher considers the problems of automatically reading addresses from envelopes in the Post Office and also Morse code recognition.Google Scholar
- 10 Freeman, D.N. Error correction in CORC: The Cornell computing language. Ph.D. Th., Dept. of Comptr. Sci., Cornell Univ., Ithaca, N.Y., Sept. 1963.Google Scholar
- 11 Galli, E.J. and Yamada, H.M. An automatic dictionary and verification of machinereadable text. IBM Syst. J. 6, 3 (1967), 192- 207. This article is a good discussion of the general problem of token identification and verification.Google ScholarDigital Library
- 12 Galli, E.J., and Yamada, H.M. Experimental studies in computer-assisted correction of unorthographic text. IEEE Trans. Engineering Writing and Speech EWS-11, 2 (Aug. 1968), 75-84. Galli and Yamada provide a good review and explanation of techniques and problems.Google ScholarCross Ref
- 13 Giangardella, J.J., Hudson, J., and Roper, R.S. Spelling correction by vector representation using a digital computer. 1EEE Trans. Engineering Writing and Speech EWS-IO, 2 (Dec. 1967), 57-62. The authors define hash functions to give a vector representation of a word as norm, angle, and distance. This speeds search time (over linear search) and aids in localizing the search for correct spellings since interchanged characters have the same norm and the extra or deleted letter is within fixed range.Google ScholarCross Ref
- 14 Glantz, H.T. On the recognition of information with a digital computer. J. A CM 4, 2 (April 1957), 178-188. Glantz seems to want either exact match or greatest number of equal characters in equal positions; his approach is good for OCR. Google ScholarDigital Library
- 15 Hanson, A.R., Riseman, E.M., and Fisher, E.G. Context in word recognition. Pattern Recognition 8, 1 (Jan. 1976), 35~,5. The authors suggest positional binary trigrams to correct words with wrong letters. See also {33}.Google ScholarCross Ref
- 16 Harmon, L.D. Automatic reading of cursive script. Proc. Symp. on Optical Character Recognition, Spartan Books, Washington, D.C., Jan. 1962, pp. 151-152. Harmon uses digrams and a "confusion matrix" to give the probability of letter substitutions.Google Scholar
- 17 Harmon, L.D. Automatic recognition of print and script. Proc. 1EEE 60, 10 (Oct. 1972) 1165-1176. Here, Harmon surveys the techniques for computer input of print, including a section on error detection and correction. He indicates that digrams can catch 70 percent of incorrect letter errors.Google ScholarCross Ref
- 18 James, E.B., and Partridge, D.P. Tolerance to inaccuracy in computer programs. Comptr. J. 19, 3 (Aug. 1976), 207-212. The authors describe an approach to correcting spelling errors in Fortran key words.Google Scholar
- 19 Knuth, D.E. The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley, Reading, Mass., 1973. This is a classic reference on sorting and searching methods. Google ScholarDigital Library
- 20 Kucera H., and Francis, W.N. Computational Analysis of Present-Day American English. Brown Univ. Press, Providence, R.I., 1967. The authors give frequency and statistical information for the Brown Corpus of over a million tokens.Google Scholar
- 21 Leslie, L.A. 20,000 Words. McGraw-Hill, N.Y., 1977. This work is representative of several books which list words.Google Scholar
- 22 Lowrance, R., and Wagner, R.A. An extension of the string-to-string correction problem. J. ACM 22, 2 (April 1975), 175-183. This article extends the work of Wagner and Fischer {431 to include adjacent transpositions as an edit operation. Google ScholarDigital Library
- 23 McElwain, C.K., and Evans, M.E. The degarbler a program for correcting machineread Morse code. Inform. and Control 5, 4 (Dec. 1962), 368-384. The authors' program uses digrams, trigrams, and a dictionary to correct up to 70 percent of errors in machine recognized Morse code. It also uses 5 special rules for the types of errors which can occur (dot interpreted as dash, etc.)Google ScholarCross Ref
- 24 McMahon, L.E., Cherry, L.L., and Morris, R. Statistical text processing. Bell Syst. Tech. J. 57, 6, part 2 (July-August 1978), 2137-2154. McMahon et al. provide a good description of how computer systems can be used to process text, including spelling correction and an attempt at a syntax checker.Google ScholarCross Ref
- 25 Morgan, H.L. Spelling correction in systems programs. Comm. ACM 13, 2 (Feb. 1970), 90-94. Morgan discusses the use of spelling correction for compilers and operating system JCL. He uses a dictionary with the four classes of errors and also notes that it is possible to use syntax and semantics to narrow search space. He reports on the CORC compiler {10} which associated a probability with each possible misspelling. Google ScholarDigital Library
- 26 Morris, R., and Cherry, L.L. Computer detection of typograhical errors. IEEE Trans. Professional Comm. PC-18, 1 (March 1975), 54-64. Morris and Cherry describe the TYPO program for the UNIX system.Google ScholarCross Ref
- 27 Muth, F., and Tharp, A.L. Correcting human error in alphanumeric terminal input. Inform. Processing and Mgmt. 13, 6 (1977), 329-337. This paper suggests using a tree structure (like a trie) with special search procedures to find corrections. Damerau's review points out that their search strategies need improvement and that their tree is much too big to be practical. Each node of the tree has one character (data) and three pointers (father, brother, son). (Reviewed in Comptng. Reviews 19, 6 (June 1978), 231.)Google ScholarCross Ref
- 28 O'Brien, J.A. Computer program for automatic spelling correction. Tech. Rep. RADC-TR-66-696, Rome Air Development Ctr., New York, March 1967. This report describes an early prototype of a spelling correction program designed to correct input from an OCR reader. It was implemented on a CDC 160A with special hardware.Google Scholar
- 29 Okuda, T., Tanaka, E., and Kasai, T. A method for the correction of garbled words based on the Levenshtein metric. 1EEE Trans. Comptrs. C-25, 2 (Feb. 1976), 172-177. The authors suggest a method for correcting an incorrectly spelled token.Google Scholar
- 30 Partridge, D.P., and James, E.B. Natural information processing. Internat. J. Man-Machine Studies 6, 2 (March 1974), 205-235. Here, the authors use a tree structure representation of words to allow checks for incorrect input words. This is done in the context of correcting key words in a Fortran program, but more is there. Frequencies are kept with tree branches to allow the tree to modify itself to optimize search.Google ScholarCross Ref
- 31 Peterson, J.L. Design of a spelling program: An experiment in program design. Lecture Notes in Comptr. ScL 96, Springer-Verlag, New York, Oct. 1980. In this work the complete top-down design of a program to detect and correct spelling errors is given. The design is machine-independent, but directly results in a Pascal program which implements the design.Google Scholar
- 32 Riseman, E.M., and Ehrich, R.W. Contextual word recognition using binary digrams. IEEE Trans. Comptrs. C-20, 4 (April 1971), 397403. The authors indicate that the important property of digrams is only their zero or nonzero nature.Google Scholar
- 33 Riseman, E.M., and Hanson, A.R. A contextual postprocessing system for error correction using binary n-grams. IEEE Trans. Comptrs. C-23, 5 (May 1974), 480--493. Riseman and Hanson suggest using digrams (2- grams), trigrams (3-grams), or in general ngrams, but only storing whether the probability is zero or nonzero (1 bit). They also favor using positional n-grams which means a separate n-gram table for each pair of positions is kept (for each i and j we have the digram table for characters in position i and position j in a word).Google ScholarDigital Library
- 34 Rosenbaum, W.S., and Hilliard, J.J. Multifont OCR postprocessing system. IBM J. Res. and Develop. 19, 4 (July 1975), 398-421. This paper focuses very specifically on OCR problems. Searching with a match-any character is discussed.Google ScholarDigital Library
- 35 Sheil, B.A. Median split trees: A fast lookup technique for frequently occurring keys. Comm. ACM 21, 11 (Nov. 1978), 947-958. Sheil defines a tree data structure for searching and applies this, as an example, to a dictionary of common English words. Google ScholarDigital Library
- 36 Szanser, A.J. Error-correcting methods in natural language processing. Inform. Processing 68, North-Holland, Amsterdam, Aug. 1968, pp. 1412-1416. This is a confused paper dealing with correction for machine translation and automatic interpretation of shorthand transcript tapes; it suggests using "elastic" matching.Google Scholar
- 37 Szanser, A.J. Automatic error-correction in natural languages. Inform. Storage and Retrieval 5, 4 (Feb. 1970), 169-174.Google ScholarCross Ref
- 38 Tanaka, E., and Kasai, T. Correcting method of garbled languages using ordered key letters. Electronics and Comm. in Japan 55, 6 (1972), 127-133.Google Scholar
- 39 Tenczar, P.J., and Golden, W.W. Spelling, word and concept recognition. Rep. CERL-X-35, Univ. of Illinois, Urbana, II1., Oct. 1962.Google Scholar
- 40 Thomas, R.B., and Kassler, M. Character recognition in context. Inform. and Control 10, 1 (Jan. 1967), 43-64. Thomas and Kassler consider tetragrams (sequences of 4 letters); of 274 possible tetragrams, only 12 percent (61,273) are legal.Google ScholarCross Ref
- 41 Thorelli, L. Automatic correction of errors in text. BIT 2, 1 (1962), 45-62. This paper is a kind of survey/tutorial. It mentions diagrams and dictionary look-up and suggests maximizing probabilities.Google ScholarCross Ref
- 42 Vossler, C.M. and Branston, N.M. The use of context for correcting garbled English text. Proc. 19th ACM Nat. Convention, Aug. 1964, pp. D2.4-I-D2.4-13. The authors use confusion matrix and word probabilities to select the most probable input word. They also use digrams and strive to improve OCR input. Google ScholarDigital Library
- 43 Wagner, R.A., and Fischer, M.J. The string-to-string correction problem. J. A CM 21, 1 (Jan. 1974), 168-173. Here, an algorithm is presented for determining the similarity of two strings as the minimum number of edit operations to transform one into the other. The allowed edit operations are add, delete, Google ScholarDigital Library
- 44 Wong, C.K., and Chandra, A.K. Bounds for the string editing problem. J. A CM 23, 1 (Jan. 1976), 13-16. Wong and Chandra show that the complexity bounds of {43} are not 3nly sufficient but also necessary. Google ScholarDigital Library
Recommendations
Context-aware correction of spelling errors in Hungarian medical documents
HighlightsWe propose two methods to automatically correct Hungarian clinical text.Method 1 generates a ranked list of correction candidates disregarding context.Method 2 uses an SMT decoder to implement context-aware error correction.Method 1 is ...
Word2Vec based spelling correction method of Twitter message
SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied ComputingTwitter1 became popular owing to the devices like smartphones and tablets, with which short messages can be easily composed. Due to the popularity of Twitter, the volume of Twitter messages has increased rapidly. Accordingly, studies have been carried ...
Context-aware correction of spelling errors in hungarian medical documents
SLSP'13: Proceedings of the First international conference on Statistical Language and Speech ProcessingIn our paper, we present a method for automated correction of spelling errors in Hungarian clinical records. We model the problem of spelling correction as a translation task, where the source language is the erroneous text and the target language is ...
Comments