skip to main content
article
Free Access

Computer programs for detecting and correcting spelling errors

Published:01 December 1980Publication History
Skip Abstract Section

Abstract

With the increase in word and text processing computer systems, programs which check and correct spelling will become more and more common. Peterson investigates the basic structure of several such existing programs and their approaches to solving the problems which arise when this type of program is created. The basic framework and background necessary to write a spelling checker or corrector are provided.

References

  1. 1 Alberga, C.N. String similarity and misspellings. Comm. ACM 10, 5 (May 1967), 302- 313. This master's thesis reviews previous work, referring to the work of two researchers at IBM Watson who suggest finding the longest common substrings and assigning probabilities based on the portion of the correct string matched. The paper presents rather extensive but unreadable analysis of different algorithms, with no real results. (Reviewed in Comptng. Reviews 8, 5 (Sept.-Oct. 1967), 440- 441.) Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2 Bledsoe W.W., and Browning, 1. Pattern recognition and reading by machine. Proc. Eastern Joint Comptr. Conf., Boston, Mass., Dec. 1959, pp. 225-232. Here, the authors have used a small dictionary with the probability of each word for OCR.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3 Blair, C.R. A program for correcting spelling errors. Inform. and Control 3, I (March 1960), 60-67. Blair weights the letters to create a four- or five-letter abbreviation for each word. If abbreviations match, the words are assumed to be the same. The possibility (impossibility) of building in rules like "i before e except after c and when like a as in neighbor and weigh" is mentioned.Google ScholarGoogle ScholarCross RefCross Ref
  4. 4 Bourne, C.P. Frequency and impact of spelling errors in bibliographic data bases. Inform. Processing and Mgmt. 13, 1 (1977), 1- 12. Bourne examines the frequency of spelling errors in a sample drawn from 11 machinereadable bibliographic databases. He finds that spelling errors are severe enough to influence the search strategy to find information in the database. Errors are not only in the input queries, but also in the database itself.Google ScholarGoogle ScholarCross RefCross Ref
  5. 5 Carlson, G. Techniques for replacing characters that are garbled on input. Proc. 1966 Spring Joint Comptr. Conf., AFIPS Press, Arlington, Va., 1966, pp. 189-192. The author uses trigrams to correct the OCR input of genealogical records.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6 Cornew, R.W. A statistical method of spelling correction. Inform. and Control 12, 2 (Feb. 1968), 79-93. Cornew employs digrams first, then a dictionary search to correct one character substitutions. The dictionary already exists for a speech output problem.Google ScholarGoogle ScholarCross RefCross Ref
  7. 7 Damerau, F.J. A technique for computer detection and correction of spelling errors. Comm. ACM 7, 3 (March 1964), 171-176. The four errors (wrong, missing, extra, and transposed) are mentioned here as accounting for 80 percent of errors. Damerau uses a bit vector for preliminary compare (bit {i} is on if letter i is in word). (Reviewed in Comptng. Reviews 5, 4 (July-Aug. 1964), 219.) Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8 Davidson, L. Retrieval of misspelled names in an airlines passenger record system. Comm. ACM 5, 3 (March 1962), 169-171. The author abbreviates a name to match a stored one. Either name (token or dictionary) may be wrong. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9 Fisher, E.G. The use of context in character recognition. Tech. Rep. 76-12, Dept. of Comptr. and Inform. Sci., Univ. of Massachusetts, Amherst, July 1976. Fisher considers the problems of automatically reading addresses from envelopes in the Post Office and also Morse code recognition.Google ScholarGoogle Scholar
  10. 10 Freeman, D.N. Error correction in CORC: The Cornell computing language. Ph.D. Th., Dept. of Comptr. Sci., Cornell Univ., Ithaca, N.Y., Sept. 1963.Google ScholarGoogle Scholar
  11. 11 Galli, E.J. and Yamada, H.M. An automatic dictionary and verification of machinereadable text. IBM Syst. J. 6, 3 (1967), 192- 207. This article is a good discussion of the general problem of token identification and verification.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12 Galli, E.J., and Yamada, H.M. Experimental studies in computer-assisted correction of unorthographic text. IEEE Trans. Engineering Writing and Speech EWS-11, 2 (Aug. 1968), 75-84. Galli and Yamada provide a good review and explanation of techniques and problems.Google ScholarGoogle ScholarCross RefCross Ref
  13. 13 Giangardella, J.J., Hudson, J., and Roper, R.S. Spelling correction by vector representation using a digital computer. 1EEE Trans. Engineering Writing and Speech EWS-IO, 2 (Dec. 1967), 57-62. The authors define hash functions to give a vector representation of a word as norm, angle, and distance. This speeds search time (over linear search) and aids in localizing the search for correct spellings since interchanged characters have the same norm and the extra or deleted letter is within fixed range.Google ScholarGoogle ScholarCross RefCross Ref
  14. 14 Glantz, H.T. On the recognition of information with a digital computer. J. A CM 4, 2 (April 1957), 178-188. Glantz seems to want either exact match or greatest number of equal characters in equal positions; his approach is good for OCR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15 Hanson, A.R., Riseman, E.M., and Fisher, E.G. Context in word recognition. Pattern Recognition 8, 1 (Jan. 1976), 35~,5. The authors suggest positional binary trigrams to correct words with wrong letters. See also {33}.Google ScholarGoogle ScholarCross RefCross Ref
  16. 16 Harmon, L.D. Automatic reading of cursive script. Proc. Symp. on Optical Character Recognition, Spartan Books, Washington, D.C., Jan. 1962, pp. 151-152. Harmon uses digrams and a "confusion matrix" to give the probability of letter substitutions.Google ScholarGoogle Scholar
  17. 17 Harmon, L.D. Automatic recognition of print and script. Proc. 1EEE 60, 10 (Oct. 1972) 1165-1176. Here, Harmon surveys the techniques for computer input of print, including a section on error detection and correction. He indicates that digrams can catch 70 percent of incorrect letter errors.Google ScholarGoogle ScholarCross RefCross Ref
  18. 18 James, E.B., and Partridge, D.P. Tolerance to inaccuracy in computer programs. Comptr. J. 19, 3 (Aug. 1976), 207-212. The authors describe an approach to correcting spelling errors in Fortran key words.Google ScholarGoogle Scholar
  19. 19 Knuth, D.E. The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley, Reading, Mass., 1973. This is a classic reference on sorting and searching methods. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20 Kucera H., and Francis, W.N. Computational Analysis of Present-Day American English. Brown Univ. Press, Providence, R.I., 1967. The authors give frequency and statistical information for the Brown Corpus of over a million tokens.Google ScholarGoogle Scholar
  21. 21 Leslie, L.A. 20,000 Words. McGraw-Hill, N.Y., 1977. This work is representative of several books which list words.Google ScholarGoogle Scholar
  22. 22 Lowrance, R., and Wagner, R.A. An extension of the string-to-string correction problem. J. ACM 22, 2 (April 1975), 175-183. This article extends the work of Wagner and Fischer {431 to include adjacent transpositions as an edit operation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. 23 McElwain, C.K., and Evans, M.E. The degarbler a program for correcting machineread Morse code. Inform. and Control 5, 4 (Dec. 1962), 368-384. The authors' program uses digrams, trigrams, and a dictionary to correct up to 70 percent of errors in machine recognized Morse code. It also uses 5 special rules for the types of errors which can occur (dot interpreted as dash, etc.)Google ScholarGoogle ScholarCross RefCross Ref
  24. 24 McMahon, L.E., Cherry, L.L., and Morris, R. Statistical text processing. Bell Syst. Tech. J. 57, 6, part 2 (July-August 1978), 2137-2154. McMahon et al. provide a good description of how computer systems can be used to process text, including spelling correction and an attempt at a syntax checker.Google ScholarGoogle ScholarCross RefCross Ref
  25. 25 Morgan, H.L. Spelling correction in systems programs. Comm. ACM 13, 2 (Feb. 1970), 90-94. Morgan discusses the use of spelling correction for compilers and operating system JCL. He uses a dictionary with the four classes of errors and also notes that it is possible to use syntax and semantics to narrow search space. He reports on the CORC compiler {10} which associated a probability with each possible misspelling. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. 26 Morris, R., and Cherry, L.L. Computer detection of typograhical errors. IEEE Trans. Professional Comm. PC-18, 1 (March 1975), 54-64. Morris and Cherry describe the TYPO program for the UNIX system.Google ScholarGoogle ScholarCross RefCross Ref
  27. 27 Muth, F., and Tharp, A.L. Correcting human error in alphanumeric terminal input. Inform. Processing and Mgmt. 13, 6 (1977), 329-337. This paper suggests using a tree structure (like a trie) with special search procedures to find corrections. Damerau's review points out that their search strategies need improvement and that their tree is much too big to be practical. Each node of the tree has one character (data) and three pointers (father, brother, son). (Reviewed in Comptng. Reviews 19, 6 (June 1978), 231.)Google ScholarGoogle ScholarCross RefCross Ref
  28. 28 O'Brien, J.A. Computer program for automatic spelling correction. Tech. Rep. RADC-TR-66-696, Rome Air Development Ctr., New York, March 1967. This report describes an early prototype of a spelling correction program designed to correct input from an OCR reader. It was implemented on a CDC 160A with special hardware.Google ScholarGoogle Scholar
  29. 29 Okuda, T., Tanaka, E., and Kasai, T. A method for the correction of garbled words based on the Levenshtein metric. 1EEE Trans. Comptrs. C-25, 2 (Feb. 1976), 172-177. The authors suggest a method for correcting an incorrectly spelled token.Google ScholarGoogle Scholar
  30. 30 Partridge, D.P., and James, E.B. Natural information processing. Internat. J. Man-Machine Studies 6, 2 (March 1974), 205-235. Here, the authors use a tree structure representation of words to allow checks for incorrect input words. This is done in the context of correcting key words in a Fortran program, but more is there. Frequencies are kept with tree branches to allow the tree to modify itself to optimize search.Google ScholarGoogle ScholarCross RefCross Ref
  31. 31 Peterson, J.L. Design of a spelling program: An experiment in program design. Lecture Notes in Comptr. ScL 96, Springer-Verlag, New York, Oct. 1980. In this work the complete top-down design of a program to detect and correct spelling errors is given. The design is machine-independent, but directly results in a Pascal program which implements the design.Google ScholarGoogle Scholar
  32. 32 Riseman, E.M., and Ehrich, R.W. Contextual word recognition using binary digrams. IEEE Trans. Comptrs. C-20, 4 (April 1971), 397403. The authors indicate that the important property of digrams is only their zero or nonzero nature.Google ScholarGoogle Scholar
  33. 33 Riseman, E.M., and Hanson, A.R. A contextual postprocessing system for error correction using binary n-grams. IEEE Trans. Comptrs. C-23, 5 (May 1974), 480--493. Riseman and Hanson suggest using digrams (2- grams), trigrams (3-grams), or in general ngrams, but only storing whether the probability is zero or nonzero (1 bit). They also favor using positional n-grams which means a separate n-gram table for each pair of positions is kept (for each i and j we have the digram table for characters in position i and position j in a word).Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. 34 Rosenbaum, W.S., and Hilliard, J.J. Multifont OCR postprocessing system. IBM J. Res. and Develop. 19, 4 (July 1975), 398-421. This paper focuses very specifically on OCR problems. Searching with a match-any character is discussed.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. 35 Sheil, B.A. Median split trees: A fast lookup technique for frequently occurring keys. Comm. ACM 21, 11 (Nov. 1978), 947-958. Sheil defines a tree data structure for searching and applies this, as an example, to a dictionary of common English words. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. 36 Szanser, A.J. Error-correcting methods in natural language processing. Inform. Processing 68, North-Holland, Amsterdam, Aug. 1968, pp. 1412-1416. This is a confused paper dealing with correction for machine translation and automatic interpretation of shorthand transcript tapes; it suggests using "elastic" matching.Google ScholarGoogle Scholar
  37. 37 Szanser, A.J. Automatic error-correction in natural languages. Inform. Storage and Retrieval 5, 4 (Feb. 1970), 169-174.Google ScholarGoogle ScholarCross RefCross Ref
  38. 38 Tanaka, E., and Kasai, T. Correcting method of garbled languages using ordered key letters. Electronics and Comm. in Japan 55, 6 (1972), 127-133.Google ScholarGoogle Scholar
  39. 39 Tenczar, P.J., and Golden, W.W. Spelling, word and concept recognition. Rep. CERL-X-35, Univ. of Illinois, Urbana, II1., Oct. 1962.Google ScholarGoogle Scholar
  40. 40 Thomas, R.B., and Kassler, M. Character recognition in context. Inform. and Control 10, 1 (Jan. 1967), 43-64. Thomas and Kassler consider tetragrams (sequences of 4 letters); of 274 possible tetragrams, only 12 percent (61,273) are legal.Google ScholarGoogle ScholarCross RefCross Ref
  41. 41 Thorelli, L. Automatic correction of errors in text. BIT 2, 1 (1962), 45-62. This paper is a kind of survey/tutorial. It mentions diagrams and dictionary look-up and suggests maximizing probabilities.Google ScholarGoogle ScholarCross RefCross Ref
  42. 42 Vossler, C.M. and Branston, N.M. The use of context for correcting garbled English text. Proc. 19th ACM Nat. Convention, Aug. 1964, pp. D2.4-I-D2.4-13. The authors use confusion matrix and word probabilities to select the most probable input word. They also use digrams and strive to improve OCR input. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. 43 Wagner, R.A., and Fischer, M.J. The string-to-string correction problem. J. A CM 21, 1 (Jan. 1974), 168-173. Here, an algorithm is presented for determining the similarity of two strings as the minimum number of edit operations to transform one into the other. The allowed edit operations are add, delete, Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. 44 Wong, C.K., and Chandra, A.K. Bounds for the string editing problem. J. A CM 23, 1 (Jan. 1976), 13-16. Wong and Chandra show that the complexity bounds of {43} are not 3nly sufficient but also necessary. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Communications of the ACM
    Communications of the ACM  Volume 23, Issue 12
    Dec. 1980
    53 pages
    ISSN:0001-0782
    EISSN:1557-7317
    DOI:10.1145/359038
    Issue’s Table of Contents

    Copyright © 1980 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 December 1980

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader