article

Free Access

Computer programs for detecting and correcting spelling errors

Author:
James L. Peterson

Univ. of Texas, Austin

Univ. of Texas, Austin
View Profile

Authors Info & Claims

Communications of the ACM Volume 23 Issue 12Dec. 1980pp 676–687https://doi.org/10.1145/359038.359041

Published:01 December 1980Publication History

Communications of the ACM

Abstract

With the increase in word and text processing computer systems, programs which check and correct spelling will become more and more common. Peterson investigates the basic structure of several such existing programs and their approaches to solving the problems which arise when this type of program is created. The basic framework and background necessary to write a spelling checker or corrector are provided.

References

1 Alberga, C.N. String similarity and misspellings. Comm. ACM 10, 5 (May 1967), 302- 313. This master's thesis reviews previous work, referring to the work of two researchers at IBM Watson who suggest finding the longest common substrings and assigning probabilities based on the portion of the correct string matched. The paper presents rather extensive but unreadable analysis of different algorithms, with no real results. (Reviewed in Comptng. Reviews 8, 5 (Sept.-Oct. 1967), 440- 441.) Google ScholarDigital Library
2 Bledsoe W.W., and Browning, 1. Pattern recognition and reading by machine. Proc. Eastern Joint Comptr. Conf., Boston, Mass., Dec. 1959, pp. 225-232. Here, the authors have used a small dictionary with the probability of each word for OCR.Google ScholarDigital Library
3 Blair, C.R. A program for correcting spelling errors. Inform. and Control 3, I (March 1960), 60-67. Blair weights the letters to create a four- or five-letter abbreviation for each word. If abbreviations match, the words are assumed to be the same. The possibility (impossibility) of building in rules like "i before e except after c and when like a as in neighbor and weigh" is mentioned.Google ScholarCross Ref
4 Bourne, C.P. Frequency and impact of spelling errors in bibliographic data bases. Inform. Processing and Mgmt. 13, 1 (1977), 1- 12. Bourne examines the frequency of spelling errors in a sample drawn from 11 machinereadable bibliographic databases. He finds that spelling errors are severe enough to influence the search strategy to find information in the database. Errors are not only in the input queries, but also in the database itself.Google ScholarCross Ref
5 Carlson, G. Techniques for replacing characters that are garbled on input. Proc. 1966 Spring Joint Comptr. Conf., AFIPS Press, Arlington, Va., 1966, pp. 189-192. The author uses trigrams to correct the OCR input of genealogical records.Google ScholarDigital Library
6 Cornew, R.W. A statistical method of spelling correction. Inform. and Control 12, 2 (Feb. 1968), 79-93. Cornew employs digrams first, then a dictionary search to correct one character substitutions. The dictionary already exists for a speech output problem.Google ScholarCross Ref
7 Damerau, F.J. A technique for computer detection and correction of spelling errors. Comm. ACM 7, 3 (March 1964), 171-176. The four errors (wrong, missing, extra, and transposed) are mentioned here as accounting for 80 percent of errors. Damerau uses a bit vector for preliminary compare (bit {i} is on if letter i is in word). (Reviewed in Comptng. Reviews 5, 4 (July-Aug. 1964), 219.) Google ScholarDigital Library
8 Davidson, L. Retrieval of misspelled names in an airlines passenger record system. Comm. ACM 5, 3 (March 1962), 169-171. The author abbreviates a name to match a stored one. Either name (token or dictionary) may be wrong. Google ScholarDigital Library
9 Fisher, E.G. The use of context in character recognition. Tech. Rep. 76-12, Dept. of Comptr. and Inform. Sci., Univ. of Massachusetts, Amherst, July 1976. Fisher considers the problems of automatically reading addresses from envelopes in the Post Office and also Morse code recognition.Google Scholar
10 Freeman, D.N. Error correction in CORC: The Cornell computing language. Ph.D. Th., Dept. of Comptr. Sci., Cornell Univ., Ithaca, N.Y., Sept. 1963.Google Scholar
11 Galli, E.J. and Yamada, H.M. An automatic dictionary and verification of machinereadable text. IBM Syst. J. 6, 3 (1967), 192- 207. This article is a good discussion of the general problem of token identification and verification.Google ScholarDigital Library
12 Galli, E.J., and Yamada, H.M. Experimental studies in computer-assisted correction of unorthographic text. IEEE Trans. Engineering Writing and Speech EWS-11, 2 (Aug. 1968), 75-84. Galli and Yamada provide a good review and explanation of techniques and problems.Google ScholarCross Ref
13 Giangardella, J.J., Hudson, J., and Roper, R.S. Spelling correction by vector representation using a digital computer. 1EEE Trans. Engineering Writing and Speech EWS-IO, 2 (Dec. 1967), 57-62. The authors define hash functions to give a vector representation of a word as norm, angle, and distance. This speeds search time (over linear search) and aids in localizing the search for correct spellings since interchanged characters have the same norm and the extra or deleted letter is within fixed range.Google ScholarCross Ref
14 Glantz, H.T. On the recognition of information with a digital computer. J. A CM 4, 2 (April 1957), 178-188. Glantz seems to want either exact match or greatest number of equal characters in equal positions; his approach is good for OCR. Google ScholarDigital Library
15 Hanson, A.R., Riseman, E.M., and Fisher, E.G. Context in word recognition. Pattern Recognition 8, 1 (Jan. 1976), 35~,5. The authors suggest positional binary trigrams to correct words with wrong letters. See also {33}.Google ScholarCross Ref
16 Harmon, L.D. Automatic reading of cursive script. Proc. Symp. on Optical Character Recognition, Spartan Books, Washington, D.C., Jan. 1962, pp. 151-152. Harmon uses digrams and a "confusion matrix" to give the probability of letter substitutions.Google Scholar
17 Harmon, L.D. Automatic recognition of print and script. Proc. 1EEE 60, 10 (Oct. 1972) 1165-1176. Here, Harmon surveys the techniques for computer input of print, including a section on error detection and correction. He indicates that digrams can catch 70 percent of incorrect letter errors.Google ScholarCross Ref
18 James, E.B., and Partridge, D.P. Tolerance to inaccuracy in computer programs. Comptr. J. 19, 3 (Aug. 1976), 207-212. The authors describe an approach to correcting spelling errors in Fortran key words.Google Scholar
19 Knuth, D.E. The Art of Computer Programming, Volume 3: Sorting and Searching. Addison-Wesley, Reading, Mass., 1973. This is a classic reference on sorting and searching methods. Google ScholarDigital Library
20 Kucera H., and Francis, W.N. Computational Analysis of Present-Day American English. Brown Univ. Press, Providence, R.I., 1967. The authors give frequency and statistical information for the Brown Corpus of over a million tokens.Google Scholar
21 Leslie, L.A. 20,000 Words. McGraw-Hill, N.Y., 1977. This work is representative of several books which list words.Google Scholar
22 Lowrance, R., and Wagner, R.A. An extension of the string-to-string correction problem. J. ACM 22, 2 (April 1975), 175-183. This article extends the work of Wagner and Fischer {431 to include adjacent transpositions as an edit operation. Google ScholarDigital Library
23 McElwain, C.K., and Evans, M.E. The degarbler a program for correcting machineread Morse code. Inform. and Control 5, 4 (Dec. 1962), 368-384. The authors' program uses digrams, trigrams, and a dictionary to correct up to 70 percent of errors in machine recognized Morse code. It also uses 5 special rules for the types of errors which can occur (dot interpreted as dash, etc.)Google ScholarCross Ref
24 McMahon, L.E., Cherry, L.L., and Morris, R. Statistical text processing. Bell Syst. Tech. J. 57, 6, part 2 (July-August 1978), 2137-2154. McMahon et al. provide a good description of how computer systems can be used to process text, including spelling correction and an attempt at a syntax checker.Google ScholarCross Ref
25 Morgan, H.L. Spelling correction in systems programs. Comm. ACM 13, 2 (Feb. 1970), 90-94. Morgan discusses the use of spelling correction for compilers and operating system JCL. He uses a dictionary with the four classes of errors and also notes that it is possible to use syntax and semantics to narrow search space. He reports on the CORC compiler {10} which associated a probability with each possible misspelling. Google ScholarDigital Library
26 Morris, R., and Cherry, L.L. Computer detection of typograhical errors. IEEE Trans. Professional Comm. PC-18, 1 (March 1975), 54-64. Morris and Cherry describe the TYPO program for the UNIX system.Google ScholarCross Ref
27 Muth, F., and Tharp, A.L. Correcting human error in alphanumeric terminal input. Inform. Processing and Mgmt. 13, 6 (1977), 329-337. This paper suggests using a tree structure (like a trie) with special search procedures to find corrections. Damerau's review points out that their search strategies need improvement and that their tree is much too big to be practical. Each node of the tree has one character (data) and three pointers (father, brother, son). (Reviewed in Comptng. Reviews 19, 6 (June 1978), 231.)Google ScholarCross Ref
28 O'Brien, J.A. Computer program for automatic spelling correction. Tech. Rep. RADC-TR-66-696, Rome Air Development Ctr., New York, March 1967. This report describes an early prototype of a spelling correction program designed to correct input from an OCR reader. It was implemented on a CDC 160A with special hardware.Google Scholar
29 Okuda, T., Tanaka, E., and Kasai, T. A method for the correction of garbled words based on the Levenshtein metric. 1EEE Trans. Comptrs. C-25, 2 (Feb. 1976), 172-177. The authors suggest a method for correcting an incorrectly spelled token.Google Scholar
30 Partridge, D.P., and James, E.B. Natural information processing. Internat. J. Man-Machine Studies 6, 2 (March 1974), 205-235. Here, the authors use a tree structure representation of words to allow checks for incorrect input words. This is done in the context of correcting key words in a Fortran program, but more is there. Frequencies are kept with tree branches to allow the tree to modify itself to optimize search.Google ScholarCross Ref
31 Peterson, J.L. Design of a spelling program: An experiment in program design. Lecture Notes in Comptr. ScL 96, Springer-Verlag, New York, Oct. 1980. In this work the complete top-down design of a program to detect and correct spelling errors is given. The design is machine-independent, but directly results in a Pascal program which implements the design.Google Scholar
32 Riseman, E.M., and Ehrich, R.W. Contextual word recognition using binary digrams. IEEE Trans. Comptrs. C-20, 4 (April 1971), 397403. The authors indicate that the important property of digrams is only their zero or nonzero nature.Google Scholar
33 Riseman, E.M., and Hanson, A.R. A contextual postprocessing system for error correction using binary n-grams. IEEE Trans. Comptrs. C-23, 5 (May 1974), 480--493. Riseman and Hanson suggest using digrams (2- grams), trigrams (3-grams), or in general ngrams, but only storing whether the probability is zero or nonzero (1 bit). They also favor using positional n-grams which means a separate n-gram table for each pair of positions is kept (for each i and j we have the digram table for characters in position i and position j in a word).Google ScholarDigital Library
34 Rosenbaum, W.S., and Hilliard, J.J. Multifont OCR postprocessing system. IBM J. Res. and Develop. 19, 4 (July 1975), 398-421. This paper focuses very specifically on OCR problems. Searching with a match-any character is discussed.Google ScholarDigital Library
35 Sheil, B.A. Median split trees: A fast lookup technique for frequently occurring keys. Comm. ACM 21, 11 (Nov. 1978), 947-958. Sheil defines a tree data structure for searching and applies this, as an example, to a dictionary of common English words. Google ScholarDigital Library
36 Szanser, A.J. Error-correcting methods in natural language processing. Inform. Processing 68, North-Holland, Amsterdam, Aug. 1968, pp. 1412-1416. This is a confused paper dealing with correction for machine translation and automatic interpretation of shorthand transcript tapes; it suggests using "elastic" matching.Google Scholar
37 Szanser, A.J. Automatic error-correction in natural languages. Inform. Storage and Retrieval 5, 4 (Feb. 1970), 169-174.Google ScholarCross Ref
38 Tanaka, E., and Kasai, T. Correcting method of garbled languages using ordered key letters. Electronics and Comm. in Japan 55, 6 (1972), 127-133.Google Scholar
39 Tenczar, P.J., and Golden, W.W. Spelling, word and concept recognition. Rep. CERL-X-35, Univ. of Illinois, Urbana, II1., Oct. 1962.Google Scholar
40 Thomas, R.B., and Kassler, M. Character recognition in context. Inform. and Control 10, 1 (Jan. 1967), 43-64. Thomas and Kassler consider tetragrams (sequences of 4 letters); of 274 possible tetragrams, only 12 percent (61,273) are legal.Google ScholarCross Ref
41 Thorelli, L. Automatic correction of errors in text. BIT 2, 1 (1962), 45-62. This paper is a kind of survey/tutorial. It mentions diagrams and dictionary look-up and suggests maximizing probabilities.Google ScholarCross Ref
42 Vossler, C.M. and Branston, N.M. The use of context for correcting garbled English text. Proc. 19th ACM Nat. Convention, Aug. 1964, pp. D2.4-I-D2.4-13. The authors use confusion matrix and word probabilities to select the most probable input word. They also use digrams and strive to improve OCR input. Google ScholarDigital Library
43 Wagner, R.A., and Fischer, M.J. The string-to-string correction problem. J. A CM 21, 1 (Jan. 1974), 168-173. Here, an algorithm is presented for determining the similarity of two strings as the minimum number of edit operations to transform one into the other. The allowed edit operations are add, delete, Google ScholarDigital Library
44 Wong, C.K., and Chandra, A.K. Bounds for the string editing problem. J. A CM 23, 1 (Jan. 1976), 13-16. Wong and Chandra show that the complexity bounds of {43} are not 3nly sufficient but also necessary. Google ScholarDigital Library

Recommendations

Context-aware correction of spelling errors in Hungarian medical documents

HighlightsWe propose two methods to automatically correct Hungarian clinical text.Method 1 generates a ranked list of correction candidates disregarding context.Method 2 uses an SMT decoder to implement context-aware error correction.Method 1 is ...
Read More
Word2Vec based spelling correction method of Twitter message
SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing

Twitter¹ became popular owing to the devices like smartphones and tablets, with which short messages can be easily composed. Due to the popularity of Twitter, the volume of Twitter messages has increased rapidly. Accordingly, studies have been carried ...
Read More
Context-aware correction of spelling errors in hungarian medical documents
SLSP'13: Proceedings of the First international conference on Statistical Language and Speech Processing

In our paper, we present a method for automated correction of spelling errors in Hungarian clinical records. We model the problem of spelling correction as a translation task, where the source language is the erroneous text and the target language is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 23, Issue 12
Dec. 1980
53 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/359038
Editor:
Robert L. Ashenhurst
Univ. of Chicago, Chicago, IL
Issue’s Table of Contents
Copyright © 1980 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 December 1980
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
spelling
spelling correction
spelling dictionary
spelling programs
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 238
  Total Citations
  View Citations
- 2,652
  Total Downloads
- Downloads (Last 12 months)353
- Downloads (Last 6 weeks)65
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Computer programs for detecting and correcting spelling errors

Communications of the ACM

Abstract

References

Cited By

Recommendations

Context-aware correction of spelling errors in Hungarian medical documents

Word2Vec based spelling correction method of Twitter message

Context-aware correction of spelling errors in hungarian medical documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Computer programs for detecting and correcting spelling errors

Communications of the ACM

Abstract

References

Cited By

Recommendations

Context-aware correction of spelling errors in Hungarian medical documents

Word2Vec based spelling correction method of Twitter message

Context-aware correction of spelling errors in hungarian medical documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media