|
ABSTRACT
While digital libraries based on page images and automatically generated text have made possible massive projects such as the Million Book Library, Open Content Alliance, Google, and others, humanists still depend upon textual corpora expensively produced with labor-intensive methods such as double-keyboarding and manual correction. This paper reports the results from an analysis of OCR-generated text for classical Greek source texts. Classicists have depended upon specialized manual keyboarding that costs two or more times as much as keyboarding of English both for accuracy and because classical Greek OCR produced no usable results. We found that we could produce texts by OCR that, in some cases, approached the 99.95% professional data entry accuracy rate. In most cases, OCR-generated text yielded results that, by including the variant readings that digital corpora traditionally have left out, provide better recall and, we argue, can better serve many scholarly needs than the expensive corpora upon which classicists have relied for a generation. As digital collections expand, we will be able to collate multiple editions against each other, identify quotations of primary sources, and provide a new generation of services.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
N. Audenaert, R. Furuta, E. Urbina, J. Deng, C. Monroy, R. Sáenz, and D. Careaga. Integrating diverse research in a digital library focused on a single author. In ECDL 05: Proceedings of the Ninth European Conference on Research and Advanced Technology for Digital Libraries, volume 3652 of Lecture Notes in Computer Science, pages 151--161. Springer, 2005.
|
| |
2
|
H. S. Baird, V. Govindaraju, and D. P. Lopresti. Document analysis systems for digital libraries: challenges and opportunities. In Document Analysis Systems VI, 6th International Workshop, DAS 2004, volume 3163 of Lecture Notes in Computer Science, pages 1--16. Springer, 2004.
|
| |
3
|
|
| |
4
|
A. Belaïd , I. Turcan , J. M. Pierrel , Y. Belaïd , Y. Hadjamar , H. Hadjamar, Automatic Indexing and Reformulation of Ancient Dictionaries, Proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL'04), p.342, January 23-24, 2004
|
| |
5
|
J. Carlquist. Medieval manuscripts, hypertext and reading, visions of digital editions. Literary and Linguistic Computing, 19(1):105--118, 2004.
|
| |
6
|
|
| |
7
|
G. S. Choudhury, T. DiLauro, R. Ferguson, M. Droetthboom, and I. Fuginaga. Document recognition for a million books. D-Lib Magazine, 12(3), 2006.
|
| |
8
|
G. Crane. Generating and parsing classical Greek. Literary and Linguistic Computing, 6(4):243--245, 1991.
|
| |
9
|
G. Crane, D. Bamman, and A. Babeu. ePhilology: When the Books Talk to Their Readers. Blackwell Companion to Digital Literary Studies, edited by Ray Siemens and Susan Scheibman. Basil Blackwell, 2007. Forthcoming.
|
| |
10
|
G. Crane, D. Bamman, L. Cerrato, A. Jones, D. Mimno, A. Packel, D. Sculley, and G. Weaver. Beyond digital incunabula: Modeling the next generation of digital libraries. In Proceedings of the 10th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2006), volume 4172 of Lecture Notes in Computer Science. Springer, 2006.
|
 |
11
|
Gregory Crane , Robert F. Chavez , Anne Mahoney , Thomas L. Milbank , Jeffrey A. Rydberg-Cox , David A. Smith , Clifford E. Wulfman, Drudgery and deep thought, Communications of the ACM, v.44 n.5, p.34-40, May 2001
[doi> 10.1145/374308.374333]
|
| |
12
|
G. Crane and A. Jones. The Perseus American Collection 1.0. Technical report, Tufts University-Perseus Project, 2005.
|
 |
13
|
|
| |
14
|
A. Dekhtyar, I. E. Iacob, J. W. Jaromczyk, K. Kiernan, N. Moore, and D. C. Porter. Support for XML markup of image-based electronic editions. Int. J. on Digital Libraries, 6(1):55--69, 2006.
|
| |
15
|
Y. Deng, S. Kumar, and W. Byrne. Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering, 12(4):1--26, 2006.
|
| |
16
|
D. F. Felluga. Addressed to the NINES: The Victorian Archive and the disappearance of the book. Victorian Studies, pages 306--319, Winter 2006.
|
 |
17
|
|
| |
18
|
B. Gatos, K. Ntzios, I. Pratikakis, S. Petridis, T. Konidaris, and S. J. Perantonis. An efficient segmentation-free approach to assist Old Greek handwritten manuscript OCR. Pattern Anal. Appl., 8(4):305--320, 2006.
|
| |
19
|
H. Ghorbel, G. Coray, and A. Linden. SAM: System for multi-criteria alignment. In Proceedings of LREC 2002, 2002.
|
 |
20
|
|
 |
21
|
|
| |
22
|
|
| |
23
|
|
 |
24
|
|
| |
25
|
|
| |
26
|
Carlos Monroy , Rajiv Kochumman , Richard Furuta , Eduardo Urbina , Eréndira Melgoza , Arpita Goenka, Visualization of Variants in Textual Collations to Analyze the Evolution of Literary Works in the Cervantes Project, Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, p.638-653, September 16-18, 2002
|
| |
27
|
R. Nelken and S. M. Shieber. Towards robust context-sensitive sentence alignment for monolingual corpora. In 11th Conference of the European Chapter of the Association for Computational Linguistics, 2006.
|
| |
28
|
|
| |
29
|
|
| |
30
|
C. B. Owen, J. Ford, F. Makedon, T. Steinberg, and C. Metaxaki-Kossionides. Parallel text alignment. Int. Journal of Dig. Libraries, 3(1):100--114, July 2000.
|
| |
31
|
|
| |
32
|
S. Rawat, K. S. S. Kumar, M. Meshesha, I. D. Sikdar, A. Balasubramanian, and C. V. Jawahar. A semi-automatic adaptive OCR for digital libraries. In Document Analysis Systems VII, 7th International Workshop,, volume 3872 of Lecture Notes in Computer Science, pages 13--24, 2006.
|
 |
33
|
|
| |
34
|
P. Robinson. Where we are with electronic scholarly editions, and where we want to be. Jahrbuch fr Computerphilogie - online, 5:123--43, 2004.
|
| |
35
|
|
| |
36
|
S. Schreibman, A. Kumar, and J. McDonald. The versioning machine. Literary and Linguistic Computing, 18(1):101--7, 2003.
|
| |
37
|
D. A. Smith. Textual variation and version control in the TEI. Computers and the Humanities, 33(1--2):103--112, April 1999.
|
| |
38
|
|
| |
39
|
M. Spencer, B. Bordalejo, L. Wang, A. Barbrook, L. Mooney, P. Robinson, T. Warnow, and C. Howe. Analyzing the order of items in manuscripts of the Canterbury Tales. Computers and the Humanities, 37(1):97--109, February 2003.
|
| |
40
|
M. Spencer and C. Howe. Collating texts using progressive multiple alignment. Computers and the Humanities, 38(3):253--70, August 2004.
|
| |
41
|
S. F. Thomas. Finalizing the multiple-text electronic King Lear for use in the classroom. Proceedings of ACH/ALLC 2005, Victoria, 15-18 Jun 2005, 2005.
|
| |
42
|
J. Veronis. Parallel text processing: Alignment and use of translation corpora. Kluwer Academic Publishers, 2000.
|
| |
43
|
|
|