ACM Home Page
Please provide us with feedback. Feedback
A new generation of textual corpora: mining corpora from very large collections
Full text PdfPdf (253 KB)
Source
International Conference on Digital Libraries archive
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries table of contents
Vancouver, BC, Canada
SESSION: Large-scale collections table of contents
Pages: 356 - 365  
Year of Publication: 2007
ISBN:978-1-59593-644-8
Authors
Gordon Stewart  Harvard University, Cambridge, MA
Gregory Crane  Tufts University, Medford, MA
Alison Babeu  Tufts University, Medford, MA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 28,   Downloads (12 Months): 141,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1255175.1255247
What is a DOI?

ABSTRACT

While digital libraries based on page images and automatically generated text have made possible massive projects such as the Million Book Library, Open Content Alliance, Google, and others, humanists still depend upon textual corpora expensively produced with labor-intensive methods such as double-keyboarding and manual correction. This paper reports the results from an analysis of OCR-generated text for classical Greek source texts. Classicists have depended upon specialized manual keyboarding that costs two or more times as much as keyboarding of English both for accuracy and because classical Greek OCR produced no usable results. We found that we could produce texts by OCR that, in some cases, approached the 99.95% professional data entry accuracy rate. In most cases, OCR-generated text yielded results that, by including the variant readings that digital corpora traditionally have left out, provide better recall and, we argue, can better serve many scholarly needs than the expensive corpora upon which classicists have relied for a generation. As digital collections expand, we will be able to collate multiple editions against each other, identify quotations of primary sources, and provide a new generation of services.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
N. Audenaert, R. Furuta, E. Urbina, J. Deng, C. Monroy, R. Sáenz, and D. Careaga. Integrating diverse research in a digital library focused on a single author. In ECDL 05: Proceedings of the Ninth European Conference on Research and Advanced Technology for Digital Libraries, volume 3652 of Lecture Notes in Computer Science, pages 151--161. Springer, 2005.
 
2
H. S. Baird, V. Govindaraju, and D. P. Lopresti. Document analysis systems for digital libraries: challenges and opportunities. In Document Analysis Systems VI, 6th International Workshop, DAS 2004, volume 3163 of Lecture Notes in Computer Science, pages 1--16. Springer, 2004.
 
3
 
4
 
5
J. Carlquist. Medieval manuscripts, hypertext and reading, visions of digital editions. Literary and Linguistic Computing, 19(1):105--118, 2004.
 
6
 
7
G. S. Choudhury, T. DiLauro, R. Ferguson, M. Droetthboom, and I. Fuginaga. Document recognition for a million books. D-Lib Magazine, 12(3), 2006.
 
8
G. Crane. Generating and parsing classical Greek. Literary and Linguistic Computing, 6(4):243--245, 1991.
 
9
G. Crane, D. Bamman, and A. Babeu. ePhilology: When the Books Talk to Their Readers. Blackwell Companion to Digital Literary Studies, edited by Ray Siemens and Susan Scheibman. Basil Blackwell, 2007. Forthcoming.
 
10
G. Crane, D. Bamman, L. Cerrato, A. Jones, D. Mimno, A. Packel, D. Sculley, and G. Weaver. Beyond digital incunabula: Modeling the next generation of digital libraries. In Proceedings of the 10th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2006), volume 4172 of Lecture Notes in Computer Science. Springer, 2006.
11
 
12
G. Crane and A. Jones. The Perseus American Collection 1.0. Technical report, Tufts University-Perseus Project, 2005.
13
 
14
A. Dekhtyar, I. E. Iacob, J. W. Jaromczyk, K. Kiernan, N. Moore, and D. C. Porter. Support for XML markup of image-based electronic editions. Int. J. on Digital Libraries, 6(1):55--69, 2006.
 
15
Y. Deng, S. Kumar, and W. Byrne. Segmentation and alignment of parallel text for statistical machine translation. Natural Language Engineering, 12(4):1--26, 2006.
 
16
D. F. Felluga. Addressed to the NINES: The Victorian Archive and the disappearance of the book. Victorian Studies, pages 306--319, Winter 2006.
17
 
18
B. Gatos, K. Ntzios, I. Pratikakis, S. Petridis, T. Konidaris, and S. J. Perantonis. An efficient segmentation-free approach to assist Old Greek handwritten manuscript OCR. Pattern Anal. Appl., 8(4):305--320, 2006.
 
19
H. Ghorbel, G. Coray, and A. Linden. SAM: System for multi-criteria alignment. In Proceedings of LREC 2002, 2002.
20
21
 
22
 
23
24
 
25
 
26
 
27
R. Nelken and S. M. Shieber. Towards robust context-sensitive sentence alignment for monolingual corpora. In 11th Conference of the European Chapter of the Association for Computational Linguistics, 2006.
 
28
 
29
 
30
C. B. Owen, J. Ford, F. Makedon, T. Steinberg, and C. Metaxaki-Kossionides. Parallel text alignment. Int. Journal of Dig. Libraries, 3(1):100--114, July 2000.
 
31
 
32
S. Rawat, K. S. S. Kumar, M. Meshesha, I. D. Sikdar, A. Balasubramanian, and C. V. Jawahar. A semi-automatic adaptive OCR for digital libraries. In Document Analysis Systems VII, 7th International Workshop,, volume 3872 of Lecture Notes in Computer Science, pages 13--24, 2006.
33
 
34
P. Robinson. Where we are with electronic scholarly editions, and where we want to be. Jahrbuch fr Computerphilogie - online, 5:123--43, 2004.
 
35
 
36
S. Schreibman, A. Kumar, and J. McDonald. The versioning machine. Literary and Linguistic Computing, 18(1):101--7, 2003.
 
37
D. A. Smith. Textual variation and version control in the TEI. Computers and the Humanities, 33(1--2):103--112, April 1999.
 
38
 
39
M. Spencer, B. Bordalejo, L. Wang, A. Barbrook, L. Mooney, P. Robinson, T. Warnow, and C. Howe. Analyzing the order of items in manuscripts of the Canterbury Tales. Computers and the Humanities, 37(1):97--109, February 2003.
 
40
M. Spencer and C. Howe. Collating texts using progressive multiple alignment. Computers and the Humanities, 38(3):253--70, August 2004.
 
41
S. F. Thomas. Finalizing the multiple-text electronic King Lear for use in the classroom. Proceedings of ACH/ALLC 2005, Victoria, 15-18 Jun 2005, 2005.
 
42
J. Veronis. Parallel text processing: Alignment and use of translation corpora. Kluwer Academic Publishers, 2000.
 
43

Collaborative Colleagues:
Gordon Stewart: colleagues
Gregory Crane: colleagues
Alison Babeu: colleagues