skip to main content
research-article

Extracting two thousand years of latin from a million book library

Published: 27 April 2012 Publication History

Abstract

With the rise of large open digitization projects such as the Internet Archive and Google Books, we are witnessing an explosive growth in the number of source texts becoming available to researchers in historical languages. The Internet Archive alone contains over 27,014 texts catalogued as Latin, including classical prose and poetry written under the Roman Empire, ecclesiastical treatises from the Middle Ages, and dissertations from 19th-century Germany written—in Latin—on the philosophy of Hegel. At one billion words, this collection eclipses the extant corpus of Classical Latin by several orders of magnitude. In addition, the much larger collection of books in English, German, French, and other languages already scanned contains unknown numbers of translations for many Latin books, or parts of books.
The sheer scale of this collection offers a broad vista of new research questions, and we focus here on both the opportunities and challenges of computing over such a large space of heterogeneous texts. The works in this massive collection do not constitute a finely curated (or much less balanced) corpus of Latin; it is, instead, simply all the Latin that can be extracted, and in its reach of twenty-one centuries (from approximately 200 BCE to 1922 CE) arguably spans the greatest historical distance of any major textual collection today. While we might hope that the size and historical reach of this collection can eventually offer insight into grand questions such as the evolution of a language over both time and space, we must contend as well with the noise inherent in a corpus that has been assembled with minimal human intervention.

References

[1]
Bamman, D. and Crane, G. 2006. The design and use of a Latin dependency treebank. In Proceedings of the 5th Workshop on Treebanks and Linguistic Theories (TLT'06). 67--78.
[2]
Bamman, D. and Crane, G. 2008. Building a dynamic lexicon from a digital library. In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'08). ACM, New York, 11--20.
[3]
Bamman, D. and Crane, G. 2009. Computational linguistics and classical lexicography. Digit. Humanities Quart. 3, 1.
[4]
Brown, P. F., Pietra, V. J. D., Pietra, S. A. D., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 2, 263--311.
[5]
Buchanan, G. 2006. Frbr: Enriching and integrating digital libraries. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL'06). ACM, New York, 260--269.
[6]
Crane, G. 1987. From the old to the new: Integrating hypertext into traditional scholarship. In Proceedings of the 1st ACM Conference on Hypertext (Hypertext '87). ACM Press, 51--56.
[7]
Crane, G. 1998. New technologies for reading: The lexicon and the digital library. Classical World, 471--501.
[8]
Crane, G., Bamman, D., Cerrato, L., Jones, A., Mimno, D. M., Packel, A., Sculley, D., and Weaver, G. 2006. Beyond digital incunabula: Modeling the next generation of digital libraries. In Proceedings of the ECDL Conference, J. Gonzalo, C. Thanos, M. F. Verdejo, and R. C. Carrasco, Eds., Lecture Notes in Computer Science, vol. 4172, Springer, 353--366.
[9]
Fletcher, W. H. 2004. Facilitating compilation and dissemination of ad-hoc web corpora. In Corpora and Language Learners, 271.
[10]
Haug, D. and Jøhndal, M. 2008. Creating a parallel treebank of the old indo-european bible translations. In Proceedings of the 2nd Workshop on Language Technology for Cultural Heritage Data (LaTeCH '08).
[11]
Hochmann, J.-R., Endress, A. D., and Mehler, J. 2010. Word frequency as a cue for identifying function words in infancy. Cogn. 115, 3, 444--457.
[12]
Kristeller, P. O. 1979. Renaissance Thought and Its Sources. Columbia University Press, New York.
[13]
Kucera, H. and Francis, W. N. 1967. Computational Analysis of Present-Day American English. Brown University Press, Providence, RI.
[14]
Leetaru, K. 2008. Mass book digitization: The deeper story of Google Books and the Open Content Alliance. First Monday 13, 10.
[15]
McDonald, R., Pereira, F., Ribarov, K., and Hajič, J. 2005. Non-Projective dependency parsing using spanning tree algorithms. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. 523--530.
[16]
Mimno, D. and McCallum, A. 2007. Organizing the oca: Learning faceted subjects from a library of digital books. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '07). ACM, New York, 376--385.
[17]
Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kübler, S., Marinov, S., and Marsi, E. 2007. Maltparser: A language-independent system for data-driven dependency parsing. Nat. Lang. Engin. 13, 2, 95--135.
[18]
Nunberg, G. August 31, 2009. Google's book search: A disaster for scholars. The Chronicle of Higher Education.
[19]
Och, F. J. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Comput. Linguist. 29, 1, 19--51.
[20]
On the Functional Requirements for Bibliographic Records, I. S. G. 2009. Functional Requirements for Bibliographic Records: Final Report. UBCIM Publications.
[21]
Orwant, J. 2010. Our commitment to the digital humanities. http://googleblog.blogspot.com/2010/07/our-commitment-to-digital-humanities.html.
[22]
Passarotti, M. 2007. Verso il lessico tomistico biculturale. la treebank dell'index thomisticus. In Il filo del discorso. Intrecci testuali, articolazioni linguistiche, composizioni logiche. Atti del XIII Congresso Nazionale della Società di Filosofia del Linguaggio, Viterbo P. Raffaella and F. Diego, Eds. 187--205.
[23]
Ramminger, J. 2003ff. Neulateinische wortliste. Ein wörterbuch des lateinischen von petrarca bis 1700. http://www.neulatein.de.
[24]
Silverman, S. and Ratner, N. B. 2002. Measuring lexical diversity in children who stutter: Application of vocd. J. Fluency Disord. 27, 4, 289--303.
[25]
Smith, D. A. and Eisner, J. 2008. Dependency parsing by belief propagation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, 145--156.
[26]
Teahan, W. J. 2000. Text classification and segmentation using minimum cross-entropy. In Proceedings of the RIAO Conference, J.-J. Mariani and D. Harman, Eds., 943--961.
[27]
Thesaurus Linguae Latinae, fourth electronic edition. 2006. Thesaurus Linguae Latinae, 4th Electronic Ed. K. G. Saur. http://www.thesaurus.badw.de/.

Cited By

View all
  • (2024)Historical insights at scale: A corpus-wide machine learning analysis of early modern astronomic tablesScience Advances10.1126/sciadv.adj171910:43Online publication date: 25-Oct-2024
  • (2023)Interdisciplinary approaches to water in the Roman worldWater History10.1007/s12685-023-00318-115:1(1-10)Online publication date: 22-Mar-2023
  • (2020)Building and Comparing Lemma Embeddings for Latin. Classical Latin versus Thomas AquinasItalian Journal of Computational Linguistics10.4000/ijcol.6246:1(29-45)Online publication date: 1-Jun-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal on Computing and Cultural Heritage
Journal on Computing and Cultural Heritage   Volume 5, Issue 1
April 2012
53 pages
ISSN:1556-4673
EISSN:1556-4711
DOI:10.1145/2160165
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 April 2012
Accepted: 01 June 2011
Received: 01 February 2011
Published in JOCCH Volume 5, Issue 1

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)3
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Historical insights at scale: A corpus-wide machine learning analysis of early modern astronomic tablesScience Advances10.1126/sciadv.adj171910:43Online publication date: 25-Oct-2024
  • (2023)Interdisciplinary approaches to water in the Roman worldWater History10.1007/s12685-023-00318-115:1(1-10)Online publication date: 22-Mar-2023
  • (2020)Building and Comparing Lemma Embeddings for Latin. Classical Latin versus Thomas AquinasItalian Journal of Computational Linguistics10.4000/ijcol.6246:1(29-45)Online publication date: 1-Jun-2020
  • (2017)‘Differing only in dialect’, or How collocations can co-shape conceptsLanguage & Communication10.1016/j.langcom.2017.04.00656(95-109)Online publication date: Sep-2017
  • (2015)Improving Accessibility of Archived Raster Dictionaries of Complex Script LanguagesProceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries10.1145/2756406.2756926(47-56)Online publication date: 21-Jun-2015
  • (2014)Cataloging for a billion word library of Greek and LatinProceedings of the First International Conference on Digital Access to Textual Cultural Heritage10.1145/2595188.2595190(83-88)Online publication date: 19-May-2014
  • (2012)CroALaJournal of the Text Encoding Initiative10.4000/jtei.425Online publication date: 3-Feb-2012
  • (2012)Student researchers, citizen scholars and the trillion word libraryProceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries10.1145/2232817.2232857(213-222)Online publication date: 10-Jun-2012

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media