Thumbnail
Access Restriction
Subscribed

Author Bamman, David ♦ Smith, David
Source ACM Digital Library
Content type Text
Publisher Association for Computing Machinery (ACM)
File Format PDF
Copyright Year ©2012
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Abstract With the rise of large open digitization projects such as the Internet Archive and Google Books, we are witnessing an explosive growth in the number of source texts becoming available to researchers in historical languages. The Internet Archive alone contains over 27,014 texts catalogued as Latin, including classical prose and poetry written under the Roman Empire, ecclesiastical treatises from the Middle Ages, and dissertations from 19th-century Germany written—in Latin—on the philosophy of Hegel. At one billion words, this collection eclipses the extant corpus of Classical Latin by several orders of magnitude. In addition, the much larger collection of books in English, German, French, and other languages already scanned contains unknown numbers of translations for many Latin books, or parts of books. The sheer scale of this collection offers a broad vista of new research questions, and we focus here on both the opportunities and challenges of computing over such a large space of heterogeneous texts. The works in this massive collection do not constitute a finely curated (or much less balanced) corpus of Latin; it is, instead, simply $\textit{all}$ the Latin that can be extracted, and in its reach of twenty-one centuries (from approximately 200 BCE to 1922 CE) arguably spans the greatest historical distance of any major textual collection today. While we might hope that the size and historical reach of this collection can eventually offer insight into grand questions such as the evolution of a language over both time and space, we must contend as well with the noise inherent in a corpus that has been assembled with minimal human intervention.
ISSN 15564673
Age Range 18 to 22 years ♦ above 22 year
Educational Use Research
Education Level UG and PG
Learning Resource Type Article
Publisher Date 2012-04-01
Publisher Place New York
e-ISSN 15564711
Journal Journal on Computing and Cultural Heritage (JOCCH)
Volume Number 5
Issue Number 1
Page Count 13
Starting Page 1
Ending Page 13


Open content in new tab

   Open content in new tab
Source: ACM Digital Library