Extracting structured data from unstructured or semi-structured data has been a long-standing problem in the digital humanities. Humanities data is inherently, and in most respects I would say wonderfully, diverse. Texts, performances, music, art, among many other things – with each example reflecting the weird workings of an individual mind – testify to the amazing creativity of the human spirit.
At the same time, though, they present a thorny problem for those who wish to analyze these artifacts with digital tools. I had a Latin teacher who, when we hit a particularly difficult construction or syntactical problem, used to say something to the effect of, “hit it until stops moving, then analyze it.” Digital analysis is kind of like that. Digital tools require consistency. Humanistic artifacts resist it. One of the major barriers to the wider adaptation of digital tools in the humanities, I suspect, is the abhorrence of scholars to this process of flattening. Most of us were drawn into our fields by the very richness, color, and ineffability of our source material. Digital analysis necessarily bleeds off much of this.
Yet as even the most hardened humanist would grudgingly admit, there is a place for digital analysis. Approaches such as distant reading have demonstrated their usefulness, especially when we approach a corpus of material so large as to be beyond the human capability to read, no less analyze. Some corpora lend themselves better than others to this kind of analysis.
Ancient inscriptions, for example, comprise such a corpus. There are hundreds of thousands of extant Greek and Latin inscriptions from the Greek and Roman worlds alone. These inscriptions, as a whole, have a bit less flourish and variety than literary texts and works of art. Many mark gravesites or honor a donor and follow certain generic formulae. Most are short. Even the longest of inscriptions is dwarfed by a short story. They are also fantastic historical sources, adding an entirely new dimension to what literary texts tell us about the past. Inscriptions, in many respects, comprise an ideal dataset.
Yet it turns out that in many other respects they do not. Many are fragmentary or contain unrecoverable gaps or illegible handwriting, full of misspellings. They are sometimes written in odd ways, like winding around a column, up a stone, or scattered about a mosaic. Their original context is not always known, making their interpretation speculative. Despite what appears to be a set of conventions around the scholarly publication of inscriptions, these publications are deeply uneven and inconsistent. Scholars, like everyone else, have different levels of expertise and do not always share the same vocabulary. When an inscription is said, for example, to be from “the Hellenistic era,” what particular dates are being referenced? Is my “etched” your “carved”? Why should I trust that you got this critical but blurred letter or that longer reconstruction right? Moreover, inscriptions are a moving target. New ones are constantly found and published as journal articles in many different places and languages. Any corpus of inscriptions is provisional. New inscriptions are constantly discovered, and old readings are continually revised.
Scholars were not slow to recognize the allure of moving inscriptions (“epigraphy” is the term that scholars use to denote the study of inscriptions) to a digital platform. The pathbreaking work by the Packard Humanities Institute demonstrated, before the Internet, that being able to do simple searches on large quantities of inscriptions could transform scholarship. There was much more work to be done, and teams quickly formed advance digitization. And that is when the allure met the challenge.
As scores of teams sought to digitize particular subsets of inscriptions, they soon understood that digitized inscriptions need a shared structure to be truly useful. For example, if I want to search for all inscriptions written in Latin using Greek characters that were produced in Rome between 100 BCE – 100 CE, the data would need to be encoded to allow searches by date, language, script, and location. It was also immediately understood that a shared data structure would facilitate the development of more useful interfaces and analytical tools and would allow users to search across multiple digital collections.
It was not only the information about the inscription – known as the metadata – that needed to be standardized. Already decades ago, scholars have banded together in a massive team effort create a shared digital language (or tags) to describe the physical features of texts. Known as the Textual Encoding Initiative (TEI), this on-going project creates and maintains standardized descriptive tags that allow texts to be rendered and exchanged across multiple digital platforms. That is, it makes them interoperable.
Different texts have different kinds of features, and those who deal with ancient papyri and ostraca (texts written on shards of pottery), particularly in library settings, began to adapt the TEI schema for their own needs. Thus when epigraphers, who generally work outside the better financed institutions such as libraries, began looking to standardize their own work they turned to the papyrologists.

A seventh-century ostracon from Egypt now at the Metropolitan Museum of Art (record here)
The upshot of this effort was the development of a customized form of TEI that is now known as EpiDoc.
Digitizing an inscription into EpiDoc is not a trivial task. Information about an inscription, such as its context, size, find location, current location, language, place, etc., all must be entered into special fields in some database system (or directly given strict, uniform tags). This takes time and expertise. Even more laborious, though, is the digitization of the inscription’s text, even at its most simple level. Epigraphers typically (and ideally) transcribe an inscription in two formats. The first is called a diplomatic transcription, and seeks to record the inscription as it appears, with all the gaps, misspellings, etc. The second is sometimes called a normalized transcription. Both employ a specialized set of typographical markers.
Let me illustrate with an example. The inscription below is located in a church in Rome and dates to the middle of the fourth century CE. It has been digitally republished in the Epigraphic Database Bari (see the record here) and is shared under a Creative Commons license.

The editors do not supply a diplomatic transcription for this Latin inscription, but it is easy enough to see the damage around the inscription and the its lack of spacing and punctuation. They do supply a normalized transcription:
[— cum sa]ṇctis aeterṇ[am]
[domum] Ṃarcianus e[t —]
[—]ne compare[s —]
[— s]ibi fecerun̂t [—]
The typography follows what epigraphers call the Leiden convention (or Leiden+ conventions). The brackets with the dashes indicate gaps. The dot under the letter M means that the letter is unclear. Words in the brackets are reconstructions. This appears to be a funerary inscription for the Christians Marcianus and his wife, who “made this for themselves,” who now wish to dwell with the saints.
Now, though, look at an EpiDoc rendering:
<div type=”edition” subtype=”transcription”> <p> <gap reason=”lost” extent=”unknown” unit=”character”/> <supplied reason=“lost”>cum sa</supplied><unclear>ṇ</unclear>ctis aeter<unclear>ṇ</unclear><supplied reason=”lost”>am</supplied> <lb/>
<supplied reason=”lost”>domum</supplied> <unclear>Ṃ</unclear>arcianus e<supplied reason=”lost”>t</supplied> <gap reason=”lost” extent=”unknown” unit=”character”/>
<lb/> <gap reason=”lost” extent=”unknown” unit=”character”/>ne compare<supplied reason=”lost”>s</supplied> <gap reason=”lost” extent=”unknown” unit=”character”/>
<lb/> <gap reason=”lost” extent=”unknown” unit=”character”/> <supplied reason=”lost”>s</supplied>ibi fecerun<unclear>̂</unclear>t <gap reason=”lost” extent=”unknown” unit=”character”/> </p> </div>
This is a relatively uncomplicated inscription that has gaps, supplied texts, unclear letters, and line breaks. By encoding it this way, any platform attuned to TEI should be able to render it correctly in whatever typographical conventions it uses. Other programs that analyze such texts also can better process it. For example, a program creating word lists might include the word “domum” along with a notation that the word has been supplied or is doubtful.
Other inscriptions are more complicated. I run a project that digitizes the inscriptions of the area of Israel/Palestine that date from the sixth century BCE to the seventh century CE (Inscriptions of Israel/Palestine, or IIP). Below is a fragment of a funerary inscription for a man named Samuel and his family inscribed on a lintel that was found completely out of context in a backyard near the city of Sepphoris.

The full record is here. The EpiDoc rendering is:
<p>Σαμουῆλος υἱὼς <gap unit=”character” extent=”unknown” reason=”lost”/>
<lb/>γαμετὴ αὐτοῦ <orig>Θ</orig><gap unit=”character” extent=”unknown” reason=”lost”/>
<lb/><expan><abbr>κα</abbr><ex>ὶ</ex></expan> σ<supplied reason=”omitted”>ύ</supplied>νγων<supplied reason=”lost”>οι</supplied> <gap unit=”character” extent=”unknown” reason=”lost”/></p>In this case the editor had to make several decisions about how to encode this inscription. Note that it is in Greek and damaged at the end of the lines. It also has abbreviations and missing letters., especially in the last line.
The complexity of these decisions and the actual manual encoding of inscriptions, particularly multilingual ones (and even more so when they are in right-to-left languages like Hebrew and Aramaic), makes the process costly. The scale of the problem is enormous. Even short inscriptions can take trained specialists an hour to encode correctly, and there are tens of thousands still unpublished or undigitized. This is precisely the kind of repetitive but expertise-intensive work that recent AI systems may be able to assist with.
Over the past year, I have been working with a team at the Center for Digital Scholarship at the Brown University Library on exploring the use of AI, and specifically LLMs, to do this work more efficiently. I have previously reported on some early probes. We are now wrapping up the first phase of our work, and I recently had an opportunity to present it at the Tenth Epigraphy.info Workshop. At that meeting I was thrilled to discover that there are several scholarly teams working on applying various aspects of AI to inscriptions. I will discuss our paper as well as some of these initiatives in my next post.






