
Strange image generated by DALL-E for this post
First published on my Substack.
In an earlier post, I discussed a very simple and quick experiment that I did on a photograph of an ancient Jewish inscription written in Greek. The results of that experiment were not perfect – epigraphers need not fear for their jobs (yet) – but they were also not bad. An off-the-shelf, untuned ChatGPT model “knew” how to decipher the inscription; write out an edited version of it in Greek (ancient inscriptions usually do not contain punction or spaces and lack both distinctions between upper and lower cases and important accent and other marks); add the conventional markings used by epigraphers to denote things like gaps in the text and unclear letters; translate it into English; and provide a short commentary. What might it do with a bit of training?
I am working with a team from the Center for Digital Scholarship at Brown University to probe this question. The immediate impetus for our work was a practical problem I was confronting. Our funding for Inscriptions of Israel/Palestine -an accessible and robustly searchable database of ancient inscriptions from the region of modern day Israel/Palestine – has paused. We have always used human labor (mostly students) to enter inscriptions into the database. The data entry was time (and resource) expensive. Encoders would have to comb through publications of inscriptions to extract data (e.g., creation date) that is then entered into a complex XML document. Most time consuming was converting the edited version of the inscription, which is often a complex text containing many editorial markings, into our own encoding, which is based on a standard known as EpiDoc. Even with all of our voluminous documentation it takes a significant time to train an encoder. A well-trained encoder might easily take 30 minutes to encode an inscription. We then run a series of automated and manual tasks on the inscription to enter the words into our indices. Without funding it was hard for me to see a practical way to continue to add to the database.
Could, though, AI do it? Could I feed it a pdf of a published inscription and have it generate a high-quality XML file? If not all of it, what about part?
We started with one central part of the workflow, the conversion of printed, edited inscriptions into EpiDoc. On the surface, this might seem like a trivial thing. Replacing the Leiden convention typographical marks (e.g., using square brackets to denote text supplied by an editor) with the equivalent EpiDoc tags seems like a simple search and replace operation. And, perhaps 60%-70% of the time, it is. The other times, though, there can be a significant degree of decision-making. We decided to start with the inscriptions in Greek. We now feel that we have made enough progress on this problem that it is worth publicly sharing, although we have not rigorously tested it for errors. The work below was largely conducted by Daniel Kang and Justin Uhr.
We are happy with two solutions to this problem, both of which use Claude.ai.
- The first approach involved a long set of instructions, with seven examples:
You are an expert system designed to translate epigraphic and papyrological inscriptions from Leiden Conventions format into XML that conforms to the EpiDoc schema. Your task is to accurately convert the given text, preserving all meaningful information while translating the special symbols into appropriate XML tags.
Here is the text in Leiden Conventions format that you need to translate:
[FILL IN YOUR TEXT HERE]
When translating the text, please make sure to meticulously follow the guideline attached as “2h. Leiden to EpiDoc Cheatsheet” for translating specific Leiden Convention symbols to EpiDoc-compliant XML
Please also follow these steps:
1. Read through the entire text to familiarize yourself with its content and structure.
2. Identify all Leiden Convention symbols and their corresponding XML translations according to the EpiDoc schema.
3. Convert each symbol to its appropriate XML tag, ensuring proper nesting when multiple features apply to the same text.
4. Preserve all alphabetic characters and spaces as they appear in the original text.
5. Review your translation to ensure all symbols have been accurately converted and tags are properly nested. Before providing your final translation, wrap your thought process in tags.
Include these in your response:
1. List all Leiden Convention symbols present in the given text.
2. Map each identified symbol to its corresponding EpiDoc XML tag using the guideline “2h. Leiden to EpiDoc Cheatsheet”
3. Consider and explain how you will handle nested tags and their proper order.
4. Outline any potential challenges in the translation and how you plan to address them.
This detailed breakdown will help ensure a thorough and accurate translation. After your analysis, provide the final XML translation wrapped in tags. Ensure that your output strictly adheres to the EpiDoc schema and conventions.
Examples are given below———————————————————
Example Leiden Input 1
Εἶς θεὸ[ς μόνο-]
ς ὁ βοηθ[ῶν]
Γαδιωναν
κ(αὶ) Ἰουλιανῷ
κ(αὶ) πᾶσιν τοῖς ἀξί-
οις
Example EpiDoc Output 1
<div type=”edition” subtype=”transcription” ana=”b1″>
<p xml:lang=”grc”>
<lb/>Εἶς θεὸ<supplied reason=”lost”>ς</supplied> <supplied reason=”lost”>μόνο</supplied><lb break=”no”/>ς ὁ
βοηθ<supplied reason=”lost”>ῶν</supplied>
<lb/>Γαδιωναν <lb/><expan><abbr>κ</abbr><ex>αὶ</ex></expan> Ἰουλιανῷ
<lb/><expan><abbr>κ</abbr><ex>αὶ</ex></expan> πᾶσιν τοῖς ἀξ<lb break=”no”/>ίοις <lb/><foreign xml:lang=”heb”>פעלהבדה</foreign></p>
</div>
Example Leiden Input 2
Κ(ύρι)ε μνήσ(θητι) τῶν πρ-
[οσ]νε(γ)καντ(ων) καὶ
[—]
Example EpiDoc Output 2
<p><expan><abbr>Κ</abbr><ex>ύρι</ex><abbr>ε</abbr></expan> <expan><abbr>μνήσ</abbr><ex>θητι</ex></expan> τῶν <expan><abbr>πρ<lb break=”no”/><supplied reason=”lost”>οσ</supplied>νε</abbr><ex>γ</ex><abbr>καντ</abbr><ex>ων</ex></expan> καὶ <lb/><gap reason=”lost” extent=”unknown” unit=”character”/></p>
Example Leiden Input 3
εἷς θεὸς ὁ νικῶν τὰ κα[κὰ]
Ἰάω θ[εὸς]
εἷς θ[εὸ]ς
Example EpiDoc Output 3
<p><lb/>εἷς θεὸς ὁ νικῶν τὰ κα<supplied reason=”lost”>κὰ</supplied> <lb/> Ἰάω θ<supplied reason=”lost”>εὸς</supplied> <lb/>εἷς θ<supplied reason=”lost”>εὸ</supplied>ς </p>
Example Leiden Input 4
Κ(ύρι)ε Ἰ(ησο)ῦ Χ(ριστ)ὲ πρόσδεξε τὴν
καρποφορίαν τῶν δούλω(ν)
σοῦ Ἰωάννου τοῦ πρ(εσβυτέρ)ου καὶ
Ἀββοσόβου ὅτι ἐξ ἰδίων κό-
πων ἤγιραν τὸν οἴκον τοῦτον.
Example EpiDoc Output 4
<p><lb/><expan><abbr>Κ</abbr><ex>ύρι</ex><abbr>ε</abbr></expan> <expan><abbr>Ἰ</abbr><ex>ησο</ex><abbr>ῦ</abbr></expan> <expan><abbr>Χ</abbr><ex>ριστ</ex><abbr>ὲ</abbr></expan> πρόσδεξε τὴν <lb/>καρποφορίαν τῶν <expan><abbr>δούλω</abbr><ex>ν</ex></expan> <lb/>σοῦ Ἰωάννου τοῦ <expan><abbr>πρ</abbr><ex>εσβυτέρ</ex><abbr>ου</abbr></expan> καὶ <lb/>Ἀββοσόβου ὅτι ἐξ ἰδίων κό<lb break=”no”/>πων ἤγιραν τὸν οἴκον τοῦτον.</p>
Example Leiden Input 5
Ἐπὶ τοῦ <δ>ὁσιωτάτου Γεωργίου δια-
κόνου καὶ Ϲαμουήλου λαμπροτ(άτου)
καὶ Ἀββεος Ζαχαρίου ἐγένετο τὸ π(ᾶν)
ἔργον τ<ῆ?>ς ψιφώσεως ταύτης
ἐν μ(ηνὶ) Ἱουν[ίῳ ἔτους] [Ἐλευθερο]πόλε(ως) βφʹ
Example EpiDoc Output 5
<p>Ἐπὶ τοῦ <supplied reason=”omitted”>δ</supplied>ὁσιωτάτου Γεωργίου δια<lb break=”no”/>κόνου καὶ Ϲαμουήλου <expan><abbr>λαμπροτ</abbr><ex>άτου</ex></expan><lb/>καὶ Ἀββεος Ζαχαρίου ἐγένετο τὸ <expan><abbr>π</abbr><ex>ᾶν</ex></expan><lb/>ἔργον τ<supplied reason=”omitted” cert=”low”>ῆ</supplied>ς ψιφώσεως ταύτης<lb/>ἐν <expan><abbr>μ</abbr><ex>ηνὶ</ex></expan> Ἱουν<supplied reason=”lost”>ίῳ</supplied> <supplied reason=”lost”>ἔτους</supplied> <supplied reason=”lost”><abbr><expan>Ἐλευθερο</expan></abbr></supplied><expan><abbr>πόλε</abbr><ex>ως</ex></expan> <num value=”502″>βφʹ</num></p>
Example Leiden Input 6
[Ἐπὶ Σι]λουανοῦ θεοφιλ(εστάτου) διακό(νου) κ(αὶ) ἡγουμέ(νου) ἡ παροῦσα
[ψήφωσ]ις ἐγένετο κ(αὶ) κόγχη κ(αὶ) ἡ προσθήκη τοῦ ναοῦ μ<ή>κος
[πήχεις … ὕ]ψους π(ή)χ(εις) ς’ μνήσθητ[ί μου] Κ(ύρι)ε ἐν [τῇ β]ασιλ<ε>ίᾳ σου.
Example EpiDoc Output 6
<p><lb/><supplied reason=”lost”>Ἐπὶ</supplied> <supplied reason=”lost”>Σι</supplied>λουανοῦ <expan><abbr>θεοφιλ</abbr><ex>εστάτου</ex></expan> <expan><abbr>διακό</abbr><ex>νου</ex></expan> <expan><abbr>κ</abbr><ex>αὶ</ex></expan> <expan><abbr>ἡγουμέ</abbr><ex>νου</ex></expan> ἡ παροῦσα <lb/><supplied reason=”lost”>ψήφωσ</supplied>ις ἐγένετο <expan><abbr>κ</abbr><ex>αὶ</ex></expan> κόγχη <expan><abbr>κ</abbr><ex>αὶ</ex></expan> ἡ προσθήκη τοῦ ναοῦ μ<supplied reason=”omitted”>ή</supplied>κος <lb/><supplied reason=”lost”>πήχεις</supplied> <gap reason=”lost” extent=”unknown” unit=”character”/> <supplied reason=”lost”>ὕ</supplied>ψους <expan><abbr>π</abbr><ex>ή</ex><abbr>χ</abbr><ex>εις</ex></expan> <num value=”6″>ς'</num> μνήσθητ<supplied reason=”lost”>ί</supplied> <supplied reason=”lost”>μου</supplied> <expan><abbr>Κ</abbr><ex>ύρι</ex><abbr>ε</abbr></expan> ἐν <supplied reason=”lost”>τῇ</supplied> <supplied reason=”lost”>β</supplied>ασιλ<supplied reason=”omitted”>ε</supplied>ίᾳ σου.</p>
Example Leiden Input 7
(+) Ἀνεπάη μακά-
ριος Ζαχαρίας
Ἐρασίνου ἐν
μηνὶ Πανέμου
δεκάτῃ ἰνδ(ικτιῶνος) ιδʹ ἡ-
μέρᾳ κυριακῇ ὧραν
τρίτῃ τῆς νυκτὸς κα-
τετέθη δὲ ἐνταῦθα
τῇ τρίτῃ τοῦ σάμ-
βατος ὥραν ὀγδόην
Πανέμῷ δοδεκα-
τῃ ἰν(δικτιῶνος) ιδʹ ἔτους κα-
τὰ Ἐλού(σην) ΥΟςʹ Κ(ύρι)ε ἀ-
νάπαυσον τὴν ψυ-
χὴν αὐτοῦ μετὰ τῶν
ἁγίων σου. Ἀμήν
Example EpiDoc Output 7
<g ref=”cross”>+</g> Ἀνεπάη μακά<lb break=”no”/>ριος Ζαχαρίας
<lb/>Ἐρασίνου ἐν
<lb/>μηνὶ Πανέμου
<lb/>δεκάτῃ <expan><abbr>ἰνδ</abbr><ex>ικτιῶνος</ex></expan> <num value=”14″>ιδʹ</num> ἡ<lb break=”no”/>μέρᾳ κυριακῇ ὧραν
<lb/>τρίτῃ τῆς νυκτὸς κα<lb break=”no”/>τετέθη δὲ ἐνταῦθα
<lb/>τῇ τρίτῃ τοῦ σάμβα<lb break=”no”/>τος ὥραν ὀγδόην
<lb/>Πανέμῷ δοδεκα<lb break=”no”/>τῃ <expan><abbr>ἰνδ</abbr><ex>ικτιῶνος</ex></expan> <num value=”14″>ιδʹ</num> ἔτους κα<lb break=”no”/>τὰ <expan><abbr>Ἐλού</abbr><ex>σην</ex></expan> <num value=”476″>ΥΟςʹ</num> <expan><abbr>Κ</abbr><ex>ύρι</ex><abbr>ε</abbr></expan> ἀ<lb break=”no”/>νάπαυσον τὴν ψυ<lb break=”no”/>χὴν αὐτοῦ μετὰ τῶν
<lb/>ἁγίων σου. Ἀμήν
- We also tried an automated approach using an API call to Claude. The advantage of this approach is that it allowed for batch processing of multiple inscriptions. The full code, with examples, can be found at our Github site: https://github.com/Brown-University-Library/ai-experiments/tree/main/01_leiden-translator. The code passes along detailed instructions and a few complex examples. We were surprised at how few examples the model needs to do a good job.
The results of both approaches were excellent. The advantage of the first approach is that it is free. The second approach allows processing in bulk. There is a cost for the second approach, but it in the range of ten cents an inscription (and would be much cheaper on DeepSeek, although we haven’t tested whether it works as well) there is a cost but it is relatively inexpensive (and may become even radically more so with the introduction of DeepSeek).
We are now working on applying these same approaches to Hebrew/Aramaic inscriptions and working straight from pdfs. I’ll update as we get results.