In my previous post, I discussed the field of “digital epigraphy” and the issue of structuring epigraphic data to make it more digitally useful. Much of the conversation at the Tenth Epigraphy.info Workshop, in fact, intersected with the FAIR principles for the tagging of data. By using consistent tagging and ensuring the inclusion of certain metadata (e.g., where the data came from, how it was captured, and its license for reuse), data becomes Findable, Accessible, Interoperable, and Reusable. Adherence to such standards makes the data far more valuable for research. In my last post, I noted that getting data into that format is often not trivial.
Over the past year I have been working with a team at the Center for Digital Scholarship at Brown University on the application of AI to the encoding of inscriptions. This began as an experiment meant both to test the practical applications of AI to humanities data and to facilitate the actual tagging of inscriptions for my Inscriptions of Israel/Palestine database. The other team members were Justin Uhr, Daniel Kang, Christopher Zeichmann, and Patrick Rashleigh. While I presented our work at the Workshop, they were in fact responsible for most of it. The purpose of this post is to summarize our research and introduce a new app we designed.
Our initial question was ambitious: could an LLM take either a photograph of an inscription or a PDF of a scholarly edition and produce a well-formed, accurate, EpiDoc-compliant XML file?
The answer, we learned quickly, is no. The LLMs showed glimmers of hope, but at least in the stage that they were in several months ago they were not ready for that intense level of extraction, structuring, and analysis. So we set our sights a bit lower.
Our more modest goal was to input Leiden tagged inscriptions and return as output proper EpiDoc XML files. We experimented with some different evaluation metrics and discovered that for this project using a text difference metric gave us the best results. We also used a XML difference routine to evaluate our results. TextDiff measured textual similarity between generated and expert XML, while XMLDiff evaluated structural differences in the markup itself. Evaluation was done through comparison with a set of inscriptions that were manually tagged by an expert.
We tested several variables: the underlying model, the structure of the prompt, and the amount of instructional material supplied to the LLM.
Model
All of our tests were done through APIs. The higher the scores the better:


For our purposes, Claude Sonnet 4.5 gave us the best results. LLM models, of course, quickly develop and we do not know if any of the current models are, at this point, better.
Prompt
We also experimented with how the prompt itself was framed.
- System prompt vs. user prompt. We can label our prompt either as a system prompt or a user prompt.
- Full vs. succinct. We tried different levels of instruction (see below). We also experimented with using an abbreviated (“succinct”) prompt suggested by Claude Workbench.
- Analysis. We tried asking the LLM to return results with and without an analysis of the inscription.


Claude 4.5 responded marginally better with a user rather than system prompt and when it was asked not to provide analysis. It did not seem to make much difference if the level of instruction was full or succinct.
Level of Instruction


We tried different kinds of prompts. IIP has an extensive manual that it uses to encode inscriptions. Our best results came when we used that full manual along with eleven examples as a prompt.
The examples included inscriptions in Latin, Greek, Hebrew, and Aramaic. The results improved when we limited the test to just Latin and Greek inscriptions.


Application
We have developed an application that assists in the translation of Leiden encoded inscriptions into EpiDoc using an LLM. The application comes preloaded with a prompt and an example. You will need your own Anthropic API Key to run it.
The application can be downloaded through Github at this link.
This short video describes how to use the application:
Conclusions and Next Steps
Our work has demonstrated that an LLM can be used to produce accurate EpiDoc compliant XML files from a Leiden tagged input. We have found Claude Sonnet 4.5 to be most effective (at least as of several months ago). The prompt is best supplied as a user prompt with limited examples. We have also developed an application to facilitate these conversions.
More research is needed to ascertain the best methods to evaluate such LLM outputs. Our preliminary results also indicated that the LLM was better at converting Greek and Latin inscriptions than it was converting Hebrew and Aramaic ones. This is not surprising. What was surprising was that the results for Aramaic inscriptions were better than those for Hebrew. Since our sample size was small, however, this result needs further testing.
More broadly, this experiment suggests that LLMs may become valuable assistants for humanities data curation—not by replacing expert encoders, but by accelerating highly structured scholarly workflows.
Leave a Reply