Edging Out Latin BERT at Text Infilling With Fine-Tuned GPT-4o

Unfortunately, many archaeologists discover ancient Latin texts in damaged condition, with words missing throughout. To solve this problem, Latin experts usually emend the texts and try to fill in these missing words. However, this process is often difficult and time-consuming. In 2020, researchers at UC Berkeley and UT Austin developed a model called Latin BERT [1], which, among many other natural language processing tasks, can reconstruct the human emendation with some accuracy. In their evaluation, the model’s top choice matched the human emendation 33.1% of the time; in 62% of cases, the correct word appeared in the model’s top 10 guesses, and in 74% of cases, it was in the top 50.

Latin BERT was trained on 642.7 million words of Latin text specifically for Latin natural language processing. However, I wondered if ChatGPT, with its vast knowledge spanning nearly the entire internet, could do better. First, I procured the authors’ dataset of 1,161 emendations and tested the GPT-4o and GPT-4.1 models using the following system prompt: “You are an expert in Latin emendation. What is the most likely word that fills in the blank (only say the word and nothing else).” The user prompt was the original sentence or two of text, with “_____” in place of the human emendation.

Since OpenAI trained GPT-4o and GPT-4.1 on nearly all the data they could obtain, I was concerned they might achieve near-perfect accuracy on this data but not be able to generalize beyond it. I was wrong. Both performed poorly, each matching the human emendation only 8.4% of the time. While this suggested that the emendations were likely not present in the training data, it was still disheartening to see the base models perform so badly. I attributed this in part to Latin’s linguistic complexity, with much of its meaning hidden in the suffixes of words (through declensions and conjugations).

While looking into ways to improve large language model (LLM) performance, I learned about fine-tuning—the process of adapting an LLM to perform better on a specific task. Conveniently, the Latin BERT paper’s authors had excluded about half their emendations (1,203, to be exact) from the testing dataset because they were part of their model’s training data. Using this excluded set in the same prompt format, along with the correct emendations, I fine-tuned GPT-4.1-2025-04-14 and GPT-4o-2024-08-06 with the default settings. Then I evaluated these fine-tuned models on Latin BERT’s test set.

The results were much better: fine-tuned GPT-4.1 achieved 32.0% accuracy, and fine-tuned GPT-4o reached 33.2%, which is slightly better than Latin BERT! I was surprised that GPT-4o did better than GPT-4.1, given that it is an older model. I suspect this is because GPT-4o tends to excel at creativity and writing, which are more closely related to emendation. One major advantage of this approach is cost: fine-tuning each model cost less than $10, compared to the $540 required to train Latin BERT. This suggests that researchers could fine-tune modern LLMs to emend other texts or even languages at a fraction of the cost.

Model	Accuracy	Training Set	Training Cost
Latin BERT	33.1%	642.7 million words	$540
GPT-4o	8.4%
GPT-4.1	8.4%
GPT-4o (fine-tuned)	33.2%	1,203 infilling examples	$9.86
GPT-4.1 (fine-tuned)	32.0%	1,203 infilling examples	$9.86

Special thanks to Mr. Wilairat, my Latin teacher, for encouraging me to pursue this project!

References

[1] Bamman D, Burns PJ. Latin bert: A contextual language model for classical philology. arXiv preprint arXiv:2009.10053. 2020 Sep 21.

If you enjoyed, please follow!