Masked Language Modeling for Predicting Missing Words in Damaged Ancient Greek Texts

Kyle Riccardi; Danushka Bandara

Download Paper | Permalink

Conference: 2024 ASEE North East Section
Location: Fairfield, Connecticut
Publication Date: April 19, 2024
Start Date: April 19, 2024
End Date: April 20, 2024
Page Count: 5
DOI: 10.18260/1-2--45774
Permanent URL: https://peer.asee.org/45774
Download Count: 138

DANUSHKA BANDARA received the bachelor’s degree in Electrical Engineering from the University of Moratuwa, Sri Lanka, in 2009. He received his master’s and Ph.D. degrees in Computer Engineering and Electrical and Computer Engineering from Syracuse University, Syracuse, NY, USA, in 2013 and 2018, respectively. From 2019 to 2020, he worked as a Data Scientist at Corning Incorporated, Corning, NY, USA. Currently, he is an
Assistant Professor of Computer Science and Engineering
at Fairfield University, Fairfield, CT, USA. His Current research interests include Applied machine learning, Bioinformatics, Human-computer interaction, and Computational social science.

visit author page

Download Paper | Permalink

Abstract

The ancient Greek texts are valuable for understanding and learning about the history, culture, and nuances of ancient Greek life. The texts come in many forms, including papyri, fragments of pottery, etc. Due to the nature of these materials and degradation over time, some of these texts are missing words, and even entire phrases. This makes it difficult for historians to interpret the texts. The data for this project was collected from the Perseus Collection and the 1KGreek collection, which contains 250,000 unique sentences of ancient Greek literature. The dataset was preprocessed using the import classical language toolkit(CLTK) and sentences were normalized for better encodings. After the encoding was done all of our data was split by sentences and then they were fed into a Disil Bert masked language model. The word piece tokenizer for this model was trained using a vocabulary list of 35,000 words. By using the DistilBert transformer model we were able to train a masked language model based on words to achieve a Hit@5 of 34 percent, Hit@10 of 35 percent, Hit@100 of 36 percent, and a perplexity of 1.04. This model can be a valuable aid for the historians' workflow in deciphering damaged ancient texts.

Citation
Format

Riccardi, K., & Bandara, D. (2024, April), Masked Language Modeling for Predicting Missing Words in Damaged Ancient Greek Texts Paper presented at 2024 ASEE North East Section, Fairfield, Connecticut. 10.18260/1-2--45774

Masked Language Modeling for Predicting Missing Words in Damaged Ancient Greek Texts

Paper Authors

Kyle Riccardi

Danushka Bandara Fairfield University

Abstract

Citation

APA

APA - LaTeX bibitem

MLA

MLA - LaTeX bibitem

Bibtex

EndNote - RIS