Asee peer logo

Masked Language Modeling for Predicting Missing Words in Damaged Ancient Greek Texts

Download Paper |

Conference

2024 ASEE North East Section

Location

Fairfield, Connecticut

Publication Date

April 19, 2024

Start Date

April 19, 2024

End Date

April 20, 2024

Page Count

5

DOI

10.18260/1-2--45774

Permanent URL

https://peer.asee.org/45774

Download Count

60

Paper Authors

author page

Kyle Riccardi

biography

Danushka Bandara Fairfield University

visit author page

DANUSHKA BANDARA received the bachelor’s degree in Electrical Engineering from the University of Moratuwa, Sri Lanka, in 2009. He received his master’s and Ph.D. degrees in Computer Engineering and Electrical and Computer Engineering from Syracuse University, Syracuse, NY, USA, in 2013 and 2018, respectively. From 2019 to 2020, he worked as a Data Scientist at Corning Incorporated, Corning, NY, USA. Currently, he is an
Assistant Professor of Computer Science and Engineering
at Fairfield University, Fairfield, CT, USA. His Current research interests include Applied machine learning, Bioinformatics, Human-computer interaction, and Computational social science.

visit author page

Download Paper |

Abstract

The ancient Greek texts are valuable for understanding and learning about the history, culture, and nuances of ancient Greek life. The texts come in many forms, including papyri, fragments of pottery, etc. Due to the nature of these materials and degradation over time, some of these texts are missing words, and even entire phrases. This makes it difficult for historians to interpret the texts. The data for this project was collected from the Perseus Collection and the 1KGreek collection, which contains 250,000 unique sentences of ancient Greek literature. The dataset was preprocessed using the import classical language toolkit(CLTK) and sentences were normalized for better encodings. After the encoding was done all of our data was split by sentences and then they were fed into a Disil Bert masked language model. The word piece tokenizer for this model was trained using a vocabulary list of 35,000 words. By using the DistilBert transformer model we were able to train a masked language model based on words to achieve a Hit@5 of 34 percent, Hit@10 of 35 percent, Hit@100 of 36 percent, and a perplexity of 1.04. This model can be a valuable aid for the historians' workflow in deciphering damaged ancient texts.

Riccardi, K., & Bandara, D. (2024, April), Masked Language Modeling for Predicting Missing Words in Damaged Ancient Greek Texts Paper presented at 2024 ASEE North East Section, Fairfield, Connecticut. 10.18260/1-2--45774

ASEE holds the copyright on this document. It may be read by the public free of charge. Authors may archive their work on personal websites or in institutional repositories with the following citation: © 2024 American Society for Engineering Education. Other scholars may excerpt or quote from these materials with the same citation. When excerpting or quoting from Conference Proceedings, authors should, in addition to noting the ASEE copyright, list all the original authors and their institutions and name the host city of the conference. - Last updated April 1, 2015