Fairfield, Connecticut
April 19, 2024
April 19, 2024
April 20, 2024
5
10.18260/1-2--45774
https://peer.asee.org/45774
62
DANUSHKA BANDARA received the bachelor’s degree in Electrical Engineering from the University of Moratuwa, Sri Lanka, in 2009. He received his master’s and Ph.D. degrees in Computer Engineering and Electrical and Computer Engineering from Syracuse University, Syracuse, NY, USA, in 2013 and 2018, respectively. From 2019 to 2020, he worked as a Data Scientist at Corning Incorporated, Corning, NY, USA. Currently, he is an
Assistant Professor of Computer Science and Engineering
at Fairfield University, Fairfield, CT, USA. His Current research interests include Applied machine learning, Bioinformatics, Human-computer interaction, and Computational social science.
The ancient Greek texts are valuable for understanding and learning about the history, culture, and nuances of ancient Greek life. The texts come in many forms, including papyri, fragments of pottery, etc. Due to the nature of these materials and degradation over time, some of these texts are missing words, and even entire phrases. This makes it difficult for historians to interpret the texts. The data for this project was collected from the Perseus Collection and the 1KGreek collection, which contains 250,000 unique sentences of ancient Greek literature. The dataset was preprocessed using the import classical language toolkit(CLTK) and sentences were normalized for better encodings. After the encoding was done all of our data was split by sentences and then they were fed into a Disil Bert masked language model. The word piece tokenizer for this model was trained using a vocabulary list of 35,000 words. By using the DistilBert transformer model we were able to train a masked language model based on words to achieve a Hit@5 of 34 percent, Hit@10 of 35 percent, Hit@100 of 36 percent, and a perplexity of 1.04. This model can be a valuable aid for the historians' workflow in deciphering damaged ancient texts.
Riccardi, K., & Bandara, D. (2024, April), Masked Language Modeling for Predicting Missing Words in Damaged Ancient Greek Texts Paper presented at 2024 ASEE North East Section, Fairfield, Connecticut. 10.18260/1-2--45774
ASEE holds the copyright on this document. It may be read by the public free of charge. Authors may archive their work on personal websites or in institutional repositories with the following citation: © 2024 American Society for Engineering Education. Other scholars may excerpt or quote from these materials with the same citation. When excerpting or quoting from Conference Proceedings, authors should, in addition to noting the ASEE copyright, list all the original authors and their institutions and name the host city of the conference. - Last updated April 1, 2015