Automating Structured Information Extraction from Images of Academic Transcripts Using Machine Learning

Declan Kirk Bracken; Sinisa Colic

Download Paper | Permalink

Conference: 2025 ASEE Annual Conference & Exposition
Location: Montreal, Quebec, Canada
Publication Date: June 22, 2025
Start Date: June 22, 2025
End Date: August 15, 2025
Conference Session: DSAI Technical Session 6: Academic Success, Performance & Complexity
Tagged Division: Data Science and Artificial Intelligence (DSAI) Constituent Committee
Page Count: 11
DOI: 10.18260/1-2--55493
Permanent URL: https://peer.asee.org/55493
Download Count: 11

Paper Authors

biography

Declan Kirk Bracken University of Toronto

visit author page

Declan Bracken is an M.Eng. student at the University of Toronto in the department of Mechanical and Industrial Engineering pursuing an emphasis in Analytics. This paper is the final product of an 8 month M.Eng. project supervised by Professor Sinisa Colic and it's work is intended for implementation into the admissions process at the University of Toronto's M.I.E department.

visit author page

biography

Sinisa Colic Ph.D. University of Toronto

visit author page

Dr. Colic is an Assistant Professor, Teaching Stream with the Department of Mechanical and Industrial Engineering. He completed his PhD at the University of Toronto in the area of personalized treatment options for epilepsy using advanced signal processing techniques and machine learning. Dr. Colic currently teaches several courses at University of Toronto covering a broad range of topics in mechatronics, data science and machine learning / deep learning.

visit author page

Download Paper | Permalink

Abstract

The admissions process for the University of Toronto requires its staff to spend countless hours manually reviewing student transcript images to make critical decisions about their academic future. Academic transcript images are tedious to read and transcribe due to their myriads of visual features, such as colored backgrounds, watermarks, multi-column layouts, and small text. To streamline this process, this report investigates the development of an AI system specifically designed for transcribing grade data from academic transcript images into organized tables. While models for table extraction are not novel, existing methods are limited when dealing with academic transcripts due to their unique features and a lack of representation in pre-existing datasets used for training. To our knowledge, this report presents the first labeled, open-source dataset of purely academic transcript images used for training computer-vision based machine learning algorithms. Two primary approaches for image-to-text table reconstruction were explored; the first is a pipeline comprising a YOLOv8 object detection model, Tesseract OCR engine, and a Mistral7b large language model (LLM). The second option implemented a fine-tuned multimodal language model (MiniCPM-Llama3-V-2_5). The multimodal LLM showed superior accuracy on a small test set, with a multi-stage prompting strategy further enhancing its recall on images with more complex multi-column layouts. Future work could greatly improve on this solution by leveraging the trained YOLOv8 object detection model as a preprocessing step, as well as continuing to develop the dataset with a greater diversity of images and prompting formats. Additionally, given the uniquely finite number of transcript formats in circulation, it’s hypothesized that a larger, more inclusive dataset could be used to train a high precision model with near-universal applicability within the target domain. This work forms the foundation of future analytics projects at the University of Toronto, providing a platform with which admissions data may be used to predict student success, and to better track student progress over their academic career.

Citation
Format

Bracken, D. K., & Colic, S. (2025, June), Automating Structured Information Extraction from Images of Academic Transcripts Using Machine Learning Paper presented at 2025 ASEE Annual Conference & Exposition , Montreal, Quebec, Canada . 10.18260/1-2--55493

TY  - CPAPER
AB  - The admissions process for the University of Toronto requires its staff to spend countless hours manually reviewing student transcript images to make critical decisions about their academic future. Academic transcript images are tedious to read and transcribe due to their myriads of visual features, such as colored backgrounds, watermarks, multi-column layouts, and small text. To streamline this process, this report investigates the development of an AI system specifically designed for transcribing grade data from academic transcript images into organized tables. While models for table extraction are not novel, existing methods are limited when dealing with academic transcripts due to their unique features and a lack of representation in pre-existing datasets used for training. To our knowledge, this report presents the first labeled, open-source dataset of purely academic transcript images used for training computer-vision based machine learning algorithms. Two primary approaches for image-to-text table reconstruction were explored; the first is a pipeline comprising a YOLOv8 object detection model, Tesseract OCR engine, and a Mistral7b large language model (LLM). The second option implemented a fine-tuned multimodal language model (MiniCPM-Llama3-V-2_5). The multimodal LLM showed superior accuracy on a small test set, with a multi-stage prompting strategy further enhancing its recall on images with more complex multi-column layouts. Future work could greatly improve on this solution by leveraging the trained YOLOv8 object detection model as a preprocessing step, as well as continuing to develop the dataset with a greater diversity of images and prompting formats. Additionally, given the uniquely finite number of transcript formats in circulation, it’s hypothesized that a larger, more inclusive dataset could be used to train a high precision model with near-universal applicability within the target domain. This work forms the foundation of future analytics projects at the University of Toronto, providing a platform with which admissions data may be used to predict student success, and to better track student progress over their academic career.
AU  - Declan Kirk Bracken
AU  - Sinisa Colic Ph.D.
CY  - Montreal, Quebec, Canada 
DA  - 2025/06/22
PB  - ASEE Conferences
TI  - Automating Structured Information Extraction from Images of Academic Transcripts Using Machine Learning
UR  - https://peer.asee.org/55493
DO  - 10.18260/1-2--55493
ER  -