Asee peer logo

Continuous Speech Emotion Recognition from Audio Segments with Supervised Learning and Reinforcement Learning Approaches

Download Paper |

Conference

2024 ASEE Annual Conference & Exposition

Location

Portland, Oregon

Publication Date

June 23, 2024

Start Date

June 23, 2024

End Date

July 12, 2024

Conference Session

DSA Technical Session 4

Tagged Topic

Data Science & Analytics Constituent Committee (DSA)

Permanent URL

https://peer.asee.org/47074

Request a correction

Paper Authors

biography

Fengbo Ma Northeastern University

visit author page

Fengbo Ma is a second-year master's student in Data Analytics Engineering at Northeastern University, the Department of Mechanical and Industrial Engineering. He holds a BME from Auburn University's Mechanical Engineering Department. Fengbo further enriched his academic journey with practical experience as an Alabama Board of Licensure for Professional Engineers and Land Surveyors (BELS) Engineering Intern.

visit author page

biography

Xuemin Jin Northeastern University

visit author page

Dr. Xuemin Jin is a teaching professor at the Department of Mechanical and Industrial Engineering at Northeastern University. He teaches two core courses for the Data Analytics Engineering Graduate Program, Data Management for Analytics and Data Mining in Engineering. His current research interests include emotion detection, remote sensing and atmospheric compensation. Before joining Northeastern University, Dr. Jin was a data scientist at State Street Corporation, a principal scientist at Spectral Sciences, Inc., a software engineer at eXcelon Corp, and a scientist at SerOptics, Inc. Dr. Jin received his Ph.D. in physics from University of Maryland at College Park. He was a postdoctoral at MIT and at TRIUMF Canada.

visit author page

Download Paper |

Abstract

Emotion plays a pivotal role in communication. When expressed appropriately, emotion not only facilitates the conveyance of a message but also enables audiences to grasp information beyond its literal meaning. Speech emotion recognition can enrich human-computer interaction experience, allowing artificial intelligence (AI) systems to engage with humans more effectively. Most of today’s AI applications often recognize emotions after the end of utterances, and there is a pressing need for solutions in continuous speech emotion recognition.

Emotion is a complex pattern to capture and can introduce high biases to AI systems. Conventional machine learning method such as Support Vector Machine (SVM) has become one of the popular approaches due to its capability in handling high-dimensional feature spaces. Deep learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have also shown significant success in speech emotion recognition. These approaches recognize emotions after an utterance is made and are not considered real-time recognition. Reinforcement learning (RL) offers a departure from the supervised learning methods. The real-time decision-making process of RL allows for exploration and evaluation of results. This characteristic aligns closely with human cognition, which makes model outcomes more interpretable. This makes RL approaches more suitable for tasks that require a nuanced understanding of the environment and a rapid response. Our study explores the potential of applying RL approaches to speech emotion recognition.

Our study uses the signal features inherent in audio segments. We use audio data from the “interactive emotional dyadic motion capture database" (IEMOCAP), where high-quality data is captured, and the corresponding emotions are labeled. Our focus is on detecting two types of emotion: anger and neutral. We first split each audio into 50ms segments and then use HTK-style Mel-frequency cepstral coefficient (MFCC) arrays transformed from audio segments as inputs for machine learning models. We apply methods including support vector machines (SVM), deep neural networks (DNN), and long short-term memory (LSTM) networks. In parallel, we also construct an RL environment for emotion recognition, employing the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms. Both algorithms, compatible with the defined discrete action spaces in the RL environment, are deployed and trained with the same inputs. A comparative performance evaluation is being done for all approaches. RL approaches can achieve accuracy and recall rates of about 70%, establishing the viability of RL in speech emotion recognition. Our study shows that 1) by correctly splitting audio, speech emotion recognition can be achieved in real-time; 2) RL approaches are suitable for real-time continuous speech emotion recognition; and 3) the comparable performance between RL and other supervised machine learning methods demonstrates RL's potential in the domain of speech emotion recognition.

Ma, F., & Jin, X. (2024, June), Continuous Speech Emotion Recognition from Audio Segments with Supervised Learning and Reinforcement Learning Approaches Paper presented at 2024 ASEE Annual Conference & Exposition, Portland, Oregon. https://peer.asee.org/47074

ASEE holds the copyright on this document. It may be read by the public free of charge. Authors may archive their work on personal websites or in institutional repositories with the following citation: © 2024 American Society for Engineering Education. Other scholars may excerpt or quote from these materials with the same citation. When excerpting or quoting from Conference Proceedings, authors should, in addition to noting the ASEE copyright, list all the original authors and their institutions and name the host city of the conference. - Last updated April 1, 2015