Continuous Speech Emotion Recognition from Audio Segments with Supervised Learning and Reinforcement Learning Approaches

Fengbo Ma; Xuemin Jin

Download Paper | Permalink

Conference: 2024 ASEE Annual Conference & Exposition
Location: Portland, Oregon
Publication Date: June 23, 2024
Start Date: June 23, 2024
End Date: June 26, 2024
Conference Session: DSA Technical Session 4
Tagged Topic: Data Science & Analytics Constituent Committee (DSA)
Page Count: 17
DOI: 10.18260/1-2--47074
Permanent URL: https://peer.asee.org/47074
Download Count: 272

Paper Authors

biography

Fengbo Ma Northeastern University

visit author page

Fengbo Ma is a second-year master's student in Data Analytics Engineering at Northeastern University, the Department of Mechanical and Industrial Engineering. He holds a BME from Auburn University's Mechanical Engineering Department. Fengbo further enriched his academic journey with practical experience as an Alabama Board of Licensure for Professional Engineers and Land Surveyors (BELS) Engineering Intern.

visit author page

biography

Xuemin Jin Northeastern University

visit author page

Dr. Xuemin Jin is a teaching professor at the Department of Mechanical and Industrial Engineering at Northeastern University. He teaches two core courses for the Data Analytics Engineering Graduate Program, Data Management for Analytics and Data Mining in Engineering. His current research interests include emotion detection, remote sensing and atmospheric compensation. Before joining Northeastern University, Dr. Jin was a data scientist at State Street Corporation, a principal scientist at Spectral Sciences, Inc., a software engineer at eXcelon Corp, and a scientist at SerOptics, Inc. Dr. Jin received his Ph.D. in physics from University of Maryland at College Park. He was a postdoctoral at MIT and at TRIUMF Canada.

visit author page

Download Paper | Permalink

Abstract

Emotion plays a pivotal role in communication. When expressed appropriately, emotion not only facilitates the conveyance of a message but also enables audiences to grasp information beyond its literal meaning. Speech emotion recognition can enrich human-computer interaction experience, allowing artificial intelligence (AI) systems to engage with humans more effectively. Most of today’s AI applications often recognize emotions after the end of utterances, and there is a pressing need for solutions in continuous speech emotion recognition.

Emotion is a complex pattern to capture and can introduce high biases to AI systems. Conventional machine learning method such as Support Vector Machine (SVM) has become one of the popular approaches due to its capability in handling high-dimensional feature spaces. Deep learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have also shown significant success in speech emotion recognition. These approaches recognize emotions after an utterance is made and are not considered real-time recognition. Reinforcement learning (RL) offers a departure from the supervised learning methods. The real-time decision-making process of RL allows for exploration and evaluation of results. This characteristic aligns closely with human cognition, which makes model outcomes more interpretable. This makes RL approaches more suitable for tasks that require a nuanced understanding of the environment and a rapid response. Our study explores the potential of applying RL approaches to speech emotion recognition.

Our study uses the signal features inherent in audio segments. We use audio data from the “interactive emotional dyadic motion capture database" (IEMOCAP), where high-quality data is captured, and the corresponding emotions are labeled. Our focus is on detecting two types of emotion: anger and neutral. We first split each audio into 50ms segments and then use HTK-style Mel-frequency cepstral coefficient (MFCC) arrays transformed from audio segments as inputs for machine learning models. We apply methods including support vector machines (SVM), deep neural networks (DNN), and long short-term memory (LSTM) networks. In parallel, we also construct an RL environment for emotion recognition, employing the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms. Both algorithms, compatible with the defined discrete action spaces in the RL environment, are deployed and trained with the same inputs. A comparative performance evaluation is being done for all approaches. RL approaches can achieve accuracy and recall rates of about 70%, establishing the viability of RL in speech emotion recognition. Our study shows that 1) by correctly splitting audio, speech emotion recognition can be achieved in real-time; 2) RL approaches are suitable for real-time continuous speech emotion recognition; and 3) the comparable performance between RL and other supervised machine learning methods demonstrates RL's potential in the domain of speech emotion recognition.

Citation
Format

Ma, F., & Jin, X. (2024, June), Continuous Speech Emotion Recognition from Audio Segments with Supervised Learning and Reinforcement Learning Approaches Paper presented at 2024 ASEE Annual Conference & Exposition, Portland, Oregon. 10.18260/1-2--47074

TY - CPAPER
AB - Emotion plays a pivotal role in communication. When expressed appropriately, emotion not only facilitates the conveyance of a message but also enables audiences to grasp information beyond its literal meaning. Speech emotion recognition can enrich human-computer interaction experience, allowing artificial intelligence (AI) systems to engage with humans more effectively. Most of today’s AI applications often recognize emotions after the end of utterances, and there is a pressing need for solutions in continuous speech emotion recognition.

Emotion is a complex pattern to capture and can introduce high biases to AI systems. Conventional machine learning method such as Support Vector Machine (SVM) has become one of the popular approaches due to its capability in handling high-dimensional feature spaces. Deep learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have also shown significant success in speech emotion recognition. These approaches recognize emotions after an utterance is made and are not considered real-time recognition. Reinforcement learning (RL) offers a departure from the supervised learning methods. The real-time decision-making process of RL allows for exploration and evaluation of results. This characteristic aligns closely with human cognition, which makes model outcomes more interpretable. This makes RL approaches more suitable for tasks that require a nuanced understanding of the environment and a rapid response. Our study explores the potential of applying RL approaches to speech emotion recognition.

Our study uses the signal features inherent in audio segments. We use audio data from the “interactive emotional dyadic motion capture database&quot; (IEMOCAP), where high-quality data is captured, and the corresponding emotions are labeled. Our focus is on detecting two types of emotion: anger and neutral. We first split each audio into 50ms segments and then use HTK-style Mel-frequency cepstral coefficient (MFCC) arrays transformed from audio segments as inputs for machine learning models. We apply methods including support vector machines (SVM), deep neural networks (DNN), and long short-term memory (LSTM) networks. In parallel, we also construct an RL environment for emotion recognition, employing the Deep Q-Network (DQN) and Proximal Policy Optimization (PPO) algorithms. Both algorithms, compatible with the defined discrete action spaces in the RL environment, are deployed and trained with the same inputs. A comparative performance evaluation is being done for all approaches. RL approaches can achieve accuracy and recall rates of about 70%, establishing the viability of RL in speech emotion recognition. Our study shows that 1) by correctly splitting audio, speech emotion recognition can be achieved in real-time; 2) RL approaches are suitable for real-time continuous speech emotion recognition; and 3) the comparable performance between RL and other supervised machine learning methods demonstrates RL&#39;s potential in the domain of speech emotion recognition.

AU - Fengbo Ma
AU - Xuemin Jin
CY - Portland, Oregon
DA - 2024/06/23
PB - ASEE Conferences
TI - Continuous Speech Emotion Recognition from Audio Segments with Supervised Learning and Reinforcement Learning Approaches
UR - https://peer.asee.org/47074
DO - 10.18260/1-2--47074
ER -