TY - JOUR
T1 - Cluster search optimisation of deep neural networks for audio emotion classification
AU - Slade, Sam
AU - Zhang, Li
AU - Asadi, Houshyar
AU - Lim, Chee Peng
AU - Yu, Yonghong
AU - Zhao, Dezong
AU - Panesar, Arjun
AU - Wu, Philip Fei
AU - Gao, Rong
PY - 2025/4/8
Y1 - 2025/4/8
N2 - Automated patient monitoring solutions greatly benefit from audio emotion classification, although the considerable variance in individual expression and interpretation of emotions poses a challenge. Current approaches often employ standard Audio Spectrogram Transformer (AST) and deep learning models such as Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN)-based networks. However, their performance can be enhanced by integrating neural architecture search techniques using swarm optimisation algorithms. In this research, we explore AST with hyperparameter optimisation for speech emotion recognition. Three deep learning architectures with optimisable τb-block structures and variable filter numbers, i.e. 1DCNN, bidirectional LSTM (BiLSTM) and CNN-BiLSTM, are also proposed, enabling the optimisation of network depth and width. A novel Cluster Search Optimisation (CSO) algorithm is introduced. It incorporates Cluster Centroid Search, a Cluster Distance Improvement metric and reinforcement learning to dispatch different search actions based on clustering convergence and Q-learning strategies, respectively. A novel Noise Tempered K-means (NTKM) clustering model is also proposed with the integration of Gaussian-based noise insertion and cluster compactness-separation measurement, to further fine-tune the cluster centriods obtained using OPTICS clustering. CSO is used for hyperparameter and architecture search for AST and aforementioned deep networks. Attention mechanisms are also integrated with CSO-optimised networks to further enhance feature learning. We evaluate the resulting models against those devised by other optimisation algorithms across the EMO-DB, SAVEE, and TESS datasets. The empirical results demonstrate that CSO-optimised AST and CNN-BiLSTM with attention mechanisms outperform other architectures and yield favourable comparison results against those from existing state-of-the-art audio emotion classification methods.
AB - Automated patient monitoring solutions greatly benefit from audio emotion classification, although the considerable variance in individual expression and interpretation of emotions poses a challenge. Current approaches often employ standard Audio Spectrogram Transformer (AST) and deep learning models such as Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN)-based networks. However, their performance can be enhanced by integrating neural architecture search techniques using swarm optimisation algorithms. In this research, we explore AST with hyperparameter optimisation for speech emotion recognition. Three deep learning architectures with optimisable τb-block structures and variable filter numbers, i.e. 1DCNN, bidirectional LSTM (BiLSTM) and CNN-BiLSTM, are also proposed, enabling the optimisation of network depth and width. A novel Cluster Search Optimisation (CSO) algorithm is introduced. It incorporates Cluster Centroid Search, a Cluster Distance Improvement metric and reinforcement learning to dispatch different search actions based on clustering convergence and Q-learning strategies, respectively. A novel Noise Tempered K-means (NTKM) clustering model is also proposed with the integration of Gaussian-based noise insertion and cluster compactness-separation measurement, to further fine-tune the cluster centriods obtained using OPTICS clustering. CSO is used for hyperparameter and architecture search for AST and aforementioned deep networks. Attention mechanisms are also integrated with CSO-optimised networks to further enhance feature learning. We evaluate the resulting models against those devised by other optimisation algorithms across the EMO-DB, SAVEE, and TESS datasets. The empirical results demonstrate that CSO-optimised AST and CNN-BiLSTM with attention mechanisms outperform other architectures and yield favourable comparison results against those from existing state-of-the-art audio emotion classification methods.
KW - Audio emotion classification
KW - Deep neural network
KW - Hyperparameter optimisation
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85219586343&partnerID=8YFLogxK
U2 - 10.1016/j.knosys.2025.113223
DO - 10.1016/j.knosys.2025.113223
M3 - Article
AN - SCOPUS:85219586343
SN - 0950-7051
VL - 314
JO - Knowledge-Based Systems
JF - Knowledge-Based Systems
M1 - 113223
ER -