Abstract
Audio classification, as a set of important and challenging tasks, groups speech signals according to speakers’ identities, accents, and emotional states. Due to the high dimensionality of the audio data, task-specific hand-crafted features extraction is always required and regarded cumbersome for various audio classification tasks. More importantly, the inherent relationship among features has not been fully exploited. In this paper, the original speech signal is first represented as spectrogram and later be split along the frequency domain to form frequency-distributed spectrogram. This paper proposes a task-independent model, called FreqCNN, to automaticly extract distinctive features from each frequency band by using convolutional kernels. Further more, an attention mechanism is introduced to systematically enhance the features from certain frequency bands. The proposed FreqCNN is evaluated on three publicly available speech databases thorough three independent classification tasks. The obtained results demonstrate superior performance over the state-of-the-art.
Original language | English |
---|---|
Pages (from-to) | 90-100 |
Number of pages | 11 |
Journal | Knowledge-Based Systems |
Volume | 161 |
Early online date | 26 Jul 2018 |
DOIs | |
Publication status | Published - 1 Dec 2018 |
Keywords
- Audio classification
- Spectrograms
- Convolutional neural networks
- Attention mechanism