2021 COSE461 Team 14
@noparkee @Xenor99 @jooeun9199
Use not only Audio data features but also Text data features
You can see our paper: Speech Emotion Recognition with Text Features.pdf
An overview of our proposed model, which consists of three featurizers and one classifier.
- Python3
- PyTorch
- librosa-0.8.1
- Use only 4 labels
- Combines happy class with excitement class
Text data can improve SER accuracy
Model | UA(%) | WA(%) |
---|---|---|
(1): audio | 51.47 | 52.75 |
(2): audio + text | 68.29 | 69.2 |
(3): audio + image | 51.01 | 53.12 |
(4): audio + text + image | 68.2 | 71.02 |
- Model (1) use only audio features
- Model (2) use audio features and text features
- Model (3) use audio features and image features
- Model (4) use audio features, text features, and image features
- make_description.py: define data path, align the data -> create description.pkl
- make_mel_spectrogram.py: create mel-spectrogram images, and save with path of that image (edit description.pkl)
- make_audio.py: create mfcc vector from wav file -> create audio.pkl
- make_data.py: join description.pkl with audio.pkl -> create data.pkl
- train_model.py: main file
- model.py: define our model
- network.py: define the featurizers
- data.py: dataloader
- utils.py: set seeds, calculate score
- SOTA
- Use only audio data
- BiLSTM + attention
Multimodal Speech Emotion Recognition and Ambiguity Resolution
- Use both audio and text data
- audio data: Use 8 hand-crafted features
- text data: Use TF-IDF
- ML models, simple LSTM, etc.