Skip to content

Multimodal Speech Emotion Recognition ViT (AST) for audio encoder and Multiscale Attention Net (MANet) for visual encoder

Notifications You must be signed in to change notification settings

PVTHust/Speech_project_Vin

Repository files navigation

Multi-modal Emotion Recognition Classification

Prerequisites

  • Python 3
  • Linux
  • Pytorch 0.4+
  • GPU + CUDA CuDNN

Dataset

CREMA-D

Pretrained Weight for Multimodal

  1. Audio_Encoder
  2. Visual_Encoder

Getting Started

  • Installation
git clone https://github.com/PVTHust/Speech_project_Vin.git
  • Download dataset:
gdown 1CdjCD2amHDsjJFfb5OuTbIfwEikgn6u-
  • Unzip dataset:
unzip cremad.zip
  • Train and evaluate our model:
python /content/Speech_project_Vin/main.py

Or if you use Kaggle/Jupyter notebook you can run:

train-kaggle.ipynb

and fix dataset path on config.yaml

About

Multimodal Speech Emotion Recognition ViT (AST) for audio encoder and Multiscale Attention Net (MANet) for visual encoder

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published