Skip to content

Latest commit

 

History

History
57 lines (34 loc) · 3.55 KB

AVEC2018.md

File metadata and controls

57 lines (34 loc) · 3.55 KB

Baseline Features

Emotion recognition from audiovisual signals usually relies on feature sets whose extraction is based on knowledge gained over several decades of research in the domains of speech processing and vision computing. Along with the recent trend of representation learning, whose objective is to learn representations of data that are best suited for the recognition task, there has been some noticeable effort in the field of affective computing to learn representations of audio/visual data in the context of emotion.

There are three different levels of supervision in the way expert knowledge is exploited at the feature extraction step:

  1. Supervised: Expert-knowledge
  2. Semi-supervised: Bags-of-X-words
  3. Unsupervised: Deep Spectrum

Supervised

The traditional approach in time-continuous emotion recognition consists in summarising low-level descriptors (LLDs) of speech and video data over time with a set of statistical measures computed over a fixed-duration sliding window. These descriptors usually include spectral, cepstral, prosodic, and voice quality information for the audio channel, appearance and geometric information for the video channel.

e.g.

  • COMPARE
  • FAUs (OpenFace)
  • eGeMAPS (OpenSmile)
  • MFCC (OpenSmile)

Semi-supervised

The technique of bags-of-words (BoW), which originates from text processing, can be seen as a semi-supervised representation learning, because it represents the distribution of LLDs according to a dictionary learned from them. To generate the XBoW-representations, both the acoustic and the visual features are processed and summarised over a block of a fixed length duration.

e.g.

  • BoAW (OpenXBOW)
  • BoVW (OpenXBOW)

Unsupervised

Deep Spectrum features were first introduced for snore sound classification, and are extracted using deep representation learning paradigm heavily inspired by image processing. To generate Deep Spectrum features, the speech files are first transformed into mel-spectrogram images using Hanning windows, and a power spectral density computed on the dB power scale. These plots are then scaled and cropped to square images of size 227 x 227 pixels without axes and margins to comply with the input needs of ALEXNET - a deep CNN pre-trained for image classification. Afterwards, the spectral-based images are forwarded through ALEXNET. Finally, 4096-dimension feature vectors are extracted from the mel-spectrogram images using the activations from the second fully-connected layer of ALEXNET.

How Yang2018, Du2018, Xing2018, Syed2018 use baseline features

methods acoustic features visual features
Yang2018 arousal hist hands dist hist
audio LLDs body HDR
pause/rate hist action units
Xing2018 eGeMAPS + MFCCs AUs MHH
topic-level feat AUs
eyesight feat
emotion feat
body movement

Baseline System

The baseline recognition system of the BDS consists of a late fusion of the best performing audio and video representations using linear SVM with LIBLINEAR toolkit; training instances of the minority classes are duplicated to be balanced with the majority class, and the type of solver and value of complexity C are optimised by a grid search, using a logarithmic scale for the latter.