Sound-IT

Sound-IT is a research project aimed at using artificial intelligence to recognize emotions in videos and generate an music that fits those emotions. We have created four models to determine the emotional tone of a video: facial expression detection, background hue recognition, body emotion recognition, and lip reading the models are running in parralele so may works slow if you only work on a cpu. After that, music is generated that matches the detected emotion using JukeBox .

The proposed method offers a potential solution to the problem of high costs associated with composing and recording original music for films.

How it works

Our model is capable of recognizing a scene's dominant emotion by analyzing factors such as body language, facial expressions, lip reading, and background color. Using this emotion, the model generates a new piece of music ideally suited to the scene.

Sound-it PIPELINE

Sound-it NN Achitectures & datasets

First, We detect what emotion the video is trying to portray by using 4 models:

facial expression detection - Initially, we created and trained a face recognition model based on the VGG-16 architecture, then we used MediaPipe, an open-source framework for building cross-platform machine learning pipelines for perception tasks such as object detection, tracking, and facial recog- nition (Lugaresi et al., 2019), to locate the face in given videos, and then we used the aforementioned model to detect the emotion from the face. background hue recognition - We extract the color values of each pixel in the video, calculate the average color of the video by averaging the color values of the pixels, and then assign the color to a specific emotion according to 4.

Approch for Facial Emotion Recognition

Using FER2013 as dataset and implementing VGG16 Neural Network Achitechture

Approch for BodyLanguage

Collecting Keypoints with Mediapipe Holistic Model than Training The model With 30 frames per action

Approch for LipReading

Collecting the frame of the down face than Training The model With 75 frames per action

Here are some examples of the emotions detected by our model and the corresponding music generated:

Example 1

Click on the thumbnail to watch the Sound-IT demo.

Getting Started To get started with Sound-IT, you can clone our repository and follow the instructions in the README file.

To run the code : -first of all run the pip install -r requirements.txt to install all the packages: -then change the models path in the UISOUND folder in the files : -allinference.py -inference.py -inferenceCam.py

for example : weights_1 = '/Users/kevynkrancenblum/Desktop/Data Science/Final Project/Facial_emotion_recognition/saved_models/vggnet.h5' weights_2 = '/Users/kevynkrancenblum/Desktop/Data Science/Final Project/Facial_emotion_recognition/saved_models/vggnet_up.h5' model_V1=BodySentimentModel(body_input_shape, actions.shape[0]) model_V1.load_weights('/Users/kevynkrancenblum/Desktop/Data Science/Final Project/Body_Language_recognition/modelsSaved/BodyModelCamv1.h5')

model_V2=BodySentimentModel(body_input_shape, actions.shape[0])
model_V2.load_weights('/Users/kevynkrancenblum/Desktop/Data Science/Final Project/Body_Language_recognition/modelsSaved/BodyModelCamv2.h5')
Change the path to where the your model is located :

To train on your own emotion or own action recognition base on the body language : Go to : Body_language_recognition/streamlitRecording.py an change the code where you need to add, remove emotions or actions To train on your own emotion or facial micro emotion train you data by first of all adding your own data then run the model ( IMPORTANT THAT BECAUSE ITS MICRO EMOTION RECOGNITION YOU WILL NEED AN SIGNIFICANTE AMOUNT OF DATA ) :

FOR THE LIP READING CONSIDER RUNNING THE CODE IN LipReading/lipnet.ipynb to download the model weight

or download it using those lines : url = 'https://drive.google.com/uc?id=1YlvpDLix3S-U8fd-gqRwPcWXAXm8JwjL' output = 'data.zip' gdown.download(url, output, quiet=False) gdown.extractall('data.zip')

you can also train in you own language and own sentence by creating your own dataset with video and text aligment. important that the models is a CNN+RNN achitechture that mean thats for the Recurent Neural network you

MUST !

have an predifine sentence lenght for example here 75 frames is the sentence lenght for every video and aligment otherwise the model won't work

Contributing We welcome contributions from the community. If you have any suggestions or would like to contribute, please open an issue or pull request on our GitHub repository.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Soundit-Models		Soundit-Models
.DS_Store		.DS_Store
README.md		README.md
pipeline.png		pipeline.png
ספר פרוייקט גמר.docx		ספר פרוייקט גמר.docx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sound-IT

How it works

Sound-it PIPELINE

Sound-it NN Achitectures & datasets

Approch for Facial Emotion Recognition

Approch for BodyLanguage

Approch for LipReading

FOR THE LIP READING CONSIDER RUNNING THE CODE IN LipReading/lipnet.ipynb to download the model weight

MUST !

About

Releases

Packages

Languages

KevynKrancen/Sound-it

Folders and files

Latest commit

History

Repository files navigation

Sound-IT

How it works

Sound-it PIPELINE

Sound-it NN Achitectures & datasets

Approch for Facial Emotion Recognition

Approch for BodyLanguage

Approch for LipReading

FOR THE LIP READING CONSIDER RUNNING THE CODE IN LipReading/lipnet.ipynb to download the model weight

MUST !

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages