A simple model that will detect specific human actions with just a few samples for training. Intended to be used as a live sign language translator.
Using MediaPipe's Holistic Model, we can acquire the specific keypoints in a human subjects Hands, Shoulders and Face.
These keypoints are the input in our Convolutional Neural Network (CNN) model. The model structure is just very basic as shown below:
Install the required packages
pip3 install -r requirements.txt
Using utils.CaptureUtils, you can automatically capture video data with specific parameters for training this model. Note that all training data should be in a single directory. And each video file should follow this filename format: {class}-{data_id}.mp4. CaptureUtils will automatically do this for you.
cap = CaptureUtils()
for n in range(num_sample):
cap.capture_action(action=action, frame_count=frame_count, save_path=data_save_path, samp_num=str(n + 1))
Creation of the Custom Dataset is very straight forward. You can use the following code for reference:
dataset = CustomDataset(data_path=data_save_path, save_dataset=is_save_data, preprocessed_dataset_path=preprocessed_dataset_path)
You can create and train a ActionDetector model using the following code:
detector = ActionDetector(class_mapping=dataset.class_mapping, frames_interval=dataset.frames_interval, include_pose=dataset.include_pose, pose_positions=dataset.pose_positions,
model_type=model_type, initial_hidden_layer=initial_hidden_layer)
detector.train(train_data=train_ds, validation_data=val_ds, epochs=epochs, batch_size=batch_size, early_stopping_patience=early_stopping_patience, test=is_test, test_data=test_ds)
To use the model, you can use ActionDetector.run function.
detector.run()
For the complete demonstration, you can follow the sample_implementation.py script above.