Skip to content

Latest commit

 

History

History
117 lines (78 loc) · 5.81 KB

README.md

File metadata and controls

117 lines (78 loc) · 5.81 KB

Human Action Recognition from Depth Maps

This repository contains the implementation of a system for Human Action Recognition (HAR) using depth map data. The system is designed to assist individuals with dementia in bathroom settings by recognizing human actions in a privacy-preserving manner. The project integrates cutting-edge deep learning techniques, utilizing the SPiKE model for 3D Human Pose Estimation (HPE) and a Relational Graph Convolutional Network (RGCN) for action classification.

Human Action Recognition


Overview of the System

The system pipeline consists of the following stages:

  1. Depth Map Acquisition: Collects 3D depth data in bathroom environments using depth sensors.
  2. Point Cloud Generation: Converts depth maps into 3D point clouds, which represent spatial distributions.
  3. 3D Skeleton Estimation (SPiKE Model): Extracts the skeletal structure of humans from the point clouds.
  4. Spatio-Temporal Graph Construction: Builds graphs where nodes represent body joints, and edges encode spatial and temporal relationships.
  5. Action Classification (RGCN Model): Predicts human actions using the relational information in the spatio-temporal graph.

This system aims to provide real-time assistance while preserving user privacy by avoiding the capture of detailed visual information.

System Architecture

System Architecture


Deep Learning Models

SPiKE Model for 3D Human Pose Estimation

The SPiKE model is a neural network designed to predict 3D human poses from point clouds. It performs the following:

  • Local Feature Extraction: Analyzes spatial features in local volumes of the point clouds.
  • Temporal Encoding: Utilizes a transformer network to model motion dynamics.
  • Pose Regression: Outputs 3D coordinates for 15 human body joints, ensuring robust and accurate skeletal representations.

The model was fine-tuned on a custom dataset, BAD, annotated with 2D skeletons, to improve its performance in real-world settings.

Visualizing a Skeleton

Skeleton Example

SPiKE Model Workflow

SPiKE Workflow


Relational Graph Convolutional Network (RGCN)

The RGCN extends traditional graph neural networks to handle spatial and temporal relationships between skeleton joints. Features include:

  • Graph Construction: Represents human poses as graphs, with joints as nodes and spatial-temporal connections as edges.
  • Relational Convolutions: Uses distinct convolutional operations for different types of edges, such as spatial or temporal.
  • Action Classification: Processes graph data to identify one of eight actions, including walking, sitting, and washing hands.

What is a Spatio-Temporal Graph?

Spatio-Temporal Graph

RGCN Architecture

RGCN Architecture


Dataset

The system is trained and tested on the BAD dataset, a custom dataset of depth maps recorded in bathroom settings. This dataset includes:

  • Depth Maps: 3D representations of the environment.
  • Annotations: 2D skeletons manually labeled for training.
  • Actions: Eight human actions like sitting, standing, and washing hands.

Dataset Example

BAD Dataset Example


Results

SPiKE Model Results

  • Quantitative: High mean Average Precision (mAP) and Percentage of Correct Keypoints (PCK) across key joints.
  • Qualitative: Strong alignment of predicted skeletons with ground truth.

SPiKE Results

RGCN Model Results

  • Quantitative: Consistently decreasing training/testing losses and increasing accuracies.
  • Qualitative: Accurate classification of human actions in testing scenarios.

RGCN Results


Key Features

  • Privacy-Preserving: Works with depth sensors to ensure individuals' dignity and anonymity.
  • Real-Time Processing: Designed for real-time human action recognition in practical scenarios.
  • Custom Dataset Support: Fine-tuned and tested on the BAD dataset for accurate performance in bathroom environments.
  • Extendable Framework: Can incorporate additional actions or adapt to other domains by modifying the graph construction and model training pipelines.

Getting Started

  1. Setup Environment: Install dependencies using:

    pip install -r requirements.txt
  2. Prepare Dataset: Organize depth maps and skeleton annotations in the required format.

  3. Train Models: Use the provided training scripts to fine-tune SPiKE and train the RGCN model.

  4. Inference: Run the system on live or pre-recorded depth maps to classify human actions.

Future Improvements

  • Complete Dataset Annotation: The entire dataset should be comprehensively annotated to provide a richer and more diverse set of training examples, improving the model’s ability to generalize across various actions and scenarios.
  • Incorporation of Edge Features in Spatio-Temporal Graphs: The spatio-temporal graph can be enriched by adding edge features, such as the lengths of the edges (i.e., distances between joints). This additional information could help improve action classification accuracy.
  • Potential Integration of Symbolic Reasoning: Incorporate symbolic reasoning to create a Neurosymbolic AI system, enhancing the model’s ability to understand contextual information and make more informed decisions.
  • Answer Set Programming (ASP) for Safety Rules: Use ASP to model safety rules and reason about action sequences, improving real-time decision-making, safety, and patient autonomy.