This repository contains the implementation of a system for Human Action Recognition (HAR) using depth map data. The system is designed to assist individuals with dementia in bathroom settings by recognizing human actions in a privacy-preserving manner. The project integrates cutting-edge deep learning techniques, utilizing the SPiKE model for 3D Human Pose Estimation (HPE) and a Relational Graph Convolutional Network (RGCN) for action classification.
The system pipeline consists of the following stages:
- Depth Map Acquisition: Collects 3D depth data in bathroom environments using depth sensors.
- Point Cloud Generation: Converts depth maps into 3D point clouds, which represent spatial distributions.
- 3D Skeleton Estimation (SPiKE Model): Extracts the skeletal structure of humans from the point clouds.
- Spatio-Temporal Graph Construction: Builds graphs where nodes represent body joints, and edges encode spatial and temporal relationships.
- Action Classification (RGCN Model): Predicts human actions using the relational information in the spatio-temporal graph.
This system aims to provide real-time assistance while preserving user privacy by avoiding the capture of detailed visual information.
The SPiKE model is a neural network designed to predict 3D human poses from point clouds. It performs the following:
- Local Feature Extraction: Analyzes spatial features in local volumes of the point clouds.
- Temporal Encoding: Utilizes a transformer network to model motion dynamics.
- Pose Regression: Outputs 3D coordinates for 15 human body joints, ensuring robust and accurate skeletal representations.
The model was fine-tuned on a custom dataset, BAD, annotated with 2D skeletons, to improve its performance in real-world settings.
The RGCN extends traditional graph neural networks to handle spatial and temporal relationships between skeleton joints. Features include:
- Graph Construction: Represents human poses as graphs, with joints as nodes and spatial-temporal connections as edges.
- Relational Convolutions: Uses distinct convolutional operations for different types of edges, such as spatial or temporal.
- Action Classification: Processes graph data to identify one of eight actions, including walking, sitting, and washing hands.
The system is trained and tested on the BAD dataset, a custom dataset of depth maps recorded in bathroom settings. This dataset includes:
- Depth Maps: 3D representations of the environment.
- Annotations: 2D skeletons manually labeled for training.
- Actions: Eight human actions like sitting, standing, and washing hands.
- Quantitative: High mean Average Precision (mAP) and Percentage of Correct Keypoints (PCK) across key joints.
- Qualitative: Strong alignment of predicted skeletons with ground truth.
- Quantitative: Consistently decreasing training/testing losses and increasing accuracies.
- Qualitative: Accurate classification of human actions in testing scenarios.
- Privacy-Preserving: Works with depth sensors to ensure individuals' dignity and anonymity.
- Real-Time Processing: Designed for real-time human action recognition in practical scenarios.
- Custom Dataset Support: Fine-tuned and tested on the BAD dataset for accurate performance in bathroom environments.
- Extendable Framework: Can incorporate additional actions or adapt to other domains by modifying the graph construction and model training pipelines.
-
Setup Environment: Install dependencies using:
pip install -r requirements.txt
-
Prepare Dataset: Organize depth maps and skeleton annotations in the required format.
-
Train Models: Use the provided training scripts to fine-tune SPiKE and train the RGCN model.
-
Inference: Run the system on live or pre-recorded depth maps to classify human actions.
- Complete Dataset Annotation: The entire dataset should be comprehensively annotated to provide a richer and more diverse set of training examples, improving the model’s ability to generalize across various actions and scenarios.
- Incorporation of Edge Features in Spatio-Temporal Graphs: The spatio-temporal graph can be enriched by adding edge features, such as the lengths of the edges (i.e., distances between joints). This additional information could help improve action classification accuracy.
- Potential Integration of Symbolic Reasoning: Incorporate symbolic reasoning to create a Neurosymbolic AI system, enhancing the model’s ability to understand contextual information and make more informed decisions.
- Answer Set Programming (ASP) for Safety Rules: Use ASP to model safety rules and reason about action sequences, improving real-time decision-making, safety, and patient autonomy.