This project consists of a two-step framework for analyzing mental health-related posts and predicting depression risk using machine learning and deep learning models. The dataset used is sourced from Zenodo and includes Reddit posts from various mental health subreddits.
- Project Overview
- Steps Overview
- Dataset Download and Setup
- Installation Requirements
- Workflow
- Files and Outputs
- Usage Instructions
This project aims to:
- Extract themes/topics from mental health-related text data using clustering techniques.
- Build a supervised learning model to predict the likelihood of depression in user posts.
- Goal: Identify key themes in mental health posts grouped by half-year intervals.
- Techniques:
- Text Preprocessing (tokenization, stopword removal, lemmatization)
- Feature Extraction using TF-IDF
- Clustering using K-Means
- Extracting top unigrams and bigrams for each cluster
Input: Preprocessed text from mental health-related posts. Output: A summary of extracted topics and corresponding keywords saved to a file.
- Goal: Predict whether a given text indicates depressive tendencies.
- Techniques:
- Data Annotation using:
- Keyword-based matching
- Sentiment analysis (VADER and Transformer-based sentiment models)
- Feature Extraction using word embeddings (BERT)
- Supervised Learning:
- Logistic Regression
- LSTM Model with BERT Embeddings
- Evaluation metrics: Accuracy, F1-Score, Confusion Matrix
- Data Annotation using:
Input: Annotated dataset of posts labeled for depression risk. Output: Trained models, evaluation metrics, and visualizations.
The dataset used in this project can be downloaded from Zenodo:
Zenodo Link: Reddit Dataset
- Run the following command in the terminal to download the dataset:
wget https://zenodo.org/api/records/3941387/files-archive -O reddit_dataset.zip
- Unzip the downloaded file into the data/ folder:
unzip reddit_dataset.zip -d data/
- Ensure the extracted files include the required
mental_health_support.csv
andnon_mental_health.csv
files.
Before running the project, install the required libraries:
pip install -r requirements.txt
Key Libraries:
- Python 3.8+
- PyTorch
- scikit-learn
- transformers (Hugging Face)
- NLTK
- Matplotlib
- Preprocessing and Data Preparation:
- Run
data_processing.ipynb
to preprocess and prepare the data.
- Run
- Step 1 - Topic Extraction:
- Run
Step1-Topic-extracition.ipynb
to extract topics and keywords using clustering.
- Run
- Step 2 - Depression Prediction:
- Run
Step2-Depression-Predict.ipynb
to train a classifier for depression risk prediction.
- Run
data/
- Folder containing the downloaded and unzipped dataset.Step1-Topic-extracition.ipynb
- Notebook for topic extraction.Step2-Depression-Predict.ipynb
- Notebook for depression risk prediction.data_processing.ipynb
- Notebook for preprocessing the dataset.lstm_depression_model.pth
- Trained LSTM model weights.vocab.json
- Saved vocabulary file for inference.README.md
- Project documentation (this file).
- Step 1:
merged_clusters_output.csv
- File containing topic keywords for each cluster. (There is an example in the repository.)
- Step 2:
combined_annotated_data_updated.csv
- Annotated dataset with depression labels.classification_report.txt
- Performance metrics (precision, recall, F1-score).lstm_depression_model.pth
- Saved model for inference.
Run the preprocessing notebook:
jupyter notebook data_processing.ipynb
Run the topic extraction notebook:
jupyter notebook Step1-Topic-extracition.ipynb
Expected Output:
- Topic clusters and corresponding top keywords for each half-year interval.
Run the depression prediction notebook:
jupyter notebook Step2-Depression-Predict.ipynb
Expected Output:
- Trained LSTM model with BERT embeddings.
- Performance metrics and confusion matrix.
To use the trained LSTM model for inference on new text data:
- Load the saved model:
import torch from transformers import BertTokenizer from model import BertLSTMClassifier # Load tokenizer and model tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") model = BertLSTMClassifier() model.load_state_dict(torch.load("lstm_depression_model.pth")) model.eval()
- Predict:
def predict(text): tokens = tokenizer(text, return_tensors='pt', truncation=True, padding='max_length', max_length=100) input_ids, attention_mask = tokens['input_ids'], tokens['attention_mask'] with torch.no_grad(): output = model(input_ids, attention_mask) prediction = torch.argmax(output, dim=1) return 'Depression' if prediction.item() == 1 else 'Non-Depression' # Example print(predict("I feel hopeless and sad all the time."))
If you have any issues or questions regarding the project, please feel free to contact the author or raise an issue on GitHub.
End of README