Single-cell RNA-seq Analysis Pipeline

This repository contains a comprehensive data analysis pipeline for single-cell RNA sequencing (scRNA-seq) data. The pipeline is designed to facilitate the identification of distinct cellular states using high-dimensional data. It implements dimensionality reduction techniques, clustering algorithms, and visualization strategies to analyze and interpret scRNA-seq datasets effectively. The project is structured with object-oriented programming (OOP) principles, ensuring modularity and scalability.

This analysis was performed as part of the 3rd Assignment for the "Machine Learning in Computational Biology" graduate course of the MSc Data Science & Information Technologies Master's programme (Bioinformatics - Biomedical Data Science Specialization) of the Department of Informatics and Telecommunications department of the National and Kapodistrian University of Athens (NKUA), under the supervision of professor Elias Manolakos, in the academic year 2023-2024.

Project Overview

Project Description

The pipeline processes single-cell RNA-seq data to achieve the following:

Dimensionality Reduction: Reducing the complexity of the dataset using PCA, t-SNE, and UMAP, while optimizing parameters for optimal representation.
Clustering: Identifying cellular states using clustering algorithms such as Gaussian Mixture Models (GMM) and DBSCAN, with automatic selection of the optimal number of clusters using metrics like BIC.
Visualization: Generating intuitive plots to showcase clustering results, posterior probabilities, and the structure of reduced data embeddings.

Key Features

Parameter optimization for dimensionality reduction techniques (e.g., variance threshold for PCA, perplexity for t-SNE, and n_neighbors for UMAP).
Automatic model selection for clustering based on Bayesian Information Criterion (BIC).
Support for both probabilistic (GMM) and density-based (DBSCAN) clustering methods.
Comprehensive visualizations for insights into data structure and clustering outcomes.

Main Workflow

Data Input: Load your scRNA-seq dataset in .csv format.
Dimensionality Reduction: Perform PCA, t-SNE, and UMAP to project data into lower dimensions.
Clustering: Use GMM or DBSCAN to identify cell groups. Automatically select the optimal number of clusters for GMM based on BIC.
Visualization: Generate 2D/3D scatter plots, posterior probability heatmaps, and clustering visualizations.
Export Results: Save clustering labels as a .csv file.

Results Overview

The pipeline has been tested with a sample scRNA-seq dataset containing 137 cells and 54,675 genes. It successfully reduces dimensionality, clusters cells into biologically meaningful groups, and provides intuitive visualizations. Example results can be viewed in the Jupyter Notebook (scRNAseq_data_analysis.ipynb).

Installation and Usage

Cloning the Repository

git clone https://github.com/GiatrasKon/scRNAseq-Analysis-Pipeline.git

Package Dependencies

Ensure you have the following packages installed:

pandas
numpy
sklearn
matplotlib
seaborn
umap

Install dependencies using:

pip install pandas matplotlib seaborn numpy scikit-learn umap

Repository Structure

codebase.py: Python script implementing the DataPipeline class.
scRNAseq_data_analysis.ipynb: Notebook demonstrating the usage of the DataPipeline class.
data/: Placeholder for the input dataset (RNA-seq.csv) and output (labels.csv).
documents/: Assignment description and professor's feedback.

Usage

Prepare Your Dataset: Place your scRNA-seq dataset in a .csv file (e.g., data/RNA-seq.csv) with cells as rows and genes as columns.
Run the Pipeline: Use the provided files to analyze the data step-by-step:
- codebase.py: Contains the main pipeline implementation. Key methods include:
  - dimensionality_reduction(): Perform PCA, t-SNE, and UMAP.
  - clustering(): Cluster the reduced data using GMM and DBSCAN.
  - plot_dimensionality_reduction(), plot_clustering_results(), and plot_posterior_probabilities(): Visualize results.
- scRNAseq_data_analysis.ipynb: Demonstrates how to use the DataPipeline class in an interactive Jupyter Notebook. Modify and run this notebook for step-by-step guidance.
Export Results:
- Clustering results, including labels, will be saved as data/labels.csv.
- Visualizations will be displayed for inspection and can be saved manually.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
documents		documents
notebooks		notebooks
src		src
LICENSE		LICENSE
Python.gitignore		Python.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Single-cell RNA-seq Analysis Pipeline

Project Overview

Project Description

Key Features

Main Workflow

Results Overview

Installation and Usage

Cloning the Repository

Package Dependencies

Repository Structure

Usage

About

Releases

Packages

Languages

License

GiatrasKon/scRNAseq-Analysis-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Single-cell RNA-seq Analysis Pipeline

Project Overview

Project Description

Key Features

Main Workflow

Results Overview

Installation and Usage

Cloning the Repository

Package Dependencies

Repository Structure

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages