This repository contains the code and data for the recreation and enrichment of the gastric (GC) cancer single-cell RNA-seq (scRNA-seq) data analysis pipeline described in the "Comprehensive analysis of metastatic gastric cancer tumour cells using single‑cell RNA‑seq" by Wang B. et. al, using the raw counts matrix that is available in GEO (GSE158631). Our robust and comprehensive pipeline surpasses previous approaches by incorporating multiple dimensionality reduction techniques, various clustering methods, marker gene identification and GO functional annotation.
This analysis was performed as part of the final project for the "Machine Learning in Computational Biology" graduate course of the MSc Data Science & Information Technologies Master's programme (Bioinformatics - Biomedical Data Science Specialization) of the Department of Informatics and Telecommunications department of the National and Kapodistrian University of Athens (NKUA), under the supervision of professor Elias Manolakos, in the academic year 2023-2024.
git clone https://github.com/GiatrasKon/gastric-cancer-scRNAseq-data-analysis.git
Ensure you have the following packages installed:
- pandas
- IPython
- matplotlib
- seaborn
- numpy
- anndata
- scanpy
- scipy
- sklearn
- umap-learn
- gprofiler-official
Install dependencies using:
pip install pandas ipython matplotlib seaborn numpy anndata scanpy scipy scikit-learn umap-learn gprofiler-official
The GCSingleCellAnalysis
class (located in src/codebase.py
) provides a comprehensive workflow for preprocessing, dimensionality reduction, clustering, and functional annotation of single-cell RNA-seq data.
-
Preprocessing and Normalization:
from src.codebase import GCSingleCellAnalysis analysis = GCSingleCellAnalysis('data/GSE158631_count.csv') analysis.preprocess_adata() analysis.normalize_adata() analysis.filter_adata() analysis.prepare_adata()
-
Dimensionality Reduction:
analysis.perform_pca() analysis.prepare_pca_reduced_adata() analysis.plot_pca() analysis.perform_tsne() analysis.plot_tsne() analysis.perform_umap() analysis.plot_umap()
-
Clustering:
methods = ['gmm', 'average_link', 'ward', 'spectral', 'louvain', 'leiden'] results_pca = analysis.cluster_and_evaluate(methods, embeddings=['X_pca']) results_tsne = analysis.cluster_and_evaluate(methods, embeddings=['X_tsne']) results_umap = analysis.cluster_and_evaluate(methods, embeddings=['X_umap']) combined_results = {'PCA': results_pca, 't-SNE': results_tsne, 'UMAP': results_umap} results_df = analysis.create_results_dataframe(combined_results) analysis.plot_clustering_evaluation(results_df)
-
Post-Clustering Analysis:
analysis.analyze_and_plot_markers(group_key='average_link_X_umap', n_genes=10) go_annotations = analysis.fetch_go_annotations(group_key='average_link_X_umap', n_genes=10) analysis.print_go_annotations(go_annotations)
For a detailed step-by-step analysis, refer to the Jupyter notebook:
notebooks/GC_scRNAseq_data_analysis.ipynb
Images of the results of the analysis can be found in the images
directory.
Refer to the documents
directory for the project proposal, presentation, report, and the original authors' paper.