Skip to content

This repository utilizes Short Tandem Repeats (STR) for in-depth genetic analysis, featuring hierarchical clustering with Ward Linkage to construct phylogenetic trees and various metrics to assess genetic similarity and distance among individuals. Includes comprehensive visualizations such as dendrograms and heatmaps.

Notifications You must be signed in to change notification settings

MjdMahasneh/sTR_Human_Phylogenetic_Tree

Repository files navigation

Human Phylogenetic Tree and Genetic Similarity Analysis using Short Tandem Repeats (STR)

This repository combines two key analyses on genetic data using Short Tandem Repeats (STR):

  1. Ward Linkage Hierarchical Clustering: Constructs a phylogenetic tree to visualize genetic relationships between individuals.
  2. Genetic Similarity & Distance Matrices: Calculates and visualizes multiple similarity and distance metrics between individuals based on their genetic profiles.

Description

  • Hierarchical Clustering (Ward Linkage): Uses Ward's method to perform agglomerative clustering, minimizing variance within clusters. The result is displayed as a dendrogram, where the distance between branches reflects genetic dissimilarity.
  • Similarity/Distance Metrics: Computes and visualizes the following matrices:
    • Cosine Similarity Matrix
    • Euclidean Distance Matrix
    • Pearson Correlation Matrix
    • Spearman Correlation Matrix

Both analyses are designed to infer genetic relationships, visualize similarity, and highlight genetic variance between samples.

Features

  • Hierarchical Clustering: Constructs a dendrogram annotated with genetic distances between individuals.
  • Similarity & Distance Matrices: Generates and visualizes key metrics such as cosine similarity, Euclidean distance, and Pearson/Spearman correlations.

Requirements

  • pandas
  • numpy
  • scikit-learn
  • scipy
  • matplotlib
  • seaborn

Usage

  1. CSV File: Ensure your genetic data is in a CSV file format.
    • Example : synthatic_sTR.csv. Note that this contains synthatic sTR data for demonstration purposes.
  2. Change File Paths: Update the file paths in the .py scripts to match the location of your .csv files.
file_path = 'your_file.csv'  # Replace with your file path

Usage

python clustering_script.py

Running the Clustering Script

python clustering_script.py

This will generate a dendrogram saved as gene_hierarchy_with_distances.png.

Running the Similarity Analysis Script

python genetic_similarity_analysis.py

This will generate the following visualizations:

  • Cosine Similarity Matrix: Cosine_Similarity_Matrix.png
  • Euclidean Distance Matrix: Euclidean_Distance_Matrix.png
  • Pearson Correlation Matrix: Pearson_Correlation_Matrix.png
  • Spearman Correlation Matrix: Spearman_Correlation_Matrix.png

Output

Dendrogram for hierarchical clustering saved as gene_hierarchy_with_distances.png. Heatmaps for similarity and distance metrics saved as separate PNG files.

About

This repository utilizes Short Tandem Repeats (STR) for in-depth genetic analysis, featuring hierarchical clustering with Ward Linkage to construct phylogenetic trees and various metrics to assess genetic similarity and distance among individuals. Includes comprehensive visualizations such as dendrograms and heatmaps.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages