This repository combines two key analyses on genetic data using Short Tandem Repeats (STR):
- Ward Linkage Hierarchical Clustering: Constructs a phylogenetic tree to visualize genetic relationships between individuals.
- Genetic Similarity & Distance Matrices: Calculates and visualizes multiple similarity and distance metrics between individuals based on their genetic profiles.
- Hierarchical Clustering (Ward Linkage): Uses Ward's method to perform agglomerative clustering, minimizing variance within clusters. The result is displayed as a dendrogram, where the distance between branches reflects genetic dissimilarity.
- Similarity/Distance Metrics: Computes and visualizes the following matrices:
- Cosine Similarity Matrix
- Euclidean Distance Matrix
- Pearson Correlation Matrix
- Spearman Correlation Matrix
Both analyses are designed to infer genetic relationships, visualize similarity, and highlight genetic variance between samples.
- Hierarchical Clustering: Constructs a dendrogram annotated with genetic distances between individuals.
- Similarity & Distance Matrices: Generates and visualizes key metrics such as cosine similarity, Euclidean distance, and Pearson/Spearman correlations.
pandas
numpy
scikit-learn
scipy
matplotlib
seaborn
- CSV File: Ensure your genetic data is in a CSV file format.
- Example :
synthatic_sTR.csv
. Note that this contains synthatic sTR data for demonstration purposes.
- Example :
- Change File Paths: Update the file paths in the
.py
scripts to match the location of your.csv
files.
file_path = 'your_file.csv' # Replace with your file path
python clustering_script.py
python clustering_script.py
This will generate a dendrogram saved as gene_hierarchy_with_distances.png
.
python genetic_similarity_analysis.py
This will generate the following visualizations:
- Cosine Similarity Matrix:
Cosine_Similarity_Matrix.png
- Euclidean Distance Matrix:
Euclidean_Distance_Matrix.png
- Pearson Correlation Matrix:
Pearson_Correlation_Matrix.png
- Spearman Correlation Matrix:
Spearman_Correlation_Matrix.png
Dendrogram for hierarchical clustering saved as gene_hierarchy_with_distances.png
.
Heatmaps for similarity and distance metrics saved as separate PNG files.