This repository provides implementations for basic sequence alignment techniques, focusing on two popular methods: Dot Plot and Needleman-Wunsch algorithm. These techniques are widely used in bioinformatics to compare biological sequences, such as DNA, RNA, or protein sequences.
The Needleman-Wunsch algorithm is a global alignment technique used to align entire sequences from end to end. It uses a dynamic programming approach to find the optimal alignment based on a scoring scheme.
- Initialization: A scoring matrix is created with the sequences along the horizontal and vertical axes. The first row and column are initialized with gap penalties.
- Matrix Filling: The matrix is filled using a scoring scheme (match, mismatch, and gap penalties) to compute the optimal alignment scores.
- Traceback: Starting from the bottom-right corner of the matrix, the algorithm traces back to the top-left corner to determine the alignment path.
The scoring scheme for the Needleman-Wunsch algorithm includes:
- Match Score: Positive score when two characters are identical.
- Mismatch Penalty: Negative score when characters do not match.
- Gap Penalty: Negative score for introducing a gap in the alignment.
The optimal alignment is found by maximizing the alignment score.
- Comparing two complete sequences (e.g., aligning entire protein sequences).
- Studying evolutionary relationships by finding the best global alignment.
- Serving as a foundation for other alignment algorithms, such as Smith-Waterman.
The Dot Plot method is a simple graphical approach used to compare two sequences. It displays similarities between the sequences in a matrix form, where each axis represents one of the sequences.
- The sequences are placed along the horizontal and vertical axes of a matrix.
- A dot is placed in the matrix at positions where the corresponding elements of the sequences are identical (or similar based on a threshold).
- The resulting pattern shows regions of similarity, such as diagonals indicating consecutive matches or repeating patterns.
- Visual identification of repeating sequences.
- Locating regions of high similarity between sequences.
- Detecting inversions or translocations in genomic data.
- Clone the Repository:
git clone git@github.com:joyou159/Pairwise-Sequence-Alignment.git cd Pairwise-Sequence-Alignment
- Install Dependencies:
pip install -r requirements.txt
-
Dot Plot
- Use the provided function to generate a dot plot for two sequences:
from generate_dot_plot import plot_dot_plot sequence1 = "CTATTGACGTA" sequence2 = "CTATGAA" plot_dot_plot(sequence1, sequence2)
- Use the provided function to generate a dot plot for two sequences:
-
Needleman-Wunsch Algorithm
- Plot the alignment scoring matrix and optionally save the alignment result:
from NW_alignment import plot_alignment sequence1 = "CTATTGACGTA" sequence2 = "CTATGAA" scoring_scheme = {'match_score': 5, 'mismatch_penalty': -2, 'gap_penalty': -4} plot_alignment(sequence1, sequence2, scoring_scheme, "./alignment_result.txt")
- Plot the alignment scoring matrix and optionally save the alignment result:
- Needleman, S. B., & Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology.
- Mount, D. W. (2004). Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press.