Cython bindings and Python interface to MUSCLE v5, a highly efficient and accurate multiple sequence alignment software.
MUSCLE is widely-used software for making multiple alignments of biological sequences. Version 5 of MUSCLE achieves highest scores on several benchmark tests and scales to thousands of sequences on a commodity desktop computer.
pyMUSCLE5 is a Python module that provides bindings to MUSCLE v5 using Cython. It directly interacts with the MUSCLE internals, which has the following advantages:
- single dependency: If your software or your analysis pipeline is
distributed as a Python package, you can add
pymuscle5
as a dependency to your project, and stop worrying about the MUSCLE binaries being properly setup on the end-user machine. - no intermediate files: Everything happens in memory, in a Python object you fully control, so you don't have to invoke the MUSCLE CLI using a sub-process and temporary files. Sequences can be passed directly as strings or bytes, which avoids the overhead of formatting your input to FASTA for MUSCLE.
- no OpenMP: The original MUSCLE code uses OpenMP
to parallelize embarassingly-parallel tasks. In pyMUSCLE5 the dependency on
OpenMP has been removed in favor of the Python
threading
module for better portability.
This library is in a very experimental stage at the moment, and consistency of the results across versions or platforms is not guaranteed yet.
At the moment pyMUSCLE5 is not available on PyPI. You can however install it directly from GitHub with:
$ pip install git+https://github.com/althonos/pymuscle5
Let's load some sequences sequence from a FASTA file, use an Aligner
to
align proteins together, and print the alignment in two-line FASTA format.
import os
import Bio.SeqIO
import pymuscle5
path = os.path.join("pymuscle", "tests", "data", "swissprot-halorhodopsin.faa")
records = list(Bio.SeqIO.parse(path, "fasta"))
sequences = [
pymuscle5.Sequence(record.id.encode(), bytes(record.seq))
for record in records
]
aligner = pymuscle5.Aligner()
msa = aligner.align(sequences)
for seq in msa.sequences:
print(f">{seq.name.decode()}")
print(seq.sequence.decode())
import os
import skbio.io
import pymuscle5
path = os.path.join("pymuscle", "tests", "data", "swissprot-halorhodopsin.faa")
records = list(skbio.io.read(path, "fasta"))
sequences = [
pymuscle5.Sequence(record.metadata["id"].encode(), record.values.view('B'))
for record in records
]
aligner = pymuscle5.Aligner()
msa = aligner.align(sequences)
for seq in msa.sequences:
print(f">{seq.name.decode()}")
print(seq.sequence.decode())
We need to use the view
method to get the sequence viewable by Cython as an array of unsigned char
.
Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.
Contributions are more than welcome! See
CONTRIBUTING.md
for more details.
This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.
This library is provided under the GNU General Public License v3.0.
The MUSCLE code was written by Robert Edgar and is distributed under the
terms of the GPLv3 as well. See vendor/muscle/LICENSE
for more information.
This project is in no way not affiliated, sponsored, or otherwise endorsed by the original MUSCLE authors. It was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.