We herein developed a bioinformatic protocol for the accurate identification, characterization, quantitation and annotation of MASP molecules.
From GENOME
To start you must run your genome sequence in getorf EMBOSS: (we recommend these parameters)
getorf -minsize 120 -maxsize 20100 -find 1
The output obtained in EMBOSS will be the input of the algorithm.
From PROTEOME
To execute the algorithm, download the MASP-algorithm.zip file and save it in your Downloads folder. Unzip the file in your desired location, creating a folder named MASP-Algorithm that contains the Python script and two HMM matrices. Open a terminal and navigate to this folder, either manually or by right-clicking the folder and selecting “Open in Terminal” for convenience.
Make sure you download the ZIP folder containing dependent files and the algorithm inside. Unzip the file, and NOT DELETE anything that is inside of the folder.
Then you run:
python3 MASP-AnnotationAlgorithm.py
First, the user has to select the directory where all outputs will be stored (Fig.A). Next, the user should select the multiFASTA file containing the protein sequences, which can also be the output file generated by the EMBOSS GetORF function (Fig.B). Once this is completed, one prompts will appear in the terminal, where the user can enter the name of the strain to be analyzed, which will be used in the names of the outputs (Fig.C).
(A) Inside the extracted ZIP folder, three essential files are displayed: the main Python script and two HMM profiles. (B) Users can open the working directory in the terminal by right-clicking the folder and selecting "Open in Terminal", which automatically sets the terminal to the correct directory. (C) In the terminal, users execute the algorithm by running the Python script, specifying required inputs. (D) The user selects the output directory where all result files will be stored, with the chosen folder path displayed in the interface. The protein multiFASTA file containing the sequences to be analyzed is selected, with the file path shown at the bottom. (E) Terminal prompt for entering the name of the strain to be analyzed, enabling the customization of output files.
The algorithm will then begin running, displaying updates on its progress like this:
All N-terminal of the entered sequences were analyzed.
All C-terminal of the entered sequences were analyzed.
Analyzing and annotating MASP sequences
Searching for chimeric sequences
All sequences were classified
Selecting sequences according to hierarchical ranking
Finished selecting sequences according to hierarchical ranking
Once the classification and annotation of MASPs is finished, the following messages will be displayed on the terminal:
FASTA files generated by prediction of the algorithm.
GFF file generated: /home/user/Selected_Folder/MASP_Strain_XX_sequences.gff
The information has been stored in /home/user/Selected_Folder/README_Strain_XX.txt
You should know that this tool, in order to operate correctly, needs some libraries installed previously.
--> Pandas version 2.8.0 and upwards
--> PyHMMER from https://pyhmmer.readthedocs.io/en/stable/
--> tkinter (is the default Python interface to the Tk GUI toolkit)
--> In some cases, depending on the computer, installation of Jinja2 may be required. https://jinja.palletsprojects.com/en/stable/