GitHub

UniRef genes families-level pangenome building and annotation

This tools provides a pipeline for annotating and clustering input genomes sequences into UniRef90/UniRef50 genes families and clustering unknown coding sequences. The output provided is a ready-to-use PanPhlAn pangenome. Thus, it will countain all genomes contigs in a multi-FASTA file, precomputed bowtie2 indexes, and a pangenome tsv file mapping gene location on contigs.

Pipeline

Prokka runs over the provided genome to annotate them
Using the UniRef annotator and the UniRef DIAMOND database, sequences are associated to UniRef90 and UniRef50 ID
The remaining (not mapped by UniRef annotator) sequences are clustered together at the same thresholds (90% and 50 % similarity). This leads to the attribution of UniRef90_UNK and UniRef50_UNK (unknown) IDs
Then the PanPhlAn pangenome is generated : concatenation of contigs of all genomes, generation of tsv mapping file, bowtie2 indexes building.

Dependencies :

The following Python packages are needed .

BioPython
bcbio-gff
gffutils

The following external tools should be installed (and the PATH variable properly configured) :

Prokka (https://github.com/tseemann/prokka)
MMSEQ2 (https://github.com/soedinglab/MMseqs2)
DIAMOND (https://github.com/bbuchfink/diamond)
BowTie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml)

On top on that, UniRef DIAMOND databases should be downloaded via the download_databases.py script.

Usage

python panphlan_exporter.py --input [input_genomes_folder]          \
                            --output [output_pangenome_folder]      \
                            --db_path [path_to_UniRef_DIAMOND_databases]

The --input [input_genomes_folder] should contain one fasta file per genome. The script assumes that the file name is the genome name
The --output [output_pangenome_folder] will be created if not existing

Additionnal parameters could be provided :

-t or --tmp specifies another directory for temporary files. Default is the output folder
-c or --clade_name specifies a prefix for PanPhlAn output files. The best would be the full species name (e.g. Escherichia_coli). Default is panplhan_clade
-n or --nprocs the number of threads to use.

N.B : If the ouput folder is already a PanPhlAn pangeome folder (containing the 8 or 9 files of a PanPhlAn pangenome : 1 fna, 1 pangenome tsv, 6 indexes files and 1 optionnal annotation file), then the pangenome generated by the pipeline will extend the existing one.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
uniref_annotator		uniref_annotator
README.md		README.md
download_databases.py		download_databases.py
external_exec.py		external_exec.py
panphlan_exporter.py		panphlan_exporter.py
parallelisation.py		parallelisation.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniRef genes families-level pangenome building and annotation

Pipeline

Dependencies :

Usage

About

Releases

Packages

Contributors 2

Languages

SegataLab/PanPhlAn_pangenome_exporter

Folders and files

Latest commit

History

Repository files navigation

UniRef genes families-level pangenome building and annotation

Pipeline

Dependencies :

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages