Auto-PSS-Genome (Automatic Positively Selected Sites Genome) is a Compi pipeline to automatically identify positively selected amino acid sites using three different methods, namely CodeML, omegaMap, and FUBAR in complete genomes (FASTA files containing all coding sequences). A Docker image is available for this pipeline in this Docker Hub repository.
Auto-PSS-Genome (Automatic Positively Selected Sites Genome) is a Compi pipeline to automatically identify positively selected amino acid sites (PSS) using three different methods, namely CodeML, omegaMap, and FUBAR in complete genomes (FASTA files containing all coding sequences).
This process comprises the following steps:
- Use the GenomeFastScreen pipeline to quickly identify genes that likely show PSS.
- Apply the CheckCDS to the files that failed to be analyzed by GenomeFastScreen in order to try to convert them into valid CDS files.
- Reanalyze such files using GenomeFastScreen.
- Finally, perform a more detailed analysis of all the genes that likely show PSS using the IPSSA pipeline.
In order to use the Auto-PSS-Genome image, create first a directory in your local file system (auto_pss_genome_project
in the example) with the following structure:
auto_pss_genome_project/
├── input
│ ├── 1.fasta
│ ├── 2.fasta
│ ├── .
│ ├── .
│ ├── .
│ └── n.fasta
│
├── global
│ └── global-reference-file.fasta
│
├── pss-genome-fs.params
├── ipssa-project.params
└── check-cds.params
Where:
- The input FASTA files to be analized must be placed in the
auto_pss_genome_project/input
directory. - Optionally, the global reference FASTA file for the GenomeFastScreen pipeline must be placed at
auto_pss_genome_project/global/global-reference-file.fasta
. - The
pss-genome-fs.params
file contains the Compi parameters file for the GenomeFastScreen pipeline. - The
ipssa-project.params
file contains the Compi parameters file for the IPSSA pipeline. - The
check-cds.params
file contains the Compi parameters file for the CheckCDS pipeline.
You can populate the Auto-PSS-Genome project directory, including sample Compi parameter files with default values, running the following command (here, you only need to set AUTO_PSS_GENOME_PD
to the right path in your local file system):
AUTO_PSS_GENOME_PD=/path/to/auto_pss_genome_project
mkdir ${AUTO_PSS_GENOME_PD}
docker run --user "$(id -u):$(id -g)" --rm -v ${AUTO_PSS_GENOME_PD}:/working_dir pegi3s/auto-pss-genome init-working-dir.sh /working_dir
Now, you should:
- Put the input FASTA files in the
auto_pss_genome_project/input
directory. - If required, put the global reference FASTA file in the
auto_pss_genome_project/global
directory. - Edit the parameters of the GenomeFastScreen pipeline in the
pss-genome-fs.params
file. Here it is mandatory to set thereference_file
to be the name of a file in theauto_pss_genome_project/input
directory andblast_type
. Optionally, set theglobal_reference_file
value (and remove the#
at the beginning of the line). - Edit the parameters of the CheckCDS pipeline in the
check-cds.params
file. Here you only need to provide the reference word (case insensitive) in the sequence headers to identify the reference sequences when trying to create valid CDS files. - Check the values of the parameters of the IPSSA pipeline in the
ipssa-project.params
file. This file contains the default recommended values for this pipeline and may need to be adjusted.
Once this structure and files are ready, you should run and adapt the following commands to run the entire pipeline. Here, you only need to set AUTO_PSS_GENOME_PD
to the right path in your local file system and COMPI_NUM_TASKS
to the maximum number of parallel tasks that can be run. Note that the --host_working_dir
is mandatory and must point to the pipeline working directory in the host machine.
AUTO_PSS_GENOME_PD=/path/to/auto_pss_genome_project
COMPI_NUM_TASKS=6
docker run --rm -v /tmp:/tmp -v /var/run/docker.sock:/var/run/docker.sock -v ${AUTO_PSS_GENOME_PD}:/working_dir --rm pegi3s/auto-pss-genome /compi run -o --logs /working_dir/logs --num-tasks ${COMPI_NUM_TASKS} -- --host_working_dir ${AUTO_PSS_GENOME_PD} --compi_num_tasks ${COMPI_NUM_TASKS}
The sample data is available here. Download and uncompress it, and move the directory named auto-pss-genome-m-haemophylum
, where you will find:
- A directory called
auto-pss-genome-project
, that contains the structure described previously. - A file called
run.sh
, that contains the following commands (where you should adapt theAUTO_PSS_GENOME_PD
path) to test the pipeline:
AUTO_PSS_GENOME_PD=/path/to/auto-pss-genome-project
COMPI_NUM_TASKS=8
docker run --rm -v /tmp:/tmp -v /var/run/docker.sock:/var/run/docker.sock -v ${AUTO_PSS_GENOME_PD}:/working_dir --rm pegi3s/auto-pss-genome /compi run -o --logs /working_dir/logs --num-tasks ${COMPI_NUM_TASKS} -- --host_working_dir ${AUTO_PSS_GENOME_PD} --compi_num_tasks ${COMPI_NUM_TASKS}
- ≈ 11.5 hours - 50 parallel tasks - Ubuntu 18.04.2 LTS, 96 CPUs (AMD EPYC™ 7401 @ 2GHz), 1TB of RAM and SSD disk.
Since pegi3s/auto-pss-genome:1.11.0
there are included several scripts to help in preparing B+ submission files. Check out this section to discover how to use them.
To build the Docker image, compi-dk
is required. Once you have it installed, simply run compi-dk build
from the project directory to build the Docker image. The image will be created with the name specified in the compi.project
file (i.e. pegi3s/auto-pss-genome:latest
). This file also specifies the version of compi that goes into the Docker image.
- H. López-Fernández; C. P. Vieira; P. Ferreira; P. Gouveia; F. Fdez-Riverola; M. Reboiro-Jato; J. Vieira (2021) On the identification of clinically relevant bacterial amino acid changes at the whole genome level using Auto-PSS-Genome. Interdisciplinary Sciences: Computational Life Sciences. Volume 13, pp. 334–343.