ARA (Automated Record Analysis) : An automatic pipeline for exploration of SRA datasets with sequences as a query
-
Docker
-
Please checkout the Docker installation guide.
or
-
-
Mamba package manager
-
Please checkout the mamba or micromamba official installation guide.
-
We prefer
mamba
overconda
since it is faster and useslibsolv
to effectively resolve the dependencies. -
conda
can still be used to install the pipeline using the same commands as described in the installation section.Note: It is important to include the 'bioconda' channel in addition to the other channels as indicated in the official manual. Use the following commands in the given order to configure the channels (one-time setup).
conda config --add channels defaults conda config --add channels bioconda conda config --add channels conda-forge conda config --set channel_priority strict
-
The user can install the pipeline by using either Docker or Mamba using the steps mentioned below.
First, click the green "Code" button, then select "Download Zip" to begin downloading the contents of this repository. Once the download is complete, extract the zip file by into the desired location before starting the setup. Please use the commands shown below to begin installing the pipeline.
Alternatively, the github repo can also be cloned through the options shown after clicking the "Code" button. Navigate inside the folder after by using the cd ARA/
command before starting the setup.
Warning: Before starting any analysis with the pipeline, please make sure that the system has enough disk space available for the data you wish to retrieve and process from the SRA repository.
-
Using Docker
Pull the latest image
docker pull ghcr.io/maurya-anand/ara
Run the container and print the usage instructions
docker run -it ghcr.io/maurya-anand/ara
or
-
Using Mamba
cd ARA-main/ mamba env create --file requirements.yaml mamba activate ara_env perl setup.pl
Note: After installation, the virtual environment consumes approximately 1.5 GB of disk space. The installation was tested on "Ubuntu 20.04.4 LTS", "Ubuntu 22.04.1 LTS" and "Fedora 37" using the procedure mentioned above.
Please be patient because downloading and configuring the tools/modules may take several minutes. The warning messages that appear during the installation of certain Perl modules can be ignored by users.
Optional: The user can also add the current directory to PATH for ease of use. Use the chmod +x ara.pl
followed by export PATH="$(pwd):$PATH"
command. Alternatively, the user is free to create symbolic, copy the executable to /bin/
, or use any other method depending on their operating system.
Refer the 'Troubleshooting' section in case of any installation related issues.
-
Docker
docker run -it ghcr.io/maurya-anand/ara perl ara.pl --input example/SraRunInfo.csv --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa --output src/main/test/ --mode screen --config conf.txt
-
Mamba environment
perl ara.pl --input example/SraRunInfo.csv --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa
To get full usage info: perl ara.pl --help
Note: The user can delete the contents of
results/
directory after testing the tool using the example mentioned above.
The configuration file conf.txt
is automatically generated during the installation by setup script. It contains certain default parameters as well as the location to the executable binaries of the tools incorporated in the pipeline.
The user can modify the default parameters in conf.txt
and pass it to the pipeline as an input. For example, the data_perc
option in the configuration refers to the default value of 5% of the dataset selected for analysis. However, the user has the flexibility to provide any integer value between 1 and 100 to specify the desired percentage of the dataset to be used.
Similarly, the user can choose between blastn or bowtie2 by changing the 'execute flag' to either 0 or 1 in the configuration file while leaving the rest of the parameters to default values. By default, both the tools are enabled ie. execute = 1
.
The read_drop_perc_cutoff
in conf.txt
config file denotes the cutoff to discard a sample if the total reads left after executing the trimmomatic are higher than the threshold (by default, if the more than 70% of reads are dropped as per the trimmomatic log, then the sample will fail the quality criteria and will not be processed downstream). Please refer the documentation of Trimmomatic for more details about the parameters present in the config file.
Similarly, the criteria to check the minimal alignment rate are indicated by the alignment perc cutoff
parameter under blastn and bowtie2 in the conf.txt
configuration file (if the total alignment percentage is less than the threshold then the pipeline will report that the sample failed the quality criteria). More details about the parameters used in the conf.txt
file can be found in the respective documentations of Blastn and Bowtie2.
By default, the pipeline uses a pre-built Kraken2 viral genomic database (release: 9/8/2022) from https://benlangmead.github.io/aws-indexes/k2. Users can provide their own database by changing the kraken2_db_path
parameter in the conf.txt
file.
Note: An example configuration
config.txt
is provided in theexamples/
directory. If the user wishes to use a different installation than Bioconda, the user can manually install the required tools and specify the absolute path of the executable binaries in the configuration.
-
--input
(mandatory) The user can provide input in either of the following ways:-
A single SRA run accession. eg:
perl ara.pl --input SRR12548227 --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa
-
A list of run accessions in a text file (1 run accession per line). eg:
perl ara.pl --input example/list.txt --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa
-
The SRA runInfo exported directly from the NCBI-SRA web portal. Goto the SRA homepage and search for the desired keyword. Export the
SraRunInfo.csv
by clicking 'Send to' => File => RunInfo). eg:perl ara.pl --input example/SraRunInfo.csv --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa
-
-
--sequences
(mandatory) The user should provide a fasta file containing the query sequences. -
--output
(optional) The output directory to store the results. By default, the output will be stored into theresults/
directory of the package. eg:perl ara.pl --input example/SraRunInfo.csv --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa --output /src/main/test/
-
--mode
(optional) Choose one of the three modes to run the pipeline.-
The
screen
is the default mode which will only download a fraction of the data-set per SRA-run accession and analyse the file as per the given configuration. -
The
full
mode will execute the pipeline by downloading the complete fastq file per SRA-run accession. -
The
both
option searches for samples using a fraction of the data that meet the minimum alignment cutoff from either 'bowtie2' or 'blastn', and then automatically performs alignment by downloading the entire fastq file. eg:perl ara.pl --input example/SraRunInfo.csv --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa --output /src/main/test/ --mode screen
Note: There is a supporting
summary
mode, that will generate a unified alignment summary by examining the output files created by either screen-mode or full-mode. The summary mode should only be used when the user needs to recreate the summary stats from the pre-existing results. The user must enter–mode summary
along with the previously used command parameters to re-generate the summary. -
--config
(optional) Pipeline configuration. By default it will use theconf.txt
generated by the setup script. eg:perl ara.pl --input example/SraRunInfo.csv --sequences example/Arabidopsis_thaliana.TAIR10.ncrna.fa --output /src/main/test/ --mode screen --config conf.txt
-
The pipeline will create folders per SRA run accession and generate results using the run accession as the prefix. The analysis related to the screening a fraction of data will be stored in screening_results
directory whereas the analysis conducted on the whole dataset will be stored in full_analyis_results
directory.
An outline of directory structure containing the results is shown below-
results/
`-- test/ (name derived from the input fasta sequence file)
|-- test.screening.analysis.stats.sorted.by.alignment.txt (combined metadata and analysis report generated after processing all the SRA run accessions, sorted in decreasing order of total alignment percentage)
|-- metadata/
| |-- test.metadata.txt (Combined metadata downloaded from SRA)
| |-- test.metadata.screened.txt (List of SRA accessions which qualify the filter criteria specified in the config.)
| |-- SRA_RUN.run.metadata.txt (unprocessed metadata on a single SRA accession as retrieved from NCBI)
|-- reference/
| |-- blastn_db/ (folder containing the blast database created from the input fasta sequence)
| |-- bowtie2_index/ (folder containing the bowtie index created from the input fasta sequence)
| |-- bowtie2_index.stdout.txt (stdout captured from bowtie2 index creation)
| `-- makeblastdb.stdout.txt (stdout captured from blastn database creation)
`-- screening_results/ (similar structure for screeing or full mode)
|-- SRA_RUN/ (each SRA run accession will be processed into a seperate folder)
| |-- blastn/
| | |-- SRA_RUN.blast.results.txt (output from NCBI Blastn)
| | `-- blast.stats.txt (blastn overall alignment stats)
| |-- bowtie2/
| | |-- SRA_RUN.bam (output from bowtie2)
| | |-- alignment.stats.txt (bowtie2 stdout)
| | `-- alignment.txt (bowtie2 overall alignment summary)
| |-- fastQC/
| | |-- <Raw data FastQC report>
| | |-- <Adapter trimmed FastQC report>
| |-- kraken2/
| | |-- SRA_RUN.kraken (kraken2 standard classification table)
| | |-- SRA_RUN.report (kraken2 classification report)
| | `-- SRA_RUN.stdout.txt (kraken2 stdout)
| |-- raw_fastq/
| | |-- <Downloaded single end or paired end fastq file(s)>
| | |-- fastq_dump.stdout.txt
| | |-- sra/
| | `-- wget.full.sra.stdout.txt
| `-- trimmed_data/
| |-- <Adapter trimmed single end or paired end fastq file(s)>
| `-- SRA_RUN_trim_stdout_log.txt (trimmomatic stdout)
`-- runlog.SRA_RUN.txt (Complete run log of the pipeline per SRA run accession)
For a thorough understanding of the results of the third-party tools, take a look at the following documentations:
The table below provides a summary of the disk usage for different analyses conducted on varying dataset sizes. It demonstrates how disk usage can increase depending on the choice of the fraction of the dataset the user wishes to analyze.
RUN ACCESSION | 100% of dataset | 5% of dataset | 10% of dataset |
---|---|---|---|
SRR8392720 | 1.3G | 85M | 156M |
SRR7289585 | 1.4G | 150M | 288M |
SRR12548227 | 15M | 9.0M | 9.1M |
This summary highlights how the disk usage (in megabytes or gigabytes) can vary depending on the chosen fraction of the dataset for analysis.
-
Errors related to mamba/conda environment:
Since
mamba
is a drop-in replacement and uses the same commands and configuration options as conda, it's possible to swap almost all commands between conda & mamba.Use
conda list
command to verify whether the packages mentioned in therequirements.yaml
are successfully installed into your environment.Note: The
requirements.yaml
provided in this package was exported frommamba 0.25.0
installation running onUbuntu 20.04.4 LTS
.In case of any missing tool/ conflicting dependencies in the environment, the user can try using
conda search <tool name>
ormamba repoquery search <tool name>
command to find the supported version of the tool and then manually install it by typingconda install <tool name>
ormamba install <tool name>
inside the environment. Please refer the official troubleshooting guide for further help.Note: On macOS and Linux, the supported tools and their dependencies aren't always the same. Even when all of the requirements are completely aligned, the set of available versions isn't necessarily the same. User may try setting up the environment using any of the supplementary
requirements-*.txt
provided in thesrc/main/resources/
directory. -
Error installing Perl modules:
Users must ensure that they have write permission to the
/Users/\*/.cpan/
or similar directory, and the CPAN is properly configured.You might need to define the PERLLIB/PERL5LIB environment variable if you see an error similar to the following:
Cant locate My/Module.pm in @INC (@INC contains: ... ... .). BEGIN failed--compilation aborted.
Note about MAKE: 'make' is an essential tool for building Perl modules. Please make sure that you have 'make' installed in your system. The setup script provided in this package utilizes 'cpan' to build the required Perl modules automatically.
If the automatic setup provided in the package fails to install the required dependencies, you may need to install them manually by using the command
cpan install <module name>
or searching the package on Metacpan.Additionally, some Perl modules can also be installed through
mamba
(eg. the compatible version of Perl moduleConfig::Simple
can be searched on mamba bymamba repoquery search perl-config-simple
)
-
Perl modules:
- Config::Simple
- Parallel::ForkManager
- Log::Log4perl
- Getopt::Long
- Text::CSV
- Text::Unidecode
-
Tools:
If you use ARA pipeline for your analysis, please cite the ARA
article as follows:
Anand Maurya, Maciej Szymanski, Wojciech M Karlowski, ARA: a flexible pipeline for automated exploration of NCBI SRA datasets, GigaScience, Volume 12, 2023, giad067, https://doi.org/10.1093/gigascience/giad067
GigaDB: http://dx.doi.org/10.5524/102428
SciCrunch ID: RRID:SCR_023890
bio.tools ID: biotools:ara_automated_record_analysis
WorkflowHub.eu: 10.48546/workflowhub.workflow.546.1