A Nextflow DSL2 pipeline for the identification, quantification and analysis of ncRNAs (especially tRNA fragments) in small/miRNA-seq datasets
Many ncRNA identification pipelines are based on differential gene expression. This is a useful method but can fall short if portions of the differentially expressed ncRNAs are degraded/not present, thus reducing the differential expression identification power. tsRNAsearch addresses this issue by applying a standard differential expression analysis and four additional methods to identify fragmented ncRNAs, especially tRNA fragments.
Run the code shown in the window below. This will create a conda environment and install the required tools inside the environment.
NOTE: please use conda 4.6 or greater
git clone https://github.com/GiantSpaceRobot/tsRNAsearch.git
conda env create -f tsRNAsearch/environment.yml
conda activate tsrnasearch_env
Rscript tsRNAsearch/bin/InstallPackages.R
Install nextflow (you may prefer to move nextflow to your PATH)
wget -qO- https://get.nextflow.io | bash
chmod +x nextflow
We have supplied data to ensure that the pipeline is functioning correctly. To run the following, make sure you are outside the tsRNAsearch directory.
Usage: Single file analysis:
nextflow run tsRNAsearch --species mouse --input_dir tsRNAsearch/ExampleData --output_dir Results
Usage: Multi file analysis:
nextflow run tsRNAsearch --species mouse --input_dir tsRNAsearch/ExampleData --output_dir Results
On finishing either the single or multi-file analysis, the pipeline will populate the Results directory with results files corresponding to each sample.
Usage: Group comparison analysis:
nextflow run tsRNAsearch --species mouse --input_dir tsRNAsearch/ExampleData --output_dir Results --layout tsRNAsearch/additional-files/Example_Layout.csv
On finishing the group comparison analysis, the pipeline will populate the Results directory with results files corresponding to each sample, a PDF summarising the comparison, and a directory containing comparison files (e.g. DESeq2 results). Example of pipeline results
- Slope algorithm
- Fisher's method for combining p-values
- Distribution algorithm
- Cleavage algorithm
- DESeq2
- Execute standard small RNA-seq steps on each dataset
- trim_galore (cutadapt and fastqc)
- Adapter removal, read trimming, and quality check
- STAR
- Read alignment to a non-coding RNA database
- FeatureCounts
- Count raw reads
- Data processing
- Data transformations/reformatting
- Run SlopeScore.R
- Run Pvalue_generator.R
- Run DistributionScore.R
- Run CleavageScore.R
- Run DESeq2
- Generate reports (PDFs, CSVs, and text reports)
- -h Print the usage and options information
- --species Analyse data against 'human', 'mouse', or 'rat'? {default: human}
- --input_dir Directory containing the files for analysis (file formats: FASTQ or gzipped FASTQ). Directory should have no other contents.
- --output_dir Output directory for the results files
- --layout CSV file containing file names and file groups (see example in additional-files/)
- --min_read_length Minimum read length (default: 16 bp)
- --all_plots Plot all features? true/false {default: false}
- --help Print help and exit
- --version Print version and exit
There are 625 tRNA genes in the human genome. Each tRNA species has multiple genes coding for it, many of which are identical or vastly similar (e.g. Arginine). Therefore, correctly matching a read with its tRNA gene of origin is difficult. Two methods have been implemented to address this:
- CD-HIT was used to extract representative sequences from all ncRNA sequences. The ncRNA STAR database was built using these representative sequences.
- The STAR output (SAM file) is processed in such a way that all reads mapping to multiple database sequences from the same ncRNA species (e.g. read maps to multiple Proline-CCC genes) are collapsed into a single SAM entry. This single SAM entry is thereby considered a single-mapping read and is not removed by downstream processing steps.
You can add any species you want with relatively little effort. Click here for more information.
You can test if the R packages are installed correctly in the conda environment using:
# You must be in the conda env to run this
Rscript tsRNAsearch/bin/TestPackages.R
This will produce errors if all packages are not found. You may need to manually install the problem packages if so.
- Paul Donovan, PhD
- Natalie McHale
This project is licensed under the MIT License.