miRSeq data alignment workflow - integrates genetic information to assess the impact of variants on miRNA expression
Link to publication
The following python libraries must be installed: pysam and biopython (available through pip
).
The following programs must be installed: bowtie2, cutadapt, samtools. If the path of any of these programs is not in $PATH
, you can provide it directly to optimir (see optimir process -h
and optimir libprep -h
).
As of nov. 2018 OptimiR has been only used (successfuly) in a GNU/Linux environment. Works fine with python2 and python3.
You can install OptimiR on your system using the following:
python setup.py install
If you need to install locally, you can use the --prefix
parameter.
optimir process --fq /path/to/sample.fq(.gz) \
--dir_output /path/to/output/dir/ \
--vcf /path/to/genotypes.vcf (optional)
Different options are available to customize the reference library and the outputs. Use optimir process -h
to see all options. The user can provide a vcf file (and is recommanded to do so) with a list of variants, OptimiR will extract variants located on miRNAs and generate new alternative sequences of these miRNAs (a.k.a polymiRs) integrating the alternative version of the variant.
In dir_output, the following directories are created:
OptimiR_lib
: contains the alignment index generated by OptimiR, as well as a pickle directory containing python object (which allow the library to not be regenerated at each run if the inputed files did not change), and a fasta file which contains all the miRs and polymiRs sequences generated by OptimiROptimiR_tmp
: contains temporary files generated during OptimiR run (files after trimming, collapsing and mapping steps, as well as cutadapt and bowtie2 reports). The final alignment file is available in the directory3_PostProcess
in bam format. If--rmTempFiles
option is given to optimir, this directory will be deleted. Tip : If OptimiR is run on the same sample more than once andOptimiR_tmp
not deleted, the sample is not trimmed again, which saves computation time (except if option--trimAgain
is provided).OptimiR_Results
: contains the abundances tables for each miRs and polymiRs, and detailed counts of isomiRs. If genotypes are provided, a consistency table is also generated and indicate the consistency between the alignment on a polymiR and the genotypes of the variants involved. Annotation files are also generated to provide information on ambiguous alignments (reads mapping on several references) that could not be resolved by OptimiR, and another file gives an indication on the parental hairpins from which the reads are originating (computed from templated tailed nucleotides).
The vcf file is optional. All variants provided in the vcf file that are mapping on miRNAs will be integrated in new sequences generated by OptimiR. Reads will be allowed to align to these new alternative mIRNA sequences.
The genotype can also be provided by the user in the vcf file, in which case OptimiR will perform a genotype consistency step and retain only alignments on polymIRs that are consistent with the genotype. The sample name in the vcf file must match the name of the provided fastq (minus the extension of the fastq file)
When a vcf with genotypes is provided, 2 tables : consistency_table
and polymiRs_table
are generated in the outputs and contain summary alignments on each polymiR. In the polymiR table, a detail of alignments on each polymiR for each sample is provided, with the number of reads aligned on the sequences integrating the reference allele and alternative allele.
In the consistency table, consistent counts are reads that aligned on the polymiR sequence with the allele that match the genotype, while inconsistent counts are alignments that have been discarded because they didn't match the genotype. For samples with heterozygous genotype, reads can align on any version of the polymiR, in which case the Allele Specific Expression (ASE) of these polymiRs can be studied.
- Multi-allelic variants must be provided on two separate lines. Allele string must not contain a comma, and genotype format must have format 0/0, 0/1, 1/0 or 1/1 but no 0/2 and such... (btw: also works with phased variants format).
- Combination of indels, or indels and SNPs: if a SNP and an indel map on the same miRNA, or if several indels map on the same miRNA, combinations of these variants might not be generated, in which case a warning will be displayed during library preparation. Combinations of SNPs works fine.
- Variant positions must respect GRCh38 coordinates (when using miRBase 21 default gff3 file).
Two R scripts are available in the directory optimir/libs/R_plot/
: plot_ASE.R
and plot_isoDist.R
. These scripts are meant to be run on the files polymiRs_table.annot
and isomiRs_dist.annot
, respectively.
For polymiRs integrating genetic variants expressed by heterozygous carriers, both sequences integrating the reference or the alternative allele can be expressed. It might be of interset for the user to analyse the ASE of these polymiRs. For this we provide the script plot_ASE.R
that automatically gather counts on polymiRs expressed by heterozygotes, and generate a scatter plot representing for each sample the reads aligned on the polymiR integrating the reference allele vs the counts gathered by the sequence integrating the alternative allele.
miRNAs can undergo post transcriptional modifications that can affect the sequence of the mature miRNA. These modifications mostly affect the 3' extremity of the miRNA, and on a smaller scale the 5' end. A miRNA that does not appear to have undergone such modifications is called canonical, while miRNAs that are not canonical are isomiRs. The modifications observed are on each end of the isomiR can be:
- Trimming (Trim): a deletion of one or more nucleotides.
- Templated tailing (TA) : the addition of one or more nucleotides, that matches the nucleotides that originally surround the mature miRNA sequence in the primary miRNA.
- Non templated tailing (NTA) : the addition of one or more nucleotides, that does not match the nucleotides that originally surround the mature miRNA.
- Trimming followed by tailing (Trim+Tail): The deletion of one or more nucleotides followed by the addition of one or more nucleotides.
- Canonical : no modification.
The script plot_isoDist.R
allows the user to represent the distribution of these modifications in the processed dataset. By default, only the global distribution accross all miRNA is generated. The script can be easily tweaked by the user if there is a need to represent the distribution of individual miRNA.
All the outputs are in tsv format, and easy to process for downstream analyses.
First column contains miRNA ids, second column contains sample's counts. For each sample, 4 abundance files are generated:
miRs_and_polymiRs_abundances
: counts gathered by each miRNA. PolymiRs have counts detailed on each sequence (integrating reference/alternative allele). The sequences that integrate an alternative allele is written as follow : miR_name +'_'
+ rsID. For example the polymiR hsa-miR-4433b-3p integrating the alternative allele forrs12473206
has its counts accessible with the idhsa-miR-4433b_rs12473206
, while counts gathered by the sequence integrating the reference allele are on the line with the idhsa-miR-4433b-3p
.miRs_merged_abundances
: Same as above, except that counts of each polymiRs are merged under a single id. For example, counts gathered byhsa-miR-4433b-3p
andhsa-miR-4433b-3p_rs12473206
are summed and available with the idhsa-miR-4433b-3p
.polymiRs_specific_abundances
: same asmiRs_and_polymiRs_abundances
except that only polymiRs are presented in this table.isomiRs_specific_abundances
: counts gathered by each isomiR. Each isomiR is represented according to the isotype format (see below).
ISOTYPE FORMAT: miRNA-ID~[5'end modifications, 3'end modifications]
--------------
Modifications description:
+N : templated tailing with nucleotide N
+n : non templated tailing with nucleotide N
-x : trimming of x nucleotides
Example of alignments on hsa-miR-4433b-3p, with their isotype
ccagaaaCAGGAGTGGGGGGTGGGACGTaaggagg : reference mature sequence, with surrounding nucleotides in pri-miRNA
---------------------
mature miRNA
AACAGGAGTGGGGGGTGGGACGT :hsa-miR-4433b-3p~[+AA,]
ACAGGAGTGGGGGGTGGGACGT :hsa-miR-4433b-3p~[+A,]
ACAGGAGTGGGGGGTGGGACG :hsa-miR-4433b-3p~[+A,-1]
CAGGAGTGGGGGGTGGGACGTA :hsa-miR-4433b-3p~[,+A]
CAGGAGTGGGGGGTGGGACGTt :hsa-miR-4433b-3p~[,+t]
CAGGAGTGGGGGGTGGGACGT :hsa-miR-4433b-3p~[,]
CAGGAGTGGGGGGTGGGACG :hsa-miR-4433b-3p~[,-1]
CAGGAGTGGGGGGTGGGACt :hsa-miR-4433b-3p~[,-2+t]
CAGGAGTGGGGGGTGGGAC :hsa-miR-4433b-3p~[,-2]
For each samples several complementary files can be generated (the user can control which ones with option --annot
):
consistency_table
: For each polymiR, reports all counts that are consistent with genotype, and the number of reads that were discarded because of inconsistency. If more than 5 counts or 1% of total reads were discarded, the flag "SUSPICIOUS" is added in theflag
column (the 0.01 default rate can be modified by the user with the option--consistentRate
). This threshold is set to detect misalignments on polymiRs that are not due to sequencing errors that might replicate a genetic variant.polymiR_table
: For each polymiR, report the number of reads gathered by the sequence integrating the reference allele, and the number of reads aligned on the sequence integrating the alternative allele.isomiRs_dist
: For each miRNA / polymiR, report the percentage of modifications on each end, the global percentage of canonical alignments, as well as the global read counts.inconsistents.sam
: file reporting all alignments that have been discarded because of an inconsistency between the provided genotype and the alignment on a polymiR.expressed_hairpins
: file reporting the primary miRNA and thus the locus from which a miRNA is transcribed. In miRBase, mature miRNAs with identical sequence can originate from different hairpins, and thus different genomic loci. When a read is aligned on such miRNAs with templated tailing, and if the nucleotides surrounding the mature miRNA are different in the potential parental hairpins, we use this information to deduce from which hairpin the miRNA might originate. When using this information, the user must be aware that templated tailing might not be due to imprecise cleavage from ribonucleases during the miRNA maturation, but could be due to terminal nucleotidyl transferase that add the same nucleotides to the mature miRNA end than those originally surrounding it in the hairpin. The reporting format of this information is detailed below.remaining_ambiguous
: When a read align on multiple references, we call it ambiguous, or cross-mapping. OptimiR has an internal scoring algorithm that allow it to choose the most probable alignment when possible. If tipically ~90% of ambiguous alignments are resolved, the remaining ones that could not be resolved are detailed in this file. For the remaining ambiguous, if a read aligns on n references, then each alignment receive a weight of 1/n. The reporting format of this information is detailed below.
Example:
-------
REFERENCE SAMPLE Expressed_hairpins Remaining_ambiguous
hsa-let-7a-3p 525 multiple[100.00%]
hsa-let-7a-5p 471517 hsa-let-7a-2[0.02%]/multiple[99.69%]/hsa-let-7a-3[0.18%]/hsa-let-7a-1[0.10%] hsa-let-7c-5p+hsa-let-7b-5p[2278(0.48%)]/hsa-let-7c-5p[27534(5.84%)]
hsa-let-7b-5p 3842888 hsa-let-7b[100.00%] hsa-let-7c-5p+hsa-let-7a-5p[2278(0.06%)]
hsa-let-7c-5p 51910 hsa-let-7c[100.00%] hsa-let-7a-5p[27534(53.04%)]/hsa-let-7a-5p+hsa-let-7b-5p[2278(4.39%)]
hsa-let-7f-5p 439810 hsa-let-7f-1[0.36%]/hsa-let-7f-2[24.53%]/multiple[75.11%]
Expressed_hairpins format: potential-hairpin-id[percentage of reads with templated tailing matching potential-hairpin-id]
If a miRNA has only one hairpin from which it originate, then it will always have 100% of its reads originating
from it. Ex: miRNA-hairpin-id[100%]
If a miRNA has more than one potential hairpin, each hairpin with matching templated tailing is reported, with
the percentage of total aligned reads reported in brackets.
If several hairpins are reported, they are separated with a '/'
symbol.
For alignments where the parental hairpin cannot be distinguished using templated tailing, these are reported
as 'multiple'
.
In the example, reads aligned on hsa-let-7f-5p
have 0.36% of aligned reads that are potentially originating
from the hairpin hsa-let-7f-1
, 24.53% from hsa-let-7f-2
, and the remaining 75.11% could originate from any of these
two hairpins.
We report here other matures miRNA on which reads ambiguously aligned. These alignments could not be resolved
using OptimiR's scoring algorithm to determine the most probable alignment.
If a read cross mapped on more than one other miRNA, then they are all reported and separated with a '+'
symbol.
In the example, 5.84% of reads aligned aligned on hsa-let-7a-5p
(which amount to 27,534 reads) also mapped
on hsa-let-7c-5p
. And 2,278 reads (or 0.48% of reads aligned to hsa-let-7a-5p
) aligned simulteanously
to hsa-let-7a-5p
, hsa-let-7b-5p
and hsa-let-7c-5p
.
This happens a lot to reads mapping on let-7
mature miRNAs, because their sequences are very similar.
The mirGFF3 format is a recently developed standard used in miRseq analysis (see the description of this format on the GitHub page). It should be noted that the current OptimiR output in mirGFF3 format will give mature miRNA ident as seqID, and might not be supported by mirtop yet. One can use the attribute optimiR_Expression to retrieve abundances computed by OptimiR in this format.
The VCF output file can be generated when variants are provided by the user in an initial VCF without genotype information. In such situations, using the option --vcf_out
, a VCF containing genotypes deduced by OptimiR on polymiRs will be produced. Be careful when using the results of this output, OptimiR is not a variant caller, these genotypes only reflect the expression level of both alleles of a polymiR, and do not necessarly reflect the DNA's genotype, as allelic imbalance is frequently observed for polymiRs (see OptimiR publication) results on allelic imbalance more additional informations).
If several samples must be processed, it is recommanded to provide the same output directory for all runs. The python script optimir_SUMMARY
with the output directory as a parameter will produce summary files for all runs (abundances tables and additional files).
Usage: optimir summarize --dir /path/to/Results/
We provide 2 very small fastq samples and a vcf file with OptimiR, available in the directory example
. In the main directory, 2 scripts allow the user to run OptimiR on these examples, either on a single sample (example_1.sh
) or on both samples sequentially (example_all.sh
).
To run the script, use ./example_1.sh
or ./example_all.sh
The script example_all.sh
has 3 steps:
- Prepare the alignment library ahead with the provided vcf. OptimiR will rely on this library to quantify reads from the 2 samples. Their genotypes for a set of variants loacted on miRNAs are included in the vcf file (
genotypes.vcf
). This step is mandatory when processing samples in parallel. - Run optimir sequentially on both samples. Since the library is already prepared and available in the provided output directory, OptimiR won't prepare it again and will align and quantify miRNAs using this library.
- Run optimir_SUMMARY on the output directory, in order to gather the results from both samples in single files (abundances and annotation tables).
optimir -h
will give:
usage: optimir [-h] {process,summarize,libprep} ...
OptimiR: A bioinformatics pipeline designed to detect and quantify miRNAs,
isomiRs and polymiRs from miRSeq data, & study the impact of genetic
variations on polymiRs' expression
optional arguments:
-h, --help show this help message and exit
OptimiR subcommands:
{process,summarize,libprep}
process Process a single fastq file with the OptimiR workflow
summarize Summarize results of previously processed samples
libprep Only prepare small RNA alignment library sequences
that integrates genetic variants (required before
processing multiple samples in parallel)
optimir process -h
will produce:
usage: optimir process [-h] -i FASTQ [-o OUTDIR] [-g VCF] [--seedLen SEEDLEN]
[--w5 WEIGHT5] [--scoreThresh SCORE_THRESH]
[--consistentRate INCONSISTENT_THRESHOLD]
[--rmTempFiles] [--annot ANNOT_FILES] [--gff_out]
[--vcf_out] [--adapt3 ADAPT3] [--adapt5 ADAPT5]
[--readMin READMIN] [--readMax READMAX]
[--bqThresh BQTHRESH] [--trimAgain]
[--maturesFasta MATURES] [--hairpinsFasta HAIRPINS]
[--gff3 GFF3] [--quiet] [--cutadapt CUTADAPT]
[--bowtie2 BOWTIE2] [--bowtie2_build BOWTIE2_BUILD]
[--samtools SAMTOOLS]
optional arguments:
-h, --help show this help message and exit
-i FASTQ, --fq FASTQ Full path of the sample fastq file (accepted formats
and extensions: fastq, fq and fq.gz)
-o OUTDIR, --dirOutput OUTDIR
Full path of the directory where output files are
generated [default: ./OptimiR_Results_Dir/]
-g VCF, --vcf VCF Full path of the vcf file with genotypes
--seedLen SEEDLEN Choose the alignment seed length used in option '-L'
by Bowtie2 [default: 17]
--w5 WEIGHT5 Choose the weight applied on events detected on the 5'
end of aligned reads [default: 4]
--scoreThresh SCORE_THRESH
Choose the threshold for alignment score above which
alignments are discarded [default: 9]
--consistentRate INCONSISTENT_THRESHOLD
Choose the rate threshold for inconsistent reads
mapped to a polymiR above which the alignment is
flagged as highly suspicious [default: 0.01]
--rmTempFiles Add this option to remove temporary files (trimmed
fastq, collapsed fastq, mapped reads, annotated bams
--annot ANNOT_FILES Control which annotation file is produced by adding
corresponding letter : 'h' for expressed_hairpins, 'p'
for polymiRs_table, 'i' for consistency_table, 'c' for
remaining_ambiguous, 's' for isomiRs_dist. Ex: '--
annot hpics' [default] will produce all of them
--gff_out Add this option to generate results in mirGFF3 format
[default : disabled]
--vcf_out Add this option to generate results in VCF format
[default : disabled]
--adapt3 ADAPT3 Define the 3' adaptor sequence (default is NEB &
ILLUMINA: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -a
TGGAATTCTCGGGTGCCAAGG -> hack: use -a to add adapter
sequences)
--adapt5 ADAPT5 Define the 5' adaptor sequence [default: None]
--readMin READMIN Define the minimum read length defined with option -m
in cutadapt [default: 15]
--readMax READMAX Define the maximum read length defined with option -M
in cutadapt [default: 27]
--bqThresh BQTHRESH Define the Base Quality threshold defined with option
-q in cutadapt [default: 28]
--trimAgain Add this option to trim files that have been trimmed
in a previous application. By default, when temporary
files are kept, trimmed files are reused. If you wish
to change a paramater used in the trimming step of the
workflow, this parameter is a must [default: disabled]
--maturesFasta MATURES
Path to the reference library containing mature miRNAs
sequences [default: miRBase 21]
--hairpinsFasta HAIRPINS
Path to the reference library containing pri-miRNAs
sequences [default: miRBase 21]
--gff3 GFF3 Path to the reference library containing miRNAs and
pri-miRNAs coordinates [default: miRBase v21, GRCh38
coordinates]
--quiet Add this option to remove OptimiR progression on
screen [default: disabled]
--cutadapt CUTADAPT Provide path to the cutadapt binary [default: from
$PATH]
--bowtie2 BOWTIE2 Provide path to the bowtie2 binary [default: from
$PATH]
--bowtie2_build BOWTIE2_BUILD
Provide path to the bowtie2 index builder binary
[default: from $PATH]
--samtools SAMTOOLS Provide path to the samtools binary [default: from
$PATH]
optimir summarize -h
will produce:
usage: optimir summarize [-h] --dir DIR
optional arguments:
-h, --help show this help message and exit
--dir DIR Full path of the directory containing results
optimir libprep -h
will produce:
usage: optimir libprep [-h] [-v VCF] [--maturesFasta MATURES]
[--hairpinsFasta HAIRPINS] [--gff3 GFF3] [-o OUTDIR]
[--bowtie2_build BOWTIE2_BUILD]
optional arguments:
-h, --help show this help message and exit
-v VCF, --vcf VCF Full path of the input VCF file.
--maturesFasta MATURES
Path to the reference library containing mature miRNAs
sequences [default: miRBase 21]
--hairpinsFasta HAIRPINS
Path to the reference library containing pri-miRNAs
sequences [default: miRBase 21]
--gff3 GFF3 Path to the reference library containing miRNAs and
pri-miRNAs coordinates [default: miRBase v21, GRCh38
coordinates]
-o OUTDIR, --dirOutput OUTDIR
Full path of the directory where output files are
generated [default: ./OptimiR_Results_Dir/]
--bowtie2_build BOWTIE2_BUILD
Provide path to the bowtie2 index builder binary
[default: from $PATH]
-
31/03/19 :
- Integrate scripts and add subcommands: process, summarize, libprep
- Added setup
- Fixed miGFF3 format
-
02/03/19 :
- Supports outputs in mirGFF3 format (see the mirtop/mirGFF3 GitHub page for more details on this format and the mirtop project)
- Supports outputs in VCF format summarizing the genotypes inferred by OptimiR