Skip to content

Bash wrapper modules for the remote analysis of RNA-Seq data, with persistency features.

License

Notifications You must be signed in to change notification settings

TCP-Lab/x.FASTQ

Repository files navigation

x.FASTQ

$ x.fastq        _____   _      ____   _____   ___
    __  __      |  ___| / \    / ___| |_   _| / _ \
    \ \/ /      | |_   / _ \   \___ \   | |  | | | |
     >  <    _  |  _| / ___ \   ___) |  | |  | |_| |
    /_/\_\  (_) |_|  /_/   \_\ |____/   |_|   \__\_\
             Bash modules for the remote analysis of
                                        RNA-Seq data

x.FASTQ is a suite of Bash scripts that wrap original and third-party software with the purpose of making RNA-Seq data analysis more accessible and automated.

First Principles and Background

x.FASTQ was originally written for the Endothelion project with the intention of abstracting our standard analysis pipeline for NGS transcriptional data. The main idea was to make the whole procedure more faster, scalable, but also affordable for wet collaborators without a specific bioinformatics background, even those possibly operating from different labs or departments. To meet these needs, we designed x.FASTQ to have the following specific features:

  • Remote operability: Given the typical hardware requirements needed for read alignment and transcript abundance quantification, x.FASTQ is assumed to be installed just on one or few remote servers accessible to all collaborators via SSH. Each x.FASTQ module, launched via CLI as a Bash command, will run in the background (&) and persistently (via nohup) so that the user is not bound to keep the connection active for the entire duration of the analysis, but only for job scheduling.
  • Standardization: Most x.FASTQ scripts are wrappers of lower-level applications commonly used as standard tools in RNA-Seq data analysis and widely appreciated for their performance (e.g., FastQC, BBDuk, STAR, RSEM).
  • Simplification: Scripts expose a limited number of options by making extensive use of default settings (suitable for the majority of standard RNA-Seq analyses) and taking charge of managing input and output data formats and their organization.
  • Automation: All scripts are designed to loop over sets of target files properly stored within the same directory.
  • Completeness: The tools provided by x.FASTQ allow for a complete workflow, from raw reads retrieval to count matrix generation.
  • No bioinformatics skills required: Each x.FASTQ module comes with an --help option providing extensive documentation. The only requirement for the user is a basic knowledge of the Unix shell and a SSH client installed on its local machine.
  • Reproducibility: although (still) not containerized, each x.FASTQ module is tightly versioned and designed to save detailed log files at each run. Also, utility functions are available to print complete version reports about x.FASTQ modules and dependencies (i.e., x.fastq -r and -d options, respectively).

Modules

Overview

x.FASTQ currently consists of 7 modules designed to be run directly by the end-user, each one of them addressing a precise step of a general pipeline for RNA-Seq data analysis, which goes from the retrieval of raw reads to the generation of the expression matrix.

  1. x.FASTQ is a cover-script that performs some general-utility tasks, such dependency check, symlink creation, and generation of version and disk usage reports;
  2. getFASTQ allows local downloading of NGS raw data in FASTQ format from ENA database;
  3. trimFASTQ uses BBDuk (from the BBTools suite) to remove adapter sequences and perform quality trimming;
  4. anqFASTQ uses STAR and RSEM to align reads and quantify transcript abundance, respectively;
  5. qcFASTQ is an interface for multiple quality-control tools, including FastQC and MultiQC;
  6. countFASTQ assembles counts from multiple samples/runs into a single TSV expression matrix, choosing among multiple metrics (TPM, FPKM, RSEM expected counts) and levels (gene or isoform); optionally, it injects experimental design information into the matrix heading and appends annotations regarding gene symbol, gene name, and gene type (Ensembl gene/transcript IDs are required for annotation);
  7. metaharvest fetches sample and series metadata from GEO and/or ENA databases, then it parses the retrieved metadata and saves a local copy of them as a CSV-formatted table (useful for both documentation and subsequent x.FASTQ analysis steps).

In addition, x.FASTQ includes a number of auxiliary scripts (written in Bash, R, or Python) that are not meant to be directly run by the end user, but are called by the main modules. Most of them are found in the workers subfolder.

  1. x.funx.sh contains variables and functions that need to be shared among (i.e., sourced by) all x.FASTQ modules;
  2. trimmer.sh is the actual trimming script, wrapped by trimFASTQ;
  3. assembler.R implements the matrix assembly procedure required by countFASTQ;
  4. pca_hc.R implements Principal Component Analysis and Hierarchical Clustering of samples as required by the qcfastq --tool=PCA ... option;
  5. fuse_csv.R is used by metaharvest to merge the cross-referenced metadata downloaded from both GEO and ENA databases;
  6. parse_series.R is used by metaharvest to extract metadata from a GEO-retrieved SOFT formatted family file;
  7. re_uniq.py is used to reduce redundancy when STAR and RSEM logs are displayed in console as anqFASTQ progress reports.

Common Features and Options

All suite modules enjoy some internal consistency:

  • upon running x.fastq.sh -l <target_path> from the local x.FASTQ repository directory, each x.FASTQ module can be invoked from any location on the remote machine using its fully lowercase name (provided that <target_path> is already included in $PATH);
  • each script launches in the background a persistent job (or a queue of jobs), the main statement of each module typically being a line of the form
    # MAIN STATEMENT
    nohup command_with_args >> "$log_file" 2>&1 &

Tip

As per the Unix shell:

  • nohup (no hangups) allows processes to keep running even upon user logout, as, for instance, when exiting an SSH session;
  • >> allows output to be redirected (and appended) somewhere other than the default ./nohup.out file;
  • 2>&1 is to redirect both standard output and standard error to the same location (i.e., the log file);
  • & at the end of the line, is to run the command in the background and get the shell prompt active again.
  • each module (except x.FASTQ and metaharvest) saves its own log file in inside the experiment-specific target directory using a common filename pattern, namely
    Z_<ScriptID>_<FastqID>_<DateStamp>.log
    Z_<ScriptID>_<StudyID>_<DateStamp>.log
    
    for sample-based or series-based logs, respectively (the leading 'Z_' is just to get all log files at the bottom of the list when ls -l);

Important

In the current implementation of x.FASTQ, filenames are very meaningful!

Each FASTQ file is required to have a name matching the regex pattern

^[a-zA-Z0-9]+[^a-zA-Z0-9]*.*\.fastq\.gz

i.e., beginning with an alphanumeric ID (usually an ENA run ID of this type (E|D|S)RR[0-9]{6,}) immediately followed by the extension .fastq.gz or separated from the remaining filename by an underscore or some other characters not in [a-zA-Z0-9]. Valid examples are GSM34636.fastq.gz, SRR19592966_1.fastq.gz, etc. This leading ID will be propagated to the names of the log files printed by the getFASTQ module and BBDuk (saved in Trim_stats subdirectory), as well as to all output files from FastQC (stored in FastQC_* subdirectories) and STAR/RSEM (i.e., Counts subfolders and all files contained therein). countFASTQ will then assume each RSEM output file being saved into a sample- or run-specific subdirectory, whose name will be used for count matrix heading. Similarly, but at a lower level, even MultiQC needs each STAR and RSEM output to be properly prefixed with a suitable sample or run ID to be correctly accounted for. Notice, however, that all this should occur spontaneously if FASTQs are downloaded from ENA database using the getFASTQ module.

In contrast, it is important for the user to manually name each project folder (i.e., each directory that will contain the entire set of FASTQ files from one single experiment) with a name uniquely assigned to the study (typically the related GEO Series ID GSE[0-9]+ or the ENA Project accession PRJ(E|D|N)[A-Z][0-9]+). Log files created by most of the x.FASTQ modules rely on the name of the target directory for the assignment of the StudyID label, and the same holds for the MultiQC HTML global report and the file name of the final expression matrix.

  • some common flags keep the same meaning across all modules (even if not all of them are always available):
    • -h | --help to display the script-specific help;
    • -v | --version to display the script-specific version;
    • -q | --quiet to run the script silently;
    • -p | --progress to see the progress of possibly ongoing processes;
    • -k | --kill to gracefully terminate possibly ongoing processes;
    • -a | --keep-all not to delete intermediate files upon script execution;
  • all modules are versioned according to the three-number Semantic Versioning system. x.fastq -r can be used to get a version report of all scripts along with the summary version of the whole x.FASTQ suite;
  • if -p is not followed by any other arguments, the script searches the current directory for log files from which to infer the progress of the latest namesake task;
  • with the -q option, scripts do not print anything on the screen other than possible error messages that stop the execution (i.e., fatal errors); logging activity is never disabled, though.

Usage

Assuming you have identified a study of interest from GEO (e.g., GSE138309), have already created a project folder somewhere (mkdir '<anyPath>'/GSE138309), and moved in there (cd '<anyPath>'/GSE138309), here are a couple of possible example workflows.

Important

Given the hard-coded background execution of most x.FASTQ modules, it is not possible (at present) to use these command sequences for creating automated pipelines. On the contrary, before launching each module, it is necessary to ensure that the previous one has successfully terminated.

Minimal Workflow

# Download FASTQs, align, quantify, and assemble a gene-level count matrix
getfastq -u GSE138309 > ./GSE138309_wgets.sh
getfastq GSE138309_wgets.sh
anqfastq .
countfastq .

Complete Workflow

# Download FASTQs in parallel and fetch GEO-ENA cross-referenced metadata
getfastq --urls GSE138309 > ./GSE138309_wgets.sh
getfastq --multi GSE138309_wgets.sh
metaharvest --geo --ena GSE138309 > GSE138309_meta.csv

# Trim and QC
qcfastq --out=FastQC_raw .
trimfastq .
qcfastq --out=FastQC_trim .

# Align, quantify, and QC
anqfastq .
qcfastq --tool=QualiMap .
qcfastq --tool=MultiQC .

# Clean up
rm *.fastq.gz

# Assemble an isoform-level count matrix with annotation and experimental design
groups=(Ctrl Ctrl Ctrl Treat Treat Treat)
countfastq --isoforms --names --design="${groups[*]}" --metric=expected_count .

# Explore samples through PCA
qcfastq --tool=PCA .

Note

The study chosen here as an example features non-interleaved (i.e., dual-file) paired-end (PE) reads, however x.FASTQ also supports single-ended (SE) and interleaved PE formats. In those cases, just add, respectively, [-s | --single-end] or [-i | --interleaved] options, when running trimFASTQ and anqFASTQ modules.

Installation

As already stressed, a working SSH client is the only local software requirement for using x.FASTQ, provided it has already been installed on some remote server machines by the system administrator. The procedure for installing x.FASTQ (and all its dependencies) on the server is here documented step by step.

Cloning

Clone x.FASTQ repository from GitHub

cd ~/.local/bin/
git clone git@github.com:Feat-FeAR/x.FASTQ.git

Symlinking (optional)

Create the symlinks in some $PATH directory (e.g., ~/.local/bin/) to enable global x.FASTQ visibility and easier module invocation.

cd x.FASTQ
./x.fastq.sh -l ..

Dependencies

Install and test the following software, as required by x.FASTQ.

  • Development Environments
    • Java
    • Python
    • R
    • Bioconductor Packages
      • BiocManager
      • PCAtools
      • org.Hs.eg.db
      • org.Mm.eg.db
    • CRAN Packages
      • gtools
      • stringi
  • Linux Tools
    • hostname
    • jq
    • figlet (optional)
  • QC Tools
    • FastQC
    • MultiQC
    • QualiMap
  • NGS Software
    • BBDuk
    • STAR
    • RSEM

Note

While the interpreters of the Development Environments need to be globally available (i.e., findable in $PATH), NGS Software just has to be locally present on the server. Paths to each NGS Software tool will be configured later by editing the install.paths file (see below). Finally, QC Tools allow both installation modes.

The following command sequence represents the standard installation procedure on an Arch/Manjaro system. For different Linux distributions, please refer to the installation guides of the different tools.

# Oracle Java (v.7 or higher)
yay -Syu jre jdk
java --version

# Python 3 (pip) and pipx
sudo pacman -Syu python
python --version
sudo pacman -Syu python-pip
pip --version
sudo pacman -Syu python-pipx
pipx --version

# R
sudo pacman -Syu r
R --version

# Bioconductor packages
R
# Within R
install.packages("BiocManager")
install.package("gtools")
BiocManager::install("PCAtools")
BiocManager::install("org.Hs.eg.db")
BiocManager::install("org.Mm.eg.db")

# Sometimes the following PCAtools dependencies need to be manually installed...
install.packages("stringi", "reshape2")
# ...as well as the following AnnotationDbi one.
install.packages("RCurl")

# Return to the Linux CLI
q()
# core/inetutils package (for hostname utility)
sudo pacman -Syu inetutils

# JQ
sudo pacman -Syu jq

# Figlet (optional)
sudo pacman -Syu figlet

# FastQC
cd ~/.local/bin
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip
unzip fastqc_v0.12.1.zip
cd FastQC
./fastqc --version

# MultiQC
pipx install multiqc
multiqc --version

# QualiMap
# Support still to be added... see the related issue #23.

# BBTools (for BBDuk)
cd ~/.local/bin
wget --content-disposition https://sourceforge.net/projects/bbmap/files/BBMap_39.01.tar.gz/download
tar -xvzf BBMap_39.01.tar.gz
cd bbmap
./stats.sh in=./resources/phix174_ill.ref.fa.gz

# STAR
cd ~
git clone git@github.com:alexdobin/STAR.git
sudo mv ./STAR /opt/STAR
# Optionally make '/opt/STAR/bin/Linux_x86_64_static/STAR' globally available.
STAR --version
# Just on the first run, download the latest Genome Assembly (FASTA) and related
# Gene Annotation (GTF), and generate the STAR-compliant genome index.
# NOTE: Unless otherwise specified, Ensembl will be used as the reference
#       standard for gene annotation.
cd /data/hg38star
sudo wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
sudo gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
sudo wget https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz
sudo gunzip Homo_sapiens.GRCh38.110.gtf.gz
sudo mkdir index
sudo chmod 777 index/
sudo STAR --runThreadN 8 --runMode genomeGenerate \
    --genomeDir /data/hg38star/index \
    --genomeFastaFiles /data/hg38star/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
    --sjdbGTFfile /data/hg38star/Homo_sapiens.GRCh38.110.gtf \
    --sjdbOverhang 100

# RSEM
cd ~
git clone git@github.com:deweylab/RSEM.git
cd RSEM
make
cd ..
sudo mv ./RSEM /opt/RSEM
# Optionally make '/opt/RSEM' globally available.
rsem-calculate-expression --version
# Just on the first run, build the RSEM reference using the Ensembl annotations
# already downloaded for STAR index generation.
cd /data/hg38star
sudo mkdir ref
sudo rsem-prepare-reference \
    --gtf /data/hg38star/Homo_sapiens.GRCh38.110.gtf \
    /data/hg38star/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
    /data/hg38star/ref/human_ensembl

Tip

Use x.fastq -d to get a complete report about the current dependency status.

Editing install.paths

A text file named install.paths can be found in the config sub-directory and it is meant to store the paths that allow x.FASTQ to find the software and genome data it requires. Each install.paths entry has the following format:

hostname:tool_name:full_path

For a given hostname, each tool_name will be looked for in full_path, usually the directory containing the installed executable for the tool. Notably, multiple hostnames for the same tool are allowed to increase portability. Since each x.FASTQ installation will consider only those lines starting with the current $HOSTNAME, a single install.paths configuration file is needed to properly run x.FASTQ on different server machines.

Important

When editing the install.paths configuration file to adapt it to your server(s), just keep in mind that hostname can be retrieved using echo $HOSTNAME or the hostname command in Bash; tool_names need to be compliant with the names hardcoded in anqfastq.sh and x.funx.sh scripts (see next paragraph for a comprehensive list of them); full_paths are meant to be the absolute paths (i.e., realpath), without trailing slashes.

Here is a list of the possible tool_names that can be added to install.paths (according to _get_qc_tools and _get_seq_sw x.funx.sh functions) along with a brief explanation of the related paths as used by x.FASTQ (here string grep-ing is case-insensitive):

  • FastQC: path to the directory containing the fastqc executable file [required by qcfastq.sh]
  • MultiQC: path to the directory containing the multiqc symlink to the executable file [required by qcfastq.sh]
  • BBDuk: path to the directory containing the bbduk.sh executable file [required by trimmer.sh, trimfastq.sh]
  • Genome: path to the parent directory containing all the locally-stored genome data (e.g., FASTA genome assemblies, GTF gene annotation, STAR index, RSEM reference, ...) [used by x.fastq.sh --space option]
  • STAR: path to the directory containing the STAR executable file [required by anqfastq.sh]
  • S_index: path to the directory containing the STAR index, as specified by the --genomeDir parameter during STAR --runMode genomeGenerate first run [required by anqfastq.sh]
  • RSEM: path to the directory containing the rsem-calculate-expression executable file [required by anqfastq.sh]
  • R_ref: path to the directory containing the RSEM reference, including the reference_name used during its creation by rsem-prepare-reference (e.g., /data/hg38star/ref/human_ensembl) [required by anqfastq.sh]

Note

install.paths is the only file to edit when installing new dependency tools or moving to different server machines. Only NGS Software and QC Tools need to be specified here. However, all QC Tools can be run by x.FASTQ even if their path is unknown but they have been made globally available by $PATH inclusion. In addition, when BBDuk cannot be found by means of install.paths file, the standalone (and non-persistent) trimmer.sh script interactively prompts the user to input an alternative path runtime, in contrast to its wrapper (trimfastq.sh) that simply quits the program.

Changing Model Organism

Similar to what was done for Human, before the first run, you need to generate a new STAR genome index for the alternative model of interest, as well as the related reference for RSEM. For example, in the case of Mouse, you need to:

cd /data/mm39star/
sudo wget https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
sudo gunzip Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
sudo wget https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz
sudo gunzip Mus_musculus.GRCm39.112.gtf.gz

# STAR
sudo mkdir index
sudo chmod 777 index/
sudo STAR --runThreadN 8 --runMode genomeGenerate \
    --genomeDir /data/mm39star/index \
    --genomeFastaFiles /data/mm39star/Mus_musculus.GRCm39.dna.primary_assembly.fa \
    --sjdbGTFfile /data/mm39star/Mus_musculus.GRCm39.112.gtf \
    --sjdbOverhang 100

# RSEM
sudo mkdir ref
sudo rsem-prepare-reference \
    --gtf /data/mm39star/Mus_musculus.GRCm39.112.gtf \
    /data/mm39star/Mus_musculus.GRCm39.dna.primary_assembly.fa \
    /data/mm39star/ref/mouse_ensembl

Now, in order to align (and quantify) reads on the mouse genome, it will be enough to edit these two lines of the install.paths file:

hostname:S_index:/data/mm39star/index
hostname:R_ref:/data/mm39star/ref/mouse_ensembl

Message Of The Day (optional)

During alignment and quantification operations (i.e., when running anqFASTQ) x.FASTQ attempts to temporarily change the Message Of The Day (MOTD) contained in the /etc/motd file on the server machine in order to alert at login any other users of the massive occupation of computational resources. Warning and idle MOTDs can be customized by editing ./config/motd_warn and ./config/motd_idle text files, respectively. However, this feature is only effective if an /etc/motd file already exists and has write permissions for the user running x.FASTQ. So, to enable it, you preliminarily have to

sudo chmod 666 /etc/motd                         # if the file already exists
sudo touch /etc/motd; sudo chmod 666 /etc/motd   # if no file exists

Updating

To updated x.FASTQ just git pull the repo. Previous steps need to be repeated only when new script files or new dependencies are added.

Additional Notes

On Trimming

In its current implementation, trimFASTQ wraps BBDuk to perform a quite conservative trimming of the reads, based on three steps:

  1. Adapter trimming: adapters are automatically detected based on BBDuk's adapters.fa database and then right-trimmed using 23-to-11 base-long kmers allowing for one mismatch (i.e., Hamming distance = 1). See the KTrimmed stat in the log file.
  2. Quality trimming: is performed on both sides of each read using a quality score threshold trimq=10. See the QTrimmed stat in the log file.
  3. Length filtering: All reads shorter than 25 bases are discarded. See the Total Removed stat in the log file.

In general, it's best to do adapter-trimming first, then quality-trimming, because if you do quality-trimming first, sometimes adapters will be partially trimmed and become too short to be recognized as adapter sequences. For this reason, when you run BBDuk with both quality-trimming and adapter-trimming in the same run, it will do adapter-trimming first, then quality-trimming.

On top of that, it should be noted that, in case you are sequencing for counting applications (like differential gene expression RNA-seq analysis, ChIP-seq, ATAC-seq) read trimming is generally not required anymore when using modern aligners. For such studies local aligners or pseudo-aligners should be used. Modern local aligners (like STAR, BWA-MEM, HISAT2) will soft-clip non-matching sequences. Pseudo-aligners like Kallisto or Salmon will also not have any problem with reads containing adapter sequences. However, if the data are used for variant analyses, genome annotation or genome or transcriptome assembly purposes, read trimming is recommended, including both, adapter and quality trimming.

References

On STAR Aligner

STAR requires ~10 x GenomeSize bytes of RAM for both genome generation and mapping. For instance, the full human genome will require ~30 GB of RAM. There is an option to reduce it to 16GB, but it will not work with 8GB of RAM. However, the transcriptome size is much smaller, and 8GB should be sufficient.

References

Alexander Dobin's (author of STAR) posts.

STAR index is commonly generated using --sjdbOverhang 100 as a default value. This parameter does make almost no difference for reads longer than 50 bp. However, under 50 bp it is recommended to generate ad hoc indexes using --sjdbOverhang <readlength>-1, also considering that indexes for longer reads will work fine for shorter reads, but not vice versa.

References

Alexander Dobin's (author of STAR) posts.

STAR does not currently support PE interleaved FASTQ files. Check it out the related issue at alexdobin/STAR#686. One way to go about this is to deinterlace PE-interleaved-FASTQs first and then run x.FASTQ in the dual-file PE default mode.

References

On STAR-RSEM Coupling

RSEM, as well as other transcript quantification software, requires reads to be mapped to transcriptome. For this reason, x.FASTQ runs STAR with --quantMode TranscriptomeSAM option to output alignments translated into transcript coordinates in the Aligned.toTranscriptome.out.bam file (in addition to alignments in genomic coordinates in Aligned.*.sam/bam file).

Importantly, x.FASTQ runs STAR with --outSAMtype BAM Unsorted option, since if you provide RSEM a sorted BAM, RSEM will assume every read is uniquely aligned and converge very quickly... but the results are wrong!

References

Bo Li's (author of RSEM) posts.

On RSEM Quantification

Among quantification tools, the biggest and most meaningful distinction is between methods that attempt to properly quantify abundance, generally using a generative statistical model (e.g., RSEM, BitSeq, salmon, etc.), and those that try to simply count aligned reads (e.g., HTSeq and featureCounts). Generally, the former are more accurate than the latter at the gene level and can also offer transcript-level estimates if desired (while counting-based methods generally cannot).

However, in order to build such a probabilistic model, the fragment length distribution should be known. The fragment length refers to the physical molecule of D/RNA captured in the library prep stage and (partially!) sequenced by the sequencer. Using this information, the effective transcript lengths can be estimated, which have an effect on fragment assignment probabilities. With paired-end reads the fragment length distribution can be learned from the FASTQ files or the mappings of the reads, but for single-end data this cannot be done, so it is strongly recommended that the user provide the empirical values via the –fragment-length-mean and –fragment-length-sd options. This generally improves the accuracy of expression level estimates from single-end data, but, usually, the only way to get this information is through the BioAnalyzer results for the sequencing run. If this is not possible and the fragment length mean and SD are not provided, RSEM will not take a fragment length distribution into consideration. Nevertheless, it should be noted that the inference procedure is somewhat robust to these parameters; maximum likelihood estimates may change a little, but, in any case, the same distributional values will be applied in all samples and so, ideally, most results of misspecification will wash out in subsequent differential analysis.

References

Rob Patro's (author of salmon) posts.

Finally, [...] I don’t believe that model misspecification that may result due to not knowing the fragment length distribution will generally have enough of a deleterious effect on the probabilistic quantification methods to degrade their performance to the level of counting based methods. I would still argue to prefer probabilistic quantification (i.e., salmon) to read counting, even if you don’t know the fragment length distribution. As I mentioned above, it may change the maximum likelihood estimates a bit, but should do so across all samples, hopefully minimizing the downstream effects on differential analysis.

About

Bash wrapper modules for the remote analysis of RNA-Seq data, with persistency features.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published