$ x.fastq _____ _ ____ _____ ___
__ __ | ___| / \ / ___| |_ _| / _ \
\ \/ / | |_ / _ \ \___ \ | | | | | |
> < _ | _| / ___ \ ___) | | | | |_| |
/_/\_\ (_) |_| /_/ \_\ |____/ |_| \__\_\
Bash modules for the remote analysis of
RNA-Seq data
x.FASTQ is a suite of Bash scripts that wrap original and third-party software with the purpose of making RNA-Seq data analysis more accessible and automated.
x.FASTQ was originally written for the Endothelion project with the intention of abstracting our standard analysis pipeline for NGS transcriptional data. The main idea was to make the whole procedure more faster, scalable, but also affordable for wet collaborators without a specific bioinformatics background, even those possibly operating from different labs or departments. To meet these needs, we designed x.FASTQ to have the following specific features:
- Remote operability: Given the typical hardware requirements needed for
read alignment and transcript abundance quantification, x.FASTQ is
assumed to be installed just on one or few remote servers accessible to all
collaborators via SSH. Each x.FASTQ module, launched via CLI as a Bash
command, will run in the background (
&
) and persistently (vianohup
) so that the user is not bound to keep the connection active for the entire duration of the analysis, but only for job scheduling. - Standardization: Most x.FASTQ scripts are wrappers of lower-level applications commonly used as standard tools in RNA-Seq data analysis and widely appreciated for their performance (e.g., FastQC, BBDuk, STAR, RSEM).
- Simplification: Scripts expose a limited number of options by making extensive use of default settings (suitable for the majority of standard RNA-Seq analyses) and taking charge of managing input and output data formats and their organization.
- Automation: All scripts are designed to loop over sets of target files properly stored within the same directory.
- Completeness: The tools provided by x.FASTQ allow for a complete workflow, from raw reads retrieval to count matrix generation.
- No bioinformatics skills required: Each x.FASTQ module comes with an
--help
option providing extensive documentation. The only requirement for the user is a basic knowledge of the Unix shell and a SSH client installed on its local machine. - Reproducibility: although (still) not containerized, each x.FASTQ
module is tightly versioned and designed to save detailed log files at each
run. Also, utility functions are available to print complete version reports
about x.FASTQ modules and dependencies (i.e.,
x.fastq -r
and-d
options, respectively).
x.FASTQ currently consists of 7 modules designed to be run directly by the end-user, each one of them addressing a precise step of a general pipeline for RNA-Seq data analysis, which goes from the retrieval of raw reads to the generation of the expression matrix.
- x.FASTQ is a cover-script that performs some general-utility tasks, such dependency check, symlink creation, and generation of version and disk usage reports;
- getFASTQ allows local downloading of NGS raw data in FASTQ format from ENA database;
- trimFASTQ uses BBDuk (from the BBTools suite) to remove adapter sequences and perform quality trimming;
- anqFASTQ uses STAR and RSEM to align reads and quantify transcript abundance, respectively;
- qcFASTQ is an interface for multiple quality-control tools, including FastQC and MultiQC;
- countFASTQ assembles counts from multiple samples/runs into a single TSV expression matrix, choosing among multiple metrics (TPM, FPKM, RSEM expected counts) and levels (gene or isoform); optionally, it injects experimental design information into the matrix heading and appends annotations regarding gene symbol, gene name, and gene type (Ensembl gene/transcript IDs are required for annotation);
- metaharvest fetches sample and series metadata from GEO and/or ENA databases, then it parses the retrieved metadata and saves a local copy of them as a CSV-formatted table (useful for both documentation and subsequent x.FASTQ analysis steps).
In addition, x.FASTQ includes a number of auxiliary scripts (written in
Bash, R, or Python) that are not meant to be directly run by the end
user, but are called by the main modules. Most of them are found in the
workers
subfolder.
x.funx.sh
contains variables and functions that need to be shared among (i.e., sourced by) all x.FASTQ modules;trimmer.sh
is the actual trimming script, wrapped by trimFASTQ;assembler.R
implements the matrix assembly procedure required by countFASTQ;pca_hc.R
implements Principal Component Analysis and Hierarchical Clustering of samples as required by theqcfastq --tool=PCA ...
option;fuse_csv.R
is used bymetaharvest
to merge the cross-referenced metadata downloaded from both GEO and ENA databases;parse_series.R
is used bymetaharvest
to extract metadata from a GEO-retrieved SOFT formatted family file;re_uniq.py
is used to reduce redundancy when STAR and RSEM logs are displayed in console as anqFASTQ progress reports.
All suite modules enjoy some internal consistency:
- upon running
x.fastq.sh -l <target_path>
from the local x.FASTQ repository directory, each x.FASTQ module can be invoked from any location on the remote machine using its fully lowercase name (provided that<target_path>
is already included in$PATH
); - each script launches in the background a persistent job (or a
queue of jobs), the main statement of each module typically being a line
of the form
# MAIN STATEMENT nohup command_with_args >> "$log_file" 2>&1 &
Tip
As per the Unix shell:
nohup
(no hangups) allows processes to keep running even upon user logout, as, for instance, when exiting an SSH session;>>
allows output to be redirected (and appended) somewhere other than the default./nohup.out
file;2>&1
is to redirect both standard output and standard error to the same location (i.e., the log file);&
at the end of the line, is to run the command in the background and get the shell prompt active again.
- each module (except x.FASTQ and metaharvest) saves its own log file in
inside the experiment-specific target directory using a common filename
pattern, namely
for sample-based or series-based logs, respectively (the leading 'Z_' is just to get all log files at the bottom of the list when
Z_<ScriptID>_<FastqID>_<DateStamp>.log Z_<ScriptID>_<StudyID>_<DateStamp>.log
ls -l
);
Important
In the current implementation of x.FASTQ, filenames are very meaningful!
Each FASTQ file is required to have a name matching the regex pattern
^[a-zA-Z0-9]+[^a-zA-Z0-9]*.*\.fastq\.gz
i.e., beginning with an alphanumeric ID (usually an ENA run ID of this type
(E|D|S)RR[0-9]{6,}
) immediately followed by the extension .fastq.gz
or
separated from the remaining filename by an underscore or some other
characters not in [a-zA-Z0-9]
. Valid examples are GSM34636.fastq.gz
,
SRR19592966_1.fastq.gz
, etc. This leading ID will be propagated to the names
of the log files printed by the getFASTQ module and BBDuk (saved in
Trim_stats
subdirectory), as well as to all output files from FastQC
(stored in FastQC_*
subdirectories) and STAR/RSEM (i.e., Counts
subfolders and all files contained therein). countFASTQ will then assume
each RSEM output file being saved into a sample- or run-specific subdirectory,
whose name will be used for count matrix heading. Similarly, but at a lower
level, even MultiQC needs each STAR and RSEM output to be properly
prefixed with a suitable sample or run ID to be correctly accounted for.
Notice, however, that all this should occur spontaneously if FASTQs are
downloaded from ENA database using the getFASTQ module.
In contrast, it is important for the user to manually name each project folder
(i.e., each directory that will contain the entire set of FASTQ files from one
single experiment) with a name uniquely assigned to the study (typically the
related GEO Series ID GSE[0-9]+
or the ENA Project accession
PRJ(E|D|N)[A-Z][0-9]+
). Log files created by most of the x.FASTQ
modules rely on the name of the target directory for the assignment of the
StudyID
label, and the same holds for the MultiQC HTML global report and
the file name of the final expression matrix.
- some common flags keep the same meaning across all modules (even if not all of
them are always available):
-h | --help
to display the script-specific help;-v | --version
to display the script-specific version;-q | --quiet
to run the script silently;-p | --progress
to see the progress of possibly ongoing processes;-k | --kill
to gracefully terminate possibly ongoing processes;-a | --keep-all
not to delete intermediate files upon script execution;
- all modules are versioned according to the three-number Semantic Versioning
system.
x.fastq -r
can be used to get a version report of all scripts along with the summary version of the whole x.FASTQ suite; - if
-p
is not followed by any other arguments, the script searches the current directory for log files from which to infer the progress of the latest namesake task; - with the
-q
option, scripts do not print anything on the screen other than possible error messages that stop the execution (i.e., fatal errors); logging activity is never disabled, though.
Assuming you have identified a study of interest from GEO (e.g., GSE138309
),
have already created a project folder somewhere (mkdir '<anyPath>'/GSE138309
),
and moved in there (cd '<anyPath>'/GSE138309
), here are a couple of possible
example workflows.
Important
Given the hard-coded background execution of most x.FASTQ modules, it is not possible (at present) to use these command sequences for creating automated pipelines. On the contrary, before launching each module, it is necessary to ensure that the previous one has successfully terminated.
# Download FASTQs, align, quantify, and assemble a gene-level count matrix
getfastq -u GSE138309 > ./GSE138309_wgets.sh
getfastq GSE138309_wgets.sh
anqfastq .
countfastq .
# Download FASTQs in parallel and fetch GEO-ENA cross-referenced metadata
getfastq --urls GSE138309 > ./GSE138309_wgets.sh
getfastq --multi GSE138309_wgets.sh
metaharvest --geo --ena GSE138309 > GSE138309_meta.csv
# Trim and QC
qcfastq --out=FastQC_raw .
trimfastq .
qcfastq --out=FastQC_trim .
# Align, quantify, and QC
anqfastq .
qcfastq --tool=QualiMap .
qcfastq --tool=MultiQC .
# Clean up
rm *.fastq.gz
# Assemble an isoform-level count matrix with annotation and experimental design
groups=(Ctrl Ctrl Ctrl Treat Treat Treat)
countfastq --isoforms --names --design="${groups[*]}" --metric=expected_count .
# Explore samples through PCA
qcfastq --tool=PCA .
Note
The study chosen here as an example features non-interleaved (i.e., dual-file)
paired-end (PE) reads, however x.FASTQ also supports single-ended (SE)
and interleaved PE formats. In those cases, just add, respectively,
[-s | --single-end]
or [-i | --interleaved]
options, when running
trimFASTQ and anqFASTQ modules.
As already stressed, a working SSH client is the only local software requirement for using x.FASTQ, provided it has already been installed on some remote server machines by the system administrator. The procedure for installing x.FASTQ (and all its dependencies) on the server is here documented step by step.
Clone x.FASTQ repository from GitHub
cd ~/.local/bin/
git clone git@github.com:Feat-FeAR/x.FASTQ.git
Create the symlinks in some $PATH
directory (e.g., ~/.local/bin/
) to enable
global x.FASTQ visibility and easier module invocation.
cd x.FASTQ
./x.fastq.sh -l ..
Install and test the following software, as required by x.FASTQ.
- Development Environments
- Java
- Python
- R
- Bioconductor Packages
- BiocManager
- PCAtools
- org.Hs.eg.db
- org.Mm.eg.db
- CRAN Packages
- gtools
- stringi
- Linux Tools
- hostname
- jq
- figlet (optional)
- QC Tools
- FastQC
- MultiQC
- QualiMap
- NGS Software
- BBDuk
- STAR
- RSEM
Note
While the interpreters of the Development Environments need to be globally
available (i.e., findable in $PATH
), NGS Software just has to be locally
present on the server. Paths to each NGS Software tool will be configured
later by editing the install.paths
file (see below). Finally, QC Tools
allow both installation modes.
The following command sequence represents the standard installation procedure on an Arch/Manjaro system. For different Linux distributions, please refer to the installation guides of the different tools.
# Oracle Java (v.7 or higher)
yay -Syu jre jdk
java --version
# Python 3 (pip) and pipx
sudo pacman -Syu python
python --version
sudo pacman -Syu python-pip
pip --version
sudo pacman -Syu python-pipx
pipx --version
# R
sudo pacman -Syu r
R --version
# Bioconductor packages
R
# Within R
install.packages("BiocManager")
install.package("gtools")
BiocManager::install("PCAtools")
BiocManager::install("org.Hs.eg.db")
BiocManager::install("org.Mm.eg.db")
# Sometimes the following PCAtools dependencies need to be manually installed...
install.packages("stringi", "reshape2")
# ...as well as the following AnnotationDbi one.
install.packages("RCurl")
# Return to the Linux CLI
q()
# core/inetutils package (for hostname utility)
sudo pacman -Syu inetutils
# JQ
sudo pacman -Syu jq
# Figlet (optional)
sudo pacman -Syu figlet
# FastQC
cd ~/.local/bin
wget https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.12.1.zip
unzip fastqc_v0.12.1.zip
cd FastQC
./fastqc --version
# MultiQC
pipx install multiqc
multiqc --version
# QualiMap
# Support still to be added... see the related issue #23.
# BBTools (for BBDuk)
cd ~/.local/bin
wget --content-disposition https://sourceforge.net/projects/bbmap/files/BBMap_39.01.tar.gz/download
tar -xvzf BBMap_39.01.tar.gz
cd bbmap
./stats.sh in=./resources/phix174_ill.ref.fa.gz
# STAR
cd ~
git clone git@github.com:alexdobin/STAR.git
sudo mv ./STAR /opt/STAR
# Optionally make '/opt/STAR/bin/Linux_x86_64_static/STAR' globally available.
STAR --version
# Just on the first run, download the latest Genome Assembly (FASTA) and related
# Gene Annotation (GTF), and generate the STAR-compliant genome index.
# NOTE: Unless otherwise specified, Ensembl will be used as the reference
# standard for gene annotation.
cd /data/hg38star
sudo wget https://ftp.ensembl.org/pub/release-110/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
sudo gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
sudo wget https://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz
sudo gunzip Homo_sapiens.GRCh38.110.gtf.gz
sudo mkdir index
sudo chmod 777 index/
sudo STAR --runThreadN 8 --runMode genomeGenerate \
--genomeDir /data/hg38star/index \
--genomeFastaFiles /data/hg38star/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--sjdbGTFfile /data/hg38star/Homo_sapiens.GRCh38.110.gtf \
--sjdbOverhang 100
# RSEM
cd ~
git clone git@github.com:deweylab/RSEM.git
cd RSEM
make
cd ..
sudo mv ./RSEM /opt/RSEM
# Optionally make '/opt/RSEM' globally available.
rsem-calculate-expression --version
# Just on the first run, build the RSEM reference using the Ensembl annotations
# already downloaded for STAR index generation.
cd /data/hg38star
sudo mkdir ref
sudo rsem-prepare-reference \
--gtf /data/hg38star/Homo_sapiens.GRCh38.110.gtf \
/data/hg38star/Homo_sapiens.GRCh38.dna.primary_assembly.fa \
/data/hg38star/ref/human_ensembl
Tip
Use x.fastq -d
to get a complete report about the current dependency status.
A text file named install.paths
can be found in the config
sub-directory and
it is meant to store the paths that allow x.FASTQ to find the software and
genome data it requires. Each install.paths
entry has the following format:
hostname:tool_name:full_path
For a given hostname
, each tool_name
will be looked for in full_path
,
usually the directory containing the installed executable for the tool. Notably,
multiple hostnames for the same tool are allowed to increase portability.
Since each x.FASTQ installation will consider only those lines starting
with the current $HOSTNAME
, a single install.paths
configuration file is
needed to properly run x.FASTQ on different server machines.
Important
When editing the install.paths
configuration file to adapt it to your
server(s), just keep in mind that hostname can be retrieved using
echo $HOSTNAME
or the hostname
command in Bash; tool_names need to be
compliant with the names hardcoded in anqfastq.sh
and x.funx.sh
scripts
(see next paragraph for a comprehensive list of them); full_paths are meant
to be the absolute paths (i.e., realpath
), without trailing slashes.
Here is a list of the possible tool_name
s that can be added to install.paths
(according to _get_qc_tools
and _get_seq_sw
x.funx.sh
functions) along
with a brief explanation of the related paths as used by x.FASTQ (here
string grep-ing is case-insensitive):
- FastQC: path to the directory containing the
fastqc
executable file [required byqcfastq.sh
] - MultiQC: path to the directory containing the
multiqc
symlink to the executable file [required byqcfastq.sh
] - BBDuk: path to the directory containing the
bbduk.sh
executable file [required bytrimmer.sh
,trimfastq.sh
] - Genome: path to the parent directory containing all the locally-stored
genome data (e.g., FASTA genome assemblies, GTF gene annotation, STAR index,
RSEM reference, ...) [used by
x.fastq.sh --space
option] - STAR: path to the directory containing the
STAR
executable file [required byanqfastq.sh
] - S_index: path to the directory containing the STAR index, as specified by
the
--genomeDir
parameter duringSTAR --runMode genomeGenerate
first run [required byanqfastq.sh
] - RSEM: path to the directory containing the
rsem-calculate-expression
executable file [required byanqfastq.sh
] - R_ref: path to the directory containing the RSEM reference, including
the reference_name used during its creation by
rsem-prepare-reference
(e.g.,/data/hg38star/ref/human_ensembl
) [required byanqfastq.sh
]
Note
install.paths
is the only file to edit when installing new dependency tools
or moving to different server machines. Only NGS Software and QC Tools
need to be specified here. However, all QC Tools can be run by x.FASTQ
even if their path is unknown but they have been made globally available
by $PATH
inclusion. In addition, when BBDuk cannot be found by means of
install.paths
file, the standalone (and non-persistent) trimmer.sh
script interactively prompts the user to input an alternative path runtime, in
contrast to its wrapper (trimfastq.sh
) that simply quits the program.
Similar to what was done for Human, before the first run, you need to generate a new STAR genome index for the alternative model of interest, as well as the related reference for RSEM. For example, in the case of Mouse, you need to:
cd /data/mm39star/
sudo wget https://ftp.ensembl.org/pub/release-112/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
sudo gunzip Mus_musculus.GRCm39.dna.primary_assembly.fa.gz
sudo wget https://ftp.ensembl.org/pub/release-112/gtf/mus_musculus/Mus_musculus.GRCm39.112.gtf.gz
sudo gunzip Mus_musculus.GRCm39.112.gtf.gz
# STAR
sudo mkdir index
sudo chmod 777 index/
sudo STAR --runThreadN 8 --runMode genomeGenerate \
--genomeDir /data/mm39star/index \
--genomeFastaFiles /data/mm39star/Mus_musculus.GRCm39.dna.primary_assembly.fa \
--sjdbGTFfile /data/mm39star/Mus_musculus.GRCm39.112.gtf \
--sjdbOverhang 100
# RSEM
sudo mkdir ref
sudo rsem-prepare-reference \
--gtf /data/mm39star/Mus_musculus.GRCm39.112.gtf \
/data/mm39star/Mus_musculus.GRCm39.dna.primary_assembly.fa \
/data/mm39star/ref/mouse_ensembl
Now, in order to align (and quantify) reads on the mouse genome, it will be
enough to edit these two lines of the install.paths
file:
hostname:S_index:/data/mm39star/index
hostname:R_ref:/data/mm39star/ref/mouse_ensembl
During alignment and quantification operations (i.e., when running anqFASTQ)
x.FASTQ attempts to temporarily change the Message Of The Day (MOTD)
contained in the /etc/motd
file on the server machine in order to alert at
login any other users of the massive occupation of computational resources.
Warning and idle MOTDs can be customized by editing ./config/motd_warn
and
./config/motd_idle
text files, respectively. However, this feature is only
effective if an /etc/motd
file already exists and has write permissions for
the user running x.FASTQ. So, to enable it, you preliminarily have to
sudo chmod 666 /etc/motd # if the file already exists
sudo touch /etc/motd; sudo chmod 666 /etc/motd # if no file exists
To updated x.FASTQ just git pull
the repo. Previous steps need to be
repeated only when new script files or new dependencies are added.
In its current implementation, trimFASTQ wraps BBDuk to perform a quite conservative trimming of the reads, based on three steps:
- Adapter trimming: adapters are automatically detected based on BBDuk's
adapters.fa
database and then right-trimmed using 23-to-11 base-long kmers allowing for one mismatch (i.e., Hamming distance = 1). See the KTrimmed stat in the log file. - Quality trimming: is performed on both sides of each read using a quality
score threshold
trimq=10
. See the QTrimmed stat in the log file. - Length filtering: All reads shorter than 25 bases are discarded. See the Total Removed stat in the log file.
In general, it's best to do adapter-trimming first, then quality-trimming, because if you do quality-trimming first, sometimes adapters will be partially trimmed and become too short to be recognized as adapter sequences. For this reason, when you run BBDuk with both quality-trimming and adapter-trimming in the same run, it will do adapter-trimming first, then quality-trimming.
On top of that, it should be noted that, in case you are sequencing for counting applications (like differential gene expression RNA-seq analysis, ChIP-seq, ATAC-seq) read trimming is generally not required anymore when using modern aligners. For such studies local aligners or pseudo-aligners should be used. Modern local aligners (like STAR, BWA-MEM, HISAT2) will soft-clip non-matching sequences. Pseudo-aligners like Kallisto or Salmon will also not have any problem with reads containing adapter sequences. However, if the data are used for variant analyses, genome annotation or genome or transcriptome assembly purposes, read trimming is recommended, including both, adapter and quality trimming.
References
- Brian Bushnell's (author of BBTools) post on SEQanswers forum.
- UC Davis Genome Center, DNA Technologies and Expression Analysis Core Laboratory's FAQ When should I trim my Illumina reads and how should I do it?
- Williams et al. 2016. Trimming of sequence reads alters RNA-Seq gene expression estimates. BMC Bioinformatics. 2016;17:103. Published 2016 Feb 25. doi:10.1186/s12859-016-0956-2.
STAR requires ~10 x GenomeSize bytes of RAM for both genome generation and mapping. For instance, the full human genome will require ~30 GB of RAM. There is an option to reduce it to 16GB, but it will not work with 8GB of RAM. However, the transcriptome size is much smaller, and 8GB should be sufficient.
References
Alexander Dobin's (author of STAR) posts.
STAR index is commonly generated using --sjdbOverhang 100
as a default
value. This parameter does make almost no difference for reads longer than 50
bp. However, under 50 bp it is recommended to generate ad hoc indexes using
--sjdbOverhang <readlength>-1
, also considering that indexes for longer reads
will work fine for shorter reads, but not vice versa.
References
Alexander Dobin's (author of STAR) posts.
STAR does not currently support PE interleaved FASTQ files. Check it out the related issue at alexdobin/STAR#686. One way to go about this is to deinterlace PE-interleaved-FASTQs first and then run x.FASTQ in the dual-file PE default mode.
References
- Posts on Stack Overflow
- Posts on Biostar Forum
- GitHub Gist script
deinterleave_fastq.sh
- SeqFu subprogram
seqfu deinterleave
RSEM, as well as other transcript quantification software, requires reads to
be mapped to transcriptome. For this reason, x.FASTQ runs STAR with
--quantMode TranscriptomeSAM
option to output alignments translated into
transcript coordinates in the Aligned.toTranscriptome.out.bam
file (in
addition to alignments in genomic coordinates in Aligned.*.sam/bam
file).
Importantly, x.FASTQ runs STAR with --outSAMtype BAM Unsorted
option,
since if you provide RSEM a sorted BAM, RSEM will assume every read is
uniquely aligned and converge very quickly... but the results are wrong!
References
Bo Li's (author of RSEM) posts.
Among quantification tools, the biggest and most meaningful distinction is between methods that attempt to properly quantify abundance, generally using a generative statistical model (e.g., RSEM, BitSeq, salmon, etc.), and those that try to simply count aligned reads (e.g., HTSeq and featureCounts). Generally, the former are more accurate than the latter at the gene level and can also offer transcript-level estimates if desired (while counting-based methods generally cannot).
However, in order to build such a probabilistic model, the fragment length
distribution should be known. The fragment length refers to the physical
molecule of D/RNA captured in the library prep stage and (partially!) sequenced
by the sequencer. Using this information, the effective transcript lengths can
be estimated, which have an effect on fragment assignment probabilities. With
paired-end reads the fragment length distribution can be learned from the FASTQ
files or the mappings of the reads, but for single-end data this cannot be done,
so it is strongly recommended that the user provide the empirical values via the
–fragment-length-mean
and –fragment-length-sd
options. This generally
improves the accuracy of expression level estimates from single-end data, but,
usually, the only way to get this information is through the BioAnalyzer
results for the sequencing run. If this is not possible and the fragment length
mean and SD are not provided, RSEM will not take a fragment length distribution
into consideration. Nevertheless, it should be noted that the inference
procedure is somewhat robust to these parameters; maximum likelihood estimates
may change a little, but, in any case, the same distributional values will be
applied in all samples and so, ideally, most results of misspecification will
wash out in subsequent differential analysis.
References
Rob Patro's (author of salmon) posts.
Finally, [...] I don’t believe that model misspecification that may result due to not knowing the fragment length distribution will generally have enough of a deleterious effect on the probabilistic quantification methods to degrade their performance to the level of counting based methods. I would still argue to prefer probabilistic quantification (i.e., salmon) to read counting, even if you don’t know the fragment length distribution. As I mentioned above, it may change the maximum likelihood estimates a bit, but should do so across all samples, hopefully minimizing the downstream effects on differential analysis.