Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update STAR version + new options for STARsolo #5060

Merged
merged 36 commits into from
Feb 17, 2023
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
b52f63b
update STAR version
lldelisle Jan 16, 2023
1006e84
add GeneQuant + outSAMattributes
lldelisle Jan 16, 2023
f450b3e
remove dup test
lldelisle Jan 17, 2023
a7cb191
update tool version [no ci]
lldelisle Jan 17, 2023
2b72bb8
lint
lldelisle Jan 17, 2023
8f42952
add SAMattributes in macro as suggested by @wm75
lldelisle Jan 18, 2023
c3b1dc3
change Cell Ranger to Chromium chemistry
lldelisle Jan 18, 2023
45afe99
Update rg_rnaStarSolo.xml
pavanvidem Jan 20, 2023
e91e9f3
Update macros.xml
pavanvidem Jan 20, 2023
662ba6d
Update rg_rnaStar.xml
pavanvidem Jan 20, 2023
033cd4a
STAR: allow fasta.gz for reference
bernt-matthias Jan 26, 2023
3ca8610
Merge pull request #2 from pavanvidem/patch-4
lldelisle Jan 26, 2023
d374110
use up to date profile
lldelisle Jan 26, 2023
5fe1484
fix back to line
lldelisle Jan 26, 2023
ca87f87
increase GenomeGenerateRAM by @nagoue @bgruening @wm75
lldelisle Jan 27, 2023
b69a4b2
fix macro limits + extend to all starSOLO
lldelisle Jan 27, 2023
a34c128
use double getVar
lldelisle Jan 27, 2023
0103d32
compare params values with command line
lldelisle Jan 27, 2023
0d80c76
put default value in second getVar
lldelisle Jan 27, 2023
dba503d
add colnames to count file
lldelisle Feb 1, 2023
3ed614c
add outWig to STAR
lldelisle Feb 1, 2023
9c5210c
add outWig in STARsolo + compress bam
lldelisle Feb 1, 2023
443ebd2
remove section coverage
lldelisle Feb 1, 2023
ae0f3c7
add ftype in test
lldelisle Feb 1, 2023
065d90e
fix output matrix for new soloFeatures + add test
lldelisle Feb 1, 2023
b5142ff
Merge pull request #4 from lldelisle/outWig
lldelisle Feb 2, 2023
117ce39
solve #1777
lldelisle Jan 27, 2023
95831a6
put quantmode_output in GTFconditional thanks @bernt-matthias
lldelisle Feb 8, 2023
1ff30c2
Merge pull request #3 from lldelisle/solve1777
lldelisle Feb 8, 2023
19b6882
change default outSAMmapqUnique to 255 like in STAR
lldelisle Feb 8, 2023
20a2126
only use soloUMIfiltering when soloUMIdedup is 1MM_CR
lldelisle Feb 8, 2023
dfe88ea
enable to output filtered and raw matrices
lldelisle Feb 8, 2023
c6c4822
add forgotten requirement
lldelisle Feb 16, 2023
296660c
use @TOOL_VERSION@+galaxy@VERSION_SUFFIX@
lldelisle Feb 16, 2023
53faaa4
bump version of data_manager
lldelisle Feb 16, 2023
e3966b9
put back to MAPQ60 with remark
lldelisle Feb 16, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
182 changes: 166 additions & 16 deletions tools/rgrnastar/macros.xml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
the index versions in sync, but you should manually adjust the +galaxy
version number. -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<tool id="rna_star_index_builder_data_manager" name="rnastar index versioned" tool_type="manage_data" version="@IDX_VERSION@" profile="19.05">

is what this comment refers to.
This PR should, because of the linked macros file, trigger deployment of a new version of the DM, too, so you need to bump the DM version to version="@IDX_VERSION@+galaxy1"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean we should change the version because we changed STAR version or because of something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess bumping the IDX_VERSION_SUFFIX does not hurt, but I do not understand why we should do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe to indicate that we use the STAR version @TOOL_VERSION@ instead of the STAR version @IDX_VERSION@ to build the index...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bernt-matthias @lldelisle Just to explain things: the reason for symlinking the macros file is to keep the IDX_VERSION in one place only so that when you update the tool wrapper to a STAR version that requires a newer index format, you'd automatically deploy also a DM that can create these indexes.

The "downside" is that any changes to the tool wrapper macros file will silently affect the DM. So in this case the next version of the DM will use the 2.7.10b version of star for building indexes. These should be identical to ones built with older versions, but it's good to bump the DM wrapper version to be able to trace things back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems legit.

Could a more expressive filter help here. For instance we could just store the star version that was used to create an entry in the datatable ... and then just filter datatable entries for a min (or max) required star version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is easy to add a new column in a table, this suppose to change the table (which happened when we add a new column for the 'genomeVersion')...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would require a new table. I'm also not so sure whether that would improve the situation much.
min and max version checks in tool wrappers also need quite some discipline to maintain, and the max check in particular doesn't work backwards, i.e., at the time of writing a tool wrapper version the max value is typically unknown still so there's always at least one wrapper version that will display all newer index versions.
What would be comparably easy to do is to remove the symlink and have the DM use its own macro, which then needs to be maintained separately, but would maybe come with fewer surprises.

Anyway, I don't think this should hold back this PR any longer. If we want to decouple the DM from the tool wrapper, we should do it in its own PR where the decision will be more discoverable than as part of a giant PR like this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

<!-- STAR version to be used -->
<token name="@VERSION@">2.7.8a</token>
<token name="@VERSION@">2.7.10b</token>
lldelisle marked this conversation as resolved.
Show resolved Hide resolved
<token name="@PROFILE@">21.01</token>
<!-- STAR index version compatible with this version of STAR
This is the STAR version that introduced the index structure expected
by the current version.
Expand All @@ -19,7 +20,7 @@
<xml name="requirements">
<requirements>
<requirement type="package" version="@VERSION@">star</requirement>
<requirement type="package" version="1.9">samtools</requirement>
<requirement type="package" version="1.16.1">samtools</requirement>
<yield />
</requirements>
</xml>
Expand All @@ -35,7 +36,7 @@
</xml>

<xml name="index_selection" token_with_gene_model="0">
<param argument="--genomeDir" name="genomeDir" type="select"
<param argument="--genomeDir" type="select"
label="Select reference genome"
help="If your genome of interest is not listed, contact the Galaxy team">
<options from_data_table="@IDX_DATA_TABLE@">
Expand All @@ -55,8 +56,8 @@
<citation type="doi">10.1093/bioinformatics/bts635</citation>
</citations>
</xml>
<xml name="@SJDBOPTIONS@" token_optional="true">
<param argument="--sjdbGTFfile" type="data" format="gff3,gtf" label="Gene model (gff3,gtf) file for splice junctions" optional="@OPTIONAL@" help="Exon junction information for mapping splices"/>
<xml name="SJDBOPTIONS">
<param argument="--sjdbGTFfile" type="data" format="gff3,gtf" label="Gene model (gff3,gtf) file for splice junctions" optional="false" help="Exon junction information for mapping splices"/>
<param argument="--sjdbOverhang" type="integer" min="1" value="100" label="Length of the genomic sequence around annotated junctions" help="Used in constructing the splice junctions database. Ideal value is ReadLength-1"/>
</xml>
<xml name="dbKeyActions">
Expand All @@ -81,11 +82,16 @@
<token name="@TEMPINDEX@"><![CDATA[
## Create temporary index for custom reference
#if str($refGenomeSource.geneSource) == 'history':
#if $refGenomeSource.genomeFastaFiles.ext == "fasta"
ln -s '$refGenomeSource.genomeFastaFiles' refgenome.fa &&
#else
gunzip -c '$refGenomeSource.genomeFastaFiles' > refgenome.fa &&
lldelisle marked this conversation as resolved.
Show resolved Hide resolved
#end if
mkdir -p tempstargenomedir &&
STAR
--runMode genomeGenerate
--genomeDir 'tempstargenomedir'
--genomeFastaFiles '${refGenomeSource.genomeFastaFiles}'
--genomeFastaFiles refgenome.fa
## Handle difference between indices with/without annotations
#if 'GTFconditional' in $refGenomeSource:
## GTFconditional exists only in STAR, but not STARsolo
Expand All @@ -109,6 +115,8 @@
--genomeSAindexNbases ${refGenomeSource.genomeSAindexNbases}
#end if
--runThreadN \${GALAXY_SLOTS:-4}
## in bytes
--limitGenomeGenerateRAM \$((\${GALAXY_MEMORY_MB:-31000} * 1000000))
&&
#end if
]]></token>
Expand All @@ -121,17 +129,15 @@
#else:
'${refGenomeSource.GTFconditional.genomeDir.fields.path}'
## Handle difference between indices with/without annotations
#if str($refGenomeSource.GTFconditional.GTFselect) == 'without-gtf':
#if $refGenomeSource.GTFconditional.sjdbGTFfile:
--sjdbOverhang $refGenomeSource.GTFconditional.sjdbOverhang
--sjdbGTFfile '${refGenomeSource.GTFconditional.sjdbGTFfile}'
#if str($refGenomeSource.GTFconditional.sjdbGTFfile.ext) == 'gff3':
--sjdbGTFtagExonParentTranscript Parent
#end if
#if str($refGenomeSource.GTFconditional.GTFselect) == 'without-gtf-with-gtf':
--sjdbOverhang $refGenomeSource.GTFconditional.sjdbOverhang
--sjdbGTFfile '${refGenomeSource.GTFconditional.sjdbGTFfile}'
#if str($refGenomeSource.GTFconditional.sjdbGTFfile.ext) == 'gff3':
--sjdbGTFtagExonParentTranscript Parent
#end if
#end if
#end if
]]></token>
#end if
]]></token>
<token name="@READSHANDLING@" ><![CDATA[
## Check that the input pairs are of the same type
## otherwise STARsolo will run for a long time and then error out.
Expand Down Expand Up @@ -161,8 +167,13 @@
@FASTQ_GZ_OPTION@
#end if
]]></token>
<token name="@LIMITS@" ><![CDATA[
--limitOutSJoneRead $getVar('algo.params.junction_limits.limitOutSJoneRead', $getVar('solo.junction_limits.limitOutSJoneRead', 1000))
--limitOutSJcollapsed $getVar('algo.params.junction_limits.limitOutSJcollapsed', $getVar('solo.junction_limits.limitOutSJcollapsed', 1000000))
--limitSjdbInsertNsj $getVar('algo.params.junction_limits.limitSjdbInsertNsj', $getVar('solo.junction_limits.limitSjdbInsertNsj', 1000000))
]]></token>
<xml name="ref_selection">
<param argument="--genomeFastaFiles" type="data" format="fasta" label="Select a reference genome" />
<param argument="--genomeFastaFiles" type="data" format="fasta,fasta.gz" label="Select a reference genome" />
<param argument="--genomeSAindexNbases" type="integer" min="2" max="16" value="14" label="Length of the SA pre-indexing string" help="Typically between 10 and 15. Longer strings will use much more memory, but allow faster searches. For small genomes, the parameter --genomeSAindexNbases must be scaled down to min(14, log2(GenomeLength)/2 - 1)"/>
</xml>
<xml name="stdio" >
Expand Down Expand Up @@ -245,4 +256,143 @@
<option value="None" >No adapter clipping</option>
</param>
</xml>
<xml name="common_SAM_attributes">
<option value="NH" selected="true">NH (number of reported alignments/hits for the read)</option>
<option value="HI" selected="true">HI (query hit index)</option>
<option value="AS" selected="true">AS (local alignment score)</option>
<option value="nM" selected="true">nM (number of mismatches per (paired) alignment)</option>
<option value="NM">NM (edit distance of the aligned read to the reference)</option>
<option value="MD">MD (string for mismatching positions)</option>
<option value="jM">jM (intron motifs for all junctions)</option>
<option value="jI">jI (1-based start and end of introns for all junctions)</option>
</xml>
<xml name="limits">
<section name="junction_limits" title="Junction Limits" expanded="false">
<param argument="--limitOutSJoneRead" type="integer" min="1" value="1000" label="Maximum number of junctions for one read (including all multimappers)" />
<param argument="--limitOutSJcollapsed" type="integer" min="1" value="1000000" label="Maximum number of collapsed junctions" />
<param argument="--limitSjdbInsertNsj" type="integer" min="0" value="1000000" label="Maximum number of inserts to be inserted into the genome on the fly." />
</section>
</xml>
<xml name="outCountActions">
<actions>
<action name="column_names" type="metadata" default="GeneID,Counts_unstrand,Counts_firstStrand,Counts_secondStrand" />
</actions>
</xml>
<xml name="outWig">
<conditional name="outWig">
<param name="outWigType" type="select" label="Compute coverage">
<option value="None">No coverage</option>
<option value="bedGraph">Yes in bedgraph format</option>
<option value="wiggle">Yes in wiggle format</option>
</param>
<when value="None">
<!-- This is necessary for the filtering of output -->
<param name="outWigStrand" type="hidden" value="false" />
</when>
<when value="bedGraph">
<expand macro="outWigParams"/>
</when>
<when value="wiggle">
<expand macro="outWigParams"/>
</when>
</conditional>
</xml>
<xml name="outWigParams">
<param name="outWigTypeSecondWord" type="select" label="Input for coverage">
<option value="">Default (everything that mapped)</option>
<option value="read_5p">signal from only 5’ of the 1st read</option>
<option value="read2">signal from only 2nd read</option>
</param>
<param argument="--outWigStrand" type="boolean" truevalue="Stranded" falsevalue="Unstranded" checked="true" label="collapse strands (unstranded coverage)" help="By default, the strands are separated."/>
<param argument="--outWigReferencesPrefix" type="text" value="-" label="prefix matching reference name" help="For example, set 'chr' if you mapped on an ensembl genome but you want to display on UCSC"/>
<param argument="--outWigNorm" type="boolean" truevalue="RPM" falsevalue="None" checked="true" label="Normalize coverage to million of mapped reads (RPM)"/>
</xml>
<token name="@OUTWIG@"><![CDATA[
#if str($outWig.outWigType) != 'None':
--outWigType '$outWig.outWigType' '$outWig.outWigTypeSecondWord'
--outWigStrand '$outWig.outWigStrand'
--outWigReferencesPrefix '$outWig.outWigReferencesPrefix'
--outWigNorm '$outWig.outWigNorm'
#end if
]]></token>
<token name="@OUTWIGOUTPUTS@"><![CDATA[
#if str($outWig.outWigType) == "bedGraph":
&& mv Signal.Unique.str1.out.bg Signal.Unique.str1.out
&& mv Signal.UniqueMultiple.str1.out.bg Signal.UniqueMultiple.str1.out
#if str($outWig.outWigStrand) == "Stranded":
&& mv Signal.Unique.str2.out.bg Signal.Unique.str2.out
&& mv Signal.UniqueMultiple.str2.out.bg Signal.UniqueMultiple.str2.out
#end if
#elif str($outWig.outWigType) == "wiggle":
&& mv Signal.Unique.str1.out.wig Signal.Unique.str1.out
&& mv Signal.UniqueMultiple.str1.out.wig Signal.UniqueMultiple.str1.out
#if str($outWig.outWigStrand) == "Stranded":
&& mv Signal.Unique.str2.out.wig Signal.Unique.str2.out
&& mv Signal.UniqueMultiple.str2.out.wig Signal.UniqueMultiple.str2.out
#end if
#end if
]]></token>
<xml name="outWigOutputs">
<data format="bedgraph" name="signal_unique_str1" label="${tool.name} on ${on_string}: Coverage Uniquely mapped strand 1" from_work_dir="Signal.Unique.str1.out">
<filter>outWig['outWigType'] != "None"</filter>
<expand macro="dbKeyActions" />
<change_format>
<when input="outWig.outWigType" value="wiggle" format="wig" />
</change_format>
</data>
<data format="bedgraph" name="signal_uniquemultiple_str1" label="${tool.name} on ${on_string}: Coverage Uniquely + Multiple mapped strand 1" from_work_dir="Signal.UniqueMultiple.str1.out">
<filter>outWig['outWigType'] != "None"</filter>
<expand macro="dbKeyActions" />
<change_format>
<when input="outWig.outWigType" value="wiggle" format="wig" />
</change_format>
</data>
<data format="bedgraph" name="signal_unique_str2" label="${tool.name} on ${on_string}: Coverage Uniquely mapped strand 2" from_work_dir="Signal.Unique.str2.out">
<filter>outWig['outWigType'] != "None" and outWig['outWigStrand']</filter>
<expand macro="dbKeyActions" />
<change_format>
<when input="outWig.outWigType" value="wiggle" format="wig" />
</change_format>
</data>
<data format="bedgraph" name="signal_uniquemultiple_str2" label="${tool.name} on ${on_string}: Coverage Uniquely + Multiple mapped strand 2" from_work_dir="Signal.UniqueMultiple.str2.out">
<filter>outWig['outWigType'] != "None" and outWig['outWigStrand']</filter>
<expand macro="dbKeyActions" />
<change_format>
<when input="outWig.outWigType" value="wiggle" format="wig" />
</change_format>
</data>
</xml>
<xml name="quantMode">
<conditional name="quantmode_output">
<param argument="--quantMode" type="select"
label="Per gene/transcript output"
help="STAR can provide analysis results not only with respect to the reference genome, but also with respect to genes and transcripts described by a gene model. Note: This functionality requires either the selection above of a cached index with a gene model, or a gene model provided alongside the index/reference genome in GTF or GFF3 format!">
<option value="-">No per gene or transcript output</option>
<option value="GeneCounts">Per gene read counts (GeneCounts)</option>
<option value="TranscriptomeSAM">Transcript-based BAM output (TranscriptomeSAM)</option>
<option value="TranscriptomeSAM GeneCounts">Both per gene read counts and transcript-based BAM output (TranscriptomeSAM GeneCounts)</option>
</param>
<when value="-" />
<when value="GeneCounts" />
<when value="TranscriptomeSAM">
<param argument="--quantTranscriptomeBan" type="boolean" truevalue="IndelSoftclipSingleend" falsevalue="Singleend"
label="Exclude alignments with indels or soft clipping from the transcriptome BAM output?"
help="You will need to exclude alignments with indels and soft-clipped bases from the transcriptome BAM output for compatibility with certain transcript quantification tools, most notably RSEM. If you are using a tool, like eXpress, that can deal with indels and soft-clipped bases, you can achieve better results by leaving this option disabled." />
</when>
<when value="TranscriptomeSAM GeneCounts">
<param argument="--quantTranscriptomeBan" type="boolean" truevalue="IndelSoftclipSingleend" falsevalue="Singleend"
label="Exclude alignments with indels or soft clipping from the transcriptome BAM output?"
help="You will need to exclude alignments with indels and soft-clipped bases from the transcriptome BAM output for compatibility with certain transcript quantification tools, most notably RSEM. If you are using a tool, like eXpress, that can deal with indels and soft-clipped bases, you can achieve better results by leaving this option disabled." />
</when>
</conditional>
</xml>
<xml name="quantModeNoGTF">
<conditional name="quantmode_output">
<param argument="--quantMode" type="select"
label="Per gene/transcript output">
<option value="-">No per gene or transcript output as no GTF was provided</option>
</param>
<when value="-" />
</conditional>
</xml>
</macros>
Loading