Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lint for unused input parameters #12724

Open
wants to merge 11 commits into
base: dev
Choose a base branch
from

Conversation

bernt-matthias
Copy link
Contributor

@bernt-matthias bernt-matthias commented Oct 15, 2021

This is first attempt to lint for input parameters that are not used:

  • in command or configfile
  • in output/filter
  • in /outputs//change_format/when

Edit: The following text is outdated:

Basic idea is to grep for "[^\w_]" + param_name + r"([^\w_]|$)", i.e. the placeholder delimited by something that can't belong to the placeholder.

Certainly this is very (maybe to) simplistic .. but it seems to work quite well :) .. But certainly it might be useful to check via some cheetah python lib if parameters are used...

I also experimented with stricter regular expressions (i.e. more specific delimiters), but there are to many cases to consider. In config files, for instance, the placeholders could be joined by any (non-placeholder) character anyway.

It might help

  • if we would be more strict with cheetah formatting (e.g. PEP8-ish) and
  • enforce ${param} syntax (instead of just $param) in more cases

Still there are cases like:

  • ${ ",".join( [ str( platform.platform_entry ) ] ) }
  • $getVar($sect + "excl_ival_type.excl_ival_type_sel")

TODO:

Additionally: added linting of configfiles

xref galaxyproject/tools-iuc#4085

TODO:

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:

And manually ran it on the IUC repo.

License

@bgruening
Copy link
Member

So easy? Super cool, this will help a lot! I was thinking we need to go through cheetah and hook into the templates but if a simple grep works - awesome!

@bernt-matthias
Copy link
Contributor Author

bernt-matthias commented Oct 17, 2021

So easy? Super cool, this will help a lot! I was thinking we need to go through cheetah and hook into the templates but if a simple grep works - awesome!

Yes, but can be improved. I started of with a regexp (\$\S+' +name + '\s') which was workung well. But it can only be applied to command an config files. Also the final \s had problems, needs more chars, at least [\s'}].

@bernt-matthias bernt-matthias changed the title first steps to lint for inputs missing from command block Lint for unused input parameters Oct 18, 2021
@bernt-matthias bernt-matthias force-pushed the topic/lint_inputs_missing_from_command branch from fd35f7e to 34be7a6 Compare October 20, 2021 09:54
@bernt-matthias bernt-matthias force-pushed the topic/lint_inputs_missing_from_command branch 4 times, most recently from 301e6ba to 5be1139 Compare November 23, 2021 09:22
bernt-matthias added a commit to bernt-matthias/tools-iuc that referenced this pull request Nov 26, 2021
I guess the validator should be sufficient, but maybe having this
additional if block can't hurt

will make the linter happy for galaxyproject/galaxy#12724
alternatively we could extend the linter such that
parameters that do not appear on the command line
bu have a (expression) validator? are OK??
bgruening pushed a commit to galaxyproject/tools-iuc that referenced this pull request Nov 27, 2021
I guess the validator should be sufficient, but maybe having this
additional if block can't hurt

will make the linter happy for galaxyproject/galaxy#12724
alternatively we could extend the linter such that
parameters that do not appear on the command line
bu have a (expression) validator? are OK??
@bernt-matthias bernt-matthias marked this pull request as draft December 1, 2021 16:05
@bernt-matthias
Copy link
Contributor Author

bernt-matthias commented Jan 24, 2022

Spend some time to improve this: Instead of a simple regex search for the parameter name the code searches now in the Compiled template.

The starting motivation was to avoid the problem of placeholders listed in comments (and removing comments seems not trivial), but also very short variable names (eg single letter) had false negatives.

The main advantage for searching in the compiled templates is that cheetah adds some code (VFN, VFFSL, rawExpr) that "ensures" to search only in uncommented code (of course if tool devs would use text matching the regular expressions in codes this would still lead to false negatives).
In theory there may be also possibilities in the cheetah lib (see)[https://sourceforge.net/p/cheetahtemplate/mailman/cheetahtemplate-discuss/thread/Pine.LNX.4.64.0803092139100.22574%40calrudd-server.calrudd.com/#msg18793519], but I don't dare to touch this (note that the devs also suggest to grep .. which is in my opinion quite complex).

I developed a series of regular expressions for discovering all syntax that I could think of. Unfortunatelly some cheetah expressions require quite "unspecific" (a bit less specific than I would find really good). So at the moment most of the specific ones are commented (if covered by the less specific ones).

I also checked the tool repos for which I had local clones and opened issues:

For completeness: found none in (W4M)[https://github.com/workflow4metabolomics/tools-metabolomics.git]

@bernt-matthias bernt-matthias force-pushed the topic/lint_inputs_missing_from_command branch 2 times, most recently from eeca2f2 to ee0e5c4 Compare January 24, 2022 20:43
@mvdbeek
Copy link
Member

mvdbeek commented Jan 25, 2022

Will this work for tools that use galaxy.json or the configfile to dump parameters (<inputs name="inputs" filename="inputs.json" />) or data managers that are just used in external scripts?

continue

if change_format:
lint_ctx.error(f"Param input [{param_name}] is only used in a change_format tag", line=param.sourceline, xpath=tool_xml.getpath(param))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would this be an error ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding the change_format tag tells the tool: depending on the parameter param_name the generated command line creates output of a different format .. which is then set for the output dataset.

I can't think of a scenario in which the generated command line would create output of a different format without using the parameter param_name at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Will make this a warning. Found a use case https://github.com/ARTbio/tools-artbio/blob/667afd5a5408321a31cdebab75b378e694deab07/tools/justgzip/gzip.xml#L20 .. even if this might be better over there: ARTbio/tools-artbio#580

if change_format:
lint_ctx.error(f"Param input [{param_name}] is only used in a change_format tag", line=param.sourceline, xpath=tool_xml.getpath(param))
else:
lint_ctx.error(f"Param input [{param_name}] not found in command or configfiles.", line=param.sourceline, xpath=tool_xml.getpath(param))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be an info or maybe a warning, I think there are legitimate cases where that's desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we can maybe use an action to annotate the source in a PR so this doesn't get lost completely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion this is clearly an error .. a really bad one :). There is a parameter in the tool form that is supposed modify the behavior of the tool in a certain way. In the cases that are covered here this is not the case and users will become undesired results.

I manually checked all cases in the linked issues and found only a single case where I had to think twice if this might be OK (this was a boolean to agree on the license .. but it had no effect .. so also an error in my opinion).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we can maybe use an action to annotate the source in a PR so this doesn't get lost completely

This would be a good thing anyway. Do you have any specific ideas or inspiration how this could be implemented.

@mvdbeek mvdbeek modified the milestones: 22.01, 22.05 Jan 25, 2022
@bernt-matthias
Copy link
Contributor Author

Will this work for tools that use galaxy.json or the configfile to dump parameters () or data managers that are just used in external scripts?

Yes.

The code variable contains the concatenation of the content of the <command> and all <configfile> tags:

for tag in [".//command", ".//configfile"]:

This could be even stricter: actually we might/should check also if the names of all configfiles appears in the command (or other configfiles). Let me just add this to the new configfile linter :)

The inputs tag is covered by the following code:

if conf_inputs is not None: # in this it's really hard to know

We do the following checks:

  • for data parameters the data_style attribute needs to be set (since otherwise data parameters are excluded)
  • it needs to be used in the command or any on the config files

@bernt-matthias bernt-matthias marked this pull request as ready for review January 25, 2022 11:34
@bernt-matthias bernt-matthias force-pushed the topic/lint_inputs_missing_from_command branch 2 times, most recently from 0918e57 to 4d9d054 Compare September 14, 2022 15:16
@bernt-matthias
Copy link
Contributor Author

Will this work for tools that use galaxy.json

How can input parameters be used via galaxy.json?

@@ -1,5 +1,56 @@
import re

from Cheetah.Compiler import Compiler
from Cheetah.Template import Template
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/galaxyproject/galaxy/blob/dev/packages/tool_util/setup.cfg#L34 this needs to be updated to require Cheetah3 or galaxy-util[template].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint @jmchilton added Cheetah3

if change_format:
lint_ctx.error(f"Param input [{param_name}] is only used in a change_format tag", node=param)
else:
lint_ctx.error(f"Param input [{param_name}] not found in command or configfiles.", node=param)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error should only be used if the tool is broken with that change IMO. I don't think unused variables really indicate a tool is broken. I would make these warns or infos.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular the last case (the else branch) is clearly an error in my opinion. The parameter is used nowhere in the tool, i.e. the functionality of this parameter is broken (even if the tool still runs...). Just assume it would be the light switch in a car .. if it does not switch the lights on/off it's broken even if the car is still driving, or?

For the if branch I just can not imagine how the tool may produce a different format without using the parameter anywhere else except for the change_format tag.

But I'm fine with downgrading errors and warnings to warnings and infos (resp) if its for the possibility of FP due to getVar and varExists...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For getVar and varExists we could just bail out like for the case when a param file is present

@jmchilton
Copy link
Member

I would love to see this applied to tools-iuc and see if there are any false negatives and if it catches in any actual positives.

I also wonder if just compiling the Cheetah is actually going to be the biggest win here - does you get_code throw an exception on invalid cheetah and can we make that a linting we do also?

@bernt-matthias
Copy link
Contributor Author

I would love to see this applied to tools-iuc and see if there are any false negatives and if it catches in any actual positives.

Already did this in January, see #12724 (comment) many have been fixed already. Have not found a single FP so far (except for copy paste mistakes .. because I did the check semi automatic .. ). But this is only because tool developers do not use varExists and getVar .. which is the only option that I can think of to construct a FP case (in this case we should degrade it to warn/info).

FN are more difficult to answer.

I also wonder if just compiling the Cheetah is actually going to be the biggest win here - does you get_code throw an exception on invalid cheetah and can we make that a linting we do also?

That is an excellent idea. Maybe at least some errors can be caught, e.g. syntax errors. But some errors will only become apparent during execution. Lets see if I can add a test.

)
if action:
lint_ctx.warn(
f"Param input [{param_name}] {only}found in output actions, better use extended metadata.",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was expecting that you "complain" about this one .. :)

considering command, configfiles, filters, ...

add linting for configfiles
for finding variables
by searching in compiled cheetah templates
@bernt-matthias
Copy link
Contributor Author

Also note that this is implemented in a separate linter which can be enabled/disabled completely via planemo

@bernt-matthias bernt-matthias force-pushed the topic/lint_inputs_missing_from_command branch from 4d9d054 to 81d2ec4 Compare December 9, 2022 11:54
@bernt-matthias bernt-matthias force-pushed the topic/lint_inputs_missing_from_command branch from b2949d8 to d9655f5 Compare December 9, 2022 13:08
@bernt-matthias
Copy link
Contributor Author

I would love to see this applied to tools-iuc

So I did it again:

  • tools/allegro/allegro.xml
    • ERROR: Param input [opt_sexspecific] not found in command or configfiles.
  • tools/arriba/arriba_draw_fusions.xml
    • ERROR: Param input [ref_file] not found in command or configfiles.
    • ERROR: Param input [fixedScale] not found in command or configfiles.
  • tools/arriba/arriba.xml
    • ERROR: Param input [fixedScale] not found in command or configfiles.
  • tools/bcftools/bcftools_query.xml
    • ERROR: Param input [tsv] is only used in a change_format tag
  • tools/bedtools/closestBed.xml
    • ERROR: Param input [fu] not found in command or configfiles.
    • ERROR: Param input [fd] not found in command or configfiles.
    • ERROR: Param input [fu] not found in command or configfiles.
    • ERROR: Param input [fd] not found in command or configfiles.
    • ERROR: Param input [fu] not found in command or configfiles.
    • ERROR: Param input [fd] not found in command or configfiles.
  • tools/cat/cat_contigs.xml
    • WARNING: Param input [seqtype] found in output actions, better use extended metadata.
    • WARNING: Param input [seqtype] found in a label attribute, this is discouraged.
    • WARNING: Param input [sum_titles] only found in output actions, better use extended metadata.
    • WARNING: Param input [bin_col] only found in output actions, better use extended metadata.
  • tools/cat/cat_bins.xml
    • WARNING: Param input [seqtype] found in output actions, better use extended metadata.
    • WARNING: Param input [seqtype] found in a label attribute, this is discouraged.
    • WARNING: Param input [sum_titles] only found in output actions, better use extended metadata.
    • WARNING: Param input [bin_col] only found in output actions, better use extended metadata.
  • tools/disco/disco.xml
    • ERROR: Param input [PrintScaffolds] not found in command or configfiles.
  • tools/egsea/egsea.xml
    • ERROR: Param input [non_commercial_use] not found in command or configfiles.
  • tools/emboss_5/emboss_epestfind.xml
    • ERROR: Param input [threshold] not found in command or configfiles.
  • tools/emboss_5/emboss_textsearch.xml
    • ERROR: Param input [casesensitive] not found in command or configfiles.
  • tools/gatk4/gatk4_Mutect2.xml
    • WARNING: Param input [exclude_intervals] only used in a template macro, use a macro token instead.
    • WARNING: Param input [interval_exclusion_padding] only used in a template macro, use a macro token instead.
    • WARNING: Param input [exclude_intervals] only used in a template macro, use a macro token instead.
    • WARNING: Param input [interval_exclusion_padding] only used in a template macro, use a macro token instead.
    • WARNING: Param input [intervals] only used in a template macro, use a macro token instead.
    • WARNING: Param input [interval_padding] only used in a template macro, use a macro token instead.
    • WARNING: Param input [intervals] only used in a template macro, use a macro token instead.
    • WARNING: Param input [interval_padding] only used in a template macro, use a macro token instead.
    • WARNING: Param input [seqdict_sequence] only used in a template macro, use a macro token instead.
    • WARNING: Param input [seqdict_sequence] only used in a template macro, use a macro token instead.
    • WARNING: Param input [interval_exclusion_padding] only used in a template macro, use a macro token instead.
    • WARNING: Param input [interval_padding] only used in a template macro, use a macro token instead.
  • tools/genehunter_modscore/genehunter_modscore.xml
    • ERROR: Param input [extra_alg_mem] not found in command or configfiles.
  • tools/ggplot2/ggplot2_heatmap.xml
    • ERROR: Param input [sample_name_orientation] not found in command or configfiles.
    • ERROR: Param input [sample_name_orientation] not found in command or configfiles.
    • ERROR: Param input [sample_name_orientation] not found in command or configfiles.
    • ERROR: Param input [sample_name_orientation] not found in command or configfiles.
  • tools/ggplot2/ggplot_violin.xml
    • ERROR: Param input [xaxismin] not found in command or configfiles.
    • ERROR: Param input [xaxismax] not found in command or configfiles.
  • tools/hicexplorer/hicConvertFormat.xml
    • ERROR: Param input [storeAppliedCorrection] not found in command or configfiles.
    • ERROR: Param input [enforceInteger] not found in command or configfiles.
    • ERROR: Param input [storeAppliedCorrection] not found in command or configfiles.
    • ERROR: Param input [enforceInteger] not found in command or configfiles.
  • tools/homer/homer_findMotifsGenome.xml
    • ERROR: Param input [olen] not found in command or configfiles.
  • tools/iqtree/iqtree.xml
    • ERROR: Param input [seqtype] not found in command or configfiles.
  • tools/maker/maker.xml
    • ERROR: Param input [license_agreement] not found in command or configfiles.
  • tools/metaphlan/metaphlan.xml
    • ERROR: Param input [stat] not found in command or configfiles.
  • tools/miniprot/miniprot.xml
    • ERROR: Param input [query_batch_size] not found in command or configfiles.
  • tools/mothur/classify.tree.xml
    • ERROR: Param input [cutoff] not found in command or configfiles.
  • tools/mothur/cluster.xml
    • ERROR: Param input [fasta] not found in command or configfiles.
    • ERROR: Param input [fasta] not found in command or configfiles.
    • ERROR: Param input [fasta] not found in command or configfiles.
  • tools/mothur/chimera.ccode.xml
    • ERROR: Param input [mask] not found in command or configfiles.
  • tools/ncbi_datasets/datasets_gene.xml
    • ERROR: Param input [decompress] not found in command or configfiles.
  • tools/necat/necat.xml
    • ERROR: Param input [min_length] not found in command or configfiles.
    • ERROR: Param input [max_length] not found in command or configfiles.
    • ERROR: Param input [min_identity] not found in command or configfiles.
    • ERROR: Param input [min_aligned_length] not found in command or configfiles.
    • ERROR: Param input [max_overhang] not found in command or configfiles.
    • ERROR: Param input [min_coverage] not found in command or configfiles.
    • ERROR: Param input [max_coverage] not found in command or configfiles.
    • ERROR: Param input [max_diff_coverage] not found in command or configfiles.
    • ERROR: Param input [coverage_discard] not found in command or configfiles.
    • ERROR: Param input [bestn] not found in command or configfiles.
    • ERROR: Param input [coverage] not found in command or configfiles.
    • ERROR: Param input [identity_global_deviation1] not found in command or configfiles.
    • ERROR: Param input [identity_global_deviation2] not found in command or configfiles.
    • ERROR: Param input [overhang_global_deviation1] not found in command or configfiles.
    • ERROR: Param input [overhang_global_deviation2] not found in command or configfiles.
    • ERROR: Param input [identity_local_deviation1] not found in command or configfiles.
    • ERROR: Param input [identity_local_deviation2] not found in command or configfiles.
    • ERROR: Param input [overhang_local_deviation1] not found in command or configfiles.
    • ERROR: Param input [overhang_local_deviation2] not found in command or configfiles.
    • ERROR: Param input [identity_local_condition] not found in command or configfiles.
    • ERROR: Param input [local_low_coverage] not found in command or configfiles.
    • ERROR: Param input [min_length] not found in command or configfiles.
    • ERROR: Param input [min_identity] not found in command or configfiles.
    • ERROR: Param input [min_contig_length] not found in command or configfiles.
    • ERROR: Param input [max_spur_length] not found in command or configfiles.
    • ERROR: Param input [select_branch] not found in command or configfiles.
    • ERROR: Param input [read_min_length] not found in command or configfiles.
    • ERROR: Param input [ctg_min_length] not found in command or configfiles.
    • ERROR: Param input [ctg2ctg_min_identity] not found in command or configfiles.
    • ERROR: Param input [ctg2ctg_max_overhang] not found in command or configfiles.
    • ERROR: Param input [ctg2ctg_min_aligned_length] not found in command or configfiles.
    • ERROR: Param input [read2ctg_min_identity] not found in command or configfiles.
    • ERROR: Param input [read2ctg_max_overhang] not found in command or configfiles.
    • ERROR: Param input [read2ctg_min_aligned_length] not found in command or configfiles.
    • ERROR: Param input [read2ctg_min_coverage] not found in command or configfiles.
    • ERROR: Param input [min_contig_length] not found in command or configfiles.
    • ERROR: Param input [window_size] not found in command or configfiles.
    • ERROR: Param input [select_branch] not found in command or configfiles.
  • tools/polypolish/polypolish.xml
    • ERROR: Param input [min_depth] not found in command or configfiles.
    • ERROR: Param input [fraction_invalid] not found in command or configfiles.
    • ERROR: Param input [max_errors] not found in command or configfiles.
    • ERROR: Param input [fraction_valid] not found in command or configfiles.
  • tools/schicexplorer/scHicPlotClusterProfiles.xml
    • ERROR: Param input [maximalDistance] not found in command or configfiles.
  • tools/schicexplorer/scHicConsensusMatrices.xml
    • ERROR: Param input [chromosomes] not found in command or configfiles.

For my own reference

planemo ci_find_repos --exclude packages --exclude deprecated --exclude_from .tt_skip --output repository_list.txt
while read line
do                          
    >&2 echo $line; planemo lint  $line --skip help,tests,inputs,output,general,command,citations,tool_xsd,xml_order,stdio
done < repository_list.txt > lint.tx
cat lint.txt | grep -v Applying | egrep "ERROR|WARNING" -B 1 | grep -v '\-\-'| sed 's@Linting tool /home/berntm/projects/tools-iuc/@- [ ] @; s/^\.\./  - [ ]/'

@dannon dannon modified the milestones: 23.0, 23.1 Jan 10, 2023
@mvdbeek mvdbeek removed this from the 23.1 milestone Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants