-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add microbiome analysis workflows #182
Conversation
Thanks, can you add a README, Changelog and dockstore.yml files ? (https://github.com/galaxyproject/iwc/blob/main/workflows/README.md#structure-of-the-directory) |
@mvdbeek did I miss something else ? Thanks a lot for helping me out :) |
@wm75 Can you help me revising and merging this PR |
…underscore in its name before
To do as discussed with @wm75 : 1- remove "latest" from workflow names |
@bebatut I have added the 5 workflows for the single samples run and the 4 workflows for the collection of samples run so in total 9 workflows with their test data, I have chosen the minimum size sample data which contain VFs, Contigs, etc. the maximum file size is 50Mb, but the other files are either in Bytes or Kbs |
@EngyNasr @bebatut I don't see much value in offering the single-sample workflows, when collection-based flavors exist that could be run with 1-element collections. Having the single-sample WFs published would just mean more maintenance and synchronization efforts. |
@EngyNasr can you please run some json reformatter tool over your workflows. Single-line JSON is just not very nice to review and prevents meaningful diffs. Use, e.g., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments:
- At least some of your workflows are lacking a
release
attribute, which you need to add manually. - In the Preprocessing workflow, the conversion of fastq.gz to plain fastq, just for the purpose of filtering reads by their ID, is rather unfriendly for a user's quota.
There istoolshed.g2.bx.psu.edu/repos/iuc/seqtk/seqtk_subseq/1.3.1
which can filter compressed fastqs directly (and which is probably faster in all cases). It's downside is that it only keeps matching IDs, but can't discard them, or write both to separate files (liketoolshed.g2.bx.psu.edu/repos/peterjc/seq_filter_by_id/seq_filter_by_id/0.2.7
).
So if you want the non-host reads, you'll have to invert the action of Filter Tabular at the step before.
If you really need also the host reads as a separate file (which I'm not entirely convinced of), you would have to run Filter Tabular and seqtk subseq twice, but even that might still be better than the current way?
it was just the old way we used to do the analysis and we use these workflows in the current training material, thats why we wanted to have both as two versions of the workflow. but definitely they are useless now since the collection version does the same exact job, but it will never take a single file it has to always be a collection |
This tool should also work here: toolshed.g2.bx.psu.edu/repos/iuc/krakentools_extract_kraken_reads/krakentools_extract_kraken_reads/1.2+galaxy0 |
…sing, to be added once the PR of the tool update is merges
I need help in tests @wm75 @paulzierep @bebatut :
I noticed that the tool id for these tools is not like the rest of the tools e.g. toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_getfastabed/2.30.0+galaxy1 Is there a way to solve that for these tests to succeed? |
…with an update to Krakentool, and also pushing the latest updates to the workflows, still the planemo tests fails for the same reasons, I need help with that
@EngyNasr once galaxyproject/planemo#1377 is merged and a new version is released it should work. |
thank you so much :) |
@mvdbeek, Is it possible that it is also done for FILTER_EMPTY_DATASETS ? |
I think it was closed by mistake |
Looks like it worked and you only need to work on your test assertions. |
@EngyNasr two questions on the latest preprocessing version:
|
The Pathogen-Detection-Nanopore-All-Samples-Analysis WF needs much better annotations and input dataset labels to make it understandable what it is good for. Right now, understanding the purpose without knowing the tutorial is really hard. Even the README file only talks about VF genes, but many people wouldn't know what that is. |
…obiomeWorkflows update my branch with the main branch of IWC
Head branch was pushed to by a user without write access
…host as a user input, and updated the test accordingly
Head branch was pushed to by a user without write access
@bebatut @mvdbeek @wm75 Its greeeeeeeeeeen <3 <3 <3 finallyyyyyy Thanks a lot @bgruening Thank you all |
Shall we merge them, and add the bar charts as an update after the visualisation is fixed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge and improve in a follow-up PR
* adding microbiome analysis workflows to IWC with test data * adding Changelog, REadme and dockstore yml file * solving linting issues * solving linting error by correcting the file name since i forgot the underscore in its name before * applying all comments * adding workflows for the collection version * applying wolfgang comments, still removing the decompress tool is missing, to be added once the PR of the tool update is merges * updating the preprocessing workflow replacing the decompressing step with an update to Krakentool, and also pushing the latest updates to the workflows, still the planemo tests fails for the same reasons, I need help with that * renaming all inputs and outputs to include no spaces and used underscore instead and low caps to solve the linting issue, and added more description for the all samples readme file * changing the tested files to be the last file in the collection, to solve the testing issues, since the github tests always compare the tested file with the last file in the collection regardless of the file name * removing the word collection from the folder names, since all workflows in this PR work with collection no need to specify that in the folder name * testing another file for SNP workflow to check the testing error * updating the SNP workflow to solve the testing issue * Second attempt to solve SNP test error * attempt 1 for solving the taxonomy profiling test failure * solving taxonomy profiling testing error * updating the general readme file to include the training material of the workflows * using MinusB standard Kraken2 database for taxonomy profiling, trying to solve the issue of database size, explained in the readme that database can be changed based on the input datasets within Galaxy * correcting a file name * correcing a typo * reducing test datasets samples size, by keeping only reads that are used to detect virulence factor later on with the workflows, and that to make tests faster * updating the preprocessing workflow to remove hosts sequences in a more relaiable way, and trying to solve the gene based workflow testing error on github * updating workflows attempt 1 to solve current errors * removing filter failed datasets tool, attempt to solve the testing failures of genebased pathogen detection and preprocessing workflows * removing attribute from the genebased test * trial to solve gene based workflow github test failure * attempt 2 to solve the tests failer * updating some workflows * correcting a preprocessing workflow test output file * trying to solve genebased test failure * another attempt to solve the testing error which states in the log as if is something wrong with line1 in the workflow ga file * another trial * solving the taxonomy profiling testing failure due to the standardPF database size, where we use now a input parameter instead, thank you so much Avatar * updating workflows based on our latest paper update April 2024 * updating workflows to include the tools latest version in Galaxy eu * correcting a test file * updating all readme files for workflows * adding workflow comments, arranging the main folder and workflows names, to make it opened for any microbiome workflows * editing readme and workflows namings and tags based on Berenice's comments * solving linting problem * updating few typos * removing a test file which is no longer produced by the updated workflow, solving the linting issue * i missed to change the test file correctly the last push, here we go this time :D * updated workflow reports, and adding the total number of reads before and after host removal to the Multiqc output of the preprocessing workflow * solving the error of MultiQC including the host reads removal details, now the table is included and multiq runs correctly :) * adding a step in the 5th workflow to remove all failed datasets from the collections comming from genebased pathogen identification, which may happen when no contigs are found by metaFlye for some samples * add more filteration steps of removing failed and empty datasets in collections, to protect the workflow from the carry on error that might occur for samples with fewer number of contigs, VF genes and AMR genes found * leaving the optional input to minimap2 selectng the samples profile to the user to choose * correcting a comment typo in the preprocessing workflow * adding a 5in1 workflow named PathoGFAIR that groups all other 5 workflows of PathoGFAIR in one workflow * removing the 5 in 1 workflow and adding tags to main workflow outputs to help users track their histories * correcting typos in readmes * removing hashtags from tags to make them normal ones not promoted ones * Apply suggestions from code review Co-authored-by: Marius van den Beek <m.vandenbeek@gmail.com> * applying all comments from Marius, changing all Readme's accordingly, and changing the Allele based workflow to be general not specific to Nanopore * updating readmes to explain more what to trying workflow mean * adding Zenodo links to all fastq or fasta files * updating Zenodo with the correct test file names * updating a test file name in the gene based test * correcting another test data names * updating resulted test files with the updated file names * correcting from path to location for the zenodo tags and adding the file types * nothing changed I am just trying again * changing file type for fastq files from fastq to fastqsanger * using the same fastq test file in Allele based, gene based and taxonomy profiling * updating Preprocessing workflow, to take the reference genome of the host as a user input, and updated the test accordingly * removing bar chart tool, to add the reports visualisations later * replacing the testing reference genome since ce6 is giving an error * Thanks to Björn he solved the module issue, I am trying now, hope it works this time :) * 2 docker containers have been fixed, thanks to Björn --------- Co-authored-by: Marius van den Beek <m.vandenbeek@gmail.com>
I tried to reduce the test-data in this PR, hope it works.
Thanks a lot,
Engy <3