Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add microbiome analysis workflows #182

Merged
merged 71 commits into from
Jun 25, 2024

Conversation

EngyNasr
Copy link
Contributor

@EngyNasr EngyNasr commented Mar 7, 2023

I tried to reduce the test-data in this PR, hope it works.

Thanks a lot,
Engy <3

@mvdbeek
Copy link
Member

mvdbeek commented Mar 7, 2023

Thanks, can you add a README, Changelog and dockstore.yml files ? (https://github.com/galaxyproject/iwc/blob/main/workflows/README.md#structure-of-the-directory)

@EngyNasr
Copy link
Contributor Author

@mvdbeek did I miss something else ?

Thanks a lot for helping me out :)

@EngyNasr
Copy link
Contributor Author

@wm75 Can you help me revising and merging this PR
Thanks a lot <3

@EngyNasr
Copy link
Contributor Author

EngyNasr commented Mar 27, 2023

To do as discussed with @wm75 :

1- remove "latest" from workflow names
2- keep only one file to test for all the workflows, except the pre-processing
3- pre-processing tests we can use the txt file to compare content
4- make a readme inside each folder as well
5- make the changelog only inside each folder
6- remove the big docker.yml in the main folder and keep only the one in every folder

@EngyNasr
Copy link
Contributor Author

@bebatut I have added the 5 workflows for the single samples run and the 4 workflows for the collection of samples run so in total 9 workflows with their test data, I have chosen the minimum size sample data which contain VFs, Contigs, etc. the maximum file size is 50Mb, but the other files are either in Bytes or Kbs

@wm75
Copy link
Contributor

wm75 commented Jun 13, 2023

@EngyNasr @bebatut I don't see much value in offering the single-sample workflows, when collection-based flavors exist that could be run with 1-element collections. Having the single-sample WFs published would just mean more maintenance and synchronization efforts.
Or is there anything that can be achieved with the single-sample versions, that the collection-based versions don't do?

@wm75
Copy link
Contributor

wm75 commented Jun 13, 2023

@EngyNasr can you please run some json reformatter tool over your workflows. Single-line JSON is just not very nice to review and prevents meaningful diffs. Use, e.g., python3 -m json.tool collection-version/pathogen-detection-nanopore-pre-processing-collection/Pathogen-Detection-Nanopore-Pre-Processing-collection.ga collection-version/pathogen-detection-nanopore-pre-processing-collection/Pathogen-Detection-Nanopore-Pre-Processing-collection_pretty.ga or any other tool you like.

Copy link
Contributor

@wm75 wm75 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments:

  • At least some of your workflows are lacking a release attribute, which you need to add manually.
  • In the Preprocessing workflow, the conversion of fastq.gz to plain fastq, just for the purpose of filtering reads by their ID, is rather unfriendly for a user's quota.
    There is toolshed.g2.bx.psu.edu/repos/iuc/seqtk/seqtk_subseq/1.3.1 which can filter compressed fastqs directly (and which is probably faster in all cases). It's downside is that it only keeps matching IDs, but can't discard them, or write both to separate files (like toolshed.g2.bx.psu.edu/repos/peterjc/seq_filter_by_id/seq_filter_by_id/0.2.7).
    So if you want the non-host reads, you'll have to invert the action of Filter Tabular at the step before.
    If you really need also the host reads as a separate file (which I'm not entirely convinced of), you would have to run Filter Tabular and seqtk subseq twice, but even that might still be better than the current way?

@EngyNasr
Copy link
Contributor Author

@EngyNasr @bebatut I don't see much value in offering the single-sample workflows, when collection-based flavors exist that could be run with 1-element collections. Having the single-sample WFs published would just mean more maintenance and synchronization efforts. Or is there anything that can be achieved with the single-sample versions, that the collection-based versions don't do?

it was just the old way we used to do the analysis and we use these workflows in the current training material, thats why we wanted to have both as two versions of the workflow. but definitely they are useless now since the collection version does the same exact job, but it will never take a single file it has to always be a collection

@paulzierep
Copy link

This tool should also work here: toolshed.g2.bx.psu.edu/repos/iuc/krakentools_extract_kraken_reads/krakentools_extract_kraken_reads/1.2+galaxy0
Can use fastq.gz

…sing, to be added once the PR of the tool update is merges
@EngyNasr
Copy link
Contributor Author

EngyNasr commented Jun 22, 2023

I need help in tests @wm75 @paulzierep @bebatut :

I noticed that the tool id for these tools is not like the rest of the tools e.g. toolshed.g2.bx.psu.edu/repos/iuc/bedtools/bedtools_getfastabed/2.30.0+galaxy1

Is there a way to solve that for these tests to succeed?

…with an update to Krakentool, and also pushing the latest updates to the workflows, still the planemo tests fails for the same reasons, I need help with that
@mvdbeek
Copy link
Member

mvdbeek commented Jun 28, 2023

@EngyNasr once galaxyproject/planemo#1377 is merged and a new version is released it should work.

@EngyNasr
Copy link
Contributor Author

@EngyNasr once galaxyproject/planemo#1377 is merged and a new version is released it should work.

thank you so much :)

@EngyNasr
Copy link
Contributor Author

EngyNasr commented Jun 28, 2023

@EngyNasr once galaxyproject/planemo#1377 is merged and a new version is released it should work.

@mvdbeek, Is it possible that it is also done for FILTER_EMPTY_DATASETS ?

@bebatut
Copy link
Member

bebatut commented Jun 29, 2023

I think it was closed by mistake

@bebatut bebatut reopened this Jun 29, 2023
@mvdbeek
Copy link
Member

mvdbeek commented Jun 29, 2023

Looks like it worked and you only need to work on your test assertions.

@wm75
Copy link
Contributor

wm75 commented Jun 30, 2023

@EngyNasr two questions on the latest preprocessing version:

  • why is the host ID now not a WF parameter anymore? You could use the "Map parameter value" expression tool to map a user-selected species name to the required taxid.
  • is there a specific reason why you're using fastqc, when you're already running fastp? MultiQC can also visualize fastp JSON output dataset and it doesn't look too different from what two FastQC results (before and after trimming).

@wm75
Copy link
Contributor

wm75 commented Jun 30, 2023

The Pathogen-Detection-Nanopore-All-Samples-Analysis WF needs much better annotations and input dataset labels to make it understandable what it is good for. Right now, understanding the purpose without knowing the tutorial is really hard. Even the README file only talks about VF genes, but many people wouldn't know what that is.

@mvdbeek mvdbeek changed the title adding microbiome analysis workflows to IWC with test data Add microbiome analysis workflows Jun 20, 2024
@bebatut bebatut enabled auto-merge (squash) June 20, 2024 10:03
mvdbeek
mvdbeek previously approved these changes Jun 20, 2024
auto-merge was automatically disabled June 20, 2024 15:15

Head branch was pushed to by a user without write access

mvdbeek
mvdbeek previously approved these changes Jun 24, 2024
@mvdbeek mvdbeek enabled auto-merge (squash) June 24, 2024 11:59
auto-merge was automatically disabled June 24, 2024 13:19

Head branch was pushed to by a user without write access

@EngyNasr
Copy link
Contributor Author

@bebatut @mvdbeek @wm75 Its greeeeeeeeeeen <3 <3 <3 finallyyyyyy

Thanks a lot @bgruening

Thank you all

@EngyNasr
Copy link
Contributor Author

Shall we merge them, and add the bar charts as an update after the visualisation is fixed?

Copy link
Member

@bebatut bebatut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge and improve in a follow-up PR

@bebatut bebatut merged commit b8dac7d into galaxyproject:main Jun 25, 2024
9 checks passed
dannon pushed a commit to dannon/iwc that referenced this pull request Sep 10, 2024
* adding microbiome analysis workflows to IWC with test data

* adding Changelog, REadme and dockstore yml file

* solving linting issues

* solving linting error by correcting the file name since i forgot the underscore in its name before

* applying all comments

* adding workflows for the collection version

* applying wolfgang comments, still removing the decompress tool is missing, to be added once the PR of the tool update is merges

* updating the preprocessing workflow replacing the decompressing step with an update to Krakentool, and also pushing the latest updates to the workflows, still the planemo tests fails for the same reasons, I need help with that

* renaming all inputs and outputs to include no spaces and used underscore instead and low caps to solve the linting issue, and added more description for the all samples readme file

* changing the tested files to be the last file in the collection, to solve the testing issues, since the github tests always compare the tested file with the last file in the collection regardless of the file name

* removing the word collection from the folder names, since all workflows in this PR work with collection no need to specify that in the folder name

* testing another file for SNP workflow to check the testing error

* updating the SNP workflow to solve the testing issue

* Second attempt to solve SNP test error

* attempt 1 for solving the taxonomy profiling test failure

* solving taxonomy profiling testing error

* updating the general readme file to include the training material of the workflows

* using MinusB standard Kraken2 database for taxonomy profiling, trying to solve the issue of database size, explained in the readme that database can be changed based on the input datasets within Galaxy

* correcting a file name

* correcing a typo

* reducing test datasets samples size, by keeping only reads that are used to detect virulence factor later on with the workflows, and that to make tests faster

* updating the preprocessing workflow to remove hosts sequences in a more relaiable way, and trying to solve the gene based workflow testing error on github

* updating workflows attempt 1 to solve current errors

* removing filter failed datasets tool, attempt to solve the testing failures of genebased pathogen detection and preprocessing workflows

* removing attribute from the genebased test

* trial to solve gene based workflow github test failure

* attempt 2 to solve the tests failer

* updating some workflows

* correcting a preprocessing workflow test output file

* trying to solve genebased test failure

* another attempt to solve the testing error which states in the log as if is something wrong with line1 in the workflow ga file

* another trial

* solving the taxonomy profiling testing failure due to the standardPF database size, where we use now a input parameter instead, thank you so much  Avatar

* updating workflows based on our latest paper update April 2024

* updating workflows to include the tools latest version in Galaxy eu

* correcting a test file

* updating all readme files for workflows

* adding workflow comments, arranging the main folder and workflows names, to make it opened for any microbiome workflows

* editing readme and workflows namings and tags based on Berenice's comments

* solving linting problem

* updating few typos

* removing a test file which is no longer produced by the updated workflow, solving the linting issue

* i missed to change the test file correctly the last push, here we go this time :D

* updated workflow reports, and adding the total number of reads before and after host removal to the Multiqc output of the preprocessing workflow

* solving the error of MultiQC including the host reads removal details, now the table is included and multiq runs correctly :)

* adding a step in the 5th workflow to remove all failed datasets from the collections comming from genebased pathogen identification, which may happen when no contigs are found by metaFlye for some samples

* add more filteration steps of removing failed and empty datasets in collections, to protect the workflow from the carry on error that might occur for samples with fewer number of contigs, VF genes and AMR genes found

* leaving the optional input to minimap2 selectng the samples profile to the user to choose

* correcting a comment typo in the preprocessing workflow

* adding a 5in1 workflow named PathoGFAIR that groups all other 5 workflows of PathoGFAIR in one workflow

* removing the 5 in 1 workflow and adding tags to main workflow outputs to help users track their histories

* correcting typos in readmes

* removing hashtags from tags to make them normal ones not promoted ones

* Apply suggestions from code review

Co-authored-by: Marius van den Beek <m.vandenbeek@gmail.com>

* applying all comments from Marius, changing all Readme's accordingly, and changing the Allele based workflow to be general not specific to Nanopore

* updating readmes to explain more what to trying workflow mean

* adding Zenodo links to all fastq or fasta files

* updating Zenodo with the correct test file names

* updating a test file name in the gene based test

* correcting another test data names

* updating resulted test files with the updated file names

* correcting from path to location for the zenodo tags and adding the file types

* nothing changed I am just trying again

* changing file type for fastq files from fastq to fastqsanger

* using the same fastq test file in Allele based, gene based and taxonomy profiling

* updating Preprocessing workflow, to take the reference genome of the host as a user input, and updated the test accordingly

* removing bar chart tool, to add the reports  visualisations later

* replacing the testing reference genome since ce6 is giving an error

* Thanks to Björn he solved the module issue, I am trying now, hope it works this time :)

* 2 docker containers have been fixed, thanks to Björn

---------

Co-authored-by: Marius van den Beek <m.vandenbeek@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants