Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ADD: cluster and reverse complement correction for resfinder #38

Merged

Conversation

Vedanth-Ramji
Copy link
Member

@Vedanth-Ramji Vedanth-Ramji commented Apr 21, 2024

  • Resfinder has gene clusters which can't be passed through RGI using 'contig' mode.
  • Gene clusters were identified and were manually assigned ARO numbers.
  • 40 gene clusters present.
  • 9 genes in reverse complement form also present.
  • RC genes were all manually curated.
  • Delete get_data_path function

- Resfinder has gene clusters which can't be passed through RGI using 'contig' mode.
- Gene clusters were identified and were manually assigned ARO numbers.
- A seperate file with manual curation for gene clusters and RCs was created, and their AROs were updated after concatenating RGI results and genes not in RGI results.
- 40 gene clusters present.
- 4 genes in reverse complement form also present. blaBIM-1_1_CP016446 and mph(D)_1_AB048591 were not found in CARD and were given parent ARO mappings. RGI correctly assigned ARO numbers to other two.
- Corrected erm(X)_1_M36726 ARO mapping
Copy link
Member

@luispedro luispedro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

It seems, however, that this could have been done without special-casing resfinder? Why is this different from the other manual curation files?

argnorm/data/cluster_rc_correction/notes.md Outdated Show resolved Hide resolved
@luispedro
Copy link
Member

luispedro commented Apr 21, 2024

The code for actually running rgi correctly for resfinder (and the output thereof) is still missing as well.

- Resfinder is comprised of coding sequences. The data wasn't being handled properly before as contig mode was used when passing coding sequences to RGI. Now, the amino acid versions of Resfinder is used with protein mode when running the database in RGI.
- Resfinder AA file is generated using biopython from nucleotide file (AA file not found online).
- 9 RC genes were found (previously 4 were found). All were manually curated as RC versions couldn't be translated properly into AA sequences.
- Documentation updated in changelog to reflect AA version being used for Resfinder and gene cluster handling
@Vedanth-Ramji
Copy link
Member Author

I just added the code for using the AA version of resfinder.

It seems, however, that this could have been done without special-casing resfinder? Why is this different from the other manual curation files?

The other manual curation files are adding genes that are not present in the mapping tables (i.e. RGI can't map them automatically). The new 'correction' files are changing mappings that are already present in the mapping tables (i.e. correcting RGI as it is not detecting a gene cluster or reverse complement properly and mapping to a wrong ARO number).

@luispedro
Copy link
Member

The other manual curation files are adding genes that are not present in the mapping tables (i.e. RGI can't map them automatically). The new 'correction' files are changing mappings that are already present in the mapping tables (i.e. correcting RGI as it is not detecting a gene cluster or reverse complement properly and mapping to a wrong ARO number).

I know this, but I don't see why it matters for my question. Why does the code need to special-case resfinder? Isn't this just a different way in which the files were generated, but the same code can handle both.

- Merged gene cluster & RC annotation with other manual curation annotations.
- Moved notes for gene cluster & RC annotation to README.md in manual curation directory.
- Removed hardcoded path in fna_to_faa in crude_db_harmonisation.py and made it general
argnorm/lib.py Outdated Show resolved Hide resolved
argnorm/lib.py Outdated Show resolved Hide resolved
Copy link
Member

@luispedro luispedro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still want to remove the special casing of resfinder

- Note: code to integrate manual curation now removes duplicate ARO mappings. This has corrected a MEGARes annotation (GMGC10.027_903_362.EMRE) which had a one to many ARO mapping. Better manual curation for MEGARes will be present in the version after v0.3.0 when MEGARes will be investigated to check for CDSs, gene clusters and RC genes.
@luispedro luispedro merged commit de775cd into BigDataBiology:main Apr 25, 2024
6 checks passed
@Vedanth-Ramji Vedanth-Ramji deleted the resfinder_cluster_annotation branch April 25, 2024 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants