Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 1.0.8 #31

Merged
merged 48 commits into from
Feb 23, 2024
Merged

Conversation

alephnull7
Copy link
Collaborator

@alephnull7 alephnull7 commented Feb 17, 2024

  • The first main change in the package consists of continued refactoring, specifically in treating related data as a composite object when possible. As a result, all GenBank data used within PACVr is now contained in a single object, both versions of the coverage data (GenomicAlignments::coverage() and the derivative from CovCalc()) are fields of the coverage object, and input parameters related to plotting are contained in plotSpecs. I am applying some object-oriented principles, such as in the validation of PACVr.complete() parameters in both getAnalysisSpecs() and getPlotSpecs(). However, the use cases of these objects are fairly simple, partially due to only one instance of each existing during the execution of PACVr.complete(), so I have not yet defined them in one of R's object-oriented systems like S3.
  • When running some test data, specifically an annotated GenBank file for NC_009143, I noticed the possibility and lack of handling for a feature with multiple qualifiers of the same name. Originally, I considered combining this information into a list. Due to some of the filtering operations done in the package involving regex matching, they are concatenated into a character string instead.
  • Experiences with other test data have led to note qualifiers being unnecessary for the execution of standard coverage analysis, and allowing the sample name in BAM file to match either VERSION or ACCESSION of GenBank file for verbose analysis.
  • When attempting to perform the IR presence test, if no feature matches are found, instead of leading to an error, this failure is indicated, and execution continues using the unpartitioned source as a single region to analyze.
  • Greater validation of the output parameter is performed, and PNG file support has been added.
  • Within getCovDepth(), lowCoverage in the summary table has been renamed lowCovWin_abs, and a statistic corresponding to lowCovWin_abs/regionLen named lowCovWin_relToRegionLen has been added. An additional row for <sample_name>_coverage.summary.regions.tsv has been added to the verbose output, Complete_genome, with coverage stats corresponding to the sums for regionLen and lowCovWin_abs. Note that the latter item is just a sum of the low coverage counts when each region is considered separately, not the low coverage count when considering the entire source. This aligns with the indicated requirements, but I wanted to mention this since the two sums typically differ. Since evenness is based upon the raw coverage data for sequences, this depth statistic for Complete_genome does use the coverage mean of the entire source in the calculation.

…`; removal of depreciated `PACVr.parseName()`; merger of `parseSource` and `PACVr.parseSource()`
@alephnull7 alephnull7 marked this pull request as draft February 20, 2024 21:18
@alephnull7 alephnull7 changed the title Refactored GenBank data flow; Refactored PACVr.visualizeWithRCircos(); Fix features with duplicate qualifier names; Expanded coverage summaries Version 1.0.8 Feb 22, 2024
@alephnull7 alephnull7 marked this pull request as ready for review February 22, 2024 23:20

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the general reduction of the number of variables (e.g., "gbkData", "analysisSpecs", and "plotSpecs" now contains various sub-variables, making it sufficient to only pass only these along instead of creating new variables)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing the function PACVr.gbkData() is a great idea, especially the fact that the initially read data is cleaned quickly after parsing to avoid excessive memory usage (e.g., gc()).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this straightforward implementation of the calculation of the number of low coverage windows per region relative to the region size. dplyr for the win!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing "combineDupQuals()". I had not even thought of the relevance of such a function, although I have not fully understood what it does.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sense behind "removeSmall" is as follows: if a region/gene/etc. were smaller than the window size specified by the user (default: 250 bp), then a coverage calculation may not make sense. Hence, the sizeThreshold should not be hard-coded but should take the value of windowSize if removeSmall=TRUE.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good improvements to handle cases when no input file specified or input file is empty.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, adding support for png-formatted output is helpful.

@michaelgruenstaeudl michaelgruenstaeudl merged commit 1dba485 into michaelgruenstaeudl:master Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants