Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pyani subsample command #135

Open
widdowquinn opened this issue Jun 21, 2019 · 0 comments
Open

Add pyani subsample command #135

widdowquinn opened this issue Jun 21, 2019 · 0 comments
Assignees
Labels
enhancement something we'd like pyani to do that it doesn't already interface issues related to how the user tells pyani to do something method the issue relates to how results are calculated
Milestone

Comments

@widdowquinn
Copy link
Owner

We could have a pyani subsample subcommand that can populate a new input directory of genomes. This would be useful when the total number of available genomes for analysis is large.

The kind of structure that the command could take would be along the lines of:

  • pyani subsample - basic subcommand
  • -n --num_genomes - total number of genomes
  • --balance_classes - if not set, the genomes are selected randomly from those in the input directory; if set, then an attempt is made to balance each class.

The way balancing might work is as follows: say there are 200 genomes, and you want to subsample 50. If there are two classes with 100 members each, we'd want to have 25 from each - a random sampling from each would be find. But if there are two classes with 190 and 10 members, we could only balance up to 20 genomes (10 from the group with 10, 10 from the group with 190) - so we'd either have to warn that the outcome was unbalanced, or we'd only be able to balance 10 randomly-selected from each class. So we might want another argument:

  • --enforce_balance - which enforces equal numbers from each class. So if there are $k$ classes where the smallest class has $m$ members, the total number of genomes subsampled is $k \times m$.

This would provide three ways of getting a subsample of size $n$ from the original set:

  1. randomly subsample $n$ genomes (and hope you cover all your classes)
  2. make a best effort to balance classes, recognising that there may be some poorly-represented classes; this could be implemented as sampling without replacement from each class in turn until we reach $n$. To avoid systematic bias (and restricting $n > k$ where there are $k$ classes) we should shuffle the order of class-sampling at each round.
  3. enforce balancing: select the nearest multiple of $k$ ($pk$) to $n$ which is less than or equal to $k \times m$ (where $m$ is the smallest class size), and randomly subsample $p$ genomes within each class.
@widdowquinn widdowquinn added the enhancement something we'd like pyani to do that it doesn't already label Jun 21, 2019
@widdowquinn widdowquinn self-assigned this Jun 21, 2019
@widdowquinn widdowquinn modified the milestones: 0.3.0, 0.3.1 May 28, 2020
@widdowquinn widdowquinn added interface issues related to how the user tells pyani to do something method the issue relates to how results are calculated labels May 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement something we'd like pyani to do that it doesn't already interface issues related to how the user tells pyani to do something method the issue relates to how results are calculated
Projects
None yet
Development

No branches or pull requests

1 participant