Add `pyani subsample` command #135

widdowquinn · 2019-06-21T10:12:47Z

We could have a pyani subsample subcommand that can populate a new input directory of genomes. This would be useful when the total number of available genomes for analysis is large.

The kind of structure that the command could take would be along the lines of:

pyani subsample - basic subcommand
-n --num_genomes - total number of genomes
--balance_classes - if not set, the genomes are selected randomly from those in the input directory; if set, then an attempt is made to balance each class.

The way balancing might work is as follows: say there are 200 genomes, and you want to subsample 50. If there are two classes with 100 members each, we'd want to have 25 from each - a random sampling from each would be find. But if there are two classes with 190 and 10 members, we could only balance up to 20 genomes (10 from the group with 10, 10 from the group with 190) - so we'd either have to warn that the outcome was unbalanced, or we'd only be able to balance 10 randomly-selected from each class. So we might want another argument:

--enforce_balance - which enforces equal numbers from each class. So if there are $k$ classes where the smallest class has $m$ members, the total number of genomes subsampled is $k \times m$.

This would provide three ways of getting a subsample of size $n$ from the original set:

randomly subsample $n$ genomes (and hope you cover all your classes)
make a best effort to balance classes, recognising that there may be some poorly-represented classes; this could be implemented as sampling without replacement from each class in turn until we reach $n$. To avoid systematic bias (and restricting $n > k$ where there are $k$ classes) we should shuffle the order of class-sampling at each round.
enforce balancing: select the nearest multiple of $k$ ($pk$) to $n$ which is less than or equal to $k \times m$ (where $m$ is the smallest class size), and randomly subsample $p$ genomes within each class.

The text was updated successfully, but these errors were encountered:

widdowquinn added the enhancement something we'd like pyani to do that it doesn't already label Jun 21, 2019

widdowquinn self-assigned this Jun 21, 2019

widdowquinn modified the milestones: 0.3.0, 0.3.1 May 28, 2020

widdowquinn added interface issues related to how the user tells pyani to do something method the issue relates to how results are calculated labels May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `pyani subsample` command #135

Add `pyani subsample` command #135

widdowquinn commented Jun 21, 2019

Add pyani subsample command #135

Add pyani subsample command #135

Comments

widdowquinn commented Jun 21, 2019

Add `pyani subsample` command #135

Add `pyani subsample` command #135