HIBF improvement through saving ubiquitous only once #237
MitraDarja
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
There might be experiments where all bins contain a certain set of ubiquitous k-mers. These can be more efficiently stored in a simple lookup table, but there might also be experiments which contain k-mers that are not present in all bins, but in all bins of a merged bin. This leads to the question, if there is a possibility to store these k-mers only once.
One possible way would be:
If a merged bin contains a k-mer in all its bins, only store it in the level of the merged bin not on any lower level. This would mean for the search, if a k-mer is found in a merged bin and not in any bin on the lower level that this is a ubiquitous k-mer and therefore found for all bins in the merged bin.
How would this approach impact the accuracy?
Let's call the probability for a FP on the level, where the merged bin is, p_m, and the probability for a FP on the lower level p_l. The level containing the merged bin is called merged and the one bin we are interested here is called merged bin. The bins in the merged bin on the lower level are called lower bins.
If a k-mer is found as TP in the merged bin and in one bin of the lower bins as FP, then only the bin with the FP would find the k-mer correctly, for all other bins the k-mer is reported as not present and therefore as FN. This happens with a probability of p_mp_l(number of lower bins) as the probability of a FP in one lower bin is independent of the probability of a FP in another lower bin.
This probability would get high quite quickly and is therefore not a good solution.
But maybe we can correct for this?
A k-mer is seen as present in all lower bins, if it is found in the merged bin and the number of found k-mers on the lower level are smaller than (number of lower bins) * p_l. This has the disadvantage that for k-mers present in only a few lower bins, are found for all lower bins.
Alternatively, any merged bin could own its own lookup table, but this seems like a lot of overhead.
Any other ideas, how ubiquitous k-mers in one merged bin could be stored not multiple times?
Beta Was this translation helpful? Give feedback.
All reactions