-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multichrom example #2665
base: main
Are you sure you want to change the base?
Multichrom example #2665
Conversation
Great! I'll take a closer look in a bit. |
Super cool. Good to know about helgrind, too; I will look into that too! :-> |
I guess this is relevant to #176, and might inform how we deal with multiple chromosomes. |
As I understand it, we could choose to make the two things tied together – to make the separate tree sequences begin/end at chromosome boundaries – but we don't have to do that, and there are pretty strong reasons not to do that, perhaps. (For example, you want to be able to keep multiple tree sequences to speed up even single-chromosome models; and also, chromosomes will be of very different lengths, whereas you want to divide the work of simplification as evenly as you can across your multiple tree sequences; and also, the number of chromosomes you are simulating will often not match the number of threads you have available for simplification.) So I would tentatively lean towards not connecting these two things. |
I think it's not exclusive. If you want to represent multiple chromosomes, then this gives a nice way to do it. But there's not reason why you couldn't also do it on a single chromosome, or indeed on part of a chromosome within a multi-chromosome set. But I do take you point that there's an elegancy about keeping the things completely independent. |
This is really about low-level C API stuff here for now - how we might (or not) surface this to higher level things like representing multiple chromosomes is something we should wait and see (which is I think what we're all saying!). |
5df5775
to
8830fbb
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2665 +/- ##
==========================================
- Coverage 89.76% 89.76% -0.01%
==========================================
Files 30 29 -1
Lines 29892 29886 -6
Branches 5803 5803
==========================================
- Hits 26832 26826 -6
Misses 1753 1753
Partials 1307 1307
Flags with carried forward coverage won't be shown. Click here to find out more. |
8830fbb
to
8ba0448
Compare
@petrelharp @bhaller - I realised there was an easy way to join the tables by appending all the edges to the first table, sorting and then squashing adjacent edges. This should be equivalent to the haploid_wright_fisher example now (but that'll need to be verified). |
1 similar comment
@petrelharp @bhaller - I realised there was an easy way to join the tables by appending all the edges to the first table, sorting and then squashing adjacent edges. This should be equivalent to the haploid_wright_fisher example now (but that'll need to be verified). |
Ah, probably need to set the sample flags for the final generation. |
One more update - I've marked the sample flags and updated the original WF example so that it also indexes the output, and I think the outputs are now identical up to node reordering:
and the second
We don't reorder the samples from 0 - n in the new version. If you do want this you can run simplify on it again and it'll renumber the output. (You could do this more cheaply too if you wanted to for compatibility purposes.) |
I think we can meaningfully compare the timings of this with haploid_wright_fisher now @petrelharp and @bhaller |
I think I know what's going on with the dissappointing scaling here. Here's an example from running the code:
The numbers for the multichromosome example are numbers of edges before and after simplification. Notice that each worker has about one-half the total number of edges: there were 20M edges for the whole simulation (2 edges * 10K individuals * 1K generations); and each of the 8 workers have 10M edges to deal with. This is because: suppose that a chromosome has been inherited in Now, what does simplify have to do under the hood? It's got a data structure that records the collection of ancestral segments that it's tracking; and then it goes through edges one at a time and deals with them. This means that if the cost of dealing with each edge is O(1), then our maximal speedup would be However, the work required by simplify's data structure does scale with the number of segments it is tracking, to some degree, since although simplify has to look at all those edges, it only has to actually do something when it encounters an edge that some of its segments move along. If this was a substantial contribution to runtime then we might expect better speedups. The things we'd need to know are (a) what portion of simplify's work is "do something to the segments" versus "look at the next edge"; and (b) once we divide up a chromosome into chunks, how much does that reduce the proportion of its edges through which a segment of ancestry passes? I'm not very clear on the point, but I suspect the answer to (b) is "not much"; if so, then there's no hope for a speedup. However, this approach would work just fine if each worker got literally one chromosome, since then edge breakpoints would usually (well, half the time) fall exactly on the divisions between workers. I may do a modification to the code here to have a look at that case, and see if we have wins there. |
Aha! That's great @petrelharp, it really clarifies things 👍 |
I'm thinking it's worth fixing up the multichrom example to merge in? @bhaller is enthusiastic about getting this into SLiM at some point. I think this just will require switching over |
OK, I'll fix it up when I get a chance. |
bdf4c9e
to
3eb2be0
Compare
I've updated this so that it's a clean build against upstream now @petrelharp. I think it needs a bit of documentation though - I don't really follow what the algorithm is doing. Could you update with some comments to help clarify that the model is, etc? |
3eb2be0
to
e51c6c2
Compare
I think the docs build is because of a stale cache. The simplest thing is probably to create a new PR. I've squashed my original commits down to one. |
I'm away on vacation for a couple of weeks, but:
|
Hi @benjeffery and @jeromekelleher. Do you really want to close this? It remains a useful example; I'm going to be consulting it carefully in the multichromosome work I'm doing in SLiM right now, in fact, and I can imagine other folks finding it interesting and useful too. I'm not saying you need to merge it necessarily, now or perhaps ever, but if it's closed, it seems to me that a fair bit of very useful knowledge and discussion will become much harder to find. Reopening so this comment doesn't get lost, but if you really want to close it, I won't get into a close/reopen war with you. :-> |
It's fairly esoteric knowledge @bhaller which I'm not sure is worth keeping a PR open in perpetuity for. We've learned the key lessons from it, I think, and moved on? |
But anybody who wants to actually use tskit in a multithreaded manner (which would be me, and very probably others, sooner or later) will continue to find it useful. Which is the point of an example, isn't it? Why purge it from the knowledge pool? |
If it's useful we should merge it I guess - I didn't think too hard about it tbh. Keeping it here in limbo isn't an answer. Does anyone want to have a look over and make suggestions for what would be needed to make this a useful documentation example? |
I nominate @petrelharp :->. In all seriousness, he knows the tskit ecosystem way better than I do, knows your doc system, and has been involved in this all along. Can I volunteer you, Peter? :-> |
I think the proposal is to add something to the docs, here. I could do that, sure. I'm imagining just talking through, like "we can have the tables for each chromosome share the same node and individual tables; here's how we do that". |
I can finish it up too @petrelharp, if you'd prefer to have a look through and post some review comments. |
Go for it! |
Stacked on #2663 (and #2619)@petrelharp @bhaller @molpopgen You might be interested in this: the new file
c/examples/multichrom_wright_fisher.c
gives a simple example of multi chromosome simulations with parallel simplify (most of the other stuff is from other in-flight PRs)This adds a simple example of running a multi chromosome forwards simulation and running the sort/simplify steps in parallel using pthreads. This was really just to make sure that the basic idea of share the node table across the different table collections would work, and that we were getting the details right with the new TSK_SIMPLIFY_NO_FILTER_NODES and TSK_SIMPLIFY_NO_UPDATE_SAMPLE_FLAGS.
Running this through helgrind gives no errors (I think it's pretty good at picking up data races):
So, hooray, @molpopgen's ingenious idea works and with the infrastructure in #2663 and #2619 we can do it pretty easily!
Another thing I ended up doing here in 31c47e4 is to add a basic implementation of "delete rows", which is an operation we badly need in the API I think (mentioned here: #1034 (comment)). I think it's worth talking about that separately, so I'll open an issue for it.
We can probably let this sit for a while now anyway, and should try to clear up some other stuff first.