-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to different finalisation mechanism #114
Conversation
This reverts commit a58c2b8.
|
||
#[cfg(profile_gc)] | ||
#[derive(Debug)] | ||
pub struct FinalizerInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is more FinalizersInfo
or GCProfileInfo
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good point. But this existed before, this commit has just moved it around. Shall I rename it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think renaming it would be a good idea, but you could do it in a subsequent PR. Up to you.
In 4d3bb73's commit message:
I don't understand the last sentence, as I think a) the paragraph following it does seem to give a performance comparison vs. the previous version b) we have compared that previous version with single-threaded? |
// | ||
// This is considered a root for the Gc because the 'to-be-finalised' objects are otherwise | ||
// unreachable and must be kept alive until their finaliser has run. | ||
static FIN_Q: Mutex<VecDeque<Gc<()>>> = Mutex::new(VecDeque::new()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean we're doing heap allocations during GC? I'm also not sure about the performance properties of VecDeque
, particularly as I guess that we often shove huge numbers of objects in here. Update: ah, I see on line 225 that we ensure we don't enlarge it! What happens then?!
I also guess that we're going to be heavily punished for the Mutex
here, as presumably parallel GC ends up locking/unlocking this heavily during particular parts of the GC cycle. I don't know if there's a good way of measuring that.
I don't think we should do this in this PR but I think we could fairly easily do this in the next PR:
- Each GC thread has a
thread_local
queue of objects to finalise. That could be aVecDeque
but (because of the "what happens if we run out of space in theVecDeque
problem) I think that might better be anmmap
ed block (or similar) under our control. - At the end of the GC cycle (or when that local queue is full) we flush it to a global "queue of queues", which is something like
Mutex<Vec<FinalisationQueue>>
.
The aim here being that our number of locks goes from O(number_of_objects_to_finalise) to O(number_of_GC_threads).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: ah, I see on line 225 that we ensure we don't enlarge it! What happens then?!
We check whether we need to enlarge the queue on each allocation because this is the only time it would be safe to do so. I'll be the first to admit I don't like this, but I couldn't see another a way around accidentally growing the ring buffer while inside the STW pause.
I also guess that we're going to be heavily punished for the Mutex here, as presumably parallel GC ends up locking/unlocking this heavily during particular parts of the GC cycle. I don't know if there's a good way of measuring that.
Profile data still shows that we spend a lot of time in futex_wait
syscalls -- especially from within the GC. However, we're punished for this Mutex
more because of contention between the finaliser thread and user mutator threads. The more concurrent mutator (application) threads we have, the worse this gets. However, this does not affect parallel collector threads (which do marking) because they don't lock/unlock this mutex (and Rust mutexes don't heap allocate anymore).
I agree your suggested approach is better. It's similar to what I had lined up (but my approach still suffers from the VecDeque
grow problem). I'm not sure what you mean by this though:
I think that might better be an mmaped block (or similar) under our control.
Do you mean mmap
a block of pointers to objects that need finalising using system allocator (i.e. outside of Boehm)? We'd still need to grow this block dynamically, and we also need Boehm to know about it so that it can treat it as a source of roots.
At the end of the GC cycle (or when that local queue is full) we flush it to a global "queue of queues", which is something like Mutex<Vec>.
Right, so we batch up finalisation in pre-defined chunks that get siphoned off to the finaliser thread once they reach some predetermined size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We check whether we need to enlarge the queue on each allocation because this is the only time it would be safe to do so.
Right, so this isn't something we ideally want to do in the fast path.
Right, so we batch up finalisation in pre-defined chunks that get siphoned off to the finaliser thread once they reach some predetermined size?
Yep, that's the key idea!
Do you mean mmap a block of pointers to objects that need finalising using system allocator (i.e. outside of Boehm)?
Assuming we can teach Boehm about the mmap
ed block of pointers somehow, that would be my suggestion. But there are other conceivable ways which achieve the same outcome.
We'd still need to grow this block dynamically, and we also need Boehm to know about it so that it can treat it as a source of roots.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, with this + #114 (comment) I think we might be able to eliminate synchronisation around the FIN_Q
. Ok to do this in the next PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me.
|
||
pub fn process_fin_q() { | ||
loop { | ||
let object = FIN_Q.lock().unwrap().pop_front(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting: I think we only have a 1 finalisation thread, right? But that thread still has to acquire a lock to do finalisation even though we know that no other thread can have gained that lock (outside of GC time).
If I've got that right, it suggests that we also want to move the "queue of queues" to the finalisation thread, so that it doesn't have to do any locking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting: I think we only have a 1 finalisation thread, right?
Correct
But that thread still has to acquire a lock to do finalisation even though we know that no other thread can have gained that lock (outside of GC time).
But the FIN_Q
is like a producer / consumer, so we need to lock because this could be popping from the queue at the same time that new allocations on the mutator push to the queue. Maybe I've misunderstood your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the FIN_Q is like a producer / consumer, so we need to lock because this could be popping from the queue at the same time that new allocations on the mutator push to the queue. Maybe I've misunderstood your suggestion.
The finaliser thread will be stopped during GC so at run-time I don't think it can ever actually race? But we will stay pay the lock/unlock costs even though we're not really doing anything that could ever race...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The finaliser thread will be stopped during GC so at run-time I don't think it can ever actually race? But we will stay pay the lock/unlock costs even though we're not really doing anything that could ever race...
Doh... You're right! This hadn't occurred to me. If I can then malloc / mmap this queue by side-stepping the Boehm allocator, I might be able to grow it inside the STW pause and remove locking altogether.
@jacob-hughes Please squash, but in so doing can you also address the question in #114 (comment) please? |
The current off-thread finalisation mechanism is very slow. There are a few factors at play here, but overwhelmingly this is due to lock contention on heap allocation. Firstly, the BDWGC uses a clever optimisation[1] to avoid locking allocation in its entirety until at least two mutator threads have spun up. This means that by simply offloading finalisation to another thread, we incur *all* synchronisation costs inside the BDWGC which were not present when Alloy was used with single-threaded applications. Second, when a finaliser in the BDWGC is both, registered, and added to the queue during a collection, it acquires the global allocator lock. This is expensive, and causes contention between the finaliser thread and any mutator or collector threads already running. This slowdown increases as the number of finalisers in the application grows. In programs with non-trivial finalisation, this is so slow that it is practically unusable. This commit switches to the lesser used but faster finalisation mechanism in BDWGC: the disclaim API. Instead, objects with a 'disclaim' function (analogous to finalisers) do not require additional locking to register with the object, and are run directly inside the sweeper. The disclaim approach will only work with off-thread finalisation because it runs finalisers during the STW pause where the allocator lock is already acquired, thus finalisers run on the mutator would not be able to perform anything which interacts with the heap. The table below shows the performance difference between this approach and single-threaded finalisation introduced in a58c2b8 (baseline) on the ReBench benchmark suite using som-rs + Alloy: Bounce baseline micro 698 Bounce disclaim micro 753 BubbleSort baseline micro 1287 BubbleSort disclaim micro 755 DeltaBlue baseline macro 483 DeltaBlue disclaim macro 375 Dispatch baseline micro 985 Dispatch disclaim micro 552 Fannkuch baseline micro 510 Fannkuch disclaim micro 336 Fibonacci baseline micro 1738 Fibonacci disclaim micro 1057 FieldLoop baseline micro 796 FieldLoop disclaim micro 382 GraphSearch baseline macro 193 GraphSearch disclaim macro 133 IntegerLoop baseline micro 1521 IntegerLoop disclaim micro 1007 JsonSmall baseline macro 555 JsonSmall disclaim macro 578 List baseline micro 1390 List disclaim micro 859 Loop baseline micro 2216 Loop disclaim micro 1251 Mandelbrot baseline micro 1051 Mandelbrot disclaim micro 623 NBody baseline macro 705 NBody disclaim macro 585 PageRank baseline macro 613 PageRank disclaim macro 297 Permute baseline micro 1173 Permute disclaim micro 669 Queens baseline micro 1052 Queens disclaim micro 589 QuickSort baseline micro 231 QuickSort disclaim micro 266 Recurse baseline micro 1231 Recurse disclaim micro 858 Richards baseline macro Timeout Richards disclaim macro 1416 Sieve baseline micro Timeout Sieve disclaim micro 1100 Storage baseline micro 81 Storage disclaim micro 149 Sum baseline micro 857 Sum disclaim micro 478 Towers baseline micro 1528 Towers disclaim micro 851 TreeSort baseline micro 382 TreeSort disclaim micro 299 WhileLoop baseline micro 1814 WhileLoop disclaim micro 1130
8fbd290
to
9868268
Compare
Squashed.
I've removed this sentence because it was a bit confusing. You're right, all I was trying to say was that a direct comparison with single-threaded finalisation is not possible. We can infer that though from the comparison between the first off-thread finalisation approach in #a58c2b8 |
Maybe I'm an idiot, but a direct comparison is possible, but it's just awkward to do? |
We'd have to modify som-rs to pause interpreting every now and then and start processing finalisers on its main-thread. I guess it is possible to do that but we'd have to emulate the threshold that Boehm uses when it goes through and processes bits of finalisation when leaving an allocation call. |
OK, that makes more sense --- thanks! |
Fixed here: 324150c |
Please squash. |
I think this should stay as a separate commit that gets squashed when we rebase with upstream rustc. It's not related to this work and it's useful to keep track of. |
Fixed here: 78bc4a0 (Again I'm not sure this one should be squashed either) |
This failure is a little bit more tricky, I'm debugging it in tryci. However, I can use it as an opportunity to change to |
"bookworm" and "latest" are currently identical, so you can push a new commit to that effect, but please don't squash it yet. |
Can I squash this to remove the top two commits? |
Please squash. |
78bc4a0
to
9868268
Compare
Squashed (please hold off on merging for now) |
Looking at this a bit more closely, I think I've hit some non-determinism around locking inside the standard library which is making these tests flaky. If it's ok with you, rather than try and fix this to then immediately remove it in the next PR, are you happy for me to push a couple of commits to this PR that address your suggestion here? |
Closing in favour of #117 |
This is the first of a couple of a couple PRs to make finalisation in Alloy faster. See the commit messages for details.