Switch to different finalisation mechanism #114

jacob-hughes · 2024-03-29T07:50:23Z

This is the first of a couple of a couple PRs to make finalisation in Alloy faster. See the commit messages for details.

This reverts commit a58c2b8.

ltratt · 2024-03-29T07:57:18Z

library/std/src/gc.rs

+
+#[cfg(profile_gc)]
+#[derive(Debug)]
+pub struct FinalizerInfo {


I guess this is more FinalizersInfo or GCProfileInfo or something like that?

Yes good point. But this existed before, this commit has just moved it around. Shall I rename it?

I think renaming it would be a good idea, but you could do it in a subsequent PR. Up to you.

ltratt · 2024-03-29T07:59:44Z

In 4d3bb73's commit message:

The disclaim approach will only work with off-thread finalisation because it
runs finalisers during the STW pause where the allocator lock is already
acquired, thus finalisers run on the mutator would not be able to perform
anything which interacts with the heap. It is therefore not possible to compare
the performance of this approach with the performance of single-threaded
allocation.

I don't understand the last sentence, as I think a) the paragraph following it does seem to give a performance comparison vs. the previous version b) we have compared that previous version with single-threaded?

library/std/src/gc.rs

ltratt · 2024-03-29T08:07:21Z

library/std/src/gc.rs

+//
+// This is considered a root for the Gc because the 'to-be-finalised' objects are otherwise
+// unreachable and must be kept alive until their finaliser has run.
+static FIN_Q: Mutex<VecDeque<Gc<()>>> = Mutex::new(VecDeque::new());


Doesn't this mean we're doing heap allocations during GC? I'm also not sure about the performance properties of VecDeque, particularly as I guess that we often shove huge numbers of objects in here. Update: ah, I see on line 225 that we ensure we don't enlarge it! What happens then?!

I also guess that we're going to be heavily punished for the Mutex here, as presumably parallel GC ends up locking/unlocking this heavily during particular parts of the GC cycle. I don't know if there's a good way of measuring that.

I don't think we should do this in this PR but I think we could fairly easily do this in the next PR:

Each GC thread has a thread_local queue of objects to finalise. That could be a VecDeque but (because of the "what happens if we run out of space in the VecDeque problem) I think that might better be an mmaped block (or similar) under our control.

At the end of the GC cycle (or when that local queue is full) we flush it to a global "queue of queues", which is something like Mutex<Vec<FinalisationQueue>>.

The aim here being that our number of locks goes from O(number_of_objects_to_finalise) to O(number_of_GC_threads).

Update: ah, I see on line 225 that we ensure we don't enlarge it! What happens then?!

We check whether we need to enlarge the queue on each allocation because this is the only time it would be safe to do so. I'll be the first to admit I don't like this, but I couldn't see another a way around accidentally growing the ring buffer while inside the STW pause.

I also guess that we're going to be heavily punished for the Mutex here, as presumably parallel GC ends up locking/unlocking this heavily during particular parts of the GC cycle. I don't know if there's a good way of measuring that.

Profile data still shows that we spend a lot of time in futex_wait syscalls -- especially from within the GC. However, we're punished for this Mutex more because of contention between the finaliser thread and user mutator threads. The more concurrent mutator (application) threads we have, the worse this gets. However, this does not affect parallel collector threads (which do marking) because they don't lock/unlock this mutex (and Rust mutexes don't heap allocate anymore).

I agree your suggested approach is better. It's similar to what I had lined up (but my approach still suffers from the VecDeque grow problem). I'm not sure what you mean by this though:

I think that might better be an mmaped block (or similar) under our control.

Do you mean mmap a block of pointers to objects that need finalising using system allocator (i.e. outside of Boehm)? We'd still need to grow this block dynamically, and we also need Boehm to know about it so that it can treat it as a source of roots.

At the end of the GC cycle (or when that local queue is full) we flush it to a global "queue of queues", which is something like Mutex<Vec>.

Right, so we batch up finalisation in pre-defined chunks that get siphoned off to the finaliser thread once they reach some predetermined size?

We check whether we need to enlarge the queue on each allocation because this is the only time it would be safe to do so.

Right, so this isn't something we ideally want to do in the fast path.

Right, so we batch up finalisation in pre-defined chunks that get siphoned off to the finaliser thread once they reach some predetermined size?

Yep, that's the key idea!

Do you mean mmap a block of pointers to objects that need finalising using system allocator (i.e. outside of Boehm)?

Assuming we can teach Boehm about the mmaped block of pointers somehow, that would be my suggestion. But there are other conceivable ways which achieve the same outcome.

We'd still need to grow this block dynamically, and we also need Boehm to know about it so that it can treat it as a source of roots.

Yes.

Ok, with this + #114 (comment) I think we might be able to eliminate synchronisation around the FIN_Q. Ok to do this in the next PR?

Works for me.

ltratt · 2024-03-29T08:08:53Z

library/std/src/gc.rs

+
+pub fn process_fin_q() {
+    loop {
+        let object = FIN_Q.lock().unwrap().pop_front();


This is interesting: I think we only have a 1 finalisation thread, right? But that thread still has to acquire a lock to do finalisation even though we know that no other thread can have gained that lock (outside of GC time).

If I've got that right, it suggests that we also want to move the "queue of queues" to the finalisation thread, so that it doesn't have to do any locking.

This is interesting: I think we only have a 1 finalisation thread, right?

Correct

But that thread still has to acquire a lock to do finalisation even though we know that no other thread can have gained that lock (outside of GC time).

But the FIN_Q is like a producer / consumer, so we need to lock because this could be popping from the queue at the same time that new allocations on the mutator push to the queue. Maybe I've misunderstood your suggestion.

But the FIN_Q is like a producer / consumer, so we need to lock because this could be popping from the queue at the same time that new allocations on the mutator push to the queue. Maybe I've misunderstood your suggestion.

The finaliser thread will be stopped during GC so at run-time I don't think it can ever actually race? But we will stay pay the lock/unlock costs even though we're not really doing anything that could ever race...

The finaliser thread will be stopped during GC so at run-time I don't think it can ever actually race? But we will stay pay the lock/unlock costs even though we're not really doing anything that could ever race...

Doh... You're right! This hadn't occurred to me. If I can then malloc / mmap this queue by side-stepping the Boehm allocator, I might be able to grow it inside the STW pause and remove locking altogether.

ltratt · 2024-03-29T20:40:14Z

@jacob-hughes Please squash, but in so doing can you also address the question in #114 (comment) please?

The current off-thread finalisation mechanism is very slow. There are a few factors at play here, but overwhelmingly this is due to lock contention on heap allocation. Firstly, the BDWGC uses a clever optimisation[1] to avoid locking allocation in its entirety until at least two mutator threads have spun up. This means that by simply offloading finalisation to another thread, we incur *all* synchronisation costs inside the BDWGC which were not present when Alloy was used with single-threaded applications. Second, when a finaliser in the BDWGC is both, registered, and added to the queue during a collection, it acquires the global allocator lock. This is expensive, and causes contention between the finaliser thread and any mutator or collector threads already running. This slowdown increases as the number of finalisers in the application grows. In programs with non-trivial finalisation, this is so slow that it is practically unusable. This commit switches to the lesser used but faster finalisation mechanism in BDWGC: the disclaim API. Instead, objects with a 'disclaim' function (analogous to finalisers) do not require additional locking to register with the object, and are run directly inside the sweeper. The disclaim approach will only work with off-thread finalisation because it runs finalisers during the STW pause where the allocator lock is already acquired, thus finalisers run on the mutator would not be able to perform anything which interacts with the heap. The table below shows the performance difference between this approach and single-threaded finalisation introduced in a58c2b8 (baseline) on the ReBench benchmark suite using som-rs + Alloy: Bounce baseline micro 698 Bounce disclaim micro 753 BubbleSort baseline micro 1287 BubbleSort disclaim micro 755 DeltaBlue baseline macro 483 DeltaBlue disclaim macro 375 Dispatch baseline micro 985 Dispatch disclaim micro 552 Fannkuch baseline micro 510 Fannkuch disclaim micro 336 Fibonacci baseline micro 1738 Fibonacci disclaim micro 1057 FieldLoop baseline micro 796 FieldLoop disclaim micro 382 GraphSearch baseline macro 193 GraphSearch disclaim macro 133 IntegerLoop baseline micro 1521 IntegerLoop disclaim micro 1007 JsonSmall baseline macro 555 JsonSmall disclaim macro 578 List baseline micro 1390 List disclaim micro 859 Loop baseline micro 2216 Loop disclaim micro 1251 Mandelbrot baseline micro 1051 Mandelbrot disclaim micro 623 NBody baseline macro 705 NBody disclaim macro 585 PageRank baseline macro 613 PageRank disclaim macro 297 Permute baseline micro 1173 Permute disclaim micro 669 Queens baseline micro 1052 Queens disclaim micro 589 QuickSort baseline micro 231 QuickSort disclaim micro 266 Recurse baseline micro 1231 Recurse disclaim micro 858 Richards baseline macro Timeout Richards disclaim macro 1416 Sieve baseline micro Timeout Sieve disclaim micro 1100 Storage baseline micro 81 Storage disclaim micro 149 Sum baseline micro 857 Sum disclaim micro 478 Towers baseline micro 1528 Towers disclaim micro 851 TreeSort baseline micro 382 TreeSort disclaim micro 299 WhileLoop baseline micro 1814 WhileLoop disclaim micro 1130

jacob-hughes · 2024-04-02T08:28:35Z

Squashed.

I don't understand the last sentence, as I think a) the paragraph following it does seem to give a performance comparison vs. the previous version b) we have compared that previous version with single-threaded?

I've removed this sentence because it was a bit confusing. You're right, all I was trying to say was that a direct comparison with single-threaded finalisation is not possible. We can infer that though from the comparison between the first off-thread finalisation approach in #a58c2b8

ltratt · 2024-04-02T08:29:58Z

You're right, all I was trying to say was that a direct comparison with single-threaded finalisation is not possible. We can infer that though from the comparison between the first off-thread finalisation approach in #a58c2b8

Maybe I'm an idiot, but a direct comparison is possible, but it's just awkward to do?

jacob-hughes · 2024-04-02T08:38:08Z

Maybe I'm an idiot, but a direct comparison is possible, but it's just awkward to do?

We'd have to modify som-rs to pause interpreting every now and then and start processing finalisers on its main-thread. I guess it is possible to do that but we'd have to emulate the threshold that Boehm uses when it goes through and processes bits of finalisation when leaving an allocation call.

ltratt · 2024-04-02T08:38:30Z

OK, that makes more sense --- thanks!

jacob-hughes · 2024-04-02T09:17:47Z

Fixed here: 324150c

ltratt · 2024-04-02T09:18:57Z

Please squash.

jacob-hughes · 2024-04-02T09:24:31Z

I think this should stay as a separate commit that gets squashed when we rebase with upstream rustc. It's not related to this work and it's useful to keep track of.

jacob-hughes · 2024-04-02T10:07:05Z

Fixed here: 78bc4a0 (Again I'm not sure this one should be squashed either)

jacob-hughes · 2024-04-02T10:45:33Z

This failure is a little bit more tricky, I'm debugging it in tryci. However, I can use it as an opportunity to change to debian:latest in the buildbot config and then re-squash it later if you're ok with that?

ltratt · 2024-04-02T10:46:14Z

"bookworm" and "latest" are currently identical, so you can push a new commit to that effect, but please don't squash it yet.

jacob-hughes · 2024-04-02T14:48:03Z

Can I squash this to remove the top two commits?

ltratt · 2024-04-02T14:55:49Z

Please squash.

jacob-hughes · 2024-04-02T14:57:22Z

Squashed (please hold off on merging for now)

jacob-hughes · 2024-04-03T08:25:27Z

Looking at this a bit more closely, I think I've hit some non-determinism around locking inside the standard library which is making these tests flaky. If it's ok with you, rather than try and fix this to then immediately remove it in the next PR, are you happy for me to push a couple of commits to this PR that address your suggestion here?

jacob-hughes · 2024-04-17T10:09:01Z

Closing in favour of #117

Revert "Finalise Gc<T> objects on a separate thread"

3c32173

This reverts commit a58c2b8.

jacob-hughes assigned ltratt Mar 29, 2024

ltratt reviewed Mar 29, 2024

View reviewed changes

library/std/src/gc.rs Show resolved Hide resolved

ltratt reviewed Mar 29, 2024

View reviewed changes

jacob-hughes force-pushed the use_disclaim branch from 8fbd290 to 9868268 Compare April 2, 2024 08:24

ltratt added this pull request to the merge queue Apr 2, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 2, 2024

ltratt added this pull request to the merge queue Apr 2, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 2, 2024

ltratt added this pull request to the merge queue Apr 2, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 2, 2024

jacob-hughes force-pushed the use_disclaim branch from 78bc4a0 to 9868268 Compare April 2, 2024 14:56

ltratt added this pull request to the merge queue Apr 2, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 2, 2024

jacob-hughes closed this Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to different finalisation mechanism #114

Switch to different finalisation mechanism #114

jacob-hughes commented Mar 29, 2024

ltratt Mar 29, 2024

jacob-hughes Mar 29, 2024

ltratt Mar 29, 2024

ltratt commented Mar 29, 2024

ltratt Mar 29, 2024

jacob-hughes Mar 29, 2024 •

edited

Loading

ltratt Mar 29, 2024

jacob-hughes Mar 29, 2024

ltratt Mar 29, 2024

ltratt Mar 29, 2024

jacob-hughes Mar 29, 2024

ltratt Mar 29, 2024

jacob-hughes Mar 29, 2024

ltratt commented Mar 29, 2024

jacob-hughes commented Apr 2, 2024

ltratt commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

ltratt commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

ltratt commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

ltratt commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

ltratt commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

jacob-hughes commented Apr 3, 2024

jacob-hughes commented Apr 17, 2024

Switch to different finalisation mechanism #114

Switch to different finalisation mechanism #114

Conversation

jacob-hughes commented Mar 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltratt commented Mar 29, 2024

Choose a reason for hiding this comment

jacob-hughes Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ltratt commented Mar 29, 2024

jacob-hughes commented Apr 2, 2024

ltratt commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

ltratt commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

ltratt commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

ltratt commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

ltratt commented Apr 2, 2024

jacob-hughes commented Apr 2, 2024

jacob-hughes commented Apr 3, 2024

jacob-hughes commented Apr 17, 2024

jacob-hughes Mar 29, 2024 •

edited

Loading