Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to different finalisation mechanism #114

Closed
wants to merge 2 commits into from

Conversation

jacob-hughes
Copy link
Collaborator

This is the first of a couple of a couple PRs to make finalisation in Alloy faster. See the commit messages for details.


#[cfg(profile_gc)]
#[derive(Debug)]
pub struct FinalizerInfo {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is more FinalizersInfo or GCProfileInfo or something like that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point. But this existed before, this commit has just moved it around. Shall I rename it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think renaming it would be a good idea, but you could do it in a subsequent PR. Up to you.

@ltratt
Copy link
Member

ltratt commented Mar 29, 2024

In 4d3bb73's commit message:

The disclaim approach will only work with off-thread finalisation because it
runs finalisers during the STW pause where the allocator lock is already
acquired, thus finalisers run on the mutator would not be able to perform
anything which interacts with the heap. It is therefore not possible to compare
the performance of this approach with the performance of single-threaded
allocation.

I don't understand the last sentence, as I think a) the paragraph following it does seem to give a performance comparison vs. the previous version b) we have compared that previous version with single-threaded?

//
// This is considered a root for the Gc because the 'to-be-finalised' objects are otherwise
// unreachable and must be kept alive until their finaliser has run.
static FIN_Q: Mutex<VecDeque<Gc<()>>> = Mutex::new(VecDeque::new());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean we're doing heap allocations during GC? I'm also not sure about the performance properties of VecDeque, particularly as I guess that we often shove huge numbers of objects in here. Update: ah, I see on line 225 that we ensure we don't enlarge it! What happens then?!

I also guess that we're going to be heavily punished for the Mutex here, as presumably parallel GC ends up locking/unlocking this heavily during particular parts of the GC cycle. I don't know if there's a good way of measuring that.

I don't think we should do this in this PR but I think we could fairly easily do this in the next PR:

  1. Each GC thread has a thread_local queue of objects to finalise. That could be a VecDeque but (because of the "what happens if we run out of space in the VecDeque problem) I think that might better be an mmaped block (or similar) under our control.
  2. At the end of the GC cycle (or when that local queue is full) we flush it to a global "queue of queues", which is something like Mutex<Vec<FinalisationQueue>>.

The aim here being that our number of locks goes from O(number_of_objects_to_finalise) to O(number_of_GC_threads).

Copy link
Collaborator Author

@jacob-hughes jacob-hughes Mar 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: ah, I see on line 225 that we ensure we don't enlarge it! What happens then?!

We check whether we need to enlarge the queue on each allocation because this is the only time it would be safe to do so. I'll be the first to admit I don't like this, but I couldn't see another a way around accidentally growing the ring buffer while inside the STW pause.

I also guess that we're going to be heavily punished for the Mutex here, as presumably parallel GC ends up locking/unlocking this heavily during particular parts of the GC cycle. I don't know if there's a good way of measuring that.

Profile data still shows that we spend a lot of time in futex_wait syscalls -- especially from within the GC. However, we're punished for this Mutex more because of contention between the finaliser thread and user mutator threads. The more concurrent mutator (application) threads we have, the worse this gets. However, this does not affect parallel collector threads (which do marking) because they don't lock/unlock this mutex (and Rust mutexes don't heap allocate anymore).

I agree your suggested approach is better. It's similar to what I had lined up (but my approach still suffers from the VecDeque grow problem). I'm not sure what you mean by this though:

I think that might better be an mmaped block (or similar) under our control.

Do you mean mmap a block of pointers to objects that need finalising using system allocator (i.e. outside of Boehm)? We'd still need to grow this block dynamically, and we also need Boehm to know about it so that it can treat it as a source of roots.

At the end of the GC cycle (or when that local queue is full) we flush it to a global "queue of queues", which is something like Mutex<Vec>.

Right, so we batch up finalisation in pre-defined chunks that get siphoned off to the finaliser thread once they reach some predetermined size?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We check whether we need to enlarge the queue on each allocation because this is the only time it would be safe to do so.

Right, so this isn't something we ideally want to do in the fast path.

Right, so we batch up finalisation in pre-defined chunks that get siphoned off to the finaliser thread once they reach some predetermined size?

Yep, that's the key idea!

Do you mean mmap a block of pointers to objects that need finalising using system allocator (i.e. outside of Boehm)?

Assuming we can teach Boehm about the mmaped block of pointers somehow, that would be my suggestion. But there are other conceivable ways which achieve the same outcome.

We'd still need to grow this block dynamically, and we also need Boehm to know about it so that it can treat it as a source of roots.

Yes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, with this + #114 (comment) I think we might be able to eliminate synchronisation around the FIN_Q. Ok to do this in the next PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me.


pub fn process_fin_q() {
loop {
let object = FIN_Q.lock().unwrap().pop_front();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting: I think we only have a 1 finalisation thread, right? But that thread still has to acquire a lock to do finalisation even though we know that no other thread can have gained that lock (outside of GC time).

If I've got that right, it suggests that we also want to move the "queue of queues" to the finalisation thread, so that it doesn't have to do any locking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting: I think we only have a 1 finalisation thread, right?

Correct

But that thread still has to acquire a lock to do finalisation even though we know that no other thread can have gained that lock (outside of GC time).

But the FIN_Q is like a producer / consumer, so we need to lock because this could be popping from the queue at the same time that new allocations on the mutator push to the queue. Maybe I've misunderstood your suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the FIN_Q is like a producer / consumer, so we need to lock because this could be popping from the queue at the same time that new allocations on the mutator push to the queue. Maybe I've misunderstood your suggestion.

The finaliser thread will be stopped during GC so at run-time I don't think it can ever actually race? But we will stay pay the lock/unlock costs even though we're not really doing anything that could ever race...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The finaliser thread will be stopped during GC so at run-time I don't think it can ever actually race? But we will stay pay the lock/unlock costs even though we're not really doing anything that could ever race...

Doh... You're right! This hadn't occurred to me. If I can then malloc / mmap this queue by side-stepping the Boehm allocator, I might be able to grow it inside the STW pause and remove locking altogether.

@ltratt
Copy link
Member

ltratt commented Mar 29, 2024

@jacob-hughes Please squash, but in so doing can you also address the question in #114 (comment) please?

The current off-thread finalisation mechanism is very slow. There are a few
factors at play here, but overwhelmingly this is due to lock contention on heap
allocation.

Firstly, the BDWGC uses a clever optimisation[1] to avoid locking allocation in
its entirety until at least two mutator threads have spun up. This means that
by simply offloading finalisation to another thread, we incur *all*
synchronisation costs inside the BDWGC which were not present when Alloy was
used with single-threaded applications.

Second, when a finaliser in the BDWGC is both, registered, and added to the
queue during a collection, it acquires the global allocator lock. This is
expensive, and causes contention between the finaliser thread and any mutator
or collector threads already running. This slowdown increases as the number of
finalisers in the application grows. In programs with non-trivial finalisation,
this is so slow that it is practically unusable.

This commit switches to the lesser used but faster finalisation mechanism in
BDWGC: the disclaim API. Instead, objects with a 'disclaim' function (analogous
to finalisers) do not require additional locking to register with the object,
and are run directly inside the sweeper.

The disclaim approach will only work with off-thread finalisation
because it runs finalisers during the STW pause where the allocator lock
is already acquired, thus finalisers run on the mutator would not be
able to perform anything which interacts with the heap. The table below
shows the performance difference between this approach and
single-threaded finalisation introduced in a58c2b8 (baseline) on the
ReBench benchmark suite using som-rs + Alloy:

  Bounce        baseline   micro       698
  Bounce        disclaim   micro       753
  BubbleSort    baseline   micro       1287
  BubbleSort    disclaim   micro       755
  DeltaBlue     baseline   macro       483
  DeltaBlue     disclaim   macro       375
  Dispatch      baseline   micro       985
  Dispatch      disclaim   micro       552
  Fannkuch      baseline   micro       510
  Fannkuch      disclaim   micro       336
  Fibonacci     baseline   micro       1738
  Fibonacci     disclaim   micro       1057
  FieldLoop     baseline   micro       796
  FieldLoop     disclaim   micro       382
  GraphSearch   baseline   macro       193
  GraphSearch   disclaim   macro       133
  IntegerLoop   baseline   micro       1521
  IntegerLoop   disclaim   micro       1007
  JsonSmall     baseline   macro       555
  JsonSmall     disclaim   macro       578
  List          baseline   micro       1390
  List          disclaim   micro       859
  Loop          baseline   micro       2216
  Loop          disclaim   micro       1251
  Mandelbrot    baseline   micro       1051
  Mandelbrot    disclaim   micro       623
  NBody         baseline   macro       705
  NBody         disclaim   macro       585
  PageRank      baseline   macro       613
  PageRank      disclaim   macro       297
  Permute       baseline   micro       1173
  Permute       disclaim   micro       669
  Queens        baseline   micro       1052
  Queens        disclaim   micro       589
  QuickSort     baseline   micro       231
  QuickSort     disclaim   micro       266
  Recurse       baseline   micro       1231
  Recurse       disclaim   micro       858
  Richards      baseline   macro       Timeout
  Richards      disclaim   macro       1416
  Sieve         baseline   micro       Timeout
  Sieve         disclaim   micro       1100
  Storage       baseline   micro       81
  Storage       disclaim   micro       149
  Sum           baseline   micro       857
  Sum           disclaim   micro       478
  Towers        baseline   micro       1528
  Towers        disclaim   micro       851
  TreeSort      baseline   micro       382
  TreeSort      disclaim   micro       299
  WhileLoop     baseline   micro       1814
  WhileLoop     disclaim   micro       1130
@jacob-hughes
Copy link
Collaborator Author

Squashed.

I don't understand the last sentence, as I think a) the paragraph following it does seem to give a performance comparison vs. the previous version b) we have compared that previous version with single-threaded?

I've removed this sentence because it was a bit confusing. You're right, all I was trying to say was that a direct comparison with single-threaded finalisation is not possible. We can infer that though from the comparison between the first off-thread finalisation approach in #a58c2b8

@ltratt
Copy link
Member

ltratt commented Apr 2, 2024

You're right, all I was trying to say was that a direct comparison with single-threaded finalisation is not possible. We can infer that though from the comparison between the first off-thread finalisation approach in #a58c2b8

Maybe I'm an idiot, but a direct comparison is possible, but it's just awkward to do?

@jacob-hughes
Copy link
Collaborator Author

Maybe I'm an idiot, but a direct comparison is possible, but it's just awkward to do?

We'd have to modify som-rs to pause interpreting every now and then and start processing finalisers on its main-thread. I guess it is possible to do that but we'd have to emulate the threshold that Boehm uses when it goes through and processes bits of finalisation when leaving an allocation call.

@ltratt
Copy link
Member

ltratt commented Apr 2, 2024

OK, that makes more sense --- thanks!

@ltratt ltratt added this pull request to the merge queue Apr 2, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 2, 2024
@jacob-hughes
Copy link
Collaborator Author

Fixed here: 324150c

@ltratt
Copy link
Member

ltratt commented Apr 2, 2024

Please squash.

@jacob-hughes
Copy link
Collaborator Author

I think this should stay as a separate commit that gets squashed when we rebase with upstream rustc. It's not related to this work and it's useful to keep track of.

@ltratt ltratt added this pull request to the merge queue Apr 2, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 2, 2024
@jacob-hughes
Copy link
Collaborator Author

Fixed here: 78bc4a0 (Again I'm not sure this one should be squashed either)

@ltratt ltratt added this pull request to the merge queue Apr 2, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 2, 2024
@jacob-hughes
Copy link
Collaborator Author

This failure is a little bit more tricky, I'm debugging it in tryci. However, I can use it as an opportunity to change to debian:latest in the buildbot config and then re-squash it later if you're ok with that?

@ltratt
Copy link
Member

ltratt commented Apr 2, 2024

"bookworm" and "latest" are currently identical, so you can push a new commit to that effect, but please don't squash it yet.

@jacob-hughes
Copy link
Collaborator Author

Can I squash this to remove the top two commits?

@ltratt
Copy link
Member

ltratt commented Apr 2, 2024

Please squash.

@jacob-hughes
Copy link
Collaborator Author

Squashed (please hold off on merging for now)

@ltratt ltratt added this pull request to the merge queue Apr 2, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 2, 2024
@jacob-hughes
Copy link
Collaborator Author

Looking at this a bit more closely, I think I've hit some non-determinism around locking inside the standard library which is making these tests flaky. If it's ok with you, rather than try and fix this to then immediately remove it in the next PR, are you happy for me to push a couple of commits to this PR that address your suggestion here?

@jacob-hughes
Copy link
Collaborator Author

Closing in favour of #117

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants