Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to using the GC_finalize API for off-thread finalisation #22

Merged
merged 4 commits into from
Aug 22, 2024

Conversation

jacob-hughes
Copy link

Implementation

Using an extension of the disclaim API proved difficult because of the need to re-mark objects pointed to by finalisable objects before a finaliser is run. This would have led to complex changes to the mark-phase for no real benefit.

Instead, by switching back to the GC_finalize mechanism but ensuring that finalisers are run off of mutator threads, we get this desired behaviour 'for free', and we can delete lots of code to do with keeping finalisable buffers around. With this approach, the list of objects which need to be finalised is instead accessed via a linked-list threaded through the swept items in the heap. It is maintained by the BDWGC without us having to add any additional changes.

This approach uses a simple condition variable to sleep the finaliser thread when there is no finalisation work to do. This is then potentially resumed at the end of a GC cycle if that cycle discovered finalisable objects.

Performance

Though the previous buffered finalisation approach was unsound, it's still worth seeing how switching to using GC_finalize performs when compared to it because these new changes have a profound impact on finalisation speed. To measure this run Alloy on som-rs with 3 configurations for 30 iterations:

  1. Buffered finalisation: our old approach
  2. Spin: our new GC_finalize approach where the finalisation thread simply spins waiting for work
  3. CondVar: our new GC_finalize approach where the finalisation thread sleeps when there is no more work to do.

All configurations are run with finaliser elision disabled, so each benchmark ends up finalising huge numbers of objects -- in the 10s of millions. The results are shown below.

som_rs_perf

For Buffered finalisation, DeltaBlue was never able to run to completion due to a bug. CondVar dramatically improves wall-clock time on QuickSor, BubbleSort, and Loop, but is statistically insignificant on most of the others.

However, it does cause a big regression on Sum. I think this is because of heap-growth: CondVar ends up finalising about 30% fewer objects than Spin on this particular benchmark. The direct consequence of this is that there is on average 30% more floating garbage in the heap which cannot be reclaimed, and that will mean that each new GC cycle has to do more marking / sweeping (not to mention slower allocation due to free-list fragmentation).

I'm going to make an educated guess that this is because of the additional synchronisation costs of the condition variable (+mutex lock/unlock, +atomic load for fin_q_empty check). In cases like Sum, where each GC cycle pushes huge numbers of finalisers to the queue, this additional synchronisation means the finaliser thread ends up getting further behind than if it were just allowed to spin -- though I must admit, it resulting in 30% fewer objects being finalised in this case is surprising to me!

I think on balance we should keep the condition variable, as this makes it much easier for us to spin up multiple finaliser threads in the future if we want to parallelise some of this work.

Jake Hughes and others added 2 commits August 21, 2024 21:20
BDWGC is now a submodule in Alloy, so this step (which builds and run's
Alloys test suite) will never actually test a BDWGC PRs changes.
Instead, this would just grab the current Alloy submodule commit.
We have now switched to using fast TLS support, so this old POSIX
get/set_specific support for tracking GC roots in TLS is no longer
required.
while (GC_should_invoke_finalizers() == 0) {
pthread_cond_wait(&flzr_t_has_work, &flzr_mtx);
}
pthread_mutex_unlock(&flzr_mtx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might have a ping-pong situation here (warning: I might be wrong).

What I tend to do with condition variables (partly because I'm stupid and I find condvars hard) is associate a bool with the condvar. In essence, the condvar says "if true, there is work to do". Then I unset it in the "listening" thread so I end up with something like:

  bool work_to_do = false; // this is protected by flzr_mtx
  while(1) {
    pthread_mutex_lock(&flzr_mtx);
    while (!work_to_do) {
      pthread_cond_wait(&flzr_t_has_work, &flzr_mtx);
    }
    work_todo = false;
    pthread_mutex_unlock(&flzr_mtx);
    // do actual work
}

At the moment, IIUC, you aren't doing the equivalent of work_todo = false with the mutex locked, so I think you might end up a situation where the finalisation thread keeps getting spuriously woken up even when there's no actual work to do.

Now, amongst the many caveats in my thinking is that I don't know how GC_invoke_finalisers synchronises with the GC thread(s). Maybe this is all handled elsewhere in a nice way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Er, I might be wrong about spurious wakeups: I think the problem is that you won't get woken up when you want.

Maybe this example in pizauth, which is very like what you've got here, might make sense of what I've said above. https://github.com/ltratt/pizauth/blob/master/src/server/notifier.rs#L33

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case GC_should_invoke_finalizers() is a function that returns 0 or 1 depending on if there's work to do. It is acting as the work_todo bool in this case. If there are no finalisers to be run, this will always return 0.

GC_should_invoke_finalizers is essentially a wrapper around an atomic global so calling it is thread-safe. It only gets set to 1 from inside a collection (where the finalisers threads are suspended anyway). It gets unset by the finaliser thread when there are no more finalisers need to be run (again this is done atomically). I think this is the root of your issue -- should it be unset while we hold the lock?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GC_should_invoke_finalizers is essentially a wrapper around an atomic global

Aha!

should it be unset while we hold the lock

Yes I think so.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try this: d281df7

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just tested this: 56cc901. Let me know if you're happy and I'll squash.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you quickly check if the performance of sum is still the same or not with these changes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can, but even with 10 iterations it will take about 30 mins to run :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just check with 1 or 2? It'll probably be fairly obvious if this has any chance of changing performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, offline, we think we see a difference.

@ltratt
Copy link
Member

ltratt commented Aug 22, 2024

Please squash.

Using an extension of the disclaim API proved difficult because of the
need to re-mark objects pointed to by finalisable objects before a
finaliser is run. This would have led to complex changes to the
mark-phase for no real benefit.

Instead, by switching back to the GC_finalize mechanism but ensuring
that finalisers are run off of mutator threads, we get this desired
behaviour 'for free'.

This approach uses a simple condition variable to sleep the finaliser thread
when there is no finalisation work to do. This is then potentially
resumed at the end of a GC cycle if that cycle discovered finalisable
objects.
This was replaced with the GC_finalize mechanism in e093be9.
@jacob-hughes
Copy link
Author

Squashed!

@ltratt ltratt added this pull request to the merge queue Aug 22, 2024
Merged via the queue into softdevteam:master with commit 74bb8e4 Aug 22, 2024
19 of 367 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants