Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[coverage] Proactively collect isolates as they pause #595

Merged
merged 20 commits into from
Oct 7, 2024
Merged

Conversation

liamappelbe
Copy link
Contributor

@liamappelbe liamappelbe commented Oct 1, 2024

The old coverage collection flow was to wait for all isolates to reach a paused state, then collect coverage for all the isolate groups, then resume all the isolates. This caused a bug where isolate A is waiting for isolate B to exit, but isolate B is paused waiting to be resumed by package:coverage, so isolate A never enters the paused state, and the test deadlocks.

The basic idea of the new flow is to collect coverage for isolates as they pause, and immediately resume them, and keep doing this until all isolates have exited. This fixes the bug, but there are a whole lot of complications (see the details section).

The new flow is more streamlined, resulting in a modest performance boost (mainly because the _waitIsolatesPaused function was pretty inefficient). A small test suite that runs in 4.38 sec without coverage, used to run with coverage in 7.91 secs, but now runs in 6.38 secs.

This change doesn't affect Flutter, because Flutter sets waitPaused to false. They have their own isolate management logic.

Details

  • When the main isolate exits, the whole VM shuts down, regardless of whether there are other running isolates.
    • So we have to detect the main isolate, and delay resuming it until all other isolates have exited.
    • Unfortunately there's no official way of detecting the main isolate at the moment, and I don't want to block this important fix on [vm] Expose isolate's origin id (the id of the spawning isolate or (if no other isolate spawned it, it's own isolate id) sdk#56732
    • For now I'm assuming the main isolate is the first isolate reported by the VM service that has the name "main".
    • This should be pretty reliable. Even if a user names one of their isolates "main", it's unlikely to appear earlier in the list than the true main isolate.
    • Should also work in Flutter, but Flutter won't use this new flow anyway.
  • We only receive isolate start/pause/exit events from the VM service after we've started listening for them.
    • Since coverage is typically started after the tests are started, we will miss some events.
    • So we have to backfill the missing events by querying the isolate list.
  • The backfill is async, so there are all sorts of races that can happen between the event subscription and the backfill.
    • The simplest and most robust way to solve this problem is to buffer all the events until after the backfill. So I added a IsolateEventBuffer class.
  • The other issue with the backfill is that it can lead to duplicate events, because isolates can be started and paused after we start listening, but before we do the backfill.
    • This means we can get duplicate and out-of-order events.
    • So listenToIsolateLifecycleEvents cleans up the event stream.
  • The coverage counters are shared between all the isolates in a group.
    • In the current flow we only collect coverage for one isolate in a group. If we naively apply that logic in the new flow, we'll undercount coverage by only collecting coverage for the entire group when its first isolate exits.
    • If we instead omit this logic and collect coverage for all isolates, we'll overcount coverage (that's less problematic, but users have complained about it before).
    • So we need to collect coverage on the last isolate in the group, before resuming it.
    • IsolateGroupState tracks the running/paused isolates in a group
  • IsolatePausedListener ties all that infra together
    • It listens to the cleaned event stream and tracks which isolates are in which groups.
    • It calls an onIsolatePaused callback with a flag indicating whether that's the last isolate in the group.
    • It resumes all the paused isolates, but delays resuming main.
    • And it provides a future that completes when all isolates are exited.
  • The main collection routine is pretty much the same, but _waitIsolatesPaused has been removed and instead we use IsolatePausedListener to collect coverage for the last isolate in each group.

Fixes #520

Copy link

github-actions bot commented Oct 1, 2024

PR Health

Breaking changes ✔️
Package Change Current Version New Version Needed Version Looking good?
coverage Non-Breaking 1.9.2 1.10.0-wip 1.10.0 ✔️
Changelog Entry ✔️
Package Changed Files

Changes to files need to be accounted for in their respective changelogs.

Coverage ✔️
File Coverage
pkgs/coverage/lib/src/collect.dart 💚 87 %
pkgs/coverage/lib/src/isolate_paused_listener.dart 💚 98 %
pkgs/coverage/lib/src/util.dart 💚 98 %

This check for test coverage is informational (issues shown here will not fail the PR).

API leaks ✔️

The following packages contain symbols visible in the public API, but not exported by the library. Export these symbols or remove them from your publicly visible API.

Package Leaked API symbols
License Headers ✔️
// Copyright (c) 2024, the Dart project authors. Please see the AUTHORS file
// for details. All rights reserved. Use of this source code is governed by a
// BSD-style license that can be found in the LICENSE file.
Files
no missing headers

All source files should start with a license header.

Package publish validation ✔️
Package Version Status
package:cli_config 0.2.1-wip WIP (no publish necessary)
package:coverage 1.10.0-wip WIP (no publish necessary)
package:extension_discovery 2.1.0 already published at pub.dev
package:graphs 2.3.3-wip WIP (no publish necessary)
package:mime 2.0.0 already published at pub.dev
package:oauth2 2.0.4-wip WIP (no publish necessary)
package:source_map_stack_trace 2.1.3-wip WIP (no publish necessary)
package:unified_analytics 6.1.4 already published at pub.dev

Documentation at https://github.com/dart-lang/ecosystem/wiki/Publishing-automation.

Copy link
Contributor

@bkonyi bkonyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far! Let me know when you're ready for a final review.

/// - Each callback will only be called once per isolate.
Future<void> listenToIsolateLifecycleEvents(
VmService service,
void Function(IsolateRef isolate) onIsolateStarted,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: consider creating a typedef for all these callback types. Maybe something like: typedef FutureOr<void> Function(IsolateRef) IsolateLifecycleCallback?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've made some typedefs, but it's important to distinguish the async callbacks from the sync ones. If all of them were allowed to be async, then ensuring they don't interleave would be more complicated.

@liamappelbe liamappelbe marked this pull request as ready for review October 3, 2024 07:14
@liamappelbe
Copy link
Contributor Author

Ready for review

@liamappelbe liamappelbe changed the title WIP: [coverage] Proactively collect isolates as they pause [coverage] Proactively collect isolates as they pause Oct 3, 2024
@liamappelbe
Copy link
Contributor Author

Verified this works on a real repo: dart-lang/native#1629

@liamappelbe liamappelbe merged commit d5d7bb5 into main Oct 7, 2024
19 checks passed
@liamappelbe liamappelbe deleted the fix520 branch October 7, 2024 20:55
@dcharkes
Copy link
Contributor

dcharkes commented Oct 7, 2024

This is great! Thanks @liamappelbe! 🚀

@HosseinYousefi
Copy link
Member

Yay! Thanks @liamappelbe 🎉

@liamappelbe
Copy link
Contributor Author

Once I publish this, it will be auto-rolled into flutter. So I'll hold off publishing for a few days to let Flutter 3.27 stabilize. This fix doesn't affect Flutter anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests with Isolate.run pause indefinitely.
4 participants