Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process abort in state tests on some macOS versions with Rust 1.70 #6812

Closed
3 tasks done
teor2345 opened this issue Jun 4, 2023 · 11 comments · Fixed by #7834 or #7843
Closed
3 tasks done

Process abort in state tests on some macOS versions with Rust 1.70 #6812

teor2345 opened this issue Jun 4, 2023 · 11 comments · Fixed by #7834 or #7843
Assignees
Labels
A-concurrency Area: Async code, needs extra work to make it work properly. A-state Area: State / database changes C-security Category: Security issues I-crash Zebra crashes (without a panic) S-needs-triage Status: A bug report needs triage

Comments

@teor2345
Copy link
Contributor

teor2345 commented Jun 4, 2023

Tasks

Motivation

Zebra's state tests are crashing on macOS across multiple PRs:

Running /Users/runner/work/zebra/zebra/target/release/deps/zebra_state-52d6a4509c1a5e9a
running 97 tests
test service::chain_tip::tests::vectors::chain_tip_change_is_initially_not_ready ... ok
test service::chain_tip::tests::vectors::current_best_tip_is_initially_empty ... ok
test service::chain_tip::tests::vectors::empty_latest_chain_tip_is_empty ... ok
test service::chain_tip::tests::prop::best_tip_is_latest_non_finalized_then_latest_finalized ... ok
2023-06-01T20:21:10.039357Z INFO Opened Zebra state cache at /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/zebra-state-v25-mainnet-T3MWk1
2023-06-01T20:21:10.039660Z INFO loaded Zebra state cache tip=None
2023-06-01T20:21:10.040026Z INFO Opened Zebra state cache at /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/zebra-state-v25-mainnet-n6VK4c
2023-06-01T20:21:10.040049Z INFO loaded Zebra state cache tip=None
2023-06-01T20:21:10.130630Z INFO Opened Zebra state cache at /var/folders/24/8k48jl6d249_n_qfxwsl6xvm0000gn/T/zebra-state-v25-mainnet-OSJIVt
2023-06-01T20:21:10.130704Z INFO loaded Zebra state cache tip=None
error: test failed, to rerun pass -p zebra-state --lib
Caused by:

process didn't exit successfully: /Users/runner/work/zebra/zebra/target/release/deps/zebra_state-52d6a4509c1a5e9a (signal: 6, SIGABRT: process abort signal)
Error: Process completed with exit code 101.

https://github.com/ZcashFoundation/zebra/actions/runs/5148228655/jobs/9269653348#step:15:3285
https://github.com/ZcashFoundation/zebra/actions/runs/5151593461/jobs/9276906631?pr=6810#step:15:3274

Investigation

This appears to be an environment issue, because it wasn't failing on the main branch when that code was originally merged:
https://github.com/ZcashFoundation/zebra/actions/runs/5144319422/jobs/9260464558

Runners started being updated to new software versions on June 2:
actions/runner-images#7660

The macOS and image versions on failing and successful runners are the same:

macOS 12.6.5 21G531

https://github.com/ZcashFoundation/zebra/actions/runs/5144319422/jobs/9260464558#step:1:4
https://github.com/ZcashFoundation/zebra/actions/runs/5151593461/jobs/9276906631?pr=6810#step:1:4

But the Rust versions are different:

stable-x86_64-apple-darwin unchanged - rustc 1.69.0 (84c898d65 2023-04-16)

https://github.com/ZcashFoundation/zebra/actions/runs/5144319422/jobs/9260464558#step:5:18

stable-x86_64-apple-darwin updated - rustc 1.70.0 (90c541806 2023-05-31) (from rustc 1.69.0 (84c898d65 2023-04-16))

https://github.com/ZcashFoundation/zebra/actions/runs/5151593461/jobs/9276906631?pr=6810#step:5:34

Teor can't reproduce this bug on their local machines with Rust 1.70, so it might be processor or OS-version dependent:

  • macOS Ventura 13.2 on M1 running aarch64 code
  • macOS Ventura 13.2 on M1 emulating x86_64 code
  • Linux on x86_64

Diagnosis

We might need to make Zebra compatible with Rust 1.70 and later on macOS 12 and earlier.

We could disable these tests until we do, because macOS is not a supported Zebra platform.

Complex Code or Requirements

Usually this happens in the state due to RocksDB or state service shutdown or concurrency bugs.

Testing

Our existing tests seem to reliably detect this bug.

@teor2345 teor2345 added C-bug Category: This is a bug S-needs-triage Status: A bug report needs triage P-Medium ⚡ C-security Category: Security issues I-crash Zebra crashes (without a panic) C-testing Category: These are tests A-state Area: State / database changes A-concurrency Area: Async code, needs extra work to make it work properly. labels Jun 4, 2023
@mpguerra mpguerra added this to Zebra Jun 4, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in Zebra Jun 4, 2023
@teor2345
Copy link
Contributor Author

teor2345 commented Jun 4, 2023

@mpguerra this is technically not a release blocker because it's on macOS. But crashes can be a sign of concurrency bugs or memory corruption.

@teor2345 teor2345 added I-integration-fail Continuous integration fails, including build and test failures and removed C-bug Category: This is a bug C-testing Category: These are tests labels Jun 4, 2023
@teor2345 teor2345 changed the title Process abort in state tests on macOS Process abort in state tests on some macOS versions Jun 5, 2023
@teor2345 teor2345 changed the title Process abort in state tests on some macOS versions Process abort in state tests on some macOS versions with Rust 1.70 Jun 5, 2023
@teor2345 teor2345 added P-Low ❄️ and removed P-Medium ⚡ I-integration-fail Continuous integration fails, including build and test failures labels Jun 6, 2023
@teor2345
Copy link
Contributor Author

We could split this ticket into:

  • build on macOS in CI
  • test on macOS in CI

Because it's only tests that use the state that are failing on macOS.

@mpguerra
Copy link
Contributor

Hey team! Please add your planning poker estimate with Zenhub @arya2 @oxarbitrage @teor2345 @upbqdn

@teor2345
Copy link
Contributor Author

@mpguerra I think this ticket should be split into two tickets: re-enabling builds, and re-enabling tests. They have different priorities and different estimates.

@oxarbitrage
Copy link
Contributor

I just checked building with my macos (Ventura 13.4.1) and the build works fine with cargo build, then zebra can be started with cargo run. I think that always worked.

Then, for my surprise, when i run cargo test all tests passed! I am pretty sure this was not the case in the past but it seems the problems are fixed.

I think it will worth a try to just enable what we disabled in the CI in a draft PR and see what happens there.

@teor2345
Copy link
Contributor Author

Sure! If it all works in CI with the latest Rust compiler, then we don't have to split the tickets or do anything complicated.

But I couldn't reproduce the CI bug on my local machine when I opened the ticket. So I wouldn't be surprised if it fails just in CI.

(I've also added branch protection rules to the ticket.)

@teor2345
Copy link
Contributor Author

Build and test also works locally on my macOS Ventura 13.6 M1 machine, with Rust 1.72.0.

Let's see how CI goes.

@teor2345
Copy link
Contributor Author

Looks like that worked!

macOS seems to be the longest job though. Maybe that will go down once it builds on the main branch and its cache gets used.

If it doesn't, we can open a devops ticket for larger macOS runners:
https://docs.github.com/en/actions/using-github-hosted-runners/about-larger-runners/about-larger-runners#about-macos-larger-runners

@mergify mergify bot closed this as completed in #7834 Oct 26, 2023
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in Zebra Oct 26, 2023
@mpguerra
Copy link
Contributor

I think I did "Admin: Add a macOS test branch protection rule" correctly
Screenshot 2023-10-26 at 14 53 49

@mpguerra mpguerra reopened this Oct 26, 2023
@mpguerra
Copy link
Contributor

fixed by #7843

@mpguerra mpguerra linked a pull request Oct 26, 2023 that will close this issue
6 tasks
@mergify mergify bot closed this as completed in #7843 Oct 26, 2023
@teor2345
Copy link
Contributor Author

macOS seems to be the longest job though. Maybe that will go down once it builds on the main branch and its cache gets used.

It's down to 1 hour in the latest build, so I think we're fine here:
https://github.com/ZcashFoundation/zebra/actions/runs/6654300326/job/18082060711

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-concurrency Area: Async code, needs extra work to make it work properly. A-state Area: State / database changes C-security Category: Security issues I-crash Zebra crashes (without a panic) S-needs-triage Status: A bug report needs triage
Projects
Status: Done
3 participants