Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move heavy computation to a thread pool with a priority queue #6247

Merged
merged 8 commits into from
Nov 19, 2024

Conversation

SimonSapin
Copy link
Contributor

@SimonSapin SimonSapin commented Nov 7, 2024

These components can take non-trivial amounts of CPU time:

  • GraphQL parsing
  • GraphQL validation
  • Query planning
  • Schema introspection

In order to avoid blocking threads that execute asynchronous code, they are now run (in their respective Rust implementations) in a new pool of as many threads as CPU cores are available. Previously we used Tokio’s spawn_blocking for this purpose, but it is appears to be intended for blocking I/O and uses up to 512 threads so it isn’t a great fit for computation tasks.

Additionally, QueryPlannerPool is bypassed for the new planner since the new thread pool and queue fulfill the same role. This makes the new planner parallel by default (whereas the old pool defaults to size 1, making planning effectively sequential).

The first commit supersedes and closes #6122. The second (non-merge) commit supersedes and closes #6142.

The ageing priority algorithm is based on @garypen’s work in https://github.com/apollographql/ageing

Fixes ROUTER-827


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

These components can take non-trivial amounts of CPU time:

* GraphQL parsing
* GraphQL validation
* Query planning
* Schema introspection

In order to avoid blocking threads that execute asynchronous code,
they are now run (in their respective Rust implementations)
in a new pool of as many threads as CPU cores are available.
Previously we used Tokio’s [`spawn_blocking`] for this purpose,
but it is appears to be intended for blocking I/O
and uses up to 512 threads so it isn’t a great fit for computation tasks.

[`spawn_blocking`]: https://docs.rs/tokio/latest/tokio/task/fn.spawn_blocking.html

This PR supersedes and closes #6122

The ageing priority algorithm is based on @garypen’s work
in https://github.com/apollographql/ageing
@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented Nov 7, 2024

✅ Docs Preview Ready

No new or changed pages found.

@router-perf
Copy link

router-perf bot commented Nov 7, 2024

CI performance tests

  • connectors-const - Connectors stress test that runs with a constant number of users
  • const - Basic stress test that runs with a constant number of users
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled
  • enhanced-signature - Enhanced signature enabled
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • extended-reference-mode - Extended reference mode enabled
  • large-request - Stress test with a 1 MB request payload
  • no-tracing - Basic stress test, no tracing
  • reload - Reload test over a long period of time at a constant rate of users
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-local-metrics - Field stats that are generated from the router rather than FTV1
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload

Comment on lines 14 to 21
/// We generate backpressure in tower `poll_ready` when reaching this many queued items
// TODO: what’s a good number? should it be configurable?
const QUEUE_SOFT_CAPACITY: usize = 100;

// TODO: should this be configurable?
fn thread_pool_size() -> NonZeroUsize {
std::thread::available_parallelism().expect("available_parallelism() failed")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some open questions here

My opinion is that if we make something configurable just because we don’t know what a good value would be, most users won’t know either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Better to try and think of a good default and only make it configurable (if ever) later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be available - 1 so we keep 1 core free to handle traffic? Or is it fine to rely on the OS scheduler to still let traffic go through the router?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking with the initial PR was to rely on the OS scheduler, but minus one might be ok too. The downside is that for example minus one out of 2 available cores has a much bigger impact that minus one out of 32 cores

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my proposal:

size: max(1, available - (ceiling(available/8)))

WORKINGS:
AVAILABLE: 1 POOL SIZE: 1
AVAILABLE: 2 POOL SIZE: 1
AVAILABLE: 3 POOL SIZE: 2
AVAILABLE: 4 POOL SIZE: 3
AVAILABLE: 5 POOL SIZE: 4
...
AVAILABLE: 8 POOL SIZE: 7
AVAILABLE: 9 POOL SIZE: 7
...
AVAILABLE: 16 POOL SIZE: 14
AVAILABLE: 17 POOL SIZE: 14
...
AVAILABLE: 32 POOL SIZE: 28

Tweaks on the basic approach are welcome, but it seems to offer reasonable scaling for query planning. We can always refine it later.

Because each task in the pool (by default just one) takes queries to plan
from a queue in sequence, it is a parallelism bottleneck even in "new" mode
Comment on lines +70 to +77
meter_provider()
.meter("apollo/router")
.u64_observable_gauge("apollo.router.compute_jobs.queued")
.with_description(
"Number of computation jobs (parsing, planning, …) waiting to be scheduled",
)
.with_callback(move |m| m.observe(queue().queued_count() as u64, &[]))
.init()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new gauge metric. apollo.router.query_planning.queued will only exist when the legacy planner is used. It is somewhat replaced by the new metric but not exactly since the new queue also contains parsing+validation jobs and introspection jobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new metric should be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve added a short description in docs/source/reference/router/telemetry/instrumentation/standard-instruments.mdx. Are there other places to add it to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the correct location.

@SimonSapin SimonSapin marked this pull request as ready for review November 8, 2024 13:16
@SimonSapin SimonSapin requested review from a team as code owners November 8, 2024 13:16
Copy link
Contributor

@garypen garypen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great in general. Maybe we can find some time over the next couple of days to discuss how we figure out good default values for queue/pool sizes.

Comment on lines 14 to 21
/// We generate backpressure in tower `poll_ready` when reaching this many queued items
// TODO: what’s a good number? should it be configurable?
const QUEUE_SOFT_CAPACITY: usize = 100;

// TODO: should this be configurable?
fn thread_pool_size() -> NonZeroUsize {
std::thread::available_parallelism().expect("available_parallelism() failed")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Better to try and think of a good default and only make it configurable (if ever) later.


/// We generate backpressure in tower `poll_ready` when reaching this many queued items
// TODO: what’s a good number? should it be configurable?
const QUEUE_SOFT_CAPACITY: usize = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it should be configurable initially. I think this queue represents the tradeoff between memory and losing time here vs being re-scheduled onto a different router, if one is available. If we are rejected from the queue here, then we know at least we have to spend the time/work to move the job elsewhere.

That's hard to quantify, but it's likely in the order of milliseconds. Perhaps we can workshop up a rough calculation based on this thinking?

Comment on lines +70 to +77
meter_provider()
.meter("apollo/router")
.u64_observable_gauge("apollo.router.compute_jobs.queued")
.with_description(
"Number of computation jobs (parsing, planning, …) waiting to be scheduled",
)
.with_callback(move |m| m.observe(queue().queued_count() as u64, &[]))
.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new metric should be documented.

apollo-router/src/compute_job.rs Show resolved Hide resolved
apollo-router/src/ageing_priority_queue.rs Show resolved Hide resolved
Copy link
Contributor

@garypen garypen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remaining decision about QUEUE_SOFT_CAPACITY is problematic. It might be best understood as a function of the thread_pool_size() for now.

Proposal: 20 * thread_pool_size() as a starting point.

I'm approving this as is, but I'd like to know what you think of my sizing suggestions.

Comment on lines 14 to 21
/// We generate backpressure in tower `poll_ready` when reaching this many queued items
// TODO: what’s a good number? should it be configurable?
const QUEUE_SOFT_CAPACITY: usize = 100;

// TODO: should this be configurable?
fn thread_pool_size() -> NonZeroUsize {
std::thread::available_parallelism().expect("available_parallelism() failed")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my proposal:

size: max(1, available - (ceiling(available/8)))

WORKINGS:
AVAILABLE: 1 POOL SIZE: 1
AVAILABLE: 2 POOL SIZE: 1
AVAILABLE: 3 POOL SIZE: 2
AVAILABLE: 4 POOL SIZE: 3
AVAILABLE: 5 POOL SIZE: 4
...
AVAILABLE: 8 POOL SIZE: 7
AVAILABLE: 9 POOL SIZE: 7
...
AVAILABLE: 16 POOL SIZE: 14
AVAILABLE: 17 POOL SIZE: 14
...
AVAILABLE: 32 POOL SIZE: 28

Tweaks on the basic approach are welcome, but it seems to offer reasonable scaling for query planning. We can always refine it later.

Comment on lines +70 to +77
meter_provider()
.meter("apollo/router")
.u64_observable_gauge("apollo.router.compute_jobs.queued")
.with_description(
"Number of computation jobs (parsing, planning, …) waiting to be scheduled",
)
.with_callback(move |m| m.observe(queue().queued_count() as u64, &[]))
.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the correct location.


#[tokio::test]
async fn test_parallelism() {
if thread_pool_size().get() < 2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:)

@SimonSapin SimonSapin enabled auto-merge (squash) November 19, 2024 13:37
@SimonSapin
Copy link
Contributor Author

I’ve updated the changelog and enabled auto-merge

@SimonSapin SimonSapin merged commit 8e1928c into dev Nov 19, 2024
14 checks passed
@SimonSapin SimonSapin deleted the simon/compute-jobs branch November 19, 2024 13:52
@abernix abernix mentioned this pull request Nov 26, 2024
BrynCooke added a commit that referenced this pull request Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants