Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reintroduce query plan reuse during warmup #5255

Draft
wants to merge 87 commits into
base: dev
Choose a base branch
from

Conversation

Geal
Copy link
Contributor

@Geal Geal commented May 27, 2024

Fix #5160

TL;DR: This reintroduces the schema aware query hashing algorithm with a configuration option to enable it for query plan reuse in warmup, and some performance and reliability fixes around the cache key generation.

Problem

The router implements a query hashing algorithm that takes the schema into account, so that if the parts of the schema relevant to the query do not change across a schema update (example: a field was added to an unrelated type), then the hash will not change.
This is useful in 2 ways:

  • when updating the schema, query plans must be regenerated for existing queries. We can use the supergraph.query_planning.warmed_up_queries option to pregenerate query plans before switching traffic to the new schema. But that can result in a lot of queries being planned, a lot of CPU usage, and schema updates taking a long time. This query hashing algorithm can be used to detect that a schema update would not affect a query, so we can reuse its query plan, and avoid regenrating most of the query plans. This is the goal of the experimental_reuse_query_plans option.
  • in entity caching, we hash the query as part of the cache key, but that is not enough, we need to check that relevant parts of the schema do not change. As an example: if the type of a field changes, we don't want to keep serving data with the wrong type. But we also do not want to rebuild the entire cache from scratch every time the schema updates. By using this query hashing algorithm, we can reuse most of the entity cache across schema updates

Proposed solution

An earlier version of this algorithm shipped in a router release with some significant issues that broke query plan caching, so it was promptly deactivated. This PR reintroduces the algorithm, with the experimental_reuse_query_plans set by default to do_not_reuse, making its use in the query planner cache optional. It will generate the query hash in all cases, and use it with the schema hash (a hash of the schema SDL as bytes), but when experimental_reuse_query_plans is set to "reuse", only the query hash will be employed. When the option is set to measure, it measures the number of query plans that could have been reused by comparing old and new query plans, in the same way we do with JS and Rust planners.
It also brings significant improvements to the query planner cache handling, using prefixes to make the Redis key self describing, reducing the amount of data serialization when generating the cache key, and making sure the rust planner's configuration is used in the cache key.

TODO:

  • configuration option to reactivate the schema hash the hash is now always generated, but the cache key will by default use that query hash and the schema hash. If the experimental_reuse_query_plans option is enabled, then it will not use the schema hash
  • metric to measure the number of query plans that could have been reused during warmup
  • measure how many of the query plans that could have been reused are actually the same
  • check with the federation version for the in memory cache we don't need to check it because a running router instance will only have one federation version
  • remove the custom Hash implementation for the caching key
  • use prefixes for each part of the redis cache key so they become self describing
  • remove JSON serialization
  • hash the Rust planner's config once a new version has been released (it implements Hash now)
  • configuration migration for the experimental_reuse_query_plans option from a boolean to an enum

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

This comment has been minimized.

@router-perf
Copy link

router-perf bot commented May 27, 2024

CI performance tests

  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • reload - Reload test over a long period of time at a constant rate of users
  • large-request - Stress test with a 1 MB request payload
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • const - Basic stress test that runs with a constant number of users
  • step - Basic stress test that steps up the number of users over time
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • demand-control-instrumented - A copy of the step test, but with demand control monitoring and metrics enabled
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • no-graphos - Basic stress test, no GraphOS.
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • xlarge-request - Stress test with 10 MB request payload
  • xxlarge-request - Stress test with 100 MB request payload
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • step-with-prometheus - A copy of the step test with the Prometheus metrics exporter enabled
  • demand-control-uninstrumented - A copy of the step test, but with demand control monitoring enabled

@svc-apollo-docs
Copy link
Collaborator

svc-apollo-docs commented Oct 18, 2024

✅ Docs Preview Ready

No new or changed pages found.

@abernix abernix requested a review from IvanGoncharov October 21, 2024 09:43
.update(serde_json::to_vec(&self.config_mode).expect("serialization should not fail"));
hasher.update(&*self.schema_id);
let mut hasher = StructHasher::new();
self.metadata.hash(&mut hasher);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should hash key names, not only values

Geal added a commit that referenced this pull request Nov 5, 2024
Fix #5160

This splits part of the work from #5255 to make it easier to merge.
This covers improvements and fixes to the query planner cache key from changes related to the query hashing algorithm and query plan reuse during warmup.

Fixed:
* use prefixes for each part of the redis cache key so they become self describing
* remove the custom Hash implementation for the cache key
* remove JSON serialization
* hash the Rust planner's config only once, not on every cache query

Co-authored-by: Ivan Goncharov <ivan.goncharov.ua@gmail.com>
Co-authored-by: Jeremy Lempereur <jeremy.lempereur@iomentum.com>
Co-authored-by: Gary Pennington <gary@apollographql.com>
Co-authored-by: Jesse Rosenberger <git@jro.cc>
Co-authored-by: Renée <renee.kooi@apollographql.com>
@Geal
Copy link
Contributor Author

Geal commented Nov 5, 2024

now that #6206 is merged into dev, 59d1e2c merges back those changes here.

@Geal Geal changed the title Fix cache key hashing algorithm Reintroduce query plan reuse during warmup Nov 5, 2024
LongLiveCHIEF pushed a commit to StateFarmIns/router that referenced this pull request Nov 21, 2024
Fix apollographql#5160

This splits part of the work from apollographql#5255 to make it easier to merge.
This covers improvements and fixes to the query planner cache key from changes related to the query hashing algorithm and query plan reuse during warmup.

Fixed:
* use prefixes for each part of the redis cache key so they become self describing
* remove the custom Hash implementation for the cache key
* remove JSON serialization
* hash the Rust planner's config only once, not on every cache query

Co-authored-by: Ivan Goncharov <ivan.goncharov.ua@gmail.com>
Co-authored-by: Jeremy Lempereur <jeremy.lempereur@iomentum.com>
Co-authored-by: Gary Pennington <gary@apollographql.com>
Co-authored-by: Jesse Rosenberger <git@jro.cc>
Co-authored-by: Renée <renee.kooi@apollographql.com>
@IvanGoncharov IvanGoncharov marked this pull request as draft November 26, 2024 19:53
@IvanGoncharov
Copy link
Member

Converting to draft due to internal discussion regarding the risks introduced by this PR.
We want to add "Query hash stable to incremental schema changes" as the Router's feature.
However, in its current form, it is implemented outside of Query Planner and without any mechanism to determine what parts of the schema are used by Query Planner.

For example, adding new directives that influence Query planning without updating this code can result in a hash collision where a schema update changes the query plan, but the hash remains the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Query plan cache key follow up
9 participants