Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove expanded config files. Move entity-level hint computation into indexing job. #485

Merged
merged 27 commits into from
Jul 31, 2023

Conversation

marikomedlock
Copy link
Collaborator

@marikomedlock marikomedlock commented Jul 26, 2023

This PR has a lot of files, but the vast majority are trivial changes to config or test files (e.g. delete, rename, new index BQ dataset name). When reviewing, I recommend opening the "File filter" and un-checking ".json" and ".sql", which leaves ~30 code files.

Goals:

  1. Reduce the number of future PRs we have with lots of these trivial config file changes. After this PR, most changes to the generated index tables or schemas won't produce any config file changes.
  2. Reduce the maintenance burden per underlay by removing the second set of expanded config files.
  3. Remove the computed hints from source code. Now they live in the index dataset, same as all other pre-computed information.

Changes:

  • Moved the entity-level hint (e.g. person.gender, not modifiers) computation from the expand config step into its own indexing job.
    • The indexing job computes the hints the same as before -- by querying BQ directly for each attribute.
    • Prior to this change, it wrote the hints to to a new "expanded" version of the original config files. Now it writes them to a BQ table in the index dataset.
    • The hints API endpoint now queries this BQ table instead of reading from the expanded config file. This takes longer, but I'm not expecting this to be a frequently called endpoint.
  • Replaced the required EXPAND_CONFIG indexing command with an optional VALIDATE_CONFIG indexing command.
    • The validation command checks that the data types are set and match the expected data type, which is computed by scanning the source data the same way that EXPAND_CONFIG did.
    • The new validate command does not write out a new "expanded" set of config files, the way the previous expand command did.
  • Made attribute data types required in the config files.
    • Prior to this change, the expand config step looked up the data types automatically and set them. Now the user has to specify them. This is less convenient, but the new optional VALIDATE_CONFIG command looks up the data types and prints them out to make this a little easier.
  • Updated all underlays.
    • Set the data types for all attributes, copying them over from the "expanded" files.
    • Deleted the expanded directory. Now all we have are the "original" files.
    • Ran the new VALIDATE_CONFIG command successfully.
    • Partially re-indexed the broad and verily underlays (vumc and aou.test will need to re-index themselves). The only indexing change was a new job, so instead of re-running the entire indexing for all underlays, I copied the most recent index dataset and just re-ran indexing. This only picked up the new jobs. It would have been cleaner to re-run from scratch, but given I'm planning more indexing changes in the near future, was trying to cut down on time I spend re-running.
    • Smoke-tested the broad and verily underlays with a local server + UI.
  • Updated the config validation GHA.
    • Removed check that configs are expanded.
    • Updated the list of underlay pairs that we expect to be in-sync.
      • cms_synpuf broad/verily
      • sdd vumc/verily
      • sdd_refresh0323 vumc/verily
    • Removed aou_synthetic broad/verily pair because there are data type differences (DATE -> STRING) in the source datasets. Prior to this change, the original config files did not require data types to be specified so we could do this comparison. Now data types are required, so we cannot.
  • Fixed a bug in the BQ row processor when the columns may be returned in a different order than the request.

@marikomedlock marikomedlock merged commit ab29390 into main Jul 31, 2023
7 checks passed
@marikomedlock marikomedlock deleted the mm-move-hints-into-indexing-job branch July 31, 2023 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants