Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data versioning support #99

Merged
merged 9 commits into from
Aug 14, 2023
Merged

Data versioning support #99

merged 9 commits into from
Aug 14, 2023

Conversation

dogversioning
Copy link
Contributor

@dogversioning dogversioning commented Aug 8, 2023

This PR makes the following changes:

  • The aggregator will now look for a version param in the body of an upload, and use that to route files to a new version subfolder
    • This change basically impacts the entire site uploading workflow, so there's a :lot: of changes here, but it's mostly adding a new level of nesting
  • Unit testing refactoring was a bit of a headache for this, with various different vintages of the "cannonical" data model, so the domain logic-specific variables were moved to constants in conftest.
  • Related to the above - added isort so I didn't have to manually manipulate the same list five times, and added a pre-commit hook for it.
  • Added a script for migrating existing data. Since we're only managing ourselves for this, it's pretty bare bones by design.

Related library PR is here.

@@ -1,5 +1,6 @@
[project]
name = "aggregator"
requires-python = ">= 3.9"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little unrelated, but since isort pulled up a couple instances of typing reorgs, i pinned this version and removed all Dict/List typing in favor of 3.9+ dict/list.

tests/utils.py Outdated Show resolved Hide resolved
Copy link
Contributor Author

@dogversioning dogversioning Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent with this file:

  • It should change all existing S3 files to add .../000/..., matching studies that don't have a version currently.
  • It should modify state tracking dicts to include the same as a nested key
  • It shouldn't try to modify files again if it's already been applied

After this is applied, I will either remove it in a followon commit or change it so that it is not runnable unless you :really: try, just in case we want an easy template for a future version of a similar operation. I could also be convinced to move this to a different folder - perhaps ./migrations?

I've already run this script on the dev bucket. If you want, I can reset it by copying over prod and you can give it a run yourself. The initial copy is slow - if this wasn't a one and done I'd put a progress bar on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I don't need to run it. Sounds good - I can see some value as keeping it around, but you can just as easily add a line to some file here that's like Go grab scripts/migrate_versioning.py from revisions xyz to see an example of ... and not keep it current. Git lets us be ruthless 😄

pyproject.toml Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, I don't need to run it. Sounds good - I can see some value as keeping it around, but you can just as easily add a line to some file here that's like Go grab scripts/migrate_versioning.py from revisions xyz to see an example of ... and not keep it current. Git lets us be ruthless 😄

src/handlers/shared/functions.py Show resolved Hide resolved
)
dt = dt or datetime.now(timezone.utc)
data_package_metadata[target] = dt.isoformat()
version_metadata[target] = dt.isoformat()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the version for a target getting set to a date? That bumped me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so that comes back to this data structure (as an example from the unit tests):

                    EXISTING_VERSION: {
                        "version": "1.0",
                        "last_upload": "2023-02-24T15:08:06+00:00",
                        "last_data_update": "2023-02-24T15:08:07.771080+00:00",
                        "last_aggregation": "2023-02-24T15:08:07.771080+00:00",
                        "last_error": None,
                        "deleted": None,
                    }

Other than the transaction format version, these are always either datetimes or none.

But I can do two things:

  • Update the docstring to make this explicit
    One of:
    • Raise an error if you pass in whatever the final name of the versioning key is
    • Change the name of the function to make it explicit that this is for setting fields in a transaction that are datetimes, and allowlist the names of datetime fields (A thing that jamie has asked for, for example, which I may not do here but should leave the option open for, is having some kind of summary metadata about an upload in here, like number of rows)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't push for you doing anything, I think this was me just not being familiar with the data structures / flow here. And maybe the re-use of the increasingly common word version - is version_metadata really a better name than data_package_metadata here? Maybe another word like data_version_metadata.

But regardless, it's fine as is.

tests/utils.py Outdated Show resolved Hide resolved
@dogversioning dogversioning merged commit 69a7aaa into main Aug 14, 2023
2 checks passed
@dogversioning dogversioning deleted the mg/versioning branch February 13, 2024 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants