-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data versioning support #99
Conversation
@@ -1,5 +1,6 @@ | |||
[project] | |||
name = "aggregator" | |||
requires-python = ">= 3.9" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a little unrelated, but since isort pulled up a couple instances of typing reorgs, i pinned this version and removed all Dict/List typing in favor of 3.9+ dict/list.
3fc07f7
to
f8b88d1
Compare
scripts/migrate_versioning.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent with this file:
- It should change all existing S3 files to add .../000/..., matching studies that don't have a version currently.
- It should modify state tracking dicts to include the same as a nested key
- It shouldn't try to modify files again if it's already been applied
After this is applied, I will either remove it in a followon commit or change it so that it is not runnable unless you :really: try, just in case we want an easy template for a future version of a similar operation. I could also be convinced to move this to a different folder - perhaps ./migrations?
I've already run this script on the dev bucket. If you want, I can reset it by copying over prod and you can give it a run yourself. The initial copy is slow - if this wasn't a one and done I'd put a progress bar on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, I don't need to run it. Sounds good - I can see some value as keeping it around, but you can just as easily add a line to some file here that's like Go grab scripts/migrate_versioning.py from revisions xyz to see an example of ...
and not keep it current. Git lets us be ruthless 😄
scripts/migrate_versioning.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eh, I don't need to run it. Sounds good - I can see some value as keeping it around, but you can just as easily add a line to some file here that's like Go grab scripts/migrate_versioning.py from revisions xyz to see an example of ...
and not keep it current. Git lets us be ruthless 😄
src/handlers/shared/functions.py
Outdated
) | ||
dt = dt or datetime.now(timezone.utc) | ||
data_package_metadata[target] = dt.isoformat() | ||
version_metadata[target] = dt.isoformat() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the version for a target getting set to a date? That bumped me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that comes back to this data structure (as an example from the unit tests):
EXISTING_VERSION: {
"version": "1.0",
"last_upload": "2023-02-24T15:08:06+00:00",
"last_data_update": "2023-02-24T15:08:07.771080+00:00",
"last_aggregation": "2023-02-24T15:08:07.771080+00:00",
"last_error": None,
"deleted": None,
}
Other than the transaction format version, these are always either datetimes or none.
But I can do two things:
- Update the docstring to make this explicit
One of:- Raise an error if you pass in whatever the final name of the versioning key is
- Change the name of the function to make it explicit that this is for setting fields in a transaction that are datetimes, and allowlist the names of datetime fields (A thing that jamie has asked for, for example, which I may not do here but should leave the option open for, is having some kind of summary metadata about an upload in here, like number of rows)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't push for you doing anything, I think this was me just not being familiar with the data structures / flow here. And maybe the re-use of the increasingly common word version
- is version_metadata
really a better name than data_package_metadata
here? Maybe another word like data_version_metadata
.
But regardless, it's fine as is.
This PR makes the following changes:
isort
so I didn't have to manually manipulate the same list five times, and added a pre-commit hook for it.Related library PR is here.