-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return of the OCR'd crash narrative #1568
Conversation
|
||
1. Start your local Vision Zero cluster (database + Hasura + editor). | ||
|
||
2. Save a copy of the `env_template` file as `.env`, and fill in the details. | ||
2. Save a copy of the `env_template` file as `.env`, and fill in the details. Set the `BUCKET_ENV` variable to `dev` in order to safely run the S3 operations locally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i decided to change the env var ENV
to BUCKET_ENV
, just to be super clear that only S3 operations are affected by this variable. For example, you could set BUCKET_ENV
to prod
and still point your hasura endpoint/secret to local, and vice-versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this change, what do you think about updating the env_tamplate with the new var name?
narrative_image = page.crop(bbox) | ||
|
||
# uncomment to save a copy of the cropped narrative image | ||
# narrative_image.save(f"temp/{cris_crash_id}_narrative_{cr3_version}.jpeg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is essential for debugging this, so i figured i'd leave it in here 🤷
|
||
logger.debug("Extracting narrative text...") | ||
narrative = None | ||
narrative = image_to_string(narrative_image, lang="eng") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the old version of the script used a try
/except
/ block for this call, but it never errored for me, even when parsing some gnarly old garbage PDFs. my thinking was that for now it would be better to know if this breaks somehow, but if it becomes a nuisance we can easily revisit.
crash = todos[index] | ||
errors.append([crash["cris_crash_id"], e]) | ||
|
||
if errors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same behavior as the crash diagram extractor
return f"v1_{page_size}" | ||
|
||
return f"v2_{page_size}" | ||
width, height = page.size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i discovered this much easier method for getting the image size 👍
"v1_large": (296, 3683, 2580, 5749), | ||
"v2_small": (30, 791, 650, 1430), | ||
"v2_large": (90, 3026, 2496, 5466), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it occurred to me that we might want to start storing this version in the database. another option would be to use the python PDF package to actually add metadata to the PDF itself indicating which version.
i spent quite a lot of time testing different ways to identify which PDF version is being used, including by parsing the CR3 from revision date using OCR. i learned two things. 1) there are many more form revisions than what we test for (at least 5 as far as i can tell) and 2) it is hard to reliable to distinguish between them.
I could expand on this and have more thoughts on it if y'all want to chat CR3 versions. My short summary is that what we're doing now captures 99% of what we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the steps worked as expected, and I was able to see narratives OCR'd and verify that the narratives would not be overwritten. Thanks for sprucing this up and providing the SQL. Made testing a breeze. 🚢 🙌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! i watched the row count decrease as i ran the OCR script 🚀 Very cool to have the conditions for OCRing stored in a DB view this way.
) | ||
execute procedure public.crashes_cris_set_old_investigator_narrative(); | ||
|
||
comment on function public.crashes_cris_set_old_investigator_narrative is 'Sets the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome - this comment answered my question before i could ask it! 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 loving the function comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the test steps and the additions to the README.
I welcome the return the of the OCR'd crash narrative
and investigator_narrative is NULL | ||
and ( | ||
investigator_narrative_ocr_processed_at is NULL | ||
or cr3_processed_at >= investigator_narrative_ocr_processed_at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if we got a new cr3 that was processed and we have to OCR the investigator narrative again?
|
||
3. Build and run the docker image. This will drop you into the docker image's shell: | ||
|
||
```shell | ||
$ docker compose build # <- only need to do this once | ||
$ docker compose run cris_import | ||
docker compose build # <- do this once, and when dependencies change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙏
"investigator_narrative_ocr_processed_at": datetime.now( | ||
timezone.utc | ||
).isoformat(), | ||
"updated_by": "dts_automation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be our standard naming convention going forward for automated tasks/etls making updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to agree on a convention—i did feel 100% great about this and am very open to ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you did feel 100% great? or are you looking for other ideas?
would you want to label anything updated by our scripts with dts_automation
, or make it more specific with which script updated it? like cris_import_script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did not* feel 100% great. yeah, i was wondering if we see benefit in being specific about what task made the updates or just using the generic dts_automation
.
return | ||
|
||
futures = [] | ||
with ProcessPoolExecutor(max_workers=cli_args.workers) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this whole block is for creating multiple parallel processes to run the extract_narrative_pdf
function concurrently for the crashes in the todos list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep!
Associated issues
Testing
change_log_crashes_cris
table data. To include just this one change log table in your replica, comment outchange_log_crashes_cris
right here in the helper script.Now you will use the change log to backfill narratives that have been erased in recent weeks. Locate the commented out
update
statement at the bottom of the1727451510064_preserve_crash_narrative
up migration (here). Manually execute this update command in your sql client.At this point there should be ~800 remaining crashes that are missing a narrative and need to be OCR'd. You can check this by querying the new SQL view:
./etl/cris_import
directory and rebuild the ETL docker image:We need to set our environment to use the prod S3 bucket, because it contains all of the CR3 PDFs that we will process. The best way to do this is to save a copy of your existing
.env
file asprod.local.env
, and set theBUCKET_ENV
value toprod
in this new file (this PR renames theENV
var toBUCKET_ENV
). Make sure your Hasura endpoint is set to you local host.Now it really is time to run the OCR script. Use this run command to run the OCR process with your new env file:
docker run -it --rm --env-file prod.local.env --network host -v $PWD:/app cris_import_cris_import ./cr3_ocr_narrative.py --verbose
crashes_edits
table:extract_2023_20240823135638635_99726_20240920_HAYSTRAVISWILLIAMSON
), and this extract is missing a large number of crash narratives, including for those crashes listed above. Let's make sure those narratives are not going to be overwritten, thanks to the new DB trigger.Make sure your
BUCKET_ENV
is set todev
in your.env
file, and run the CRIS import like so:That's it—thank you for testing this PR 👍
Ship list
main
branch