Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return of the OCR'd crash narrative #1568

Merged
merged 29 commits into from
Oct 3, 2024
Merged

Return of the OCR'd crash narrative #1568

merged 29 commits into from
Oct 3, 2024

Conversation

johnclary
Copy link
Member

@johnclary johnclary commented Sep 26, 2024

Associated issues

Testing

  1. You will need a copy of the database that includes the change_log_crashes_cris table data. To include just this one change log table in your replica, comment out change_log_crashes_cris right here in the helper script.
# ./vision-zero helper script
TABLES_TO_IGNORE = [
    "public.change_log_crashes",
     #  "public.change_log_crashes_cris",
    "public.change_log_units_cris",
    "public.change_log_people_cris",
    "public.change_log_people",
    "public.change_log_units",
    "public.change_log_people_edits",
    "public.change_log_units_edits",
    "public.change_log_crashes_edits",
]
  1. And now replicate like normal. This will download ~4gb of data:
./vision-zero replicate-db
  1. Apply migrations and metadata
hasura migrate apply
hasura metadata apply
  1. Now you will use the change log to backfill narratives that have been erased in recent weeks. Locate the commented out update statement at the bottom of the 1727451510064_preserve_crash_narrative up migration (here). Manually execute this update command in your sql client.

  2. At this point there should be ~800 remaining crashes that are missing a narrative and need to be OCR'd. You can check this by querying the new SQL view:

select count(*) from view_crash_narratives_ocr_todo;
  1. It's time to test the OCR script. head to the ./etl/cris_import directory and rebuild the ETL docker image:
docker compose build
  1. We need to set our environment to use the prod S3 bucket, because it contains all of the CR3 PDFs that we will process. The best way to do this is to save a copy of your existing .env file as prod.local.env, and set the BUCKET_ENV value to prod in this new file (this PR renames the ENV var to BUCKET_ENV). Make sure your Hasura endpoint is set to you local host.

  2. Now it really is time to run the OCR script. Use this run command to run the OCR process with your new env file:

docker run -it  --rm --env-file prod.local.env --network host -v $PWD:/app cris_import_cris_import ./cr3_ocr_narrative.py --verbose
  1. Nice! You can inspect the new narratives by querying the crashes_edits table:
select
    id,
    updated_at,
    updated_by,
    investigator_narrative
from
    crashes_edits
where
    investigator_narrative is not null;
  1. Now we will test the CRIS import and make sure that our narrative-preserving trigger is working as expected. First, let's inspect a batch of crashes the should have investigator narratives, observing that the narrative is populated for all records:
select
    investigator_narrative
from
    crashes
where
    cris_crash_id in(18510430, 18548351, 18553517, 18553518, 18553519, 18553520, 18553522, 18553525, 18553527, 18553532, 18566225, 18569950, 18578152, 18600315, 18613803, 18615918, 18634346, 18640454, 18640455, 18945007, 19208347);
  1. I have placed a CRIS extract in the dev S3 inbox for testing (extract_2023_20240823135638635_99726_20240920_HAYSTRAVISWILLIAMSON), and this extract is missing a large number of crash narratives, including for those crashes listed above. Let's make sure those narratives are not going to be overwritten, thanks to the new DB trigger.

Make sure your BUCKET_ENV is set to dev in your .env file, and run the CRIS import like so:

docker run -it  --rm --env-file .env --network host -v $PWD:/app cris_import_cris_import ./cris_import.py --csv --s3-download
  1. Excellent—now query the DB again and make sure these crash narratives are still intact:
select
    investigator_narrative
from
    crashes
where
    cris_crash_id in(18510430, 18548351, 18553517, 18553518, 18553519, 18553520, 18553522, 18553525, 18553527, 18553532, 18566225, 18569950, 18578152, 18600315, 18613803, 18615918, 18634346, 18640454, 18640455, 18945007, 19208347);

That's it—thank you for testing this PR 👍


Ship list

  • Check migrations for any conflicts with latest migrations in main branch
  • Confirm Hasura role permissions for necessary access
  • Code reviewed
  • Product manager approved

@johnclary johnclary added the WIP Work in progress label Sep 26, 2024
@johnclary johnclary changed the base branch from main to jc-download-db-data September 30, 2024 15:38

1. Start your local Vision Zero cluster (database + Hasura + editor).

2. Save a copy of the `env_template` file as `.env`, and fill in the details.
2. Save a copy of the `env_template` file as `.env`, and fill in the details. Set the `BUCKET_ENV` variable to `dev` in order to safely run the S3 operations locally.
Copy link
Member Author

@johnclary johnclary Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i decided to change the env var ENV to BUCKET_ENV, just to be super clear that only S3 operations are affected by this variable. For example, you could set BUCKET_ENV to prod and still point your hasura endpoint/secret to local, and vice-versa.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change, what do you think about updating the env_tamplate with the new var name?

narrative_image = page.crop(bbox)

# uncomment to save a copy of the cropped narrative image
# narrative_image.save(f"temp/{cris_crash_id}_narrative_{cr3_version}.jpeg")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is essential for debugging this, so i figured i'd leave it in here 🤷


logger.debug("Extracting narrative text...")
narrative = None
narrative = image_to_string(narrative_image, lang="eng")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the old version of the script used a try/except/ block for this call, but it never errored for me, even when parsing some gnarly old garbage PDFs. my thinking was that for now it would be better to know if this breaks somehow, but if it becomes a nuisance we can easily revisit.

crash = todos[index]
errors.append([crash["cris_crash_id"], e])

if errors:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the same behavior as the crash diagram extractor

return f"v1_{page_size}"

return f"v2_{page_size}"
width, height = page.size
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i discovered this much easier method for getting the image size 👍

"v1_large": (296, 3683, 2580, 5749),
"v2_small": (30, 791, 650, 1430),
"v2_large": (90, 3026, 2496, 5466),
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it occurred to me that we might want to start storing this version in the database. another option would be to use the python PDF package to actually add metadata to the PDF itself indicating which version.

i spent quite a lot of time testing different ways to identify which PDF version is being used, including by parsing the CR3 from revision date using OCR. i learned two things. 1) there are many more form revisions than what we test for (at least 5 as far as i can tell) and 2) it is hard to reliable to distinguish between them.

I could expand on this and have more thoughts on it if y'all want to chat CR3 versions. My short summary is that what we're doing now captures 99% of what we need.

Copy link
Contributor

@mddilley mddilley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the steps worked as expected, and I was able to see narratives OCR'd and verify that the narratives would not be overwritten. Thanks for sprucing this up and providing the SQL. Made testing a breeze. 🚢 🙌

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! i watched the row count decrease as i ran the OCR script 🚀 Very cool to have the conditions for OCRing stored in a DB view this way.

)
execute procedure public.crashes_cris_set_old_investigator_narrative();

comment on function public.crashes_cris_set_old_investigator_narrative is 'Sets the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome - this comment answered my question before i could ask it! 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 loving the function comments

Copy link
Member

@chiaberry chiaberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the test steps and the additions to the README.

I welcome the return the of the OCR'd crash narrative

@johnclary johnclary changed the base branch from jc-download-db-data to main October 2, 2024 20:56
and investigator_narrative is NULL
and (
investigator_narrative_ocr_processed_at is NULL
or cr3_processed_at >= investigator_narrative_ocr_processed_at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if we got a new cr3 that was processed and we have to OCR the investigator narrative again?


3. Build and run the docker image. This will drop you into the docker image's shell:

```shell
$ docker compose build # <- only need to do this once
$ docker compose run cris_import
docker compose build # <- do this once, and when dependencies change
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙏

"investigator_narrative_ocr_processed_at": datetime.now(
timezone.utc
).isoformat(),
"updated_by": "dts_automation",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be our standard naming convention going forward for automated tasks/etls making updates?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to agree on a convention—i did feel 100% great about this and am very open to ideas.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you did feel 100% great? or are you looking for other ideas?
would you want to label anything updated by our scripts with dts_automation, or make it more specific with which script updated it? like cris_import_script

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did not* feel 100% great. yeah, i was wondering if we see benefit in being specific about what task made the updates or just using the generic dts_automation.

return

futures = []
with ProcessPoolExecutor(max_workers=cli_args.workers) as executor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this whole block is for creating multiple parallel processes to run the extract_narrative_pdf function concurrently for the crashes in the todos list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants