Return of the OCR'd crash narrative #1568

johnclary · 2024-09-26T21:23:13Z

Associated issues

Testing

You will need a copy of the database that includes the change_log_crashes_cris table data. To include just this one change log table in your replica, comment out change_log_crashes_cris right here in the helper script.

# ./vision-zero helper script
TABLES_TO_IGNORE = [
    "public.change_log_crashes",
     #  "public.change_log_crashes_cris",
    "public.change_log_units_cris",
    "public.change_log_people_cris",
    "public.change_log_people",
    "public.change_log_units",
    "public.change_log_people_edits",
    "public.change_log_units_edits",
    "public.change_log_crashes_edits",
]

And now replicate like normal. This will download ~4gb of data:

./vision-zero replicate-db

Apply migrations and metadata

hasura migrate apply
hasura metadata apply

Now you will use the change log to backfill narratives that have been erased in recent weeks. Locate the commented out update statement at the bottom of the 1727451510064_preserve_crash_narrative up migration (here). Manually execute this update command in your sql client.
At this point there should be ~800 remaining crashes that are missing a narrative and need to be OCR'd. You can check this by querying the new SQL view:

select count(*) from view_crash_narratives_ocr_todo;

It's time to test the OCR script. head to the ./etl/cris_import directory and rebuild the ETL docker image:

docker compose build

We need to set our environment to use the prod S3 bucket, because it contains all of the CR3 PDFs that we will process. The best way to do this is to save a copy of your existing .env file as prod.local.env, and set the BUCKET_ENV value to prod in this new file (this PR renames the ENV var to BUCKET_ENV). Make sure your Hasura endpoint is set to you local host.
Now it really is time to run the OCR script. Use this run command to run the OCR process with your new env file:

docker run -it  --rm --env-file prod.local.env --network host -v $PWD:/app cris_import_cris_import ./cr3_ocr_narrative.py --verbose

Nice! You can inspect the new narratives by querying the crashes_edits table:

select
    id,
    updated_at,
    updated_by,
    investigator_narrative
from
    crashes_edits
where
    investigator_narrative is not null;

Now we will test the CRIS import and make sure that our narrative-preserving trigger is working as expected. First, let's inspect a batch of crashes the should have investigator narratives, observing that the narrative is populated for all records:

select
    investigator_narrative
from
    crashes
where
    cris_crash_id in(18510430, 18548351, 18553517, 18553518, 18553519, 18553520, 18553522, 18553525, 18553527, 18553532, 18566225, 18569950, 18578152, 18600315, 18613803, 18615918, 18634346, 18640454, 18640455, 18945007, 19208347);

I have placed a CRIS extract in the dev S3 inbox for testing (extract_2023_20240823135638635_99726_20240920_HAYSTRAVISWILLIAMSON), and this extract is missing a large number of crash narratives, including for those crashes listed above. Let's make sure those narratives are not going to be overwritten, thanks to the new DB trigger.

Make sure your BUCKET_ENV is set to dev in your .env file, and run the CRIS import like so:

docker run -it  --rm --env-file .env --network host -v $PWD:/app cris_import_cris_import ./cris_import.py --csv --s3-download

Excellent—now query the DB again and make sure these crash narratives are still intact:

select
    investigator_narrative
from
    crashes
where
    cris_crash_id in(18510430, 18548351, 18553517, 18553518, 18553519, 18553520, 18553522, 18553525, 18553527, 18553532, 18566225, 18569950, 18578152, 18600315, 18613803, 18615918, 18634346, 18640454, 18640455, 18945007, 19208347);

That's it—thank you for testing this PR 👍

Ship list

Check migrations for any conflicts with latest migrations in main branch
Confirm Hasura role permissions for necessary access
Code reviewed
Product manager approved

johnclary · 2024-09-30T17:42:31Z

etl/cris_import/README.md


 1. Start your local Vision Zero cluster (database + Hasura + editor).

-2. Save a copy of the `env_template` file as `.env`, and fill in the details.
+2. Save a copy of the `env_template` file as `.env`, and fill in the details. Set the `BUCKET_ENV` variable to `dev` in order to safely run the S3 operations locally.


i decided to change the env var ENV to BUCKET_ENV, just to be super clear that only S3 operations are affected by this variable. For example, you could set BUCKET_ENV to prod and still point your hasura endpoint/secret to local, and vice-versa.

I like this change, what do you think about updating the env_tamplate with the new var name?

johnclary · 2024-09-30T17:54:32Z

etl/cris_import/cr3_ocr_narrative.py

+    narrative_image = page.crop(bbox)
+
+    # uncomment to save a copy of the cropped narrative image
+    # narrative_image.save(f"temp/{cris_crash_id}_narrative_{cr3_version}.jpeg")


this is essential for debugging this, so i figured i'd leave it in here 🤷

johnclary · 2024-09-30T17:56:24Z

etl/cris_import/cr3_ocr_narrative.py

+
+    logger.debug("Extracting narrative text...")
+    narrative = None
+    narrative = image_to_string(narrative_image, lang="eng")


the old version of the script used a try/except/ block for this call, but it never errored for me, even when parsing some gnarly old garbage PDFs. my thinking was that for now it would be better to know if this breaks somehow, but if it becomes a nuisance we can easily revisit.

johnclary · 2024-09-30T17:57:26Z

etl/cris_import/cr3_ocr_narrative.py

+            crash = todos[index]
+            errors.append([crash["cris_crash_id"], e])
+
+    if errors:


this is the same behavior as the crash diagram extractor

johnclary · 2024-09-30T18:00:52Z

etl/cris_import/utils/process_pdfs.py

-            return f"v1_{page_size}"
-
-    return f"v2_{page_size}"
+    width, height = page.size


i discovered this much easier method for getting the image size 👍

johnclary · 2024-09-30T18:06:25Z

etl/cris_import/utils/settings.py

+    "v1_large": (296, 3683, 2580, 5749),
+    "v2_small": (30, 791, 650, 1430),
+    "v2_large": (90, 3026, 2496, 5466),
+}


it occurred to me that we might want to start storing this version in the database. another option would be to use the python PDF package to actually add metadata to the PDF itself indicating which version.

i spent quite a lot of time testing different ways to identify which PDF version is being used, including by parsing the CR3 from revision date using OCR. i learned two things. 1) there are many more form revisions than what we test for (at least 5 as far as i can tell) and 2) it is hard to reliable to distinguish between them.

I could expand on this and have more thoughts on it if y'all want to chat CR3 versions. My short summary is that what we're doing now captures 99% of what we need.

mddilley

All the steps worked as expected, and I was able to see narratives OCR'd and verify that the narratives would not be overwritten. Thanks for sprucing this up and providing the SQL. Made testing a breeze. 🚢 🙌

mddilley · 2024-09-30T22:37:47Z

database/metadata/databases/default/tables/public_view_crash_narratives_ocr_todo.yaml

nice! i watched the row count decrease as i ran the OCR script 🚀 Very cool to have the conditions for OCRing stored in a DB view this way.

mddilley · 2024-09-30T22:45:03Z

database/migrations/default/1727451510064_preserve_crash_narrative/up.sql

+)
+execute procedure public.crashes_cris_set_old_investigator_narrative();
+
+comment on function public.crashes_cris_set_old_investigator_narrative is 'Sets the


awesome - this comment answered my question before i could ask it! 🙏

+1 loving the function comments

etl/cris_import/env_template

chiaberry

Thanks for the test steps and the additions to the README.

I welcome the return the of the OCR'd crash narrative

roseeichelmann · 2024-10-03T15:50:02Z

database/migrations/default/1727375054583_narrative_ocr/up.sql

+        and investigator_narrative is NULL
+        and (
+            investigator_narrative_ocr_processed_at is NULL
+            or cr3_processed_at >= investigator_narrative_ocr_processed_at


so if we got a new cr3 that was processed and we have to OCR the investigator narrative again?

roseeichelmann · 2024-10-03T16:00:37Z

etl/cris_import/README.md


 3. Build and run the docker image. This will drop you into the docker image's shell:

 ```shell
-$ docker compose build # <- only need to do this once
-$ docker compose run cris_import
+docker compose build # <- do this once, and when dependencies change


roseeichelmann · 2024-10-03T16:07:30Z

etl/cris_import/cr3_ocr_narrative.py

+            "investigator_narrative_ocr_processed_at": datetime.now(
+                timezone.utc
+            ).isoformat(),
+            "updated_by": "dts_automation",


should this be our standard naming convention going forward for automated tasks/etls making updates?

it would be nice to agree on a convention—i did feel 100% great about this and am very open to ideas.

you did feel 100% great? or are you looking for other ideas?
would you want to label anything updated by our scripts with dts_automation, or make it more specific with which script updated it? like cris_import_script

did not* feel 100% great. yeah, i was wondering if we see benefit in being specific about what task made the updates or just using the generic dts_automation.

roseeichelmann · 2024-10-03T16:19:55Z

etl/cris_import/cr3_ocr_narrative.py

+        return
+
+    futures = []
+    with ProcessPoolExecutor(max_workers=cli_args.workers) as executor:


so this whole block is for creating multiple parallel processes to run the extract_narrative_pdf function concurrently for the crashes in the todos list?

johnclary added 3 commits September 26, 2024 13:41

restore old files - wip

2d72e03

wip ocr narrative extract within cris_import etl

3703029

remove files that were briefly restored

4af186d

johnclary added the WIP Work in progress label Sep 26, 2024

johnclary added 13 commits September 26, 2024 17:24

remove jpeg

b1abc9c

create migrations for new col

f035461

use db view for this

0280a9d

add migration to use old narrative when new is null

5738208

update update statement

c9d605d

more logging

18535b0

add comment

8df55db

thread it

4d10547

refine download file

514ce8b

rework pixel tests and handle narrative-related tasks

a884e99

cleanup unused objects

4a7ad2a

comments

9c4628f

Merge branch 'jc-download-db-data' into 19204-ocr-narrative

8dfc6b7

johnclary changed the base branch from main to jc-download-db-data September 30, 2024 15:38

johnclary added 6 commits September 30, 2024 12:14

add docs and rename env to bucket_env

a772af1

typos

7f93f3d

interrobang + comment

a704bc3

remove cr3 v0 test

ea5352c

lint

ee3bff7

comment out bulk narrative update

004b670

johnclary commented Sep 30, 2024

View reviewed changes

johnclary added 4 commits September 30, 2024 13:45

add ocr script usage example

6bfcf11

update env var

cb550ae

rename env var and doctring

066be37

doc string and var passing

2c3f07c

johnclary commented Sep 30, 2024

View reviewed changes

use a better exception

5631bac

johnclary commented Sep 30, 2024

View reviewed changes

johnclary requested review from mddilley, frankhereford, mateoclarke, chiaberry, tillyw and roseeichelmann September 30, 2024 18:06

johnclary removed the WIP Work in progress label Sep 30, 2024

johnclary mentioned this pull request Sep 30, 2024

Add narrative extraction to CRIS import DAG cityofaustin/atd-airflow#252

Open

3 tasks

mddilley approved these changes Sep 30, 2024

View reviewed changes

add jpeg to .gitignore

bf27634

chiaberry reviewed Oct 2, 2024

View reviewed changes

etl/cris_import/env_template Outdated Show resolved Hide resolved

chiaberry approved these changes Oct 2, 2024

View reviewed changes

rename env to bucket_env in template

ff113ec

johnclary changed the base branch from jc-download-db-data to main October 2, 2024 20:56

roseeichelmann reviewed Oct 3, 2024

View reviewed changes

johnclary merged commit 8f72239 into main Oct 3, 2024

johnclary mentioned this pull request Oct 3, 2024

Rename ENV var to BUCKET_ENV in CRIS import DAG cityofaustin/atd-airflow#253

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return of the OCR'd crash narrative #1568

Return of the OCR'd crash narrative #1568

johnclary commented Sep 26, 2024 •

edited

Loading

johnclary Sep 30, 2024 •

edited

Loading

chiaberry Oct 2, 2024

johnclary Sep 30, 2024

johnclary Sep 30, 2024

johnclary Sep 30, 2024

johnclary Sep 30, 2024

johnclary Sep 30, 2024

mddilley left a comment

mddilley Sep 30, 2024

mddilley Sep 30, 2024

roseeichelmann Oct 3, 2024

chiaberry left a comment

roseeichelmann Oct 3, 2024

roseeichelmann Oct 3, 2024

roseeichelmann Oct 3, 2024

johnclary Oct 3, 2024

chiaberry Oct 3, 2024

johnclary Oct 3, 2024

roseeichelmann Oct 3, 2024

johnclary Oct 3, 2024

Return of the OCR'd crash narrative #1568

Return of the OCR'd crash narrative #1568

Conversation

johnclary commented Sep 26, 2024 • edited Loading

Associated issues

Testing

Ship list

johnclary Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mddilley left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chiaberry left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnclary commented Sep 26, 2024 •

edited

Loading

johnclary Sep 30, 2024 •

edited

Loading