Scanning Legacy Documents with AWS Textract

Process

This will set up two SQS queues and two correpsonding lambda functions to process their items.

send_pdfs_to_textract_queue
Items in this queue reprsent PDFs in existing the pdf_document_bucket bucket defined in terraform.tfvars. The body of these items should be the filename (without the bucket path) of the document. This is probably the DocketEntryId. Additionally, you can define a message attribute, docket_number as a string value. If defined this will be added to the JSON document of the final file.

textract_complete_queue
This queue will automatically be populated by items from the SNS message topic, textract_notification_topic. Textract sends messages to textract_notification_topic when the asynchronous processing of the PDF has completed. You should not need to manually send items to this queue.

After the process is finished documents are places in the efcms-textract-result-documents bucket created by terraform.

Installing

AWS resources are managed by Terraform. From within the terraform directory, running the following should work as expected:

terraform init initialize

then

terraform plan see what's going to happen
terraform apply push changes to AWS terraform destroy pull it down.

Making it do something

Adding items to the send_pdfs_to_textract_queue will kick off the process.

Reconcile DDB tables

The textract scanning process produced a DynamoDB table the holds the link between the JSON documents in S3 and the Docket-Entries in Dynamo.

The terraform in update_ddb_terraform will deploy an SQS queue and subscribed Lambda Function that will update items in Dawson's DynamoDB table as items show up in the queue.

To depoly:

set the terraform variable ddb_destination_table_name to the correct dynamo table
ensure AWS credentials are available in your environment
> cd update_ddb_terraform
> terraform init
> terraform plan
assuming the above looks good
> terraform apply

To populate queue:

The script in scripts/send_document_contents_sqs.py will read from the Dynamo table textract-completed-ocr-efcms set as SOURCE_TABLE_NAME in the script.

After confirming AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY are set in the environment running, executing this script will start the process of scanning the source table and sending batches of SQS items to the queue. In migration, this took about 3.5 hours.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
lambda		lambda
scripts		scripts
terraform		terraform
update_ddb_terraform		update_ddb_terraform
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scanning Legacy Documents with AWS Textract

Process

Installing

Making it do something

Reconcile DDB tables

About

Releases

Packages

Languages

ustaxcourt/textract_scan

Folders and files

Latest commit

History

Repository files navigation

Scanning Legacy Documents with AWS Textract

Process

Installing

Making it do something

Reconcile DDB tables

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages