Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETL for parsing and publishing email attachments to S3 for emergency mgmt integration #1

Merged
merged 27 commits into from
Mar 11, 2024

Conversation

frankhereford
Copy link
Member

@frankhereford frankhereford commented Feb 20, 2024

Intent

This PR intends to provide for the ETL side of work on this issue. Look for a sibling PR in the airflow repo in the very near future.

Testing

There is no harm in running this script as many times as you want. Here's the testing routine:

  • Create an env file, using the credentials stored in 1PW under the entry Maximo Geo Integrations
    • This is a new user / policy combination which has the minimum required permissions in IAM
  • docker compose up
  • Look at the emergency-mgmt-recd-data bucket for the attachments to be filed in a date-labeled folder as well as in a static location.
  • You can also send a new email to the address if you want to test that out too. I won't publish that address here, but you can get it from me on slack or find it in the AWS console.

Copy link

@Charlie-Henry Charlie-Henry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One gotcha about list_objects_v2 which tripped me up in the past. Other than that, a couple of suggestions.

🗺️

etl/parse_email_save_attachment/parse_messages.py Outdated Show resolved Hide resolved

# Create a new folder with the current date and time in Central Time
central = pytz.timezone("US/Central")
folder_name = datetime.now(central).strftime("%Y%m%d-%H%M%S")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to use the email's received date for the name of these archived folders so if we stopped getting emails for some reason it wouldn't appear that we're still getting new data? I'm not sure how often this will run and how often we're getting emails but it should also reduce the number of duplicate entries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i always get nervous when i see a non-UTC timestamp without a timezone offset. could we get by on UTC here instead of fussing with timezones?

Copy link

@mddilley mddilley Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the comment that I'm most curious about: I have the same question. Here is what I'm seeing in S3 when running this several times in a row - all these contain the same data. I could be missing a detail about the process, but I like Charlie's idea to create folders based off a timestamp tied to what has been received so the ETL would know to skip the upload if there isn't new data.
Screenshot 2024-02-28 at 11 19 07 AM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions y'all, let me take them one by one:

Charlie's question

@Charlie-Henry asks if we can use the email's received date. I'm not sure we can -- the team opted in a recent reining meeting to not parse these emails upon receipt, but rather to reuse the VZ pattern and parse them via an ETL. As a result, it's possible and likely that we will end up parsing the same email multiple times. If the email's sent timestamp was a unique identifier of an email, then we'd be able to use it, but otherwise, I don't think we can ensure that we don't receive two emails at the same time. I do realize that it's very unlikely..

That said, I totally do recognize the value in being able to see something that indicates we're not processing new information and just churning over the same email over and over, so I am going to extend the program to include a snippet of the sha256 of the raw email we received. That will end up not changing from time to time when this runs on the same input, so we'll have an indication in the file listing of what's happening. I think this will serve the spirit of the request, just not using the timestamp.

John's concern

@johnclary recognizes the challenges of leaving a timestamp without an indication of what timezone it is. He's totally right, and I am going to adjust the code to use UTC, and further, it will include an indication in the directory name that the timestamp is UTC. 💯

Mike's question

@mddilley shares the same question as Charlie and he observes that we are getting the same output from multiple runs. Based on how we scoped this, this is by design - we'll get the same output until a new email is received from our upstream source. However, Mike offers a great idea based on what Charlie suggested -- using a finger print to know we've already processed some data. In lieu of the timestamp found in the email, I am going to hash the whole email payload, and store that in S3 along with the extracted attachments. Using that, I can enhance the program to escape early if it determines that it's already seen certain data.

Great suggestions and questions - thanks y'all!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is following the same pattern as the EMS/AFD/VZ email processing, does the the VZ setup also potentially process the same email multiple times?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is following the same pattern as the EMS/AFD/VZ email processing, does the the VZ setup also potentially process the same email multiple times?

I believe it would yes, but we didn't give it a ton of thought because for those, we know the interval that we're supposed to receive the data, so we only run it once a day. If it does build up cruft, it does it a slower rate than this one would if geo were to ask for it to run real frequently.

Copy link
Member Author

@frankhereford frankhereford Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Y'all's suggestions and concerns here were 💯 🎯. Thank you so much for them. Here's what's changed:

  • Use UTC not Central and note it: 16178ba
  • Validate 4 headers to increase our confidence that the sender is who they say they are: c8c6328
  • Use sha256 to encode some state in the directories of files processed. Exit early if we've seen an email before. fb41324 & 5ca2a3b / 019277d

Thanks again, and let me know if I can do anything else on these 🙏

parsed_email = email.message_from_bytes(email_content_bytes, policy=policy.default)

# Create a temporary directory to store attachments
temp_dir = tempfile.mkdtemp()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love the temp_dir's! I forget how useful these are


def parse_email_from_s3(email_content):
# Decode the string into bytes
email_content_bytes = email_content.encode("utf-8")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a good idea to check the sender's email address to verify it's from maximo so if we got junk emails in there it won't cause any havoc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great idea, and we should do this via the emergency-mgmt-data email filter that defines this inbox in AWS. As soon as we deliver this to the geo team and they put it to use, we'll be able to do that, if not before. I'll be on the lookout for the sender's address to be specified. 👍

@frankhereford
Copy link
Member Author

frankhereford commented Feb 28, 2024

Thank you all for your great suggestions. I think I've addressed everything, but LMK if you see anything that got by. This work improved a ton with y'all's help!! 🙏

@chiaberry
Copy link
Member

@frankhereford your comment about the ARM images reminded me, should there be an issue to add a github workflow to this repo to build and push to dockerhub? If not, then a note in the readme that says "updates should be manually pushed to dockerhub" or something phrased better

@frankhereford
Copy link
Member Author

@chiaberry wrote:

your comment about the ARM images reminded me, should there be an issue to add a github workflow to this repo to build and push to dockerhub? If not, then a note in the readme that says "updates should be manually pushed to dockerhub" or something phrased better

I think that's a great idea. I'm going to try to drop in another work flow we've built to do that and maybe get this taken care of right now, otherwise, I will write up an issue. Thank you Chia.

@frankhereford
Copy link
Member Author

I've added a GitHub workflow that builds the image and pushes it to docker hub for both X64 and ARM64 architectures. This includes an update to the production tagged image, so any airflow testing will be working from the latest code as of this morning.

@chiaberry
Copy link
Member

Thanks for adding the github workflow to this PR, and addressing our feedback. Does David need to test this, or have you gotten a sample dataset to test from him?

Do you think any of the improvements to this ETL should be done to the EMS/AFD ETL as well, since this was based on that work? cc: @johnclary

@frankhereford
Copy link
Member Author

frankhereford commented Mar 5, 2024

Great idea from @chiaberry - put the email address that we have picked for this in 1PW (dev vault) and reference that in the README.md. 🙏

Note to self, it can be found in 1PW here: Emergency Mgmt / Geo / Maximo data ingest email address

@mddilley mddilley self-requested a review March 6, 2024 18:02
Copy link

@mddilley mddilley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed all the new commits, and they look spot on to me! Thanks for taking this way beyond the ETL that is was spun-off from. I think Chia's idea to improve the AFD/EMS etl in the same way is great. 🚢 🙏

One last suggestion - i think it would be great to add a link to this repo to the 1Password entry with the email this ETL accepts. 🔑

@frankhereford
Copy link
Member Author

One last suggestion - i think it would be great to add a link to this repo to the 1Password entry with the email this ETL accepts. 🔑

Great idea - thank you Mike & Chia both. You can find it here:

Emergency Mgmt / Geo / Maximo data ingest email address

@frankhereford frankhereford merged commit 2fc172e into main Mar 11, 2024
1 check passed
@frankhereford frankhereford deleted the etl-maximo-geo-emergency-mgmt-integration branch March 11, 2024 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants