-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETL for parsing and publishing email attachments to S3 for emergency mgmt integration #1
ETL for parsing and publishing email attachments to S3 for emergency mgmt integration #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One gotcha about list_objects_v2 which tripped me up in the past. Other than that, a couple of suggestions.
🗺️
|
||
# Create a new folder with the current date and time in Central Time | ||
central = pytz.timezone("US/Central") | ||
folder_name = datetime.now(central).strftime("%Y%m%d-%H%M%S") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to use the email's received date for the name of these archived folders so if we stopped getting emails for some reason it wouldn't appear that we're still getting new data? I'm not sure how often this will run and how often we're getting emails but it should also reduce the number of duplicate entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i always get nervous when i see a non-UTC timestamp without a timezone offset. could we get by on UTC here instead of fussing with timezones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the comment that I'm most curious about: I have the same question. Here is what I'm seeing in S3 when running this several times in a row - all these contain the same data. I could be missing a detail about the process, but I like Charlie's idea to create folders based off a timestamp tied to what has been received so the ETL would know to skip the upload if there isn't new data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great questions y'all, let me take them one by one:
Charlie's question
@Charlie-Henry asks if we can use the email's received date. I'm not sure we can -- the team opted in a recent reining meeting to not parse these emails upon receipt, but rather to reuse the VZ pattern and parse them via an ETL. As a result, it's possible and likely that we will end up parsing the same email multiple times. If the email's sent timestamp was a unique identifier of an email, then we'd be able to use it, but otherwise, I don't think we can ensure that we don't receive two emails at the same time. I do realize that it's very unlikely..
That said, I totally do recognize the value in being able to see something that indicates we're not processing new information and just churning over the same email over and over, so I am going to extend the program to include a snippet of the sha256 of the raw email we received. That will end up not changing from time to time when this runs on the same input, so we'll have an indication in the file listing of what's happening. I think this will serve the spirit of the request, just not using the timestamp.
John's concern
@johnclary recognizes the challenges of leaving a timestamp without an indication of what timezone it is. He's totally right, and I am going to adjust the code to use UTC, and further, it will include an indication in the directory name that the timestamp is UTC. 💯
Mike's question
@mddilley shares the same question as Charlie and he observes that we are getting the same output from multiple runs. Based on how we scoped this, this is by design - we'll get the same output until a new email is received from our upstream source. However, Mike offers a great idea based on what Charlie suggested -- using a finger print to know we've already processed some data. In lieu of the timestamp found in the email, I am going to hash the whole email payload, and store that in S3 along with the extracted attachments. Using that, I can enhance the program to escape early if it determines that it's already seen certain data.
Great suggestions and questions - thanks y'all!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is following the same pattern as the EMS/AFD/VZ email processing, does the the VZ setup also potentially process the same email multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is following the same pattern as the EMS/AFD/VZ email processing, does the the VZ setup also potentially process the same email multiple times?
I believe it would yes, but we didn't give it a ton of thought because for those, we know the interval that we're supposed to receive the data, so we only run it once a day. If it does build up cruft, it does it a slower rate than this one would if geo were to ask for it to run real frequently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Y'all's suggestions and concerns here were 💯 🎯. Thank you so much for them. Here's what's changed:
- Use UTC not Central and note it: 16178ba
- Validate 4 headers to increase our confidence that the sender is who they say they are: c8c6328
- Use sha256 to encode some state in the directories of files processed. Exit early if we've seen an email before. fb41324 & 5ca2a3b / 019277d
Thanks again, and let me know if I can do anything else on these 🙏
parsed_email = email.message_from_bytes(email_content_bytes, policy=policy.default) | ||
|
||
# Create a temporary directory to store attachments | ||
temp_dir = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love the temp_dir's! I forget how useful these are
|
||
def parse_email_from_s3(email_content): | ||
# Decode the string into bytes | ||
email_content_bytes = email_content.encode("utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to check the sender's email address to verify it's from maximo so if we got junk emails in there it won't cause any havoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great idea, and we should do this via the emergency-mgmt-data
email filter that defines this inbox in AWS. As soon as we deliver this to the geo team and they put it to use, we'll be able to do that, if not before. I'll be on the lookout for the sender's address to be specified. 👍
Thank you all for your great suggestions. I think I've addressed everything, but LMK if you see anything that got by. This work improved a ton with y'all's help!! 🙏 |
@frankhereford your comment about the ARM images reminded me, should there be an issue to add a github workflow to this repo to build and push to dockerhub? If not, then a note in the readme that says "updates should be manually pushed to dockerhub" or something phrased better |
@chiaberry wrote:
I think that's a great idea. I'm going to try to drop in another work flow we've built to do that and maybe get this taken care of right now, otherwise, I will write up an issue. Thank you Chia. |
I've added a GitHub workflow that builds the image and pushes it to docker hub for both X64 and ARM64 architectures. This includes an update to the |
Thanks for adding the github workflow to this PR, and addressing our feedback. Does David need to test this, or have you gotten a sample dataset to test from him? Do you think any of the improvements to this ETL should be done to the EMS/AFD ETL as well, since this was based on that work? cc: @johnclary |
Great idea from @chiaberry - put the email address that we have picked for this in 1PW (dev vault) and reference that in the Note to self, it can be found in 1PW here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed all the new commits, and they look spot on to me! Thanks for taking this way beyond the ETL that is was spun-off from. I think Chia's idea to improve the AFD/EMS etl in the same way is great. 🚢 🙏
One last suggestion - i think it would be great to add a link to this repo to the 1Password entry with the email this ETL accepts. 🔑
Great idea - thank you Mike & Chia both. You can find it here:
|
Intent
This PR intends to provide for the ETL side of work on this issue. Look for a sibling PR in the airflow repo in the very near future.
Testing
There is no harm in running this script as many times as you want. Here's the testing routine:
env
file, using the credentials stored in 1PW under the entryMaximo Geo Integrations
docker compose up
emergency-mgmt-recd-data
bucket for the attachments to be filed in a date-labeled folder as well as in a static location.