Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RedditMap Data and Scripts #17

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

freeformflow
Copy link

@19mangatj informed me that Virginia authorized us to add RedditMap materials to the IHOP repository. This pull request adds RedditMap data as well as supporting scripts to ready that data for use in its application. I've taken care to not disturb the existing conventions while maintaining RedditMap's publish pipeline. In addition to this pull request, I will need to provide the AWS access key out-of-band so the new GitHub action will be properly authorized.

I added RedditMap data to the directory /data/redditmap. This contains the RedditMap data and only the data.

I added publish code to the directory /scripts/redditmap. This contains a README overview and the needed Node.js code to complete the publish task.

I added a new GitHub action publish-redditmap. When you push to the branch publish-redditmap, that action will prepare a Node.js environment and sync the RedditMap data API with the local directory.

@ginic
Copy link
Collaborator

ginic commented May 8, 2023

Thanks @freeformflow!

Unfortunately Github won't let me see the full diff because there are too many changed files, so I'm going to have to awkwardly put notes in comments.

To start, I have a couple questions:

  • Will having this Github workflow in both forks break any of the data pushing to the same AWS bucket? What if the repositories get out of sync with each other? Maybe the scripts/redditmap/ensemble.yaml should point to separate buckets by default?
  • Is there any way to avoid keeping all the data/redditmap/metadata json files as separate files tracked in git? One of the major benefits of using Data Version Control (DVC), as we've done with other parts of the project, is that data artifacts become programmatically reproducible and the entire end-to-end pipeline is versioned, rather than managing each output file's version. Ideally, I'd like to have the data needed for the website become part of those pipelines as well, but we might have to address this later.

@ginic
Copy link
Collaborator

ginic commented May 8, 2023

Line 17 of scripts/redditmap/README.md ends in an unfinished sentence:

After you commit changes to our mainline branch, main, you'll want to cut a release and publish changes for the world to see. We've setup a GitHub action to make that easy for you. The publish task will sync storage in our infrastructure so that it looks exactly like what's in that directory. So any

@ginic
Copy link
Collaborator

ginic commented May 8, 2023

In scripts/redditmap/image.mjs, there's an in-code reference to the bucket https://ihopmeag.s3.us-east-2.amazonaws.com/reddit_images. I made this public so that so that @19mangatj could test changes to the code, but I'd like for it not to be public, so we don't risk getting spammed with get requests and costing a lot of money. Is there a way to require AWS credentials for that bucket as well?

@ginic
Copy link
Collaborator

ginic commented May 8, 2023

Could you please put an "Add" note to the changelog briefly describing the new scripts and data?

@freeformflow
Copy link
Author

Will having this Github workflow in both forks break any of the data pushing to the same AWS bucket? What if the repositories get out of sync with each other? Maybe the scripts/redditmap/ensemble.yaml should point to separate buckets by default?

My understanding is that Jasmine would like to move RedditMap data curation to the IHOP repository entirely, and we will deprecate the iDPI repositories that have housed previous versions of the pipeline. However, as I understand it, we'd like to use infrastructure under the iDPI AWS account to host the RedditMap application and host data specific to RedditMap. We should confirm with @19mangatj and @chandrn7

Is there any way to avoid keeping all the data/redditmap/metadata json files as separate files tracked in git? One of the major benefits of using Data Version Control (DVC), as we've done with other parts of the project, is that data artifacts become programmatically reproducible and the entire end-to-end pipeline is versioned, rather than managing each output file's version. Ideally, I'd like to have the data needed for the website become part of those pipelines as well, but we might have to address this later.

I agree that dealing with the individual files is difficult. We'd like to ultimately serve them as individual files to benefit from fine-grained HTTP caching, but we can handle all sorts of preprocessing scenarios. That includes managing them in a different form and isolating them into smaller files just prior to publishing to iDPI infrastructure. I'd need to know more about the IHOP output to alter the publishing script to accomplish that goal.

Line 17 of scripts/redditmap/README.md ends in an unfinished sentence:

I pushed an edit that corrects that sentence.

In scripts/redditmap/image.mjs, there's an in-code reference to the bucket https://ihopmeag.s3.us-east-2.amazonaws.com/reddit_images. I made this public so that so that @19mangatj could test changes to the code, but I'd like for it not to be public, so we don't risk getting spammed with get requests and costing a lot of money. Is there a way to require AWS credentials for that bucket as well?

I pushed an edit that removed that file entirely. That was a temporary script. We copied over screenshots to the iDPI bucket so iDPI might host the images. But we determined it would be too costly to serve those images in the RedditMap application for now. Over time, we may want to work toward restoring updates to images as part of data pipeline work, but it's not a priority.

The RedditMap application no longer needs public access to the IHOP bucket. You might want to speak with Jasmine to confirm, but as far as the application is concerned, you can restore the bucket access control. If you have any specific S3 configuration questions, I'd be happy to help there, too.

@freeformflow
Copy link
Author

Could you please put an "Add" note to the changelog briefly describing the new scripts and data?

I've drafted some notes and asked Jasmine to confirm. I'll add these notes to the changelog once Jasmine approves.

@ginic
Copy link
Collaborator

ginic commented May 10, 2023

Thanks, @freeformflow ! Everything looks good to me! I'll approve and @19mangatj can merge it in when she's ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants