Skip to content

Commit

Permalink
Merge pull request #15 from MITLibraries/pr4-ci-aws
Browse files Browse the repository at this point in the history
PR4 - Add CI and AWS terraform
  • Loading branch information
ghukill authored Oct 16, 2023
2 parents 6cad0bf + c782c8c commit 7e4a5e1
Show file tree
Hide file tree
Showing 8 changed files with 196 additions and 16 deletions.
28 changes: 28 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# To get started with Dependabot version updates, you'll need to specify which
# package ecosystems to update and where the package manifests are located.
# Please see the documentation for all configuration options:
# https://docs.github.com/github/administering-a-repository/configuration-options-for-dependency-updates

version: 2
updates:
# Maintain dependencies for GitHub Actions
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "daily"

# Maintain dependencies for application
- package-ecosystem: "pip"
directory: "/"
schedule:
interval: "weekly"
reviewers:
- "MITLibraries/dataeng"

# Maintain dependencies for Docker
- package-ecosystem: "docker"
directory: "/"
schedule:
interval: "weekly"
reviewers:
- "MITLibraries/dataeng"
39 changes: 39 additions & 0 deletions .github/pull-request-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
### What does this PR do?

Describe the overall purpose of the PR changes. Doesn't need to be as specific as the
individual commits.

### Helpful background context

Describe any additional context beyond what the PR accomplishes if it is likely to be
useful to a reviewer.

Delete this section if it isn't applicable to the PR.

### How can a reviewer manually see the effects of these changes?

Explain how to see the proposed changes in the application if possible.

Delete this section if it isn't applicable to the PR.

### Includes new or updated dependencies?

YES | NO

### What are the relevant tickets?

Include links to Jira Software and/or Jira Service Management tickets here.

### Developer

- [ ] All new ENV is documented in README (or there is none)
- [ ] Stakeholder approval has been confirmed (or is not needed)

### Code Reviewer

- [ ] The commit message is clear and follows our guidelines
(not just this pull request message)
- [ ] There are appropriate tests covering any new functionality
- [ ] The documentation has been updated or is unnecessary
- [ ] The changes have been verified
- [ ] New dependencies are appropriate or there were no changes
7 changes: 7 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
name: CI
on: push
jobs:
test:
uses: mitlibraries/.github/.github/workflows/python-shared-test.yml@main
lint:
uses: mitlibraries/.github/.github/workflows/python-shared-lint.yml@main
24 changes: 24 additions & 0 deletions .github/workflows/dev-build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
### This is the Terraform-generated dev-build.yml workflow for the browsertrix-harvester-dev app repository ###
### If this is a Lambda repo, uncomment the FUNCTION line at the end of the document ###
### If the container requires any additional pre-build commands, uncomment and edit ###
### the PREBUILD line at the end of the document. ###
name: Dev Container Build and Deploy
on:
workflow_dispatch:
pull_request:
branches:
- main
paths-ignore:
- '.github/**'

jobs:
deploy:
name: Dev Container Deploy
uses: mitlibraries/.github/.github/workflows/ecr-shared-deploy-dev.yml@main
secrets: inherit
with:
AWS_REGION: "us-east-1"
GHA_ROLE: "browsertrix-harvester-gha-dev"
ECR: "browsertrix-harvester-dev"
# FUNCTION: ""
# PREBUILD:
21 changes: 21 additions & 0 deletions .github/workflows/prod-promote.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
### This is the Terraform-generated prod-promote.yml workflow for the browsertrix-harvester-prod repository. ###
### If this is a Lambda repo, uncomment the FUNCTION line at the end of the document. ###
name: Prod Container Promote
on:
workflow_dispatch:
release:
types: [published]

jobs:
deploy:
name: Prod Container Promote
uses: mitlibraries/.github/.github/workflows/ecr-shared-promote-prod.yml@main
secrets: inherit
with:
AWS_REGION: "us-east-1"
GHA_ROLE_STAGE: browsertrix-harvester-gha-stage
GHA_ROLE_PROD: browsertrix-harvester-gha-prod
ECR_STAGE: "browsertrix-harvester-stage"
ECR_PROD: "browsertrix-harvester-prod"
# FUNCTION: ""

24 changes: 24 additions & 0 deletions .github/workflows/stage-build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
### This is the Terraform-generated dev-build.yml workflow for the browsertrix-harvester-stage app repository ###
### If this is a Lambda repo, uncomment the FUNCTION line at the end of the document ###
### If the container requires any additional pre-build commands, uncomment and edit ###
### the PREBUILD line at the end of the document. ###
name: Stage Container Build and Deploy
on:
workflow_dispatch:
push:
branches:
- main
paths-ignore:
- '.github/**'

jobs:
deploy:
name: Stage Container Deploy
uses: mitlibraries/.github/.github/workflows/ecr-shared-deploy-stage.yml@main
secrets: inherit
with:
AWS_REGION: "us-east-1"
GHA_ROLE: "browsertrix-harvester-gha-stage"
ECR: "browsertrix-harvester-stage"
# FUNCTION: ""
# PREBUILD:
48 changes: 43 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -49,24 +49,62 @@ black-apply:
ruff-apply:
pipenv run ruff check --fix .

# CLI commands
docker-shell:
pipenv run harvester-dockerized docker-shell

# Docker commands
dist-local:
docker build -t $(ECR_NAME_DEV):latest .

# Testing commands
test-harvest-local:
# Test local harvest
run-harvest-local:
pipenv run harvester-dockerized --verbose harvest \
--crawl-name="homepage" \
--config-yaml-file="/browsertrix-harvester/tests/fixtures/lib-website-homepage.yaml" \
--metadata-output-file="/crawls/collections/homepage/homepage.xml" \
--num-workers 4 \
--btrix-args-json='{"--maxPageLimit":"15"}'

# Test Dev1 harvest
run-harvest-dev:
CRAWL_NAME=test-harvest-ecs-$(DATETIME); \
aws ecs run-task \
--cluster timdex-dev \
--task-definition timdex-browsertrixharvester-dev \
--launch-type="FARGATE" \
--region us-east-1 \
--network-configuration '{"awsvpcConfiguration": {"subnets": ["subnet-0488e4996ddc8365b","subnet-022e9ea19f5f93e65"], "securityGroups": ["sg-044033bf5f102c544"]}}' \
--overrides '{"containerOverrides": [ {"name":"browsertrix-harvester", "command": ["--verbose", "harvest", "--crawl-name", "'$$CRAWL_NAME'", "--config-yaml-file", "/browsertrix-harvester/tests/fixtures/lib-website-homepage.yaml", "--metadata-output-file", "s3://timdex-extract-dev-222053980223/librarywebsite/'$$CRAWL_NAME'.xml", "--wacz-output-file", "s3://timdex-extract-dev-222053980223/librarywebsite/'$$CRAWL_NAME'.wacz", "--num-workers", "2"]}]}'

# Test local URL content parsing
test-parse-url-content:
pipenv run harvester parse-url-content \
--wacz-input-file="tests/fixtures/example.wacz" \
--url="https://example.com/hello-world"
--url="https://example.com/hello-world"

### Terraform-generated Developer Deploy Commands for Dev environment ###
dist-dev: ## Build docker container (intended for developer-based manual build)
docker build --platform linux/amd64 \
-t $(ECR_URL_DEV):latest \
-t $(ECR_URL_DEV):`git describe --always` \
-t $(ECR_NAME_DEV):latest .

publish-dev: dist-dev ## Build, tag and push (intended for developer-based manual publish)
docker login -u AWS -p $$(aws ecr get-login-password --region us-east-1) $(ECR_URL_DEV)
docker push $(ECR_URL_DEV):latest
docker push $(ECR_URL_DEV):`git describe --always`

### Terraform-generated manual shortcuts for deploying to Stage. This requires ###
### that ECR_NAME_STAGE, ECR_URL_STAGE, and FUNCTION_STAGE environment ###
### variables are set locally by the developer and that the developer has ###
### authenticated to the correct AWS Account. The values for the environment ###
### variables can be found in the stage_build.yml caller workflow. ###
dist-stage: ## Only use in an emergency
docker build --platform linux/amd64 \
-t $(ECR_URL_STAGE):latest \
-t $(ECR_URL_STAGE):`git describe --always` \
-t $(ECR_NAME_STAGE):latest .

publish-stage: ## Only use in an emergency
docker login -u AWS -p $$(aws ecr get-login-password --region us-east-1) $(ECR_URL_STAGE)
docker push $(ECR_URL_STAGE):latest
docker push $(ECR_URL_STAGE):`git describe --always`
21 changes: 10 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,22 @@ make lint

#### Local Test Crawl
```shell
make test-harvest-local
make run-harvest-local
```

This Make command kicks off a harvest via a local docker container. The Make command reflects some ways in which a harvest can be configured, including local or S3 filepath to a configuration YAML, setting an output metadata file, and even passing in miscellaneous browsertrix arguments to the crawler not explicitly defined as CLI parameters in this app.

The argument `--metadata-output-file="/crawls/collections/homepage/homepage.xml"` instructs the harvest to parse metadata records from the crawl, which are written to the container, and should then be available on the _host_ machine at: `output/crawls/collections/homepage/homepage.xml`.

### Remote Test Crawl

```shell
make run-harvest-dev
```
* Set AWS credentials are required in calling context
* Kicks off an ECS Fargate task in Dev1
* WACZ file and metadata file are written to S3 at `timdex-extract-dev-222053980223/librarywebsite/test-harvest-ecs-<TIMESTAMP>.xml|wacz`

## CLI commands

### Main
Expand Down Expand Up @@ -261,16 +270,6 @@ An example record from an XML output file looks like this:
</records>
```

## Convenience Make Commands

### Local Test Crawl

```shell
make test-harvest-local
```
* Performs a crawl using the container mounted config YAML `/browsertrix-harvest/tests/fixtures/lib-website-homepage.yaml`
* Metadata is written to container directory `/crawls/collections/homepage/homepage.xml`, which is mounted and available in the local `output/` folder

## Troubleshooting

### Cannot read/write from S3 for a LOCAL docker container harvest
Expand Down

0 comments on commit 7e4a5e1

Please sign in to comment.