Skip to content

Commit

Permalink
Merge pull request #313 from MITLibraries/TIMX-412-establish-feature-…
Browse files Browse the repository at this point in the history
…flagging-pathways

Timx 412 establish feature flagging pathways
  • Loading branch information
jonavellecuerdo authored Dec 9, 2024
2 parents ed33a71 + b859e80 commit 8e09243
Show file tree
Hide file tree
Showing 8 changed files with 948 additions and 714 deletions.
1,277 changes: 655 additions & 622 deletions Pipfile.lock

Large diffs are not rendered by default.

183 changes: 97 additions & 86 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,13 @@

TIMDEX Pipeline Lambdas is a collection of lambdas used in the TIMDEX Ingest Pipeline.

## Required env variables

- `TIMDEX_ALMA_EXPORT_BUCKET_ID`: the name of the Alma SFTP export S3 bucket, set by Terraform on AWS.
- `TIMDEX_S3_EXTRACT_BUCKET_ID`: the name of the TIMDEX pipeline S3 bucket, set by Terraform on AWS.
- `WORKSPACE`: set to `dev` for local development, set by Terraform on AWS.

## Format Input Handler

Takes input JSON (usually from EventBridge although it can be passed to a manual Step Function execution), and returns reformatted JSON matching the expected input data needed for the remaining steps in the TIMDEX pipeline Step Function.

### Input fields
### Event Fields

#### Required fields
#### Required

- `next-step`: The next step of the pipeline to be performed, must be one of `["extract", "transform", "load"]`. Determines which task run commands will be generated as output from the format lambda.
- `run-date`: Must be in one of the formats ["yyyy-mm-dd", "yyyy-mm-ddThh:mm:ssZ"]. The provided date is used in the input/output file naming scheme for all steps of the pipeline.
Expand All @@ -24,17 +18,17 @@ Takes input JSON (usually from EventBridge although it can be passed to a manual
- `source`: Short name for the source repository, must match one of the source names configured for use in transform and load apps. The provided source is passed to the transform and load app CLI commands, and is also used in the input/output file naming scheme for all steps of the pipeline.
- *Note*: if provided source is "aspace" or "dspace", a method option is passed to the harvest command (if starting at the extract step) to ensure that we use the "get" harvest method instead of the default "list" method used for all other sources. This is required because ArchivesSpace inexplicably provides incomplete oai-pmh responses using the "list" method and DSpace@MIT needs to skip some records by ID, which can only be done using the "get" method.

#### Required OAI-PMH harvest fields
#### Required for OAI-PMH Harvest

- `oai-pmh-host`: *required if next-step is extract via OAI-PMH harvest*, not needed otherwise. Should be the base OAI-PMH URL for the source repository.
- `oai-metadata-format`: *required if next-step is extract via OAI-PMH harvest*, optional otherwise. The metadata prefix to use for the OAI-PMH harvest command, must match an available metadata prefix provided by the `oai-pmh-host` (see source repository OAI-PMH documentation for details).

#### Optional fields
#### Optional Fields

- `oai-set-spec`: optional, only used when limiting the OAI-PMH record harvest to a single set from the source repository.
- `verbose`: optional, if provided with value `"true"` (case-insensitive) will pass the `--verbose` option (debug level logging) to all pipeline task run commands.

### Example format input with all fields
### Example Format Input Event

```json
{
Expand All @@ -49,9 +43,11 @@ Takes input JSON (usually from EventBridge although it can be passed to a manual
}
```

### Example format output from the above input
### Example Format Input Result

Note: the output will vary slightly depending on the provided `source`, as these sometimes require different command logic. See test cases for input/output representations of all expected logic.
Note: This result is based on the previous example.

The output will vary slightly depending on the provided `source`, as these sometimes require different command logic. See test cases for input/output representations of all expected logic.

```json
{
Expand All @@ -72,101 +68,116 @@ Note: the output will vary slightly depending on the provided `source`, as these
]
}
}

```

## Ping Handler

Useful for testing and little else.

### Example Ping input

`{}`

### Example Ping output

`pong`

## Developing locally

<https://docs.aws.amazon.com/lambda/latest/dg/images-test.html>

### Makefile commands for installation and dependency management

```bash
make install # installs with dev dependencies
make test # runs tests and outputs coverage report
make lint # runs code linting, quality, and safety checks
make update # updates dependencies
```

### Build the container
### Example Ping Event

```bash
make dist-dev
```

### Run the default handler for the container

```bash
docker run -e TIMDEX_ALMA_EXPORT_BUCKET_ID=alma-bucket-name -e TIMDEX_S3_EXTRACT_BUCKET_ID=timdex-bucket-name -e WORKSPACE=dev -p 9000:8080 timdex-pipeline-lambdas-dev:latest
```

### POST to the container

Note: running this with next-step transform or load involves an actual S3 connection and is thus tricky to test locally. Better to push the image to Dev1 and test there.

```bash
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{
"next-step": "extract",
"run-date": "2022-03-10T16:30:23Z",
"run-type": "daily",
"source": "YOURSOURCE",
"verbose": "true",
"oai-pmh-host": "https://YOUR-OAI-SOURCE/oai",
"oai-metadata-format": "oai_dc",
"oai-set-spec": "YOUR-SET-SPEC"
}'
```json
{}
```

### Observe output
### Example Ping Result

```json
{
"run-date": "2022-03-10",
"run-type": "daily",
"source": "YOURSOURCE",
"verbose": true,
"next-step": "transform",
"extract": {
"extract-command": [
"--host=https://YOUR-OAI-SOURCE/oai",
"--output-file=s3://timdex-bucket-name/YOURSOURCE/YOURSOURCE-2022-03-09-daily-extracted-records-to-index.xml",
"--verbose",
"harvest",
"--metadata-format=oai_dc",
"--set-spec=YOUR-SET-SPEC",
"--from-date=2022-03-09"
]
}
}
pong
```

### Run a different handler in the container
## Development

* To preview a list of available Makefile commands: `make help`
* To install with dev dependencies: `make install`
* To update dependencies: `make update`
* To run unit tests: `make test`
* To lint the repo: `make lint`

The Makefile also includes account specific `dist`, `publish`, and `update-format-lambda` commands.

The `update-format-lambda` is required anytime an image contains a change to the format function is published to the ECR in order for the Format Input Lambda to use the updated code.

GitHub Actions is configured to update the Lambda function with every push to the `main` branch.

### Running Locally with Docker

- Build the container

```bash
make dist-dev
```

- Run the default handler for the container

```bash
docker run -e TIMDEX_ALMA_EXPORT_BUCKET_ID=alma-bucket-name \
-e TIMDEX_S3_EXTRACT_BUCKET_ID=timdex-bucket-name \
-e WORKSPACE=dev \
-p 9000:8080 timdex-pipeline-lambdas-dev:latest
```

- POST to the container
Note: running this with next-step transform or load involves an actual S3 connection and is thus tricky to test locally. Better to push the image to Dev1 and test there.

```bash
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{
"next-step": "extract",
"run-date": "2022-03-10T16:30:23Z",
"run-type": "daily",
"source": "YOURSOURCE",
"verbose": "true",
"oai-pmh-host": "https://YOUR-OAI-SOURCE/oai",
"oai-metadata-format": "oai_dc",
"oai-set-spec": "YOUR-SET-SPEC"
}'
```

- Observe output
```json
{
"run-date": "2022-03-10",
"run-type": "daily",
"source": "YOURSOURCE",
"verbose": true,
"next-step": "transform",
"extract": {
"extract-command": [
"--host=https://YOUR-OAI-SOURCE/oai",
"--output-file=s3://timdex-bucket-name/YOURSOURCE/YOURSOURCE-2022-03-09-daily-extracted-records-to-index.xml",
"--verbose",
"harvest",
"--metadata-format=oai_dc",
"--set-spec=YOUR-SET-SPEC",
"--from-date=2022-03-09"
]
}
}
```

### Running a Specific Handler Locally with Docker
You can call any handler you copy into the container (see Dockerfile) by name as part of the `docker run` command.

```bash
docker run -p 9000:8080 timdex-pipeline-lambdas-dev:latest lambdas.ping.lambda_handler
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'
```

Should result in `pong` as the output.
## Environment Variables

### Required

```shell
TIMDEX_ALMA_EXPORT_BUCKET_ID=### The name of the Alma SFTP export S3 bucket, set by Terraform on AWS.
TIMDEX_S3_EXTRACT_BUCKET_ID=### The name of the TIMDEX pipeline S3 bucket, set by Terraform on AWS.
WORKSPACE=### Set to `dev` for local development; this will be set to `stage` and `prod` in those environments by Terraform on AWS.
```

### Optional

## Makefile Use for AWS
```shell
ETL_VERSION=### Version number of the TIMDEX ETL infrastructure. This can be used to align application behavior with the requirements of other applications in the TIMDEX ETL pipeline.
```

The Makefile includes account specific "dist" "publish" and "update-format-lambda" commands.

"Update-format-lambda" is required anytime an image is published to the ECR that contains a change to the format function in order for the format-lambda to use the updated code.

The github action updates this every push to main no matter what changes are made right now.
44 changes: 41 additions & 3 deletions lambdas/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,10 +72,24 @@ def generate_transform_commands(
input_data: dict,
run_date: str,
timdex_bucket: str,
) -> dict:
) -> dict[str, list[dict]]:
"""Generate task run command for TIMDEX transform."""
files_to_transform: list[dict] = []
# NOTE: FEATURE FLAG: branching logic will be removed after v2 work is complete
etl_version = config.get_etl_version()
match etl_version:
case 1:
return _etl_v1_generate_transform_commands_method(
extract_output_files, input_data, run_date, timdex_bucket
)
case 2:
return _etl_v2_generate_transform_commands_method()


# NOTE: FEATURE FLAG: branching logic + method removed after v2 work is complete
def _etl_v1_generate_transform_commands_method(
extract_output_files: list[str], input_data: dict, run_date: str, timdex_bucket: str
) -> dict[str, list[dict]]:
files_to_transform: list[dict] = []
source = input_data["source"]
transform_output_prefix = helpers.generate_step_output_prefix(
source, run_date, input_data["run-type"], "transform"
Expand All @@ -96,14 +110,33 @@ def generate_transform_commands(
]

files_to_transform.append({"transform-command": transform_command})

return {"files-to-transform": files_to_transform}


# NOTE: FEATURE FLAG: branching logic + method removed after v2 work is complete
def _etl_v2_generate_transform_commands_method() -> dict[str, list[dict]]:
raise NotImplementedError


def generate_load_commands(
transform_output_files: list[str], run_type: str, source: str, timdex_bucket: str
) -> dict:
"""Generate task run command for loading records into OpenSearch."""
# NOTE: FEATURE FLAG: branching logic will be removed after v2 work is complete
etl_version = config.get_etl_version()
match etl_version:
case 1:
return _etl_v1_generate_load_commands_method(
transform_output_files, run_type, source, timdex_bucket
)
case 2:
return _etl_v2_generate_load_commands_method()


# NOTE: FEATURE FLAG: branching logic + method removed after v2 work is complete
def _etl_v1_generate_load_commands_method(
transform_output_files: list[str], run_type: str, source: str, timdex_bucket: str
) -> dict:
if run_type == "daily":
files_to_index = []
files_to_delete = []
Expand Down Expand Up @@ -169,3 +202,8 @@ def generate_load_commands(
return {
"failure": "Something unexpected went wrong. Please check input and try again."
}


# NOTE: FEATURE FLAG: branching logic + method removed after v2 work is complete
def _etl_v2_generate_load_commands_method() -> dict:
raise NotImplementedError
10 changes: 10 additions & 0 deletions lambdas/config.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import logging
import os
from typing import Literal

GIS_SOURCES = ["gismit", "gisogm"]
INDEX_ALIASES = {
Expand Down Expand Up @@ -51,6 +52,15 @@ def configure_logger(
)


# NOTE: FEATURE FLAG: function will be removed after v2 work is complete
def get_etl_version() -> Literal[1, 2]:
etl_version = int(os.environ.get("ETL_VERSION", "1"))
if etl_version not in [1, 2]:
message = f"ETL_VERSION '{etl_version}' not supported"
raise ValueError(message)
return etl_version # type: ignore[return-value]


def validate_input(input_data: dict) -> None:
"""Validate input to the lambda function.
Expand Down
2 changes: 0 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,6 @@ select = ["ALL", "PT"]

ignore = [
# default
"ANN101",
"ANN102",
"COM812",
"D107",
"N812",
Expand Down
2 changes: 2 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ def _test_env(monkeypatch):
monkeypatch.setenv("TIMDEX_ALMA_EXPORT_BUCKET_ID", "test-alma-bucket")
monkeypatch.setenv("TIMDEX_S3_EXTRACT_BUCKET_ID", "test-timdex-bucket")
monkeypatch.setenv("WORKSPACE", "test")
# NOTE: FEATURE FLAG: remove after v2 work is complete
monkeypatch.setenv("ETL_VERSION", "1")


@pytest.fixture(autouse=True)
Expand Down
2 changes: 1 addition & 1 deletion tests/test_commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@ def test_generate_load_commands_full_with_deletes_logs_error(caplog):
) in caplog.record_tuples


def test_generate_load_command_unexpected_input():
def test_generate_load_commands_unexpected_input():
transform_output_files = [
"alma/alma-2022-01-02-full-transformed-records-to-index.json"
]
Expand Down
Loading

0 comments on commit 8e09243

Please sign in to comment.