Carbon is a tool for loading data into Symplectic Elements. Carbon retrieves records from the Data Warehouse, normalizes and writes the data to XML files, and uploads the XML files to the Elements FTP server. It is used to create and run the following feed types:
people
: Provides data for the HR Feed.articles
: Provides data for the Publications Feed.
Please refer to the mitlib-tf-workloads-carbon for the deployment configuration.
For more information on the Carbon application, please refer to our internal documentation on Confluence.
This flowchart depicts the data flow from MIT's Data Warehouse to the application's in-memory buffered streams and to the final location of the output file, an XML file on the Elements FTP server.
---
title:
---
flowchart TB
subgraph ext-source[Database]
mit-dwrhs[(MIT Data Warehouse)]
end
subgraph in-memory[Application In-memory]
direction TB
rec-generator([Query Results Generator])
subgraph piped[Piped Read-Write Buffer]
buffered-writer([Buffered Writer])
buffered-reader([Buffered Reader])
end
ftps-client((FTPS Client))
end
subgraph elements-ftp[Elements FTP server]
direction TB
xml-file([Feed XML file])
end
mit-dwrhs -->|Fetch query results | rec-generator
rec-generator-->|Yielding records one at a time, <br> transform record into normalized XML strings <br> and pass to write buffer| piped
buffered-writer -.->|Pipe contents to read buffer| buffered-reader
buffered-reader -->|Read buffer acts as data feed for an XML file on FTP server <br>| ftps-client
ftps-client -->|Stream contents from read buffer to an XML file on FTP server|xml-file
- To install with dev dependencies:
make install
- To update dependencies:
make update
- To lint the repo:
make lint
The Data Warehouse runs on an older version of Oracle that necessitates the thick
mode of python-oracledb
, which requires the Oracle Instant Client Library (this app was developed with version 21.9.0.0.0).
The test suite uses SQLite, so you can develop and test without connecting to the Data Warehouse.
- Run
make test
to run unit tests.
-
Export AWS credentials for the
Dev1
environment. For local runs, theAWS_DEFAULT_REGION
environmnet variable must also be set. -
Create a
.env
file at the root folder of the Carbon repo, and set the required environment variables described in Required Env.Note: The host for the Data Warehouse is different when connecting from outside of AWS (which uses Cloudconnector). For assistance, reach out to the Data Warehouse team.
-
Connect to an approved VPN client.
-
Follow the steps relevant to the machine you are running:
-
If you are on a machine that cannot run Oracle Instant Client, follow the steps outlined in With Docker. When running the application locally, skip the step to run
make publish-dev
as it is not necessary to publish the container to ECR.Note: As of this writing, Apple M1 Macs cannot run Oracle Instant Client.
-
If you are on a machine that can run Oracle Instant Client, follow the steps outlined in Without Docker.
-
The data retrieved by the Carbon application contains personally identifiable information (PII), so downloading the files is not recommended. However, if examining the files created by Carbon is absolutely necessary for testing purposes, this can be done on your local machine via a Docker container. For more information, please refer to the Confluence document: How to download files from an application that connects to the Data Warehouse.
Note: Any downloaded files or .env
files must be immediately deleted after testing is complete.
-
Run
make dependencies
to download the Oracle Instant Client from S3. -
Run
make dist-dev
to build the Docker container image. -
Run
make publish-dev
to push the Docker container image to ECR for theDev1
environment. -
Run any
make
commands for testing the application. In the Makefile, the names of relevant make commands will contain the suffix '-with-docker'.
-
Download Oracle Instant Client (basiclite is sufficient) and set the
ORACLE_LIB_DIR
env variable. -
Run any
make
commands for testing the application. In the Makefile, the names of relevant make commands will contain the suffix '-with-docker'.
The application can be run as an ECS task. Any runs that require a connection to the Data Warehouse must be executed as a task in the Stage-Workloads
environment because Cloudconnector is not enabled in Dev1
. This requires building and publishing the Docker container image to ECR for Stage-Workloads
.
-
Export AWS credentials for the
stage
environment. TheECR_NAME_STAGE
andECR_URL_STAGE
environment variables must also be set. The values correspond to the 'Repository name' and 'URI' indicated on ECR for the container image, respectively. -
Run
make dist-stage
to build the Docker container image. -
Run
make publish-stage
to push the Docker container image to ECR for thestage
environment. -
Run any
make
commands for testing the application. In the Makefile, the names of relevant make commands will contain the suffix '-with-ecs-stage' (e.g.run-connection-tests-with-ecs-stage
).
For an example, see Connecting to the Data Warehouse.
In the AWS Organization, we have a automated pipeline from Dev1
--> Stage-Workloads
--> Prod-Workloads
, handled by GitHub Actions.
When a PR is merged onto the main
branch, Github Actions will build a new container image. The container image will be tagged with "latest" and the shortened commit hash (the commit that merges the PR to main
). The tagged image is then uploaded to the ECR repository in Stage-Workloads
. An EventBridge scheduled event will periodically trigger the Fargate task to run. This task will use the latest image from the ECR registry.
Tagging a release on the main
branch will promote a copy of the latest
container from Stage-Workloads
to Prod-Workloads
.
The password for the Data Warehouse is updated each year. To verify that the updated password works, run the connection tests for Carbon. Carbon will run connection tests for the Data Warehouse and the Elements FTP server when executed with the flag --run_connection_tests
.
- Export AWS credentials for the
stage
environment. TheECR_NAME_STAGE
andECR_URL_STAGE
environment variables must also be set. The values correspond to the 'Repository name' and 'URI' indicated on ECR for the container image, respectively. - Run
make install
. - Run
make run-connection-tests-with-ecs-stage
. - View the logs from the ECS task run on CloudWatch.
- On CloudWatch, select the
carbon-ecs-stage
log group. - Select the most recent log stream.
- Verify that the following log is included:
Successfully connected to the Data Warehouse: <VERSION NUMBER>
- On CloudWatch, select the
WORKSPACE="dev" # Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.
FEED_TYPE="people" # Type of feed, either "people" or "articles".
DATAWAREHOUSE_CLOUDCONNECTOR_JSON='{"USER": "<VALID_DATAWAREHOUSE_USERNAME>", "PASSWORD": "<VALID_DATAWAREHOUSE_PASSWORD>", "HOST": "<VALID_DATAWAREHOUSE_HOST>", "PORT": "<VALID_DATAWAREHOUSE_PORT>", "PATH": "<VALID_DATAWAREHOUSE_ORACLE_SID>", "CONNECTION_STRING": "<VALID_DATAWAREHOUSE_CONNECTION_STRING>"}' # JSON formatted string of key/value pairs for the MIT Data Warehouse connection.
SYMPLECTIC_FTP_JSON='{"SYMPLECTIC_FTP_HOST": "<VALID_ELEMENTS_FTP_HOST>", "SYMPLECTIC_FTP_PORT": "<VALID_ELEMENTS_FTP_PORT>", "SYMPLECTIC_FTP_USER": "<VALID_ELEMENTS_FTP_USER>", "SYMPLECTIC_FTP_PASS": "<VALID_ELEMENTS_FTP_PASSWORD>"}' # A JSON formatted string of key/value pairs for connecting to the Symplectic Elements FTP server.
SYMPLECTIC_FTP_PATH="<FTP_FILE_DIRECTORY>/<FEED_TYPE>.xml" # Full XML file path that is uploaded to the Symplectic Elements FTP server.
SNS_TOPIC="<VALID_SNS_TOPIC_ARN>" # SNS topic ARN used for sending email notifications.
LOG_LEVEL="INFO" # The log level for the 'carbon' application. Defaults to 'INFO' if not set.
ORACLE_LIB_DIR="<PATH>" # The directory containing the Oracle Instant Client library.
SENTRY_DSN="<SENTRY_DSN>" # If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development.