Skip to content

onefoursix/run-streamsets-job-on-ephemeral-engine-on-k8s

Repository files navigation

Run StreamSets Job on an Ephemeral Engine on Kubernetes

This project provides an example of how to use the StreamSets Platform SDK to automate the process of running a StreamSets Job on a "just-in-time" engine deployment on Kubernetes.

This deployment pattern can help minimize the expense of long-running and under-utilized StreamSets engines.

The Python script in this project performs the following steps:

  • Clones a StreamSets Kubernetes Deployment from a pre-existing template and assigns a unique engine label to the new deployment.

  • Starts the deployment which causes an engine to be deployed on Kubernetes with the unique label.

  • Assigns the unique engine label to the Job intended to run on the engine.

  • Starts the Job, which will run on the just deployed engine.

  • Waits for the Job to complete.

  • Tears down the engine and deletes the Deployment.

This example assumes the use of WebSocket Tunneling, which simplifies the process of cloning a deployment. If you must use Direct Engine Rest APIs instead, the same "just in time" deployment pattern can be used, but with the added requirement of configuring ingress for the cloned deployment. See the project here for an example of using the StreamSets SDK to automate deploying engines on Kubernetes with Direct Enginen REST APIs.

Prerequisites

  • A Python 3.9+ environment with the StreamSets Platform SDK v6.0+ module installed. This example was tested using Python 3.11.5 and StreamSets SDK v6.4.

  • StreamSets API Credentials

  • An active StreamSets Kubernetes Environment with an online Kubernetes Agent.

  • A StreamSets Kubernetes Deployment that this project will clone at runtime (see below for details).

The Python script sets a maximum time for a Job to complete; feel free to change that to fit your environment.

Running the Example

  • Clone this project to your local machine

  • Create a file named sdk-env.sh with your API credentials in the project's private directory as quoted strings with no spaces, like this:

     export CRED_ID="esdgew……193d2"
     export CRED_TOKEN="eyJ0…………J9."
    
  • Select a Job to run. For example, I'll use a Job that performs a batch load from SQLServer to Snowflake, with a pipeline like this:

pipeline

  • Select an existing or create a new StreamSets Kubernetes Deployment to serve as a template. Make sure the deployment includes the stage libraries necessary to run the pipeline, in my case, the JDBC and Snowflake stage libraries. Typically, the deployment's "desired instances" should be set to one, with autoscaling disabled, in order to deploy a single engine, and the engine's CPU and memory sizing should reflect the need to run only a single pipeline at a time for use with this pattern.

In my example, I'll use a Deployment named deployment-template:

deployment-template

  • Execute the project's top level shell script using a command of the form:

$ ./run-streamsets-job-on-ephemeral-engine-on-k8s.sh <deployment_to_clone_id> <new_deployment_name> <job_id> <engine_label>

For example, I'll specify the ID of my template deployment, the name of the new deployment I want to create ("ephemeral-1"), the Job ID, and a globally unique engine label ("ephemeral-label-1") which will bind the Job to the engine:

$ ./run-streamsets-job-on-ephemeral-engine-on-k8s.sh \ 19dc63bb-5911-4e9e-b71c-8a6d6e29a9c7:8030c2e9-1a39-11ec-a5fe-97c8d4369386 \ ephemeral-1 \ 9caf20bc-dd88-4665-8ef2-10140e7a5417:8030c2e9-1a39-11ec-a5fe-97c8d4369386 \ ephemeral-label-1

As the script runs you should see a new deployment enter the Activating state:

deployment-activating

After a minute or two it should transition to an Active state:

deployment-active

Once the deployment is Active, you should see a new Engine has registered with Control Hub:

engine

And then the Job should start:

job

When the Job completes, the engine and the deployment will be deleted.

Note that even though the engine and deployment are deleted, the Job's history still has full metrics of the run:

job

job

Command line output

Here is the command-line output from running the script:

% ./run-streamsets-job-on-ephemeral-engine-on-k8s.sh \
 19dc63bb-5911-4e9e-b71c-8a6d6e29a9c7:8030c2e9-1a39-11ec-a5fe-97c8d4369386 \
 ephemeral-1 \
 9caf20bc-dd88-4665-8ef2-10140e7a5417:8030c2e9-1a39-11ec-a5fe-97c8d4369386 \
 ephemeral-label-1
2024-09-17 11:25:24 ----
2024-09-17 11:25:24 Run StreamSets Job on Ephemeral Kubernetes Deployment
2024-09-17 11:25:24 ----
2024-09-17 11:25:24 Source Deployment's ID: 19dc63bb-5911-4e9e-b71c-8a6d6e29a9c7:8030c2e9-1a39-11ec-a5fe-97c8d4369386
2024-09-17 11:25:24 New Deployment's name: ephemeral-1
2024-09-17 11:25:24 Job ID: 9caf20bc-dd88-4665-8ef2-10140e7a5417:8030c2e9-1a39-11ec-a5fe-97c8d4369386
2024-09-17 11:25:24 Engine Label: ephemeral-label-1
2024-09-17 11:25:24 ----
2024-09-17 11:25:24 Connecting to Control Hub
2024-09-17 11:25:26 Found Job 'SQLServer to Snowflake'
2024-09-17 11:25:27 Found source Deployment 'deployment-template'
2024-09-17 11:25:27 Cloning Deployment
2024-09-17 11:25:27 Setting the new Deployment's engine label
2024-09-17 11:25:27 Starting Deployment
2024-09-17 11:26:22 Deployment is ACTIVE
2024-09-17 11:26:23 Engine is online
2024-09-17 11:26:23 ----
2024-09-17 11:26:23 Setting the Job's engine label
2024-09-17 11:26:23 Starting the Job
2024-09-17 11:26:39 Job status is ACTIVE
2024-09-17 11:26:39 Waiting for Job to complete...
2024-09-17 11:26:39 Waiting for Job to complete...
2024-09-17 11:26:49 Waiting for Job to complete...
2024-09-17 11:26:59 Waiting for Job to complete...
2024-09-17 11:27:09 Waiting for Job to complete...
2024-09-17 11:27:19 Waiting for Job to complete...
2024-09-17 11:27:29 Job completed successfully
2024-09-17 11:27:29 Job status is INACTIVE
2024-09-17 11:27:29 ----
2024-09-17 11:27:29 Stopping engine and deleting Deployment
2024-09-17 11:28:23 ----
2024-09-17 11:28:23 Done

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published