Skip to content

khointn/mlops_pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Simple MLOps with Airflow, MLFlow, and Pycaret

1. Introduction

A toy project to help me learn MLOps. What does it do?

  • create_db: Create a simple MySQL database from a single data source (csv). This code is run in AWS EC2.
  • training_pipeline: Extracts data from remote MySQL table (above), performs simple preprocessing, and trains baseline ML models. The training results, params, and datasets are recorded and stored using MLFlow for tracing back when needed.

2. How to run with Airflow

My Airflow DAG includes tasks:

  • print_task and setup_config_task: Print and save the execution_date. Setup the config for data preprocessing and training.
  • load_task: Access to the remote MySQL and save the dataset in parquet.
  • preprocess_task: Perform data preprocessing and save the preprocessed data seperately.
  • train_task: Train classification task using Pycaret and track with MLFlow. Save the best-performed model for deployment.
  • clean_task: Clean the caches and tmp file.

For running training_pipeline with Airflow, do as follows:

Step 1: Go to training_pipeline

cd mlflow_pipeline

Step 2: Build and deploy docker image

docker compose build && docker compose up

Step 3: Go inside the docker container

docker ps

Copy the container id (or name) and run

docker exec -it container_id bash

Where container_id is replaced by your container id.

Step 4: Set up Airflow inside docker container

export AIRFLOW_HOME=$(pwd) && airflow db init

Where AIRFLOW_HOME is the environment variable for running airflow. Your AIRFLOW_HOME is set to be training_pipeline/. You can check it by:

echo $AIRFLOW_HOME

Step 4: Check existing dags and create Airflow user

The below command should return you the dag defined in dags/dag_training_bash:

airflow dags list

Airflow requires you to create a simple account to access to the webserver UI:

airflow users  create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin

You can check if your account is created successfully by running:

airflow users list

Step 5: Running Airflow scheduler and websever

airflow scheduler

Open another cmd tab, go inside your docker container and run:

airflow webserver -p 8080

Go to http://localhost:8080/. You can keep track your flow and run here. You can also trigger dag and check the running result.

Please refer to the Airflow doc [source] to read about the concepts and usages of Airflow.

3. How to run with MLFlow only

If you just want to run MLFlow instead of the wole Airflow dag, do as follows:

Step 1: Go to training_pipeline

cd mlflow_pipeline

Step 2: Build docker image

docker build -t mlflow-pipeline -f Dockerfile .

Step 3: Build MLFlow project

 mlflow run . -P baseline_sample= <float> -P task=<string> -P target=<string> --experiment-name <name>

Where:

  • baseline_sample is the percentage of the data will be used for running the baseline (default 0.1)
  • task is 'classification' or 'regression'
  • target is the name of target feature
  • --experiment-name is the name of this experiment run

Example:

mlflow run . -P baseline_sample=0.1 -P task='classification' -P target='click' --experiment-name pycaret_experiment

Step 4: Check the recorded results

Note: When running against a local tracking URI, MLflow mounts the host system’s tracking directory (e.g., a local mlruns directory) inside the container so that metrics, parameters, and artifacts logged during project execution are accessible afterwards [source]. In our case, the mlruns folder will be mounted back to the local directory (and not in the Docker image).

Simply call mlflow ui in the command-line for accessing the tracking UI.

mlflow ui

4. Limitation

This is a toy project for me to learn about Airflow and MLFlow at a basic level. I'm aware that industry deployment is more complicated and sophisticated than what shown in my repo.

A data pipeline with ETL process should be built to handle data loading from MySQL. Also, more MLOps modules such as Feast and CI/CD tools can be used to improve the quality of this project.

5. Reference

  • MLOps Crash Course [source]: Helpful and practical course on MLOps.

  • Airflow doc [source]

  • MLFlow doc [source]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published