A toy project to help me learn MLOps. What does it do?
- create_db: Create a simple MySQL database from a single data source (csv). This code is run in AWS EC2.
- training_pipeline: Extracts data from remote MySQL table (above), performs simple preprocessing, and trains baseline ML models. The training results, params, and datasets are recorded and stored using MLFlow for tracing back when needed.
My Airflow DAG includes tasks:
print_task
andsetup_config_task
: Print and save theexecution_date
. Setup the config for data preprocessing and training.load_task
: Access to the remote MySQL and save the dataset in parquet.preprocess_task
: Perform data preprocessing and save the preprocessed data seperately.train_task
: Train classification task using Pycaret and track with MLFlow. Save the best-performed model for deployment.clean_task
: Clean the caches and tmp file.
For running training_pipeline
with Airflow, do as follows:
Step 1: Go to training_pipeline
cd mlflow_pipeline
Step 2: Build and deploy docker image
docker compose build && docker compose up
Step 3: Go inside the docker container
docker ps
Copy the container id (or name) and run
docker exec -it container_id bash
Where container_id
is replaced by your container id.
Step 4: Set up Airflow inside docker container
export AIRFLOW_HOME=$(pwd) && airflow db init
Where AIRFLOW_HOME is the environment variable for running airflow. Your AIRFLOW_HOME is set to be training_pipeline/
. You can check it by:
echo $AIRFLOW_HOME
Step 4: Check existing dags and create Airflow user
The below command should return you the dag defined in dags/dag_training_bash
:
airflow dags list
Airflow requires you to create a simple account to access to the webserver UI:
airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin
You can check if your account is created successfully by running:
airflow users list
Step 5: Running Airflow scheduler and websever
airflow scheduler
Open another cmd tab, go inside your docker container and run:
airflow webserver -p 8080
Go to http://localhost:8080/
. You can keep track your flow and run here. You can also trigger dag and check the running result.
Please refer to the Airflow doc [source] to read about the concepts and usages of Airflow.
If you just want to run MLFlow instead of the wole Airflow dag, do as follows:
Step 1: Go to training_pipeline
cd mlflow_pipeline
Step 2: Build docker image
docker build -t mlflow-pipeline -f Dockerfile .
Step 3: Build MLFlow project
mlflow run . -P baseline_sample= <float> -P task=<string> -P target=<string> --experiment-name <name>
Where:
baseline_sample
is the percentage of the data will be used for running the baseline (default 0.1)task
is 'classification' or 'regression'target
is the name of target feature--experiment-name
is the name of this experiment run
Example:
mlflow run . -P baseline_sample=0.1 -P task='classification' -P target='click' --experiment-name pycaret_experiment
Step 4: Check the recorded results
Note: When running against a local tracking URI, MLflow mounts the host system’s tracking directory (e.g., a local mlruns directory) inside the container so that metrics, parameters, and artifacts logged during project execution are accessible afterwards [source]. In our case, the mlruns folder will be mounted back to the local directory (and not in the Docker image).
Simply call mlflow ui
in the command-line for accessing the tracking UI.
mlflow ui
This is a toy project for me to learn about Airflow and MLFlow at a basic level. I'm aware that industry deployment is more complicated and sophisticated than what shown in my repo.
A data pipeline with ETL process should be built to handle data loading from MySQL. Also, more MLOps modules such as Feast and CI/CD tools can be used to improve the quality of this project.