This project is designed to develop, deploy, and maintain a machine learning model for multiclass disease classification of chest X-rays. It utilizes a Machine Learning Operations (MLOps) framework to ensure seamless development, deployment, and continuous monitoring of the model. The project follows best practices for reproducibility, modularity, and scalability.
- Multiclass Thoracic Disease Classification
- Introduction
- Dataset Information
- Project Workflow
- Git Repo and Project Structure
- Installation
- DVC (Data Version Control)
- Google Cloud Computing
- Integration of Docker and Airflow
- Data Extraction
- Schema and Statistics Generation
- Anomalies Detection
- Augmentation and Pre-processing
- Data Pipeline
- Airflow Dags
- Optimized Pipeline
- Data Versioning
- Model pipeline
- CI/CD
- Bias Detection
- Contributors
Thoracic diseases, such as pneumonia, emphysema, and cardiomegaly, are significant health concerns globally, affecting millions of people each year. In the United States alone, respiratory diseases contribute to a high percentage of hospitalizations and healthcare costs. Accurate diagnosis of these conditions is crucial for effective treatment, yet traditional diagnostic methods rely heavily on manual analysis of medical images like chest X-rays. This process can be time-consuming and subject to human error, especially in areas with limited access to trained radiologists.
Motivated by these challenges, we chose this topic for our MLOps project. By leveraging the principles of MLOps, we have developed an end-to-end machine learning pipeline for automated thoracic disease classification using chest X-rays. Our solution is built to enhance the accuracy and efficiency of non-invasive diagnostic methods, aiming to assist healthcare providers with reliable, timely, and scalable diagnostic support.
By automating the process of classifying multiple thoracic diseases from chest X-ray images, this approach has the potential to alleviate the burden on healthcare professionals, reduce diagnostic errors, and provide faster results, especially in underserved areas where radiologists may not always be available.
This project integrates best practices in MLOps to ensure that our model is not only accurate but also easy to maintain and deploy, making it suitable for real-world clinical use.
The project utilizes the NIH ChestX-ray14 dataset, one of the largest publicly available chest X-ray image datasets for research purposes. The dataset is provided by the National Institutes of Health (NIH) Clinical Center and consists of 112,120 frontal-view chest X-ray images from 30,805 unique patients. The dataset is annotated with labels for 14 different thoracic disease categories, including:
- Dataset Name: NIH ChestX-ray14
- Source: National Institutes of Health (NIH) Clinical Center
- Link to Dataset: NIH Clinical Center ChestX-ray Dataset
- Domain: Medical Imaging (Radiology)
- Task: Multiclass disease classification from chest X-ray images
- Atelectasis
- Cardiomegaly
- Effusion
- Infiltration
- Mass
- Nodule
- Pneumonia
- Pneumothorax
- Consolidation
- Edema
- Emphysema
- Fibrosis
- Pleural Thickening
- Hernia
This is the basic Project Flow and this will be updated after the final system architecture is designed
Project Structure
├── .dvc
│ ├── config # Configuration file for DVC (Data Version Control)
│ ├── .gitignore # Specifies files/folders related to DVC to be ignored by Git
├── .github/workflows
│ ├── model_pipeline.yaml # yaml file resposible for implementing CI/CD via GitHub actions
├── data
│ ├── raw_notebooks # Directory containing preprocessed training data
│ │ ├── Pre-processing.ipynb #A Visual example of how the pre-processing steps are executed for a sample data
│ ├── tests
│ │ ├── test_get_data_from_gcp.py #unittest file for getting data from gcp
│ │ ├── test_schema_generation.py #unittest file for generation of schema for the data
│ │ ├── test_anomaly_detection.py #unittest file for detection of anomalies
│ │ ├── test_preprocessing.py #unittest file for testing out the pre-processing steps and augmentation steps
├── assets
│ ├── images #contains images required for the project
├── config
│ ├── _init_.py
├── Dags
│ ├── Src
│ │ ├── Data # Kubernetes deployment configuration for the frontend
│ │ | ├── sampled_data.dvc # resulted.dvc file when folder of sampled_data with image data set is pushed to gcp using DVC
│ │ | ├── sampled_data_entry.csv.dvc # resulted .dvc file when csv file containg labels and image index is pushed to gcp using DVC
│ │ ├── get_data_from_gcp.py # this file is to execute to get the data from the remote gcp, which is pushed to gcp using DVC
│ │ ├── cloud_data_management.py # this is the file which integrates all the scripts to assist in airflow execution
│ │ ├── schema_generation.py # this file generates the statistics and schema for the data
│ │ ├── anomaly_detection.py # this file uses the schema and other metrics to detect the anomalies
│ │ ├── preprocessing.py # this file executes augmentation to overcome class imbalance and then perform pre-processing steps
├── logs/scheduler
├── model
│ ├── model_with_grid_search.py # Defines the model architecture for training, training process, validation process and best model picking
│ ├── bias_detection.py # Script for bias detection
│ ├── evaluate.py # Script for evaluating the model's performance
├── .dvcignore # Specifies files/folders that should be ignored by DVC
├── .gitignore # Specifies files/folders that should be ignored by Git
├── dockerfile # Dockerfile to build the Docker image for the data pipeline
├── Dockerfilemodel # Dockerfile to build the Docker image for the model pipeline
├── model_requirements.txt # dependencies required for model pipeline
├── README.md # read me file
├── docker-compose.yaml # Docker Compose file to set up and run multi-container Docker applications
├── requirements.txt # Python dependencies required for running the project locally
├── sampled_data.dvc # resulted.dvc file when folder of sampled_data with image data set is pushed to gcp using DVC
├── sampled_data_entry.csv.dvc # resulted .dvc file when csv file containg labels and image index is pushed to gcp using DVC
This project requires Python version 3.8 or higher. Please ensure that the correct Python version is installed on your device. The project is compatible with Windows, Linux, and macOS operating systems.
Before you start, please make sure that the following are already installed in your machine, if not please install them.
- Git
- Docker
- Airflow
- DVC
- Python 3
- Pip
User should follow the below steps:
- Clone the repository onto your local machine:
git clone https://github.com/SatwikReddySripathi/Multiclass_Disease_Classification.git
- Access files: .JSON files which are credentials to GCP Bucket and Drive should be stored in config folder
- Build Docker Image:
In Mac:
In windows:
docker compose build
docker-compose build
- Run the docker image:
docker compose up
- . To view Airflow dags on the web server, visit https://localhost:8080 and log in with credentials
user: airflow password: airflow
- Run the DAG by clicking on the play button on the right side of the window
- To stop the docker containers: click ctrl+c in the terminal to kill all the containers
This is the crucial open source tool to track the changes in the data through out the machine learning pipeline. Thus, allowing us to track the changes in the data over the time making it reproducible and traceable. DVC maintains Git repositories minimal and clutter-free by keeping meta-information apart from data. It's smooth integration with Git makes it possible to handle code, data, and models efficiently. This method guarantees that project states may be readily rebuilt and streamlines collaboration. For machine learning workflows to remain reliable and intact, DVC's emphasis on data versioning is crucial. In this project, the data from the local is uploaded to GCP using DVC. This ensures remote storage of data.
Google Cloud Platform is a suite of cloud computing services offered by Google that provides a series of modular cloud services including computing, data storage, data analytics, and machine learning, alongside a set of management tools. For this project, One must create a service account on google platform services: https://console.cloud.google.com/. One must create a GCP Bucket and store data and models for the efficient performance of the project.
The code essential to execute Airflow is located in the docker-compose.yaml
file. We can deliver our data pipeline with the necessary dependencies installed by using Docker and containerization. This makes it platform independent, so our data pipeline will work effectively on Windows, Mac, or Linux.
get_data_from_gcp.py
is the module resposnible for fetching the data from the GCP Bucket and stores the image and metadata. This also includes compression of the images and storage of the image and metadata into pickle file.
This module automates data validation, schema generation, and statistics collection for the data pipeline.In schema_generation.py
, it uses TensorFlow Data Validation (TFDV), it ensures data integrity, consistency, and quality across training, evaluation, and serving datasets. By defining a standardized schema, this step enables data drift detection, bias assessment, and robust data handling, forming a critical part of our end-to-end machine learning pipeline.
The anomaly_detection.py
script is a utility within this project that performs data validation and anomaly detection. It ensures data consistency by loading and analyzing datasets against a predefined schema using TensorFlow Data Validation (TFDV). The script loads data from a pickle file, applies schema validation, and reports anomalies, aiding in data quality management. This script ensures to check various data issues, including image formats, label correctness, class distribution, and schema conformity, making it an indispensable tool for preprocessing medical imaging data format, and quality.
The preprocessing.py
scripts enhances data quality and increases dataset diversity of the medical images, which will support effective model training. It handles key tasks like loading, resizing, grayscale conversion, and applying CLAHE (Contrast Limited Adaptive Histogram Equalization) to improve contrast. These steps enhance image clarity, making them more suitable for model interpretation, especially critical in medical imaging. An augmentation generator also applies transformations like rotations and zooms, expanding the dataset and helping the model generalize better.
This is our data pipeline which includes key-functioning components:
In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code. Below is the structure of dag which is being executed:
The optimized data pipeline in Airflow efficiently processes data through a series of stages. It starts with Data Extraction to gather raw data, followed by Data Schema validation to ensure consistency. Anomaly Detection then identifies outliers, improving data quality. Image Preprocessing prepares images through resizing and transformations, aligning them with model requirements. Finally, Upload Data stores the processed data for downstream applications.
The Gantt chart shows an efficient timeline, with Data Extraction occupying the longest duration, followed by swift transitions through subsequent stages. This design enables rapid, accurate data processing, supporting a scalable and robust end-to-end workflow.
- Initialize dvc in the directory
dvc init
- Set up remote GCP connection.
dvc remote add -d myremote gs://<bucket>/<path>
- Setting .JSON Credential path
dvc remote modify myremote credentialpath <GOOGLE-KEY-JSON-PATH>
In this pipeline, the processing of image classification model is executed. As a part of the pipeline, designing of the architecture, training, validation will take place. A grid search technique is also used for picking the best hyperparameters to increase the performance of the model. As an outcome of the pipeline the best model after validation and bias detection and mitigation, it is pushed to the GCP registry
Triggering of the model pipeline is ensured as a part of continuous integration and deployment using Github actions. model_pipeline.yaml
is responsible for triggering of the pipeline and to execute each and every task in a sequence to run the model pipeline. Model pipeline is triggered whenever a new commit is pushed to the repository to continuously make sure that the model pipeline is up to the date
Executing detection of bias is a vital step to ensure that the model is fair to every demographic feature present in the data using bias_detection.py
. Tools like TFDV are used to slice the data into meaningful demographic slices and each slice is evaluated with the best model to ensure the model is not biased towards a feature