This project is an extension of improving the models, productionizing the project with best practices previously developed for Kaggle Competition "Multi Class Prediction of Obesity Risk"where we placed within the top 5%. The project aims at redoing the project with added production using best practices learned from class MGSC-695-076. For the sake of security, no access keys were shared.
Tech Stack: Apache Kafka, MLflow, Azure ML, VS Code, Poetry, AutoGluon, H2O, PyCaret, FLAML, PandasAI, Docker, Streamlit, Postman, FastAPI, SHAP
- Data Source: Original Kaggle CSV data split into Model Development and Hold-Off datasets.
- Live Data Simulation: Used Apache Kafka for simulating real-time data feeds.
- Workspace Configuration: Established Azure ML Workspace with RBAC.
- Team Roles: Assigned roles for Data Science, Data Engineering, ML Engineering, and Governance.
- Comprehensive Analysis:
- Univariate Analysis: Leveraged PandasAI for detailed insights.
- Bivariate Analysis: Used pairplots and interaction plots.
- Dimensionality Reduction: Applied PCA with KMediansClustering.
- Feature Engineering: Enhanced performance based on EDA insights.
- Normalization and Scaling: Ensured optimal feature scaling.
- Missing Data Handling: Applied appropriate strategies for missing data.
- Poetry Integration: Managed dependencies for reproducibility.
-
State-of-the-Art Models:
- Custom models like XGBoost, LightGBM, CatBoost.
- Hyperparameter Tuning: Used Optuna for optimization.
-
AutoML Exploration:
- Explored Pycaret, AutoGluon, H2O for benchmarking.
- Advanced Techniques: Stacked models, Isolation Forest, custom loss functions.
- MLflow & Azure MLFlow Integration:
- Tracked global and local metrics, target distribution.
- SHAP Analysis: Utilized SHAP values for explainability and error analysis.
-
Containerization: Used FastAPI and Docker.
-
Azure Deployment: Azure Container Instances, planned Kubernetes.
-
Conversion to Azure Scripts:
- Converted Jupyter notebooks to Python scripts for Azure jobs.
- Azure Pipelines: CI/CD with GitHub Actions and Azure Container Registry.
- Streamlit Application: User-friendly interface integrated with APIs.
- Monitoring Strategy: Drift detection, automated endpoint management.
- UI-Based Experiments: Used Azure ML Designer for experiments additionally for learning purposes using SDK v2, and UI.
- Cross-Validation: Ensured model generalizability.
- Model Governance: Versioning, lineage tracking, compliance.
- Scalability and Optimization: Performance tests, scalability checks.
- Feedback Loop: Integrated feedback for continuous improvement.
- Main: For Final Product [Owner - Team]
- Experiments: For ML Experiments and tracking [Owners - Arham, Krishan]
- ArchDevelopment: For CICD [Owner - Nandani]
- Streamlit: For front end [Owner - Nandani]
- Data Engineering: For Kafka Streaming [Owner- Yash]
- Backup: For Backup [Owner - Aasna, Mahrukh]
- Data Analysis/Model Training: Python, Jupyter Notebooks
- Experiment Tracking: MLFlow
- Model Building: PyCaret, LightGBM, XGBoost, CatBoost
- Hyperparameter Optimization: Optuna
- Containerization: Docker
- Realtime Data Streaming: Kafka
- Version Control and CI/CD: Git, GitHub Actions
- Cloud Deployment: Azure Machine Learning, Azure Blob Storage
- User Interface: Streamlit
- Dependency and Environment Management: Poetry
- Python 3.8+
- Poetry
- Docker
- Azure Account
- Kafka
-
Clone the Repository
git clone https://github.com/McGill-MMA-EnterpriseAnalytics/Multi-Class-Prediction-of-Obesity-Risk.git cd Multi-Class-Prediction-of-Obesity-Risk
-
Install Dependencies
poetry install
-
Set Up Environment Variables
Create a
.env
file in the root directory and add the necessary environment variables. Example:AZURE_SUBSCRIPTION_ID=your_subscription_id AZURE_RESOURCE_GROUP=your_resource_group AZURE_WORKSPACE_NAME=your_workspace_name
-
Start Docker
Ensure Docker is running on your machine. Build and run the Docker containers:
docker-compose up --build
-
Run Streamlit Application
streamlit run Streamlit/app.py
-
Run Jupyter Notebooks
Start Jupyter Lab to run and explore notebooks:
poetry run jupyter lab
-
Azure ML Deployment
- Configure your Azure workspace by setting up the necessary resources.
- Use the provided Azure scripts to deploy models and services.
poetry run python deploy/deploy_to_azure.py
-
CI/CD Setup
- Ensure GitHub Actions are configured correctly.
- Push changes to the repository to trigger CI/CD pipelines.
git add . git commit -m "Your commit message" git push origin main
- Model Monitoring: Utilize integrated monitoring tools to track model performance and detect drift.
- Endpoint Management: Automated endpoint management to ensure availability and performance.
Our solution targets healthcare providers for early identification of at-risk patients, public health officials for data-driven policy making, and insurance companies for premium adjustment based on individual risk. The economic impact includes significant healthcare cost savings and revenue generation from tailored wellness programs.
This project is an effort by the team to tackle the global health crisis of obesity by employing advanced data science and machine learning techniques, aiming to make a significant impact in the healthcare sector.
- Product Manager - Aasna
- Machine Learning Engineer - Arham
- ML Ops - Krishan
- Data Engineer - Yash
- Cloud SME - Nandani
- Business Analyst - Mahrukh