This project predicts whether the person is suffering from the respiratory disease of Asthma or not. The porject classifies the person into three main categories as suffering from mild asthma, suffering from Moderate or high asthma and Not suffering from asthma on streamlit based application.
The application takes the input like the person has various disorders like tiredness,running nose,nsaql congestion, difficulty in breathing or not, age group , gender and predict the likeliness of the disease.
The below csv dataset from kaggle is used as reference which contains nearly 30000+ rows on which porcessing is performed to obtained a 3000 row processed csv data asthma_detection.csv.
The dataset link is are as follows :- https://www.kaggle.com/datasets/deepayanthakur/asthma-disease-prediction
on this dataset, below porcessing are performed :
- recreation of new asthma output column
- SMOTE (synthetic minority oversmapling technique) to manage class imbalance
- feature scaling and feature engineering
and finally the processed data asthma_detection.csv is obtained which is used to train the model.
The entire work of model training is depicited in Asthma_detection_ML.ipynb file. kindly refer it along with final dataset asthma_detection.csv.
The project follows the below structured methodology ranging from data preprocessing pipeline to model training, evaluation and deployment :-
-
Data Preprocessing and feature enginnering: Following Data preprocessing and feature engineering steps are performed :
- removal of missing values, duplicates ,oversampling
- reverse encoding and label encoding
- correlation test and matrices
- Outlier detection and removal
-
Exploratory Data Analysis (EDA): after Data preprocessing the next step is Exploratory data analysis using different plotting libraries like matplotlib,pandas,seaborn and plotly.following plots were plotted in this step:-
- Pie chart of old age poeple suffering and not suffering from asthma
- Histogram of tiredness vs asthma category
- violen category plot for different disorder likeliness with asthma
- count plot of all classes to detect class imbalance
- Box plot for outlier detection
(refer output folder for this images and graph observation as well as wep application output that is created using streamlit) along with these in model training and evaluation below graphs are plotted :
- confusion matrix and classification report for random forest model
- confusion matrix and classification report for SVM model
- comparison charts for svm and random forest model
-
Model Training and evaluation: The two machine learning model random forest and support vector machine are selected for model training over the inputed processed data: random forest accuracy : 85 % support vector machine accuracy : 84 %
The 10 fold cross validation is then performed on random forest model to obtained a final average cross validated accuracy of 84 % with 2% of deviation.
this random forest model is then loaded into streamlit application after installign using joblib library.
-
Inference: Deployed the model with the help streamlit web application to detect asthma from input features.
- Joblib: For downloading the random forest model
- Sckiti learn: For machine learning processing and operations
- Matplotlib: For plotting and visualizing the detection results.
- Pandas: For image manipulation.
- NumPy: For efficient numerical operations.
- Seaborn : for advanced data visualizations
- plotly : for 3D data visualizations .
- Streamlit : for creating gui of the web application.
-
Clone the Repository:
git clone url_to_this_repository
-
Install Dependencies:
pip install -r requirements.txt
-
Run the Model:
streamlit run main.py
-
View Results: The script will allow you to predict whether the person is suffering from asthma or not based on input features.