- Built a model that accepts cell nucleus values features of a breast cancer tumor as input and predicts if the cancer is Benign or Malignant.
- Model is trained on a dataset of 570 Breast Cancer Images from the Kaggle Wisconsin UCI Breast Cancer dataset.
- Data was trained on 5 different models. K-fold cross-validation was performed to validate for overfitting and a final trained Support Vector Machine (SVM) model was used to build the predictor.
For Web Framework Requirements: pip install -r requirements.txt
Following changes were made to the data to make it usable for a model:
- Column with Null Values was removed.
- Got the count of malignant vs benign tumor cells.
- Performed encoding to to represent categorical variables as numerical values to use it in the ML model.
Various analysis was made related to the dataset and the models. Below are a few highlights.
StandardScaler method was used to remove the mean and scale each feature/variable to unit variance. The data was split into train and test sets with a test size of 20%.
Five different models were tried and evaluated based on their metrics:
- Logistic Regression
- K-Nearest Neighbor
- Decision Tree Classifier
- Random Forest Method
- Support Vector Machines
The SVM model outperformed the other approaches on the test and validation sets.
- Random Forest : Accuracy = 94.73%
- Decision Tree : Accuracy = 93.85%
- Logistic Regression: Accuracy = 96.49%
- K-NN: Accuracy = 95.61%
- SVM: Accuracy = 98.24%
A Final Trained model was built on SVM where the input of the nucleus features are accepted from the user and the model predicts if the tumor is malignant or benign. The Final model can be downloaded from svm_model.pkl