This project implements a Gaussian Naive Bayes classifier to detect breast cancer based on the Breast Cancer Wisconsin dataset. The classifier is built using both scikit-learn's GaussianNB model and a custom Gaussian Naive Bayes implementation.
The Naive Bayes algorithm is a probabilistic classifier based on Bayes' Theorem, assuming that the features are conditionally independent given the class (i.e., the naive assumption). Gaussian Naive Bayes is a special case of Naive Bayes, which assumes that the continuous features follow a Gaussian (normal) distribution.
The dataset used in this project is the Breast Cancer Wisconsin dataset, which is available in scikit-learn
. It contains 569 instances and 30 features, where each instance is a patient's breast cancer diagnostic measurements. The target variable is binary, representing:
- 0: Malignant (cancerous)
- 1: Benign (non-cancerous)
- 30 real-valued input features, such as mean radius, mean texture, etc., calculated from a digitized image of a fine needle aspirate (FNA) of a breast mass.
Naive Bayes is based on Bayes' Theorem, which calculates the posterior probability of a class C
given a set of features X = {x1, x2, ..., xn}
.
Where:
- ( P(C | X) ): Posterior probability of class
C
given the feature vectorX
- ( P(X | C) ): Likelihood of feature vector
X
given classC
- ( P(C) ): Prior probability of class
C
- ( P(X) ): Evidence (total probability of
X
)
In Gaussian Naive Bayes, we assume that the likelihood ( P(X | C) ) is a Gaussian (normal) distribution, parameterized by the mean and variance of each feature in the training data for each class.
For a given feature x
with mean μ
and variance σ²
, the probability density function is given by:
The Gaussian Naive Bayes algorithm combines the likelihoods of each feature for both classes and multiplies them with the class prior to predict the class with the highest posterior probability.
- Load the Breast Cancer dataset using
scikit-learn
'sload_breast_cancer()
function. - Create a Pandas DataFrame containing the features and the target.
- Split the data into training (80%) and testing (20%) sets using
train_test_split
.
The custom implementation of Gaussian Naive Bayes performs the following steps:
- Calculate mean and variance for each feature within each class on the training data.
- Calculate prior probabilities based on the frequency of each class in the training set.
- Predict class for new data by computing the posterior probability using Bayes' Theorem and Gaussian PDF.
- Evaluate the accuracy of the custom implementation by comparing predictions to true labels.
- Accuracy is calculated as the proportion of correct predictions out of the total number of predictions.
- Train the model using
GaussianNB()
fromscikit-learn
. - Evaluate the model's accuracy on the test data.