- Introduction
- Data Collection
- Data Processing
- Exploratory Data Analysis (EDA) & Statastics
- Key Visualizations
- Feature Observations
- Feature Selection
- Train & Test Data
- Logistic Regression
- Decision Tree
- Conclusion
The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper the use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.
The dataset is collected from Kaggle[https://www.kaggle.com/uciml/iris]. This dataset consists of 3 categories of species which is setosa, versicolor and virginica. Each iris species consists of 50 samples. The features of iris flower are Sepal Length in cm, Sepal Width in cm, Petal Length in cm and Petal Width in cm.
- Loaded the data.
# Load Iris csv dataset
iris_data = pd.read_csv('../data/iris.csv')
- Let’s group the data by species and do some descriptive statistics:
# Groupby Species for descriptive statistics
iris_data.groupby('species').describe().T
-
count shows that there 50 samples for each species.
-
Setosa
- Average sepal length is 5cm
- Average sepal width is 3cm
- Average petal length is 1.5cm
- Average petal width is 0.25cm
-
Versicolor
- Average sepal length is 6cm
- Average sepal width is 2.8cm
- Average petal length is 4.26cm
- Average petal width is 1.32cm
-
Virginica
- Average sepal length is 6.6cm
- Average sepal width is 3cm
- Average petal length is 6cm
- Average petal width is 2cm
-
From the above information,
- Based on Petal length we can easily classify them as Setosa(1.5cm), Versicolor(4.2cm) and Virginica(6cm).
- Based on Petal width we can easily classify Setosa(0.25cm) from Versicolor(1.32cm) and Virginica(2cm).
- Sepal width looks similar for all three species — Setosa(3cm), Versicolor(2.8cm) and Virginica(3cm).
- Based on Sepal length, there are only small changes on three species (5cm, 6cm and 6.6cm) Since Sepal width looks similar for all the species, we can drop that feature.
train, test = train_test_split(iris_data, test_size = 0.3) # dataset is split into 70% training and 30% testing
print(train.shape)
print(test.shape)
petal = iris_data[['petal_length','petal_width','species']]
sepal = iris_data[['sepal_length','sepal_width','species']]
#Iris_Petals:
train_p,test_p = train_test_split(petal, test_size=0.3, random_state=0)
train_x_p = train_p[['petal_length','petal_width']]
train_y_p = train_p.species
test_x_p = test_p[['petal_length','petal_width']]
test_y_p = test_p.species
#Iris_Sepals:
train_s,test_s = train_test_split(sepal, test_size=0.3, random_state=0) #sepals
train_x_s = train_s[['sepal_length','sepal_width']]
train_y_s = train_s.species
test_x_s = test_s[['sepal_length','sepal_width']]
test_y_s = test_s.species
model = LogisticRegression()
model.fit(train_x_p,train_y_p)
prediction=model.predict(test_x_p)
print('The accuracy of the Logistic Regression using Petals is:',metrics.accuracy_score(prediction,test_y_p))
model.fit(train_x_s,train_y_s)
prediction=model.predict(test_x_s)
print('The accuracy of the Logistic Regression using Sepals is:',metrics.accuracy_score(prediction,test_y_s))
- The accuracy of the Logistic Regression using Petals is: 0.9777777777777777
- The accuracy of the Logistic Regression using Sepals is: 0.8222222222222222
model=DecisionTreeClassifier()
model.fit(train_x_p,train_y_p)
prediction=model.predict(test_x_p)
print('The accuracy of the Decision Tree using Petals is:',metrics.accuracy_score(prediction,test_y_p))
model.fit(train_x_s,train_y_s)
prediction=model.predict(test_x_s)
print('The accuracy of the Decision Tree using Sepals is:',metrics.accuracy_score(prediction,test_y_s))
- The accuracy of the Decision Tree using Petals is: 0.9555555555555556
- The accuracy of the Decision Tree using Sepals is: 0.6444444444444445
- From the mathematical models i used i can confirm that using petal features gives more accuracy.
- Further it was validated by the heatmap high correlation between petal length and width than that of sepal length and width.