Iris-Species-Classification-Building-ML-Models:

Introduction:

The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper the use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.

Data Collection:

The dataset is collected from Kaggle[https://www.kaggle.com/uciml/iris]. This dataset consists of 3 categories of species which is setosa, versicolor and virginica. Each iris species consists of 50 samples. The features of iris flower are Sepal Length in cm, Sepal Width in cm, Petal Length in cm and Petal Width in cm.

Data Processing:

Loaded the data.

# Load Iris csv dataset
iris_data = pd.read_csv('../data/iris.csv')

Exploratory Data Analysis (EDA) & Statastics:

Let’s group the data by species and do some descriptive statistics:

# Groupby Species for descriptive statistics
iris_data.groupby('species').describe().T

count shows that there 50 samples for each species.
Setosa
- Average sepal length is 5cm
- Average sepal width is 3cm
- Average petal length is 1.5cm
- Average petal width is 0.25cm
Versicolor
- Average sepal length is 6cm
- Average sepal width is 2.8cm
- Average petal length is 4.26cm
- Average petal width is 1.32cm
Virginica
- Average sepal length is 6.6cm
- Average sepal width is 3cm
- Average petal length is 6cm
- Average petal width is 2cm
From the above information,
- Based on Petal length we can easily classify them as Setosa(1.5cm), Versicolor(4.2cm) and Virginica(6cm).
- Based on Petal width we can easily classify Setosa(0.25cm) from Versicolor(1.32cm) and Virginica(2cm).
- Sepal width looks similar for all three species — Setosa(3cm), Versicolor(2.8cm) and Virginica(3cm).
- Based on Sepal length, there are only small changes on three species (5cm, 6cm and 6.6cm) Since Sepal width looks similar for all the species, we can drop that feature.

Key Visualizations:


Boxplot: It visually compares distributions of sepal length, sepal width, petal length, petal width based on numerical data through their quartiles.	Pairplot: Relationships between variables across multiple dimensions.
Swarm-Plot: (	Voilin-Plot

Feature Observations:

Splitting the data into training and testing dataset:

train, test = train_test_split(iris_data, test_size = 0.3) # dataset is split into 70% training and 30% testing
print(train.shape)
print(test.shape)

Feature Selection: Use petal & sepalas as features:

Training and testing data for petals and sepals:

petal = iris_data[['petal_length','petal_width','species']]
sepal = iris_data[['sepal_length','sepal_width','species']]

#Iris_Petals:
train_p,test_p = train_test_split(petal, test_size=0.3, random_state=0) 
train_x_p = train_p[['petal_length','petal_width']]
train_y_p = train_p.species

test_x_p = test_p[['petal_length','petal_width']]
test_y_p = test_p.species

#Iris_Sepals:
train_s,test_s = train_test_split(sepal, test_size=0.3, random_state=0) #sepals
train_x_s = train_s[['sepal_length','sepal_width']]
train_y_s = train_s.species

test_x_s = test_s[['sepal_length','sepal_width']]
test_y_s = test_s.species

Logistic Regression:

model = LogisticRegression()
model.fit(train_x_p,train_y_p) 
prediction=model.predict(test_x_p) 
print('The accuracy of the Logistic Regression using Petals is:',metrics.accuracy_score(prediction,test_y_p))

model.fit(train_x_s,train_y_s) 
prediction=model.predict(test_x_s) 
print('The accuracy of the Logistic Regression using Sepals is:',metrics.accuracy_score(prediction,test_y_s))

The accuracy of the Logistic Regression using Petals is: 0.9777777777777777
The accuracy of the Logistic Regression using Sepals is: 0.8222222222222222

Decision Tree:

model=DecisionTreeClassifier()
model.fit(train_x_p,train_y_p) 
prediction=model.predict(test_x_p) 
print('The accuracy of the Decision Tree using Petals is:',metrics.accuracy_score(prediction,test_y_p))

model.fit(train_x_s,train_y_s) 
prediction=model.predict(test_x_s) 
print('The accuracy of the Decision Tree using Sepals is:',metrics.accuracy_score(prediction,test_y_s))

The accuracy of the Decision Tree using Petals is: 0.9555555555555556
The accuracy of the Decision Tree using Sepals is: 0.6444444444444445

Conclusion:

From the mathematical models i used i can confirm that using petal features gives more accuracy.
Further it was validated by the heatmap high correlation between petal length and width than that of sepal length and width.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
DATA		DATA
Iris Species Prediction & Model Evaluation.ipynb		Iris Species Prediction & Model Evaluation.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Iris-Species-Classification-Building-ML-Models:

Table of Contents:

Introduction:

Data Collection:

Data Processing:

Exploratory Data Analysis (EDA) & Statastics:

Key Visualizations:

Feature Observations:

Splitting the data into training and testing dataset:

Feature Selection: Use petal & sepalas as features:

Training and testing data for petals and sepals:

Logistic Regression:

Decision Tree:

Conclusion:

About

Releases

Packages

Languages

DA-Atharv/Iris-Species-Classification-and-Model-Evaluation

Folders and files

Latest commit

History

Repository files navigation

Iris-Species-Classification-Building-ML-Models:

Table of Contents:

Introduction:

Data Collection:

Data Processing:

Exploratory Data Analysis (EDA) & Statastics:

Key Visualizations:

Feature Observations:

Splitting the data into training and testing dataset:

Feature Selection: Use petal & sepalas as features:

Training and testing data for petals and sepals:

Logistic Regression:

Decision Tree:

Conclusion:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages