This post consists of two examples, made in a google colab script, the first one in which we will analyze all the EDA steps in order to have the best data preprocessing before making our own models, and the second one in which we will analyze the distribution of the Iris data set thanks to the different libraries available.
First of all, we will need both data set in our colab workshop, data_price_cars.csv and Iris.csv. By means of the first data set we will perform all the steps of the EDA with the help of the following commands.
df = pd.read_csv("data_price_cars.csv")
# To display the top 5 rows
df.head(5)
#Remove irrelevant columns
df = df.drop(['Engine Fuel Type', 'Market Category', 'Vehicle Style', 'Popularity', 'Number of Doors', 'Vehicle Size'], axis=1)
#Rename columns
df = df.rename(columns={"Engine HP": "HP", "Engine Cylinders": "Cylinders", "Transmission Type": "Transmission", "Driven_Wheels": "Drive Mode","highway MPG": "MPG-H", "city mpg": "MPG-C", "MSRP": "Price" })
#Remove duplicated rows
duplicate_rows_df = df[df.duplicated()]
df = df.drop_duplicates()
# Dropping the missing values.
df = df.dropna()
Next we will detect the outliers,with the seaborn library and remove them from the data set.
And we remove them with the following code line.
#Remove outliers
df = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df.shape
Finally, we can analyze our data using histograms, heat maps or scatterplots. For example:
In this example, unlike the previous one, we will analyze the different probabilistic distributions of our data, using the different libraries avaible, here are some examples:
General data:
Probability Distribution:
Box Plots:
Violin Plots:
Scatter Plots:
Pair Plots: