Data Visualization of ImmoWeb project Data Sets
The real estate company "ImmoEliza" wants to create a machine learning model to predict prices on Belgium's sales. This repo contains all the files that were used in this project including the raw dataset, the cleaned dataset, .ipynb files containing the code for both cleaning and analysis/visualization, and a presentation containing the results of the analysis.
This project was a collaborative effort between four members of the Bouwman2 promotion at BeCode, Brussels, in October 2020. The team comprised of Davy Mariko, Manasa Devinoolu, Sara Silvente, and Naomi Thiru
pip install pandas
pip install numpy
pip install more_itertools
This is a collection of all the properties for sale from all the regions of Belgium, to be used to create a machine learning model to predict prices on Belgium's sales for the real estate company, ImmoEliza.
The data was scraped from various Belgian real-estate websites by all the members of the Bowman2 Promotion of BeCode, in September 2020. The raw dataset had 93068 rows and 22 columns.
The variables in this dataset are:
'source', 'hyperlink', 'locality', 'postcode', 'house_is',
'property_subtype', 'price', 'sale', 'rooms_number', 'area',
'kitchen_has', 'furnished', 'open_fire', 'terrace', 'terrace_area',
'garden', 'garden_area', 'land_surface', 'land_plot_surface',
'facades_number', 'swimming_pool_has', 'building_state'
Various data cleaning operations were performed on the dataset, using pandas and numpy within Jupyter Notebooks.
import numpy as np
import pandas as pd
Each of the 22 columns was processed using functions created specifically for their contents. An example of such a function, to clean the 'Open Fire' column is as follows:
def process_open_fire_col():
dt = data_new['open_fire']
dt.convert_dtypes()
dt = dt.str.lower()
dt = dt.map({'false': 0, 'true': 1, 'nan': 2, '0': 2})
dt.fillna(2, inplace=True)
dt = dt.astype(int)
return dt
The following issues were handled as described:
Null Values and None
All null values and ‘None’ values in the dataset were replaced with a value ‘-999’
True/False Values
All true values were replaced with the value ‘1’ All false values were replaced with the value ‘0’
Duplicates
The dataset had duplicated values, that were dropped.
Blank spaces and special characters
All blank spaces and special characters were replaced with an underscore. All the values in the dataset were set to lowercase.
The cleaned dataset has 43342 rows and 24 columns.
The presentation contains results of the data visualization and an interpretation of the analysis.
Using matplotlib and seaborn, visualization on a clean dataset was done to observe the correlations between the variables and the target variable.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
In order to conduct analysis on the real estate dataset, we identified the target variable as ‘Price’, and used this to determine it's correlations with the other variables in the dataset.
The interpretation of our results are clearly outlined in the presentation file.
The dataset required a large amount of cleaning. Apart from null values, other unsuitable values were found in the dataset that were categorized as null for the sake of this analysis. External data was integrated into the dataset to provide additional location information, such as 'city' and 'region'.
Establishing the correct datatypes to ensure smooth workflow with the data was also identified as a challenge. Working with NaN values with visualization libraries needed to be handled as well.