Skip to content

Latest commit

 

History

History
138 lines (80 loc) · 8.84 KB

README.md

File metadata and controls

138 lines (80 loc) · 8.84 KB

Real Estate: Predicting House Prices

Members

Asmaa Ahmed

Chayda Adora

Leonardo Aleixo

Saber Chowdhury

Xiao Wang

Overview of Project

This project’s objective is to analyze the housing market sector in King County, USA between May 2014 and May 2015 and create a machine learning model which is able to predict prices and visualizations to provide insights on house sales.

Project Presentation Link (Google Slides):

https://docs.google.com/presentation/d/1FSh0O_s-OfNs4f0mQAv_XbYGDBrPot1RKguuH9DE7T0/edit?usp=sharing

Tableau Dashboard Link:

https://public.tableau.com/app/profile/leonardo.aleixo/viz/Ptest/Story1

Why Supervised Machine Learning?

The housing market sector offers a variety of data where advanced regression techniques can be applied to predict price. As the data is known and the target outcome (price) is a continuous variable, Supervised Machine Learning is the appropriate approach.

OLS Regression Model:

OLS_Regression

Data Source

Name: House Prices

Origin: Kaggle – Ames Housing dataset (kc_house_data.csv)

Publication: House Prices - Advanced Regression Techniques | Kaggle

Questions that will be answered with the data:

  • Predict housing market price

  • Differences of frequencies in sale by season

  • Correlation between:

    • Property sales frequency and price

    • Property size and price

    • Year the property was built and price

    • Location and price

Machine Learning

Description of data preprocessing:

In this project, we examined the quality of the raw data, the following steps has been achieved prior to feeding the data into the model:

  • Overview of the dataset
  • Handling Null Values
  • Handling missing values & duplicate rows
  • Standardization of the numerical data while handling Categorical Variables

Importing the necessary libraries, including the following: Pandas_ For handling structured data, Scikit Learn_ For machine learning, NumPy_ For linear algebra and mathematics, Seaborn_ For data visualization, Standard Scaler-For scaling and normalizing numerical data.

For the preliminary features, all available features were used that might affect the model including the date, number of bedrooms, bathrooms, floors, etc. An additional feature was added to include the month and the year to see if this would improve the model. The id column was dropped as it affects the model accuracy since it shows a high predictive value while trying to predict the price.

Data was split into training and tests sets as follows:

  • Training: The training data consists of 21436 examples of houses with 21 features describing different aspects of the house. The training data is what is used to “teach” the models.
  • Testing: The test data set consists of 21436 examples with the same number of features as the training data. The test data set excludes the sale price because this is the dependent value (what we are trying to predict).
  • Our training and testing setup is using a 70/30 train-test split ratio.

The Supervised Machine Learning model was chosen because the target outcome is already known (house price).Three different models were run and compared against each other for the accuracy outcome: Linear Regression, Decision Tree, and Random Forest.

  1. Linear Regression returned an accuracy score of 0.68.
  2. Decision Tree returned accuracy score of 0.75.
  3. Random Forest returned accuracy score of 0.86.

regressorforest_score

regressorforest_head

One of the limitations encountered with Random Forest is getting a score of 0.53 when using only the following parameters: max_dept =2 & Random state =79; however, after adding the hyper parameters such as n_estimators of 100, criterion as mse, max_depth as 100 then the model score significantly increased to 0.86. The runtime for the model was originally 32 minutes, but was decreased to 22 minutes by changing the n_jobs parameter from the default of 1 to -1, which means all processors run parallel in the backend.

Cleaning the data:

Dataset has 21 columns, including the id for each house, and over 21K rows. There were no null values, but we did find duplicate rows by house id, which would indicate either the house was sold twice in this year or this was a duplicate entry. Found some duplicates and confirmed all feature details were the same including the closing date so these are duplicate entries. We decided to drop these rows as they only totalled 177. Also, an additional feature was created to show number of bedrooms by the home_size. Correlation heatmap indicates that home_size and price had high correlation, as well as a high correlation between home_size and above_size.

correlation_heatmap

Cleaned data is exported to csv and loaded into a PostgreSQL database.

Database ERD

  • House Database Flowchart created with QuickDBD: Raw Data Table:

ERD_multi_table

The dataset for this project is simple; it was not necessary to design the database with multiple tables. A table "condition" was created to link the descriptive values for the condition column and connected to the main table as a foreign key. The schema.sql file has code for creating an inner join to query the database and return a joined table with condition description. Example: JOIN Joined table

Database is hosted in pgAdmin; screenshot of data loaded to house_data_clean table (via SQLAlchemy): DatabaseTable

Connection String to RDS:

Connection_String

Connect cleaned house data to SQL:

Connection_to_SQL

Tableau Dashboard

Interact with the data by choosing different parameters for house size (see table below) and Rank of Units Sold (1st-70th place). tableau_dashboard

Market Share by Home Size

Displays market share of the properties between May 2014 and May 2015 by home sizes, which have been divided into groups based on size in square feet. The medium-sized houses were most popular with 42.18%of the market share, and the small and large houses taking a combined market share of 49.97%.

House Size

Sales by Quarter

Displays the total sales volume by Quarter.The 3rd quarter in 2014 has recorded the highest sales volume, hitting $3.16 billion, and then dropping to $2.50 billion in the 4th quarter. This could indicate that this is the time of year where the house prices are most competitive. Further, a Quarter on Quarter comparison of the 2nd quarter between 2014 and 2015 shows a 25% decrease in sales volume, which should be considered significant.

Sales Ranking Map

Displays the sales ranking map based on the zip code of different areas in King County. The number on each area displays the ranking in terms of the number of houses sold between 2014 and 2015. Shoreline to the north of Seattle, has bagged 3 spots – 1st, 3rd and 5th out of the top 5 ranking and averaging 574 houses sold, and Hobart is the runner-up with 587 houses sold. No. 4 is the East side of Kirkland, closing 571 deals. On the other side of the coin, Carnation and Maury Island shows the least number of housed sold, averaging slightly over a 100 in the same period.

Other visualizations in Tableau Story:

House sales by condition, closing date, grading system, location, and comparison to ML models.

Software/Tools:

Python, Pandas, Google Colab, Jupyter Notebook, PostgreSQL, Tableau, Supervised Machine Learning