Asmaa Ahmed
Chayda Adora
Leonardo Aleixo
Saber Chowdhury
Xiao Wang
This project’s objective is to analyze the housing market sector in King County, USA between May 2014 and May 2015 and create a machine learning model which is able to predict prices and visualizations to provide insights on house sales.
https://docs.google.com/presentation/d/1FSh0O_s-OfNs4f0mQAv_XbYGDBrPot1RKguuH9DE7T0/edit?usp=sharing
https://public.tableau.com/app/profile/leonardo.aleixo/viz/Ptest/Story1
The housing market sector offers a variety of data where advanced regression techniques can be applied to predict price. As the data is known and the target outcome (price) is a continuous variable, Supervised Machine Learning is the appropriate approach.
Name: House Prices
Origin: Kaggle – Ames Housing dataset (kc_house_data.csv)
Publication: House Prices - Advanced Regression Techniques | Kaggle
-
Predict housing market price
-
Differences of frequencies in sale by season
-
Correlation between:
-
Property sales frequency and price
-
Property size and price
-
Year the property was built and price
-
Location and price
-
In this project, we examined the quality of the raw data, the following steps has been achieved prior to feeding the data into the model:
- Overview of the dataset
- Handling Null Values
- Handling missing values & duplicate rows
- Standardization of the numerical data while handling Categorical Variables
Importing the necessary libraries, including the following: Pandas_ For handling structured data, Scikit Learn_ For machine learning, NumPy_ For linear algebra and mathematics, Seaborn_ For data visualization, Standard Scaler-For scaling and normalizing numerical data.
For the preliminary features, all available features were used that might affect the model including the date, number of bedrooms, bathrooms, floors, etc. An additional feature was added to include the month and the year to see if this would improve the model. The id column was dropped as it affects the model accuracy since it shows a high predictive value while trying to predict the price.
Data was split into training and tests sets as follows:
- Training: The training data consists of 21436 examples of houses with 21 features describing different aspects of the house. The training data is what is used to “teach” the models.
- Testing: The test data set consists of 21436 examples with the same number of features as the training data. The test data set excludes the sale price because this is the dependent value (what we are trying to predict).
- Our training and testing setup is using a 70/30 train-test split ratio.
The Supervised Machine Learning model was chosen because the target outcome is already known (house price).Three different models were run and compared against each other for the accuracy outcome: Linear Regression, Decision Tree, and Random Forest.
- Linear Regression returned an accuracy score of 0.68.
- Decision Tree returned accuracy score of 0.75.
- Random Forest returned accuracy score of 0.86.
One of the limitations encountered with Random Forest is getting a score of 0.53 when using only the following parameters: max_dept =2 & Random state =79; however, after adding the hyper parameters such as n_estimators of 100, criterion as mse, max_depth as 100 then the model score significantly increased to 0.86. The runtime for the model was originally 32 minutes, but was decreased to 22 minutes by changing the n_jobs parameter from the default of 1 to -1, which means all processors run parallel in the backend.
Dataset has 21 columns, including the id for each house, and over 21K rows. There were no null values, but we did find duplicate rows by house id, which would indicate either the house was sold twice in this year or this was a duplicate entry. Found some duplicates and confirmed all feature details were the same including the closing date so these are duplicate entries. We decided to drop these rows as they only totalled 177. Also, an additional feature was created to show number of bedrooms by the home_size. Correlation heatmap indicates that home_size and price had high correlation, as well as a high correlation between home_size and above_size.
Cleaned data is exported to csv and loaded into a PostgreSQL database.
- House Database Flowchart created with QuickDBD: Raw Data Table:
The dataset for this project is simple; it was not necessary to design the database with multiple tables. A table "condition" was created to link the descriptive values for the condition column and connected to the main table as a foreign key. The schema.sql file has code for creating an inner join to query the database and return a joined table with condition description. Example:
Database is hosted in pgAdmin; screenshot of data loaded to house_data_clean table (via SQLAlchemy):
Interact with the data by choosing different parameters for house size (see table below) and Rank of Units Sold (1st-70th place).
Displays market share of the properties between May 2014 and May 2015 by home sizes, which have been divided into groups based on size in square feet. The medium-sized houses were most popular with 42.18%of the market share, and the small and large houses taking a combined market share of 49.97%.
Displays the total sales volume by Quarter.The 3rd quarter in 2014 has recorded the highest sales volume, hitting $3.16 billion, and then dropping to $2.50 billion in the 4th quarter. This could indicate that this is the time of year where the house prices are most competitive. Further, a Quarter on Quarter comparison of the 2nd quarter between 2014 and 2015 shows a 25% decrease in sales volume, which should be considered significant.
Displays the sales ranking map based on the zip code of different areas in King County. The number on each area displays the ranking in terms of the number of houses sold between 2014 and 2015. Shoreline to the north of Seattle, has bagged 3 spots – 1st, 3rd and 5th out of the top 5 ranking and averaging 574 houses sold, and Hobart is the runner-up with 587 houses sold. No. 4 is the East side of Kirkland, closing 571 deals. On the other side of the coin, Carnation and Maury Island shows the least number of housed sold, averaging slightly over a 100 in the same period.
House sales by condition, closing date, grading system, location, and comparison to ML models.
Python, Pandas, Google Colab, Jupyter Notebook, PostgreSQL, Tableau, Supervised Machine Learning