Skip to content

Determining the best model to predict house prices in Kings County, Seattle.

Notifications You must be signed in to change notification settings

Yeshi341/Kings-County-House-Prices-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kings County Housing Bake-off


This project conducts an analysis on the Seattle, Washington, Kings County Houses attributes data to create a model that will make the best prediction of the price of a house. House prices depend on a lot of factors like the house characteristics, economic environment, housing market, neighborhood, neighboring houses, proximity to amenities etc. According to an article on Opendoor the depending variables can be boiled down to eight critical factors; neighborhood comps, location, home size and usable space, age and condition, upgrades and updates, local market, economic indicators and interest rates.The project will use intuition, statistics and python to build the model.

Data

The data used in this project is based on the following 21 key descriptions of houses sold in Kings County of Seattle, Washington. Data can be found on kaggle at this link

  • id - unique ID for a house
  • date - Date day house was sold
  • price - Price is the prediction target
  • bedrooms - Number of bedrooms
  • bathrooms - Number of bathrooms
  • sqft_living - square footage of the home
  • sqft_lot - square footage of the lot
  • floors - Total floors (levels) in house
  • waterfront - Whether house has a view to a waterfront
  • view - Number of times house has been viewed
  • condition - How good the condition is (overall)
  • grade - overall grade given to the housing unit, based on King County grading system
  • sqft_above - square footage of house (apart from basement)
  • sqft_basement - square footage of the basement
  • yr_built - Year when house was built
  • yr_renovated - Year when house was renovated
  • zipcode - zip code in which house is located
  • lat - Latitude coordinate
  • long - Longitude coordinate
  • sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
  • sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

The training dataset on which the model will be built has information on 17290 houses in the Kings County. Here is a heatmap visualization, that tells you the degree of correlation between the variables. The darker the red the stronger the relationship and darker the blue the lesser the relationship. A strong correlation between the features and the target variable price is good while a strong relationship of one feature with another is not.

Heatmap

Method

The model building process will follow the steps below:

    1. Exploratory Analysis: Understanding the data before model creation is important. Exploratory Data Analysis or EDA is done by looking at the distribution of data for each of the variables, checking the kind of values present in the dataset. Identifying any variable relationship is also performed here. Much can be learned by also performing descriptive analysis on the data.
    1. Data Cleaning: Any issues spotted during the EDA process is addressed at this step. Issues like missing values, changing data types etc
    1. Feature Engineering: The independent variables can be used to build new features that can explain the target variable better and help in builing better models. Feature Engineering can be achieved by manually creating features based on intuition or through automated processes using sklearn.
    1. Train-Test Split: Train-Test Split is an important step which will reduce the risk of overfitting the model to the dataset. By splitting it to a train and test test, the models can be built using the train set and applied to the test set to predict and test the model strength.
    1. Feature Selection: While running models, selecting features is an important aspect that differentiates one model from the other. Feature Selection can be achieved by manaual excluding or including features, as well as using various feature selection option like the filter method, wrapper method or the Embedded Method.
    1. Model Evaluation: Model evaluation can be done on the bases of the magnitude of the Root Mean Squared Error. The smaller the RMSE, the better the model
    1. Final Model: The final model will be used to predict the target variable or house prices on the hold out data.

Results

Four models were created selecting various feature subsets. The RMSE scores from the test predictions for each of the models were

Model RMSE Score
Baseline Model 159552.87
Log Baseline Model 655070.65
Manual Feat Model 195675.71
KBest Model 189733.51
RFE 158484.36

Conclusion

Based on the results, the RFE or Recursive Feature Elimination model performed the best with the least RMSE score.

For more Information


You can view the full analysis in for this project in the Jupyter Notebook

Repository Structure


├── README.md                           <- The top-level README for this project.
│
├── data                                <- Data folder
│   ├── housing_preds_Lhamu.csv         <- Project predictions on holdout set
│   ├── kc_house_data_test_features.csv <- Project holdout dataset on which predictions are made
│   └── kc_house_data_train.csv         <- Project holdout dataset on which models are trained.
│
├── notebooks                           <- Jupyter Notebooks
│   ├── Bakeoff_modeling_process.ipynb  <- Project analysis explained with documentation
│   ├── KC.ipynb                        <- 
│   └── Predict_holdout.ipynb           <- Project prediction process with final model
│
├── images                              <- Visuals created in analysis and images used in repo 
│
└── pickles                             <- pickled objects
    ├── lm2.pkl                         <- Linear Regression model
    ├── model.pickle                    <- Best performing model selected to project predictions
    ├── model_features.pickle           <- Set of features selected for the best performing model
    └── scaler.pkl                      <- Scaler object

About

Determining the best model to predict house prices in Kings County, Seattle.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published