Skip to content

This project focuses on developing a Machine Learning model to predict housing prices in California.

License

Notifications You must be signed in to change notification settings

enescatagan/house-price-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

house-price-prediction

This project aims to develop a Machine Learning model to predict California housing prices. The model predicts the median housing price of a district, helping determine whether investing in that area is worthwhile.

Project Organization

├── data
│   └── housing.csv             <- Data from kaggle.
├── images                      <- images for visualization 
├── models                      <- Trained Models
├── notebooks
│   ├── preparation_notebooks   <- Only necessary Notebooks for model production; data preparation, pipeline creation, parameter tuning etc.
│   ├── testing_notebooks       <- Every notebook for quick test; dump and test notebooks
│   └── main.ipynb              <- Main Notebook
├── requirements.txt            <- The requirements file, generated with `pip freeze > requirements.txt`
└── Readme.md                   <- Project Explanation, notes etc. 

Project Objectives

  • Analyze and preprocess the dataset.
  • Train different regression models and compare their performances.
  • Select the best-performing 3 to 5 models and perform hyperparameter tuning.
  • Select the best-performing 2 or 3 model, ensemble these and compare their performances.
  • Get best model out of these.

Design and Implementation Details

  • Supervised Learning: The model is trained with labeled examples.
  • Regression Task: The model is used to predict a value median-house-price.
  • Data Preprocessing: The dataset is prepared by handling missing data, processing outliers, and feature engineering, transformation, extraction steps.
  • Model Selection: 14 different regression models are trained and their performances are compared.
  • Hyperparameter Tuning: The hyperparameters of the 5 best-performing model are tuned.
  • Selection of Model: After ensemble GradienBoostingRegressor and LGBMRegressor, decided to use LGBMRegressor Model

Results and Improvement Recommendations

  • RMSE values for the performance of the LGBMRegressor Model on the training, testing, and validation sets are reported on main.ipynb.
    • There is overfitting issue going on but not much. Test scores and validation scores is acceptable.
  • Data augmentation and further hyperparameter tuning are recommended for model improvement.

How to Use

  1. Navigate to the project directory.
  2. Install the necessary dependencies by running pip install -r requirements.txt
  3. Open notebooks directory then main.ipynb notebook and run it. This will run whole projects, and can take some time.