Skip to content

This repository contains a project I completed for an NTU course titled CB4247 Statistics & Computational Inference to Big Data. In this project, I applied regression and machine learning techniques to predict house prices in India.

License

Notifications You must be signed in to change notification settings

nlawira/india-house-rent-prediction

Repository files navigation

India's House Prices Prediction

Preface

This self-initiated project improves my submitted project for the CB4247 Statistics & Computational Inference to Big Data module at NTU. After my initial project was graded, I sought my lecturer's feedback and incorporated it into this version of this project. This repository contains my project's code and report and showcases my machine learning, data analysis, pre-processing, Python, and report-writing skills.

Project Overview

This self-initiated project aims to:

  • Apply data analysis and visualization techniques to analyze a real-world dataset.
  • Train machine learning algorithms on the chosen dataset, including Ordinary Least Squares Regression, Random Forest Regressor, and XGBoost Regressor.
  • Conduct ANOVA and residual analysis to test the validity of Ordinary Least Squares (OLS) regression assumptions.
  • Evaluate each algorithm's performance via metrics, including mean absolute error, root mean squared error, and R2.
  • Identify critical variables via feature importance.
  • Develop a robust and accurate model by combining high-performing algorithms.

This project used a dataset containing house prices in India for OLS regression analysis and training various machine learning models. The dataset can be found here in Kaggle. Data pre-processing and thorough preliminary analysis were conducted before regression and training, removing outliers to reduce noise in the data and feature engineering to preserve information in the dataset. Afterward, OLS regression was performed on the dataset and residual analysis followed to verify the assumptions made by OLS. Next, various machine learning models were trained on the dataset. The best-performing models were selected, and feature importance analysis was done on them to determine the improvement of the model. Finally, the top-performing models were combined to perform a final test against the dataset for its performance.

Some of the graphs below comprise the analysis I conducted in this project. image Figure 1 Distribution graphs of Rent

image Figure 2 Heatmap of numerical variables' correlations

image Figure 3 Residual analysis of the OLS regression model

image Figure 4 Root mean squared errors of trained machine learning models

image Figure 5 Feature importance graph of the CatBoost model

Conclusion

The OLS regression analysis reveals that a OLS linear model does not fit the dataset well, with a R2 value of 56.06% and RMSE of 3.54×104. This is because preliminary and residual analysis reveals that the dataset exhibits non-linearity and existance of outliers. Regarding the machine learning models, CatBoost, Gradient Boosting, LightGBM, Random Forest, and XGBoost regressors performed the best amongst the others, exhibiting high R2 values and low errors. However, the feature importance analysis conducted on these models revealed that feature selection did not improve the performance of the model. This conclusion was furhter validated by decreasing R2 adjusted values before and after feature selection. Nonetheless, these five models were combined and tested on the original data, yielding a decently good result with an R2 value of 77.54% and RMSE of 2.53×104. Afterward, recommendations were suggested for future projects to improve the performance of machine learning models trained on this dataset.

License

Protected under the MIT License. See LICENSE for more information.

Contact me

Thank you so much for visiting my repository! I sincerely hope my project can help you in providing insights to regression analysis, machine learning models, and report writing! 😄 If you would like me to explain my project further or contact me for any reason, you can email me below or connect with me on LinkedIn!

LinkedIn Email

About

This repository contains a project I completed for an NTU course titled CB4247 Statistics & Computational Inference to Big Data. In this project, I applied regression and machine learning techniques to predict house prices in India.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published