This project aims to predict housing prices using various regression models. The dataset used is from a housing dataset containing information about different houses' geographical location, median age, number of rooms, population, and other features.
The dataset contains the following columns:
longitude
latitude
housing_median_age
total_rooms
total_bedrooms
population
households
median_income
median_house_value
ocean_proximity
-
Handling Missing Values:
- Dropped rows with missing values in the
total_bedrooms
column.
- Dropped rows with missing values in the
-
Log Transformation:
- Applied log transformation to
total_rooms
,total_bedrooms
,population
, andhouseholds
to reduce skewness.
- Applied log transformation to
-
One-Hot Encoding:
- Converted categorical
ocean_proximity
column to dummy variables.
- Converted categorical
-
Feature Scaling:
- Scaled the features using
StandardScaler
.
- Scaled the features using
- Plotted histograms for numerical columns before and after log transformation.
- Created a heatmap to visualize the correlation between features.
- Used scatter plots to understand geographical distribution.
- Linear Regression
- Ridge Regression
- Lasso Regression
- Random Forest Regressor
- Gradient Boosting Regressor
- XGBoost Regressor
- Decision Tree Regressor
The models were evaluated using Root Mean Squared Error (RMSE) and R² score.
Model | RMSE | R² Score |
---|---|---|
Linear Regression | 67775.13 | 0.664 |
Ridge Regression | 67742.03 | 0.664 |
Lasso Regression | 67699.92 | 0.665 |
Random Forest | 48516.43 | 0.828 |
Gradient Boosting | 56816.72 | 0.764 |
XGBoost | 49541.45 | 0.821 |
Decision Tree | 67766.99 | 0.664 |
Random Forest Regressor performed the best with the lowest RMSE and highest R² score.