- Table of Contents
⚠️ Frameworks and Libraries- 📁 Datasets
- ℹ️ Source
- 📊 Data description
- 📈 Visualising data
- 📖 Data Preprocessing
- 🔗 Download
- 🔑 Prerequisites
- 🚀 Installation
- 💡 How to Run
- 📂 Directory Tree
- 👏 And it's done!
- 🙋 Citation
- ❤️ Owner
- 👀 License
- SKLearn: Simple and efficient tools for predictive data analysis
- Joblibs: Joblib is optimized to be fast and robust on large data in particular and has specific optimizations for numpy arrays.
- Matplotlib : Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
- Numpy: Caffe-based Single Shot-Multibox Detector (SSD) model used to detect faces
- Pandas: pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
This dataset is a modified version of the California Housing dataset available from Luís Torgo’s page (University of Porto). Luís Torgo obtained it from the StatLib repository (which is closed now). The dataset may also be downloaded from StatLib mirrors.
This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).
longitude latitude housing_median_age total_rooms total_bedrooms \
0 -122.23 37.88 41.0 880.0 129.0
1 -122.22 37.86 21.0 7099.0 1106.0
2 -122.24 37.85 52.0 1467.0 190.0
3 -122.25 37.85 52.0 1274.0 235.0
4 -122.25 37.85 52.0 1627.0 280.0
population households median_income median_house_value ocean_proximity
0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 496.0 177.0 7.2574 352100.0 NEAR BAY
3 558.0 219.0 5.6431 341300.0 NEAR BAY
4 565.0 259.0 3.8462 342200.0 NEAR BAY
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None
longitude latitude housing_median_age total_rooms \
count 20640.000000 20640.000000 20640.000000 20640.000000
mean -119.569704 35.631861 28.639486 2635.763081
std 2.003532 2.135952 12.585558 2181.615252
min -124.350000 32.540000 1.000000 2.000000
25% -121.800000 33.930000 18.000000 1447.750000
50% -118.490000 34.260000 29.000000 2127.000000
75% -118.010000 37.710000 37.000000 3148.000000
max -114.310000 41.950000 52.000000 39320.000000
total_bedrooms population households median_income \
count 20433.000000 20640.000000 20640.000000 20640.000000
mean 537.870553 1425.476744 499.539680 3.870671
std 421.385070 1132.462122 382.329753 1.899822
min 1.000000 3.000000 1.000000 0.499900
25% 296.000000 787.000000 280.000000 2.563400
50% 435.000000 1166.000000 409.000000 3.534800
75% 647.000000 1725.000000 605.000000 4.743250
max 6445.000000 35682.000000 6082.000000 15.000100
median_house_value
count 20640.000000
mean 206855.816909
std 115395.615874
min 14999.000000
25% 119600.000000
50% 179700.000000
75% 264725.000000
max 500001.000000
A typical performance measure for a regression problems is the Root Mean Square Error(RMSE). It gives an idea of how much error the system typically makes in it’s prediction, with a higher weight for large errors.
Equation of RMSE :
Data pre-processing is an important step for the creation of a machine learning model. Initially, data may not be clean or in the required format for the model which can cause misleading outcomes. In pre-processing of data, we transform data into our required format. It is used to deal with noises, duplicates, and missing values of the dataset. Data pre-processing has the activities like importing datasets, splitting datasets, attribute scaling, etc. Preprocessing of data is required for improving the accuracy of the model.
The dataset is now available here !
All the dependencies and required libraries are included in the file requirements.txt
See here
The Code is written in Python 3.7. If you don’t have Python installed you can find it here. If you are using a lower version of Python you can upgrade using the pip package, ensuring you have the latest version of pip. To install the required packages and libraries, run this command in the project directory after cloning the repository:
- Clone the repo
git clone https://github.com/Chaganti-Reddy/House-Price-Prediction.git
- Change your directory to the cloned repo
cd House-Price-Prediction
- Now, run the following command in your Terminal/Command Prompt to install the libraries required
python3 -m virtualenv my_env
source my_env/bin/activate
pip3 install -r requirements.txt
- Open terminal. Go into the cloned project directory and type the following command:
python3 Housing.py
├── datasets
│ └── housing
│ ├── housing.csv
│ └── housing.tgz
├── Housing.ipynb
├── images
│ ├── 1.jpeg
│ ├── attribute_histogram_plots.png
│ ├── bad_visualization_plot.png
│ ├── better_visualization_plot.png
│ ├── california_housing_prices_plot.png
│ ├── california.png
│ ├── housing_prices_scatterplot.png
│ ├── income_vs_house_value_scatterplot.png
│ └── scatter_matrix_plot.png
├── my_model.pkl
├── Readme.md
└── requirements.txt
Feel free to mail me for any doubts/query :email: chagantivenkataramireddy1@gmail.com
You are allowed to cite any part of the code or our dataset. You can use it in your Research Work or Project. Remember to provide credit to the Maintainer Chaganti Reddy by mentioning a link to this repository and her GitHub Profile.
Follow this format:
- Author's name - Chaganti Reddy
- Date of publication or update in parentheses.
- Title or description of document.
- URL.
Made with ❤️ by Chaganti Reddy
MIT © Chaganti Reddy