This project aims to predict NYC property values based on their distance to train stations and historical housing sales data from 2003-2015. Using a dataset of over 900,000 records, the project focuses on single-family dwellings (Building Code A) and applies machine learning models to estimate future property values. This work leverages regression models and geographic data to build accurate predictions.
- Data cleaning and preprocessing pipeline for housing sales data.
- Integration of geographic data using the Nominatim API to retrieve location details.
- Machine learning models for property value prediction (Linear Regression, Random Forest).
- Feature engineering with building code, price change, and other key attributes.
- Clone the repository:
git clone https://github.com/aaronmpuga/property-value-prediction.git cd property-value-prediction
- Set up Virtual environment and install dependencies
git clone https://github.com/aaronmpuga/property-value-prediction.git cd property-value-prediction python3 -m venv env source env/bin/activate pip install -r requirements.txt
- Set up the Nominatim API for location data (if applicable):
- Use the free Nominatim API to get geographic coordinates for each property.
- Note: Nominatim API only processes one request per second so if you plan on finding longitude and latitude values for a large amount of properties I would suggest the use of a paid geographic API to process data faster.
To test the prediction models follow the steps below:
- Download the random forest regression models folder to your local device
- Download the predict.py python script to your local device
- In your IDE open the predict.py function and add this code at the end of the function
curr_price = 300000 # Current property value input by the user years = 5 # How many years ahead the user wants to predict b_code = "A1" # Building classification code of user's property borough = "Queens" # The borough the property is in station_dist = 500 # Distance of the property from the station in meters predicted_value = predict_property_value(curr_price, years, b_code, borough, station_dist) print(f"The predicted property value in {years_ahead} years is: {predicted_value}")
- curr_price, years, b_code, borough and station_dist are example inputs for the function to take in. Change these values to what you would like to predict
- The raw_data folder is reserved for files and data that were sourced online at the start of the project from sites such as Kaggle and dataNy.gov
- The Annulized_Rolling_Sale_Update folder taken from NYC Open Data has the Excel files containing the property sales data across all boroughs in NYC from 2003 - 2015
- The zip_borough.csv file taken from Kaggle was used to create the borough column in the housing sales dataframe by mapping each property zipcode to its corresponding borough name
- The MTA_Subway_Stations.csv file taken from data.gov held the train station data used throughout the project
- The interim_data folder contains modified/cleaned datasets that have been created based on the inital data in the raw_data folder or from other csv files in the folder.
- The processed_data folder contains the final csv file datasets that contain all the data necessary for their respective purposes
- Expand the model to include additional building codes (e.g., Code B and D).
- Implement more sophisticated geographic feature extraction using APIs with better limits.
- Explore deep learning models for improved predictions
- Develop a web-based application for real-time property value prediction.