Used Cars Price Prediction

Motivation

Car prices in Egypt have been in a chaotic situation in the past couple of years, especially in the used car market. In this project, I developed a pipeline to scrap car data (nearly 3OK cars) listed on Hatla2ee.com (one of the biggest Egyptian used cars marketplaces) for sale then stored it Amazone S3 storage as a data lake and loaded it in Amazon Redshift data warehouse using airflow to run the pipeline daily. Next, I applied some analytics to the data to prepare it for feeding to a neural network model to predict car prices based on its main features (brand, model, class, km, transmission, and fuel type), and then I created a RESTful API to deploy the model using Apache Flask. Lastly, I developed a web application using Plotly Dash provides an interactive user interface for predicting car prices

Data Source:

I used the beatifulSoup library to scrap the whole used car data on hatla2ee.com. Firstly, I scraped the car's data based on the fuel type as a search filter to avoid the need to load every single car page to get the fuel data (this was the only information that was not available on the car's list page and needed to load each car page to get). Then, I scraped data based on the car body search filter, because when I used both fule and body data as filters at the same time, I found that only 10% of the data were being scraped, which turned out to be caused by the fact that not all cars have the body type information available so they don't appear in the search filter. I scraped the available cars with body type information and used it to develop a data set with the available body data for each model, which can be used to get the body type of cars of the same model with no body data.

Technologies

The technologies I used in developing the pipeline are:

Scrapin: BeatifulSoup
Cloud: AWS
Data Lake: S3
Data Warehouse: Amazon Redshift
Workflow Orchestration: Airflow
API development: Apache Flask
Web app development: Dash Plotly

Data Flow

Airflow manages two main pipelines full_loadand ìncremental_load pipelines. The full_oad one runs once to initiate the database by scraping the bulk corpus from the car website. On the other hand, the ìncremental_load pipeline runs daily to scrap new cars data and update existing cars if the price has been changed by the user using thefingerprint column to track any changes in prices.

Database Model

The cars_data table contains the main car scraped information, the fingerprint column is a combination of the car_id and price columns to be used as a signature to track if the car price has been changed by the seller and needs to be updated.
car_body_data contains information about the body of the car for each car brand/model.
car_class_data contains the classes available for each model.

Data Analytics

All the analytics are found in detail in the notebook directory. I'll show the main charts here

Brand popularity in the market

Model popularity in the market

Most popular colors

Average age of the car per brand

Average Km per brand

Average Km of the car per Model (top 50)

Fuel, transmission, and body distributions

Cars distribution per governorate

brands distribution per governorate

Prediction model

I have built a neural network of three dense layers and trained the data on it. The accuracy of the model was around 92%. More details about building the model and data preparation are in the notebook directory. After training the model, I saved it as a pickle file along with car data frame and trained scaler, to be used later by the prediction API and the web application

Web application

This is the interface of the web application. The user selects the car properties and then clicks Predict to see the results, the application also shows statistics of car prices per popular brand and model. here is a video demo showing the web app and how to use it: https://youtu.be/xCKlSArHJvQ

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.ipynb_checkpoints		.ipynb_checkpoints
dag		dag
dashboard		dashboard
images		images
models		models
notebooks		notebooks
scripts		scripts
.DS_Store		.DS_Store
README.md		README.md
km.png		km.png

Provide feedback

Saved searches