Car prices in Egypt have been in a chaotic situation in the past couple of years, especially in the used car market. In this project, I developed a pipeline to scrap car data (nearly 3OK cars) listed on Hatla2ee.com (one of the biggest Egyptian used cars marketplaces) for sale then stored it Amazone S3 storage as a data lake and loaded it in Amazon Redshift data warehouse using airflow to run the pipeline daily. Next, I applied some analytics to the data to prepare it for feeding to a neural network model to predict car prices based on its main features (brand, model, class, km, transmission, and fuel type), and then I created a RESTful API to deploy the model using Apache Flask. Lastly, I developed a web application using Plotly Dash provides an interactive user interface for predicting car prices
I used the beatifulSoup library to scrap the whole used car data on hatla2ee.com. Firstly, I scraped the car's data based on the fuel type as a search filter to avoid the need to load every single car page to get the fuel data (this was the only information that was not available on the car's list page and needed to load each car page to get). Then, I scraped data based on the car body search filter, because when I used both fule and body data as filters at the same time, I found that only 10% of the data were being scraped, which turned out to be caused by the fact that not all cars have the body type information available so they don't appear in the search filter. I scraped the available cars with body type information and used it to develop a data set with the available body data for each model, which can be used to get the body type of cars of the same model with no body data.
The technologies I used in developing the pipeline are:
Scrapin: BeatifulSoup
Cloud: AWS
Data Lake: S3
Data Warehouse: Amazon Redshift
Workflow Orchestration: Airflow
API development: Apache Flask
Web app development: Dash Plotly
Airflow manages two main pipelines full_load
and ìncremental_load
pipelines. The full_oad
one runs once to initiate the database by scraping the bulk corpus from the car website. On the other hand, the ìncremental_load
pipeline runs daily to scrap new cars data and update
existing cars if the price has been changed by the user using thefingerprint
column to track any changes in prices.
- The cars_data table contains the main car scraped information, the
fingerprint
column is a combination of thecar_id
andprice
columns to be used as a signature to track if the car price has been changed by the seller and needs to be updated. - car_body_data contains information about the body of the car for each car brand/model.
- car_class_data contains the classes available for each model.
All the analytics are found in detail in the notebook directory. I'll show the main charts here
I have built a neural network of three dense layers and trained the data on it. The accuracy of the model was around 92%. More details about building the model and data preparation are in the notebook directory. After training the model, I saved it as a pickle file along with car data frame and trained scaler, to be used later by the prediction API and the web application
This is the interface of the web application. The user selects the car properties and then clicks Predict to see the results, the application also shows statistics of car prices per popular brand and model. here is a video demo showing the web app and how to use it: https://youtu.be/xCKlSArHJvQ