Skip to content

A dashboard featuring data scraped from a real estate website.

Notifications You must be signed in to change notification settings

lealre/crawler-to-dash

Repository files navigation

Property Sales Dashboard Using Real Data Crawled from Real Estate Website

This repository consists of an implementation of a dashboard containing data scraped from a real estate website: Imovirtual, a Portuguese real estate website offering homes, apartments, and other properties for sale and rent. Using MongoDB as the database, it crawls raw data, cleans it, and makes it ready to be used in the dashboard.

Both the dashboard and the scripts to crawl the data were implemented using Python. The dashboard uses the Dash and Dash Bootstrap Components frameworks. To scrape the data, it uses Requests, asynchronous requests with HTTPX, and BeautifulSoup.

By setting the environmental variables, the script can store the scraped raw data in three different sources: MongoDB using pymongo, an AWS S3 bucket as a JSON file (using boto3), and local storage as a JSON file.

Below is a quick demonstration of the dashboard.

It's possible to run the dashboard using Docker Compose.

Table of Contents

How it works

The project is divided into two blocks that can work separately, each inside the src folder:

  • Data Ingestion: Responsible for crawling the data, consolidating it in the database while avoiding duplicates, and preparing the data for use in the dashboard.
  • Dashboard: Uses the cleaned and filtered data from the database.

Data Ingestion

All the code related to data ingestion is inside the src/ingestion folder. It creates three collections in MongoDB:

  • A raw collection for storing the crawled data.
  • A consolidated collection that maintains an historical record of the data.
  • A dashboard collection used by the dashboard.

The raw collection, which comes from the crawler, can store data in MongoDB, an AWS S3 bucket as a JSON file, or locally as a JSON file, depending on the settings in the .env file.

The consolidation process creates a new collection in MongoDB and removes duplicate values from the raw data. It compares the raw data with the consolidated collection, allowing only new advertisements to be inserted and updating advertisements that are no longer available on the website. It filters unique entries using the same ID retrieved from the website for each advertisement. The goal of consolidation is to maintain a historical record of advertisements, even if they are no longer available on the website.

For the data used in the dashboard, a new collection is created. This pipeline extracts data from one of the previous collections (raw or consolidated), filters and transforms it so that it is ready for use in the dashboard.

For this specific website, it was possible to use asynchronous requests. In the first request, pagination information is retrieved for our search. This allows us to make an initial request to obtain this information, construct a block of URLs for requests, and perform asynchronous requests. After the requests, the data is extracted.

Dashboard

All the data used in the dashboard is loaded directly from the MongoDB collection designed for it. The folder and file structures inside src/dash were designed to load the data from MongoDB only once, and the components import the data from the same source file.

To build the graphs and manipulate the data, it uses Plotly and Pandas.

How to run this project

This section explains how to run the project.

All the steps here are intended for a bash terminal.

The project setup uses pyenv and poetry.

As mentioned before, this project operates in two blocks, and it is possible to run both of them independently. Using Docker Compose, you can run the dashboard locally connected to the database. The Docker Compose setup includes an entrypoint that populates the database with the JSON file located at scripts/data/data.json if the collection does not already exist in MongoDB.

1 - Clone the repo locally:

git https://github.com/lealre/crawler-to-dash.git

2 - Access the project directory:

cd crawler-to-dash

To run this properly, it's necessary to create the .env variable file in the root of the project. An example can be found in .env-example. The default configuration to connect with MongoDB is:

MONGO_HOST='mongodb'
MONGO_PORT=27017
MONGO_DATABASE='db'

Dash with Docker

After completing steps 1 and 2, and with the .env variable file configured:

Build the image:

docker compose buid

Build the container:

dokcer compose up

Access the local host:

http://localhost:8051/

NOTE: It may be necessary to make the script ./entrypoint.sh executable before building the container:

chmod +x entrypoint.sh

Local Setup

After completing steps 1 and 2, and with the .env variable file configured:

3 - Set the Python version with pyenv:

pyenv local 3.12.2

4 - Create the virtual environment:

poetry env use 3.12.2

5 - Activate the virtual environment:

poetry shell

6 - Install dependencies:

poetry install

7 - Run the data pipeline, from crawling to dashboard data:

task crawl_to_dash

It is also possible to run each step separately:

To run just the crawler:

task crawl

Based on the .env variables, it will store the data in different possible sources.

USE_STORAGE_LOCAL=False
USE_STORAGE_MONGO=False
USE_STORAGE_AWS_S3=False

To run just the data consolidation:

task consolidate

To generate the dash data:

task dash_data

MongoDB local Backup

The script mongo_backup.sh dumps the database to local storage in a file format. It uses the paths and container name specified in .env file.

The script first dumps the database content to a file inside the container, and it then copies the dump file from the container to the local storage.

Make the script executable:

chmod +x mongo_backup.sh

Execute the backup script:

./mongo_backup.sh

The .env file should contain the following variables:

CONTAINER_NAME="container-name"
BACKUP_PATH="/path/in/container"
LOCAL_BACKUP_PATH="/local/path/to/export"

Further Improvements

  • Improve the dashboard's style, including font size, colors, and callback interactions.

About

A dashboard featuring data scraped from a real estate website.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages