This repository consists of an implementation of a dashboard containing data scraped from a real estate website: Imovirtual, a Portuguese real estate website offering homes, apartments, and other properties for sale and rent. Using MongoDB as the database, it crawls raw data, cleans it, and makes it ready to be used in the dashboard.
Both the dashboard and the scripts to crawl the data were implemented using Python. The dashboard uses the Dash and Dash Bootstrap Components frameworks. To scrape the data, it uses Requests, asynchronous requests with HTTPX, and BeautifulSoup.
By setting the environmental variables, the script can store the scraped raw data in three different sources: MongoDB using pymongo, an AWS S3 bucket as a JSON file (using boto3), and local storage as a JSON file.
Below is a quick demonstration of the dashboard.
It's possible to run the dashboard using Docker Compose.
The project is divided into two blocks that can work separately, each inside the src
folder:
- Data Ingestion: Responsible for crawling the data, consolidating it in the database while avoiding duplicates, and preparing the data for use in the dashboard.
- Dashboard: Uses the cleaned and filtered data from the database.
All the code related to data ingestion is inside the src/ingestion
folder. It creates three collections in MongoDB:
- A raw collection for storing the crawled data.
- A consolidated collection that maintains an historical record of the data.
- A dashboard collection used by the dashboard.
The raw collection, which comes from the crawler, can store data in MongoDB, an AWS S3 bucket as a JSON file, or locally as a JSON file, depending on the settings in the .env
file.
The consolidation process creates a new collection in MongoDB and removes duplicate values from the raw data. It compares the raw data with the consolidated collection, allowing only new advertisements to be inserted and updating advertisements that are no longer available on the website. It filters unique entries using the same ID retrieved from the website for each advertisement. The goal of consolidation is to maintain a historical record of advertisements, even if they are no longer available on the website.
For the data used in the dashboard, a new collection is created. This pipeline extracts data from one of the previous collections (raw or consolidated), filters and transforms it so that it is ready for use in the dashboard.
For this specific website, it was possible to use asynchronous requests. In the first request, pagination information is retrieved for our search. This allows us to make an initial request to obtain this information, construct a block of URLs for requests, and perform asynchronous requests. After the requests, the data is extracted.
All the data used in the dashboard is loaded directly from the MongoDB collection designed for it. The folder and file structures inside src/dash
were designed to load the data from MongoDB only once, and the components import the data from the same source file.
To build the graphs and manipulate the data, it uses Plotly and Pandas.
This section explains how to run the project.
All the steps here are intended for a bash
terminal.
The project setup uses pyenv
and poetry
.
As mentioned before, this project operates in two blocks, and it is possible to run both of them independently. Using Docker Compose, you can run the dashboard locally connected to the database. The Docker Compose setup includes an entrypoint that populates the database with the JSON file located at scripts/data/data.json if the collection does not already exist in MongoDB.
1 - Clone the repo locally:
git https://github.com/lealre/crawler-to-dash.git
2 - Access the project directory:
cd crawler-to-dash
To run this properly, it's necessary to create the .env
variable file in the root of the project. An example can be found in .env-example. The default configuration to connect with MongoDB is:
MONGO_HOST='mongodb'
MONGO_PORT=27017
MONGO_DATABASE='db'
After completing steps 1 and 2, and with the .env
variable file configured:
Build the image:
docker compose buid
Build the container:
dokcer compose up
Access the local host:
http://localhost:8051/
NOTE: It may be necessary to make the script ./entrypoint.sh
executable before building the container:
chmod +x entrypoint.sh
After completing steps 1 and 2, and with the .env
variable file configured:
3 - Set the Python version with pyenv
:
pyenv local 3.12.2
4 - Create the virtual environment:
poetry env use 3.12.2
5 - Activate the virtual environment:
poetry shell
6 - Install dependencies:
poetry install
7 - Run the data pipeline, from crawling to dashboard data:
task crawl_to_dash
It is also possible to run each step separately:
To run just the crawler:
task crawl
Based on the .env
variables, it will store the data in different possible sources.
USE_STORAGE_LOCAL=False
USE_STORAGE_MONGO=False
USE_STORAGE_AWS_S3=False
To run just the data consolidation:
task consolidate
To generate the dash data:
task dash_data
The script mongo_backup.sh dumps the database to local storage in a file format. It uses the paths and container name specified in .env
file.
The script first dumps the database content to a file inside the container, and it then copies the dump file from the container to the local storage.
Make the script executable:
chmod +x mongo_backup.sh
Execute the backup script:
./mongo_backup.sh
The .env
file should contain the following variables:
CONTAINER_NAME="container-name"
BACKUP_PATH="/path/in/container"
LOCAL_BACKUP_PATH="/local/path/to/export"
- Improve the dashboard's style, including font size, colors, and callback interactions.