Property Sales Dashboard Using Real Data Crawled from Real Estate Website

This repository consists of an implementation of a dashboard containing data scraped from a real estate website: Imovirtual, a Portuguese real estate website offering homes, apartments, and other properties for sale and rent. Using MongoDB as the database, it crawls raw data, cleans it, and makes it ready to be used in the dashboard.

Both the dashboard and the scripts to crawl the data were implemented using Python. The dashboard uses the Dash and Dash Bootstrap Components frameworks. To scrape the data, it uses Requests, asynchronous requests with HTTPX, and BeautifulSoup.

By setting the environmental variables, the script can store the scraped raw data in three different sources: MongoDB using pymongo, an AWS S3 bucket as a JSON file (using boto3), and local storage as a JSON file.

Below is a quick demonstration of the dashboard.

It's possible to run the dashboard using Docker Compose.

How it works

The project is divided into two blocks that can work separately, each inside the src folder:

Data Ingestion: Responsible for crawling the data, consolidating it in the database while avoiding duplicates, and preparing the data for use in the dashboard.
Dashboard: Uses the cleaned and filtered data from the database.

Data Ingestion

All the code related to data ingestion is inside the src/ingestion folder. It creates three collections in MongoDB:

A raw collection for storing the crawled data.
A consolidated collection that maintains an historical record of the data.
A dashboard collection used by the dashboard.

The raw collection, which comes from the crawler, can store data in MongoDB, an AWS S3 bucket as a JSON file, or locally as a JSON file, depending on the settings in the .env file.

The consolidation process creates a new collection in MongoDB and removes duplicate values from the raw data. It compares the raw data with the consolidated collection, allowing only new advertisements to be inserted and updating advertisements that are no longer available on the website. It filters unique entries using the same ID retrieved from the website for each advertisement. The goal of consolidation is to maintain a historical record of advertisements, even if they are no longer available on the website.

For the data used in the dashboard, a new collection is created. This pipeline extracts data from one of the previous collections (raw or consolidated), filters and transforms it so that it is ready for use in the dashboard.

For this specific website, it was possible to use asynchronous requests. In the first request, pagination information is retrieved for our search. This allows us to make an initial request to obtain this information, construct a block of URLs for requests, and perform asynchronous requests. After the requests, the data is extracted.

Dashboard

All the data used in the dashboard is loaded directly from the MongoDB collection designed for it. The folder and file structures inside src/dash were designed to load the data from MongoDB only once, and the components import the data from the same source file.

To build the graphs and manipulate the data, it uses Plotly and Pandas.

How to run this project

This section explains how to run the project.

All the steps here are intended for a bash terminal.

The project setup uses pyenv and poetry.

As mentioned before, this project operates in two blocks, and it is possible to run both of them independently. Using Docker Compose, you can run the dashboard locally connected to the database. The Docker Compose setup includes an entrypoint that populates the database with the JSON file located at scripts/data/data.json if the collection does not already exist in MongoDB.

1 - Clone the repo locally:

git https://github.com/lealre/crawler-to-dash.git

2 - Access the project directory:

cd crawler-to-dash

To run this properly, it's necessary to create the .env variable file in the root of the project. An example can be found in .env-example. The default configuration to connect with MongoDB is:

MONGO_HOST='mongodb'
MONGO_PORT=27017
MONGO_DATABASE='db'

Dash with Docker

After completing steps 1 and 2, and with the .env variable file configured:

Build the image:

docker compose buid

Build the container:

dokcer compose up

Access the local host:

http://localhost:8051/

NOTE: It may be necessary to make the script ./entrypoint.sh executable before building the container:

chmod +x entrypoint.sh

Local Setup

After completing steps 1 and 2, and with the .env variable file configured:

3 - Set the Python version with pyenv:

pyenv local 3.12.2

4 - Create the virtual environment:

poetry env use 3.12.2

5 - Activate the virtual environment:

poetry shell

6 - Install dependencies:

poetry install

7 - Run the data pipeline, from crawling to dashboard data:

task crawl_to_dash

It is also possible to run each step separately:

To run just the crawler:

task crawl

Based on the .env variables, it will store the data in different possible sources.

USE_STORAGE_LOCAL=False
USE_STORAGE_MONGO=False
USE_STORAGE_AWS_S3=False

To run just the data consolidation:

task consolidate

To generate the dash data:

task dash_data

MongoDB local Backup

The script mongo_backup.sh dumps the database to local storage in a file format. It uses the paths and container name specified in .env file.

The script first dumps the database content to a file inside the container, and it then copies the dump file from the container to the local storage.

Make the script executable:

chmod +x mongo_backup.sh

Execute the backup script:

./mongo_backup.sh

The .env file should contain the following variables:

CONTAINER_NAME="container-name"
BACKUP_PATH="/path/in/container"
LOCAL_BACKUP_PATH="/local/path/to/export"

Further Improvements

Improve the dashboard's style, including font size, colors, and callback interactions.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
media		media
scripts		scripts
src		src
tests		tests
.env-example		.env-example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
entrypoint.sh		entrypoint.sh
mongo_backup.sh		mongo_backup.sh
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Property Sales Dashboard Using Real Data Crawled from Real Estate Website

Table of Contents

How it works

Data Ingestion

Dashboard

How to run this project

Dash with Docker

Local Setup

MongoDB local Backup

Further Improvements

About

Releases

Packages

Languages

lealre/crawler-to-dash

Folders and files

Latest commit

History

Repository files navigation

Property Sales Dashboard Using Real Data Crawled from Real Estate Website

Table of Contents

How it works

Data Ingestion

Dashboard

How to run this project

Dash with Docker

Local Setup

MongoDB local Backup

Further Improvements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages