Property and Locality Data Analysis

Introduction

This project involves setting up a data engineering pipeline to collect, store, process, and analyze Property and Locality data using Hadoop, Docker, MySQL, Tailscale, and Selenium.

Project Overview

Objective: Analyze Property and Locality data to derive meaningful insights.
Scope: Collect data through web scraping, store in HDFS, process using Hadoop, and analyze with MySQL.

Video

Here is a demo video of the Pipeline:

Setup and Environment

Virtual Machines Setup

Step: Installed Ubuntu on VirtualBox for each VM.
Action: Configured each VM with necessary packages including Docker and Docker Compose.

Networking with Tailscale

Step: Installed and configured Tailscale on all VMs.
Action: Created a secure virtual network to enable communication between VMs.

Docker Swarm Initialization

Step: Initialized Docker Swarm on the master node and joined worker nodes.
Action: Used Docker Swarm for container orchestration.

Image of Setup Process

Data Collection

Web Scraping with Selenium

Step: Installed Selenium and Chrome WebDriver.
Action: Developed scripts to scrape Property and Locality data from various websites.

Image of Collection Process

Data Storage in HDFS

Step: cd spark cluster folder to use, A ready to go Big Data cluster (Hadoop + Hadoop Streaming + Spark + PySpark + Jupyter Notebook) with Docker and Docker Swarm! Configured HDFS on the Hadoop cluster provided by Prof. Dr.-Ing. Binh Vu. Check the README.md file in the spark cluster folder to begin with the setup process`
Action: Stored scraped data in HDFS with appropriate partitioning and replication.

Image of Data Storage in HDFS Process

Data Processing

Hadoop Job Development

Step: Developed and executed Hadoop jobs for data cleaning and transformation.
Action: Used MapReduce for distributed processing.

Image of Data Processing Process

Failure Test

Step: Conducted data read/write operations while intentionally shutting down a worker node.
Action: Verified system resilience and fault tolerance.

Image of Failure Test Process - node down

Image of Failure Test Process - ingestion to mysql

Data Ingestion into MySQL

Database Design

Step: Created a relational database schema in MySQL.
Action: Developed scripts to ingest data from HDFS to MySQL.

Image of Database Schema

Business Insights

Query Development

Step: Developed SQL queries to extract insights from the database.
Action: Generated graphs and tables to present the results.

Image of Business Insights Visualization

Acknowledgements

Note: A special thank you to Prof. Dr.-Ing. Binh Vu for providing the ready-to-go Spark cluster image used in this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Property and Locality Data Analysis

Introduction

Project Overview

Video

Setup and Environment

Virtual Machines Setup

Networking with Tailscale

Docker Swarm Initialization

Image of Setup Process

Data Collection

Web Scraping with Selenium

Image of Collection Process

Data Storage in HDFS

Image of Data Storage in HDFS Process

Data Processing

Hadoop Job Development

Image of Data Processing Process

Failure Test

Image of Failure Test Process - node down

Image of Failure Test Process - ingestion to mysql

Data Ingestion into MySQL

Database Design

Image of Database Schema

Business Insights

Query Development

Image of Business Insights Visualization

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Property and Locality Data Analysis

Introduction

Project Overview

Video

Setup and Environment

Virtual Machines Setup

Networking with Tailscale

Docker Swarm Initialization

Image of Setup Process

Data Collection

Web Scraping with Selenium

Image of Collection Process

Data Storage in HDFS

Image of Data Storage in HDFS Process

Data Processing

Hadoop Job Development

Image of Data Processing Process

Failure Test

Image of Failure Test Process - node down

Image of Failure Test Process - ingestion to mysql

Data Ingestion into MySQL

Database Design

Image of Database Schema

Business Insights

Query Development

Image of Business Insights Visualization

Acknowledgements