This project involves setting up a data engineering pipeline to collect, store, process, and analyze Property and Locality data using Hadoop, Docker, MySQL, Tailscale, and Selenium.
- Objective: Analyze Property and Locality data to derive meaningful insights.
- Scope: Collect data through web scraping, store in HDFS, process using Hadoop, and analyze with MySQL.
Here is a demo video of the Pipeline:
- Step: Installed Ubuntu on VirtualBox for each VM.
- Action: Configured each VM with necessary packages including Docker and Docker Compose.
- Step: Installed and configured Tailscale on all VMs.
- Action: Created a secure virtual network to enable communication between VMs.
- Step: Initialized Docker Swarm on the master node and joined worker nodes.
- Action: Used Docker Swarm for container orchestration.
- Step: Installed Selenium and Chrome WebDriver.
- Action: Developed scripts to scrape Property and Locality data from various websites.
- Step:
cd spark cluster
folder to use, A ready to go Big Data cluster (Hadoop + Hadoop Streaming + Spark + PySpark + Jupyter Notebook) with Docker and Docker Swarm! Configured HDFS on the Hadoop cluster provided byProf. Dr.-Ing.
Binh Vu. Check the README.md file in the spark cluster folder to begin with the setup process` - Action: Stored scraped data in HDFS with appropriate partitioning and replication.
- Step: Developed and executed Hadoop jobs for data cleaning and transformation.
- Action: Used MapReduce for distributed processing.
- Step: Conducted data read/write operations while intentionally shutting down a worker node.
- Action: Verified system resilience and fault tolerance.
- Step: Created a relational database schema in MySQL.
- Action: Developed scripts to ingest data from HDFS to MySQL.
- Step: Developed SQL queries to extract insights from the database.
- Action: Generated graphs and tables to present the results.
- Note: A special thank you to
Prof. Dr.-Ing.
Binh Vu for providing the ready-to-go Spark cluster image used in this project.