This project demonstrates an ETL (Extract, Transform, Load) pipeline for processing and analyzing taxi trip data, orchestrated using Mage AI and Pandas. The pipeline extracts data from APIs and local files, transforms it into analytical tables, and loads it into Google BigQuery for visualization in Looker Studio.
Key highlights:
- Designed to handle large datasets (1.4 million rows of taxi trip data).
- Built for efficient data processing on a cost-effective Google Compute Engine virtual machine (VM).
- Uses a modular architecture to split pipeline nodes by tables, reducing throughput and cost.
The ETL process is divided into three main stages:
- Extract: Taxi trip data and location lookup tables are collected from APIs and source files.
- Transform: Data is processed into normalized tables, including:
- Fact table
- Dimension tables (Date Time, Drop-off Location, Passenger Count, Payment Type, Trip Distance, Pickup Location, Rate Code).
- Load: Processed tables are loaded into BigQuery for analysis.
Due to the large dataset size (1.4 million rows) and the limitations of a low-cost VM:
- Pipeline Splitting: Each table in the transformation step is processed in separate nodes to reduce memory usage and throughput pressure.
- This approach minimizes costs and optimizes performance for scalable ETL processing.
- Mage AI: Pipeline orchestration.
- Google Cloud Platform (GCP):
- Compute Engine: For running the pipeline.
- BigQuery: For data storage and analytics.
- Looker Studio: For data visualization.
- Pandas: For transformation.
- VM Compute Engine: Hosts the pipeline execution.
- Mage AI: Manages the workflow and scheduling of tasks.
- BigQuery: Receives the transformed data for analytics.
- Looker Studio: Connects to BigQuery for visualization and dashboards.
- Data is available in BigQuery for querying.
- Dashboards in Looker Studio provide real-time analytics and insights. (https://lookerstudio.google.com/s/pGa41KkFdh4)