ETL PIPELINE FROM S3 TO REDSHIFT BY IaC

OVERVIEW

Note

This project focuses on utilizing Apache Airflow to orchestrate an ETL (Extract, Transform, Load) process using data from the Stack Overflow API. The primary objective is to determine the most prominent tags on Stack Overflow for the current month. The workflow involves fetching data from the API, transforming it using Pandas, and loading the processed data into an Amazon S3 Bucket.

Tech Stack and Knowledge

Data warehouse knowledge
- Database design
- ETL
Tech/Framework used
- SQL
- Python SDK for AWS (boto3)
- Airflow: Automate pipeline

Project Introduction

Project Introduction

In this project, my aim is to apply the knowledge I've acquired about Airflow. The data employed in this ETL process will be sourced from the Stack Overflow API, accessible here. The extraction revolves around the query, "What are the most prominent tags on Stack Overflow this month?" The workflow orchestrated by Airflow will involve:
- Fetching data from the Stack Overflow API endpoint
- Employing Pandas for data transformation
- Loading the processed data into an Amazon S3 Bucket
The simple pipeline is described as below.

Hands-on

CREATE AWS S3 BUCKET
- Log in to AWS and search S3 and select S3 Scalable storage in the Cloud
- From S3 dashboard, click on Create Bucket
- Enter a bucket name and AWS region
FETCH DATA FROM API ENDPOINT TO AWS S3 BUCKET
- Download all the required libraries
- Create a function get_stackoverflow_data() to get the data using the request library.
  - Ti stands for task instance. It is used to call Xcoms_push and Xcoms_pull. Xcoms stands for cross communication where tasks can communicate each other.
    - Xcoms_push is used to push the data to task storage on the task instance
    - Xcoms_pull is used to pull the data on task storage on the task instance
TRANSFORM DATA
- Create a function transform_data() to remove the all the redundant columns
- After transforming, push data back to task storage on the task instance for the next task.
LOAD TRANSFORMED DATA TO AWS S3 BUCKET
- Boto3 is the python SDK for AWS which allows us to read, create, delete and update resources from python code.
- Create a temporary file and putting transformed data in it and then push it to AWS S3.
AIRFLOW DATA PIPELINE DAG
- DAG Default Argument: specifies all defaults properties such as owner, retries, retry_daily...
- DAG Instance: We create a DAG object to nest our tasks into and pass all basic properties such as dag_id, schedule_interval...
- DAG tasks and dependencies: This is place where tasks are created using BashOperator or PythonOperator. Tasks and task-flow are specified here. The dependencies are set using ">>" operators.
- Go to localhost where Airflow UI is running, log in and check for created DAG.
- Dark green means that DAG run successfully and tasks completed.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
dags		dags
image		image
logs		logs
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ETL PIPELINE FROM S3 TO REDSHIFT BY IaC

OVERVIEW

Tech Stack and Knowledge

Project Introduction

About

Releases

Packages

Languages

Susanhuynh/etl-API-data-to-AWS-S3-using-Airflow

Folders and files

Latest commit

History

Repository files navigation

ETL PIPELINE FROM S3 TO REDSHIFT BY IaC

OVERVIEW

Tech Stack and Knowledge

Project Introduction

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages