This is a repo to implement an ETL pipeline for Reddit data using Airflow and AWS cloud services.
What the pipelines does:
- Extract Reddit data trough their API.
- Load the data to an S3 bucket.
- Perform some transformations to the data using AWS Glue.
- Clone the repository
git clone https://github.com/manuelandersen/reddit-pipeline.git
- Create a virtual environment (optional but recommended):
python3 -m venv venv
source venv/bin/activate
- Install the dependencies:
pip install -r requirements.txt
- Rename the configuration file:
mv config/config.conf.example config/config.conf
Warning
Make sure to put the credentials you need in the new config.conf file.
- Build and run the Docker container:
docker compose up -d --build
- Open Airflow web UI:
In your browser go to http://localhost:8080
, you will see the DAG's and then you can run them.