⚽ Football Transfermarkt Data Pipeline

📋 Table of contents

Introduction
Problem description
Architecture Overview
Dashboard
Usage
Further Improvements

🏗️ Architecture

📖 Introduction

This repository represents my final project for the Data Engineer Zoomcamp. It aims to analyze a large dataset of football data scraped from the page transfermarkt.es.

❓ Problem Description

The project aims to answer several questions regarding players and their statistics. The focus is on market value, goals, and assists that players have accumulated over the years of their football careers.

🏗️ Architecture Overview

We use 🧙Mage as an orchestrator for the whole pipeline, mounted inside a 🐳Docker container.

📥 Data Source

Data is inside a Kaggle dataset, and consists in 9 csv files:

apearances.csv (124.22 MB)
club_games.csv (8.45 MB)
clubs.csv (96.06 MB)
competitions.csv (7.47 MB)
game_events.csv (75.46 MB)
game_lineups.csv (244.38 MB)
games.csv (19.88 MB)
player_valuations.csv (15.8 MB)
players.csv (10.33 MB)

📥 Data Ingestion

Batch: for the data ingestion we use the Kaggle API to extract all 9 CSV files into a 🧙Mage block to prepare it for the data lake.
Batch: for the data warehouse, we use 🧙Mage to download the data from the Google Cloud Storage bucket and prepare it for loading.

💾 Data Storage

Both the data lake and the data warehouse are managed by Terraform as a way to learn IaC.

Google cloud storage: a Google Cloud Storage bucket was created using Terraform. Data is store as Parquet files to reduce memory consumption. All datasets were stored as whole file, except for game_lineups, wich was partitioned by year and month.
Big query: a BigQuery database was created using terraform.

🔄 Transformations

We use dbt cloud to manage all the transformations of the data inside the data warehouse. We clean and join the apearances and the player_valuations files to get the aggregations by year of the statistics that we care about.

📊 Dashboard

Warning

Since i was using the free credits of Google Cloud, all the information was deleted and the report is no longer available.

Here is the report. And here is a quick look at how it looks:

🔧 Usage

You can refer to this video where we provide an explicit tour of the project and how it works.

If you prefer instructions, check the README of the docs section.

🚀 Further Improvements

Add unit test, and integrate them into the CI/CD pipeline (besides the dbt ones).
Add a Makefile.

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/workflows		.github/workflows
dbt		dbt
docs		docs
football-pipeline		football-pipeline
terraform		terraform
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚽ Football Transfermarkt Data Pipeline

📋 Table of contents

🏗️ Architecture

📖 Introduction

❓ Problem Description

🏗️ Architecture Overview

📥 Data Source

📥 Data Ingestion

💾 Data Storage

🔄 Transformations

📊 Dashboard

🔧 Usage

🚀 Further Improvements

About

Releases

Packages

Languages

License

manuelandersen/football-pipeline

Folders and files

Latest commit

History

Repository files navigation

⚽ Football Transfermarkt Data Pipeline

📋 Table of contents

🏗️ Architecture

📖 Introduction

❓ Problem Description

🏗️ Architecture Overview

📥 Data Source

📥 Data Ingestion

💾 Data Storage

🔄 Transformations

📊 Dashboard

🔧 Usage

🚀 Further Improvements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages