This project served as the final assignment for the Hands-On Advanced Analytics with Apache Spark course. The training spanned 5 weeks and focused on mastering big data technologies. The project was completed at Fii practic.
- Python
- PySpark
- Jupyter Notebook
- The dataset contained approximately 3,549,246 entries.
- The primary objective of the project was to clean the dataset, addressing inconsistencies intentionally introduced by our trainers, as well as more realistic inconsistencies.
- Upon completion of the cleaning process, we performed data aggregation.
For detailed tasks, please refer to the Tasks document.