This is an MSSQL Data Warehouse and ETL implementation on specially formatted Water Quality dataset from DEFRA, UK
A data warehouse is a central repository of information that can be analyzed to make more informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence (https://aws.amazon.com/what-is/data-warehouse).
This repository is about a data warehouse project that was carried out using ETL (extract, transform, and load) process on a specially formatted WaterQuality dataset from The Department for Environment Food & Rural Affairs (DEFRA), UK. This particular dataset is provided in an MS Access (.accdb) file. It contains 17 tables, and each would have to be exported into individual CSV files.
The data warehouse consists of a staging table, nine (9) dimension tables, and one fact table. Among the dimension tables is an extended Time table to aid time-based BI analysis. The data warehouse was created in a Microsoft SQL Server 2019 database environment with the source dataset exported into CSV files, and then imported into corresponding tables in the database using SQL Server Management Studio (SSMS) Import wizard; while the main ETL process was done in a Jupyter Notebook (Python environment) which was connected to the data warehouse in the MSSQL database through pyodbc Python cursor connection.
Finally, SQL queries were run on the data warehouse star schema using the project questions to gain insights into the data.
These are the objectives of the project:
- To design a data warehouse on Microsoft SQL Server database environment for the WaterQuality dataset to enable analysis.
- To implement ETL process and demonstrate its use cases especially in the transform and load phases.
- To demonstrate the use of Python environment to interact with the data warehouse.
The following are information desired to be gotten from the dataset:
- The list of water sensors measured by type of sensor by month
- The number of sensor measurements collected by type of sensor by week
- The number of measurements made by location by month
- The average number of measurements covered for pH by year
- The average value of Nitrate measurements by locations by year
- Here is the Jupyter Notebook for the Python environment that was used to carry out data cleaning, and ETL
- For reference purposes, here are all the T-SQL scripts and codes that were used throughout the project.
- For code comparison sake, here is the Oracle SQL equivalent of the Jupyter Notebook mentioned above.
- If you need to see exactly how I implemented it in Oracle DW, see this repository.
Enjoy!