As the global business landscape is increasingly transitioning towards data-driven decision-making and artificial intelligence, the importance of gathering and structuring data in an organized manner is crucial for all organizations. With all the technologies available at our disposal building, data pipleines and data storage has become very easy and fast. But it is not always economically viable, especially for growing business and startups. Often, immediate investment in high-end software or cloud solutions might not be feasible. Or say, in some cases the workload might not warrant the need for large-scale software solutions and can be effectively handled with in house, open source tools.
In this project I illustrate the process of implementing automated data pipleines with open source tools. I aim to demonstrate how simple solutions and proactive initiatives can empower us to commence data gathering from day one and offer valuable insights for strategic decision-making.
The basic idea of building these pipelines are pretty simple. Once we have a data storage area determined, we can take help of a scripting tool to perform all the data manipulation and ingestion tasks and then a task scheduler which automates the execution of this scripts.
- Python as scripting language with some libraries (for e.g. Pyodbc/SqlAlchemy for communicating with Database, SMTP for emailing etc.)
- SQL Server as Database
- Windows Task Scheduler as Automater
In this project I have tried to implement two different types of data pipelines:
- One in which data is loaded from one database to another daily and incrementally which is the project_demo_1.
- One in which data is loaded from a file and upserted into a database table which is the project_demo_2.
In this porject the requirement is to incrementally load a table from a transactional datasource to a data-mart which is used for all data sciene and BI applicational needs. As both the sources lies under a same server therefore a stored procedure was written to move the data from source to destination. A stage table was created where data is first loaded, manipulated and then pushed into the destination table. The python script is written to coonect to the database using pyodbc library and excute the stored procedure. In an event of sucess the script then sends out an email to the stakeholders with number of records processed and load time. If process is aborted for any reason the failure notifications will also be emailed to the stakeholders. The ready python script is finally converted into an excutable file use pyinstaller. This executable file is scheduled to run on a daily basis at a specified time. The flowchart below describes the general idea of overall process flow.
Link to Project 1 files:
- Python File: load_main_table_daily_type1.ipynb
- SQL SP: load_main_table_daily_type1.sql
In this project, the goal is to load data from files to a dstination database. For this purpose a python script is written to check if file is available. If file is available, then the script reads the flile, performs all the necessary data transformations. To load the data into database first the script truncates the stage table. Then it uses the SQLAlchemy library to load the data into stage table rather than using a cursor to_sql function enahnces the load performance. Finally the script calls a stored procedure which then upserts the data from stage to the destination table. After processing the data it archives the files to an archive folder. In event of a sucess the script then sends out an email to the stakeholders with number of records processed and load time. If process is aborted for any reason the failure notifications will also be emailed to the stakeholders. The ready python script is finally converted into an excutable file use pyinstaller. This executable file is scheduled to run on a daily basis at a specified time. The flowchart below describes the general idea of overall process flow.
Link to Project 2 files:
- Python File: Load_Environmental_Data.ipynb
- SQL SP: Load_Environmental_Data_SP.sql
- Pyinstaller: Pyinstaller.ipynb
Note: How to use task scheduler can be found in the img folder with name task_schd_#
The result is smooth, low to no cost data pipelines which keeps the data flowing into a datalake/datamart or any data storage area which now can be used for data science and BI tasks. But we need to make sure this pipelines are scheduled by their dependency order. Also maintenance of this pipelines will be necessary and proactive actions might be needed if volume of the data sudenly increases to more than expected.