This project is an ETL (Extract, Transform, Load) service designed to process child care provider data from multiple sources and consolidate it into a single SQLite database.
- Python 3.7 or higher
- pip (Python package installer)
-
Clone this repository:
git clone https://github.com/your-username/child-care-data-etl.git cd child-care-data-etl
-
Install the required packages:
pip install -r requirements.txt
-
Ensure your Excel file is named "Technical Exercise Data.xlsx" and is in the project root directory.
-
Run the ETL script:
python etl.py
-
The script will process the data and create a SQLite database named
child_care_data.db
in the project directory. -
I use DBeaver to browse the resulting SQLite database, but you can use any database browsing tool you choose.
- Run the testing script:
python -m unittest test_etl.py
While developing this ETL service, several tradeoffs were made to balance functionality, time constraints, and simplicity:
-
SQLite Database: I used SQLite for simplicity, but for a production environment, a more robust database system like PostgreSQL would be more appropriate.
-
Error Handling: The current implementation has basic error handling. In a production system, more comprehensive error handling and logging would be crucial.
-
Data Validation: I implemented basic data cleaning and validation, like deduplication and standardizing formats of phone numbers and addresses. More rigorous validation checks could be added to ensure data integrity. Deviation from instructions: I chose to remove the first_name and last_name fields in favor of contact_name. I made this choice because in my experience, there is no universal way to parse a first and last name reliably. There is a great article on this problem here if you want to check it out. Names come in so many formats, it's best to preserve the original format as a whole entity.
-
Data Quality: I perform basic data quality checks, but in the future I would have added in QA using Great Expectations to verify things like columns that are expected to be null, columns that are expected to be unique, verifying that certain columns values fall within an expected range, etc.
-
Geocoding: The current geocoding implementation is basic and may not handle all address formats. I use an external API that is free (TomTom Geocoding API) but the free level does have limits. A more sophisticated geocoding service could improve accuracyand not incur additional cost. IMPORTANT NOTE: I include the API key in this repository's .env file, to avoid the user having to go make an API account for testing this repo. THIS IS BAD PRACTICE, and should never be done. I created the key with a dummy email address and no identifying information. The key will be automatically destroyed 1 week from it's creation date.
-
Performance: For large datasets, the current row-by-row processing might be slow. Batch processing or using more efficient data processing libraries like pandas could improve performance.
-
Testing: The current version lacks a large number of unit tests. Implementing a comprehensive test suite would be a priority for a production-ready system.
Given additional time, these areas would be prime candidates for improvement.
For a more scalable and robust ETL process in the long term, I would consider the following strategies:
-
Schema Flexibility: Implement a schema-on-read approach to handle varying file formats and changing schemas.
-
Cloud-based Infrastructure: Utilize cloud services (e.g., AWS S3, Azure Data Lake) for scalable storage and processing.
-
Workflow Orchestration: Use tools like Apache Airflow for scheduling and monitoring ETL jobs.
-
Data Quality Framework: Implement automated data quality checks using tools like Great Expectations.
-Thank you for your time and consideration!
This project is licensed under the MIT License - see the LICENSE file for details.