Child Care Data ETL Project

This project is an ETL (Extract, Transform, Load) service designed to process child care provider data from multiple sources and consolidate it into a single SQLite database.

Installation and Setup

Prerequisites

Python 3.7 or higher
pip (Python package installer)

Installation Steps

Clone this repository:

git clone https://github.com/your-username/child-care-data-etl.git
cd child-care-data-etl

Install the required packages:
```
pip install -r requirements.txt
```

Running the Service

Ensure your Excel file is named "Technical Exercise Data.xlsx" and is in the project root directory.
Run the ETL script:
```
python etl.py
```
The script will process the data and create a SQLite database named child_care_data.db in the project directory.
I use DBeaver to browse the resulting SQLite database, but you can use any database browsing tool you choose.

Testing the Service

Run the testing script:
```
python -m unittest test_etl.py
```

Tradeoffs and Future Improvements

While developing this ETL service, several tradeoffs were made to balance functionality, time constraints, and simplicity:

SQLite Database: I used SQLite for simplicity, but for a production environment, a more robust database system like PostgreSQL would be more appropriate.
Error Handling: The current implementation has basic error handling. In a production system, more comprehensive error handling and logging would be crucial.
Data Validation: I implemented basic data cleaning and validation, like deduplication and standardizing formats of phone numbers and addresses. More rigorous validation checks could be added to ensure data integrity. Deviation from instructions: I chose to remove the first_name and last_name fields in favor of contact_name. I made this choice because in my experience, there is no universal way to parse a first and last name reliably. There is a great article on this problem here if you want to check it out. Names come in so many formats, it's best to preserve the original format as a whole entity.
Data Quality: I perform basic data quality checks, but in the future I would have added in QA using Great Expectations to verify things like columns that are expected to be null, columns that are expected to be unique, verifying that certain columns values fall within an expected range, etc.
Geocoding: The current geocoding implementation is basic and may not handle all address formats. I use an external API that is free (TomTom Geocoding API) but the free level does have limits. A more sophisticated geocoding service could improve accuracyand not incur additional cost. IMPORTANT NOTE: I include the API key in this repository's .env file, to avoid the user having to go make an API account for testing this repo. THIS IS BAD PRACTICE, and should never be done. I created the key with a dummy email address and no identifying information. The key will be automatically destroyed 1 week from it's creation date.
Performance: For large datasets, the current row-by-row processing might be slow. Batch processing or using more efficient data processing libraries like pandas could improve performance.
Testing: The current version lacks a large number of unit tests. Implementing a comprehensive test suite would be a priority for a production-ready system.

Given additional time, these areas would be prime candidates for improvement.

Long-term ETL Strategies

For a more scalable and robust ETL process in the long term, I would consider the following strategies:

Schema Flexibility: Implement a schema-on-read approach to handle varying file formats and changing schemas.
Cloud-based Infrastructure: Utilize cloud services (e.g., AWS S3, Azure Data Lake) for scalable storage and processing.
Workflow Orchestration: Use tools like Apache Airflow for scheduling and monitoring ETL jobs.
Data Quality Framework: Implement automated data quality checks using tools like Great Expectations.

Additional Notes

-Thank you for your time and consideration!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Child Care Data ETL Project

Installation and Setup

Prerequisites

Installation Steps

Running the Service

Testing the Service

Tradeoffs and Future Improvements

Long-term ETL Strategies

Additional Notes

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
README.md		README.md
Technical Exercise Data.xlsx		Technical Exercise Data.xlsx
env		env
etl.py		etl.py
requirements.txt		requirements.txt
test_etl.py		test_etl.py

skpthedev/child-care-data-etl

Folders and files

Latest commit

History

Repository files navigation

Child Care Data ETL Project

Installation and Setup

Prerequisites

Installation Steps

Running the Service

Testing the Service

Tradeoffs and Future Improvements

Long-term ETL Strategies

Additional Notes

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages