This is a TFX example pipeline.
The flow between TFX components is depicted in the graph below. The following can be said about the TFX components:
-
ExampleGen
:
Data is ingested into the pipeline and splitted intoTrain
andEval
sets. -
StatisticsGen
,SchemaGen
, andExampleValidator
:
Data validation and anomalies detection. -
Transform
:
Transformation and preprocessing of data. -
Tuner
, andTrainer
:
Estimator with tuned (or untuned) hyperparameters is trained. -
Resolver
, andEvaluator
:
Model analysis of trained model. The model will be assignedBLESSED
orUNBLESSED
depending on the evaluation metrics threshold(s). -
InfraValidator
:
Model infrastructure validation. To guarantee the model is mechanically fine and prevents bad models from being pushed. -
Pusher
:
Model validation outcomes. If a model is deemedBLESSED
it will be pushed for serving.
stateDiagram-v2
direction LR
[*] --> ExampleGen
ExampleGen --> StatisticsGen
StatisticsGen --> SchemaGen
StatisticsGen --> ExampleValidator
ExampleGen --> Transform
SchemaGen --> Transform
Transform --> Tuner
SchemaGen --> Tuner
SchemaGen --> Trainer
Transform --> Trainer
Tuner --> Trainer
Trainer --> Resolver
ExampleGen --> Evaluator
Trainer --> Evaluator
Resolver --> Evaluator
ExampleGen --> InfraValidator
Trainer --> InfraValidator
Trainer --> Pusher
Evaluator --> Pusher
InfraValidator --> Pusher
Pusher --> [*]
Folder structure
.
├── data/ # Data folder
├── notebooks/ # Example TFX notebooks
├── outputs/ # Local runs outputs folder
├── schema/ # Custom defined schema
├── src/
│ ├── data
│ │ └── data.csv # Source data
│ ├── models/ # Directory of ML model definitions
│ │ ├── estimator_model/
│ │ │ ├── constants.py # Defines constants of the model
│ │ │ ├── model_test.py # Model test file
│ │ │ └── model.py # DNN model using TF estimator
│ │ ├── keras_model/
│ │ │ ├── constants.py # Defines constants of the model
│ │ │ ├── model_test.py # Model test file
│ │ │ └── model.py # DNN model using Keras
│ │ ├── features_test.py
│ │ ├── features.py
│ │ ├── preprocessing_test.py # Preprocessing test file
│ │ └── preprocessing.py # Defines preprocessing job using TF Transform
│ ├── pipeline/ # Directory of pipeline definition
│ │ ├── configs.py # Defines common constants for pipeline runners
│ │ └── pipeline.py # Defines TFX components and a pipeline
│ ├── utils/ # Directory of utils/helper functions
│ ├── data_validation.ipynb # Data validation notebook
│ ├── local_runner.py # Runner for local orchestration
│ └── model_analysis.ipynb # Model analysis notebook
├── .dockerignore
├── .gitignore
├── Dockerfile
├── requirements.in # Environment requirements
├── requirements.txt # Compiled requirements
└── README.md
There are some files with _test.py
in their name. These are unit tests of the pipeline and it is recommended to add more unit tests as you implement your own pipelines. You can run unit tests by supplying the module name of test files with -m
flag. You can usually get a module name by deleting .py
extension and replacing /
with .
. For example:
# cd into src folder
$ cd src
# Run test file
$ python -m models.features_test
This pipeline is running on Python 3.8.10. In order to create an environment to run locally follow these steps:
# Make sure you have the right Python activated
# Create virtual environment
$ python -m venv .venv
# Upgrade pip
$ .venv/bin/pip install --upgrade pip
# Install requirements
$ .venv/bin/pip install -r requirements.txt
# Activate environment
$ source .venv/bin/activate
This should spin up your local environment and you should be good-to-go running the pipeline and notebook scripts locally.
In case you need to update any requirements (e.g. update a package version or add new packages) do the following steps:
# Delete virtual environment (to make sure all old dependencies are removed)
# Make sure you have activated the right Python version and have pip-tools installed
# Update src/requirements.in with new package versions/added packages
# Compile requirements
$ pip-compile requirements.in
# Redo steps in section `Set up environment`
When you want to run the pipeline using the local_runner.py
script, simply run:
$ python src/local_runner.py
Build the image using the docker build
command:
$ docker build -t tfx-pipeline:latest .
Now that the image has been successfully built it is time to run a container out of it. You can do so by using the following command:
$ docker run -it \
$ -v $PWD/test_dir/data/:/app/data/ \
$ --rm tfx-pipeline:latest