Skip to content

Latest commit

 

History

History
76 lines (61 loc) · 5.2 KB

File metadata and controls

76 lines (61 loc) · 5.2 KB

TensorFlow Pipelines

Training pipeline

The TensorFlow training pipeline can be found in training/pipeline.py. The training component train_tensorflow_model is the main training component which contains the implementation of a TensorFlow Keras model. This component can then be wrapped in a custom kfp ContainerOp from google-cloud-pipeline-components which submits a Vertex Training job with added flexibility for machine_type, replica_count, accelerator_type among other machine configurations.

Data

The input data is split into three parts in BigQuery and stored in Google Cloud Storage:

  • 80% of the input data is used for model training
  • 10% of the input data is used for model validation
  • 10% of the input data is used for model testing/evaluation

Model Architecture

The architecture of the example TensorFlow Keras model is shown below:

TensorFlow Model Architecture

  • Input layer: there is one input node for each of the 7 features used in the example:
    • dayofweek
    • hourofday
    • trip_distance
    • trip_miles
    • trip_seconds
    • payment_type
    • company
  • Pre-processing layers
    • Categorical encoding for categorical features is done using Tensorflow's StringLookup layer. New/unknown values are handled using this layer's default parameters. (https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup).
      • The feature payment_type is one-hot encoded. New/unknown categories are assigned to a one-hot encoded array with zeroes everywhere.
      • The feature company is ordinal encoded. New/unknown categories are assigned to zero.
    • Normalization for the numerical features (dayofweek, hourofday, trip_distance, trip_miles, trip_seconds)
  • Dense layers
    • One Dense layer with 64 units whose activation function is ReLU.
    • One Dense layer with 32 units whose activation function is ReLU.
  • Output layer
    • One Dense layer with 1 unit where no activation is applied (this is because the example is a regression problem)

Model hyperparameters

You can specify different hyperparameters through the model_params argument of train_tensorflow_model, including:

  • Batch size
  • No. of epochs to check for early stopping
  • Learning rate
  • Number of hidden units and type of activation function in each layer
  • Loss function
  • Optimization method
  • Evaluation metrics
  • Whether you want early stopping

For a comprehensive list of options for the above hyperparameters, see the docstring in train.py.

Model artifacts

A number of different model artifacts/objects are created by the training of the TensorFlow model. With these files, you can load the model into a new script (without any of the original training code) and run it or resume training from exactly where you left off. For more information, see this.

tensorflow_component_model&metrics_artifact

Model test/evaluation

Once the model is trained, it will be used to get challenger predictions for evaluation purposes. In general, the component predict_tensorflow_model which expects a single CSV file to create predictions for test data is implemented in the pipeline, However, if you are working working with larger test data, it is more efficient to replace it with a prebuilt component provided by Google, ModelBatchPredictOp, to avoid crash caused by insufficent memory usage.

Distribution strategy

In deep learning, it is common to use GPUs, which utilise a large number of simple cores allowing parallel computing though thousands of threads at a time, to train complicated neural networks fed by massive datasets. For optimisation tasks, it is often better to use CPUs.

There is a variable, distribute_strategy, in tensorflow training pipeline that allows you to set up distribution strategy. You have three options:

Value description
single This strategy use GPU is a GPU device of the requested kind is available, otherwise, it uses CPU
mirror This strategy is typically used for training on one machine with multiple GPUs.
multi This strategy implements synchronous distributed training across multiple machines, each with potentially multiple GPUs

Prediction pipeline

The TensorFlow prediction pipeline can be found in prediction/pipeline.py.

The rationale for exporting the data twice (once as a CSV file and once as JSONL file) is that the CSV file is passed to the generate_statistics component (which uses the function tfdv.generate_statistics_from_csv) while the JSONL file is used when calling the ModelBatchPredictOp component for batch prediction.