This tutorial uses the Taxi Trips Dataset provided by the City of Chicago.
Note: This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.
Read more about the dataset in Google BigQuery. Explore the full dataset in the BigQuery UI.
The goal is to train a Binary Classification model that predicts whether a person leaves 20% tips or more (target label) based on the taxi ride information.
We did some analysis of the dataset and realized that over 50% of the payment types are Cash. We also noticed that the majority of cash payments don't have any tips. We believe this is because the tips for cash payments have not been properly recorded, and therefore, the dataset is somewhat incomplete for cash payments.
This will naturally have an impact on any trained model. The model accuracy for non-cash payments will be a bit lower than the general accuracy. On the other hand, any prediction of the model for cash payments is not as reliable as the other payment types.
The dataset in BigQuery is in a raw format. We have processed the dataset to prepare it for model training.
We have two different sizes of the dataset:
small
: 10K samples for training and 2K samples for evaluation. This dataset has a good size for local training and debugging your code.big
: 1M samples for training and 200K samples for evaluation. This dataset is best used in cloud training due to its size.
You can download either dataset (or both) by running the following script:
To download both datasets:
source ./download-taxi.sh /path/to/output/directory
To download the big dataset only:
source ./download-taxi.sh /path/to/output/directory big
To download the small dataset only:
source ./download-taxi.sh /path/to/output/directory small
Note: Using source
preserves the environment variables.
Upon completion, the selected datasets are downloaded and the corresponding environment variables are set and exported:
-
Local Datasets for Training:
TAXI_TRAIN_BIG
: Local path to the big training datasetTAXI_EVAL_BIG
: Local path to the big evaluation datasetTAXI_TRAIN_SMALL
: Local path to the small training datasetTAXI_EVAL_SMALL
: Local path to the small evaluation dataset
-
GCS Datasets for Training:
GCS_TAXI_BIG
: GCS path to the big training+evaluation datasetGCS_TAXI_TRAIN_BIG
: GCS path to the big training datasetGCS_TAXI_EVAL_BIG
: GCS path to the big evaluation datasetGCS_TAXI_SMALL
: GCS path to the small training+evaluation datasetGCS_TAXI_TRAIN_SMALL
: GCS path to the small training datasetGCS_TAXI_EVAL_SMALL
: GCS path to the small evaluation dataset
-
Prediction Datasets:
TAXI_PREDICTION_DICT_NDJSON
: a Newline Delimited JSON file with 3 examples, represented as dictionariesTAXI_PREDICTION_LIST_NDJSON
: a Newline Delimited JSON file with 3 examples, represented as lists
Note: Each line in a Newline Delimited JSON file is a JSON object or list.