This guide will help you get started with training PyTorch models using Google Cloud Storage (GCS) and Google Cloud's AI Platform.
-
Google Cloud Platform (GCP) account + (service account for cloud storage access)
-
Google Cloud SDK installed and configured
-
Docker installed on your local machine
-
Weights & Biases (wandb) account
- Create a new Google Cloud Project or select an existing one.
- Enable the following APIs:
- Google Cloud Storage
- AI Platform Training & Prediction
- Container Registry
(PRO TIP - you may want to toggle on just these to help you find things)
also use this (for branch highlighting) https://ohmyz.sh/
sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
Add to .zshrc
plugins=(git)
export GCP_PROJECT=your-project-id
export GOOGLE_CLOUD_BUCKET_NAME=your-bucket-name
export WANDB_KEY=your-wandb-api-key
- Create a new GCS bucket:
./create_bucket.sh
Set the following environment variables:
Use Service Account Key
Create a service account with the necessary permissions (Storage Admin, Storage Object Creator, etc.) in the Google Cloud Console. Generate a JSON key for this service account and download it. google-service-account-key.json
-
Using the Google Cloud Console:
- Navigate to your bucket
- Click "Upload files" or drag and drop files into the browser
-
Using the
gsutil
command-line tool:pip install google-cloud-storage gsutil cp [LOCAL_FILE_PATH] gs://[BUCKET_NAME]/
Example - recursive copy to bucket:
gsutil mkdir gs://mybucket/celebvhq/35666/images/ gsutil -m cp -r -p /media/2TB/celebvhq/35666/images/* gs://$GOOGLE_CLOUD_BUCKET_NAME/celebvhq/35666/images/
-
To upload an entire directory:
gsutil -m cp -r [LOCAL_DIRECTORY_PATH] gs://[BUCKET_NAME]/
Example:
gsutil -m cp -r ./training_data gs://my-pytorch-bucket/
Google Container Registry (GCR) is a private container image registry that runs on Google Cloud. Here's how to set it up for your project:
First, ensure that the Container Registry API is enabled for your project:
- Go to the Google Cloud Console.
- Select your project.
- Go to "APIs & Services" > "Dashboard".
- Click on "+ ENABLE APIS AND SERVICES" at the top.
- Search for "Container Registry API" and enable it.
To push images to GCR, you need to configure Docker to authenticate with Google Cloud:
gcloud auth application-default login
gcloud auth configure-docker
This command adds credentials to Docker's configuration file, allowing you to push and pull images from GCR.
GCR can host your images in multiple locations. The main options are:
gcr.io
(United States)us.gcr.io
(United States)eu.gcr.io
(European Union)asia.gcr.io
(Asia)
Choose the location closest to where you'll be running your training jobs for optimal performance.
Keeping your training environment up-to-date is crucial for optimal performance and compatibility. Vertex AI provides pre-built containers that are regularly updated with the latest PyTorch versions and dependencies. Here's how to upgrade your Docker container:
Visit the Vertex AI pre-built containers page to see the latest available versions.
Modify your Dockerfile to use the latest base image. For PyTorch with CUDA support, update the first line:
FROM us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.X-Y.py3Z:latest
Replace X
, Y
, and Z
with the latest version numbers from the Vertex AI documentation.
For example, to use PyTorch 2.2 with Python 3.10:
FROM us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-2.py310:latest
- Place your PyTorch training code in a GitHub repository.
- Update the
GITHUB_REPO
andBRANCH_NAME
in thejob_config.yaml
file.
To train models and view training statistics, you'll need to assign the following roles to your Google Cloud account or service account:
-
AI Platform Training and Prediction:
roles/ml.admin
: Full access to AI Platform Training and Prediction resources.roles/ml.developer
: Permission to submit training jobs and view job details.
-
Compute Engine:
roles/compute.viewer
: Permission to view Compute Engine resources (for CPU usage statistics).
-
Cloud Storage:
roles/storage.objectAdmin
: Full control of GCS objects.
-
Container Registry:
roles/containerregistry.ServiceAgent
: Permission to push and pull Docker images.
To assign these roles:
- Go to the IAM & Admin section in the Google Cloud Console.
- Click on "Add" to add a new member or edit an existing one.
- Enter the user's email or service account.
- Add the following roles:
- AI Platform Admin
- AI Platform Developer
- Compute Viewer
- Storage Object Admin
- Container Registry Service Agent
These roles will allow you to:
- Submit and manage training jobs
- View job details and logs
- Access CPU usage statistics
- Manage storage objects in your GCS bucket
- Push and pull Docker images to/from Container Registry
Note: It's a best practice to follow the principle of least privilege. If you're working in a team or production environment, consider creating custom roles with only the necessary permissions.
Run the build.sh
script to build and push your Docker image:
./build.sh
This script will:
- Build a Docker image with your PyTorch code
- Tag the image with an incremented version
- Push the image to Google Container Registry
- Update the
job_config.yaml
file with the new image URI
They should end up in the artifacts with version bumped.
Use the push-job.sh
script to submit your training job:
./push-job.sh
This script will create a custom job on Google Cloud AI Platform using the configuration in job_config.yaml
.
workerPoolSpecs:
machineSpec:
machineType: n1-standard-8
# machineType: n1-standard-32
# acceleratorType: NVIDIA_TESLA_V100
# machineType: a2-ultragpu-1g
# acceleratorType: NVIDIA_A100_80GB
# acceleratorCount: 1
replicaCount: 1
containerSpec:
imageUri: 'gcr.io/kommunityproject/pytorch-training:v1.0.1'
env:
- name: GCS_BUCKET_NAME
value: gs://jp-ai-experiments
- name: BRANCH_NAME
value: feat/ada-fixed4
- name: GITHUB_REPO
value: https://github.com/johndpope/imf.git
To view the training progress and CPU usage:
- Go to the AI Platform Jobs page in the Google Cloud Console.
- Find your job in the list and click on it.
- In the job details page, you can see:
- Job status and duration
- Logs from your training script
- Resource utilization graphs, including CPU usage
You can also use the gcloud
command-line tool to get job information:
gcloud ai custom-jobs describe JOB_ID
Replace JOB_ID
with your actual job ID.
Dockerfile
: Defines the Docker image for your PyTorch training environment.build.sh
: Builds and pushes the Docker image, and updatesjob_config.yaml
.job_config.yaml
: Configuration file for the AI Platform training job.push-job.sh
: Submits the training job to AI Platform.
- Modify the
Dockerfile
to include any additional dependencies your project requires. - Adjust the
job_config.yaml
file to change machine types, accelerators, or environment variables. - Update the
build.sh
script if you need to modify the image building process.
- If you encounter permission issues, make sure you have the necessary IAM roles assigned to your Google Cloud account.
- For issues with the Docker image, check the build logs and ensure all dependencies are correctly installed.
- If the training job fails, review the job logs in the Google Cloud Console for error messages.
- If you can't see CPU usage statistics, ensure you have the Compute Viewer role assigned.
- Google Cloud AI Platform Documentation
- PyTorch Documentation
- Weights & Biases Documentation
- Google Cloud IAM Documentation
For more detailed information or support, please refer to the official documentation of each tool or service used in this project.
you should be able to run / boot your docker container with the run.sh
it should be able to connect / mount cloud storage if you don't have sensitive training data - consider making bucket public??? grant access to allUsers
install gcsfuse locally to mount your cloud storage https://cloud.google.com/storage/docs/gcsfuse-install
export GCP_PROJECT=kommunityproject
export IMAGE_NAME="pytorch-training"
export GCS_BUCKET_NAME="gs://jp-ai-experiments"
export BRANCH_NAME="feat/ada-fixed4"
export GITHUB_REPO="https://github.com/johndpope/imf.git"
if you want to use publically available bucket use --anonymous-access in start.sh
# DISABLE THIS
echo "Mounting GCS bucket: $GCS_BUCKET_NAME to $MOUNT_POINT"
# gcsfuse --debug_fuse --implicit-dirs --key-file=$GOOGLE_APPLICATION_CREDENTIALS $GCS_BUCKET_NAME $MOUNT_POINT
# echo "Using publically available bucket"
gcsfuse --debug_fuse --implicit-dirs --anonymous-access $GCS_BUCKET_NAME $MOUNT_POINT