PyTorch Training on Google Cloud Storage

This guide will help you get started with training PyTorch models using Google Cloud Storage (GCS) and Google Cloud's AI Platform.

Prerequisites

Google Cloud Platform (GCP) account + (service account for cloud storage access)
Google Cloud SDK installed and configured
Docker installed on your local machine
Weights & Biases (wandb) account

Setup

1. Set up Google Cloud Project

Create a new Google Cloud Project or select an existing one.
Enable the following APIs:
- Google Cloud Storage
- AI Platform Training & Prediction
- Container Registry

(PRO TIP - you may want to toggle on just these to help you find things)

also use this (for branch highlighting) https://ohmyz.sh/

sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

Add to .zshrc

plugins=(git)
 
export GCP_PROJECT=your-project-id
export GOOGLE_CLOUD_BUCKET_NAME=your-bucket-name
export WANDB_KEY=your-wandb-api-key

2. Set up Google Cloud Storage

Create a new GCS bucket:
```
./create_bucket.sh
```

3. Configure Environment Variables

Set the following environment variables:

Use Service Account Key

Create a service account with the necessary permissions (Storage Admin, Storage Object Creator, etc.) in the Google Cloud Console. Generate a JSON key for this service account and download it. google-service-account-key.json

4. Uploading Files to Your Bucket

Using the Google Cloud Console:
- Navigate to your bucket
- Click "Upload files" or drag and drop files into the browser

Using the gsutil command-line tool:

pip install google-cloud-storage

gsutil cp [LOCAL_FILE_PATH] gs://[BUCKET_NAME]/

Example - recursive copy to bucket:

gsutil mkdir gs://mybucket/celebvhq/35666/images/
gsutil -m cp -r -p /media/2TB/celebvhq/35666/images/* gs://$GOOGLE_CLOUD_BUCKET_NAME/celebvhq/35666/images/

To upload an entire directory:

gsutil -m cp -r [LOCAL_DIRECTORY_PATH] gs://[BUCKET_NAME]/

Example:

gsutil -m cp -r ./training_data gs://my-pytorch-bucket/

Setting up Google Container Registry

Google Container Registry (GCR) is a private container image registry that runs on Google Cloud. Here's how to set it up for your project:

1. Enable the Container Registry API

First, ensure that the Container Registry API is enabled for your project:

Go to the Google Cloud Console.
Select your project.
Go to "APIs & Services" > "Dashboard".
Click on "+ ENABLE APIS AND SERVICES" at the top.
Search for "Container Registry API" and enable it.

2. Configure Docker for GCR

To push images to GCR, you need to configure Docker to authenticate with Google Cloud:

gcloud auth application-default login
gcloud auth configure-docker

This command adds credentials to Docker's configuration file, allowing you to push and pull images from GCR.

3. Choose a Hosting Location

GCR can host your images in multiple locations. The main options are:

gcr.io (United States)
us.gcr.io (United States)
eu.gcr.io (European Union)
asia.gcr.io (Asia)

Choose the location closest to where you'll be running your training jobs for optimal performance.

Upgrading the Docker Container

Keeping your training environment up-to-date is crucial for optimal performance and compatibility. Vertex AI provides pre-built containers that are regularly updated with the latest PyTorch versions and dependencies. Here's how to upgrade your Docker container:

1. Check for New Container Versions

Visit the Vertex AI pre-built containers page to see the latest available versions.

2. Update the Dockerfile

Modify your Dockerfile to use the latest base image. For PyTorch with CUDA support, update the first line:

FROM us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.X-Y.py3Z:latest

Replace X, Y, and Z with the latest version numbers from the Vertex AI documentation.

For example, to use PyTorch 2.2 with Python 3.10:

FROM us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.2-2.py310:latest

4. Prepare Your PyTorch Code

Place your PyTorch training code in a GitHub repository.
Update the GITHUB_REPO and BRANCH_NAME in the job_config.yaml file.

5. Set up IAM Roles

To train models and view training statistics, you'll need to assign the following roles to your Google Cloud account or service account:

AI Platform Training and Prediction:
- roles/ml.admin: Full access to AI Platform Training and Prediction resources.
- roles/ml.developer: Permission to submit training jobs and view job details.
Compute Engine:
- roles/compute.viewer: Permission to view Compute Engine resources (for CPU usage statistics).
Cloud Storage:
- roles/storage.objectAdmin: Full control of GCS objects.
Container Registry:
- roles/containerregistry.ServiceAgent: Permission to push and pull Docker images.

To assign these roles:

Go to the IAM & Admin section in the Google Cloud Console.
Click on "Add" to add a new member or edit an existing one.
Enter the user's email or service account.
Add the following roles:
- AI Platform Admin
- AI Platform Developer
- Compute Viewer
- Storage Object Admin
- Container Registry Service Agent

These roles will allow you to:

Submit and manage training jobs
View job details and logs
Access CPU usage statistics
Manage storage objects in your GCS bucket
Push and pull Docker images to/from Container Registry

Note: It's a best practice to follow the principle of least privilege. If you're working in a team or production environment, consider creating custom roles with only the necessary permissions.

Usage

1. Build and Push Docker Image

Run the build.sh script to build and push your Docker image:

./build.sh

This script will:

Build a Docker image with your PyTorch code
Tag the image with an incremented version
Push the image to Google Container Registry
Update the job_config.yaml file with the new image URI

They should end up in the artifacts with version bumped.

2. Submit Training Job

Use the push-job.sh script to submit your training job:

./push-job.sh

This script will create a custom job on Google Cloud AI Platform using the configuration in job_config.yaml.

workerPoolSpecs:
  machineSpec:
    machineType: n1-standard-8
    # machineType: n1-standard-32
    # acceleratorType: NVIDIA_TESLA_V100
    # machineType: a2-ultragpu-1g
    # acceleratorType: NVIDIA_A100_80GB
    # acceleratorCount: 1
  replicaCount: 1
  containerSpec:
    imageUri: 'gcr.io/kommunityproject/pytorch-training:v1.0.1'      
    env:
      - name: GCS_BUCKET_NAME
        value: gs://jp-ai-experiments
      - name: BRANCH_NAME
        value: feat/ada-fixed4
      - name: GITHUB_REPO
        value: https://github.com/johndpope/imf.git

3. Monitor Training Progress and CPU Usage

To view the training progress and CPU usage:

Go to the AI Platform Jobs page in the Google Cloud Console.
Find your job in the list and click on it.
In the job details page, you can see:
- Job status and duration
- Logs from your training script
- Resource utilization graphs, including CPU usage

You can also use the gcloud command-line tool to get job information:

gcloud ai custom-jobs describe JOB_ID

Replace JOB_ID with your actual job ID.

File Descriptions

Dockerfile: Defines the Docker image for your PyTorch training environment.
build.sh: Builds and pushes the Docker image, and updates job_config.yaml.
job_config.yaml: Configuration file for the AI Platform training job.
push-job.sh: Submits the training job to AI Platform.

Customization

Modify the Dockerfile to include any additional dependencies your project requires.
Adjust the job_config.yaml file to change machine types, accelerators, or environment variables.
Update the build.sh script if you need to modify the image building process.

Troubleshooting

If you encounter permission issues, make sure you have the necessary IAM roles assigned to your Google Cloud account.
For issues with the Docker image, check the build logs and ensure all dependencies are correctly installed.
If the training job fails, review the job logs in the Google Cloud Console for error messages.
If you can't see CPU usage statistics, ensure you have the Compute Viewer role assigned.

Additional Resources

For more detailed information or support, please refer to the official documentation of each tool or service used in this project.

TROUBLESHOOTING run docker locally

you should be able to run / boot your docker container with the run.sh

it should be able to connect / mount cloud storage if you don't have sensitive training data - consider making bucket public??? grant access to allUsers

install gcsfuse locally to mount your cloud storage https://cloud.google.com/storage/docs/gcsfuse-install

export GCP_PROJECT=kommunityproject
export IMAGE_NAME="pytorch-training"
export GCS_BUCKET_NAME="gs://jp-ai-experiments"
export BRANCH_NAME="feat/ada-fixed4"
export GITHUB_REPO="https://github.com/johndpope/imf.git"

if you want to use publically available bucket use --anonymous-access in start.sh

# DISABLE THIS
echo "Mounting GCS bucket: $GCS_BUCKET_NAME to $MOUNT_POINT"
# gcsfuse --debug_fuse  --implicit-dirs --key-file=$GOOGLE_APPLICATION_CREDENTIALS $GCS_BUCKET_NAME $MOUNT_POINT

# echo "Using publically available bucket"
gcsfuse --debug_fuse  --implicit-dirs  --anonymous-access $GCS_BUCKET_NAME $MOUNT_POINT

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
artifacts.png		artifacts.png
build.sh		build.sh
cleanup_cloud_builds.sh		cleanup_cloud_builds.sh
cleanup_local_builds.sh		cleanup_local_builds.sh
create_bucket.sh		create_bucket.sh
gcloud.png		gcloud.png
grant_service_agent_acces.sh		grant_service_agent_acces.sh
job_config.yaml		job_config.yaml
jobs.png		jobs.png
just_these.png		just_these.png
logs.png		logs.png
mount_google_cloud.sh		mount_google_cloud.sh
policy.yaml		policy.yaml
poll.py		poll.py
push-job.sh		push-job.sh
readme.md		readme.md
run_latest_image.sh		run_latest_image.sh
start.sh		start.sh
zip_files.sh		zip_files.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PyTorch Training on Google Cloud Storage

Prerequisites

Setup

1. Set up Google Cloud Project

2. Set up Google Cloud Storage

3. Configure Environment Variables

4. Uploading Files to Your Bucket

Setting up Google Container Registry

1. Enable the Container Registry API

2. Configure Docker for GCR

3. Choose a Hosting Location

Upgrading the Docker Container

1. Check for New Container Versions

2. Update the Dockerfile

4. Prepare Your PyTorch Code

5. Set up IAM Roles

Usage

1. Build and Push Docker Image

2. Submit Training Job

3. Monitor Training Progress and CPU Usage

File Descriptions

Customization

Troubleshooting

Additional Resources

TROUBLESHOOTING run docker locally

About

Releases

Packages

Languages

johndpope/vertex-jumpstart

Folders and files

Latest commit

History

Repository files navigation

PyTorch Training on Google Cloud Storage

Prerequisites

Setup

1. Set up Google Cloud Project

2. Set up Google Cloud Storage

3. Configure Environment Variables

4. Uploading Files to Your Bucket

Setting up Google Container Registry

1. Enable the Container Registry API

2. Configure Docker for GCR

3. Choose a Hosting Location

Upgrading the Docker Container

1. Check for New Container Versions

2. Update the Dockerfile

4. Prepare Your PyTorch Code

5. Set up IAM Roles

Usage

1. Build and Push Docker Image

2. Submit Training Job

3. Monitor Training Progress and CPU Usage

File Descriptions

Customization

Troubleshooting

Additional Resources

TROUBLESHOOTING run docker locally

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages