Skip to content

Latest commit

 

History

History
321 lines (259 loc) · 11.1 KB

INSTALLATION.md

File metadata and controls

321 lines (259 loc) · 11.1 KB

Cloud Learning Platform Installation

1. Prerequisites

Access

  • Access to the following Github repos:
  • Read-only access to aitutor-dev buckets (for ML model export and Firestore import)
  • A new project for installation with Project Owner (roles/owner)
  • A domain or sub_domain that you control and can create DNS records for (not needed for CEs)

Quotas

You'll need the following quotas in your preferred zone

  • 48 vCPU
  • 4 x T4 GPUs

Version

Confirm with the Learning Platform team which release version of each repo to use.

Tools

Install the following tools:

2. Project Bootstrap

You will use a provided Terraform module to perform the following:

  • Bootstrap a project in your organization
  • Create a Terraform service account and Terraform state bucket in the project for further Terraform scripts

From your workstation:

export PROJECT_ID=<your-project-id>
export DOMAIN_NAME=<your-org-domain>
export REGION=<your-region>
export ZONE=<your-zone>
export PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
export BILLING_ACCOUNT="$(gcloud beta billing projects describe ${PROJECT_ID} | grep billingAccountName \
    | tr / ' ' | cut -f3 -d' ')"

export CLP_VERSION=<clp_tag>
export DEMO_VERSION=<v2.0.0-beta12.7-demo>
git clone https://github.com/GPS-Solutions/cloud-learning-platform.git
cd cloud-learning-platform
git checkout $CLP_VERSION
cd terraform/stages/project_bootstrap/

Log in to your project:

gcloud auth login

You will need to also create ADC credentials pointed towards an org admin account for use by the Terraform client.

gcloud auth application-default login

And verify if they've taken:

cat ~/.config/gcloud/application_default_credentials.json

Set gcloud to point to your project (if not already)

gcloud config list
gcloud config set project ${PROJECT_ID}

Run the following to set Terraform variables:

# Pass variables to terraform using environment prefix TF_VAR_
export TF_VAR_project_id=${PROJECT_ID}
export TF_VAR_billing_account=${BILLING_ACCOUNT}
export TF_VAR_region=${REGION}
export TF_VAR_zone=${ZONE}
export TF_VAR_bucket_region_or_multiregion="US"
export TF_VAR_org_domain_name=${DOMAIN_NAME}
export TF_VAR_add_project_owner=true

Now that you're logged in, initialize and run a terraform apply command to see the expected changes. Inspect the changes before typing yes when prompted.

terraform init
terraform apply

Ensure that a bucket with the same name as the project has been created:

gsutil ls -p $PROJECT_ID

There should also be a jump host VM in the project:

gcloud compute instances list

Summary

In this section you successfully created the following:

  • A bucket to capture future terraform state and prepare for CI/CD
  • A Terraform service account with the required permissions
  • A jump host to perform the rest of the installation

3. Main Infrastructure Installation - GKE Cluster

Copy bootstrap terraform state to tfstate-bucket

gsutil cp ../project_bootstrap/terraform.tfstate gs://"${PROJECT_ID}"-tfstate/env/bootstrap/terraform.tfstate

Enable deletion protection for the jump host

gcloud compute instances update jump-host --deletion-protection --project="${PROJECT_ID}"

SCP startup script to jump host

gcloud compute scp ../scripts/bastion_startup.sh jump-host:~ --zone=${ZONE} --tunnel-through-iap --project="${PROJECT_ID}"

Log onto the jump host using IAP and start tmux

gcloud compute ssh jump-host --zone=${ZONE} --tunnel-through-iap --project=${PROJECT_ID}

Run the startup script (takes 10-20 min)

source ~/bastion_startup.sh

Start tmux on the jump host

Preferred so that disconnected sessions are not lost (https://tmuxcheatsheet.com/). To re-connect tmux attach

tmux

Git Clone CLP repos

git config --global user.email "you@example.com"
git config --global user.name "Your Name"
git config --global credential.https://github.com.username "username"

git config --global credential.helper store
git clone https://github.com/GPS-Solutions/cloud-learning-platform.git

Set Project ID and other variables

export PROJECT_ID=<your-project-id>
export LDAP=<your-ldap>
export GITHUB_ID=<your-github-id>
export REGION=<your-region>
export ZONE=<your-zone>

Authenticate to Google Cloud

gcloud auth login
gcloud auth application-default login

Create and download service key for terraform account

This account is used for several reasons:

  • Ensuring a consistent experience, as users coming to this process may have varying permissions
  • Creating Firebase resources in Terraform requires the use of a Service Account because of API limitations
  • Setting up CI/CD for this project to consume upstream changes is halfway done for you
export SA_KEY_FILE=~/clp-terraform-cicd-key.json
gcloud iam service-accounts keys create ${SA_KEY_FILE} \
--iam-account=terraform-cicd@${PROJECT_ID}.iam.gserviceaccount.com
export GOOGLE_APPLICATION_CREDENTIALS=${SA_KEY_FILE}

Define input variables for Terraform

export TF_VAR_project_id=${PROJECT_ID}
export TF_VAR_region=${REGION}
export TF_VAR_firestore_region="us-central"
export TF_VAR_gke_cluster_zones=${ZONE}
export TF_VAR_github_owner=${GITHUB_ID}
export TF_VAR_api_domain="${PROJECT_ID}-api"
export TF_VAR_web_app_domain="${PROJECT_ID}"
export TF_VAR_ckt_app_domain="${PROJECT_ID}-ckt"
export TF_VAR_github_ref="refs/tags/${DEMO_VERSION}"

Set up the frontend app domains and whitelisted users

These variables have been defaulted for Argolis projects

export TF_VAR_cert_issuer_email="${LDAP}@google.com"
export TF_VAR_org_domain_name="${LDAP}.altostrat.com"
export TF_VAR_base_domain="cloudpssolutions.com"
export TF_VAR_ai_tutor_whitelist_domains="google.com"
export TF_VAR_ai_tutor_whitelist_emails="${LDAP}@google.com,admin@${LDAP}.altostrat.com"
export TF_VAR_ckt_whitelist_domains="google.com"
export TF_VAR_ckt_whitelist_emails="${LDAP}@google.com,admin@${LDAP}.altostrat.com"

Terraform Module Initialization

Now change directories to demo_environment and initialize the terraform module,

pushd cloud-learning-platform
git checkout "${CLP_VERSION}"
cd terraform/stages/demo_environment

terraform init -backend-config="bucket=${PROJECT_ID}-tfstate"
terraform plan | grep -e "#"
# Firestore may only be initialized once
FIRESTORE_INIT="-var=firebase_init=false"
if [[ $(gcloud alpha firestore databases list --project="${PROJECT_ID}" --quiet | grep -c uid) == 0 ]]; then
  FIRESTORE_INIT="-var=firebase_init=true"
fi

terraform apply ${FIRESTORE_INIT} --auto-approve
popd

Summary

In this section you successfully created the following via Terraform

  • Firebase Base Apps
  • GKE Cluster for backends and GCS buckets
  • Ingress and other Service Accounts and Secrets on the Cluster

4. Firebase Setup

Follow Firebase setup Instructions

Firestore Index Creation

Checkout the backend repo and select the latest release version you'd like to deploy to match the frontends you just deployed. Deploy the needed indexes to firestore. Make sure the database import is completed first.

cd cloud-learning-platform
export PWD=$(pwd)
export GCP_PROJECT=$PROJECT_ID
echo "Your current GCP Project ID is: "$GCP_PROJECT

cd utils
PYTHONPATH=../common/src python firestore_indexing.py
cd ..

5. GKE Backend Deployment

We will now run a series of skaffold commands to build the necessary containers in cloud build and deploy them to the GKE cluster to power the backend services.

First connect to your GKE cluster that you've already provisioned. You can find the command here.

gcloud container clusters get-credentials $GCP_PROJECT-$REGION --region $REGION --project $GCP_PROJECT

kubectx and kubens are handy tools to easily switch between Kubernetes clusters and namespaces.

Return to the repo root. Make sure you have the version you desire checked out.

cd $PWD
echo "Your current GCP Project ID is: "$(git branch --show-current)

export GCP_PROJECT=$PROJECT_ID
export PROJECT_NUMBER=$(gcloud projects describe $PROJECT_ID --format="value(projectNumber)")
echo "Your current GCP Project ID is: "$PROJECT_ID

export BACKEND_API=https://$PROJECT_ID-api.cloudpssolutions.com
# GIT_RELEASE=$(git describe --tags --abbrev=0 --exact-match)
GIT_SHA=$(git rev-parse HEAD)

Run the following to get Firebase API key (Web API key):

KEY_NAME=$(gcloud alpha services api-keys list --filter="displayName='Browser key (auto created by Firebase)'" --format="value(name)")
export FIREBASE_API_KEY=$(gcloud alpha services api-keys get-key-string $KEY_NAME --format="value(keyString)")

Set environment variables:

export IS_DEVELOPMENT=false
export IS_CLOUD_LOGGING_ENABLED=true
export RELEASE_VERSION=$CLP_VERSION
export SKAFFOLD_BUILD_CONCURRENCY=0

Deploy each set of services, one set at a time. This can take several tries due to transient build failures. These can take over 10 minutes to complete.

NOTE: Make sure gcloud is set to the proper project and your Kubeconfig is set to the appropriate cluster. Make sure your user account is also set as Application Default Credentials so skaffold and helm have the appropriate access.

You can watch the logs of your builds in Cloud Build as well as streaming to your command line.

echo $GCP_PROJECT $PROJECT_ID $GIT_SHA $CLP_VERSiON
# Deploy backend microservices
skaffold run -p custom --default-repo=gcr.io/$PROJECT_ID -l commit=$GIT_SHA -m v3_backends --tag $CLP_VERSION

Eventually you should see that all the containers are built and skaffold is starting to deploy resources. You can also watch the pods deploy by running this in another terminal session:

kubectl get po

# or if you have `watch`
watch kubectl get po

Eventually you will see the deployments stabilize:

6. Scaling the cluster

To save on cost it may be desirable to reduce GCP spend when the application is not being used or evaluated. Primarily this is achieved by turning down the GKE cluster and turning off the backend. Please not this pathway is only somewhat tested. You should test your user journeys each time you turn up the cluster.

Turning Down

  1. For each node pool in the console:
  • Disable auto scaling, Click Save
  • Set nodes = 0, Click Save

Turning Up

  1. Turn on Auto-scaling for both pools (min 1, max 8)
  2. Change number of nodes for both pools to 1-4 (autoscaler will even it out)
  3. Let all services turn on

Use kubectl get pods to monitor the status of pods. ContainerCreating means it’s starting, Pending meaning it is waiting for resources, i.e. GPU node