Minor validation and docs update (#695)

- Enforce cluster name is between [1, 19] characters to prevent empty cluster names and too long cluster names from being propagated down to IRSA role creation and causing role creation to fail due to role length being greater than 64 characters - Highlight which deployment add-on steps can be skipped when following the Terraform deployment guides **Testing:** - Tested cluster name length validation via empty string, "ack-sagemaker-controller-irsa-tf-vanilla-uwiqgwfq-ap-southeast-1", and "ack-sagemaker-controller-irsa-tf-vanilla-uwiqgwfqu3-ap-southeast-1" By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
awslabs · Apr 24, 2023 · bba1bc9 · bba1bc9
1 parent b014c04
commit bba1bc9
Show file tree

Hide file tree

Showing 9 changed files with 114 additions and 22 deletions.
diff --git a/Makefile b/Makefile
@@ -40,7 +40,7 @@ install-jq:
 	sudo apt-get install jq -y
 
 install-terraform:
-	$(eval TERRAFORM_VERSION:=1.2.7)
+	$(eval TERRAFORM_VERSION:=1.4.5)
 	curl "https://releases.hashicorp.com/terraform/$(TERRAFORM_VERSION)/terraform_$(TERRAFORM_VERSION)_linux_amd64.zip" -o "terraform.zip"
 	unzip -o -q terraform.zip
 	sudo install -o root -g root -m 0755 terraform /usr/local/bin/terraform

diff --git a/deployments/cognito-rds-s3/terraform/variables.tf b/deployments/cognito-rds-s3/terraform/variables.tf
@@ -2,6 +2,11 @@
 variable "cluster_name" {
   description = "Name of cluster"
   type        = string
+
+  validation {
+    condition     = length(var.cluster_name) > 0 && length(var.cluster_name) <= 19
+    error_message = "The cluster name must be between [1, 19] characters"
+  }
 }
 
 variable "cluster_region" {

diff --git a/deployments/cognito/terraform/variables.tf b/deployments/cognito/terraform/variables.tf
@@ -2,6 +2,11 @@
 variable "cluster_name" {
   description = "Name of cluster"
   type        = string
+
+  validation {
+    condition     = length(var.cluster_name) > 0 && length(var.cluster_name) <= 19
+    error_message = "The cluster name must be between [1, 19] characters"
+  }
 }
 
 variable "cluster_region" {

diff --git a/deployments/rds-s3/terraform/variables.tf b/deployments/rds-s3/terraform/variables.tf
@@ -2,6 +2,11 @@
 variable "cluster_name" {
   description = "Name of cluster"
   type        = string
+
+  validation {
+    condition     = length(var.cluster_name) > 0 && length(var.cluster_name) <= 19
+    error_message = "The cluster name must be between [1, 19] characters"
+  }
 }
 
 variable "cluster_region" {

diff --git a/deployments/vanilla/terraform/variables.tf b/deployments/vanilla/terraform/variables.tf
@@ -2,6 +2,11 @@
 variable "cluster_name" {
   description = "Name of cluster"
   type        = string
+
+  validation {
+    condition     = length(var.cluster_name) > 0 && length(var.cluster_name) <= 19
+    error_message = "The cluster name must be between [1, 19] characters"
+  }
 }
 
 variable "cluster_region" {

diff --git a/website/content/en/docs/add-ons/load-balancer/guide.md b/website/content/en/docs/add-ons/load-balancer/guide.md
@@ -10,6 +10,8 @@ This tutorial shows how to expose Kubeflow over a load balancer on AWS.
 
 Follow this guide only if you are **not** using `Cognito` as the authentication provider in your deployment. Cognito-integrated deployment is configured with the AWS Load Balancer controller by default to create an ingress-managed Application Load Balancer and exposes Kubeflow via a hosted domain.
 
+> Note: For Terraform deployment users, some steps that should be skipped will have a note indicating such below.
+
 ## Background
 
 Kubeflow does not offer a generic solution for connecting to Kubeflow over a Load Balancer because this process is highly dependent on your environment and cloud provider. On AWS, we use the [AWS Load Balancer (ALB) controller](https://kubernetes-sigs.github.io/aws-load-balancer-controller/), which satisfies the Kubernetes [Ingress resource](https://kubernetes.io/docs/concepts/services-networking/ingress/) to create an [Application Load Balancer](https://docs.aws.amazon.com/elasticloadbalancing/latest/application/introduction.html) (ALB). When you create a Kubernetes `Ingress`, an ALB is provisioned that load balances application traffic.
@@ -37,8 +39,15 @@ This guide assumes that you have:
 
 ## Create Load Balancer
 
+
+#### Setup for Manifest deployments
+
 If you prefer to create a load balancer using automated scripts, you **only** need to follow the steps in the [automated script section](#automated-script). You can read the following sections in this guide to understand what happens when you run the automated script or to walk through all of the steps manually.
 
+#### Setup for Terraform deployments
+
+Follow the manual steps below. 
+
 ### Create domain and certificates
 
 You need a registered domain and TLS certificate to use HTTPS with Load Balancer. Since your top level domain (e.g. `example.com`) can be registered at any service provider, for uniformity and taking advantage of the integration provided between Route53, ACM, and Application Load Balancer, you will create a separate [sudomain](https://en.wikipedia.org/wiki/Subdomain) (e.g. `platform.example.com`) to host Kubeflow and a corresponding hosted zone in Route53 to route traffic for this subdomain. To get TLS support, you will need certificates for both the root domain (`*.example.com`) and subdomain (`*.platform.example.com`) in the region where your platform will run (your EKS cluster region).
@@ -86,7 +95,9 @@ If you choose DNS validation for the validation of the certificates, you will be
     ```bash
     printf 'certArn='$certArn'' > awsconfigs/common/istio-ingress/overlays/https/params.env
     ```
-### Configure Load Balancer controller
+### Configure Load Balancer Controller
+
+> Important: Skip this step if you are using a Terraform deployment since the AWS Load Balancer Controller is installed by default unless you set `enable_aws_load_balancer_controller = false`.
 
 Set up resources required for the Load Balancer controller:
 
@@ -103,6 +114,7 @@ Set up resources required for the Load Balancer controller:
             ```
         - `kubernetes.io/role/internal-elb`. Add this tag only to private subnets.
         - `kubernetes.io/role/elb`. Add this tag only to public subnets.
+
 1. The Load balancer controller uses [IAM roles for service accounts](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html)(IRSA) to access AWS services. An OIDC provider must exist for your cluster to use IRSA. Create an OIDC provider and associate it with your EKS cluster by running the following command if your cluster doesn’t already have one:
     ```bash
     eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME} --region ${CLUSTER_REGION} --approve
@@ -113,15 +125,30 @@ Set up resources required for the Load Balancer controller:
     export LBC_POLICY_ARN=$(aws iam create-policy --policy-name $LBC_POLICY_NAME --policy-document file://awsconfigs/infra_configs/iam_alb_ingress_policy.json --output text --query 'Policy.Arn')
     eksctl create iamserviceaccount --name aws-load-balancer-controller --namespace kube-system --cluster ${CLUSTER_NAME} --region ${CLUSTER_REGION} --attach-policy-arn ${LBC_POLICY_ARN} --override-existing-serviceaccounts --approve
     ```
+
 1. Configure the parameters for [load balancer controller](https://github.com/awslabs/kubeflow-manifests/blob/main/awsconfigs/common/aws-alb-ingress-controller/base/params.env) with the cluster name.
     ```bash
     printf 'clusterName='$CLUSTER_NAME'' > awsconfigs/common/aws-alb-ingress-controller/base/params.env
     ```
 
-### Build Manifests and deploy components
-Run the following command to build and install the components specified in the Load Balancer [kustomize](https://github.com/awslabs/kubeflow-manifests/blob/main/deployments/add-ons/load-balancer/kustomization.yaml) file.
+### Install Load Balancer Controller
+
+> Important: Skip this step if you are using a Terraform deployment since the AWS Load Balancer Controller is installed by default unless you set `enable_aws_load_balancer_controller = false`.
+
+Run the following command to build and install the Load Balancer controller [kustomize](https://github.com/awslabs/kubeflow-manifests/blob/main/awsconfigs/common/aws-alb-ingress-controller/base/kustomization.yaml) file.
+
 ```bash
-while ! kustomize build deployments/add-ons/load-balancer | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 30; done
+kustomize build awsconfigs/common/aws-alb-ingress-controller/base | kubectl apply -f -
+kubectl wait --for condition=established crd/ingressclassparams.elbv2.k8s.aws
+kustomize build awsconfigs/common/aws-alb-ingress-controller/base | kubectl apply -f -
+```
+
+### Create Ingress
+
+Create an ingress that will use the certifcate you specified in `certArn`.
+
+```bash
+kustomize build awsconfigs/common/istio-ingress/overlays/https | kubectl apply -f -
 ```
 
 ### Update the domain with ALB address
@@ -140,6 +167,8 @@ while ! kustomize build deployments/add-ons/load-balancer | kubectl apply -f -;
 
 ### Automated script
 
+> Important: Terraform deployment users should not follow these Automated setup instructions and should follow the [Manual setup instructions](#create-load-balancer).
+
 1. Install dependencies for the script
     ```bash
     cd tests/e2e
@@ -198,6 +227,8 @@ while ! kustomize build deployments/add-ons/load-balancer | kubectl apply -f -;
 
 ## Clean up
 
+> Important: Terraform deployment users should not follow these clean up steps and should manually delete resources created while following the [Manual setup instructions](#create-load-balancer).
+
 To delete the resources created in this guide, run the following commands from the root of your repository:
 > Note: Make sure that you have the configuration file created by the script in `tests/e2e/utils/load_balancer/config.yaml`. If you did not use the script, plug in the name, ARN, or ID of the resources that you created in the configuration file by referring to the sample in Step 4 of the [previous section](#automated-script).
 ```bash

diff --git a/website/content/en/docs/add-ons/storage/efs/guide.md b/website/content/en/docs/add-ons/storage/efs/guide.md
@@ -6,6 +6,8 @@ weight = 10
 
 This guide describes how to use Amazon EFS as Persistent storage on top of an existing Kubeflow deployment.  
 
+> Note: For Terraform deployment users, some steps that should be skipped will have a note indicating such below.
+
 ## 1.0 Prerequisites
 For this guide, we assume that you already have an EKS Cluster with Kubeflow installed. The FSx CSI Driver can be installed and configured as a separate resource on top of an existing Kubeflow deployment. See the [deployment options]({{< ref "/docs/deployment" >}}) and [general prerequisites]({{< ref "/docs/deployment/vanilla/guide.md" >}}) for more information.
 
@@ -37,9 +39,18 @@ export CLAIM_NAME=<efs-claim>
 
 ## 2.0 Set up EFS
 
+#### Setup for Manifest deployments
+
 You can either use Automated or Manual setup to set up the resources required. If you choose the manual route, you get another choice between **static and dynamic provisioning**, so pick whichever suits you. On the other hand, for the automated script we currently only support **dynamic provisioning**. Whichever combination you pick, be sure to continue picking the appropriate sections through the rest of this guide. 
 
+#### Setup for Terraform deployments
+
+Follow the Manual setup to set up the resources required. As part of the Manual setup, you get another choice between **static and dynamic provisioning**, so pick whichever suits you.
+
 ### 2.1 [Option 1] Automated setup
+
+> Important: Terraform deployment users should not follow these Automated setup instructions and should follow the [Manual setup instructions](#22-option-2-manual-setup).
+
 The script automates all the manual resource creation steps but is currently only available for **Dynamic Provisioning** option.  
 It performs the required cluster configuration, creates an EFS file system and it also takes care of creating a storage class for dynamic provisioning. Once done, move to section 3.0. 
 1. Run the following commands from the `tests/e2e` directory:
@@ -80,7 +91,11 @@ If you prefer to manually setup each component then you can follow this manual g
 export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
 ```
 
-#### 1. Install the EFS CSI driver
+#### 1. Driver install and IAM configuration
+
+> Important: Skip this step if you are using a Terraform deployment since EFS CSI driver is installed by default unless you set `enable_aws_efs_csi_driver = false`.
+
+##### 1.1 Install the EFS CSI driver
 We recommend installing the EFS CSI Driver v1.5.4 directly from the [the aws-efs-csi-driver github repo](https://github.com/kubernetes-sigs/aws-efs-csi-driver) as follows:
 
 ```bash
@@ -95,7 +110,7 @@ NAME              ATTACHREQUIRED   PODINFOONMOUNT   MODES        AGE
 efs.csi.aws.com   false            false            Persistent   5d17h
 ```
 
-#### 2. Create the IAM Policy for the CSI driver
+##### 1.2. Create the IAM Policy for the CSI driver
 The CSI driver's service account (created during installation) requires IAM permission to make calls to AWS APIs on your behalf. Here, we will be annotating the Service Account `efs-csi-controller-sa` with an IAM Role which has the required permissions.
 
 1. Download the IAM policy document from GitHub as follows.
@@ -129,15 +144,15 @@ eksctl create iamserviceaccount \
 kubectl describe -n kube-system serviceaccount efs-csi-controller-sa
 ```
 
-#### 3. Manually create an instance of the EFS filesystem
+#### 2. Manually create an instance of the EFS filesystem
 Please refer to the official [AWS EFS CSI Document](https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-create-filesystem) for detailed instructions on creating an EFS filesystem. 
 
 > Note: For this guide, we assume that you are creating your EFS Filesystem in the same VPC as your EKS Cluster. 
   
 #### Choose between dynamic and static provisioning  
 In the following section, you have to choose between setting up [dynamic provisioning](https://kubernetes.io/docs/concepts/storage/dynamic-provisioning/) or setting up static provisioning.
 
-#### 4. [Option 1] Dynamic provisioning  
+#### 3. [Option 1] Dynamic provisioning  
 1. Use the `$file_system_id` you recorded in section 3 above or use the AWS Console to get the filesystem id of the EFS file system you want to use. Now edit the `dynamic-provisioning/sc.yaml` file by chaning `<YOUR_FILE_SYSTEM_ID>` with your `fs-xxxxxx` file system id. You can also change it using the following command :  
 ```bash
 file_system_id=$file_system_id yq e '.parameters.fileSystemId = env(file_system_id)' -i $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/sc.yaml
@@ -161,7 +176,7 @@ kubectl apply -f $GITHUB_STORAGE_DIR/efs/dynamic-provisioning/pvc.yaml
 
 Note : The `StorageClass` is a cluster scoped resource which means we only need to do this step once per cluster. 
 
-#### 4. [Option 2] Static Provisioning
+#### 3. [Option 2] Static Provisioning
 Using [this sample](https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/multiple_pods), we provided the required spec files in the sample subdirectory. However, you can create the PVC another way. 
 
 1. Use the `$file_system_id` you recorded in section 3 above or use the AWS Console to get the filesystem id of the EFS file system you want to use. Now edit the last line of the static-provisioning/pv.yaml file to specify the `volumeHandle` field to point to your EFS filesystem. Replace `$file_system_id` if it is not already set. 

diff --git a/website/content/en/docs/add-ons/storage/fsx-for-lustre/guide.md b/website/content/en/docs/add-ons/storage/fsx-for-lustre/guide.md
@@ -6,6 +6,8 @@ weight = 20
 
 This guide describes how to use Amazon FSx as Persistent storage on top of an existing Kubeflow deployment.  
 
+> Note: For Terraform deployment users, some steps that should be skipped will have a note indicating such below.
+
 ## 1.0 Prerequisites
 For this guide, we assume that you already have an EKS Cluster with Kubeflow installed. The FSx CSI Driver can be installed and configured as a separate resource on top of an existing Kubeflow deployment. See the [deployment options]({{< ref "/docs/deployment" >}}) and [general prerequisites]({{< ref "/docs/deployment/vanilla/guide.md" >}}) for more information.
 
@@ -36,9 +38,19 @@ export CLAIM_NAME=<fsx-claim>
 ```
 
 ## 2.0 Setup FSx for Lustre
+
+#### Setup for Manifest deployments
+
 You can either use Automated or Manual setup. We currently only support **Static provisioning** for FSx.  
 
+#### Setup for Terraform deployments
+
+Follow the Manual setup. We currently only support **Static provisioning** for FSx.  
+
 ### 2.1 [Option 1] Automated setup
+
+> Important: Terraform deployment users should not follow these Automated setup instructions and should follow the [Manual setup instructions](#22-option-2-manual-setup).
+
 The script automates all the manual resource creation steps but is currently only available for **Static Provisioning** option.  
 It performs the required cluster configuration, creates an FSx file system and it also takes care of creating a storage class for static provisioning. Once done, move to section 3.0. 
 1. Run the following commands from the `tests/e2e` directory:
@@ -74,7 +86,11 @@ The script applies some default values for the file system name, performance mod
 ### 2.2 [Option 2] Manual setup
 If you prefer to manually setup each component then you can follow this manual guide.  
 
-#### 1. Install the FSx CSI Driver
+#### 1. Driver install and IAM configuration
+
+> Important: Skip this step if you are using a Terraform deployment since EFS CSI driver is installed by default unless you set `enable_aws_fsx_csi_driver = false`.
+
+##### 1. Install the FSx CSI Driver
 We recommend installing the FSx CSI Driver v0.9.0 directly from the [the aws-fsx-csi-driver GitHub repository](https://github.com/kubernetes-sigs/aws-fsx-csi-driver) as follows:
 
 ```bash
@@ -89,7 +105,7 @@ NAME              ATTACHREQUIRED   PODINFOONMOUNT   MODES        AGE
 fsx.csi.aws.com   false            false            Persistent   14s
 ```
 
-#### 2. Create the IAM Policy for the CSI Driver
+##### 2. Create the IAM Policy for the CSI Driver
 The CSI driver's service account (created during installation) requires IAM permission to make calls to AWS APIs on your behalf. Here, we will be annotating the Service Account `fsx-csi-controller-sa` with an IAM Role which has the required permissions.
 
 1. Create the policy using the json file provided as follows:
@@ -117,12 +133,12 @@ eksctl create iamserviceaccount \
 kubectl describe -n kube-system serviceaccount fsx-csi-controller-sa
 ```
 
-#### 3. Create an instance of the FSx Filesystem
+#### 2. Create an instance of the FSx Filesystem
 Please refer to the official [AWS FSx CSI documentation](https://docs.aws.amazon.com/fsx/latest/LustreGuide/getting-started-step1.html) for detailed instructions on creating an FSx filesystem. 
 
 Note: For this guide, we assume that you are creating your FSx Filesystem in the same VPC as your EKS Cluster. 
 
-#### 4. Static provisioning
+#### 3. Static provisioning
 [Using this sample from official Kubeflow Docs](https://www.kubeflow.org/docs/distributions/aws/customizing-aws/storage/#amazon-fsx-for-lustre) 
 
 1. Use the AWS Console to get the filesystem id of the FSx volume you want to use. You could also use the following command to list all the volumes available in your region. Either way, make sure that `file_system_id` is set.