Merge pull request #20 from mozilla/issue-19

Various stability enhancements
openwpm · Aug 2, 2019 · 741c5e7 · 741c5e7
2 parents c996054 + dd5a2f0
commit 741c5e7
Show file tree

Hide file tree

Showing 6 changed files with 104 additions and 67 deletions.
diff --git a/deployment/gcp/README.md b/deployment/gcp/README.md
@@ -19,7 +19,7 @@ For the remainder of these instructions, you are assumed to be in the `deploymen
 export PROJECT="foo-sandbox"
 ```
 
-## Provision GCP Resources
+## (One time) Provision GCP Resources
 
 ### Configure the GCP Project
 
@@ -31,7 +31,7 @@ export PROJECT="foo-sandbox"
 
 ### Setup GKE Cluster
 
-The following command will create a zonal GKE cluster with [preemptible](https://cloud.google.com/preemptible-vms/) [n1-highcpu-16](https://cloud.google.com/compute/all-pricing) nodes ($0.1200/node/h).
+The following command will create a zonal GKE cluster with [n1-highcpu-16](https://cloud.google.com/compute/all-pricing) nodes ($0.5672/node/h) with [IP-Alias enabled](https://cloud.google.com/kubernetes-engine/docs/how-to/alias-ips#creating_a_new_cluster_with_ip_aliases) (makes it a bit easier to connect to managed Redis instances from the cluster).
 
 You may want to adjust fields within `./start_gke_cluster.sh` where appropriate such as:
 - num-nodes, min-nodes, max-nodes
@@ -42,6 +42,12 @@ You may want to adjust fields within `./start_gke_cluster.sh` where appropriate
 ./start_gke_cluster.sh crawl1
 ```
 
+Note: For testing, you can use [preemptible](https://cloud.google.com/preemptible-vms/) nodes ($0.1200/node/h) instead:
+
+```
+./start_gke_cluster.sh crawl1 --preemptible
+```
+
 ### Fetch kubernetes cluster credentials for use with `kubectl`
 
 ```
@@ -50,17 +56,24 @@ gcloud container clusters get-credentials crawl1
 
 This allows subsequent `kubectl` commands to interact with our cluster (using the context `gke_{PROJECT}_{ZONE}_{CLUSTER_NAME}`)
 
-## Build and push Docker image to GCR
+## (Optional) Configure sentry credentials
 
-(Optional) If one of [the pre-built OpenWPM Docker images](https://hub.docker.com/r/openwpm/openwpm/tags) are not sufficient:
+Set the Sentry DSN as a kubectl secret (change `foo` below):
 ```
-cd ../openwpm-crawler/OpenWPM; docker build -t gcr.io/$PROJECT/openwpm .; cd -
-gcloud auth configure-docker
-docker push gcr.io/$PROJECT/openwpm
+kubectl create secret generic sentry-config \
+--from-literal=sentry_dsn=foo
 ```
-Remember to change the `crawl.yaml` to point to `image: gcr.io/$PROJECT/openwpm`.
 
-## Allow the cluster to access AWS S3
+To run crawls without Sentry, remove the following from the crawl config after it has been generated below:
+```
+        - name: SENTRY_DSN
+          valueFrom:
+            secretKeyRef:
+              name: sentry-config
+              key: sentry_dsn
+```
+
+## (One time)  Allow the cluster to access AWS S3
 
 Make sure that your AWS credentials are stored in `~/.aws/credentials` as per:
 
@@ -75,12 +88,40 @@ Then run:
 ./aws_credentials_as_kubectl_secrets.sh
 ```
 
+## Build and push Docker image to GCR
+
+(Optional) If one of [the pre-built OpenWPM Docker images](https://hub.docker.com/r/openwpm/openwpm/tags) are not sufficient:
+```
+cd ../openwpm-crawler/OpenWPM; docker build -t gcr.io/$PROJECT/openwpm .; cd -
+gcloud auth configure-docker
+docker push gcr.io/$PROJECT/openwpm
+```
+Remember to change the `crawl.yaml` to point to `image: gcr.io/$PROJECT/openwpm`.
+
 ## Deploy the redis server which we use for the work queue
 
+Launch a 1GB Basic tier Google Cloud Memorystore for Redis instance ($0.049/GB/hour):
+```
+gcloud redis instances create crawlredis --size=1 --region=us-central1 --redis-version=redis_4_0
+```
+
+Launch a temporary redis-box pod deployed to the cluster which we use to interact with the above Redis instance:
 ```
-kubectl apply -f redis.yaml
+kubectl apply -f redis-box.yaml
 ```
 
+Use the following output:
+```
+gcloud redis instances describe crawlredis --region=us-central1
+```
+... to set the corresponding env var:
+
+```
+export REDIS_HOST=10.0.0.3
+```
+
+(See https://cloud.google.com/memorystore/docs/redis/connecting-redis-instance for more information.)
+
 ## Adding sites to be crawled to the queue
 
 Create a comma-separated site list as per:
@@ -117,32 +158,25 @@ cd ../../; python -m utilities.get_sampled_sites; cd -
 
 Since each crawl is unique, you need to configure your `crawl.yaml` deployment configuration. We have provided a template to start from:
 ```
-cp crawl.tmpl.yaml crawl.yaml
+envsubst < ./crawl.tmpl.yaml > crawl.yaml
 ```
 
-- Update `crawl.yaml`. This may include:
-    - spec.parallelism
-    - spec.containers.image
-    - spec.containers.env
-    - spec.containers.resources
+Use of `envsubst` has already replaced `$REDIS_HOST` with the value of the env var set previously, but you may still want to adapt `crawl.yaml`:
+- spec.parallelism
+- spec.containers.image
+- spec.containers.env
+- spec.containers.resources
 
 Note: A useful naming convention for `CRAWL_DIRECTORY` is `YYYY-MM-DD_description_of_the_crawl`.
 
-## (Optional) Configure sentry credentials
+### Scale up the cluster before running the crawl
 
-Set the Sentry DSN as a kubectl secret (change `foo` below):
-```
-kubectl create secret generic sentry-config \
---from-literal=sentry_dsn=foo
-```
+Some nodes including the master node can become temporarily unavailable  during cluster auto-scaling operations. When larger new crawls are started, this can cause disruptions for a couple of minutes after the crawl has started.
+
+To avoid this, set the amount of nodes (to, say, 15) before starting the crawl:
 
-To run crawls without Sentry, remove the following from the crawl config:
 ```
-        - name: SENTRY_DSN
-          valueFrom:
-            secretKeyRef:
-              name: sentry-config
-              key: sentry_dsn
+gcloud container clusters resize crawl1 --num-nodes=15
 ```
 
 ## Start the crawl
@@ -159,10 +193,9 @@ Note that for the remainder of these instructions, `metadata.name` is assumed to
 
 #### Queue status
 
-Open a temporary instance and launch redis-cli:
+Launch redis-cli:
 ```
-kubectl attach temp -c temp -i -t || kubectl run --generator=run-pod/v1 -i --tty temp --image redis --command "/bin/bash"
-redis-cli -h redis
+kubectl exec -it redis-box -- sh -c "redis-cli -h $REDIS_HOST"
 ```
 
 Current length of the queue:
@@ -180,7 +213,7 @@ Contents of the queue:
 lrange crawl-queue 0 -1
 ```
 
-#### OpenWPM progress and logs
+#### Crawl progress and logs
 
 Check out the [GCP GKE Console](https://console.cloud.google.com/kubernetes/workload)
 
@@ -210,17 +243,23 @@ kubectl describe job openwpm-crawl
 
 The crawl data will end up in Parquet format in the S3 bucket that you configured.
 
-### Clean up created pods, services and local artifacts
+### Clean up created pods, instances and local artifacts
 
 ```
-kubectl delete -f redis.yaml
 kubectl delete -f crawl.yaml
-kubectl delete pod temp
+gcloud redis instances delete crawlredis --region=us-central1
+kubectl delete -f redis-box.yaml
 ```
 
 ### Decrease the size of the cluster while it is not in use
 
-While the cluster has autoscaling activated, and thus should scale down when not in use, it can sometimes be slow to do this or fail to do this adequately. In these instances, it is a good idea to go to `Clusters -> crawl1 -> default-pool -> Edit` and set the number of instances to 0 or 1 manually. It will still scale up when the next crawl is executed.
+While the cluster has auto-scaling activated, and thus should scale down when not in use, it can sometimes be slow to do this or fail to do this adequately. In these instances, it is a good idea to set the number of nodes to 0 or 1 manually:
+
+```
+gcloud container clusters resize crawl1 --num-nodes=1
+```
+
+It will still auto-scale up when the next crawl is executed.
 
 ### Deleting the GKE Cluster
 

diff --git a/deployment/gcp/crawl.tmpl.yaml b/deployment/gcp/crawl.tmpl.yaml
@@ -5,6 +5,7 @@ metadata:
 spec:
   # adjust for parallelism
   parallelism: 100
+  backoffLimit: 10000 # to avoid crawls failing due to sporadic worker crashes
   template:
     metadata:
       name: openwpm-crawl
@@ -27,6 +28,8 @@ spec:
               key: aws_secret_access_key
         - name: NUM_BROWSERS
           value: '1'
+        - name: REDIS_HOST
+          value: '$REDIS_HOST'
         - name: REDIS_QUEUE_NAME
           value: 'crawl-queue'
         - name: CRAWL_DIRECTORY
@@ -56,6 +59,8 @@ spec:
           # these are taken at face value by the autoscaler, so they should match actual
           # resources required by any single instance/container as good as possible
           # see: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler
+          # tip: observe `kubectl top nodes` during auto-scaled crawls to get an idea of how
+          # resources are being utilized
           requests:
             cpu: 750m
           limits:

diff --git a/deployment/gcp/redis-box.yaml b/deployment/gcp/redis-box.yaml
@@ -0,0 +1,13 @@
+apiVersion: v1
+kind: Pod
+metadata:
+  name: redis-box
+  labels:
+    app: redis-box
+spec:
+  containers:
+    - name: redis-box
+      image: redis:4
+      # avoids starting the redis-server
+      command: ["tail"]
+      args: ["-f", "/dev/null"]
diff --git a/deployment/gcp/redis.yaml b/deployment/gcp/redis.yaml
diff --git a/deployment/gcp/start_gke_cluster.sh b/deployment/gcp/start_gke_cluster.sh
@@ -2,10 +2,11 @@
 set -e
 
 if [[ $# -lt 1 ]]; then
-    echo "Usage: start_gke_cluster.sh cluster_name" >&2
+    echo "Usage: start_gke_cluster.sh cluster_name additional_args" >&2
     exit 1
 fi
 CLUSTER_NAME=$1
+ADDITIONAL_ARGS="${*:2}"
 
 gcloud container clusters create $CLUSTER_NAME \
 --zone us-central1-f \
@@ -17,5 +18,5 @@ gcloud container clusters create $CLUSTER_NAME \
 --min-nodes=0 \
 --max-nodes=30 \
 --enable-autoscaling \
---min-cpu-platform="Intel Broadwell" \
---preemptible
+--enable-ip-alias \
+--min-cpu-platform="Intel Broadwell" $ADDITIONAL_ARGS
diff --git a/deployment/load_site_list_into_redis.sh b/deployment/load_site_list_into_redis.sh
@@ -1,6 +1,11 @@
 #!/usr/bin/env bash
 set -e
 
+if [[ "$REDIS_HOST" == "" ]]; then
+    echo "The env var $REDIS_HOST needs to be set with the IP/hostname of the managed Redis instance" >&2
+    exit 1
+fi
+
 if [[ $# -lt 2 ]]; then
     echo "Usage: load_site_list_into_redis.sh redis_queue_name site_list_csv" >&2
     exit 1
@@ -19,7 +24,7 @@ echo "DEL $REDIS_QUEUE_NAME:processing" >> joblist.txt
 # awk #1 = Add the RPUSH command with the site value within single quotes
 cat "$SITE_LIST_CSV" | sed '1!G;h;$!d' | sed "s/'/\\\'/g" | awk -F ',' 'FNR > 0 {print "RPUSH '$REDIS_QUEUE_NAME' '\''"$1","$2"'\''"}' >> joblist.txt
 
-kubectl cp joblist.txt redis-master:/tmp/joblist.txt
-kubectl exec redis-master -- sh -c "cat /tmp/joblist.txt | redis-cli --pipe"
+kubectl cp joblist.txt redis-box:/tmp/joblist.txt
+kubectl exec redis-box -- sh -c "cat /tmp/joblist.txt | redis-cli -h $REDIS_HOST --pipe"
 
 rm joblist.txt