Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various stability enhancements #20

Merged
merged 10 commits into from
Aug 2, 2019
111 changes: 75 additions & 36 deletions deployment/gcp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ For the remainder of these instructions, you are assumed to be in the `deploymen
export PROJECT="foo-sandbox"
```

## Provision GCP Resources
## (One time) Provision GCP Resources

### Configure the GCP Project

Expand All @@ -31,7 +31,7 @@ export PROJECT="foo-sandbox"

### Setup GKE Cluster

The following command will create a zonal GKE cluster with [preemptible](https://cloud.google.com/preemptible-vms/) [n1-highcpu-16](https://cloud.google.com/compute/all-pricing) nodes ($0.1200/node/h).
The following command will create a zonal GKE cluster with [n1-highcpu-16](https://cloud.google.com/compute/all-pricing) nodes ($0.5672/node/h) with [IP-Alias enabled](https://cloud.google.com/kubernetes-engine/docs/how-to/alias-ips#creating_a_new_cluster_with_ip_aliases) (makes it a bit easier to connect to managed Redis instances from the cluster).

You may want to adjust fields within `./start_gke_cluster.sh` where appropriate such as:
- num-nodes, min-nodes, max-nodes
Expand All @@ -42,6 +42,12 @@ You may want to adjust fields within `./start_gke_cluster.sh` where appropriate
./start_gke_cluster.sh crawl1
```

Note: For testing, you can use [preemptible](https://cloud.google.com/preemptible-vms/) nodes ($0.1200/node/h) instead:

```
./start_gke_cluster.sh crawl1 --preemptible
```

### Fetch kubernetes cluster credentials for use with `kubectl`

```
Expand All @@ -50,17 +56,24 @@ gcloud container clusters get-credentials crawl1

This allows subsequent `kubectl` commands to interact with our cluster (using the context `gke_{PROJECT}_{ZONE}_{CLUSTER_NAME}`)

## Build and push Docker image to GCR
## (Optional) Configure sentry credentials

(Optional) If one of [the pre-built OpenWPM Docker images](https://hub.docker.com/r/openwpm/openwpm/tags) are not sufficient:
Set the Sentry DSN as a kubectl secret (change `foo` below):
```
cd ../openwpm-crawler/OpenWPM; docker build -t gcr.io/$PROJECT/openwpm .; cd -
gcloud auth configure-docker
docker push gcr.io/$PROJECT/openwpm
kubectl create secret generic sentry-config \
--from-literal=sentry_dsn=foo
```
Remember to change the `crawl.yaml` to point to `image: gcr.io/$PROJECT/openwpm`.

## Allow the cluster to access AWS S3
To run crawls without Sentry, remove the following from the crawl config after it has been generated below:
```
- name: SENTRY_DSN
valueFrom:
secretKeyRef:
name: sentry-config
key: sentry_dsn
```

## (One time) Allow the cluster to access AWS S3

Make sure that your AWS credentials are stored in `~/.aws/credentials` as per:

Expand All @@ -75,12 +88,40 @@ Then run:
./aws_credentials_as_kubectl_secrets.sh
```

## Build and push Docker image to GCR

(Optional) If one of [the pre-built OpenWPM Docker images](https://hub.docker.com/r/openwpm/openwpm/tags) are not sufficient:
```
cd ../openwpm-crawler/OpenWPM; docker build -t gcr.io/$PROJECT/openwpm .; cd -
gcloud auth configure-docker
docker push gcr.io/$PROJECT/openwpm
```
Remember to change the `crawl.yaml` to point to `image: gcr.io/$PROJECT/openwpm`.

## Deploy the redis server which we use for the work queue

Launch a 1GB Basic tier Google Cloud Memorystore for Redis instance ($0.049/GB/hour):
```
gcloud redis instances create crawlredis --size=1 --region=us-central1 --redis-version=redis_4_0
```

Launch a temporary redis-box pod deployed to the cluster which we use to interact with the above Redis instance:
```
kubectl apply -f redis.yaml
kubectl apply -f redis-box.yaml
```

Use the following output:
```
gcloud redis instances describe crawlredis --region=us-central1
```
... to set the corresponding env var:

```
export REDIS_HOST=10.0.0.3
```

(See https://cloud.google.com/memorystore/docs/redis/connecting-redis-instance for more information.)

## Adding sites to be crawled to the queue

Create a comma-separated site list as per:
Expand Down Expand Up @@ -117,32 +158,25 @@ cd ../../; python -m utilities.get_sampled_sites; cd -

Since each crawl is unique, you need to configure your `crawl.yaml` deployment configuration. We have provided a template to start from:
```
cp crawl.tmpl.yaml crawl.yaml
envsubst < ./crawl.tmpl.yaml > crawl.yaml
```

- Update `crawl.yaml`. This may include:
- spec.parallelism
- spec.containers.image
- spec.containers.env
- spec.containers.resources
Use of `envsubst` has already replaced `$REDIS_HOST` with the value of the env var set previously, but you may still want to adapt `crawl.yaml`:
- spec.parallelism
- spec.containers.image
- spec.containers.env
- spec.containers.resources

Note: A useful naming convention for `CRAWL_DIRECTORY` is `YYYY-MM-DD_description_of_the_crawl`.

## (Optional) Configure sentry credentials
### Scale up the cluster before running the crawl

Set the Sentry DSN as a kubectl secret (change `foo` below):
```
kubectl create secret generic sentry-config \
--from-literal=sentry_dsn=foo
```
Some nodes including the master node can become temporarily unavailable during cluster auto-scaling operations. When larger new crawls are started, this can cause disruptions for a couple of minutes after the crawl has started.

To avoid this, set the amount of nodes (to, say, 15) before starting the crawl:

To run crawls without Sentry, remove the following from the crawl config:
```
- name: SENTRY_DSN
valueFrom:
secretKeyRef:
name: sentry-config
key: sentry_dsn
gcloud container clusters resize crawl1 --num-nodes=15
```

## Start the crawl
Expand All @@ -159,10 +193,9 @@ Note that for the remainder of these instructions, `metadata.name` is assumed to

#### Queue status

Open a temporary instance and launch redis-cli:
Launch redis-cli:
```
kubectl attach temp -c temp -i -t || kubectl run --generator=run-pod/v1 -i --tty temp --image redis --command "/bin/bash"
redis-cli -h redis
kubectl exec -it redis-box -- sh -c "redis-cli -h $REDIS_HOST"
```

Current length of the queue:
Expand All @@ -180,7 +213,7 @@ Contents of the queue:
lrange crawl-queue 0 -1
```

#### OpenWPM progress and logs
#### Crawl progress and logs

Check out the [GCP GKE Console](https://console.cloud.google.com/kubernetes/workload)

Expand Down Expand Up @@ -210,17 +243,23 @@ kubectl describe job openwpm-crawl

The crawl data will end up in Parquet format in the S3 bucket that you configured.

### Clean up created pods, services and local artifacts
### Clean up created pods, instances and local artifacts

```
kubectl delete -f redis.yaml
kubectl delete -f crawl.yaml
kubectl delete pod temp
gcloud redis instances delete crawlredis --region=us-central1
kubectl delete -f redis-box.yaml
```

### Decrease the size of the cluster while it is not in use

While the cluster has autoscaling activated, and thus should scale down when not in use, it can sometimes be slow to do this or fail to do this adequately. In these instances, it is a good idea to go to `Clusters -> crawl1 -> default-pool -> Edit` and set the number of instances to 0 or 1 manually. It will still scale up when the next crawl is executed.
While the cluster has auto-scaling activated, and thus should scale down when not in use, it can sometimes be slow to do this or fail to do this adequately. In these instances, it is a good idea to set the number of nodes to 0 or 1 manually:

```
gcloud container clusters resize crawl1 --num-nodes=1
```

It will still auto-scale up when the next crawl is executed.

### Deleting the GKE Cluster

Expand Down
5 changes: 5 additions & 0 deletions deployment/gcp/crawl.tmpl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ metadata:
spec:
# adjust for parallelism
parallelism: 100
backoffLimit: 10000 # to avoid crawls failing due to sporadic worker crashes
template:
metadata:
name: openwpm-crawl
Expand All @@ -27,6 +28,8 @@ spec:
key: aws_secret_access_key
- name: NUM_BROWSERS
value: '1'
- name: REDIS_HOST
value: '$REDIS_HOST'
- name: REDIS_QUEUE_NAME
value: 'crawl-queue'
- name: CRAWL_DIRECTORY
Expand Down Expand Up @@ -56,6 +59,8 @@ spec:
# these are taken at face value by the autoscaler, so they should match actual
# resources required by any single instance/container as good as possible
# see: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler
# tip: observe `kubectl top nodes` during auto-scaled crawls to get an idea of how
# resources are being utilized
requests:
cpu: 750m
limits:
Expand Down
13 changes: 13 additions & 0 deletions deployment/gcp/redis-box.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
apiVersion: v1
kind: Pod
metadata:
name: redis-box
labels:
app: redis-box
spec:
containers:
- name: redis-box
image: redis:4
# avoids starting the redis-server
command: ["tail"]
args: ["-f", "/dev/null"]
26 changes: 0 additions & 26 deletions deployment/gcp/redis.yaml

This file was deleted.

7 changes: 4 additions & 3 deletions deployment/gcp/start_gke_cluster.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@
set -e

if [[ $# -lt 1 ]]; then
echo "Usage: start_gke_cluster.sh cluster_name" >&2
echo "Usage: start_gke_cluster.sh cluster_name additional_args" >&2
exit 1
fi
CLUSTER_NAME=$1
ADDITIONAL_ARGS="${*:2}"

gcloud container clusters create $CLUSTER_NAME \
--zone us-central1-f \
Expand All @@ -17,5 +18,5 @@ gcloud container clusters create $CLUSTER_NAME \
--min-nodes=0 \
--max-nodes=30 \
--enable-autoscaling \
--min-cpu-platform="Intel Broadwell" \
--preemptible
--enable-ip-alias \
--min-cpu-platform="Intel Broadwell" $ADDITIONAL_ARGS
9 changes: 7 additions & 2 deletions deployment/load_site_list_into_redis.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
#!/usr/bin/env bash
set -e

if [[ "$REDIS_HOST" == "" ]]; then
echo "The env var $REDIS_HOST needs to be set with the IP/hostname of the managed Redis instance" >&2
exit 1
fi

if [[ $# -lt 2 ]]; then
echo "Usage: load_site_list_into_redis.sh redis_queue_name site_list_csv" >&2
exit 1
Expand All @@ -19,7 +24,7 @@ echo "DEL $REDIS_QUEUE_NAME:processing" >> joblist.txt
# awk #1 = Add the RPUSH command with the site value within single quotes
cat "$SITE_LIST_CSV" | sed '1!G;h;$!d' | sed "s/'/\\\'/g" | awk -F ',' 'FNR > 0 {print "RPUSH '$REDIS_QUEUE_NAME' '\''"$1","$2"'\''"}' >> joblist.txt

kubectl cp joblist.txt redis-master:/tmp/joblist.txt
kubectl exec redis-master -- sh -c "cat /tmp/joblist.txt | redis-cli --pipe"
kubectl cp joblist.txt redis-box:/tmp/joblist.txt
kubectl exec redis-box -- sh -c "cat /tmp/joblist.txt | redis-cli -h $REDIS_HOST --pipe"

rm joblist.txt