Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update GCP docs #59

Merged
merged 7 commits into from
Jan 25, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions buildstockbatch/gcp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Buildstock Batch on GCP

![Architecture diagram](/buildstockbatch/gcp/arch.svg)

Buildstock Batch runs on GCP in a few phases:

* Locally
- Build a Docker image that includes OpenStudio and BuildStock Batch.
- Push the Docker image to GCP Artifact Registry.
- Run sampling and split the generated buildings + upgrades into batches.
- Collect all the required input files (including downloading weather files)
and upload them to a Cloud Storage bucket.
- Kick off the Batch and Cloud Run jobs (described below), and wait for them to finish.

* In GCP Batch
- Run a job where each task runs one batch of simulations.
GCP Batch uses the Docker image to run OpenStudio on Compute Engine VMs.
- Raw output files are written to the bucket in Cloud Storage.

* In Cloud Run
- Run a job for post-processing steps. Also uses the Docker image.
- Aggregated output files are written to the bucket in Cloud Storage.
1 change: 1 addition & 0 deletions buildstockbatch/gcp/arch.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 3 additions & 12 deletions buildstockbatch/gcp/gcp.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,7 @@
~~~~~~~~~~~~~~~
This class contains the object & methods that allow for usage of the library with GCP Batch.

Architecture overview (these steps are split between GcpBatch and DockerBatchBase):
- Build a Docker image that includes OpenStudio and BuildStock Batch.
- Push the Docker image to GCP Artifact Registry.
- Run sampling, and split the generated buildings into batches.
- Collect all the required input files (including downloading weather files)
and upload them to Cloud Storage.
- Run a job on GCP Batch where each task runs one batch of simulations.
Uses the Docker image to run OpenStudio on Compute Engine VMs.
- Run a Cloud Run job for post-processing steps. Also uses the Docker image.
- Output files are written to a bucket in Cloud Storage.
See the README for an overview of the architecture.
nweires marked this conversation as resolved.
Show resolved Hide resolved

:author: Robert LaThanh, Natalie Weires
:copyright: (c) 2023 by The Alliance for Sustainable Energy
Expand Down Expand Up @@ -468,7 +459,7 @@ def show_jobs(self):
"""
# GCP Batch job that runs the simulations
if job := self.get_existing_batch_job():
logger.info("Batch job")
logger.info("--------------- Batch job ---------------")
logger.info(f" Name: {job.name}")
logger.info(f" UID: {job.uid}")
logger.info(f" Status: {job.status.state.name}")
Expand All @@ -490,7 +481,7 @@ def show_jobs(self):
status = "Running"
if last_execution.completion_time:
status = "Completed"
logger.info("Post-processing Cloud Run job")
logger.info("----- Post-processing Cloud Run job -----")
logger.info(f" Name: {job.name}")
logger.info(f" Status of latest run ({last_execution.name}): {status}")
logger.debug(f"Full job info:\n{job}")
Expand Down
4 changes: 2 additions & 2 deletions buildstockbatch/gcp/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@
# terraform init
#
# To see what changes will be applied:
# terraform plan
# terraform plan -var="gcp_project=myproject"
#
# To apply those changes:
# terraform apply
# terraform apply -var="gcp_project=myproject"
#
# Optionally set variables:
# terraform apply -var="gcp_project=myproject" -var="bucket_name=mybucket" -var="region=us-east1-b"
Expand Down
70 changes: 53 additions & 17 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -246,39 +246,75 @@ Google Cloud Platform

Shared, one-time GCP setup
..........................
One-time GCP setup shared by all users.
One-time GCP setup that can be shared by multiple users.

1. If needed, create a GCP Project. The following steps will occur in that project.
2. `Create a repository`_ in Artifact Registry (to store Docker images).
3. `Create a Google Cloud Storage Bucket`_ (that will store simulation and postprocessing output).
Alternatively, each user can create and use their own bucket.
4. Create a Service Account. Alternatively, each user can create their own service account, or each
user can install the `gcloud CLI`_. The following documentation will assume use of a Service
2. Set up the following resources in your GCP projects. You can either do this manually or
using terraform.

* **Option 1**: Manual setup

* `Create a Google Cloud Storage Bucket`_ (that will store simulation and postprocessing output).
Alternatively, each user can create and use their own bucket.
* `Create a repository`_ in Artifact Registry (to store Docker images).
This is expected to be in the same region as the storage bucket.

* **Option 2**: Terraform

* Install `Terraform`_
* From the buildstockbatch/gcp/ directory, run the following with your chosen GCP project and region.
You can optionally specify the names of the storage bucket and artifact registery repository. See
`main.tf` for more details.

::

terraform init
terraform apply -var="gcp_project=PROJECT" -var="region=REGION"

3. Optionally, create a shared Service Account. Alternatively, each user can create their own service account,
or each user can install the `gcloud CLI`_. The following documentation will assume use of a Service
Account.

.. _Create a repository:
https://cloud.google.com/artifact-registry/docs/repositories/create-repos
.. _Create a Google Cloud Storage Bucket:
https://cloud.google.com/storage/docs/creating-buckets
.. _gcloud CLI: https://cloud.google.com/sdk/docs/install
.. _Terraform: https://developer.hashicorp.com/terraform/tutorials/aws-get-started/install-cli


Per-developer setup
...................
One-time setup that each developer needs to do on the workstation from which they'll launch and
Per-user setup
..............
One-time setup that each user needs to do on the workstation from which they'll launch and
manage BuildStockBatch runs.

1. `Install Docker`_. This is needed by the script to manage Docker images (pull, push, etc).
1. Install `Docker`_. This is needed by the script to manage Docker images (pull, push, etc).
2. Get BuildStockBatch and set up a Python environment for it using the :ref:`python` instructions
above (i.e., create a Python virtual environment, activate the venv, and install buildstockbatch
to it).
3. Download/Clone ResStock or ComStock.
4. Create and download a `Service Account Key`_ for GCP authentication.
4. Set up GCP authentication

* **Option 1**: Create and download a `Service Account Key`_.

* Add the location of the key file as an environment variable; e.g.,
``export GOOGLE_APPLICATION_CREDENTIALS="~/path/to/service-account-key.json"``. This can be
done at the command line (in which case it will need to be done for every shell session that
will run BuildStockBatch, and it will only be in effect for only that session), or added to a
shell startup script (in which case it will be available to all shell sessions).

* **Option 2**: Install the `Google Cloud CLI`_ and run the following:

::

gcloud config set project PROJECT
gcloud auth application-default login

gcloud auth login
gcloud auth configure-docker REGION-docker.pkg.dev


* Add the location of the key file as an environment variable; e.g.,
``export GOOGLE_APPLICATION_CREDENTIALS="~/path/to/service-account-key.json"``. This can be
done at the command line (in which case it will need to be done for every shell session that
will run BuildStockBatch, and it will only be in effect for only that session), or added to a
shell startup script (in which case it will be available to all shell sessions).

.. _Install Docker: https://www.docker.com/get-started/
.. _Docker: https://www.docker.com/get-started/
.. _Service Account Key: https://cloud.google.com/iam/docs/keys-create-delete
.. _Google Cloud CLI: https://cloud.google.com/sdk/docs/install-sdk
28 changes: 14 additions & 14 deletions docs/project_defn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -270,11 +270,10 @@ using `GCP Batch <https://cloud.google.com/batch>`_ and `Cloud Run <https://clou
buildstock run locally, on Eagle, or on AWS cannot save to GCP.

* ``job_identifier``: A unique string that starts with an alphabetical character,
is up to 48 characters long, and only has letters, numbers or hyphens.
is up to 48 characters long, and only has lowercase letters, numbers or hyphens.
nweires marked this conversation as resolved.
Show resolved Hide resolved
This is used to name the GCP Batch and Cloud Run jobs to be created and
differentiate them from other jobs.
* ``project``: The GCP Project ID in which the batch will be run and of the Artifact Registry
(where Docker images are stored).
* ``project``: The GCP Project ID in which the job will run.
* ``service_account``: Optional. The service account email address to use when running jobs on GCP.
Default: the Compute Engine default service account of the GCP project.
* ``gcs``: Configuration for project data storage on GCP Cloud Storage.
Expand All @@ -287,7 +286,9 @@ using `GCP Batch <https://cloud.google.com/batch>`_ and `Cloud Run <https://clou
may help. Default: 40 MiB

* ``region``: The GCP region in which the job will be run and the region of the Artifact Registry.
* ``batch_array_size``: Number of tasks to divide the simulations into. Max: 10000.
(e.g. ``us-central1``)
* ``batch_array_size``: Number of tasks to divide the simulations into. Tasks with fewer than 100
simulations each are recommended, especially when using spot instances. Max: 10,000.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid loosing too much of simulations or anything in respect to that when preemption happens? maybe adding something like this to justify the recommendation on < 100 sims per task?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

* ``parallelism``: Optional. Maximum number of tasks that can run in parallel. If not specified,
uses `GCP's default behavior`_ (the lesser of ``batch_array_size`` and `job limits`_).
Parallelism is also limited by Compute Engine quotas and limits (including vCPU quota).
Expand All @@ -298,29 +299,28 @@ using `GCP Batch <https://cloud.google.com/batch>`_ and `Cloud Run <https://clou
repository.
* ``job_environment``: Optional. Specifies the computing requirements for each simulation.

* ``vcpus``: Number of CPUs to allocate for running each simulation. Default: 1.
* ``memory_mib``: Amount of RAM memory needed for each simulation in MiB. Default: 1024.
For large multifamily buildings this works better if set to 2048.
* ``vcpus``: Optional. Number of CPUs to allocate for running each simulation. Default: 1.
* ``memory_mib``: Optional. Amount of RAM memory to allocate for each simulation in MiB.
For large multifamily buildings this works better if set to 2048. Default: 1024.
nweires marked this conversation as resolved.
Show resolved Hide resolved
* ``boot_disk_mib``: Optional. Extra boot disk size in MiB for each task. This affects how
large the boot disk will be (see the `Batch OS environment docs`_ for details) of the
machine(s) running simulations (which is the disk used by simulations). This will likely need
to be set to at least 2,048 if more than 8 simulations will be run in parallel on the same
machine (i.e., when vCPUs per machine_type ÷ vCPUs per sim > 8). Default: None (which should
result in a 30 GB boot disk according to the docs linked above).
* ``machine_type``: GCP Compute Engine machine type to use. If omitted, GCP Batch will
* ``machine_type``: Optional. GCP Compute Engine machine type to use. If omitted, GCP Batch will
choose a machine type based on the requested vCPUs and memory. If set, the machine type
should have at least as many resources as requested for each simulation above. If it is
large enough, multiple simulations will be run in parallel on the same machine. Usually safe
to leave unset.
* ``use_spot``: true or false. This tells the project whether to use
`Spot VMs <https://cloud.google.com/spot-vms>`_ for data simulations, which can reduce
costs by up to 91%. Default: false
* ``use_spot``: Optional. Whether to use `Spot VMs <https://cloud.google.com/spot-vms>`_

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should change the default to true here? thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer a default of False, because 1) that's the default when creating GCP jobs directly and 2) I think it's best for the default to be the most reliable option, at least as long as using spot instances at scale continues to require manual retries sometimes.

for data simulations, which can reduce costs by up to 91%. Default: false
* ``postprocessing_environment``: Optional. Specifies the Cloud Run computing environment for
postprocessing.

* ``cpus``: `Number of CPUs`_ to use. Default: 2.
* ``memory_mib``: `Amount of RAM`_ needed in MiB. 2048 MiB per CPU is recommended. Default:
4096.
* ``cpus``: Optional. `Number of CPUs`_ to use. Default: 2.
* ``memory_mib``: Optional. `Amount of RAM`_ needed in MiB. 2048 MiB per CPU is recommended.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably saying "at least 2048 MiB per cpu recommended ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to mention anything about the guardrails here?

Default: 4096.

.. _GCP's default behavior: https://cloud.google.com/python/docs/reference/batch/latest/google.cloud.batch_v1.types.TaskGroup
.. _job limits: https://cloud.google.com/batch/quotas
Expand Down
27 changes: 21 additions & 6 deletions docs/run_sims.rst
Original file line number Diff line number Diff line change
Expand Up @@ -117,16 +117,15 @@ on S3 and queryable in Athena.
Google Cloud Platform
~~~~~~~~~~~~~~~~~~~~~

Running a batch on GCP is done by calling the ``buildstock_gcp`` command line
tool.
Run a project on GCP by calling the ``buildstock_gcp`` command line tool.

.. command-output:: buildstock_gcp --help
:ellipsis: 0,8

The first time you run ``buildstock_gcp`` it may take several minutes,
especially over a slower internet connection as it is downloading and building a docker image.

GCP Specific Project configuration
GCP specific project configuration
..................................

For the project to run on GCP, you will need to add a ``gcp`` section to your config
nweires marked this conversation as resolved.
Show resolved Hide resolved
Expand All @@ -136,12 +135,13 @@ file, something like this:

gcp:
job_identifier: national01
# The project, Artifact Registry repo, and GCS bucket must already exist.
project: myorg_project
region: us-central1
artifact_registry:
repository: buildstockbatch
repository: buildstockbatch-docker
gcs:
bucket: mybucket
bucket: buildstockbatch
prefix: national01_run01
use_spot: true
batch_array_size: 10000
Expand All @@ -154,18 +154,33 @@ You can optionally override the ``job_identifier`` from the command line
quickly assign a new ID with each run without updating the config file.


List existing jobs
Show existing jobs
..................

Run ``buildstock_gcp your_project_file.yml [job_identifier] --show_jobs`` to see the existing
jobs matching the project specified. This can show you whether a previously-started job
has completed, is still running, or has already been cleaned up.


Post-processing only
.....................

If ``buildstock_gcp`` is interrupted after the simulations are kicked off (i.e. the Batch job is
running), the simulations will finish, but post-processing will not be started. You can run only
the post-processing steps later with the ``--postprocessonly`` flag.


Cleaning up after yourself
..........................

When the simulations and postprocessing are complete, run ``buildstock_gcp
your_project_file.yml [job_identifier] --clean``. This will clean up all the GCP resources that
were created to run the specified project, other than files in Cloud Storage. If the project is
still running, it will be cancelled. Your output files will still be available in GCS.

You can clean up files in Cloud Storage from the `GCP Console`_.

If you make code changes between runs, you may want to occasionally clean up the docker
images created for each run with ``docker image prune``.

.. _GCP Console: https://console.cloud.google.com/storage/browser