Skip to content

Commit

Permalink
Merge pull request #106 from jajimer/cloud
Browse files Browse the repository at this point in the history
Sinergym 1.6.0
  • Loading branch information
AlejandroCN7 committed Dec 28, 2021
2 parents 9ae18e7 + ca17e5a commit ff03a80
Show file tree
Hide file tree
Showing 94 changed files with 216 additions and 1,128 deletions.
4 changes: 4 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
ARG UBUNTU_VERSION=18.04
FROM ubuntu:${UBUNTU_VERSION}

# Configuring tzdata in order to don't ask for geographic area
ENV TZ=Europe/Kiev
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

# Arguments for EnergyPlus version (default values of version 8.6.0 if is not specified)
ARG ENERGYPLUS_VERSION=9.5.0
ARG ENERGYPLUS_INSTALL_VERSION=9-5-0
Expand Down
11 changes: 3 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,8 +19,7 @@ The main functionalities of Sinergym are the following :
have been developed by our team in order to test easily these environments
with deep reinforcement learning algorithms.
- **Google Cloud Integration**. Whether you have a Google Cloud account and you want to
use your infrastructure with Sinergym, it has been designed a complete functionality
in order to facilitate this work.
use your infrastructure with Sinergym, we tell you some details about how doing it.
- **Mlflow tracking server**. `Mlflow <https://mlflow.org/>`__ is an open source platform for the machine
learning lifecycle. This can be used with Google Cloud remote server (if you have Google Cloud account)
or using local store. This will help you to manage and store your runs and artifacts generated in an orderly
Expand Down Expand Up @@ -165,10 +164,6 @@ Notice that a folder will be created in the working directory after creating the

## Google Cloud Platform support

In this project, an API based on RESTfull API for **Google Cloud** has been designed and developed in order to use Google Cloud infrastructure directly writing experiments definition ir our personal computer.
Cloud Computing

<div align="center">
<img src="images/Sinergym_cloud_API.png" width=100%><br><br>
</div>

For more information about this functionality, please, visit our documentation [here](https://jajimer.github.io/sinergym/build/html/pages/gcloudAPI.html)
For more information about this functionality, please, visit our documentation [here](https://jajimer.github.io/sinergym/build/html/pages/gcloudAPI.html).
134 changes: 0 additions & 134 deletions cloud_manager.py

This file was deleted.

Binary file modified docs/build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/build/doctrees/index.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/pages/deep-reinforcement-learning.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/pages/gcloudAPI.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/pages/installation.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/pages/introduction.doctree
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ How use
****************

You can try your own experiments and benefit from this functionality. `sinergym/examples/DRL_usage.py <https://github.com/jajimer/sinergym/blob/main/examples/DRL_usage.py>`__
is a example code to use it.
is a example code to use it. You can use directly DRL_battery.py directly from your local computer specifying ``--tensorboard`` flag in execution.

The most important information you must keep in mind when you try your own experiments are:

Expand All @@ -96,7 +96,7 @@ The most important information you must keep in mind when you try your own exper
* Callbacks can be concatenated in a ``CallbackList`` instance from Stable Baselines 3.
* Neural network will not train until you execute ``model.learn()`` method. Here is where you
specify train ``timesteps``, ``callbacks`` and ``log_interval`` as we commented in type algorithms (On and Off Policy).
* ``DRL_usage.py`` requires some extra arguments to being executed like ``-env`` and ``-ep``.
* ``DRL_usage.py`` or ``DRL_battery.py`` requires some extra arguments to being executed like ``-env`` and ``-ep``.

Code example:

Expand Down
84 changes: 15 additions & 69 deletions docs/build/html/_sources/pages/gcloudAPI.rst.txt
Original file line number Diff line number Diff line change
@@ -1,21 +1,12 @@
#########################
Sinergym Google Cloud API
#########################
###########################
Sinergym with Google Cloud
###########################

In this project, an API based on RESTfull API for gcloud has been designed and developed in order to use Google Cloud infrastructure directly writing experiments definition ir our personal computer.
In this project, we are defined some functionality based in gcloud API python in `sinergym/utils/gcloud.py`. Our time aim to configure a Google Cloud account and combine with Sinergym easily.

.. image:: /_static/Sinergym_cloud_API.png
:width: 1000
:alt: Sinergym cloud API diagram
:align: center

From our personal computer, we send a list of experiments we want to be executed in Google Cloud, using **cloud_manager.py** script for that purpose. An instance will be created for every experiment defined.
Each VM send MLFlow logs to **MLFlow tracking server**. On the other hand, Sinergym output and Tensorboard output are sent to a **Google Cloud Bucket** (see :ref:`Remote Tensorboard log`), like **Mlflow artifact** (see :ref:`Mlflow tracking server set up`)
and/or **local VM storage** depending on the experiment configuration.

When an instance has finished its job, container **auto-remove** its host instance from Google Cloud Platform if experiments has been configured with this option. Whether an instance is the last in the MIG, that container auto-remove the empty MIG too.
The main idea is to construct a **virtual machine** (VM) using **Google Cloud Engine** (GCE) in order to execute our **Sinergym container** on it. At the same time, this remote container will update a Google Cloud Bucket with experiments results and mlflow tracking server with artifacts if we configure that experiment with those options.

.. warning:: Don't try to remove an instance inner MIG directly using Google Cloud API REST, it needs to be executed from MIG to work. Some other problems (like wrong API REST documentation) have been solved in our API. We recommend you use this API directly.
When an instance has finished its job, container **auto-remove** its host instance from Google Cloud Platform if experiments has been configured with this option.

Let’s see a detailed explanation above.

Expand Down Expand Up @@ -246,60 +237,13 @@ To use this container in our machine you only have to do:
:alt: GCE VM container usage.
:align: center

And now you can execute your own experiments in Google Cloud! If you are interested in using our API specifically for Gcloud (automated experiments using remotes containers generation). Please, visit our section :ref:`Executing API`

****************
Executing API
****************

Our objective is defining a set of experiments in order to execute them in a Google Cloud remote container each one automatically. For this, *cloud_manager.py* has been created in repository root. This file must be used in our local computer:

.. literalinclude:: ../../../cloud_manager.py
:language: python

This script uses the following parameters:

- ``--project_id`` or ``-id``: Your Google Cloud project id must be specified.
- ``--zone`` or ``-zo``: Zone for your project (default is *europe-west1-b*).
- ``--template_name`` or ``-tem``: Template used to generate VM's clones, defined in your project previously (see :ref:`4. Create your VM or MIG`).
- ``--group_name`` or ``-group``: Instance group name you want. All instances inner MIG will have this name concatenated with a random str.
- ``--experiment_commands`` or ``-cmds``: Experiment definitions list using python command format (for information about its format, see :ref:`Receiving experiments in remote containers`).

Here is an example bash code to execute the script:

.. code:: sh
$ python cloud_manager.py \
--project_id sinergym \
--zone europe-west1-b \
--template_name sinergym-template \
--group_name sinergym-group \
--experiment_commands \
'python3 DRL_battery.py --environment Eplus-5Zone-hot-discrete-v1 --episodes 2 --algorithm DQN --logger --log_interval 1 --seed 58 --evaluation --eval_freq 1 --eval_length 1 --tensorboard gs://experiments-storage/tensorboard_log --remote_store --auto_delete' \
'python3 DRL_battery.py --environment Eplus-5Zone-hot-continuous-v1 --episodes 3 --algorithm PPO --logger --log_interval 300 --seed 52 --evaluation --eval_freq 1 --eval_length 1 --tensorboard gs://experiments-storage/tensorboard_log --remote_store --mlflow_store --auto_delete'
This example generates only 2 machines inner an instance group in your Google Cloud Platform because of you have defined two experiments. If you defined more experiments, more machines will be created by API.

This script do the next:

1. Counting commands list in ``--experiment_commands`` parameter and generate an Managed Instance Group (MIG) with the same size.
2. Waiting for **process 1** finishes.
3. If *experiments-storage* Bucket doesn't exist, this script create one to store experiment result called **experiemnts-storage** (if you want other name you have to change this name in script), else use the current one.
4. Looking for instance names generated randomly by Google cloud once MIG is created (waiting for instances generation if they haven't been created yet).
5. To each commands experiment, it is added ``--group_name`` option in order to each container see what is its own MIG (useful to auto-remove them).
6. Looking for *id container* about each instance. This process waits for containers are initialize, since instance is initialize earlier than inner container (this could take several minutes).
7. Sending each experiment command in containers from each instance using an SSH connection (parallel).

.. note:: Because of its real-time process. Some containers, instance list action and others could take time. In that case, the API wait a process finish to execute the next (when it is necessary).

.. note:: This script uses gcloud API in background. Methods developed and used to this issues can be seen in `sinergym/sinergym/utils/gcloud.py <https://github.com/jajimer/sinergym/blob/main/sinergym/utils/gcloud.py>`__ or in :ref:`API reference`.
Remember to configure Google Cloud account correctly before use this functionality.
And now you can execute your own experiments in Google Cloud! For example, you can enter in remote container with *gcloud ssh* and execute *DRL_battery.py* for the experiment you want.

********************************************
Receiving experiments in remote containers
Executing experiments in remote containers
********************************************

This script, called *DRL_battery.py*, will be allocated in every remote container and it is used to understand experiments command exposed above by *cloud_manager.py* (``--experiment_commands``):
This script, called *DRL_battery.py*, will be allocated in every remote container and it is used to execute experiments and combine it with **Google Cloud Bucket**, **Mlflow Artifacts**, **auto-remove**, etc:

.. literalinclude:: ../../../DRL_battery.py
:language: python
Expand All @@ -323,11 +267,13 @@ The list of parameter is pretty large. Let's see it:
- ``--seed`` or ``-sd``: Seed for training, random components in process will be able to be recreated.
- ``--remote_store`` or ``-sto``: Determine if sinergym output and tensorboard log (when a local path is specified and not a remote bucket path) will be sent to a common resource (Bucket), else will be allocate in remote container memory only.
- ``--mlflow_store`` or ``-mlflow``: Determine if sinergym output and tensorboard log (when a local path is specified and not a remote bucket path) will be sent to a Mlflow Artifact, else will be allocate in remote container memory only.
- ``--group_name`` or ``-group``: Added by *cloud_manager.py* automatically. It specify to which MIG the host instance belongs.
- ``--group_name`` or ``-group``: It specify to which MIG the host instance belongs, it is important if --auto-delete is activated.
- ``--auto_delete`` or ``-del``: Whether this parameter is specified, remote instance will be auto removed when its job has finished.

- **algorithm hyperparameters**: Execute ``python DRL_battery --help`` for more information.

.. warning:: For a correct auto_delete functionality, please, use MIG's instead of individual instances.

This script do the next:

1. Setting an appropriate name for the experiment. Following the next format: ``<algorithm>-<environment_name>-episodes<episodes_int>-seed<seed_value>(<experiment_date>)``
Expand All @@ -341,7 +287,7 @@ This script do the next:
9. Setting up Tensorboard logger callback if it has been specified.
10. Training with environment.
11. If ``--remote_store`` has been specified, saving all outputs in Google Cloud Bucket. If ``--mlflow_store`` has been specified, saving all outputs in Mlflow run artifact.
12. Auto-delete remote container in Google Cloud Platform if script has been called from **cloud_manager.py** and parameter ``--auto_delete`` has been specified.
12. Auto-delete remote container in Google Cloud Platform when parameter ``--auto_delete`` has been specified.

Containers permission to bucket storage output
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -360,15 +306,15 @@ Hence, it is **necessary** to **set up this service account** and give privilege
$ export GOOGLE_CLOUD_CREDENTIALS= PROJECT_PATH/google-storage.json
In short, we create a new service account called **storage-account**. Then, we dote this account with *roles/owner* permission. The next step is create a file key (json) called **google-storage.json** in our project root (gitignore will ignore this file in remote).
Finally, we export this file in **GOOGLE_CLOUD_CREDENTIALS** in order to gcloud SDK knows that it has to use that token to authenticate.
Finally, we export this file in **GOOGLE_CLOUD_CREDENTIALS** in our local computer in order to gcloud SDK knows that it has to use that token to authenticate.

***********************
Remote Tensorboard log
***********************

In ``--tensorboard`` parameter we have to specify a **local path** or a **Bucket path**.

If we specify a **local path**, tensorboard logs will be stored in remote containers memory. If you have specified ``remote_store`` or ``mlflow_store``, this logs will be sent to those remote storages when experiment finishes.
If we specify a **local path**, tensorboard logs will be stored in remote containers memory. If you have specified ``--remote_store`` or ``--mlflow_store``, this logs will be sent to those remote storages when experiment finishes.
One of the strengths of Tensorboard is the ability to see the data in real time as the training is running. Thus, it is recommended to define in ``--tensorboard`` the **bucket path** directly in order to send that information
as the training is generating it (see `this issue <https://github.com/ContinualAI/avalanche/pull/628>`__ for more information). In our project we have *gs://experiments-storage/tensorboard_log* but you can have whatever you want.

Expand Down
2 changes: 1 addition & 1 deletion docs/build/html/_sources/pages/installation.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -120,5 +120,5 @@ Cloud Computing
****************

You can run your experiments in the Cloud too. We are using `Google Cloud <https://cloud.google.com/>`__ in order to make it possible. Our team aim to set up
a managed instance group (`MIG <https://cloud.google.com/compute/docs/instance-groups/getting-info-about-migs?hl=es-419>`__) in which execute our Sinergym container.
an account in which execute our Sinergym container with **remote storage** and **mlflow tracking**.
For more detail about installation and getting Google Cloud SDK ready to run your experiments, visit our section :ref:`Preparing Google Cloud`.
3 changes: 1 addition & 2 deletions docs/build/html/_sources/pages/introduction.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,7 @@ The main functionalities of *sinergym* are the following:
have been developed by our team in order to test easily these environments
with deep reinforcement learning algorithms.
- **Google Cloud Integration**. Whether you have a Google Cloud account and you want to
use your infrastructure with Sinergym, it has been designed a complete functionality
in order to facilitate this work.
use your infrastructure with Sinergym, we tell you some details about how doing it.
- **Mlflow tracking server**. `Mlflow <https://mlflow.org/>`__ is an open source platform for the machine
learning lifecycle. This can be used with Google Cloud remote server (if you have Google Cloud account)
or using local store. This will help you to manage and store your runs and artifacts generated in an orderly
Expand Down
Loading

0 comments on commit ff03a80

Please sign in to comment.