-
-

Sinergym Google Cloud API

-

In this project, an API based on RESTfull API for gcloud has been designed and developed in order to use Google Cloud infrastructure directly writing experiments definition ir our personal computer.

-Sinergym cloud API diagram -

From our personal computer, we send a list of experiments we want to be executed in Google Cloud, using cloud_manager.py script for that purpose. An instance will be created for every experiment defined. -Each VM send MLFlow logs to MLFlow tracking server. On the other hand, Sinergym output and Tensorboard output are sent to a Google Cloud Bucket (see Remote Tensorboard log), like Mlflow artifact (see Mlflow tracking server set up) -and/or local VM storage depending on the experiment configuration.

-

When an instance has finished its job, container auto-remove its host instance from Google Cloud Platform if experiments has been configured with this option. Whether an instance is the last in the MIG, that container auto-remove the empty MIG too.

-
-

Warning

-

Don’t try to remove an instance inner MIG directly using Google Cloud API REST, it needs to be executed from MIG to work. Some other problems (like wrong API REST documentation) have been solved in our API. We recommend you use this API directly.

-
+
+

Sinergym with Google Cloud

+

In this project, we are defined some functionality based in gcloud API python in sinergym/utils/gcloud.py. Our time aim to configure a Google Cloud account and combine with Sinergym easily.

+

The main idea is to construct a virtual machine (VM) using Google Cloud Engine (GCE) in order to execute our Sinergym container on it. At the same time, this remote container will update a Google Cloud Bucket with experiments results and mlflow tracking server with artifacts if we configure that experiment with those options.

+

When an instance has finished its job, container auto-remove its host instance from Google Cloud Platform if experiments has been configured with this option.

Let’s see a detailed explanation above.

Preparing Google Cloud

@@ -352,193 +344,12 @@

5. Init your VMGCE VM container usage. -

And now you can execute your own experiments in Google Cloud! If you are interested in using our API specifically for Gcloud (automated experiments using remotes containers generation). Please, visit our section Executing API

+

And now you can execute your own experiments in Google Cloud! For example, you can enter in remote container with gcloud ssh and execute DRL_battery.py for the experiment you want.

-
-

Executing API

-

Our objective is defining a set of experiments in order to execute them in a Google Cloud remote container each one automatically. For this, cloud_manager.py has been created in repository root. This file must be used in our local computer:

-
import argparse
-from time import sleep
-from pprint import pprint
-import sinergym.utils.gcloud as gcloud
-from google.cloud import storage
-import google.api_core.exceptions
-
-parser = argparse.ArgumentParser(
-    description='Process for run experiments in Google Cloud')
-parser.add_argument(
-    '--project_id',
-    '-id',
-    type=str,
-    dest='project',
-    help='Your Google Cloud project ID.')
-parser.add_argument(
-    '--zone',
-    '-zo',
-    type=str,
-    default='europe-west1-b',
-    dest='zone',
-    help='service Engine zone to deploy to.')
-parser.add_argument(
-    '--template_name',
-    '-tem',
-    type=str,
-    default='sinergym-template',
-    dest='template_name',
-    help='Name of template previously created in gcloud account to generate VM copies.')
-parser.add_argument(
-    '--group_name',
-    '-group',
-    type=str,
-    default='sinergym-group',
-    dest='group_name',
-    help='Name of instance group(MIG) will be created during experimentation.')
-parser.add_argument(
-    '--experiment_commands',
-    '-cmds',
-    default=['python3 ./algorithm/DQN.py -env Eplus-demo-v1 -ep 1 -'],
-    nargs='+',
-    dest='commands',
-    help='list of commands for DRL_battery.py you want to execute remotely.')
-
-args = parser.parse_args()
-
-print('Init Google cloud service API...')
-service = gcloud.init_gcloud_service()
-
-print('Init Google Cloud Storage Client...')
-client = gcloud.init_storage_client()
-
-# Create instance group
-n_experiments = len(args.commands)
-print('Creating instance group(MIG) for experiments ({} instances)...'.format(
-    n_experiments))
-response = gcloud.create_instance_group(
-    service=service,
-    project=args.project,
-    zone=args.zone,
-    size=n_experiments,
-    template_name=args.template_name,
-    group_name=args.group_name)
-pprint(response)
-
-# Wait for the machines to be fully created.
-print(
-    '{0} status is {1}.'.format(
-        response['operationType'],
-        response['status']))
-if response['status'] != 'DONE':
-    response = gcloud.wait_for_operation(
-        service,
-        args.project,
-        args.zone,
-        operation=response['id'],
-        operation_type=response['operationType'])
-pprint(response)
-print('MIG created.')
-
-# If storage exists it will be used, else it will be created previously by API
-print('Looking for experiments storage')
-try:
-    bucket = gcloud.get_bucket(client, bucket_name='experiments-storage')
-    print(
-        'Bucket {} found, this storage will be used when experiments finish.'.format(
-            bucket.name))
-except(google.api_core.exceptions.NotFound):
-    print('No bucket found into your Google account, generating new one...')
-    bucket = gcloud.create_bucket(
-        client,
-        bucket_name='experiments-storage',
-        location='EU')
-
-
-# List VM names
-print('Looking for instance names... (waiting for they are visible too)')
-# Sometimes, although instance group insert status is DONE, isn't visible
-# for API yet. Hence, we have to wait for with a loop...
-instances = []
-while len(instances) < n_experiments:
-    instances = gcloud.list_instances(
-        service=service,
-        project=args.project,
-        zone=args.zone,
-        base_instances_names=args.group_name)
-    sleep(3)
-print(instances)
-# Number of machines should be the same than commands
-
-# Processing commands and adding group id to the petition
-for i in range(len(args.commands)):
-    args.commands[i] += ' --group_name ' + args.group_name
-
-# Execute a comand in every container inner VM
-print('Sending commands to every container VM... (waiting for container inner VM is ready too)')
-for i, instance in enumerate(instances):
-    container_id = None
-    # Obtain container id inner VM
-    while not container_id:
-        container_id = gcloud.get_container_id(instance_name=instance)
-        sleep(5)
-    # Execute command in container
-    gcloud.execute_remote_command_instance(
-        container_id=container_id,
-        instance_name=instance,
-        experiment_command=args.commands[i])
-    print(
-        'command {} has been sent to instance {}(container: {}).'.format(
-            args.commands[i],
-            instance,
-            container_id))
-
-print('All VM\'s are working correctly, see Google Cloud Platform Console.')
-
-
-

This script uses the following parameters:

-
    -
  • --project_id or -id: Your Google Cloud project id must be specified.

  • -
  • --zone or -zo: Zone for your project (default is europe-west1-b).

  • -
  • --template_name or -tem: Template used to generate VM’s clones, defined in your project previously (see 4. Create your VM or MIG).

  • -
  • --group_name or -group: Instance group name you want. All instances inner MIG will have this name concatenated with a random str.

  • -
  • --experiment_commands or -cmds: Experiment definitions list using python command format (for information about its format, see Receiving experiments in remote containers).

  • -
-

Here is an example bash code to execute the script:

-
$ python cloud_manager.py \
-    --project_id sinergym \
-    --zone europe-west1-b \
-    --template_name sinergym-template \
-    --group_name sinergym-group \
-    --experiment_commands \
-    'python3 DRL_battery.py --environment Eplus-5Zone-hot-discrete-v1 --episodes 2 --algorithm DQN --logger --log_interval 1 --seed 58 --evaluation --eval_freq 1 --eval_length 1 --tensorboard gs://experiments-storage/tensorboard_log --remote_store --auto_delete' \
-    'python3 DRL_battery.py --environment Eplus-5Zone-hot-continuous-v1 --episodes 3 --algorithm PPO --logger --log_interval 300 --seed 52 --evaluation --eval_freq 1 --eval_length 1 --tensorboard gs://experiments-storage/tensorboard_log --remote_store --mlflow_store --auto_delete'
-
-
-

This example generates only 2 machines inner an instance group in your Google Cloud Platform because of you have defined two experiments. If you defined more experiments, more machines will be created by API.

-

This script do the next:

-
-
    -
  1. Counting commands list in --experiment_commands parameter and generate an Managed Instance Group (MIG) with the same size.

  2. -
  3. Waiting for process 1 finishes.

  4. -
  5. If experiments-storage Bucket doesn’t exist, this script create one to store experiment result called experiemnts-storage (if you want other name you have to change this name in script), else use the current one.

  6. -
  7. Looking for instance names generated randomly by Google cloud once MIG is created (waiting for instances generation if they haven’t been created yet).

  8. -
  9. To each commands experiment, it is added --group_name option in order to each container see what is its own MIG (useful to auto-remove them).

  10. -
  11. Looking for id container about each instance. This process waits for containers are initialize, since instance is initialize earlier than inner container (this could take several minutes).

  12. -
  13. Sending each experiment command in containers from each instance using an SSH connection (parallel).

  14. -
-
-
-

Note

-

Because of its real-time process. Some containers, instance list action and others could take time. In that case, the API wait a process finish to execute the next (when it is necessary).

-
-
-

Note

-

This script uses gcloud API in background. Methods developed and used to this issues can be seen in sinergym/sinergym/utils/gcloud.py or in API reference. -Remember to configure Google Cloud account correctly before use this functionality.

-
-
-
-

Receiving experiments in remote containers

-

This script, called DRL_battery.py, will be allocated in every remote container and it is used to understand experiments command exposed above by cloud_manager.py (--experiment_commands):

+
+

Executing experiments in remote containers

+

This script, called DRL_battery.py, will be allocated in every remote container and it is used to execute experiments and combine it with Google Cloud Bucket, Mlflow Artifacts, auto-remove, etc:

from stable_baselines3.common.logger import configure
 from stable_baselines3.common.vec_env import DummyVecEnv
 from stable_baselines3.common.callbacks import CallbackList
@@ -656,13 +467,25 @@ 

Receiving experiments in remote containers'-sto', action='store_true', dest='remote_store', - help='Determine if sinergym output will be sent to a common resource') + help='Determine if sinergym output will be sent to a Google Cloud Storage Bucket.') +parser.add_argument( + '--mlflow_store', + '-mlflow', + action='store_true', + dest='mlflow_store', + help='Determine if sinergym output will be sent to a mlflow artifact storage') parser.add_argument( '--group_name', '-group', type=str, dest='group_name', help='This field indicate instance group name') +parser.add_argument( + '--auto_delete', + '-del', + action='store_true', + dest='auto_delete', + help='If is a GCE instance and this flag is active, that instance will be removed from GCP.') parser.add_argument('--learning_rate', '-lr', type=float, default=.0007) parser.add_argument('--gamma', '-g', type=float, default=.99) @@ -687,6 +510,15 @@

Receiving experiments in remote containersif args.seed: name += '-seed_' + str(args.seed) name += '(' + experiment_date + ')' +# Check if MLFLOW_TRACKING_URI is defined +if os.environ.get('MLFLOW_TRACKING_URI') is not None: + # Check ping to server + mlflow_ip = os.environ.get( + 'MLFLOW_TRACKING_URI').split('/')[-1].split(':')[0] + # If server is not valid, setting default local path to mlflow + response = os.system("ping -c 1 " + mlflow_ip) + if response != 0: + mlflow.set_tracking_uri('file://' + os.getcwd() + '/mlruns') # MLflow track with mlflow.start_run(run_name=name): # Log experiment params @@ -705,7 +537,7 @@

Receiving experiments in remote containersmlflow.log_param('evaluation-length', args.eval_length) mlflow.log_param('log-interval', args.log_interval) mlflow.log_param('seed', args.seed) - mlflow.log_param('remote-store', bool(args.seed)) + mlflow.log_param('remote-store', bool(args.remote_store)) mlflow.log_param('learning_rate', args.learning_rate) mlflow.log_param('n_steps', args.n_steps) @@ -881,7 +713,24 @@

Receiving experiments in remote containerslog_interval=args.log_interval) model.save(env.simulator._env_working_dir_parent + '/' + name) - # Store all results if remote_store flag is True + # If mlflow artifacts store is active + if args.mlflow_store: + # Code for send output and tensorboard to mlflow artifacts. + mlflow.log_artifacts( + local_dir=env.simulator._env_working_dir_parent, + artifact_path=name + '/') + if args.evaluation: + mlflow.log_artifacts( + local_dir='best_model/' + name + '/', + artifact_path='best_model/' + name + '/') + # If tensorboard is active (in local) we should send to mlflow + if args.tensorboard and 'gs://experiments-storage' not in args.tensorboard: + mlflow.log_artifacts( + local_dir=args.tensorboard + '/' + name + '/', + artifact_path=os.path.abspath(args.tensorboard).split('/')[-1] + '/' + name + '/') + + # Store all results if remote_store flag is True (Google Cloud Bucket for + # experiments) if args.remote_store: # Initiate Google Cloud client client = gcloud.init_storage_client() @@ -891,18 +740,19 @@

Receiving experiments in remote containerssrc_path=env.simulator._env_working_dir_parent, dest_bucket_name='experiments-storage', dest_path=name) - if args.tensorboard: - gcloud.upload_to_bucket( - client, - src_path=args.tensorboard + '/' + name + '/', - dest_bucket_name='experiments-storage', - dest_path=os.path.abspath(args.tensorboard).split('/')[-1] + '/' + name + '/') if args.evaluation: gcloud.upload_to_bucket( client, src_path='best_model/' + name + '/', dest_bucket_name='experiments-storage', dest_path='best_model/' + name + '/') + # If tensorboard is active (in local) we should send to bucket + if args.tensorboard and 'gs://experiments-storage' not in args.tensorboard: + gcloud.upload_to_bucket( + client, + src_path=args.tensorboard + '/' + name + '/', + dest_bucket_name='experiments-storage', + dest_path=os.path.abspath(args.tensorboard).split('/')[-1] + '/' + name + '/') # gcloud.upload_to_bucket( # client, # src_path='mlruns/', @@ -912,8 +762,9 @@

Receiving experiments in remote containers# End mlflow run mlflow.end_run() - # If it is a Google Cloud VM, shutdown remote machine when ends - if args.group_name: + # If it is a Google Cloud VM and experiment flag auto_delete has been + # activated, shutdown remote machine when ends + if args.group_name and args.auto_delete: token = gcloud.get_service_account_token() gcloud.delete_instance_MIG_from_container(args.group_name, token)

@@ -939,10 +790,14 @@

Receiving experiments in remote containers--seed or -sd: Seed for training, random components in process will be able to be recreated.

  • --remote_store or -sto: Determine if sinergym output and tensorboard log (when a local path is specified and not a remote bucket path) will be sent to a common resource (Bucket), else will be allocate in remote container memory only.

  • --mlflow_store or -mlflow: Determine if sinergym output and tensorboard log (when a local path is specified and not a remote bucket path) will be sent to a Mlflow Artifact, else will be allocate in remote container memory only.

  • -
  • --group_name or -group: Added by cloud_manager.py automatically. It specify to which MIG the host instance belongs.

  • +
  • --group_name or -group: It specify to which MIG the host instance belongs, it is important if –auto-delete is activated.

  • --auto_delete or -del: Whether this parameter is specified, remote instance will be auto removed when its job has finished.

  • algorithm hyperparameters: Execute python DRL_battery --help for more information.