Safe Reinforcement Learning Algorithms

HCOPE
Safe Exploration
Off Policy Evaluation
Solving side effects

HCOPE (High-Confidence Off-Policy Evaluation.)

Python Implementation of HCOPE lower bound evaluation as given in the paper: Thomas, Philip S., Georgios Theocharous, and Mohammad Ghavamzadeh. "High-Confidence Off-Policy Evaluation." AAAI. 2015.

Requirements

PyTorch
Numpy
Matplotlib
scipy
gym

Running Instructions

Modify the environment in the main function, choosing from OpenAI gym. (Currently the code works for discrete action spaces)
Run python hcope.py

Notes

The file policies.py contains the policy used in the code. Modify the policy to suit your needs in this file.
To reproduce the graph given in the original paper explaining the long tail problem of importance sampling, use the

visualize_IS_distribution()

method. Also, a graph of distribution of Importance sampling ratio is created which nicely explains the high variance of the simple IS estimator.

All the values required for offpolicy estimation are initialized in the HCOPE class initialization.
Currently the estimator policy is defined as a gaussian noise(mean,std_dev) added to the behavior policy for estimator policy initialization in the function setup_e_policy(). The example in paper uses policies differing by natural gradient. But, this works as well.
To estimate c*, I use the BFGS method which does not require computing hessian or first order derivative.
The hcope_estimator() method also implements a sanity check, by computing the discriminant of the quadratic in parameter delta(confidence). If it does not satisfy the basic constraints, the program prints the bound predicted is of zero confidence.
The random variables are implemented using simple importance sampling. Per-decision importance sampling might lead to better bounds and is to be explored.
A bilayer MLP policy is used for general problems.
Results:
Output format:

Safe exploration in continuous action spaces.

Paper: Safe Exploration in Continuous Action Spaces - Dalal et al.

Running Instructions

Go inside safe_exploration folder
First learn the safety function by collecting experiences
python learn_safety_function.py
Now using the learned safety function, add the path of these learned torch weights in the train_safe_explorer.py file. After that:
python train_safe_explorer.py
This enables agent to learn while following the safety constraints.

Results

Safe exploration in a case where constraint is on crossing the right lane marker.

Instability is observed in safe exploration using this method. Here constraint is activated going left through the center of the road.(0.3)

Explanation

Linear Safety Signal Model

Safety Layer via Analytical Optimization

Action Correction

Importance Sampling

Implementation of:

Simple Importance Sampling
Per-Decision Importance Sampling
Normalized Per-Decision Importance Sampling (NPDIS) Estimator
Weighted Importance Sampling (WIS) Estimator
Weighted Per-Decision Importance Sampling (WPDIS) Estimator
Consistent Weighted Per-Decision Importance Sampling (CWPDIS) Estimator

Comparision of different importance sampling estimators:

Image is taken from phD thesis of P.Thomas:
Links: https://people.cs.umass.edu/~pthomas/papers/Thomas2015c.pdf

Side Effects

Penalizing side effects using relative reachability

Code - https://github.com/hari-sikchi/safeRL/tree/safe_recovery/side_effects

Added a simple example for calculating side effects as given towards the end of paper

The relative reachability measure

Paper: Penalizing side effects using stepwise relative reachability - Krakovna et al.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
HCOPE		HCOPE
MADRaS @ 129ee03		MADRaS @ 129ee03
importance_sampling		importance_sampling
results		results
safe-grid-gym @ 6a080f9		safe-grid-gym @ 6a080f9
safe_exploration		safe_exploration
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.txt		LICENSE.txt
README.md		README.md
README.md~		README.md~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Safe Reinforcement Learning Algorithms

HCOPE (High-Confidence Off-Policy Evaluation.)

Requirements

Running Instructions

Notes

Safe exploration in continuous action spaces.

Running Instructions

Results

Explanation

Importance Sampling

Side Effects

Penalizing side effects using relative reachability

About

Releases

Packages

Languages

License

Santara/safeRL

Folders and files

Latest commit

History

Repository files navigation

Safe Reinforcement Learning Algorithms

HCOPE (High-Confidence Off-Policy Evaluation.)

Requirements

Running Instructions

Notes

Safe exploration in continuous action spaces.

Running Instructions

Results

Explanation

Importance Sampling

Side Effects

Penalizing side effects using relative reachability

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages