Skip to content

Commit

Permalink
docs: refine typo in docs
Browse files Browse the repository at this point in the history
docs: refine typo in docs
  • Loading branch information
Gaiejj authored Aug 22, 2023
2 parents ff0fd62 + addafb1 commit 35c0804
Show file tree
Hide file tree
Showing 16 changed files with 116 additions and 67 deletions.
35 changes: 29 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ default: install
check_pip_install = $(PYTHON) -m pip show $(1) &>/dev/null || (cd && $(PYTHON) -m pip install $(1) --upgrade)
check_pip_install_extra = $(PYTHON) -m pip show $(1) &>/dev/null || (cd && $(PYTHON) -m pip install $(2) --upgrade)

# Installations

install:
$(PYTHON) -m pip install -vvv .

Expand All @@ -26,6 +28,24 @@ install-editable:

install-e: install-editable # alias

docs-install:
$(call check_pip_install_extra,pydocstyle,pydocstyle[toml])
$(call check_pip_install,doc8)
$(call check_pip_install,sphinx)
$(call check_pip_install,sphinx-autoapi)
$(call check_pip_install,sphinx-autobuild)
$(call check_pip_install,sphinx-copybutton)
$(call check_pip_install,sphinx-autodoc-typehints)
$(call check_pip_install_extra,sphinxcontrib-spelling,sphinxcontrib-spelling pyenchant)
$(PYTHON) -m pip install -r docs/requirements.txt

pytest-install:
$(call check_pip_install,pytest)
$(call check_pip_install,pytest-cov)
$(call check_pip_install,pytest-xdist)

# Benchmark

multi-benchmark:
cd safepo/multi_agent && $(PYTHON) benchmark.py --total-steps 10000000 --experiment benchmark

Expand Down Expand Up @@ -64,13 +84,16 @@ test-benchmark: install-editable multi-test-benchmark single-test-benchmark plot

benchmark: install-editable multi-benchmark single-benchmark plot eval

pytest-install:
$(call check_pip_install,pytest)
$(call check_pip_install,pytest-cov)
$(call check_pip_install,pytest-xdist)

pytest: pytest-install
cd tests && \
$(PYTHON) -m pytest --verbose --color=yes --durations=0 \
--cov="../safepo" --cov-config=.coveragerc --cov-report=xml --cov-report=term-missing \
$(PYTESTOPTS) .
$(PYTESTOPTS) .

# Documentation

docs: docs-install
$(PYTHON) -m sphinx_autobuild --watch $(PROJECT_PATH) --open-browser docs/source docs/build

spelling: docs-install
$(PYTHON) -m sphinx_autobuild -b spelling docs/source docs/build
1 change: 1 addition & 0 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ sphinx-design
moviepy
pygame
sphinx_github_changelog
sphinxcontrib-spelling
2 changes: 1 addition & 1 deletion docs/source/algorithms/comparision.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ We have compared the following algorithms:
- ``CPO``: `OpenAI Baselines: Safety Starter Agents <https://github.com/openai/safety-starter-agents>`_, `RL Safety Algorithms <https://github.com/SvenGronauer/RL-Safety-Algorithms>`_
- ``FOCOPS``: `Original Implementation <https://github.com/ymzhang01/focops>`_

We compared those alforithms in 12 tasks from `Safety-Gymnasium <https://github.com/PKU-Alignment/safety-gymnasium>`_,
We compared those algorithms in 12 tasks from `Safety-Gymnasium <https://github.com/PKU-Alignment/safety-gymnasium>`_,
they are:

- ``SafetyPointButton1-v0``
Expand Down
6 changes: 3 additions & 3 deletions docs/source/algorithms/curve.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Training Curves
Safe reinforcement learning algorithms are designed to achieve high reward while satisfying the safety constraint.
In this section, we evaluate the performance of SafePO's algorithms on the various environments in `Safety-Gymnasium <https://github.com/PKU-Alignment/safety-gymnasium>`_.

Single Agent
Single-Agent
------------

First order
Expand Down Expand Up @@ -97,8 +97,8 @@ Second order

</iframe>

Muilti-Agent
------------
Multi-Agent
-----------

.. tab-set::

Expand Down
14 changes: 7 additions & 7 deletions docs/source/algorithms/first_order.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Implementation Details

.. note::

All experiemnts are ran under total 1e7 steps, while in the `Doggo <https://www.safety-gymnasium.com/en/latest/components_of_environments/agents.html>`_ agent, 1e8 steps are used.
All experiments are ran under total 1e7 steps, while in the `Doggo <https://www.safety-gymnasium.com/en/latest/components_of_environments/agents.html>`_ agent, 1e8 steps are used.
This setting is the same as `Safety-Gym <https://www.google.com.hk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwjevqzswM-AAxXZtlYBHVFlDOAQFnoECBIQAQ&url=https%3A%2F%2Fopenai.com%2Fresearch%2Fsafety-gym&usg=AOvVaw2bTv-b9BBuC-4eDmkFZPr3&opi=89978449>`_

Environment Wrapper
Expand Down Expand Up @@ -81,8 +81,8 @@ of observations, rewards and costs:
Lagrangian Multiplier
~~~~~~~~~~~~~~~~~~~~~

Lagreangian-based alforithms use ``Lagrangian Multiplier`` to control the safety
constraint. The ``Lagrangian Multiplier`` is an intergrated part of
Lagrangian-based algorithms use ``Lagrangian Multiplier`` to control the safety
constraint. The ``Lagrangian Multiplier`` is an Integrated part of
SafePO.

Some key points:
Expand Down Expand Up @@ -132,9 +132,9 @@ We provide how ``SafePO`` implements the two stage projection:

.. tab-item:: CUP

CUP first make a PPO update to imporve the policy reward.
CUP first make a PPO update to improve the policy reward.
Then it projects the policy back to the safe set.
We will foccus on the projection part.
We will focus on the projection part.

- Get the cost advantage from buffer and prepare training data.

Expand All @@ -149,7 +149,7 @@ We provide how ``SafePO`` implements the two stage projection:
shuffle=True,
)
- Update the policy by using cost adavantage and kl divergence.
- Update the policy by using cost advantage and kl divergence.

.. code:: python
Expand Down Expand Up @@ -177,7 +177,7 @@ We provide how ``SafePO`` implements the two stage projection:
distribution, old_distribution_b
).sum(-1, keepdim=True)
- Then, update the policy by using cost adavantage and kl divergence.
- Then, update the policy by using cost advantage and kl divergence.

.. code:: python
Expand Down
6 changes: 3 additions & 3 deletions docs/source/algorithms/lag.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Implement Details

.. note::

All experiemnts are ran under total 1e7 steps, while in the `Doggo <https://www.safety-gymnasium.com/en/latest/components_of_environments/agents.html>`_ agent, 1e8 steps are used.
All experiments are ran under total 1e7 steps, while in the `Doggo <https://www.safety-gymnasium.com/en/latest/components_of_environments/agents.html>`_ agent, 1e8 steps are used.
This setting is the same as `Safety-Gym <https://www.google.com.hk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwjevqzswM-AAxXZtlYBHVFlDOAQFnoECBIQAQ&url=https%3A%2F%2Fopenai.com%2Fresearch%2Fsafety-gym&usg=AOvVaw2bTv-b9BBuC-4eDmkFZPr3&opi=89978449>`_

Environment Wrapper
Expand Down Expand Up @@ -81,8 +81,8 @@ of observations, rewards and costs:
Lagrangian Multiplier
~~~~~~~~~~~~~~~~~~~~~

Lagreangian-based alforithms use ``Lagrangian Multiplier`` to control the safety
constraint. The ``Lagrangian Multiplier`` is an intergrated part of
Lagrangian-based algorithms use ``Lagrangian Multiplier`` to control the safety
constraint. The ``Lagrangian Multiplier`` is an Integrated part of
SafePO.

Some key points:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/api/buffer.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Buffer

.. currentmodule:: safepo.common.buffer

Single Agent Buffer
Single-Agent Buffer
-------------------

.. autoclass:: VectorizedOnPolicyBuffer
Expand Down
2 changes: 1 addition & 1 deletion docs/source/api/env.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Environment Maker

.. currentmodule:: safepo.common.env

Single Agent Environment
Single-Agent Environment
------------------------

MuJoCo Environment
Expand Down
2 changes: 1 addition & 1 deletion docs/source/api/logger.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Logger
======

Simple usage
Simple Usage
------------

.. code-block:: python
Expand Down
5 changes: 5 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
# -- Project information -----------------------------------------------------
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
import pathlib
import os
import sys


Expand Down Expand Up @@ -33,6 +34,10 @@
'sphinx_design',
]

if not os.getenv('READTHEDOCS', None):
extensions.append('sphinxcontrib.spelling')

source_suffix = {'.rst': 'restructuredtext', '.md': 'markdown'}
templates_path = ['_templates']
exclude_patterns = []

Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ One line to run SafePO benchmark:
make benchmark
Then you can check the runs in ``safepo/runs``. After that, you can check the
results (eavluation outcomes, training curves) in ``safepo/results``.
results (evaluation outcomes, training curves) in ``safepo/results``.


.. toctree::
Expand Down
22 changes: 21 additions & 1 deletion docs/spelling_wordlist.txt → docs/source/spelling_wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ algos
config
configs
timestep
timesteps
steps
rollout
GAE
PPO
Expand Down Expand Up @@ -405,3 +405,23 @@ Unsqueeze
rescales
affinely
rescales
eval
dir
cpu
tensorboard
rollout
benchmarking
conda
num
rnn
probs
randn
csv
hyperparameters
reproducibility
Dimensionality
Normalizer
Stooke
pkl
serializable
subclasses
4 changes: 2 additions & 2 deletions docs/source/usage/eval.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ This will evaluate the model in the last checkpoint of the training, and save th
Training Curve Plotter
----------------------

Training curves reveal the episodic reward and cost overtime, which is usefull to evaluate the performance of the algorithms.
Training curves reveal the episodic reward and cost overtime, which is useful to evaluate the performance of the algorithms.

suppose you have ran the training script in `algorithms training <./train.html>`_ and saved the training log in `safepo/runs/ppo_lag_exp`, then you can plot the training curve by running:

Expand All @@ -27,7 +27,7 @@ suppose you have ran the training script in `algorithms training <./train.html>`
.. note::

This plotter is also suitable for mmulti-agent algorithms plotting. However, in experiment we found that
This plotter is also suitable for multi-agent algorithms plotting. However, in experiment we found that
the cost value training curve of multi-agent safe and unsafe algorithms are largely different, which makes the
plot not very clear. So we recommend to plot the multi-agent training curve by running the plotter in ``safepo/multi_agent/plot_for_benchmark``.

Expand Down
10 changes: 5 additions & 5 deletions docs/source/usage/implement.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,28 +18,28 @@ To verify the correctness of the classic RL algorithms, we provide the performan
</iframe>


Intergrated Safe RL Pipeline
Integrated Safe RL Pipeline
----------------------------

SafePO's classic RL algorithms are integrated with the Safe RL pipeline, though they make no use of the constraint.
You can customize the Safe RL algorithms based on the classic RL algorithms.

Breifly, the ``PPO`` in SafePO has the following characteristics, which are also suitable for other customization of safe RL algorithms.
Briefly, the ``PPO`` in SafePO has the following characteristics, which are also suitable for other customization of safe RL algorithms.

- ``VectorizedOnPolicyBuffer``: A vectorized buffer supporting cost adavantage estimation.
- ``VectorizedOnPolicyBuffer``: A vectorized buffer supporting cost advantage estimation.
- ``ActorVCritic``: A actor-critic network supporting cost value estimation.
- ``Lagrange``: A lagrangian multiplier for constraint violation control.

Beyond the above characteristics, the ``PPO`` in SafePO also provides a training pipeline for data collection and training.
You can customize new alforithms based on it.
You can customize new algorithms based on it.

Next we will provide a detailed example to show how to customize the ``PPO`` algorithm to ``PPO-Lag`` algorithm.

Example: PPO-Lag
----------------

The Lagrangian multiplier is a useful tool to control the constraint violation in the Safe RL algorithms.
Classic RL algorithms combined with the Lagrangian multiplier are exellent baselines for Safe RL algorithms.
Classic RL algorithms combined with the Lagrangian multiplier are trustworthy baselines for Safe RL algorithms.

.. note::

Expand Down
10 changes: 5 additions & 5 deletions docs/source/usage/make.rst
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
Efficient Commands
==================

To help users quickly reporduce our results,
To help users quickly reproduce our results,
we provide a command line tool for easy installation, benchmarking, and evaluation.

One line benchmark running
One Line Benchmark Running
--------------------------

First, create a conda environment with Python 3.8.
Expand All @@ -20,16 +20,16 @@ Then, run the following command to install SafePO and run the full benchmark:
make benchmark
This command will install SafePO in editable mode and excute the training process parallelly.
This command will install SafePO in editable mode and execute the training process of all algorithms on all environments.
After the training process is finished, it will evaluate the trained policies and generate the benchmark results,
including training curves and evaluation rewards and costs.

Simple benchmark running
Simple Benchmark Running
------------------------

The full benchmark is time-consuming.
To verify the performance of SafePO, we provide a simple benchmark command,
which runs all alforithms on sampled environments and evaluate the trained policies.
which runs all algorithms on sampled environments and evaluate the trained policies.

.. code-block:: bash
Expand Down
Loading

0 comments on commit 35c0804

Please sign in to comment.