This implementation serves as the reference code for the paper RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning, authored by the same researchers. Unlike the commonly used discounted reward sum, RVI-SAC employs average reward as its objective, as shown below (precisely, the objective includes entropy; for more details, please refer to the paper).
Average reward is a more natural objective than the discounted reward sum for continuing tasks (e.g., locomotion tasks) where episodes continue indefinitely. By utilizing the average reward instead of the discounted reward, performance improvements can be expected. Our algorithm, RVI-SAC, is a novel method that combines average reward with Soft Actor-Critic.
This research has been accepted at ICML 2024.
- Make sure you have
poetry
installed on your system. If you don't have it yet, you can install it by following the instructions here.
Run the following command to set up the environment using poetry
.
poetry install
- (proposal) RVI-SAC
- Soft Actor-Critic (Original Implementation: here)
- ARO-DDPG (Original Implementation: here)
Hyperparameters are managed by hydra. See config.yaml for details.
poetry run python3 experiments/main.py \
algo=rvi_sac \
env=Ant-v4 \
seed=0