Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients

PyTorch implementation of the Stochastic Weighted Twin Delayed Deep Deterministic Policy Gradient algorithm (SWTD3). Note that the implementation of the TD3 algorithm is heavily based on the author's Pytorch implementation of the TD3 algorithm. If you use our code or data, please cite the paper.

The algorithm is tested on MuJoCo and Box2D continuous control tasks.

Computing Infrastructure

Following computing infrastructure is used to produce the results.

Hardware/Software	Model/Version
Operating System	Ubuntu 18.04.5 LTS
CPU	AMD Ryzen 7 3700X 8-Core Processor
GPU	Nvidia GeForce RTX 2070 SUPER
CUDA	11.1
Python	3.8.5
PyTorch	1.8.1
OpenAI Gym	0.17.3
MuJoCo	1.50
Box2D	2.3.10
NumPy	1.19.4

Usage

usage: main.py [-h] [--policy POLICY] [--env ENV] [--seed SEED] [--gpu GPU]
               [--start_time_steps N] [--buffer_size BUFFER_SIZE]
               [--eval_freq N] [--max_time_steps N] [--exploration_noise G]
               [--batch_size N] [--discount G] [--tau G] [--policy_noise G]
               [--noise_clip G] [--policy_freq N] [--save_model]
               [--load_model LOAD_MODEL]

Arguments

optional arguments:
  -h, --help            show this help message and exit
  --policy POLICY       Algorithm (default: SWTD3)
  --env ENV             OpenAI Gym environment name
  --seed SEED           Seed number for PyTorch, NumPy and OpenAI Gym (default: 0)
  --gpu GPU             GPU ordinal for multi-GPU computers (default: 0)
  --start_time_steps N  Number of exploration time steps sampling random actions (default: 1000)
  --buffer_size BUFFER_SIZE Size of the experience replay buffer (default: 1000000)
  --eval_freq N         Evaluation period in number of time steps (default: 1000)
  --max_time_steps N    Maximum number of steps (default: 1000000)
  --exploration_noise G Std of Gaussian exploration noise
  --batch_size N        Batch size (default: 256)
  --discount G          Discount factor for reward (default: 0.99)
  --tau G               Learning rate in soft/hard updates of the target networks (default: 0.005)
  --policy_noise G      Noise added to target policy during critic update
  --noise_clip G        Range to clip target policy noise
  --policy_freq N       Frequency of delayed policy updates
  --save_model          Save model and optimizer parameters
  --load_model LOAD_MODEL Model load file name; if empty, does not load

Bibtex

@misc{https://doi.org/10.48550/arxiv.2109.11788,
  doi = {10.48550/ARXIV.2109.11788},
  url = {https://arxiv.org/abs/2109.11788},
  author = {Saglam, Baturay and Mutlu, Furkan Burak and Cicek, Dogan Can and Kozat, Suleyman Serdar},
  keywords = {Machine Learning (cs.LG), Artificial Intelligence (cs.AI), Machine Learning (stat.ML), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients},
  publisher = {arXiv},
  year = {2021},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
SWTD3.py		SWTD3.py
TD3.py		TD3.py
main.py		main.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients

Computing Infrastructure

Usage

Arguments

Bibtex

About

Releases

Packages

Languages

License

baturaysaglam/SWTD3

Folders and files

Latest commit

History

Repository files navigation

Parameter-free Reduction of the Estimation Bias in Deep Reinforcement Learning for Deterministic Policy Gradients

Computing Infrastructure

Usage

Arguments

Bibtex

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages