- use actor-critic as RL framework
- fix reward that is always negative resulting in wrong convergence with log probabilities
Benchmarks will come soon, don't use trained models of this release
Benchmarks will come soon, don't use trained models of this release