Skip to content

1.6 Reinforcement Learning

Marc Juchli edited this page Apr 21, 2018 · 24 revisions

Statistical approaches have long been the preferred choice for optimizing order placement in limit order books. While statistics emphasizes inference of a process generated by data, machine learning on the other hand emphasizes the prediction of the future with respect to some variable, given the very same data.

                      Machine Learning
                             |
        ---------------------|---------------------
       |                     |                     |
   Supervised           Unsupervised           Reinforcement
   (Task driven,        (Data driven,          (Learning to act
   Regression or        Clustering)            in environment)
   Classification)

Reinforcement learning allows to solve problems which involve sequential decision making. That is, when a decision made in a system affects the future decisions and eventually its outcome, the aim is to learn the optimal sequence of decisions with reinforcement learning. Typically, such a system underlies limited supervision, whereas it is known what we want to optimze but do not know which actions are required to do so. Reinforcement learning learns by maximizing rewards while proceeding a task with a sequence of actions, then evaluates the outcome and updates the strategy accordingly [1]. This process can be regarded as end-to-end learning, where every required component of a system is involved and influences the produced result. This has the advantage that the underlying learning algorithm improves its strategy according to the very value which was used by the system as a suggestion from the learned strategy. Unlike supervised learning techniques, which are oftentimes modelled such that the predicted values do not directly give a suggestion to the model on how change its parameters, such an end-to-end learning environment comes handy in the context of order execution.

A standard reinforcement pipeline is as follows:

Observation -> State estimation -> Modelling & Prediction -> Action
     ∧                                                         |
     |                                                         |
      ---------------------------------------------------------

The learning process simply appends a reward stage:

                                              ---------- Reward ----------
                                             |                            |
                                             v                            |
Observation -> State estimation -> Modelling & Prediction -> Action -> Evaluation
     ∧                                                         |
     |                                                         |
      ---------------------------------------------------------

RL Overview


[1] http://rll.berkeley.edu/deeprlcourse/