Skip to content

1.6 Reinforcement Learning

Marc Juchli edited this page Apr 23, 2018 · 24 revisions

Statistical approaches have long been the preferred choice for optimizing order placement in limit order books. While statistics emphasizes inference of a process generated by data, machine learning on the other hand emphasizes the prediction of the future with respect to some variable, given the very same data.

                      Machine Learning
                             |
        ---------------------|---------------------
       |                     |                     |
   Supervised           Unsupervised           Reinforcement
   (Task driven,        (Data driven,          (Learning to act
   Regression or        Clustering)            in environment)
   Classification)

Reinforcement learning allows to solve problems which involve sequential decision making. That is, when a decision made in a system affects the future decisions and eventually its outcome, the aim is to learn the optimal sequence of decisions with reinforcement learning. Typically, such a system underlies limited supervision, whereas it is known what we want to optimze but do not know which actions are required to do so. Reinforcement learning learns by maximizing rewards while proceeding a task with a sequence of actions, then evaluates the outcome and updates the strategy accordingly [1]. This process can be regarded as end-to-end learning, where every required component of a system is involved and influences the produced result. This has the advantage that the underlying learning algorithm improves its strategy according to the very value which was used by the system as a suggestion from the learned strategy. Unlike supervised learning techniques, which are oftentimes modelled such that the predicted values do not directly give a suggestion to the model on how change its parameters, such an end-to-end learning environment comes handy in the context of order execution.

A standard reinforcement pipeline is as follows:

Observation -> State estimation -> Modelling & Prediction -> Action
     ∧                                                         |
     |                                                         |
      ---------------------------------------------------------

The learning process simply appends a reward stage:

                                              ---------- Reward ----------
                                             |                            |
                                             v                            |
Observation -> State estimation -> Modelling & Prediction -> Action -> Evaluation
     ∧                                                         |
     |                                                         |
      ---------------------------------------------------------

"Reinforcement learning ca be naturally integrated with artificial neural networks to obtain high-quality generalization" [1].

The previously described standard reinforcement pipeline gets simplified by replacing state estimation and modelling components by a perception:

Observation -> Perception -> Action
     ∧                          |
     |                          |
      --------------------------

The learning process simply appends a reward stage:

                     ------- Reward -------
                    |                      |
                    v                      |
Observation -> Perception -> Action -> Evaluation
     ∧                          |
     |                          |
      --------------------------

Compontents

RL Overview

A reinforcement learning problem is traditionally defined within the context of two main components: Environment and Agent. The interfaces of which are:

  • S is the set of possible states s_t E S
  • A(s_t) is the set of actions available in the state at time step t a_t E A(s_t), whereas
  • r_t E R is the reward generated at time step t.

With the interfaces provided above, we can define an interaction process between an agent and environment. Assuming discrete time steps: t=0, 1, 2, ...

  1. The agent observes a state s_t E S
  2. and produces an action at time step t: a_t E A(s_t)
  3. which leads to a reward r_{t+1} E R and the next state s_{t+1}

During this process, and as the agent aims to maximise its future reward, the agent consults a policy, which is a map that gives the probabilities of taking action a when in state s:

Hence, the policy at time step t: π_t is a mapping from state to action probabilities as a result of the agents experience, and therefore,

π_t(s,a) is the probability that a_t=a when s_t=s.

The optimal policy is and defines the agent behaviour.

Environment

Agent

Learning approaches

Bellman equation

Value iteration

https://stackoverflow.com/questions/37370015/what-is-the-difference-between-value-iteration-and-policy-iteration

Policy iteration

Action-value function approximation


[1] http://rll.berkeley.edu/deeprlcourse/