Skip to content

1.6 Reinforcement Learning

Marc Juchli edited this page Apr 24, 2018 · 24 revisions

This section first aims to describe what Reinforcement Learning is and highlights its differences to other machine learning paradigms. We briefly reason why this particular technique might be an appropriate choice for the task of optimising order placement. Then, a basic understanding about Markov Decision Processes is provided, after which we explain the interaction between the Reinforcement Learning components, followed by a description of their properties.

Introduction

Reinforcement Learning is a specific learning approach in the Machine Learning field and aims to solve problems which involve sequential decision making. Therefore, when a decision made in a system affects the future decisions and eventually its outcome, the aim is to learn the optimal sequence of decisions with reinforcement learning.

                      Machine Learning
                             |
        ---------------------|---------------------
       |                     |                     |
   Supervised           Unsupervised           Reinforcement
   (Task driven,        (Data driven,          (Learning to act
   Regression or        Clustering)            in environment)
   Classification)

Typically, such a system underlies limited supervision, whereas it is known what we want to optimze but do not know which actions are required to do so. Reinforcement learning learns by maximising rewards while proceeding a task with a sequence of actions, then evaluates the outcome and updates the strategy accordingly [1]. This process can be regarded as end-to-end learning, where every required component of a system is involved and influences the produced result. Unlike supervised learning techniques, which are oftentimes modelled such that the predicted values do not directly give a suggestion to the model on how change its parameters, such an end-to-end learning environment comes handy in the context of order execution.

For optimising order placement in limit order books, statistical approaches have long been the preferred choice. While statistics emphasises inference of a process, machine learning on the other hand emphasises the prediction of the future with respect to some variable. Although the capabilities of influencing a non-trivial procedure using predictions might be tempting, supervised learning machine learning approaches demand to facilitate complex intermediate steps in order to let the learner interact within the process of buying and selling assets. Reinforcement learning, as mentioned before, provides the capabilities to model the execution end-to-end whereas the learner improves upon the outcome of the submitted orders.

A standard reinforcement pipeline is as follows:

Observation -> State estimation -> Modelling & Prediction -> Action

Compontents

RL Overview

A reinforcement learning problem is traditionally defined within the context of two main components: Environment and Agent. The interfaces of which are:

  • S is the set of possible states s_t E S
  • A(s_t) is the set of actions available in the state at time step t a_t E A(s_t), whereas
  • r_t E R is the reward generated at time step t.

With the interfaces provided above, we can define an interaction process between an agent and environment. Assuming discrete time steps: t=0, 1, 2, ...

  1. The agent observes a state s_t E S
  2. and produces an action at time step t: a_t E A(s_t)
  3. which leads to a reward r_{t+1} E R and the next state s_{t+1}

During this process, and as the agent aims to maximise its future reward, the agent consults a policy, which is a map that gives the probabilities of taking action a when in state s:

Hence, the policy at time step t: π_t is a mapping from state to action probabilities as a result of the agents experience, and therefore,

π_t(s,a) is the probability that a_t=a when s_t=s.

The optimal policy is and defines the agent behaviour.

Environment

Agent

Learning approaches

Bellman equation

Value iteration

https://stackoverflow.com/questions/37370015/what-is-the-difference-between-value-iteration-and-policy-iteration

Policy iteration

Action-value function approximation

NOTES The learning process simply appends a reward stage:

                                              ---------- Reward ----------
                                             |                            |
                                             v                            |
Observation -> State estimation -> Modelling & Prediction -> Action -> Evaluation
     ∧                                                         |
     |                                                         |
      ---------------------------------------------------------

"Reinforcement learning ca be naturally integrated with artificial neural networks to obtain high-quality generalization" [1].

The previously described standard reinforcement pipeline gets simplified by replacing state estimation and modelling components by a perception:

Observation -> Perception -> Action
     ∧                          |
     |                          |
      --------------------------

The learning process simply appends a reward stage:

                     ------- Reward -------
                    |                      |
                    v                      |
Observation -> Perception -> Action -> Evaluation
     ∧                          |
     |                          |
      --------------------------

[1] http://rll.berkeley.edu/deeprlcourse/