Skip to content

1.6 Reinforcement Learning

Marc Juchli edited this page Apr 24, 2018 · 24 revisions

This section first aims to describe what Reinforcement Learning is and highlights its differences to other machine learning paradigms. We briefly reason why this particular technique might be an appropriate choice for the task of optimising order placement. Then, a basic understanding about Markov Decision Processes is provided, after which we explain the interaction between the Reinforcement Learning components, followed by a description of their properties.

Introduction

Reinforcement Learning is a specific learning approach in the Machine Learning field and aims to solve problems which involve sequential decision making. Therefore, when a decision made in a system affects the future decisions and eventually its outcome, the aim is to learn the optimal sequence of decisions with reinforcement learning.

                      Machine Learning
                             |
        ---------------------|---------------------
       |                     |                     |
   Supervised           Unsupervised           Reinforcement
   (Task driven,        (Data driven,          (Learning to act
   Regression or        Clustering)            in environment)
   Classification)

In reinforcement learning there is no supervision and instead an agent learns by maximising rewards. The feedback retrieved while proceeding a task with a sequence of actions might be delayed over several time steps and hence the agent might spend some time exploring until it finally reaches the goal and can updates its strategy accordingly.

This process can be regarded as end-to-end learning, and is unlike other machine learning paradigms. In supervised learning techniques, for example, the algorithm learns by presenting a specific situation provided with the right action to do. From there, the algorithm tries to generalise the model. In addition, in reinforcement learning problems, the data is not independent and identically distributed (I.I.D). The agent might in fact, while exploring, miss out on some important parts to learn the optimal behaviour. Hence, time is crucial as the agent must explore as many parts of the environment to be able to take the appropriate actions. [3]

For optimising order placement in limit order books, statistical approaches have long been the preferred choice. While statistics emphasises inference of a process, machine learning on the other hand emphasises the prediction of the future with respect to some variable. Although the capabilities of influencing a non-trivial procedure using predictions might be tempting, supervised learning machine learning approaches demand to facilitate complex intermediate steps in order to let the learner interact within the process of buying and selling assets. Reinforcement learning, as mentioned before, provides the capabilities to model the execution process pipeline whereas the learner improves upon the outcome of the submitted orders. [1]

A generic end-to-end training pipeline is as follows:

Observation -> State estimation -> Modelling & Prediction -> Action

Since we are working with financial systems, let us assume we want to buy and sell stocks on a stock exchange. In reinforcement learning terms, the trader is represented as an agent and the exchange is the environment. The details of the environment do not have to be known as it is rather regarded as a black-box. The agents purpose is to observe the state of the environment: say for example the current price of a stock. The agent then makes estimates about the situation of the observed state and decides which action to take next – buy or sell. The action is then send to the environment which determines whether this was a good or bad choice, for example whether we made a profit or a loss.

Markov Decision Process (MDP)

A process such as the one sketched above, can be formalised as a Markov Decision Process. An MDP is a 5-tuple <S, A, P, R, 𝛾 > where:

  1. S is the finite set of possible states s_t E S at some time step
  2. A(s_t) is the set of actions available in the state at time step t a_t E A(s_t), whereas
  3. p(s_{t+1} | s_t, a_t) is the state transition model that describes how the environment state changes, depending on the action a and the current state s_t.
  4. p(r_{t+1} | s_t, a_t) is the reward model that describes the immediate reward value that the agent receives from the environment after performing an action in the current state s_t.
  5. is the discount factor which determines the importance of the future rewards.

Interaction

RL Overview

A reinforcement learning problem is commonly defined with the help of two main components: Environment and Agent. With the interfaces provided above (MDP), we can define an interaction process between an agent and environment. Assuming discrete time steps: t=0, 1, 2, ...

  1. The agent observes a state s_t E S
  2. and produces an action at time step t: a_t E A(s_t)
  3. which leads to a reward r_{t+1} E R and the next state s_{t+1}

During this process, and as the agent aims to maximise its future reward, the agent consults a policy, which dictates which action to take given a particular state.

Policy

A policy is a function that can be either deterministic or stochastic. The distribution is used for a stochastic policy and a mapping function is used for a deterministic policy, where S is the set of possible states and A is the set of possible actions.

The stochastic policy at time step t: π_t is a mapping from state to action probabilities as a result of the agents experience, and therefore, π_t(s,a) is the probability that a_t=a when s_t=s.

The optimal policy is oftentimes denoted as .

Reward

The goal is that the agent learns how to select actions such that it maximises its future reward when submitting them to the environment. We rely on the standard assumption that future rewards are discounted by a factor of γ per time-step in the sense that the total discounted reward accounts to

Hence we define the future discounted return at time t as

, where T is the length of the episode (which can be infinity if there is no maximum length for the episode).

The discounting factor has two obligations: it prevents the total reward from going to infinity (since 0 ≤ 𝛾 ≤ 1), and it allows to control the preference of the agent between immediate rewards and potentially received reward in the future. [4]

Environment

There are two types of environments:

  • Deterministic environment: implies that both the sate transition model and reward model are deterministic functions. In this setup, if the agent in a given state s_t repeats a given action a, the result will always be the same next state s_{t+1} and reward r_t.
  • Stochastic environment: implies that there is an uncertainty about the outcome of taking an action a in state s_t as the next state s_{t+1} and received reward r_t might not be the same for each time.

Deterministic environments are, in general, easier to solve as the agent learns to improve the policy without uncertainties in the MDP.

Agent

The goal of the agent is to solve the MDP by finding the optimal policy, which means finding the sequence of action that lead to maximise the total received reward. However, there are various approaches to so, which are commonly categorised as follows:

  • Value Based Agent, the agent will evaluate all the states in the state space, and the policy will be kind of implicit, i.e. the value function tells the agent how good is each action in a particular state and the agent will choose the best one.
  • Policy Based Agent, instead of representing the value function inside the agent, we explicitly represent the policy. The agent searches for the optimal action-value function which in turn will enable it to act optimally.
  • Actor-Critic Agent, this agent is a value-based and policy-based agent. It’s an agent that stores both of the policy, and how much reward it is getting from each state. Model-Based Agent, the agent tries to build a model of how the environment works, and then plan to get the best possible behavior.
  • Model-Free Agent, here the agent doesn’t try to understand the environment, i.e. it doesn’t try to build the dynamics. Instead we go directly to the policy and/or value function. We just see experience and try to figure out a policy of how to behave optimally to get the most possible rewards.

Learning approaches

Bellman equation

Value iteration

https://stackoverflow.com/questions/37370015/what-is-the-difference-between-value-iteration-and-policy-iteration

Policy iteration

Action-value function approximation

NOTES The learning process simply appends a reward stage:

                                              ---------- Reward ----------
                                             |                            |
                                             v                            |
Observation -> State estimation -> Modelling & Prediction -> Action -> Evaluation
     ∧                                                         |
     |                                                         |
      ---------------------------------------------------------

"Reinforcement learning ca be naturally integrated with artificial neural networks to obtain high-quality generalization" [1].

The previously described standard reinforcement pipeline gets simplified by replacing state estimation and modelling components by a perception:

Observation -> Perception -> Action
     ∧                          |
     |                          |
      --------------------------

The learning process simply appends a reward stage:

                     ------- Reward -------
                    |                      |
                    v                      |
Observation -> Perception -> Action -> Evaluation
     ∧                          |
     |                          |
      --------------------------

[1] http://rll.berkeley.edu/deeprlcourse/

[2] https://www.wikiwand.com/en/Markov_decision_process

[3] https://towardsdatascience.com/reinforcement-learning-demystified-36c39c11ec14

[4] https://medium.com/@m.alzantot/deep-reinforcement-learning-demysitifed-episode-2-policy-iteration-value-iteration-and-q-978f9e89ddaa