Skip to content

1.6 Reinforcement Learning

Marc Juchli edited this page Apr 27, 2018 · 24 revisions

This section first aims to describe what Reinforcement Learning is and highlights its differences to other machine learning paradigms. We briefly reason why this particular technique might be an appropriate choice for the task of optimising order placement. Then, a basic understanding about Markov Decision Processes is provided, after which we explain the interaction between the Reinforcement Learning components, followed by a description of their properties.

Introduction

Reinforcement Learning is a specific learning approach in the Machine Learning field and aims to solve problems which involve sequential decision making. Therefore, when a decision made in a system affects the future decisions and eventually its outcome, the aim is to learn the optimal sequence of decisions with reinforcement learning.

                      Machine Learning
                             |
        ---------------------|---------------------
       |                     |                     |
   Supervised           Unsupervised           Reinforcement
   (Task driven,        (Data driven,          (Learning to act
   Regression or        Clustering)            in environment)
   Classification)

In reinforcement learning there is no supervision and instead an agent learns by maximising rewards. The feedback retrieved while proceeding a task with a sequence of actions might be delayed over several time steps and hence the agent might spend some time exploring until it finally reaches the goal and can updates its strategy accordingly.

This process can be regarded as end-to-end learning, and is unlike other machine learning paradigms. In supervised learning techniques, for example, the algorithm learns by presenting a specific situation provided with the right action to do. From there, the algorithm tries to generalise the model. In addition, in reinforcement learning problems, the data is not independent and identically distributed (I.I.D). The agent might in fact, while exploring, miss out on some important parts to learn the optimal behaviour. Hence, time is crucial as the agent must explore as many parts of the environment to be able to take the appropriate actions. [3]

For optimising order placement in limit order books, statistical approaches have long been the preferred choice. While statistics emphasises inference of a process, machine learning on the other hand emphasises the prediction of the future with respect to some variable. Although the capabilities of influencing a non-trivial procedure using predictions might be tempting, supervised learning machine learning approaches demand to facilitate complex intermediate steps in order to let the learner interact within the process of buying and selling assets. Reinforcement learning, as mentioned before, provides the capabilities to model the execution process pipeline whereas the learner improves upon the outcome of the submitted orders. [1]

A generic end-to-end training pipeline is as follows:

Observation -> State estimation -> Modelling & Prediction -> Action

Since we are working with financial systems, let us assume we want to buy and sell stocks on a stock exchange. In reinforcement learning terms, the trader is represented as an agent and the exchange is the environment. The details of the environment do not have to be known as it is rather regarded as a black-box. The agents purpose is to observe the state of the environment: say for example the current price of a stock. The agent then makes estimates about the situation of the observed state and decides which action to take next – buy or sell. The action is then send to the environment which determines whether this was a good or bad choice, for example whether we made a profit or a loss.

Markov Decision Process (MDP)

A process such as the one sketched above, can be formalised as a Markov Decision Process. An MDP is a 5-tuple <S, A, P, R, 𝛾 > where:

  1. S is the finite set of possible states s_t E S at some time step
  2. A(s_t) is the set of actions available in the state at time step t a_t E A(s_t), whereas
  3. p(s_{t+1} | s_t, a_t) is the state transition model that describes how the environment state changes, depending on the action a and the current state s_t.
  4. p(r_{t+1} | s_t, a_t) is the reward model that describes the immediate reward value that the agent receives from the environment after performing an action in the current state s_t.
  5. is the discount factor which determines the importance of the future rewards.

Interaction

RL Overview

A reinforcement learning problem is commonly defined with the help of two main components: Environment and Agent. With the interfaces provided above (MDP), we can define an interaction process between an agent and environment. Assuming discrete time steps: t=0, 1, 2, ...

  1. The agent observes a state s_t E S
  2. and produces an action at time step t: a_t E A(s_t)
  3. which leads to a reward r_{t+1} E R and the next state s_{t+1}

During this process, and as the agent aims to maximise its future reward, the agent consults a policy, which dictates which action to take given a particular state.

Policy

A policy is a function that can be either deterministic or stochastic. The distribution is used for a stochastic policy and a mapping function is used for a deterministic policy, where S is the set of possible states and A is the set of possible actions.

The stochastic policy at time step t: π_t is a mapping from state to action probabilities as a result of the agents experience, and therefore, π_t(s,a) is the probability that a_t=a when s_t=s.

Reward

The goal is that the agent learns how to select actions such that it maximises its future reward when submitting them to the environment. We rely on the standard assumption that future rewards are discounted by a factor of γ per time-step in the sense that the total discounted reward accounts to

Hence we define the future discounted return at time t as

, where T is the length of the episode (which can be infinity if there is no maximum length for the episode).

The discounting factor has two obligations: it prevents the total reward from going to infinity (since 0 ≤ 𝛾 ≤ 1), and it allows to control the preference of the agent between immediate rewards and potentially received reward in the future. [4]

Value Functions

The value of a state s indicates how good or bad a state is for the agent to be in, measured by the expected total reward for an agent starting from this state. Hence we introduce the value function, which depends on the policy the agent chooses its actions from:

Among all value functions there is an optimal value function which has higher values for all states

Furthermore, the optimal policy π^* can be derived as

In addition to the value of a state with respect to the expected total reward to be achieved, we might also be interested in a value which determines the value of being an a certain state s and taking a certain action a. To get there we first introduce the Q function, which takes a state-action pair and returns a real value:

Finally, the optimal action-value function (or optimal Q function) as the maximum expected return achievable after seeing some state s and then taking some action a. That is,

with the policy π mapping the states to either actions or distributions over actions.

The relationship between the optimal value function and the optimal action-value function is, as already hinted by their names, easily obtained as

and thus the optimal policy for state s can be derived by choosing the action a that gives maximum value

Environment

There are two types of environments:

  • Deterministic environment: implies that both the sate transition model and reward model are deterministic functions. In this setup, if the agent in a given state s_t repeats a given action a, the result will always be the same next state s_{t+1} and reward r_t.

  • Stochastic environment: implies that there is an uncertainty about the outcome of taking an action a in state s_t as the next state s_{t+1} and received reward r_t might not be the same for each time.

Deterministic environments are, in general, easier to solve as the agent learns to improve the policy without uncertainties in the MDP.

Agent

The goal of the agent is to solve the MDP by finding the optimal policy, which means finding the sequence of action that lead to maximise the total received reward. However, there are various approaches to so, which are commonly categorised [4] as follows:

  • Value Based Agent: the agent starts off with a random value function and then finds a new (improved) value function in an iterative process, until reaching the optimal value function. As stated before (see value function) one can easily derive the optimal policy from the optimal value function.

  • Policy Based Agent: the agent starts off with a random policy, then finds the value function of that policy and derives a new (improved) policy based on the previous value function, and so on. Each policy is guaranteed to be a strict improvement over the previous one (unless it is already optimal). As stated before (see value function), given a policy, one can derive the value function.

  • Actor-Critic Agent: the agent is a combination of a value-based and policy-based agent. Both, the policy and the reward from each state will be stored.

  • Model-Based Agent: the agent attempts to approximate the environment using a model. It then suggests the best possible behaviour.

  • Model-Free Agent: the agent builds a policy just from experience from which it will derive how to behave optimally to get the most possible rewards.

Value iteration

This work makes use of the value based agent approach which relies on the action-value function that obeys an important identity known as the Bellman equation.

The intuition is that: if the optimal value action-value of the state s' at the next time step t+1 was known for all possible actions a', the optimal strategy is to select the action a' which maximises the expected value of ,

The idea behind the many reinforcement learning algorithms which follow the iterative value approach is to estimate the action-value function by using the Bellman equation as an iterative update,

.

Value iteration algorithms converge to the optimal action-value function as [5].

Q-Learning

One of the reinforcement learning algorithms used in this work is known as Q-Learning. The name derives from the application of the previously presented Q-function (eq. _), more specifically its use within the Bellman equation (eq. _) which is iteratively updated.

The general idea is ...

Deep Reinforcement Learning

From [1] goes that: "Reinforcement learning ca be naturally integrated with artificial neural networks to obtain high-quality generalization". The term generalization refers to the action-value function (eq. _) and the fact that this value is estimated for each state separately, which becomes totally impractical in for large state spaces that oftentimes occur in real world scenarios. A function approximator is therefore a common choice to to estimate the action-value function instead. Typically a linear function approximator is being used but in deep reinforcement learning, one uses a non-linear function approximator; particularly a neural network with weights θ. The action value function therefore becomes

.

In terms of the previously described reinforcement pipeline, the use of a function approximator simplifies this process by replacing state estimation and modelling components by a perception:

Observation -> Perception -> Action

Deep Q-Network


[1] http://rll.berkeley.edu/deeprlcourse/

[2] https://www.wikiwand.com/en/Markov_decision_process

[3] https://towardsdatascience.com/reinforcement-learning-demystified-36c39c11ec14

[4] https://medium.com/@m.alzantot/deep-reinforcement-learning-demysitifed-episode-2-policy-iteration-value-iteration-and-q-978f9e89ddaa

[5] Richard Sutton and Andrew Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.