-
Notifications
You must be signed in to change notification settings - Fork 84
1.6 Reinforcement Learning
This section first aims to describe what Reinforcement Learning is and highlights its differences to other machine learning paradigms. We briefly reason why this particular technique might be an appropriate choice for the task of optimising order placement. Then, a basic understanding about Markov Decision Processes is provided, after which we explain the interaction between the Reinforcement Learning components, followed by a description of their properties.
Reinforcement Learning is a specific learning approach in the Machine Learning field and aims to solve problems which involve sequential decision making. Therefore, when a decision made in a system affects the future decisions and eventually its outcome, the aim is to learn the optimal sequence of decisions with reinforcement learning.
Machine Learning
|
---------------------|---------------------
| | |
Supervised Unsupervised Reinforcement
(Task driven, (Data driven, (Learning to act
Regression or Clustering) in environment)
Classification)
Typically, such a system underlies limited supervision, whereas it is known what we want to optimze but do not know which actions are required to do so. Reinforcement learning learns by maximising rewards while proceeding a task with a sequence of actions, then evaluates the outcome and updates the strategy accordingly. This process can be regarded as end-to-end learning, where every required component of a system is involved and influences the produced result. Unlike supervised learning techniques, which are oftentimes modelled such that the predicted values do not directly give a suggestion to the model on how change its parameters, such an end-to-end learning environment comes handy in the context of order execution. [1]
For optimising order placement in limit order books, statistical approaches have long been the preferred choice. While statistics emphasises inference of a process, machine learning on the other hand emphasises the prediction of the future with respect to some variable. Although the capabilities of influencing a non-trivial procedure using predictions might be tempting, supervised learning machine learning approaches demand to facilitate complex intermediate steps in order to let the learner interact within the process of buying and selling assets. Reinforcement learning, as mentioned before, provides the capabilities to model the execution process pipeline whereas the learner improves upon the outcome of the submitted orders. [1]
A generic end-to-end training pipeline is as follows:
Observation -> State estimation -> Modelling & Prediction -> Action
Since we are working with financial systems, let us assume we want to buy and sell stocks on a stock exchange. In reinforcement learning terms, the trader is represented as an agent and the exchange is the environment. The details of the environment do not have to be known as it is rather regarded as a black-box. The agents purpose is to observe the state of the environment: say for example the current price of a stock. The agent then makes estimates about the situation of the observed state and decides which action to take next – buy or sell. The action is then send to the environment which determines whether this was a good or bad choice, for example whether we made a profit or a loss.
A process such as the one sketched above, can be formalised as a Markov Decision Process. An MDP is a tuple of 5 elements:
-
S
is the finite set of possible statess_t E S
at some time step -
A(s_t)
is the set of actions available in the state at time step ta_t E A(s_t)
, whereas -
p(s_{t+1} | s_t, a_t)
is the state transition model that describes how the environment state changes, depending on the actiona
and the current states_t
. -
p(r_{t+1} | s_t, a_t)
is the reward model that describes the immediate reward value that the agent receives from the environment after performing an action in the current states_t
. - is the discount factor which determines the importance of the future rewards.
A reinforcement learning problem is traditionally defined within the context of two main components: Environment and Agent.
With the interfaces provided above, we can define an interaction process between an agent and environment. Assuming discrete time steps: t=0, 1, 2, ...
- The agent observes a state
s_t E S
- and produces an action at time step t:
a_t E A(s_t)
- which leads to a reward
r_{t+1} E R
and the next states_{t+1}
During this process, and as the agent aims to maximise its future reward, the agent consults a policy, which is a map that gives the probabilities of taking action a
when in state s
:
Hence, the policy at time step t: π_t
is a mapping from state to action probabilities as a result of the agents experience, and therefore,
π_t(s,a)
is the probability that a_t=a
when s_t=s
.
The optimal policy is and defines the agent behaviour.
NOTES The learning process simply appends a reward stage:
---------- Reward ----------
| |
v |
Observation -> State estimation -> Modelling & Prediction -> Action -> Evaluation
∧ |
| |
---------------------------------------------------------
"Reinforcement learning ca be naturally integrated with artificial neural networks to obtain high-quality generalization" [1].
The previously described standard reinforcement pipeline gets simplified by replacing state estimation and modelling components by a perception:
Observation -> Perception -> Action
∧ |
| |
--------------------------
The learning process simply appends a reward stage:
------- Reward -------
| |
v |
Observation -> Perception -> Action -> Evaluation
∧ |
| |
--------------------------