-
Notifications
You must be signed in to change notification settings - Fork 84
1.6 Reinforcement Learning
This section first aims to describe what Reinforcement Learning is and highlights its differences to other machine learning paradigms. We briefly reason why this particular technique might be an appropriate choice for the task of optimising order placement. Then, a basic understanding about Markov Decision Processes is provided, after which we explain the interaction between the Reinforcement Learning components, followed by a description of their properties.
Reinforcement Learning is a specific learning approach in the Machine Learning field and aims to solve problems which involve sequential decision making. Therefore, when a decision made in a system affects the future decisions and eventually its outcome, the aim is to learn the optimal sequence of decisions with reinforcement learning.
Machine Learning
|
---------------------|---------------------
| | |
Supervised Unsupervised Reinforcement
(Task driven, (Data driven, (Learning to act
Regression or Clustering) in environment)
Classification)
Typically, such a system underlies limited supervision, whereas it is known what we want to optimze but do not know which actions are required to do so. Reinforcement learning learns by maximising rewards while proceeding a task with a sequence of actions, then evaluates the outcome and updates the strategy accordingly. This process can be regarded as end-to-end learning, where every required component of a system is involved and influences the produced result. Unlike supervised learning techniques, which are oftentimes modelled such that the predicted values do not directly give a suggestion to the model on how change its parameters, such an end-to-end learning environment comes handy in the context of order execution. [1]
For optimising order placement in limit order books, statistical approaches have long been the preferred choice. While statistics emphasises inference of a process, machine learning on the other hand emphasises the prediction of the future with respect to some variable. Although the capabilities of influencing a non-trivial procedure using predictions might be tempting, supervised learning machine learning approaches demand to facilitate complex intermediate steps in order to let the learner interact within the process of buying and selling assets. Reinforcement learning, as mentioned before, provides the capabilities to model the execution process pipeline whereas the learner improves upon the outcome of the submitted orders. [1]
A generic end-to-end training pipeline is as follows:
Observation -> State estimation -> Modelling & Prediction -> Action
An agent observes the state of some environment whose details do not have to be known as the environment is rather regarded as a black-box. The agent then makes estimates about the situation of the observed state and decides which action to take next. The action is then send to the environment which determines whether this was a good or bad choice.
A reinforcement learning problem is traditionally defined within the context of two main components: Environment and Agent. The interfaces of which are:
-
S
is the set of possible statess_t E S
-
A(s_t)
is the set of actions available in the state at time step ta_t E A(s_t)
, whereas -
r_t E R
is the reward generated at time step t.
With the interfaces provided above, we can define an interaction process between an agent and environment. Assuming discrete time steps: t=0, 1, 2, ...
- The agent observes a state
s_t E S
- and produces an action at time step t:
a_t E A(s_t)
- which leads to a reward
r_{t+1} E R
and the next states_{t+1}
During this process, and as the agent aims to maximise its future reward, the agent consults a policy, which is a map that gives the probabilities of taking action a
when in state s
:
Hence, the policy at time step t: π_t
is a mapping from state to action probabilities as a result of the agents experience, and therefore,
π_t(s,a)
is the probability that a_t=a
when s_t=s
.
The optimal policy is and defines the agent behaviour.
NOTES The learning process simply appends a reward stage:
---------- Reward ----------
| |
v |
Observation -> State estimation -> Modelling & Prediction -> Action -> Evaluation
∧ |
| |
---------------------------------------------------------
"Reinforcement learning ca be naturally integrated with artificial neural networks to obtain high-quality generalization" [1].
The previously described standard reinforcement pipeline gets simplified by replacing state estimation and modelling components by a perception:
Observation -> Perception -> Action
∧ |
| |
--------------------------
The learning process simply appends a reward stage:
------- Reward -------
| |
v |
Observation -> Perception -> Action -> Evaluation
∧ |
| |
--------------------------