Skip to content

3.3 RL Environment (Market Maker)

Marc Juchli edited this page Apr 27, 2018 · 2 revisions

The environment E, labelled as ctc-marketmaker-v0 is a child class from ctc-executioner-v0 that is a child of gym.Env. This environment a market maker simulator, extending the capabilities of the execution simulator by buying and selling simultaneously. This section will provide overview of the environment by highlighting the differences to the executioner environment.

To make use of this environment:

import gym_ctc_marketmaker
env = gym.make("ctc-marketmaker-v0")
env.setOrderbook(orderbook)

Overview

With every step taken by the agent, a chain of tasks will be processed:

  1. The agent selects an action a and passes it to the environment.
  2. A internal state s (defined as ActionState) is being constructed whereas it is either derived from a previous state or from the order book in case a new epoch has started.
  3. Then two orders (Order) are created, for each side buy and sell, according to the remaining inventory and time horizon the agent has left for both ongoing executions. Then the specified actions will be taken.
  4. Both orders are sent to the match engine which will perform an attempt to execute the orders in their current and independent order book state (from which the agents state was derived). This process continues to attempt matching until either the runtime of the current step is consumed or the total inventory was filled.
  5. The matching will result in either no-, a partial- or a full-execution for both of the submitted orders. Whichever outcome it might be, a certain reward can be derived alongside the next state (again derived from the order book) and whether the epoch is done or not.
  6. The values of both ongoing executions will then be stored in the memory and returned to the agent in order to take another step.

For more details see 3.2

State

The observation state depends on the chosen feature configuration (see Feature Engineering), resulting in some state s ∈ R^d.

Action

A discrete action space represented by a vector of size equal to the number of limit levels squared (L^2) is configured. The action space features actions a ∈ Z which represent the two independent limit levels segmented in $0.10 steps. The action space is configurable and the default implementation is of size 101^2=10201, derived from the limit level starting at -50 up to +50 resulting in a total of 101 levels and then squared. Negative limit levels indicate the listing deep in the book and positive listings relate to the level in the opposing side of the book. Thus, at each time-step t the agent selects an action a_t from the set of legal actions, A = {l_min, . . . , l_max}, whereas l_min is the most negative limit level for both sides buy and sell (l_min_buy, l_min_sell) and l_max is the most positive limit level for both sides (l_max_buy, l_max_sell).

Reward

The reward is defined as the difference of the volume weighted average price (VWAP) paid of the sell order to the VWAP of the buy order at some time step t. That is,

, whereas p_T is the best market price at execution time step t=T.

If either of the orders VAP is 0, meaning that nothing has been bought or sold over the course of the running execution process, a reward of 0 is given. That is,