Python 3.6.5
I would reccomend creating a virtual enviorment to avoid dependancy issues.
You can create a virtual enviorment using Virtualenv if you don't already have it installed in your current python interpreter. The current dependancies are in requirements-cpu.txt
or gpu equivalent, and can be installed by the following commands.
pip3 install virtualenv
python3 -m virtualenv env
source env/bin/activate
pip install -r requirements-cpu.txt
The equivalent requirements for gpu support are inside requirements-gpu.txt
.
We are currently working on optimizing the distribution of funds between two assets. You run python main.py [source type]
, where the source type(s) are as follows:
markov
- Markov memory 1 and a fixed asset with return rate 0
markov2
- Markov memory 2 and a fixed asset with return rate 0
iid
- IID uniform random variable and a fixed asset with return rate 0
mix
- Markov memory 1 and IID uniform r.v
real
- Real data
This will populate the contents of a Q table and display the result of following the policy obtained over 100 intervals of testing data. An overview of the code architechure is found below.
main.py
- This is where you can specify episodes and initial investment
enviorment.py
- Here you can manipulate paramaters of the trading enviorment such as action space, observation space, reward function and done flag
Q_table.py
- This class defines learning rate, gamma, epsilon parameters and contains choice action and learning methods
utils.py
- A series of methods used to import, generate and manipulate data for training
/Matlab/
- A bunch of matlab scripts that have been used to determine quantizers, empirical distributions and policies.
Below are some observations thus far. Each example is derived with an initial investment of $100 for 10 episodes.
When the input source is an iid random variable uniformly distributed on [-1,1] with steps of 0.2, the Q table is populated as follows. Note that there is no observation space, as return rates are independent.
Distribution into Source | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|
None | 0.013 | -0.016 | -0.11 | -0.038 | -0.009 | -0.038 | 0.004 | 0.001 | -0.038 | -0.087 | -0.123 |
Which indicates the best option is to invest %0 of capital into the IID stock at any given time.
When the input source is an iid random variable uniformly distributed on [0,0.3], the Q table is populated as follows.
Distribution into Source | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|
None | 1923.39 | 2243.24 | 2064.73 | 2121.45 | 2028.12 | 2122.95 | 2088.61 | 2155.73 | 2096.5 | 2124.11 | 4931.78 |
Which indicates the best option is to invest all capital into the IID stock at any given time.
Here the input source is markovian with the following transition probabilities.
If the possible return rates for the 3 states are -0.2, 0.0, 0.2 the Q table is populated as follows.
Distribution into Source | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|
Prev Value: 0 | 218.074 | 111.159 | 127.127 | 105.276 | 112.809 | 102.041 | 98.145 | 97.792 | 113.613 | 95.798 | 106.162 |
Prev Value: 0.2 | 181.137 | 191.67 | 261.255 | 187.992 | 172.182 | 179.056 | 223.266 | 186.9 | 182.004 | 217.171 | 629.632 |
Prev Value: -0.2 | 244.331 | 58.172 | 62.42 | 48.51 | 56.312 | 55.855 | 50.747 | 63.276 | 77.343 | 52.846 | 55.282 |
This deduces the policy that, when the previous value is:
- 0 -> Do not invest
- -0.2 -> Do not invest
- 0.2 -> Invest %100 of capital into stock
If the return rates for the 3 possible states are -0.1, 0, 0.4, the Q table below is produced.
Distribution into Source | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|
Prev Value: -0.1 | 559.623 | 62.7048 | 60.9816 | 68.2782 | 61.0308 | 66.2113 | 58.6067 | 56.0653 | 64.5082 | 60.6457 | 56.6935 |
Prev Value: 0 | 69.0304 | 72.4873 | 64.9642 | 58.6928 | 61.7887 | 561.144 | 70.8124 | 69.7554 | 66.7248 | 63.1553 | 64.2676 |
Prev Value: 0.4 | 159.824 | 196.479 | 200.312 | 181.142 | 205.622 | 200.464 | 184.247 | 178.793 | 196.372 | 172.393 | 1281.63 |
This deduces the policy that, when the previous value is:
- 0 -> Invest %50 of capital into stock
- -0.1 -> Do not invest
- 0.4 -> Invest %100 of capital into stock
This results are consistend with the optimal investement decision derived in the./Matlab/
scripts, which yields that the Q-learning is working effectivly.
Using the script ./Matlab/ModelFitting.m
the one step transition matrix for IBM computed over the IBM return rates, (discluding the last 1000 samples for testing).
However, this quantization is generic and randomly chosen. Consider now repeating the experiment, however, we can determine the quantization of our training data using the Lloyd Max algorithm on the real data training set.
Applying this quantiation to `./Matlab/ModelFitting.m` we generate the following transition matrix.Training a Q agent on the dataset generated according to this empirical distribution and applying the policy obtained to the testing data yeilds the following result.
Repeating this experiment with the Microsoft data, we first obtain a Llloyd Max quantizatizer as shown below. Afterwards we can obtain an empiratical approximation of the 1 step transition matrix according to this quantizer, with respect to the MSFT data discluding the most recent 1000 entries. Using this quantizer and transition matrix to generate training data for the Q agent, the policies obtained have the following result on our testing data.To better understand what markov order (if any) best represents the conditional dependence of past return rates we experiment with manipulating the obsersvations seen by the Q learning agent. To start, we consider the microsoft data. Below is the populated Q table if this stock is treated as iid. Here, the agent has no access to previous return rates so the observation is always the same.
Distribution into Source | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|
None | 39.2431 | 39.5498 | 39.2918 | 40.3238 | 39.5636 | 48.7792 | 39.8656 | 39.8663 | 39.4475 | 39.746 | 39.4284 |
This policy implies investing %90 of assets into the stock for any given time step. Applying such policy over the testing data yeilds.
Now, if we allow the Q learning agent to observe the previous return rate as `-1, 0, 1` correspoding to being in the negative interval, nuetral interval and positive interval as per the uniform quantization over the training data, we observe the following markov order 1 policy from the Q learning agent.Distribution into Source | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|
Prev "Down" | 8.69687 | 8.416 | 8.76448 | 8.32668 | 8.06351 | 8.44431 | 8.61975 | 19.2718 | 9.00001 | 7.86183 | 7.99984 |
Prev "Nuetral" | 21.6007 | 21.3949 | 21.1213 | 21.2595 | 21.0212 | 21.3045 | 20.5782 | 21.1772 | 20.8734 | 28.6595 | 21.2023 |
Prev "Positive" | 8.626 | 8.10509 | 18.6383 | 8.12764 | 8.08328 | 8.12089 | 7.71774 | 8.01588 | 8.39773 | 8.66198 | 8.14364 |
This markov policy yeilds the following performance over the testing data.
Allowing the Q agent to look at the previous two return rates yeilds the following Q table.
Distribution into Source | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|
Prev Seq: "Down", "Down" | 1.46329 | 1.34375 | 1.31343 | 1.51526 | 1.36004 | 1.47882 | 4.51021 | 1.35624 | 1.63156 | 1.34464 | 1.29957 |
Prev Seq: "Down", "Nuetral" | 4.35364 | 4.07422 | 3.88701 | 4.19443 | 4.04607 | 3.83908 | 4.51836 | 4.09829 | 10.6058 | 3.95363 | 4.04416 |
Prev Seq: "Down", "Positive" | 1.16335 | 1.42665 | 1.21431 | 1.37276 | 1.42155 | 1.41948 | 1.36941 | 4.42286 | 1.15609 | 1.05923 | 1.19335 |
Prev Seq: "Nuetral", "Down" | 2.88785 | 2.68302 | 3.32496 | 2.92446 | 2.70594 | 2.6983 | 2.91641 | 2.71681 | 7.94342 | 3.07691 | 2.84623 |
Prev Seq: "Nuetral", "Nuetral" | 15.2012 | 15.0627 | 22.1063 | 14.9272 | 14.9772 | 15.1129 | 14.9588 | 14.7254 | 14.772 | 15.2999 | 15.2734 |
Prev Seq: "Nuetral", "Postive" | 7.55047 | 3.08106 | 3.07143 | 3.17324 | 2.909 | 3.29652 | 2.78571 | 3.24299 | 3.09982 | 2.82638 | 3.24957 |
Prev Seq: "Positive", "Down" | 4.64262 | 1.45491 | 1.49116 | 1.44225 | 1.37647 | 1.39101 | 1.6392 | 1.17189 | 1.3111 | 1.53721 | 1.38324 |
Prev Seq: "Positive", "Nuetral" | 3.74981 | 4.25076 | 4.28741 | 9.85611 | 4.12424 | 4.65483 | 4.37102 | 4.29442 | 3.81727 | 4.28817 | 3.76588 |
Prev Seq: "Positive", "Positive" | 1.22974 | 1.00224 | 1.16756 | 1.04358 | 1.34929 | 1.13015 | 1.25551 | 3.65231 | 0.896386 | 1.18419 | 1.17435 |
Applying this markov 2 policy to the testing set yeilds
Interestingly engough, the seccond order memory performed worse. This is due to Q learning agent not being able to appropriately explore all states. In order to allow the agent to derive a markov memory 2 policy we must increase the number of training episoides. Conducting the equivilant experiment for the IBM data yeilds the following results. When treated as IID, the following Q table is derived.Distribution into Source | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|
None | 43.8113 | 43.8479 | 43.6593 | 44.0324 | 44.544 | 44.5033 | 43.8169 | 44.1369 | 43.9382 | 54.4591 | 44.3226 |
And applying such policy to the testing data yeilds.
Moving to markov memory 1, we obtain the following Q table.Distribution into Source | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|
Prev "Down" | 10.2676 | 10.2352 | 11.1674 | 10.8175 | 10.6088 | 11.0861 | 10.3039 | 22.9233 | 9.90889 | 10.1815 | 9.52493 |
Prev "Nuetral" | 23.2557 | 23.2025 | 23.4107 | 23.7157 | 23.1266 | 23.4601 | 31.6622 | 24.0064 | 22.8477 | 23.2966 | 23.6219 |
Prev "Positive" | 7.81495 | 7.97147 | 8.75302 | 8.76693 | 8.49904 | 8.13567 | 8.53647 | 8.06919 | 8.06033 | 8.29538 | 20.0326 |
Which yeilds the following testing results.
The difference between the IID test and markov 1 test is that the markov 1 policy seems to be less "risky", it looses less money at the begining of the testing period, however, gains less towards the end. The two policies appear to be very simaler. Moving forward to markov order 2, we obtain the following q table.Distribution into Source | 0% | 10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|
Prev Seq: "Down", "Down" | 2.01617 | 2.08723 | 1.95979 | 6.864 | 1.78769 | 1.70395 | 1.87764 | 2.22568 | 1.64296 | 1.81577 | 2.00709 |
Prev Seq: "Down", "Nuetral" | 5.53366 | 5.52372 | 5.12495 | 5.09426 | 12.9182 | 5.1679 | 5.40856 | 4.4439 | 4.89715 | 4.97982 | 5.47335 |
Prev Seq: "Down", "Positive" | 1.18736 | 1.34069 | 1.44999 | 1.51097 | 1.20899 | 1.38332 | 4.57586 | 1.24534 | 1.38879 | 1.46921 | 1.37788 |
Prev Seq: "Nuetral", "Down" | 3.6141 | 3.97473 | 3.66598 | 3.66956 | 3.88415 | 9.45307 | 3.82271 | 3.77226 | 3.98146 | 3.84407 | 3.50747 |
Prev Seq: "Nuetral", "Nuetral" | 16.3345 | 16.0558 | 15.8404 | 16.3348 | 15.6925 | 16.4471 | 16.2041 | 16.3338 | 16.1826 | 16.1871 | 24.4097 |
Prev Seq: "Nuetral", "Postive" | 3.13525 | 2.72395 | 2.83029 | 3.04262 | 8.08926 | 2.84946 | 3.14385 | 3.14187 | 2.75471 | 2.78054 | 2.84228 |
Prev Seq: "Positive", "Down" | 1.49701 | 5.84822 | 1.38522 | 1.58129 | 1.71697 | 1.74671 | 1.58833 | 1.83914 | 1.59212 | 1.92212 | 1.68669 |
Prev Seq: "Positive", "Nuetral" | 4.31702 | 3.85078 | 10.0363 | 3.83889 | 3.75574 | 3.99425 | 3.21814 | 4.12956 | 3.47642 | 3.56835 | 3.76362 |
Prev Seq: "Positive", "Positive" | 1.30869 | 1.09477 | 1.28534 | 1.12935 | 1.18675 | 0.935871 | 0.9024 | 0.860647 | 1.13079 | 1.10643 | 4.14827 |
Which yeilds the following performance over out testing period.