-
Notifications
You must be signed in to change notification settings - Fork 0
/
cda0_notes.txt
3410 lines (2934 loc) · 206 KB
/
cda0_notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Notes on the cda0 project
=========================
10/3/22:
* Spent approx 3 hr setting up & beginning the coding.
10/4/22:
* Time spent:
17:00 - 18:44 = 1:44
21:21 - 22:34 = 1:13
* Decided to go forward with the somewhat awkward and non-extensible approach of modeling observations to include states on
exactly 3 neighbor vehicles. For a future version I will replace that with a more general approach that looks at the
roadway itself (analogous to an Atari agent viewing screen pixels rather than tracking a number of alien ships).
10/5/22:
* Time 20:45 - 22:40 = 1:55
10/6/22:
* Time
10:33 - 12:39 = 2:06
21:53 - 23:03 = 1:10
10/7/22:
* Time
21:24 - 22:42 = 1:18
23:01 - 00:30 = 1:29
10/8/22:
* Time
17:04 - 18:25 = 1:21
19:03 - 19:27 = 0:24
10/9/22:
* Time 19:54 - 22:53, int 13 = 2:46
* Completed unit testing of all env logic except reward.
* Built initial reward logic.
10/10/22:
* time 10:48 - 11:15 = 0:27
10/11/22:
* Time
21:08 - 23:15 = 2:07
23:44 - 00:15 = 0:31
10/12/22:
* Time
08:14 - 09:40 = 1:26
14:12 - 14:57 = 0:45
15:27 - 15:56 = 0:29
* Fixed all unit test and run-time errors. Ready to begin training.
10/13/22:
* Time 22:08 - 01:00 = 2:52
* Still having occasional problems with observation_space violations
10/14/22:
* Time 23:08 - 01:59 = 2:51
* Created conda env "cda0" as a copy of rllib2 to make sure any project-specific changes aren't applied to the base env.
10/15/22:
* Time
11:59 - 12:34 = 0:35
22:44 - 00:20 = 1:36
* SUCCESS - first taste! got several trials to drive lane 0 beginning to end, forcing no steering.
Best tuning trial was 35915_00003; saving checkpoint 181.
* Ran new training job (7559b), with results under the cda0-l0-free dir, that starts agent in lane 0 but allows it to change
lanes. Most successful trial was 75599b_00005; using checkpoint 126. PPO lr = 2.8e-5, batch = 512.
* TODO: Need an inference program to run these successful models and capture their trajectories for viewing.
10/16/22:
* Time 21:05 - 23:09, int 20 = 1:44
* New tuning job (17669) under cda0-l01-free dir. This one randomly initializes episodes with the ego vehicle in either lane
0 or 1, but not 2. The neighbor vehicles still do not move. Two solutions got pretty close (00003 and 00002), but none
scored higher than low 1.8s for the mean reward.
* Installed pygame (with pip) into the conda env "cda0" to experiment with making a graphical display of the simulation.
Played with an example program enough to quickly understand how to do some basic graphics needed for my sim.
10/17/22:
* Time
14:16 - 14:48 = 0:32
17:10 - 18:36 = 1:26
22:32 - 23:21 = 0:49
* Fixed a problem in the env wrapper that was locking the ego lane to 0; also changed reward shape some, so all previous
training needs to be thrown away.
* Training again on random lane start (0 or 1 only) with no neighbor vehicles moving; trial ID = c2d96.
* Trial 0003 ran for a long time, but reward gradually increased the whole time, maxing out above 4! Not sure how
this is possible. LR = 0.00013, batch = 512. It has learned to shy away from speeds near the max, since it gets
punished for large accels there; also shy of speeds near zero, for same reason, but this is less of a fear.
Result is that accel oscillates between +3 and -3 m/s^2, with speeds going between 4% and 75% of max range.
So avg speed of the vehicle is really small, and it collects more reward points for staying on track.
* Trial 0007 was the only other one that succeeded, with a max reward around 1.9. LR = 0.000161, batch = 256.
* Increased penalty for jerk; decreased existential time step reward; made a more differentiable shape for penalty where
accel is threatening to take vehicle past either high or low speed limit, in order to minimize accel oscillations.
* Training again on random lane start (0 or 1); trial ID = 47fd7. None of the 18 trials converged.
10/18/22:
* Time 21:40 - 11:00 = 1:20
* Removed reward penalties for both the acceleration limits and trying to keep steering command near {-1, 0, 1}.
Ran tune (ID 10baa). Died in middle due to computer crash.
* Compared code to that used on 10/16 (commit cf6f) to find why suddenly nothing is learning to drive straight.
Only found two seemingly minor diffs in the reward structure (given that the new penalties are commented out).
I changed those back to the way they were on 10/16 and ran a new tuning run (b24d4).
10/19/22:
* Time
17:46 - 18:25 = 0:39
21:54 - 23:03 = 1:10
* Last night's run produced 3 winners (out of 20 trials). This is with the "old" reward structure, so just a baseline to
prove it can be trained.
* Trial 14 had mean reward = 1.87; LR = 5.54e-5, activation = relu, network = [300, 128, 64], batch = 2048. Solved
in 121 iters. Its min reward was slightly < 0, however, so not idea.
* Trial 16 had mean reward = 1.88 with similar min; LR = 3.42e-5, activation = relu, network = [300, 128, 64], batch = 512.
* Trial 19 had mean reward = 1.89 with similar min; LR = 2.11e-5, activation = relu, network = [256, 256], batch = 512.
Use checkpoint 163.
* Reviewing these results, I realized the reward was broken for lane change penalty, so fixed it.
* Another run (f7b05) applied these changes. Found several successful trials.
* Trial 19 had mean reward = 2.02 with similar min; LR = 1.36e-5, activation = tanh, network = [300, 128, 64], batch = 256.
Use checkpoint 154.
* Trial 18 had mean reward = 1.89 with min ~1.3; LR = 2.44e-5, activation = tanh, network = [300, 128, 64], batch = 1024.
* Trial 15 had mean reward = 1.83 with an unimpressive min; LR = 5.60e-5, activation = relu, network = [256, 256], batch = 512.
Use checkpoint 126.
* Trial 4 had mean reward = 1.86 with min ~1.3; LR = 3.39e-5, activation = relu, network = [300, 128, 64], batch = 2048.
Use checkpoint 97.
10/20/22:
* Time
01:12 - 01:33 = 0:21
17:15 - 18:31 = 1:16
19:54 - 10:00 = 2:06
* Added the LCC penalty back into the reward method. Made tuning run 211ad, all with a 3-layer network.
* Trial 6 mean reward = 2.01, LR = 1.04e-4, activation = tanh, batch = 1024
* Trial 11 mean reward = 1.75, LR = 1.99e-4, activation = tanh, batch = 1024
* Trial 14 mean reward = 1.75, LR = 1.37e-5, activation = tanh, batch = 128; min reward above +1.5
* Trial 18 mean reward = 1.76, LR = 1.67e-5, activation = tanh, batch = 128, min reward above +1.6
* Run 211ad showed
* LC cmd penalty worked really well at keeping the LC command very close to zero
* Accel was also really close to zero on trial 6, but averaged slightly positive so that speed rose throughout, but it never
reached the speed limit. Trial 6 got only 0.796 completion reward, but gathered lots of time step rewards throughout.
* Trial 11 had big accel, but its total reward was ~1.8 vs 2.2 for trial 6. So avg time step reward of ~0.09 is too high.
* There is a strong desire to drive in lane 0, whether the initial lane is 0 or 1; LC occurs immediately if needed.
* Max completion reward is only ~0.83 for the fastest possible travel time. Needs to be increased.
* Added scaling of the action_space in the wrapper class to keep the NN output accel limited to [-1, 1] (it was in [-3, 3]).
Modified the reward shape a bit to better emphasize completing the course as fast as possible.
* New tuning run (b5410) to finalize work on lane change issues.
10/21/22 Fri:
* Time
16:52 - 19:37, int 23 = 2:22
20:12 - 21:12 = 1:00
* One run from yesterday (b4510) succeeded, which was trial 1. Mean reward = 2.11, LR = 4.99e-5, activation = tanh, batch = 1024.
Actions were well behaved, as desired, but accels were all small and tended to slow the car down to get more time step rewards.
* Changed reward limits from [-1, 1] to [-2, 2], since the completion reward was being greatly clipped.
* Reduced jerk penalty mult from 0.01 to 0.005.
* Reduced time step reward from 0.005 to 0.002 (it was contributing >0.5 to the total episode reward)
* Added penalty for exceeding speed limit (and increased obs space upper limit for speed substantially to allow an excess).
* Run fe601 with these changes produced no successful trials. After observing one slightly promising run, made these changes:
* Increased gamma from 0.99 to 0.999.
* Reduced LR range a bit.
* Added a HP to choose model's post_fcnet_activation between relu & tanh (was formerly fixed at relu).
* Another run with the above changes had no success either. So I removed the speed limit penalty and created run 25ee5.
* During training, I continued to write the graphics code, but in separate copies of the files: inference.py,
simple_highway_with_ramp.py, using the suffix "_new" on each one, so it won't affect the ongoing training.
10/22/22 Sat:
* Time
10:18 - 11:00 = 0:42
11:34 - 12:55 = 1:21
13:50 - 15:01 = 1:11
17:40 - 18:03 = 0:23
19:55 - 21:25 = 1:30
* Runs from last night (92d01) that look promising are 3 (LR 9.36e-5, tanh/tanh, batch 1024) and 12 (LR 3.50e-5, relu/tanh, batch 512).
* Run 3 ran close to 0 speed. When it finished, it collected 0 completion reward because it took 1176 time steps!
* Added penalty for slow speeds (normalized speed < 0.5), slightly increased penalty for jerk and slightly reduced penalty for
lane change command.
* Increased number of most_recent iterations to evaluate for stopping, and didn't stop if max reward is close to success threshold.
* Finished the graphics code for the initial roadway display and integrated it.
* Realized a MAJOR PROBLEM I have had: the training was always starting in lane 0. Also, since the vehicle initial conditions set by
reset() were pretty limited (speed and location), it was often never seeing experiences for downtrack locations or for high speeds.
Therefore, I modified reset() to randomize these, and the initial lane ID, over the full range of possible experiences. It
is not clear to me how to get Ray Tune to pass in random values for each episode (when reset() is called), so for now I'll
depend on reset() to handle it. I've added a "training" config option that opens up these random ranges; if it is False or
undefined, the initial conditions will be as they were before.
10/23/22 Sun:
* Time
15:11 - 15:30 = 0:19
20:16 - 21:03 = 0:47
* Finally got one that seems to have learned: 10965 trial 10, used LR 5.80e-5, network [200, 100, 20] and output activation = tanh.
However, this also did not perform well. Accelerations are all over the place, and LC commands are as well. Several inference
runs failed to complete even half of the track.
* Added remaining code to the Graphics class in the inference program to do rudimentary display of ego vehicle driving on the track.
* Started a new training run using the DDPG algo instead of PPO (which I had been using thus far). It seemed to produce some
good results quickly, but the rewards never grew enough. Started playing with the network structures more.
* Some of the iterations are showing min rewards in the -200 to -300 range - how is this possible? Apparently, due to lots
of accumulated low-speed penalties.
10/24/22:
* time
13:34 - 15:07 = 1:33
19:57 - 22:31, int 15 = 2:19
* DDPG run from last night found some success! Run 41f25, trial 3 used actor network of [256, 128] and LR = 9.6e-6, with critic
network of [100, 16] and LR = 3.8e-4. Inference run in lane 0 stayed there with small lane chg cmds the whole way, and gradually
accelerated to max speed with no jerk penalties anywhere! Running checkpoint 500 (after which the mean reward dropped a bit).
Full run captured total episode reward of 1.54 taking 85 time steps. Another inference run started with a low speed (0.16 scaled)
and performed similarly, but, because of the low speed penalties at beginning, its total episode reward was only 0.34.
* Trial 14 from that run also performed really well in inference, using actor network of [100, 16] and LR = 6.6e-6, and
critic network of [128, 32] and LR = 9.6e-4.
* Completed testing the graphics update method. Had to modify the env wrapper to get access to both the scaled and uscaled obs.
* Added upper speed limit penalty and ran new training runs with DDPG, but not getting success. Max rewards tend to settle around 1.3
while min rewards settle around -10 to -20, with means in the -5 to -10 range. Maybe this settling is because noise gets removed
too early?
* Started another run with:
* Reduced upper speed limit penalty (0.2 at 1.2*speed limit)
* Much larger noise decay schedule (from 90k timesteps to 900k), plus random noise for 20k timesteps.
* Longer trials, up to 900 iters.
10/25/22:
* Time
19:00 - 19:58 = 0:58
21:45 - 23:22 = 1:37
* Last night's DDPG run did not succeed either. From the four best ones, I see that the actor performed best with the largest
network ([400, 100]), implying that maybe bigger would be better. Also, it seemed to prefer LR ~2e-5. The critic didn't seem
to care as much about either of these params, so it is probably good with a smaller network.
* Created a new DDPG run, notably with a much larger replay buffer. Before, the default 50k experiences was used; now I am using
1M experiences. Also adjusted some HPs a bit. Still no good.
* Run ac6fb: it seems that adding the upper speed limit penalty is causing it not to learn, so I removed that penalty and made no
other changes for this run. This produced at least 2 successful trials! It has therefore dawned on me that the problem is the
magnitude of the penalties being imposed for approaching the speed limits. They are way too big for a per-time step penalty,
considering the other penalties are O(0.001) and these are O(0.1), especially when the offending situation is not possible to
get out of in a single time step. The negative reward piles up very quickly, discouraging any learning.
* Reduced the low & high speed penalties by about 2 orders of magnitude for new run fad55.
10/26/22:
* Time 20:04 - 21:12 = 1:08
* A few runs scored decent mean rewards (~0.9) very early, then they gradually dropped as the episodes went on. However, their
early checkpoints perform pretty well.
* It is willing to accept a penalty of ~0.003 for low speed and ln chg cmd (each). But it doesn't like a large jerk
penalty at all. Seems to be willing to accept the low speed penalty in order to pick up more existence reward (0.005).
* In another run that started much faster, it was willing to accept a high speed penalty of up to 0.014 for the entire
run of 100 steps, in order to pick up 1.14 points for completing the run fast.
* Changed rewards so that
* Reduced existence reward to 0.003 (was 0.005)
* jerk penalty maxes out at 0.006 (was much higher)
* low speed penalty maxes out at 0.02 (was 0.01)
* high speed penalty at 1.2x speed limit is 0.02 (was 0.01)
* Reduced success threshold to 1.1 (was 1.2)
10/27/22:
* Time
09:07 - 09:33 = 0:26
11:53 - 12:25 = 0:32
14:29 - 15:07 = 0:38
16:29 - 16:51 = 0:22
21:00 - 22:47 = 1:47
* Got a few runs whose mean reward peak came very early, and only hit ~0.2. Common to accept lots of ongoing penalty for low
speeds, but kept very gradual & small accels.
* Reduced jerk penalty. Also removed the time step existence reward and reshaped the completion reward to drop off faster for
slow traversal.
* Fixed defect in steps_since_reset counter initialization, which was causing success reward to be less than it should. Also
set up new run with wider exploration of actor LR and noise params.
* Added condition to stop logic in case mean reward is low but max is above threshold; if mean is near the min than stop it.
* Turned off all lane change command penalties and ignore LC command coming into step() so it only has to learn about speed
control.
Still got no good results. This time each of the min, mean & max reward curves was almost flat, except for some noise, with
the mean centered aroune 0.2, which is way below what it should be. Running inference on one of the more successful trials,
I see that it is on the gas all the way, maxing out high speed penalties, which accumulated to ~2x the completion reward.
This doesn't make sense. I therefore think it is learning that max speed is the best policy because each episode begins
at a random downtrack location, so some of them are gathering close to 1.5 completion points for going very fast for only
a few time steps. Therefore, I changed the restart() method to always initialize training runs at the very beginning of
the track so it has to drive the entire track to get a completion reward.
* Debugging statments reveal that incoming accel actions, even at the very beginning of a training trial, are
highly correlated throughout the episode - mostly in the 0.8 - 1.2 m/s^2 range! But it is completing episodes
regularly very early. It quickly learns to max out accel (+3.0) and minimize number of time steps, and never gets
to explore what happens below the speed limit.
* Replaced the OU noise model with Gaussian using sigma = 1.0 m/s^2 at the beginning (gradually annealed). I confirmed that
it initially produces accels all over the place. However, it still quickly learns to push the accel to the max throughout
the episode.
10/28/22:
* Time
19:28 - 21:41 int 15 = 1:58
* Walked through a few series of manually specified accelerations through episodes in the new environment loop tester. It
appears all the rewards make sense. Further, running a full episode at speed just below the speed limit gives a much
better episode reward than going full throttle all the way through.
* Changed the hidden layer activations from relu to tanh for both actor & critic. No improvement.
* Reduced the lower limit of the actor LR range. Started to see some hope at the low end. Need to go down into the e-7 area.
* Turned down the Gaussian noise sigma from 1.0 to 0.1. No noticeable effect on the reward plots.
* Inference on the best one of these (episode reward max ~0.8) showed that it seems to be learning to reduce its
accelerations (held close to 0.2), but it still lets speeds max out, so maybe slowing down the LR more will allow
it to discover the sweet spot.
* Reduced both actor & critic LRs even more.
* TIMING: on laptop battery, one episode of 145 iterations on 1 worker/1 env took 2:51.
1 worker/4 env took 3:01 on battery (took 3:09 on AC power)
10/29/22:
* Time
09:00 - 09:18 = 0:18
11:29 - 12:08 = 0:39
16:15 - 18:23 = 2:08
* Runs from last night with much lower LRs kept the max reward steady > 2, and even mean rewards stayed steady instead of
dropping, but still < -1, since the mins never improved. It does appear that gamma = 0.999 is important, and a more
narrow range of LRs is the sweet spot.
* More tuning with tau. Doesn't make a big difference. Nor does adding more Gaussian noise.
* Switched to TD3 algorithm, using the defaults suggested in the algo manual (they aren't provided to copy).
* Took a break to train the racecar project, which is a simplified version of this, only trying to drive straight down
a lane as fast as possible, but while respecting the speed limit.
10/30/22:
* Time
15:08 - 16:09 = 1:01
16:48 - 17:53 = 1:05
21:33 - 22:56 = 1:23
* Racecar toy taught me that it is important to keep the cumulative amount of possible penalties (over the episode) on the
same order of magnitude as the completion reward. I had had them about 2 orders of magnitude larger. I also suspect that
it will be beneficial to just let the system train a lot longer than I have, even though there hasn't been any clear
progress in a few hundred iterations.
* Applying these lessons to the cda0 project brought some quick success for the limited case of just speed control with TD3.
* Critic was [256, 32] for all trials.
* Found that actor network of [128, 16] is definitely too small.
* Best performers had actor of [256, 32]. They all learned quickly and reward curves were smooth.
* Actor network of [512, 128, 32] struggled to learn, but some trials did well. Best ones were the lowest LRs
(1.2e-6 for actor and 4.7e-6 for critic)
* I was unable to get checkpoints to load for inference engine - error about differing state dict param sets,
even though I verified the network structures were specified the same.
* Max rewards were smaller than I had hoped (but I can't see exactly what's going on due to no inference).
* New training run using DDPG and using lane change control also (just lanes 0 & 1). Changed reset() to randomize the
vehicle's initial location anywhere along the lane instead of just at the beginning (it was learning to just come to a
stop to avoid the perpetual negative rewards).
* Five of the 15 trials appear to have succeeded! Run 8c03d using critic net [256,32]. Results are in the
cda0-l01-free dir.
* Trial 0 had a long & jagged learning curve, but got there. Actor [256, 32], actor LR = 2.5e-5, critic LR = 3.1e-5,
tau = 0.005.
* Trial 2 had actor [256, 32], actor LR = 8.2e-6, critic LR = 1.6e-5, tau = 0.005.
* Trial 10 had actor [256, 32], actor LR = 7.8e-7, critic LR = 1.1e-6, tau = 0.001.
* Trial 11 had the fastest learniing curve, with actor [256, 32], actor LR = 9.6e-7, critic LR = 7.7e-5,
tau = 0.005. In inference it used modestly high accel all the time and ignored the speed penalty. Rewards ~0.94.
* Trial 12 had actor [256, 32], actor LR = 4.7e-6, critic LR = 1.8e-5, tau = 0.001.
* None of the trials with an actor net of [512, 64] was even close.
* Next run includes the following mods:
* Increased success threshold from 1.0 to 1.1, which is not reachable by going full throttle all the time.
* Increased minimum required iterations to 300 to ensure we have plenty of settling time.
* Added penalty for LC cmd values near +/-0.5.
* Doubled the penalty for high speed violation (previously maxed out at 0.001).
* Results of this run (be9bc) showed that the [200, 20] actor network could achieve success, but not as easily.
Also, inference of a couple winners showed they still prefer full speed and a lower episode reward.
* Next run (23f2e) doubled the high speed penalty again.
10/31/22 Mon:
* Time
16:18 - 17:00 = 1:42
20:50 - 21:15 = 0:25
* Results of 23f2e run:
* Several trials reached episode reward between 1.0 and 1.05 quickly, and mostly stayed there.
* None reached the success threshold of 1.1.
* Best trials were 4, 7, 9, 13. Others that reached 1.0 but had some downward spikes: 1, 8, 12
* It appears that probability of success favors the [256, 32] network over the [200, 20] actor; also the larger
tau (0.005) seems more favored. Also, as expected, a ratio of actor LR / critic LR ~0.1 seems best.
* Inference shows that these models still want to use a high accel and are very reluctant to change its value.
* New run with following changes (3b68a):
* Trying larger 2-layer networks (the 200 last time wanted higher accels)
* Eliminate the jerk penalty to encourage large changes in accel.
* Double the high speed penalty (it was maxing at 0.0041).
* Use OU noise to be more realistic in generating variations in accel.
* Results: successful trials have mean rewards plateauing ~0.8, despite max rewards consistently being > 1.4.
Inference run still accelerates to max speed and stays there. Lane chg commands remain close to 0, however.
* New run with following changes
* Tuning noise magnitude
* Tuning with larger actor network (512 nodes in first layer)
* Tuning with larger choice of critic network
* Doubled the high speed penalty again to give max of 0.016
11/1/22 Tue:
* Time
04:07 - 04:47 = 0:40
11:46 - 12:20 = 0:34
15:56 - 16:32 = 0:36
19:01 - 20:03 = 1:02
* Analysis of prev run:
* No cases were successful. However, 3 of them quickly reached mean reward ~0.5 and stayed there. Then
2 of them fairly quickly (~500k steps) reached mean reward ~0.2 then slowly climbed to reach 0.8 after
7M steps. It appears they would keep going with further training.
* The two promising ones had both networks at [512, 64], noise in use (for 3M steps), and a LR ratio of
actor/critic close to 0.1 with actor LR between 2e-7 and 5e-7. In each case their min reward stayed around
-1 and max stayed at 1.48.
* On inference, they both had learned that small positive accel is the answer, and kept the LC cmd quite
small as well. They did not learn to maximize speed within the safe range, however.
* The four that got almost as high results showed a similar LR ratio, and 3/4 had the larger critic network.
* Modified reward code to randomly cancel the episode completion bonus if high speed violation occurs; probability
of cancellation is proportional to the amount of excess speed involved in that time step.
* Similar to previous run, a couple cases gradually increased mean reward (max ~0.7). These were 0, 2, 3
(for a while, then dropped off). These all had actor network of [512, 64] and critic network of [768, 80]
and similar LRs: actor ~3e-7, critic ~2e-5.
* Inference on two of them showed same pattern of sticking to very small, positive accelerations, regardless
of initial speed, and letting it run into the high speed violation.
* Modified reset() method to change initial position of the vehicle during training. It had been allowed to start
anywhere along the route, but I feel that is encouraging it to go for the big score and ignore speed limits, and
therefore, not worry much about accel. As iterations progress, the initial position will gradually be squeezed
toward the beginning of the track, forcing it to train for longer episodes. Also adjusted the probability of
cancelling the completion bonus upward (to worst case 4%).
* Most cases resulted in flat mean reward curves plateauing at ~-0.2, so no good. Three reached positive
territory, however.
* All 3 best trials used actor of [512, 64] at LR between 2e-7 and 7e-7, and critic of [768, 80] at LR
between 2e-5 and 5e-5.
* The biggest peak mean reward (trial 14) was 0.3, but it eventually tailed off to < 0 (after 5M steps).
* The worst performing of the 3 (trial 12) peaked ~0.1, then quickly dropped to < 0 after 2.5M steps.
* Inference runs show that these didn't perform any better than the previous training run. They learned
to keep accel small, but have no idea that speeding up to the speed limit is advantageous, or that
slowing down if above it is good.
* I confirmed that actions coming into the step() method tend to cover the full range of possible values,
at least early in a training run.
* Modified reward code to add an accel bonus if recent (4 steps) avg speed > speed limit and avg accel over that
period is < 0, and vice versa for speeds below speed limit (exceept for a deadband). Bonus increases with
larger speed difference from the limit and with larger acceleration magnitude.
11/2/22:
* Time
10:37 - 10:59 = 0:22
12:51 - 13:38 = 0:47
19:51 - 20:32 = 0:41
* Analyzed run from last night:
* Pretty similar performance as before - 3 runs reached mean reward > 0, maxing ~0.3.
* Inference still shows a desire for accels very close to 0 and a slow change to it. However, one run
demonstrated a slight response to the high speed penalty, where speed got into that zone, then accels
turned negative and returned it below speed limit. It took many time steps, however.
* Modified reward code to double the probability of eliminating the completion reward if high speed, and
doubled the accel bonus value.
* Running inference on some very early checkpoints (2-20 iterations) shows that already the accels and
LC cmds are very small. This makes me wonder if there is a scaling problem.
* Adding print statement during training run shows that scaling is not a problem; all calcs seem proper.
It is just learning very quickly (in first 20 iters) to keep accels close to zero. I now suspect
that this may be due to too much smoothing by training in large batches. Also the small time step
may be having some smoothing impact.
* Next set of mods:
* Tuning with much smaller batch sizes (down to 8)
* Simplified accel bonus calcs to just be based on current time step, not history.
* Analyzed above run (0c04c):
* All cases had a mean reward > 0, but none of them exceeded 0.8 despite each one having a max > 1.3.
* Runs 0 & 1 peaked quickly then dropped rapidly. Batch sizes were 1024 and 128, respectively.
* Run 2 took the longest to peak (6M+ steps) but also had the lowest max (as low as 1.25 at 7M). Its
batch size = 128. LR among the smallest & largest at 1.2e-7 for actor and 9.3e-5 for critic.
* Run 3 also peaked fairly quickly and dropped off a lot. Batch = 128.
* Run 4 was a slow mover but peaked nicely. Batch = 16, actor LR 1.7e-7, one of the lowest.
* Run 7 was lowest mean peak (but highest max), and dropped away very quickly. The only batch = 8.
* Runs 2, 5, 4, 13 had the lowest actor LRs, and their critic LRs were widely different. They all
showed gradual peak then tail-off of mean, plus max started high (~1.55), dipped, then climbed again
until its end. The dip was lower for those whose means took longer to peak. Once means tail off,
the max climbs again. These had batch = 128, 16, 1024, 1024. It seems they may have continued to
improve with more time.
* Inference performed similar to previous runs - accel very small and slow to change, but they are
starting to see the correct directions to move.
* It doesn't appear that small batch size has a noticeable effect. Best bet seems to be LR ~1e-7 for
actor and 5e-5 to 9e-5 for critic, then let it run a lot longer.
* New run:
* Using new StopLong class for the stopper, which pretty much lets it run to max iterations unless
the max reward is a failure. Also extended max iterations to 2000.
* Magnified the noise.
* Tightened the LR ranges, and moved the actor lower and critic higher.
* Doubled reward bonus for correct accel action.
11/3/22:
* Time
10:31 - 10:59 = 0:28
18:00 - 19:17 = 1:17
21:13 - ?
* Analysis of last night's run
* Some runs got peak mean reward of close to 0.7, similar to previous. Most has max rewards >= 2.
* Case 1 and 7 started very slowly, then gradually increased mean reward; the only two that didn't
drop off within the 2000 iterations. Peak value of mean was ~0.4. Also, their max values were on
the lower side (~1), then began to climb towards the end. They are ripe for additional training.
These both had batch = 8 and 2 of the lowest actor LRs (7e-8 and 9e-8), with the same critic LR
of 9.8e-5.
* Case 6 is also interesting, as possibly the best performing of the others, with peak rewards
~8M steps.
* Inference results are similar to previous, however. Not satisfying.
* Mods for a new run:
* Check for even smaller LRs for actor, larger for critic. Also, throw in a couple really big ones.
* Try batch size = 4.
***** Can config params get passed into the env object for each run? Yes - the configs are passed in to
the init method, and it is called by each worker at the beginning of an iteration. Any values
will be held constant throughout all episodes of that iteration (just like extrinsic items are,
such as LR). This could be a means to schedule gradual changes in individual reward penalties or
bonuses, or even in environment dynamics, such as taking off training wheels (removing limits on
action outputs). It could also be used to randomize some env constants to effectively provide
data augmentation (e.g. change friction coefficients, control response times, control biases, etc).
* Analysis of above run (e9151):
* Case 2 looks maybe promising in late time steps - long, steady mean reward, probably the highest
mean and max reward near the 12.5M step mark. Actor LR = 1e-5, critic LR = 9e-5, batch = 8.
* Case 3 very similar to 2 but much less erratic reward plots. Actor LR = 4e-8, critic LR = 9e-5,
batch = 16.
* Case 5 no good but had an interesting, huge spike in both mean and max rewards at ~2M steps, then
terminated soon after. Is this where noise ended?
* None of the 3 cases with batch = 4 was able to hold a mean reward > 0. At least one had what I
believe is a good LR ratio.
* Inference runs (of case 2) show that it is definitely reacting to the high and low speed penalties
and/or accel rewards, but its reaction time is many seconds, not one or two time steps as I would
** like to see. Is it possible that the unused historical accel values in the obs vector arc
contributing to that? Hard to believe, but maybe. No - experiment shows it is not.
** * Other things to try: play with size of completion reward (even make it zero at all times) or
at least ratio of completion reward to penalty magnitudes (maybe the all need to be closer to 1?).
Try PPO or another algo to get more dynamic action response to changing conditions.
Try training with a slow-down zone in the road to force larger accels.
Took a few days off and played with the simpler Racecar project to get back to basics. This is in the
projects/cda0_copy dir. An even simpler variant, Car, is in the projects/copy2 dir, and in Github under
repo name simple-car. Used PPO as the training algo.
LESSONS LEARNED from simple-car:
* I was able to train a car to drive a straight lane without traffic, as fast as possible while obeying the
speed limit.
* Keeping time step rewards large enough to present noticeable impacts on derivatives is important (values
in O(1e-3) or O(1e-4) were not doing it). Final time step's completion reward was O(10) and incremental
values were O(0.01) for 150 time step episodes.
* Shaping all rewards/penalties as smoothly differentiable seems to be important. Using quadratic function
of distance from target value is better than piecewise linear.
* Start with really small NN structure. I was getting good results with a [16, 4] FC network.
* HP tuning takes a LOT of time. Most of them don't have much impact, but LR is very sensitive.
* May need to get several trials of ~same LR before finding one that works with initial weights distro.
* May need to let training run a long time so it can gradually converge after several chaotic dips in
performance.
* Important to get a good balance of noise, which probably needs to be extended throughout most of the
trial.
* Be sure noise is explicitly turned off during inference runs! Using Ray algorithms may bring in unseen
config settings that turns it on by default (e.g. PPO's "explore" flag).
* Ray makes it very difficult to continue training from a Tuner checkpoint. For insights, see
* https://discuss.ray.io/t/save-and-reuse-checkpoints-in-ray-2-0-version/8169
* https://discuss.ray.io/t/correct-way-of-using-tuner-restore/8247
* https://discuss.ray.io/t/retraining-a-loaded-checkpoint-using-tuner-fit-with-different-config/7994/7
11/19/22:
* Time recorded directly on dashboard
* Merged the reward and tuning code from the simple Car code into the cda0 code and ran a tuning run there in
attempt to duplicate the success of training a drive in a straight lane with no other traffic around. The
only difference is that the cda0 environment now involves all the other observation elements and a 2D action
vector (although the 2nd element is not yet used).
* Results: two of the first 4 trials achieved mean rewards > 9. Running them in inference (with no
noise) starting in either lane 0 or 1 showed good performance. However, it seemed to be a little
too afraid of jerk, and happier to go slowly and take a lower completion reward.
* The one trial that had random_start_dist turned off never went anywhere, so it seems this is essential.
* Trying a new run with the low speed penalty turned on and a little smaller jerk penalty.
* This was still really tame in terms of acceleration & jerk, so accepted some slower-than-desired
solutions, with very slow, smooth accels.
* New run with ligher jerk penalty again, by 10x more, and with 2.5x more low-speed penalty.
* Trial 00006 performed beautifully! Smart accel at the beginning, and smoothly leveled off when speed
approached posted limit, then stayed there. Exactly what I wanted. This is saved as trial
PPO_SimpleHighwayRampWrapper_1656e_00006 under ray_results/cda0-l01-free, and against code committed
on 11/20, commit fe894d0.
*****
* Straight lane performance is now complete. Time to move on.
11/20/22:
* Time recorded in dashboard
* Ran the cda0 tuning code with all 3 lanes as starting options, meaning in lane 2 the agent has to learn to change
lanes in order to finish. Results are now being recorded in ray_results/cda0.
* Results were poor. None of the 15 trials got a mean reward > -8 or so, although the max reward was
consistently ~ +16.
* One trial aborted due to a Ray error, but was looking like it could possibly break out to higher ground
than the others. Its LR = 4.5e-5.
12/4/22:
* Time recorded in dashboard
* Narrowed the LR range a bit and increased the noise magnitude (stddev) from 0.3 to 0.5 (Gaussian).
12/5/22:
* Results from yesterday's run:
* Nothing got above mean reward of 0, but one trial came close, peaking twice around -8 before falling off.
Running inference on this one's peak checkpoint...
* Starting in lane 0, it ran smoothly, but at a nearly constant, small positive accel, thus hitting top
speed eventually, and taking a penalty of 0.4 per time step. Total score = 10.4, but it could have done
a lot better.
* Starting in lane 1, it immediately changed to lane 0, taking a 0.05 penalty for that (trivial), then
stayed there decisively. Accel performance was similar to lane 0.
* Starting in lane 2, it tried to make an illegal lane change in first time step, so crashed. This is
repeatable 5 times, but with slightly varying action outputs.
* Mods for next run:
* reset() printing lane selection to verify that it is training in lane 2. Verified it is choosing all 3
lanes randomly, so turned this off again to avoid clutter.
* Tuning with choice of random seeds, based on a recent article I read.
* Increased max LR quite a bit, since rewards tend to be small (O(1)), and therefore gradient is not large.
* Increased noise slightly, from 0.5 to 0.6 magnitude of sigma.
12/7/22:
* Results of prev run:
* Trial 0 had LR = 7.99e-6; it drives lane 2 all the way, but runs off the end rather than change lanes.
* Trial 7 had LR = 1.62e-5; make immediate lane chg in lane 2, so goes off road.
* Trial 10 had LR = 1.9e-3
* I verified that both the raw & scaled obs vectors are correct.
* Accel is stubbornly small, positive throughout all runs, regardless of speed & reward.
* Ideas to try (from reviewing above history):
* Reduce size of NN
* Play with LR schedule
* Turn off random start distance during training (force it to complete full route every time)
* Train in only lane 2 to see if it can at least learn that one
* Turn off jerk penalty
* Play with noise magnitude
* Mods for next run:
* Forcing lane ID to be 2 always (in reset())
* Commented out jerk penalty
* Results:
* 4 trials had mean reward peaks very close to 0 (> -10), but dropped off fast afterward. These
had LR of 1.4e-4, 1.4e-4 and 5.0e-5.
* All had mean reward that stayed at -50 for a long time, then several started climbing fast
around 500k to 600k steps after max reward suddenly jumped from -50 to +10.
* No min reward ever exceeded -50.
* Running trial 3 inference made it to the end! It had several lane changes, which is fine.
* Accelerations were somewhat sporadic, but much more aggressive at times.
* Found a defect in geometry code that repositioned the vehicle backwards one step after making
the lane change.
* Mods for next run:
* Fixed distance calc during lane change. I can't believe this had a significant impact on training,
although it did make the distance signal discontinuous, which could have had some effect.
* Added tuning options for NN size.
* Trial 3 performed well, with a mean reward hovering close to +10 for many iterations before finally
tanking. Inference runs on it, in lane 2, performed ideally, holding the speed limit and changing lane
toward the end of lane 2. When started in lane 1, it immediately tried to change lanes left too much.
This one had a NN structure of [64, 24] and LR = 7.8e-5.
* None of the others achieved a promising mean reward, so I don't feel I'm really finding the sweet spot.
12/11/22:
* Mods for next run, which is still only training on lane 2 start:
* Fixed the NN structure at [64, 24] and tuned for noise magnitude (still decaying to 0.04 at 1M steps).
* Two very promising trials were terminated prematurely due to stop criteria taking slopes over way too
many iterations. I need to reduce this, and also print out more details when it decides to stop.
* Three of the first 11 trials peaked between 0 and +10 (mean reward); not great performance, but decent.
Inference on two of these shows good performance. Their LRs were 7.5e-5, 1.1e-4, 9.5e-4, and they
used noise magnitude (stddev) of 0.61, 0.25, 0.17, respectively. So I believe I have this lane change
nailed.
* Mods for next run:
* Reduced the avg_over_latest param from 300 to 60 iterations, because I believe this is iters, not
steps. I added better output on stop condition to help confirm this.
* Zeroing in tuning to LRs in the range of success above, as well as noise magnitudes in the lower end
of that range.
* Reinstate all 3 lanes as candidate initial conditions (mod to reset()).
12/12/22:
* Trials 6 & 8 peaked at mean reward ~5. Inference run shows it is pretty good, but willing to accept a lot
of speed penalty, with only very tiny adjustments to accel. They both performed extraneous lane change
to lane 0 if starting in 1 or 2, which cost an extra ~0.2 penalty point.
* Jerk penalty has been turned off for some time, so it is learning smooth acceleration from other means.
* Next run:
* Increased size of NN from [64, 24] to [64, 40]
* Turned off randomized start distance during training (forcing all starts to be at beginning of lane).
* The first 6 trials here were lousy (one peaked at 0 and one peaked at -5), indicating that maybe
randomized start distance is still needed, so I aborted the run.
* Next run:
* Increased num workers from 8 to 12, keeping the rollout fragment length = 200, so train batch size
increased to 2400. Hoping things will move faster now.
* Turned the randomized start distance back on.
12/13/22:
* It doesn't appear that the randomized start distance had any particular effect, which was expected,
since it had already learned to travel the full length of the course.
* A couple of the trials peaked above 0, and one got to +10 for mean reward. However, inference on it
showed pretty loose speed control. It had no problem reaching way into the penalty areas for extended
periods. Although it did seem to sense it didn't want to be there, the jerk went negative as soon as
it entered the high speed area, but it was so small that the accel took a long time to reverse the
incursion. With jerk penalty off, I don't understand why it won't change accels more quickly.
* Next run:
* Based on successful straight-lane results on 11/19, with the more aggressive jerk performance, I am
going back to that [64, 48, 8] NN structure.
12/14/22:
* This run was worse than the previous. Only one trial reached above 0 mean reward, but stayed < 5.
It had LR = 8.8e-5 and noise stddev = 0.28.
* Like pretty much all trials to date, good or bad, the mean reward tends to climb to some peak,
usually around 400k to 600k steps, then it falls off, sometimes dramatically. I feel like LR
annealing is going to be important to get past this problem and allow things to keep learning.
* Next run:
* Figured out how to add LR annealing to the PPO params, so did that, going from 2e-4 to 2e-6 over
the first 800k steps. Did not try to make this a tuned param at this time.
12/15/22:
* The chosen schedule did not perform well at all. 3 trials peaked just below 0, but all exhibited
some amount of pull-back after a fairly short peak. Trials that tended to stay the most flat
after peaking were 2, 4, 5, 7, 11, 13. There is no correlation here with noise magnitude, as these
span the full range allowed, as do the trials not listed here.
* Next run:
* Changed the LR schedule a bit, generally moving it to lower values (about 2x change), with an
extra breakpoint.
* AI: I cannot see the exact LR being used in the log for any given time step. It would be good to
add a print statement into the PPO code to make that visible.
12/16/22:
* Three trials peaked above 0 but below +10. Inference on 2 of them showed close to constant,
small accel throughout, but got decent rewards in lanes 0 & 1 (~6-9); but when starting in lane 2
it continually either slowed to a stop or drove off the end of the lane. No good! The third trial
performed about the same in lanes 0 & 1, but in lane 2 it always did an illegal lane change in
the first time step. Clearly, nobody has learned how to drive lane 2 here. This explains the
persistent plots of -50 to -80 in the min rewards arena. It never exceeds -50. Two of these
trials had large noise magnitude (0.48 and 0.46), seeming to indicate that larger noise is good.
* Next run:
* It appears that lane 2 performance was good when that was the only thing trained, but when the
other lanes were thrown in it never learned well. Possible solutions:
* Force the training to select lane 2 more often that totally random choice.
* More iterations
* Larger NN to accommodate the additional stuff it needs to learn.
* Use more noise
* I will try to make the NN wider instead of deep. Changing from [64, 48, 8] to [128, 50].
* Increased noise magnitude to the range of [0.4, 0.6] and stretched out its schedule to fully
decay to 0.04 at 1.6M steps (which is typically where the latest runs have been ending).
* Modified reset() to train on lane 2 50% of the time and lanes 0 and 1 25% each.
* Added stop condition in the StopLogic class to terminate if the mean reward degrades at least
x (currently using 25%) below its peak in a given trial.
12/17/22:
* The 6 trials complete so far are looking like better trends, steadily climbing toward zero,
but some of them are being cut off prematurely by a defective stopper, so I aborted.
* Next run:
* Fixed the stopper defect (in the new code I added yesterday).
* Adjusted the LR schedule so it doesn't drop off so fast, based on where the reward curve was
starting to flatten out.
* Adjusted the long-term noise to be a bit higher, and increased the initial range somewhat.
* Trials 2 and 8 peaked between +8 and +10 on mean rewards, and performed really well in
inference on all lanes. Noise on these was 0.65 and 0.41, respectively. Accels were a
little jerky at times, but there is no jerk penalty, so it's to be expected. Agent tends to
like the lower speeds a bit, so it would be worth increasing that penalty a little and
narrowing its deadband as well, but not a major problem. Also, the accels were more
aggressive than I've seen in recent past, so I guess the extra neurons made that possible.
If that's the case, then maybe a few more still could be more useful.
* This is run 53a0c, and I am leaving it stored in ~/ray_results/cda0-solo.
** * AI: tweak the completion reward to drop off a little faster with the number of time steps.
There is hardly a noticeable difference between 130 and 170 time steps, so not much motivation
to speed it up.
**** * I believe I have found what I'm looking for in the solo vehicle department. Time to move on
and build a version that can handle other traffic on the roadway.
** * Lesson: if the NN feels too small, try to add width before adding depth.
>>>>>
12/18/22:
DRIVING IN TRAFFIC
The code is essentially already in place to start running 3 neighbor vehicles on the track along with
the ego vehicle (AI agent). I made a few small changes, noted below, to turn that code on. It will
run 3 vehicles at constant speed in lane 1. That speed can be varied for each trial by Tune, as can
their starting downtrack distance. However, for the first run, I am leaving them constant. These
vehicles will always all drive the same speed, so they won't crash into each other. They will remain
2 vehicle lengths apart, which leaves a 1-vehicle bumper-to-bumper gap between them, not enough for
the ego vehicle to slide into without registering a crash. Hopefully, this will force the ego vehicle
to either speed up to get in front or slow down to get behind if it is trying to change lanes while
they are in the way. For now the ego vehicle will be started randomly, as before, in any lane and
at any location and speed. Therefore, it will often never see a conflict with the neighbors.
* Changed completion reward from parabolic to linear, and made it degrade to 0 sooner (300
steps vice the previous 600 steps).
* Changed penalties for failed episodes to give a little less weight to those that endedd in
an off-road (-40) or stopping in the road (-30), while a crash with another vehicle is still
worth -50 points.
* Modified reset() to give the neighbor vehicles a non-zero starting speed and location, which
can be configured.
* First run, set neighbor speed constant at 29.1 m/s and neighbor location constant at 320 m, which
gives n3 the same travel distance to the merge point as a vehicle would have if starting at the
beginning of lane 2. This feels like reasonably good chances of forcing a merge crash situation.
I am staying with the same NN structure and other tuning params that were successful yesterday in the
solo vehicle training.
* First several trials did decently, peaking between -20 and -10.
* Five of the first trials died with an error.
* During this run I enhanced the inference program to display the neighbor vehicles as well.
It became obvious through this that they form a tiny target, so training episodes will
probably normally miss them altogether, thus the agent won't have much opportunity to learn
anything about deconfliction.
* Next run:
* Made the vehicles much longer (40 m, which is about 2x the lenth of a semi), and started
them farther apart, so that they will present a much larger barrier to lane change from
lane 2.
* I adjusted the long end of the noise schedule so that it doesn't die off so soon (now goes
to 0.1 at 2M steps).
* All trials progressed well, and all leveled out without any big drops. But their plateau
was around -18 to -8, so no successes. Some inference runs with the new graphics show that
40 m is too long for the vehicles in this scenario, as the 3 neighbors can completely
block the merge area.
* Next run:
* Reduced vehicle length to 20 m. Also reduced the neighbor initial spacing in reset()
from 4 lengths to 3. At this length & spacing they can block about half of the merge area.
* Realized that the randomized start distance is being limited on a schedule that expires
after 400k episodes. This is going to increase more slowly than the time step count, but
I increased the limit to 800k episodes to see what will happen.
* Fixed a defect in the crash detection logic.
12/20/22:
* Results here are similar to the previous run, with all trials ending in the -20 to -10
range.
* Next run:
* Increased NN size to [256, 64]
12/21/22:
* No improvement in performance. Most trials ended with mean reward between -20 and -10.
Two of them peaked around -7.
* Next run:
* Changed reset() calc of max_distance for randomized start distance. I confirmed that it
is based on episode counts, which seldom exceed 3000, but it was stretching it out over
800k episodes. I pulled that back in to schedule the reduction over 2000 episodes, so
that my later episodes will be forced to run virtually the whole track.
* Reduced the initial LR somewhat.
* Trying some 3-layer NN structures as a tuning variable.
* One trial reached -22 for mean reward, but the others slowly crept upwards in the low
-30s before being terminated after ~1.2M steps due to the max reward falling to -30. It
would be interesting to see one of these left to run for several million steps since it
is improving.
* I noticed that these trials are running beyond 8000 episodes, so the max_distance param
is getting shortened too fast.
* Next run:
* Changed the max_distance schedule to reduce over 8000 episodes.
12/22/22:
* A few trials did the typical plateau between -20 and -10. One peaked slightly above -10
but couldn't hold it. Several stayed in the -35 region the whole way, with max rewards
rapidly dropping down to -30. These tended to be the 3-layer models.
* Next run:
* Made a few adjustments to the stop logic to help struggling borderline cases continue.
* Changed tuner to select from larger 2-layer models.
* AI: I think what is really needed is some curriculum training where the model first gets
trained to navigate the track without traffic, then introduce traffic. I haven't yet
figured out how to save a checkpoint from a tuning run, which would be maybe needed to
do that.
12/23/22:
* Process died during trial 5. Of the ones complete, only one peaked above -15 and one
reached above -20.
* Next run:
* Added some rudimentary curriculum learning by using a new environment config param
to specify at what point neighbor vehicles will start being used (time step #). Prior
to that episode, they will stay in their initial positions like before I started using
them.
12/26/22:
* The best trial plateaued between -15 and -10. In inference, its best checkpoint showed
a LOT of lane changing.
* Next run
* Fixed a defect in reset() that was not turning on the neighbor vehicles for phase 2 of
the curriculum learning, so the agent never experienced neighbors in the previous run.
* Improved curriculum training capability by allowing definition of multiple phases in
the StopLogic class.
12/29/22:
* Didn't get any notably different results.
* Next run (96415)
* Added an arg to StopLogic to let a trial run to max iterations unless it is a winner.
Ran two trials like this.
* Early max rewards took a smooth slope downward from 10 to ~4, as before, but between
3M and 4M steps it suddenl spiked up to +10 and stayed near there. Also about that
point the mean reward started getting less smooth. In one of the trials it had long
bursts up above -5. These ran to 1800 iterations (~5.5M steps).
* Inference on the best checkpoint (training mean reward ~0) showed decent performance
in the straight lanes, but it kept doing illegal lane changes early in lane 2.
* Next run (a0c3c)
* Fixed a minor defect in reset() where it printed some statuses after they were cleared.
* Extended the max iterations from 1800 to 2400, since it still looked like there was
some progress being made at that point.
12/30/22:
* One of these two runs performed similarly to the "good" one previous, in that it peaked
at mean reward = 0, but its fluctuations looked like it could have benefitted from
more iterations.
* In the log file I notice that each trial stopped after a few 100k time steps (400k
and 580k, respectively), due to reaching iteration limit. This never triggered the
neighbor vehicles to turn on! Therefore, all this training was for solo driving. I
suspect that having 12 jobs running in parallel caused this problem. Ray is summing
all time steps from all workers to get the 7M or so on the plot, but each worker is
only contributing 1/12 of that. But the threshold to turn on the neighbors is
assuming each env object goes all the way to 7M time steps.
** * AI - if all I've been training is solo driving, why is it so hard to get good rewards?
Need to compare to successful solo training for HPs.
* Next run (2784c)
* Changed tuning program to only use 1 worker (was 12).
* Extended iteration limit from 2400 to 3000.
* Now that all time steps are happending on 1 worker, it is transitioning to using
neighbor vehicles as expected, beginning at 1.2M steps.
* In both trials the max reward took a huge step down at 1.2M steps (to around -30); in
one of them it quickly recovered to around -5, but in the other it stayed at -30 for
the remainder.
* My num crash tracker was being reinitialized wrong, but there is reason to believe
that no crashes have been detected, which is bothersome.
** * It really bugs me that a trial progresses at virtually the same speed whether it is
using 12 workers or 1. Each worker is assigned 1 cpu, 1 env and 0 gpu. The eval
worker has 1 cpu, 2 env and 1 gpu (the full enchilada). I need to spend time
playing with various combinations to understand how to improve performance.
* Next run
* Increased terminal LR (1e-6) to apply at 7M steps instead of 3M to be a little more
like the solo vehicle success on 12/17.
* Added a new tuning option to use a NN of [512, 64] as one of the options.
* Changed the final noise magnitude from 0.2 to 0.1 (still ocurring at 4M steps) to
be more like the solo vehicle success.
* The first litmus test needs to be that the rewards look acceptable at the 1.2M step
mark, indicating that it has learned to drive solo before adding the neighbors.
* Enhanced StopLogic to use a let_it_run flag for each phase, so that we don't waste
time if a trial can't achieve good solo driving first. I now have 3 curriculum
phases:
0 = 1M steps to learn solo driving without aborting
1 = 200k steps to allow stop logic to evaluate rewards and abort while
still driving solo
2 = xM more steps with neighbors in motion to learn driving in traffic.
* Moved from 1 gpu on local worker and 0 on rollout workers to 0.25 on local
workers and 0.5 on the rollout worker to see if it changes overall trial time
(current pace is close to 1M steps/hr). I immediately found this doesn't work, as
Ray hung before any trials ever started. So I moved these configs back to the way
they were before.
* Realized a defect in the phase management design, where the min timesteps is doing
double duty as also defining the phase boundaries, so it will never exceed the current
phase's min timesteps, so never trigger an early stop.
* Next run (10 trials, ID 3db63)
* Fixed StopLogic defect by adding phase_end_steps input to define the phases separately
from definition of the min timesteps in each phase.
* StopLogic had a section that multiplies the min timesteps by 1.2 if max reaches above
the success threshold, which pushes it up to the phase 0 boundary. I removed this
logic and increased the phase boundary a bit in case I want to bring that logic back.
1/1/23:
* All trials failed badly, with max reward going steadily down from 10 to -2 at about
600k steps then staying there (70% did this, the others had bigger drops). The max
starting distance gradually drops over the first 800k steps, which explains most of
this behavior. Mean rewards were all over between -40 and -28, but generally climbed
well until 300k steps; some continued climbing (or dropped then came back) until 700k
steps. Mins stayed clustered around -55. All trials stopped at 1M steps.
* Next run
* Added a tuning choice for NN size of [128, 50], which was used on 12/17 for successful
solo driving.
* Changed noise schedule to end (magnitude 0.1) at 1.6M instead of 4M, which is what gave
success for solo driving.
* Enhanced the reset() max_distance calc to allow an initial period with the full track
length before ramping it down. Initially set it at 200k steps before ramping begins.
* One trial had a max reward that stayed above 0, and continued past 1.2M steps, then
suddenly tanked. So it never reached phase 1, which begins at 1.3M. This trial used
a NN of [512, 64].
* All other trials stopped at 800k steps, showing mean reward growth through 200k then
gradually decreasing; max rewards stayed at 10 until 200k then gradually headed to
negative. Peak mean reward was as high as -20.
* Next run
* Changed randomized start distance calc so that it doesn't begin to ramp down until
700k steps, then takes until 1M steps to completely disappear. Hoping this will give
the reward enough time to become positive before the situations become more difficult.
1/2/23:
* All trials stopped at 800k steps. Their mean rewards were climbing until 700k, then
headed down, while max rewards stayed at 10 until that point, then went down very
quickly. There were two groups, with the first group achieving a peak between -20 and
-10 (mean), and the second group running distinctly lower and peaking just below -20.
In the first group, the best 3 trials were all [128, 50] networks and had noise
magnitude between 0.48 and 0.65. The second group all had noise magnitude > 0.70.
* Next run (c8a85, 14 trials)
* Giving more chance of choosing a [128, 50] NN.
* Redefining phase 0 to be just random starting point, and extending it to 1M steps.
Then phase 1 will be gradually ramping down the starting distance, to 1.6M steps and
the phase will end at 1.7M steps. Then phase 2 will be neighbor vehicles for remainder
of 4M steps.
* In the first 3 trials (all [128, 50]) there is some indication of a major step change
at 1.6M steps, causing the trail to go very badly and terminate at 1.7M.
* Had to kill this run in middle of trial 5 due to shutdown for vacation. Results are
still available - run the tensorboard server again.
** * AI: considerations for next runs:
* the reward curve gradually flattens out as it progresses. I wonder if this is
due to the LR reduction. Maybe leave LR higher for longer to see if it helps.
LR tapers from 1e-4 to 1e-5 over first 800k steps, where the reward slope is
pretty steep. Then it stays at 1e-5 between 800k and 1.6M, where the reward
slope is nearly flat and even starts to go a little negative. Then LR starts
to drop again to 1e-6 over the next several million steps.
* consider extending the noise out longer
2/15/23:
* Next run
* Set seed to a constant value, since varying it just creates an additional variable that
may be clouding the story of what works & doesn't.
* Stretched out the LR schedule, per above, so it doesn't hit 1e-5 until 1.6M steps.
After that it is the same as before.
* All trials looked similar to recent previous runs, where the reward curve slope gradually
decreases after a few hundred k steps, so that it is pretty flat after 1M, and nothing
reaches above mean reward of -10.
2/19/23:
* Compared code between current branch (3-neighbors) and master, which was last committed around