cda0_notes.txt

Notes on the cda0 project
=========================

10/3/22:
* Spent approx 3 hr setting up & beginning the coding.

10/4/22:
* Time spent:  
  17:00 - 18:44 = 1:44
  21:21 - 22:34 = 1:13
* Decided to go forward with the somewhat awkward and non-extensible approach of modeling observations to include states on
  exactly 3 neighbor vehicles.  For a future version I will replace that with a more general approach that looks at the
  roadway itself (analogous to an Atari agent viewing screen pixels rather than tracking a number of alien ships).

10/5/22:
* Time 20:45 - 22:40 = 1:55

10/6/22:
* Time
  10:33 - 12:39 = 2:06
  21:53 - 23:03 = 1:10

10/7/22:
* Time
  21:24 - 22:42 = 1:18
  23:01 - 00:30 = 1:29

10/8/22:
* Time
  17:04 - 18:25 = 1:21
  19:03 - 19:27 = 0:24

10/9/22:
* Time 19:54 - 22:53, int 13 = 2:46
* Completed unit testing of all env logic except reward.
* Built initial reward logic.

10/10/22:
* time 10:48 - 11:15 = 0:27

10/11/22:
* Time
  21:08 - 23:15 = 2:07
  23:44 - 00:15 = 0:31

10/12/22:
* Time
  08:14 - 09:40 = 1:26
  14:12 - 14:57 = 0:45
  15:27 - 15:56 = 0:29
* Fixed all unit test and run-time errors. Ready to begin training.

10/13/22:
* Time 22:08 - 01:00 = 2:52
* Still having occasional problems with observation_space violations

10/14/22:
* Time 23:08 - 01:59 = 2:51
* Created conda env "cda0" as a copy of rllib2 to make sure any project-specific changes aren't applied to the base env.

10/15/22:
* Time
  11:59 - 12:34 = 0:35
  22:44 - 00:20 = 1:36
* SUCCESS - first taste! got several trials to drive lane 0 beginning to end, forcing no steering.
  Best tuning trial was 35915_00003; saving checkpoint 181.
* Ran new training job (7559b), with results under the cda0-l0-free dir, that starts agent in lane 0 but allows it to change
  lanes.  Most successful trial was 75599b_00005; using checkpoint 126.  PPO lr = 2.8e-5, batch = 512.
* TODO: Need an inference program to run these successful models and capture their trajectories for viewing.

10/16/22:
* Time 21:05 - 23:09, int 20 = 1:44
* New tuning job (17669) under cda0-l01-free dir. This one randomly initializes episodes with the ego vehicle in either lane
  0 or 1, but not 2.  The neighbor vehicles still do not move. Two solutions got pretty close (00003 and 00002), but none
  scored higher than low 1.8s for the mean reward.
* Installed pygame (with pip) into the conda env "cda0" to experiment with making a graphical display of the simulation.
  Played with an example program enough to quickly understand how to do some basic graphics needed for my sim.

10/17/22:
* Time
  14:16 - 14:48 = 0:32
  17:10 - 18:36 = 1:26
  22:32 - 23:21 = 0:49
* Fixed a problem in the env wrapper that was locking the ego lane to 0; also changed reward shape some, so all previous
  training needs to be thrown away.
* Training again on random lane start (0 or 1 only) with no neighbor vehicles moving; trial ID = c2d96.
	* Trial 0003 ran for a long time, but reward gradually increased the whole time, maxing out above 4!  Not sure how
	  this is possible. LR = 0.00013, batch = 512.  It has learned to shy away from speeds near the max, since it gets
	  punished for large accels there; also shy of speeds near zero, for same reason, but this is less of a fear.
	  Result is that accel oscillates between +3 and -3 m/s^2, with speeds going between 4% and 75% of max range.
	  So avg speed of the vehicle is really small, and it collects more reward points for staying on track.
	* Trial 0007 was the only other one that succeeded, with a max reward around 1.9. LR = 0.000161, batch = 256.
* Increased penalty for jerk; decreased existential time step reward; made a more differentiable shape for penalty where
  accel is threatening to take vehicle past either high or low speed limit, in order to minimize accel oscillations.
* Training again on random lane start (0 or 1); trial ID = 47fd7. None of the 18 trials converged.

10/18/22:
* Time 21:40 - 11:00 = 1:20
* Removed reward penalties for both the acceleration limits and trying to keep steering command near {-1, 0, 1}.
  Ran tune (ID 10baa). Died in middle due to computer crash.
* Compared code to that used on 10/16 (commit cf6f) to find why suddenly nothing is learning to drive straight.
  Only found two seemingly minor diffs in the reward structure (given that the new penalties are commented out).
  I changed those back to the way they were on 10/16 and ran a new tuning run (b24d4).

10/19/22:
* Time
  17:46 - 18:25 = 0:39
  21:54 - 23:03 = 1:10
* Last night's run produced 3 winners (out of 20 trials). This is with the "old" reward structure, so just a baseline to 
  prove it can be trained.
	* Trial 14 had mean reward = 1.87; LR = 5.54e-5, activation = relu, network = [300, 128, 64], batch = 2048. Solved
	  in 121 iters. Its min reward was slightly < 0, however, so not idea.
	* Trial 16 had mean reward = 1.88 with similar min; LR = 3.42e-5, activation = relu, network = [300, 128, 64], batch = 512.
	* Trial 19 had mean reward = 1.89 with similar min; LR = 2.11e-5, activation = relu, network = [256, 256], batch = 512.
	  Use checkpoint 163.
	* Reviewing these results, I realized the reward was broken for lane change penalty, so fixed it.
* Another run (f7b05) applied these changes.  Found several successful trials.
	* Trial 19 had mean reward = 2.02 with similar min; LR = 1.36e-5, activation = tanh, network = [300, 128, 64], batch = 256.
	  Use checkpoint 154. 
	* Trial 18 had mean reward = 1.89 with min ~1.3; LR = 2.44e-5, activation = tanh, network = [300, 128, 64], batch = 1024.
	* Trial 15 had mean reward = 1.83 with an unimpressive min; LR = 5.60e-5, activation = relu, network = [256, 256], batch = 512.
	  Use checkpoint 126.
	* Trial 4 had mean reward = 1.86 with min ~1.3; LR = 3.39e-5, activation = relu, network = [300, 128, 64], batch = 2048.
	  Use checkpoint 97.

10/20/22:
* Time
  01:12 - 01:33 = 0:21
  17:15 - 18:31 = 1:16
  19:54 - 10:00 = 2:06
* Added the LCC penalty back into the reward method.  Made tuning run 211ad, all with a 3-layer network.
	* Trial 6 mean reward = 2.01, LR = 1.04e-4, activation = tanh, batch = 1024
	* Trial 11 mean reward = 1.75, LR = 1.99e-4, activation = tanh, batch = 1024
	* Trial 14 mean reward = 1.75, LR = 1.37e-5, activation = tanh, batch = 128; min reward above +1.5
	* Trial 18 mean reward = 1.76, LR = 1.67e-5, activation = tanh, batch = 128, min reward above +1.6
* Run 211ad showed
	* LC cmd penalty worked really well at keeping the LC command very close to zero
	* Accel was also really close to zero on trial 6, but averaged slightly positive so that speed rose throughout, but it never
	  reached the speed limit.  Trial 6 got only 0.796 completion reward, but gathered lots of time step rewards throughout.
	* Trial 11 had big accel, but its total reward was ~1.8 vs 2.2 for trial 6. So avg time step reward of ~0.09 is too high.
	* There is a strong desire to drive in lane 0, whether the initial lane is 0 or 1; LC occurs immediately if needed.
	* Max completion reward is only ~0.83 for the fastest possible travel time.  Needs to be increased.
* Added scaling of the action_space in the wrapper class to keep the NN output accel limited to [-1, 1] (it was in [-3, 3]).
  Modified the reward shape a bit to better emphasize completing the course as fast as possible.
* New tuning run (b5410) to finalize work on lane change issues.

10/21/22 Fri:
* Time
  16:52 - 19:37, int 23 = 2:22
  20:12 - 21:12 = 1:00
* One run from yesterday (b4510) succeeded, which was trial 1. Mean reward = 2.11, LR = 4.99e-5, activation = tanh, batch = 1024.
  Actions were well behaved, as desired, but accels were all small and tended to slow the car down to get more time step rewards.
	* Changed reward limits from [-1, 1] to [-2, 2], since the completion reward was being greatly clipped.
	* Reduced jerk penalty mult from 0.01 to 0.005.
	* Reduced time step reward from 0.005 to 0.002 (it was contributing >0.5 to the total episode reward)
	* Added penalty for exceeding speed limit (and increased obs space upper limit for speed substantially to allow an excess).
* Run fe601 with these changes produced no successful trials.  After observing one slightly promising run, made these changes:
	* Increased gamma from 0.99 to 0.999.
	* Reduced LR range a bit.
	* Added a HP to choose model's post_fcnet_activation between relu & tanh (was formerly fixed at relu).
* Another run with the above changes had no success either.  So I removed the speed limit penalty and created run 25ee5.
* During training, I continued to write the graphics code, but in separate copies of the files: inference.py,
  simple_highway_with_ramp.py, using the suffix "_new" on each one, so it won't affect the ongoing training.

10/22/22 Sat:
* Time
  10:18 - 11:00 = 0:42
  11:34 - 12:55 = 1:21
  13:50 - 15:01 = 1:11
  17:40 - 18:03 = 0:23
  19:55 - 21:25 = 1:30
* Runs from last night (92d01) that look promising are 3 (LR 9.36e-5, tanh/tanh, batch 1024) and 12 (LR 3.50e-5, relu/tanh, batch 512).
	* Run 3 ran close to 0 speed. When it finished, it collected 0 completion reward because it took 1176 time steps!
* Added penalty for slow speeds (normalized speed < 0.5), slightly increased penalty for jerk and slightly reduced penalty for 
  lane change command.
* Increased number of most_recent iterations to evaluate for stopping, and didn't stop if max reward is close to success threshold.
* Finished the graphics code for the initial roadway display and integrated it.
* Realized a MAJOR PROBLEM I have had: the training was always starting in lane 0. Also, since the vehicle initial conditions set by
  reset() were pretty limited (speed and location), it was often never seeing experiences for downtrack locations or for high speeds.
  Therefore, I modified reset() to randomize these, and the initial lane ID, over the full range of possible experiences.  It
  is not clear to me how to get Ray Tune to pass in random values for each episode (when reset() is called), so for now I'll
  depend on reset() to handle it.  I've added a "training" config option that opens up these random ranges; if it is False or
  undefined, the initial conditions will be as they were before.

10/23/22 Sun:
* Time
  15:11 - 15:30 = 0:19
  20:16 - 21:03 = 0:47
* Finally got one that seems to have learned: 10965 trial 10, used LR 5.80e-5, network [200, 100, 20] and output activation = tanh.
  However, this also did not perform well.  Accelerations are all over the place, and LC commands are as well.  Several inference
  runs failed to complete even half of the track.
* Added remaining code to the Graphics class in the inference program to do rudimentary display of ego vehicle driving on the track.
* Started a new training run using the DDPG algo instead of PPO (which I had been using thus far).  It seemed to produce some
  good results quickly, but the rewards never grew enough.  Started playing with the network structures more.
	* Some of the iterations are showing min rewards in the -200 to -300 range - how is this possible? Apparently, due to lots
	  of accumulated low-speed penalties.

10/24/22:
* time
  13:34 - 15:07 = 1:33
  19:57 - 22:31, int 15 = 2:19
* DDPG run from last night found some success!  Run 41f25, trial 3 used actor network of [256, 128] and LR = 9.6e-6, with critic
  network of [100, 16] and LR = 3.8e-4.  Inference run in lane 0 stayed there with small lane chg cmds the whole way, and gradually
  accelerated to max speed with no jerk penalties anywhere!  Running checkpoint 500 (after which the mean reward dropped a bit).
  Full run captured total episode reward of 1.54 taking 85 time steps.  Another inference run started with a low speed (0.16 scaled)
  and performed similarly, but, because of the low speed penalties at beginning, its total episode reward was only 0.34.
	* Trial 14 from that run also performed really well in inference, using actor network of [100, 16] and LR = 6.6e-6, and
	  critic network of [128, 32] and LR = 9.6e-4.
* Completed testing the graphics update method.  Had to modify the env wrapper to get access to both the scaled and uscaled obs.
* Added upper speed limit penalty and ran new training runs with DDPG, but not getting success. Max rewards tend to settle around 1.3
  while min rewards settle around -10 to -20, with means in the -5 to -10 range.  Maybe this settling is because noise gets removed
  too early?
* Started another run with:
	* Reduced upper speed limit penalty (0.2 at 1.2*speed limit)
	* Much larger noise decay schedule (from 90k timesteps to 900k), plus random noise for 20k timesteps.
	* Longer trials, up to 900 iters.

10/25/22:
* Time
  19:00 - 19:58 = 0:58
  21:45 - 23:22 = 1:37
* Last night's DDPG run did not succeed either.  From the four best ones, I see that the actor performed best with the largest
  network ([400, 100]), implying that maybe bigger would be better.  Also, it seemed to prefer LR ~2e-5.  The critic didn't seem
  to care as much about either of these params, so it is probably good with a smaller network.
* Created a new DDPG run, notably with a much larger replay buffer. Before, the default 50k experiences was used; now I am using
  1M experiences.  Also adjusted some HPs a bit.  Still no good.
* Run ac6fb: it seems that adding the upper speed limit penalty is causing it not to learn, so I removed that penalty and made no
  other changes for this run. This produced at least 2 successful trials!  It has therefore dawned on me that the problem is the
  magnitude of the penalties being imposed for approaching the speed limits.  They are way too big for a per-time step penalty,
  considering the other penalties are O(0.001) and these are O(0.1), especially when the offending situation is not possible to
  get out of in a single time step.  The negative reward piles up very quickly, discouraging any learning.
* Reduced the low & high speed penalties by about 2 orders of magnitude for new run fad55.

10/26/22:
* Time 20:04 - 21:12 = 1:08
* A few runs scored decent mean rewards (~0.9) very early, then they gradually dropped as the episodes went on. However, their
  early checkpoints perform pretty well.
	* It is willing to accept a penalty of ~0.003 for low speed and ln chg cmd (each). But it doesn't like a large jerk
	  penalty at all.  Seems to be willing to accept the low speed penalty in order to pick up more existence reward (0.005).
	* In another run that started much faster, it was willing to accept a high speed penalty of up to 0.014 for the entire
	  run of 100 steps, in order to pick up 1.14 points for completing the run fast.
* Changed rewards so that
	* Reduced existence reward to 0.003 (was 0.005)
	* jerk penalty maxes out at 0.006 (was much higher)
	* low speed penalty maxes out at 0.02 (was 0.01)
	* high speed penalty at 1.2x speed limit is 0.02 (was 0.01)
	* Reduced success threshold to 1.1 (was 1.2)

10/27/22:
* Time
  09:07 - 09:33 = 0:26
  11:53 - 12:25 = 0:32
  14:29 - 15:07 = 0:38
  16:29 - 16:51 = 0:22
  21:00 - 22:47 = 1:47
* Got a few runs whose mean reward peak came very early, and only hit ~0.2.  Common to accept lots of ongoing penalty for low
  speeds, but kept very gradual & small accels.
* Reduced jerk penalty.  Also removed the time step existence reward and reshaped the completion reward to drop off faster for
  slow traversal.
* Fixed defect in steps_since_reset counter initialization, which was causing success reward to be less than it should. Also
  set up new run with wider exploration of actor LR and noise params.
* Added condition to stop logic in case mean reward is low but max is above threshold; if mean is near the min than stop it.
* Turned off all lane change command penalties and ignore LC command coming into step() so it only has to learn about speed
  control.
  Still got no good results. This time each of the min, mean & max reward curves was almost flat, except for some noise, with
  the mean centered aroune 0.2, which is way below what it should be.  Running inference on one of the more successful trials,
  I see that it is on the gas all the way, maxing out high speed penalties, which accumulated to ~2x the completion reward.
  This doesn't make sense.  I therefore think it is learning that max speed is the best policy because each episode begins
  at a random downtrack location, so some of them are gathering close to 1.5 completion points for going very fast for only
  a few time steps.  Therefore, I changed the restart() method to always initialize training runs at the very beginning of
  the track so it has to drive the entire track to get a completion reward.
	* Debugging statments reveal that incoming accel actions, even at the very beginning of a training trial, are
	  highly correlated throughout the episode - mostly in the 0.8 - 1.2 m/s^2 range! But it is completing episodes
	  regularly very early.  It quickly learns to max out accel (+3.0) and minimize number of time steps, and never gets
	  to explore what happens below the speed limit.
* Replaced the OU noise model with Gaussian using sigma = 1.0 m/s^2 at the beginning (gradually annealed).  I confirmed that
  it initially produces accels all over the place.  However, it still quickly learns to push the accel to the max throughout
  the episode.

10/28/22:
* Time
  19:28 - 21:41 int 15 = 1:58
* Walked through a few series of manually specified accelerations through episodes in the new environment loop tester.  It
  appears all the rewards make sense.  Further, running a full episode at speed just below the speed limit gives a much
  better episode reward than going full throttle all the way through.
* Changed the hidden layer activations from relu to tanh for both actor & critic.  No improvement.
* Reduced the lower limit of the actor LR range.  Started to see some hope at the low end.  Need to go down into the e-7 area.
* Turned down the Gaussian noise sigma from 1.0 to 0.1.  No noticeable effect on the reward plots.
	* Inference on the best one of these (episode reward max ~0.8) showed that it seems to be learning to reduce its
	  accelerations (held close to 0.2), but it still lets speeds max out, so maybe slowing down the LR more will allow
	  it to discover the sweet spot.
* Reduced both actor & critic LRs even more.
* TIMING: on laptop battery, one episode of 145 iterations on 1 worker/1 env took 2:51.
	  1 worker/4 env took 3:01 on battery (took 3:09 on AC power)

10/29/22:
* Time
  09:00 - 09:18 = 0:18
  11:29 - 12:08 = 0:39
  16:15 - 18:23 = 2:08
* Runs from last night with much lower LRs kept the max reward steady > 2, and even mean rewards stayed steady instead of
  dropping, but still < -1, since the mins never improved.  It does appear that gamma = 0.999 is important, and a more
  narrow range of LRs is the sweet spot.
* More tuning with tau.  Doesn't make a big difference. Nor does adding more Gaussian noise.
* Switched to TD3 algorithm, using the defaults suggested in the algo manual (they aren't provided to copy).
* Took a break to train the racecar project, which is a simplified version of this, only trying to drive straight down
  a lane as fast as possible, but while respecting the speed limit.

10/30/22:
* Time
  15:08 - 16:09 = 1:01
  16:48 - 17:53 = 1:05
  21:33 - 22:56 = 1:23
* Racecar toy taught me that it is important to keep the cumulative amount of possible penalties (over the episode) on the
  same order of magnitude as the completion reward.  I had had them about 2 orders of magnitude larger.  I also suspect that
  it will be beneficial to just let the system train a lot longer than I have, even though there hasn't been any clear
  progress in a few hundred iterations.
* Applying these lessons to the cda0 project brought some quick success for the limited case of just speed control with TD3.
	* Critic was [256, 32] for all trials.
	* Found that actor network of [128, 16] is definitely too small.
	* Best performers had actor of [256, 32].  They all learned quickly and reward curves were smooth.
	* Actor network of [512, 128, 32] struggled to learn, but some trials did well. Best ones were the lowest LRs
	  (1.2e-6 for actor and 4.7e-6 for critic)
	* I was unable to get checkpoints to load for inference engine - error about differing state dict param sets,
	  even though I verified the network structures were specified the same.
	* Max rewards were smaller than I had hoped (but I can't see exactly what's going on due to no inference).
* New training run using DDPG and using lane change control also (just lanes 0 & 1).  Changed reset() to randomize the
  vehicle's initial location anywhere along the lane instead of just at the beginning (it was learning to just come to a
  stop to avoid the perpetual negative rewards).
	* Five of the 15 trials appear to have succeeded! Run 8c03d using critic net [256,32]. Results are in the
	  cda0-l01-free dir.
	* Trial 0 had a long & jagged learning curve, but got there. Actor [256, 32], actor LR = 2.5e-5, critic LR = 3.1e-5,
	  tau = 0.005.
	* Trial 2 had actor [256, 32], actor LR = 8.2e-6, critic LR = 1.6e-5, tau = 0.005.
	* Trial 10 had actor [256, 32], actor LR = 7.8e-7, critic LR = 1.1e-6, tau = 0.001.
	* Trial 11 had the fastest learniing curve, with actor [256, 32], actor LR = 9.6e-7, critic LR = 7.7e-5,
	  tau = 0.005.  In inference it used modestly high accel all the time and ignored the speed penalty. Rewards ~0.94.
	* Trial 12 had actor [256, 32], actor LR = 4.7e-6, critic LR = 1.8e-5, tau = 0.001.
	* None of the trials with an actor net of [512, 64] was even close.
* Next run includes the following mods:
	* Increased success threshold from 1.0 to 1.1, which is not reachable by going full throttle all the time.
	* Increased minimum required iterations to 300 to ensure we have plenty of settling time.
	* Added penalty for LC cmd values near +/-0.5. 
	* Doubled the penalty for high speed violation (previously maxed out at 0.001). 
	* Results of this run (be9bc) showed that the [200, 20] actor network could achieve success, but not as easily.
	  Also, inference of a couple winners showed they still prefer full speed and a lower episode reward.
* Next run (23f2e) doubled the high speed penalty again.

10/31/22 Mon:
* Time
  16:18 - 17:00 = 1:42
  20:50 - 21:15 = 0:25
* Results of 23f2e run:
	* Several trials reached episode reward between 1.0 and 1.05 quickly, and mostly stayed there.
	* None reached the success threshold of 1.1.
	* Best trials were 4, 7, 9, 13.  Others that reached 1.0 but had some downward spikes: 1, 8, 12
	* It appears that probability of success favors the [256, 32] network over the [200, 20] actor; also the larger
	  tau (0.005) seems more favored.  Also, as expected, a ratio of actor LR / critic LR ~0.1 seems best.
	* Inference shows that these models still want to use a high accel and are very reluctant to change its value.
* New run with following changes (3b68a):
	* Trying larger 2-layer networks (the 200 last time wanted higher accels)
	* Eliminate the jerk penalty to encourage large changes in accel.
	* Double the high speed penalty (it was maxing at 0.0041).
	* Use OU noise to be more realistic in generating variations in accel.
* Results: successful trials have mean rewards plateauing ~0.8, despite max rewards consistently being > 1.4. 
  Inference run still accelerates to max speed and stays there.  Lane chg commands remain close to 0, however.
* New run with following changes
	* Tuning noise magnitude
	* Tuning with larger actor network (512 nodes in first layer)
	* Tuning with larger choice of critic network
	* Doubled the high speed penalty again to give max of 0.016

11/1/22 Tue:
* Time
  04:07 - 04:47 = 0:40
  11:46 - 12:20 = 0:34
  15:56 - 16:32 = 0:36
  19:01 - 20:03 = 1:02
* Analysis of prev run:
	* No cases were successful.  However, 3 of them quickly reached mean reward ~0.5 and stayed there.  Then
	  2 of them fairly quickly (~500k steps) reached mean reward ~0.2 then slowly climbed to reach 0.8 after
	  7M steps.  It appears they would keep going with further training.
	* The two promising ones had both networks at [512, 64], noise in use (for 3M steps), and a LR ratio of
	  actor/critic close to 0.1 with actor LR between 2e-7 and 5e-7.  In each case their min reward stayed around
	  -1 and max stayed at 1.48.
	* On inference, they both had learned that small positive accel is the answer, and kept the LC cmd quite
	  small as well.  They did not learn to maximize speed within the safe range, however.
	* The four that got almost as high results showed a similar LR ratio, and 3/4 had the larger critic network.
* Modified reward code to randomly cancel the episode completion bonus if high speed violation occurs; probability
  of cancellation is proportional to the amount of excess speed involved in that time step.
	* Similar to previous run, a couple cases gradually increased mean reward (max ~0.7).  These were 0, 2, 3
	  (for a while, then dropped off).  These all had actor network of [512, 64] and critic network of [768, 80]
	  and similar LRs: actor ~3e-7, critic ~2e-5.
	* Inference on two of them showed same pattern of sticking to very small, positive accelerations, regardless
	  of initial speed, and letting it run into the high speed violation.
* Modified reset() method to change initial position of the vehicle during training.  It had been allowed to start
  anywhere along the route, but I feel that is encouraging it to go for the big score and ignore speed limits, and
  therefore, not worry much about accel.  As iterations progress, the initial position will gradually be squeezed
  toward the beginning of the track, forcing it to train for longer episodes.  Also adjusted the probability of
  cancelling the completion bonus upward (to worst case 4%).
	* Most cases resulted in flat mean reward curves plateauing at ~-0.2, so no good.  Three reached positive
	  territory, however.
	* All 3 best trials used actor of [512, 64] at LR between 2e-7 and 7e-7, and critic of [768, 80] at LR
	  between 2e-5 and 5e-5.
	* The biggest peak mean reward (trial 14) was 0.3, but it eventually tailed off to < 0 (after 5M steps).
	* The worst performing of the 3 (trial 12) peaked ~0.1, then quickly dropped to < 0 after 2.5M steps.
	* Inference runs show that these didn't perform any better than the previous training run.  They learned
	  to keep accel small, but have no idea that speeding up to the speed limit is advantageous, or that
	  slowing down if above it is good.
	* I confirmed that actions coming into the step() method tend to cover the full range of possible values,
	  at least early in a training run.
* Modified reward code to add an accel bonus if recent (4 steps) avg speed > speed limit and avg accel over that
  period is < 0, and vice versa for speeds below speed limit (exceept for a deadband).  Bonus increases with
  larger speed difference from the limit and with larger acceleration magnitude.

11/2/22:
* Time
  10:37 - 10:59 = 0:22
  12:51 - 13:38 = 0:47
  19:51 - 20:32 = 0:41
* Analyzed run from last night:
	* Pretty similar performance as before - 3 runs reached mean reward > 0, maxing ~0.3. 
	* Inference still shows a desire for accels very close to 0 and a slow change to it.  However, one run
	  demonstrated a slight response to the high speed penalty, where speed got into that zone, then accels
	  turned negative and returned it below speed limit.  It took many time steps, however.
* Modified reward code to double the probability of eliminating the completion reward if high speed, and
  doubled the accel bonus value.  
	* Running inference on some very early checkpoints (2-20 iterations) shows that already the accels and
	  LC cmds are very small.  This makes me wonder if there is a scaling problem.
	* Adding print statement during training run shows that scaling is not a problem; all calcs seem proper.
	  It is just learning very quickly (in first 20 iters) to keep accels close to zero.  I now suspect
	  that this may be due to too much smoothing by training in large batches.  Also the small time step
	  may be having some smoothing impact.
* Next set of mods:
	* Tuning with much smaller batch sizes (down to 8)
	* Simplified accel bonus calcs to just be based on current time step, not history.
* Analyzed above run (0c04c):
	* All cases had a mean reward > 0, but none of them exceeded 0.8 despite each one having a max > 1.3.
	* Runs 0 & 1 peaked quickly then dropped rapidly.  Batch sizes were 1024 and 128, respectively.
	* Run 2 took the longest to peak (6M+ steps) but also had the lowest max (as low as 1.25 at 7M). Its
	  batch size = 128.  LR among the smallest & largest at 1.2e-7 for actor and 9.3e-5 for critic.
	* Run 3 also peaked fairly quickly and dropped off a lot. Batch = 128.
	* Run 4 was a slow mover but peaked nicely. Batch = 16, actor LR 1.7e-7, one of the lowest.
	* Run 7 was lowest mean peak (but highest max), and dropped away very quickly.  The only batch = 8.
	* Runs 2, 5, 4, 13 had the lowest actor LRs, and their critic LRs were widely different. They all
	  showed gradual peak then tail-off of mean, plus max started high (~1.55), dipped, then climbed again
	  until its end.  The dip was lower for those whose means took longer to peak.  Once means tail off,
	  the max climbs again.  These had batch = 128, 16, 1024, 1024.  It seems they may have continued to
	  improve with more time.
	* Inference performed similar to previous runs - accel very small and slow to change, but they are
	  starting to see the correct directions to move.
	* It doesn't appear that small batch size has a noticeable effect.  Best bet seems to be LR ~1e-7 for
	  actor and 5e-5 to 9e-5 for critic, then let it run a lot longer.
* New run:
	* Using new StopLong class for the stopper, which pretty much lets it run to max iterations unless
	  the max reward is a failure.  Also extended max iterations to 2000.
	* Magnified the noise.
	* Tightened the LR ranges, and moved the actor lower and critic higher.
	* Doubled reward bonus for correct accel action.

11/3/22:
* Time
  10:31 - 10:59 = 0:28
  18:00 - 19:17 = 1:17
  21:13 - ?
* Analysis of last night's run
	* Some runs got peak mean reward of close to 0.7, similar to previous. Most has max rewards >= 2.
	* Case 1 and 7 started very slowly, then gradually increased mean reward; the only two that didn't
	  drop off within the 2000 iterations.  Peak value of mean was ~0.4.  Also, their max values were on
	  the lower side (~1), then began to climb towards the end.  They are ripe for additional training. 
	  These both had batch = 8 and 2 of the lowest actor LRs (7e-8 and 9e-8), with the same critic LR
	  of 9.8e-5.
	* Case 6 is also interesting, as possibly the best performing of the others, with peak rewards
	  ~8M steps.
	* Inference results are similar to previous, however. Not satisfying.
* Mods for a new run:
	* Check for even smaller LRs for actor, larger for critic.  Also, throw in a couple really big ones.
	* Try batch size = 4.
    ***** Can config params get passed into the env object for each run?  Yes - the configs are passed in to
	  the init method, and it is called by each worker at the beginning of an iteration.  Any values
	  will be held constant throughout all episodes of that iteration (just like extrinsic items are,
	  such as LR).  This could be a means to schedule gradual changes in individual reward penalties or
	  bonuses, or even in environment dynamics, such as taking off training wheels (removing limits on
	  action outputs).  It could also be used to randomize some env constants to effectively provide
	  data augmentation (e.g. change friction coefficients, control response times, control biases, etc).
* Analysis of above run (e9151):
	* Case 2 looks maybe promising in late time steps - long, steady mean reward, probably the highest
	  mean and max reward near the 12.5M step mark.  Actor LR = 1e-5, critic LR = 9e-5, batch = 8.
	* Case 3 very similar to 2 but much less erratic reward plots.  Actor LR = 4e-8, critic LR = 9e-5,
	  batch = 16.
	* Case 5 no good but had an interesting, huge spike in both mean and max rewards at ~2M steps, then
	  terminated soon after.  Is this where noise ended?
	* None of the 3 cases with batch = 4 was able to hold a mean reward > 0.  At least one had what I
	  believe is a good LR ratio.
	* Inference runs (of case 2) show that it is definitely reacting to the high and low speed penalties
	  and/or accel rewards, but its reaction time is many seconds, not one or two time steps as I would
   **     like to see.  Is it possible that the unused historical accel values in the obs vector arc
	  contributing to that?  Hard to believe, but maybe.  No - experiment shows it is not.
   **   * Other things to try:  play with size of completion reward (even make it zero at all times) or
	  at least ratio of completion reward to penalty magnitudes (maybe the all need to be closer to 1?).
	  Try PPO or another algo to get more dynamic action response to changing conditions.
	  Try training with a slow-down zone in the road to force larger accels.

Took a few days off and played with the simpler Racecar project to get back to basics. This is in the
projects/cda0_copy dir.  An even simpler variant, Car, is in the projects/copy2 dir, and in Github under
repo name simple-car.  Used PPO as the training algo.

LESSONS LEARNED from simple-car:
* I was able to train a car to drive a straight lane without traffic, as fast as possible while obeying the
  speed limit.
* Keeping time step rewards large enough to present noticeable impacts on derivatives is important (values
  in O(1e-3) or O(1e-4) were not doing it). Final time step's completion reward was O(10) and incremental
  values were O(0.01) for 150 time step episodes.
* Shaping all rewards/penalties as smoothly differentiable seems to be important. Using quadratic function
  of distance from target value is better than piecewise linear.
* Start with really small NN structure.  I was getting good results with a [16, 4] FC network.
* HP tuning takes a LOT of time.  Most of them don't have much impact, but LR is very sensitive.
* May need to get several trials of ~same LR before finding one that works with initial weights distro.
* May need to let training run a long time so it can gradually converge after several chaotic dips in 
  performance.
* Important to get a good balance of noise, which probably needs to be extended throughout most of the
  trial.
* Be sure noise is explicitly turned off during inference runs!  Using Ray algorithms may bring in unseen
  config settings that turns it on by default (e.g. PPO's "explore" flag).
* Ray makes it very difficult to continue training from a Tuner checkpoint.  For insights, see
	* https://discuss.ray.io/t/save-and-reuse-checkpoints-in-ray-2-0-version/8169
	* https://discuss.ray.io/t/correct-way-of-using-tuner-restore/8247
	* https://discuss.ray.io/t/retraining-a-loaded-checkpoint-using-tuner-fit-with-different-config/7994/7

11/19/22:
* Time recorded directly on dashboard
* Merged the reward and tuning code from the simple Car code into the cda0 code and ran a tuning run there in
  attempt to duplicate the success of training a drive in a straight lane with no other traffic around.  The
  only difference is that the cda0 environment now involves all the other observation elements and a 2D action
  vector (although the 2nd element is not yet used).
	* Results: two of the first 4 trials achieved mean rewards > 9.  Running them in inference (with no
	  noise) starting in either lane 0 or 1 showed good performance.  However, it seemed to be a little
	  too afraid of jerk, and happier to go slowly and take a lower completion reward.
	* The one trial that had random_start_dist turned off never went anywhere, so it seems this is essential.
* Trying a new run with the low speed penalty turned on and a little smaller jerk penalty.
	* This was still really tame in terms of acceleration & jerk, so accepted some slower-than-desired
	  solutions, with very slow, smooth accels.
* New run with ligher jerk penalty again, by 10x more, and with 2.5x more low-speed penalty.
	* Trial 00006 performed beautifully! Smart accel at the beginning, and smoothly leveled off when speed
	  approached posted limit, then stayed there.  Exactly what I wanted.  This is saved as trial
	  PPO_SimpleHighwayRampWrapper_1656e_00006 under ray_results/cda0-l01-free, and against code committed
	  on 11/20, commit fe894d0.
*****
* Straight lane performance is now complete.  Time to move on.

11/20/22:
* Time recorded in dashboard
* Ran the cda0 tuning code with all 3 lanes as starting options, meaning in lane 2 the agent has to learn to change
  lanes in order to finish.  Results are now being recorded in ray_results/cda0.
	* Results were poor.  None of the 15 trials got a mean reward > -8 or so, although the max reward was
	  consistently ~ +16.
	* One trial aborted due to a Ray error, but was looking like it could possibly break out to higher ground
	  than the others.  Its LR = 4.5e-5.

12/4/22:
* Time recorded in dashboard
* Narrowed the LR range a bit and increased the noise magnitude (stddev) from 0.3 to 0.5 (Gaussian).

12/5/22:
* Results from yesterday's run:
	* Nothing got above mean reward of 0, but one trial came close, peaking twice around -8 before falling off.
	  Running inference on this one's peak checkpoint...
	* Starting in lane 0, it ran smoothly, but at a nearly constant, small positive accel, thus hitting top
	  speed eventually, and taking a penalty of 0.4 per time step.  Total score = 10.4, but it could have done
	  a lot better.
	* Starting in lane 1, it immediately changed to lane 0, taking a 0.05 penalty for that (trivial), then
	  stayed there decisively. Accel performance was similar to lane 0.
	* Starting in lane 2, it tried to make an illegal lane change in first time step, so crashed.  This is
	  repeatable 5 times, but with slightly varying action outputs.
* Mods for next run:
	* reset() printing lane selection to verify that it is training in lane 2. Verified it is choosing all 3
	  lanes randomly, so turned this off again to avoid clutter.
	* Tuning with choice of random seeds, based on a recent article I read.
	* Increased max LR quite a bit, since rewards tend to be small (O(1)), and therefore gradient is not large.
	* Increased noise slightly, from 0.5 to 0.6 magnitude of sigma.

12/7/22:
* Results of prev run:
	* Trial 0 had LR = 7.99e-6; it drives lane 2 all the way, but runs off the end rather than change lanes.
	* Trial 7 had LR = 1.62e-5; make immediate lane chg in lane 2, so goes off road.
	* Trial 10 had LR = 1.9e-3
	* I verified that both the raw & scaled obs vectors are correct.
	* Accel is stubbornly small, positive throughout all runs, regardless of speed & reward.
* Ideas to try (from reviewing above history):
	* Reduce size of NN
	* Play with LR schedule
	* Turn off random start distance during training (force it to complete full route every time)
	* Train in only lane 2 to see if it can at least learn that one
	* Turn off jerk penalty
	* Play with noise magnitude
* Mods for next run:
	* Forcing lane ID to be 2 always (in reset())
	* Commented out jerk penalty
	* Results:
		* 4 trials had mean reward peaks very close to 0 (> -10), but dropped off fast afterward.  These
		  had LR of 1.4e-4, 1.4e-4 and 5.0e-5.
		* All had mean reward that stayed at -50 for a long time, then several started climbing fast
		  around 500k to 600k steps after max reward suddenly jumped from -50 to +10.
		* No min reward ever exceeded -50.
		* Running trial 3 inference made it to the end!  It had several lane changes, which is fine.
		* Accelerations were somewhat sporadic, but much more aggressive at times.
		* Found a defect in geometry code that repositioned the vehicle backwards one step after making
		  the lane change.
* Mods for next run:
	* Fixed distance calc during lane change. I can't believe this had a significant impact on training,
	  although it did make the distance signal discontinuous, which could have had some effect.
	* Added tuning options for NN size.
	* Trial 3 performed well, with a mean reward hovering close to +10 for many iterations before finally
	  tanking.  Inference runs on it, in lane 2, performed ideally, holding the speed limit and changing lane
	  toward the end of lane 2.  When started in lane 1, it immediately tried to change lanes left too much.
	  This one had a NN structure of [64, 24] and LR = 7.8e-5.
	* None of the others achieved a promising mean reward, so I don't feel I'm really finding the sweet spot.

12/11/22:
* Mods for next run, which is still only training on lane 2 start:
	* Fixed the NN structure at [64, 24] and tuned for noise magnitude (still decaying to 0.04 at 1M steps).
	* Two very promising trials were terminated prematurely due to stop criteria taking slopes over way too
	  many iterations.  I need to reduce this, and also print out more details when it decides to stop.
	* Three of the first 11 trials peaked between 0 and +10 (mean reward); not great performance, but decent.
	  Inference on two of these shows good performance.  Their LRs were 7.5e-5, 1.1e-4, 9.5e-4, and they
	  used noise magnitude (stddev) of 0.61, 0.25, 0.17, respectively. So I believe I have this lane change
	  nailed.

* Mods for next run:
	* Reduced the avg_over_latest param from 300 to 60 iterations, because I believe this is iters, not
	  steps. I added better output on stop condition to help confirm this.
	* Zeroing in tuning to LRs in the range of success above, as well as noise magnitudes in the lower end
	  of that range.  
	* Reinstate all 3 lanes as candidate initial conditions (mod to reset()).

12/12/22:
	* Trials 6 & 8 peaked at mean reward ~5. Inference run shows it is pretty good, but willing to accept a lot
	  of speed penalty, with only very tiny adjustments to accel.  They both performed extraneous lane change
	  to lane 0 if starting in 1 or 2, which cost an extra ~0.2 penalty point.
	* Jerk penalty has been turned off for some time, so it is learning smooth acceleration from other means.
* Next run:
	* Increased size of NN from [64, 24] to [64, 40]
	* Turned off randomized start distance during training (forcing all starts to be at beginning of lane).
	* The first 6 trials here were lousy (one peaked at 0 and one peaked at -5), indicating that maybe
	  randomized start distance is still needed, so I aborted the run.

* Next run:
	* Increased num workers from 8 to 12, keeping the rollout fragment length = 200, so train batch size
	  increased to 2400.  Hoping things will move faster now.
	* Turned the randomized start distance back on.

12/13/22:
	* It doesn't appear that the randomized start distance had any particular effect, which was expected,
	  since it had already learned to travel the full length of the course.
	* A couple of the trials peaked above 0, and one got to +10 for mean reward. However, inference on it
	  showed pretty loose speed control.  It had no problem reaching way into the penalty areas for extended
	  periods.  Although it did seem to sense it didn't want to be there, the jerk went negative as soon as
	  it entered the high speed area, but it was so small that the accel took a long time to reverse the
	  incursion. With jerk penalty off, I don't understand why it won't change accels more quickly.

* Next run:
	* Based on successful straight-lane results on 11/19, with the more aggressive jerk performance, I am
	  going back to that [64, 48, 8] NN structure.

12/14/22:
	* This run was worse than the previous. Only one trial reached above 0 mean reward, but stayed < 5.
	  It had LR = 8.8e-5 and noise stddev = 0.28.
	* Like pretty much all trials to date, good or bad, the mean reward tends to climb to some peak,
	  usually around 400k to 600k steps, then it falls off, sometimes dramatically. I feel like LR
	  annealing is going to be important to get past this problem and allow things to keep learning.

* Next run:
	* Figured out how to add LR annealing to the PPO params, so did that, going from 2e-4 to 2e-6 over
	  the first 800k steps.  Did not try to make this a tuned param at this time.

12/15/22:
	* The chosen schedule did not perform well at all.  3 trials peaked just below 0, but all exhibited
	  some amount of pull-back after a fairly short peak.  Trials that tended to stay the most flat
	  after peaking were 2, 4, 5, 7, 11, 13. There is no correlation here with noise magnitude, as these
	  span the full range allowed, as do the trials not listed here.

* Next run:
	* Changed the LR schedule a bit, generally moving it to lower values (about 2x change), with an
	  extra breakpoint.
	* AI: I cannot see the exact LR being used in the log for any given time step. It would be good to
	  add a print statement into the PPO code to make that visible.

12/16/22:
	* Three trials peaked above 0 but below +10. Inference on 2 of them showed close to constant,
	  small accel throughout, but got decent rewards in lanes 0 & 1 (~6-9); but when starting in lane 2
	  it continually either slowed to a stop or drove off the end of the lane. No good! The third trial
	  performed about the same in lanes 0 & 1, but in lane 2 it always did an illegal lane change in
	  the first time step.  Clearly, nobody has learned how to drive lane 2 here.  This explains the
	  persistent plots of -50 to -80 in the min rewards arena. It never exceeds -50.  Two of these
	  trials had large noise magnitude (0.48 and 0.46), seeming to indicate that larger noise is good.

* Next run:
	* It appears that lane 2 performance was good when that was the only thing trained, but when the
	  other lanes were thrown in it never learned well.  Possible solutions:
		* Force the training to select lane 2 more often that totally random choice.
		* More iterations
		* Larger NN to accommodate the additional stuff it needs to learn.
		* Use more noise
	* I will try to make the NN wider instead of deep.  Changing from [64, 48, 8] to [128, 50].
	* Increased noise magnitude to the range of [0.4, 0.6] and stretched out its schedule to fully
	  decay to 0.04 at 1.6M steps (which is typically where the latest runs have been ending).
	* Modified reset() to train on lane 2 50% of the time and lanes 0 and 1 25% each.
	* Added stop condition in the StopLogic class to terminate if the mean reward degrades at least
	  x (currently using 25%) below its peak in a given trial.

12/17/22:
	* The 6 trials complete so far are looking like better trends, steadily climbing toward zero,
	  but some of them are being cut off prematurely by a defective stopper, so I aborted.

* Next run:
	* Fixed the stopper defect (in the new code I added yesterday).
	* Adjusted the LR schedule so it doesn't drop off so fast, based on where the reward curve was
	  starting to flatten out.
	* Adjusted the long-term noise to be a bit higher, and increased the initial range somewhat.
	* Trials 2 and 8 peaked between +8 and +10 on mean rewards, and performed really well in
	  inference on all lanes.  Noise on these was 0.65 and 0.41, respectively.  Accels were a
	  little jerky at times, but there is no jerk penalty, so it's to be expected.  Agent tends to
	  like the lower speeds a bit, so it would be worth increasing that penalty a little and
	  narrowing its deadband as well, but not a major problem.  Also, the accels were more
	  aggressive than I've seen in recent past, so I guess the extra neurons made that possible.
	  If that's the case, then maybe a few more still could be more useful.
	* This is run 53a0c, and I am leaving it stored in ~/ray_results/cda0-solo.
 **	* AI: tweak the completion reward to drop off a little faster with the number of time steps.
	  There is hardly a noticeable difference between 130 and 170 time steps, so not much motivation
	  to speed it up.
 ****	* I believe I have found what I'm looking for in the solo vehicle department.  Time to move on
	  and build a version that can handle other traffic on the roadway.
 **	* Lesson:  if the NN feels too small, try to add width before adding depth.


>>>>>


12/18/22:

DRIVING IN TRAFFIC

The code is essentially already in place to start running 3 neighbor vehicles on the track along with
the ego vehicle (AI agent).  I made a few small changes, noted below, to turn that code on.  It will
run 3 vehicles at constant speed in lane 1.  That speed can be varied for each trial by Tune, as can
their starting downtrack distance.  However, for the first run, I am leaving them constant.  These
vehicles will always all drive the same speed, so they won't crash into each other.  They will remain
2 vehicle lengths apart, which leaves a 1-vehicle bumper-to-bumper gap between them, not enough for
the ego vehicle to slide into without registering a crash.  Hopefully, this will force the ego vehicle
to either speed up to get in front or slow down to get behind if it is trying to change lanes while
they are in the way.  For now the ego vehicle will be started randomly, as before, in any lane and
at any location and speed.  Therefore, it will often never see a conflict with the neighbors.
	* Changed completion reward from parabolic to linear, and made it degrade to 0 sooner (300
	  steps vice the previous 600 steps).
	* Changed penalties for failed episodes to give a little less weight to those that endedd in
	  an off-road (-40) or stopping in the road (-30), while a crash with another vehicle is still
	  worth -50 points.
	* Modified reset() to give the neighbor vehicles a non-zero starting speed and location, which
	  can be configured.

* First run, set neighbor speed constant at 29.1 m/s and neighbor location constant at 320 m, which
gives n3 the same travel distance to the merge point as a vehicle would have if starting at the
beginning of lane 2.  This feels like reasonably good chances of forcing a merge crash situation.
I am staying with the same NN structure and other tuning params that were successful yesterday in the
solo vehicle training.
	* First several trials did decently, peaking between -20 and -10.
	* Five of the first trials died with an error.
	* During this run I enhanced the inference program to display the neighbor vehicles as well.
	  It became obvious through this that they form a tiny target, so training episodes will
	  probably normally miss them altogether, thus the agent won't have much opportunity to learn
	  anything about deconfliction.

* Next run:
	* Made the vehicles much longer (40 m, which is about 2x the lenth of a semi), and started
	  them farther apart, so that they will present a much larger barrier to lane change from
	  lane 2.
	* I adjusted the long end of the noise schedule so that it doesn't die off so soon (now goes
	  to 0.1 at 2M steps).
	* All trials progressed well, and all leveled out without any big drops.  But their plateau
	  was around -18 to -8, so no successes.  Some inference runs with the new graphics show that
	  40 m is too long for the vehicles in this scenario, as the 3 neighbors can completely
	  block the merge area.

* Next run:
	* Reduced vehicle length to 20 m.  Also reduced the neighbor initial spacing in reset()
	  from 4 lengths to 3.  At this length & spacing they can block about half of the merge area.
	* Realized that the randomized start distance is being limited on a schedule that expires
	  after 400k episodes.  This is going to increase more slowly than the time step count, but
	  I increased the limit to 800k episodes to see what will happen.
	* Fixed a defect in the crash detection logic.

12/20/22:
	* Results here are similar to the previous run, with all trials ending in the -20 to -10
	  range.

* Next run:
	* Increased NN size to [256, 64]

12/21/22:
	* No improvement in performance. Most trials ended with mean reward between -20 and -10.
	  Two of them peaked around -7.

* Next run:
	* Changed reset() calc of max_distance for randomized start distance.  I confirmed that it
	  is based on episode counts, which seldom exceed 3000, but it was stretching it out over
	  800k episodes.  I pulled that back in to schedule the reduction over 2000 episodes, so
	  that my later episodes will be forced to run virtually the whole track.
	* Reduced the initial LR somewhat.
	* Trying some 3-layer NN structures as a tuning variable.
	* One trial reached -22 for mean reward, but the others slowly crept upwards in the low
	  -30s before being terminated after ~1.2M steps due to the max reward falling to -30.  It
	  would be interesting to see one of these left to run for several million steps since it
	  is improving.
	* I noticed that these trials are running beyond 8000 episodes, so the max_distance param
	  is getting shortened too fast.

* Next run:
	* Changed the max_distance schedule to reduce over 8000 episodes.

12/22/22:
	* A few trials did the typical plateau between -20 and -10. One peaked slightly above -10
	  but couldn't hold it.  Several stayed in the -35 region the whole way, with max rewards
	  rapidly dropping down to -30.  These tended to be the 3-layer models.

* Next run:
	* Made a few adjustments to the stop logic to help struggling borderline cases continue.
	* Changed tuner to select from larger 2-layer models.
	* AI: I think what is really needed is some curriculum training where the model first gets
	  trained to navigate the track without traffic, then introduce traffic.  I haven't yet
	  figured out how to save a checkpoint from a tuning run, which would be maybe needed to
	  do that.

12/23/22:
	* Process died during trial 5.  Of the ones complete, only one peaked above -15 and one
	  reached above -20.

* Next run:
	* Added some rudimentary curriculum learning by using a new environment config param
	  to specify at what point neighbor vehicles will start being used (time step #).  Prior
	  to that episode, they will stay in their initial positions like before I started using
	  them.

12/26/22:
	* The best trial plateaued between -15 and -10.  In inference, its best checkpoint showed
	  a LOT of lane changing.

* Next run
	* Fixed a defect in reset() that was not turning on the neighbor vehicles for phase 2 of
	  the curriculum learning, so the agent never experienced neighbors in the previous run.
	* Improved curriculum training capability by allowing definition of multiple phases in
	  the StopLogic class.

12/29/22:
	* Didn't get any notably different results.

* Next run (96415)
	* Added an arg to StopLogic to let a trial run to max iterations unless it is a winner.
	  Ran two trials like this.
	* Early max rewards took a smooth slope downward from 10 to ~4, as before, but between
	  3M and 4M steps it suddenl spiked up to +10 and stayed near there. Also about that
	  point the mean reward started getting less smooth.  In one of the trials it had long
	  bursts up above -5.  These ran to 1800 iterations (~5.5M steps).
	* Inference on the best checkpoint (training mean reward ~0) showed decent performance
	  in the straight lanes, but it kept doing illegal lane changes early in lane 2.

* Next run (a0c3c)
	* Fixed a minor defect in reset() where it printed some statuses after they were cleared.
	* Extended the max iterations from 1800 to 2400, since it still looked like there was
	  some progress being made at that point.

12/30/22:
	* One of these two runs performed similarly to the "good" one previous, in that it peaked
	  at mean reward = 0, but its fluctuations looked like it could have benefitted from
	  more iterations.
	* In the log file I notice that each trial stopped after a few 100k time steps (400k
	  and 580k, respectively), due to reaching iteration limit.  This never triggered the
	  neighbor vehicles to turn on!  Therefore, all this training was for solo driving. I
	  suspect that having 12 jobs running in parallel caused this problem.  Ray is summing
	  all time steps from all workers to get the 7M or so on the plot, but each worker is
	  only contributing 1/12 of that.  But the threshold to turn on the neighbors is
	  assuming each env object goes all the way to 7M time steps.
 **	* AI - if all I've been training is solo driving, why is it so hard to get good rewards?
	  Need to compare to successful solo training for HPs.

* Next run (2784c)
	* Changed tuning program to only use 1 worker (was 12).
	* Extended iteration limit from 2400 to 3000.
	* Now that all time steps are happending on 1 worker, it is transitioning to using
	  neighbor vehicles as expected, beginning at 1.2M steps.
	* In both trials the max reward took a huge step down at 1.2M steps (to around -30); in
	  one of them it quickly recovered to  around -5, but in the other it stayed at -30 for
	  the remainder.
	* My num crash tracker was being reinitialized wrong, but there is reason to believe
	  that no crashes have been detected, which is bothersome.
 **	* It really bugs me that a trial progresses at virtually the same speed whether it is
	  using 12 workers or 1.  Each worker is assigned 1 cpu, 1 env and 0 gpu.  The eval
	  worker has 1 cpu, 2 env and 1 gpu (the full enchilada).  I need to spend time
	  playing with various combinations to understand how to improve performance.

* Next run
	* Increased terminal LR (1e-6) to apply at 7M steps instead of 3M to be a little more
	  like the solo vehicle success on 12/17.
	* Added a new tuning option to use a NN of [512, 64] as one of the options.
	* Changed the final noise magnitude from 0.2 to 0.1 (still ocurring at 4M steps) to
	  be more like the solo vehicle success.
 	* The first litmus test needs to be that the rewards look acceptable at the 1.2M step
	  mark, indicating that it has learned to drive solo before adding the neighbors.
	* Enhanced StopLogic to use a let_it_run flag for each phase, so that we don't waste
	  time if a trial can't achieve good solo driving first.  I now have 3 curriculum
	  phases:
		0 = 1M steps to learn solo driving without aborting
		1 = 200k steps to allow stop logic to evaluate rewards and abort while
		    still driving solo
		2 = xM more steps with neighbors in motion to learn driving in traffic.
	* Moved from 1 gpu on local worker and 0 on rollout workers to 0.25 on local
	  workers and 0.5 on the rollout worker to see if it changes overall trial time
	  (current pace is close to 1M steps/hr).  I immediately found this doesn't work, as
	  Ray hung before any trials ever started.  So I moved these configs back to the way
	  they were before.
	* Realized a defect in the phase management design, where the min timesteps is doing
	  double duty as also defining the phase boundaries, so it will never exceed the current
	  phase's min timesteps, so never trigger an early stop.

* Next run (10 trials, ID 3db63)
	* Fixed StopLogic defect by adding phase_end_steps input to define the phases separately
	  from definition of the min timesteps in each phase.
	* StopLogic had a section that multiplies the min timesteps by 1.2 if max reaches above
	  the success threshold, which pushes it up to the phase 0 boundary.  I removed this
	  logic and increased the phase boundary a bit in case I want to bring that logic back.

1/1/23:

	* All trials failed badly, with max reward going steadily down from 10 to -2 at about
	  600k steps then staying there (70% did this, the others had bigger drops). The max
	  starting distance gradually drops over the first 800k steps, which explains most of
	  this behavior.  Mean rewards were all over between -40 and -28, but generally climbed
	  well until 300k steps; some continued climbing (or dropped then came back) until 700k
	  steps. Mins stayed clustered around -55.  All trials stopped at 1M steps.

* Next run
	* Added a tuning choice for NN size of [128, 50], which was used on 12/17 for successful
	  solo driving.
	* Changed noise schedule to end (magnitude 0.1) at 1.6M instead of 4M, which is what gave
	  success for solo driving.
	* Enhanced the reset() max_distance calc to allow an initial period with the full track
	  length before ramping it down.  Initially set it at 200k steps before ramping begins.
	* One trial had a max reward that stayed above 0, and continued past 1.2M steps, then
	  suddenly tanked. So it never reached phase 1, which begins at 1.3M. This trial used
	  a NN of [512, 64].
	* All other trials stopped at 800k steps, showing mean reward growth through 200k then
	  gradually decreasing; max rewards stayed at 10 until 200k then gradually headed to
	  negative.  Peak mean reward was as high as -20.

* Next run
	* Changed randomized start distance calc so that it doesn't begin to ramp down until
	  700k steps, then takes until 1M steps to completely disappear.  Hoping this will give
	  the reward enough time to become positive before the situations become more difficult.

1/2/23:

	* All trials stopped at 800k steps. Their mean rewards were climbing until 700k, then
	  headed down, while max rewards stayed at 10 until that point, then went down very
	  quickly. There were two groups, with the first group achieving a peak between -20 and
	  -10 (mean), and the second group running distinctly lower and peaking just below -20.
	  In the first group, the best 3 trials were all [128, 50] networks and had noise
	  magnitude between 0.48 and 0.65. The second group all had noise magnitude > 0.70.

* Next run (c8a85, 14 trials)
	* Giving more chance of choosing a [128, 50] NN.
	* Redefining phase 0 to be just random starting point, and extending it to 1M steps.
	  Then phase 1 will be gradually ramping down the starting distance, to 1.6M steps and
	  the phase will end at 1.7M steps. Then phase 2 will be neighbor vehicles for remainder
	  of 4M steps.
	* In the first 3 trials (all [128, 50]) there is some indication of a major step change
	  at 1.6M steps, causing the trail to go very badly and terminate at 1.7M.
	* Had to kill this run in middle of trial 5 due to shutdown for vacation.  Results are
	  still available - run the tensorboard server again.

 **	* AI: considerations for next runs:
		* the reward curve gradually flattens out as it progresses. I wonder if this is
		  due to the LR reduction.  Maybe leave LR higher for longer to see if it helps.
		  LR tapers from 1e-4 to 1e-5 over first 800k steps, where the reward slope is
		  pretty steep. Then it stays at 1e-5 between 800k and 1.6M, where the reward
		  slope is nearly flat and even starts to go a little negative. Then LR starts
		  to drop again to 1e-6 over the next several million steps.
		* consider extending the noise out longer

2/15/23:

* Next run
	* Set seed to a constant value, since varying it just creates an additional variable that
	  may be clouding the story of what works & doesn't.
	* Stretched out the LR schedule, per above, so it doesn't hit 1e-5 until 1.6M steps.
	  After that it is the same as before.
	* All trials looked similar to recent previous runs, where the reward curve slope gradually
	  decreases after a few hundred k steps, so that it is pretty flat after 1M, and nothing
	  reaches above mean reward of -10.

2/19/23:
* Compared code between current branch (3-neighbors) and master, which was last committed around
  mid-Dec with successful solo driving performance.  There is not a significant change that
  should affect agent capability.
* Next run
	* Modified LR schedule so it holds 1e-4 all the way until 1.6M steps instead of gradually
	  decreasing it, in the hopes of encouraging the same growth rate in mean rewards
	  throughout the trial.
	* No good results. Two runs peaked above mean reward of 0, one reaching ~5. But their
	  inference was terrible. All trials fell off to mean reward ~-40 between 1.6M and 1.7M
	  steps, so none saw any beneficial training with neighbor vehicles.

===== UPGRADED TO RAY 2.2.0 =====

2/22/23

Created a new conda env cda0, based upon the new env rllib22, which includes Ray 2.2.0 and several
other updated packages.  I then installed pygame 2.1.3 (with pip).  Tried running the cda0_tune
program as a system test.
	* It died immediately, complaining that tensorflow_probability package doesn't exist, so
	  I installed it (v0.19.0).  The tune program now runs, as does the plots program using
	  tensorboard.
	* Confirmed the gpu tester is working properly also.
	* The inference program didn't work at first because the config params code needed to be
	  rewritten.  Once I did that it works fine.
	* Updated the rllib22 env to include tensorflow_probability as well.
**	* Updated the tune program to use the new config objects of Ray 2.2. Best reference for this
	  is to look at the source code under ray/rllib/algorithms/algorithm_config.py (the base
	  class), and also in the derived classes algorithms/pg/pg_config.py and
	  algorithms/ppo/ppo.py (derived from pg).

2/24/23

***	Created new Git branch tune-checkpointing off of 3-neighbors, specifically to build a raw
	checkpointing capability to support better curriculum learning.

3/1/23

===== UPGRADED TO RAY 2.3.0 =====

* Created a new conda env cda0 based on a copy of the new rllib23 env, but with upgrade to torchvision 0.14.1.
  I then installed pygame 2.2.0 with pip to complete the env.
	* The GPU tester program passes in this new env.
	* Ran the existing cda0_tune program, which did a lot more environment model checking and gave me lots
	  of error messages, indicating a need to make it compatible with the gym 0.26 API (now called
	  gymnasium).
	* Reinstalled gym 0.26.1 in addition to gymnasium 0.26, to cover some legacy RLlib code that still uses it.
	* Modified some of my code to use more strong typing in environment reset method. The tune program now
	  runs fine.

*** Notes from my research in how to use checkpoints and do some real curriculum learning.
	* rllib/policy/torch/torch_policy_v2 has methods get_weights() -> ModelWeights and set_weights(weights)
	* Same class also has export_model(export_dir), and will create a checkpoint named model.pt, also
	  import_model_from_h5(import_file), and loads it into self.model apparently. But a model's state
	  includes network weights, optimizer state dict, exploration state, timestep info, etc. I only
	  want to keep the weights, I think.
	* self.model is created with internal method _init_model_and_dist_class(), which allows use
	  of an overridden method self.make_model(). It appears I can subclass TorchPolicyV2 (or without V2)
	  and create my own make_model() method, as this one doesn't do anything.
	* If make_model() isn't overridden, then it calls its own get_model_V2() method with several params.
	* Air Checkpoint class manages everything about checkpoints, which contain states of both algorithms
	  and policies.
	* Instead, use AlgorithmConfig class's checkpointing() method to define behavior with param
	  export_native_model_files = True, then after training use Algorithm.get_policy().
	* Consider using the ray/rllib/models/ModelV2 class (actually, classes derived from it).
	* Tune checkpoints include full experiment data, so are specified with the experiment top level dir
	  (e.g. in my recent use it is ~/ray_results/cda0).  Of course, if there are multiple runs in here it
	  could get confused.  Starting with a clean dir, I made a run, saved a couple checkpoints, then killed
	  it.  I then started a new run specifying the experiment checkpoint.  It said the checkpoint was
	  successfully loaded, but the run started over with iteration 1 and a different selection of HPs than
	  was used on the previous run.  Not sure it really worked.

	* Got some help on Ray discuss from @justinvyu on 2/28. I plan to implement his 3 suggestions:
		* Suggestion 1 - use a single env that swaps tasks (training levels or lessons)
			* Set up the API compliance and tested it.
			* Updated the environment code to handle levels 0-3 with two additional reserved for
			  future. Did short test with level 0.

		* Suggestion 2 - add PBT scheduler with this new environment.
			* Modified cda0_tune to incorporate code analogous to the examples provided.
				* Initial run - first perturbation it modified gamma outside of the range I
				  expected. I gave the schedule a mutation range of loguniform(0.95, 0.9999), but
				  rather than sampling from there it just multiplied the previous value by 0.8.
				  It seems it sometimes chooses to resample from the config values given, and
				  sometimes chooses to multiply by either 0.8 or 1.2.
				* However, in one mutation, it multiplied gamma by 1.2 and ended up with gamma > 1,
				  which is illegal, but it continued to run.  If I can't find a way to limit this,
				  then gamma is not a good candidate for perturbation!
				* When I specified num_cpus_per_worker = 1 and num_rollout_workers = 1, it set up
				  8 rollout workers, each with its own cpu. So it evaluated the 24 trials 8 at a
				  time for 20 iters, then perturbed them all and started another round of 20 iters.
				* Changing the train_batch_size to 200 (according to what it should be) still resulted
				  in 8 cpus being used at a time.
				* Specifying num_cpus_per_worker = 0 doesn't work.
				* Specifying num_cpus_for_local_worker = 0 is acceptable. However, not it runs
				  12 trials in parallel.
				* Specifying num_rollout_workers = 13, train_batch_size = 2600, I see only one trial
				  running at a time.
				* Posted a question about resource allocation to Ray Discuss.

		* Suggestion 3 - implement your own CurriculumScheduler

3/2/23

* Updated my environment model (SimpleRoadwayRamp) to provide multiple learning levels (tasks), per the Ray 
  curriculum training guidance (see notes above, including advice from a Ray Discuss contributor).  Also updated
  the wrapper class, tuner program and stop logic to accommodate these changes.
	* It doesn't really work well, since StopLogic can't talk to env, and number of iterations and timesteps
	  is only approx, since each trial arrives at them independently; also PBT scheduler resets the iteration
	  counter (the one that is available to env) after eacn perturbation cycle.  I think I'd rather do a single
	  learning level per run and manually adjust it.

3/3/23

* Updated the tuning program to implement population based training (PBT), per various Ray documentation and hints
  from my Ray Discuss thread.
	* Made a new run with 24 trials using HPs and tuning HPs that I think have a good chance for success over
	  all 4 defined learning levels.
		* The mean rewards all moved up asymptotically to 0 but never crossed it, and the min rewards never
		  exceeded -45 for any significant amount. The network structure was [128, 50], so maybe just not big
		  enough? I killed it at 220 iterations (~600k steps).
		* PBT had built a consensus that hidden layer activation should be relu and output activation should
		  be tanh. It didn't seem to settle on a good gamma; in fact, several trials still have gamma > 1.
		  Noise values (stddev) are still all over the place, as well, as are LR values.

* New run without gamma variation, but with network [256, 256] and perturb_int increased to 30 iters. 16 trials.
	* Found a defect in the set_task() method that killed any trial attempting to promote to the next level.
	  PBT with synch = True then waits forever for the errored trials to catch up to the next perturb point
	  and they never do, so the whole run hangs.

* New run with a fix for set_task hang, and the following additional tuning & mutating params:
	* tuning/mutating kl_coeff (default is 0.2) try uniform(0.24, 0.8). Set clip_actions = True and tune
	  clip_param = uniform(0.1, 0.3). Tune entropy_coeff = uniform(0.0, 0.01) or tune/mutate in uniform(0.0, 0.008).
	* Many times a thread hit the success threshold and got promoted to level 1, and a few times a trial got
	  promoted to level2, but the mean rewards never stayed in the success region. Worse, the min rewards continue
	  to stay down in the -40 region.  Until I can get rid of that, I don't feel like success is real.
	* The stop logic is not working, because internally the iteration count keeps getting reset at each
	  perturbation cycle. Also, the steps_this_phase continues to increase through phase changes in all trials.
	  It appears that steps_at_phase_begin is never getting updated. Finally, the passing of the env model to
	  it does not work. While it seems to store it okay, actually referencing it in the __call__() method
	  doesn't happen, so there seems to be no way for the env and the stopper to communicate.

3/9/23 - I've spent the last few days trying to get the Tuner program to start up and load weights from the NN in a
	 previously saved checkpoint.  A slow conversation with Ray staff on the Ray Discuss is working through it.

3/10/23

* Working to test checkpoint injection capability.  First step is to get a checkpoint worthy of ingestion, one where
  the agent can perform straight driving pretty well. This will enable me to easily verify whether that checkpoint
  was loaded correctly or not.
	* Notes above indicate that straight driving success was achieved on 12/17/22. I looked at the saved checkpoint
	  from that run and found it to be a single file (checkpoint-600), not a full file structure like I have been
	  recently studying and experiencing. Therefore, I don't believe it will work.
	* Pulled code from that Git checkpoint, which happens to be the current head of the master branch, and tried
	  to run it to regenerate the checkpoint. Since I have upgraded the system to Ray 2.3, this code no longer
	  runs. I went back to my tune-checkpoints branch head, then modified the tune program to meet the same config
	  that was used on 12/17. I also modified the env model so that it won't promote levels, but stays at level 0,
	  since my stopping criteria are currently hosed up. 
	* Made a new run with this setup (05f87), but with new HP search params & logic (using PBT), hoping to achieve
	  similar level of training of the agent. It found a solution with mean rewards > 8 pretty quickly (1-2 hr).
	  This checkpoint is now saved in ~/project/cda0/test/level0-pt.
* I tried running the tune program with the above checkpoint hard-coded in the CdaCallbacks class to load by the
  Algorithm code.  It now hangs if the from_checkpoint() line is there (not even when it's executed, just if it is
  uncommented). The hang is in the TrialRunner.step() method for some reason.
* Also, the inference program no longer works. It displays a mysterious error in the Ray code, after I have done
  nothing to it except set the NN size paramater to match what's in the checkpoint. It seems it can't handle the new
  checkpoint format (?)
	* Commenting out the class variable and all its uses doesn't fix the hanging. But changing the checkpoint
	  string to an invalid checkpoint (e.g. the parent dir) allows it to continue running, only with an exception
	  raised about the invalid checkpoint.

3/11/23

	* Forced python not to buffer stdout, and found that problem is in Algorithm.from_state() where it tries to
	  create the new PPO algorithm object. The PPO class doesn't have a constructor, so I assume it just runs the
	  parent's, which is Algorithm.
	* Additional print stmts throughout RLlib code shows that there is a call to ray.wait() that never comes back,
	  made from FaultTolerantActorManager.__fetch_result(), which results from a call to WorkerSet.add_workers()
	  from Algorithm.setup(). Posting the gory details to Ray Discuss thread for insight.

3/12/23

	* Got the inference program working. The call signature for env.reset() needed to be updated.

3/15/23

* Tentatively got checkpoint restore working for use in Tuner. Restoring individual Policy checkpoints rather than the
  whole Algorithm checkpoint.
	* In the process of of figuring this out (besides help from Ray staff), I added print stmts to:
		rllib.evaluation.rollout_worker.py
		rllib.utils.torch_utils.py
		rllib.utils.actor_manager.py
		_private.worker.py
		rllib.utils.checkpoint.py
		rllib.evaluation.worker_set.py
		rllib.algorithms.algorithm.py
		tune.execution.trial_runner.py
		tune.tune.py	
		tune.trainable.trainable.py
		tune.impl.tuner_internal.py
****	* Verified that checkpoints are now working!

========================================
===== Curriculum training for cda0 =====
========================================

* Began training level 0 (experiment 7ea15)
	* I had adjusted the reward function to remove completion bonus and to reduce the magnitude of penalties for
	  crashing or stopping (e.g. crash into vehicle went from -50 to -5) in an attempt to narrow the range of
	  potential rewards and therefore keep the reward gradient more moderate. However, this didn't learn very
	  well, so I restored it almost to its original (Dec 22) logic; the exception was that completion reward is
	  now a flat 10 points (for level 0), instead of being scaled based on time to complete.

3/16/23

	* 13 of 15 trials terminated successfully (min reward > 5), which put the mean rewards all close to 10.
	  Most trials took no more than ~90k steps.
	* Inference tests revealed:
		* Trial 1 YES, with 4/4 runs getting reward > 4, and one > 9.
		* Trial 2 no good. Accel was pretty limited, and even wrong sign sometimes.
		* Trial 3 no good. Accel was not as much as some of the others, and scores < 5.
		* trial 6 is very small accels, and even wrong sign in some cases.
		* trial 9 YES, with 4/4 runs getting reward > 4, one at 9.8.
		* Trial 10 no good. Accel is virtually 0.
		* trial 14 no good. It has ~0 accel, and is happy to go way too fast or too slow.
		* All tested trials perform well on lane change - no maneuvers at all, and LC cmd close to 0.
	* Saved trials 1 & 9 in project dir (and checked them into Git).
***	* Considering level 0 complete.

* Ran training level 1 (experiment ef817) starting with level 0 checkpoint from trial 1.
	* Prior to running, I changed the reward structure to give higher weights to the high-speed and low-speed
	  penalties.
	* Only one trial, #13, passed the stop criterion of min reward > 5, and that after ~500k steps. All other trials
	  settled out with mean rewards in the 0-5 range and min rewards in -60 to -10. Inference on this performed
	  really well, accelerating smartly from low initial speeds, and 4/4 achieving episode rewards > 7.
***	* Considering level 1 complete.

* Level 2 is not needed, since its objective is addressed with the level 1 solution.  Moving on to level 3.

* Level 3 training begins, to teach driving on lane 2 and making the lane change to survive.
	* Starting run aaf00

3/17/23

	* Training didn't work well. After 280k steps, only one trial had a mean reward > -10. Three trials errored,
	  and the remainder paused forever waiting on them. So the setting for recovering workers doesn't help this
	  situation with PBT. It seems the only answer is to not force it to sync.
	* The 3 errors were from the Gaussian noise calculation (distribution.py) trying to work with tensors full
	  of `nan`. Feels like a Ray defect.
	* It seems that the agent isn't exploring lane changes enough to figure out the solution, so probably needs
	  more noise.
	* However, running inference on the best one (run 9 got mean reward ~-5), it handles the lane change well
	  (never ran off road in 4/4 runs starting in lane 2). However, it has some sloppy speed control, usually
	  preferring lower speeds. It consistently got rewards in the 7-8 range, however. Saving this run in the
	  ray_results dir in case I want to use it as a baseline. Its reward curves look promising, like it would
	  have probably reached the goal if it had been allowed to continue.

* Next run (41fcb) started from the aaf00 trial 9 checkpoint saved in the previous run.
	* Tweaked initial noise std dev to range [0.1, 0.5]. Also increased its duration from 200k to 250k steps.
	* Set the PBT scheduler's synch param to False.
	* Configured for 1 rollout worker, which increases the number of parallel trials from 2 to 4. I think this is
	  important for asynch PBT perturbations, so that each trial has more to look at before making its decision.
	* Reducing the num rollout workers, however, reduces the number of steps per iteration. Even though there are
	  more trials running in parallel, they are doing less work, so the total progress against the wall clock is
	  maybe no better than having more workers running fewer trials.
	* Several of the trials passed the stop criterion of min reward > 0, but I'm not happy with it. First, these
	  ended up with mean rewards < 5. Second, these conditions just suddenly appeared, in every trial, for a single
	  iteration, where the previous iteration had mins around -50. With small batch sizes, it's hard to believe that
	  the agent has suddenly figured it all out in a single iteration. I would rather see the trend gradually grow
	  to the threshold value.
	* Inference runs on a couple of the "successful" trials showed performance no better than the starting
	  baseline.

* Next run (5501e) starting again from the aaf00 trial 9 checkpoint.
	* Increased num_rollout_workers from 1 to 4, and gave each 0.25 gpu.
	* Increased sgd_minibatch from 32 to 64 to help smooth out results. To accommodate, I increased rollout
	  fragment length from 200 to 256, and train_batch_size from 200 to 1024.

3/18/23

	* None of the trials got a mean reward > 0, but all ran to iteration limit of 800.
	* I suspect that the NN is just not large enough to handle the secondary reactions to its chosen actions.

* Next run (e43eb) starting from scratch and learning level 0.
	* Increased NN structure from [128, 50] to [256, 128]

3/19/23

	* All trials finished successfully, the longest taking ~420k steps. Trial 0 had both the mean and min rewards
	  climbing most gradually toward the end, so both were in good shape for a few iterations. I chose this one
	  to move forward, after confirming its inference performed well.

* Next run (eee95) started from the previous level 0 checkpoint (e43eb) and is training to level 2 with no HP changes.
	* First to complete was trial 10, after 210k steps. It worked well in inference, although it sometimes
	  commands negative accel (~-0.08) when speed is below the limit (26.8 m/s) with significant distance remaining.

3/20/23

	* Trial 3 reached a high mean reward of 8.4 with a min of 6.7 at 847k steps, which met the success criteria,
	  but everything else stayed level in the 4-6 range after ~400k time steps.
	* Inference on trial 10 looks very good. On one run its total score was only 5.15, due to a low starting speed.
	  In recovering from that it piled up lots of large low-speed penalties (~0.5/step initially). It was fairly
	  aggressive on the accel, but could have done more. It maxed out around 0.48; not sure why it didn't try up
	  to 1.0. This is disappointing, but not terrible. It also probably explains why nobody else was able to get
	  averages (or mins) any higher than they did, so I need to lower my success criteria or ease up on the
	  penalty calcs.
	* Inference on trial 3 is pretty good also, but it has a little more tendency to lower speeds and lower accels
	  than 10, so leaving it behind.

* Next run (e629c) at level 3
	* Changed max iterations from 800 to 500 and tweaked the success/failure criteria a bit to work with relaxed
	  speed penalties.
	* Changed reward function to relax the stiff speed penalties that were applied for levels 1 & 2. I figure the
	  agent has learned speed control at those levels, and now just needs a gentle reminder, as was in level 0.
	* Changed stopping criteria to use evaluation metrics instead of training metrics; also changed evaluation
	  config to go at every 10 iters (was 20) and in parallel to training runs to keep it from slowing things down.
	  The tag prefix to be used for evaluation metrics seems to be "sampler_results/", not "evaluation/"
	* None of the trials even came close to succeeding, which is strange, since it learned level 3 pretty well
	  last time around.
**	* I noticed that the CdaCallbacks.on_algorithm_init() is being called (and thus re-loading the baseline
	  checkpoint) every time a trial passes its perturb interval. This seems to be having the effect of erasing all
	  of the learning that has happened in the past perturb cycle. I initiated a question in Ray Discuss.

* Next run (2d9c3) at level 3, using normal random search on HPs (not PBT).
	* Starting from checkpoint of level 2 yesterday (eee95); no other changes from the previous run.
	* Log confirms that it is only restoring from the checkpoint once per trial.
	* Several trials looked promising until around 200k to 300k steps, then reward dramatically dropped or leveled
	  off. The noise schedule stops at 250k steps.

* Next run (02121) at level 3, repeated the above, but extended the noise degradation out to 500k steps.
	* This was a little better. One trial reached mean reward ~7, one reached ~3 and another reached ~0. There was
	  a bit of a drop-off after 500k, but less severe than before.
	* Of these three, the HP ranges used were
		* clip_param in [0.12, 0.25] from original range [0.1, 0.3]
		* entropy_coeff	in [0.00060, 0.0027] from original range [0, 0.008]
		* stddev noise in [0.19, 0.24] from original range [0.1, 0.5]
		* kl_coeff in [0.36, 0.71] from original range [0.35, 0.8]
		* LR in [3.2e-5, 14e-5] from original range [1e-6, 1e-3]

3/21/23

* Next run () at level 3, but this time starting from scratch (no baseline checkpoint)
	* Tightened some of the HP search space around the effective ranges above.
	* Since it is starting from scratch, it needs to learn everything, so extended max iterations from 500 to 1000
	  and noise schedule from 500k to 1.2M time steps. Also revised noise final scale from 0.1 to 0.2 to keep it
	  more involved.
	* NOT using curriculum logic, just letting it run completely at level 3.
	* No winners, but a couple trials hit mean reward ~ +5, so this is better. A few looked like they were still on
	  upward trajectories when terminated at 1000 iterations, but mins are all at -20 or less.

* Next run (7c62d) at level 3, starting from the level 2 checkpoint eee95. Otherwise, everything is same as previous
  run.

3/22/23

	* Trial 9 was fairly successful, ending with a gradual climb to mean reward of 7.4 and min reward of 2.8. This
	  did not meet the stopping criteria, but could be acceptable anyway. HPs for this run were:
		* clip_param = 0.264
		* entropy_coeff = 0.00287
		* stddev = 0.229
		* kl_coeff = 0.734
		* lr = 7.6e-5
	  Two others reached above zero (trial 10
	  mean reward maxed out at 5.4 and trial 7 reached mean reward of 2.5, but the best min for both of these was
	  ~ -40).
	* Inference on trial 9 shows pretty good performance, but it likes speeds a little on the low side (20-25 m/s).
	  Rewards are consistently in the 8s. It seems it could use some remedial speed control training.
	* Stored the trial 9 checkpoint for future reference.

* Run 7a976 starting from the level 3 checkpoint (7c62d), but running level 2 to help refresh its memory of good
  speed control. Limited to 200 iterations.  All other HPs the same as previous run.
	* All trials performed similarly, maxing out at ~ -20 for mean reward. Even the max rewards were often < 0, and
	  for 2 trials they stayed locked in at -40 and -28.  How can this be, if it has already 
	  succeeded at level 2? I hope it's due to the noise.  Need to try running without any noise.

* Run 6bdc1 attempting to repeat the above run, but this time with exploration = False.
	* None of the trials performed well. The best ones had mean rewards <-20.
	* I'm done trying to re-train the level 3 agent at this time.  Moving on.

* Run 7de26 - learning level 4, starting from the level 3 checkpoint (7c62d).
	* Expanded the HP search range a little bit, noise & LR, in particular.

3/23/23

	* Two trials had a promising trajectory, 12 and 13, with mean rewards climbing towards 0, but everything else
	  stayed at -40.  The HPs for these two defined the ranges
		clip_param in [0.10, 0.29]
		entropy_coeff in [0.0015, 0.0030]
		stddev in [0.22, 0.27]
		kl_coeff in [0.54, 0.57]
		lr in [1.6e-6, 9.9e-5]

* Run f3974 is a repeat of the previous run, trying level 4, but with the HPs tuning in the ranges listed above. Also
  extended the max duration out to 1200 iterations.
	* Most trials maintained mean reward at -40. Four climbed above, with the best trial (#2) reaching > -5, while
	  its max reward was near 10 for a long time.
	* Trial 2 used 
		clip_param = 0.19
		entropy_coeff = 0.0022
		stddev = 0.27
		kl_coeff = 0.53
		lr = 1.6e-6
	* Inference runs on trial 2 showed decent performance in all lanes, except that it tends to go way slower than
	  speed limit. When in lane 1 it starte behind the neighbor cars, so they always pull away from it. The lane 2
	  starts are interesting, but it's hard to tell if it is trying to avoid a collision with speed changes, or if
	  that is just a natural effect of the random starting state. I need to give user more control of that starting
	  situation.
	* Storing trial 2 for future reference, as the only example of level 4 action I have at this point. Probably
	  not demo-worthy, but it wouldn't be terrible.

* Run 64981 training level 4 starting from scratch. I opened up the HP search speace some more, and allowed it 2000
  iterations. I also turned on PBT scheduler, with first perturbation at 70 iterations, then every 50 afterward.

3/24/23

	* Four trials succeeded! Three of those four not only beat the mean reward target but also the min target
	  simultaneously.  These are:
		* Trial 1, rmean = 7.5, rmin = 5.7 at ~780k steps
		* Trial 14, rmean = 5.1, rmin = -40 at ~1.1M steps
		* Trial 2, rmean = 7.7, rmin = 6.5 at ~1.3M steps
		* Trial 9, rmean = 7.3, rmin = 5.4 at ~1.6M steps
	* All of the trials reached rmax ~10 before 500k steps and stayed there for the remainder. Most of the rmins
	  remained in the -50 to -40 range.
	* 10 of the trials errored out before completing. Every one was due to the common nan error in the Gaussian
	  noise generator. Maybe this has to do with interaction with PBT? I never saw it before using PBT.
	* Inference on trial 2 shows good performance from all 3 lanes, except that it likes the lower speeds (as low
	  as 12 m/s, when speed limit is 29 m/s). During ramp merges, it definitely slows down to avoid the neighbors
	  and then makes 2 lane changes to get on their far side, then accelerates smartly (69%) to get back to a
	  reasonable speed.
	* Training could be improved with better positioning of the ramp vehicle & its init speed. Also need a higher
	  low speed penalty.

* Run c9ded at level 4 with PPO and PBT from scratch. Same HP search space as previous run.
	* Changed reward penalties for low-speed to be a bit more severe.
	* Lengthened lane 2 so that the ramp vehicle can start with the neighbors and arrive at the merge approx
	  together while both are still close to the speed limit (don't want it to try to change its ramp speed
	  greatly).
	* Added method initialize_ramp_vehicle_speed() to try to force the ramp vehicle to hit merge point at particular
	  positions relative to the chain of neighbors.
	* After 200k steps all of the trials showed max reward < -25! This continued till > 500k steps, by which point
	  all mean rewards were coalescing ~ -32. So I killed it in favor of starting a duplicate run.

* Run c8429 is an almost duplicate of the previous run conditions - level 4 starting from scratch.
	* I increased the lower end of the noise range from 0.18 to 0.25, since all of the previous trials had converged
	  on 0.19, and I don't think that is enough to help them find their way around the neighbor vehicles well. This
	  time, the noise stddev values are all on the high side, with several > 0.5, and the smallest at 0.31.

3/25/23

	* Every trial quickly reached a state with both rmin and rmax = -40, so there was no learning going on.

* Played with turning on the curriculum_fn again to let it begin the training in level 0, then switch over to level 4
  just before the first perturb cycle. I got the timing of that down, but the problem is that PBT starts each perturb
  cycle by creating brand-new algorithm objects and environments, instead of reusing previous ones. So the environment
  loses all history and thinks it is starting over at level 0 again.
	* Created a new CallbackSingleton class to act as a singleton object that all instances of environment and
	  algorithms and CdaCallbacks can share info through. I made this a separate module that is first instantiated
	  by main (the tuner program) to ensure that it consistently lives through all creation cycles. Problem is
	  apparently each worker is a complete new process that re-instantiates _everything_ for itself, and doesn't
	  know about any of the other workers.
	* I think what I really need is this singleton to live in the local_worker and have all other workers somehow
	  do an IPC to share that info. Not sure how to do that.
	* In the meantime, I'll experiment with running a single local worker. No good. Even the local worker apparently
	  runs multiple trial threads, and recreates everythiing as well.
	* Designing a method that uses file system to track singleton-like behavior. This will work as long as all Ray
	  activity is on a single node, which is fine for now. It is the PerturbationController class. Tests show that
	  it works very well and doesn't seem to add a big execution time burden.

* Run 81302 started at level 4, using the level 3 checkpoint 7c62d as the baseline.
	* Early concerns in that the first 2 trials have rmax = -40 for the first 70 iterations.

3/26/23

	* Indeed, all trials scored a consistent -40, which is consistent with running off the road.
	* Log shows that after first perturb cycle it was correctly not restoring the checkpoint, and stayed in level 4.

* Run 6bee2 for level 4, but starting at level 0 from the level 3 checkpoint.
	* Enabled the curriculum_fn again, so that it will promote from level 0 to level 4 after 18k steps.
	* Modified env reset() so that in level 4 it only chooses vehicle position in lane 2 at relative loc 3 or 4, which
	  may help it avoid some collisions.
	* It transitioned from level 0 to 4 at 18k steps total (not per trial), at which point the rmax went from 10 to
	  -40 and stayed there. I killed it after 6 trials.
	* However, the evaluation plot shows that it continues to get rmax near 10 all the way to 70k steps.
	* I am starting to suspect this level 3 baseline checkpoint is no good.

* Run 9669f starting from scratch at level 0, then getting promoted to level 4 after 60k time steps.
	* Changed the promotion code to increase the number of steps.
	* Training progressed really well, with all trials showing improved rewards through ~350k steps, then they leveled
	  off with rmean between -10 and 0 and a few short peaks up to the +2 to +5 region, but nothing sustained. And the
	  rmins remain near -50.

3/27/23

	* ~1M steps (920 iters) a few trials started showing some serious rewards, in the 6-7 range. Although the stopping
	  condition for level 4 is rmean > 7, they are continuing. I started the program with cmd line arg of level 0, and
	  it internally promoted to level 4. Therefore, the main program is using the level 0 stoopping criteria, which are
	  stiffer.
	* After 1000 iters (~1.3M steps) all but one of the trials had met one of the level 0 stopping criteria, which was
	  impressive. However,...
	* Inference runs on trials 12 and 13 were very disappointing, as they did illegal lane changes consistently very
	  early in every episode. I did some deep investigation into the Ray code to ensure that it was truly inferencing,
	  with the model in eval mode and not applying any noise. The LC cmd (action[1]) coming out of the NN was well over
	  1, and relied on clipping to get it into the acceptable action space. I chose these two agents as they had good
	  values of both rmean and rmin in the training data _and_ rmin > 8 in the evaluation plot. I don't know how an
	  agent can perform so well in evaluation mode and fail so misrably during inference.

* run 8fc2d starting from level 0 (running level 1) checkpoint then getting promoted to level 4 after 60k time steps.
	* Adjusted PBT params to perturb only every 100 iterations.
	* Starting with the level 0 checkpoint is hoped to kick-start learniing a little sooner, and also to preserve more
	  respect for speed controls.
	* Adjusted reward structure so that crash, off road and stopping all get penalized at -10 points (was -50, -40 and
	  -20, respectively). I saw lots of inference episodes that chose to stop, which gave it a better score than running
	  off road or crashing. Also, the smaller magnitude may help it determine reward gradients in these regions.
	* No good. One trial started with rmax spiking above 0 several times in the first 60k steps, but then it stayed at
	  -10, along with all the other trials. PBT therefore had nothing to choose from and it never recovered.

* Run ed794 started from the level 0 checkpoint but running level 0 training to allow the variable start locations near the
  finish line. Otherwise, nothing changed from previous run.
	* No good. I discovered that at each perturb cycle the env is getting regenerated at difficulty level 0, even thoubh
	  there was a promotion to 4 previously. During perturbation new envs get generated, and they read the config dict,
	  which reflects the original cmd line arg. This may explain why the long run last night took so long to converge;
	  it just wasn't obvious that it was resetting.

* Run ? started from the level 0 checkpoint but running level 0 training, as before. Promoting to level 4 after 60k steps.
	* Fixed logic in curriculum_fn to ensure that it forces level 4 upon restart after a perturb cycle has occurred
	  (uses the new PerturbationControl).

3/28/23

	* The pre-training didn't really help. After 60k steps the rmax fell to -10 for all trials. However, rmax gradually
	  rose to +10 over 1M steps, showing some promise. But the rmean stayed near -10 until the ~800k steps or so, when
	  some of the trials showed fluctuations up to -7.

=========================================
===== Began using the SAC algorithm =====
=========================================

* Had been using PPO up to this point, but it seems to be time for some more drastic changes to continue progressing.
* Starting with mostly the default configs until I get a feel for it.
* Cannot use any of the previously saved checkpoints. Also cannot use the existing checkpoint handling code in the
  CdaCallbacks class (I will upgrade it in the future to do this).
* Updated the storage scheme for checkpoints, since I'm getting so many of them with too many attributes.

* Run 32efc training level 0 starting from scratch (no usable checkpoints) with PBT perturbing every 100 iterations.

3/29/23

	* All trials terminated at 1100 iterations, having not achieved the success criteria. But all got pretty close,
	  in the range of 5-7 for rmean.  Just needed a little more time.
	* SAC seems to move through the iterations a lot faster than PPO.
	* Saved the best checkpoint (trial 10) under SAC L0.

* Run 9eb73 attempting to finish what the previous run started, so using its checkpoint, but still training level 0.
	* Checkpoint restore using the existing CdaCallbacks code worked fine, with the sample printing turned off.
	* Interesting that the first several iterations produce training rewards that look like the agent was never
	  trained before.
	* Training ended similarly to previous, with rmean values in the 6-8 range, but there were a couple spikes
	  into the low 9s (just shy of the stopping condition). Even so, the best trial (7) when run in inference
	  was really bad, changing lanes frequently, and thus making illegal ones often. Hard to understand, given
	  the rmean values, and even evaluation runs looked similarly good.

* Run ? starting from scratch at level 0, then promoting to level 3
	* Adjusted noise level downward.
	* Turned on curriculum auto promotion from phase 0 to phase 3 after 80k steps.
	* No good. rmean actually started dropping after the switch to level 3. A couple trials spiked to 
	  rmean > 9.5 and terminated successfully. But inference on one of them was terrible.

*** Switched to DISCRETE ACTION SPACE with 23 accel values.
	* Turns out SAC can't handle a MultiDiscrete action space, so I also had to switch back to PPO.
	* Did this on new branch discrete-actions off of alt-training.

=======================
===== Back to PPO =====
=======================

* Run b948a started from scratch training just level 0 using discrete action space.

3/31/23

	* All trials eventually met the stop criteria, every one hitting the rmin threshold (6.0) before hitting
	  the rmean threshold (9.5); the final rmin values were all in [8.1, 8.9].
	* I did inference on trials 4 and 14 (best rmean) and found that #4 performs really well, so keeping it
	  for future reference.

* Run 1aa52 starting from the level 0 checkpoint just saved and training for level 2 using PPO and discrete actions.

4/1/23

	* 15 of the trials errored out, nearly at the same time, between 1M and 1.2M steps. The final trial ran to
	  the limit of 2000 iterations and stopped. None scored more than rmean ~ 6.5, and none had rmin > -10.
	* Sampling the error files, it seems they all just died of unknown segfaults.

* Run bd6cc starting from the level 0 checkpoint and training for level 2 (PPO with discrete actions).
	* Reduced sgd_minibatch_size from 64 to 16 to try getting a little more variation in rewards.
	* Reduced the number of trials from 16 to 8 since they all have been closely packed togehter in the reward
	  plots.
	* None of the trials saw rmax > -10 prior to the first perturb cycle, so there seems to be no way it will
	  learn anything.

4/2/23

* Lost most of a day due to power interruption, which somehow screws up the GPU driver.

* Run d9e0d starting from the level 0 checkpoint and training for level 2 (PPO with discrete actions).
	* Increased sgd_minibatch_size from 16 to 128.
	* Using 10 trials.

4/3/23

	* All trials behaved very similarly, gradually improving to an asymptotic approach to rmean ~6 and
	  rmax ~9 and rmin in the range -10 to -2 after 2000 iterations (2.7M steps). The rmax was still slowly
	  increasing, so more time may have improved results slightly.
	* The best result was trial 6, with ending rmean = 5.9 and rmax = 9.0. It used 
		clip_param = 0.056
		entropy_coeff = 0.0030
		kl_coeff = 0.74
		lr = 2.56e-5
	* In inference, trial 6 performed really well, so it's a keeper, and the new baseline.
	* I discovered that PBT was perturbing 3 variables that are for SAC, so weren't even being used.
	  No wonder it took so long for this solution to perform well.

* Run 0ed7f starting from the level 2 checkpoint (d9e0d) and training for level 3 (PPO with discrete actions).
	* Fixed the PBT perturb params to use the PPO variables.

4/4/23

	* Got quite good results on 9 of 10 trials in < 1M steps. All of them beat the rmin criterion of 6
	  while achieving rmean in the range of 8.
	* Trial 6 performed the best, with rmean = 8.02, rmin = 6.00 and
		clip_param = 0.27
		entropy_coeff = 0.00071
		kl_coeff = 0.80
		lr = 1.61e-4
	* Inference worked very well, so this is the new baseline.
	* Realized that I need a reward deadband in the speed dimension. with none it is almost never choosing an
	  accel cmd = 0, rather it constantly tries to tweak the behavior, which is unnecessary.

* Run 319ad starting from the above level 3 checkpoint and training for level 4 (PPO with discrete actions).
	* Adjusted reward function to allow a small deadband of speed just below the speed limit.
	* Adjusted PBT schedule to perturb every 80 iterations instead of every 100.
	* No good. All trials roughly consolidated around rmean of -4. Inference on one was pitiful on lane 2 (it
	  consistently made illegal lane change in first time step).

* Run 2ce8a repeated previous run, but with 2000 iteration limit. Maybe different prng draws will find something.
	* Worked well! A couple trials found a high rmax before the first perturbation, and the others soon
	  followed. After 1200 iterations all trials were still active, but tightly bunched around rmean ~8.2,
	  with many iterations showing rmin ~+8.
	* Trial 3 was the only one that showed a consistently high rmin of 7+ for an extended time (over 140k
	  steps), so I chose this one to keep.
	* Inference on this trial showed good performance on all lanes. However, it revealed that my assumptions
	  in the reset() method were poor for setting the initial speed when in lane 2. Instead of staying
	  close to that speed on average, it immediately accelerated to get up to the speed limit (training
	  taught it this will maximize reward, crash notwithstanding). 

* Run 65cf4 repeated previous run, but with initial conditions improved. Starting with same level 3 checkpoint and
  training for level 4 (PPO with discrete actions).
	* I enhanced the initialize_ramp_vehicle_speed() to assume max accel then stay at speed limit as the
	  nominially desired trajectory (without neighbors). This is the trajectory it will need to alter to
	  avoid a collision, so the generated initial speed will force it to learn to accept some penalties
	  in the ramp speed profile in order to avoid a crash.

4/5/23

	* All trials ran to iteration limit. 
	* The best trial at the end was #0, with rmean = 3.5. Second best was #9 at rmean = 2.9. The trial with
	  largest rmean towards the end was #2 with rmean = 7.0 on a brief spike for one iteration around 1.98M
	  steps, maybe ~1700 iterations. However, no checkpoint was stored near there.
	* Inference on trial 0 works great on lanes 0 & 1. It works pretty well on lane 2, but it seems to have
	  learned a simple behavior of always slowing down (to 24-25 m/s), then changing lanes, hoping the
	  neighbors have already passed by then. This works well if it is timed for the end of the congestion,
	  but usually results in a crash if it is timed for the beginning. Not a terrible policy, but it doesn't
	  seem to be looking at the neighbor data dynamically. This could be because:
		* the NN is too small to accommodate that level of thinking, or
		* the observation space makes it too difficult to understand what is going on.
	* Inference on trial 9 performed similarly, but a few more crashes.

*** The above results are good enough for my objectives for level 4.  Need to start moving on.

=============================
===== For fun larger NN =====
=============================

* just for grins...training level 4 from scratch (no checkpoint) using PPO and discrete action space, but
  also with a larger NN ([400, 256]) to see if that is a factor in how it learned in the previous run. I'm doing this
  to kill time while I take care of other life priorities.
	* The runs in this section are not on the clock, as they are just playing and general education.
	* Run e6215
		* Since it won't have much chance to learn plan speed control on lane 2, I changed the reset() method to
		  hold back the neighbor vehicles 10% of the time when in level 4.
		* Initial try had all trials running at rmean = -10 since it never learned how to steer straight. Turned
		  on the curriculum_fn to do 80k steps at level 0 then promote to level 4 and ran again.

4/6/23

		* It didn't do much. Most trials held rmean ~-6 and then spiked up to +10 for a single iteration, triggering
		  the stop criterion without really learning enough. So I...
	* Run 1e8ce repeated the above for-fun run, but with a new stopper that forces rmean to exceed the success threshold
	  (on average) for at least 5 iterations before declaring done. Also changed the speed penalty to be symmetrical
	  instead of heavily weighting the high-speed performance.
		* Didn't work. All trials stayed at rmean = -10 except for a handful of spikes > 0, but no real learning.
	* Run c5c7e repeated the above for-fun run, but only does the curriculum transition from level 0 to level 3 so that
	  it can properly learn the basic speed control and lane change tasks. Then it will be introduced to neighbor
	  vehicles, which will now have somewhat randomized starting distance and speed.

4/8/23

	* Run 86ab5 training just level 0 since it seems the automatic curriculum is not working so well (it isn't learning
	  level 3 stuff, and each perturb cycle spikes the rmean to 10 because new envs are created at level 0).
	* Run ??? training level 3 starting at the above level 0 checkpoint.
		* Didn't work. All trials leveled out ~ rmean = -4.
	* Run 88c1e training level 2, starting from the prev level 0 checkpoint.
		* Worked well. All trials leveled out with rmean ~6.
	* Run 2e382 continued the training for level 3, based on the above level 2 checkpoint (trial 9).
		* No good. All trials had volatile rmean in the range of -7 to -3.
		* Inference on a couple trials showed good performance on lanes 0 & 1, but couldn't get strted on lane 2,
		  as if it had never been trained on lane 2. I added print stmts to confirm that most of the level 3
		  episodes start on lane 2 during training runs, and other starting values are reasonable.
		* Ran a second training run with same settings to confirm that prng wasn't being wonky. No good again.

4/9/23

	* Run df4c6 rerun of level 3 training starting from level 2 checkpoint above.
		* Realized the prev run included lots of changes, so backed out everything in the env model (using the
		  simple_highway_with_ramp file from the master branch), but still using the larger NN and new stopper.
		* Still no good. rmean leveled out quickly in the range -7 to -3 for all trials. Same story as above.
		* Tentatively concluding that the [400, 256] NN structure isn't going to work here.
	* Run bce56 going back to the NN of [256, 128], but with the new logic for speed penalty. Training level 3
	  starting from the level 2 checkpoint made on 4/3 (d9e0d). This run should be contrasted against run 0ed7f
	  done on 4/3.
		* Also leveled out noisily between -7 and -3. It appears it's the baseline whose training is bad.

4/10/23

	* Run 5f786 starting from scratch to train level 0 on a [256, 128] NN using the new, symmetrical speed penalty.
		* All trials moved to rmean ~ +10 quickly. I kept trial01 for future use.
	* Run c23d4 training level 2 from previous level 0 checkpoint usiing symmetrical speed penalty.
		* Worked well. Trial 8 was best, and being saved.
	* Run 9e59e training level 3 from the previous level 2 checkpoint using symmetrical speed penalty.
		* Performed very well. All trials reached rmean ~8 by 600k steps. Lots of frequent spikes of rmin up
		  to the 0 to +4 range in the last half of the run (up to 1.3M steps).
		* Trial 7 was best, and worked very well in inference on all 3 lanes. Saved for reference.

4/11/23

	* Run 75472 training level 4 from previous level 3 checkpoint using symmetrical speed penalty. This one also
	  includes the new variability in neighbor vehicle initial positions and speeds, which should give the agent
	  a more generalized understanding of how to interpret their motion and avoid them.
		* Learning was mediocre. After ~700k steps all trials maintained a noisy rmean between -2 and +2,
		  but never seemed to learn more. Run stopped after 1000 iterations.
		* The best trial was 5, which I saved for future reference.
		* Inference shows that it learned a simple policy (in lane 2) of "slow down to 26 m/s and hope that
		  the neighbors go past, then change lanes". If it comes in near the front of the pack it doesn't
		  try to speed up and get in front. It accepts the crash.

**	* Now to re-try the larger NN to see if that resolves the problems noted above. I now have the symmetrical
	  speed penalties and the variable initial conditions for the neighbor vehicles. But need to start training
	  the whole shebang from scratch.
	* Run cc914 training level 0 from scratch with [400, 256] NN, symmetrical speed penalty.
		* Two trials terminated early, after 300 iterations, achieving nearly perfect scores. All the others
		  performed well by the end of 800 iterations, with rmean ~8. Saving the best, trial 0.
	* Run 9b3a9 training level 2 starting from previous level 0 checkpoint with NN [400, 256] and sym speed pen.

4/12/23

		* Worked well. Saved trial 7 as reference. All trials reached rmean ~8.
	* Run 2e253 training level 3 starting from previous level 2 checkpoint with NN [400, 256] and sym speed pen.
		* No good. After 400k steps all trials settled into a noisy, flat rmean between -6 and -4, with
		  rmax almost 10 all the way, but rmin never above -10.
		* This feels just like the results on 4/8 and 4/9 with the larger NN. And none of the perturbed
		  HPs converged to a dominant value, each one is still all over the place at 800 iterations.
	* Run cc955 duplicating the above run just to give it more random initialized trials to see if one catches.
		* No good - identical results as previous. It would appear that the larger NN is incapable of
		  learning at this level, as NN size is the only difference from the success in run 9e59e on 4/10.

4/13/23

		* Extended this run by starting from one of its best late checkpoints (trial 5 at iteration 740),
		  which had rmean = -3.5. Letting it run for an additional 800 iterations (run 878ef) on level 3.
			* This run generated the same results, so I have to conclude that this NN structure just
			  isn't going to work for level 3. I don't understand why making a network larger would
			  compromise its ability to learn. I would expect it to learn better, even though it may
			  take longer.

==================================
===== New observation vector =====
==================================

* Based on the experiments of past few days, and some introspection, it seems the agent is having a hard time
  understanding the neighbors' states and what it means for it. So I modified the observation space to provide it
  a more intuitive picture. 
	* Working in the new new-obs Git branch, off of the adv-level4 branch from develop.
	* Removed neighbor distance remaining (not being used)
	* Replaced neighbor X (distance downtrack), which could be confusing since each lane has a different 
	  origin), with delta distance downtrack (the neighbor's distance ahead of the ego vehicle)
	* Replaced neighbor speed with delta speed above that of the ego vehicle. This will magnify small diffs
	  so that they have a bigger impact.
	* Eliminated the ego vehicle lane ID as a variable input to the NN (always set to 0) so that it can't learn
	  specialized behavior just on this value, and forces it to look at the other inputs more.
	* I considered changing the roadway code so that everything uses the same X origin, but decided it would
	  be riskier and take more time. Using the adjusted downtrack dist is pretty safe, if a little slower
	  and a little more confusing for the reader.

* Run 31b3b training level 0 with the new obs; back to NN [256, 128] and using symmetrical speed penalties
	* Worked very well, with all trials passing rmean > 9.5 and several having rmin > 8 for several iterations.
	  (note that all the changes don't impact this level, except for agent not seeing its own lane ID.)

* Run 6428b training level 2 from the previously saved level 0 checkpoint.
	* Good. All trials had rmean > 2 before 500k steps, then gradually climbed to between 6 & 8, with rmin
	  often shooting above -10. Best trial was 2, which I kept for reference. Inference on it is fine.

* Run 1579d training level 3 from the previously saved level 2 checkpoint.
	* No good. All iters quickly settled into rmean between -6 and -3 up to 800 iterations.

4/18/23

* Run 3cf40 training level 3 again, this time starting from level 0 checkpoint.

4/19/23

	* No good. As before, rmean in -6 to -3 range, through 1000 iterations, although rmax ~10 for all.

* run 3ec58 training level 3, starting directly from scratch.
	* Considered changing the randomized start dist for level 0 to apply, but with PBT it risks always
	  randomizing no matter how mature the training has become.
	* Nothing doing. All rmax = -10.

* run 6ad8f training level 3 starting from scratch, same as previous. However...
	* Change env code to randomize initial agent position for first 60k steps in levels 0-3; also reduced
	  initial lane ID emphasis on lane 2 (for level 3) from 70% to only 50%.

4/20/23

	* Worked well. rmeans leveled out a 2 to 4 until 800k steps, then they jumped up again until ~1200k
	  steps, where they leveled off between 7 and 9.5. Best was trial 2, ending early at rmean = 9.5.
	* It appears it is experiencing some problems starting from a pre-trained checkpoint (catastrophic
	  forgetting?).

* Run f024c training level 4 starting from the previous level 3 checkpoint (6ad8f) with NN [256, 128], PPO
  and discrete actions with the new observation space.
	* After 1400 iterations, all trials were stuck in a run with rmean between -3 and -1. Inference on
	  trial 2 at this point showed that it had learned to avoid a crash by going full throttle all the
	  time; when in lane 2 it just outran the neighbors, and its reward was ~-0.5, much better than that
	  for a crash. It also hit full throttle in the other lanes as well, so not seeing opportunities.

* Run ba89e training level 4 from scratch (extending the randomized positions for first 60k steps to this level).
	* Adjusted random lane assignment for level 4 from 70% on lane to to 50% on lane 2.
	* Also adjusted LR range top end from 1e-3 to 3e-4
	* Adjusted entropy_coeff range top end from 0.008 to 0.01

4/21/23

	* Moderately successful. For all trials, rmean very gradually made a noisy climb to between +1 and +4
	  at 1500 iterations. It may still have momentum to keep going.
	* Best trial at end was 0, with rmean = 3.9, rmin = -10. Inference proved very bad, with illegal lane
	  changes everywhere.
	* I noticed that the code change made to randomize starting position was not what I had intended, and
	  produced substandard training conditions relative to neighbor positioning.

* Run c4c9f training level 4 from scratch as a re-do of the previous run.
	* Fixed code for level 4 randomized initial location to use the actual level 4 logic for the other
	  initializations also.

4/22/23

	* Training for 1500 iterations saw noisy rmean end in the range +3 to +7, but appearing to be on a
	  slow, upward trend. I suspect another 1500 iterations may be able to improve it.
	* Trial 0 had the highest rmean toward the end (~7 at checkpoint 1350). Inference on it worked well
	  for lanes 0 & 1, but most runs on lane 2 had illegal lane change in first couple steps.
	* Trial 9 performed similarly.

* Run fabc8 training level 4 as a direct continuation of the above run, starting with its trial00 checkpoint
  1350. No other changes made.

4/23/23

	* Good. Didn't get much learning improvement over the previous run. All trials leveled out quickly
	  in a range of rmean in +5 to +8 (after 200k steps) and rmin stayed at -10 except for a few spikes.
	* Best trial was #0. Inference shows that it performs well in lanes 0 & 1, and often does well in
	  lane 2. But at least 50% of runs on lane 2 have illegal lane change on the first step, always to
	  the right. Almost all of these occur when initial speed is really low (< 7 m/s). For some reason
	  these mostly seem to occur when specifying relative positions 2 or 3.
	* Found a code defect in initialize_ramp_vehicle_speed() solving the quadratic equation, which could
	  explain some unexpected starting speeds, and maybe the lack of training at the very low speeds.
	  Also found that reset() is initializing ego_x differently in inference than in training, such that
	  inference is trying values that were never trained.
	* Realized, during this fix, that many times there is no solution for initial speed, because there is
	  too much time available, even if initial speed is 0. In order to meet the target time in many cases
	  the accel would have to be less than max allowed. This is why a trained agent always tries to get
	  in front of the neighbor platoon - it has very little experience getting behind it.

* Run 7f4e2 training level 4 starting from checkpoint L4-c4c9f, as a re-run of the previous run.
	* Applied fixes to both defects noted above.
	* Increased the possible downtrack starting range for neighbor vehicles, given the propensity for
	  their arrival time to be too long, thus encouraging the agent to learn to get behind them sometimes.
	* Training performed about the same as before, as did inference runs. Not worth keeping.
	* Found & fixed a defect (after training) in initializing ego_x when in lane 1; it sometimes put the
	  vehicle at a negative start distance.

===============================
===== SAC algorithm again =====
===============================

* Converted the environment code to use a flattened, 1D Discrete action space, since SAC can't handle a
  MultiDiscrete. The action space involves 51 actions, made by slimming down the number of acceleration choices
  from 23 to 17, and repeating them for each of the 3 lane change commands.
	* Reenabled the commented-out SAC model config variables in the tune program. Also changed up the HPs
	  used in the PBT scheduler.

* Run f4e8f training level 4 from scratch. All HPs remain unchanged from the PPO runs, except those that are
  algorithm-specific.

4/24/23

	* Terrible. rmax bounced rapidly between -10 and +10, but rmean never got above -10 for any trial.

* Run eb83b training level 4 from scratch. Adding search ranges to more HPs (tau, initial_alpha, n_step).
	* Similar performance. I killed it early.
	* It appears that each 100 iterations the variable start distance gets re-instated (using a step
	  counter that is being reset by the perturb cycle), then after the next 60k steps, the rmax drops to
	  -10 for the remainer of that 100 iters.

* Run f576c training level 4 from scratch, but this time went back to PPO while still using the new 1D action
  space. This is to verify that the action space is not causing problems.
	* Also added a perturb_ctrl action to reset() to ensure that once the first perturb cycle occurs there
	  will be no more randomized initial positions.

=========================================================
===== Restructuring with advice from Kevin Albarado =====
=========================================================

4/29/23

* I got some mentoring time from Kevin, who made several suggestions on how to change the approach.
	1 Use PPO or SAC - I will use PPO for now, since it has been working well for me.
	2 Use continuous action space, and discretize lane change action like I did originally.
	3 Change the action space to reflect the agent's goals, not the process of getting there. Instead of
	  accel and lane change, use speed and lane ID. Not sure how to generalize the lane ID part, but for
	  now, worth trying to see how well it works.
	4 Change the obs space to get closer to the raster-type approach I have been considering already, so
	  that the agent will look at spatial cells near it to figure out what is going on in the immediate
	  vicinity.
	5 Be sure to penalize sudden, large changes in any action command (this is currently in place).
	6 Keep the NN small, and go a little deeper (try 3 layers).

* Merged the git branch adv-level4 into develop, which now reflects the best of the above work in PPO.
  Created new branch "new-obs-actions" off of develop for this next phase of work.

* Run 72102 training level 4 from scratch, with suggestions #1 and #6 implemented (PPO).
	* NN structure is now [256, 256, 128].
	* Changed penalty for crash (with another vehicle) from -10 to -20 so it is worse than runnin off road.

5/2/23

	* After ~500k steps all trials settled into a noisy steady state where rmean fluctuates between
	  +1 and +5 and rmin fluctuates between -20 and -10.
	* Inference on best trial shows most runs on lane 2 drive off road in the first time step.

* Code mods for uniform coordinate system - replacing the concept of measuring a vehicle's location relative
  to its current lane's start point, since each lane may start at a different X value. Going forward, all
  vehicle locations will be represented as X coordinates, regarless of which lane it is in. 
	* Updated Roadway and Lane classes to support this concpet change.
	* Changed all the rest of the environment code to convert from DDT to X coordinates.
	* Ran a bit with debugging set to 1 and saw nothing crazy in the log.
	* Ran a training job with same settings as previous one, to demonstrate that it performs the same.
	  Results were pretty much like the previous run, so I'm comfortable moving forward.

5/4/23

	* Modified the action space from [accel cmd, LC cmd] to [desired speed, desired lane ID]. This
	  satisfies suggestions #2 and #3.

5/7/23

	* Modified the obs space so that it includes a grid of zones surrounding the ego vehicle, rather
	  than looking at data on specific neighbor and specific roadway geometry. This satisfies suggestions
	  #4 and #5.

5/9/23

	* Misc testing & fixing defects. Still have a problem with obs[6] sometimes being out of bounds.
	  I think I need to redefine the obs space only in terms of the scaled values, not the raw - need
	  to experiment with this.

5/11/23

	* Created new branch, scaled-space, from new-obs-actions to play with scaling the obs space.
	  Found that I can't do much with the derived class because it needs to overwrite the parent's member
	  observation_space variable. Aborted this idea.
	* Completed testing and began a new run...

* Run c6e1e first training run with the new observations & action space, using PPO.
	* Did not perform very well. rmean topped out ~4 for some trials, then dropped a bit.

5/12/23

	* Inference showed lane 0 works great, lane 1 starts often crashes into neighbors at the beginning.
	  I believe this is restart() putting it too close to the back of the platoon and it doesn't have a
	  chance to adjust. Lane 2 starts seem to always want to do an immediate lane change (to lane 1), so
	  I never saw a run go past the first time step.
	* Looked into updating the inference program for the new lane 2 schematic concept. Decided it's best
	  to revise comments, variable names, & concepts throughout to support a true map frame (reflecting
	  real geometry), a parametric frame (how the NN views it, with lanes parallel), and a display frame.
	  This will take some time to draw out, so I ran inference several times just for the calculations,
	  accepting that the graphics are bogus.

AI: redo coordinate transforms, then verify latest results with new inference program.
	- get_vehicle_dist_downtrack() should be removed (used by inference)

5/16/23

5/17/23

5/18/23

	* Completed conversion of env code to use the parametric coordinate frame (schematic concept) and
	  support transforms between that and the map frame to ensure nothing gets confused between the two.

* Training run 41240 using this new code
	* Set the burn-in period to 200 iterations (was alwasy 100 prior) to ensure that there is plenty of
	  recovery time for an agent after the initial randomized location gets turned off.
	* Still having problems with initializing ego_p in reset() based on PBT perturb counter. Experiments
	  with the tune prgm specifying num_trials (running 2 at a time) and the resulting envs that get
	  started:
		* 1 trial specified, 1 run = 17 envs	16*1 + 1
		* 2 trials specified, 2 run = 33 envs	16*2 + 1
		* 4 trials specified, 2 run = 35 envs	16*2 + 3
		* 4 trials specified, 4 run = 67 envs	16*4 + 3
		* 4 trials specified, 6 run = 101 envs	16*6 + 5
		* 8 trials specified, 2 run = 32 envs	16*2
		* 8 trials specified, 4 run = 66 envs	16*4 + 2
		* 8 trials specified, 6 run = 95 envs	16*6 - 1
		* 8 trials specified, 8 run = 129 envs	16*8 + 1
		* 8 trials specified, 10 run = 163 env	16*10 + 3
		* 8 trials specified, 12 run = 197 env	16*12 + 3
		* 8 trials specified, 14 run = 231 env	16*14 + 7
		* 10 trials specified, 2 run = 34 env	16*2 + 2
		* 10 trials specified, 4 run = 68 env	16*4 + 4
		* 10 trials specified, 6 run = 102 env	16*6 + 6
		* 10 trials specified, 8 run = 136 env	16*8 + 8
		* 10 trials specified, 10 run = 170 env	16*10 + 10
		* 10 trials specified, 12 run = 201 env	16*12 + 9
		* 10 trials specified, 14 run = 233 env	16*14 + 9
	* Each time a new env is created, its "total_steps" variable starts at 0, so we never get a
	  cumulative step count, and must therefore rely on this perturb state calculation.

* Run 5ad76 with the new perturb-controlled random initial position for the first 33333 steps (per
  env object).
	* The randomized initial postition turned off after ~65k steps (total), but the first 200
	  iterations lasted until ~220k, so could have set the threshold a bit higher, maybe 50k,
	  to allow for better orientation with the training wheels on.

5/19/23

	* Results were terrible. All 4 trials whose rmax was positive died early (nan problem in
	  Gaussian).

* Run 6fa3a extended the randomized initial position up to 55k env steps, and limited to 800 total
  iterations.
	* Trial 6 was performing far better than others, with rmax ~10, rmean between -2 and 2,
	  when it died at 500k steps due to the same Gaussian nan problem as before. I am starting
	  to wonder if there's something in my env code or the way the observations are structured
	  that forces this when the solution starts to work well.

5/22/23

	* Performing inference on trial 6 in lane 0, which is the only trial that had a positive rmean. 
		* Ego started at 1 m/s, but it seems the neighbor vehicles did the same thing, and
		  all accelerated together!
		* Jerk penalty still being applied several time steps in when speed cmd hasn't changed
		  and accel is constant. Jerk penalty seems to be inversely proportional to speed.
		* Eventually, ego went slightly faster than neighbors, but the rel_ps never changed
		* Neighbor rel_speed seems wrong = always 0.10.

5/23/23

* Code changes:
	* Fixed a defect where a nan or 0 neighbor speed could have been used as a denominator and an
	  arg to prng.normal().
	* Added historical (prev time step) command feedback and reward penalty for each of them changing.

* Run 3a029 with the above changes

5/25/23

	* Still had 6 trials end with errors, all the same nan problem in Normal() as before, being called
	  by ppo_torch_policy. Five of these had rmax ~10, and no other trials showed such success, all
	  others having rmax < 6, and mostly ~ -10.
	* Trial 9 went the longest of the ones with a high rmax (> 500k steps), but its ending rmean was ~-10.
	  Performed some inference on it for debug purposes.
		* Neighbor rel_speed is incorrect and not changing as ego accelerates.
		* Neighbor rel_p never changes as ego accelerates.
		* neighbor_in_ego_zone flag is never set even though ego crashes into rear of n3.
		* The zone driveable and reachable flags never update when ego changes lanes.

* Run 1d440 with following changes:
	* Removed the only call to prng.normal() (where reset() is defining the agent's relative_pos).
	* Fixed several places where Vehicle.cur_speed was misspelled.
	* Fixed step() where it forced new neighbor speed command to be the speed limit, rather than the
	  current speed.
	* Fixed step() missing a call to _update_obs_zones().
	* Fixed step() defect in erasing the obs zone info from previous step.
	* Fixed defect in Graphics._get_vehicle_coords() illegally referencing lane 2 start location.
	* Ran several tests with inference program and extra debugging to ensure that env behavior is proper.
	* Results: first 200 iterations showed a lot of promise, with 3 trials achieving rmax > 6 after the
	  randomized initial location was turned off, and one trial approached rmean = -1. However, there
	  were several exceptions thrown due to starting location in lane 2 prior to the lane's starting point.

* Run a7e09 included fix for Roadway to avoid the above mentioned assertion exception.
	* After ~400 iterations, 7 of the 10 trials had errored out, wiping out all the promising solutions.

5/26/23

* Run bee8a redo of previous run.
	* There were no error trials, but every trial dove to rmax = -10.

* Run 514fc redo of previous run.
	* Similar problems with the promising performers throwing errors in normal distribution calcs.

* Run cfe6a with several debug stmts in Ray code to try diagnosing the error problems.
	* Also improved the reset() method of defining ego_p during the warmup phase, so it gradually moves
	  the max_distance for starting location back toward beginning of route, instead of the suddedn
	  restriction.
	* It seems the debug stmts were generating too much output on the log file. The laptop locked up a
	  couple times, and the log was way too big to open for editing (in vi). So I took most of them out
	  and only left some new stuff for when an exception occurred in Normal.
	* A couple runs showed that even the successful trials were gradually degrading (rmax) as start
	  location got moved farther uptrack, and it never learned to recover. So maybe it needs more time
	  using the full track before restricting the start location.

* Run cc948 with fixes to ppo_torch_policy and adjusted my env randomized start distance initializations;
  it is now using the full length for 50k steps, then fairly quickly restricts start dist to near the
  beginning after 90k steps.
	* In first 200 iterations, all trials crapped out, ending with rmax = -10.
	* When trial 0 died it was due to the nan exception in Normal, but revealed that it is due to the
	  logits tensor coming out of the model in ppo_torch_policy. I need to look into how the model code
	  works to understand what might be causing that.  The entire tensor was nan values.

* Run 6b67a adjusted the initialization constants to INITIAL_STEPS = 60k, FINAL_STEPS = 65k to make the
  dropoff more dramatic like it was a few versions ago.
	* Still no workie.

* Researched possible problems on Ray Discuss. 
	* Found that I had reported this same problem back in Sep 2022! with a simple mountain car prgm.
	  Solution there was to reduce LR. I dialed back the max range of the LR here, but it looks like
	  it's already in a good range.
	* Someone else ran into a similar problem (not sure it's the same). The answer was to look for
	  spikes (in Tensorboard) in total_loss, cur_kl_coeff, kl, and entropy. I did that and found
	  nothing jumping out on the one trial that failed in this run.
	* The above discussion also advised to try using grad_clip between 10 & 40, so I added that.

5/29/23

* Run 76b7c incorporates the 2 changes described under the previous run.

5/30/23

	* All trials dropped to rmax = -10 as soon as random initial location was removed, and never recovered.

* Run 518f4 trains just to level 0 to use as a baseline for further learning.
	* Trials 0 and 9 completed successfully (rmin = +10) in ~100k steps. Saving them for future use.

* Run 1c07e training level 3 using the above result (trial 9) as the starting baseline. This still includes
  initial location gradually getting reduced over 1.6M steps.

5/31/23

	* Three or four trials showed promise in the first 200k steps, with rmax ~10 and rmean between 0 and 4,
	  then all of them dropped down to -1 or lower after the first perturb cycle, even though none of them
	  got perturbed. Nothing ever recovered. I suspect the initial location randomization played a part here.

* Run 0c256 training level 3 using the same baseline (518f4, trial09). Also turned off randomized_start_dist
  and fixed code so that in levels 0-3 it initializes based on ego_lane_start, not blindly setting it to 0.

6/1/23

	* All trials ended with rmax ~10, rmean in [-4, -1] and rmin = -10, so not very good.

* Run abe44 training level 1 using the 518f4 baseline and randomized_start_dist = False.
	* Several trials completed successfully after ~100k steps.
	* 4 trials also ended with the nan error described above.

* Run 8cdf5 training level 4 from the abe44 (level 1) baseline, with randomized_start_dist = False.

6/2/23

	* Most trials ended with rmean in [-2, 0]. But individual iteration results show that an episode either
	  succeeds very well or dies immediately due to running off road.  Inference runs show that starting in
	  lanes 0 or 1 gives success, while starting in lane 2 it always does an illegal lane change immediately.
	  Seems like more training is needed in lane 2.

* Run 17455 training level 4 from the abe44 level 1 baseline. 
	* Changed _select_init_lane() for level 4 to choose lane 2 80% of the time (was 50%).
	* 7 of the 10 trials ended with errors - the same nan errors.
	* All of the completed trials had rmean ~ -6. Inference on one of them showed that starting in lanes 0
	  and 1 works fine, but every start in lane 2 had an immediate illegal lane change.

* Run 18baa training level 3 from scratch to see if it can better learn to drive lane 2. Randomized start is
  still off, and lane assignments are 60% lane 2, 30% lane 1, 10% lane 0.
	* Turned on verify_obs to ensure that there are no "nan"s being propagated out of the environment.
	* Fixed a defect in PerturbController that forced it to read the checkpoint file possibly used in the
	  previous run (not clear, but probably the main instance used a different checkpoint than the env or
	  callbacks instances). It now works whether previous info files are present or not.
	* All trials quickly reached rmean = rmax = -10, having never found the finish line.

* Run bad2c training level 3 from scratch, this time with randomized start location turned on.
	* All trials converged to rmean in [-7, -5], representing actual episodes, much like above, with lanes
	  0 and 1 showing good trajectories (although at full speed always), and lane 2 results in immediate
	  lane change cmd & dies.
	* I noticed that exploration was turned on, but none of the Gaussian params were included, so not sure
	  how much exploration is actually going on.

6/3/23

* Run 23c74 training level 3 from scratch, with the old Gaussian noise params in use.
	* All trials performed as before, with rmean ending in the [-4, -1] range after 500k steps, but
	  rmin was usually -20. Checking a few sample results, rmin was usually -10, with an occasional -20,
	  indicating it was probably crashing into the stopped neighbors at startup. Good performers scored
	  +5 or better.
	* Every trial errored out eventually, some took > 1M steps to do so. It appears all were the nan
	  problems. For a couple of them, I note that the logged "results" section included learner_stats
	  with cur_kl_coeff extremely small (order of 1e-217 or 1e-234) just before the nans showed up.
	* One trial actually started to solve the puzzle, it appears, with rmin hitting -5 just before it
	  died. It's prior cur_kl_coeff was 1e-34.
	* Searching the log, I see that values of cur_kl_coeff start the run in O(1) to O(0.001) and gradually
	  drop so that by the end of the run all of them are something like O(1e-200). This, despite the
	  tuning program's limiting kl_coeff to [0.3, 0.8].

* Investigated use of kl_coeff in the RLlib code:
	* cur_kl_coeff is only used in one file, ppo_torch_policy.py. In there...
	* In the loss() method it attempts to handle bad values by calling warn_if_infinite_kl_divergence(),
	  which is from rllib.utils.torch_utils. However, it is testing the loss() variable mean_kl_loss
	  for values near infinity, which is a different problem.
	* The term "kl_loss" doesn't appear in the log anywhere. Further down in the loss() method it is
	  multiplied by kl_coeff and added to the total loss. Log does show "total_loss", which is pretty
	  steady over the course of the run, starting in O(1) to O(10) and ending in O(0.1).
	* This class doesn't assing the value to its self.kl_coeff, which indicates that it is probably a
	  derived member. The class has parents
		TorchPolicyV2
		ValueNetworkMixin
		LearningRateSchedule
		EntropyCoeffSchedule
		KLCoeffMixin

* Run 40b02 training level 3 from scratch
	* Turned off perturbation of kl_coeff, although it is still an HP subject to tuning (static value).
	  I am hoping this reduces the occurrences of the nan problem.
	* Modified rewards so that going off road in lane 2 results in score of -12, while the other lanes
	  continue to score -10. This will allow identifying those lane 2 cases in the logs.
	* Every trial errored out. Most lasted to at least 600k steps, and some went to 900k. All stabilized
	  with rmean in the -5 to -2 range, with rmin ~ -20.
	* I confirmed through visual inspection that reset() is not placing the agent in an illegal
	  downtrack location when starting on lane 2; it is limited to somewhere on that physical lane.
	  However, this presents a challenge. When learning from scratch, it has a big leap even if starting
	  from the end of lane 2, to change lanes then drive the entire rest of lane 1 without error before
	  getting the big reward.

* Run f2297 training level 3 from the level 1 baseline (abe44), since it seems the agent will never learn
  to reach the finish line if it starts from scratch in lane 2 because it requires too many steps.
	* I also changed reset() to force this run to ONLY start the agent in lane 2 for every episode,
	  just to see if it can learn that lane alone. Given its baseline, it should know how to drive the
	  remainder of lane 1 after changing lanes there, so it should only have to learn the lane change.
	* All rewards dropped to -12 after the first perturbation, where the randomized initial location
	  ended.

6/4/23

* Run ed762 training level 3 from level 1 baseline (abe44) - DRIVING LANE 2 ONLY!
	* Modified the reward function to add a keep-alive component for every timestep, and scaled back
	  the penalties possible to be less than that keep-alive value except in aggregious cases. This
	  should motivate level 3 to progress downtrack on lane 2 instead of using its ingrained idea of
	  always desiring lane 0, which previously had forced the least-cost episode (-10 in first time
	  step vs -10 minus a penalty for doing something else wrong).
	* 9 trials errored out, and after first perturb, they all had rmax = -12, so I killed it.

* Run f8bb5 training level 3 from level 1 baseline (abe44) - DRIVING LANE 2 ONLY!
	* Increased keep-alive reward from 0.05 to 0.1
	* Increased burn-in period from 200 to 500 iterations and set the randomized start location to
	  decay over 500k steps (approx 500 iterations) instead of 1M steps.
	* 8 of 10 trials errored out by 600k steps. Spot check shows this is the nan problem still.
	* Of the two that went to 1.4M steps, one showed some promise, achieving rmin as high as -3,
	  with quite a bit of time > -10, and rmean > 0 in quite a few iterations.
	* Running that trial in inference, it most recently learned to collect a few keep-alive points
	  over ~40 time steps, then come to a stop, getting penalized with -10, rather than get hit with
	  a -12 for going off road.
	* I believe that using the level 1 baseline is a handicap, as it had learned to live in lane 0
	  always, so that lesson seems to be hard to unlearn.

* Run be359 training level 3 from scratch - DRIVING LANE 2 ONLY!
	* Using no baseline allows it to learn lane choices more freely.
	* Tweaked the reward function a bit to discourage stopping
	* Killed it after 700 iterations, as all living trials (8 of 10) were showing rewards < -10.
	  rmean is a gradual dive between 200k and 500k steps from a peak reward in the -5 to 0 range,
	  as randomizes start location gets pushed uptrack. I had thought that it would learn the lower
	  part of the track well, then gradually translate that knowledge into how to start from farther
	  up, but it doesn't seem to be working.

* Run bdc08 training level 3 from scratch - DRIVING LANE 2 ONLY
	* Turned off randomized inital location, forcing agent to start near the beginning of the lane
	  every time. Hoping that the keep-alive reward is enough to encourage it to drive downtrack.
	* 4 trials completed early (between 200k and 500k steps), reaching rmean > 9.5, so this looks
	  like the right approach. With the new keep-alive reward, rmax can get much bigger, so maybe
	  the success criterion needs to be changed.
	* Inference on one of the winners showed that speed was highly variable and generally well
	  below speed limit, so penalties need to be a bit tougher for both those problems. Also, it
	  looks like the agent jumped from lane 2 to lane 1 prior to the frog, although the log shows
	  that it did a legal move. I need to check the geometry of that merge point in the parametric
	  frame.
	* Trials that succeeded tended to jump from rmax < 0 to rmean ~+10 very quickly - it might be
	  better to extend the stopper criterion to require more time at a high rmean. This could also
	  give it more time to address speed command problems.
	* Only one trial had errored out when I killed it after 900 iterations.

* Run 9a992 training level 3 from scratch - DRIVING LANE 2 ONLY
	* Changed tune prgm to require 20 consecutive acceptable iterations before a stop signal is sent.
	* Changed tune prgm to set success threshold from 9.5 to 15.0.
	* Reduced the keep-alive reward from 0.08 to 0.06 per time step.
	* Increased penalty for speed not at speed limit (multiplier from 0.06 to 0.08).

6/5/23

	* 5 of 10 trials completed, but none hit the success criterion, although they all got close.
	  I think the criterion was set slightly unrealistically high.
	* Inference on Ray's reported "best" trial proved that it was bogus. In fact, that trial, while
	  it has the best rmean at the end, its final iteration included the nan error, so the NN wass
	  hosed.
	* Inference on another trial that completed 1000 iterations worked pretty well.  It had decent
	  respect for speed desires, and made the lane change well. It did attempt an illegal lane
	  change on step 1 a couple times, however.
	* I have noticed for a long time, including this run, that a trial may often perform well in a
	  given iteration (many episodes with ~100 steps), but will have several (or many) episodes with
	  just 1 time step. It virtually never happens that it fails on step 2, 3, 4, 10 or 20; it
	  either fails on first step or it succeeds. I suspect this has to do with the initial values
	  of the feedback observation elements. Maybe they need to be tuned somehow.
	* Inference runs show that there is definitely a longitudinal offset error between lanes 1 and 2
	  in the graphics representation, at least.

* Run a70cb training level 3 from scratch - DRIVING LANE 2 ONLY
	* Adjusted rewards slightly to dial back the keep-alive a little and increase emphasis on speed
	  penalty.
	* Fixed a defect where obs element EGO_DES_LN_PREV was initialized to 0 instead of to the 
	  currently assigned lane, which should improve first step failures.
	* Changed success criterion from 15 to 13, anticipating that the above changes will make higher
	  scores less likely.

6/6/23

	* 4 trials succeeded and 4 errored out. Inference on one of the successful trials shows...
	* rmax got up to 15.
	* Speed cmd penalty needs to be larger - is not having much effect.
	* Speed penalty could be a little larger.
	* Keep-alive reward could be a little smaller.

* Run e6301 training level 3 from scratch (back to all lanes being trained)
	* Fixed small defect in inference program's graphics display when vehicle is in lane 2 segment 1.
	* Reduced keep-alive reward from 0.05 to 0.04
	* Doubled the speed command variance penalty
	* Increased speed penalty multiplier from 0.1 to 0.15
	* I expect the rmax to be a little lower as a result of these, so leaving the success threshold
	  the same at 13.0 to make it a little more difficult
	* None of the trials could sustain rmean > 0. Several reached rmax ~14.5. There were quite a few
	  examples of rmin = -20.
	* 8 of 10 trials errored.

* Run e04ff training level 3 from scratch
	* Restored the keep-alive reward to 0.05 and the speed penalty multiplier to 0.1. This is now
	  like run a70cb (yesterday) but with all 3 lanes and a larger speed command variance penalty.
	* Trial 0 completed successfully. Inference on it showed that starting in lane 0 always wants to
	  move to lane 2, so crashes into the parked neighbors. When starting in lane 1 it still prefers
	  lane 2, and moves over there where it's legal, then moves back to complete the track. When
	  starting in lane 2 performance was good, achieving total scores of ~15.3, while staying pretty
	  close to the speed limit for most of the route.
	* The skew in performance among the 3 starting lanes is probably due to the skew in how much
	  each lane was trained (lane 0 10%, lane 1 30%, lane 2 60%).

* run 8148a training level 3 from scratch
	* Modified the level 3 success criterion from 13.0 to 14.5 for rmean.
	* I modified _select_init_lane() to choose each lane equally for levels 3 & 4.
	* 9 of 10 trials errored out.
	* No trial sustained any rmean > 8 but most stayed above 0 for most of the time.
	* Inference on trial 7 showed that lane 1 worked well, but starting in either of the other lanes
	  immediately ran off-road because its desired lane is always 1. This checkpoint was 30 iterations
	  prior to when it caught the nan disease. Trial 2 performed the same way.

* Run 155e6 training level 3 from scratch
	* Fixed a defect where the two desired speed feedback elements of the obs vector were allowed to
	  initialize to 0 instead of the actual speed.
	* Code inspection revealed no obvious reason that the agent would be learning to always desire
	  lane 1.

===== Upgraded to Ray 2.4.0 =====

* Created new conda env rllib24 and upgraded the env cda0 to be built upon that. Note that ray 2.4.0
  installs torch 2.0.1. It also installed torchvision 0.14.1 (in the rllib24 env). However, when
  importing that to cda0, pip didn't like it, saying those versions are incompatible. So I had to
  upgrade torchvision to 0.15.2. Also installed pygame 2.4.0 with pip.
	* Initial run of the tune program resulted in a few errors from Ray, since it was trying to
	  search for obsolete uses of gym (instead of gymnasium). I modified the Ray code in miniconda3
	  to skip over these checks.
	* It now runs, but continues to throw out some new messages (warnings?) about the backprop
	  process.
	* I have a small concern that maybe the upgrade left some old 2.3 remnants around, so maybe will
	  need to do a full uninstall then reinstall of 2.4.0.

* New run of the tune program just as it was before the upgrade
	* Most of the trials errored out early in the process, with the nan problem. I killed it early.

* New run, this time with exploration turned off.
	* Still got lots of trials that errored out.

6/9/23

* Redid the Ray 2.4.0 installation (see the top level Notes.txt), then replaced the conda env cda0
  with this new rllib24 env. It required an updated version of torchvision (0.15.2). Also installed
  pygame 2.4.0.
	* This time there was no need to modify the Ray source code to avoid those gym checks.

* Run 1efa4 training level 3 to smoke test the new Ray installation.
	* Still getting lots of trials with the nan error. I now realize the errors are coming from
	  pytorch, not from ray. But they were happening in the old version of ray/torch, so dropping
	  back to torch 1.13.1 (the lastest 1.x version) is not likely to change the problem. I need
	  to look at what is being input to the distribution object.

* A series of tuning runs training level 3 as before, just to get debugging info on Ray code.
	* Put this work on a new git branch "nan-experiments" off of new-obs-actions.
	* Added exception traps w/print stmts to several places in torch & rllib to help find the
	  problem.
	* Problem is first detected in TorchDiagGaussian constructor, which traps the exception from
	  torch's distribution.normal.Normal.
	* Up the call stack one notch is PPOTorchPolicy.loss(), which has logits of all nan. It
	  reports on the exception then reraises it.
	* loss() is called by _worker(), which is an embedded function in
	  TorchPolicyV2._multi_gpu_paralle_grad_calc(). It passed model and sample_batch to loss(),
	  which calls the model with its train_batch (the input sample_batch) to compute logits.
	* model is a FullyConnectedNetwork that has a _logits layer of type SlimFC with 128 in_features
	  and 4 out_features. I suspect the in_features is connected to the policy NN output but I have
	  no idea where it gets 4 outputs.
	* model is created in rllib.models.catalog.py line 718, which is called by 
	  rllib.policy.torch_policy_v2.py line 490 (_init_model_and_dist_class). It is created with
	  num_outputs = 4 and all flags vf_share_layers, free_log_std, no_final_linear all False.
	  In call to DiagGaussian.required_model_output_shape() it multiplies the action_space shape
	  by 2 to give the num_outputs. This seems to be to support that the DiagGaussian class 
	  represents the action vector as 2 mean values followed by 2 std deviations.
	* Run with no_final_linear = True and NN structure of [256, 256, 4] the final value must be
	  exactly 2x the number of outputs. This avoids having a logits layer, which is always the
	  one containing the nan values.
		* Generated nan error very quickly on trial 1, indicating that the logits contains
		  the offending content. This is the name given to the output of model.forward(),
		  regardless of whether it has a _logits layer defined or not.
	* Added diagnostics to the FullyConnectedNetwork.forward() method to see when things go south.
		* When error occurs, FullyConnectedNetwork.forward() shows _features tensor is all
		  nans, as is logits, but the obs tensor looks like probably reasonable values.
		* When one trial errored, within 1000 steps prior the largest logit returned from the
		  model was ~51. Most values I've seen in normal operation are O(0.01) to O(0.1).
		* During one of the thread crashes, there is the error message, "RuntimeError: Expected
		  scalars to be on CPU, got cuda:0 instead." from torch/optim/adam.py. Found a Discuss
		  post sayiing this error is fixed in v2.5.
		* When an exception was raised, the logits from the previous time step consistently
		  contain somewhat large numbers in the 4th column, values ~50, while the other columns
		  are all magnitude < 1. In another trial, 2nd column also had some suspiciously large
		  values, ~+2. In another it was the 3rd column with large values, between 10 & 70.
		  Meanwhile, the every-50-steps reports show the largest magnitude logit values are all
		  < 1.0. So maybe it is reasonable to have the model clamp all of them at 1 or so?
	* Created a temporary clamping mechanism in FullyConnectedNetwork.forward() to limit the output
	  logits to be within [-1, 1] and to track how often this clamping was applied.
		* Apparently this action is causing failure of the gradient calculation, as it can't
		  trace backwards through the chain.

* During above debugging, I continue to see the agent learning to get to lane 1 asap, which doesn't work
  so thinking maybe it doesn't get enough exposure to the neighbors to understand how to behave near
  them. Maybe training at level 4 is better and/or train with randomized start locaton on?


===== Upgraded to Ray 2.5.0 =====

6/16/23

* Created conda env rllib25 (see ~/Notes), then built a new env cda0 on top of that. In cda0 I installed
  pygame-2.4.0. The test_cuda program works after reboot.

* Run 3da38 training level 3 with SAC, with randomize_start_dist off.
	* This run used whatever configs were on for the previous days' debugging sessions, so may not
	  be ideal.
	* All trials converged to a rough performance band of rmax in [2, 6] and rmean in [-6, -2] after
	  800 iterations. But there were no errors reported.

* Run 979df training level 3 with SAC, as before.
	* Added command masking logic to the step() method that forces desired lane command to be equal
	  to the actual lane for the first 3 time steps, which allows the feedback obs to fully initialize.
	* This run proved slightly better than the previous, with rmax in [3, 7] and starting to trend
	  upward at the end, but rmean in [-5, -1] for the last half of the run.
	* Inference shows it still consistenly wants to be in lane 1, so makes illegal lane changes when
	  starting in 0 or 2. Also, when it runs in 1, desired speeed is too low, so rewards need adjusting.

6/17/23

* Several runs to debug desired lane command
	* Adjusted rewards so that completion bonus is spread a bit wider on time step variation, and
	  reduced the keep-alive bonus from 0.05 to 0.03 to encourage higher speeds.
	* Inspected code dealing with lane assignment and lane change, and found no defects. Added print
	  stmts to track down desired lane assignments to see how it really varies during training.
	* Desired lane leans heavily to lane 1, regardless of initial lane assignment.
	* Initial lane seems skewed to 0 and 1. But I verified that _select_init_lane() is evenly drawing
	  from all 3 lanes.
	* Fixed a minor initialization defect in the env wrapper, but it had no effect.
	* Tried a run forcing all initial lane assignments to be either 0 or 2.
		* Between 700 and 800 iters (the defined limit), three of the trials had a sudden burst
		  of learning, where rmax reached 8 for one and 10 for another, and rmean > 0.
		* On the best of these trials, inference shows that it can successfully run the track
		  after starting in lane 2.  Speed is still a bit of a problem, but slightly less than
		  earlier.

* Run 3bbcf training level 3 with SAC
	* Modified _select_init_lane() to choose 0 and 2 each with 45% probability and lane 1 at 10%.
	* Increased the speed penalty multiplier from 0.1 to 0.15.
	* Doubled the duration of the Gaussian noise from 500k to 1M time steps.
	* Doubled the run duration from 800 to 1600 iterations.
	* By the end of training 4 trials had rmax in [9, 10] and rmean in [0, 3], while all had
	  rmin = -20. Looks like there's room for more improvement if training continues.
	* Inference on trial 3 showed lane 0 starts always seem to crash. Turns out inference starts
	  more consistently uptrack than traning did, so maybe it didn't get enough training exposure
	  to the neighbor vehicles parked in lane 1.
	* Inference in lane 1 shows better speed performance, holding steady ~21 m/s, but still
	  incurring more speed penalty than the keep-alive bonus.
	* Saved trial 3 checkpoint for future use.

* Run 0392b training level 3 with SAC starting from checkpoint in 3bbcf trial03 for additional 800 iters
	* Modified reset() to set the initial ego_p the same in training as it does in inference.
	* Modified reward to increase multiplier on speed penalty from 0.15 to 0.2.
	* Not so good. Best rmax ~9 and best rmean ~-2.

* Run 122bb training level 3 with SAC from scratch for 2400 iterations. No other changes made.
	* Very bad. No trial sustained an rmax > -5.

6/18/23

* Run 740f8 training level 3 from scratch with SAC
	* Changed reset() to prevent agent starting in lane 0 to be beside the parked neighbors.
	* This proved somewhat better, but not as good as 3bbcf yesterday. rmax up to 8 on a couple
	  trials but most did a lot worse with rmean in [-3, 1] range.
	* Inference looked fine, except that speeds are slow in lane 1 (~21 m/s). It runs fast during
	  time in lane 2, but as soon as the lane change occurs, it slows down.
	* It appears the low scores are all due to the desire for low speeds; the lane change maneuver
	  normally happens just fine.

* Run 1d15b training level 3 from scratch with SAC
	* Tweaked reward keep-alive bonus from 0.03 to 0.02 and decreased speed penalty multiplier from
	  0.2 to 0.15 (as it was yesterday).
	* Terrible. All trials rmax ~ -6.
	* It seems a big part of the problem is that it can't learn to drive the length of lane 2. Even
	  after many iterations it attempts to change lanes after only 3 steps (forced to stay in lane).

* Run de296 training level 3 from scratch with SAC
	* Modified reward function to anneal the keepalive bonus from 0.03 down to 0 between 400k and
	  800k steps.
	* Changed _select_init_lane() to choose 0 40% of the time and 2 50% and 1 10%.
	* Turned on randomized_start_dist to help it explore all parts of the track.
	* A couple trials reached rmax ~12, with two of them (8 and 9) reaching rmean ~0 at the end
	  with some spikes well above that earlier.
	* Keepalive bonus never deviated from 0.03, so the annealing didn't happen. Mostly the step
	  count stayed below 120k, with the highest I see at 147k. The counter is getting reset at each
	  perturb event.
	* Inference on trial 9 shows lane 0 start does really well, maintainin speed in 24-27 m/s,
	  which is almost ideal, and scored 11 points. Same for lane 1 start. It never learned to
	  start lane 2, however, running off road in 4th time step every time.

6/20/23

* Run d106d training level 3 from scratch with SAC
	* Fixed reward function for annealing keepalive bonus across perturb events.
	* Skewed more training time to lane 2 starts (70% vs 20% for lane 0 and 10% for lane 1).
	* Not good. All rmax level at +8, all rmean noisy in [-6, -2] after 1600 iters.
	* Same as before, lane 0 & 1 starts steered to lane 1 quickly and settled at 18-19 m/s
	  (which happens to be real close to action[0] = 0). Lane 2 starts all end on first
	  non-masked time step due to lane change attempt. Speeds are so slow that the speed
	  penalty is about 2x the size of the keepalive bonus, so cumulative rewards are zero
	  and the completion bonus isn't awarded due to late arrival.

* Run 86465 training level 3 from scratch with SAC
	* Altered step() unscaling of the desired lane command (action[1]) so that it maps
	  action in [-1, 1] to desired_lane in [2, 0, 1] (the target was [0, 1, 2]). This will
	  show if there is really an affinity for action[1] = 0 or an affinity to the physical
	  lane 1.
	* All trials ended with rmax ~8 and rmean ~-4.
	* Inference shows lane 0 start stays in lane 0 (action[1] ~ 0.0) and speed ~18 m/s for
	  a speed penalty ~-0.05 per time step compared to keepalive bonus of 0.03. Starting in
	  lane 1 it immediately changes to lane 0 then stays there at ~18 m/s. Starting in lane
	  2 it drives the lane fine until about 2 time steps before the merge point, then commands
	  lane 1 (slightly prematurely), so runs offroad. In doing this, while it travels lane 2
	  the lane command (action[1]) carries a value of 0.51-0.52 (just barely large enough to
	  qualify for lane 2), then the illegal maneuver happens because the command dropped to
	  0.49. So it has been staying as close to 0 as it can get the whole time.

6/21/23

* Run 1abe2 training level 3 from scratch with SAC
	* Changed NN activations from relu to tanh for both the policy & value networks, thinking
	  that relu may be forcing an affinity for outputs near 0.
	* Results were virtually identical to those obtained above (run 86465), except that lane
	  2 starts ended immediately after the masking. So the change of activation function had
	  no effect.
	* With inference on lane 2 run, I verified that the zone flag for reachable is working
	  correctly as agent passes the various key points, but the drivable flag does not. After
	  agent changed to lane 1 then passed the end of lane 2, its zones 3, 6, 9 were still
	  showing the drivable flag as true.

6/22/23

* Studied how to do action masking with RLlib
	* They provide an environment wrapper, ActionMaskEnv, that supplements the env's obs
	  outputs by adding a dict of illegal actions, but it only works for a discrete action
	  space.

* Run 77444 training level 3 from scratch with SAC
	* Restored the desired lane command interpretation to the original design.
	* Relaxed initial lane choice so that it is now lane 0 30%, lane 1 20%, lane 2 50%.
	* Added HP tuning for SAC tau and n_step, and set alpha = 2, since the action space is
	  2 long.
	* Set the burn-in period to 800 iterations, to allow things to settle more.
	* Took a little longer for the various trials to converge on the "best" performance,
	  but they all did.
	* Same results as before. rmax ~8, rmean in [-6, -3] from ~700k steps onward for all
	  trials. Didn't do inference on anything yet.

6/24/23

* Run 6bfd4 training level 3 from scratch with SAC, after re-reading the original SAC paper
	* Turned off explicit exploration, allowing the SAC entropy to do it implicitly.
	* Set up alpha as a tune HP, allowing it to vary in [0.01, 10.0] as an initial value
	  (not a perturbable HP for now).
	* Starting to get decent results. All trials rmax > 12, rmean rougly in [3, 7]. All
	  trials ended with alpha either 0.11 or 0.02.
	* Inference on trail 2 (ending rmean = 6) shows lanes 0 & 1 work better than before
	  (lane 0 stays in lane for most of the route), but speed is ~21 m/s. Lane 2 starts
	  stay in lane almost to the merge point, then drive off; but it slows way down to
	  9 m/s.  Still needs to learn more. Trial 9 performed similarly.

* Run 9eb13 training level 3 from scratch with SAC
	* Added debugging for perturb control actions involved with annealing keepalive
	  reward.
	* Increased speed penalty multiplier from 0.15 to 0.2.
	* Narrowed the HP search space on alpha to [0.001, 0.5]
***	* Very good at first glance! All trials ended with rmean > 8 and rmin ~0! All ended
	  up with alpha = 0.0046 and n_step = 2. Best trial was 5.
	* Inference on trial 5 shows
		* lane 0 start does a lot of jumping between lanes 0 and 1, and runs up to
		  36 m/s, seemingly ignoring the penalties for those actions.
		* lane 1 start performs similarly. In both cases, completion reward was 10
		  even though it used 109-116 steps.
		* lane 2 start drives close to 36 m/s, stays in lane 2 until its very end
		  then changes to 1, then quickly alternates between 0 and 1 for the
		  remainder of the route.
		* It seems the entropy term is still swamping the finer reward signal
		  elements, such as the per-time step penalties.
	* Found that PerturbController still doesn't see past a perturb event when the
	  worker is deleted and a new one started. In that case the num init count resets
	  to 0, which it should not do.

6/25/23

* Run 22c9e training level 3 from scratch with SAC
	* Set n_step = 2 (fixed) and changed tuning range for alpha to [0.0005, 0.01].
	* Changed completion reward to peak at 130 steps and degrade on both sides instead
	  of just for slower routes.
	* Fixed problem in _update_obs_zones() to report the correct drivable flag for
	  neighbor lanes as they come into and go out of existence along the route.
	* No good. rmean ended in [-4, 2] for all trials. Some had rmax in [8, 10]. Trials
	  had alpha either 0.00092 or 0.0018.
	* Inference shows lane 0 & 1 start drives smoothly to end, but at speeds ~17 m/s.
	  A lane 1 start changed lanes once to 0.  Lane 2 starts end after 4 steps due to
	  illegal lane change.

6/26/23

* Run 7800e training level 3 from scratch with SAC
	* Allowed n_step to vary in [1, 2, 3]. Changed range of alpha to [0.0009, 0.006].
	* No good. rmean all in [-4, -1], rmin all -12.
	* Inference on lanes 0 & 1 worked pretty well, running at steady 30 m/s (slightly
	  over speed limit), with one or two extra lane changes, but scoring ~9 points.
	  But lane 2 runs always run off road right away.
	* Best trial was 9, with alpha = 0.0017, actor LR = 8.3e-4, critic LR = 1.7e-6,
	  entropy LR = 7.0e-4 and n_step = 1. However, there were several trials that
	  ended with n_step = 3.
	* I am starting to believe that reward scaling is a problem. There is too much
	  variation between the large reward signals and the small ones, that they can't
	  both balance against the entropy term with a single alpha, so the reward signal
	  maybe needs to have its dynamic range reduced.

* Run cacb4 training level 3 from scratch with SAC
	* Changed HP search range slighly for alpha to [0.001, 0.008].
	* Changed top end of the 3 LR search ranges from 1e-3 to 1e-4.
	* Changed penalties for in-progress time steps:
		* keepalive bonus went from 0.03 to 0.01 (accumulates to ~1.3), and made
		  it constant (no annealing)
		* lane cmd penatly multiplier went from 0.01 to 0.1 (accumulates to 1-2)
		* increased speed penalty mult from 0.2 to 0.25 (accumulates to 5+ for a
		  continued 20% deviation)
		* increased LC penalty from 0.02 + 0.01*delay to 0.1 + 0.02*delay
		  (accumulates to 1 for short-sighted LC)
	* Changed perturb params for burn-in from 800 to 400 and for perturb interval to
	  go from 200 to 400 iterations.
	* Verified that randomized start location limit should gradually degrade from
	  physical limit to 10 m over 60k to 500k steps, but it is susceptible to
	  restarting at each perturb cycle.
	* Trial 1 was climbing well until ~158k steps when rmax suddenly dropped from
	  +10 to -10, and didn't recover for a long time. Debug stmts prove that the
	  max_distance was degrading smoothly, so that should not have been a factor.
	  Trial 0 did a similar thing ~step 260k. This doesn't correlate to any annealing
	  so it could be just a sign of instability in the solution, possibly indicating
	  the need for a larger alpha? Most trials performed similarly.
	* No good. All rmax were bouncing between -10 and +2. Worse than previous runs.
	  All trials ended with alpha in [0.001, 0.004] - maybe this is too low?

* Run a55b0 training level 3 from scratch with SAC
	* No more HP tuning on n_step - left it at 1.
	* Change HP search range for alpha to [0.001, 0.1]
	* Not very good, but best of the day. A few trials had rmax up to 10, but not
	  dropped to 5 or 6 by 1600 iterations. These had rmean in [-4, -1] at their end.
	* Inference on trial 2 shows lane 0 & 1 starts both drive in lane 1 at 23 m/s.
	  Lane 2 starts go offroad immediately.

* Run c0533 training level 3 from scratch with SAC
	* Turned on Gaussian noise.
	* No good. One trial had rmax drift up to 9 by the end, but the rest floundered
	  between -10 and 0, with rmean mostly at -10.

6/27/23

* Created new branch "sac-tweaks" off of the 6/24 commit after run 9eb13 that had best
  success so far. Starting to work on that branch. Noise is off at this point.

* Run c7084 training level 3 from scratch with SAC
	* Changes from that commit only involve, limiting the SAC LRs to [1e-6, 1e-4]
	  and setting the range of initial_alpha to [0.001, 0.3]. n_step is left as a
	  choice of {1, 2, 3}, but I don't think it has a material effect.
	* Found and fixed a defect in the entropy_learning_rate range (both for the PBT
	  mutations and for the initial setting in SAC training. It was [1e-4, 1e-3], so
	  I changed it to [1e-6, 1e-3] to be like the others.  I'm sure I was using all
	  3 to be the same.
	* Not as good as 9eb13, but better than I've seen since. Lanes 0 & 1 starts went
	  all the way at ~30 m/s, but lane 2 starts went offroad right away every time.
	  rmax = 10, rmean for most trials ended in [1.5, 4]. Almost all trials, including
	  best 2 had alpha = 0.0039.

* Run d96d1 training level 3 from scratch with SAC
	* Limited entropy LR range to [1e-5, 1e-3].
	* Limited alpha range to [0.001, 0.01].
	* Turned on Gaussian noise with stdev chosen (one time) in [0.1, 0.5].
	* Modest results. rmeans ended in [1, 5], with trial 5 being the best at the end.
	  rmax was all > 9.5.
	* Inference on trial 5 shows lane 0 & 1 starts are great - accelerate to 26-27 m/s
	  and stay in lane 0 so rewards > 8. But lane 2 starts always go offroad right
	  away. I verified that it is training lane 2 sufficiently (roughly 50% of the
	  episodes, as is intended).

6/28/23

* Run 9a8e6 training level 3 from scratch with SAC
	* Added in mods to _update_obs_zones() that were applied in the new-obs-actions
	  branch to fix the drivable flag in certain cases.
	* Shifted initial lane assignments to be: 0 20%, 1 20%, 2 60%.
	* Modest again. rmax ~ 10 in all trials, with rmean in [-1, 4].
	* Inference performs similarly to prev run, with steady speeds on lanes 0 & 1
	  ~31 m/s, but lane 2 never starting.

6/29/23

* Run f5285 training level 3 from scratch with SAC
	* Changed _select_init_lane() to do lane 0 10%, lane 1 10%, lane 2 80% of the time.
	* No good. rmean were in [-2, 1].
	* Inference again showed lane 2 starts never got anywhere, running off road immediately.

* Fed up with lane 2 starts never working. Reviewed all my journal back to 4/29, when I got the
  advice from Kevin. Since that time, there were only 2 times that lane 2 starts showed any
  success:  on 6/6 and on 6/17. Both of these occurred while I was using a massively large alpha
  and one of them also included training limited to 2 lanes. So I don't believe they are
  examples I can use.
	* However, I am concerned that the notion of "initial" alpha being used as an HP may be
	  getting bastardized by the PBT perturbations copying that HP into a trial that is
	  already underway - how does the ongoing alpha value get affected by this shift?  Maybe
	  SAC is inherently incompatible with PBT?
	* Also, based on an observation on 6/17, it may be that I recently haven't let the trials
	  run long enough.

* Run 28979 training level 3 from scratch with SAC - no PBT
	* Setting total number of iterations from 1600 to 2500 to allow extra time for learning.
	* Reset _select_init_lane() to more evenly balance the choices to give lane 0 25%,
	  lane 1 25%, lane 2 50%.
	* Set PBT burn-in time to 2500 iterations, effectively turning off PBT without changing
	  the scheduling code.
	* No good. All trials converged to rmax ~8, rmean to [-4, -1] in 1.7M steps. Aroung 50%
	  of final iteration episodes had length 4 steps, which correlates to all lane 2 starts
	  driving off road.
	* Not clear if the PBT-SAC interaction is bad or not.

6/30/23

* Moving back to the new-obs-actions branch (no merging), but then created new branch lc-cmd
  from there to support the next activity.
* Modified the environment & wrapper to replace the desired (absolute) lane ID with a lane
  change command, which is relative to the current lane, and doesn't care about the lane ID.

* Run 85a79 training level 3 from scratch with SAC - first run using new lane change command
	* BOGUS RUN - see notes on 7/13.
	* No good. rmax climbed to 6 on a couple trials before dropping back to -10. rmean
	  climbed above -10 on a few trials after 400k steps, but never reached more than
	  -8. Two were starting to climb at the end of 1600 iterations (1.05M steps).

7/13/23

* After a long break, I reviewed notes above and code on the lc-cmd branch. It appears that
  the previous run (85a79) was only done with 1600 iteration limit and 400 iter burn-in so that
  PBT was, in fact, being used. The log shows that perturbations were being done, so the above
  run was bogus.

* Run 7c45c training level 3 from scratch with SAC on lc-cmd branch. Redoing first run with the new
  lane change command code. Noise is on. Coincides with git commit a594a35.
	* Set max_iterations = 4000 and burn_in_iters = 4000.
	* Also fixed tune defect that was scheduling perturbation of DDPG vars but nothing else.
	* Not very good. A couple trials achieved peak rmax ~10, but only one of them held it near
	  the end. This trial (#4) achieved rmean ~-7 towards its end. Only one other trial got
	  that high a rmean.
	* Inference run on lane 0 immediately stopped with "ran off road", even though action[1]
	  was 0. All obs zone info looks correct.
	* Inference run on lane 1 ran the whole route, but somewhat slowly, so achieved a final
	  score of -4.5. Obs zone data all looks correct along the path. All action[1] values were
	  near 0.0.
	* Inference on lane 2 ran off road, even though action[1] value was 0.0.
	* Found a defect in step() that was doing incorrect action masking on the LC cmd (left over
	  from the absolute desired lane code).

7/14/23

* Run cb759 training level 3 from scratch with SAC on lc-cmd branch. Redoing first run with new lane
  change command code, now fixed - no PBT.
	* Fixed defect in lane change command masking at top of step().
	* Reduced iteration limit to 2000.
	* Pretty poor results. Most trials got rmax > 4 at some point, but had a hard time holding
	  it. Four of them got rmax ~10. The best rmeans were ~0.
	* Inference on the best trial (#3) where it reached its max rmean (iteration 1360) shows
	  it driving a steady pace in all lanes, but ~21 m/s, which incurs a sizeable time step
	  penalty, thus final score is < 0. While lane 2 finally drives to the end, it refuses to
	  change lanes at any point, so always runs off road. Noise was on in this run.
	* Trial 3 used noise stddev = 0.23, initial_alpha = 0.0011, actor LR = 7.0e-5, critic LR
	  = 5.1e-4, entropy LR = 6.8e-4.

* Run 629df training level 3 from scratch with SAC on lc-cmd branch - no PBT.
	* Set up exploration as a tuning variable, so it is chosen to be on or off.  If on, then
	  its stddev initial value is a tuning variable also, in [0.1, 0.7], which anneals to 0.1
	  of its initial value over 1M time steps.
	* Not very good. Six trials reached rmax > 8, four of them ending there, but best rmean was
	  ~0.3.
	* Inference on best trial (#9) shows similar performance to previous, with speeds steady
	  ~21 m/s, and never a lane change (lane 2 runs off the end).
	* Of the trials that reached high rmax, 3 had exploration on and 3 had it off. These were 
		* Trial 3, noise off, init alpha = 0.018; erratic rewards, settled at -10.
		* Trial 6, noise off, init alpha = 0.071; somewhat erratic rmax that climbed back
		  to +8 just before the end.
		* Trial 2, noise off, init alpha = 0.011; very smooth rmax climb to +7 at the end,
		  and had one of the better rmeans at -3. Looks like more iterations would have
		  done really well.
		* Trial 8, noise on at 0.15, init alpha = 0.0026; erratic rmax spent more time < 0
		  than high.
		* Trial 9, noise on at 0.28, init alpha = 0.0052; rmax suddenly jumped from -10 to
		  +8 then smoothly climbed to +10. rmean was best of all, at +0.3, the only one > 0.
		* Trial 5, noise on at 0.30, init alpha = 0.040; somewhat erratic but steady climb
		  of rmax to +8. Had 2nd best ending rmean at -2.

7/15/23

* Run a0334 training level 3 from scratch with SAC on lc-cmd branch, no PBT. Running commit 4847cc2.
	* Set noise level to start at 0.3, with multiplier ending at 0.3 instead of 0.1 at 1M steps.
	* Set tuning choice to 2/3 true for noise addition.
	* Increased max iterations from 2000 to 3000 to allow more time for things to mature.
	* Restricted the range of initial_alpha to [0.002, 0.04].
	* Not very good. Only 4 trials got rmax > 0, and only one reached > 8 late in the game, but
	  it dropped to ~4 by the end. Its rmean was the best, only reaching +1 near its end. This
	  trial had noise on and initial_alpha = 0.0024.
	* Inference on trial 4 (best) shows that it did try to change lanes at end of lane 2, but
	  started about 1 time step too late. It initiated the change the time step after zone 5
	  became unreachable. Prior to that, I see that zones 1 and 4 (to the left) show they are
	  not reachable for several time steps, even while zone 5 is still drivable/reachable. I
	  believe the reachable flag is incorrect. I doubt this is a major factor in this particular
	  situation, but could improve the training a bit. One more time step and the LC would have
	  been successful, but it was going a little above speed limit (~33 m/s) so ran out of
	  pavement.  If training runs were going slower, the zone 5 disappearing would have been a
	  reasonable signal to initiate the LC.  More training (maybe with more noise) would have
	  helped this situation.
	* Inference on lane 1 shows a clean run with score +3, but drove a little fast (~33 m/s),
	  so kept incurring speed penalties.  More training should take care of this.

* Run 45ec8 training level 3 from scratch with SAC on lc-cmd branch, no PBT. Running from commit
  4b652b4.
	* Changed code for obs zones to make the reachable flag more liberal on the side zones,
	  so it is now reachable if any part of the zone overlaps with an adjacent drivable lane.
	* Forcing noise on for all trials, and increased the magnitude a bit: initial stddev
	  increased from 0.3 to 0.4 and final scale factor increased from 0.3 to 0.6.
	* Increased max iterations to 4000.
	* Changed step() action masking to only mask on the first time step of an episode, instead
	  of the first 4 steps.
	* No good. Run died after 6 trials completed, presumably due to power failure. But at that
	  point no trial had exceeded rmean = -8, even though two had reached rmax > 4. Best trials
	  were 4 and 2, with initial_alpha of 0.019 and 0.032, respectively.
	* Inference shows that no LC is ever commanded on any lane.
	* Training log shows that lane 2 episodes got punished often for attempted LCs in step 1,
	  then seem to have learned to seldom try them later. Not clear from this log where the
	  agent was along the lane when these happened (randomized start loc is on). Maybe need
	  more noise?

7/16/23

* Run 21d5a training level 3 from scratch with SAC on lc-cmd branch, no PBT.
	* Increased initial noise stddev to 0.7. Added logging info to see agent's p location at
	  each step, along with its commanded actions.
	* Widened the search range for entropy LR by reducing the lower limit from 1e-4 to 1e-6.
	* No good. No trial even got rmax > -10, except for one that had a few quick spikes up
	  to -9.
	* Logs from the step() method show very few steps on lane 2 - not clear if this is because
	  that lane just dies quickly.

* Run 4d754 training level 3 from scratch with SAC on lc-cmd branch, no PBT.
	* Increased random start loc applicability from 500k steps to 1M steps.
	* Added logging in reset() to understand initial lane and location distributions.
	* No good. Again, no trial had rmax > -10 for any significant amount of time.
***	* Log shows that as time goes on the initial lane selection is skewed away from lane 2
	  quite a bit (should be 50%). I am now concerned about the reliability of the prng. After
	  analyzing many pages of log output, I conclude that the PRNG is faulty, in that it only
	  generates values in [0.5, 1] a little less than 1/3 of the time. Since my lane selector
	  was using this range to assign lane 2 starts, lane 2 has been badly underrepresented in
	  the training.

*** Wrote a new PRNG class (HpPrng) and did a sanity check on its output to help ensure that it
    will perform as intended.  Included this into the env model.

* Run 813e3 training level 3 from scratch with SAC on lc-cmd branch, no PBT.
	* Using the new HpPrng generator instead of the numpy default generator.
	* Reduced the Gaussian noise level from 0.7 to 0.3, and reduced its final scale from 0.6
	  to 0.2 (at 1M steps).
	* Not so good. At least now we see rmax reacing ~+10 on more than a couple trials, but
	  rmean is still stuck ~0 in the best case (trial 2 got rmean ~+2 for a while).
	* Inference on trial 2 shows all lanes use a steady speed around 21-23 m/s, and never a
	  lane change.  It runs off the end of lane 2.
	* Log shows that there are very few instances of agent in lane 2 within the merge zone
	  and commanding a left lane change, so it has very little opportunity to learn that this
	  is a good move. It seems that gradually reducing the max start location is hurting its
	  opportunity for this kind of exposure.
	* Aborted during trial 6/7.

* Run 8c50c training level 3 from scratch with SAC on lc-cmd branch, no PBT.
	* Changed reset() to prevent annealing of the agent's start location by setting 
	  initial_steps = 6M (longer than the trial will go). This should give a lot more
	  opportunity for the agent to experience the merge zone in lane 2 and explore what can
	  be done there.
	* No good. Best trial was #7, which ended with rmean = 0. Most trials got rmax > 0, with
	  several near 10.
	* Inference showed identical performance as previous run.

7/18/23

* Run df22a training level 3 from scratch with SAC on lc-cmd branch, no PBT.
	* Added grad_clip HP to the SAC algorithm, with tuning range in [0.5, 1.0].
	* No good. All trials reached rmax > 5 at some point and some stayed near 10 for a while.
	  But none had rmean > 0.
	* Inference on lane 2 start same as before, with steady speed ~20 m/s and no attempt to
	  change lanes. The action[1] is very close to 0 the whole time.

* Run 4ce18 training level 3 from scratch with SAC on lc-cmd branch, no PBT.
	* Turned off randomized start location. With the keepalive bonus it should still find its
	  way.
	* No good. All trials reached rmax > 3, and several reached 10 at least once, but none
	  achieved rmean > 0.
	* Inference showed same story as prior runs.

7/19/23

* Run 7f13dc training level 3 from scratch with SAC on lc-cmd branch, no PBT.
	* Changed _select_init_lane() to return 2 always, forcing the agent to learn just how to
	  drive lane 2.
	* No good. Aborted after 6 trials complete and no rmax making significant dent > 0.
	* Perhaps the keepalive reward wasn't enough for all episodes starting at beginning? I
	  don't think this is the problem. An inference run on one of the completed checkpoints
	  shows that it drove the length of lane 2 okay, it just didn't know to change lanes.

* Run 9b926 training level 3 from scratch with SAC on lc-cmd branch, no PBT.
	* Turned on randomized start loc. Still training ONLY LANE 2.
	* No good. Trials 5-9 died due to sigfault, but the first 5 trials showed no better
	  behavior than seen recently.
	* Inference on two different trials shows same behavior as before - drives all of lane
	  2 at a somewhat slow speed, but never attempts to change lanes.
	* Scrutinizing inference log to ensure that obs data is correct throughout an episode.
		* obs[EGO_P] is not providing any valuable info, and could be distracting the NN,
		  so maybe needs to be removed.
		* Found a defect in calculation of reachable flag for adjacent lanes. It was
		  a few hundred meters off, which could have taught the agent to ignore this
		  signal, since it would lead to off-roading.

* Run b0774 training level 3 from scratch (LANE 2 ONLY) with SAC on lc-cmd branch, no PBT.
  Corresponds to git commit dcf37c6.
	* Applied fix for the reachable flag defect.
	* Removed EGO_P from the obs vector.
	* No good. Some trials reached rmax > 5, but didn't stay. Only trial 6 reached rmean ~0,
	  and its rmax was ~8 then stayed above 5 for ~400k steps at the end. This was the best
	  one.
	* Inference on trial 6 showed steady speeds, higher than before ~25 m/s, but never a
	  lane change. However, the LC command quickly jumped up to -0.41 in the last few time
	  steps, so it seems to be thinking about it. Maybe more training would take care of it?

7/20/23

* Run 7b4c9 training level 3 from previous run's trial 6 checkpoint (LANE 2 ONLY) with SAC on
  lc-cmd branch, no PBT.
	* No changes, just continuing from a promising checkpoint.
	* No good. For most trials, rmax in [2, 4] was eventually achieved, with rmean ~-2. All
	  plots were quite smooth & steady, indicating not much additional learning.
	* It feels like noise is not quite enough. With the Gaussian stddev at 0.3 (at its
	  largest), that is seldom going to be enough to move the LC cmd over by a whole lane
	  (typically requires moving it from 0.0 to -0.5 or more). It seems that noise >= 0.5
	  is needed, for long enough to give it ample opportunity to learn, but then it needs
	  to taper down so that it can refine its policy.

* Run 54b7f training level 3 from scratch (LANE 2 ONLY) with SAC on lc-cmd branch, no PBT.
	* Changed Gaussian noise stddev to 0.6 and taper from 1.0 to 0.1 over 2M steps, with
	  100k steps of random at the beginning.
	* No good. Only 2 trials had rmax > 0, and one of those had rmean > -10.

7/21/23

* Run d8b1e training level 3 from scratch (LANE 2 ONLY) with SAC on lc-cmd branch, no PBT.
	* Modified noise stddev to be a tunable HP in [0.2, 0.6].
	* Initial lane selector modified to select 2 70% of the time and 1 30%. When 1 is
	  selected, starting loc will only be at 800 m or greater, to give it experience
	  driving this lane to the finish line (includes change in reset() also).
	* No good. Only one trial got rmax > 0, and it reached ~4. All trials had
	  rmean ~-10 or less. The trial with best rmax had noise stddev = 0.24 and
	  initial_alpha = 0.034.
	* I verified in the log that initial lane assignments & initial locations were as
	  expected.
	* Inference on this trial had a steady speed of ~19 m/s, and no attempt to change
	  lanes. When started in lane 1 it drove that lane to the end at 19 m/s.

* Run b1e24 training level 3 from scratch with SAC (LANE 2 ONLY) on lc-cmd branch, no PBT.
  Corresponds to commit bcc3750.
	* Changed SAC n_step from 1 to 5.
	* No good. Only a couple trials reached rmax ~0, and all rmean < -10.
***	* I have concluded that I am doing something fundamentally wrong with SAC, and need
	  to go back and try this with one of the standard gymnasium environments.

7/30/23

* Played with Kevin's deepyrl engine to run my environment (details not recorded here). It
  is using SAC with gamma = 0.995 and max_grad_norm = 1. It seems to have solved it (lots
  of episodes with scores > 4) in about 30 min (10k episodes). It is using a very old version
  of the environment (prior to early June), but running difficulty level 4.
* I also spent time playing with RLlib's SAC implementation on the MountainCarContinuous-v0
  env. It seems that I found a solution (running very slowly), the key being to set
  train_batch_size = 1024 instead of 256. It also seems that the initial_alpha doesn't need
  to be tuned, as the entropy LR will guide alpha to its optimum value over time - just pick
  one that is in the ballpark. Also, a nn of [256, 256] was necessary to find a winner, as
  [64, 64] was not working.

* Run 32cbb training level 4 from scratch with SAC on lc-cmd branch, no PBT. Trying to configure
  more like Kevin's run and with lessons learned from recent MountainCar activity.
	* Changed from level 3 to 4.
	* Removed starting lane restrictions.
	* Changed n_step back to 1
	* Set train_batch_size = 1024
	* Fixed initial_alpha = 0.02
	* Fixed grad_clip = 1.0
	* All 10 iterations errored out, three before they ever started running! The most
	  iterations achieved was 600 in only one of the trials. In all cases the error log
	  indicates unknown error, possibly a SIGSEGV or high memory usage. Log shows that
	  most of them were a SIGSEGV on a particular cpu (different one each time) when
	  executing rllib/policy/sample_batch.py at line 81 in its __init__() method. It is
	  logged as a fatal Python error.

* Attempted to update python to 3.9.13 in the cda0 conda env, but got a zstd decompression error.
  Found that the recent Nvidia (cuda) update hosed up my conda and/or pip so that I can't 
  install much of anything. Spending the rest of the night reinstalling conda.


===== Ray 2.6.1 debacle - back to Ray 2.5.1 =====


8/1/23

* Stopped to reinstall all of the conda sw and all of its environments from scratch. Built a new
  cda0 env based on python 3.10, torch 2.0.1, Ray 2.6.1.
	* Also installed pygame 2.5.0 with pip (not available via conda).

8/2/23

* Made a trial run of cda0_tune and found that there are some new warnings and deprecations that
  need attention. Attempted to address them, but getting lots of problems with using local storage.
  Submitted a topic on Ray Discuss (https://discuss.ray.io/t/trouble-starting-tune-job-on-local-machine/11653)
  but cannot move forward at this point. Am also starting to see a lot of segfaults, so maybe
  v2.6.1 is a premature upgrade.

* After playing with the tune program some more, determined there must be a defect in Ray that prevents my
  situation from working. I therefore created a conda env rllib25 and installed Ray 2.5.1 there. Made a new
  env cda0 built on that (but I had to manually install ray 2.5.1 there also, after doing the torch/cuda
  reinstall trick - see conda notes).

* Run c2d29 training level 4 from scratch with SAC on lc-cmd branch, no PBT.
	* Only running 4 trials, since I'm concerned that having 10 might have been contributing to all the
	  segfaults earlier (too much heating or possible conflict in the processor allocations?).
	* Otherwise, same as run 32cbb on 7/30.
	* 3 trials showed lots of promise, with rmax climing to > 8, and two of them had rmean ~0, but...
	* All 4 trials crashed for unexplained reasons (3 of them). Trial 3 died due to logical comparison
	  defect in the Ray code.  Results saved in ~/ray_results/trial3-defect.

* Run c74ed training level 4 with SAC on lc-cmd branch, same as previous run.
	* Two trials died of segfaults early.
	* Two other trials had some success, with rmax intermittently reaching as high as 6 in each case,
	  but rmean never above -2. Then they both suddenly died without generating an error log of their
	  own. The stdout log shows a UnicodeDecodeError way down in a message unpacker, apparently in
	  Ray core.

8/3/23

* Run 936ba identical to previous run, trying to capture more info on segfault situation
	* Using 0.8 / 1.0 GPU and 8.0 of 16.0 cpu.  htop shows total mem usage is 12 GB / 62 GB.
	* Filed a new topic on Ray Discuss (https://discuss.ray.io/t/frequent-sigsegv-running-tune/11676).
	* One trial died early on, then the whole job died due to a segfault.

* Run 3a43b training level 4 with SAC on lc-cmd branch, no PBT.
	* Commented out the whole call to cfg.resources(), which had been specifying gpu usage. Thinking
	  this may be causing the segfault.
	* Solution ran only on cpus, but still ran all 4 trials simultaneously.
	* All 4 trials died before iteration 1000 with same unknown error.

8/6/23

* Run 1683e training level 4 with SAC on lc-cmd branch, no PBT. Coincides with commit be4ed1a.
	* Added exception trapping wrappers on the 3 primary methods of the SimpleHighwayRamp class to
	  help identify where the segfaults may be coming from.
	* Changed GPU allocation from 0.1 for every worker to only 0.5 for local worker and none for
	  sampling workers. This forces Ray to schedule only 2 trials simultaneously, as it had been
	  doing for a long time. I hope this minimizes the segfaults.
	* All 4 trials completed their 5000 iterations - no segfaults!
	* All 4 showed rmean > 0 between 1M and 2M steps, then generally stayed in that region for the
	  remainder of their ~3M total steps. Two of them achieved rmean > 5, one for a sustained time.
	  All had rmax > 8, the best ones were in the 9.8 to 10.2 vicinity. However, all had rmin ~-10,
	  except for a few short excursions to ~0.
	* Troublingly, all trials showed evaluation results with rmax ~-10 throughout.

	* Got warning "install gputil for system monitoring"

***	* THIS WORKED!!! Many inference runs all go to completion, with total reward > 2, usually ~8,
	  regardless of the starting lane. For lane 2 starts, I ran several cases with starting relative
	  position of 0-4. In every case, the agent quickly adjusted its speed downward as necessary to
	  fall in line behind the neighbor platoon. Close scrutiny of the log shows that it was actively
	  adjusting speed to narrowly avoid neighbor 3 as it changed lanes right behind it, and just
	  before lane 2 disappeared (initiated with < 150 m remaining). This report is for trial #1,
	  checkpoint 4510.
	* Trial #2, checkpoint 5000, performs similarly, but not quite as well. It is better at not
	  immediately getting out of lane 1. But on some lane 2 starts it attempts to change lanes just
	  before the merge point, so goes off-roading. However, many times it does the merge just fine.
	* In both trials it is obvious the agent has learned to slow down & get behind the neighbors
	  rather than speed up to get in front.
	* I can't explain why the eval rewards were so terrible - they don't reflect my inference.
	* Both of these trials are now saved in the project directory, under
	  training/SAC/p256-256-v256-256/L4-1683e. Saved screen shots/videos there as well.

* Installed kazam (with apt install) in the cda0 conda env to allow capturing of screen videos.

8/9/23

* Created new git branch "level5" off of develop to start adding capability of neighbor vehicles
  doing more diverse behaviors.

* Cleaned up code (primarily in the tune program) to remove PBT, curriculum learning, and other
  things that haven't been used in a long time.

* Began adding level 5 to the environment, which starts the neighbors in any lane, and a wide range of
  locations and speeds. They also have ACC capability so that they won't rear-end anyone in front
  of them. Otherwise, they stay at their original speed, and do not change lanes.

8/10/23

* Finished the level 5 functionality, including neighbors that do a simple ramp merge near the
  end of lane 2. Lots of testing with the inference program on the 1683e agent.

* Run 06acf training level 5 for the first time
	* Changed the reset() code to draw down the randomized start location over 1M steps. It
	  had previously been only a single step, so there was effectively no randomized start
	  location at all (always starting within the first 10 m for training runs).
	* Pretty good! rmax for all 4 trials reached 6 to 10 at ~1M steps, then all of them declined
	  a bit, but stayed > 0. rmean reached > 0 for 3 trials, but then all dropped back to
	  [-6, -2] by the 6000 iteration mark (1.9M steps). It might be good to let it run another
	  1M steps or more.
	* Inference on the best trial (#3, checkpoint 4170) shows somewhat decent performance
		* It doesn't know to avoid rear-ending a neighbor in front, even though the obs
		  zones clearly say it is there.
		* In lanes 0 & 1 it performs okay, but speeds are a little high (30-32 m/s).
		* In lane 2 it reliably does an illegal lane change right out of the gate.

8/11/23

* Run 447ed training level 5 with SAC on the level5 branch
	* Extended the max_iters to 10k
	* Extended duration of the noise taper from 2M to 3M steps.
	* Added capability to include 6 neighbor vehicles for level 5, instead of 3 to provide
	  more opportunities to learn about close interactions.
	* Of the two trials, one showed very good reward plots, and a second shows promise.
	  Trial 2 reached rmax > 9 and sustained > 8, with its rmean peaking at 7 and even
	  rmin > 0. It terminated slightly early, after 9569 iters, since the success threahold
	  was set at rmean > 7.0 over 20 iters. It started with noise stddev = 0.56, but performed
	  poorly until ~1.5M steps (50% decay in the noise level).
	* Total training time was ~8.0 hr.
	* Inference on trial 2
		* It still has a habit of rear-ending neighbors. Detailed log study of one case
		  shows that it first detected a forward neighbor while going 23 m/s, then
		  agent gradually sped up as it got closer, hitting > 33 m/s just prior to the
		  crash. So it had exceeded the speed limit by then, and was not worrying about
		  any merge activity. This suggests a need for additional training.

8/12/23

* Run 15e0d training level 5 with SAC on the level5 branch
	* Set noise stddev HP range down to [0.1, 0.4]
	* Set noise scaling duration as a tunable HP, allowing choice of 2M, 3M or 4M steps.
	* Increased the success criterion to rmean > 8.0
	* Increased to request for 8 trials, in order to handle the more HP combinations.
	* Increased max iterations from 10k to 14k
	* Increased num_cpus_for_local_worker from 1 to 4 and num_cpus_per_worker from 1 to 4
	  to see if this can increase throughput a bit.
	* Doubled the speed command penalty multiplier from 0.2 to 0.4 to encourage smoothness.
	* Doubled the penalty for crashing from -20 to -40.
	* During training, htop shows only 2 cpus being used (each fully), and the gpu is seeing
	  30-50% utilization.
	* Trial 0 terminated successfully early. But it had rmin ~-10. In inference it performs
	  well except that it consistently rear-ends slower neighbors.
	* Trial 2 terminated successfully early, at which time its rmin = 4.9. Inferens shows it
	  rear-ending a neighbor twice, but also 2 cases where it slowed and avoided a rear-end.
	  It also slowed way down at end of lane 2 to avoid merging into heavy lane 1 traffic
	  (it ran off road, but avoided a crash); this happened multiple times, even though 
	  traffic cleared so it could have merged after greatly slowing.

* Run 58533 training level 5 with SAC on branch level5
	* Enhanced stopper to require both rmean and rmin to exceed specified thresholds. Set
	  these thresholds at 8.0 and 0.0, respectively, and increased rqmt that they must be
	  held for 100 iterations instead of 20, to force longer learning.
	* Changed num_gpus from 0.5 to 0.33 to encourage 3 simultaneous trials.
	* 7 of 8 trials died prematurely due to unknown error (segfault?). One trial finished
	  unimpressively.

* Run 4414e training level 5 with SAC on branch level5
	* No changes from previous to the training itself.
	* Playing with resource allocations for performance only.
	* Good run. Three of 8 trials had segfaults. But trial 2 scored a winner after ~2M steps,
	  and 3 others achieved rmean > 0 for extended time, two of them > 5.
	* Trial 7 ended with a SystemError where rllib/policy/sample_batch.py did something with
	  a list object that gave "bad argument to internal function".
	* Inference on trial 2 shows that it can't avoid a rear-end situation. Also, it chose to
	  change from lane 2 (close to the end) to lane 1 even though a neighbor was there, so
	  it crashed for -40 pts instead of going off-road for -11 pts or slowing down enough to
	  change lanes behind it for almost no loss. It feels like the agent just hasn't seen
	  enough of these situations to understand them.

8/13/23

* Run cfc78 training level 5 with SAC on branch level5 (previous 2 runs died due to weird Ray error)
	* Increased stopper avg_over_latest from 100 to 400 iterations to help ensure consistent
	  performance before calling it good.
	* Modified reset() to a) force neighbor 1 to be in front of ego 80% of the time, and b)
	  resize the distribution geometry of neighbors to better match the ego's distro space
	  so that they aren't so often located all behind ego.
	* Changed step() to more aggressively slow neighbors coming to the end of lane 2 if they
	  are blocked from making a lane change.
	* No good. Most trials had rmax ~10, but only 2 had rmin > 0 and their rmean was ~5. I
	  think this is because of so many rear-end collisions, the agent often never finishes an
	  episode successfully.
	* Inference on trial 2 shows it still likes to drive ~34 m/s and rear-end neighbors. When
	  starting in lane 1, it likes to change to lane 0 immediately, which is less populated, so
	  it can just pass by a lot of the traffic.
	* After run, I found a code defect that was placing neighbor 1 often behind the ego vehicle
	  rather than usually in front of it, so maybe it didn't get as much rear-end experience as
	  I thought.

8/14/23

* Run 16009 training level 5 with SAC on branch level 5
	* Fixed defect in code that was supposed to place neighbor #1 in front of ego, but may not
	  have done that as often as desired.
	* Changed reset() to place neighbor #2 (usually) in front of ego, but in adjacent through
	  lane, to help prevent a run-around maneuver.
	* Changed reset() to distribute remaining neighbors equally among the 3 lanes.
	* Reduced penalty for changing lanes (to encourage this as a rear-end avoidance mvr).
	* Slightly increased speed penalty.
	* Reward plots show that normally rmin > -40 by a long margin. How can this be if the
	  reward for a collision is -40?
	* Pretty good - three trials ended with rmean > 5 and two of them with rmin > 0; not
	  quite enough for automatic success, but close.
	* Inference on trial 3 (the 2nd best one) shows that it still really likes to rear-end
	  neighbors. It seems a little better at respecting speed limit. In one case it attempted
	  to slow drastically when the neighbor_in_ego_zone flag went high, but it was too late.
	  In lane 2 it also crashed by merging into the side of a neighbor - totally ignoring it.
	* These behaviors make me wonder if crashing is actually not happening during training,
	  which would explain the large rmin and the fact that it acts like it doesn't know what
	  to do.
	* Trial 6 (the best one) was able to slow down to avoid a rear-end. More often, it
	  collided with neighbors in front.
	* It is becoming clear that with the obs vector available, the MDP doesn't have good info
	  about how fast the forward neighbor is going, only a delta speed and the ego speed, so
	  these have to be combined before it would know what speed to command, and since they
	  are in different scales, it is probably not easy to do. Seems the NN would work better
	  if the action was acceleration instead of absolute target speed.  Then it only needs
	  to consider the neighbor's delta speed and adjust accel in proportion to that number.
	* Log shows that many iterations experience no crashes (in dozens of episodes), as the
	  min episode reward is > ~-20. Many are, in fact, > 0. Crash reward is -40.

8/15/23

* Several training runs specifically geared to debugging, not learning
	* Added debug stmts to give clues to rear-end crashes during training. No other changes.
	* Logs don't show a single crash! Additional debugging stmts added, but not clear.
	* Rearranged some logic in step() to ensure that if a crash condition is present it is
	  being accounted for, with higher priority than other conditions.
	* No difinitive conclusions, however, on cause or existence of any particular problem.
	  Several short runs show that it is really rare that a crash reward is actually handed
	  out, even though it seems rear-end or near-rear-end events are fairly common. Python's
	  inability to properly sequence stdout output frustrates attempts to trace down the
	  flow of logic in this code.

* Run 0675c training level 5 with SAC on the level5 branch
	* Implementing the step() changes noted above to improve detection & penalizing of
	  crashes.
	* Turned off randomized_start_dist, since it seems to be contributing to too many early
	  episodes that score a completion reward without any opportunity to experience a crash
	  and therefore burn in poor habits early.
	* No better. Several trials had rmax close to 10 at some point, and a couple had rmean
	  > 0 near the end. The best one had rmin = 0.3 at the end.
	* Inference on that best trial demonstrated high speed driving and no regard for rear-
	  ending neighbors. So the previous fixes didn't do what was intended. 
	* Log analysis shows what should be marked as crashes in the same lane being called
	  nothing and allowed to proceed during training.

8/16/23

* Lots of code inspection & debugging. Still seeing weird inconsistencies with crash conditions
  not always being recognized the same in different parts of the code. Realized that each trial
  experiences 1 "recognized" crash early, then never again, and found that reset() is not clearing
  the self.vehicles[] data structure. Fixing this definitely shows more recognized crashes - I
  think this is the elusive answer I've been searching for the past few days.

* Run 875cf training level 5 with SAC on branch level5
	* Fixed reset() to clear vehicle data from previous episode before any setup begins.
	* Trial 0 shows rmax ~10 from ~2.2M steps to the end (near 4M), and rmean ~-4 over all
	  that. Inference on it shows really good speed mgmt to avoid rear-ends! It is overly
	  conservative, though slowing abruptly whenever a neighbor appears in zone 5 (immediately
	  in front). However, it fails to do a lane change from the merge ramp, even when there is
	  no traffic blocking, other than near the front of zone 1 (farthest forward on left side).
	  It would rather run off the road than take even a slight chance on a crash, which says
	  the crash penalty is too severe.

8/17/23

* Run 86df0 training level 5 with SAC on branch level5. Corresponds to commit 0cbe6d0.
	* Changed reward penalty for a crash from -40 to -15. We now have failure rewards of
	  -10 for off-road, -12 for stopping and -15 for crash, and no randomized start.
	* Trial 2 works really well!  Training plots showed rmean ~0, but steady, with rmax ~10
	  for this trial. Many inference runs show that it largely avoids rear-ends, but is
	  aggressive enough on the merge ramp to slide into most tight spots. It occasionally
	  crashes during a lane change. It seems these traits could be resolved with more
	  extended training. It handles many situations where its episode reward is wee below
	  0, but it survives, which is the desired behavior.
***	* THIS IS THE ANSWER!  The cda0 project can be concluded with this agent. Stored the
	  checkpoint as usual, under the training sub-dir, but that isn't part of the git repo.
	  Made a copy of it in new sub-dir "checkpoints" that is now in the repo. This
	  checkpoint is named "l5-speed-cmd" since it is interpreting action[0] as an absolute
	  speed command.

8/18/23

* Worked on graphics improvements - added crash indicator and beefed up the ego speed plot.
	* Used the free Greenfish icon editor on Windows (couldn't get it to run on Linux).
* Demonstrated that inference works (and the agent handles well) cases with 8 and 10 neighbors.


Considerations:
- Move to action[0] as accel rather than tgt speed
- do a lot more training to ensure smooth handling of some of the more rare cases, maybe with 8 neighbors?

- Tweak reward for more penalty on large speed cmd changes
- try running ray.train instead of all the tuning jobs
- redo rewards per recent paper suggestions
- change replay buffer capacity from 1e6 to smaller.
- remove time step bonus.


Near-term considerations:
- implement level 5 with random lanes & crash avoidance neighbors

Longer term considerations:
- study graph ML
- study path planning to replace near-sighted agent


*** Variables under manual control:
	* batch size
	* noise schedule
	* replay buffer HPs
	* PBT HPs
	* training session duration, stopping criteria, checkpointing and curriculum structure
	** learning algorithm
	** NN structure
	** action space design
	** observation space design
	** reward structure